Hannachi2021_Book_PatternsIdentificationAndDataM

Springer Atmospheric Sciences
Abdelwaheb Hannachi
Patterns
Identification
and Data Mining
in Weather and
Climate
The Springer Atmospheric Sciences series seeks to publish a broad portfolio
of scientific books, aiming at researchers, students, and everyone interested in
this interdisciplinary field. The series includes peer-reviewed monographs, edited
volumes, textbooks, and conference proceedings. It covers the entire area of
atmospheric sciences including, but not limited to, Meteorology, Climatology,
Atmospheric Chemistry and Physics, Aeronomy, Planetary Science, and related
subjects.
More information about this series at http://www.springer.com/series/10176

Abdelwaheb Hannachi
Patterns Identification
and Data Mining in Weather
and Climate
Abdelwaheb Hannachi
Department of Meteorology, MISU
Stockholm University
Stockholm, Sweden
ISSN 2194-5217 ISSN 2194-5225 (electronic)

ISBN 978-3-030-67072-6 ISBN 978-3-030-67073-3 (eBook)
https://doi.org/10.1007/978-3-030-67073-3
© Springer Nature Switzerland AG 2021

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To the memory of my father and mother who
taught me big principles, and to my little
family Houda, Badr, Zeid and Ahmed for
their patience.
Preface
Weather and climate is a fascinating system, which affects our daily lives, and is
closely interlinked with the environment, society and infrastructure. They have large
impact on our lives and activities, climate change being a typical example. It is a
high-dimensional highly complex system involving nonlinear interactions between
very many modes or degrees of freedom. This made weather and climate look
mysterious in ancient societies. Complex high-dimensional systems are difficult to
comprehend by our three-dimensional concept of the physical world. Humans have
sought out patterns in order to describe the working world around us. This task,
however, proved difficult and challenging.
In the climate context, the quest to identify patterns is driven by the desire to
find structures embedded in state space, which can lead to a better understanding
of the system dynamics, and eventually learn its behaviour and predict its future
state. With the advent of computers and observing systems, massive amounts of data
from the atmosphere and ocean are obtained, which beg for exploration and analysis.
Pattern identification in atmospheric science has a long history. It began in the 1920s
with Gilbert Walker, who identified the southern oscillation and the atmospheric
component of ENSO teleconnection, although the latter concept seems to have been
mentioned for the first time by Ångström in the mid-1930s. The correlation analysis
of Gilbert Walker used to identify the southern oscillation is akin to the iterative
algorithm used to compute empirical orthogonal functions. However, the earliest
known eigenanalysis in atmospheric science goes back to the time of the previous
USSR school by Obukhov and Bagrov around the late 1940s and early 1950s,
respectively. But it was Ed. Lorenz who coined the term ‘empirical orthogonal
functions’ (EOFs) in the mid-1950s. Since then, research on the topic has been
expanding, and a number of textbooks have been written, notably by Preisendorfer
in the late 1980s, followed by texts by Thiebaux, and von Storch and Zwiers about
a decade later, and another one by Jolliffe in the early 2000s. These texts did an
excellent job in presenting the theory and methods, particularly those related to
eigenvalue problems in meteorology and oceanography.
Weather and climate data analysis has witnessed a fast growth in the last few
decades both in terms of methods and applications. This growth was driven by the
vii
viii Preface
need to analyse and interpret the fast-growing volume of climate data using both
linear and nonlinear methods. In this book, I attempt to give an up-to-date text by
presenting linear and nonlinear methods that have been developed in the last two
decades, in addition to including conventional ones.
The text is composed of 17 chapters. Apart from the first two introductory and
setting up chapters, the remaining 15 chapters present the different methods used to
analyse spatio-temporal data from atmospheric science and oceanography. The EOF
method, a cornerstone of eigenvalue problems in meteorology and oceanography, is
presented in Chap. 3. The next four chapters present derivatives of EOFs, including
eigenvalue problems involved the identification of propagating features. A whole
chapter is devoted to predictability and predictable patterns, and another one on
multidimensional scaling, which discusses various dissimilarity measures used in
pattern identification, followed by a chapter on factor analysis. Nonlinear methods
of space-time pattern identification, with different perspectives, are presented in the
next three chapters. The previous chapters deal essentially with discrete gridded
data, as is usually the case, with no explicit discussion of the continuous case, such
as curves or surfaces. This topic is presented and discussed in the next chapter.
Another whole chapter is devoted to presenting and discussing the topic of coupled
patterns using conventional and newly developed approaches. A number of other
methods are not presented in the previous chapters. Those methods are collected
and presented in the penultimate chapter. Finally, and to take into account the recent
interest in automatic methods, the last chapter presents and discusses few commonly
used methods in machine learning. To make it as a stand-alone text, a number of
technical appendices are given at the end of the book.
This book can be used in teaching data analysis in atmospheric science, or
other topics such as advanced statistical methods in climate research. Apart from
Chap. 15, in the context of coupled patterns and regression, and Appendix C, I
did not discuss explicitly statistical modelling/inference. This topic of statistical
inference in climate science is covered in a number of other books reported in
the reference list. To help students and young researchers in the field explore the
topics, I have included a number of small exercises, with hints, embedded within
the different chapters, in addition to some basic skeleton Matlab codes for some
basic methods. Full Matlab codes can be obtained from the author upon request.
A list of software links is also given at the end of the book.
Stockholm, Sweden Abdelwaheb Hannachi

Pattern Identification and Data Mining in
Weather and Climate
Complexity, nonlinearity and high-dimensionality constitute the main characteristic

features of the weather and climate dynamical system. Advances in computer
power and observing systems have led to the generation and accumulation of
large-scale weather and climate data, which beg for exploration and analysis.
Pattern Identification and Data Mining in Weather and Climate presents, from
different perspectives, most available, novel and conventional, approaches used
to analyse multivariate time series in atmospheric and oceanographic science to
identify patterns of variability, teleconnections, and reduce dimensionality. The
book discusses in detail linear and nonlinear methods to identify stationary and
propagating patterns of spatio-temporal, single and combined fields. The book also
presents machine learning with a particular focus on the main methods used in
climate science. Applications to real atmospheric and oceanographic examples are
also presented and discussed in most chapters. To help guide students and beginners
in the field of weather and climate data analysis, basic Matlab skeleton codes are
given is some chapters, complemented with a list of software links towards the end
of the textbook. A number of technical appendices are also provided, making the
text particularly suitable for didactic purposes.
Abdelwaheb Hannachi is associate professor in the Department of Meteorol-
ogy at Stockholm University, MISU. He currently serves as editor-in-chief of
Tellus A: Dynamic Meteorology and Oceanography. Abdel teaches a number of
undergraduate and postgraduate courses, including dynamic meteorology, statistical
climatology, numerical weather prediction and data assimilation, and boundary
layer turbulence. His main research interests are large-scale dynamics, teleconnec-
tions, nonlinearity in weather and climate, in addition to extremes and forecasting.
ix
Over the last few decades, we have amassed an enormous amount of weather and climate
data of which we have to make sense now. Pattern identification methods and modern data
mining approaches are essential in better understanding how the atmosphere and the climate
system works. These topics are not traditionally taught in meteorology programmes. This
book will prove a valuable source for students as well as active researchers interested
in these topics. The book provides a broad overview over modern pattern identification
methods and an introduction to machine learning.
– Christian Franzke, ICCP, Pusan National University
The topic of EOFs and associated pattern identification in space-time data sets has gone
through an extraordinary fast development, both in terms of new insights and the breadth
of applications. For this reason, we need a text approximately every 10 years to summarize
the fields. Older texts by, for instance, Jolliffe and Preisendorfer need to be succeeded by
an up-to-date new text. We welcome this new text by Abdel Hannachi who not only has a
deep insight in the field but has himself made several contributions to new developments in
the last 15 years.
– Huug van den Dool, Climate Prediction Center, NCEP, College Park, MD
Now that weather and climate science is producing ever larger and richer data sets, the topic
of pattern extraction and interpretation has become an essential part. This book provides an
up-to-date overview of the latest techniques and developments in this area.
– Maarten Ambaum, Department of Meteorology, University of Reading, UK
The text is very ambitious. It makes considerable effort to collect together a number
of classical methods for data analysis, as well as newly emerging ones addressing the
challenges of the modern huge data sets. There are not many books covering such a
wide spectrum of techniques. In this respect, the book is a valuable companion for many
researchers working in the field of climate/weather data analysis and mining. The author
deserves congratulations and encouragement for his enormous work.
– Nickolay Trendafilov, Open University, Milton Keynes
This nicely and expertly written book covers a lot of ground, ranging from classical
linear pattern identification techniques to more modern machine learning methodologies,
all illustrated with examples from weather and climate science. It will be very valuable both
as a tutorial for graduate and postgraduate students and as a reference text for researchers
and practitioners in the field.
– Frank Kwasniok, College of Engineering, Mathematics and Physical Sciences,
University of Exeter
xi
We will show them Our signs in the horizons and within
themselves until it becomes clear to them that it is the truth
Holy Quran Ch. 51, V. 53
Acknowledgements
This text is a collection of work I have been conducting over the years on weather
and climate data analysis, in collaboration with colleagues, complemented with
other methods from the literature. I am especially grateful to all my teachers,
colleagues and students, who contributed directly or indirectly to this work. I would
like to thank, in particular, Zoubeida Bargaoui, Bernard Legras, Keith Haines, Ian
Jolliffe, David B. Stephenson, Nickolay Trendafilov, Christian Franzke, Thomas
Önskog, Carlos Pires, Tim Woollings, Klaus Fraedrich, Toshihiko Hirooka, Grant
Branstator, Lesley Gray, Alan O’Neill, Waheed Iqbal, Andrew Turner, Andy Heaps,
Amro Elfeki, Ahmed El-Hames, Huug van den Dool, Léon Chafik and all my MISU
colleagues, and many other colleagues I did not mention by name. I acknowledge the
support of Stockholm University and the Springer team, in particular Robert Doe,
executive editor, and Neelofar Yasmeen, production coordinator, for their support
and encouragement.
xiii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Complexity of the Climate System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data Exploration, Data Mining and Feature Extraction. . . . . . . . . . . 3
1.3 Major Concern in Climate Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Characteristics of High-Dimensional Space
Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Curse of Dimensionality and Empty Space
Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Dimension Reduction and Latent Variable Models . . . . 11
1.3.4 Some Problems and Remedies in Dimension
Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Examples of the Most Familiar Techniques . . . . . . . . . . . . . . . . . . . . . . . 13
2 General Setting and Basic Terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Simple Visualisation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Data Processing and Smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Preliminary Checking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Simple Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Data Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Basic Notation/Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 Centring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.2 Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.3 Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.4 Sphering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Stationary Time Series, Filtering and Spectra . . . . . . . . . . . . . . . . . . . . . 26
2.6.1 Univariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.2 Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
xv
xvi Contents
3 Empirical Orthogonal Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Eigenvalue Problems in Meteorology: Historical Perspective . . . . 33
3.2.1 The Quest for Climate Patterns: Teleconnections . . . . . . 33
3.2.2 Eigenvalue Problems in Meteorology. . . . . . . . . . . . . . . . . . . 34
3.3 Computing Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Basis of Principal Component Analysis . . . . . . . . . . . . . . . . 35
3.3.2 Karhunen–Loéve Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Derivation of PCs/EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.4 Computing EOFs and PCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Sampling, Properties and Interpretation of EOFs . . . . . . . . . . . . . . . . . 45
3.4.1 Sampling Variability and Uncertainty . . . . . . . . . . . . . . . . . . 45
3.4.2 Independent and Effective Sample Sizes . . . . . . . . . . . . . . . 50
3.4.3 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.4 Properties and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Covariance Versus Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 Scaling Problems in EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7 EOFs for Multivariate Normal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8 Other Procedures for Obtaining EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.9 Other Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.9.1 Teleconnectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.9.2 Regression Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9.3 Empirical Orthogonal Teleconnection . . . . . . . . . . . . . . . . . . 67
3.9.4 Climate Network-Based Methods . . . . . . . . . . . . . . . . . . . . . . . 67
4 Rotated and Simplified EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Rotation of EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1 Background on Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.2 Derivation of REOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.3 Computing REOFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Simplified EOFs: SCoTLASS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.2 LASSO-Based Simplified EOFs . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.3 Computing the Simplified EOFs . . . . . . . . . . . . . . . . . . . . . . . . 83
5 Complex/Hilbert EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Conventional Complex EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.1 Pairs of Scalar Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.2 Single Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Frequency Domain EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3.2 Derivation of FDEOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 Complex Hilbert EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.1 Hilbert Transform: Continuous Signals . . . . . . . . . . . . . . . . . 101
Contents xvii
5.4.2 Hilbert Transform: Discrete Signals . . . . . . . . . . . . . . . . . . . . 103

5.4.3 Application to Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.4 Complex Hilbert EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 Rotation of HEOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6 Principal Oscillation Patterns and Their Extension . . . . . . . . . . . . . . . . . . . . 117
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 POP Derivation and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2.1 Spatial Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2.2 Time Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3 Relation to Continuous POPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3.1 Basic Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3.2 Finite Time POPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.4 Cyclo-Stationary POPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.5 Other Extensions/Interpretations of POPs . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5.1 POPs and Normal Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5.2 Complex POPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.5.3 Hilbert Oscillation Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.5.4 Dynamic Mode Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.6 High-Order POPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.7 Principal Interaction Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7 Extended EOFs and SSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2 Dynamical Reconstruction and SSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.2 Dynamical Reconstruction and SSA . . . . . . . . . . . . . . . . . . . . 148
7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.3.1 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.3.2 Red Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.4 SSA and Periodic Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.5 Extended EOFs or Multivariate SSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5.2 Definition and Computation of EEOFs . . . . . . . . . . . . . . . . . 157
7.5.3 Data Filtering and Oscillation Reconstruction . . . . . . . . . 161
7.6 Potential Interpretation Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.7 Alternatives to SSA and EEOFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.7.1 Recurrence Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.7.2 Data-Adaptive Harmonic Decomposition . . . . . . . . . . . . . . 169
8 Persistent, Predictive and Interpolated Patterns . . . . . . . . . . . . . . . . . . . . . . . . 171
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.2 Background on Persistence and Prediction of Stationary
Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.2.1 Decorrelation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.2.2 The Prediction Problem and Kolmogorov Formula . . . . 174
xviii Contents
8.3 Optimal Persistence and Average Predictability. . . . . . . . . . . . . . . . . . . 176

8.3.1 Derivation of Optimally Persistent Patterns . . . . . . . . . . . . 176
8.3.2 Estimation from Finite Samples. . . . . . . . . . . . . . . . . . . . . . . . . 179
8.3.3 Average Predictability Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.4 Predictive Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4.2 Optimally Predictable Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4.3 Computational Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.5 Optimally Interpolated Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.5.2 Interpolation and Pattern Derivation . . . . . . . . . . . . . . . . . . . . 189
8.5.3 Numerical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.5.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.6 Forecastable Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9 Principal Coordinates or Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . 201
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.2 Dissimilarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
9.3 Metric Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.3.1 The Problem of Classical Scaling . . . . . . . . . . . . . . . . . . . . . . . 204
9.3.2 Principal Coordinate Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.3.3 Case of Non-Euclidean Dissimilarity Matrix . . . . . . . . . . . 207
9.4 Non-metric Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
9.5 Further Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.5.1 Replicated and Weighted MDS . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.5.2 Nonlinear Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.5.3 Application to the Asian Monsoon. . . . . . . . . . . . . . . . . . . . . . 212
9.5.4 Scaling and the Matrix Nearness Problem . . . . . . . . . . . . . . 215
10 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
10.2 The Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.2.2 Model Definition and Terminology . . . . . . . . . . . . . . . . . . . . . 220
10.2.3 Model Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.2.4 Non-unicity of Loadings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
10.3.1 Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . . . . . . . . 224
10.3.2 Expectation Maximisation Algorithm . . . . . . . . . . . . . . . . . . 225
10.4 Factor Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.4.1 Oblique and Orthogonal Rotations . . . . . . . . . . . . . . . . . . . . . . 229
10.4.2 Examples of Rotation Criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.5 Exploratory FA and Application to SLP Anomalies . . . . . . . . . . . . . . 232
10.5.1 Factor Analysis as a Matrix Decomposition Problem. . 232
10.5.2 A Factor Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Contents xix
10.6 Basic Difference Between EOF and Factor Analyses . . . . . . . . . . . . . 235

10.6.1 Comparison Based on the Standard Factor Model . . . . . 236
10.6.2 Comparison Based on the Exploratory Factor
Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
11 Projection Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
11.2 Definition and Purpose of Projection Pursuit . . . . . . . . . . . . . . . . . . . . . . 242
11.2.1 What Is Projection Pursuit? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11.2.2 Why Projection Pursuit? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
11.3 Entropy and Structure of Random Variables . . . . . . . . . . . . . . . . . . . . . . 244
11.3.1 Shannon Entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
11.3.2 Differential Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
11.4 Types of Projection Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
11.4.1 Quality of a Projection Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
11.4.2 Various PP Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
11.4.3 Practical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
11.5 PP Regression and Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.5.1 PP Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.5.2 PP Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
11.6 Skewness Modes and Climate Application of PP . . . . . . . . . . . . . . . . . 260
12 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.2 Background and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
12.2.1 Blind Deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
12.2.2 Blind Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
12.2.3 Definition of ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
12.3 Independence and Non-normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
12.3.1 Statistical Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
12.3.2 Non-normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
12.4 Information-Theoretic Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
12.4.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
12.4.2 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
12.4.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
12.4.4 Negentropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
12.4.5 Useful Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
12.5 Independent Component Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
12.5.1 Choice of Objective Function for ICA . . . . . . . . . . . . . . . . . . 276
12.5.2 Numerical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
12.6 ICA via EOF Rotation and Weather and Climate Application . . . 284
12.6.1 The Standard Two-Way Problem. . . . . . . . . . . . . . . . . . . . . . . . 284
12.6.2 Extension to the Three-Way Data . . . . . . . . . . . . . . . . . . . . . . . 291
12.7 ICA Generalisation: Independent Subspace Analysis . . . . . . . . . . . . . 293
xx Contents
13 Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

13.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
13.2 Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
13.2.1 Formulation of Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
13.2.2 Practical Details of Kernel EOF Computation . . . . . . . . . 301
13.2.3 Illustration with Concentric Clusters. . . . . . . . . . . . . . . . . . . . 302
13.3 Relation to Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.3.1 Spectral Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.3.2 Modularity Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
13.4 Pre-images in Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
13.5 Application to An Atmospheric Model and Reanalyses . . . . . . . . . . 309
13.5.1 Application to a Simplified Atmospheric Model . . . . . . . 309
13.5.2 Application to Reanalyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
13.6 Other Extensions of Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
13.6.1 Extended Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
13.6.2 Kernel POPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
14 Functional and Regularised EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
14.1 Functional EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
14.2 Functional PCs and Discrete Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
14.3 An Example of Functional PCs from Oceanography . . . . . . . . . . . . . 321
14.4 Regularised EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
14.4.1 General Setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
14.4.2 Case of Spatial Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
14.5 Numerical Solution of the Full Regularised EOF Problem . . . . . . . 327
14.6 Application of Regularised EOFs to SLP Anomalies . . . . . . . . . . . . . 331
15 Methods for Coupled Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
15.2 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
15.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
15.2.2 Formulation of CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
15.2.3 Computational Aspect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
15.2.4 Regularised CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
15.2.5 Use of Correlation Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
15.3 Canonical Covariance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
15.4 Redundancy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
15.4.1 Redundancy Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
15.4.2 Redundancy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
15.5 Application: Optimal Lag Between Two Fields and
Other Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
15.5.1 Application of CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
15.5.2 Application of Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
15.6 Principal Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
15.7 Extension: Functional Smooth CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
15.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Contents xxi
15.7.2 Functional Non-smooth CCA and Indeterminacy . . . . . . 352

15.7.3 Smooth CCA/MCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
15.7.4 Application of SMCA to Space–Time Fields. . . . . . . . . . . 359
15.8 Some Points on Coupled Patterns and Multivariate Regression . . 363
16 Further Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
16.2 EOFs and Random Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
16.3 Cyclo-stationary EOFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
16.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
16.3.2 Theory of Cyclo-stationary EOFs . . . . . . . . . . . . . . . . . . . . . . . 372
16.3.3 Application of CSEOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
16.4 Trend EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
16.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
16.4.2 Trend EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
16.4.3 Application of Trend EOFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
16.5 Common EOF Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
16.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
16.5.2 Formulation of Common EOFs . . . . . . . . . . . . . . . . . . . . . . . . . 384
16.6 Continuum Power CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
16.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
16.6.2 Continuum Power CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
16.6.3 Determination of the Degree Parameter . . . . . . . . . . . . . . . . 390
16.7 Kernel MCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
16.7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
16.7.2 Kernel MCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
16.7.3 An Alternative Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
16.8 Kernel CCA and Its Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
16.8.1 Primal and Dual CCA Formulation . . . . . . . . . . . . . . . . . . . . . 394
16.8.2 Regularised KCCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
16.8.3 Some Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
16.9 Archetypal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
16.9.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
16.9.2 Derivation of Archetypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
16.9.3 Numerical Solution of Archetypes . . . . . . . . . . . . . . . . . . . . . . 399
16.9.4 Archetypes and Simplex Visualisation. . . . . . . . . . . . . . . . . . 403
16.9.5 An Application of AA to Climate . . . . . . . . . . . . . . . . . . . . . . . 404
16.10 Other Nonlinear PC Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
16.10.1 Principal Nonlinear Dynamical Modes . . . . . . . . . . . . . . . . . 410
16.10.2 Nonlinear PCs via Neural Networks . . . . . . . . . . . . . . . . . . . . 412
17 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
17.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
17.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
17.2.1 Background and Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
17.2.2 General Structure of Neural Networks. . . . . . . . . . . . . . . . . . 419
xxii Contents
17.2.3 Examples of Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

17.2.4 Learning Procedure in NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
17.2.5 Costfunctions for Multiclass Classification. . . . . . . . . . . . . 428
17.3 Self-organising Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
17.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
17.3.2 SOM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
17.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
17.4.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
17.4.2 Random Forest: Definition and Algorithm . . . . . . . . . . . . . 437
17.4.3 Out-of-Bag Data, Generalisation Error and Tuning . . . . 437
17.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
17.5.1 Neural Network Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
17.5.2 SOM Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
17.5.3 Random Forest Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
A Smoothing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453

A.1 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
A.1.1 More on Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
A.1.2 Choice of the Smoothing Parameter . . . . . . . . . . . . . . . . . . . . 458
A.2 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
A.2.1 Exact Interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
A.2.2 RBF and Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
A.2.3 Relation to PDEs and Other Techniques . . . . . . . . . . . . . . . . 464
A.3 Kernel Smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
B Introduction to Probability and Random Variables . . . . . . . . . . . . . . . . . . . . 467

B.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
B.2 Sets Theory and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
B.2.1 Elements of Sets Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
B.2.2 Definition of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
B.3 Random Variables and Probability Distributions . . . . . . . . . . . . . . . . . . 470
B.3.1 Discrete Probability Distributions. . . . . . . . . . . . . . . . . . . . . . . 470
B.3.2 Continuous Probability Distributions . . . . . . . . . . . . . . . . . . . 471
B.3.3 Joint Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 473
B.3.4 Expectation and Covariance Matrix of
Random Vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
B.3.5 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
B.4 Examples of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
B.4.1 Discrete Case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
B.4.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
B.5 Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
C Stationary Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483

C.1 Autocorrelation Structure: One-Dimensional Case . . . . . . . . . . . . . . . 483
C.1.1 Autocovariance/Correlation Function. . . . . . . . . . . . . . . . . . . 483
C.1.2 Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
Contents xxiii
C.2 Power Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487

C.3 The Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
C.3.1 Autocovariance Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
C.3.2 Cross-Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
C.4 Autocorrelation Structure in the Sample Space . . . . . . . . . . . . . . . . . . . 492
C.4.1 Autocovariance/Autocorrelation Estimates . . . . . . . . . . . . . 492
C.4.2 The Periodogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
D Matrix Algebra and Matrix Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

D.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
D.1.1 Matrices and Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 499
D.1.2 Operation on Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
D.2 Most Useful Matrix Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
D.3 Matrix Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
D.3.1 Vector Derivative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
D.3.2 Matrix Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
D.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
D.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
D.4.1 MLE of the Parameters of a Multinormal
Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
D.4.2 Estimation of the Factor Model Parameters . . . . . . . . . . . . 513
D.4.3 Application to Results from PCA . . . . . . . . . . . . . . . . . . . . . . . 515
D.5 Common Algorithms for Linear Systems and
Eigenvalue Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
D.5.1 Direct Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
D.5.2 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
E Optimisation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521

E.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
E.2 Single Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
E.2.1 Direct Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
E.2.2 Derivative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
E.3 Direct Multivariate Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
E.3.1 Downhill Simplex Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
E.3.2 Conjugate Direction/Powell’s Method . . . . . . . . . . . . . . . . . . 524
E.3.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
E.4 Multivariate Gradient-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
E.4.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
E.4.2 Newton–Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
E.4.3 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
E.4.4 Quasi-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
E.4.5 Ordinary Differential Equations-Based Methods. . . . . . . 530
E.5 Constrained Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
E.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
E.5.2 Approaches for Constrained Minimisation . . . . . . . . . . . . . 532
xxiv Contents
F Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535

F.1 Linear Vector and Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
F.1.1 Linear Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
F.1.2 Metric Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
F.2 Norm and Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
F.2.1 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
F.2.2 Inner Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
F.2.3 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
F.2.4 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
F.3 Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
F.3.1 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
F.3.2 Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
F.3.3 Application to Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
G Systems of Linear Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . 543

G.1 Case of a Constant Matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
G.1.1 Homogeneous System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
G.1.2 Non-homogeneous System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
G.2 Case of a Time-Dependent Matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
G.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
G.2.2 Particular Case of Periodic Matrix A: Floquet Theory. 546
H Links for Software Resource Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
Chapter 1
Introduction
Abstract This chapter describes the characteristic features of high dimensional-

ity and introduces the problem of dimensionality reduction in high-dimensional
systems with a particular focus on the importance of its application to the highly
complex climate system.
Keywords Complexity of the climate system · High dimensionality · Curse of

dimensionality · Dimension reduction · Data exploration
1.1 Complexity of the Climate System
Our atmosphere is composed of the collection of an innumerable interacting

molecules. For instance, the number of “particles” composing the Earth atmosphere
is astronomic and is estimated to be of the order O(1045 ) molecules.1 This
astronomical number of interacting molecules do not move randomly, but move
coherently to some extent, giving rise to the atmospheric motion and weather
systems.
The climate system is the aggregation of daily weather. Put mathematically, the
climate, as opposed to weather, may be defined as the collection of all long-term
statistics of the state of the atmosphere. Heinlein (1973) quotes2 “climate is what
we expect but weather is what we get.” Figure 1.1 shows a simple illustration of
the weather/climate system. Small marbles are dropped from the top, which flow
through the space between the pins and are collected in small containers. Each
trajectory of one marble describes the daily weather and the collective behaviour
1 The total mass m of the atmosphere is of the order of O(1022 )gr. The total number of molecules
m 45
ma Na is of the order O(10 ). The constants Na and ma are respectively the Avogadro number
6.023 × 10 and the molar air mass, i.e. the mass of Na molecules (29 gr).
23
2 The quotation appears in the section “More from the Notebooks of Lazarus Long” of Robert A.
Heinlein’s novel. Some sources, however, attribute the quotation to the American writer/lecturer
Samuel L. Clemens known by pen name Mark Twain (1835–1910), although this seems perhaps
unlikely, since this concept of climate as average of weather was not readily available around 1900.
© Springer Nature Switzerland AG 2021 1

A. Hannachi, Patterns Identification and Data Mining in Weather and Climate,
Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_1
2 1 Introduction
Fig. 1.1 Illustration of a highly simplified paradigm for the weather/climate system
represented by the shape of the marbles in the containers describes the probability
density function of the system.
The weather/climate system of our rotating Earth consists of the evolution of the
coupled Atmosphere–Land–Ocean–Ice system driven by radiative forcing from the
Sun as well as from the earth’s interior, e.g. volcanoes. The climate, as a complex
nonlinear dynamical system varies on a myriad of interacting space/time scales. It
is characterised by its high number of degrees of freedom (dof) and their complex
nonlinear interactions. It also displays nontrivial sensitivity to initial as well as
boundary conditions. Weather and climate at one location can be related to those at
another distant location, that is, weather and climate are not local but have a global
character. This is known as teleconnections in atmospheric science, and represents
links between climate anomalies occurring at one specific location and at large
distances. They can be seen as patterns connecting widely separated regions, such
the El-Niño Southern Oscillation (ENSO), the North Atlantic Oscillation (NAO)
and the Pacific North America (PNA) pattern. More discussions on teleconnections
are given in the beginning of Chap. 3.
In practice, various climate variables, such as sea level pressure, wind field and
ozone concentrations are measured at different time intervals and at various spatial
locations. In general, however, these measurements are sparse both is space and
time. Climate models are usually used, via data assimilation techniques, to produce
regular data in space and time, known as the “reanalyses”. The analysis of climate
data is not solely restricted to reanalyses but also includes other observed records,
e.g. balloon measurements, satellite irradiances, in situ recordings such as rain
gauges, ice cores for carbon dating, etc. Model simulations are also extensively used
for research purposes, e.g. for investigating/understanding physical mechanisms,
1.2 Data Exploration, Data Mining and Feature Extraction 3
studying anthropogenic climate change and climate prediction, and also for climate
model validation, etc.
The recent explosive growth in the amount of observed and simulated atmo-
spheric data that are becoming available to the climate scientist has created an ever
increasing need for new mathematical data analysis tools that enable researchers
to extract the most out of these data in order to address key questions relating to
major concerns in climate dynamics. Understanding weather and climate involves
a genuine investigation of the nature of nonlinearities and causality in the system.
Predicting the climate is also another important reason that drives climate research.
Because of the high dimensionality involved in weather/climate system advanced
tools are required to analyse and hence understand various aspects of the system. A
major objective in climate research is the identification of major climate modes,
patterns, regimes, etc. This is precisely one of the most challenging problems in
data mining/feature extraction.
1.2 Data Exploration, Data Mining and Feature Extraction
In climate research and other scientific fields we are faced with large datasets,
typically multivariate time series with large dimensionality where the objective
is to identify or find out interesting or more prominent patterns of variability.
A basic step in the analysis of multivariate data is exploration (Tukey 1977).
Exploratory data analysis (EDA) provides tools for hypothesis formulation and
feature selection (Izenman 2008). For example, according to Seltman (2018) one
reads: “Loosely speaking, any method of looking at data that does not include formal
statistical modeling and inference falls under the term exploratory data analysis.”
For moderately large to large multivariate data, simple EDA, such as scatter plots
boxplots, may not be possible, and “advanced” EDA is needed. This includes for
instance the identification (or extraction) of structures, patterns, trends, dimension
reduction, etc. For some authors (e.g. Izenman 2008) this may be categorised as
“descriptive data mining”, to distinguish it from “predictive data mining”, which is
based on model building including classification, regression and machine learning.
The following quotes show examples of excerpts of data mining from various
sources. For instance, according to the glossary of data mining3 we read:
• “data mining is an information extraction activity whose goal is to discover
hidden facts contained in data bases. Using a combination of machine learning,
statistical analysis, modelling techniques and database technology, data mining
finds patterns and subtle relationships in data and infers rules that allow the
prediction of future state”.
3 http://twocrows.com/data-mining/dm-glossary/.
4 1 Introduction
The importance of patterns is reflected through their interpretation into knowl-

edge which can be used to help understanding, interpretation and decision-making.
Technically speaking, the term “data mining” has been stretched beyond its limits
to apply to any form of data analysis. Here are some more definitions of data
mining/knowledge discovery from various sources:4
• “Data mining, or knowledge discovery in databases (KDD) as it is also known,
is the nontrivial extraction of implicit, previously unknown, and potentially
useful information from data. This encompasses a number of different technical
approaches, such as clustering, data summarisation, learning classification
rules, . . . ” (Frawley et al. 1992).
• “Data mining is the search for relationships and global patterns that exist in
large databases but are ‘hidden’ among the vast amount of data, such as rela-
tionship between patient data and their medical diagnosis. These relationships
represent valuable knowledge about the database and the objects in the database
and, if the database is a faithful mirror, of the real world registered by the
database” (Holsheimer and Siebes 1994).
• Also, according to the Clementine 11.1 User’s Guide,5 (Chap. 4), one reads:
“Data mining refers to using a variety of techniques to identify nuggets of
information or decision-making knowledge in bodies of data, and extracting
these in such a way that they can be put to use in the areas such as decision
support, prediction, forecasting and estimation. The data is often voluminous,
but as it stands of low value as no direct use can be made of it; it is the hidden
information in the data that is useful”
It is clear therefore that data mining is basically a process concerned with the use
of a variety of data analysis tools and software techniques to identify/discover/find
the main/hidden/latent patterns of variability and relationships in the data that may
enhance our knowledge of the system. It is based on computing power and aims at
explaining a large part of large-dimensional dataset through dimension reduction. In
this context climate data constitute a fertile field for data analysis and data mining.
The literature in climate research clearly contains various classical techniques
that attempt to address some of the previous issues. Some of these tools
have been adopted from other fields and others have been developed within
a climate context. Examples of these techniques include principal component
analysis/empirical orthogonal function (PCA/EOF), Rotated EOFs (REOF),
complex EOFs (CEOFs), canonical correlation analysis (CCA), factor analysis
(FA), blind separation/deconvolution techniques, regression models, clustering
techniques, multidimensional scaling (MDS), . . . etc.
4 http://www.pcc.qub.ac.uk/tec/courses/datamining.
5 https://home.kku.ac.th/wichuda/DMining/ClementineUsersGuide_11.1.pdf.
1.3 Major Concern in Climate Data Analysis 5
1.3 Major Concern in Climate Data Analysis
1.3.1 Characteristics of High-Dimensional Space Geometry

Volume Paradox of Hyperspheres
A d-dimensional Euclidean space is a mathematical continuum that can be described

by d independent coordinates. In this space a point A is characterised by a set of d
numbers, or coordinates (a1 , a2 , . . . ad ),and the distance between any two points
A and B is given by the square root of dk=1 (ak − bk )2 . An important concept in
this space is the definition of hyperspheres. A d-dimensional hypersphere of radius
r, centred at the origin, is the (continuum) set of all points X with coordinates
x1 , x2 , . . . xd satisfying

d
xk2 = r 2 . (1.1)
k=1
Note that in the Euclidean three-dimensional space the hypersphere is simply the
usual sphere. The definition of hyperspheres allows another set of coordinates,
namely spherical coordinates in which a point (x1 , x2 , . . . xn ) can be defined using
an equivalent set of coordinates: (r, θ1 , . . . , θd−1 ), where r ≥ 0, − π2 ≤ θk ≤ π2 ,
for 1 ≤ k ≤ d − 2, and 0 ≤ θd−1 ≤ 2π . The relationship between the two sets of
coordinates is given by:
x1 = r cosθ1 cosθ2 . . . cosθd−1

x2 = r cosθ1 cosθ2 . . . cosθd−2 sinθd−1
..
.
(1.2)
xk = r cosθ1 cosθ2 . . . cosθd−k sinθd−k+1
..
.
xd = r sinθ1 .
It can be shown that the Jacobian J of this transformation6 satisfies |J | =

r d−1 cosd−2 θ1 . . . cosθd−2 for d ≥ 3 and |J | = r for d = 2. The volume Vd◦ (r)
of the hypersphere of radius r in d dimensions, d (r), can be calculated using
transformation (1.2):
r π π
2 2
1dx = dρ dθ1 . . . dθd−2
d (r) 0 − π2 − π2
6 That is |( ∂x
∂θl )kl |, for k = 1, . . . d, and l = 0, . . . , d − 1 where θ0 = r.
k
6 1 Introduction
2π
× ρ d−1 cosd−2 θ1 . . . cos2 θd−3 cosθd−2 dθd−1 (1.3)
0
and yields
Vd◦ (r) = C◦ (d) r d , (1.4)
where
d
π2 2π
C◦ (d) = = C◦ (d − 2). (1.5)
( d2 + 1) d
Table 1.1 shows the values of C◦ (d) for the first 8 values of d.
Comparing the volume Vd (2r) of the hypercube d (2r) of side 2r, i.e.
Vd (2r) = 2d r d , to that of the hypersphere Vd◦ (r) we see that both the hypervolumes

depend exponentially on the linear scale r but with totally different coefficients.
For instance the coefficient C◦ (d) in Eq. (1.5) is not monotonic and it tends to
zero rapidly for large dimensions. The volume of a hypersphere of any radius
will decrease towards zero when increasing the dimension d from the moment
d > π r 2 . The decrease is sharper when the space dimension is even. The coefficient
C (d) = 2d for the hypercube, on the other hand, increases monotonically with d.
The previous result seems paradoxical, and one might ask what happens to the
content of the hyperspheres in high dimensions, and where does it go? To answer
this question, let us first look at the concentration of the hypercube content. Consider
the d-dimensional hypercube of side 2r and the inscribed hypersphere of radius r
(Fig. 1.2). The fraction of the residual volume Vd (2r) − Vd◦ (r) to the hypercube
volume goes to one, i.e.
Vd (2r) − Vd◦ (r)

d
π2 1
=1− →1 (1.6)
Vd (2r) 2d γ ( d2 + 1)
as d increases to infinity. Hence with increasing space dimension most of the

hypercube volume concentrates in its 2d corners whereas the centre becomes less
important or empty! This is one aspect of the empty space phenomenon in high-
dimensional spaces (Scott and Thompson 1983).
Consider now what happens to the hyperspheres. The fraction of the volume of a
spherical shell of thickness ε and radius r, Vd◦ (r) − Vd◦ (r − ε), to that of the whole
hypersphere is
Table 1.1 Values of C◦ as a function of the space dimension d

d 1 2 3 4 5 6 7 8
C◦ 2 3.1416 4.1888 4.9348 5.2638 5.1677 4.7248 4.0587
Fig. 1.2 Representation of

the volume concentration in
the hypercube corners in two
dimensions
Vd◦ (r) − Vd◦ (r − ε) ε d

= 1 − 1 − →1 as d → ∞. (1.7)
Vd◦ (r) r
Hence the content of the hypersphere becomes more and more concentrated close
to its surface. A direct consequence of this is that a uniformly distributed data in the
hypersphere or hypercube is mostly concentrated on their edges. In other words to
sample uniformly in a hypercube or hypersphere (with large dimensions) we need
extremely large sample sizes.
Exercise Using Eq. (1.1) compute the area Sd◦ (a) of the hypersphere of dimension
d and radius a.
a
Answer Use the fact that Vd◦ (a) = 0 dρ dSd◦ (ρ) and, keeping in mind that
d
π 2 a d−1 d
Sd◦ (a) = a d−1 Sd◦ (1), yields Sd◦ (a) = .
( d2 +1)
Two Further Paradoxical Examples
• Inscribed spheres (Hamming 1980)

We consider the square of side 4 divided into four squares of side 2 each. In each
square we consider the inscribed circle of radius one. These circles are tangent to
each other and also to the original square. Now we fit in the small space at the centre
of the original square a small circle tangent
√ to the four inscribed circles (Fig. 1.3).
The radius of the small circle is r2 = 2 − 1 = 0.41. Doing the same thing with
the 43 -cube and the corresponding eight unit spheres, the radius of the small inner
8 1 Introduction
Fig. 1.3 Inscribed spheres in

two dimensions
√
sphere is r3 = 3 − 1. Extending the same procedure to the d-dimensional
√ 4d -
hypercube and its corresponding 2 unit hyperspheres yields rd = d − 1. Hence
d
for d ≥ 10, the small inner hypersphere reaches outside the hypercube. Note also,
as pointed out earlier that the volume of this inner hypersphere goes to zero as d
increases.
• Diagonals in hypercubes (Scott 1992)
Consider the d-dimensional hypercube [−1, 1]d . Any diagonal vector v joining the
origin to one of the corners is of the form (±1, ±1, . . . , ±1)T . Now the angle θ
between v and any vector basis, i.e. coordinate axis i is given by:
v.i ±1
cos θ = =√ , (1.8)
v d
which goes to zero as d goes to infinity. Hence the diagonals are nearly orthogonal
to the coordinate axes in high-dimensional spaces.7 An important consequence is
that any data that tend to cluster near the diagonals in hyperspaces will be mapped
onto the origin in every paired scattered plot. This points to the importance of the
choice of the coordinate system in multivariate data analysis.
• Consequence for the multivariate Gaussian distribution (Scott 1992; Carreira-
Perpiñán 2001)
7 Note that because of the square root, this is true for very large dimensions d, e.g. d ≥ 103 .
Data scattered uniformly in high-dimensional spaces will always be concentrated

on the edge, and this is the emptiness paradox mentioned earlier, i.e. spherical
neighbourhood of uniformly distributed data on hypercubes is empty. Similar
behaviour also occurs with the multivariate normal distribution. The standard
multinormal probability density function is given by

1 1
f (x) = d
exp − xT x . (1.9)
(2π ) 2 2
Equiprobable sets form (1.9) are hyperspheres and the origin is the mode of the
distribution. Now consider the set of points y within the hypersphere associated
with a pdf of εf (0), i.e. points satisfying f (y) ≥ εf (0) for small ε. The probability
of this set is

P r x 2 ≤ −2log ε = P r χd2 ≤ −2 log ε . (1.10)
For a given ε this probability falls off rapidly as d increases. This decrease becomes
sharp after d = 5. Hence the probability of points not in the tail decreases rapidly
as d increases. Consequently, unlike our prediction from low-dimensional intuition,
most of the mass of the multivariate normal in high-dimensional spaces is in the tail.
1.3.2 Curse of Dimensionality and Empty Space Phenomena
Data is a valuable source of information which provide ultimately the way to

knowledge and wisdom. The concept of flow diagram from data to wisdom through
information and knowledge (DIKW) goes back to Zeleny (1987), and Ackoff (1989)
and has become known as the DIKW hierarchy. This hierarchy can have various
representations such as the one given by the understanding–context independence
diagram schematised in Fig. 1.4. The figure clearly reflects the importance of data,
which allow, through understanding relationships, access to information. This latter
allows for pattern understanding to get knowledge, and ultimately wisdom through
understanding principles.
However, because of the nature of the problem in climate, and also various
other fields, data analysis cannot escape the major challenges of the empty space
phenomenon, discussed above, and what has become known as the “curse of
dimensionality”. To grasp the idea in more depth let us suppose that we can be
satisfied that a sample size of 10 data points in one-dimensional time series, e.g.
10 years of yearly North Atlantic Oscillation (NAO) index, are considered to be
enough to analyse and learn something from the one-dimensional process. Now, in
the two-dimensional case, the minimum or satisfactory sample size would be 102 .
So we see that if we are to analyse for example two variables: the yearly NAO index
and the Northern Hemisphere (NH) surface temperature average, one would require
10 1 Introduction
Fig. 1.4 Data–Information–Knowledge–Wisdom (DIKW) hierarchy
of the order of a century of data record. In various climate analysis studies we use
at least 4 to 5 dimensions which translate into an astronomical sample size record,
e.g. million years.
The moral is that if n1 is the sample size required in one dimension, then in
d dimensions we require nd = nd1 , i.e. a power law in the linear dimension. The
curse of dimensionality, coined by Bellman (1961), refers to this phenomenon,
i.e. the sample size needed to have an accurate estimate of a function in a high-
dimensional space grows exponentially with the number of variables or the space
dimension. The curse of dimensionality also refers to the paradox of neighbourhood
in high-dimensional spaces, i.e. empty space phenomenon (Scott and Thompson
1983). Local neighbourhoods are almost empty whereas nonempty neighbourhood
are certainly nonlocal (Scott 1992). This is to say that high-dimensional spaces are
inherently empty or sparse.
Example For the uniform distribution in the unit 10-dimensional hypersphere the
probability of a point falling in the hypersphere of radius 0.9 is only 0.35 whereas
the remaining 0.65 probability is for points on the outer shell of thickness 0.1.
The above example shows that density estimation in high dimensions can be
problematic. This is because regions of relatively very low density can contain
considerable part of the distribution whereas regions of apparently high density
can be completely devoid of observations in samples of moderate sizes (Silverman
1986). For example, 70% of the mass in the standard normal distribution is within
one standard deviation of the mean whereas the same domain contains only 0.02%
of the mass, in 10 dimensions, and has to take a radius of more than three standard
deviations to reach 70%. Consequently, and contrary to our intuition, the tails are
much more important in high dimensions than in low dimensions. This has a serious
consequence, namely the difficulty in probability density function (pdf) estimation
in high-dimensional spaces (see e.g. Silverman 1986). Therefore since most density
estimation methods are based on local concepts, e.g. local averages (Silverman
1986), in order to find enough neighbours in high dimensions, the neighbourhood
has to extend far beyond local neighbours and hence locality is lost (Carreira-
Perpiñán 2001) since local neighbourhoods are mostly empty.
The above discussion finds natural application in the climate system. If, for
example, we are interested in studying a phenomenon, such as El-Nino Southern
Oscillation (ENSO) using say observations from 40 grid points of monthly sea
surface temperature (SST) data in the Tropical Pacific region, then theoretically one
would necessarily need a sample size of O(1040 ). This metaphoric observational
period is far beyond the age of the Universe.8 This is another facet of the emptiness
phenomenon related to the inherent sparsity of high-dimensional spaces. As has
been pointed out earlier, this has a direct consequence on the probability distribution
of high-dimensional data. For example, in the one-dimensional case the probability
density function of the uniform distribution over [−1, 1] is a box of height 2−1 ,
whereas in 10 dimensions the hyperbox height is only 2−10 ≈ 9.8 × 10−4 .
1.3.3 Dimension Reduction and Latent Variable Models
Given the complexity, e.g. high dimensionality, involved in the climate system one
major challenge in atmospheric data analysis is the reduction of the system size.
The basic objective behind dimension reduction is to enable data mapping onto a
lower-dimensional space where data analysis, feature extraction, visualisation, inter-
pretation, . . ., etc. become manageable. Figure 1.5 shows a schematic representation
of the target/objective of data analysis, namely knowledge or KDD.
In probabilistic terminology, the previous concepts have become known as latent
variable problems. Latent variable models are probabilistic models that attempt
Fig. 1.5 Knowledge gained via data reduction
8 Inpractice, of course, data are not totally independent, and the sample size required is far less
than the theoretical one.
12 1 Introduction
to explain processes happening in high-dimensional spaces in terms of a small

number of variables or degrees of freedom (dof). This is based of course on the
assumption that the observed high-dimensional data are the result of underlying
lower-dimensional processes.9 The original concept of latent variable modelling
appeared in psychometrics with Spearman (1904b). The technique eventually
evolved into what is known as factor analysis. See, e.g. Bartholomew (1987) for
historical details on the subject. Dimensionality reduction problems are classified
into three classes (Carreira-Perpiñán 2001), namely:
• Hard dimensionality reduction problems where the dimension d of the data is of
the order 102 –105 .
• Soft dimensionality reduction problems in which d ≈ (2 − 9) × 10.
• Visualisation problems where the dimension d is small but reduction is required
for visual purposes. Examples include Asimov’s (1985) grand tour and Cher-
noff’s (1973) faces, etc.
1.3.4 Some Problems and Remedies in Dimension Reduction

Known Difficulties
Various problems exist in dimension reduction, some of which are in general

unavoidable due to the complex nature in the approaches used. The following list
gives some examples:
• Difficulty related to large dimensions—the larger the dimension the more
difficult the problem.
• Non-unicity—No single method exists to deal with the data. In general different
approaches (exploratory and probabilistic) can give different results. For exam-
ple, in probabilistic modelling the choice of the latent variable model is not
unique.
• Unknown intrinsic, latent or hidden dimension—There is indeed no effective
and definitive way to know in general the minimum number of dimensions to
represent the data.
• Nonlinear association/relationship between the variables—This is a difficult
problem since there is no systematic way to find out these associations.
• Nature of uncertainties underlying the data, and the loss of information resulting
from the reduction procedure.
9 Thishigh dimensionality can arise from various causes, such as uncertainty related for example
to nonlinearity and stochasticity, e.g. measurement errors.
1.4 Examples of the Most Familiar Techniques 13
Some Remedies
Some of the previous problems can be tackled using some or a combination of the
following approaches:
• Occam’s razor or model simplicity.
• Parsimony principle.
• Arbitrary choices of e.g. the latent (hidden) dimensions.
• Prior physical knowledge of the process under investigation. For example,
when studying tropical climate one can make use of the established ENSO
phenomenon linking the tropical Pacific SST to the sea saw in the large scale
pressure system.
1.4 Examples of the Most Familiar Techniques
Various techniques have been used/adapted/developed in climate analysis studies to

find and identify major patterns of variability. It is fair to say that most of these
techniques/tools are basically linear. These and various other techniques will be
presented in more detail in the next chapters. Below we mention some of the most
widely used methods in atmospheric science.
• Empirical Orthogonal Functions (EOFs).
EOFs are the most widely used methods in atmospheric science. EOF analysis is
also known as principal component analysis (PCA), and is based on performing an
eigenanalysis of the sample covariance matrix of the data.
• Rotated EOFs (REOFs).
EOFs are constrained to be orthogonal and as such problems related to physical
interpretability may arise. REOF method is one technique that helps find more
localised structures by rotating the EOFs in such a manner to get patterns with
simple structures.
• Complex EOFs (CEOFs).
The CEOF method is similar to that of EOFs and aims at detecting propagating
structures from the available data.
• Principal Oscillation Patterns (POPs).
As for CEOFs, POP method aims also at finding propagating patterns without
recourse to using the complex space.
14 1 Introduction
• Canonical Correlation Analysis (CCA).

Unlike the previous techniques where only one single field e.g. sea level pressure
(SLP) is used, in CCA the objective is to find the most (linearly) correlated structures
between two fields,10 e.g. SLP and SST. It is used to identify (linearly) “coupled
modes”.
10 Although it is also possible to perform a EOF analysis of more than one field, for example SST
and surface air temperature combined (Kutzbach 1967). This has been labelled combined principal
component analysis by Bretherton et al. (1992).
Chapter 2
General Setting and Basic Terminology
Abstract This chapter introduces some basic terminologies that are used in
subsequent chapters. It also presents some basic summary statistics of data sets and
reviews basic methods of data filtering and smoothing.
Keywords Data processing · Smoothing · Scaling and sphering · Filtering ·

Stationarity · Spectra · Singular value decomposition
2.1 Introduction
By its nature, the climate data analysis is a large multivariate (high-dimensional)

problem par excellence. When atmospheric data started to accumulate since the
beginning of the twentieth century, the first tools that atmospheric scientists tried
were basically exploratory, and they included simple one-dimensional time series
plots, then two-dimensional scatter plots and later contour plots. Fisher (1925)
quotes, “The preliminary examination of most data is facilitated by the use of
diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye;
they are therefore no substitute for such critical tests as may be applied to data, but
are valuable in suggesting such tests, and in explaining conclusions founded upon
them”. This same feeling is also shared with other scientists. Hunter (1988) also
quotes “The most effective statistical technique for analysing environmental data
are graphical methods. They are useful in the initial stage for checking the quality
of the data, highlighting interesting features of the data, and generally suggesting
what statistical analyses should be done. Interesting enough, graphical methods
are useful again after intermediate quantitative analyses have been completed and
again in the final stage for providing complete and readily understood summaries of
the main findings of investigations”. Finally we quote Tukey’s (1977) declaration,
also quoted in Spence and Garrison (1993) and Berthouex and Brown (1994): “The
greatest value of a picture is when it forces us to notice what we never expected
to see”.

16 2 General Setting and Basic Terminology
It was not until the early part of the twentieth century that correlation started to
be used in meteorology by Gilbert Walker1 (Walker 1909, 1923, 1924; Walker and
Bliss 1932). It is fair to say that most of the multivariate climate data analyses are
based mainly on the analysis of the covariance between the observed variables of the
system. The concept of covariance in atmospheric science has become so important
that it is routinely used in climate analysis. Data, however, have to be processed
before getting to this slightly advanced stage. Some of the common processing
techniques are listed below.
2.2 Simple Visualisation Techniques
In their basic form multivariate data are normally composed of many unidimen-
sional time series. A time series is a sequence of values x1 , x2 . . . xn , in which each
datum represents a specific value of the variable x. In probabilistic terms x would
represent a random variable, and the value xi is the ith realisation of x in some
experimental set-up. In everyday language xt represents the observation at time t of
the variable x.
In order to get a basic idea about the data, one has to be able to see at least some
of their aspects. Plotting some aspects of climate data is therefore a recommended
first step in the analysis. Certainly this cannot be done for the entire sample; for
example, simple plots for certain “key” variables could be very useful. Trends, for
instance, are examples of aspects that are best inspected visually before quantifying
them.
Various plotting techniques exist for the purpose of visualisation. Examples
include:
• Simple time series plots—this constitutes perhaps the primary step to data
exploration.
• Single/multiple scatter plots between pairs of variables—these simple plots
provide information on the relationships between various pairs of variables.
• Histogram plots—they are a very useful first step towards exploring the distribu-
tions of individual variables (see e.g. Silverman 1986).
1 The modern concept of correlation can be traced as far back as late seventeenth century with
Galton (1885), see e.g. Stigler (1986). The use of the concept of correlation is actually older than
Galton’s (1885) paper and goes back to 1823 with the German mathematician Carl Friedrich Gauss
who developed the normal surface of N correlated variates. The term “correlation” appears to
have been first quoted by Auguste Bravais, a French naval officer and astronomer who worked on
bivariate normal distributions. The concept was also used later in 1868 by Charles Darwin, Galton’s
cousin, and towards the end of the seventeenth century, Pearson (1895) defined the (Galton-)
Pearson’s product-moment correlation coefficient. See Rodgers and Nicewander (1988) for some
details and Pearson (1920) for an account on the history of correlation. Rodgers and Nicewander
(1988) list thirteen ways to interpret the correlation coefficients.
2.3 Data Processing and Smoothing 17
• Contour plots of variables in two dimensions—contour plots are also very useful
in analysing, for example, individual maps or exploring smoothed histograms
between two variables.
• Box plots—these are very useful visualisation tools (Tukey 1977) used to display
and compare the distributions of an observed sample for a number of variables.
Other useful methods, such as sunflower scatter plots (Cleveland and McGill
1984), Chernoff faces (Chernoff 1973), brushing (Venables and Ripley 1994;
Cleveland 1993) and colour histograms (Wegman 1990) are also often used in high-
dimensional data exploration and visualisation. A list of these and other methods
with a brief description and further references can be found in Martinez and
Martinez (2002), see also Everitt (1978).
2.3 Data Processing and Smoothing
2.3.1 Preliminary Checking
Climate data are the direct result of experimentation, which translate via our senses
into observations or (in situ) measurements and represent therefore information. By
its very nature, data are always subject to uncertainties and are hence deeply rooted
in probabilistic concepts. It is in general recommended to process the data before
indulging into advanced analyses techniques. The following list provides familiar
examples of processing that are normally applied at each grid point.
• Checking for missing/outlier values—these constitute simple processing tech-
niques that are routinely applied to data. For example, unexpectedly large values
or outliers can either be removed or replaced. Missing values are normally
interpolated using observations from the neighbourhood.
• Detrending—if the data indicate evidence of a trend, linear or polynomial, then it
is in general recommended to detrend the data, by calculating the trend and then
removing it.
• Deseasonalising—seasonality constitutes one of the major sources of variability
in climate data and is ubiquitous in most climate time series. For example, with
monthly data a smooth seasonal cycle can be estimated by fitting a sine wave.
Alternatively, the seasonal cycle can be obtained by the collection of the 12
monthly averages.
A more advanced way is to apply Fourier analysis and considers the few leading
sine waves as the seasonal cycle. The deaseasonalised data are then obtained by
subtracting the seasonal cycle from the original (and perhaps detrended) data. If
the seasonal component is thought to change over time, then techniques based on
X11, for example, can be used. This technique is based on a local fit of a seasonal
component using a simple moving average. Pezzulli et al. (2005) have investigated
the spectral properties of the X11 filter and applied it to sea surface temperature.
The method uses a Henderson filter for the moving average and provides a more
flexible alternative to the standard (constant) seasonality.
2.3.2 Smoothing
Smoothing is the operation that allows removing “irregular” or more precisely,

spiky features and sudden changes from a time series, which otherwise will hamper
any possibility from recognising and identifying special features. In the process of
smoothing the data are implicitly assumed to be composed2 of a smooth component
plus an irregular component. The smoothed data are generally easier to interpret.
Below is a list of the most widely used smoothing techniques applied in climate
research.
Moving Average
It is a simple local average using a sliding window. If we designate by xt , t =

1, 2, . . . n, the sample of the time series and M the length of the window, then the
smoothed time series is given by
1
k+M−1
yk = xk , (2.1)
M
i=k
for k = 1, 2, . . . n − M + 1. Note that to estimate the seasonal cycle, a non-

overlapping 30-day3 moving average is normally applied to the data. Note also that
instead of simple average, one can have a weighted moving average.
Exponential Smoothing
Unlike the simple moving average where the weights are uniform, the exponential
smoothing uses an exponentially decaying weighting function of the past observa-
tions as
∞

yk = (1 − φ) φ i xk−i , (2.2)
i=0
2 Similar to fitting a probability model where the data is decomposed as data = fit + residuals.
3 Depending on the calendar month; 28, 29, 30 or 31.
2.3 Data Processing and Smoothing 19
for an infinite time series. The smoothing parameter φ satisfies φ < 1, and the
smaller |φ|, the smoother the obtained curve. The coefficient (1 − φ) in Eq. (2.2)
is introduced to make the weights sum to one. In practice for finite time series, the
previous equation is truncated to yield
1−φ i
m
yk = φ xk−i , (2.3)
1 − φ m+1
i=0
where m depends on k.
Spline Smoothing
Unlike moving averages or exponential smoothing, which are locally linear, splines
(Appendix A) provide a nonlinear smoothing using polynomial fitting. The most
popular spline is the cubic spline based on fitting a twice continuously differentiable
piece-wise cubic polynomial. If xk = x(tk ) is the observed time series at time tk ,
k = 1, . . . n, then the cubic spline f () is defined by
(i) f (t) = fk (t) = ak + bk t + ck t 2 + dk t 3 for t in the interval [tk , tk+1 ], k =
1, . . . n − 1.
α
(ii) at each point tk , f () and its first two derivatives are continuous, i.e. atd α fk (tk ) =
α
at α fk−1 (tk ), α = 0, 1, 2.
d
Remark The cubic spline (Appendix A) can also be obtained from an optimisation
problem.
Kernel Smoothing
The kernel smoothing is a global weighted average and is often used to estimate
pdfs, see Appendix A for more advanced smoothing methods. The weights are
obtained as the value of a specific kernel function, e.g. exponential or Gaussian,
applied to the target point. Designate again by xk = x(tk ), k = 1, . . . n the finite
sample time series, and the smoothed time series is given by

n
yl = κli xi , (2.4)
i=1
where κli = K ti −t
h , and K() is the smoothing kernel. The most widely used kernel
l
is the Gaussian function:

1
K(x) = √ exp −x 2 /2 .
2π
The parameter h in κli is a smoothing parameter and plays a role equivalent to that
of a window width.
The list provided here is not exhaustive, and other methods exist, see e.g.
Chatfield (1996) or Tukey (1977).
Once the gridded data have been processed, advanced techniques can be applied
depending on the specific objective of the analysis. In general, the processed data
have to be written as an array to facilitate computational procedures, and this is
presented in the next section. But before that, let us define first a few characteristics
of time series.
2.3.3 Simple Descriptive Statistics
Given a time series x1 , x2 , . . . xn , the sample mean x is given by
1
n
x= xk (2.5)
n
k=1
and the sample variance sx2 by
1
n
sx2 = (xk − x)2 . (2.6)
n−1
k=1
See Appendix B for the properties of these estimators. The standard sample
deviation of the time series is sx . The time series is termed centred when the mean
is removed. When the time series is scaled by its standard deviation, it is termed
standardised and consequently has unit variance. Sometimes when the time series is
centred and standardised, it is termed scaled. Often the time series is supposed to be
a realisation of some random variable X with cumulative distribution function (cdf)
FX () with finite mean μX and finite variance σX2 (Appendices B and C). In this case
the sample mean and variance x and sx2 are regarded as estimates of the (population)
mean and variance μX and σX2 , respectively. Now let the time series be sorted into
x(1) ≤ x(2) ≤ . . . ≤ x(n) , and then the following function
⎧
⎨ 0 if u < x(1)
F̂X (u) = nk if x(k) ≤ u < x(k+1) (2.7)
⎩
1 if u > x(n)
provides an estimator of the cdf FX () and is referred to as the empirical distribution

function (edf). Note that the edf can be smoothed to yield a smooth approximation
of FX ().
2.4 Data Set-Up 21
Now let y1 , y2 , . . . , yn be another time series supposed to be also a realisation of

another random variable Y with mean μY and variance σY2 . The sample covariance
cxy between the two time series is given by
1
n
cxy = (xk − x)(yk − y). (2.8)
n−1
k=1
Similarly, the sample correlation coefficient rxy between the two time series is the
covariance between the corresponding scaled time series, i.e.
cxy
rxy = . (2.9)
sx sy
Note that the correlation always satisfies −1 ≤ rxy ≤ 1.

Now if both time series are sorted, then the rank correlation, ρxy , also known
as Spearman’s rank correlation coefficient (Kendall and Stuart 1961, p. 476), is
obtained as the ordinary (or product moment) correlation between the ranks of the
corresponding time series instead of the actual values. This rank correlation can
also be computed using the differences dt , t = 1, . . . n, between the ranks of the
two sample time series and yields
6 n
ρr = 1 − dt2 . (2.10)
n(n2 − 1)
t=1
It can be seen that the transform of the sample x1 , x2 , . . . , xn using the empir-
ical distribution function (edf) F̂X () in (2.7) is precisely pn1 , pn2 , . . . , pnn , where
p1 , p2 , . . . , pn are the ranks of the time series and similarly for the second time
series y1 , y2 . . . , yn . Therefore, the rank correlation is an estimator of the correlation
corr (FX (X), FY (Y )) between the transformed uniform random variables FX (X)
and FY (Y ).
2.4 Data Set-Up
Most analysis methods in climate are described in a matrix form, which is the
essence of multivariate analyses. A given spatio-temporal field, e.g. sea level pres-
sure, is composed of a multivariate time series, where each time series represents the
values of the field X at a given spatial location, e.g. grid point s, taken at different
times4 t noted by X(s, t). The spatial locations are often represented by grid points
that are regularly spaced on the spherical earth at a given vertical level. For example,
4 It could be daily, monthly, etc.

a continuous time series at the jth grid point sj can be noted xj (t), where t spans a
given period. The resulting continuous field represents then a multivariate, or vector,
times series:
T
x(t) = x1 (t), . . . , xp (t) .
When the observations are sampled at discrete times, e.g. t = t1 , t2 , . . . tn , one gets
a finite sample xk , k = 1, . . . n, of our field, where xk = x(tk ). In our set-up the
j’th grid point sj represents the j’th variable. Now if we assume that we have p such
variables, then the sampled field X can be represented by an array X = (xij ), or
data matrix, as
⎛ ⎞
x11 x12 . . . x1p
⎜ x21 x22 . . . x2p ⎟
⎜ ⎟
X = (x1 , x2 , . . . , xn )T = ⎜ . .. .. ⎟ . (2.11)
⎝ .. . . ⎠
xn1 xn2 . . . xnp
In (2.11) n is the number of observations or sample size and p is the number of

variables. The j’th column (x1j , . . . xnj )T is the time series at the grid point location
sj , whereas the ith row (xi1 , . . . , xip ) is the observed field xTi at time t = ti , which
is also known as a map at t = ti . One can also write (2.11) alternatively as

X = x1 , x2 , . . . , xp . (2.12)
Suppose now that we have another observed field Y = (yij ), observed at the same
times as X but perhaps at different grid points s∗k , k = 1, . . . q, think, for example,
of sea surface temperature. Then one can form the combined field obtained by
combining both data matrices Z = [X, Y] as
⎛ ⎞
x11 . . . x1p y11 . . . y1q
⎜ x21 . . . x2p y21 . . . y2q ⎟
⎜ ⎟
Z = (zij ) = ⎜ . .. .. .. ⎟ . (2.13)
⎝ .. . . . ⎠
xn1 . . . xnp yn1 . . . ynq
This combination is useful when, for example, one is looking for combined patterns
such as empirical orthogonal functions (Jolliffe 2002; Hannachi et al. 2007). The
ordering or set-up of the data matrix shown in (2.11) or (2.13) where the temporal
component is treated as observation and the space component as variable is usually
referred to as S-mode convention. In the alternative convention, namely the T-mode,
the previous roles are swapped (see e.g. Jolliffe 2002).
2.5 Basic Notation/Terminology 23
2.5 Basic Notation/Terminology
In the general framework of multivariate analyses, each observation xj k = xk (tj )

is considered as a realisation of a random variable xk and therefore the observed
vector xk = (xk1 , . . . xkp )T , for k = 1, . . . n as a realisation of the multivariate
random variable x. We denote hereafter by p(x) the probability density function of
the variable x (Appendix B). Various operations are often required prior to applying
advanced mathematical methods to find patterns from our high-dimensional sam-
pled field. Some of the preprocessing steps have been presented in the previous
section for the unidimensional case.
2.5.1 Centring
Since the observed field is a realisation of some multivariate random variable, one
can speak of the mean μ of the field, also known as climatology, which is the
expectation of x, i.e.

μ = E(x) = xp(x)dx1 . . . dxp . (2.14)
The mean is estimated using the observed sample by

T
x = x1, x2, . . . , xp , (2.15)
where x k is the time mean of the observed k’th time series.

The centring operation consists of transforming the data matrix X to have zero
mean and yields the centred matrix Xc . The centred field is normally referred to
as the anomaly field with respect to the time mean. This is to differentiate it from
anomalies with respect to other components such as mean annual cycle. The centred
matrix is then

1
Xc = X − 1n x = In − 1n 1n X,
T T
(2.16)
n
where 1n = (1, 1, . . . , 1)T is the column vector of length n containing only ones
and In is the n × n identity matrix. The Matlab command to compute the mean of X
and get the anomalies is
>> [n p] = size(X);
>> Xbar = mean(X,1);
>> Xc = X-ones(n,1)*Xbar;
2.5.2 Covariance Matrix
The covariance (or variance–covariance) matrix is the second-order centred moment

of the multivariate random variable x and is given by

= cov(x, x) = var(x) = E (x − μ)(x − μ)T = (x − μ)(x − μ)T p(x) dx.
(2.17)
The (i, j)th element of is simply the covariance γ (xi , xj ) between the i’th and
the j’th variables, i.e. the time series at the i’th and the j’th grid points, respectively.
Note that the diagonals of are simply the variances σ12 , σ22 , . . . σp2 of the individual
random variables. The sample estimate of the covariance matrix is given by5
1
n
1
S= (xk − x)(xk − x)T = XT Xc . (2.18)
n−1 n−1 c
k=1
The Matlab command is

>> S = cov(X);
As for , the diagonals s1 , . . . , sp of S are also the sample variances of the
individual time series x1 (t), x2 (t), . . . xp (t).
Remark The sample covariance matrix is sometimes referred to as the dispersion
matrix, although, in general, the latter is taken to mean matrix of non-centred
1
second-order moments n−1 XT X.
The correlation matrix is the covariance matrix of the standardised (or scaled)
random variables; here, each random variable xk is standardised by its standard
deviation σk to yield unit variance. Hence the (i,j)’th element ρij of is the
correlation between the i’th and the j’th time series:
ρij = ρ(xi , xj ). (2.19)
If we designate
2 ⎛ ⎞
σ11 0 ... 0
⎜ 0 2 ...
σ22 0 ⎟
⎜ ⎟
D = Diag () = ⎜ . .. .. ⎟, (2.20)
⎝ .. . ... . ⎠
0 0 ... 2
σpp
5 The coefficient 1 1
n−1used in (2.15) instead of n is to make the estimate unbiased, but the difference
in practice is in general insignificant.
2.5 Basic Notation/Terminology 25
then we get
1 1
= D− 2 D− 2 . (2.21)
The sample correlation matrix is also obtained in a similar way by standardising

each variable.
2.5.3 Scaling
This operation consists in dividing the variables x1 , x2 , . . . xp by their respec-

tive standard
deviationsσ1 , σ2 , . . . σp . Using the matrix D = Diag () =
2 , σ 2 , . . . , σ 2 , the scaled data matrix takes the form:
Diag σ11 22 pp
1
Xs = XD− 2 , (2.22)
so each variable in Xs is unit variance, but the correlation structure among the
variables has not changed. Note that the centred and scaled data matrix is
1
Xcs = Xc D− 2 . (2.23)
Note also that the correlation matrix of X is the covariance of the scaled data
matrix Xs .
2.5.4 Sphering
It is an affine transformation by which the covariance matrix of the sample data

becomes the identity matrix. Sphering destroys, therefore, all the first- and second-
order information of the sample. For our data matrix (2.11), the sphered data matrix
X◦ takes the form:
1 1

X◦ = − 2 Xc = − 2 X − 1n xT . (2.24)
1
In (2.24) − 2 represents the inverse of the square root6 of . The covariance matrix
of X◦ is the identity matrix, i.e. n1 XT◦ X◦ = Ip . Because sphering destroys the first
6 Thesquare root of a symmetric matrix is a matrix R such that RRT = . The square root of
, however, is not unique since for any orthogonal matrix Q, i.e. QQT = QT Q = I, the matrix
RQ is also square root. The standard square root is obtained via a congruential relationship with
respect to orthogonality and is obtained using the singular value decomposition theorem.
two moments of the data, it can be useful when the covariance structure in the data is
not desired, e.g. when we are interested in higher order moments such as skewness.
2.5.5 Singular Value Decomposition
The singular value decomposition (see also Appendix D) is a powerful tool that
decomposes any n × p matrix X into the product of two orthogonal matrices and a
diagonal matrix as
X = UVT ,
which can be written alternatively as

r
X= λk uk vTk , (2.25)
k=1
where uk and vk , k = 1, . . . r, are, respectively, the left and right singular vectors of
X and r is the rank of X.
The SVD theorem can also be extended to the complex case. If X is a n × p
complex matrix, we have a similar decomposition to (2.25), i.e. X = UV∗T , where
now U and V satisfy U∗T U = V∗T V = Ir and the superscript (∗ ) denotes the
complex conjugate.
Application
If X = UVT is the SVD decomposition of the n ×p data matrix X, then

k uk uk = In and that the covariance matrix is S =
T 2 T
k λk uk uk . The Matlab
routine is called SVD, which provides all the singular values and associated singular
vectors.
>> [u s v] = svd (X);
The routine SVD is more economic and provides a preselected number of singular
values (see Chap. 3).
2.6 Stationary Time Series, Filtering and Spectra
2.6.1 Univariate Case
Let us consider a continuous stationary time series (or signal) x(t) with autocovari-
ance function γx () and spectral density function fx () (see Appendix C). A linear
filter L is a linear operator transforming x(t) into a filtered time series y(t) = Lx(t).
This linear filter can be written formally as a convolution, i.e.
2.6 Stationary Time Series, Filtering and Spectra 27

y(t) = Lx(t) = h(u)x(t − u)du, (2.26)
where the function h() is known as the transfer function of the filter or its impulse
response function. The reason for this terminology is that if x(t) is an impulse,
i.e. a Dirac delta function, then the response is simply h(t). From (2.26), the
autocovariance function γy () of the filtered time series is

γy (τ ) = h(u)h(v)γx (τ + u − v)dudv. (2.27)
Taking the Fourier transform of (2.27), the spectral density function of the response
y(t) is
fy (ω) = fx (ω)|(ω)|2 , (2.28)
where

(ω) = h(u)e−iuω du = |(ω)|eiφ(ω) (2.29)
is the Fourier transform of the transfer function and is known as the frequency
response function. Its amplitude |(ω)| is the gain of the filter, and φ(ω) is its phase.
In the discrete case the transfer function is simply a linear combination of Dirac
pulses as

h(u) = ak δk (2.30)
k
giving as output

yt = ak xt−k . (2.31)
k
The frequency response function is then the discrete Fourier transform of h() and is
given by
1
(ω) = ak e−iωk . (2.32)
2π
k
Exercise
1. Derive the frequency response function of the moving average filter (2.1).
2. Derive the same function for the exponential smoothing filter (2.2).
Using (2.26) or (2.31), one can compute the cross-covariance function:

γxy (τ ) = h(u)γx (τ − u). (2.33)
The cross-covariance function satisfies
γxy (τ ) = γyx (−τ ). (2.34)
Note that the cross-covariance function is not symmetric in general. The Fourier
transform of the cross-covariance function, i.e. the cross-spectrum fxy (ω) =
1
2π γxy (k)e−iωk , is given by
fxy (ω) = (ω)fx (ω). (2.35)
Note that the cross-covariance function is not limited to time series defined, e.g. via
Eq. (2.26), but is defined for any two time series.
2.6.2 Multivariate Case
The previous concepts can be extended to the multivariate time series in the same
manner (Appendix C). Let us suppose that xt , t = 1, 2, . . ., is a d-dimensional time
series with zero mean (for simplicity). The lagged autocovariance matrix is

(τ ) = E xt xTt+τ . (2.36)
Using (2.34), we get
T (τ ) = (−τ ). (2.37)
Exercise Derive (2.37).

(Hint. Use stationarity).
The cross-spectrum matrix F is given by
1 −iωk
F(ω) = e (k). (2.38)
2π
k
The inverse of this Fourier transform yields

π
(k) = F(ω)eiωk dω. (2.39)
−π
2.6 Stationary Time Series, Filtering and Spectra 29
The cross-spectrum matrix F(ω) is Hermitian, i.e.
F∗T (ω) = F(ω), (2.40)
where the notation (∗ ) represents the complex conjugate. Note that the diagonal
elements of the cross-spectrum matrix, [F]ii (ω), i = 1, . . . d, represent the
individual power spectrum of the ith component xti , of xt . The real part FR =
Re (F(ω)) is the co-spectrum, and its imaginary part FR = I m (F(ω)) is the
quadrature spectrum. The co-spectrum is even and satisfies

FR (ω) = (0) + cos (k) + T (k) , (2.41)
k≥1
and the covariance matrix can also be written as

π π
(0) = F(ω)dω = 2 FR (ω)dω. (2.42)
−π 0
The relations (2.27) and (2.35) can also be extended naturally to the multivariate
filtering problem. In fact if the multivariate
signal x(t) is passed through a linear
filter L to yield y(t) = Lx(t) = H(u)x(t − u)du, then the covariance matrix of
the output is

y (τ ) = H(u) x (τ + u − v)HT (v)dudv.
Similarly, the cross-covariance between input and output is

xy (τ ) = H(u) x (τ − u)du. (2.43)
1 −iωτ , using
By expanding the cross-spectrum matrix Fxy (ω) = 2π τ xy (τ )e
(2.43), and similarly for the output spectrum matrix Fy (ω), one gets
Fxy (ω) = (ω)Fx (ω)

(2.44)
Fy (ω) = (ω)Fx (ω) ∗T (ω),

where (ω) = H(u)e−iωu du is the multivariate frequency response function.
Chapter 3
Empirical Orthogonal Functions
Abstract This chapter describes the idea behind, and develops the theory of
empirical orthogonal functions (EOFs) along with a historical perspective. It also
shows different ways to obtain EOFs and provides examples from climate and
discusses their physical interpretation. Strength and weaknesses of EOFs are also
mentioned.
Keywords Empirical orthogonal functions · Teleconnection · Arctic oscillation ·

Sampling uncertainties · Teleconnectivity · Adjacency matrix
3.1 Introduction
The inspection of multivariate data with a few variables can be addressed easily
using the techniques listed in Chap. 2. For atmospheric data, however, where one
deals with many variables, those techniques become impractical, see Fig. 3.1 for an
example of data cube of sea level pressure. In the sequel, we adopt the notation and
terminology presented in Chap. 2. In general, and before any advanced analysis, it
is recommended to inspect the data using simple exploratory tools such as:
• Plotting the mean field x of xk , k = 1, . . . n.
• Plotting the variance of the field, i.e. diag(S) = s11 , . . . spp , see Eq. (2.18).
• Plotting time slices of the field, or the time evolution of the field at a given latitude
or longitude, that is the Hovmöller diagram.
• Computing and plotting one-point correlation maps between the field and a
specific time series. This could be a time series from the same field at, say,
a specific location sk in which case the field to be plotted is simply the k’th
column of the correlation matrix (Eq. (2.21)). Alternatively, the time series
could be any climate index zt , t = 1, 2, . . . n in which case the field to be plotted
is simply ρ1 , ρ2 , . . . , ρp , where ρk = ρ(xk , zt ) is the correlation between the
index and the k’th variable of the field. An example of one-point correlation map
for DJF NCEP/NCAR sea level pressure is shown in Fig. 3.1 (bottom), the base

32 3 Empirical Orthogonal Functions
Fig. 3.1 An example of space–time representation of winter monthly (December–January–

February) sea level pressure from the National Center for Atmospheric Research/National center
for Environmental Prediction (NCAR/NCEP) reanalyses for the period Dec 1991–Dec 1995 (top
and middle, unit: hPa), and one-point correlation map shown in percentage (bottom)
3.2 Eigenvalue Problems in Meteorology: Historical Perspective 33
point is also shown. This map represents the North Atlantic Oscillation (NAO)
teleconnection pattern, discussed below.
As stated in Chap. 1, when we have multivariate data the objective is often to
find coherent structures or patterns and to examine possible dependencies and
relationships among them for various purposes such as exploration, identifica-
tion of physical and dynamical mechanisms and prediction. This can only be
achieved through “simplification” or reduction of the data structure. The words
simplification/reduction will become clear later. This chapter deals with one of the
most widely used techniques to simplify/reduce and interpret the data structure,
namely principal component analysis (PCA). This is an exploratory technique for
multivariate data, which is in essence an eigenvalue problem, aiming at explaining
and interpreting the variability in the data.
3.2 Eigenvalue Problems in Meteorology: Historical

Perspective
3.2.1 The Quest for Climate Patterns: Teleconnections
The climate system is studied using observations as well as model simulations. The
weather and climate system is not an isolated phenomenon, but is characterised by
high interconnections, namely climate anomalies at one location on the earth can be
related to climate anomalies at other distant locations. This is the basic concept
of what is known as teleconnection. In simple words, teleconnections represent
patterns connecting widely separated regions (e.g. Hannachi et al. 2017). Typical
examples of teleconnections include The El-Niño Southern Oscillation, ENSO (e.g.
Trenberth et al. 2007), the North Atlantic Oscillation, NAO (Hurrell et al. 2003;
Hannachi and Stendel 2016; Franzke and Feldstein 2005) and the Pacific-North
American (PNA) patterns (Hannachi et al. 2017; Franzke et al. 2011).
ENSO is a recurring ocean-atmosphere coupled pattern of interannual fluc-
tuations characterised by changes in sea surface temperature in the central and
eastern tropical Pacific Ocean associated with large scale changes in sea level
pressure and also surface wind across the maritime continent. The Ocean part
of ENSO embeds El-Niño and La-Niǹa, and the atmospheric part embeds the
Southern Oscillation. An example of El-Niño is shown in Chap. 16 (Sect. 16.9).
El-Niño and La-Niña represent, respectively, the warming (or above average) and
cooling (or below average) phases of the central and eastern Pacific Ocean surface
temperature. This process has a period of about three to 7 years where the sea surface
temperature changes by about 1–3 ◦ C. The Southern Oscillation (SO) involves
changes in pressure, and other variables such as wind, temperature and rainfall, over
the tropical Indo-Pacific region, and is measured by the difference in atmospheric
pressure between Australia/Indonesia and eastern South Pacific. An example of
SO is discussed in Chap. 8 (Sect. 8.5). ENSO, as a teleconnection, has an impact
over considerable parts of the globe, especially North and South America and parts
of east Asia and the summer monsoon region. Although ENSO teleconnection,
precisely the Southern Oscillation, seems to have been discovered by Gilbert Walker
(Walker 1923), through correlation between surface pressure, temperature and
rainfall, the concept of teleconnection, however, seems to have been mentioned for
the first time in Ångström (1935). The connection between the Southern Oscillation
and El-Niño was only recognised later by Bjerknes in the 1960s, see, e.g. Bjerknes
(1969).
The NAO (Fig. 3.1, bottom) is a kind of see-saw in the atmospheric mass
between the Azores and the extratropical North Atlantic. It is the dominant mode of
near surface pressure variability over the North Atlantic, Europe and North Africa
(Hurrell et al. 2003; Hannachi and Stendel 2016), and has an impact on considerable
parts of the northern hemisphere (Hurrell et al. 2003; Hannachi and Stendel 2016).
The two main centres of action of the NAO are located, respectively, near the Azores
and Iceland. For example, in its positive phase the pressure difference between the
two main centres of action is enhanced, compared to the climatology, resulting in
stronger than normal westerly flow.
3.2.2 Eigenvalue Problems in Meteorology
PCA has been used since the beginning of the twentieth century by statisticians
such as Pearson (1901) and later Hotelling (1933, 1935). For statistical and more
general application of PCA, the reader is referred, e.g., to the textbooks by Seal
(1967), Morrison (1967), Anderson (1984), Chatfield and Collins (1980), Mardia
et al. (1979), Krzanowski (2000), Jackson (2003) and Jolliffe (2002) and more
references therein. In the atmospheric science, it is difficult to get the exact origin of
eigenvalue problems. According to Craddock (1973), the earliest1 recognisable use
of eigenvalue problems in meteorology seems to have been mentioned in Fukuoka
(1951). The earliest known and comprehensive development of eigenvalue analyses
and orthogonal expansion in atmospheric science are the works of Obukhov (1947)
and Bagrov (1959) from the previous USSR and Lorenz (1956) from the US Weather
Service. Fukuoka (1951) also mentioned the usefulness of these methods in weather
prediction. Obukhov (1947) used the method for smoothing purposes whereas
Lorenz (1956), who coined the name empirical orthogonal functions (EOFs), used
it for prediction purposes.
Because of the relatively large number of variables involved and the low
speed/memory of computers that were available in the mid-1950s, Gilman (1957),
for example, had to partition the atmospheric pressure field by slicing the north-
ern hemisphere into slices, and the data matrix was thus reduced substantially
and allowed an eigenanalysis. Later developments were conducted by various
1 Wallace (2000) maintains the view that the way Walker (1923) computed the Southern Oscillation
bears resemblance to iterative techniques used to compute empirical orthogonal functions.
3.3 Computing Principal Components 35
researchers: Obukhov (1960), Grimmer (1963), Craddock (1965, 1973), Kutzbach

(1967), Wallace and Dickinson (1972). The question of mixed variables, for
example, was raised by Kutzbach (1967), which has led to the issue of scaling the
different variables. From then onwards, and with the advent of powerful computers
and the increase of data leading to big data, the domain of eigenanalysis methods
in atmospheric science has grown rapidly. The available large amounts of data in
weather and climate need to be analysed in an efficient way. Empirical orthogonal
functions provide one such tool to deal with such big data.
Other extended forms of EOFs have been introduced later. These include, for
example, rotated EOFs (Horel 1981; Richman 1981, 1986), complex EOFs (Horel
1984) and extended EOFs (Weare and Nasstrom 1982). Rotated EOFs for instance
have been introduced to obtain more geographically compact patterns with more
robustness (see e.g. Barnston and Livezey 1987) whereas complex EOFs, for
example, are EOFs of complexified fields.
3.3 Computing Principal Components
3.3.1 Basis of Principal Component Analysis
PCA aims to find a new set of variables that explain most of the variance observed
in the data.2 Figure 3.2 shows the axes that explain most of the variability in the
popular three-variable Lorenz model. It has been extensively used in atmospheric
research to analyse particularly large scale and low frequency variability. The first
seminal work on PCA in atmospheric science goes back to the mid-1950s with
Ed. Lorenz. The method, however, has been used before by Obukhov (1947), see
e.g. North (1984), and was mentioned later by Fukuoka (1951), see e.g. Craddock
(1973). Here and elsewhere we will use both terminologies, i.e. PCA or EOF
analysis, interchangeably. Among the very few earliest textbooks on EOFs in
atmospheric science, the reader is referred to Preisendorfer and Mobley (1988), and
to later textbooks by Thiebaux (1994), Wilks (2011), von Storch and Zwiers (1999),
and Jolliffe (2002).
The original aim of EOF analysis (Obukhov 1947; Fukuoka 1951; Lorenz 1956)
was to achieve a decomposition of a continuous space–time field X(t, s), where t
and s denote respectively time and spatial position, as

X(t, s) = ck (t)ak (s) (3.1)
k≥0
using an optimal set of orthogonal basis functions of space ak (s) and expansion
functions of time ck (t). When the field is discretised in space and/or time a similar
2 This is based on the main assumption in data analysis, that is variability represents information.
Fig. 3.2 Empirical orthogonal functions of the Lorenz model attractor
expansion to (3.1) is also sought. For example, if the field is discretised in both time
and space the expansion above is finite, and the obtained field can be represented by
a data matrix X as in (2.11). In this case the sum extends to the rank r of X. The
basis functions ak (s) and expansion coefficients ck (t) are obtained by minimising
the residual:

M
2
R1 = X(t, s) − ck (t)ak (s) dtds, (3.2)
T S k=1
where the integration is performed over the time T and spatial S domains for the
continuous case and for a given expansion order M. A similar residual is minimised
for the discrete case except that the integrals are replaced by discrete sums.
3.3.2 Karhunen–Loéve Expansion
In probability theory expansion (3.1), for a given s (and as such the parameter s is
dropped here for simplicity) is known as Karhunen–Loève expansion associated
with a continuous zero-mean3 stochastic process X(t) defined over an interval
[a, b], and which is square integrable, i.e. E|X(t)|2 < ∞, for all t in the interval
[a, b]. Processes having these properties constitute a Hilbert space (Appendix F)
with the inner product < X1 (t), X2 (t) >= E (X1 (t)X2 (t)). The covariance
function of X(t):
3 If it is non-zero mean it can be centred by removing the stochastic process mean.

γ (s, t) = E (X(t)X(s))
is symmetric and non-negative for a ≤ t1 , t2 ≤ b. This covariance function is at the

root of expressing the stochastic process X(t), for t in [a, b], in terms of a sum of
uncorrelated random variables. Let us consider the space of square integrable (real)
functions defined over the interval [a, b], noted L2 ([a, b]). This functional space is
a Hilbert space with the inner product:
b
< f, g >= f (t)g(t)dt.
a
The linear transformation defined over L2 ([a, b]):

b
Af (t) = γ (t, s)f (s)ds
a
is self-adjoint (Appendix F) because the covariance function is symmetric. The

main consequence of this is that the kernel function γ (s, t) can be expanded into
an absolutely and uniformly convergent series, i.e.
∞

γ (t, s) = λk φk (t)φk (s),
k=1
where λ1 , λ2 . . . and φ1 (), φ2 (), . . . are respectively the eigenvalues and associated
orthonormal eigenfunctions of the Fredholm eigen problem:
b
Aφ(t) = γ (t, s)φ(s)ds = λφ(t)
a
and satisfying < φi , φj >= δij . This result is due to Mercer (1909), see also
Basilevsky and Hum (1979), and the covariance function γ (t, s) is known as Mercer
kernel. Accordingly the stochastic process X(t) is then expanded as:
∞

X(t) = Xk φk (t),
k=1
where Xk , k = 1, 2 . . . are zero-mean uncorrelated random variables given by

b
the stochastic integral Xk = a X(t)φk (t)dt and hence E Xi Xj = δij λi . The
Karhunen–Loève expansion has the following two important properties (Loève
1963, p. 477; Parzen 1963, see also Basilevsky and Hum 1979) namely:

(i) it minimises the Shannon entropy − k λk ln λk , and
(ii) it minimises the mean square error
b
k
k
|X(t) − Xi φi (t)|2 dt = 1 − λi
a i=1 i=1
when the first k terms of the expansion are used. Note that when the stochastic
process is stationary, i.e. γ (s, t) = γ (s − t) then the previous expansion
becomes
∞

γ (s − t) = λk φk (s)φk (t)
k=1
and the integral operator A becomes a convolution.

Now we come back to stochastic process X(t, s). The basis functions ak (s)
in Eq. (3.1) are the empirical orthogonal functions (EOFs) and the expansion
coefficients ck (t) are the principal components (PCs). For the discrete case, where
the data matrix X has dimensions n × p the k’th EOF is a vector ak of length
p, whereas the associated PC is a time series fk (t), t = 1, . . . n. Because the
continuous case requires a special treatment, it will be discussed in a later chapter,
and we focus here on the discrete case. In the literature, EOFs are also known as
principal component loadings, or vector of loadings, and the PCs as EOF time series,
EOF amplitudes, expansion coefficients and scores. In this and subsequent chapters
we will reserve the term EOFs or EOF patterns for the spatial patterns and PCs for
the corresponding time series.
EOFs and PCs (and their derivatives) are multipurpose. They are used as an
exploratory tool to analyse multivariate time series in climate or any other field
and identify the dominant modes of variability (Jolliffe 2002; Hannachi et al. 2007).
In weather and climate, in particular, they can be used in forecasting, downscaling,
regression analysis, dimension reduction and analysing nonlinear features in state
space (Tippett et al. 2008; Franzke et al. 2005; Kim and North 1999; Kim et al.
2015; Hannachi et al. 2017; Önskog et al. 2018, 2020).
3.3.3 Derivation of PCs/EOFs
Given the (centred) data matrix (2.11), the objective of EOF/PC analysis is to find
the linear combination of all the variables explaining maximum variance, that is to
T
find a unit-length direction a = a1 , . . . , ap that captures maximum variability.
The projection of the data onto the vector a yields the centred time series Xa, whose
variance is simply the average of the squares of its elements, i.e. aT XT Xa/n. The
EOFs are therefore obtained as the solution to the quadratic optimisation problem

max F (a) = aT Sa
(3.3)
subject to aT a = 1.
Fig. 3.3 Illustration of a pair of EOFs: a simple monopole EOF1 and dipole EOF2
Eq. (3.3) can be solved using a Lagrange multiplier μ to yield
max aT Sa − μ(1 − aT a)
a
which is also equivalent to maximising (aT Sa)/(aT a). The EOFs are therefore
obtained as the solution to the eigenvalue problem:
Sa = λ2 a. (3.4)
The EOFs are the eigenvectors of the sample covariance matrix S arranged in
decreasing order of the eigenvalues. The first eigenvector a1 gives the first principal
component, i.e. the linear function Xa1 , with the largest variance; the second EOF
a2 gives the second principal component with the next largest variance subject to
being orthogonal to a1 as illustrated in Fig. 3.3, etc.
Remark In PCA one usually defines the PCs first as linear combinations of the
different variables explaining maximum variability from which EOFs are then
derived. Alternatively, one can similarly define EOFs as linear combinations vT X,
where v is a vector of weights, of the different maps of the field that maximise
the norm squared. Applying this definition one obtains a similar equation to (3.3),
namely:
vT Pv
max , (3.5)
vT v
where P = XXT is the matrix of scalar product between the different maps.
Equation (3.5) yields automatically the (standardised) principal components. Note
that Eq. (3.5) is formulated using a duality argument to Eq. (3.3), and can be useful
for numerical purposes when, for example, the sample size is smaller than the
number of variables.
3.3.4 Computing EOFs and PCs

Singular Value Decomposition and Similar Algorithms
Since the covariance matrix is symmetric by construction, it is diagonalisable and

the set of its eigenvectors forms an orthogonal basis of the p-dimensional Euclidean
space, defined with the natural scalar product. This is a classical result in linear
algebra, which is summarised by the following decomposition of the covariance
matrix:
S = U2 UT , (3.6)
where U is a p × p orthogonal
4 matrix, i.e. UT U = UUT = I, and 2 is a diagonal

matrix, i.e. 2 = Diag λ21 , . . . , λ2p , containing the eigenvalues5 of S. The EOFs
u1 , u2 , . . . up are therefore the columns of U. It is clear that if p < n, then there are
at most p positive eigenvalues whereas if n < p there are at most n − 1 positive
eigenvalues.6 To be more precise, if r is the rank of S, then there are exactly r
positive eigenvalues. To be consistent with the previous maximisation problem, the
eigenvalues are sorted in decreasing order, i.e. λ21 ≥ λ22 ≥ . . . ≥ λ2p , so that the first
EOF yields the time series with maximum variance, the second one with the next
largest variance, etc. The solution of the above eigenvalue problem, Eq. (3.6), can
be obtained using either direct methods such as the singular value decomposition
(SVD) algorithm or iterative methods based on Krylov subspace solvers using
Lanczos or Arnoldi algorithms as detailed in Appendix D, see also Golub and van
Loan (1996) for further methods and more details. The Krylov subspace method is
particularly efficient for large and/or sparse systems.
In Matlab programming environment, let X(n, p1, p2) designate the two-
dimensional (p1 × p2) gridded (e.g. lat-lon) field, where n is the sample size. The
field is often assumed to be anomalies (though not required). The following simple
code computes the leading 10 EOFs, PCs and the associated covariance matrix
eigenvalues:
>> [n p1 p2] = size(X); p12=p1*p2;

>> X = reshape(X, n, p12);
>> if(n>p12) X = X’; end
>> [PCs S EOFs] = svds(X, 10, ’L’);
>> if(n>p12) PCs = A; PCs = EOFs; EOFs = A; end
>> S = diag(diag(S).*diag(S))/n;
>> EOFs = reshape (EOFs, p1, p2, 10).
4 This is different from a normal matrix U, which commutes with its transpose, i.e. UT U = UUT .
5 We use squared values because S is semi-definite positive, and also to be consistent with the SVD
of S.
6 Why n − 1 and not n?
Note also that Matlab has a routine pca, which does PCA analysis of the data matrix
X(n, p12) (see also Appendix H for resources):
>> [PCs EOFs S] = pca (X).
The proportion of variance explained by the kth principal component is usually

given by the ratio:
100λ2
r k 2 %, (3.7)
j =1 λj
which is often expressed in percentage. An example of spectrum is displayed in

Fig. 3.4, which shows the percentage of explained variance of the winter months
Dec–Jan–Feb (DJF) sea level pressure. The vertical bars show the approximate 95%
confidence limits discussed in the next section. The PCs are obtained by projecting
the data onto the EOFs, i.e.:
C = XU, (3.8)
so the k’th PC ck = (ck (1), ck (2), . . . , ck (n)) is simply Xuk whose elements are

p
ck (t) = xtj uj k
j =1
Fig. 3.4 Percentage of explained variance of the leading 40 EOFs of winter months (DJF)
NCAR/NCEP sea level pressure anomalies for the period Jan 1940–Dec 2000. The vertical bars
provide approximate 95% confidence interval of the explained variance. Adapted from Hannachi
et al. (2007)
for t = 1, . . . , n and where uj k is the j’th element of the kth EOF uk . It is clear
from (3.8) that the PCs are uncorrelated and that
1 2
cov(ck , cl ) = λ δkl . (3.9)
n k
Exercise Derive Eq. (3.9)

Note that when using (3.5) instead we get automatically uncorrelated PCs, and then
a similar relationship to (3.9) can be derived for the EOFs.
There are various algorithms to obtain the eigenvalues and eigenvectors of S, see
e.g. Jolliffe (2002). The most efficient and widely used algorithm is based on the
SVD theorem (2.25), which, when applied to the data matrix X, yields
1
X = √ VUT , (3.10)
n
where = Diag (λ1 , λ2 , . . . , λr ), and λ1 ≥ λ2 ≥ . . . λr ≥ 0 are the singular

values of X. Note that the term √1n in (3.10) is used for consistency with (3.6), but
the term is absorbed by the singular values and most of the time it does not appear.
The SVD algorithm is a standard computing routine provided in most software, e.g.
MATLAB (Linz and Wang 2003) and does not require computing the covariance
matrix. To be efficient the SVD is applied to X if n < p otherwise XT is used
instead. Using (3.8) the PCs are given by
1
C = √ V, (3.11)
n
hence the columns of V are the standardised, i.e. unit variance principal components.
One concludes therefore that the EOFs and standardised PCs are respectively the
right and left singular vectors of X.
Figure 3.5 shows the leading two EOFs of DJF SLP anomalies (with respect to
the mean seasonal cycle) from NCAR/NCEP. They explain respectively 21% and
13% of the total winter variability of the SLP anomalies, see also Fig. 3.4. Note, in
particular, that the leading EOF reflects the Arctic Oscillation mode (Wallace and
Thompson 2002), and shows the North Atlantic Oscillation over the North Atlantic
sector. This is one of the many features of EOFs, namely mixing and is discussed
below. The corresponding two leading PCs are shown in Fig. 3.6.
The SVD algorithm is reliable, as pointed out by Toumazou and Cretaux (2001),
and the computation of the singular values is governed by the condition number of
the data matrix. Another strategy is to apply the QR algorithm (see Appendix D) to
the symmetric matrix XT X or XXT , depending on the smallest dimension of X. The
algorithm, however, can be unstable as the previous symmetric matrix has a larger
condition number compared to that of the data matrix. In this regard, Toumazou and
Cretaux (2001) suggest an algorithm based on a Lanczos eigensolver technique.
Fig. 3.5 Leading two

empirical orthogonal
functions of the winter (DJF)
monthly sea level pressure
anomalies for the period Jan
1940–Dec 2000 (a) EOF1
(21%). (b) EOF2 (13%).
Adapted from Hannachi et al.
(2007)
The method is based on using a Krylov subspace (see Appendix D), and reduces to
computing some eigen-elements of a small symmetric matrix.
Basic Iterative Approaches
Beside the SVD algorithm, iterative methods have also been proposed to compute
EOFs (e.g. van den Dool 2011). The main advantage of these methods is that they
avoid computing the covariance matrix, which may be prohibitive at high resolution
and large datasets, or even dealing directly with the data matrix as is the case with
Fig. 3.6 Leading two principal components of the winter (DJF) monthly sea level pressure
anomalies for the period Jan 1940–Dec 2000 (a) DJF sea level pressure PC1. (b) DJF sea level
pressure PC2. Adapted from Hannachi et al. (2007)
SVD. The iterative approach makes use of the identities linking EOFs and PCs.
EOF Em (s) and corresponding PC cm (t) of a field X(t, s) satisfy Em (s) =
An
t cm (t)X(t, s), and similarly for cm (t). The method then starts with an initial
guess of a time series, say c(0) (t), scaled to unit variance, and obtains the associated
pattern E (0) (s) following the previous identity. This pattern is then used to compute
an updated time series c(1) (t), etc. As pointed out by van den Dool (2011), the
process normally converges quickly to the leading EOF/PC. The process is then
continued with the residuals, after removing the contribution of the leading mode, to
get the subsequent modes of variability. The iterative method can be combined with
spatial weighting to account for the grid (e.g. Gaussian grid in spherical geometry)
and subgrid processes, and to maximise the signal-to-noise ratio of EOFs (Baldwin
et al. 2009).
3.4 Sampling, Properties and Interpretation of EOFs 45
3.4 Sampling, Properties and Interpretation of EOFs
3.4.1 Sampling Variability and Uncertainty
There are various ways to estimate or quantify uncertainties associated with the
EOFs and corresponding eigenvalues. These uncertainties can be derived based on
asymptotic approximation. Alternatively, the EOFs and associated eigenvalues can
be obtained using a probabilistic framework where uncertainties are comprehen-
sively modelled. Monte-Carlo method is another approach, which can be used, but
can be computationally expensive. Cross-validation and bootstrap are examples of
Monte-Carlo methods, which are discussed below. But other methods of Monte-
Carlo technique also exist, such as surrogate data, which will be commented on
below.
Uncertainty Based on Asymptotic Approximation
Since the data matrix is subject to sampling uncertainty, so do the eigenvalues

and corresponding EOFs of the covariance matrix. Because of the existence of
correlations among the different variables, the hypothesis of independence is simply
not valid. One would expect for example that the eigenvalues are not entirely
separated since each eigenvalue will have a whole uncertainty interval. So what we
estimate is an interval rather than a single point value. Using asymptotic arguments
based on the central limit theorem (CLT) in the limit of large samples (see e.g.
Anderson 1963), it can be shown (Girshik 1939; Lawley 1956; Mardia et al. 1979;
North et al. 1982) that if λ̂21 , . . . , λ̂2p denote the eigenvalues estimated from the
sample covariance matrix S, obtained from a sample of size n, then we have the
approximation:

2 2 4
λ̂2k ∼ N λk , λk , (3.12)
n
where N (μ, σ 2 ) stands for the normal distribution with mean μ and variance σ 2 (see
Appendix B), and λ2k , k = 1, . . . p are the eigenvalues of the underlying population
covariance matrix . The standard error of the estimated eigenvalue λ̂2k is then

2
δ λ̂2k ≈ λ2k . (3.13)
n
For a given significance level α the interval λ̂2k ± δ λ̂2k 1− α2 , where the notation
a refers to the a’th quantile,7 provides therefore the asymptotic 100(1 − α)%
confidence limits of the population eigenvalue λ2k , k = 1, 2, . . . p. For example, the
95% confidence interval is [λ̂2k − 1.96δ λ̂2k , λ̂2k + 1.96δ λ̂2k ]. Figure 3.4 displays these
limits for the winter seal level pressure anomalies. A similar approximation can also
be derived for the eigenvectors uk , k = 1, . . . p:
δ λ̂2k
δuk ≈ uj , (3.14)
λ2j − λ2k
where λ2j is the closest eigenvalue to λ2k . Note that in practice the number n used in
the previous approximation corresponds to the size of independent data also known
as effective sample size, see below.
Probabilistic PCA
A more comprehensive method to obtain EOFs and explained variances along with
their uncertainties is to use a probability-based method. This has been done by
Tipping and Bishop (1999), see also Goodfellow et al. (2016). In this case EOFs
are computed via maximum likelihood. The model is based on a latent variable as
in factor analysis discussed in Chap. 10. Note, however, that the method relies on
multinormality assumption. More discussion on the method and its connection to
factor analysis is discussed in Sect. 10.6 of Chap. 10.
Monte-Carlo Resampling Methods
The asymptotic uncertainty method discussed above relies on quite large sample
sizes. In practice, however, this assumption may not be satisfied. An attractive
and easy to use alternative is Monte-Carlo resampling method, which has become
invaluable tool in modern statistics (James et al. 2013; Goodfellow et al. 2016).
The method involves repeatedly drawing subsamples from the training set at hand,
refitting the model to each of these subsamples, and obtain thereafter an ensemble of
realisations of the parameters of interest enabling the computation of uncertainties
on those parameters. Cross-validation and bootstrap are the most commonly used
resampling methods. The bootstrap method goes back to the late 1970s with Efron
(1979). The textbooks by Efron and Tibshirani (1994), and also Young and Smith
(2005) provide a detailed account of Monte-Carlo methods and their application. A
summary of cross-validation and bootstrap methods is given below, and for deeper
7
a = −1 (a) where () is the cumulative distribution function of the standard normal
distribution (Appendix B).
analysis the reader is referred to the more recent textbooks of James et al. (2013),
Goodfellow et al. (2016), Brownlee (2018) and Watt et al. (2020).
Cross-Validation
Cross-validation is a measure of performance, and involves splitting the available
(training) data sample (assumed to have a sample size n) into two subsets, one
is used for training (or model fitting) and the other for validation. That is, the
fitted model (on the training set) is used to get responses via validation with the
second sample, enabling hence the computation of the test error rate. In this way,
cross-validation (CV) can be used to get the test error, and yields a measure of
model performance and model selection, see, e.g., Sect. 14.5 of Chap. 14 for an
application to parameter identification. There are basically two types of cross-
validation methods, namely, the leave-one-out CV and k-fold CV. The former deals
with leaving one observation out (validation set), and fitting the statistical model
on the remaining n − 1 observations (training set), and computing the test error
at the left-one-out observation. This error is simply measured by the mean square
error (test error) between the observation and the corresponding value given by the
fitted model. This procedure is then repeated with every observation, and then the
leave-one-out CV is estimated by the average of the obtained n test errors.
In the k-fold CV, the dataset is first divided randomly into k subsamples of
similar sizes. Training is then applied to one subsample and validation applied to the
remaining k-1 subsamples, yielding one test error. The procedure is then repeated
with each subsample, and the final k-fold CV is obtained as the average of the
obtained k test errors. The leave-one-out approach is obviously a special case of the
k-fold CV, and therefore the latter is advantageous over the former. For example,
k-fold CV gives more accurate estimates of the test errors, a result that is related to
the bias-variance trade-off. James et al. (2013) suggest the empirical values k = 5
or k = 10.
The Bootstrap
The bootstrap method is a powerful statistical tool used to estimate uncertainties
on a given statistical estimator from a given dataset. The most common use of
bootstrap is to provide a measure of accuracy of the parameter estimate of interest.
In this context, the method is used to estimate summary statistics of the parameter
of interest, but can also yield approximate distribution of the parameter. The
bootstrap involves constructing a random subsample from the dataset, which is
used to construct an estimate of the parameter of interest. This procedure is then
repeated a large number of times, yielding hence an ensemble of estimates of the
parameter of interest. Each sample used in the bootstrap is constructed from the
dataset by drawing observations, one at a time, and returning the drawn sample to
the dataset, until the required size is reached. This procedure is known as sampling
with replacement and enables observations to appear possibly more than once in a
bootstrap sample. In the end, the obtained ensemble of estimates of the parameter of
interest is used to compute the statistics, e.g. mean and variance, etc., and quantify
the uncertainty on the parameter (James et al. 2013). In summary, a bootstrap
sample, with a chosen size, is obtained by drawing observations, one at a time, from
the pool of observations of the training dataset. In practice, the number of bootstrap
samples should be large enough of the order of O(1000). Also, for reasonably large
data, the bootstrap sample size can be of the order 50–80% of the size of the dataset.
The algorithm of resampling with replacement goes as follows:
(1) Select the number of bootstrap samples, and a sample size of these samples.
(2) Draw the bootstrap sample
(3) Compute the parameter of interest
(4) Go to (2) until the number of bootstrap samples is reached.
The application of the above algorithm yields a distribution, e.g. histogram, of the
parameter.
Remarks on Surrogate Data Method The class of Monte-Carlo method is quite
wide and includes other methods than CV and bootstrap resampling. One partic-
ularly powerful method used in time series is that of surrogate data. The method
of surrogate data (Theiler et al. 1992) involves generating surrogate datasets, which
share some characteristics with the original time series. The method is mostly used
in nonlinear and chaotic time series analysis to test linearity null hypotheses (e.g.
autoregressive moving-average ARMA processes) versus nonlinearity. The most
common algorithm for surrogate data is phase randomisation and amplitude adjusted
Fourier transform (Theiler et al. 1992). Basically, the original time series is Fourier
transformed, the amplitudes of this transform are then used with new uniformly
distributed random phases, and finally an inverse Fourier transform is applied to
get the surrogate sample. For real time series, the phases are constrained to be
antisymmetric. By construction, these surrogates preserve the linear structure of
the original time series (e.g. autocorrelation function and power spectrum). Various
improvements and extensions have been proposed in the literature (e.g. Breakspear
et al. 2003, Lucio et al. 2012). Surrogate data method has also been applied in
atmospheric science and oceanography. For example, Osborne et al. (1986) applied
it to identify signatures of chaotic behaviour in the Pacific Ocean dynamics.
Bootstrap Application to EOFs of Atmospheric Fields
The bootstrap method can easily be applied to obtain uncertainties on the EOFs of
a given atmospheric field, as shown in the following algorithm:
(1) Select the number of bootstrap samples, and the sample size of these samples.
(2) For each drawn bootstrap sample:
(2.1) Compute the EOFs (e.g. via SVD) and associated explained variance.
(2.2) Rank the explained variances (and associated EOFs) in decreasing order.
(3) Calculate the mean, variance (and possibly histograms, etc.) of each explained
variance and associated EOFs (at each grid point).
The application of this algorithm yields an ensemble of EOFs and associated

eigenvalues, which can be used to quantify the required uncertainties.
Remarks
1. It is also possible to apply bootstrap without replacement. This can affect
probabilities, but experience shows that the difference with the bootstrap with
replacement is in general not large. In a number of applications, other, not
standard, sampling methods have also been used. An example would be to choose
a subset of variables then scramble them by breaking the chronological order then
apply EOFs, and so on.
2. Another test, also used in atmospheric science and oceanography, consists of
generating random samples, e.g. red noise data (see Appendix C) having the same
spectral (or autocorrelation) characteristics as the original data, then computing
the eigenvalues and the eigenvectors from the various samples and obtain an
uncertainty estimate for the covariance matrix spectra.
The Monte-Carlo bootstrap method is commonly used in atmospheric science
(e.g. von Storch and Zwiers 1999), and has been applied to study nonlinear flow
regimes in atmospheric low frequency variability (e.g. Hannachi 2010), and climate
change effect on teleconnection (e.g. Wang et al. 2014). For example, Wang et al.
(2014) estimated uncertainties in the NAO and applied it to the winter sea level
pressure from the twentieth century reanalyses. They used the bootstrap sampling
to obtain spatial patterns of NAO uncertainties. The methodology is based on
computing the standard deviation of the leading EOF of the sampled covariance
matrix. Wang et al. (2014) used a slightly modified version of the bootstrap method.
To replicate the correlation structure, they sampled blocks of data instead of
individual observations. This is common practice in atmospheric science because
of the non-negligible correlation structure in weather and climate data. In their
analysis, Wang et al. (2014) used non-overlapping blocks of 20-yr winter time
monthly SLP anomalies and 2- and 4-month bootstrap blocks. Their results indicate
that the largest uncertainties are located between the centres of action of the NAO,
particularly in the first half of the record. Figure 3.7 shows the evolution of the
longitude of the northern (Fig. 3.7a) and southern (Fig. 3.7b) nodes of the NAO over
20-yr running windows for the period 1871–2008. There is a clear zonal shift of
the NAO nodes across the record. For example, the poleward centre of action of the
NAO shows an eastward shift during the last quarter of the record. Furthermore, the
southern node shows larger uncertainties compared to the northern node.
Fig. 3.7 Evolution of the frequency distribution of the north (a) and south (b) centres of action
of the NAO pattern computed over 20-yr running windows. The yellow line corresponds to the
longitude of the original sample. Adapted from Wang et al. (2014). ©American Meteorological
Society. Used with permission
3.4.2 Independent and Effective Sample Sizes

Serial Correlation
Given a univariate time series xt , t = 1, . . . n, with serial correlation, the number

of degrees of freedom (d.o.f) n∗ is the independent sample size of the time series.
Although it is not very easy to define exactly the effective sample size from a given
sample, one can use approximations based on probabilistic models. For example,
using a AR(1) process, with autocorrelation ρ(.), Leith (1973) suggested
n n
n∗ = = − log(ρ(1)), (3.15)
2T0 2
where T0 is the e-folding time of ρ(.). The idea behind Leith (1973) is that if xt ,
t = 1, 2, . . . n is a realisation of an independent and identically distributed
(IID)
random variables X1 , . . . Xn , with variance σ 2 then the mean x = n−1 nt=1 xt has
variance σx2 = n−1 σ 2 . Now, consider a continuous time series x(t) defined for all
values of t with lagged autocovariance γ (τ ) = E [x(t)x(t + τ )] = σ 2 ρ(τ ). An

estimate of the mean x t for a finite time interval [0, T ] is
t− T2
1
xt = x(u)du. (3.16)
T t− T2
The variance of (3.16) can easily be derived and (3.15) can be recovered from a red
noise.
Exercise
1. Compute the variance σT2 of (3.16).
σ2
2. Derive σT2 for a red noise and show that = T
2T0 .
σT2
Hint
1. From (3.16) we have

t+ T2 t+ T2
T 2 σT2 =E x(s1 )x(s2 )ds1 ds2
t− T2 t− T2
T T

2 2
=E x(t + s1 )x(t + s2 )ds1 ds2
− T2 − T2
T T
2 2
=σ 2
ρ(s2 − s1 )ds1 ds2 .
− T2 − T2
This integral can be computed by a change of variable u = s1 and v = s2 − s1 ,

which transforms the square [− T2 , T2 ]2 into a parallelogram R (see Fig. 3.8) i.e.

σ2 0 T
T 2 T2 = ρ(u)dudv = dv ρ(v)dudv + dv ρ(v)dudv
σ R −T 0
0 T T −v+ T2 T
2 v
= dv ρ(v)dv + dv du = 2T (1 − )ρ(v)dv.
−T −v− T2 0 − T2 0 T
σT2 T
Hence σ2
= 2
T 0 (1 − v
T )ρ(v)dv.
Remark Note that for a red noise or (discrete) AR(1) process, xt = φ1 xt−1 + εt ,
−τ |τ |
ρ(τ ) = e T0 (= φ1 ), see Appendix C, and the e-folding time T0 is given by the
∞
integral 0 ρ(τ )dτ . In the above formulation, the time interval was assumed to be
unity. If the time series is sampled every t then as T0 = − log t
ρ(t) , and then one
gets
n
n∗ = − log ρ(t).
2
Fig. 3.8 Domain change
Fig. 3.9 Effective sample

size n∗ vs ρ(1) for Eq. (3.17)
(continuous) and Eq. (3.18)
(dashed)
Jones (1975) suggested an effective sample size of order-1, which, for a red noise,
boils down to
1 − ρ(1)
n∗ = n (3.17)
1 + ρ(1)
while Kikkawa and Ishida (1988) suggested
1 − ρ(1)2
n∗ = n (3.18)
1 + ρ(1)2
which can be twice as large compared to (3.17) (Fig. 3.9).

Time Varying Fields
For time varying fields, or multivariate time series, with N grid points or variables
x(t) = (x1 (t), . . . , xN (t)T observed over some finite time interval, Bretherton et
al. (1999) discuss two measures of effective numbers of spatial d.o.f or number
of independently varying spatial patterns. For example, for isotropic turbulence a
similar equation to (3.15) was given by Taylor (1921) and Keller (1935). Using the
“moment matching” (mm) method of Bagrov (1969), derived from a χ 2 distribution,
an estimate of the effective number of d.o.f can be derived, namely
∗ 2
Nmm = 2E /E 2 , (3.19)
where () is a time mean and E is a quadratic measure of the field, e.g. the quadratic
norm of x(t), E = x(t)T x(t).
An alternative way was also proposed by Bagrov (1969) and TerMegreditchian
(1969) based on the covariance matrix of the field. This estimate, which is also
discussed in Bretherton et al. (1999) takes the form
N 2
λk tr()2
∗
Neff = N = , (3.20)
k=1 i=1 λi tr( 2 )
where λk , k = 1, . . . N are the eigenvalues of the covariance matrix of the field

∗ and N ∗
x(t). Bretherton et al. (1999) investigated the relationship between Neff mm
in connection to non-Gaussianity, yielding
∗ κ −1 ∗
Neff = Nmm ,
2
where κ is the kurtosis assumed to be the same for all PCs. This shows, in particular,
that the two values can be quite different.
3.4.3 Dimension Reduction
Since the leading order EOFs explain more variance than the lowest order ones, one
would then be tempted to focus on the few leading ones and discard the rest as being
noise variability. This is better assessed by the percentage of the explained variance
by the first, say, m retained EOFs:
m 2 m
k=1 λk k=1 var (Xuk )
p 2
= . (3.21)
k=1 λk
tr ()
In this way one can choose a pre-specified percentage of explained variance, e.g.
70%, then keep the first m EOFs and PCs that explain altogether this amount.
Remark Although this seems a reasonable way to truncate the spectrum of the
covariance matrix, the choice of the amount of explained variance remains, however,
arbitrary.
We have seen in Chap. 1 two different types of transformations: scaling and
sphering. The principal component transformation, obtained by keeping a subset of
EOFs/PCs, is yet another transformation that can be used in this context to reduce
the dimension of the data. The transformation is given by
Y = XU.
To keep the leading EOFs/PCs that explain altogether a specific amount of

variability, say β, one uses
m m−1
λ2 λ2
100 trk=1
()
k
≥ β and 100 trk=1
()
k
< β.
The reduced data is then given by
Ym = [y1 , y2 , . . . , ym ] = XUm , (3.22)
where Um = [u1 , u2 , . . . , um ] is the matrix of the leading m EOFs.

Remark If some of the original variables are linearly dependent the data matrix
cannot be of full rank, which is min(n, p). In this case the covariance matrix is not
invertible, and will have zero eigenvalues. If p0 is the number of zero eigenvalues,
then min(n, p) − p0 is the dimension of the space containing the observations.
NB
As it was mentioned earlier, the EOFs are also known as loadings, and the loading
coefficients are the elements of the EOFs.
3.4.4 Properties and Interpretation
The main characteristic features of EOF analysis is the orthogonality of EOFs and
uncorrelation of PCs. These are nice geometric properties that can be very useful
in modelling studies using PCs. For example, the covariance matrix of any subset
of retained PCs is always diagonal. These constraints, however, yield partially
predictable relationships between an EOF and the previous ones. For instance, as
pointed out by Horel (1981), if the first EOF has a constant sign over its domain,
then the second one will generally have both signs with the zero line going through
the maxima of the first EOF (Fig. 3.3 ). The orthogonality constraint also makes the
EOFs domain-dependent and can be too non-local (Horel 1981; Richman 1986).
Perhaps one of the main properties of EOFs is mixing. Assume, for example,
that our signal is a linear superposition of signals, not necessarily uncorrelated, then
EOF analysis tends to mix these signals in order to achieve optimality (i.e. maximum
variance) yielding patterns that are mixture of the original signals. This is known as
the mixing problem in EOFs. This problem can be particularly serious when the data
contain multiple signals with comparable explained variance (e.g. Aires et al. 2002;
Kim and Wu 1999). Figure 3.10 shows the leading EOF of the monthly sea surface
temperature (SST) anomalies over the region 45.5◦ S–45.5◦ N. The anomalies are
computed with respect to the monthly mean seasonal cycle. The data are on a
1◦ × 1◦ latitude-longitude grid and come from the Hadley Centre Sea Ice and
Sea Surface Temperature8 spanning the period Jan 1870–Dec 2014 (Rayner et al.
2003). The EOF shows a clear signal of El-Niño in the eastern equatorial Pacific. In
addition we also see anomalies located on the western boundaries of the continents
related to the western boundary currents. These are discussed in more detail in
Chap. 16 (Sect. 16.9). Problems related to mixing are conventionally addressed
using, e.g. EOF rotation (Chap. 4), independent component analysis (Chap. 12) and
also archetypal analysis (see Chap. 16).
Furthermore, although the truncated EOFs may explain a substantial amount
of variance, there is always the possibility that some physical modes may not be
represented by these EOFs. EOF analysis may lead therefore to an underestimation
of the complexity of the system (Dommenget and Latif 2002). Consequently,
these constraints can cause limitations to any possible physical interpretation of
the obtained patterns (Ambaum et al. 2001; Dommenget and Latif 2002; Jolliffe
-6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
Fig. 3.10 Leading EOF of SST anomalies equatorward of 45◦
8 www.metoffice.gov.uk/hadobs/hadisst.
2003) because physical modes are not necessarily orthogonal. Normal modes
derived for example from linearised dynamical/physical models, such as barotropic
models (Simmons et al. 1983) are not orthogonal since physical processes are not
uncorrelated. The Arctic Oscillation/North Atlantic Oscillation (AO/NAO) EOF
debate is yet another example that is not resolved using (hemispheric) EOFs
(Wallace 2000; Ambaum et al. 2001, Wallace and Thompson 2002). Part of the
difficulty in interpretation may also be due to the fact that, although uncorrelated,
the PCs are not independent and this is particularly the case when the data are
not Gaussian, in which case other approaches exist and will be presented in later
chapters.
It is extremely difficult and perhaps not possible to get, using techniques based
solely on purely mathematical/statistical concepts, physical modes without prior
knowledge of their structures (Dommenget and Latif 2002) or other dynamical
constraints. For example, Jolliffe (2002, personnel communication) points out that
in general EOFs are unsuccessful to capture modes of variability, in case where
the number of variables is larger than the number of modes, unless the latter
are orthogonally related to the former. In this context we read the following
quotation9 (Everitt and Dunn, 2001 p. 305; also quoted in Jolliffe 2002, personnel
communication): “Scientific theories describe the properties of observed variables
in terms of abstraction which summarise and make coherent the properties of
observed variables. Latent variables (modes), are, in fact one of this class of
abstract statements and the justification for the use of these variables (modes) lies
not in an appeal to their “reality” or otherwise but rather to the fact that these
variables (modes) serve to synthesise and summarise the properties of the observed
variables”.
One possible way to evaluate EOFs is to compare them with a first-order
spatial autoregressive process (e.g. Cahalan et al. 1996), or more generally using
a homogeneous diffusion process (Dommenget 2007; Hannachi and Dommenget
2009). The simplest homogeneous diffusion process is given by
d
u = −λu + ν∇ 2 u + f (3.23)
dt
and is used as a null hypothesis to evaluate the modes of variability of the
data. The above process represents an extension of the simple spatial first-order
autoregressive model. In Eq. (3.23) λ and μ represent, respectively, damping and
diffusion parameters, and f is a spatial and temporal white noise process. Figure 3.11
shows the leading EOF of SST anomalies along with its PC and the time series of
the data at a point located in the south western part of the Indian Ocean. The data
span the period 1870–2005.
Figure 3.12 compares the data covariance matrix spectrum with that of a fitted
homogeneous diffusion process and suggests consistency with the null hypothesis,
9 Attributed to D.M. Fergusson, and L. J. Harwood.

2
PC1
−2
Jan80 Jan00 Jan20 Jan40 Jan60 Jan80 Jan00
Indian Ocean SSTA at (0.5S, 56.5E)

4
2
SSTA ( C)
o
−2
Fig. 3.11 Leading EOF of the Indian Ocean SST anomalies (top), the associated PC (middle) and
the SST anomaly time series at the centre of the domain (0.5◦ S, 56.5◦ E) (bottom). Adapted from
Hannachi and Dommenget (2009)
Eigenvalue spectrum
60
50
Eigenvalue (%)
40
30
20
10
0
0 5 10 15
Rank
Fig. 3.12 Spectrum of the covariance matrix of the Indian Ocean SST anomalies, with the
approximate 95% confidence limits, along with the spectrum of the fitted homogeneous diffusion
process following (3.17). Adapted from Hannachi and Dommenget (2009)
particularly for the leading few modes of variability. The issue here is the existence
of a secular trend, which invalidates the test. For example, Fig. 3.13 shows the time
series distribution of the SST anomalies averaged over the Indian Ocean, which
shows significant departure from normality. This departure is ubiquitous in the basin
as illustrated in Fig. 3.14. Hannachi and Dommenget (2009) applied a differencing
operator to the data to remove the trend. Figure 3.15 shows the spectrum compared
to that of similar diffusion process of the differenced fall SST anomalies. The
leading EOF of the differenced data (Fig. 3.16), reflecting the Indian Ocean dipole,
can be interpreted as an intrinsic mode of variability.
Another possible geometric interpretation of EOFs is possible with multinormal
data. In fact, if the underlying probabilistic law generating the data matrix is the
multivariate Gaussian, or multinormal, i.e. the probability density function of the
vector x is

1 1
f (x) = p 1
exp − (x − μ)T −1 (x − μ) , (3.24)
(2π ) 2 || 2 2
where μ and are respectively the mean and the covariance matrix of x and || is
the determinant of , then the interpretation of the EOFs is straightforward. Indeed,
in this case the EOFs represent the principal axes of the ellipsoid of the distribution.
a) SST anomalies
4
SST anomalies ( C)
o
−2
b) Histogram c) Quantile−quantile plot

1.5 5
SST quantiles
Frequency
1
0
0.5
0 −5
−1 0 1 −4 −2 0 2 4
o Standard Normal Quantiles
SST anomalies ( C)
Fig. 3.13 Time series of the SST anomalies averaged over the box (0–4◦ S, 62–66◦ E) (a), its
histogram (b) and its quantile-quantile (c). Adapted from Hannachi and Dommenget (2009)
Fig. 3.14 Grid points where the detrended SST anomalies over the Indian Ocean are non-
Gaussian, based on a Lilliefors test at the 5% significance level. Adapted from Hannachi and
Dommenget (2009)
Eigenvalue spectrum
60
50
Eigenvalue (%)
40
30
20
10
0
0 5 10 15
Rank
Fig. 3.15 Same as Fig. 3.12 but for the detrended Indian Ocean fall SST anomalies. Adapted from
Fig. 3.16 Leading EOF of the detrended fall Indian Ocean SST anomalies. Adapted from
3.6 Scaling Problems in EOFs 61
This is discussed below in Sect. 3.7. The ellipsoid is given by the isolines10 of (3.24).
Furthermore, the PCs in this case are independent.
3.5 Covariance Versus Correlation
EOFs from the covariance matrix find new variables that successively maximise
variance. By contrast the EOFs from the correlation matrix C, i.e. the sample version
of , attempt to maximise correlation instead. The correlation-based EOFs are
obtained using the covariance matrix of the standardised or scaled data matrix (2.22)
Xs = XD−1/2 , where D = diag(S). Therefore all the variables have the same
weight as far as variance is concerned. The correlation-based EOFs can also be
obtained by solving the generalised eigenvalue problem:
D−1 Sa = λ2 a, (3.25)
then u = D1/2 a is the correlation-based EOF corresponding to the eigenvalue λ2 .

Exercise Derive the above Eq. (3.25).
The individual eigenvalues of the correlation matrix cannot be interpreted in a
simple manner like the case with the covariance matrix. Both analyses yield in
general different information and different results. Consequently, there is no genuine
and systematic way of choosing between covariance and correlation, and the choice
remains a matter of individual preference guided, for example, by experience or
driven by a particular need or focus. For example, Overland and Preisendorfer
(1982) found, by analysing cyclone frequencies, that the covariance-based EOFs
provide a better measure for cyclonic frequency variability whereas correlation-
based EOFs provide a better measure to identify storm tracks, see also Wilks (2011)
for more discussions.
3.6 Scaling Problems in EOFs
One of the main features of EOFs is that the PCs of a set of variables depend
critically on the scale used to measure the variables, i.e. the variables’ units. PCs
change, in general, under the effect of scaling and therefore do not constitute a
unique characteristic of the data. This problem does not occur in general when all
the variables have the same unit. Note also that this problem does not occur when the
10 This interpretation extends also to a more general class of multivariate distributions, namely
the elliptical distributions. These are distributions whose densities are constant on ellipsoids. The
multivariate t-distribution is an example.
correlation matrix is used instead. This is particularly useful when one computes for
example EOFs of combined fields such as 500 mb heights and surface temperature.
Consider for simplicity two variables: geopotential height x1 at one location, and
zonal wind x2 at another location. The variables x1 and x2 are expressed respectively
in P a and ms−1 . Let z1 and z2 be the obtained PCs. The PCs units will depend on
the original variables’ units. Let us assume that one wants the PCs to be expressed
in hP a and km/ h, then one could think of either premultiply x1 and x2 respectively
by 0.01 and 3.6 then apply EOF analysis or simply post-multiply the PCs z1 and
z2 respectively by 0.01 and 3.6. Now the question is: will the results be identical?
The answer is no. In fact, if C is the diagonal scaling matrix containing the scaling
constants, the scaled variables are given by the data matrix Xs = XC whose
PCs are given by the columns of Z obtained from a SVD of the scaled data, i.e.
Xs = As ZT . Now one can post-multiply by C the SVD decomposition of X to
yield: XC = U (CV)T . What we have said above is that Z = CV, which is
true since U (CV)T is no longer a SVD of XC. This is because CV is no longer
orthogonal unless C is of the form aIp , i.e. isotropic. This is known as the scaling
problem in EOF/PCA. One simple way to get around the problem is to use the
correlation matrix. For more discussion on the scaling problem in PCA refer, for
example, to Jolliffe (2002), Chatfield and Collins (1980), and Thacker (1996).
3.7 EOFs for Multivariate Normal Data
EOFs can be difficult to interpret in general. However, there are cases in which
EOFs can be understood in a geometric sense, and that is when the data come
from a multivariate normal random variable, e.g. Y , with distribution N (μ, ) with
probability density function given by (3.24). Let λk , and ak , k = 1, . . . , p, be the
eigenvalues and associated (normalised) eigenvectors of the covariance matrix ,
i.e. = AAT , with A = a1 , . . . , ap , = diag λ1 , . . . , λp and AT A = Ip .
Now from a sample data matrix X, the sample mean μ̂ and sample covariance matrix
S are maximum likelihood estimates of μ and respectively. Furthermore, when the
eigenvalues of are all distinct the eigenvalues and EOFs of S are also maximum
likelihood estimate (MLE) of λk , and ak , k = 1, . . . , p, respectively (see e.g.
Anderson 1984; Magnus and Neudecker 1995; Jolliffe 2002). Using the pdf f (y) of
Y , see Eq. (3.24), the joint probability density function of the PCs Z = AT (Y − μ)
is given by
− 12
!
p
1 zk2
p
− p2
f (z) = (2π ) λk exp − , (3.26)
2 λk
k=1 k=1
which is the product of p independent normal probability density functions. The

multivariate probability density function is constant over the ellipsoids
3.8 Other Procedures for Obtaining EOFs 63
Fig. 3.17 Illustration of a

two-dimensional Gaussian
distribution along with the
two EOFs
(y − μ)T −1 (y − μ) = α
for a given positive

p constant α. Using the PCs coordinates the above equation
simplifies to k=1 zk2 /λk = α.
These ellipsoids have therefore the eigenvalues λk and EOFs ak , k = 1, . . . , p,
as the length and directions, respectively, of their principal axes. Figure 3.17 shows
an illustration of a two-dimensional Gaussian distribution with the two EOFs. EOFs
constitute therefore a new rotated coordinate system going through the data mean
and directed along the principal axes of the distribution ellipsoid. Note that the PCs
of a multivariate normal are independent because they are uncorrelated. This is not
true in general, however, if the data are not Gaussian and other techniques exist to
find independent components (see Chap. 12.)
3.8 Other Procedures for Obtaining EOFs
It is shown above that EOFs are obtained as the solution of an eigenvalue problem.
EOFs can also be formulated through a matrix optimisation problem.
p √ Let again X be
a n × p data matrix which is decomposed using SVD as X = k=1 λk uk vTk . The
1 p
sample covariance matrix is also written as S = n−1 T
k=1 λk uk uk . Keeping the
first r < p EOFs is equivalent to truncating the
previous
√ sum by keeping the first r
terms to yield the filtered data matrix Xr = rk=1 λk uk vTk , and similarly for the
associated covariance matrix Sr . The covariance matrix Sr of the filtered data can
also be obtained as the solution to the following optimisation problem:
min φ (Y) = tr (S − Y)2 (3.27)

Y
over the set of positive semi-definite matrices Y of rank r (see Appendix D). So Sr
provides the
pbest approximation to S in the above sense, and the minimum is in fact
φ (Sr ) = k=r+1 λk .
The expression of the data matrix as the sum of the contribution from different
EOFs/PCs provides a direct way of filtering the data. The idea of filtering the data
using PCs can also be formulated through finding an approximation of X of the

form:
X = ZAT + E (3.28)
for a n × r data matrix Z and a p × r semi-orthogonal matrix A of rank r, i.e.

AT A = Ir . The matrices Z and A can be obtained indeed by minimising the error
variance: tr EE T from (3.27), i.e.
T
min φ (Z, A) = tr X − ZAT X − ZAT . (3.29)
The solution to (3.29) is obtained (see Appendix D) for A = (u1 , . . . , ur ), and

Z = XA. The minimum of (3.29) is then given by

p
min φ (Z, A) = λk .
k=r+1
In other words A is the matrix of the first r EOFs, and Z is the matrix of the
associated PCs. This way of obtaining Z and A is referred to as the one-mode
component analysis (Magnus and Neudecker 1995), and attempts to reduce the
number of variables from p to r. Magnus and Neudecker (1995) also extend it to
two- and more mode analysis.
Remark Let xt , t = 1, . . . , n, be the data time series that we suppose to be centred,
and define zt = AT xt , where A is a p × m matrix. Let also S = Uλ2 UT be the
decomposition of the samplecovariance matrix into eigenvectors U = u1 , . . . , up ,
and eigenvalues λ2 = diag λ21 , . . . , λ2p , and where the eigenvalues are arranged
in decreasing order. Then the following three optimisation problems are equivalent
in that they yield the same solution:
• Least square sum of errors of reconstructed data (Pearson 1901), i.e.
n

m
min xt − AA xt T 2
=n λ2k .
A
t=1 k=1
• Maximum variance in the projected space, subject to orthogonality (Hotelling

1933), i.e.
n

m
max tr xt xTt =n λ2k .
AT A=Ip
t=1 k=1
• Maximum mutual information, based on normality, between the original random

variables x and their projection z, generating respectively xt and zt , t = 1, . . . , n,
3.9 Other Related Methods 65
(Kapur 1989, p. 502; Cover and Thomas 1991), i.e.

m
1 !
max [I (z, x)] = Log 2π eλ2k ,
A 2
k=1

where I (z, x) = E Log fxf(x)f (x,z)
z (z) is the mutual information of z and
x (Chap. 12), fx () and fz () are respectively the marginal probability density
functions of x and z, and f (x, z) is the joint pdf of x and z. The application to the
sample is straightforward and gives a similar expression. All these optimisations
yield the common solution, namely, the leading m EOFs, A = (u1 , . . . , um ),
of S.
3.9 Other Related Methods
3.9.1 Teleconnectivity
Teleconnection maps (Wallace and Gutzler 1981) are obtained using one-point
correlation where a base point is correlated to all other points. A teleconnection
map is simply a map of row (or column) of the correlation matrix C = (cij ) and
is characterised particularly by a (nearly) elliptical region of positive correlation
around the base point with correlation one at the base point, featuring a bullseye
to use the term of Wallace and Gutzler (1981). Occasionally, however, this main
feature can be augmented by another centre with negative correlations forming
hence a dipolar structure. It is this second centre that makes a big difference
between base points. Figure 3.18 shows an example of correlation between All
45° N
30° N
0° 30° E
−30 −25 −20 −15 −10 −5 0 5 10 15 20 25 30 35 40 45
Fig. 3.18 Correlation between All India Rainfall (AIR) and September Mediterranean evaporation
India monsoon Rainfall (AIR) index, a measure of the Asian Summer Monsoon
strength, and September Mediterranean evaporation. AIR11 is an area-averaged
of 29 subdivisional rainfall amounts for all months over the Indian subcontinent.
The data used in Fig. 3.18 is for Jun–Sep (JJAS) 1958–2014. There is a clear
teleconnection between Monsoon precipitation and Mediterranean evaporation with
an east–west dipole. Stronger monsoon precipitation is normally associated with
stronger (weaker) evaporation over the western (eastern) Mediterranean and vice
versa. There will be always differences between teleconnection patterns even
without the second centre. For instance some will be localised and others will be
spread over a much larger area. One could also have more than one positive centre.
Using the teleconnection map, one can define the teleconnectivity Ti at the ith
grid point and is defined by
Ti = − min cij .
j
The obtained teleconnectivity map is a special pattern and provides a simple way to
locate regions that are significantly inter-related in the correlation context.
The idea of one-point correlation can be extended to deal with linear relationships
between fields such as SST variable at a given grid point, or even any climate index,
correlated with another field such as geopotential height. These simple techniques
are widely used in climate research and do reveal sometimes interesting features.
3.9.2 Regression Matrix
Another alternative way to using the correlation or the covariance matrices is

to use the regression matrix R = (rij ), where rij is the regression coefficient
obtained from regressing the jth grid point onto the ith grid point. The “EOFs”
of the regression matrix R = D−1 S are the solution to the generalised eigenvalue
problem:
D−1 Sv = λ2 v (3.30)
The regression matrix is no more symmetric; however, it is still diagonalisable.

Furthermore R and S have the same spectrum.
Exercise Show that R and S have the same spectrum and compute the eigenvectors
of R.
Answer The above generalised eigenvalue problem can be transformed (see, e.g.
1 1 1
Hannachi (2000)) to yield D− 2 SD− 2 a = λ2 a, where a = D 2 v. Hence the spectrum
11 http://www.m.monsoondata.org/india/allindia.html.
1 1
of R is the same as that of the symmetric matrix D− 2 SD− 2 . Furthermore, from
1 1 1 1 T 1
the SVD of S we get D− 2 SD− 2 = D− 2 U2 D− 2 U . Now since T = D− 2 U
1 1
is orthogonal (but not unitary), 2 provides the spectra of D− 2 SD− 2 , and the
1
eigenvectors of R are vk = D− 2 ak where ak , k = 1, . . . p are the eigenvectors
1 1
of D− 2 SD− 2 .
Remark The EOFs of the correlation matrix C are linearly related to the regression-
1 1
based EOFs. The correlation matrix is C = D− 2 SD− 2 , and therefore the eigenvec-
1
tors vk of R are related to the eigenvectors ak of C by vk = D− 2 ak , k = 1, . . . p.
3.9.3 Empirical Orthogonal Teleconnection
Empirical orthogonal teleconnection (EOT) is reminiscent of the teleconnectivity

map for a chosen base point. The method finds that specific base point in space that
explains as much as possible of the variance of all other points (van den Dool et al.
2000). For example, the first EOT consists of the regression coefficients between the
selected base point and all other points. The remaining EOTs are obtained similarly
after removing the effect of the base point by linearly regressing every grid point
onto it.
The EOT algorithm is iterative and chooses such base grid points successively
based on how well the grid point can explain residual variations at all other grid
points. As pointed out by Jolliffe (2002, personal communication) the results are
difficult to interpret because the regressions are made on non-centred data.
3.9.4 Climate Network-Based Methods
The analysis based on similarity matrices, such as covariance or correlation matrices

for single or coupled (see Chap. 15) fields, can be transformed into binary similarity
or adjacency matrices, which are interpreted in terms of climate networks (Tsonis
and Roebber 2004; Tsonis et al. 2006; Donges et al. 2009, 2015). In this framework,
the grid points of the analysed field are taken as nodes of a network. Compared
to linear methods, which rely effectively on dimensionality reduction, the wisdom
behind these network techniques is that they allow full exploration of the complexity
of the inter-dependencies in the data. Donges et al. (2015), for example, argue
that the climate networks can provide additional information to those given by
standard linear methods, particularly on the higher order structure of statistical inter-
relationships in climate data.
If S = (Sij ) designates the pairwise measure of a given statistical association,
e.g. correlation, the climate network adjacency matrix A = (Aij ) is given by
Aij = 1{Sij −Tij ≥0} (1 − δij ), where 1X is the indicator function of set X, δij is
the Kronecker symbol and Tij is a threshold parameter, which may be constant.
Note that self-interactions are not included in the adjacency matrix. A number
of parameters are then defined from this adjacency matrix, such as closeness and
betweenness, which can be compared to EOFs, for example, and identify hence
processes and patterns which are not accessible from linear measures of association.
Examples of those processes include synchronisation of climatic extreme events
(Malik et al. 2012; Boers et al. 2014), and reconstruction of causal interactions, from
a statistical information perspective, between climatic sub-processes (e.g. Ebert-
Uphoff and Deng 2012; Runge et al. 2012, 2014). More discussion is given in
Chap. 7 (Sect. 7.7) in relation to recurrence networks.
An example of connection is shown in Fig. 3.19 (left panel) based on monthly
sea level pressure field (1950–2015) from NCEP/NCAR reanalysis. The figure
shows connections between two locations, one in Iceland (60N, 330E) and the
other in north east Pacific (30N, 220E), and all other grid points. A connection
is defined when the correlation coefficient is larger than 0.3. Note, in particular,
the connections between the northern centre of the NAO (around Iceland) and
remote places in the Atlantic, North Africa and Eurasia. The middle panel of
Fig. 3.19 shows a measure of the total number of connections at each grid point.
It is essentially proportional to the fraction of the total area that a point is connected
to (Tsonis et al. 2008, 2008). This is similar to the degree defined in climate network,
see e.g. Donges et al. (2015). High connections are located in the NAO and PNA
regions, and also Central Asia. Note that if, for example, the PNA is removed from
the SLP field (e.g. Tsonis et al. 2008), by regressing out the PNA time series, then
the total number of connections (Fig. 3.19, left panel) mostly features the NAO
pattern. We note here that the PNA pattern is normally defined, and better obtained
with, the geopotential height anomalies at mid-tropospheric level, and therefore
results with, say 500-hPa geopotential heights, gives clearer pictures (Tsonis et al.
2008).
Fig. 3.19 Connections between two points (one in Iceland and one in north east Pacific) and all
other gridpoints for which monthly SLP (1950–2015) correlation is larger than 0.3 superimposed
on the SLP climatology (left), total number of connections (see text for details) defined at each
grid point (middle), and same as middle but when the PNA time series was regressed out from the
SLP field (right). Units (left) hPa
Another network-related method was also applied by Capua et al. (2020),

based on causal inter-dependencies, using the so-called causal effect networks
(CEN). Broadly speaking, CEN generalises correlation analysis by removing the
confounding autocorrelation and common source effects. Capua et al. (2020)
applied CEN to the analysis of tropical-midlatitude relationships. They point out,
in particular, the general two-way causal interaction, with occasional time scale
dependency of the causal effect.
Chapter 4
Rotated and Simplified EOFs
Abstract This chapter describes further the drawbacks of EOFs mentioned in

Chap. 3. It also provides different ways to overcome those drawbacks, including
EOF rotation and simplified EOFs. A number of applications to climate data are
also provided.
Keywords Simplification · Rotation · Varimax · Quartimin ·

LASSO-algorithm · Ordinary differential equations · North Atlantic Oscillation
4.1 Introduction
In the previous chapter we have listed some problems that can be encountered
when working with EOFs, not least the physical interpretation caused mainly by
the geometric constraint imposed upon EOFs and PCs, such as orthogonality,
uncorrelatedness, and domain dependence. Physical modes are inter-related and
tend to be mostly non-orthogonal, or correlated. As an example, normal modes
derived from linearised physical models (Simmons et al. 1983) are non-orthogonal,
and this does not apply to EOFs. Furthermore, EOFs tend to be size and shape
domain-dependent (Horel 1981; Richman 1986, 1993; Legates 1991, 1993). For
instance, the first EOF pattern tends to have wavenumber one sitting on the whole
domain. The second EOF, on the other hand, tends to have wavenumber two and
be orthogonal to EOF1 regardless of the nature of the physical process involved
in producing the data, and this applies to subsequent EOFs. In his detailed review,
Richman (1986) maintains that EOFs exhibit four characteristics that hamper their
utility to isolate individual modes of variation. These are
• domain dependence,
• subdomain instability,
• sampling problems and
• inaccurate relationship to physical phenomena.

72 4 Rotated and Simplified EOFs
If the objective of EOFs is to reduce the data dimension, then the analysis can be
acceptable. If, however, one is looking to isolate patterns for physical interpretation,
then clearly as stated above EOFs may not be the best choice.
To overcome some of the drawbacks caused by the geometric constraints,
researchers have looked for an alternative through linear transformation of the
EOFs. The concept of rotation emerged in factor analysis and has been proposed
since the work of Thurstone (1947) in social science. In atmospheric science, rotated
EOFs (REOFs) have been applied nearly three decades later and continue to be
widely used (Horel 1981; Richman 1981, 1986; Preisendorfer and Mobley 1988;
Cheng et al. 1995). The review of Richman (1986) provides a particularly detailed
discussion of the characteristics of unrotated EOFs. REOFs yield simpler structures,
compared to EOFs, by rotating the vector of loadings or EOFs hence losing some
of the nice geometric properties of EOFs, in favour of yielding better interpretation.
REOFs, however, have their own shortcomings such as how to choose the number
of EOFs to be rotated and the rotation criteria that specify the simplicity.
The objective of pattern simplicity is manifold. Most important is perhaps the
fact that simple patterns avoid the trap of mixing, which is a main feature of EOFs.
Simple patterns and their time amplitude series cannot be spatially orthogonal
and temporally uncorrelated simultaneously. Furthermore, propagating planetary
waves Hoskins and Karoly 1981 tend to follow wave guides (Hoskins and Ambrizzi
(1993); Ambrizzi et al. (1995)) because of the presence of critical lines (Held 1983;
Killworth and McIntyre 1985). Physically relevant patterns are therefore expected
to be more local or simple, i.e. with zeros outside the main centres of action.
A number of alternatives have been developed to construct simple structure
patterns without compromising the nice properties of EOFs, namely variance
maximisation and space–time orthogonality (Jolliffe et al. 2002; Trendafilov and
Jolliffe 2006; Hannachi et al. 2006). This chapter discusses these methods and their
usefulness in atmospheric science.
4.2 Rotation of EOFs
4.2.1 Background on Rotation
Horel (1981) and Richman (1981, 1986) argued that EOFs can be too non-local
and dependent on the size and the shape of the spatial domain. Thurstone (1947, p.
360–61) applied rotated factors and pointed out that, invariance or constancy of a
solution, e.g. factors or EOFs, when the domain changes is a fundamental necessity
if the solution is to be physically meaningful (see also Horel 1981). The previous
problems encountered with EOFs have led atmospheric researchers to geometrically
transform EOFs by introducing the concept of rotation in EOF analysis.
Rotated EOF (REOF) technique is based on rotating the EOF patterns or the PCs,
and has been adopted by atmospheric scientists since the early 1980s (Horel 1981;
4.2 Rotation of EOFs 73
Richman 1981, 1986). The technique, however, is much older and goes back to the
early 1940s when it was first suggested and applied in the field of social science1
(Thurstone 1940, 1947; Carroll (1953)). The technique is also known in factor
analysis as factor rotation and aims at getting simple structures. In atmospheric
science the main objective behind rotated EOFs is to obtain
• a relaxation of some of the geometric constraints
• simple and more robust spatial patterns
• simple temporal patterns
• an easier interpretation.
In this context simplicity refers in general to patterns with compact/confined
structure. It is in general accepted that simple/compact structures tend to be more
robust and more physically interpretable. To aid interpretation one definition of
simplicity is to drive the EOF coefficients (PC loadings) to have either small or
large magnitude with few or no intermediate values. Rotation of EOFs, among other
approaches, attempts precisely to achieve this.
4.2.2 Derivation of REOFs
Simply put, rotated EOFs are obtained by applying a rotation to a selected set of
EOFs explaining say a given percentage of the total variance. Rotation has been
applied extensively in social science and psychometry, see for example Carroll
(1953), Kaiser (1958), and Saunders (1961), and later in atmospheric science (e.g.
Horel 1981; Richman 1986). Let us denote by Um the p × m matrix containing
the first m EOFs u1 , u2 , . . . um that explain a given amount of variance, i.e. Um =
(u1 , u2 , . . . um ). Rotating these EOFs yields m rotated patterns Bm given by
Bm = Um R = (b1 , b2 , . . . , bm ) , (4.1)

where R = (rij ) is a m×m rotation matrix. The obtained patterns bk = m j =1 rj i uj ,
k = 1, . . . m are the rotated EOFs (REOFs). In (4.1) the rotation matrix R has to
satisfy various constraints that reflect the simplicity criterion of the rotation, which
will be discussed in the next section.
As for EOFs, the amplitudes or the time series associated with the REOFs are
also obtained by projecting the data onto the REOFs, or equally by similarly rotating
the PCs matrix using the same rotation matrix R. The rotated principal components
C = (c1 , c2 , . . . , cm ) are given by
1 Beforethe availability of high speed computers, pattern rotation used to be done visually, which
made it somehow subjective because of the lack of a quantitative measure and the possibility of
non-reproducibility of results.
C = XBm = VUT Um R = Vm m R, (4.2)
where Vm is the matrix of the leading (standardised) PCs, and m is the diagonal
matrix containing the leading m singular values. It is also clear from (4.1) that
Bm BTm = RRT (4.3)
and therefore the rotated patterns will be orthonormal if and only if R is unitary,
i.e. RRT = Im . In this case the rotation is referred to as orthogonal, otherwise it is
oblique.
From Eq. (4.2) we also get a similar result for the rotated PCs. The covariance
matrix of the rotated PCs is proportional to
C T C = RT 2m R. (4.4)
Equation (4.4) shows that if the rotation is orthogonal the rotated PCs (RPCs)
are no longer uncorrelated. If one choses the RPCs to be uncorrelated, then the
REOFs are non-orthogonal. In conclusion REOFs and corresponding RPCs cannot
be simultaneously orthogonal and uncorrelated respectively. In summary rotation
compromises some of the nice geometric properties of EOFs/PCs to gain perhaps a
better interpretation.
4.2.3 Computing REOFs

Rotation or Simplicity Criteria
Rotation of the EOF patterns can systematically alter the structures of EOFs. By
constraining the rotation to maximise a simplicity criterion the rotated EOF patterns
can be made simple in the literal sense. Given a p ×m matrix Um = (u1 , u2 , . . . um )
of the leading m EOFs (or loadings), the rotation is formally achieved by seeking a
m × m rotation matrix R to construct the rotated EOFs B given by Eq. (4.1): The
criterion for choosing the rotation matrix R is what constitutes the rotation algorithm
or the simplicity criterion, and is expressed by the maximisation problem:
max f (Um R) (4.5)
over a specified subset or class of m × m square rotation matrices R. The functional

f () represents the rotation criterion. Various rotation criteria exist in the literature
(Harman, 1976; Reyment and Jvreskog 1996). Richman (1986), for example, lists
five simplicity criteria. Broadly speaking there are two large families of rotation:
orthogonal and oblique rotations. In orthogonal rotation (Kaiser 1958; Jennrich
2001) the rotation matrix R in (4.1) is chosen to be orthogonal, and the problem
is to solve (4.5) subject to the condition:
RRT = RT R = Im , (4.6)
where Im is the m × m identity matrix. In oblique rotation (Jennrich, 2001; Kiers

1994) the rotation matrix R is chosen to be non-orthogonal.
Various simplicity criteria exist in the literature such as the VARIMAX and
QUARTIMAX discussed below. Chapter 10 contains more rotation criteria. The
most well known and used rotation algorithm is the VARIMAX criterion (Kaiser
1958, see also Krzanowski and Marriott 1994). Let us designate by bij , i = 1, . . . p,
and j = 1, . . . m, the elements of the rotated EOFs matrix B in (4.1), i.e. bij = [B]ij ,
then the VARIMAX orthogonal rotation maximises a simplicity criterion according
to:
⎛ ⎡ ⎛ ⎞2 ⎤⎞
⎜ m
⎢ p
p
⎥⎟
max ⎝f (B) = ⎣p bj4k − ⎝ bj2k ⎠ ⎦⎠ , (4.7)
k=1 j =1 j =1
where m is the number of EOFs chosen for rotation. The quantity inside the square
brackets in (4.7) is proportional to the (spatial) variance of the square of the rotated
T
vector bk = b1k , . . . , bpk . Therefore the VARIMAX attempts to simplify the
structure of the patterns by tending the loadings coefficients towards zero, or ±1. In
various cases, the loading of the rotated EOFs B are weighted by the communalities
of the different variables (Walsh and Richman 1981). The communalities h2j ,

j = 1, . . . p, are directly proportional to m 2
k=1 aj k , i.e. the sum of squares of the
−1/2
loadings for a particular variable (grid point). Hence if C = Diag Um UTm ,
then in the weighted or normalised VARIMAX, the matrix B in (4.7) is simply
replaced by BC. This normalisation is generally used to reduce the bias toward the
first EOF with the largest eigenvalue.
Another familiar orthogonal rotation method is based on the QUARTIMAX
simplicity criterion (Harman 1976). It seeks to maximise the variance of the
patterns, i.e.
⎡ ⎤2
1 ⎣ 2 1 2 ⎦
m p m p
f (B) = bj k − bj k . (4.8)
mp mp
k=1 j =1 k=1 j =1
Because of the orthogonality property (4.6) required by R, the rotated EOFs matrix
also satisfies BT B = Im . Therefore the sum of the squared elements of B is constant,
and the QUARTIMAX simply boils down to maximising the fourth-order moment
of the loadings, hence the term QUARTIMAX, and is based on the following
maximisation problem:
⎡ ⎤
1
m
p
max ⎣f (B) = .bj4k ⎦ . (4.9)
mp
k=1 j =1
Equations (4.7) or (4.9) are then to be optimised subject to the orthogonality

constraint (4.6). The VARIMAX is in general preferred to the QUARTIMAX
because it is slightly less sensitive to changes in the number of variables (Richman
1986), although the difference in practice is not significant.
In oblique rotation the matrix R need not be orthogonal and in general the
problem to be solved is
max [f (Bm = Um R)]

(4.10)
subject to Diag RT R = Im ,
where f () is the chosen simplicity criterion. Most rotations used in atmospheric

science are orthogonal. Few oblique rotations (Richman 1981) have been used in
atmospheric science perhaps because the algorithms involved are slightly more
complex. The QUARTIMIN (Harman 1976) is the most widely used oblique
rotation particularly in fields such as psychometry. In QUARTIMIN the simplicity
criterion is directly applied not to the rotated patterns themselves but to the
transformed EOFs using the inverse matrixR−T of RT , i.e. Um R−T . Denoting
again
by b1 , . . . , bm the rotated patterns (using m a 2 ), i.e. b = U R−T , then in
k=1 j k ij m ij
this case the rotation matrix R is obtained using the following optimisation criterion:
⎡ ⎤

min ⎣f (Um R−T ) = b2ir b2is ⎦ (4.11)
r=s i
subject to the 2nd equation in (4.10).
Computation of REOFs
Since the criterion to be optimised is non-quadratic and cannot be performed

analytically, numerical methods have to be applied. There are various algorithms
to minimise (or equivalently maximise) a multivariate function f (x1 , x2 , . . . , xm ).
There are two classes of minimisations: constrained and unconstrained. Any con-
strained minimisation problem can be transformed, using e.g. Lagrange multipliers,
to yield an unconstrained problem. Appendix E reviews various algorithms of
optimisation. For the rotation problem the constraints are relatively simple since
they are equalities. Our problem is of the following form:
min f (x)
(4.12)
s.t. gk (x) = ck , k = 1, . . . p,
where f (.) and gk (.), k = 1, . . . , p, are multivariate function. To solve the

above problem one first introduces a new extended multivariate function, called
Lagrangian, given by

p
H (x, λ) = f (x) + λk (gk (x) − ck ) = f (x) + λT g, (4.13)
k=1
where g(x) = (g1 (x) − c1 , . . . , gp (x) − cp )T and λ = (λ1 , . . . , λp )T , which

represents the Lagrange multipliers. Next the optimisation of the unconstrained
problem (4.13) is carried out with respect to x and λ, see Appendix E for details
on how to solve (4.13).
An example of the application of VARIMAX to winter monthly mean SLP over
the northern hemisphere, north of 20◦ N is shown in Fig. 4.1. The data come from
NCEP/NCAR reanalyses and are described in Chap. 3. Figure 4.1 shows three
REOFs obtained by rotating m = 6 SLP EOFs. It can be noted that REOF1 and
REOF2 reflect features of the NAO. The northern centre of action is better reflected
with REOF1 and the southern centre of action is better reflected with REOF2. The
Pacific pattern is shown in REOF3, which is quite similar to EOF2 (Fig. 3.5).
Another example, with m = 20, is shown in Fig. 4.2. The sensitivity to the
parameter m can be seen. The reason behind this sensitivity is that the EOFs have all
the same weight when the rotation is applied. A simple way out is to weigh the EOFs
by the square root of the corresponding eigenvalues. The obtained rotated patterns
are shown in Fig. 4.3, which is quite different. Figures 4.3a,b,c show, respectively,
clear signals of the NAO, the Pacific pattern and the Siberian high (Panagiotopoulos
et al. 2005). These rotated patterns are also quite robust to changes in m.
It is noted by Hannachi et al. (2006, 2007) that orthogonal rotation is computa-
tionally more efficient than oblique rotation to matrix inversion. These authors also
found that orthogonal and oblique rotations of (non-weighted or non-scaled) EOFs
produce quite similar results. Figure 4.4 shows a scatter plot of rotated loadings
using VARIMAX versus rotated loadings using QUARTIMIN, with m = 30. A
similar feature is also obtained using other rotation criteria.
Matlab computes EOF rotation using different rotation criteria. If EOFs is a
matrix containing say m EOFs, i.e. EOF s(p12, m), then the varimax rotated EOFs
are given in REOFs:
>> REOFs = rotatefactors (EOFs, ’Method’,’varimax’);

Fig. 4.1 VARIMAX rotated

EOFs using the leading
m = 6 winter SLP EOFs. The
order shown is based on the
variance of the corresponding
time series. Positive contours
solid, and negative contours
dashed. (a) REOF1. (b)
REOF3. (c) REOF4. Adapted
from Hannachi et al. (2007)
Fig. 4.2 Same as Fig. 4.1 but

using m = 20. (a) REOF1.
(b) REOF6. (c) REOF10.
(2007)
Fig. 4.3 Leading three

VARIMAX rotated EOFs
obtained based on the leading
m = 20 EOFs weighted by
the square root of the
corresponding eigenvalues.
(a) REOF1 (m = 20). (b)
REOF2 (m = 20). (c) REOF3
(m = 20). Adapted from
Hannachi et al. (2007)
4.3 Simplified EOFs: SCoTLASS 81
Fig. 4.4 Scatter plot of VARIMAX REOFs versus QUARTIMIN REOFs using m = 30 EOFs.
Note that scatter with negative slopes correspond to similar REOFs but with opposite sign. Adapted
4.3 Simplified EOFs: SCoTLASS
4.3.1 Background
REOFs have been introduced mainly to improve interpretation through obtaining

simpler patterns than EOFs. Building objective simplicity criteria, however, turns
out to be a difficult problem. In fact, Jolliffe et al. (2002) point out that concentrating
the EOF coefficients close to 0 or ±1 is not the only possible definition of simplicity.
For example, a pattern with only ones is simple though it could rarely be of much
interest in atmospheric science. Although REOFs attempt to achieve this using a
simple and practical criterion they have a number of difficulties which make the
method quite controversial (Richman 1986, 1987; Jolliffe 1987, 1995; Mestas-
Nuñez 2000).
When we apply the rotation procedure we are usually faced with the following
questions:
• How to fix the number of EOFs or PCs to be rotated?
• What type of rotation, e.g. orthogonal or oblique, should be used?
• Which of the large number of simplicity criteria should be used?
• How to choose the normalisation constraint (Jolliffe 1995)?
Another problem in REOFs, not often stressed in the literature, is that after rotation
the order is lost, and basically all REOFs become equivalent2 in that regard. It is
clear that addressing some of these concerns will depend to some extent on what the
rotated patterns will be used for. A simplification technique that can overcome most
of these problems, and which in the meantime retains some of the nice properties of
EOFs, is desirable. Such a technique is described next.
4.3.2 LASSO-Based Simplified EOFs
Various simplification techniques have been suggested, see also Jolliffe (2002,
chapter 11). Most of these techniques attempt to reduce the two-step procedure of
rotated PCA into just one step. Here we discuss a particularly interesting method
of simplicity that is rooted in regression analysis. A common problem that arises
in multiple linear regression is instability of regression coefficients because of
colinearity or high dimensionality. Tibshirani (1996) has investigated this problem
and proposed a technique known as the Least Absolute Shrinkage and Selection
Operator (LASSO). In a least-square multiple linear regression:
y = Xβ + ε,
where the parameters β = (β1 , . . . , βp )T are estimated by minimising

n 2
(y − Xβ)T (y − Xβ) = t=1 yt − j j tj , the additional constraint
β x
p
j =1 |βj | ≤ τ has, for suitable choices of τ , the property of shrinking some
of the regression coefficients to zero. The LASSO approach attempts to shrink
some regression coefficients exactly to zero, hence implicitly selecting variables.
The same idea was adapted in PCA later by Jolliffe et al. (2003) who used it to
shrink loadings to zero. They label it Simplified Component Technique-LASSO
(SCoTLASS). For notational convenience, however, and to keep the acronym short
we refer to the SCoTLASS EOF method as simplified3 EOFs (SEOFs).
The SEOF method attempts to use the main properties of EOFs and REOFs
simultaneously by successively maximising variance and constraining the patterns
to be orthogonal and simple. Hence the objective of SEOFs is to seek directions
T
ak = ak1 , ak2 , . . . , akp , k = 1, . . . , p, maximising the quadratic function:
F (ak ) = aTk Sak (4.14)
subject to
2 Anexception is with weighted EOFs, see Sect. 4.2.3

3 Thereader should note the use of adjectives “simple” and “simplified” to describe other different
techniques in the literature.
aTk al = δkl . (4.15)
To achieve simplicity the LASSO technique requires the following extra constraint
to be satisfied (Jolliffe et al. 2003):

d
ak 1 = |akj | = aTk sign(ak ) ≤ τ (4.16)
j =1
for some tuning parameter τ . In Eq. (4.16) sign(ak ) = (sign(ak1 ), . . . , sign(akp ))T
p 2
2 = 1 ≤ p
is the sign of ak . Because j =1 akj j =1 |akj | , it is clear that the
optimisation problem (4.14–4.16) is only ppossible for τ ≥ 1. Furthermore, since
a 1 is maximised over the unit sphere, i=1 ai2 = 1, only when all the components
√ √
are equal we get a 1 ≤ p. Hence if τ ≥ p we regain conventional EOFs.
Consequently EOFs can be regarded as a particular case of SEOFs. Figure 4.5
shows an example of the leading two SEOFs obtained with a threshold parameter
τ = 8. These patterns are orthogonal and they represent respectively the NAO and
the Pacific patterns. The centres of action are quite localised. These centres get
broader as τ increases. This is discussed in the next section.
4.3.3 Computing the Simplified EOFs
Since the optimisation problem (4.14–4.16) is non-quadratic and nondifferentiable

due to the LASSO condition (4.16), the solution can only be obtained numerically
using a suitable descent algorithm. The nondifferentiability condition (4.16) is
a particular nuisance for the optimisation, and it is desirable to smooth it out.
Trendafilov and Jolliffe (2006) used the fact that tanh(x) ∼ |x|
x = sign(x) for large
values of |x|, and transformed (4.16) to yield a smooth constraint

d
aTk tanh (γ ak ) − τ = akj tanh γ akj − τ ≤ 0 (4.17)
j =1
for some fixed large number γ . The problem (4.14–4.16) was solved by Trendafilov
and Jolliffe (2006), see also Hannachi et al. (2005), using the projected gradient
approach (Gill et al. 1981). To ease the problem further and to make it look like the
standard EOF problem (4.14–4.15), the nonlinear condition (4.17) is incorporated
into the function F () in Eq. (4.14) as an exterior penalty function, see e.g. Gill et al.
(1981). This means that this condition will be explicitly taken into account only if it
is violated. Hence if we designate by
Pe (x) = max(0, x),

Fig. 4.5 Leading two

simplified EOFs of the winter
SLP anomalies using τ = 8.
(a) SEOF1 (τ = 8). (b)
SEOF2 (τ = 8). Adapted
the exterior penalty function, then condition (4.17) can be incorporated into (4.14)
to yield the extended objective function:
1 T
Fμ (ak ) = ak Sak − μPe aTk tanh(γ ak ) − τ (4.18)
2
to be maximised, and where μ is a large positive number. It is now clear from (4.18)
that (4.17) is not taken into account if it is satisfied, but when it is positive it is
penalised and is sought to be minimised. Note again that (4.18) is not differentiable,
and to make it so we use the fact that max(x, y) = 12 (x + y + |x − y|), and hence
the exterior penalty function is replaced by P (x) = 12 x [1 + tanh(γ x)]. Hence the
smooth objective function to maximise becomes:
1 T
Fμ (ak ) = ak Sak − μP aTk tanh(γ ak ) − τ (4.19)
2
subject to the orthogonality condition (4.15). Figure 4.6 shows a plot of Fμ (EOF 1)
versus γ for μ = 1000, where EOF1 is the leading EOF of the winter SLP field. The
function becomes independent of γ for large values of this parameter. Hannachi et
al. (2006) found that the solution is invariant to changes in μ (for large values).
Various methods exist to solve the nonlinear constrained maximisation prob-
lem (4.15) and (4.19), such as steepest ascent, and projected/reduced gradient
methods (Gill et al. 1981). These methods look for linear directions of ascent
to achieve the optimum solution. In various problems, however, the search for
suitable step sizes (in line search) can be problematic particularly when the objective
function to be maximised is not quadratic, for which the algorithm can converge to
the wrong local maximum.
An elegant alternative approach to the linear search method is to look for a
smooth curvilinear trajectory to achieve the optimum. For instance the minimum
of an objective function F (x) can be achieved by integrating the system of ordinary
differential equations (ODE)
dx
= −∇F (x) (4.20)
dt
Fig. 4.6 Function Fμ (EOF 1) versus γ for μ = 1000. EOF1 is the leading EOF of winter SLP
anomalies. Adapted from Hannachi et al. (2006)
forward in time for a sufficiently long time using a suitably chosen initial condition
(Evtushenko 1974; Botsaris 1978; Brown 1986). In fact, if x∗ is an isolated local
minimum of F (x), then x∗ is a stable fixed point of the dynamical system (4.20),
see e.g. Hirsch and Smale (1974), and hence can be reached by integrating (4.20)
from some suitable initial condition. Such methods have been around since the
mid 1970 (Evtushenko 1974; Botsaris and Jacobson 1976, 1978) and can make
use of efficient integration algorithms available for dynamical systems. Trajectories
defined by second-order differential equations have also been suggested (Snyman
1982).
In the presence of constraints the gradient of the objective function to be
minimised (or maximised) has to be projected onto the tangent space of the feasible
set, i.e. the manifold or hypersurface satisfying the constraints (Botsaris 1979, 1981;
Evtushenko and Zhadan 1977; and Brown 1986). This is precisely what projected
gradient stands for (Gill et al. 1981). Now if Ak−1 = (a1 , a2 , . . . , ak−1 ), k ≥ 2,
is the set of the first k − 1 SEOFs, then the next kth SEOF ak has to satisfy the
following orthogonality constraints:
aTk al = δkl , l = 1, . . . k. (4.21)
Therefore the feasible set is simply the orthogonal complement to the space spanned
by the columns of Ak−1 . This can be expressed conveniently using projection
operators. In fact, the following matrix:

k−1
πk = Id − al aTl (4.22)
l=1
provides the projection operator onto this space. Furthermore, the condition aTk ak =
1 is equivalent to (Id − ak aTk )ak = 0. Therefore the projection onto the feasible set
is achieved by applying the operator πk (Id − ak aTk ) to the gradient of the objective
function (4.19). Hence the solution to the SEOF problem (4.14–4.16) is provided by
the solution to the following system of ODEs:
d
ak = πk Id − ak aTk ∇Fμ (ak ) = πk+1 ∇Fμ (ak ). (4.23)
dt
The kth SEOF ak is obtained as the limit, when t → ∞, of the solution to Eq. (4.23).
This approach has been successfully applied by Jolliffe et al. (2003) and Trendafilov
and Jolliffe (2006) to a simplified example, and by Hannachi et al. (2005) to the sea
level pressure (SLP) field.
Figure 4.7 shows the leading SLP SEOF for τ = 18. The patterns get broader
as τ increases. The SEOF patterns depend on τ and they converge to the EOFs
as τ increases as shown in Fig. 4.8. Figure 4.9 shows the third SLP EOF pattern
for τ = 12 and τ = 16 respectively. For the latter value the pattern becomes
Fig. 4.7 As in Fig. 4.5 but

with τ = 18. (a) SEOF1. (b)
SEOF2. Adapted from
nearly hemispheric. The convergence of SEOF1 to EOF1, as shown in Fig. 4.7, starts
√
around τ = 23 p
Hannachi et al. (2006) modified slightly the above system of ODEs. The kth
SEOF is obtained after removing the effect of the previous k − 1 SEOFs by
computing the residuals:

k−1
Yk = X Id − al aTl = Xπk (4.24)
l=0
Fig. 4.8 Variance ratio of simplified PC1 to that of PC1 versus parameter τ . Adapted from
with the convention a0 = 0. The covariance matrix of the residuals is

1 T
k−1
k−1
Sk = Yk Yk = Id − al aTl S Id − al aTl . (4.25)
n
l=0 l=0
The k’th SEOF ak is then obtained as an asymptotic limit when t tends to infinity ,
i.e. stationary solution to the dynamical system:
d
ak = Id − ak aTk ∇Fμ(k) (ak ), (4.26)
dt
where Fμ(k) is defined as in (4.19) except that S is replaced by Sk .

Remark The variety of simplicity criteria that can be used in REOFs is appealing
from a conceptual point of view. However, and unless the simplicity criteria are
chosen to reflect physical significance of the patterns the approach remains ad hoc.
Examples of criteria that are physically relevant include flow tendency induced
by the patterns using e.g. various simplified dynamical models (e.g. Haines and
Hannachi 1995, Hannachi 1997).
Fig. 4.9 SEOF3 of winter

SLP anomalies using τ = 12
(top) and τ = 16 (bottom).
(2007)
Chapter 5
Complex/Hilbert EOFs
Abstract Weather and Climate data contain a myriad of processes including

oscillating and propagating features. In general EOF method is not suited to identify
propagating patterns. In this chapter describes a spectral method based on Hilbert
transform to identify propagating features, with application to the stratospheric
quasi-biennial oscillation.
Keywords Propagating patterns · Quasi-biennial oscillation · Cross-spectra ·

Complex EOFs · Hilbert transformation · Hilbert EOFs · Phase portrait
5.1 Background
The introduction of EOF analysis into meteorology since the late 1940s (Obukhov
1947; Fukuoka 1951; Lorenz 1956) had a strong impact on the course of weather
and climate research. This is because one major concern in climate research is the
extraction of patterns of variability from observations or model simulations, and the
EOF method is one such technique that provides a simple tool to achieve this. The
EOF patterns are stationary patterns in the sense that they do not evolve or propagate
but can only undergo magnitude and sign change. This is certainly a limitation if
one is interested in inferring the space–time characteristics of weather and climate,
since EOFs or REOFs, for example assume a time–space separation as expressed by
the Karhunen–Loéve expansion (3.1). For instance, one does not expect in general
EOFs to reveal to us the structure of the space–time characteristics of propagating
phenomena1 such as Madden–Julian oscillation (MJO) or quasi-biennial oscillation
(QBO), etc.
The QBO, for example, represents a clear case of oscillating phenomenon that
takes place in the stratosphere, which can be identified using stratospheric zonal
1 Inreality all depends on the variance explained by those propagating patterns. If they have
substantial variance these propagating patterns can actually be revealed by a EOF analysis when
precisely they appear as degenerate pair of eigenvalues and associated eigenvectors in quadrature.

92 5 Complex/Hilbert EOFs
wind. This wind is nearly zonally symmetric. Figure 5.1 shows the climatology
of the zonal wind for January and July from the surface to 1 mb level using the
European Reanalyses (ERA-40) from the European Centre for Medium Range
Weather Forecasting (ECMWF), for the period January 1958–December 2001. A
number of features can be seen. The tropospheric westerly jets are located near
250-mb height near 30–35◦ latitude. In January the Northern Hemisphere (NH) jet
is only slightly stronger than the Southern Hemisphere (SH) counterpart. In July,
Fig. 5.1 Climatology of the ERA-40 zonal mean zonal wind for January (a) and July (b) for the
period Jan 1958–Dec 2001. Adapted from Hannachi et al. (2007)
5.1 Background 93
however, the NH jet is weaker than the SH jet. This latter is stronger due in part to
the absence of boundary layer friction caused by mountains and land masses.
In the stratosphere, on the other hand, both easterly and westerly flows are
present. Stratospheric westerlies (easterlies) exist over most winter (summer)
hemispheres. The stratospheric westerly flow represents the polar vortex, which
is stronger on the winter time because of the stronger equator-pole temperature
gradient. Note also the difference in winter stratospheric wind speed between the
northern hemisphere, around 40–50 m/s at about 1-mb and the southern hemisphere,
around 90 m/s at the same height.
The above figure refers mainly to the seasonality of the zonal flow. The variability
of the stratospheric flow can be analysed after removing the seasonality. Figure 5.2
shows the variance of the zonal wind anomalies over the ERA-40 period. Most of the
variance is concentrated in a narrow latitudinal band around the region equatorward
of 15◦ and extends from around 70-mb up to 1-mb.
Figure 5.3 shows a time–height plot of the zonal wind anomalies at the equator
over the period January 1994–December 2001. A downward propagating signal is
identified between about 3 and 70-mb. The downward propagating speed is around
1.2 km/month. The period at a given level varies between about 24 and 34 months,
yielding an average of 28 months, hence quasi-biennial periodicity, see e.g. Baldwin
et al. (2001) and Hannachi et al. (2007) for further references.
To get better insight into space–time characteristics of various atmospheric
processes one necessarily has to incorporate time information into the analysis. This
is backed by the fact that atmospheric variability has significant auto- and cross-
Fig. 5.2 Variance of monthly zonal mean zonal wind anomalies, with respect to the mean seasonal
cycle, over the ERA-40 period. Adapted from Hannachi et al. (2007)
Fig. 5.3 Time–height plot of equatorial zonal mean zonal wind anomalies for the period January
1992–December 2001. Adapted from Hannachi et al. (2007)
correlations in time (and space). Among the earliest contributions in meteorology

along this line one finds methods based on cross-spectral analysis (Kao 1968;
Madden and Julian 1971; Hayashi 1973), complex cross-covariances obtained using
complex wind time series whose zonal and meridional components constitute
respectively the real and imaginary parts of the complex field (Kundu and Allen
1976), and complex principal components in the frequency domain using cross-
spectral covariances (Wallace and Dickinson 1972; Wang and Mooers 1977).
Extended EOFs (Weare and Nasstrom 1982) is another time domain method
that incorporates the lagged information in the data matrix before computing
EOFs and is discussed in the next chapter. Complex EOFs (CEOFs) in the time
domain was introduced as an alternative to the CEOFs in the frequency domain
(Brillinger 1981; Rasmusson et al. 1981; Horel 1984). Principal oscillation patterns
(Hasselmann 1976) is another widely used method based also on the lagged
covariance matrix and finds propagating structures in a quasi-linear system. The
time domain complex EOFs is conceptually close to EOFs, except that the field is
complex or complexified.
Conventional EOF analysis can be applied to a single space–time field or a
combination of fields. EOF analysis finds “stationary” patterns in the sense that they
are not evolving. It yields a varying time series for any obtained EOF pattern, which
means that the spatial EOF pattern will only decrease or increase in magnitude
whereas the spatial structure remains unchanged. Because EOFs are based on
(simultaneous) covariances, the way time is arranged is irrelevant. In fact, if xt and
5.2 Conventional Complex EOFs 95
yt , t = 1, . . . n, are two univariate time series, then any permutation of xt and yt

will yield the same covariance, i.e.
ρxy = cov(xt , yt ) = cov(xπ(t) , yπ(t) ), (5.1)
where π is any permutation of {1, 2, . . . n}.

This can lead sometimes to difficulties capturing propagating structure by EOFs.
Extended EOFs and principal oscillation pattern (POP) analysis (e.g. von Storch and
Zwiers 1999) can in general easily extract these structures. These methods will be
discussed in the coming chapters. Here we discuss another method similar to EOF
analysis, which is based on the complexified field. The method does not involve
explicitly the lagged information, hence avoiding the use of large (extended) data
matrices as in EEOFs.
It is known that any wave can be expressed using a complex representation as
x(t) = aeiωt+φ , (5.2)
where a is the wave amplitude and ω and φ are respectively its frequency and
phase shift (at the origin). Complex EOFs (CEOFs) are based on this representation.
There are, in principle, two ways to perform complex EOFs, namely “conventional”
complex EOFs and “Hilbert” EOFs. When we deal with a pair of associated climate
fields then conventional complex EOFs are obtained. Hilbert EOFs correspond to
the case when we deal with a single field, and where we are interested in finding
propagating patterns. In this case the field has to be complexified by introducing an
imaginary part, which is a transform of the actual field.
5.2 Conventional Complex EOFs
5.2.1 Pairs of Scalar Fields
The method is similar to conventional EOFs except that it is applied to the complex
field obtained from a pair of variables such as the zonal and meridional components
u and v of the wind field U = (u, v) (Kundu and Allen 1976; Brink and Muench
1986; von Storch and Zwiers 1999; Preisendorfer and Mobley 1988). The wind field
Utl = U (t, sl ), defined at each location sl , l = 1, . . . p, and time t, t = 1, . . . n, can
be written using a compact complex form as
Utl = u(t, sl ) + iv(t, sl ) = utl + ivtl . (5.3)
The complex covariance matrix is then obtained using the data matrix U = (Utl ) by
1
S= U ∗T U , (5.4)
n−1
and the elements skl , k, l = 1, . . . p, of S in (5.4) are given by
1 ∗
n
skl = Utk Utl ,
n
t=1
where (∗ ) is the complex conjugate operator. The (complex) covariance matrix,

Eq. (5.4), is Hermitian, i.e. S∗T = S, and is therefore diagonalisable. The matrix
has therefore a set of orthonormal complex eigenvectors U = u1 , . . . up and a real
non-negative2 eigenspectrum λ21 , . . . λ2p . The complex amplitude of the kth EOF is
the kth complex principal component (CPC) ek and is given by
ek = U u∗k (5.5)
This immediately yields non-correlation of the CPCs:
e∗T
k el = λk δkl .
2
(5.6)
The complex EOFs and associated complex PCs are also obtained using the singular
value decomposition of U .
Any CEOF uk has a pattern amplitude and phase. The pattern of phase informa-
tion is given by

I m(uk )
φ k = arctan , (5.7)
Re(uk )
where Re() and I m() stand respectively for the real and imaginary parts, and where
the division is performed componentwise. The pattern amplitude of uk is given by
its componentwise amplitudes. This method of doing CEOFs seems to have been
originally applied by Kundu and Allen (1976) to the velocity field of the Oregon
coastal current. The conventional CEOFs are similar to conventional EOFs in the
sense that time ordering is irrelevant, and hence the method is mostly useful to
capture covarying spatial patterns between the two fields.
5.2.2 Single Field
If one is dealing with a single field xt = (xt1 , . . . , xtp )T , t = 1, 2 . . . n, such as

sea surface temperature, and one is interested in propagating patterns one can still
use the conventional complex EOFs applied to the complexified field obtained from
2 Since u∗T
k Suk = λk =
2 1
n−1 [Uuk ]∗T [Uuk ] ≥ 0.
5.3 Frequency Domain EOFs 97
the pair of lagged variables (xt , xt+τ ) for some chosen lag τ . The complex field is
defined by
yt = xt + ixt+τ . (5.8)
This is a natural way to define a homogeneous complexified field using lagged

information. The corresponding complex data matrix defined from (5.8) is then
given at each grid point sl and each time t by (Y)tl = (xtl + ixt+τ,l ). The obtained
complex data matrix Y = (ytl ) can then be submitted to the same complex EOF
analysis as in the previous section.
The obtained CEOFs provide the propagating structures and the corresponding
CPCs provide the phase information. This procedure is based on the choice of
the time lag τ , which reflects the characteristic time of the propagating feature.
In general, however, this parameter is not precisely known, and requires some
experience. The choice of this parameter remains, in practice, subject to some
arbitraryness. One way to determine the approximate value of τ is to compute the
CEOFs for many values of the lag τ then plot the leading eigenvalue vs lag and look
for the lag corresponding to the maximum value. An elegant alternative to choosing
the lag in the time domain is to use the Hilbert transform, which is based on phase
shift in the frequency domain and is discussed in the coming sections.
5.3 Frequency Domain EOFs
5.3.1 Background
Complex EOFs in spectral or time domain is a natural extension to EOFs and aims
at finding travelling patterns. In spectral domain, the method is based on an eigen-
decomposition of the cross-spectral matrix and therefore makes use of the whole
structure of the (complex) cross-spectral matrix. Ordinary EOFs method is simply
an application of frequency domain EOFs (FDEOFs) to contemporaneous data only.
It appears that the earliest introduction of complex frequency domain EOFs
(FDEOFs) in atmospheric context dates back to the early 1970s with Wallace and
Dickinson. Their work has stimulated the introduction of Hilbert EOFs, and we start
by reviewing FDEOFs first. The spectrum gives a measure of the contribution to the
variance across the whole frequency range. EOF analysis in the frequency domain
(Wallace and Dickinson 1972; Wallace 1972; Johnson and McPhaden 1993), see
also Brillinger (1981) for details, attempts to analyse propagating disturbances by
concentrating on a specific frequency band allowing thus the decomposition of
variance in this band while retaining phase relationships between locations.
5.3.2 Derivation of FDEOFs
For a univariate stationary time series xt , t = 1, 2, . . . , we know the relationship

between the auto-covariance function γ () and the spectral density function f (), i.e.
(see Appendix C):
1 −iωk
f (ω) = e γ (k), (5.9)
2π
k
and
π
γ (τ ) = eiτ ω f (ω)dω. (5.10)
−π
T
For a multivariate time series xt = xt1 , xt2 , . . . xtp , t = 1, 2, . . . , the
previous equations extend to yield respectively the cross-spectrum matrix F and
the autocovariance or lagged covariance matrix given by
1 −iωk
F(ω) = e (k) (5.11)
2π
k
and
π
(τ ) = eiτ ω F(ω)dω. (5.12)
−π
The elements ij (τ ), i, j = 1, . . . p, of the lagged covariance matrix are given by
ij (τ ) = cov xti , xt+τ,j (5.13)
and gives the lagged covariance between the ith and jth variables. Because the cross-
spectrum matrix is Hermitian it is therefore diagonalizable, and can be factorised as
F = EDE∗T , (5.14)
where E is a unitary complex matrix containing the (complex) eigenvectors, and D

is a diagonal matrix containing the real positive eigenvalues of F.
The idea behind spectral EOFs of a multivariate time series xt , t = 1, . . . , n, is to
find a linear transformation that has the diagonal matrix D as cross-spectra. Wallace
and Dickinson (1972) filtered3 the time series by keeping only a given frequency
ω using a narrow band pass filter that retains only the frequencies in ω ± dω. A
3 Tohave a spectral representation of a continuous time series x(t), Wallace and Dickinson (1972)
used the concept of stochastic integrals as
5.3 Frequency Domain EOFs 99
complexified time series y(t) is then obtained, which involves the filtered time series
and its time derivative as its real and complex parts respectively. The EOFs and PCs
are then obtained from the real time series:
zt = Re [E(yt )] , (5.15)
where E() and Re[] stand, respectively, for the expectation and the real part
operators.
In practice, FDEOFs are based on performing an eigenanalysis of the cross-
spectrum matrix calculated in a small frequency band. Let u(ω) be the Fourier
transform (FT) of the (centred) field xt , t = 1, . . . n at frequency ω, i.e.

n
u(ω) = xt e−iωt . (5.16)
t=1
The cross-spectral matrix at ω is (ω) = u(ω)u(ω)T , and can be written in terms

of the lagged covariance matrix
1
n−τ
Sτ = xt xTt+τ
n−τ
t=1
as

(ω) = Sτ e−iωτ .
τ
π
Note that the covariance matrix satisfies S = −π (ω)dω, and therefore the
spectrum gives a measure of the contribution to the variance across the whole
frequency range. The average of the cross-spectral matrix over the frequency band
[ω0 , ω1 ], i.e.
ω1
C= (ω)dω (5.17)
ω0
provides a measure of the contribution to the covariance matrix in that frequency

band. The spectral domain EOFs are given by the complex eigenvectors of F. The
∞
x(t) = Re eiωt dε(ω) ,
0
where ε is an independent random noise and Re[.] stands for the real part. The filtered time
series
(i.e. spectral
components outside [ω, ω + dω]) by defining first x (t) =
is thend obtained
f
Re eiωt dε(ω) dω, from which they get z(t) = Re (1 − ωi dt )E(xf ) . This new time series then

satisfies E z(t)zT (t + τ ) = cos(ωτ )D(ω)dω.
“principal components” resulting from the FDEOF are obtained by projecting the
complexified time series onto the spectral domain EOFs.
Now since waves are coherent structures with consistent phase relationships at
various lags, and given that FDEOFs represent patterns that are uniform across
a frequency band, the leading FDEOF provides coherent structures with most
wave variance. The FDEOFs are then obtained as the EOFs of C (Brillinger
1981). Johnson and McPhaden (1993) have applied FDEOFs to study the spatial
structure of intraseasonal Kelvin wave structure in the Equatorial Pacific Ocean.
They identified coherent wave structures with periods 59–125 days. Because most
climate data spectra look reddish, FDEOF analysis may be cumbersome in practice
(Horel 1984). This is particularly the case if the power spectrum of an EOF, for
example is spread over a wide frequency band, requiring an averaging of the cross-
spectrum over this wide frequency range, where the theory behind FDEOFs is no
longer applicable (Wallace and Dickinson 1972).
To summarise the following bullet points provide the essence of FDEOFs:
• Conventional EOFs are simply frequency domain EOFs applied to contempora-
neous data only.
• FDEOFs are defined as the eigenvectors of the cross-spectrum matrix defined at
a certain frequency band ω ± dω.
• This means that all frequencies outside an infinitesimal interval around ω have to
be filtered.
The method, however is difficult to apply in practice. For instance, if the power
in the data is spread over a wide range, it is not clear how FDEOFs can be applied.4
There is also the issue related to the choice of the “right” frequency. Averaging
the cross-spectrum over a wider range is desirable but then the theory is no longer
valid (Wallace and Dickinson 1972). Note that averaging the cross-spectrum matrix
over the whole positive/negative frequency domain simply yields ordinary EOFs.
In addition to the previous difficulties there is also the problem of estimating
the power spectrum at a given frequency, given that the spectrum estimate is in
general highly erratic (see Chatfield 1996). Also and as pointed out by Barnett
(1983), the interactions between various climate components involve propagation
of information and irregular short term as well as cyclo-stationary, e.g. seasonality,
interactions. This complicated (non-stationary) behaviour cannot be analysed using
spectral techniques. These difficulties have led to the method being abandoned.
Many of the above problems can be handled by Hilbert EOFs discussed next.
4 For example, Horel (1984) suggests that many maps, one for each spectral estimate may be
studied.
5.4 Complex Hilbert EOFs 101
5.4 Complex Hilbert EOFs
An elegant alternative to FDEOFs is the complex EOFs in the time domain

introduced into atmospheric science by Rasmusson et al. (1981), see also Barnett
(1983) and Horel (1984), using Hilbert singular decomposition. The method has
been refined later by Barnett (1983) and applied to the monsoon (Barnett 1984a,b),
atmospheric angular momentum (Anderson and Rosen 1983), the QBO in northern
hemispheric SLP (Trenberth and Shin 1984) and coastal ocean currents (Merrifield
and Winant 1989). The method is based on Hilbert transform and is therefore
referred to as Hilbert EOFs (HEOFs).
5.4.1 Hilbert Transform: Continuous Signals

∞
Let x(t) be a continuous time series, the integral may not exist in the
−∞ x(t)dt
λ
ordinary sense, i.e. it may be divergent. However, one can calculate e.g. −λ x(t)dt
for any finite value of λ. The limit of this integral, when λ → ∞, ∞exists in
many cases and is known as the Cauchy principal value of the integral −∞ x(t)dt,
∞
denoted as P −∞ x(t)dt, i.e.
∞ λ
P x(t)dt = lim x(t)dt. (5.18)
−∞ λ→∞ −λ
∞ λ
For example, P −∞ tdt = limλ→∞ −λ tdt = 0. Note that when the integral
is already convergent then it is identified to its Cauchy principal value. A direct
application of this integral is the Hilbert transform (Thomas 1969; Brillinger 1981).
Definition The Hilbert transform Hx(t), of the continuous signal x(t), is defined
by

1 x(s)
Hx(t) = P ds. (5.19)
π t −s
Note that the inverse of this transform is simply its opposite. This transform is
defined for every
∞signal x(t) in Lp , the space of functions whose pth power is
integrable, i.e. −∞ |x(t)|p dt < ∞. This result derives from Cauchy’s integral
formula5 and the function z(t) = x(t) − iHx(t) = a(t)eiθ(t) is analytic. In fact,
the Hilbert transform is the unique transform that defines an imaginary part so that
5 If f () is analytic over a domain D in the complex plane containing a simple path C0 , then

1 f (u)
f (z) = du
2iπ C0 u − z
the result is analytic. The Hilbert transform is defined as a convolution of x(t) with
the function 1t emphasizing therefore the local properties of x(t). Furthermore, using
the polar expression, it is seen that z(t) provides the best local fit of a trigonometric
function to x(t) and yields hence an instantaneous frequency of the signal and
provides information about the local rate of change of x(t).
The Hilbert transform y(t) = Hx(t) is related to the Fourier transform Fx(t) by
∞
1
y(t) = Hx(t) = Im Fx(s)e−ist ds , (5.20)
π 0
where I m() stands for the imaginary part.

∞
Exercise Derive (5.20) keeping in mind that 0 sinu u du = π2 .
N
Hint Use the fact that x(t) = limN →∞ −N e−ist Fx(s)ds. The result is then
λ N x(s) −ist N λ i(u−t)s
obtained from the equality −λ −N Fu−t e dtds= −N Fx(s)e−ius −λ e u−t dt
ds after tending N and λ to ∞.
Remark Based on Eq. (5.20), the analytic Hilbert transform y(t) of x(t) can be
obtained using: (i) Fourier transforming x(t), (ii) substituting the amplitude of
negative frequencies with zero, and doubling the amplitude of positive frequencies
and (iii) taking inverse Fourier transform.
In the language of signals the Hilbert transform is a linear filter that removes
precisely the zero frequency from the spectrum and has the simplest response
function. In fact, the transfer, or frequency, response function (Appendix C) of the
Hilbert filter is given by
∞
1 1 eisλ i sign(λ) if λ = 0
h(λ) = P ds = (5.21)
π π −∞ s 0 if λ = 0.
Equation (5.21) is obtained after remembering that the Hilbert transform is a simple
convolution of the signal x(t) with the function 1t . The filter transfer function is
therefore 1t , and the frequency response function is given by the principal value of
the Fourier transform of 1t . It is therefore clear from (5.21) that the Hilbert filter6
precisely removes the zero frequency but does not affect the modulus of all others.
The analytic signal z(t) has the same positive frequencies as x(t) but zero negative
frequencies.
for any z inside C0 . Furthermore, the expression f (t) = 12 (f (t) + i H[f (t)]) +
2 (f (t) − i H[f (t)]) provides a unique decomposition of f (t) into the sum of two analytic
1
functions, see, e.g., Polya and Latta (1974).

6 This transform is used also in other circumstances to define the envelop of a time series, (Hannan
1970), given by |x(t) − i Hx(t)|.

Table 5.1 Examples of time series and their Hilbert transforms

e−x
sin t 1 2
x(t) sin t cos t t δ(t) 1+t 2
−1+cos t
Hx(t) cos t − sin t t − π1t − 1+tt
4 − √2π D(t)
Remark: Difference Between Fourier and Hilbert Transform Like Fourier trans-
form the Hilbert transform provides the energy 12 a 2 (t) = |z(t)|2 , and frequency
ω = − dθ(t)dt . In Hilbert transform these quantities are local (or instantaneous)
where at any given time the signal has one amplitude and one frequency. In
Fourier transform, however, the previous quantities are global in the sense that
each component in the Fourier spectrum covers the whole time span uniformly. For
instance, a spike in Fourier spectrum reflects a sine wave in a narrow frequency
band in the whole time span. So if we represent this in a frequency-time plot one
gets a vertical narrow band. The same spike in the Hilbert transform, however,
indicates the existence of a sine wave somewhere in the time series, which can be
represented by a small square in the frequency-time domain if the signal is short
lived. The Hilbert transform is particularly useful for transient signals, and is similar
to wavelet transform in this respect.7
Table 5.1 gives examples of familiar functions and their Hilbert transforms. The
function D(t) in Table 5.1 is known as Dawson’s integral defined by D(t) =
2 t
e−t 0 ex dx, and the symbol δ() refers to the spike function or Dirac delta
2
function, i.e. δ(0) = 1 and δ(t) = 0 for non-vanishing values of t.
5.4.2 Hilbert Transform: Discrete Signals
We suppose here that our continuous signal x(t) has been discretised at various
times tk = kt to yield the discrete time series xk , k = 0, ±1, ±2, . . . , where
xk = x(kt). To get the Hilbert transform of this time series, we make use of the
transfer function of the filter in (5.21) but now defined over [−π, π ] (why?), i.e.
7 The wavelet transform of a signal x(t) is given by

1 t −b
W (a, b, ψ, x(t)) = |a|− 2 x(t)ψ dt,
a
where ψ() is the basis wavelet, a is a dilation factor and b is the translation of the origin. In this
transform higher frequencies are more localised and there is a uniform temporal resolution for
all frequency scales. A major problem here is that the resolution is limited by the basis wavelet.
Wavelets are particularly useful for characterising gradual frequency changes.
⎧
⎨ −1 for −π ≤ λ < 0
h(λ) = i sign(λ)I[−π,π ] (λ) = 0 for λ = 0 (5.22)
⎩
1 for 0 < λ < π.
To get the filter coefficients (in the time domain) we compute the frequency
response function, which is then expanded into Fourier series (Appendix C). The
transfer function (5.22) can now be expanded into Fourier series, after extending
it by periodicity to the real line then applying the discrete
π Fourier transform
(e.g. Stephenson 1973). The Fourier coefficients, ak = π1 −π h(λ)eikλ dλ, k =
0, 1, 2, . . ., become:8

0 if k = 2p
ak = (5.23)
− kπ
4
if k = 2p + 1.
The frequency response function is therefore

ak
(λ) = − eikλ
2
k
and hence the time series response is given by

ak 2 x((t − (2k + 1))t)
yk = − xt−k = −
2 π 2k + 1
k k
that is,
2 xt−(2k+1) 2 1
yt = − = (xt+2k+1 − xt−2k−1 ) . (5.24)
π 2k + 1 π 2k + 1
k k≥0
The discrete Hilbert transform formulae (5.24) was also derived by Kress and
Martensen (1970), see also Weideman (1995), using the rectangular rule of inte-
gration applied to (5.19). Now the time series is expanded into Fourier series as

xt = a(ω)e−2iπ ωt (5.25)
ω
8 Note that this yields the following expansion of the transfer function into Fourier series as
4 sin (2k + 1)x 2 sin (2k + 1)x

−ih(λ) = = .
π 2k + 1 π 2k + 1
k≥0 k
then its Hilbert transform is obtained by multiplying (5.25) by the transfer function
to yield:
1
Hxt = yt = h(ω)a(ω)e−2π iωt . (5.26)
π ω
Note that in (5.26) the Hilbert transform has removed the zero frequency and phase
rotated the time series by π2 . The analytic (complex) Hilbert transform

i
zt = xt − iHxt = a(ω) 1 − h(ω) e−2iπ ωt (5.27)
ω
π
has the same positive frequencies as xt and zero negative spectrum.
5.4.3 Application to Time Series
Let xt , t = 1, 2, . . . , be a univariate stationary signal, and yt = Hxt , t = 1, 2, . . .,

its Hilbert transform. Then using (5.21), and (5.22) we get respectively
fx (ω) = fy (ω) (5.28)
and
fxy (ω) = i sign(ω)fx (ω) (5.29)
for ω = 0. Note that the first of these two equations is already known. It is therefore
clear that the co-spectrum is zero. This means that the cross-correlation between the
signal and its transform is odd, hence
γxy (−τ ) = −γxy (τ ) = γyx (τ ). (5.30)
Note in particular that (5.30) yields γxy (0) = 0, hence the two signals are
uncorrelated.
For the multivariate case similar results hold. Let us designate by xt , t = 1, 2, . . .
a d-dimensional signal and yt = Hxt , its Hilbert transform. Then the cross-spectrum
matrix, see Eqs. (5.28)–(5.29), is given by
Fy (ω) = Fx (ω)
Fxy (ω) = i sign(ω)Fx (ω), (5.31)
for ω = 0. Note that the Hilbert transform for multivariate signals is isotropic.
Furthermore, since the inverse of the transform is its opposite we get, using again
Eqs. (5.30) and (5.31), the following identity:
Fyx (ω) = −isign(ω)Fy (ω) = −Fxy (ω). (5.32)
Using (5.31), the latter relationship yields, for the cross-covariance matrix, the
following property:
π π
xy = Fxy (ω)dω = −2 FIx = − yx . (5.33)
−π 0
Exercise Derive the above equation (5.33).

(Hint Use (5.31) plus the fact that FR is even and FI is odd.).
Let us now consider the complexified multivariate signal:
zt = xt − iHxt = xt − iyt (5.34)
then the Hermitian covariance matrix z of zt is

z = E zt z∗T
t = x + y + i xy − yx = 2 x + 2i xy (5.35)
The covariance matrix of the complexified signal is also related to the cross-spectra
Fx via
π π

z = 2 1 + sign(ω) Fx (ω)dω = 4 F∗x (ω)dω, (5.36)
−π 0
where (∗ ) stands for the complex conjugate.

Exercise Derive (5.36).
Hint Use (5.31), (5.33) and (5.35), in addition to the fact that the real and imaginary
parts of Fx are respectively even and odd.
π
Exercise If in (5.34) zt is defined instead by xt +iyt , show that z = 4 0 Fx (ω)dω
5.4.4 Complex Hilbert EOFs

Complexified Field
Given a scalar field xt = (xt1 , . . . , xtp )T , t = 1, . . . n, with Fourier representation:

xt = a(ω) cosωt + b(ω) sinωt, (5.37)
ω
where a(ω) and b(ω) are vector Fourier coefficients, and since propagating distur-
bances require complex representation as in (5.2), Eq. (5.37) can be transformed to
yield the general (complex) Fourier decomposition:

zt = c(ω) e−iωt , (5.38)
ω
where precisely Re(zt ) = xt , and c(ω) = a(ω) + ib(ω). The new complex field
T
zt = zt1 , . . . ztp can therefore be written as
zt = xt − iH(xt ). (5.39)
The imaginary part of zt

H(xt ) = b(ω) cos ωt − a(ω) sin ωt, (5.40)
ω
is precisely the Hilbert transform, or quadrature function of the scalar field xt and
is seen to represent a simple phase shift by π2 in time. In fact, it can be seen that
the Hilbert transform, considered as a filter, removes the zero frequency without
affecting the modulus of all the others, and is as such a unit gain filter. Note that
if the time series (5.37) contains only one frequency, then the Hilbert transform is
simply proportional to the time derivative of the time series. Therefore, locally in
the frequency domain H(xt ) provides information about the rate of change of xt
with respect to time t.
Computational Aspects
In practice, various methods exist to compute the finite Hilbert transform. For a
scalar field xt of finite length n, the Hilbert transform H(xt ) can be estimated using

the discrete Fourier transform (5.40) in which ω becomes ωk = 2πn k , k = 1, . . . n2 .
Alternatively, H(xt ) can be obtained by truncating the infinite sum in Eq. (5.24).
This truncation can also be written using a convolution or a linear filter as (see e.g.
Hannan 1970):

L
H(xt ) = αk xt−k (5.41)
k=−L
with the filter weights
2 πk
αk = sin2 .
kπ 2
Barnett (1983) found that 7 ≤ L ≤ 25 provides adequate values for L. For example
for L = 23 the frequency response function (Appendix C) is a band pass filter with
periods between 6 and 190 time units with a particular excellent response obtained
between 19 and 42 time units (Trenberth and Shin 1984). The Hilbert transform has
also been extended to vector fields, i.e. two or more fields, through concatenation
of the respective complexified fields (Barnett 1983). Another interesting method to
compute Hilbert transform of a time series is presented by Weideman (1995), using a
series expansion in rational eigenfunctions of the Hilbert transform operator (5.19).
The HEOFs uk , k = 1, . . . p, are then obtained as the eigenvectors of the
Hermitian covariance matrix
1 ∗T
n
S= zt zt = Sxx + Syy + i Sxy − Syx , (5.42)
n−1
t=1
where Sxx and Syy are covariance matrices of xt and yt respectively, and Sxy is
the cross-covariance between xt and yt and similarly for Syx . Alternatively, these
HEOFs can also be obtained as the right complex singular vectors of the data matrix
Z = (ztk ) using SVD, i.e.

p
Z = UV∗T = λk uk v∗T
k .
k=1
Note that (5.33) is the main difference between FDEOFs, where the integration is
performed over a very narrow (infinitesimal) frequency range ω0 ± dω, and Hilbert
EOFs, where the whole spectrum is considered. The previous SVD decomposition
T
expresses the complex map zt = zt1 , zt2 , . . . , ztp at time t as

r
zt = λk vtk u∗k , (5.43)
k=1
where vtk is the value of the k’th complex PC (CPC) vk at time t and r is the rank of
X. The computation of Hilbert EOFs is quite similar to conventional EOFs. Given
the gridded data matrix X(n, p12), the Hilbert transform in Matlab is given by
XH = hilbert(X);
and the Hilbert EOFs are given by the EOFs of XH .

Figure 5.4 shows the leading 30 eigenvalues, expressed in percentage of
explained variance, of the spectrum of the Hermitian matrix S, Eq. (5.42), using the
zonal mean zonal wind anomalies over the ERA-40 period. The vertical bars refer
to the approximate 95% confidence interval based on the rule-of-thumb given by
Eq. (3.13). The spectrum has an outstanding feature reflected by the high percentage
of the leading eigenvalue, with a substantial amount of variance of the order of 70%.
Fig. 5.4 Spectrum of the Hermitian covariance matrix given by Eq. (5.42), of the zonal mean zonal
wind anomalies of the ERA-40 data. Vertical bars represent approximate 95% confidence limits.
Adapted from Hannachi et al. (2007)
The leading Hilbert EOF (real and complex parts) is shown in Fig. 5.5. The patterns
are in quadrature reflecting the downward propagating signal.
The time and spectral structure can be investigated further using the associated
complex (or Hilbert) PC. Figure 5.6 shows a plot of the real and complex parts of
the Hilbert PC1 along with the power spectrum of the former. The period of the
propagating signal comes out about 30 months. In a conventional EOF analysis
the real and complex parts of Hilbert EOF1 would come out approximately as
a pair of degenerate EOFs. Table 5.2 shows the percentage of the cumulative
explained variance of the leading 5 EOFs and Hilbert EOFs. The leading EOF pair
explain respectively about 39% and 37% whereas the third one explains about 8%
of the total. Table 5.2 also shows the efficiency of Hilbert EOFs in reducing the
dimensionality of the data compared to EOFs. This comes of course at a price,
namely the double size of the Hilbert covariance matrix.
Remark The real and imaginary parts of the CPC’s are Hilbert transform of
each other. In fact using the identity λk vk =
p ∗
j =1 ukj zj , where zk is the k’th
complexified variable (or time series at the kth grid point) and uk = (uk1 , . . . , ukp )
is the kth HEOF, we can apply the Hilbert transform to yield

p p
λk Hvk = ukj Hx ∗j + iHy ∗j = ukj y ∗j − ix ∗j = iλk vk ,
j =1 j =1
which completes the proof.

Fig. 5.5 Real (a) and imaginary (b) parts of the leading Hilbert EOF of the ERA-40 zonal mean
zonal wind anomalies. Adapted from Hannachi et al. (2007)
From this decomposition we get the spatial amplitude and phase functions
respectively:
ak = uk u∗k = Diag uk u∗T

k

I m(uk )
θ k = arctan . (5.44)
Re(uk )
Fig. 5.6 Time series of the Hilbert PC1 (a), phase portrait of Hilbert PC1 and the power spectrum
of the real part of Hilbert PC1 of the Era-40 zonal mean zonal wind anomalies. Adapted from
Hannachi et al. (2007). (a) Complex PC1: Real and imaginary parts. (b) Phase portrait of CPC1.
(c) Spectrum of real (CPC1)
Table 5.2 Percentage of explained variance of the leading 5 EOFs and Hilbert EOFs of the ERA-
40 zonal mean zonal wind anomalies
Eigenvalue rank 1 2 3 4 5
EOFs 39.4 37.4 7.7 5.5 2.3
Hilbert EOFs 71.3 10.0 7.7 2.8 1.9
Similarly, one also gets the temporal amplitude and phase functions as
bk = vk v∗k = Diag(vk v∗T

k )

I m(vk )
φ k = arctan , (5.45)
Re(vk )
where the vector product and division in (5.44) and (5.45) are performed compo-
nentwise. For each eigenmode, the amplitude map can be interpreted as a variability
map as in ordinary EOFs. The function θ k gives information on the relative
phase. For “simple” fields, its spatial derivative provides a measure of the local
wavenumber. Its interpretation for moderately complex fields/waves can be difficult
(Wallace 1972), and can be made easier by applying a prior filtering (Barnett 1983).
Also for simple waves, the time derivative of the temporal phase gives a measure of
the instantaneous frequency. Note that the phase speed of the wave of the kth mode
at time t and position x can be measured by dθ k (x)/dx
dφ k (t)/dt .
The amplitude and phase of the leading Hilbert EOF Eq. (5.44) are shown in
Fig. 5.7. The spatial amplitude shows that the maximum of wave amplitude is round
25-mb on the equator. It also shows the asymmetry of the amplitude in the vertical
Fig. 5.7 Spatial modes of the amplitude and phase of leading Hilbert EOF of the ERA-40 zonal
mean zonal wind anomalies. Adapted from Hannachi et al. (2007). (a) Spatial amplitude of
complex EOF1. (b) Spatial phase of complex EOF1
direction. The spatial phase shows banded structure from 1-mb height down to
around 50-mb, where propagation stops, and indicates the direction of propagation
of the disturbance where the phase changes between −180◦ and 180◦ in the course
of complete cycle.
The temporal amplitude and phase, Eq. (5.45), are shown in Fig. 5.8 of the first
Hilbert PC of the ERA-40 zonal mean zonal wind anomalies for the period January
1992–December 2001. For example, the amplitude is larger in the middle of the
wave life cycle. The temporal phase, on the other hand, provides information on the
phase of the wave. For every wave lifecycle the phase is nearly quasi-linear, with
the slope providing a measure of the instantaneous frequency.
As with conventional EOFs, Hilbert EOFs can be used to filter the data. For the
example of the ERA-40 zonal mean zonal wind, the leading Hilbert EOF/PC can be
used to filter out the QBO signal. Figure 5.9 shows the obtained filtered anomalies
for the period Jan 1992–Dec 2001. The downward signal propagating is clearer than
the signal shown in Fig. 5.3. The average downward phase propagation is about
1 km/month.
The covariance matrix S in (5.42) is related to the cross-spectrum matrix
1
n−τ
(ω) = zt+τ z∗T
t e
−iωτ
n τ
t=1
Fig. 5.8 Time series of the amplitude and phase of the leading Hilbert EOF of ERA-40 zonal mean
zonal wind anomalies. Adapted from Hannachi et al. (2007). (a) Temporal amplitude of complex
PC 1. (b) Temporal phase of complex PC 1
Fig. 5.9 Filtered field of ERA-40 zonal mean zonal wind anomalies using Hilbert EOF1 for the
period January 1992–December 2001. Adapted from Hannachi et al. (2007)
via
ωN
S=2 (ω)dω, (5.46)
0
where ωN = 2t 1
and represents the Nyquist frequency and t is the time
interval between observations. This can be compared to (5.36). Note that since
the covariance matrix of xt is only related to the co-spectrum (i.e. the real part
of the spectrum) of the time series it is clear that conventional EOFs, based on
covariance or correlation matrix, does not take into consideration the quadrature
part of the cross-spectrum matrix, and therefore EOFs based on the cross-spectrum
matrix generalise conventional EOFs.
It is also clear from (5.46) that HEOFs are equivalent to FDEOFs with the cross-
spectrum integrated over all frequencies. Note that the frequency band of interest
can be controlled by prior smoothing. Horel (1984) points out that HEOFs can fail
to detect irregularly occurring progressive waves, see also Merrifield and Winant
(1989). Merrifield and Guza (1990) have shown that complex EOF analysis in the
time domain (HEOFs) is not appropriate for non-dispersive and broad-banded waves
in wavenumber κ relative to the largest separation measured (array size x). In
fact Merrifield and Guza (1990), see also Johnson and McPhaden (1993), identified
κx as the main parameter causing spread of propagating variability into more
than one HEOF mode, and the larger the parameter the lesser the captured data
variance. Barnett (1983) applied HEOFs to study relationship between the monsoon
5.5 Rotation of HEOFs 115
and the Pacific trade winds and found strong coupling particularly at interannual
time scales.
5.5 Rotation of HEOFs
Although HEOFs constitute a very useful tool to study and identify propagating
phenomena (Barnett 1983; Horel 1984; Lazante 1990; Davis et al. 1991) such as
Kelvin waves in sea level or forced Rossby waves, the method suffers various
drawbacks. For example, HEOFs is unable to isolate, in one single mode, irregular
disturbance progressive waves Horel (1984). This point was also pointed out by
Merrifield and Guza (1990), who showed that the method can be inappropriate for
non-dispersive waves that are broadband in wavenumber relative to the array size.
More importantly, the drawbacks of EOFs, e.g. non-locality and domain depen-
dence (see Chap. 3) are inherited by HEOFs. Here again, rotation can come to their
rescue. Horel (1984) suggested the varimax rotation procedure to rotate HPCs using
real orthogonal rotation matrices, which can yield more realistic modes of variation.
The varimax rotation was applied later by Lazante (1990). Davis et al. (1991)
showed, however, that the (real) varimax procedure suffers a drawback related to a
lack of invariance to arbitrary complex rephasing of HEOFs. Bloomfield and Davis
(1994) proposed a remedy to rotation by using a complex unitary rotation matrix.
Bloomfield and Davis (1994) applied the complex orthogonal rotation to synthetic
examples and to sea level pressure. They argue that the rotated HPCs are easier to
interpret than the varimax rotation.
Chapter 6
Principal Oscillation Patterns and Their
Extension
Abstract EOF method is essentially an exploratory method to analyse the modes of

variability of multivariate weather and climate data, with no model is involved. This
chapter describes a different method, Principal Oscillation Pattern (POP) analysis,
that seeks the simplest dynamical system that can explain the main features of the
space–time data. The chapter also provides further extension of POPs by including
nonlinearity. Examples from climate data are also given.
Keywords Autoregressive model · Feedback matrix · Principal oscillation

pattern · Fluctuation–dissipation · Cyclo-stationarity · Baroclinic structure ·
e-folding time · Principal interaction patterns
6.1 Introduction
As was pointed out in the previous chapters, EOFs and closely related methods
are based on contemporaneous information contained in the data. They provide
therefore patterns that are by construction stationary in the sense that they do
not allow in general the detection of propagating disturbances. In some cases,
little dynamical information can be gained. The PCs contain of course temporal
information except that they only reflect the amplitude of the corresponding EOF.1
Complex Hilbert EOFs, on the other hand, have been conceived to circumvent
this shortcoming and allow for the detection of travelling waves. As pointed out
in Sect. 5.3.2, HEOFs are not exempt from drawbacks, including, for example,
difficulty in the interpretation of the phase information such as in the case of non-
dispersive waves or nonlinear dynamics, in addition to the drawbacks shared with
EOFs.
It is a common belief that propagating disturbances can be diagnosed using the
lagged structure of the observed field. For example, the eigenvectors of the lagged
1 There are exceptions, and these happen when, for example, there is a pair of equal eigenvalues
separated from the rest of the spectrum.

118 6 Principal Oscillation Patterns and Their Extension
covariance matrix at various lags can provide information on the dynamics of the
propagating structures. The atmospheric system is a very complex dynamical system
par excellence, whose state can be well approximated by an evolution equation or
dynamical system:
d
x = F(x, t), (6.1)
dt
where the vector x(t) represents the state of the atmosphere at time t, and F
is a nonlinear function containing all physical and dynamical processes such as
nonlinear interactions like nonlinear advection, and different types of forcing such
as radiative forcing, etc.
Our knowledge of the state of the system is and will be always partial, and this
reflects back on our exploration of (6.1). For example, various convective processes
will only be parametrised, and subscale processes are considered as noise, i.e.
known statistically. A first and important step towards exploring (6.1) consists in
looking at a linearised version of this dynamical system. Hence a more simplified
system deriving from (6.1) reads
d
x = Bx + εt , (6.2)
dt
where ε t is a random forcing taking into account the non-resolved processes, which
cannot be represented by the deterministic part of (6.2). These include subgrid scales
and nonlinear effects. This model goes along with the assumption of Hasselmann
(1988), namely that the climate system can be split into two components: a
deterministic or signal part and a nondeterministic or noise part. Equation (6.2) is a
simple linear stochastic system that can be studied analytically and can be compared
to observed data. This model is known as continuous Markov process or continuous
first-order (multivariate) autoregressive (AR(1)) model and has nice properties. It
has been frequently used in climate studies (Hasselmann 1976, 1988; Penland 1989;
Frederiksen 1997; Frederiksen and Branstator 2001, 2005).
In practice Eq. (6.2) has to be discretised to yield a discrete multivariate AR(1),
which may look like:
xt+1 = (I + Bt)xt + tεt = Axt + ε t+1 , (6.3)
in which t can be absorbed in B and can assume that t = 1, see also next sections
for time-dependent POPs.
Remark Note that when model data are used the vector xt may be complex,
containing, for example, spectral coefficients derived, for instance, from spherical
harmonics. Similarly, the operator A may also be complex. In the sequel, however,
it is assumed that the operator involved in Eq. (6.2) is real.
6.2 POP Derivation and Estimation 119
The AR(1) model (6.3) is very useful from various perspectives, such as the explo-
ration of observed data or the analysis of climate models or reduced atmospheric
models. This model constitutes the corner stone of principal oscillation pattern
(POP) analysis (Hasselmann 1988; von Storch et al. 1988, 1995; Schnur et al.
1993). According to Hasselmann (1988), the linear model part in (6.3), within the
signal subspace, is the POP model. If the simplified model takes into account the
system nonlinearity, then it is termed principal interaction pattern (PIP) model by
Hasselmann (1988). The main concern of POP analysis consists of an Eigen analysis
of the linear part of (6.3), see e.g. Hasselmann (1988), von Storch et al. 1988), and
Wikle (2004). POP analysis has been applied mostly to climate variability but has
started recently to creep towards other fields such as engineering and biomedical
science, see e.g. Wang et al. (2012).
6.2 POP Derivation and Estimation
6.2.1 Spatial Patterns
We consider again the basic multivariate AR(1) model (6.3), in which the multivari-
ate process (εt ) is a white noise (in time) with covariance matrix Q = E εε T . The
matrix A is known as the feedback matrix (von Storch et al. 1988) in the discrete case
and as Green’s function in the continuous case (Riskin 1984; Penland 1989). POP
analysis attempts to infer empirical space–time characteristics of the climate system
using a simplified formalism expressed by (6.3). These characteristics are provided
by the normalised eigenvectors of the matrix A (or the empirical normal modes of
the multivariate AR(1) model (6.3)) and are referred to as the POPs. In (6.3) we
suppose that xt , like ε t , is zero mean. The autoregression matrix is then given by
−1
A = E xt+1 xTt E xt xTt = 1 −1
0 , (6.4)
T
where 1 is the lag-1 autocovariance matrix of xt = x1t , x2t , . . . xpt . That is, if
γij (τ ) is the lagged autocovariance between xit and xi,t+τ , then [ 1 ]ij = γij (1).
Recall that 1 is not symmetric.
If we now have a finite sample xt , t = 1, 2, . . . n, then an estimate of A is given
by
Â = S1 S−1
0 , (6.5)
where S1 and S0 are, respectively, the sample lag-1 autocovariance matrix and the
covariance matrix of the time series. Precisely, we have the following result.
Proposition The matrix Â minimises the residual:

n−1
n−1
F (A) = xt+1 − Axt 2
= Ft (A),
t=1 t=1
where x 2 = E(xT x).

Proof We first note the following few simple identities. For any p × 1 vectors a and
b and p × p matrix A, we have aT b = tr abT and aT (Ab) = tr abT AT =
tr AbaT . Now Ft () and also F () are quadratic forms in A and

Ft (A) = E xTt xt − xTt Axt−1 − xTt−1 AT xt + xt−1 AT Axt−1 .
Let us forget for a moment the expectation operator E(.). The differential of Ft (A)
is obtained by computing Ft (A + H) − Ft (A) for any matrix H, with small norm.
We get
Ft (A + H) − Ft (A)

= −xTt Hxt−1 − xTt−1 HT xt + xTt−1 AT Hxt−1 + xTt−1 HT Axt−1 + O H 2

= DFt (A).H + O H 2 ,
where DFt (A) is the differential of Ft () at A, which can be simplified to yield
DFt (A).H = −2xTt−1 HT xt +2xTt−1 HT Axt−1 = −2tr xt−1 xTt H − xt−1 xTt−1 AT H .
Now we can bring back either the expectation operator E() or the summa-
use the expectation operator, we get DFt (A).H =
tion. If, for example, we
−2tr −1 − 0 AT H . If the summation over the sample is used instead, we
obtain the same expression except that 1 and 0 are, respectively, replaced by S1
and S0 . Hence the minimum of F () satisfies DFt (A).H = 0 for all H, and this
yields (6.4).
The next step consists of computing the eigenvalues and the corresponding
eigenvectors.2 Let us denote by λk and uk , k = 1, . . . p, the eigenvalues and the
associated eigenvectors of A, respectively. The eigenvectors can be normalised to
have unit norm but are not orthogonal.
2 Thisdecomposition can be sometimes problematic due to the possible existence of small

eigenvalues of 0 . In general it is common practice to filter the data first using, for example,
EOFs and keeping the leading PCs (Barnett and Preisendorfer 1987).
Remarks
• Because A is not symmetric, the eigenvalues/eigenvectors can be complex, in
which case they come in conjugate pairs.
• The eigenvalues/eigenvectors of A (or similarly Â) are also solution of a
generalised eigenvalue problem.
Exercise Let L be an invertible linear transformation, and yt = Lxt . Show that the
eigenvalues of the feedback matrix are invariant under this transformation.
Hint We have yt+1 = LAL−1 yt , and the eigenvalues of A and LAL−1 are identical.
γij (τ )
Exercise Let ρij (τ ) = √ be the lagged cross-correlation between xit and
γii (0)γjj (0)
xj,t+τ , then |ρij (τ )| ≤ 1.
Hint Use |E(XY )|2 ≤ E(X2 )E(Y 2 ).
In the noise-free case (6.3) yields xt = At x0 , from which we can easily derive the
condition of stationarity. The state xt can be decomposed into a linear combination
of the eigenvectors of A as

r
xt = at(i) ai ,
i=1
where r is the rank of A and ai , i = 1, . . . r, are the eigenvectors of A. Equation (6.3)

then yields
(i)
at+1 = ci λti ,
where ci is a constant. It is clear therefore that under stationarity conditions the

eigenvalues of A satisfy |λi | < 1, i = 1, . . . p.
Exercise Assume that λ is a real eigenvalue of A. Show, using a different method,
that (under stationarity condition) the eigenvalues of A satisfy |λ| < 1. Furthermore,
when the eigenvalues are estimated from the data, i.e. using Â, then these sample
eigenvalues are in fact inside the unit circle.
Hint Let λ be an eigenvalue of A = 1 −1 0 and u the corresponding eigenvector,
then we have 1 v = λ 0 v, where v = −1 0 u. Now consider the random variable
zt = vT xt . We have var (zt ) = vT 0 v and γz (1) = vT 1 v. That is γz (1) =
λvar (zt ), which, under stationarity assumption, yields
−1 ≤ λ = ρz (1) ≤ 1.
The last condition is also satisfied for the sample feedback matrix. In particular,
the above inequality becomes strict in general since in a finite sample the lag-
1 autocorrelation is in general less that one. Even for the population, the above
inequality tends to be strict. In fact the equality ρij (τ ) = ±1 holds only when
xi,t+τ = αxj,t .
The POPs U = u1 , . . . , up are the normalised (right) eigenvectors of A
satisfying AU = U, where = Diag λ1 , . . . , λp . Since the feedback matrix
is not symmetric, it also has left eigenvectors V = v1 , . . . , vp satisfying VT A =
VT . These left eigenvectors are the eigenvectors of the adjoint AT of A and are
known as the adjoint vectors of U (Appendix F). They satisfy VT U = UVT = Ip .
Precisely, U and V are the left and right singular vectors of A, i.e.

p
A = UVT = λk uk vTk .
k=1
The interpretation of POPs is quite different from EOFs. For example, unlike
POPs, the EOFs are orthogonal and real. von Storch et al. (1988) interpret the
real and imaginary parts of POPs as standing oscillatory and propagating patterns,
respectively. Also, in EOFs the importance of the patterns is naturally dictated by
their explained variance, but this is not the case for the POPs. In fact, there is no a
priori unique rule of pattern selection in the latter case. One way forward is to look
at the time series of the corresponding POP.
6.2.2 Time Coefficients
As for EOFs, each POP has an associated time series, or POP coefficients, showing
their amplitudes as a function of time. The k’th complex coefficient zk (t) at time t,
associated with the k’th POP,3 is the projection of xt onto the adjoint vk of the k’th
POP uk
zk (t) = zkt = xTt vk . (6.6)
Unlike PCs, the POP coefficients satisfy a stochastic model dictated by Eq. (6.3).
In fact, it can be seen by post-multiplying (6.3) by vk that zk (t) yields a (complex)
AR(1) model:
zk (t + 1) = λk zk (t) + k,t+1 . (6.7)
Remark The expected values of (6.7) decouple completely, and the dynamics are
described by damped spirals.
3 Note that there is no natural order for the POPs so far.

One would like to find out the covariance matrix of the new noise term in Eq. (6.7).
To this end, one assumes that the feedback matrix is diagonalisable so A = UU−1 .
The state vector xt is decomposed in the basis U of the state space following:

p
xt uk = Ux+
(k)
xt = t .
k=1
(k)
Note that the components xt are the coordinates of xt in this new basis. They are
not, in general, identical to the original coordinates. Similarly we get for the noise
term:

p
εt uk = Uε +
(k)
εt = t ,
k=1
where x+ +
(1) (p)
t = (xt , . . . , xt ) and similarly for ε t . After substituting the above
T
expression into (6.3), one obtains
x+ + +
t+1 = xt + ε t+1 .
Component-wise this takes the form:

(k) (k) (k)
xt+1 = λk xt + εt+1 k = 1, . . . p.
Exercise Derive the noise characteristics of ε+

t .
Hint Use the expression ε+ −1

t = U ε t to get the covariance matrix C = (cij ) =
U−1 QU−T . Hence E(εt εs ) = δts ckl .
(k) (l)
Because λk = |λk |e−iωk is inside the unit circle (for the observed sample), the
evolution of zk (t), in the noise-free case (zk (t) = λtk zk (0)), describes in the complex
plane a damped spiral with period Tk = 2π ωk and e-folding time τk = − log |λk | .
1
Within the normal modes, the variance of the POP coefficients, i.e. σk2 =
E(zk2 (t)), reflects the dynamical importance of the k’th POP. It can be shown that
ckl
σk2 = . (6.8)
1 − |λk |2
Exercise Derive Eq. (6.8).

Answer Use the identity 0 = A 0 AT + Q plus the decompositions A = UU−1
and C = U−1 QU−T . The covariance matrix then satisfies 0 = U 00 UT where
00 is the solution of the matrix equation: 00 = 00 T + C, which can be
solved component-wise leading (for |λk | < 1) to ( 00 )kl = ckl /(1 − λk λ∗l ).
Remark The above excitation, which reflects a measure of the dynamics of eigen-
modes, is at odds with the traditional measure in which the mode with the least
damping (or largest |λk |) is the most important. The latter view is based on the noise-
free dynamics, whereas the former takes into consideration the stochastic forcing
and therefore seems more relevant.
Let us write the k’th POP uk as uk = urk + iuik and similarly for the POP
coefficient zk (t) = zkr (t) + izki (t). The noise-free evolution of the k’th POP mode
(taking for simplicity zk (0) = 1) is given by

zk (t)uk + zk∗ (t)u∗k = λtk uk + (λ∗k )t u∗k = 2|λk |t urk cos ωk t + uik sin ωk t .
Therefore, within the two-dimensional space spanned by urk and uik , the evolution
can be described by a succession of patterns, with decreasing amplitudes, given by
urk → uik → −urk → −uik → urk . . . ,
and these represent the states occupied by the POP mode at successive times tm =
2ωk , for m = 0, 1, 2, . . . (Fig. 6.1), see also von Storch et al. (1995). The result of
mπ
travelling features is a consequence of the above equations, and therefore, a AR(1)

model is inappropriate to model a standing wave.
A familiar property of the AR(1) model (6.7) is the form of its spectral density
function (Appendix C). In fact, if from (6.7) we regard t as the output of a linear
digital filter (Chap. 2), then we get the following relationship between the spectral
density functions f () and fz () of (t) and zk (t), respectively:
f (ω)
fz (ω) = . (6.9)
|λk − eiω |2
Fig. 6.1 The evolution of the Evolution of one POP

k’th POP within the plane
(urk , uik ), showing the ui
positions of uk (t) at k
successive times t0 , t1 , . . . , tm
uk(t2) u (t )
k 1
ur
k
O u (t )
k 0
uk(tm)
Exercise Derive the above relationship, Eq. (6.9).

Hint k (t) can be regarded as a “filtered” version of zk (t), i.e. k (t) = h(u)zk (t −
u)du, where the transfer function is given by h(u) = δ−1 (u)−λk δ(u), where δ(u) is
the Dirac (spike) distribution.4 After applying Fourier transform (Chap. 2), one gets
f (ω) = |(ω)|2 fz (ω), and hence (6.9). The function (ω) is the Fourier transform
of the transfer function h().
The spectrum of zk (t) is a rational function of λk . If the noise k (t) is white, then
it is clear that when λk is real (6.9) is the usual AR(1) model with its red spectrum.
However, when λk is complex, then (6.9) has a spectral peak whose width increases
as |λk | decreases and represents a second-order autoregressive model AR(2), see
e.g. von Storch and Zwiers (1999, chap 15).
The interpretation of POPs can be made easy by analysing the spectral informa-
tion from the POP coefficients (or amplitudes). An example where tropical surface
wind and SST are combined with stratospheric zonal wind and submitted to a POP
analysis is investigated by Xu (1993). See also von Storch and Xu (1990), von
Storch and Baumhefner (1991), Xue et al. (1994) and von Storch et al. (1995)
(and references therein) who used POPs for prediction purposes. POP analysis with
cyclo-stationary time series has also been addressed in von Storch et al. (1995).
6.2.3 Example
POP analysis (Hasselmann 1988) was applied by a number of authors, e.g. von
Storch et al. (1988), von Storch et al. (1995), Xu (1993). Schnur et al. (1993),
for example, investigated synoptic- and medium-scale (3–25 days) wave activity in
the atmosphere. They analysed geopotential heights from ECMWF analyses for the
Dec–Jan–Feb (DJF) period in 1984–1987 for wavenumbers 5–9. Their main finding
was a close similarity between the most significant POPs and the most unstable
waves, describing the linear growth of unstable baroclinic structures with period 3–
7 days. Figure 6.2 shows POP1 (phase and amplitude) of twice-daily geopotential
height for zonal wavenumber 8. Significant amplitude is observed around 45◦ N
associated with a 90◦ phase shift with the imaginary part westward of the real
part. The period of POP1 is 4 days with an e-folding of 8.6 days. Combined with
the propagating structure of POP evolution, as is shown in Sect. 6.2.2, the figure
manifests an eastward propagating perturbation with a westward phase tilt with
height. The eastward propagation is also manifested in the horizontal cross-section
of the upper level POP1 pattern shown in Fig. 6.3.
Frederiksen and Branstator (2005), for example, investigated POPs of 300-
hPa streamfunction fields from NCAR/NCEP reanalysis and general circulation

4 TheDirac function is defined by the property δa (u)f (u)du = f (a), and in general, δ0 (u) is
simply noted as δ(u) .
Fig. 6.2 Leading POP, showing phase (upper row) and amplitude (lower row), of twice-daily
ECMWF geopotential height field during DJF 1984–1987 for zonal wavenumber 8. Adapted from
Schnur et al. (1993). ©American Meteorological Society. Used with permission
Fig. 6.3 Cross-section at the 200-hPa level of POP1 of Fig. 6.2. Adapted from Schnur et al. (1993).
©American Meteorological Society. Used with permission
6.3 Relation to Continuous POPs 127
Fig. 6.4 Leading EOF (a) and POP (b) of the NCAR/NCEP 300-hPa streamfunction for March.
Adapted from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with
permission
model (GCM) simulations. Figure 6.4 shows the leading EOF and POP patterns
of NCAR/NCEP March 300-hPa streamfunction field. There is similarity between
EOF1 and POP1. For example, both are characterised by approximate large scale
zonal symmetry capturing midlatitude and subtropical variability. The leading POPs
are in general real with decay e-folding times. Figure 6.5 shows the POP2 for
January. It shows a Pacific North America (PNA) pattern signature and is similar
to EOF2. The leading POPs can be obtained, in general, from a superposition of the
first 5 to 10 EOFs as pointed out by Frederiksen and Branstator (2005). Figure 6.6
shows the average global growth rate of the leading 5 POPs across the different
months. The figure shows in particular that the leading POPs are all damped.
6.3 Relation to Continuous POPs
6.3.1 Basic Relationships
Various interesting relationships can be derived from (6.3), and the relationship
given in Eq. (6.4) is one of them. Furthermore by computing xt+1 xTt+1 and taking
expectation, one gets
0 = A 0 AT + Q. (6.10)
Also, expanding (xt+1 − ε t+1 ) xTt+1 − ε Tt+1 and taking expectation, after
using (6.10), yield E ε t xTt + E xt ε Tt = 2Q. On a computer, the continuous
Fig. 6.5 POP2 (a) and EOF2 (b) of January 300-hPa NCEP/NCAR streamfunction field. Adapted
from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with permis-
sion
–0.045
–0.050
–0.055
∼ (t)
ω i
day–1
–0.060
–0.065
–0.070
–0.075
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Fig. 6.6 Average global growth rate of the leading 5 POPs (continuous) and FTPOPs (dashed).
Adapted from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with
permission
AR(1) model (6.2) has to be discretised, and when this is done, one gets a similar
equation to (6.3). So if we discretise (6.2) using a unit time interval, which can be
made after some scaling is applied, then Eq. (6.2) can be roughly approximated by
xt+1 = Ip + B xt + εt+1 , (6.11)
which is equivalent to (6.3) with A = Ip + B. The relationship (6.10) now becomes
B 0 + 0 BT + B 0 BT + Q = O. (6.12)
This last relationship can be regarded as a generalisation of the fluctuation–

dissipation relation, derived in the context of linear inverse modelling, LIM (e.g.
Penland 1989),
B 0 + 0 BT + Q = O, (6.13)
which can be obtained using (6.12) by dropping the nonlinear term in B. A more
accurate discrete approximation of (6.2) can be obtained using an infinitesimal time
step δτ to obtain the following process:
xt+nδτ = Ip + δτ B xt+(n−1)δτ + ε t+nδτ . (6.14)
The iteration of (6.14) yields

n
xt+nδτ = Ip + δτ B xt + ηt,τ , (6.15)
where ηt,τ is now an autocorrelated noise but uncorrelated with xt . Now, for large
enough n, such that τ = nδτ remains finite, (6.15) can be approximated by
xt+τ = eBτ xt + ηt,τ . (6.16)
Note that now (6.16) is similar to (6.3) except that the noise is autocorrelated.
Multiplying both sides of (6.16) by xTt and taking expectation yield
τ = eBτ 0 . (6.17)
The matrix G(τ ) = exp (Bτ ) = τ −1 0 is known as the Green’s function or

resolvent. Furthermore, from (6.16), we can also obtain the covariance matrix of
the noise, i.e.

σ (τ ) = E ηt,τ ηTt,τ = 0 − G(τ ) 0 [G(τ )]T . (6.18)
Relation (6.18) can be useful, for example, in deriving the conditional probability5
p (xt+τ |xt ). If we suppose that B is diagonalisable, and we decompose it using its
the noise term εt is Gaussian, then (6.15) implies that if xt is given the state vector xt+τ is also
5 If
normal with mean G(τ )xt and covariance matrix σ (τ ), i.e.

left and right eigenvectors, i.e. B = LR, then the “continuous” POPs are given by
the right eigenvectors R of B. Note that this is not exactly the SVD decomposition
of B. Because RLT = Ip , we also have a similar decomposition of the Green’s
function:
G(τ ) = Leτ R. (6.19)
Note that, in practice, the (feedback) matrix G(τ ) of the continuous POP is
calculated first before the matrix B. Note also that Eq. (6.18) can be useful, for
example, in
forecasting and provides a natural measure of forecasting accuracy,
namely E xt+τ − x̂t+τ /tr ( 0 ), where x̂t+τ = G(τ )xt , see e.g. Penland (1989)
and von Storch et al. (1995).
6.3.2 Finite Time POPs
Finite time POPs (FTPOPs) or empirical finite time normal modes were introduced
by Frederiksen and Branstator (2005) as the analogue of finite time normal modes
in which the linear operator or feedback matrix A is obtained by direct linearisation
of the nonlinear equations (Frederiksen 1997; Frederiksen and Branstator 2001). In
FTPOPs, the linear operator in Eq. (6.2) is time-dependent, i.e.
d
xt = B(t)xt + ε t . (6.20)
dt
The solution to Eq. (6.20) is obtained as an integral (Appendix G):
t
xt = S(t, t0 )xt0 + S(t, u)ε u du, (6.21)
t0
where S(., .) is the propagator. Note that when the operator B is time independent,
then
S(t, t0 ) = e(t−t0 )B . (6.22)
An explicit expression of S(t, u) can be obtained when B(t) and B(u) commute, i.e.
B(t)B(u) = B(u)B(t), for all t and u, then (Appendix G):
t
S(t, u) = e u B(τ )dτ
. (6.23)

p 1 1
P r (xt+τ = x|xt ) = (2π )− 2 |σ (τ )|− 2 exp − (x − G(τ )xt )T [σ (τ )]−1 (x − G(τ )xt ) ,
2
which, under stationarity, tends to the multinormal distribution N (0, 0 ) when τ tends to infinity.
The propagator satisfies the property (Appendix G):
S(t, u)S(u, v) = S(t, v). (6.24)
The FTPOPs are defined as the eigenvectors of S(t, 0). Over an interval [0, T ], the
propagator can be approximated via Eq. (6.24) using a second-order finite difference
scheme of S(tk , tk−1 ), for tk = T − (n − k)δt, k = 1, . . . n, and δt is a half-
hour time step. The eigenvalues λ = λr + iλi (and associated eigenvectors) of
the propagator S(T , 0) are used to compute the global growth rate ωi and phase
frequency ωr following:
ωi = 2T
1
log |λ|
(6.25)
ωr = − T1 arctan( λλri ).
In their analysis, Frederiksen and Branstator (2005) considered one year period
(T = 1 yr) for the global characteristics and examined also local characteristics for
each month using daily data. As for POPs, the eigenvalues determine the nature
of the FTPOPs, namely travelling when λi = 0 or recurring when λi = 0.
Figure 6.7 shows the leading 300-hPa streamfunction FTPOP during March using
NCEP/NCAR reanalysis for the northern and southern hemispheres. As for the
leading POP, the leading FTPOP has an approximate zonally symmetric state with
a particular high correlation between EOF1, FTPOP1 and POP1 (Fig. 6.4). There
is also a similarity between the growth rates of the leading POPs and leading
FTPOPs (Fig. 6.6). The leading FTPOP for January (Fig. 6.8) shows many common
features with EOF2 and POP2 (Fig. 6.5) especially over the PNA region and has
high correlation with both EOF2 and POP2.
Fig. 6.7 Leading NCEP/NCAR 300-hPa streamfunction FTPOP for March for the northern (left)
and southern (right) hemispheres. Adapted from Frederiksen and Branstator (2005). ©American
Meteorological Society. Used with permission
Fig. 6.8 January FTPOP1 obtained from the NCAR/NCEP 300-hPa streamfunction. Adapted
from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with permis-
sion
6.4 Cyclo-Stationary POPs
Cyclo-stationary POPs have been implicitly mentioned in the previous section when
finite time POPs were discussed. In this section more details on cyclo-stationarity
are given. In the POP model (6.3) the time series was assumed to be second-order
stationary.6 This can be a reasonable assumption when we analyse, for example,
data on an intraseasonal time scale for a given, e.g. winter, season. When the data
contain a (quasi-) periodic signal, such as the annual or biennial cycles, then the
AR(1) model can no longer be supported. An appropriate extension of the POP
model, which takes into account the periodic signal, is the cyclo-stationary POP
(CPOP) model. CPOP analysis was published first7 by Blumenthal (1991), who
applied it to analyse El-Niño Southern Oscillation (ENSO) from a climate model,
and was applied later by various authors, such as von Storch et al. (1995), who
applied it to the Madden Julian Oscillation (MJO), and Kim and Wu (1999), who
compared it to other EOF techniques.
We assume now that our data contain, say, T cycles, and in each cycle we have
l sample making the total sample size n = T l. For example, with a 10-year data
of monthly SST, we have T = 10 and l = 12. Given a time series x1 , . . . , xn ,
6 In practice POP analysis has been applied to many time series not necessarily strictly stationary.
7 see von Storch et al. (1995) for other unpublished works on CPOPs.
6.4 Cyclo-Stationary POPs 133
with n = T l, any observation xs , s = 1, . . . , n, belongs to some t’th cycle with

t = t − δτ0 , where t and τ are obtained from the relation:
s = (t − 1)l + τ , (6.26)
with 0 ≤ τ ≤ l − 1 and 1 ≤ t ≤ T + 1, and the nested time within the cycle

is τ = τ + lδτ0 , where δi denotes the Kronecker symbol. The observation xs can
j
be identified alternatively by t and τ , i.e. xs = xt,τ . Note that the noise term is not
cyclo-stationary. The CPOP (or cyclo-stationary AR(1)) model then reads
xt,τ +1 = A(τ )xt,τ + ε t,τ +1 (6.27)
with the property xt,τ +l = xt+1,τ and A(τ + l) = A(τ ). The iteration of Eq. (6.27)
yields
xt+1,τ = xt,τ +l = B(τ )xt,τ + εt+1,τ , (6.28)
where the system matrix B(τ ) is given by
!
l
B(τ ) = A(τ + l − 1) . . . A(τ ) = A(τ + l − k). (6.29)
k=1
The CPOPs are the eigenvectors of the system matrix B(τ ) for τ = 1, . . . l, i.e.
B(τ )u = λu. (6.30)
It can be shown that the eigenvalues of (6.24) do not depend on τ .

Exercise Let λ(τ ) and u(τ ) denote an eigenvalue and associated eigenvector of
B(τ ), respectively. Show that λ(τ ) is also an eigenvalue of B(τ + q) for any q.
Hint Just take q = 1 and compute B(τ + 1) [A(τ )u(τ )], using Eq. (6.29), and the
periodicity of A(τ ).
The above exercise shows that if u(τ ) is an eigenvector of B(τ ) with eigenvalue
λ, then A(τ )u(τ ) is an eigenvector of B(τ + 1) with the same eigenvalue. The
periodicity of A(τ ) is inherited by the eigenvectors of B(τ ), see the next exercise.
Exercise Assume u(τ ) to be an eigenvector of B(τ ) with eigenvalue λ, and let
u(τ + k) = A(τ + k − 1)u(τ + k − 1), for k = 1, . . . , l. Show that u(τ + l) = λu(τ ).
Hint Use the result u(τ + l) = B(τ )u(τ ) plus the fact that u(τ ) is an eigenvector of
B(τ ).
From the above exercise, it can be seen that using a proper normalisation we can
make u(τ ) periodic. Precisely, let the unit-length vector u(τ ) satisfy (6.30) with
λ = |λ|e−iφ . Then the vector A(τ )u(τ ) is an eigenvector of B(τ + 1), with the same
eigenvalue λ. Let u(τ + 1) = cA(τ )u(τ ), where c is a complex coefficient. We can
choose c = ρ(τ )eiθ , so that u(τ + 1) is unit-length and also periodic. The choice
ρ(τ ) = A(τ )u(τ ) yields u(τ +1)) = 1, and to achieve periodicity, we compute
u(τ + l) recursively yielding
!
l !
l
u(τ + l) = ρ(τ + l − k)eilθ B(τ )u(τ ) = |λ| ρ(τ + l − k)ei(lθ−φ) u(τ ).
k=1 k=1
(6.31)
Now,
( since u(τ + l) = 1 by construction, the above equation yields
|λ| lk=1 ρτ +l−k = 1, and then u(τ + l) = ei(lθ−φ) u(τ ). By choosing θ = φ/ l, the
eigenvectors u(τ ) become periodic. To summarise, the CPOPs are obtained from
the set of simultaneous eigenvalue problems B(τ )u(τ ) = λu(τ ), τ = 1, . . . l, and
one needs to only solve the problem for τ = 1.
Once the CPOPs are obtained, we proceed as for the POPs and compute the
CPOP coefficients z(t, τ ) = v(τ )∗T x(t, τ ), by projecting the data x(t, τ ) onto the
adjoint pattern v(τ ), i.e. the eigenvector of BT (τ ) with the eigenvalue λ, yielding
z(t + 1, τ ) = λz(t, τ ) + εt+1,τ . (6.32)
Remark The adjoint patterns v(τ ), τ = 1, . . . , l − 1, can be calculated recursively

starting from those of B(l), in a similar manner to the eigenvectors u(τ ), but going
backward, see the exercise below.
Exercise Let v(τ + 1) be an adjoint vector of B(τ + 1) with eigenvalue λ and v(τ )
the adjoint vector of B(τ ) with the same eigenvalue.
Show that v(τ ) = αAT (τ )v(τ + 1), where α is a complex number.
Find α that makes the adjoint unit-length and periodic.
Hint Keeping in mind that BT (τ + 1)v(τ + 1) = λv(τ + 1), along with (6.29) plus
the periodicity
of A(τ ), we get
BT (τ ) AT (τ )v(τ + 1) = AT (τ ) BT (τ + 1)v(τ + 1) = λAT (τ )v(τ + 1).
Therefore, AT (τ )v(τ + 1) is an eigenvector of BT (τ ), and v(τ ) = αAT (τ )v(τ +
1) (where we have assumed that all the eigenvalues are distinct). For the normalisa-
tion, assuming v(τ + 1) = 1, we can obtain v(τ ) = r(τ + 1)−1 eiθ AT (τ )v(τ + 1),
where r(τ + 1) = AT (τ )v(τ + 1) and θ is as before.
Remark The estimation of A(τ ) is obtained, for each τ = 1, . . . , l, using Sτ,1 S−1
τ,0 ,
where Sτ,0 is the sample covariance matrix of xt,τ , and Sτ,1 is the sample lagged-1
cross-covariance matrix between xt,τ and xt,τ +1 , t = 1, . . . , T .
6.5 Other Extensions/Interpretations of POPs
6.5.1 POPs and Normal Modes
Here we attempt to compare two different but related techniques, POPs and normal
modes, to evaluate the effects of various feedback systems on the dynamics of
6.5 Other Extensions/Interpretations of POPs 135
waves and disturbances. POP is an empirically based technique that attempts to

gain knowledge of spatial–temporal characteristics of a complex system. Normal
modes approach, on the other hand, is a (physical) model-based technique that aims
at analysing the (linearly) fastest growing modes. The model could be a full GCM
or another simple model, such as the quasi-geostrophic vorticity equations. The
normal modes are obtained as the eigenfunctions of the linearised system equations
by performing a stability analysis. Note that in this case there is no need for data
but only the physical model. The linearisation is generally performed using a basic
state8 (Simmons et al. 1983). The most unstable modes are the normal modes
(eigenfunctions) of the linear system:
dx
= F (x0 )x (6.33)
dt
associated with eigenvalues λ of F (x0 ) satisfying |λ| > 1. In (6.33) the vector x0 is
the basic state and F (x0 ) is the differential of F () at x0 , i.e. F (x0 ) = ∂F
∂x (x0 ). Note
that the eigenvalues and the corresponding normal modes are in general complex
and therefore have standing and propagating patterns.
It is commonly accepted that POPs also provide estimates of normal modes.
Schnur et al. (1993), see also von Storch and Zwiers (1999), investigated normal
modes from a quasi-geostrophic model and computed POPs using data simulated
by the model. They also investigated and compared normal modes and POPs using
the quasi-geostrophic vorticity equations on the sphere and concluded that POPs
can be attributed to the linear growing phase of baroclinic waves. Unstable normal
modes have eigenvalues with magnitude greater than unity. We have seen, however,
that the eigenvalues of A = 1 −1 0 , or its estimate Â, are inside the unit circle.
Part of this inconsistency is due to the fact that the normal modes are (implicitly)
derived from a continuous system, whereas (6.3) is a discrete AR(1). A useful way
is perhaps to compare the modes of (6.19) with the continuous POPs.
6.5.2 Complex POPs
In POP analysis the state vector of the system is real. POPs are not appropriate
to model and identify standing oscillations (Bürger 1993). A standing oscillation
corresponds to a real POP associated with a real eigenvalue. But this implies a
(real) multivariate AR(1) model, i.e. a red spectrum or damped system. Since
the eigenvalues are inside the unit circle, a linear first-order system is unable to
model standing oscillations. Two alternatives can be considered to overcome this
shortcoming, namely:
8 Often taken to be the mean state or climatology, although conceptually it should be a stationary
state of the system. This choice is adopted because of the difficulties in finding stationary states.
(i) increase the order of the system,

(ii) incorporate further information by including the state of the system and its
tendency.
The first solution is clearly not parsimonious since it requires more unknowns and
can be expensive to run and difficult to interpret. The second alternative is more
appropriate since the other required information is contained in the momentum
or tendencies of the system in the continuous case. In the discrete case, the
“momentum” or conjugate information is readily provided by the Hilbert transform.
The complexified field
zt = xt + iyt , (6.34)
where yt = H(xt ) is then used to compute the POPs, yielding Hilbert POPs
(HPOPs). The Hilbert transform H(xt ) contains information about the system state
tendency. Both xt and H(xt ) play, respectively, the roles of position and momentum
in Hamiltonian systems. Hence the use of the complexified system states becomes
equivalent to using a second-order system but without increasing the size of the
unknown parameters. The HPOPs are therefore obtained as the eigenvectors of the
new complex matrix:
A = 1 −1
0 , (6.35)
where 0 = E zt z∗T t and 1 = E zt+1 z∗T

t . The matrix A in (6.35) represents
the equivalent of the feedback matrix of a multivariate complex AR(1) model
zt+1 = Azt + t+1 . Unlike usual POPs, HPOPs do not come in conjugate pairs
and are able to resolve a maximum number of independent oscillations equal to the
space dimensions.
6.5.3 Hilbert Oscillation Patterns
POP analysis is based on a simple model representing a linearised version of a

complex nonlinear model expressed in discrete form given by (6.3), which can also
be written as
xt+1 − xt = Bxt + ε t+1 ,
in which one recognises the left hand side as a “time derivative”. Since Hilbert
transform can also be interpreted as a (special) time derivative, an alternative way is
to consider a similar model except that the left hand side is now a Hilbert transform
of the original field (Irene Oliveira, personal communication). The model reads
H (xt ) = Axt + ε t+1 , (6.36)

6.5 Other Extensions/Interpretations of POPs 137
where yt = H (xt ) is the Hilbert transform of xt . The matrix A is estimated using

the sample covariance matrix S of the data and the sample cross-covariance matrix
Sxy between xt and yt , t = 1, . . . , n as
Â = Sxy S−1 . (6.37)
Hilbert oscillation patterns (HOPs) are obtained as the eigenvectors of Â. If Szz is
the (Hermitian) covariance matrix of the complexified field zt = xt + iyt , then one
has

1 −1
A = i Ip − Szz S ; (6.38)
2
hence, the eigenvalues of A are pure imaginary since the eigenvalues of Szz S−1 are
real (non-negative). It is also straightforward to check that if Szz is not invertible,
then i is an eigenvalue of A with associated eigenvectors given by SN, where N is
the null space of Szz .
Now let u be an eigenvector of Szz S−1 with eigenvalue λ, and then Szz v = λSv,
where v = S−1 u. Taking xt = v∗T xt , yt = v∗T yt and zt = v∗T zt , one gets, since
var(zt ) = var(xt ) + var(yt ) = 2var(xt ):
var(zt ) = v∗T Szz v = 2v∗T Sv; (6.39)
hence, λ = 2, and the corresponding eigenvalue of A is β = 0. Hence the spectrum

of A contains λ = 0. In addition, when Szz is not invertible, then the spectrum also
contains λ = i. These HOPs can be used to define Hilbert canonical correlation
analysis (HCCA). Classical CCA (see Chap. 15) looks for most correlated patterns
between two fields xt and yt . In HCCA, the field yt is taken to be H(xt ). The
equations involved in HCCA are similar to those of CCA and are given by
S−1 Sxy S−1 Syx a = λa, (6.40)
and a similar equation for the other patterns, with Sxy and Syx , exchanged. Defining
u = Sa, and noting that Syx = −Sxy , the above equation becomes
A2 u = −λu. (6.41)
Hence the eigenvalues of HCCA are the (opposite of the) square of the eigenvalues
associated with HOPs, and the associated eigenvectors are given by S−1 uk , where
uk , k = 1, . . . , p, are the HOPs.
6.5.4 Dynamic Mode Decomposition
An interesting method of analysing the dynamics of (linear or nonlinear) dynamical

systems was introduced by Schmid (2010), under the name “dynamical mode
decomposition” (DMD), to deal originally with fluid flow. The method was extended
later by a number of authors (e.g. Tu et al. 2014, Williams et al. 2015). The method
seeks to decompose the data using a set of modes characterised by oscillation
frequencies and growth rates. The DMD modes are the analogues of normal modes
for linear systems. These modes are obtained through analysing the eigenfunctions
and associated eigenvalues of the composition operator, also known as Koopman
operator in dynamical system theory. Because the DMD modes have temporal
characteristics (e.g. growth/decay rates) and are not in general orthogonal, they
can outperform EOF analysis in terms of data representation or dimensionality
reduction.
Briefly, if x1 , . . . , xn represent the set of d-dimensional time series, assumed to
be related via an operator A, i.e. X1,n−1 = AX2,n +E, where Xk,l is the d ×(l−k+1)
matrix [xk , xk+1 , . . . , xl ], and E is an error term, then the DMD eigenvalues and
modes are provided, respectively, by the eigenvalues and the associated eigenvectors
of A. The DMD modes can be computed using the Krylov method, via, for example,
the Arnoldi algorithm (Appendix D), or simply using the SVD algorithm. Tu et al.
(2014) presents the theory behind DMD and extends it to a large class of datasets
along with a number of improved algorithms. They also showed the utility of the
method using numerical examples.
6.6 High-Order POPs
The POP model based on the multivariate AR(1), Eq. (6.3), is based on the lag-
1 autocorrelation of the process. This model can be extended by including m
consecutive lags:

m
xt = Al xt−l + ε t . (6.42)
l=1
Equation (6.42) is the vector autoregressive, V AR(m), model of order m. The

matrices Ak , k = 1, . . . m, can be estimated using various approaches such as the
(stepwise) least squares (Neumaier and Schneider 2001) or state space models (e.g.
Lütkepoch 2006).
As for the AR(1) POP model, the V AR(m) model can be decomposed using
pm p-dimensional normal (or Eigen) modes of (6.42), e.g. Neumaier and Schnei-
der (2001) and Schneider and Neumaier (2001). These multivariate POPs are
characterised as damped oscillators having characteristic features and e-folding
or damping times. This decomposition yields a system of mp coupled univariate
6.7 Principal Interaction Patterns 139
AR(1) models in which the coupling is through the noise covariance. The idea
is to use the extended, or delay, state space as for extended EOFs, which is
presented in Chap. 7. Denoting by x t the delayed state vector using m lags, i.e.
x t = (xt , xt−1 , . . . , xt−m+1 )T , and t = (ε t , 0, . . . , 0)T , the model (6.42) can be
written as a generalised AR(1) model:
x t = Ax t−1 + t, (6.43)
where
⎛ ⎞
A1 A2. . . Ap−1 Ap
⎜ I O ... O O⎟
⎜ ⎟
⎜ O⎟
A=⎜O I... O ⎟. (6.44)
⎜ . ..
.. ⎟
⎝ .. . . O O⎠
O O ... I O
The same decomposition can now be applied as for the VAR(1) case. Note, however,
because of the Fröbenius structure of the mp × mp matrix A in (6.44) (e.g.
Wilkinson 1988, chap. 1.3), the eigenvectors vk , k = 1, . . . mp have a particular
structure:
⎛ ⎞
λm−1 uk
⎜ .. ⎟
⎜ . ⎟
vk = ⎜ ⎟, (6.45)
⎝ λuk ⎠
uk
where uk is a m-dimensional vector. It can be seen, by partitioning the vector vk into

p m-dimensional vectors, that uk satisfies

Ap + λAp−1 + . . . + λp−1 A1 uk = λp uk . (6.46)
6.7 Principal Interaction Patterns
The principal interaction pattern (PIP) method was proposed originally by Has-
selmann (1988). Slight modifications of PIP method was introduced later by, e.g.
Kwasniok (1996, 1997, 2004) and Wu (1996). A description is given below, and for
more technical details, the reader is referred to Kwasniok (1996) and later papers.
PIP method takes into account the (nonlinear) dynamics of a nonlinear system.
In its simplest form, the PIP method attempts to project the dynamics of a N-
dimensional (autonomous) dynamical system living in a Hilbert space E with basis
ei , i = 1, . . . N:
du
= F (u), (6.47)
dt

where u = N i=1 ui ei , onto a lower L-dimensional Hilbert space P. This space is
spanned by the PIP patterns p1 , . . . , pL , with

N
pi = pki ek , i = 1, . . . , L, (6.48)
k=1
where the N × L matrix P = (pij ) contains the PIP patterns. The state vector u is
then projected onto the PIP patterns to yield

L
z = P roj (u) = zi pi . (6.49)
i=1
The PIPs are normally chosen to be orthonormal, with respect to a predefined scalar
product, i.e.
[p∗i , pj ] = δij , i, j = 1, . . . , L. (6.50)
The dynamics within the reduced PIP space is then given by
żi = [p∗i , P roj (F (u))], i = 1, . . . , L. (6.51)
The patterns are then sought by minimising a costfunction measuring the discrep-
ancy between the (true) tendency of the PIP coefficients, żit , and the projection of
the tendency u̇ onto the PIP space, i.e.

L
P I P s = argmin{F =< |żit − żi |2 >}, (6.52)
i=1
where <, > is an ensemble average. In simple terms, given an initial condition u0 ,
system (6.47) is integrated for a specified time interval τ . The obtained trajectory is
then projected onto the PIPs. Similarly, the dynamics of Eq. (6.47) is projected onto
the PIPs, which is then integrated forward using the projection P roj (u0 ) = uP 0
of u0 onto the PIPs as initial condition. The norm of the difference between the
obtained two trajectories is then computed. More specifically, let uP τ be the state
of the trajectory of (6.47) at time τ , starting from u0 , and projected onto the PIP
space. Let also zτ be the state of the trajectory of (6.51) starting from uP 0 P . The
discrepancy between the two trajectories at time τ , uτP −zτ , is then computed, and
a global measure ε2 of the discrepancy is obtained. Kwasniok (2004), for example,
used ε2 = uτPmax − zτmax 2 , for some integration time τmax , whereas Crommelin
τ
and Majda (2004) used the total integral of this discrepancy, i.e. ε2 = 0 max uτPmax −
zτmax 2 . The costfunction to be minimised with respect to the matrix P is then
F (P) =< ε2 (u0 , τmax , P) > . (6.53)
The minimisation of (6.53) with conditions of orthonormality of PIPs, and

a prespecified τmax , is performed numerically with a quasi-Newton algorithm.
Kwasniok (2004) applied PIP analysis to a barotropic quasi-geostrophic model.
Figure 6.9 shows an example of the leading two PIPs along with the leading
two rotated EOFs of the streamfunction field. The rotated EOFs are quite similar
to the leading PIPs. The reason for this similarity, as suggested by Kwasniok
(2004), is the possible dominance of the linear terms by the forcing chosen in the
model. In general, however, this is not the case as shown in the example discussed
by Crommelin and Majda (2004). These authors considered the six-dimensional
Fig. 6.9 Leading two PIPs (a,b) and two rotated EOFs (c,d) of the streamfunction from a
barotropic quasi-geostrophic model on the sphere. Adapted from Kwasniok (2004). ©American
truncation model of the beta-plane quasi-geostrophic model of Charney and Devore

(1979), also discussed in De Swart (1988). Crommelin and Majda (2004) compared
several optimal bases applied to this six-dimensional system. Figure 6.10 shows the
phase portrait and the model trajectory based on the leading 4 EOFs compared to
the reference model trajectory. A similar plot is also obtained with the leading four
PIPs (Fig. 6.11). The figures show clearly that the PIP trajectory represents better
the reference model trajectory compared to EOFs. The PIP model was also found to
be superior to the optimal persistent patterns-based models.
Fig. 6.10 Trajectory of the integration of the Charney and Devore (1979) model using the leading
four-EOF model projected onto the leading two EOFs (top left) and onto the leading EOF (middle),
and the reference trajectory projected onto the leading two EOFs (top right) and onto the leading
EOF (bottom). Adapted from Crommelin and Majda (2004). ©American Meteorological Society.
Used with permission
Fig. 6.11 Same as Fig. 6.10 but using a four-PIP model. The trajectories from the original
reference model are also shown for comparison. Adapted from Crommelin and Majda (2004).
Chapter 7
Extended EOFs and SSA
Abstract Hilbert EOFs presented in Chap. 5 are based on a spectral method to

identify propagating or oscillating features. This chapter describes a time domain
method, the extended EOFs, to identify propagating patterns from spatio-temporal
data sets. The method is similar to the EOF method except that the spatial dimension
is extended to include lagged information. Examples from the Madden–Julian
oscillation are also provided.
Keywords Time–space EOFs · Singular spectrum analysis · Multichannel SSA ·

Extended EOFs · Dynamical reconstruction · OLR · Madden–Julian oscillation ·
Recurrence networks · Harmonic decomposition
7.1 Introduction
Atmospheric fields are very often significantly correlated in both the space and time
dimensions. EOF technique, for example, finds patterns that are both spatially and
temporally uncorrelated. Such techniques make use of the spatial correlation but do
not take into account the significant auto- and cross-correlations in time. As a result,
travelling waves, for example, cannot be identified easily using these techniques as
was pointed out in the previous chapter.
Complex (or Hilbert) EOFs (HEOFs) (Rasmusson et al. 1981; Barnett 1983;
Horel 1984; von Storch and Zwiers 1999) presented in Chap. 5, have been intro-
duced to detect propagating structures. In HEOFs, a phase information is introduced
using the conjugate part of the field, which is provided by its Hilbert transform.
Chapter 5 provides illustration of Hilbert EOFs with the QBO signal. So the
information regarding the “propagation” is contained in the phase-shifted complex
part of the system. However, despite this extra information provided by the Hilbert
transform, HEOFs approach does not take into consideration the temporal auto- and
cross-correlation in the field (Merrifield and Guza (1990)).
POPs and HPOPs (Hasselmann 1988; von Storch et al. 1995; Bürger 1993) are
methods that aim at empirically inferring space–time characteristics of a complex
field. These methods are based on a first-order autoregressive model and can

146 7 Extended EOFs and SSA
therefore be used to identify travelling structures and in some cases forecast the
future system sates. The eigenfunctions of the system feedback matrix in POPs
and HPOPs, however, do not provide an orthogonal and complete basis. Besides
being linear, another drawback of the AR(1) model, which involves only lag-1
autocorrelations, is that it may be sometimes inappropriate to model higher order
systems.
The extended EOFs introduced by Weare and Nasstrom (1982) combine both the
aspects of spatial correlation of EOFs and the temporal auto- and cross-correlation
obtained from the lagged covariance matrix. Subsequent works by Broomhead and
colleagues (e.g. Broomhead and King (1986a,b)) and Fraedrich (1986) focussed
on the dynamical aspects of extended EOFs as a way of dynamical reconstruction
of the attractors of a dynamical system that is partially observed and termed it
singular system analysis (SSA). Multichannel singular spectrum analysis (MSSA)
was used later by Vautard et al. (1992) and Plaut and Vautard (1994) and applied it
to atmospheric fields. Hannachi et al. (2011) applied extended EOFs to stratospheric
warming, whereas an application to Rossby wave breaking and Greenland blocking
can be found in Hannachi et al. (2012), see also the review of Hannachi et al. (2007).
An appropriate example where propagating signals are prominent and where
extended EOFs can be applied to reveal these signals is the Madden–Julian
oscillation (MJO). MJO is an eastward propagating planetary-scale wave of tropical
convective anomalies and is a dominant mode of intraseasonal tropical variability.
The oscillation has a broadband with a period between 40 and 60 days and has
been identified in zonal and divergent wind, sea level pressure, outgoing long wave
radiation and OLR (Madden and Julian 1994). Figure 7.1 shows the OLR field over
the tropical band on 25 December 1996. It can be noted, for example, from this
figure the low-value region particularly over the warm pool, which is an area of
large convective activity, in addition to other regions such as Amazonia and tropical
Africa. The OLR data come from NCEP/NCAR reanalyses over the tropical region
Fig. 7.1 Outgoing long wave radiation distribution over the tropics in 25 December 1996. Units
w/m2 . Adapted from Hannachi et al. (2007)
7.2 Dynamical Reconstruction and SSA 147
Fig. 7.2 Leading EOF of OLR anomalies. Units arbitrary. Adapted from Hannachi et al. (2007)
equatorward from 30◦ latitude. Seasonality of OLR is quite complex and depends
on the latitude band (Hannachi et al. 2007).
Figure 7.2 shows the leading EOF pattern explaining about 15% of the total
variability. It has opposite signs north and south of the equator and represents the
seasonal cycle, mostly explained by the inter-tropical convergence zone (ITCZ).
7.2 Dynamical Reconstruction and SSA
7.2.1 Background
The issue of dynamical reconstruction of attractors is rooted in the theory of dynam-

ical systems and aims at reconstructing the multidimensional system’s trajectory, or
more widely the attractor of a chaotic system. In a nutshell a chaotic system is a
system that can be described by a set1 of ordinary differential equations
d
x = F (x),
dt
which cannot be analytically integrated. A chaotic trajectory is the trajectory of the
chaotic system within its phase space. A characteristic feature of a chaotic system is
its sensitivity to initial conditions, i.e. trajectories corresponding to two very close
initial conditions diverge exponentially in a finite time. A chaotic system gives rise
to a chaotic attractor, a set with extremely complex topology. Figure 7.3 shows an
example of the popular Lorenz (1963) system (Fig. 7.3a) and a contaminated (or
1 Of at least 3 variables.
a 6 b 6
4 4
2
2
0
X'
X
0
–2
–2 –4
–4 –6
0 100 200 300 400 0 100 200 300 400
Time Time
Fig. 7.3 Time series of the Lorenz (1963) model (a) and its contamination obtained by adding a
red noise. Adapted from Hannachi and Allen (2001)
noisy) time series (Fig. 7.3b). SSA can be used to obtain the hidden signal from the
noisy data.
The general problem of dynamical reconstruction consists in inferring dynamical
and geometric characteristics of a chaotic attractor from a univariate time series,
x1 , x2 , . . . , xn , sampled from the system. The solution to this problem is based on
transforming the one-dimensional sequence into a multivariate time series using the
so-called method of delays, or delay coordinates first proposed by Packard et al.
(1980), and is obtained using a sliding window through the time series to yield
T
xt = xt , xt+τ , . . . , xt+(M−1)τ , (7.1)
where τ is the time delay or lag and M is known as the embedding dimension.
A basic result in the theory of dynamical systems indicates that the characteristics
of the dynamical attractor can be faithfully recovered using the delay coordinate
method provided that the embedding dimension is large enough (Takens 1981).
7.2.2 Dynamical Reconstruction and SSA
In the sequel we suppose for simplicity that τ = 1. The analysis of the multivariate
time series xt , t = 1, 2, . . . n − M + 1, for dynamical reconstruction is once
more faced with the problem of dimensionality. Attractors of low-order chaotic
systems are in general low-dimensional and can in principle be explored within
a smaller dimension than that of the embedding space. A straightforward approach
is to first reduce the space dimension using, for example, EOFs of the data matrix
obtained from the multivariate time series given in Eq. (7.1), i.e. time EOFs. This
approach has been in fact adopted by Broomhead and King (1986a,b). They used
SVD to calculate an optimal basis for the trajectory of the reconstructed attractor.
If the dynamic is in fact chaotic, then the spectrum will be discontinuous or
singular, with the first few large singular values well separated from the floor (or
7.2 Dynamical Reconstruction and SSA 149
background) spectrum associated with the noise level, hence the label singular
system or spectrum analysis (SSA). At the same time, Fraedrich (1986) applied
SSA, using a few climate records, in an attempt to analyse their chaotic nature
and estimate their fractal dimensions. Vautard et al. (1992) analysed the spectral
properties of SSA and applied it the Central England temperature (CET) time series
in an attempt to find an oscillation buried in a noise background. They claim that
CET contains a 10-year cycle. Allen and Smith (1997) showed, however, that the
CET time series consists of a long-term trend in a coloured background noise. SSA
is in fact a useful tool to find a periodic signal contaminated with a background white
noise. For example, if the time series consists of a sine wave plus a white noise, then
asymptotically, the spectrum will have a leading pair of equal eigenvalues and a flat
spectrum. The time EOFs corresponding to the leading eigenvalues will consist of
two sine waves (in delay space) in quadrature. The method, however, can fail when,
for example, the noise is coloured or the system is nonlinear. A probability-based
approach is proposed in Hannachi (2000), see also Hannachi and Allen (2001).
The anomaly data matrix, referred to as trajectory matrix by Broomhead and
King (1986a,b), obtained from the delayed time series of Eq. (7.1), that we suppose
already centred, is given by
⎛ ⎞
x1 x2 . . . xM
⎜ x2 x3 . . . xM+1 ⎟
⎜ ⎟
X=⎜ .. .. .. ⎟ . (7.2)
⎝ . . . ⎠
xn−M+1 xn−M+2 . . . xn
The trajectory matrix expressed by Eq. (7.2) has a special form, namely with
constant second diagonals. This property gets transmitted to the covariance matrix:
1 1
n−M+1
C= XT X = xt xTt , (7.3)
n−M +1 n−M +1
t=1
which is a Toeplitz matrix, i.e. with constant diagonals corresponding to the same
lags. This Toeplitz structure is known to have useful properties, see e.g. Graybill
(1969). If σ 2 is the variance of the time series, then C becomes
⎛ ⎞
1 ρ1 . . . ρM−1
⎜ . . ⎟
1 ⎜ ρ1 1 . . .. ⎟
C==⎜
⎜ .
⎟,
⎟ (7.4)
σ 2
⎝ .. .. ..
. . ρ1 ⎠
ρM−1 . . . ρ1 1
where is the lagged autocorrelation matrix. If U = (u1 , . . . , uM ) is the set of

(time) EOFs, or similarly the right singular vectors of X, the PCs are given by C =
XU. The i’th PC ci = (ci (1), ci (2), . . . , ci (n − M + 1)) is then given by

M
ci (t) = xTM+t−1 ui = uil xM+t−l (7.5)
l=1
for t = 1, . . . , n − M + 1. Hence the PCs are obtained as weighted moving averages

of the time series. Note, however, that this is not a conventional moving average
since the coefficients are function of the time series itself, and therefore, the PCs
and also the EOFs are somehow “special” moving averages of the time series. One
is tempted to say that the attribute “special” here points to nonlinearity, but of course
once the weights are computed then the filtering takes the form of an ordinary
moving average. The method, however, looks rather like an adaptive filter. It is
pointed out in von Storch and Zwiers (1999) that the filter (7.5) causes a frequency-
dependent phase shift because the weights are not in general symmetric. However,
the symmetric Toeplitz matrix C has an interesting property, namely the symmetry
of its eigenvectors (see the exercise below).
Figure 7.4 shows the spectra and the leading pair of EOFs of the covariance
matrix of the extended data with a window lag M = 30, similar to Eq. (7.3),
except that it is weighted by the inverse of a similar covariance matrix of the
noise (Hannachi and Allen 2001). The leading two eigenvalues are equal and well
separated from the rest of the spectrum. The associated (extended) EOFs show two
sine waves in quadrature reflecting the (quasi-) periodic nature of the hidden signal.
Exercise Show that the eigenvectors of the covariance matrix (7.3) are symmetric,
i.e. the elements of each eigenvector are symmetric.
Hint Let a = (a0 , a1 , . . . , aM−1 )T be an eigenvector of the lagged autocorrelation
matrix with eigenvalue λ, i.e. a = λa. Now let us define the vector b =
(aM−1 , aM−2 , . . . , a0 )T , i.e. the “reverse” of a, and let us compute b. Clearly the
first element [b]1 of b is precisely [a]M , i.e.
a b
10.0 0.6
0.4
Eigenvector
Eigenvalue
0.2
1.0
0.0
–0.2
0.1 –0.4
0 10 20 30 0 10 20 30
Eigenvalue rank Lag
Fig. 7.4 Spectrum of the grand covariance matrix, Eq. (7.3), of the noisy time series of Fig. 7.3b
weighted by the inverse of the same covariance matrix of the noise (a) and the leading two extended
EOFs. Adapted from Hannachi and Allen (2001)
7.3 Examples 151

M−1
M−1
[b]1 = b0 + ρk bk = aM−1 + ρk aM−k−1 = λaM−1 = λb0
k=1 k=1
and similarly for the remaining elements. Hence if the eigenvalues are distinct, then
clearly a = b, and therefore, a is symmetric.
This conclusion indicates therefore that the SSA filter is symmetric and does
not cause a frequency-dependent phase shift. This phase shift remains, however,
possible particularly for singular covariance matrices.
Remarks
• SSA enjoys various nice properties, such as being an adaptive filter.
• The major drawback in SSA, however, is related to the problem of choosing
the embedding dimension M. In general the bigger the value of M the more
accurate the reconstruction. However, in the case of extraction of periodic signals,
M should not be much larger than the period, see e.g. Vautard et al. (1992) for
discussion.
As we have mentioned above, SSA, like EOFs, can be used to reconstruct the
original time series using a time reconstruction, instead of space reconstruction as
in EOFs. Since xt , t = 1, . . . n − M + 1, can be decomposed as

M
xt = ck (t)uk , (7.6)
k=1
one gets

M
xt+l−1 = ck (t)uk,l (7.7)
k=1
for l = 1, . . . M. It is interesting to note that for a given xt , t = 1, . . . n − M + 1,

Eq. (7.7) yields different reconstructions as it will be discussed in section 7.5.3. Each
decomposition, however, has its own variance distribution among the different SSA
vectors.
7.3 Examples
7.3.1 White Noise
For a white noise time series with variance σ 2 , the covariance matrix of the delayed
vectors is simply σ 2 IM , and the eigenvectors are simply the degenerate unitary
vectors.
7.3.2 Red Noise
The red noise is defined by
xt+1 = ρxt + εt+1 , (7.8)
for t = 1, 2, . . ., with independent and identically distributed (IID) noise with zero
mean and variance σ 2 . The autocorrelation function is
ρx (τ ) = ρ |τ | , (7.9)
and the Toeplitz covariance matrix C of xt = (xt , xt+1 , . . . , xt+M−1 )T takes the
form
⎛ ⎞
1 ρ . . . ρ M−1
⎜ . . ⎟
1 ⎜ ρ 1 . . .. ⎟
C=⎜
⎜ ..
⎟.
⎟ (7.10)
σ 2
⎝ .. ..
. . . ρ ⎠
ρ M−1 . . . ρ1 1
To compute the eigenvalues of (7.10), one could, for example, start first by inverting
C and then compute its eigenvalues. The inverse of C has been computed, see e.g.
Whittle (1951) and Wise (1955). For example, Wise (1955) provided a general way
to compute the autocovariance matrix of a general ARMA(p, q) time series.
Following Wise (1955), Eq. (7.8) can be written in a matrix form as
(I − ρJ) xt = ε t , (7.11)
where xt = (xt , xt+1 , . . .)T , and ε t = (εt , εt+1 , . . .)T are semi-infinite vectors, I is
the semi-infinite identity matrix, and J is the semi-infinite auxiliary matrix whose
finite counterpart is
⎛ ⎞
0 1 ... 0
⎜ .. ⎟
⎜ 0 0 ... .⎟
Jn = ⎜
⎜. . .
⎟,
⎟
⎝ .. . . . . 1⎠
0 ... 0 0
that is with ones on the superdiagonal and zeros elsewhere. The matrix Jn is
nilpotent with Jnn = O. Using Eq. (7.11), we get
−1
E xt xTt = = σ 2 (I − ρJ)−1 I − ρJT ,
7.3 Examples 153
that is

σ 2 −1 = I − ρJT (I − ρJ) . (7.12)
By taking the finite counterpart of Eq. (7.12) corresponding to a finite sample

x1 , x2 , . . . , xn , we get the following tridiagonal inverse matrix of C:
⎛ ⎞
1 −ρ 0 ... 0
⎜ −ρ 1 + ρ 2 −ρ 0 ⎟
⎜ ... ⎟
⎜ ⎟
⎜ . . ⎟
⎜ 0 −ρ 1 + ρ 2 . 0 ⎟
σ 2 C−1 =⎜ . ⎟. (7.13)
⎜ . .. .. .. ⎟
⎜ . . . . −ρ ⎟
⎜ ⎟
⎝ 0 ... −ρ 1 + ρ −ρ ⎠
2
0 ... 0 −ρ 1 + ρ 2
Another direct way to compute C−1 is given, for example, in Graybill (1969,
theorem 8.3.7, p. 180). Now the eigenvalues of C−1 are given by solving the
polynomial equation:
Dn = |C−1 − λIn | = 0.
Again, a simple way to solve this determinantal equation is to decompose Dn as

Dn = (1 − λ)n−1 + ρδn−1 , where
) )
) 1 + ρ2 − λ −ρ ... 0 )
) )
) −ρ 1 + ρ 2 − λ ... 0 )
) )
n−1 =) . . . )
) .. .. .. −ρ )
) )
) 0 ... −ρ 1 + ρ − ρ )
2
is the determinant of C−1 − λI after removing the first line and first column, and
) )
) −ρ 0 0 ... 0 )
) )
) −ρ 1 + ρ 2 − λ −ρ ... 0 )
) )
) .. .. .. .. )
δn−1 =) . . . . −ρ )
) )
) 0 −ρ 1 + ρ − λ
2 −ρ )
) ... )
) 0 ... 0 −ρ 1 + ρ − λ)
2
is the determinant of C−1 − λI after removing the second line and first column. We
now have the recurrence relationships:
δn = −ρn−1
, (7.14)
n = 1 + ρ 2 − λ n−1 + ρδn−1
so that n = 1 + ρ 2 − λ n−1 − ρ 2 n−1 , which yields, in general, a solution of

the form:
n = aμn−2
1 + bμn−2
2 ,
where μ1,2 are the roots (supposed to be distinct) of the quadratic equation x 2 −
(1 + ρ 2 − λ)x + ρ 2 = 0. Note that the constants a and b are obtained from the initial
conditions such as 2 and 3 .
7.4 SSA and Periodic Signals
The use of SSA to find periodic signals from time series has been known since
the early 1950s with Whittle (1951). The issue was reviewed by Wise (1955),
who showed that for a stationary periodic (circular) time series of period n, the
autocovariance matrix of the time series contains at least one eigenvalue with
multiplicity greater or equal than two. Wise2 (1955) considered a stationary zero-
mean random vector x = (xn , xn−1 , . . . , x1 ) having a circular covariance matrix, i.e.
γn+l = γn−l = γl ,
where γn = E (xt xt+n ) = σ 2 ρn , and σ 2 is the common variance of the xk ’s. The
corresponding circular autocorrelation matrix is given by
⎛ ⎞
1 ρ1 ρ2 . . . ρ1
⎜ ρ1 1 ρ1 . . . ρ2 ⎟
⎜ ⎟
=⎜ .. .. .. ⎟. (7.15)
⎝ ρ2 . . . ρ1 ⎠
ρ1 ρ2 . . . ρ1 1
Exercise
1. Show that the matrix
⎛ ⎞
0 1 0
0 ...
⎜0 0 0⎟
1 ...
⎜ ⎟
⎜ ⎟
W = ⎜ ... ..
.
.. ⎟
.
..
.
⎜ ⎟
⎝0 0 0 ... 1⎠
1 0 0 ... 0
is unitary, i.e. WT W = WWT = I.
2 Wise also extended the analysis to calculate the eigenvalues of a circular ARMA process.
7.4 SSA and Periodic Signals 155

T
2. Show that = I+ρ1 W + WT +ρ2 W2 + W2 +. . .+qρp Wp + Wp T
where p = [n/2], i.e. the integer part of n/2 and q equals 1 if n is odd, and 12
otherwise.
3. Compute the eigenvalues of W and show that it is diagonalisable, i.e. W =
A A−1 .
Hint The eigenvalues of W are the unit roots ωk , k = 1, . . . n of ωn − 1 = 0. Hence
W = A A−1 , and Wα + W−α = A α + −α A−1 . This means in particular
that Wα + W−α is diagonalisable with ωk + ωk−1 as eigenvalues.
Exercise (Wise 1955) Compute the eigenvalues (or latent roots) λk , k = 1, 2 . . . n
of and show, in particular, that they can be expressed as an expansion into Fourier
sine–cosine functions as
2π k 4π k 2πpk
λk = 1 + 2ρ1 cos + 2ρ2 cos + . . . + 2qρp cos (7.16)
n n n
and that λk = λn−k .
Remark The eigenvalues of can also be obtained directly by calculating the
determinant = | − λI| by writing (see e.g. Mirsky 1955, p. 36)
⎡ ⎤

n−1
= nk=1 ⎣(1 − λ) + ρj ωk ⎦ .
j
j =1
The eigenvalues of are therefore given by λk = f (θk ), θk = 2πn k , and f (ω) is

the spectral density function of the circular process. As a consequence, we see that
λn−k = λk .
After the eigenvalues have been found, it can now be checked that the vector

1 2π k 2π k 2π nk 2π nk
uk = √ cos + sin , . . . , cos + sin (7.17)
n n n n n
is the unitary eigenvector associated with λk . Furthermore, any two eigenvectors of

a degenerate eigenvalue are in quadrature. Note that when n is even (7.16) becomes

n2/−1
2π kj
λk = 1 + 2 ρj cos + ρ n2 cos π k. (7.18)
n
j =1
The same procedure can be applied when the time series contains, for example,
a periodic signal but is not periodic. This was investigated by Basilevsky (1983)
and Basilevsky and Hum (1979) using the Karhunen-Loéve decomposition, which
consists precisely of an EOF analysis of the lagged covariance (Toeplitz) matrix, i.e.
SSA. Basilevsky (1983) applied the procedure to UK unemployment time series.

Note that, as in the circular case, the eigenvalues of the lagged covariance matrix
provide a discrete approximation of the power spectrum of the time series. The
eigenvectors, on the other hand, are time lagged series.
Remark (SSA and Wavelets) Wavelet analysis has been used to study time series
with intermittent phenomena including chaotic signals. A characteristic feature of
wavelet transform is that it can focus on specific parts of the signal, to examine
local structure by some sort of algebraic zooming (see Sect. 5.4.1, footnote 7.) For
a continuous signal, the wavelet transform is a special integral transform whose
kernel is obtained by a translation and dilation of a localised function, which is the
mother wavelet (Meyer 1992; Daubechies 1992). Wavelet transform is comparable
to SSA, with main differences related to locality. For example, SSA modes, such
as EOFs and Fourier transform, contain information from the whole time series,
whereas wavelet transform is localised and can therefore analyse complex signals,
such as self-similar signals and singularities. A detailed account of the difference
between the two methods, with more references, can be found in Ghil et al. (2002),
who also discuss local SSA as a substitute to conventional SSA.
We now consider the (lagged) autocovariance matrix of a stationary time series
xt , t = 1, 2 . . . with autocovariance function γ () and spectral density f (),
⎛ ⎞
γ0 γ1 . . . γn−1
⎜ γ1 γ0 . . . γn−2 ⎟
⎜ ⎟
=⎜ .. .. .. .. ⎟ , (7.19)
⎝ . . . . ⎠
γn−1 γn−2 . . . γ0
and we suppose, without loss of generality, n to be odd. We also let u0 =

1 T 1
2− 2 , 1, 0, 1, . . . , 0 , and for k = 1, . . . n − 1, uk = 2− 2 , cos 2πn k , sin 2πn k ,
T
cos 4πn k , sin 4πn k , . . . , sin 2π(n−1)k
2n . We now form the following two arrays:
*
U= 2/n (u0 , u1 , . . . , un−1 )T (7.20)
and
= Diag (λ1 , λ2 , . . . , λn ) , (7.21)

where λ1 = 1
2π j γj , and λ2k = λ2k+1 = 1
2π j γj e−2iπj k/n for k =
1, 2, . . . n−1
2 .
1
Remark When n is even, an additional row, n− 2 (1, −1, 1, . . . , 1, −1), is added in
the last row of U, and the last entry of is λn = 2π
1
j γj cos πj .
7.5 Extended EOFs or Multivariate SSA 157
Now for defined in (7.15) the matrix UT U is diagonal and converges to 2π

as n increases. The nice thing is that the same result also extends to the lagged
autocovariance matrix in (7.19). Precisely, we have the following theorem, see
e.g. Fuller (1976, p. 138).
Theorem Let be defined by (7.19), with absolutely summable autocovariance
function γ (), and U and defined by (7.20) and (7.21), respectively, and then
UT U − 2π
converges to O as n goes to infinity.

Hence for large sample sizes n, the eigenvalues λj of the lagged autocovariance
matrix of x1 , . . . , xn are approximately equal to 2πf (θj ), where θj = 2πj/n,
j = 0, 1, . . . , n − 1.
7.5 Extended EOFs or Multivariate SSA
7.5.1 Background
The extension of SSA to two or more variables was performed by Broomhead

and King (1986a,b), who applied multichannel SSA (MSSA) to the Lorenz system
for dynamical reconstruction of the attractor. It was also applied later by Kimoto
et al. (1991) to look for propagating structures from 500-mb geopotential height
reanalyses. MSSA makes use of space and time information simultaneously to find
coherent structures, referred to sometimes as space–time EOFs. Extended EOF
analysis is another name for MSSA and was introduced in the meteorological
literature some years earlier by Weare and Nasstrom (1982). They applied it to the
300-mb relative vorticity and identified westward propagating Rossby waves with
phase speeds of the order 0.4 m/s. They also applied it to tropical Pacific SST and
identified strong persistence in El-Niño. Below we discuss the computation and use
of extended EOFs, see also Hannachi et al. (2007).
7.5.2 Definition and Computation of EEOFs
In EEOF analysis the atmospheric state vector at time t, i.e. xt = xt1 , . . . xtp ,
t = 1, . . . , n, used in traditional EOF, is extended to include temporal information
as
x t = xt1 , . . . xt+M−1,1 , xt2 , . . . xt+M−1,2 , . . . xt,p , . . . xt+M−1,p (7.22)

with t = 1, . . . , n − M + 1. The new data matrix takes now the form:

⎡ ⎤
x1
⎢ x2 ⎥
⎢ ⎥
X =⎢ .. ⎥. (7.23)
⎣ . ⎦
x n−M+1
It is now clear from (7.22) that time is incorporated in the state vector side by side
with the spatial dimension. If we denote by
x st = xts , xt+1,s . . . xt+M−1,s , (7.24)
then the extended state vector (7.22) is written in a similar form to the conventional
state vector, i.e.

p
x t = x 1t , x 2t , . . . , x t (7.25)
except that now the elements x kt , k = 1, . . . p, of this grand state vector, Eq. (7.24),
are themselves temporal-lagged values. The data matrix X in Eq. (7.23) now takes
the form
⎡ p ⎤
x 11 x 21 ... x1
⎢ .. .. .. ⎥
X =⎣ . . . ⎦, (7.26)
p
x 1n−M+1 x 2n−M+1 ... x n−M+1
which is again similar to traditional data matrix X, see Chap. 2, except that now its
elements are (temporal) vectors.
The vector x st in (7.24) is normally referred to as the delayed vector obtained
from the time series (xts ), t = 1, . . . n of the field value at grid point s. The new data
matrix (7.26) is now of order (n − M + 1) × pM, which is significantly larger than
the original matrix dimension n × p.
We suppose that X in (7.26) has been centred and weighted, etc. The covariance
matrix of time series (7.25) obtained using the grand data matrix (7.26) is
⎡ ⎤
C11 C12 . . . C1M
⎢ C21 C22 . . . C2M ⎥
1 ⎢ ⎥
= XT X = ⎢ . .. .. ⎥ , (7.27)
n−M +1 ⎣ .. . . ⎦
CM1 CM2 . . . CMM
where each Cij , 1 ≤ i, j ≤ M, is the lagged covariance matrix between the i’th and
j’th gridpoints and is given by3
1
n−M+1
T j
Cij = x it x t . (7.28)
n−M +1
t=1
If the elements
of the data matrix (7.26) were random variables, then each submatrix
T
Cij = E x i x j , from the covariance matrix = E X T X , will exactly take
a symmetric Toeplitz form, i.e. with constant diagonals, and consequently, will
be block Toeplitz. Due to finite sampling; however, Cij is approximately Toeplitz
for large values of n, compared to the window length M. This is in general the case
when we deal with high frequency data, e.g. daily observations or even monthly
averages from long climate model simulations. The symmetric covariance matrix
is therefore approximately block Toeplitz for large values of n.
An alternative form of the data matrix is provided by re-writing the state
vector (7.22) in the form
x t = xt1 , . . . xt,p , xt+1,1 , . . . xt+1,p , . . . xt+M−1,1 , . . . xt+M−1,p , (7.29)
that is
x t = (xt , xt+1 , . . . , xt+M−1 ) (7.30)
where xt is the state vector at time t, t = 1, . . . n − M + 1, i.e.
xt = xt1 , . . . , xt,p .
Hence the matrix (7.26) now takes the alternative form4

⎡ ⎤
x1 x2 . . . xM
⎢ .. .. .. ⎥ .
X1 = ⎣ . . . ⎦ (7.31)
xn−M+1 xn−M+2 . . . xn
This form is exactly equivalent to (7.26) since it is obtained from (7.26) by a

permutation of the columns as
X 1 = X P, (7.32)
3 Other alternatives to compute C also exist, and they are related to the way the lagged covariance
ij
between two time series is computed, see e.g. Priestly (1981) and Jenkins and Watts (1968).
4 used by Weare and Nasstrom (1982).
where P = (pij ), i, j = 1, . . . Mp, is a permutation matrix5 given by
pij = δi,α , (7.33)
where α is a function of j given by α = rM + pj + 1 where j − 1 ≡ r(p), and

[x] is the integer part of x.
In order to compute the extended EOFs, the OLR anomalies, discussed in
Sect. 7.1, with respect to the long-term climatology for the period 1 Jan 1996–31
Dec 2000, are analysed. The dimension of the data is reduced by keeping the leading
10 EOFs/PCs. Figure 7.5 shows the obtained spectrum of the grand covariance
matrix of Eq. (7.27) using the leading 10 OLR PCs, with a window lag M = 80 days.
The anomalies are computed with respect to the daily climatology over the period 1
Jan 1996–31 Dec 2000. The leading two eigenvalues correspond to the annual cycle.
They do not look nearly equal and separated from the rest of the spectrum. This is
due to the relatively small sample size and the choice of the window lag, which is
much smaller than the length of the seasonal cycle.
EEOFs are the EOFs of the extended data matrix (7.23), i.e. the eigenvectors of
the grand covariance matrix given in (7.27). They can be obtained directly by
computing the eigenvalues/eigenvectors of (7.26). Alternatively, we can use again
the SVD of the grand data matrix X in (7.26). This yields
X = VUT , (7.34)
where U = (uij ) = (u1 , u2 , . . . , ud ) represents the matrix of the d extended

EOFs or left singular vectors of X . Here, d = Mp represents now the number
Fig. 7.5 Spectrum of the grand covariance matrix, Eq. (7.3), of the leading 10 OLR PCs using a
window lag M = 80 days. The vertical bars show the approximate 95% confidence interval using
an hand-waving effective sample size of 116. Adapted from Hannachi et al. (2007)
5 That
is a matrix containing exactly 1 in every line and every column and zeros elsewhere. A
permutation matrix P is orthogonal, i.e. PPT = PT P = I.
of the newly obtained variables, i.e. the number of columns of the grand data
matrix. The diagonal matrix contains the singular values θ1 , . . . θd of X , and
V = (v 1 , v 2 , . . . , v d ) is the matrix of the right singular vectors or extended PCs
where the k’th extended PC is v k = (vk (1), . . . , vk (n − M + 1))T .
The computation of Hilbert EOFs is again similar to conventional EOFs. Given
the gridded data matrix X(n, p12) and the window length m, the cornerstone of
EEOFs is to compute the extended data matrix EX as shown in the simple Matlab
code:
>> EX = [ ];
>> for t=1:(n-m+1)
>> test0 = [ ]; test1 = [ ];
>> for s = 1:p12
>> test1 = X (t:(t+m-1), s)’;
>> test0 = [test0 test1];
>> end
>> EX = [EX; test0];
>> end
The extended EOFs and extended PCs along with associated explained variance
are then computed as for EOFs using the new data matrix EX from the above
piece of Matlab code. These extended EOFs and PCs can be used to filter the
data by removing the contribution from nonsignificant components and also for
reconstruction purposes as detailed below and illustrated with the OLR example.
7.5.3 Data Filtering and Oscillation Reconstruction
The extended EOFs U can be used as a filter exactly like EOFs. For instance, the
SVD decomposition (7.34) yields the expansion of each row x t of X in (7.26)

d
x Tt = θk vk (t)uk , (7.35)
k=1
for t = 1, . . . n − M + 1, or in terms of the original variables xt as

d
j
xTt+j −1 = θk vk (t)uk (7.36)
k=1
for t = 1, . . . n − M + 1, and j = 1, . . . M, and where
j T
uk = uj,k , uj +M,k , . . . , uj +(p−1)M,k . (7.37)
j
Note that the expression of the vector uk in the expansion (7.36) depends on the
form of the data matrix. The one given above corresponds to (7.26), whereas for the
data matrix X 1 in Eq. (7.31) we get
j T
uk = u(j −1)p+1,k , u(j −1)p+2,k , . . . , ujp,k . (7.38)
Note also that when we filter out higher EEOFs, expression (7.36) is to be truncated
to the required order d1 < d.
Figure 7.6, for example, shows PC1 and its reconstruction based on Eq. (7.40)
using the leading 5 extended PCs. Figure 7.5 shows another pair of approximately
equal eigenvalues, namely, eigenvalues 4 and 5. Figure 7.7 shows a time plot
of extended PC4/PC5, and their phase plot. This figure reflects the semi-annual
oscillation signature in OLR. Figure 7.8 shows the extended EOF8 pattern along
10◦ N as a function of the time lag. Extended EOF8 shows an eastward propagating
Fig. 7.6 Time series of the raw and reconstructed OLR PC1. Adapted from Hannachi et al. (2007)
Fig. 7.7 Time series of OLR extended PC4 and PC5 and their phase plot. Adapted from Hannachi
et al. (2007)
Fig. 7.8 Extended EOF 8 of the OLD anomalies along 10o N as a function of time lag. Units
arbitrary. Adapted from Hannachi et al. (2007)
wave with an approximate phase speed of about 9 m/s, comparable to that of Kelvin
waves.
The expansion (7.36) is exact by construction. However, when we truncate it by
keeping a smaller number of EEOFs for filtering purposes, e.g. when we reconstruct
the field components from a single EEOF or a pair of EEOFs corresponding, for
example, to an oscillation, then the previous expansion does not give a complete
picture. This is because when (7.36) is truncated to a smaller subset K of EEOFs
yielding, e.g.
j
yTt+j −1 = θk vk (t)uk , (7.39)
k in K
where yt = yt,1 , . . . , yt,p is the filtered or reconstructed state space vector,

then one obtains a multivalue function. For example, for t = 1 and j = 2, we
get one value of yt,1 , and for t = 2 and j = 1, we get another value of yt,1 .
Note that this is due to the fact that the EEOFs have time lagged components.
To get a unique reconstructed value, we simply take the average of those multiple
values. The number of multiple values depends6 on the value of time t = 1, . . . n.

The reconstructed variables using a subset K of EEOFs are then easily obtained
from (7.39) by
⎧ 1 t j
⎪ t j =1 K θk vk (t − j + 1)uk for 1 ≤ t ≤ M − 1
⎪
⎪
⎪
⎪
⎨
M j
yt = M
T 1
j =1 K θk vk (t − j + 1)uk for M ≤ t ≤ n − M + 1
⎪
⎪
⎪
⎪
⎪
⎩ 1 M j
n−t+1 j =t−n+M K θk vk (t − j + 1)uk for n − M + 2 ≤ t ≤ n
(7.40)
The eigenvalues related to the MJO are reflected by the pair (8,9), see Fig. 7.5.
The time series and phase plots of extended EOFs 8 and 9 are shown in Fig. 7.9.
This oscillating pair has a period of about 50 days. This MJO signal projects onto
many PCs. Figure 7.10 shows a time series of the reconstructed PC1 component
using extended EOFs/PCs 8 and 9. For example, PCs 5 to 8 are the most energetic
components regarding MJO. Notice, in particular, the weak projection of MJO onto
the annual cycle. Figure 7.11 shows the reconstructed OLR field for the period 3
March 1997 to 14 March 1997, at 5◦ N, using the extended 8th and 9th extended
EOFs/PCs.
The figure reveals that the MJO is triggered near 25–30◦ E over the African jet
region and matures over the Indian Ocean and Bay of Bengal due to the moisture
excess there. MJO becomes particularly damp near 150◦ E. Another feature that is
Fig. 7.9 Time series of OLR extended PCs 8 and 9 and their phase portrait. Adapted from
6 These numbers can be obtained by constructing a M × n array A = (aj t ) with entries aj t =

t − j + 1. Next, all entries that are non-positive or greater than n − M + 1 are to be equated to
zero. Then for each time, t takes all the indices j with positive entries.
Fig. 7.10 Reconstructed OLR PCs 1 to 8 using the extended EOFs 8 and 9. Adapted from
clear from Fig. 7.11 is the dispersive nature of the MJO, with a larger phase speed
during the growth phase compared to that during the decay phase.
Note that these reconstructions can also be obtained using least squares (see e.g.
Vautard et al. 1992, and Ghil et al. 2002). The reconstructed components can also
be restricted to any subset of the Eigen elements of the grand data matrix (7.26) or
similarly the grand covariance matrix . For example, to reconstruct the time series
associated with an oscillatory Eigen element, i.e. a pair of degenerate eigenvalues,
the subset K in the sum (7.39) is limited to that pair.
The reconstructed multivariate time series yt , t = 1, . . . n, can represent the
reconstructed (or filtered) values of the original field at the original p grid points.
In general, however, the number of grid points is too large to warrant an eigen
decomposition of the grand data or covariance matrix. In this case a dimension
reduction of the data is first sought by using say the leading p0 PCs and then
apply a MSSA to these retained PCs. In this case the dimension of X becomes
(n − M + 1) × Mp0 , which may be made considerably smaller than the original
dimension. To get the reconstructed space–time field, one then use the reconstructed
PCs in conjunction with the p0 leading EOFs.
Fig. 7.11 Reconstructed OLR anomaly field using reconstructed EOFs/PCs 1 to 8 shown in
Fig. 7.10. Adapted from Hannachi et al. (2007)
Remark The previous sections discuss the fact that extended EOFs are used
essentially for identifying propagating patterns, filtering and also data reduction.
Another example where extended EOFs can be used is when the data contain some
kind of breaks. This includes studies focussing on synoptic and large-scale processes
in a particular season. In these studies the usual procedure is to restrict the analyses
to data restricted to the chosen period. This includes analysing, for example, winter
(e.g. DJF) daily data over a number of years. If care is not taken, an artificial
oscillatory signal emerges. An example was given in Hannachi et al. (2011) who
used geometric moments of the polar vortex using ERA-40 reanalyses. They used
December–March (DJFM) daily data of the aspect ratio, the centroid latitude and
the area of the vortex from ERA-40 reanalyses for the period 1958–2002.
Figure 7.12a shows the spectrum of the extended time series in the delay
coordinates of the vortex area time series using a window lag M = 400 days. A
pair of nearly identical eigenvalues emerges and is well separated from the rest
of the spectrum. The associated extended EOFs are shown in 7.11b, and they
show a clear pair of sine waves in quadrature. The associated extended PCs are
shown in 7.12c, revealing again a phase quadrature supported also by the phase
plot (7.12d). The time series is then filtered by removing the leading few extended
EOFs/PCs. The result is shown in Fig. 7.13. Note that in Fig. 7.13 the leading four
extended EOFs/PCs were filtered out from the original vortex area time series.
Fig. 7.12 Spectrum of the grand covariance matrix, Eq. (7.3), of the northern winter (DJFM) polar
vortex (a) using a window lag M = 400 days, the leading two extended EOFs (b), the extended
PC1/PC2 (c) and the phase portrait of the extended PC1/PC2 (d). Adapted from Hannachi et al.
(2007)
Fig. 7.13 Raw polar vortex area time series and the reconstructed signal using the leading four
extended EOFs/PCs (a) and the filtered time series obtained by removing the reconstructed signal
of (a). Adapted from Hannachi et al. (2011). ©American Meteorological Society. Used with
permission
7.6 Potential Interpretation Pitfalls
EEOFs are useful tools to detect propagating features, but some care needs to be
taken when interpreting the patterns. There are two main difficulties of interpretation
related, respectively, to the standing waves and the relationship between the
EEOFs substructures. The method finds one or more EEOFs where each EEOF
is composed of a number of patterns or substructures. These patterns are taken
to represent propagating features, and this would imply some sort of coherence
between individual features within a given EEOF. The fact that the method attempts
to maximise the variance of each EEOF (without considering extra constraints on
the correlation, or any other measure of association, between the substructures of a
given EEOF) suggests the existence of potential caveats. Chen and Harr (1993) used
a two-variable model to show that the partition of the loadings is much sensitive
to the variance ratio than to the correlation between the two variables. This may
yield some difficulties in interpretation particularly when some sort of relationships
are expected between the substructures of a given EEOF. Chen and Harr (1993)
constructed a 6-variable toy model data to show that interpretation of EEOF patterns
can be misleading.
In the same token and like POPs, EEOFs interpretation can also be difficult when
the data contains a standing wave. The problem arises for particular choices of
the delay parameter τ . Monahan et al. (1999) showed that if the dataset contains
a standing wave the EEOFs describing this wave will be degenerate if τ coincides
with a zero of the autocovariance function of the wave’s time series. The model used
by Monahan et al. (1999) takes the simple form:
xt = ayt + ε t ,
where E ε t ε Tt+τ = η(τ )I, aT a = 1 and E (yt yt+τ ) = a(τ ), and ε t and yt
T
are uncorrelated. For example, if zt = xTt , xTt+τ , the corresponding covariance
matrix is

(0) (τ )
z = ,
(τ ) (0)
where (τ ) = a(τ )aaT + η(τ )I is the covariance matrix function of xt . Denoting
γ (τ ) = a(τ )+η(τ ), then the eigenvalue λ = γ (0)+γ (τ ) is degenerate if γ (τ ) = 0.
Using more lags, the degeneracy condition becomes slightly more complicated. For
T
example, when using two time lags, i.e. zt = xTt , xTt+τ , xTt+2τ , then one gets
two obvious eigenvalues, λ1 = γ (0) + γ (τ ) + γ (2τ ), and λ2 = γ (0) − γ (2τ )
T T
with respective eigenvectors aT , aT , aT and aT , 0T , −aT . If γ (τ ) = γ (2τ ),
then the second eigenvalue degenerates, and if in addition γ (τ ) = 0, then the first
eigenvalue degenerates. When this happens, the substructures within a single EEOF
can be markedly different, similar to the case mentioned above. Monahan et al.
7.7 Alternatives to SSA and EEOFs 169
(1999) performed EOF and a single-lag EEOF analysis of the monthly averages
of the Comprehensive Ocean-Atmosphere Data Set (COADS) SLP from January
1952 to June 1997. The first EOF was identified as an east–west dipole and was
interpreted as a standing mode, with its PC representing the ENSO time series. They
obtained degeneracy of the leading eigenvalue when the lag τ is chosen near to the
first zero of the sample autocovariance function of the PC of the standing mode,
i.e. at 13 months. The result was a clear degradation, and in some substructures
a suppression, of the standing wave signal leading to a difficulty of interpretation.
Therefore, in the particular case of (unknown) standing wave, it is recommended to
try various lags and check the consistency of the EEOF substructures.
7.7 Alternatives to SSA and EEOFs
7.7.1 Recurrence Networks
Recurrence networks are defined in a similar way to climate networks, discussed in

Sect. 3.9 of Chap. 3. Recurrence in phase space was explored by Reik Donner and
colleagues in a different direction (Marwan et al. 2009; Donner et al. 2010). They
viewed and interpreted recurrences matrix as adjacency matrix of some complex
network. The adjacency (binary) matrix is defined, for a specific threshold distance
ε, as Dij = 1{ε− xti −xtj ≥0} − δij , where xti , and xtj are M-dimensional time delay
embedding of time series xt (Eq. 7.1). Donner et al. (2010) showed, in particular,
that the recurrence network can provide quantitative information on the statistical
properties of the dynamical system trajectory in phase space. Marwan et al. (2009)
applied the approach to marine climate proxy records over Africa during the last
4.5Ma and identified various regime transitions.
7.7.2 Data-Adaptive Harmonic Decomposition
Another EEOF-related method, the data-adaptive harmonic decomposition

(DAHD), was proposed by Chekroun and Kondrashov (2017), and also by
Kondrashov et al. (2018a,b), to deal with univariate and multivariate time series.
The DAHD seeks to decompose the data in terms of elementary modes. In its
simplest form, the method constructs a convolutional linear (integral) operator with
the time series autocorrelation function as kernel. For multivariate time series,
the kernel is replaced by the lagged autocorrelation matrix. The DAH modes are
then given by the eigenelements of the integral operator. For a d-dimensional time
series x1 , . . . xn (xt = (xt,1 , . . . xt,d ), t = 1, . . . n), with (lagged) window length
M, the numerical approximation of the operator yields a grand bloc, symmetric
d(2M −1)×d(2M −1) Hankel Matrix H = (H(p,q) ), p, q = 1, . . . , d. Each H(p,q)
is a (2M −1)×(2M −1) Hankel matrix. It is defined using the (2M −1)×(2M −1)
cyclic permutation matrix P = (pij ), (i, j = 1, . . . 2M − 1), given by
pij = δi+1,j + δi,2M−1 δ1,j , (7.41)
where δij is the Kronecker symbol. Precisely, if c is the vector contain-

ing the lagged correlations between the pth and qth variables, i.e. c =
(p,q) (p,q) (p,q) (p,q)
(ρ−M+1 , ρ−M+2 , . . . ρM−1 )T , where ρm = corr(xt,p , xt+m,q ), m = −M +
1, . . . , M − 1, then
H(p,q) = [c, Pc, . . . P2M−2 c]. (7.42)
The DAH modes and associated eigenvalues are then given by the eigenelements
of the grand (symmetric) Hankel matrix H. In a similar fashion to EEOFs, the
eigenvalues of H come in pairs of equal in magnitude but with opposite sign, and
the associated modes coefficients (time series) are shifted by quarter of a period.
The obtained modes, and their coefficients, are used, in similar way to EEOFs, to
identify oscillating features and to reconstruct the data. Kondrashov et al. (2018a)
used DAH to analyse Arctic sea ice and predict September sea ice extent, whereas
Kondrashov et al. (2018b) used it to analyse wind-driven ocean gyres. The authors
argue, in particular, that the model has some predictive skill.
Chapter 8
Persistent, Predictive and Interpolated
Patterns
Abstract Previous chapters discuss methods with no particular predictive power.

This chapter describes further advanced linear methods that are more related
to prediction. Three related methods, involving extrapolation (or prediction) and
interpolation, with climate applications are discussed in this chapter.
Keywords Power spectrum · Decorrelation time · Kolmogorov formula ·

Optimal persistence · Average predictability · Predictive patterns · Interpolated
patterns · Southern oscillation · Forecastable component
8.1 Introduction
Time-varying atmospheric fields are characterised, beside their high dimensionality

and complexity, by high spatial and significant temporal correlations. We have
encountered in the previous chapters various methods that take into consideration
either or both of the last two characteristics, i.e. spatial and temporal correlations.
EOFs, for example, take into account the spatial correlation, whereas EEOFs, or
MSSA, take into account spatial as well as temporal correlations.
The temporal correlation is examined, in principle, using auto- and cross-
correlations in time between different variables and involves the concept of persis-
tence or serial correlation. Persistence, for example, is a local property and reflects
the fact that atmospheric fields do not change substantially from 1 day to the next.
Given the nature of the atmosphere,1 this local property propagates with time, and
we end up with fields that persist for much longer than the local time scale of
persistence. Sometimes, we talk about the system’s memory, or decorrelation time,
given by the time lag beyond which atmospheric variables become uncorrelated.
Persistence is a very useful concept in forecasting and can therefore be used in a
similar fashion to EOFs.
1 As a moving fluid masses.

172 8 Persistent, Predictive and Interpolated Patterns
Early attempts along this direction were explored by Renwick and Wallace
(1995) to determine patterns that maximise correlation between forecast and
analysis. This is an example of procedure that attempts to determine persistent
patterns. Persistent patterns are very useful in prediction but do not substitute
predictable patterns. In fact, prediction involves in general (statistical and/or dynam-
ical) models, whereas in persistence there is no need for modelling. This chapter
reviews techniques that find most persistent and also most predictable patterns,
respectively. The former attempt to maximise the decorrelation time, whereas the
latter attempt to minimise the forecast error. We can also increase forecastability
by reducing uncertainties of time series. A further technique based on smoothing is
also presented, which attempts to find most smooth patterns.
8.2 Background on Persistence and Prediction of Stationary

Time Series
8.2.1 Decorrelation Time
Here we consider the case of a univariate stationary time series (see Appendix C)
xt , t = 1, 2, . . . , with zero mean and variance σ 2 . The autocorrelation function
ρ(τ ) is defined by ρ(τ ) = σ −2 E (xt xt+τ ), where xt , t = 1, 2 . . ., are considered as
random variables. Suppose now that the centred time series is a realisation of these
random variables, i.e. an observed time series of infinite sample, then an alternative
definition of the autocorrelation is

1
n
1
ρ(τ ) = 2 lim xt xt+τ , (8.1)
σ n→∞ n
k=1

where σ 2 = limn→∞ n1 nk=1 xt2 is the variance of this infinite sequence. The
connection between both the definitions is assured via what is known as the ergodic
theorem. Note that the first definition involves a probabilistic framework, whereas
the second is merely functional. Once the autocorrelation function is determined,
the power spectrum is defined in the usual way (see Appendix C) as
∞
∞

f (ω) = ρ(τ )e 2π iτ ω
=1+2 ρ(τ ) cos2π ωτ. (8.2)
τ =−∞ τ =1
The autocorrelation function of a stationary time series goes to zero at large lags.
The decorrelation time is defined theoretically as the smallest lag τ0 beyond which
the time series becomes decorrelated. This definition may be, however, meaningless
8.2 Background on Persistence and Prediction of Stationary Time Series 173
since in general the autocorrelation does not have a compact support,2 and therefore
the decorrelation time is in general infinite.
Alternatively, the decorrelation time can be defined using the integral of the
autocorrelation function when it exists:
∞ ∞
T = ρ(τ )dτ = 2 ρ(τ )dτ (8.3)
−∞ 0
or similarly, for the discrete case,

∞

T =1+2 ρ(τ ). (8.4)
τ =1
It is clear from Eq. (8.2) that
T = f (0). (8.5)
The integral T in Eqs. (8.3) and (8.4) represents in fact a characteristic time between
effectively independent sample values (Leith 1973) and can therefore be used as a
measure of the decorrelation time. This can be seen, for example, when one deals
with a AR(1) or Markovian time series. The autocorrelation function of the red
noise is ρ(τ ) = exp (−|τ |/τ0 ), which yields T = 2τ0 . Therefore, in this case
the integral T plays the role of twice the e-folding time of ρ(τ ), as presented
in Sect. 3.4.2 of Chap. 3. Some climatic time series are known, however, not to
have finite decorrelation time. These are known to have long memory, and the
corresponding time series are also known as long-memory time series. By contrast
short-memory time series have autocorrelation functions that decay exponentially
with increasing lag, i.e.
lim τ k ρ(τ ) = 0 (8.6)

τ →∞
for every integer k ≥ 0. Long-memory time series have autocorrelations that decay
hyperbolically, i.e.
ρ(τ ) ∼ aτ −α (8.7)
for large lag τ and for some 0 < α < 1, and hence T = ∞. Their power spectrum
also behaves in a similar way as
f (ω) ∼ bωα−1 (8.8)
2A function with compact support is a function that is identically zero outside a bounded subset of
the real axis.
as ω → 0. In Eqs. (8.7) and (8.8) a and b are constant (e.g. Brockwell and Davis
2002, chap. 10).
8.2.2 The Prediction Problem and Kolmogorov Formula
Consider again a stationary zero-mean time series xt , t = 1, 2, . . ., with variance

σ 2 and autocorrelation function ρ(). A familiar problem in time series analysis
is the prediction of xt+τ from previous values, i.e. from xs , s ≤ t. The τ -step
ahead prediction x̂t+τ of xt+τ in the mean square sense is given by the conditional
expectation:
x̂t+τ = E [xt+τ |ss , s ≤ t] (8.9)

2
and minimises the mean square error E xt+τ − x̂t+τ . It is expressed as a linear
combination of previous values as

x̂t+τ = αk xt−k = h(B)xt , (8.10)
k≥0

where h(z) = k≥0 αk zk , and B is the backward shift operator, i.e. Bzt = zt−1 .
The prediction error covariance is given by
2
σ12 = min E xt+τ − x̂t+τ . (8.11)
The prediction theory based on the entire past of the time series using least square
estimation has been developed in the early 1940s by Kolmogorov and Wiener and
is known as the Kolmogorov–Wiener approach. Equation (8.10) represents a linear
filter with response function H (ω) = h(eiω ), see Appendix C. Now the prediction
error ετ = xt+τ − x̂t+τ has (1 − H (ω)) as response function. Hence the prediction
error variance becomes
π
σετ = |1 − H (ω)|2 f (ω)dω, (8.12)
−π
where f (ω) is the power spectrum of xt , t = 1, 2, . . . , at frequency ω. Equation

(8.12) requires an explicit knowledge of the filter (8.10) satisfying (8.11). Note,
however, that if we use a finite past to predict xt+1 , e.g.
x̂t+1 = a1 xt + a2 xt−1 + . . . ap xt−p+1 + εt+1 , (8.13)

8.2 Background on Persistence and Prediction of Stationary Time Series 175
then the coefficients a1 , a2 , . . . , ap can be obtained explicitly using the Yule–Walker

equations, and therefore σε2 in (8.12) becomes explicit and is a function of the lagged
covariance matrix of xt , t = 1, . . . , n.
The one-step ahead prediction error variance can be computed, in fact, without
need to know the optimal filter. It is given by an elegant formula, the Kolmogorov
formula, which is based solely on the knowledge of the power spectrum f (ω) of the
time series. The Kolmogorov formula reads (Kolmogorov 1939, 1941)
π
1
σ12 = exp log 2πf (ω) dω , (8.14)
2π −π
1
π
which can also be written as 2π exp 2π −π log f (ω) dω . The logarithm used here
is the natural logarithm with base e. Note that the expression given by Eq. (8.14) is
always finite for a stationary process.
π
Exercise Show that for a stationary time series, 0 ≤ exp 2π 1
−π log f (ω) dω <
∞.
π
Hint Use the fact that log x ≤ x along with the identity σ 2 = −π f (ω) dω,
π
yielding −∞ ≤ −π log f (ω) dω ≤ σ 2 .
π
Note that −π log f (ω) dω can be identical to −∞, in which case σ1 = 0, and the
time series xt , t = 1, 2, . . ., is known as singular or deterministic; see e.g. Priestly
(1981) or (Brockwell and Davis 1991, 2002).
The Kolmogorov formula can be extended to the multivariate case, see e.g.
Whittle (1953a,b) and Hannan (1970). Given a nonsingular spectral density matrix
F(ω), i.e. |F(ω)| = 0, the one-step ahead prediction error is given by (Hannan 1970,
theorem 3” p. 158, and theorem 3”’ p. 162)
π π
1 1
σ12 = exp tr (log 2π F(ω)) dω = exp log |2π F(ω)|dω .
2π −π 2π −π
(8.15)
Recall that in (8.15) |2π F(ω)| = (2π )p |F(ω)|, where p is the dimension of the
multivariate time series. Furthermore, if x̂t+τ is the minimum square error τ -step
ahead prediction of xt and ετ the error covariance matrix3 of ε τ = xt+τ − x̂t+τ ,
and then σ12 = det( ε1 ).
3 The error covariance matrix ετ can also be expressed explicitly using an expansion in
power series of the spectral density matrix. In particular, if F(ω) is factorised as F(ω) =
∗ iω
2π 0 (e )0 (e ), then ε1 takes the form:
1 iω
ε1 = 0 (0)∗0 (0);
see Hannan (1970, theorem 1, p. 129).

Table 8.1 Comparison between time domain and spectral domain characteristics of multivariate
stationary time series
Time domain Spectral domain

τ = 0 : = F(ω)dω ω = 0 : F(0) = τ τ
π −2iπ ω dω

τ = 0 : τ = 2π 1
−π F(ω)e ω = 0 : F(ω) = τ e2π iωτ
[ 0 ]ii = σi is the ith variance
2 [F(0)]ii = Ti is the ith decorrelation time
Remark Stationary time series can be equally analysed in physical space or spectral
space and yields the same results. There is a duality between these two spaces.
Table 8.1 shows a comparison between time and spectral domains.
The table shows that spectral domain and time domain analyses of stationary
time series are mirrors of each other assured using Fourier transform. It is also clear
from the table that the image of EOFs is persistent patterns, whereas the image of
frequency domain EOFs is POPs.
8.3 Optimal Persistence and Average Predictability
8.3.1 Derivation of Optimally Persistent Patterns
The objective here is similar to that of EOFs where, instead of looking for patterns
that maximise the observed variance of the space–time field, one seeks patterns
that persist most. The method has been introduced and applied by DelSole (2001).
Formally speaking, the spatial patterns themselves, like EOFs, are stationary. It
is the time component of the patterns that is most persistent, i.e. has the largest
decorrelation time.
Given a space–time field xt , t = 1, 2, . . . , the objective is to find a pattern u,
the optimally persistent pattern (OPP), ∞whose time series coefficients display the
maximum decorrelation time T = 2 0 ρ(τ )dτ . Precisely, given a p-dimensional
times series xt , t = 1, 2, . . . , we let yt , t = 1, 2, . . . , n, be the univariate time
series obtained by projecting the field onto the pattern u, i.e. yt = uT xt . The
autocovariance function of this time series is given by
γ (τ ) = E (yt+τ yt ) = uT τ u, (8.16)
where τ is the autocovariance matrix function of xt . Hence the autocorrelation

function of the time series is
uT τ u
ρ(τ ) = , (8.17)
uT u
where is the covariance matrix of the time series xt , t = 1, 2, . . . . Using the
identity −τ = Tτ , the decorrelation time of (yt ) reduces to
8.3 Optimal Persistence and Average Predictability 177
∞
uT τ + Tτ dτ u
T = 0
. (8.18)
uT u
The maximum of T in (8.18) is given by the generalised eigenvalue problem:
−1 Mu = λu, (8.19)
∞
where M = 0 τ + Tτ dτ . Note that Eq. (8.19) can also be transformed to
yield a simple eigenvalue problem of a symmetric matrix as
Mv = C−T MC−1 v = λv, (8.20)
where C = 1/2 is a square root of the covariance matrix . The optimal patterns
are then given by
uk = −1/2 vk , (8.21)
where vk , k = 1, . . . , p, are the eigenvectors of the symmetric matrix M in

Eq. (8.20). This matrix is easily obtained via an SVD decomposition of , and hence
the eigenvalue problem, Eq. (8.20), becomes easier to solve than the generalised
eigenvalue problem in Eq. (8.19). Note that by analogy with , it is appropriate to
refer to the matrix M = F(0) as the co-decorrelation time matrix. Finally, it is clear
from Eq. (8.18) that when the process is not long memory, the decorrelation time is
proportional to the ratio of the power at zero frequency (i.e. very low frequency) to
the total integrated power, i.e. variance. Equation (8.20) can be written in terms of
the power spectral matrix F(ω), and the equivalent generalised eigenvalue problem
is
F(0)u = λu. (8.22)
So the eigenvalues maximise the Raleigh quotient, and the eigenvectors maximise
the ratio of the zero frequency to the total power.
Exercise Show that M is semi-definite positive, and consequently M is also
symmetric and positive semi-definite.
Hint Use the autocovariance function γ () of yt .
Remark If F(ω) is the spectral density matrix of xt , then M = F(0) and is symmet-
ric semi-definite positive. Let zt , t = 1, 2, . . . , n, be a stationary multivariate time
series with M as covariance matrix, and then the generalised eigenvalue problem,
Eq. (8.19), represents the solution to a filtering problem based on signal-to-noise
maximisation, in which the input is xt , and the output is xt + zt .
The eigenvalue problem, Eq. (8.20), produces a set of optimal patterns that can
be ordered naturally according to the magnitude of the (non-negative) eigenvalues
of M. The patterns uk , k = 1, . . . , p, are not orthogonal, but vk , k = 1, . . . , p,
are. That is the optimal persistent patterns uk , k = 1, . . . , p, are orthogonal with

respect to the inner product < a, b >= aT b. The existence of the two sets of
patterns is a classical result from generalised eigenvalue problems, where uk and vk ,
k = 1, . . . , p, are known, respectively, as the signal patterns and the filter patterns.
The time series xt , t = 1, 2, . . . , n, can be decomposed using the orthogonal
basis vk , k = 1, . . . , p, to yield

p
p
xt = αk (t)vk = xTt vk vk . (8.23)
k=1 k=1
In this case the time series coefficients are not uncorrelated. Alternatively, it is
possible to compromise the orthogonality of the filter patterns and get instead
uncorrelation of the time coefficients in a similar way to REOFs. This is achieved
using again the orthogonality of the filter patterns, or in other words the bi-
orthogonality of the optimal persistent pattern uk , and the associated filter wk =
uk , i.e.
wk ul = δkl
for k = 1, . . . , p. This property can be used to obtain the alternative expansion:

p
xt = βk (t)wk , (8.24)
k=1
where now
βk (t) = xTt uk . (8.25)
Note that the patterns (or filters) wk , k = 1, . . . , p, are not orthogonal, but the new
time series coefficients βk (t), k = 1, . . . , p, are uncorrelated. In fact, we have
E [βk (t)βl (t)] = uTk ul = δkl . (8.26)
The sampling errors associated with the decorrelation times T = λ, resulting from
a finite sample of data, can be calculated in a manner similar to to that of EOFs
(DelSole 2001; Girshik 1939; Lawley 1956; Anderson 1963; and North et al. 1982).
In some time series with oscillatory autocorrelation functions, such as those
produced by a AR(2) model, the correlation integral T in (8.3) can tend to zero as
the theoretical decorrelation time goes to infinity.4 DelSole (2001) proposes using
the integral of the square of the autocorrelation function, i.e.
∞
4 The example of a AR(2) model where ρ(τ ) = e−|τ |/τ0 cos ω0 τ and T1 = 0 ρ(τ )dτ =
−1
τ0 1 + ω02 τ02 was given in DelSole (2001).
∞
T2 = ρ 2 (τ )dτ. (8.27)
0
In this case, the maximisation problem cannot be solved analytically as in Eq. (8.18),
and the solution has to be found numerically. Note also that the square of the auto-
correlation function does not solve the problem related to the infinite decorrelation
time for a long-memory time series.
A comparison between the performance of T-optimals, Eq. (8.3), or T2-optimals,
Eq. (8.27), applied to the Lorenz (1963) model by DelSole (2001) shows that the T2-
optimal remains correlated well beyond 10 time units, compared to that obtained
using EOFs or Eq. (8.3). The latter patterns become uncorrelated after three time
units as shown in Fig. 8.1. This may have implication on forecasting models. For
example, the optimal linear prediction of the Lorenz (1963) system (Penland 1989)
had skill about 12 time unit, which makes the T2-optimal provide potential for
statistical forecast models. The T2-optimal mode (Fig. 8.1), however, cannot be well
modelled by the first-order Markov model. It can be better modelled by a second-
order Markov model or AR(2) as suggested by DelSole (2001). Furthermore, the
T2-optimal mode cannot also be produced by the POP model.
8.3.2 Estimation from Finite Samples
In practice we deal with discrete time series, and the matrix M is normally written
as a sum + 1 + T1 + 2 + T2 + . . . , which is normally limited to some
lag τ0 beyond which the autocorrelations of the (individual) variables become non-
significant, i.e.
First EOF T1 Optimal T2 Optimal

62% T1 = 0.5 T2 = 1.9
1.0
0.5
ρτ
0.0
–0.5
0 10 0 10 0 10
τ τ τ
Fig. 8.1 Autocorrelation function of the leading PC (left), the leading time series of the T-optimal
(middle) and that of the T2 -optimal (right) of the Lorenz model. Adapted from DelSole (2001).
M = + 1 + T1 + · · · + τ0 + Tτ0 . (8.28)
Remark Note that, in general, τ0 need not be large, and in some cases it can be
limited to the first few lags. For example, for daily geopotential heights τ0 is around
a week. For monthly sea level pressure, τ0 ≈ 9 − 12 months. For sea surface
temperature, one expects a larger value of τ0 . Again here we suppose that the data
are short memory. There are of course exceptions with variables that may display
signature of long memory such as wind speed or perhaps surface temperature. In
those cases the lag τ0 will be significantly large. Some time series may have long-
term trends or periodic signals, in which case it is appropriate to filter out those
signals prior to the analysis.
Persistent patterns from atmospheric fields, e.g. reanalyses, using T or T2
measures may not be too different, particularly, for the prominent modes. In fact,
DelSole (2001) finds that the leading few optimal T- or T2-persistent patterns are
similar for allowable choices of EOF truncation (about 30 EOFs), but with possible
differing ordering between the two methods. For example, the leading T2-optimal
of daily NCEP 500–mb heights for the period 1950–1999 is shown in Fig. 8.2. This
pattern is also the second T-optimal pattern. The pattern bears resemblance to the
Arctic Oscillation (AO) Thompson and Wallace (1998). The trend signature in the
time series is not as strong as in the AO pattern as the T2-optimal is based on
all days not only winter days. Note also the 12–15 days decorrelation time from
the autocorrelation function of the time series (Fig. 8.2). An interesting result of
OPP is that it can also identify other low-frequency signatures such as trends and
discontinuities. For example, the trailing OPP patterns are found to be associated
with synoptic eddies along the storm track (Fig. 8.3).
In reality, the above truncation in Eq. (8.28) can run into problems when we
compute the optimal decorrelation time. In fact, a naive truncation of Eq. (8.18),
giving equal weights to the different lagged covariances, can yield a negative
decorrelation time as the lagged covariance matrix is not reliable due to the small
sample size used when the lag τ is large. Precisely, to obtain the finite sample of
Eq. (8.22), a smoothing of the spectrum is required as the periodogram is not a
consistent estimator of the power spectrum.
When we have a finite sample xt , t = 1, . . . , n, we use the sample lagged
covariance matrix
1 n−τ
xt+τ xTt 0 ≤ τ < n
Cτ = n1 t=1n+τ (8.29)
t=1 xt xt−τ −n < τ < 0
T
n
In this case, Eq. (8.22) takes the form
t F̂(0)u = λC0 u, (8.30)

Fig. 8.2 The leading filter and T2-optimal persistent patterns (top), the associated time series
(middle) and its autocorrelation function (bottom) of the daily 500-hPa geopotential height
anomalies for the period 1950–1999. The analysis is based on the leading 26 EOFs/PCs. Adapted
from DelSole (2001). ©American Meteorological Society. Used with permission
where the sample power spectrum is given by

M
F̂(ω) = ατ Cτ e−iωk τ (8.31)
τ =−M
Fig. 8.3 Same as Fig. 8.2 but for the trailing filter and T2-optimal persistent pattern. Adapted from
DelSole (2001). ©American Meteorological Society. Used with permission
with ω√k = 2π k/n, and ατ is a lag window at lag τ . DelSole (2006) considered
M = n (Chatfield 1989) and used a Parzen window,
,
τ 2
1−6 +6 τ
if 0 ≤ τ ≤ M/2
ατ = M
τ 2
M (8.32)
2 1− M if M/2 ≤ τ ≤ M,
because it cannot give negative power spectra compared, for example, to the Tukey
window. Note that here again, through an inverse Fourier transform of Eq. (8.31),
the finite sample eigenvalues maximise a similar Raleigh quotient to that derived
from Eq. (8.22).
DelSole (2006) applied OPP analysis to surface temperature using the observed
data HadCRUT2, compiled jointly by the Climate Research Unit and the Met
Office’s Hadley Centre, and 17 IPCC AR4 (Intergovernmental Panel for Climate
Change 4th Assessment Report) climate models. He found, in particular, that the
leading two observed OPPs are statistically distinguishable from noise and can
explain most changes in surface temperature. On the other hand, most model
simulations produce one single physically significant OPP.
Remark A similar method based on the lag-1 autocorrelation ρ(1) was proposed by
Wiskott and Sejnowski (2002). They labelled it slow feature analysis (SFA), and it
is discussed more in Sect. 8.6.
8.3.3 Average Predictability Patterns
Average predictability pattern (APP) analysis was presented by DelSole and Tippett
(2009a,b) based on the concept of average predictability time (APT) decomposition,
which is a metric for the average predictability. Let us designate by τ the
covariance matrix of time series xt , t = 1, 2, . . . , at
the forecast of a p-dimensional
lead time τ , i.e. E (x̂t+τ − xt+τ )(x̂t+τ − xt+τ )T , and the APT is given by
∞

S=2 Sτ , (8.33)
τ =1

where Sτ = p1 tr ( ∞ − τ ) −1 ∞ , also known as the Mahalanobis signal, and
∞ is the climatological covariance.
APT decomposition analysis seeks vectors v such that the scalar time series yt =
xTt v, t = 1, 2, . . ., has maximum APT. Keeping in mind that for these univariate
time series στ2 = vT τ v and σ∞2 = vT v, the pattern v is obtained as the solution
∞
of the following generalised eigenvalue problem:
∞

2 ( ∞ − τ ) v = λ ∞ v. (8.34)
τ =1
Note that Eq. (8.34) can be transformed to yield the following eigenvalue problem:
∞

−1/2 T −1/2
2 I − ∞ τ ∞ w = λw, (8.35)
τ =1
1/2
where w = ∞ v and is taken to be unit norm. The APP u is then obtained by
projecting the time series onto v, i.e. E(xt yt ), and
u = ∞ v. (8.36)
A similar argument to the OPP can be applied here to get the decomposition of the
time series xt , t = 1, . . . , n, using Eqs. (8.24) and (8.25) after substituting uk and
p
vk , Eq. (8.36), for uk and wk , Eq. (8.24), respectively, i.e. xt = k=1 (vTk xt )uk .
To estimate the APPs, the linear model xt+τ = Axt + ε t , with A = Cτ C−1 0 , is
used, and for which τ = C0 − Cτ C−1 0 C T . The patterns are then solution to the
τ
eigenproblem:

Cτ C−1
0 Cτ v = λC0 v.
T
(8.37)
τ
The estimation from a finite sample is obtained in a similar manner to the OPPs.
In fact, to avoid getting negative APT values, which could result from a “naive”
truncation of Eq. (8.37), DelSole and Tippett (2009b) suggest using a Parzen lagged
window ατ , given in Eq. (8.32), to weight the lagged covariance matrices, which
does not produce negative spectra. The APT is then estimated using

M
S=2 ατ Sατ . (8.38)
τ =1
DelSole and Tippett (2009b) applied APP analysis to the National Center
for Environmental Prediction/National Center for Atmospheric Research
(NCEP/NCAR) 6-hourly 1000-hPa zonal velocity during the 50-year period
from 1 January 1956 to 31 December 2005, providing a sample size of 73,052.
Their analysis reveals that the prominent predictable patterns reflect the dominant
low-frequency modes, including a climate change signal (leading pattern), a ENSO-
related signal (second pattern) in addition to the annular mode (third pattern). For
example, the second predictable component (Fig. 8.3) has an average predictability
time of about 5 weeks, and predictability is mostly captured by ENSO signal. They
also obtained the MJO signal when the zonal wind was reconstructed based on
the leading seven predictable patterns. The remaining patterns were identified with
weather predictability having time scales less than a week.
Fischer (2015, 2016) provides an alternative expression of the OPP and APP
analyses based on the reduced rank multivariate regression Y = XB + E, where E
is a n × p residual matrix and B a p × d matrix of regression coefficients. For OPP
8.4 Predictive Patterns 185
M
and APP analyses, the tth row of Y is given by yt = τ =−M ατ xt+τ , where ατ
represents the Parzen lag window, Eq. (8.32), at lag τ .
8.4 Predictive Patterns
8.4.1 Introduction
Each of the methods presented so far serves a particular goal. EOFs, for example, are
defined without constraint on the temporal structure (since they are based on using
contemporaneous maps) and are forced to be correlated with the original field. On
the other hand, POPs, for example, compromise the last property, i.e. correlation
with the field, but use temporal variation. EEOFs, or MSSA, use both the spatial
and the temporal structure and produce instead a set of patterns that can be used to
filter the field or find propagative patterns. They do not, however, make use of the
of the temporal structure, e.g. autocovariance, in an optimal way. This means that
there is no constraint on predictability.
Because they are formulated using persistence, the OPPs use the autocovariance
structure of the field. They achieve this by finding patterns whose time series evo-
lution decays very slowly. These patterns, however, are not particularly predictable.
Patterns that maximise covariance or correlation between forecast and analysis
(Renwick and Wallace 1995) deal explicitly with predictability since they involve
some measure of forecast skill. These patterns, however, are not the most predictable
since they are model-dependent.
An alternative way to find predictable patterns is to use a conventional measure
of forecast skill, namely the prediction error variance. In fact this is a standard
measure used in prediction and yields, for example, the Yule–Walker equations
for a stationary univariate time series. Predictive Oscillation Patterns (PrOPs)
Kooperberg and O’Sullivan (1996) achieve precisely this. PrOPs are patterns whose
time coefficients minimise the one-step ahead prediction error and as such are
considered as the Optimally Persistent Patterns or simply the most predictable
patterns. When this approach was introduced, Kooperberg and O’Sullivan (1996)
were not motivated by predicting the weather but mainly by working out a hybrid of
EOFs and POPs that attempt to retain desirable aspects of both, but not by predicting
the weather.
8.4.2 Optimally Predictable Patterns
Let xt , t = 1, 2 . . . , be a multivariate stationary time series with spectral density

matrix F(ω). The objective is to find a pattern u with the most predictable time
series yt = uT xt , t = 1, 2 . . . , in terms of the one-step ahead prediction. That is,
the time series yt , t = 1, 2 . . . , has the smallest one-step ahead prediction error
variance. Using the autocovariance of this time series, as a function of the lagged
covariance matrix of xt , see Eq. (8.16), the spectral density function f (ω) of yt ,
t = 1, 2 . . . , becomes
f (ω) = uT F(ω)u, (8.39)
and the one-step ahead prediction error Eq. (8.15) yields

π
1
σ12 = exp log 2π uT F(ω)u dω , (8.40)
2π −π
which has to be minimised. Because log is a monotonically increasing function, it

is simpler to minimise log(σ12 ) instead. The required pattern u is then obtained as
the solution to the optimisation problem:
π
minu −π log 2π uT F(ω)u dω s.t. uT u = 1, (8.41)
which can be transformed to yield the unconstrained problem:

π uT F(ω)u
min F(u) = log dω . (8.42)
u −π uT u
Remark Equivalent formula can also be derived, such as

π
uT u 1 uT F(ω)u
min T
exp log dω , (8.43)
u u u 2π −π uT Fu
π
where F = 1
2π −π F(ω)dω = 2π .
1
Higher order PrOPs are obtained in a similar manner under the constraint of being
orthogonal to the previous PrOPs. For instance, the k + 1th PrOP is given by
π uT F(ω)u
uk+1 = argmin log dω. (8.44)
u, uT uα = δα,k+1 −π uT u
Suppose, for example, that the first k PrOPs, k ≥ 1, have been identified. The next
one is obtained as the first PrOP of the residuals:

k
zt = xt − yt,j uj (8.45)
j =1
for t = 1, 2, . . . , and where yt,j = xTt uj , j = 1, 2, . . . , k. In fact, Eq. (8.45) is

obtained by removing recursively the contributions from previous PrOPs and can be
rewritten as
8.4 Predictive Patterns 187
⎛ ⎞

k
zt = ⎝Ip − uj uTj ⎠ xt = Ak xt , (8.46)
j =1
and Ak is simply the projection operator onto the orthogonal complement to the
space spanned by the first k PrOPs. The PrOP optimisation problem derived from
the residual time series zt , t = 1, 2, . . . , yields
π uT ATk F(ω)Ak u
min ln dω, (8.47)
u −π uT u
which provides the k + 1th PrOP uk+1 . Note that Ak uk+1 = uk+1 , so that uk+1
belongs to the null space of (u1 , . . . , uk ).
8.4.3 Computational Aspects
The optimisation problem in Eq. (8.42) or Eq. (8.47) can only be carried out
numerically using some sort of descent algorithm (Appendix E) because of the
nonlinearity involved. Given a finite sample xt , t = 1, 2, . . . , n, the first step consists
in estimating the spectral density matrix using, for example, the periodogram (see
Appendix C):
1 −itωp T itωp
n n
I(ωp ) = xt e xt e (8.48)
nπ
t=1 t=1
for ωp = 2πpn , p = −[ 2 ], . . . , [ 2 ], where [x] is the integer part of x. Note that the
n−1 n
first sum in the rhs of Eq. (8.48) is the Fourier transform x̂(ωp ) of xt , t = 1, . . . , n
at frequency ωp , and I(ωp ) = π1 x̂(ωp )x̂∗T (ωp ), where the notation x ∗ stands for
the complex conjugate of x. However, since the periodogram is not a consistent
estimator of the power spectrum, a better estimator of the spectral density matrix
F(ω) can be obtained by smoothing the periodogram (see Appendix C):
n
[2]

F̂(ω) = (ω − ωp )I(ωp ), (8.49)
p=−[ n−1
2 ]
where () is a spectral window. The smoothing makes F̂(ω) asymptotically

consistent (see e.g. Jenkins and Watts 1968 and Chatfield 1996). Furthermore, the
smoothed periodogram F̂(ω) will be in general full rank and avoids the integrals in
Eq. (8.42) or Eq. (8.47) to be singular. The function to be optimised in Eq. (8.42) is
then approximated by
[ n2 ]−1
π uT F̂(ωk )u uT F̂(ωk+1 )u
F(u) = log + (8.50)
n uT u uT u
k=−[ n−1
2 ]
and similarly for Eq. (8.47). A gradient type (Appendix E) can be used in the
minimisation. The gradient of F(u) is given by
π
1 1 F(ω)
− ∇F(u) = Ip − T
dω u
4π 2π −π u F(ω)u
⎡ ⎤
[ n2 ]−1
⎢ F̂(ωk ) F̂(ωk+1 ) ⎥
≈ ⎣I p − + ⎦ u. (8.51)
T T uT F̂(ωk+1 )uT
n−1 u F̂(ωk )u
k=−[ 2 ]
The application of PrOPs to 47-year 500-mb height anomalies using the National
Meteorology Center (NMC) daily analyses (Kooperberg and O’Sullivan 1996)
shows some similarities with EOFs. For example, the first PrOP is quite similar
to the leading EOF, and the second PrOP is rather similar to the third EOF. An
investigation of forecast errors (Fig. 8.4) suggests that PrOPs perform, for example,
better than POPs and have similar performance to EOFs.
Fig. 8.4 Forecast error versus the number of the patterns using PCA (continuous), POPs (dotted)
and PrOPs (dashed) of the NMC northern extratropical daily 500-hPa geopotential height
anomalies for the period of Jan 1947–May 1989. Adapted from Kooperberg and O’Sullivan (1996)
8.5 Optimally Interpolated Patterns 189
8.5 Optimally Interpolated Patterns
8.5.1 Background
Let again xt , t = 1, 2, . . ., be a zero-mean stationary multivariate time series with

covariance matrix and spectral density matrix F(ω). In the prediction theory, the
τ -step ahead prediction xt,τ of xt given xt−k , for k ≥ τ , is given by the conditional
expectation:
xt,τ = E [xt |xt−τ , xt−τ −1 , . . .] . (8.52)
Equation (8.52) is known to minimise the mean square error variance

T
E xt − xt,τ xt − xt,τ . The prediction formula in Eq. (8.52) corresponds
also to a linear prediction using the past values of the times series, i.e.
xt,τ = h(B)xt , (8.53)

τ +k and A , k = 1, 2, . . ., are p × p matrices.
where h(z) = k≥1 Ak z k
Equation (8.53) is a linear filter with frequency response function given by
H(ω) = h eiω . (8.54)
Accordingly, the error xt − xt,τ has Ip − H(ω) as frequency response function, and
hence the error covariance matrix takes the form (see Sect. 2.6, Chap. 2):
π ∗T
τ = Ip − H(ω) F(ω) Ip − H(ω) dω. (8.55)
−π
8.5.2 Interpolation and Pattern Derivation
This section discusses a method identifying patterns using interpolation, optimally

interpolated patterns (OIP) Hannachi (2008). Interpolation problems are related,
but not identical to the prediction problem. The objective here is to interpolate a
stationary time series, i.e. obtain a replacement of one or several missing values.
More details on interpolation in stationary time series can be found in Grenander
and Rosenblatt (1957), Bonnet (1965) and Hannan (1970). We suppose that we are
given xt−j , for all j = 0, the objective is then to seek an estimate x̂t of xt , which is
estimated by

x̂t = αj xt−j = h(B)xt , (8.56)
j =0

where h(z) = j =0 αj z
j, such that the mean square error
T
xt − x̂t 2
=E xt − x̂t xt − x̂t (8.57)
is minimised. This is equivalent (see Sect. 2.6) to minimising

π ∗T
tr Ip − H(ω) F(ω) Ip − H(ω) dω (8.58)
−π
under reasonable assumptions of continuity of the spectral density matrix F(ω) of

the stochastic process xt , t = 0, ±1, ±2, . . . , and integrability of F−1 (ω). The
T
covariance matrix = E x̂t − xt x̂t − xt of the interpolation error xt − x̂t ,
which is also represented by Eq. (8.55), can be computed and is given by
π −1
= 2π (2π F(ω))−1 dω . (8.59)
−π
Furthermore, is nonsingular, and the optimal interpolation filter is given by
H(ω) = Ip − 2π −1 F−1 (ω). (8.60)
We give below an outline of the proof and, for more details, refer to the above
references. We let xk = (Xk1 , . . . , Xkp ), k = 0, ±1, ±2, . . . , be a sequence of zero-
mean second-order random vectors, i.e. with components having finite variances.
Let also Ht be the space spanned by the sequence {Xkj , k = t, j = 1, . . . , p, k =
0, ±1, ±2, . . . , k = t} known also as random function. Basically, Ht is composed
of finite linear combinations of elements of this random function. Then Ht has
the structure of a Hilbert space with respect to a generalised scalar product (see
Appendix F). The estimator x̂t in Eq. (8.56) can be seen as the projection5 of xt
onto this space. Therefore xt − x̂t is orthogonal to x̂t and also to xs for all s = t.
The first of these two properties yields
5 Not exactly so, because Ht is the set of all finite linear combinations of elements from
the sequence. However, this can be easily overcome by considering partial sums of the form
N
hN (B)xt = k=−N,k=0 αk xt−k and then make N approach infinity. The limit h() of hN () is
then obtained from
π
lim tr (H(ω) − HN (ω)) F(ω) (H(ω) − HN (ω))∗ dω = 0, (8.61)
N →∞ −π
where HN (ω) = hN (eiω ).


E xt − x̂t x∗T
s = O for all s = t, (8.62)
where the notation (∗ ) stands for the complex conjugate. This equation can be
expressed using the spectral density
matrix F(ω) and the multivariate frequency
response function Ip − H(ω) of xt − x̂t , refer to Sect. 2.6 in Chap. 2. Note that
t can be set to zero because of stationarity. This necessarily implies
π
−π Ip − H(ω) F(ω)e
isω dω = O for s = 0. (8.63)
This necessarily implies that the matrix inside the integral is independent of ω, i.e.

Ip − H(ω) F(ω) = A, (8.64)
where
A is a constant matrix. The second orthogonality property, i.e.
E xt − x̂t x̂∗T
t = O, implies a similar relationship; namely,
π
Ip − H(ω) F(ω)H∗T (ω)dω = O. (8.65)
−π
Now, by expanding the expression of the covariance matrix τ in Eq. (8.55) and
using the last orthogonality property in Eq. (8.65), one gets
π
= Ip − H(ω) F(ω)dω = 2π A. (8.66)
−π

Finally, substituting the expression of Ip − H(ω) from Eq. (8.64) into the expres-
sion of in Eq. (8.55) and noting that A is real (see Eq. (8.66)) yield
π
1
A−1 = F−1 (ω)dω, (8.67)
2π −π
where the invertibility of is guaranteed by the integrability of F−1 .

Now given a unitary vector u, the previous interpolation problem can be cast
in terms of the univariate time series yt = uT xt , t = 1, 2 . . . . The mean square
interpolation error for yt is
2
E yt − ŷt = uT u. (8.68)
The optimally interpolated pattern (OIP) u is the vector minimising the interpolation
error variance, Eq. (8.68). Hence the OIPs are given by the eigenvectors of the
interpolation error covariance matrix in Eq. (8.59) associated with its successively
increasing eigenvalues.
Exercise In general, Eq. (8.66) can only be obtained numerically, but a simple
example where can be calculated analytically is given in the following model
(Hannachi 2008):
xt = uαt + ε t ,
where u is a constant p-dimensional vector, αt a univariate stationary time series

with spectral density function g(ω) and εt a p-dimensional random noise, uncorre-
lated in time and independent of αt , with covariance ε . Assume var(αt ) = 1 =
u . Show that
β
= ε + uuT ,
1 − βuT ε −1 u
and find the expression of β.
8.5.3 Numerical Aspects
Given a sample of observed multivariate time series xt , t = 1, 2 . . . , n, to find the

OIPs of these data, one first estimates the interpolation error covariance matrix using
an estimate of the spectral density matrix as in the previous section using
π −1
ˆ −1 = 1
F̂(ω) dω , (8.69)
4π 2 −π
where F̂(ω) is an estimate of the spectral density matrix given, for example, by
Eq. (8.49). Note that, as it was mentioned above, smoothing the periodogram makes
F̂(ω) full rank. A trapezoidal rule can then be used to approximate this integral to
yield
n
2 −1

ˆ −1 ≈ 1
F̂−1 (ωk ) + F̂−1 (ωk+1 ) , (8.70)
4π n n−1
k=− 2
where [x] is the integer part of x, and ωk = 2π k/n represents the kth discrete
frequency. Any other form of quadrature, e.g. Gaussian, or interpolation (e.g. Linz
and Wang 2003) can also be used to evaluate the previous integral.
In Hannachi (2008) the spectral density was estimated using smoothed peri-
odogram given in Eq. (8.49), where () is a smoothing spectral window and
I(ωk ) = π −1 x̂(ωk )x̂∗T (ωk ), where x̂(ωk ) is the Fourier transform of xt , t =
1, . . . , n at ωk , that is x̂(ωk ) = n−1/2 nt=1 xt e−iωk t .
8.5.4 Application
The following low-dimensional example was analysed by Hannachi (2008). The

system is a three-variable time series given by
⎧ 3/2
⎨ xt = α t + 1.6εt1
y = 32 αt + 2.4εt2 (8.71)
⎩ t
zt = 12 + 1.5εt3 ,
where α = 0.004, εt1 , εt2 and εt3 are first-order AR(1) models with respective lag-1
autocorrelation of 0.5, 0.6 and 0.3. Figure 8.5 shows a simulated example from this
model. The trend is shared between PC1 and PC2 but is explained solely by OIP1,
as shown in Fig. 8.6, which shows the histograms of the correlation coefficients
between the linear trend and the PCs and OIP time series.
Fig. 8.5 Sample time series of system, Eq. (8.71), giving xt , yt and zt (upper row), the three
PCs (middle row) and the three OIPs (bottom row). Adapted from Hannachi (2008). ©American
Meteorological Society. Used with permission. (a) xt . (b) yt . (c) zt . (d) PC 1. (e) PC 2. (f) PC 3.
(g) OIP 1. (h) OIP 2. (i) OIP 3
Fig. 8.6 Histograms of the absolute value of the correlation coefficient between the linear trend in
Eq. (8.71) and the PCs (upper row) and the OIP time series (bottom row). Adapted from Hannachi
(2008). ©American Meteorological Society. Used with permission
Another example of OIP application is discussed next using the northern

hemispheric sea level pressure (Hannachi 2008). Given the large dimensionality
involved in the computation of the spectral density matrix, an EOF truncation is
used. For example, with the retained leading m = 5 EOFs/PCs, the leading OIP has
95% total interpolation error variance, whereas the next two OIPs have, respectively,
2% and 1%. The leading two OIPs are shown in Fig. 8.7 along with the leading two
EOFs. The leading OIP, in particular, shows clearly the NAO signal. Compare this
with the mixed AO/NAO EOF1. The second OIP represents (neatly) the Pacific
oscillation pattern, compared again to the mixed EOF2 pattern.
It was noted in Hannachi (2008) that as the number of retained EOFs/PCs
increases, the spectrum of the interpolation covariance matrix isolates the leading
two OIPs, reflecting the low-frequency patterns. An example of spectrum is shown
in Fig. 8.8 with m = 20 EOFs/PCs. The stability of these two OIP patterns for
different values of m can be checked by comparing, for example, the patterns shown
in Fig. 8.7, with the same patterns for different values of m. This is shown in Fig. 8.9.
Fig. 8.7 Leading two OIPs (upper row) and EOFs (bottom row). The OIPs are based on the leading
5 EOFs/PCs of northern hemispheric SLP anomalies. Adapted from Hannachi (2008). ©American
Fig. 8.8 Leading 20

eigenvalues of the
interpolation error covariance
matrix shown in percentage
of total interpolation error
variance based on the leading
20 EOFs/PCs of northern
hemispheric SLP anomalies.
Adapted from Hannachi
(2008). ©American
Meteorological Society. Used
with permission
Fig. 8.9 Spatial and temporal correlation of the leading 3 OIP patterns (thin) and associated IPC
time series (bold), obtained based on the leading 5 EOFs/PCs, with the same OIPs and IPCs, for
m=5,6, . . . 25 EOFs/PCs of northern hemispheric SLP anomalies. Adapted from Hannachi (2008).
Power spectra of the leading five OIP time series, i.e. interpolated PCs (IPCs),
is shown in Fig. 8.10 based on two estimation methods, namely Welch and Burg
methods. The decrease of low-frequency power is clear as one goes from IPC1
to IPC5. Another example where OIP method was applied is the tropical SLP.
The leading two OIPs (and EOFs) are shown in Fig. 8.11. The leading OIP of
tropical SLP anomalies, with respect to the seasonal cycle, represents the Southern
Oscillation mode. Compare, for example, this mode with the leading EOF, which is
a monopole pattern. The second EOF is more associated with OIP1.
8.6 Forecastable Component Analysis
Forecastable patterns are patterns that are derived based on uncertainties in time
series. Forecastable component analysis (ForeCA) was presented by Goerg (2013)
and is based on minimising a measure of uncertainty represented by the (differential)
entropy of a time series. For a second-order stationary time series xt , t =
1, 2, . . ., with autocorrelation function ρ(.) and spectral density f (.), a measure
of uncertainty is given by the “spectral” entropy6
6 The idea behind this definition is that if U is a uniform distribution over [−π, π ] and V a random
variable,
√ independent of U , and with probability density function g(.), then the time series yt =
2 cos(2π V t + U ) has precisely g(.) as its power spectrum (Gibson 1994).
8.6 Forecastable Component Analysis 197
Fig. 8.10 Power spectra of

the leading five IPCs based on
the leading 5 EOFs/PCs of
northern hemispheric SLP
anomalies using the Welch
and Burg methods. Adapted
from Hannachi (2008).
©American Meteorological
Society. Used with
permission. (a) Power
spectral density (Welch). (b)
Power spectral density (Burg)
π
1
H (x) = − f (ω) log f (ω)dω. (8.72)
log 2π −π
The forecastability of the time series is given by
F (x) = 1 − H (x). (8.73)
Note that the factor log 2π in Eq. (8.72) corresponds to the spectral entropy of a
white noise.
Fig. 8.11 Leading two OIPs and EOFs of the northern hemispheric SLP anomalies. The OIPs
are obtained based on the leading 5 EOFs/PCs. Adapted from Hannachi (2008). ©American
Meteorological Society. Used with permission. (a) Tropical SLP OIP 1. (b) Tropical SLP OIP
2. (c) EOF 1. (d) EOF 2
8.6 Forecastable Component Analysis 199
Given a n × p data matrix X, the ForeCA pattern is defined as the vector w that
maximises the forecastability of the time series x = XT w, subject to the constraint
wT Sw = 1, with S being the sample covariance matrix. Before proceeding, the data
matrix is first whitened, and then the multivariate spectrum matrix F(ω) is obtained.
The univariate spectrum along the direction w is then given by
F (ω) = wT F(ω)w. (8.74)
The ForeCA patterns are then obtained by minimising the uncertainty, i.e.
π
w∗ = argmax F (ω) log F (ω)dω. (8.75)
w,wT w=1 −π
In application the integral in Eq. (8.74) is approximated by a sum over the discrete
frequencies ωj = j/n, j = 0, 1, . . . , n − 1. Goerg (2013) used the weighted
overlapping segment averaging (Nuttal and Carter 1982) to compute the power
spectrum. The estimate from Eq. (8.49) can be used to compute the multivariate
spectrum. Fischer (2015, 2016) compared five methods pertaining to predictability,
namely OPP, APP, ForeCA, principal trend analysis (PTA) and slow feature analysis
(SFA). PTA or trend EOF analysis (Hannachi 2007) is presented in Chap. 15, and
SFA (Wiskott and Sejnowski 2002), mentioned in the end of Sect. 8.3, is based
on the lag-1 autocorrelation. In Fischer (2016) these methods were applied to a
global dataset of speleothem δ 18 O for the last 22 ka, whereas in Fischer (2015) the
methods were applied to the Australian daily near-surface minimum air temperature
for the period of 1910–2013. He showed, in particular, that the methods give
comparable results with some subtle differences. Figure 8.12 shows the leading
three predictable component time series of δ 18 O dataset for the five methods. It is
found, for example, that OPP analysis, SFA and PTA tend to identify low-frequency
persistent components, whereas APP analysis and ForeCA represent more measure
of predictability. Of particular interest from this predictability analysis is the
association of these signals with North American deglaciation, summer insolation
and the Atlantic meridional overturning circulation (AMOC), see Fig. 8.12.
Component 1 Component 2 Component 3
OPA
0
–2
APTD
0
–2
Standardized δ18O
ForeCA
0
–2
SFA
0
–2
PTA
0
–2
20
15
10
20
15
10
20
15
10
5
Time (ka BP)
Fig. 8.12 The leading three predictable component time series (black) obtained from the five
methods, OPP (first row), APP (second row), ForeCA (third row), SFA (fourth row) and PTA
(last row) applied to the global δ 18 O dataset. The yellow bars refer, respectively, to the timing of
Heinrich Stadial 1, centred around 15.7 ka BP, and the Younger Dryas, centred around 12.2 ka BP.
The red curves represent the percentage of deglaciated area in North America (left), the summer
insolation at 65◦ N (middle) and the AMOC reconstruction (right). Adapted from Fischer (2016)
Chapter 9
Principal Coordinates or
Multidimensional Scaling
Abstract This chapter describes patterns obtained based on proximity or similarity

measures, i.e. multidimensional scaling (MDS). Conventional EOFs correspond
to the case of quadratic distance. In this chapter other forms of similarities are
discussed, with climate application, and which can yield structures that cannot be
revealed by classical MDS.
Keywords Principal coordinate analysis · Dissimilarity measure ·

Multidimensional scaling · Classical scaling · Stress function · Non-metric
scaling · Isomap · Kernel smoothing · Asian summer monsoon
9.1 Introduction
Most multivariate data analysis techniques involve in a way or another a measure

of proximity or similarity, between variables. Take for example ordinary EOFs or
PCA, the covariance (or correlation) matrix involves a measure of covariability
between all pairs of variables. The correlation between two variables provides
in fact a measure of proximity between them. A correlation value of one means
that the two variables are proportional whereas a zero correlation means no linear
association. This proximity, however, is not a standard distance in the coordinate
space but provide a measure proximity with respect to time (for time series) or
realisations in general. In fact, a high correlation between two time series means
that the two variables evolve in a coherent (or similar) manner in time hence the
concept of similarity. On various occasions data are given in terms of distances
between the different variables instead of their actual coordinates or their time
variations. A familiar problem is then to find a configuration of the points in a low-
dimensional space using these pairwise distances. This has given rise to the concept
of multidimensional scaling.
Multidimensional scaling (MDS) is a geometric method for reconstructing a
configuration from its interpoint distances, and enables one to uncover the structure
embedded in high-dimensional data. It is an exploratory technique that aims at
visualising proximities in low-dimensional spaces. It attempts therefore to preserve,

202 9 Principal Coordinates or Multidimensional Scaling
as much as possible, the observed proximities in the high-dimensional data space.

In visualisation MDS plots similar objects close together. For example, EOFs can
find a low-dimensional embedding of the data points that preserves their variances
as measured in the high-dimensional input space. MDS, on the other hand, finds that
embedding which preserves the interpoint distances. As it will be shown later, these
two methods are equivalent when the distances are Euclidean.
The MDS method has been developed originally in Psychology in the mid
1930s by Schoenberg (1935). The fundamental and classical paper on MDS is
that of Young and Householder (1938, 1941). Later, the method was developed
further by Torgerson (1952, 1958) by incorporating ideas due to Jucker and others,
see, e.g., Sibson (1979), and is sometimes referred to as classical scaling. Gower
(1966) popularised the method under the name of principal coordinate analysis.
The method was later extended to other fields such as sociology, economy and
meteorology, etc. MDS is not only a method of visualisation of proximities, but also
a method of dimension reduction like PCA and others (see e.g. Carreira-Perpiñán
2001, Tenenbaum et al. 2000). Various text books present in more detail MDS (Cox
and Cox 1994; Young 1987; Mardia et al. 1979; Borg and Groenen 2005; Chatfield
and Collins 1980).
9.2 Dissimilarity Measures
The starting point of MDS is a set of interpoint distances, or precisely a matrix

consisting of all pairwise similarities. The Euclidean distance is a special case of
dissimilarity measure. In general, however, dissimilarities need not be distances in
the usual Euclidean sense.
Definition A n × n matrix D = (dij ) is a distance matrix if it is symmetric and
satisfies dij ≥ 0, and dii = 0, for i, j = 1, 2, . . . , n
Remark Some authors define a distance matrix as the opposite of the matrix given
in the above definition, i.e. symmetric matrix with zero-diagonal and non-positive
off-diagonal elements. The above definition provides the general form of distance
or dissimilarity matrix. For example, the metric inequality dij ≤ dik + dkj , for all
i, j and k, is not required.
The choice of a dissimilarity measure depends in general on the type of data and
the problem at hand. For example, in the case of quantitative measurements, such as
continuous data, the most common dissimilarity measures include the Minkowski
distance, where the distance between two points xi = (xi1 , . . . , xin ) and xj =
xj 1 , . . . , xj n is given by
n 1
λ
dij = xi − xj λ = |xik − xj k |λ (9.1)
k=1
9.2 Dissimilarity Measures 203
for λ > 0. The Minkowski distance is a generalisation of the ordinary Euclidean

distance obtained with λ = 2. Note also that when λ = 1 we get the L1 -norm and
when λ → ∞ we get the L∞ -norm, i.e. xi − xj ∞ = maxk |xik − xj k |.
Other dissimilarity measures also exist such as Canberra and Mahalanobis metrics.
The latter is closely related to the Euclidean metric and is given by dij =
xi − xj S−1 xi − xj , where S is the data covariance matrix. Table 9.1 provides
T
some of the most widely used metrics for continuous variables.

Euclidean and related distances are not particularly useful when, for example,
the measurements have different units, or when they are categorical, i.e. not of
numerical type. For binary data, a similarity coefficient is usually defined as the
rate of common attributes to the total number of attributes. Assume each individual
takes the values 0 or 1, and let a and d designate the number of common attributes
between individuals i and j , i.e. a attributes when i = j = 1 (i.e. the individuals
co-occur), and d attributes when i = j = 0 as summarised in the contingency table,
see Table 9.2. Then the similarity coefficient is defined by
a+d
sij = , (9.2)
a+b+c+d
and the dissimilarity (or distance) can be defined by dij = 1 − sij . As an illustration
we consider a simple example consisting of six patients labelled A, B, . . . , F .
The variables (or attributes) considered are sex (M/F), age category (old/young),
employment (employed/unemployed) and marital status (married/single) and we use
the value 1 and 0 to characterise the variables. We then construct the following data
matrix (Table 9.3):
a+d
The similarity between, for example, A and B is a+b+c+d = 2+1+1+0
2+1
= 34 .
For categorical data the most common dissimilarity measure is given by Sneath’s
coefficient:
1
sij = 1 − δij k ,
p
k
Table 9.1 Examples of the Metric Formula d(xi , xj )

most widely used metrics for 1
k (xik − xj k )
Euclidean 2 2
continuous variables
1
k (xik − xj k )
Minkowski λ λ
xi − xj S−1 xi − xj
T
Mahalanobis
|xik −xj k |
Canberra k |xik |+|xj k |
One minus Pearson correlation 1 - corr(xi , xj )
Table 9.2 Contingency table i

1 0
j 1 a b
0 c d
Table 9.3 Example of a Variables

binary data matrix
Attributes A B C D E F
Sex 1 0 1 0 0 1
Age 1 1 0 0 1 1
Employment 1 1 1 1 1 1
Marital status 0 0 1 0 1 1
where δij k is 1 if i and j agree on variable k and 0 otherwise. An overview of other

coefficients used to obtain various types of proximities can be found, e.g. in Cox
and Cox (1994).
9.3 Metric Multidimensional Scaling
9.3.1 The Problem of Classical Scaling
We assume that we are given a n × p data matrix, i.e. a set of n measurements

in a p-dimensional Euclidean space. It is always possible to compute a p × p (or
n × n) symmetric association matrix between the variables. Now given a matrix of
distances D = (dij ) between the different n points the converse problem consists
in finding the original coordinates or configuration of the points as well as the
dimension of the space embedding the data. This is the original metric or classical
scaling problem (Torgerson 1952; Young and Householder 1938) when the data
are quantitative. We suppose that our data matrix is X = (x1 , x2 , . . . , xn )T where
T
xk = xk1 , . . . , xkp is the kth point with its p coordinates. The objective is then
to find X given the distance matrix D. Note that all we needed for MDS is the
matrix of proximities between the units without worrying about the measurements
on individual variables.
Before we carry on, let us go back a bit to EOFs. We know that EOFs correspond
to an orthogonal projection of the data from a p-dimensional Euclidean space Ep
onto a lower r-dimensional subspace Er where the overall variability is maximised.
Now we let dij and dij∗ designate respectively the distances in Ep and Er . Then it is
Er defined by taking the leading r PCs is also the
possible to show that the subspace
∗2
space for which ij dij − dij is minimised. The idea in MDS is also similar. We
2
seek a low-dimensional representation of the data such that the obtained distances
give a faithful representation of the true distances or dissimilarities between units.
9.3 Metric Multidimensional Scaling 205
9.3.2 Principal Coordinate Analysis
So far we did not impose extra constraints on the distance matrix D. However, to
be able to solve the problem, we need the nondegenerate1 distance or dissimilarity
matrix = (δij ) to satisfy the triangular metric inequality:
δij ≤ δik + δkj (9.3)
for all i, j and k. In fact, when (3) is satisfied one can always find n points in a
(n − 1)-dimensional Euclidean space En−1 in which the interpoint distances satisfy
dij = δij for all pairs of points. For what follows, and also for simplicity, we suppose
that the data have been centred by requiring that the centroid of the set of n points
lie at the origin of the coordinates:

n
xk = 0. (9.4)
k=1
Note that if the dissimilarity matrix does not fulfil nondegeneracy or the
triangular metric inequality (9.3), then the matrix has to be processed to achieve
these properties. In the sequel we assume, unless otherwise stated, that these two
properties are satisfied. It is possible to represent the information contained in by
n points in En−1 , i.e. using a n × (n − 1) matrix of coordinates. An application of
PCA analysis to this set of points can yield a lower dimensional representation in
2
which the dissimilarity matrix ∗ minimises i,j δij − δij∗ . Since EOF analysis
involves an eigen decomposition of a covariance-type matrix, we will express the
distances δij , that we suppose Euclidean, using the n × n scalar product matrix A=
p
XXT , where the element at the ith line and j th column of A is aij = k=1 xik xj k .
p 2
Using Eq. (9.4) and recalling that δij2 = k=1 xik − xj k , we get
1 2
aij = −
δij − δi.2 − δ.j2 + δ..2 , (9.5)
2

where δi.2 = δ.i 2 = n1 nj=1 δij2 and δ..2 = n12 ni,j =1 δij2 .
Hint By summing δij2 = aii + ajj − 2aij over one index then both indexes one gets
n
δ.j2 = ajj + n1 ni=1 aii = δj. 2 and
i=1 aii = 2 δ.. . Hence the diagonal terms are
n 2
aii = δi.2 − 12 δ..2 , which yields −2aij = δij2 − δi.2 − δ.j2 + δ..2 .
1 That is zero-diagonal; δii = 0 for all i.


The expression aij = 1
2 aii + ajj − δij2 represents the cosine law from triangles.2
Also, the matrix with elements (δij2 − δi.2 − δ.j2 + δ..2 ), i, j = 1, . . . n, is known as
the double-centred dissimilarity matrix. Denote by 1 = (1, 1, . . . , 1)T the vector of
length n containing only ones, the scalar product matrix A in (9.5) can be written in
the more compact form

1 1 1
A=− In − 11T 2 In − 11T , (9.6)
2 n n
where 2 = (δij2 ), and In is the identity matrix.

Remark The operator In − n1 11T is the projection operator onto the orthogonal
complement 1⊥ to 1. Furthermore, Eq. (9.6) can be inverted to yield 2 . From the
expression of δij2 shown in the previous exercise we have
2 = 11T Diag(A) + Diag(A)11T − 2A (9.7)
which can be regarded formally as the inverse of Eq. (9.6).

By construction, A is symmetric positive semi-definite, and can be decomposed as
A = U2 UT where U is orthogonal. Hence the principal coordinate matrix X is
given:3
X = U, (9.8)
of order n × d, if d is the rank of A. Hence we get a configuration in d dimensions,

and the components (λ1 u1i , λ2 u2i , . . . , λn uni ) provide the coordinates of the ith
point. Note in passing that the solution X in (9.8) minimises the matrix norm A −
XXT .
Remark: Relation to EOFs Given an Euclidean distance matrix, the classical MDS
solution is obtained analytically and corresponds to a SVD decomposition of A, i.e.
A = XXT . In EOFs one seeks an eigenanalysis of XT X. But these two methods are
equivalent, see Sect. 3.3 of Chap. 3. One could anticipate that for a data matrix X,
the EOFs correspond to the right singular vectors of X, and the principal coordinates
correspond to the left singular vectors of X, i.e. the PCs. Although this is true in
principle, it is also misleading because in EOFs we have the data matrix and we seek
2 This is because in a triangle with vertices indexed by i, j and k, and side lengths dij , dik
d 2 +d 2 −d 2
and dj k the angle θi at vertex i satisfies cos θi = ij 2dijikdik j k . Hence, if we define bij k =

dij2 + dik
2 − d 2 /2, then b
jk ij k = dij dik cos θi , i.e. a scalar product.
decomposition of a positive semi-definite symmetric matrix A as A = QQT is known as

3 The
Young–Householder decomposition.
9.3 Metric Multidimensional Scaling 207
EOFs whereas in MDS we have the distance matrix and we seek the data matrix.
One should, however, ask: “how do we obtain EOFs from the same dissimilarity
matrix?” The eigenvectors of the matrix of scalar products provide, in fact, the PCs,
see Sect. 3.3 of Chap. 3. So one could say the PCs are the principal coordinates, and
the EOFs are therefore given by a linear transformation of the principal coordinates.
Classical Scaling in Presence of Errors
When the matrix A has zero eigenvalues the classical procedure considers only the
part of the spectrum corresponding to positive eigenvalues. The existence of zero
eigenvalues is implicitly assured by the double-centred structure of A since A1 = 0.
A natural question arises here, which is related to the robustness of this procedure
when errors are present. Sibson (1972, 1978, 1979) has investigated the effect of
errors on scaling. It turns out, in particular, that classical scaling is quite robust,
where observed dissimilarities remain approximately linearly related to distances
(see also Mardia et al. 1979).
9.3.3 Case of Non-Euclidean Dissimilarity Matrix
When the dissimilarity matrix is not Euclidean the matrix A obtained from (9.5) may
cease to be positive semi-definite, hence some of its eigenvalues may be negative.
The classical solution to this problem is simply to choose the first k largest (positive)
eigenvalues of A, and the corresponding normalised eigenvectors, which are taken
as principal coordinates. Another situation that appears in practice corresponds to
the case when one is provided with similarities rather than dissimilarities. A n × n
symmetric matrix C = (cij ) is a similarity matrix if cij ≤ cii for all i and j . A
standard way to obtain a distance matrix D = (dij ) from the similarity matrix C is
to use the transformation:
dij2 = cii − 2cij + cjj . (9.9)
It turns out that if C is positive semi-definite, then D is Euclidean, and the

corresponding scalar product matrix A is given by

1 1
A = In − 11T C In − 11T . (9.10)
n n
Note that the matrix A in (9.10) is positive semi-definite, and it is straightforward

to show that (9.9) leads to (9.10) as in the previous case. When the matrix C is not
positive semi-definite, a solution is presented in Sect. 5.3 towards the end of this
chapter.
Remark: Invariance to Linear Transformations The MDS configuration is indeter-

minate with respect to translation, and rotation. The most famous example of MDS
application is the road map where distances between cities are provided and the
objective is to find a map showing the cities location. The obtained MDS solution
is the same if for example we reflect the map. More generally, the MDS solution is
invariant to any affine transformation of the form y = Bx + b.
9.4 Non-metric Scaling
In classical MDS the dissimilarity between pairs of objects is assumed to be

linearly or quadratically related to the distances between the corresponding pairs of
individual points in some geometric configuration. However, perfect reproduction
of the Euclidean distance is not always the best choice, particularly when we are
dealing with ordinal data. In fact, it appears that in many situations the actual
numerical values of similarities/dissimilarities have little intrinsic meaning, e.g. in
ordinal data. In this case the rank order of the distances between objects becomes
more meaningful.
The objective of non-metric or ordinal MDS is to attempt to recover the order
of the proximities, and not the proximities or their linear transformation. The
projection we are seeking should therefore attempt to match the rank order of
the distances in the projected (quite often two-dimensional) space to the rank
order of the distances in the original space. This can be achieved by introducing
monotonically increasing mappings that act on the original dissimilarities, hence
preserving the ordering of the dissimilarities. The method of non-metric MDS,
developed by Shepard (1962a,b) and Kruskal (1964a,b), is based on the assumption
that the interpoint distances dij , in the projection space are monotonically related to
the dissimilarities δij in the original configuration. In order to find a configuration
such that the ordering of the distances dij and the dissimilarities δij are as close
as possible Shepard (1962a,b) and Kruskal (1964a,b) developed a way to measure
how well the dissimilarities and distances depart from monotonicity. This measure
is provided by the so-called stress function, and measures the difference between
the distances dij and, not the dissimilarities, but the best monotonic transformation
δ̂ij of the dissimilarities δij . The stress is given by
2
i<j dij − d̂ij
S= 2
, (9.11)
i<j dij
where (d̂ij ) are obtained as the least square monotone regression of the distances
(dij ) on the dissimilarities (δij ). Note that if (δij ) and (dij ) have the same ranking
order then the stress is zero, and is between 0 and 1 otherwise. Unlike the
classical scaling where the dimension space of the new configuration can be chosen
9.4 Non-metric Scaling 209
explicitly by fixing, say, the percentage of total variance explained by the principal
coordinates, in the non-metric case the dimension is unknown. In practice two or
three dimensions are normally chosen. The monotonic regression (increasing in
this case) requires that the order is preserved after transformation. Two possible
definitions of monotonicity exist:
• If δij < δkl , then d̂ij ≤ d̂kl , this is known as the primary monotone condition.
• If δij ≤ δkl , then d̂ij ≤ d̂kl , which is the secondary monotone condition.
The true requirements differ only in the presence of ties. The secondary monotone
approach requires, in particular, that if δij = δkl , then one should have d̂ij = d̂kl .
The primary approach is more general, with less convergence problems and is also
referred to as the global scaling (Chatfield and Collins 1980).
For a given configuration of distances (dij ), the distances (d̂ij ) are obtained, as
stated above, from the (primary/secondary) least squares monotone regression of
(dij ) on (δij ). That is, they are the set of values that minimise (Kruskal 1964a,b):
2
dij − d̂ij (9.12)
i<j
subject to being (primary/secondary) monotone increasing over the dissimilarities

(δij ). In other words d̂ij is the point on the (monotone increasing) regression curve
corresponding to δij .
Remark If the “dissimilarity” matrix is not symmetric, a new dissimilarity matrix
can be defined by taking the symmetric part 12 + T of .
The non-metric MDS algorithm is iterative by construction. One starts by choosing
the desired dimension for the configuration. Normally low-dimensions are desired
for obvious reasons. Within this low dimensional space one starts by selecting
an initial configuration, i.e. a set of coordinates and compute the corresponding
Euclidean distances (dij ). The distances (d̂ij ) are then obtained by a monotonic
regression of (dij ) on (δij ). The goodness of fit is provided by the stress S
in Eq. (9.11). The coordinates are updated so as to minimise S. Kruskal (1964)
suggests the values 0.05 and 0.2 for S as good and poor respectively. The situation,
however, is more complex since S increases with the sample size n and the
dimension p. Note that since the stress S is a differentiable function of the distances
(dij ) (and also of the corresponding coordinates) descent algorithms can be adopted
for the minimisation. Note also that the coordinates of the configuration are obtained
from the distances (dij ) using classical scaling.
In the minimisation procedure one should be prepared to accept that the solution
may not be global. In fact, the obtained solution can be a simple local minimum
and that a global minimum may not exist. Remember that the stress function is non-
quadratic. One way to proceed with the minimisation is to try various initial random
configurations and then choose the best in terms of stress. The choice of the type
of monotonic increase can also affect the convergence. Some authors, e.g. Sibson
(1972), and Lingoes and Roskam (1973), point out that the secondary definition
of monotonicity is usually less satisfactory than the primary definition. Gordon
(1981) further argues that the secondary configuration appears to be less readily
interpretable. More discussion on the subject can be found in Young (1987), Cox
and Cox (1994), Borg and Groenen (2005), and Mardia et al. (1979).
Remark: Non-metric Scaling and Errors The non-metric scaling solution can also
be viewed as a solution to the classical scaling problem except that the distances
are not known perfectly but are corrupted with various types of noise such as
measurement and distortion errors. In this case what we observe is a distance matrix
D = (dij ) of the form:
dij = f (δij + εij ), (9.13)
where f () is an unknown monotonic increasing function and (εij ) are error terms.
The objective is then to reconstruct this unknown configuration (δij ). It is clear
in this example that a better and more reliable information is not the exact value of
these distances but their rank order. Consequently, we are better off using non-metric
or ordinal scaling. In some simple cases where, for example, f () is the identity and
the noise characteristics are known, then one could attempt to use a linear filter
and then obtain an estimate of δij . In general, however, this information is seldom
available in which case we opt for the former solution.
9.5 Further Extensions
9.5.1 Replicated and Weighted MDS
In replicated MDS, RMDS, (McGee 1968, Young and Hamer 1994) we are given
m similarity matrices that have to be analysed simultaneously. In metric RMDS
the distance matrices are determined by minimising the sum of squared elements
k ∗
2
ij k δ ij − δij where the sum includes not only indices related to individuals
but also indices of the individual distance matrices. Similarly, non-metric RMDS
also attempts to minimise the total (including individual dissimilarity matrices)
sum of squared residuals, similar to Eq. (9.12), between distances and monotonic
transformations of the dissimilarities. The RMDS generates one unique distance
matrix D that is “simultaneously like” all the dissimilarity matrices, which we can
write as:
Fk (k ) = D + E k
for k = 1, 2, . . . m, where Fk () is a transformation; linear for classical MDS and

monotonic for non-metric case, and E k is an error matrix.
9.5 Further Extensions 211
Another major development in MDS includes weighted MDS (WMDS), which

involves the following definition of weighted Euclidean distances:
1
T 2
dk,ij = xi − xj Wk xi − xj ,
where Wk , k = 1, 2, . . . m, are diagonal matrices. WMDS generates m distance

matrices Dk , k = 1, 2, . . . m, one for each dissimilarity matrix k . They are
calculated so that they are as close as possible to their corresponding dissimilarity
matrix. Note that in WMDS we need to solve for the matrix of coordinates, the m
weight matrices Wk , and distances Dk , k = 1, 2, . . . m. The objective function to be
minimised is similar to that of RMDS.
9.5.2 Nonlinear Structure
Many datasets may contain nonlinear structures that cannot be revealed by classical
MDS. This includes data distributed on lower dimensional manifolds, such as Swiss-
roll, tori, etc., where only geodesic distances, i.e. distances on the manifold, can
reflect the true low-dimensionality of the data. Various methods have attempted
to solve this problem. These include local linear techniques and other nonlin-
ear techniques such as local linear embedding, that computes low-dimensional
neighbourhood-preserving embeddings of high-dimensional data (Roweis and Saul
2000). An MDS-related method, the ISOMAP, was proposed by Tenenbaum et al.
(2000). Their approach builds on classical MDS but, in addition, seeks to preserve
the intrinsic geometry of the data, hence including the global geometry of the data.
Their definition of geodesic distance is based on computing shortest paths in a graph
with edges connecting neighbouring data points. This is particularly helpful when
the data contain for example holes. In their algorithm each point is connected to its
“nearest” neighbours using the provided distances dij , which are used to define a
new weighted distance dw (i, j ) given by4

dij if i and j are neighbours
dw (i, j ) = (9.14)
∞ otherwise.
The geodesic distance dG is then given by
dG (i, j ) = min {dw (i, j ), dw (i, k), dw (k, j )} . (9.15)

k
This condition is to satisfy the triangular inequality. The procedure to construct

the distance matrix DG = (dG (i, j )) is known as Floyd’s algorithm and is known
4A similar distance dictated by time proximity, i.e. using the autocorrelation can be used.
to require O(n3 ) operations, see e.g. Borg and Groenen (2005) for algorithms
that exploit the sparsity structure of the neighbourhood graph. The configuration
2
of the points is then obtained by minimising δij − δij∗ , which yields the
solution, Eq. (9.8), obtained using the scalar product matrix Eq. (9.6) with [2 ]ij =
dG (i, j )2 .
Remark Note that the main part of the algorithm consists in computing the
neighbourhood using the k-nearest neighbours or using a ball of small fixed radius
. The choice of the parameters k or is discussed, e.g. in Ross et al. (2008), Gamez
et al. (2004) and Hannachi and Turner (2013b).
9.5.3 Application to the Asian Monsoon
Nonlinear MDS via Isomap or related methods can reveal structures that cannot be
obtained using linear MDS. We discuss here an application to the Asian summer
monsoon (ASM) using the European Re-analyses (ERA-40) products from the
European Centre for Medium Weather Forecast, ECMWF, (Uppala et al. 2005). The
example is taken from Hannachi and Turner (2013b) using sea level pressure (SLP)
and 850-hPa wind fields over the Asian monsoon region (50–145◦ E, 20S –20◦ N) for
June–September (JJAS) 1958–2001. The JJAS climatology of monsoon is shown
in Fig. 9.1, which is dominated by the low-level Somali jet bringing moisture from
the Indian Ocean into the Indian peninsula with a general low pressure over land
masses. The leading two EOFs of the SLP anomalies, with respect to the mean
seasonal cycle, are shown in Fig. 9.2. The leading mode of variability reflects one
sign over the whole region associated with an overall positive or negative phase of
Fig. 9.1 JJAS climatology of ERA-40 sea level pressure and 850-hPa wind over the ASM region.
SLP contour interval: 2-hPa, and maximum wind speed: 18 m/s. Adapted from Hannachi and
Turner (2013b)
Fig. 9.2 Leading two EOFs

of JJAS ERA-40 SLP
anomalies over the ASM
region. (a) EOF1 (22%). (b)
EOF2 (15%). Adapted from
Hannachi and Turner (2013b)
Fig. 9.3 Gaussian kernel

estimate of the probability
density function of the
leading two PCs of ASM
ERA-40 SLP anomalies.
Adapted from Hannachi and
Turner (2013b)
the EOF. The second EOF reflects a south-west/north-east dipole reflecting broadly
the variability of the strength between the Indian and east Asian monsoon.
A large volume of Asian monsoon literature deals with the existence of two main
monsoon states: break (or weak) and active (or strong) phases. Figure 9.3 shows a
kernel PDF estimate of JJAS daily SLP anomalies, where no signature of bimodality
exists. Because Isomap operates locally, it can follow the nonlinear manifold by
building a network using local neighbourhood. An example of such neighbourhood
of SLP anomalies is shown in Fig. 9.4. The two leading isomap time series and the
corresponding PDF are shown in Fig. 9.5, where a clear bimodality emerges. The
Fig. 9.4 Neighbourhood

graph for the first 100 ASM
SLP anomalies within the
space spanned by the leading
two Isomap time series. Axis
units arbitrary. Adapted from
Hannachi and Turner (2013b)
Fig. 9.5 Leading two Isomap time series of the ASM SLP anomalies (a) and the corresponding
two-dimensional kernel PDF estimate (b). Adapted from Hannachi and Turner (2013b)
two modes reflect in fact the break and active ASM phases (Hannachi and Turner
2013a). A close examination of the PDF of the two time series within the probability
space (Fig. 9.6) shows that there are indeed three robust modes. These modes are
also identified using a Gaussian mixture model as shown in Fig. 9.7. They show,
in addition to the active and break phases, a third mode, i.e. the west-north Pacific
(WNP) active phase. The precipitation map associated with those phases is shown
in Fig. 9.8. In the active phase most precipitation occurs over western India along
the Western Ghats in addition to the northern part of India and south Pakistan. The
WNP active phase is associated with precipitation over south east and north east
China with dry conditions over east China. This latter region receives precipitation
during the active and break phases.
Fig. 9.6 Kernel PDF of the

leading two Isomap time
series of ASM SLP anomalies
within the probability space.
Adapted from Hannachi and
Turner (2013b)
Fig. 9.7 Three-component Gaussian mixture model of the leading two Isomap time series of the
ASM SLP anomalies showing three regimes of ASM phases (a), namely the WNP active phase
(b), the active phase (c) and the break phase (d) of SLP anomalies and 850-hPa wind (maximum
wind arrow: 2.5 m/s). Adapted from Hannachi and Turner (2013b)
9.5.4 Scaling and the Matrix Nearness Problem
We have seen in Sect. 3.3 that when the dissimilarity matrix is non-Euclidean, one
solution is to restrict oneself to the leading positive eigenvalues and corresponding
eigenvectors for principal coordinates. An alternative and elegant way to solve the
problem has been proposed by Mathar (1985) and consists in finding the nearest
Euclidean distance matrix. Note that an Euclidean distance matrix D = (dij ) in
Fig. 9.8 Composites of APHRODITE precipitation and 850-hPa wind anomalies based on the 300
closest data points to the centres of the ASM monsoon phases of Fig. 9.7. (a) WNP active phase.
(b) Active phase. (c) Break phase. Adapted from Hannachi and Turner (2013b)
k-dimensions is defined by dij = 12 xi − xj 2 , for a set of points x1 , x2 , . . . , xn .

The problem is then formulated simply as follows. Given a distance matrix =
(δij ), the objective is to find an Euclidean distance matrix D = (dij ) such that
D − is minimum, where . is any matrix norm. The solution to this problem is
based on the characterisation
of Euclidean
distance matrices
expressed by Eq. (9.6),
namely that A = − 12 In − n1 11T In − n1 11T is positive semi-definite, and
rank(A) ≤ k. Let z be any n-dimensional real vector satisfying zT z = 1, and
denote by Pz the orthogonal projection onto the orthogonal complement z⊥ of z, i.e.
Pz = In − 1zT . Let us also denote by . p , p ≥ 1, the Minkowski matrix norm, i.e.
1
A p= |μk |p p , where the μk ’s are the eigenvalues of A. Mathar (1985) finds
the nearest Euclidean distance matrix D∗ to , which is given by
D∗ = argmin − D (z)
p , (9.16)
D
(z)
where A p = Pz APTz p. One first decomposes Pz APTz using, e.g. a SVD
procedure as:
Pz APTz = UUT (9.17)
with = Diag (λ1 , . . . , λn ) and λ1 ≥ λ2 ≥ . . . ≥ λn . Note that is not

necessarily positive semi-definite, hence some of the eigenvalues may be negative.
From Eq. (9.17) one selects the first k eigenvalues and define λ+ +
1 , . . . , λk , where
+
λj = max λj , 0 , for j = 1, . . . k. The next step consists in constructing the
“positive part” of (17), i.e. U+ + + +
k U , where k = Diag λ1 , . . . , λk , 0, . . . , 0 .
T
Finally, letting b = 12 Diag U+ T

k U , the nearest matrix is then given by
− D∗ = U+
k U − b1 − 1b
T T T
(9.18)
which solves the problem of scaling.

When we are provided with a similarity matrix C, refer to Eq. (9.10) in Sect. 3.3,
which is positive semi-definite (PSD) the solution is already given by Eq. (9.10).
However, if C is not PSD, then again one has to find the nearest PSD matrix to
C. In the Minkowski norm, the nearest PSD matrix to C = UUT is given by
C+ = U+ UT where + contains only the positive eigenvalues defined as above.
This means that one keeps only the positive eigenvalues and associated eigenvectors.
Remark Other solutions
exist for other norms. The solution for the Fröbenius
2 12
norm, A F = i,j aij , and 2-norm; A 2 = ρ A A
2 T , where ρ(A) =
max{|λ|, |A − λIn | = 0} is the spectral radius of A, has been given for example in
Halmos (1972) and Higham (1988). Let us first consider the Fröbenius norm, and
we give the solution for a general matrix C not necessarily symmetric. Denote by
CS = 12 C + CT , and CK = 12 C − CT the symmetric and skew-symmetric
parts of C respectively. Then one can always decompose CS as
CS = UH,
where UT U = In , and H is a unique symmetric PSD. The above decomposition

is known as polar decomposition of C. Note that the polar decomposition can be
extended and applied to any complex matrix. The main result is then (Higham 1988):
Theorem The matrix
1
CF = (CS + H)
2
is the unique symmetric positive approximant of C in the Fröbenius norm.
1
For the 2-norm case, if one lets for any real α, Gα = CS + α 2 In + C2K 2 ,
and also let δ2 (C) = minα ≥ 0; α 2 In + C2K ≥ 0, and Gα ≥ 0 then we have
Theorem (Halmos 1972) The matrix C2 = G (δ2 (C) is a 2-norm positive
approximant of C.
Chapter 10
Factor Analysis
Abstract This chapter describes a model-based method that attempts to explain

the variability and correlation structure in the data, and reducing its dimension,
using hidden factors. Various algorithms to identify the factor model are discussed,
along with various factor rotations in addition to the link to conventional EOFs. An
application to climate data is also provided.
Keywords Factor model · Hidden variables · Communality · Uniqueness ·

Maximum likelihood · Expectation-maximisation · Factor rotation · Sea level
pressure · NAO · Scandinavian pattern
10.1 Introduction
EOF analysis or PCA and related methods are examples of exploratory techniques
aiming, among other things, at reducing the dimension of the system by finding
a small number of patterns associated with maximum variance from the high
dimensional data. In these methods no explicit probability model is involved. Factor
analysis is a multivariate method that aims at finding patterns or factors from the
data based on a proper statistical model. Whereas EOFs, for example, are concerned
with explaining the observed variances, factor analysis (FA) attempts to explain the
observed covariances between the variables through a set of variables or factors.
By assuming that each observed variable is a linear combination of these factors,
together with a random error, the relationship between the factors and the original
variables yields a possible explanation of the observed association.
An example of methods that is based on a model was given by POP analysis
in Chap. 6, using the multivariate AR(1) model. This model focuses, however,
more on the temporal correlations rather than the covariability between variables.
Furthermore, the model was not explicitly formulated by specifying, for example,
the probability distribution of the noise. FA was developed in the turn of the century
in psychology and social science. It has been extended to other fields of science
recently. In meteorology, for example, and to the best of our knowledge, the method
was not applied extensively.

220 10 Factor Analysis
FA bears some similarities with PCA regarding, for example, dimension reduc-
tion and determination of prominent modes of variability or covariability. FA was
proposed around the same time as Pearson’s work on PCA (Pearson 1902) by
Spearman (1904a,b). The method was formally and mathematically formulated later
by Hotelling (1935, 1936a,b). PCA/EOF method was always regarded by most
data analysts and even some statisticians as an exploratory technique. The method,
however, can also be regarded as model-based, and this brings it closer to FA. It
turns out, indeed that EOF/PCA method can be regarded as a special case of FA.
10.2 The Factor Model
10.2.1 Background
We suppose that we have a multivariate times series xt , t = 1, 2, . . . , n, where each

T
xt is composed of p variables, i.e. xt = xt1 , . . . , xtp . We suppose that the time
series is zero mean and has = E xt xTt as covariance matrix, which is supposed
to be full rank.1
10.2.2 Model Definition and Terminology
To explain the covariance observed in the multivariate time series one assumes
that there are m hidden variables or factors yt = (yt1 , yt2 , . . . , ytm )T whose linear
combination can explain the observed variability in the time series xt . This can be
written as
xt = yt + ε t + μ, (10.1)
where = (λij ) is a p × m constant matrix, and εt a p-dimensional random

noise. Here, we will take μ to be zero except in the estimation section. The matrix
coefficients λij of are known as factor loadings. The latent, or hidden variables
yt1 , . . . ytm are referred to as the (common) factors (or factor scores matrix).
Componentwise, model (10.1) can be written as

m
xti = λij ytj + εti (10.2)
j =1
1 If
this is not the case it is always possible to find q new variables, q < p, such that the new
covariance matrix is full rank.
10.2 The Factor Model 221
so that λij represents the loading of the ith variable on the j th factor. Note that
in model (10.1) or (10.2) only the left hand side is known, whereas the right hand
side is entirely unknown. Model (10.1) is known as the linear factor model and is
described in many text books, e.g. Everitt (1984), Bartholomew (1987), Mardia et
al. (1979), and Lawley and Maxwell (1971).
Remarks
• Although model (10.1) looks like a regression model it is not so because, as we
stated above, the whole rhs is unknown. In fact, we need to estimate the factor
loadings, the common factors, the error terms and even the number of factors m.
In addition, the factors in (10.1) are also random variables unlike the regression
model in which the independent variables are non-random.
• If U = (u1 , . . . , ur ) are the leading r EOFs, (r < p), of the time series xt ,
t = 1, 2, . . ., and ct = (ct1 , . . . ctr ) are the corresponding r PCs, then we can
write
xt = Uct + rt , (10.3)
where rt is the remaining part, i.e. the p-r EOFs/PCs. Now there is a clear
similarity between (10.1) and (10.3). The main difference is that U is orthogonal
and the PCs uncorrelated, and this is not necessarily the case for model (10.1).
10.2.3 Model Identification
To identify model (10.1) one normally requires the following assumptions. Since xt
is assumed to be zero mean it is reasonable to assume E (yt ) = E (ε t ) = 0. One
also assumes that the factors and the noise are independent, implying

cov (yt , ε t ) = E yt ε Tt = O. (10.4)
These two assumptions are basic requirements but they do not permit a complete
model identification. What we need is an assumption on the probability distributions
of the unknown terms. The most “affordable” and common assumption is to
consider a multivariate Gaussian noise with diagonal covariance matrix, i.e.

E ε t ε Tt = = Diag ψ1 , . . . , ψp . (10.5)
In addition, we also assume that the factors are standard multivariate normal:

E yt yTt = Ip . (10.6)
These assumptions automatically lead to the conclusion that xt is also multivariate

normal, xt ∼ N (0, ), with covariance matrix given by
= T + . (10.7)
Exercise
Derive Eq. (10.7).
Equations (10.2) or (10.7) yield

m
var (xti ) − ψi = σii − ψi = λ2ik (10.8)
k=1
which represents the part of the variance explained by the common factors known
as the communality. The diagonal elements of are referred to as the uniqueness.
The factor loadings matrix is therefore the covariance between xt and yt :
= cov (xt , yt ) . (10.9)
Now using Eqs. (10.6), (10.7) and (10.9), the zero-mean multivariate normal vector
T
zt = xTt , yTt has covariance:
T
E zt zTt = . (10.10)
T Ip
Case of Autocorrelated Factors
If the common factors are autocorrelated with E yt yTt = R, i.e. yt ∼ N (0, R),
then (10.7) becomes
= + T R. (10.11)
Remark In stochastic modelling with continuous AR(1) model:
dx
= Bx + ε (10.12)
dt
one gets the famous fluctuation-dissipation relation (e.g. Penland 1989; Penland and
Sardeshmukh 1995):
C(0) = G(τ ) + G(τ )C(0) [G(τ )]T , (10.13)
where C(0) = E xt xTt , G(τ ) = eBτ = E xt+τ xTt [C(0)]−1 . The discretised
version of Eq. (10.12), i.e. xt+τ = G(τ )xt + ζ t,τ , is similar to Eq. (10.1) when
10.2 The Factor Model 223
yt is replaced by xt . Equation (10.11) then becomes E ζ t,τ ζ Tt,τ = C(0) −

G(τ )C(0) [G(τ )]T ,
10.2.4 Non-unicity of Loadings
One of the major drawbacks of the factor model is the non-unicity of the solution.
In fact, if H is any m × m orthogonal matrix, i.e. HHT = HT H = I, then the factor
model (10.1) can also be written as
xt = HHT yt + ε t = yt + ε t , (10.14)
where yt = HT yt and = H. The covariance matrix of xt is invariant:
= T + = T + . (10.15)
We have therefore constructed from the original model another factor model with
new common factors yt and new loadings . Since an orthogonal matrix is a
rotation matrix, factor models are indeterminate with respect to rotations. One
possible solution to this drawback is presented later in Sect. 10.3.2.
Remark In data mining the factor model is regarded as a map f () from the latent
space, i.e. the space of the y’s, onto the data space (see, e.g. Carreira-Perpiñán 2001).
The map is simply given by
f (y) = y + μ (10.16)
with a normal probability distribution of p(y), i.e. y ∼ N(0, I). The probability
p(y) is known as the prior. The mapping f () describes an ideal measurement
process, mapping each point from the latent space onto exactly one point in a
smaller dimension manifold on the observed space. For the classical factor model
this map is simply linear. Other possibilities will be discussed in the next chapters.
The noise from the data space is given by the conditional distribution of (x|y),
which is N (f (y), ), and the marginal distribution in data space is that of x, i.e.
T
N μ, T + . Now given the normality of the joint distribution z = xT , yT ,
with covariance matrix given by Eq. (10.10), the conditional distribution of (y|x),
known as the posterior in latent space is also normal (see Appendix B) with mean:
−1
E (y|x) = A (x − μ) = T T + (x − μ) (10.17)
and covariance:
−1
cov (y|x) = I + T −1 . (10.18)
10.3 Parameter Estimation
10.3.1 Maximum Likelihood Estimates
The most common way to estimate the parameters , and μ of the factor
model is to use the maximum likelihood method (MLE). Given a sample x1 , . . . , xn
of independent and identically distributed (IID) random variables drawn from a
probability distribution p (x) with a set of unknown parameters θ , the likelihood
of these parameters given the sample is given by
!
n
L (θ ) = P r (x1 , . . . , xn |θ) = p (xk |θ) . (10.19)
k=1
Usually one takes the Logarithm of Eq. (10.19) because the Logarithm is mono-
tonically increasing, and simplifies the computation substantially. The function to
be maximised becomes the log-likelihood L (θ ) = log [L (θ )]. Since the factor
model in Eq. (10.1) is based on normality assumption, the log-likelihood of the data
x1 , . . . , xn given the parameters , and μ is easy to compute and takes the form:
n
L (, , μ) = − p log(2π ) + log || + tr S −1 , (10.20)
2
where = T + and S is the sample covariance matrix of the data, i.e.

S = n−1 k (xk − μ) (xk − μ)T .
Exercise Derive the expression of the log-likelihood L , Eq. (10.20).
(n −p/2 ||− 12 exp −(x − μ)T −1 (x − μ) ,
Hint Expand L = log k=1 (2π ) k k
n −1
−1 n
then use the fact that k=1 (xk − μ) (xk − μ) = tr
T
k=1 (xk − μ)

(xk − μ)T = ntr −1 S .
Hence the parameters , and μ are provided by the minimiser of L, i.e.

min [−L (, )] = min log |T + | + tr S(T + )−1 . (10.21)
, ,
An efficient maximum likelihood algorithm in factor analysis was developed by

Jöreskog (1969).
Remark By differentiating Eq. (10.20) with respectto μ, which is implicitly
included in the expression of S, we get x = n−1 nk=1 xk as the MLE of μ.
The optimisation problem given by Eq. (10.20) or (10.21) involves the inverse
−1
matrix T + . Therefore in any algorithm this inverse has to be computed
efficiently. Fortunately this is readily provided using the following formula:
10.3 Parameter Estimation 225
−1
(A + BCD)−1 = A−1 − A−1 B C−1 + DA−1 B DA−1 (10.22)
whenever the involved inverses exist (see e.g. Golub and van Loan 1996).
Exercise
Check that Eq. (10.22) holds.

Hint Proceed by direct substitution, i.e. (A + BCD) A−1 − A−1 B C−1 +
−1 −1
DA−1 B DA−1 = I − B C−1 +DA−1 B DA−1 + BCDA−1 −
−1
BCDA−1 B C−1 + DA−1 B DA−1 . But the last term in this sum can be
−1 −1
written as BC C + DA−1 B − C−1 C−1 + DA−1 B DA−1 which yields
−1
BCDA−1 − B C−1 + DA−1 B DA−1 .
Hence the inverse of becomes
−1 −1
T + = −1 − −1 Ip + T −1 T −1 (10.23)
which is easy to compute given that is diagonal.
10.3.2 Expectation Maximisation Algorithm

Necessary Optimality Condition
The MLE of the factor model parameters leads to the following system of equations:
∇, L (, ) = 0. (10.24)
Equation (10.24) yields the following conditions (see Appendix D):
−1 ( − S) −1 = O
(10.25)
diag −1 ( − S) −1 = O.
Equations (10.24) or (10.25) represent a system of nonlinear equations that can only
be solved numerically using a suitable descent algorithm. We know also that the
solution is invariant to any orthogonal rotation. Hence the log-likelihood has an
infinite set of equivalent maxima, but one cannot tell whether these maxima are
global or local. Because of this nonuniqueness an additional constraint was required
to obtain a unique solution. The most commonly used constraint is

diag T −1 = T −1 , (10.26)
that is the matrix T −1 is diagonal. Note that even with this additional
constraint the equations remain difficult to solve analytically. The solution can
be made easier by maximising the log-likelihood in two stages, i.e. by fixing, for
example, then maximising with respect to . This approach was successfully
developed by Jöreskog (1967) who used a second-order Fletcher–Powell method
for the maximisation. Precisely, we have the following theorem:
ˆ the MLE of satisfying the extra constraint (10.26)
Theorem For fixed = ,
is given by
1 1
ˆ =
ˆ 2 U − Ip 2 , (10.27)
1 1
ˆ − 2 S
where is a diagonal matrix containing the eigenvalues of ˆ − 2 and the
columns of U are the associated eigenvectors.
Proof Outline Since is the only unknown we can transform the problem in which
the matrix = T + becomes the new unknown. The MLE estimate of
(see Appendix D) is ˆ = S. Now let us choose one solution among the many
solutions (obtained simply through orthogonal rotations). The MLE estimate ˆ is
T
simply obtained from S = ˆˆ + .ˆ We now have the following identity, namely:
1 1 1 1
ˆ − 2 S
ˆ − 2 − Ip =
ˆ −2
ˆ ˆ −2 .
ˆT
1 1
If UUT represents the SVD decomposition of ˆ − 2 S
ˆ − 2 , then we obtain
1
ˆ −2
ˆˆ T − 12 = U − Ip UT , which yields Eq. (10.27). Note that to guarantee
the property of semi-definite positivity the matrix − Ip has to be diagonal with
elements given by max (φi − 1, 0), where the φk s are the eigenvalues of . Of
course, any R ˆ is also a solution for any orthogonal R, which can be chosen so
that Eq. (10.26) is satisfied.
An iterative procedure based on (10.27) can be used to get the estimates as
follows. From an estimate ˆ k using (10.27). From
ˆ k of at the kth step obtain
this estimate maximise the log-likelihood numerically, subject to (10.26), to get a
new estimate ˆ k+1 , and so on until convergence is achieved.
Expectation Maximisation Algorithm
An alternative and elegant iterative approach to maximum likelihood has been

developed in the late 1970s and is based on the expectation maximisation (EM)
algorithm, see Dempster et al. (1977), and McLachlan and Krishnan (1997), see
also Moon (1996) for an illustration. The EM algorithm has been applied in factor
analysis by Rubin and Thayer (1982, 1983) and by Bentler and Tanaka (1983).
10.3 Parameter Estimation 227
In general the algorithm is simple to implement, and also has the advantage
of increasing the likelihood monotonically, and converges in general to a local
maximum2 and does not require the second derivatives to be calculated (Rubin and
Thayer 1982). The EM algorithm can be used therefore to explore the local maxima
in the state space of the parameters. The algorithm can be slow, however, particularly
after the first few iterations, which are usually quite effective.
The EM algorithm consists of two steps, namely, the expectation (E) step and
the maximisation (M) step. The E-step consists of computing the expectation of the
complete-data log-likelihood

Q y|x, θ (k) = E L y|x, θ (k)

with respect to the current posterior distribution p y|x, θ (k) , where θ (k) refers to
the current parameter estimates. The M-step consists of determining new parameters
θ (k+1) that maximise the expected complete-data log-likelihood:

θ (k+1) = argmin −Q y|x, θ (k) . (10.28)
θ
For the factor model setting the E-step results in computing the first two moments
E (y|xk ) and E yyT |xk for each data point xk given the parameters and
as given in Eqs. (10.17) and (10.18), which can also be written using the efficient
−1
inversion formula of T + .
Remark The second moment E yyT |x provides a measure of uncertainty in the
factors.
The expected log-likelihood of the factor model with respect to the posterior distri-
bution is obtained using
the sample-space
noise model, i.e. (x|y) ∼N (y + μ, ),
( p 1
and takes the form: Q y|x, θ (k)
= E log i (2π )− 2 ||− 2 exp − 12 (xi − y)T

−1 (xi − y) , which, after some little algebra, leads to
n n
Q y|x, θ (k) = − log (2π )p || − tr −1 S − xTi −1 E (y|xi )
2 2
i
1
+ tr T −1 E yyT |xi . (10.29)
2
The M-step is then to maximise Eq. (10.29) with to andT . The derivative
respect
−1 x E

of Eq. (10.29) with respect to , i.e. ∂Q
∂ is − i i (y|xi ) + E yy |xi .
T
Equating this to zero yields
2 Which quite often depends on the starting point.

T
E yyT |xi = xi E (y|xi )T . (10.30)
i i
Similarly, the derivative of Q in Eq. (10.29) with respect to −1 yields
n 1 1
− xi xTi − E (y|xi ) xTi + E yyT |xi T = 0. (10.31)
2 2 2
i
Using Eq. (10.30), Eq. (10.31) yields
n 1
= xi xTi − E (y|xi ) xTi . (10.32)
2 2
i
The iterative EM procedure, taking into consideration that diagonal, yields
n n T −1
(k+1) = i=1 xi E (y|xi )
T
i=1 E yyT |xi ,
n (10.33)
(k+1) = 1
n Diag i=1 xi xi
T − (k+1)
(E (y|xi )) xTi .
Equation (10.33) can be used iteratively to update the parameters of the factor
model. The conditional expectations in Eq. (10.33) are given by
−1
E (y|xi ) = A(k) (xi − μ) = (k)T (k) (k)T + (k) (xi − μ)
E yyT |xi = Ip − A(k) (k) + A(k) (xi − μ) (xi − μ)T A(k)T .
n
Remember that the MLE of μ is the sample mean x = n−1 i=1 xi .
Model Assessment
Once the model parameters are estimated the model fit is normally assessed using
the deviance between the fitted model and the full model in which has no
particular structure. The likelihood ratio statistic λ is given by
−1
− 2 log λ = n tr ˆ −1 S| − p .
ˆ SD−1 − log | (10.34)
It can be shown, see, e.g. Mardia et al. (1979), that the asymptotic distribution of
this statistic, under the null hypothesis Hm that there are m factors, is
−2 log λ ∼ χ 21 .
2 [(p−m)2 −p−m]
10.4 Factor Rotation 229
Hence fixing a significance level α, this χ 2 distribution permits either rejection or

non-rejection of the null hypothesis Hm depending on whether the value of −2 log λ
falls inside or outside the critical region.
Remark: Factor Scores Estimation The factor scores can be estimated using the
conditional expectation (10.17) by
−1 −1 T −1
ˆT
ŷk = ˆˆT + ˆ (xk − x) = Ip + ˆTˆ −1
ˆ ˆ
ˆ (xk − x) .
(10.35)
This estimate can also be obtained as follows. If for example we assume that μ is
known (= 0 for simplicity), and that and are known, then for a factor score y the
vector x = y + ε is multinormal with mean y and covariance matrix . The log-
likelihood of x is therefore l (x, y) = − 12 log |2π | − 12 (x − y)T −1 (x − y),
and the MLE of y is simply
−1
ŷ = T −1 T x. (10.36)
The estimate given by Eq. (10.36) is known as Bartlett’s factor score. The estimate
−1 T −1
provided by ŷ = Ip + T −1 x is known as Thompson’s factor
score. Refer, e.g., to Mardia et al. (1979) for more discussion on other estimates.
Exercise
Derive (10.35).
Hint Using (10.23) we get
−1
T T + = I − T −1 (I + T −1 )−1 T −1

= I − (T −1 + I − I)(I + T −1 )−1 T −1 .
10.4 Factor Rotation
10.4.1 Oblique and Orthogonal Rotations
A useful tool often applied in EOF and factor analysis to facilitate interpretation is
rotation. We know, for example, from Sect. 10.2.3 that if Y is the matrix of factor
scores and H is an orthogonal matrix (compatible with Y), then HT Y is also a
solution to the factor model and similarly for H where is the matrix of factor
loadings. In other words the rotated factors are also solution. Rotation of EOFs
has already been presented in Chap. 4 using a few examples of rotation, and we
give here a number of other rotation criteria. The reason for doing this here and
not in Chap. 4 is because rotation has historically been developed in connection
with factor analysis. Rotation of factor loadings goes back to the early 1940s with
Thurstone (1940, 1947), and also Carroll (1953) where the objective was to get
simple structures. This is similar to EOF rotation in atmospheric science, which
was introduced to alleviate some of the constraints imposed upon EOFs, and also to
ease physical interpretation.
There are two main rotation criteria: orthogonal and oblique. The general
orthogonal rotation problem is to find an orthogonal matrix H solution to the
following optimisation problem:
minH f (H) s.t. HT H = I, (10.37)
where is the factor loading matrix. The function f () is what defines the rotation,
and H is the new (rotated) loading matrix. Note that in a number of cases the
matrix H need not be square, but is known as semi-orthogonal. In oblique rotation
the problem is to find the matrix H satisfying
minH f H−T s.t. Diag HT H = I. (10.38)
The condition imposed in (10.38) simply means that the columns of H are unit-
length. Note that what distinguishes (10.37) from (10.38) is the formulation, but
the rotation criteria given by f () can be the same. In general, however, criteria
appropriate for oblique rotation are also appropriate for orthogonal rotation but the
converse is not true. The most well known and popular example in atmospheric
science is the varimax criterion (Kaiser 1958) applied in meteorology by Richman
(1986), see Chap. 4. Below we give other examples of rotation criteria along with
their gradient since a wide range of numerical algorithms for minimisation require
the gradient of the function, see also Appendix E.
10.4.2 Examples of Rotation Criteria

• Quartimax
The quartimax rotation criterion (Carroll 1953, Ferguson 1954) used in orthogonal
rotation is based on the Fröbenius norm. The Fröbenius product of two m × m
matrices X and Y is defined by [X, Y]F = tr XT Y . The Fröbenius norm is then
-
defined by X F = tr XT X . Let now the element-wise product between X =
xij and Y = yij be given by X Y = xij yij , then the quartimax criterion is
defined by minimising the function:
1 4
m
1
f (X) = − XX F =− xij . (10.39)
4 4
i,j =1
10.4 Factor Rotation 231

The gradient of f (X) can be obtained by writing df (X) = − ij xij3 dxij which

can also be written as − ij [X X X]Tji dxij , hence we get
df (X)
= −X X X. (10.40)
dX
• Quartimin
Quartimin is another rotation criterion often used in oblique rotation, see e.g. Carroll
(1953) and Harman (1976, p. 305). The function to be minimised is given by
1 2 2 1
f (X) = xik xil = [X X, (X X) J]F , (10.41)
4 4
i k=l
where J = 1m 1Tm − Im is the zero-diagonal square matrix whose off-diagonal

elements are ones and 1m is the vector of length m containing only ones. To
compute the gradient one can either use the quartimax argument above, or simply
write 2df (X) = [X dX, (X X) J]F + [X X, (X dXJ)]F , which can be
simplified to [(X dXJ) , X dX]F = [X ((X dX) J) , dX]F , hence
df (X)
= X [(X dX) J] . (10.42)
dX
• Oblimin
It is based on minimising the function (Harman 1976):
1
f (X) = [X X, (Im − γ K) (X X) J]F . (10.43)
4
The gradient of the oblimin is given by
df (X)
= X [(Im − γ K) (X dX) J] , (10.44)
dX
where K = (kij ) is a p × p matrix, with kij = p1 , and γ is some parameter, see

Harman (1976, p. 322) for recommendation on the choice of this parameter. Note
that for γ = 0 one obtains the quartimin. The case γ = 12 yields the so-called
bi-quartimin criterion.
• Oblimax
This criterion aims at minimising the function:

XX 2
f (X) = − log 4
F
(10.45)
X F
whose gradient is given by
df (X) (X X X) X
= −4 +4 . (10.46)
dX XX F 2 X 2F
• Entropy
It is based on minimising the entropy (Jennrich 2004):
1
f (X) = − X X, log (X X) F (10.47)
2
whose gradient is given by
df (X)
= −X log (X X) − X. (10.48)
dX
Other rotation criteria can be found in Harman (1976) and Jennrich (2001, 2002).
10.5 Exploratory FA and Application to SLP Anomalies
10.5.1 Factor Analysis as a Matrix Decomposition Problem
Factor analysis can also be formulated, in a similar way to PC analysis, using matrix
decomposition. This was initially proposed by Henk A. L. Kiers as described in
Sočan (2003, pp. 19–20), and mentioned in Adachi and Trendafilov (2019). It was
applied e.g. by Unkel et al. (2010), see Adachi and Trendafilov (2019) for further
details and references. The model given by Eq. (10.11), with m factors, can be
written using the n × p data matrix X as
X = YT + U + EF A , (10.49)
where Y is the (n × m) matrix of common factors, is the (p × m) factor loading

matrix, U is the n × p matrix of unique factor scores, is the p × p diagonal matrix
10.5 Exploratory FA and Application to SLP Anomalies 233
containing the unique variances and EF A is the unsystematic error matrix. We also
have the following properties: UT U = Ip×p , YT Y = Im and ET Y = Op×m .
Unkel et al. (2010) solved for Y, and by minimising the costfunction
X − YT − U 2
F = X − BA 2
F = X + tr(BT BAT A) − 2tr(BT XA),
2
F
(10.50)
where B = [Y : U] and A = [ : ] are block matrices, . F is the Fröbenius
matrix norm and tr(.) is the trace operator. Unkel et al. (2010) then minimised
Eq. (10.50) iteratively. Note that here the first four matrices appearing in the rhs
of Eq. (10.49) are treated as fixed unknown matrices whereas in the standard
formulation of the factor model, e.g. Eq. (10.1), the elements of Y and U are treated
as random quantities.
Unkel et al. (2010) applied exploratory factor analysis (EFA) to gridded SLP
anomalies from NCEP/NCAR reanalysis. The data consist of monthly winter (Dec–
Jan–Feb) anomalies of SLP field for the period Dec 1948 to Feb 2006, i.e. a sample
size of n = 174, with a 2.5◦ × 2.5◦ lat×lon grid, north of 20◦ N, yielding p = 4176
grid points or variables.
Figure 10.1 shows an example of the costfunction (10.50) versus the iteration
number for m = 5 factors. Practically, the costfunction reaches its minimum after
10 to 20 iterations. Figure 10.2 shows the maps of four factor loading patterns. An
arctic-like oscillation and the North Pacific oscillation patterns can be recognised.
For comparison, Fig. 10.3 shows the leading two EOFs. These two EOFs have quite
similar structures to the two factor loadings (i) and (ii) shown in Fig. 10.2.
Fig. 10.1 Costfunction (10.58) versus iteration number. Adapted from Unkel et al. (2010)
Fig. 10.2 Spatial patterns of 4 factor loadings of winter NCEP/NCAR SLP anomalies. Adapted
from Unkel et al. (2010)
10.5.2 A Factor Rotation
Unkel et al. (2010) applied a special rotation, as in Hannachi et al. (2009), that yield
approximate statistical independent factor scores. This subject is discussed in more
detail in Sect. 12.6 of Chap. 12. The rotated factor scores are obtained from the
rotation:
R = FT, (10.51)
where T is an orthogonal rotation matrix. As in Hannachi et al. (2009), the objective

function requires that the covariance matrix of the squared components be diagonal.
The function to be minimised is discussed in detail in Sect. 12.6. Unkel et al. (2010)
applied this rotation to the obtained 5 factors of the winter SLP anomalies. Four of
the rotated factors are shown in Fig. 10.4. Familiar patterns emerge like the North
Atlantic Oscillation, the North Pacific Oscillation and the Scandinavian patterns.
10.6 Basic Difference Between EOF and Factor Analyses 235
Fig. 10.3 Leading EOFs of monthly winter NCEP/NCAR SLP anomalies. Adapted from Han-
nachi et al. (2007)
10.6 Basic Difference Between EOF and Factor Analyses
The difference between EOF (or PC) and FA analyses is one of the confusing issues
in data analysis. The two methods are quite similar in many ways. Both methods
are data reduction procedure, which allow to capture the data variance in a smaller
set of variables. The application of both methods yields in general quite similar
results. Yet there is a basic difference between the two methods. This difference is
Fig. 10.4 Four rotated patterns of the anomalous SLP factor loadings. Adapted from Unkel et al.
(2010)
discussed below based on the conventional FA model, e.g. Eq. (10.1), and the EFA
model given by Eq. (10.49).
10.6.1 Comparison Based on the Standard Factor Model
In Chap. 3 we have discussed the fact that EOF analysis constitutes an exploratory
tool that produces patterns and corresponding time series successively explaining
maximum variance in the data. EOFs are also computationally easy to obtain since it
is a simple eigenvalue problem. The standard factor model, on the other hand, has a
number of parameters that have to be estimated as discussed in the previous sections.
The estimation procedure is not trivial since it is based on maximum likelihood
where a descent numerical algorithm is required for the optimisation.
The main point here is that EOF/PC analysis, unlike factor analysis, is not model-
based, see e.g. Mardia et al. (1979, p. 275), Chatfield and Collins (1980, p. 87),
Krzanowski (2000, p. 502) and Hinton et al. (1997) to name just a few. Early
literature on factor analysis has noticed, however, that PCA can also be seen as
a particular case of FA. This seems to have gone unnoticed until very recently,
see, e.g., Roweis (1998), Tipping and Bishop (1999), Carreira-Perpiñán (2001) and
Jolliffe (2002). In fact, EOFs can be obtained as the MLE of a factor model with
isotropic uniqueness matrix, i.e.
= σ 2 I. (10.52)
∂L
Consider the likelihood equation ∂ = 0 provided by the first equation of
system (10.25), namely,
S −1 = . (10.53)
Now we write = σ 2 I + T as σ 2 I + A2 AT , that is = A σ 2 Ip + 2 AT ,

where A is orthogonal, and is diagonal. Keeping in mind the invariance of the
solution when multiplied by any orthogonal matrix, Eq. (10.53) yields

S = σ 2 I + 2 . (10.54)
Exercise
Derive Eq. (10.54) and point out the expression of in this equation in relation to
the invariance principle mentioned above.
Hint Multiply both sides of Eq. (10.53) by T then add σ 2 S −1 . This yields
S = A σ 2 I + 2 AT . Lastly, multiplying both sides by A yields the answer
where A is the new loading matrix.
Equation (10.54) is precisely an eigenvalue problem, and the factor loadings, i.e.
the column vectors of are the eigenvectors of the sample covariance matrix S, i.e.
the EOFs, whereas the factor scores, see Eq. (10.35), are the corresponding PCs. In
fact, if S = UDUT where U = u1 , . . . , up are the eigenvectors of the sample
covariance matrix, and D = α1 , . . . , αp is the diagonal matrix of the associated
eigenvalues arranged in decreasing order, then from Eq. (10.54) we can choose the
factor loadings
1
2
= Um Dm − σ 2 Im , (10.55)
where Dm = Diag (α1 , . . . , αm ), and Um = (u1 , . . . , um ). To find the MLE of σ 2 ,

we use the likelihood in Eq. (10.20) where now we take

Dm Om,p−m
= Um Dm − σ Im Um + σ Ip = U
2 T 2
UT (10.56)
Op−m,m σ 2 Ip−m
or similarly:

−1 Im Om,p−m
S =U −2 UT , (10.57)
Op−m,m σ Dp−m
where Dp−m = Diag αm+1 , . . . , αp .

Exercise
Derive Eq. (10.56).
Hint Start fromthe last (rhs) expression of Eq. (10.56) and use the decomposition
U = Um Up−m to get Um Dm UTm + σ 2 Up−m UTp−m . Next use the orthogonality of
U, i.e. Um UTm + Up−m UTp−m = Ip to get Eq. (10.56).
Exercise
Derive Eq. (10.57).
Hint Replace Dm and σ 2 in Eq. (10.56) respectively by D−1
m and σ
−2 then use the
SVD decomposition S = UDU in S .

T −1
The log-likelihood then becomes

n m p
−2
L=− p log 2π + log αk + 2(p − m) log σ + m + σ αk
2
k=1 k=m+1
(10.58)
and the MLE of σ 2 yields then
1
p
σ̂ 2 = αk . (10.59)
p−m
k=m+1
In practice and as it was pointed out in Chap. 3, EOFs are efficiently obtained using
SVD and no MLE is required. This makes the problem of EOFs easier to solve than
the FA model. Among other things, the maximum likelihood estimates of the factor
model overcomes the scaling problem encountered in EOFs. In EOF/PC analysis
the number of retained components can be fixed by choosing e.g. the percentage of
explained variance. The choice of this percentage, although arbitrary, does not alter
the PCs because they are unique. In factor model, on the other hand, the number of
factors m is difficult to estimate. One way to estimate m is to use the EM algorithm
with a cross-validation procedure rather like fitting a mixture model (Hannachi
and O’Neill 2001). This is achieved by sequentially increasing m and taking the
best model with the largest likelihood. Also, and unlike EOF/PC analysis, when m
changes the form of the factors may also change! (Chatfield and Collins 1980).
10.6.2 Comparison Based on the Exploratory Factor Analysis

Model
Adachi and Trendafilov (2019) used the matrix formulation of exploratory factor
analysis, as given by Eq. (10.49), and provided a number of inequalities, which are
used to contrast the parameters’ estimates in PCA and FA. Given a n × p data
matrix X = [x1 , . . . , xp ], a PCA formulation is given for example in Eq. (3.28),
i.e. X = ZAT + E, with the n × m and p × m matrices Z and A representing
respectively the PC scores and loadings. This formulation of PCA can be used as a
basis for a comparison with the exploratory factor analysis model. This is illustrated
in Fig. 10.5, with p = 3 and m = 2, as described also in Adachi and Trendafilov
(2019).
For the PC problem (Fig. 10.5a) the variables x1 , x2 and x3 are commonly
explained by the PC scores z1 and z2 with, in general, unequal weights given by
Fig. 10.5 Illustration of a PCA model as given by Eq. (3.28) (a) and the exploratory factor model
given by Eq. (10.49) (b) with p = 3 and m = 2, along with the relative size of the different
components for the PCA and the exploratory FA models
the loadings aij plus an error term E. In a similar way, Eq. (10.49) is illustrated in
Fig. 10.5b and shows the common part, namely YT , just like ZAT in Eq. (3.28).
Unlike PCA, however, there is the unique part, namely U, which is included in
the exploratory FA model Eq. (10.49) where each unique factor uk is weighted by
ψk , and affects only the variable xk . Adachi and Trendafilov (2019) provide a kind
of quantification of the different contributions arising in Eqs. (3.28) and (10.49) to
the total variance of the data matrix as illustrated in Fig. 10.5c. For a given number
m of principal components and factors, the authors conclude that: (1) the common
part of PCA is larger than that of FA, and (2) the residual part for FA is smaller than
that for PCA. In addition, they suggested that the unique part for FA is often larger
than the residual part for PCA.
Chapter 11
Projection Pursuit
Abstract This chapter describes a method that attempts to identify, sequentially,

‘interesting’ patterns by seeking directions in the data state space that optimises
a specific projection index. A number of projection indexes, including indexes
measuring non-normality such as skewness, are discussed and application to climate
data presented.
Keywords Unsupervised technique · Interestingness · Projection index ·

Projection pursuit · Entropy · Information index · Clustering index · Density
estimation · Skewness modes · PNA pattern · Polar vortex
11.1 Introduction
Revealing interesting structures from high-dimensional hyperspaces is an extremely

challenging task. This is because human mind can only perceive three-dimensional
objects and as such mapping high-dimensional data onto low-dimensional spaces for
visual inspection becomes highly desirable. The complexity of the physical world,
including our climate system, means that one is dealing with very high-dimensional
physical phenomena where dimensionality reduction becomes an invaluable tool
of understanding and analysis where the human gift for pattern recognition can
be applied efficiently. EOF/PC analysis constitutes an example of multivariate
exploratory data analysis for dimensionality reduction. Given a data matrix X and
associated covariance matrix S we recall that the EOF method is a linear projection
in which the objective is to seek directions a such that the (variance) function:
I (a) = var (Xa) = aT Sa,
is maximised. In a similar manner one can find other directions that optimise various
other criteria. For example, in optimally persistent patterns (OPP), see Chap. 8, the
function I (a) is the decorrelation time of the projected time series Xa. Similarly,
for the optimally interpolated pattern (OIP) the function I (a) represents the mean
square interpolation error. The procedure of finding these directions constitute what

242 11 Projection Pursuit
is known as projection pursuit in multivariate exploratory data analysis. Projection

pursuit is a powerful tool to explore high-dimensional data, and is discussed below.
11.2 Definition and Purpose of Projection Pursuit
11.2.1 What Is Projection Pursuit?
Projection pursuit (PP) is an unsupervised technique that attempts to find “interest-

ing” low-dimensional structures or linear projections from a high-dimensional cloud
of data points. The concept of interestingness will be made clear later. This projec-
tion is achieved by maximising an objective function of direction a, I (a) referred to
as projection index. Friedman and Tukey1 (1974) introduced PP as an exploratory
analysis tool for multivariate data and pioneered its successful implementation. It
seems, however, that the method goes back to Kruskal (1969, 1972), Wright and
Switzer (1971), see Huber (1985) for further references. Historically, PCA and also
factor analysis have been used as conventional tools for finding meaningful low-
dimensional spaces. As such, most of the conventional tools represent special cases
of PP.
Unlike PCA, which maximises variance, projection pursuit attempts, in general,
to find interesting and informative projections which includes, but not limited
to, clustering, holes, departure from normality, nonlinear relationship between
variables, etc. The structure we are after is measured by a projection index I (a)
by solving the optimisation problem:2
max I (a) s.t. aT a = 1. (11.1)
The maximisation problem (1) is in general analytically intractable and is therefore

to be solved numerically.
A particular advantage of PP over other techniques based on interpoint distances
such as multidimensional scaling and other clustering techniques is its ability to
ignore non-informative variables (Huber 1985). Most projection indexes are chosen
to be differentiable to make use of the available advanced optimisation algorithms
such as gradient and reduced gradient, and ODE-based methods. Most PP methods
find one-dimensional projection at a time. But in some other cases, two-dimensional
1 Who coined it projection pursuit. In projection pursuit, the term “projection” refers to the fact
that the data X is first projected onto the direction a, i.e. aT X, whereas “pursuit” refers to the
optimisation used to find the correct direction.
2 In projection pursuit the term “projection” refers to the fact that the data X is first projected
onto the direction a, i.e. aT X, whereas “pursuit” refers to the optimisation used to find the correct
direction.
11.2 Definition and Purpose of Projection Pursuit 243
(Friedman and Tukey 1974), and three-dimensional (Nason 1995) projections,

where the projected indexes take the respective forms, I (a, b), and I (a, b, c), are
sought. The precise purpose and reason of PP are made clear next.
11.2.2 Why Projection Pursuit?
We have seen in Chap. 2 various properties of high-dimensional spaces such

as the empty space phenomenon. For instance, data distributed uniformly over
hypervolumes tend to be mostly distributed near their boundaries. Most importantly,
because diagonals in hyperspaces are nearly orthogonal, data distributed near these
diagonals project onto the origin in any pairwise scatter plots (Scott 1992). Seeking
interesting structures, e.g. clusters from high-dimensional data using conventional
linear projections, such as EOFs/PCA, will fail to reveal these structures. PP on
the other hand does not suffer from this problem since it uses low-dimensional
linear projections particularly when appropriate projection indexes are chosen.
As pointed out by Huber (1985) PP constitutes one of the very few multivariate
methods which are able to overcome the difficulties related to phenomenon of
hyperspaces’ emptiness. PP is therefore unavoidable if one is interested in finding
nontrivial structures from multivariate data by visually inspecting low-dimensional
projections.
An extremely important element of PP is the choice of interesting projection
indexes. Interestingness is a tricky concept since there is no universal definition for
it. It is, however, generally accepted that any structure departing from multivariate
normality is considered to be interesting. As such, clusters, holes, rings, etc. are all
interesting. The basic reasons for this are twofold:
(i) Among the continuous probability distributions with fixed mean and variance
the normal distribution maximises entropy, and therefore has the least infor-
mation in both the senses of Fisher information and negative Shannon entropy
(Cover and Thomas 1991).
(ii) Because of the empty space phenomena most low-dimensional projections of
high-dimensional clouds are approximately Gaussian, even if the underlying
density is for example multimodal (Diaconis and Freedman 1984).
PP does not exist without specifying the projection index. This latter characterises
precisely the projection. We reserved therefore a whole section that deals with
projection indexes. We start first by reviewing some concepts from information
theory dealing with entropy.
11.3 Entropy and Structure of Random Variables
11.3.1 Shannon Entropy
It is well accepted that probability is a measure of uncertainty about the occurrence

of an event. A fair coin, when tossed, yields head or tail with the same probability
1
2 , so one has 50% chance that any experiment will yield say head. Let Z be a
random variable taking values z1 , . . . , zm with respective probabilities p1 , . . . , pm .
The entropy of Z is given by

m
H (Z) = − pk log pk (11.2)
k=1
and is also known as Shannon entropy (Shannon 1948). The entropy, Eq. (11.2), is
a measure of the uncertainty of the collection of events of Z. It also represents
the average amount of surprise upon learning the value of Z (Hamming 1980;
Ross 1998). In information theory, entropy can also be interpreted as the average
information content , or similarly a measure of the disorder in the system. So the
concept of entropy, uncertainty and information are all equivalent. In fact, as pointed
out, e.g., by McEliece (1977), given a random variable X entropy can be thought of
as a measure of
• The amount of information gained by observing X.
• Our uncertainty about X.
• The randomness of X.
It is easy to show that the entropy given in Eq. (11.2) is maximised for a (discrete)
uniform distribution, i.e. one in which pk = m1 , k = 1, . . . , m. This is to say that
a discrete uniform distribution contains the least information content, or similarly
that this distribution has the largest uncertainty, and therefore has no interesting
structure. Hence it would seem that a reasonable way to define interestingness is
as departure from discrete uniformity. This is different from the continuous case,
presented next.
11.3.2 Differential Entropy
It is clear from its relationship with probability that entropy is a measure of

our ignorance about the actual structure of the system. Definition (11.2) can be
extended to the continuous case to yield the differential entropy. From now on,
and unless otherwise stated, we will take the opposite of (11.2) to mean entropy.3
3 In fact, it is called negentropy.

11.3 Entropy and Structure of Random Variables 245
The differential entropy of a continuous random variable x with probability density

function f (x) is defined by

H (X) = E log f (X) = f (x) log f (x)dx. (11.3)
The entropy (11.3) is also known as Boltzmann’s H-function, see e.g. Hamming
(1980) and Cover and Thomas (1991). We have the following theorem:
Theorem Among all continuous random variables with zero mean and unit vari-
ance the standard normal distribution has the smallest entropy.
A proof outline is given as an exercise, see below.
Exercise Compute H for the normal random variable N(0, σ 2 ).
Answer H = − 12 log(2π eσ 2 ).
Exercise Using arguments from the calculus of variations show that among all
continuous distributions with zero mean and unit variance the normal distribution
has the smallest negentropy, Eq. (11.3).
The theorem can also be extended to the multivariate normal N (μ, ) with
probability density function given by

1 1
f (x) = m 1
exp − (x − μ)T −1 (x − μ) (11.4)
(2π ) 2 || 2 2
whose entropy is given by
1
H (x) = − log || (2π e)m . (11.5)
2
Exercise Prove (11.5).
Hint Make a variable change in (11.3) then use the result of the previous exercise.
It is therefore reasonable that in the continuous case one way of defining interest-
ingness is to consider departure from normality.
Another related measure of entropy is the Fisher information defined as the
variance of the efficient score in maximum likelihood estimate. For a probability
d 2
density function f (x), the Fisher information is given by I (f ) = E dx log f (x) ,
i.e.
2 2
1 df (x) d
I (f ) = dx = log f (x) f (x)dx. (11.6)
f (x) dx dx
Fisher information, also like differential entropy, is minimised with the normal
probability density function.
Exercise Compute I (f ) for the normal distribution N(0, σ 2 ).

Answer σ −2
Exercise Show that among all continuous distributions with zero mean and unit
variance the normal distribution minimises (11.6).
Hint As for differential entropy consider I = I (f + δf ) − I (f ) where f () is any
such distribution, and δf is small perturbation
of f (). After neglecting
the second-
2
df (x)/dx d df (x)/dx
order terms in δf , we obtain I = 2 f (x) dx δf (x) − f (x) + λx 2
2
d df (x)/dx df (x)/dx
+μx + α) δf (x)] dx, which boils down to − 2 dx f (x) + f (x) + λx 2
+μx + α] δf (x)dx = 0 for all “perturbations” δf . Hence one gets 2 dg(x)

/dx +
df (x)/dx
g 2 (x) + λx 2 + μx + α = 0, where g(x) = f (x) . The solution takes the form
g(x) = ax + b.
11.4 Types of Projection Indexes
11.4.1 Quality of a Projection Index
A projection index of a one-dimensional PP of a p-dimensional random vector (or

a p × n data matrix) is a real function I (a) defined on the p-dimensional space
Rp . For each direction in this p-dimensional space one associates a number. The
one-dimensional PP is the most obvious to examine. However, two-dimensional
(Friedman and Tukey 1974) and three-dimensional (Nason 1992) projection pursuit
are also useful to pick up special two- and three-dimensional interesting structures.
For example, in two-dimensional PP, one looks for two p-dimensional vectors a and
b satisfying:
max I (a, b) s.t. aT a = bT b = 1, and aT b = 0. (11.7)
The orthogonality, although not necessary, is a nice and convenient constraint.

A projection index I (x) is a nonlinear (and non-quadratic) function of the vector
x. Its optimisation can yield in general several local optima, and therefore the
choice of the optimiser is important. The discussion on interestingness presented in
Sect. 11.4.2 below provides guidelines for choosing appropriate projection indexes.
Nice qualities for projection indexes are in general desirable. Huber (1985) lists the
following two qualities:
• Quality 1: invariance to nonsingular affine transformation, i.e.
I (Ax + b) = I (x) (11.8)
for any nonsingular matrix A and translation vector b.

11.4 Types of Projection Indexes 247
• Quality 2:
I (x1 + x2 ) ≤ max (I (x1 ), I (x2 )) . (11.9)
The reason for the last quality is that from the central limit theorem, we know that
x1 + x2 is more normal, and hence less interesting, than the more normal of x1 and
x2 . Consequently, if x1 , . . . , xm are IID random vectors having the same distribution
as the random vector x, then (11.9) yields
I (x1 + . . . + xm ) ≤ I (x) . (11.10)
Huber (1985) points out that an index I () satisfying (11.8) essentially amounts to a
normality test.
It is also desirable to have a continuously differentiable index to increase
computational efficiency by using powerful hill climbing algorithms (Friedman
and Tukey 1974) or gradient and projected gradient methods (Appendix E). In the
following, and unless otherwise stated, we will concentrate on one-dimensional
projection pursuit, but the method is straightforward to extend to two or three
dimensions.
11.4.2 Various PP Indexes
Projection pursuit finds low-dimensional projections of high-dimensional data that

capture interesting features. Interesting projections may include, but not limited
to clustering, skewness/kurtosis, and other nonlinear structures such as nonlinear
manifolds. All these features are strongly non-normal, and consequently a suitable
projection index essentially gives a measure of departure from normality. Differ-
ential entropy and Fisher information also constitute appropriate measures for the
same purpose although in practice different projection indexes achieve different
results. Below is a list of some of the most popular PP indexes used in the literature.
Friedman and Tukey’s Index
Historically, the first PP index is that of Friedman and Tukey (1974) for one and two
dimensions. In one dimension, their index takes the form:
I (x) = s(x)d(x), (11.11)
where s(x) is a measure of spread of the projected data, and d(x) describes the
local density of the projected scatter. In their application s(x) was taken to be the
trimmed standard deviation of the scatter, whereas for d(x) they took an average

nearness function of the form i,j f (rij 1x≥0 (R − rij ) , where rij is the interpoint
distance between the ith and j th projected data, f () is a monotonically decreasing
function satisfying F (R) = 0, and 1x≥0 (x) = 1 for x ≥ 0, and zero otherwise
(the indicator function). Friedman and Tukey’s index has stimulated thoughts about
indexes and various indexes have been proposed since then.
Jones and Sibson’s Index
Jones (1983) and Jones and Sibson (1987) proposed using the projection index:

I (x) = f 2 (x)dx, (11.12)
where f () is the probability density function of the projected data onto the direction
x. Huber (1985) and also Jones and Sibson (1987) pointed out, in particular,
that the function d(x) in (11.12) is an estimate of (11.11). Now among all the
probability distributions with
√ √ zero mean and unit variance
√ the parabolic density
function defined over [− 5, 5], i.e. f (x) = 0.003 5 5 − x 2 1[−√5,√5] (x),
known also as Epanechnikov kernel (Silverman 1986), minimises the index (11.11),
see e.g., Hodges and Lehmann (1956). Therefore maximising (11.11) can only
lead to departure from the parabolic distribution rather than finding, for example,
clusters, and this is the main critic addressed by Jones and Sibson (1987). They also
note that the Friedman and Tukey two-dimensional index is not invariant to rotation.
That is, the projection index depends on the orientation of the plane of projection,
which is a serious set back. Jones (1983), however, found little difference in practice
between the entropy index Eq. (11.3) and index Eq. (11.11).
Entropy/Information Index
Since interestingness is related to non-normality measures given by the entropy4

Eq. (11.3) and Fisher information Eq. (11.6) suggest themselves naturally as the
appropriate projection indexes. The entropy has been used as a measure for testing
normality (Vasicek 1976), but several other tests also exist, see e.g. Mardia (1980).
These indexes can be normalised to yield positive functions having zero value for
a normal distribution. For example, the standardised (negative) differential entropy
index becomes
√
I (x) = f (x) log f (x)dx + log σ 2π e (11.13)
whereas the standardised Fisher information index (Huber 1985) becomes
4 Following Rényi (1961; 1970 p. 592)’s introduction of the concept of order-α entropy; the

differential entropy, e.g., − f log f , is of order 1, whereas index (11) is of order 2.
2
1 df (x)
I (x) = σ 2 dx − 1, (11.14)
f (x) dx
where σ is the standard deviation of the projected data. Jee (1985) compares various
indices applied to synthetic multimodal data as well as the seven-dimensional
particle physics data of Friedman and Tukey (1974). He finds that the Fisher
information index achieves a better result with significantly higher information
content.
Moments-Based Indexes
The previous projection indexes are based on the probability density function of
the projected data. Another straightforward way to compute the index is to estimate
the pdf, which will be presented later. Jones (1983), and Jones and Sibson (1987)
suggested expanding the pdf in terms of its cumulants (see Appendix B) using
Hermite polynomials.
Hermite polynomials Hk (.), k = 0, 1, . . ., are defined from the successive
2 d k −x 2 /2
derivatives of the Gaussian PDF, i.e. Hk (x) = (−1)k ex /2 dx ke . For example,
H0 (x) = 1, H1 (x) = x and H2 (x) = x − 1. Figure 11.1 shows the first
2
five Hermite polynomials. These polynomials are orthogonal with respect to the
Hermite polynomials
H
0
15 H
1
H
2
10 H
3
H
H (x)
4
5
k
−5
−2 0 2 4
x
Fig. 11.1 Graphs of the first five Hermite polynomials

Gaussian PDF φ(x) as weight function, namely, R φ(x)Hk (x)Hl (x)dx = k!δkl .
This allows
any PDF f (.) to be expanded in terms of those polynomials, i.e.
f (x) = k≥0 ak φ(x)Hk (x). This expansion is known as the Gram–Charlier, or
Edgeworth polynomial expansion of f (.) (Kendall and Stuart 1977, p.169).
This expansion leads to the expression f (x) = φ(x) [1 + ε(x)] where φ(x) is
the normal PDF and ε(x) is a combination
of Hermite polynomials. The entropy
index is then approximated by 12 φ(x)ε2 (x)dx. This expression can be simplified
further using the polynomial expansion of f (.).

1 1
f (x) = φ(x) 1 + κ3 H3 (x) + κ4 H4 (x) + . . . , (11.15)
6 24
where Hk (x) is the kth Hermite polynomial and κα is the cumulant of order α of
f (x). Using the properties of these polynomials, the projection entropy index yields

1 1
I (x) = f (x) log f (x)dx ≈ κ32 + κ42 . (11.16)
12 4
The approximation (11.16) is efficient in computing the entropy index. Similar

expansion can also be applied to the planar entropy index (Jones and Sibson 1987),
i.e.
1 2
I (x, y) = κ30 + 3κ212
+ 3κ12
2
+ κ03
2
+ κ40 2
+ 4κ31
2
+ 6κ22
2
+ 4κ13
2
+ κ04
2
,
12
(11.17)
where κrs is the bivariate cumulant5 of order (r, s). The planar index (17) has been
shown by Jones (1983) to be rotationally invariant. The major drawback of these
approximations is the fact that the cumulants are in general sensitive to outliers.
One way to address this setback is to use robust estimates of moments (Hosking
1990; Hannachi et al. 2003). One problem with this is that the obtained index
becomes non-smooth, where efficient gradient algorithms cannot be used. There
are, of course, other sophisticated optimisation algorithms for nondifferentiable
functions, such as simulated annealing. Alternatively, one can use the approximation
of Hyvärinen (1998), which is based on a first-order approximation of the density
maximising the entropy (see also Carreira-Perpiñán 2001). Friedman (1987) also
proposes another alternative to overcome the problem encountered with heavier tail,
than the normal distribution.
the projected data is centred and scaled to have zero mean and unit variance, then κ31 = μ31 ,
5 If
κ13 = μ13 , κ22 = μ22 − 1, κ40 = μ40 − 3, κ04 = μ04 − 3, κ12 = μ12 , κ21 = μ21 , κ03 = μ03 , and
κ30 = μ30 , where the μs refer to the centred cumulants.
Friedman and Related L2 Norm-Based Indices
Let X be a random variable with zero mean and unit variance. The projected index
can be considered as realisations of this random variable. Friedman (1987) considers
the transformation of this random variable using
U = 2(X) − 1, (11.18)
where (x) is the cumulative distribution function of the standard normal distribu-
x
tion, i.e. (x) = √1 −∞ e−t /2 dt. The transformation (11.18) maps the projected
2
2π
data onto the interval [−1, 1]. The probability density function of U is given by

f −1 ( u+1
2 )
p(u) = , (11.19)
2φ −1 ( 2 )
u+1
where f () and φ() are respectively the pdfs of X and the standard normal
distribution. Furthermore, if X is normal, then U is the uniform distribution over
[−1, 1]. Friedman’s index measures departure of the transformed variable from the
uniform distribution over [−1, 1], in the L2 sens, and is given by
1 2 1
1 1
I (X) = p(u) − du = p2 (u)du − , (11.20)
−1 2 −1 2
where p(u) is the density function of U . For efficient computation Friedman (1987)
expanded p(u) into Legendre polynomials6 to yield
∞
1
I (X) = (2k + 1) [E (Pk (U ))]2 , (11.21)
2
k=1
where E[] stands for the expectation operator. For a sample of projected data index
x = (x1 , . . . xn ), the previous expression yields, when truncated to the first K terms,
the expression:
2

K
1
n
I (x) = (2k + 1) Pk (2(xt ) − 1) . (11.22)
n
k=1 t=1
Friedman (1987) also extended his index to two-dimensional projection, but the bi-
dimensional index is not rotationally invariant (see Morton 1989).
6 These are orthogonal polynomials over [−1, 1], satisfying P0 (y) = 1, P1 (y) = y, and for k ≥ 2,
kPk (y) − (2k − 1)yPk−1 (y) + (k − 1)Pk−2 (y) = 0. They are orthogonal with respect to the normal
density function φ(x).
Equation (11.20) can be simplified using the expression of the density function
p(u) given by (11.19) to yield
∞ ∞
1 1 1 f 2 (x) 1
I (X) = [f (x) − φ(x)] dx = 2
dx − . (11.23)
2 −∞ φ(x) 2 −∞ φ(x) 2
In order for Eq. (11.23) to converge, the density function f (x) has to decrease at
least as fast as the normal density. Hall (1989) notes, as a consequence, that this
index is not very useful for heavy tailed distributions. He suggests the following
index as a remedy:
∞
I (X) = [f (x) − φ(x)]2 dx. (11.24)
−∞
Using an expansion into Hermite polynomials the previous index becomes

∞
21/2
I (X) = E [Hk (X)] = E [H0 (X)] . (11.25)
π 1/4
k=1
Another suggestion came from Cook et al. (1993) who propose the following index
instead:
∞
I (X) = [f (x) − φ(x)]2 φ(x)dx (11.26)
−∞
by using the polynomial expansion in terms of Hermite polynomials. The two-

dimensional index of Hall (1989) takes the form:
2 2

K K−j
1
n
1
n
I (x, y) = Hj (xi )φ(xi ) Hk (yi )φ(yi )
n n
j =0 k=0 i=1 i=1
1
n
1
− φ(xi )φ(xj ) + (11.27)
n2 4π
i,j =1
whereas Cook et al. (1993) index takes the form:

2

K K−j
1 1
n
I (x, y) = √ Nj (xi )Nk (yi )φ2 (xi , yi ) − bj bk , (11.28)
j =0 k=0
n j !k!
i=1
where√ Nj () is the natural Hermite polynomial of order j , b2k+1 = 0, and b2k =

(−1) k (2k)!
√ .
π k!22k+1
Chi-Square Index
Posse (1995) proposes a two-dimensional projection index based on the chi-squared

distance. The plane is divided radially into 48 boxes Bk , k = 1, . . . , 48. The boxes
have the same angular width of 45o , and the inner 40 boxes have the same radial
1√
width equal to 5 2 log 6. This choice provides regions with the same probability
under the standard bivariate Gaussian distribution. The regions in the outer ring have
1
equal probability of 48 . This construction accounts for the isotropy of N(0, I2 ). The
index gives a measure of the χ 2 -distance between the number of data points in each
box and the expected number from a bivariate standard normal distribution in the
same box. The index takes the form:
2

48
Bk dF (x, y) − Bk φ 2 (x, y)dxdy
Iχ 2 (x, y) = , (11.29)
φ
Bk 2 (x, y)dxdy
k=1
where F (x, y) is the probability distribution of the two-dimensional projected

(and sphered) data, and φ2 (x, y) = φ(x)φ(y) is the probability density function
of the standard bivariate normal. To make it rotationally invariant, the projection
index (11.29) is averaged over all planar rotations up to 45◦ . This yields
n 2
11 1
8 48
(j ) (j )
Iχ 2 (x, y) = 1Bi (xi , yi ) − nk n (11.30)
9 n nk n
j =0 k=1 i=1

where nk = Bk φ2 (x, y)dxdy, 1Bk is the indicator function of Bk , i.e. it equals
(j ) πj πj (j )
one inside Bk and zero elsewhere, and xi = xi cos 4×9 − yi sin 4×9 and yi =
πj πj (j ) (j )
xi sin 4×9 + yi cos 4×9 . The numbers xi and yi result from averaging the
projections over [0, 4 ] using nine equal angles π36k , k = 0, 1, . . . , 8. In practice
π
if a and b are two directions, x = Xa, y = Xb, and if a(j ) = a cos πj πj

36 − b sin 36 ,
and b(j ) = a sin πj πj
36 + b cos 36 , then x
(j ) = Xa(j ) and y(j ) = Xb(j ) . Posse (1995)
argues that the χ 2 index appears very efficient and fast to compute. Other measures
can also be used instead of the χ 2 . One can use for example the Kolmogorov–
Smirnov distance,
KS = max |F (x) − (x)| which is a L∞ -norm, or else use the
L1 -norm |F (x) − (x)|dx.
Clustering Index
Although the previous projection indexes attempt to maximise departure from nor-
mality, they can also be used to find clusters. There are, however, various studies that
investigated indexes specifically designed for uncovering low-dimensional cluster
structures (Eslava and Marriott 1994; Kwon 1999). Looking at the connection
between multidimensional scaling, MSD (Chap. 9) and cluster analysis, Bock

(1986) and Heiser and Groenen (1997) introduced stress function in MDS that
accounts for the presence of clusters in the data. Bock (1987) developed a MDS
clustering method based on maximising the between-groups sums-of-squares in
the low-dimensional projection step and minimising the within-groups sum-of-
squares in the clustering step using k-means (Gordon 1981). Now, if B, W and
T are respectively the between-group, within-group and total sums-of-squares and
products, then
T = B + W. (11.31)
Let X = xij be our n × p data matrix and suppose that there are
K
K groups with nk units in group k, k = 1, . . . , K, with k=1 nk =
n, and that x
k ik is the p-dimensional vector of unit i in group k with
xk = n1k ni=1 xik . Then the within- and between-group covariance matrices
1 K n k
are respectively W = n−K k=1 i=1 (xik − xk ) (xik − xk ) , and B =
T
K
k=1 nk (xk − x) (xk − x) , where x is the total mean. The eigenvectors
1 T
K−1
−1
of W B are supposed to be ordered by size of eigenvalues given the complete
set of canonical variables. Now, given that in Eq. (11.31) T is constant, Bock’s
procedure essentially maximises I0 (A) = tr ABAT , subject to AT A = Ip , that
T
k ak Bak , subject to ak al = δkl , where A = a1 , . . . , ap is the matrix
is max T
of the low-dimensional basis vectors. This bears some similarities to discriminant

analysis, which attempts to identify canonical variates7 by maximising the ratio
aT Ba
aT Wa
. Bolton and Krzanowski (2003) propose the orthogonal canonical variates as
an ideal candidate for projection pursuit clustering index:

p
aTk Bak
I1 (A) = I1 a1 , . . . , ap = (11.32)
k=1
aTk Wak
subject to AAT = Ip . Note that using B or T in Eq. (11.32) yields the same
result. Because Eq. (11.32) is a sum of some sort of signal-to-noise ratios, Bolton
and Krzanowski (2003) point out that when the groups are not well defined, the
separation between the groups will occur only on the first projection, and no
separation thereafter, i.e. on subsequent projections. They propose as a remedy the
following PP clustering index:
p
aT Tak
I2 a1 , . . . , ap = pk=1 Tk s.t. aTk al = δkl . (11.33)
k=1 k Wak
a
7 For example, Fisher’s linear discrimination function when there are only two groups.
Note that only a1 can be identified as the first canonical variate whereas subsequent
vectors are found numerically. In this approach the number of clusters is supposed to
be known, and the optimal clustering can be found using, e.g. k-means, or any other
algorithm, see e.g., Gordon (1981) or Everitt (1993). Note also that when the data
are sphered, then I1 (A) becomes identical to I2 (A). In their application, Bolton
and Krzanowski (2003) argue that their refined index I2 gives better results than the
previous index. When the number of clusters is unknown they use the rule of thumb
of Hartigan8 (1975).
11.4.3 Practical Implementation
Suppose that we have a data matrix X = xij = (x1 , . . . , xn ), i = 1, . . . p, j =

1, . . . n where n is the sample size and p the number of variables. The direction of
projection is a p-dimensional vector a, and the 1-dimensional projection index is
the vector of observations given by a linear combination of the original variables:

x = aT X = aT x1 , . . . , aT xn . (11.34)
Since many of the projection indexes are based on the probability distribution
function of the index it is necessary therefore to have an estimate of the pdf from the
sample index x = (x1 , . . . , xn ). Let us denote by fa () an estimate of the pdf of the
index (11.34). If, for example, one in interested in the entropy index (11.13), then
the sample index is
√
I (a) = fa (x) log fa (x)dx + log σa 2π e , (11.35)
where σa is the standard deviation of x. The one-dimensional projection vector a is

then obtained by maximising I (a) subject to being unitary, i.e. aT a = 1. The first
direction is then obtained as the solution of max I (fˆa ) − λ 1 − aT a , where λ is
a Lagrange multiplier. Similar procedure applies to other indexes.
A suitable and convenient method to estimate the pdf of a time series is provided
by the kernel density estimation (Silverman 1986). This estimator is given by (see
Appendix A):

1
n
x − xk
fa (x) = K , (11.36)
nh h
k=1
8 If W is the total within-groups sums-of-squares obtained using k-means, with k clusters, then it
k
is acceptable to take k + 1 clusters if (n − k + 1) (−1 + Wk /Wk+1 ) > 0.
where K() is a kernel smoother, typically a symmetric probability density function,

and h is the kernel bandwidth. This parameter controls the smoothness of the
estimator (11.36) and its optimal value is a function of the sample size, and is
1
given by h0 = 1.06n− 5 (in the univariate case), see, e.g., Silverman (1986) and
Scott (1992). The estimator (36) has a number of desirable properties. For instance,
for a suitable kernel choice this estimator is asymptotically unbiased and efficient
and converges to the true pdf under various measures (Silverman 1986). If K() is
differentiable, then this property gets inherited by (11.35), hence efficient gradient
optimisation algorithms can be used for the maximisation of I (a). A common and
most widely used choice for the kernel smoother is the standard normal density
function φ(). In this case (11.35) becomes

1
n
x − aT xk
fa (x) = φ . (11.37)
nh h
k=1
Before applying the projection, it is recommended that the data be centred and
sphered (Jones and Sibson 1987; Huber 1985; Tukey and Tukey 1981). Centring
yields a zero-mean data matrix whereas sphering yields a data matrix whose
covariance matrix is the identity matrix. The centred data matrix (see Chap. 2) is
Xc = XH, (11.38)
where H = In − n1 1n 1Tn is the centring matrix. The sphered data matrix can be
obtained by multiplying the centred data matrix by the inverse of a square root of
the sample covariance matrix S = n1 Xc XTc . For example if X = VUT , then one
√ 1
1 T
can take S− 2 = nU− 2 UT . Note that S− 2 S− 2
1 1
= S−1 . The sphered data
matrix is then X◦ = UVT . The gradient of (11.37) with respect to a is given by
n x − aT x
1 k
∇a fa (x) = x − aT
xk φ xk , (11.39)
nh3 h
k=1
and that of I (a) in Eq. (11.35) can be obtained easily by differentiating the
expression under the integral.
Once the first direction a1 is found the next direction can be obtained in a
similar fashion from the residuals X − 1aT1 X where 1 is a vector of length k
containing one, etc. The other alternative is to extend the definition of the index
to data projected onto a plane where the index I (a, b) is a function of xa = aT X
and xb = bT X, with aT a = bT b = 1, and aT b = 0. The 2D PP is useful
if we are interested in finding 2D structures simultaneously. Nason (1992) has
extended this to the 3D case to analyse multispectral images through RGB colours
representation. The same procedure can still be applied to find further 2- or 3-
dimensional structures after removing the previously found structures. This is
known as structure removal (Friedman 1987). Structure removal is in general an
11.5 PP Regression and Density Estimation 257
iterative procedure and repeatedly transforms data that are projected to the current
solution plane to standard normal.
11.5 PP Regression and Density Estimation
11.5.1 PP Regression
Projection pursuit has been extended beyond exploratory data analysis to regression
analysis and density estimation. PP regression is a non-parametric regression
approach that attempts to identify nonlinear relationships between variables, e.g.,
a response and explanatory variables y and x. This is achieved by looking for m
vectors a1 , . . . , am , and nonlinear transformations φ1 (x), . . . , φm (x) such that

m
yt = y + αk φk aTk xt + εt (11.40)
k=1
provides an accurate model for the data (yt , xt ), t = 1, . . . , n. Formally, y and x are
presumed to satisfy the conditional expectation:

m
E [y|x] = μy + αk φk aTk x (11.41)
k=1

where μy = E(y) and φk , k = 1, . . . m, are normalised so that E φk aTk x = 0,
2
and E φk aTk x = 1. The functions φk (x), k = 1, . . . , m are known as the ridge
functions.
The parameters of the model (11.40), i.e. the ridge functions and projection
vectors, are obtained by minimising the mean squared error:
2

m
E y − μy − αk φk aTk x . (11.42)
k=1
Model (11.40) is the basic PP regression model.9 It includes the possibility of having
interactions between the explanatory variables. The problem is normally solved
using a forward stepwise procedure (Friedman and Stuetzle 1981). First, a trial
direction a1 is chosen to compute the projected data zt = aT1 xt , t = 1, . . . , n, then
a curve fitting is obtained between yt and zt . This amounts
to a simple 2D scatter
plot analysis, and is achieved by finding φ1 (x) such that nt=1 wt (yt − φ1 (zt ))2 is
9 Notethat when a1 = (1, 0, . . . 0), . . . am = (0, . . . , 0, 1) and αk = 1, k = 1, . . . m, then

model (11.40) is known as an additive model.
minimised. This procedure is iterative in a1 and φ1 . Once a1 and φ1 are found, the
same procedure is repeated with the residuals yt − y − β1 φ1 aT1 xt where φ1 has
been standardised as above.
As for spline smoothing, to make the ridge function smooth one can impose a
smoothing constraint. For instance, if in (11.41) or (11.42) we let μy = 0, then the
quantity to be minimised takes the form:
m 2 2

n
m 2 d
yi − T
αk φk ak xi +λ φ k (u) du (11.43)
du2
i=1 k=1 k=1
d2
2
over all ridge functions φk for which du < ∞. Given ak , k =
φ (u)
du2 k
ˆ
1, . . . m, Eq. (11.43) provides estimates φk for the ridge functions. From these
estimates one can compute better projections by minimising Eq. (11.43) but with
λ = 0, i.e.
2

n
m
min yi − αˆk φk aTk xi .
i=1 k=1
This procedure can be re-iterated until convergence is achieved. In the backfitting

algorithm we first estimate initial projections and ridge functions a(0) (0)
k , φk , k =
(1)
1, . . . m. Next, to compute for example a1 we minimise
⎡ ⎤2

n
⎣yt − φ (0) a(1)T xt − φk ak xt ⎦
(0) (0)T
1 1
t=1 k=1
(1) (1)
with respect to a1 . The ridge function φ1 is then obtained from minimising the
1D spline smoothing problem

n 2
ut − φ1(1) a(1)T
1 xt ,
t=1

(0) (0)T (1) (1)
where ut = yt − k=1 φk ak xt . This is then repeated to estimate ak , φk ,
(p) (p)
for k ≥ 2. The whole procedure is then re-iterated to yield estimates ak , φk , k =
1, . . . m, until convergence is achieved. Note that the forward stepwise procedure
is different from the backfitting in that in the latter the sequentially estimated ridge
functions are re-used to estimate new ridge functions.
11.5 PP Regression and Density Estimation 259
11.5.2 PP Density Estimation
Probability density functions can also be estimated using projection pursuit. This
has been presented in Friedman et al. (1984) and in Huber (1985). The PP density
estimation of a pdf f (x) is based on the following multiplicative decomposition:
!
K
fK (x) = f0 (x) gm aTm x , (11.44)
m=1
where f0 (x) is a given standard probability density function such as the normal
having the same first two moments as the unknown density function f (x). The
objective is to determine the directions am and the univariate functions gm () such
that fK () converges to f () in some metric. From (11.44) one gets the recursion

fK (x) = fK−1 (x)gK aTK x , (11.45)
given the initial density function f0 (). At each step a direction aK and an aug-
menting function gK () are computed so as to maximise the goodness of fit of fK ().
Various measures can be used to evaluate the approximation. Huber (1985) mentions
two discrepancy measures that provide proximity between two density functions:

(i) the relative entropy or Kullback–Leibler distance E (f, g) = f (x) log fg(x)
(x)
dx,
√ √ 2
and (ii) the Hellinger distance H (f, g) = f (x) − g(x) dx. One of the
advantages of the relative entropy, despite not being a metric because E (f, g) =
E (g, f ), is that any probability density function f () with finite second-order
moments has one unique Gaussian distribution closest to it in the sense of entropy.
This Gaussian distribution has the same first- and second-order moments, i.e. mean
and covariance matrix, as that of f (). The relative entropy is also invariant under
arbitrary affine transformation and is suitable for multiplicative approximation such
as (11.44), see Huber (1985).
Friedman and Tukey (1974) used, in fact, a version of
the relative entropy W = f (x) log fK (x)dx, which has to be maximised, i.e.

max f (x) log gK aTK x dx s.t. fK (x)dx = 1. (11.46)
Note using the available sample,the relative entropy W above can be estimated from
the sample by the expression n1 nt=1 log fK (xt ). Now given the probability density
function fK−1 () and the direction aK the solution to (46) is obtained from
f (aK ) aT x
gK aTK x = K(a ) K (11.47)
fK−1K
aTK x
where fK(aK ) () and fK−1

(aK )
() are the respective marginal probability density functions
of fK () and fK−1 () along direction aK . The estimation procedure from the sample
goes like this. Given a direction a, and since fK−1 () is supposed to be known,
one computes gK aT x according to Eq. (11.47). The estimation of fK(a) () is
obtained by projecting the data onto direction a, then computing an estimate of the
corresponding probability density function using e.g. kernel smoother. The direction
a is then chosen so to maximise (11.46).
(a)
Numerical computation of fK−1 (), however, may not be efficient in an iterative
process. Friedman et al. (1984) propose a Monte Carlo approach based on replacing
(a)
the estimate fK−1 () by a random sample from fK−1 (), from which fK−1 () is
(a)
estimated in the same way as fK (), and this is found to speed up the computation.
See also Huber (1985) for a few other alternatives.
Remark: How to Construct Clusters at the Vertices of a High-Dimensional Simplex
A 14-dimensional simplex was used by Friedman and Tukey (1974). Sammon
(1969) used Gaussian data distributed at the vertices of a 4D simplex to test their
clustering, or cluster revealing, algorithm, which is a multidimensional clustering
algorithm. The simplex can be constructed as follows. One has to fix first the origin
O and the intervertex distance r, then
(1) Choose A1 = (x1 = r, 0, . . . , 0).
(2) Compute the centre of mass of the interval [O, A1 ], i.e. G1 = g11 , 0, . . . , 0 .
Note that g11 = r/2.
(3) Choose A2 = x21 = g11 , x22 , 0, . . . , 0 , where x22 is obtained from OA2 = r,
√
(i.e. r 3/2 ).
(4) Compute the centre of mass G2 = g21 , g22 , 0, . . . , 0, of (O, A1 , A2 ) using the
fact that G2O + G2A1 + G2A2 = 0. Note that g21 = g11 .
(5) Now choose A3 = x31 = g21 , x32 = g22 , x33 , 0, . . . , 0 where again x33 is chosen
from using OA3 = r. Next compute the centre of the 3D simplex
(O, A1 , A2 , A3 ) using G3O + G3A1 + G3A2 + G3A3 = 0, i.e. G3 =
g31 = g11 , g32 = g22 , g33 , 0, . . . , 0 .
(6) We then choose A4 = x41 = g31 , x42 = g32 , x43 = g33 , x44 , 0, . . . , 0 , where
as before x44 is obtained from OA4 = r, then compute G4 =
g41 = g11 , g42 = g22 , g43 = g33 , g44 , 0, . . . , 0 .
11.6 Skewness Modes and Climate Application of PP
A number of PP studies have been applied to weather and climate. Chan and
Shi (1997) compared PP based on a graduation index (Huber 1981), which is a
robust measure of scatter, with EOF analysis. They suggest that PP analysis is more
robust than EOF analysis particularly vis-a-vis outliers. Christiansen (2009) applied
projection pursuit to 20- and 500-hPa geopotential heights from NCEP/NCAR
reanalyses for the period 1948–2005 by optimising five indices. In terms of the
500-hPa heights Christiansen (2009) identified a pattern resembling a combination
11.6 Skewness Modes and Climate Application of PP 261
of the PNA and a European pattern when kurtosis is maximised, and a NAO pattern
when the negentropy is maximised. The PDF estimated from the first PP index was
unimodal but skewed when daily data are used. When seasonal means of 20-hPa
data were used the kernel PDF along the first PP direction was bimodal.
Figure 11.2 shows the kernel PDF estimates of the seasonal mean 20-hPa winter
geopotential height with the (PC1, PC2) (left) and (PC2, PC3) (right) spaces along
with the PP directions based on the leading three PCs and using five indices:
“kurtosis” (orange), “negentropy” (red), “depth” (yellow), “multi” (brown) and
“critic” (grey). The “depth” index provides a measure of the depth of the largest
trough in the PDF and targets hence bimodality of the PDF, the “critic” index was
proposed by Silverman (1986) and provides a measure of the degree of smoothing
required to get a unimodal PDF, i.e. related to the smoothing parameters used in
kernel PDF estimates, whereas the “multi” index was introduced by Nason and
Sibson (1992) and provides a measure that targets multimodality. It is clear that
the different indices do not necessarily lead to the same directions. For example,
Fig. 11.2 shows that “kurtosis”, “negentropy” and “critic” indices provide the
same direction along the bimodality. However, the other two PP indices provide
a direction nearly along the diagonal.
Figure 11.3 shows the histograms and kernel PDF estimates of the time series
obtained by maximising the “depth” (left) and “critic” (right) indices using the
leading three PCs. The associated PP patterns are also shown in Fig. 11.3. These
results suggest that projection maximising the “critic” index is associated with
weakening and strengthening of the polar vortex whereas the “depth” index is
Fig. 11.2 PDF of the winter seasonal mean 20-hPa geopotential height within the leading two PCs
along with the directions maximising the “kurtosis”, “negentropy”, “critic” (overlapping—grey
colour), “depth” and “multi” (overlapping—red colour) within the space of the leading three PCs,
projected onto the space of (PC1,PC2) (left) and (PC2,PC3) (right). Adapted from Christiansen
Fig. 11.3 Histograms and kernel PDF estimates (top) of the projection of 20-hPa winter mean
geopotential height on the direction maximising “depth” (left) and “critic” (right) PP indices, and
the associated PP patterns (bottom). Adapted from Christiansen (2009). ©American Meteorologi-
cal Society. Used with permission
dominated by the existence of two centres of same polarity sitting respectively over
North America and Siberia.
Pasmanter and Selten (2010) applied PP using skewness and proposed a numer-
ical algorithm to solve a system of nonlinear equations to obtain skewness modes.
Denoting the projected time series of zero-mean data xt = (x1 (t), . . . , xp (t))T ,
t = 1, . . . , n, onto the pattern a = (a1 , . . . , ap )T , by zt = aT x, the skewness modes
are obtained by maximising the third-order moment of zt , t = 1, . . . , n, subject to
having unit variance, i.e.
maxa E(zt3 )
(11.48)
subject to E(zt2 ) = aT Ca = 1,
where E(.) is the expectation operator and C = (cij ) is the covariance matrix of xt ,
t = 1, . . . , n.
11.6 Skewness Modes and Climate Application of PP 263
After differentiation , Eq. (11.48) boils down to a (quadratic) nonlinear system

of equations:
p p
3 i,j =1 ai sij m aj − 2λ k=1 cmk ak = 0, m = 1, . . . , p, (11.49)

where sij m = E xi (t)xj (t)xm (t) , i, j, m = 1, . . . p and represent the elements of
the skewness tensor. In general, it is convenient to reduce the data using the EOFs
and PCs in which case Eq. (11.49) becomes
p
3 i,j =1 αi Sij m αj − 2λαm = 0, m = 1, . . . , p
p (11.50)
i=1 αi = 1,
subject to 2
where Sij m , i, j, m = 1, . . . M, and represent the elements of the skewness tensor

obtained using the first m PCs. The quadratic system of equations (11.50) can only
be solved numerically10 when p > 2. Pasmanter and Selten (2010) proposed a
numerical algorithm to solve (11.50) using a Lyapunov function and integrated a
system of ordinary differential equations with a fourth-order Runge–Kutta scheme.
Pasmanter and Selten (2010) applied this to the winter (December–January–
February) daily 500-hPa streamfunction from ERA-40 reanalyses. The first maximal
skewness mode (Fig. 11.4, left) represents a Rossby wave-like propagating from
central Pacific along the west North American coast. The second skewness mode
(Fig. 11.4, right) projects onto the PNA pattern and contains a west Pacific pattern
and a particular phase of the wave in the intermediate frequency-range of Rennert
Sk: 1.13 Sk: 0.96
–1 –0.8–0.6 –0.4–0.2 0.2 0.4 0.6 0.8 1 –1 –0.8–0.6 –0.4–0.2 0.2 0.4 0.6 0.8 1
Fig. 11.4 Leading skewness modes of daily 500-hPa streamfunction from ERA-40. Skew values
are shown on top of each panel. Adapted from Pasmanter and Selten (2010)
10 When p = 2, the solution can be easily obtained exactly.

and Wallace (2009) (see their Fig. 14). These modes explain altogether around 8%
of the total variance. Note that by construction the skewness modes are temporally
uncorrelated, so it makes sense to talk about them explaining a certain fraction of
the total variance.
The message from these techniques is that although the amount of variance
explained by those patterns is not very large, it is still significant compared to
extratropical EOFs’ variance. In addition, it is important to notice also that skewness
(or kurtosis) modes project well onto large scale flows pointing to the presence of
non-Gaussianity in large scale atmosphere (Sura and Hannachi 2015; Franzke et al.
2007). As pointed out by Sura and Hannachi (2015), it is likely that “interesting
structures” of large scale flow may lie on nonlinear manifolds, for which nonlinear
methods, despite their shortcomings (Monahan and Fyfe 2007) can still provide
useful information (Hannachi and Turner 2013b; Monahan 2001). Monahan and
DelSole (2009) also point out that when dealing with non-Gaussian statistics
“linear” approaches, i.e. operating on hyperplanes, are not efficient and they suggest
using different measures based on information-theoretic metrics.
Chapter 12
Independent Component Analysis
Abstract Conventional EOFs yield orthogonal spatial patterns and uncorrelated

time series. Non-correlation does not necessarily yield independence, which is a
strong constraint compared to non-correlation. This chapter discusses the concept
of independence, and its relation to non-normality and describes different ways
to obtain independent components. The chapter also discusses the application to
climate data.
Keywords Blind deconvolution · Cocktail-party problem · Independent

components · Non-normality · Shannon entropy · Kullback–Leibler divergence ·
Mutual information · Negentropy · Infomax · FastICA · Three-way analysis ·
NAO · Polar vortex · High-order SVD
12.1 Introduction
The climate system displays a great deal of complexity. Any atmospheric/climate

variable, e.g. surface temperature or sea level pressure, includes nonlinear contri-
butions from different sources such as SST, solar variability (e.g. 11-year cycle),
volcanic eruptions, etc. In addition to this complexity there is also the difficulty
related to the dimension problem. In many atmospheric studies a challenging task is
to be able to isolate the influence of those different sources. This is very important
particularly when it comes in model comparison/validation. One way to ease the
task is to consider that the measured/observed variables result from a mixture of
different sources. To ease the problem further, one can assume that these sources are
independent. The objective then is to find a way to separate and obtain the effect of
these independent sources. The separation of these independent sources is a long-
standing problem in pattern recognition. It boils down to finding a suitable linear
transformation of the multivariate data in order to make the structure of the data
“transparent”. This procedure can also be interpreted within the framework of data
mining and dimension reduction.

266 12 Independent Component Analysis
We recall here that the scope of EOF/PCA is to find a linear transformation
S = WX (12.1)
of the data matrix X that faithfully represents the data in a particular way (see
Chap. 3). In fact, EOF/PCA is only one way to determine the matrix W in (12.1),
and is based on the second-order moments, namely the covariance matrix n1 XT X,
and yields uncorrelated variables. Alternative ways, based on higher order moments,
also exist. Independent component analysis constitutes one such alternative and is
the subject of this chapter.
12.2 Background and Definition
In a nutshell, independent component analysis (ICA) is a technique similar to

PCA except that uncorrelation is substituted for (statistical) independence. Before
discussing ICA, it is helpful to put the method in its historical context. Two examples
are historically relevant to ICA, namely blind deconvolution of signals and blind
source separation.
12.2.1 Blind Deconvolution
A classical problem in signal processing is concerned with the task of recovering an

unknown input signal xn , n = 0, 1, . . ., given its output yn , n = 0, 1, . . ., resulting
from a time-invariant linear filter:

yn = fk xn−k , (12.2)
k in N
where fk , for k in N, represents the sequence of the unit-impulse response of the

filter, and N is the set of integers for which (12.2) is defined. The filtered signal
yn , n = 0, 1, . . ., results therefore from a convolution of the input signal with the
unit-impulse responses. Given a realisation of the response yn , n in N, the classical
deconvolution problem consists in obtaining the signal using also a convolution of
the form:

zn = gk yn−k . (12.3)
k
The ideal deconvolution operation is the one for which there is a constant α and
integer m0 such that zn = αxn−m0 , for n in N , and this is satisfied when
an
k fk gn−k = αδ(n − m0 ). Note that a similar problem formulation arises in signal
12.2 Background and Definition 267
detection where the objective is to estimate the input signal xt given that we observe
yt = xt + εt , for which the noise εt is supposed to be independent of xt . In this
detection problem we suppose, for example, that the second-order moments of εt
are known. The solution to this problem is obtained through a mean square error
(MSE) minimisation and takes a similar form to (12.3). If in the original problem,
Eqs. (12.2)–(12.3), the unit-impulse response of the convolving operator (12.2), i.e.
fk , (k in N), is known, one talks about deconvolution when we seek the input signal
xt , t in N. The deconvolving signal, Eq. (12.3), can be obtained by inverting the
linear filter (12.2) using, for example, the z-transform or the frequency response
function.1
A more challenging task corresponds to the case when the convolving filter (12.2)
is unknown. In this case one talks about blind deconvolution (Shalvi and Weinstein
1990). Although this is not the main concern here, it is of interest to briefly discuss
the solution to this problem in a particular case where the signal xk , k in N , is
supposed to be independent white noise, i.e. IID sample from a random variable
X. The solution to this problem can be obtained using higher order moments of X
estimated from the sample (see Appendix B). Precisely, the normalised cumulants,2
(1994, chap. 3), of the sample yt , t in N, are used in this case. Note
see e.g. Kendall
that if yt = k fk xt−k , with xt , t in N, are IID realisations
of X, then the cumulants
of yt are related to those of xt via cy (p) = cx (p) k (fk )p , see e.g. Brillinger and
Rosenblatt (1967) and Bartlett (1955). It can be shown, in fact, that the magnitude
of the normalised response cumulant κy (p, q), of order (p, q), for any even q > 0,
is bounded from above, and for any even p > 0, it is bounded from below as (see
e.g. Cadzow 1996)
|κy (p, q)| ≤ |κx (p, q)| for all p > q and,
(12.4)
|κx (p, q)| ≤ κy (p, q)| for all q > p.
An important particular case is obtained with p = 4 and q = 2, which yields

μ (4)
the kurtosis κy (4, 2) = σy 2 − 3. The objective is then to maximise the kurtosis
y
since this corresponds to the case p > q in the first equality of Eq. (12.4)
above. Note that in this procedure X is supposed to be non-Gaussian. Using the
sample y1 , . . . , yn , the kurtosis of (12.3) is then computed and is a function of the
deconvolving coefficients g = (g1 , . . . , gK )T . The gradient of the kurtosis can be

1 Writing (12.2) as yt = ψ(B)xt , where ψ(z) = k fk zk and B is the backward shift operator,
i.e. Bxt = xt−1 , one gets fy (ω) = |(ω)| fx (ω) = |ψ(eiω )|2 fx (ω). In this expression |(ω)|
2
is the gain of the filter, and fx () and fy () are the power spectra of xt and yt , respectively. The
Fourier transform of the deconvolving filter, i.e. its frequency response function, is then given by
[(ω)]−1 .
p
2 Of order (p, q) of a random variable Y , κ (p, q) is defined by κ (p, q) = c (p)|c (q)|− q , where
y y y y
cy (p) is the cumulant of order p of Y , and where it is assumed that cy (q) = 0. The following are
examples of cumulants: cy (1) = μ (the mean), cy (2) = μ(2) = σ 2 , cy (3) = μ(3), cy (4) =
μ(4) − 3σ 4 and cy (5) = μ(5) − 10μ(3)μ(2), where the μ’s are the centred moments.
easily computed, and the problem is solved using a suitable gradient-based method
(Appendix E), see e.g. Cadzow (1996), Cadzow and Li (1995), Haykin (1999) and
Hyvärinen (1999) for further details.
12.2.2 Blind Source Separation
The previous discussion concerns univariate time series. In the multivariate case
there is also a similar problem of data representation, which is historically related
to ICA. In blind source separation (BSS) we suppose that we observe a multivariate
time series xt = (x1t , . . . , xmt )T , t = 1, . . . , n, which is assumed to be a mixture of
m source signals. This is also equivalent to saying that xt is a linear transformation
m-dimensional unobserved time series st = (s1t , . . . , smt )T , t = 1, . . . , n, i.e.
of an
xt = m i=1 aik sit , or in matrix form:
xt = Ast , (12.5)
where A = (aij ) and represents the matrix of unknown mixing coefficients. The
components of the vector st are supposed to be independent. The objective is then
to estimate this time series st , t = 1, . . . , n, or similarly the mixing matrix A. This
problem is also known in speech separation as the cocktail-party problem. Briefly,
BSS is a technique that is used whenever one is in the presence of an array of m
receivers recording linear mixtures of m signals.
Remark An interesting solution to this problem is obtained from non-Gaussian
signals st , t = 1, . . . , n. In fact, if st is Gaussian so is xt , and the solution to
this problem is trivially obtained by pre-whitening xt using principal components,
i.e. using only the covariance matrix of xt , t = 1, . . . , n. The solution to the BSS
problem is also obtained as a linear combination of xt , t = 1, . . . , n, as presented
later. Note also that the BSS problem can be analysed as an application to ICA.
12.2.3 Definition of ICA
We assume that we are given an m-dimensional unobserved non-Gaussian random

variable s = (S1 , . . . , Sm )T , with independent components, and we suppose instead
that one observes the mixed random variable x = (X1 , . . . , Xm )T :
x = As, (12.6)
12.3 Independence and Non-normality 269
where A = (aij ) and represents the m×m mixing matrix. The objective of ICA is to
estimate both the underlying independent components3 of s and the mixing matrix
A. If we denote the inverse of A by W = A−1 , the independent components are
obtained as a linear combination of x:
u = Wx. (12.7)
In practice one normally observes a sample time series xt , t = 1, . . . , n, and

the objective is to find the linear transformation st = Wxt , such that the
components st1 , . . . , stm are “as statistically independent as possible”. Because the
basic requirement is the non-normality of st , ICA is sometimes considered as non-
Gaussian factor analysis (Hyvärinen et al. 2001).
Historically, ICA originated around the early 1980s as a BSS problem in
neurophysiology (see e.g. Jutten and Herault 1991), in neural networks and spectral
analysis (Comon et al. 1991; Comon 1994) and later in image analysis and signal
processing (Mansour and Jutten 1996; Bell and Sejnowski 1997; Hyvärinen 1999).
A detailed historical review of ICA can be found in Hyvärinen et al. (2001). In
atmospheric science, ICA has been applied much later, see e.g. Basak et al. (2004).
Since non-Gaussianity is a basic requirement in ICA, methods used in ICA are
necessarily based on higher (than two) order moments using numerical methods,
see e.g. Belouchrani et al. (1997) for second-order methods in BSS. Hence, as in
projection pursuit, criteria for non-Gaussianity as well as appropriate algorithms for
optimisation are required to do ICA. These issues are discussed next.
12.3 Independence and Non-normality
12.3.1 Statistical Independence
The objective of ICA is that the transformation s = Wx, see Eq. (12.7), be as
statistically independent as possible. If one denotes by f (s) = f (s1 , . . . , sm ) the
joint probability density of s, then independence means that f (s) may be factorised
as
!
m
f (s1 , . . . , sm ) = fk (sk ), (12.8)
k=1
where fk (sk ) is the marginal density function of sk , k = 1, . . . , m, given by

!
m
fk (sk ) = f (s1 , . . . , sm ) dsk . (12.9)
Rm−1 i=1,i=k
3 Also hidden factors or variables.

It is well known that independence implies uncorrelatedness, but the converse is in

general incorrect.4 The Gaussian random variable constitutes an exception where
both the properties are equivalent. Uncorrelatedness between any two components
si and sj means cov si , sj = 0. Independence, however, yields non-correlation
between g(si ) and h(sj ) for any5 function g() and h() (Hyvärinen 1999, and
references therein) as

E g(si )h(sj ) − E [g(si )] E h(sj ) = 0,
i.e. si and sj are nonlinearly uncorrelated for all types of nonlinearities. This is
clearly difficult to satisfy in practice when using independence. We seek instead a
simpler way to measure independence, and as it will be seen later, this is possible
using information-theoretic approaches.
12.3.2 Non-normality
We have already touched upon non-normality in the previous chapter in relation to

projection pursuit, which is also related to independence as detailed below. In order
to have a measure of non-normality, it is useful to know the various properties of the
Gaussian distribution. This distribution (see Appendix B) is completely specified
by its first- and second-order moments, i.e. its mean and its covariance matrix. In
statistical testing theory various ways exist to test whether a given finite sample of
data comes from a multivariate normal distribution, see e.g. Mardia (1980).6 The
most classical property of the Gaussian distribution, apart from its symmetry (i.e.
zero skewness), is related to its fourth-order cumulant or kurtosis. If X is a zero-
mean Gaussian random variable, then the (excess) kurtosis
2
κ4 (X) = E(X4 ) − 3 E(x 2 ) (12.10)
vanishes. A random variable with positive kurtosis is called super-Gaussian or

leptokurtic.7 It is typically characterised by a spike at the origin and with fatter
tail than the normal distribution with the same variance. A random variable with
4 Take X to be a standard normal random variable and define Y = X 2 , then X and Y are
uncorrelated but not independent.
5 Measurable.
6 Who referred to the statement “Normality is a myth; there never was, and never will be a normal
distribution” due to Geary (1947). As Mardia (1980) points out, however, this is an overstatement
from a practical point of view, and it is important to know when √
a sample departs from normality.
7 Such as the Laplace probability density function f (x) = √1 e− 2|x| .
2
12.4 Information-Theoretic Measures 271
negative kurtosis is known as sub-Gaussian or platykurtic.8 It is characterised by a

relatively flat density function at the origin and decays fast for large values.
It is clear that the squared kurtosis can serve as a measure of non-normality and
can be used to find independent components from a sample of time series. Note
that this is similar to blind deconvolution (Shalvi and Weinstein 1990) where the
kurtosis is maximised. The serious drawback of kurtosis is that it is dominated by
a few outliers. A simple alternative is to use robust measures of kurtosis based on
order statistics. The only problem in this case, however, is that this measure becomes
nondifferentiable, and gradient type methods cannot be used. One way forward
is to smooth the empirical distribution function by fitting a smooth cumulative
distribution function and then estimate the order statistics using this smoothed
distribution. There are also other measures that characterise a normal random
variable, such as maximising differential entropy, which is presented next.
12.4 Information-Theoretic Measures
Most measures applied in ICA use information-theoretic quantities, which are

rooted in information theory. The most important quantity involved in those mea-
sures is the differential entropy defined for continuous random variables. Entropy
has already been presented in the previous chapter, and, for convenience, we give
below a brief description of it.
12.4.1 Entropy
The entropy of a random variable can be defined as the average amount of

information given by observing the variable. The more predictable the smaller the
entropy and vice versa. For a discrete variable U taking on the values u1 , . . . , uq ,
with respective probabilities p1 , . . . , pq , the (Shannon) entropy is given by

q
H (U ) = − pk ln pk . (12.11)
k=1
For a continuous variable with probability density function f (u), the differential
entropy or Boltzmann H-function is given by

H (U ) = − f (u) ln f (u)du. (12.12)
R
8 Such as the uniform distribution over [−1, 1].

The entropy can also be extended naturally to the multivariate case. If u =

(u1 , . . . , um )T is an m-dimensional random vector with probability density function
f (u), the entropy of u is
!
m
H (u) = − f (u) ln f (u) duk . (12.13)
Rm k=1
A particular property of the differential entropy is that it is maximised for a normal

distribution among all the distributions with the same covariance structure, see
Sect. 3 of the previous chapter. The entropy also enjoys another property related
to independence,
m namely if the components u1 , . . . , um of u are independent, then
H (u) = k=1 H (ui ). The entropy, however, is not invariant under covariance
changes. More generally, if u = g(v) is an invertible transformation between the
multivariate random variables u and v whose probability density functions are,
respectively, fu () and fv (), then using the fact that H (u) = −E [ln fu (u)], one gets

H (u) = H (v) + E [ln |J |] = − fv (v) ln fv (v)dv + ln |J |fu (u)du,
Rm Rm
(12.14)
where J is the determinant of the Jacobian of the transformation, i.e. J = det ∂u ∂v .

Equation (12.14) derives from the relationship between the densities fu () and
fv () of u and v, respectively,9 i.e. |J |fu (u) = fv (v). In particular, for a linear
transformation u = Av, one gets
H (u) = H (v) + ln |detA|. (12.15)
The entropy gives rise to a number of important measures useful for ICA as detailed
next, see also their usefulness in projection pursuit.
12.4.2 Kullback–Leibler Divergence
We consider here the set of all continuous probability density functions of m-

dimensional random vectors. The Kullback–Leibler (K–L) divergence for the two
probability density functions f () and g() is given by

f (u)
Df g = f (u) ln du (12.16)
g(u)

9 Writing
first E [h(u)] = h(u)fu (u)du and E [h(g(v))] = h(g(v))fv (v)dv =
h(u)fv (v)|J |du, one obtains the required result.
Properties of the K–L Divergence
• Df f = 0.
• Df g ≥ 0 for all f () and g().
Proof The classical proof for the second property uses the so-called Gibbs inequal-
ity10 (Gibbs 1902, chap XI, theorem 2), see also Fraser and Dimitriadis (1994).
Here we give another sketch of the proof using, as before, ideas from the calculus
of variations. Let f () be any given probability density function, and we set the task
to compute the minimum of Df g considered as a functional of g(). Let ε() be a
“small perturbation” function such that [g + ε] () is still a probability density for
a given probability density g(), that is, f + ε/ge0. This necessarily means that
ε(u)du = 0. Given the constraint satisfied by g(), we consider the objective
function G(g) = Df g − λ(1 − g(u)du), where λ is a Lagrange multiplier.
ε(x) ε(x)
Now using the approximation ln 1 + g(x) ≈ g(x) , one gets G (g + ε) − G (g) ≈

f (x)
− g(x) + λ ε(x)dx. The necessary condition of optimum yields f = −λg, i.e.
f = g since f () and g() integrate to 1. It remains to show that g = f is indeed

the minimum. Let F (g) = f (x) ln fg(x)
(x)
dx. For any “small” perturbation ε(), we
have, keeping in mind that log(1 + ε)−1 ≈ −ε + ε2 /2, F (f + ε) − F (f ) =
ε2 (x)
f (x) dx + o(ε ) ≥ 0. Hence g = f minimises the functional F ().
1 2
2
The K–L divergence is sometimes regarded as a distance between probability
density functions where in reality it is not because it is not symmetric.11 The K–L
divergence yields an important measure used in ICA, namely mutual information.
12.4.3 Mutual Information
We have seen that if the components ofu = (u1 , . . . , um ) are independent,

then the entropy of u, H (u), reduces to m i=1 H (ui ). For a general multivariate
random variable u, the difference between the two quantities is precisely the mutual
information:12

m
I (u) = H (uk ) − H (u). (12.17)
k=1
10 Some authors attribute this inequality to Kullback and Leibler (1951).

11 One way to make it symmetric is to consider the extended K–L divergence: D(f, g) = Df g +

Dg f = (f − g) Log fg .
12 Bell and Sejnowski (1995) define the mutual information I (X, Y ) between input X and output Y
of a neural network by I (X, Y ) = H (Y ) − H (Y |X), where H (Y |X) is the entropy of the output
that does not come from the input X. The objective in this case is to maximise I (X, Y ).
The mutual information is also known as redundancy (Bell and Sejnowski 1995),
and the process of minimising I () is known as redundancy reduction (Barlow 1989).
The mutual information I (u) provides a natural measure of the dependence between
the components of u. In fact, I (u) ≥ 0 and that I (u) = 0 if and only if (u1 , . . . , um )
are independent.
Exercise Show that I (u1 , . . . , um ) ≥ 0.
Hint See below.
The mutual information can also be defined using the K–L divergence. In fact, if
f () is the probability density
( of u, fk () is the marginal density function of uk ,
k = 1, . . . , m, and f˜(u) = m
k=1 fk (uk ), then

f (u)
I (u) = Df f˜ = f (u) ln du, (12.18)
f˜(u)
and hence I (u) ≥ 0.

Exercise Derive the last equation (12.18).

Hint We can rewrite (12.18) to yield I (u) = −H (u) − ˜(u)du. The
f (u) ln f(

lastterm can be expanded to yield m i=1 R ln f (u
i i )du i Rm−1 f (u) j =i duj =
− i H (ui ).
12.4.4 Negentropy
Since the entropy is not invariant to variable changes, see Eq. (12.15), it is desirable
to construct a similar function that is invariant to linear transformations. Such
measure is provided by the negentropy. If u is an m-dimensional random variable
with covariance matrix and u is the multinormal random variable with the same
covariance, then the negentropy of u is given by
J (u) = H (u ) − H (u) . (12.19)

Recall that u is the closest to u and that H (u ) = 12 ln || (2π e)m . Given
that the multinormal distribution has maximum entropy, among distribution with
same covariance, the negentropy is zero only for a multinormal distribution. The
negentropy has already been presented in projection pursuit as a measure of non-
normality, see
Eq. (12.34)in the previous chapter. Furthermore,
from (12.19) one
gets J (u) − i J (ui ) = i H (ui ) − H (u) + H (u ) − i H (uσ 2 ), where uσ 2 is
i i
the normal variable with zero mean and variance σi2 = ()ii ; hence,
(m

m
1 i=1 σi
2
I (u) = J (u) − J (ui ) + ln . (12.20)
2 ||
i=1
Note that if u1 , . . . , um are uncorrelated,

the last term vanishes.
If u = Wx, then since I (u) = i H (ui ) − H (x) − ln |detW |, one gets when
the covariance matrix of u is the identity matrix, i.e. WT W = −1x , equivalence
between negentropy and mutual information:

m
I (u) = − J (uk ) + c, (12.21)
k=1
where c is a constant not depending on W. Equation (12.21) is very useful when

optimising the objective function with respect to W.
12.4.5 Useful Approximations
One of the main drawbacks of the previously defined information-theoretic mea-

sures is that they all rely on an estimate of the probability density function
of the data. Probability density function estimation is a well known difficult
problem particularly in high dimension. Some useful approximations have been
constructed for the entropy/mutual information based on the Edgeworth expansion,
see Eq. (11.15) in the preceding chapter. Hyvärinen (1998) presents a more robust
approximation of the negentropy:

J (y) ≈ ki [E (Gi (y)) − E (Gi (ν))]2 (12.22)
i
with ν a standard normal variable, where ki are positive constants, y is supposed

to be scaled (with zero mean and unit variance) and Gi () are some non-quadratic
functions. Note that Gi () cannot be quadratic, otherwise J (y) would be identically
zero. Example of such functions includes (Hyvärinen 1998; Hyvärinen and Oja
2000)
2
G1 (u) = ln cosh(au), and G2 (u) = − exp − y2 (12.23)
for some 1 ≤ a ≤ 2. For the multivariate case, Hyvärinen (1998) also provides
an approximation to the mutual information; involving the third- and fourth-order
cumulants, namely
1
m
I (u) = c + 4 (κ3 (ui ))2 + (κ4 (ui ))2 + 7 (κ4 (ui ))4 − 6 (κ3 (ui ))2 κ4 (ui )
48
i=1
(12.24)
for uncorrelated ui , i = 1, . . . , m, and c is a constant.
12.5 Independent Component Estimation
Given a multivariate time series of dimension m, xt , t = 1, . . . , n, obtained by

mixing p independent components of unknown times series st , t = 1, . . . , n, the
objective is to find the mixing matrix A and the independent components satisfying
xt = Ast , or in matrix form:
X = AS, (12.25)
where S = (s1 , . . . , sn ). The usual procedure to solve this problem is to find instead
a matrix W, known as matrix of filters, such that ut = Wxt , or in matrix form:
U = WX (12.26)
recovers the underlying independent components13 S. The matrix W is also known

as the weight or decorrelating matrix. In the ICA model (12.6) and (12.7) the filter
matrix is obtained in such a way that the components u1 , . . . , up of U are as
statistically independent as possible. A basic requirement for the ICA problem to be
identifiable is that the components of st are non-Gaussian, see e.g. Hyvärinen et al.
(2001). Hence, in the search process the vector ut = Wxt will have, in particular, to
be maximally non-Gaussian. Various objective functions exist to obtain independent
components, and these are discussed next.
12.5.1 Choice of Objective Function for ICA
Unlike EOFs/PCA where the objective function is quadratic and the solution
is easily obtained, in ICA various objective functions exist along with various
algorithms to compute them. A great deal of objective functions is based on an
information-theoretic approach, but there are also other objective functions using
ideas from neural networks.
Furthermore, these objective functions can be either “one unit” where each ICA
component is estimated at a time or “several”/“whole units” where several/all ICA
13 Possibly in a different order and rescaled.

12.5 Independent Component Estimation 277
components are obtained at once (Hyvärinen 1999). The one-unit case is particularly
similar to projection pursuit where an index of non-normality is maximised. A
particularly interesting independence index for one-unit ICA is provided by the
following criteria:
Negentropy
For a given direction w, the negentropy of the index
y = wT X (12.27)
is obtained using Eq. (12.19) and estimated using, for example, Eq. (12.22). The
maximisation of J (y) yields the first ICA direction. Subsequent ICA directions
can be computed in a similar manner after removing the effect of the previously
identified directions as in PP. Alternatively, one can also estimate negentropy using a
non-parametric estimation of the data probability density function discussed below.
Non-normality
There is a subtle relationship between projection pursuit and ICA. This subtlety
can be made clear using an argument based on the central limit theorem (CLT).
The linear expression in Eq. (12.27) can be expressed as a linear combination of
the (unknown) independent components using Eq. (12.6). A linear combination
of these components is more Gaussian than any of the individual (non-Gaussian)
components. Hence to achieve the objective, one has to maximise non-normality
of the index (12.27) using, e.g. kurtosis, Eq. (12.10) or any other index of non-
normality from the preceding chapter.
Information-Theoretic Approach
Information-theoretic-based approaches, presented in Sect. 12.4, can be used for

both: the one-unit and multi-unit search method. The mutual information, which
incorporates entropy and the Kullback–Leibler divergence, can be used particularly
for the multi-unit case where all independent components are estimated simultane-
ously. This can be made possible owing to the approximation shown in Eq. (12.24).
Likelihood Maximisation Approach
Another approach, based on maximum likelihood estimation (MLE), was presented

by Pham et al. (1992) using the likelihood of model, Eq. (12.7). Using the same
arguments as for variable changes in Sect. 12.4.1, e.g. Eq. (12.14), and recalling that
the components u1 , . . . , um of u are independent, the log-likelihood of model (12.7),

based on the sample, is

n
n
m
L= ln fx (xt ) = ln fk (wTk xt ) + n ln |detW|, (12.28)
t=1 t=1 k=1
where wk is the kth column of WT , and fk () is the probability density function of

the kth component uk of u. One major difficulty with (28) is that fu () is unknown
and has to be estimated, see e.g. Pham et al. (1992). It is possible to overcome this
difficulty by using non-parametric methods for probability density estimation, as
discussed next.
Information Maximisation Approach
A different approach to solve the ICA problem was developed by Bell and
Sejnowski (1995) by maximising the input–output information (info-max) from a
neural network system. The info-max approach goes as follows. An input x, with
probability density f (), is passed through a (m → m) neural network system with
weight matrix W and bias vector w0 and a sigmoid function g() = (g1 (), . . . , gm ()).
In practice the sigmoids are taken to be identical.14 The output is then given by
y = g (Wx + w0 ) (12.29)
with probability density fy (). The objective is to maximise the information

transferred. This can be achieved by minimising redundancy, a generalisation
of mutual information, or by maximising the output entropy. Noting y =
(g1 (u1 ), . . . , gm (um )), where (u1 , . . . , um )T = Wx + w0 , and using Eq. (12.14)
along( with the fact that the Jacobian of transformation (12.29) is J =
|W| m d
k=1 du gk (uk ), the maximisation of H (y) with respect to W is then achieved
by maximising the second term in the right hand side of Eq. (12.14), i.e. using the
gradient of this term with respect to W. The updating rule is then given by
m )
! )
∂ ∂ ∂ )d )
W ∼ ln |J | = ln |W| + ln ) ) (12.30)
∂W ∂W ∂W ) du gk (uk )) ,
k=1
where we have noted u = (u1 , . . . , um )T = Wx + w0 . The first term in the right

hand side of Eq. (12.30) has been calculated (see Appendix D) and is W−T . For the
second term, the derivative with respect to the individual element wij of W is given
by
eu
14 For example, tanh() or the logistic function g(u) = 1+eu .

m d2
1 ∂ d g (u )
du2 i i
d
gk (uk ) = d
xj . (12.31)
du gk (uk )
∂wij du du gi (ui )
k=1

Similarly we get the derivative with respect to the second argument w0 =
d2
T ∂ ln |J | gk (uk )
w1,0 , . . . , wm,0 in Eq. (12.29): ∂wk,0 = du2
d . The learning rule for the
du k (uk )
g
network is then
W = W−T + axT
(12.32)
w0 = a,
where
⎛ ⎞
d2 d2
g (u )
du2 1 1
g (u )
du2 m m ⎠
aT = ⎝ d
,..., d .
du g1 (u1 ) du gm (um )
For example, if the logistic function is used, then a = 1 − 2y. Note that when
one has a sample x1 , . . . , xn , the function to be minimised is E [ln |J |], where the
expectation of ln |J | and also the gradient, i.e. E [W] and E [w0 ], is simply a
sample mean.
Remark Using Eq. (12.14) along with Eq. (12.17), the mutual information I (u) is
related to the entropy of the output v = g(u) from a sigmoid function g() via
m
| du
d
gk (uk )|
I (u) = −H (g(u)) + E ln , (12.33)
fk (uk )
k=1
where fk () is the marginal probability density of uk .

Exercise Derive the above equation (12.33).

(12.17) , I (u) =
Hint First, use Eqs. (12.14) and ( H (ui ) − H (u) = H (ui ) −
H (v) + E(ln |J |), where J = du gk (uk ), (i.e. J = | ∂v |). Next use H (ui ) =
d ∂u
−E[ln fi (ui )].

Equation (12.33) indicates that maximum information transmission (i.e. redun-
dancy reduction) can be achieved by optimally aligning or matching the sloping part
of the squashing (sigmoid) function with the input density function, i.e. du d
gk () =
fk (). This means that the sigmoid is precisely the cdf of the input probability density
function. This coding principle applied in a neurobiological context (Laughlin 1981)

has been found to maximise neuron’s information capacity.15
The same equation also points out to the fact that if uk , k = 1, . . . , m, are
d
independent with du gk () as respective pdf, then maximising the joint entropy is
equivalent to minimising mutual information; hence, Infomax and ICA become
equivalent. As pointed out by Jung et al. (2001), there is a difficulty in practice
since this means that the sigmoids have to be estimated. They argue, however,
that for super Gaussian independent component sources, super-Gaussian sigmoids,
e.g. logistic function, give good results. Similarly, for sub-Gaussian independent
component sources, a sub-Gaussian sigmoid is also recommended.
Remarks It has been shown by Cardoso (1997) that Infomax is equivalent to MLE
when the input cumulative distribution functions match the sigmoid functions, see
also Hyvärinen (1999).
The previous approach can be applied simply to just one unit to find individual
independent components at a time, which results precisely in a PP problem.
A Non-parametric Approach
A major difficulty in some of the previous methods is related to the estimation of the
pdf. This is a well known difficult problem particularly in high dimensions. A useful
and practical way to overcome this problem is to use a non-parametric approach for
the pdf estimation using kernel methods. The objective is to minimise the mutual
information of y = Wx, I (y) = H (yi ) − ln |W| − H (x), see Eq. (12.15). This is
equivalent to minimising

m
m
F (W) = H (yk ) − ln |W| = − E ln fyk (wTk x) − ln |W|, (12.34)
k=1 k=1
where wk , k = 1, . . . , m, are the columns of WT . Note that Eq. (12.34) is identical

to the likelihood (12.28). Also the inclusion of ln |W| in (12.34) means that this
term will be maximised making W full rank. It is also recommended to choose
wk , k = 1, . . . , m, to be unitary, which makes the objective function bounded, as
in projection pursuit. The most common way to estimate the marginal probability
densities fyk () is to use the kernel smoother. Given a sample xt , t = 1, . . . , n, the
kernel density estimate p() is

1
n
x − xt
p(x) = φ , (12.35)
nh h
t=1
15 By ensuring that all response levels are used with equal frequency.
where φ() is a kernel density

function
usually taken to be the standard Gaussian ker-
nel φ(x) = (2π )−1/2 exp −x 2 /2 , and h is a smoothing parameter. Consequently,
using this estimator, the minimisation of Eq. (12.34) becomes
n m n
1 wk (xt −xk )
max{F (W) = t=1 k=1 Log nh k=1 φ h + nLog|W|} s.t. wj = 1,
(12.36)
for j = 1, . . . , m. The advantage of this approach is that one needs only to estimate
the marginal pdfs and not the joint pdf. Note that Eq. (12.36) can also be used in
projection pursuit to estimate individual directions.
Other Methods
Various other methods exist that deal with ICA, see e.g. Hyvärinen et al. (2001)
for details. But before closing this section, it is worth mentioning a particularly
interesting and easy method to use, based on a weighted covariance matrix, see e.g.
Cardoso (1989). The method is based on finding the eigenvectors of E x 2 xxT ,
which can be estimated by
n
ˆ = 1
xk 2
xk xTk ,
n
k=1
where the data have been sphered prior to computing this covariance matrix. The
method is based on the assumption that the kurtosis of the different components is
different.
The above nonlinear measures of association between variables can also be used,
in a similar way to the linear covariances or Pearson correlation, to define climate
networks, see Sect. 3.9 (Chap. 3). For example, mutual information (Donges et
al. 2009; Barreiro et al. 2011) and transfer entropy (Runge et al. 2012) have been
used to define climate networks and quantify statistical association between climate
variables.
12.5.2 Numerical Implementation

Sphering/Whitening
Prior to any numerical procedure, in ICA it is important to preprocess the data. The
most common way is to sphere the data, see Chap. 2. After centring, the sphered
(or whitened) variable is obtained from the transformation z = Q (x − E(x)), such
that the covariance matrix of z is the identity. For our sample covariance matrix
S = E2 ET , one can take for Q the inverse of the square root16 of S. i.e. Q =
−1 ET . From the sample data matrix X = (x1 , . . . , xn ), the sphered data matrix Z
corresponds to the standardised PC matrix.
The benefit of sphering has been nicely illustrated in various places, e.g.
Hyvärinen et al. (2001) and Jenssen (2000), and helps simplify calculations. For
example, if the new mixing matrix B, for which z = Bs = QAs, is orthogonal, then
the number of degrees of freedom is reduced from m2 to m(m − 1)/2.
Optimisation Algorithms
Once the data have been sphered, the ICA problem can be solved using one of
the previous objective functions. When the objective function is relatively simple,
e.g. the kurtosis (12.10) in the unidimensional case, the optimisation can be
achieved using any algorithm such as gradient type algorithms. For other objective
functions such as those based on information-theoretic approaches, commonly used
algorithms include the Infomax (Bell and Sejnowski 1995), the maximum likelihood
estimation and the FastICA algorithm (Hyvärinen and Oja 2000).
• Infomax
The gradient of the objective function based on Infomax has already been given
in Eq. (12.32). For example, when the logistic sigmoid is used, the learning rule
or algorithm is given by
W = W−T + [1m − 2g (Wx)] xT , (12.37)
whereas for the tangent hyperbolic, particularly useful for super-Gaussian inde-
pendent components (Hyvärinen 1999), the learning rule becomes W−T −
2 tanh (Wx) xT . The algorithm used in Infomax is based on the (stochastic)
gradient ascent of the objective function.17
• FastICA
The FastICA algorithm was introduced by Hyvärinen and Oja (2000) in order to
accelerate convergence, compared to some cases with Infomax using stochastic
gradient. FastICA is based on a fixed point algorithm similar to the Newton
iteration procedure (Hyvärinen 1999). For the one-unit case, i.e. one ICA
at a time, FastICA finds directions of maximal non-normality. FastICA was
developed in conjunction with information-theoretic approaches based on the
negentropy approximation in (12.22), using a non-quadratic function G(), with
derivative g(), and yields, after discarding the constant E (G(ν)),
16 Notethat the square root is not unique; E and EET are two square roots, and the last one is
symmetric.
17 Hyvärinen (1999) points out that algorithm (12.37) can be simplified by a right multiplication by
WT W to yield the relative gradient method (Cardoso and Hvam Laheld 1996).

J (w) = E G wT xt ,
2
where w is unitary. Because of sphering, one has E wT xt = wT w = 1. Using
a Lagrange multiplier, the solution corresponds to the stationary points of the
extended objective function J (w) − λ w 2 − 1 , given by

F (w) = E xt g wT xt − λw = 0. (12.38)
dg
The Jacobian of F (w) is JF (w) = E xt xTt T
du (w xt ) − λIm . Hyvärinen (1999)
dg dg
uses the approximation E xt xTt du (wT xt ) ≈ E xt xTt E du (wT xt ) , which
is isotropic and easily invertible. Now, using the Newton algorithm method, this
yields the iterative form:

E xt g(wTk xt ) −λwk
w∗k+1 = wk − dg
E T
du (wk xt ) −λ
(12.39)
w∗k+1
wk+1 = w∗k+1 .
The parameter λ can be estimated using Eq. (12.38) through multiplication by

wT , to yield λ = E wTk xt g wTk xt at each iteration.
To estimate several independent components, one applies the same procedure
using projected gradient, as in simplified EOFs (Chap. 4), by imposing an
extra constraint of orthogonality. Note also that if the matrix W is chosen to
be orthogonal, the Infomax learning rule (12.37) becomes comparable to the
algorithm (12.39) written in matrix form. Hyvärinen (1999) argues that the
FastICA algorithm seems to have quadratic and cubic convergence compared to
the linear convergence of gradient-based algorithms. Furthermore, unlike other
gradient methods, the algorithm has no step size to fix. More details on the
properties of the algorithm can be found in Hyvärinen et al. (2001).
• Non-parametric methods
Apart from neural-based algorithms, e.g. Infomax and FastICA, there is also the
gradient-based approach, which can be applied to the non-parametric estima-
tion (12.36). In this approach the smoothing is normally a function of sample
size, and its optimal value is ho = 1.06n−1/5 (Silverman 1986). The algorithm
can be made more efficient by using a FFT transform to estimate where the first
expression is for the evaluation of the objective function, and the second one is
for gradient computation. The optimisation of (12.36) can be transformed into an
unconstrained problem via the change of variable wk = pk −1 pk and using the
−T
identity ∂pij (ln |W|) = P
∂
ij
− pi −1 pij , i.e.
∂ ln |W| −1
= diag p1 2 , . . . , pm 2 P + P−T ,
∂P
which can be reported back into the gradient of (12.36). Alternatively, if W

is orthogonal, the last term in (12.36) vanishes, and one can use the projected
gradient algorithm as in simplified EOFs. A few other algorithms have also been
developed, and the reader is directed to Hyvärinen (1999) for further details.
12.6 ICA via EOF Rotation and Weather and Climate

Application
12.6.1 The Standard Two-Way Problem
The use of ICA in climate research is quite recent, and the number of research
papers is limited. Philippon et al. (2007) employ ICA to extract independent
modes of interannual and intraseasonal variability of the West African vegetation.
Mori et al. (2006) applied ICA to monthly sea level pressures to find the main
independent contributors to the AO signal. See also Basak et al. (2004) for an
analysis of the NAO, Fodor and Kamath (2003) for an application of ICA to
global temperature series and Aires et al. (2000) for an analysis of tropical sea
surface temperatures. The ICA technique has the potential to avoid the PCA “mixing
problem”. PCA has the tendency to mix several modes of comparable magnitude,
often generating spurious regional overlaps or teleconnections where none exists or
distorting existing overlaps or teleconnections (Aires et al. 2002).
There is a wide class of ICA algorithms that achieve approximate independence
by optimising criteria involving higher order cumulants; for example, the JADE
criterion proposed by Cardoso and Souloumiac (1993) performs joint diagonal-
isation of a set of fourth-order cumulant matrices. The orthomax-based criteria
proposed in Kano et al. (2003) are, respectively, quadratic and linear functions of
fourth-order statistics. Unlike higher order cumulant-based methods, the popular
FastICA algorithm chooses a single non-quadratic smooth function (e.g. g(x) =
log cosh x), such that the expectations of this function yield a robust approximation
to negentropy (Hyvärinen et al. 2001). In the next section a criterion is introduced,
which requires the minimisation of the sum of squared fourth-order statistics formed
by covariances computed from squared components.
ICA can be interpreted in terms of EOF rotation. Aires et al. (2002) used a neural
network-based approach to obtain independent components. Hannachi et al. (2009)
presented a more analytical way to ICA via rotation by minimising a criterion based
on the sum of squared fourth-order statistics. The optimal rotation matrix is then
used to rotate the matrix of initial EOFs to enhance interpretation. The data are first
pre-whitened using EOFs, and the ICA problem is then solved by rotating the matrix
of the uncorrelated component scores (PCs), i.e.
Ŝ = YT , (12.40)
12.6 ICA via EOF Rotation and Weather and Climate Application 285
for some orthogonal matrix T. The method then uses the fact that if the components
s1 , . . . , sk are independent, their squares s12 , . . . , sk2 are also independent. Therefore
the model covariance matrix of the squared components is diagonal. Given any
orthogonal matrix V, and letting G = YV, the sample covariance matrix between
the element-wise squares of G is
1
R= (G G)T H(G G) , (12.41)
n−1
where H is the centring operator, and denotes the element-wise (Hadamard)

matrix product. Hannachi et al. (2009) minimised the rotation criterion:
1
F(V) = ( R F − diag(R) F) , (12.42)
2
where R F = trace(RRT ) is the Fröbenius norm of R. The matrix T in (12.40)

satisfies F(T) = 0 and can therefore be estimated by minimising the functional
F(.).
The following is a simple illustrative example that compares two models of the
Arctic Oscillation (AO), see Hannachi et al. (2009) for more details and further
references.
Model AO1: x = A1 s (12.43)
and
Model AO2: x = A2 s, (12.44)
where x is a three-dimensional vector of the observed time series representing the

SLP anomalies at the centres of action of the AO, namely, Arctic, North Pacific and
North Atlantic, s is two-element vector of independent variables and A1 and A2 are
two mixing matrices given by
⎛ ⎞ ⎛ ⎞
2 0 1 1
A1 = ⎝ −1 1 ⎠ and A2 = ⎝ −1 0 ⎠ .
−1 −1 0 −1
Assuming the components of s to be uniformly distributed over [−1, 1], Fig. 12.1
shows PC1 vs PC2 of the above models. The figure shows clearly that the two
models are actually distinguishable. However, this is not the case when we compute
the covariance matrix of these models, which are proportional since
A1 AT1 = 2A2 AT2 . (12.45)

a IC2 b IC2
2 2
IC1
IC1
EOF2
EOF2
0 0
−2 −2
−2 0 2 −2 0 2
EOF1 EOF1
c d EOF2 EOF1
EOF2
2 2
IC2
IC2
0 0
EOF1
−2 −2
−2 0 2 −2 0 2
IC1 IC1
Fig. 12.1 Scatter plot of PC1 vs PC2 (a, b) and IPC1 vs IPC2 (c, d) of models AO1 (a, c) and
AO2 (b, d). Adapted from Hannachi et al. (2009) ©American Meteorological Society. Used with
permission
The joint density of the latent signals s1 and s2 is uniform on a square. In Fig. 12.1a
the joint density of the scores of EOF 1 and EOF 2 for Model AO1 is uniform on
a square, and PC 1 and PC 2 are independent. In Fig. 12.1b, the joint density of
the scores of EOF 1 and EOF 2 cannot be expressed as the product of the marginal
densities, and the components are not independent. When the above procedure is
applied, the independent components are extracted successfully (Fig. 12.1c, d). An
example of application of ICA via rotation to monthly SLP from NCEP-NCAR was
presented by Hannachi et al. (2009).
Monthly SLP anomalies for the period Jan 1948–Nov 2006 were used. Fig-
ure 12.2 shows grid points where the time series of SLP anomalies depart signifi-
cantly, at the 5% level, from normality. It is clear that non-Gaussianity is a dominant
feature of the SLP anomalies. The data were first reduced by applying an EOF
analysis, and ICA rotation was applied to various number of EOFs. Figure 12.3
shows the first five independent principal components (IPCs) obtained by rotating
the first five component scores. Their cross-correlations have been checked and
found to be zero according to some prescribed level of machine precision. The cross-
correlations of various nonlinear transformations of IPCs have also been computed
and compared to those obtained using PCs.
Fig. 12.2 Grid points in the

northern hemisphere where
sea level pressure anomalies
depart from Gaussianity at
the 1% significance level
using a KS test. ©American
Meteorological Society. Used
with permission
2
IC 1
0
−2
−4
Jan50 Jan60 Jan70 Jan80 Jan90 Jan00
2 4
2
IC 2
IC 3
0
−2 0
−4 −2
4
2
2
IC 4
IC 4
0 0
−2 −2
Fig. 12.3 Independent principal components obtained via EOF rotation of the first five PCs of the
monthly mean SLP anomalies. Adapted from Hannachi et al. (2009). ©American Meteorological
The upper triangular part of the matrix in Table 12.1 shows the absolute values of
the correlations between the element-wise fourth power of IPCs 1–5. The lower part
shows the corresponding correlations using the PCs instead of the IPCs. Significant
Table 12.1 Correlation matrix of the fourth power elements of ICs 1 to 5 (above the diagonal)
and the same correlation but for the PCs (below the diagonal). The sign of the correlations has
been dropped. Bold faces and underlined values refer to significant correlations at 1% and 5%
levels, respectively
IPC1/PC1 IPC2/PC2 IPC3/PC3 IPC4/PC4 IPC5/PC4
1 0 0.01 0 0
0.08 1 0.02 0.02 0.01
0 0.01 1 0 0.010
0.01 0.01 0.08 1 0.02
0.1 0.05 0.04 0.01 1
Table 12.2 As in Table 12.1 IPC1/PC1 IPC2/PC2 IPC3/PC3 IPC4/PC4 IPC5/PC5

but using third power of
absolute value function 1 0.02 0 0.02 0.03
0.13 1 0.04 0.05 0.03
0.02 0.05 1 0.03 0
0.05 0.04 0.11 1 0.05
0.14 0.09 0.07 0 1
correlations at the 1% and 5% levels, respectively, are indicated below the diagonal.
Table 12.2 is similar to Table 12.1 except that now the nonlinear transformation is
the absolute value of the third power law. Note again the significant correlations
in the transformed PCs in the lower triangular part of Table 12.2, whereas no such
significant correlations are obtained with the IPCs.
Hannachi et al. (2009) computed those same correlations using various other
nonlinear functions and found consistent results with Tables 12.1 and 12.2. Note
that the IPCs reflect also large non-normality, as can be seen from the skewness
of the IPCs. For example, Fig. 12.4 shows the q–q plots of all SLP IPCs. The
straight diagonal lines in Fig. 12.4 are for the normal distribution, and any departure
from these lines reflects non-normality. Clearly, all the q–q plots display strong
nonlinearity, i.e. non-normality. A formal KS test reveals that the null hypothesis
of normality is rejected for the first three IPCs at 1% significance level and for the
last two IPCs at 5% level. The non-normality of the PCs has also been checked and
compared with that of the IPCs using the q–q plot (not shown). It is found that the
IPCs are more non-normal than the PCs.
The spatial patterns associated with the IPCs are shown in Fig. 12.5. In principle,
the rotated EOFs have no natural ranking, but the order of the rotated EOFs in
Fig. 12.1 is based on the amount of variance explained by those patterns. The first
REOF looks like the Arctic Oscillation (AO) pattern,18 and the second REOF repre-
sents the NAO. The fourth pattern, for example, is reminiscent of the Scandinavian
teleconnection pattern. Figure 12.6 shows, respectively, the correlation map between
18 Note,however, that the middle centre of action is displaced from the pole and shifted towards
northern Russia.
Quantile−quantile plot
4 a
IC2 quantiles
b
IC1 quantiles 0
−4
4 c d
IC4 quantiles
IC3 quantiles
−4
4 e −4 0 4
IC5 quantiles
Standard normal
0 quantiles
−4
−4 0 4
Standard Normal Quantiles
Fig. 12.4 Quantile plots on individual IPCs of the monthly SLP anomalies. Adapted from
Hannachi et al. (2009). ©American Meteorological Society. Used with permission
Fig. 12.5 Spatial patterns associated with the leading five IPCs. The order is arbitrary. Adapted
from Hannachi et al. (2009). ©American Meteorological Society. Used with permission
Fig. 12.6 Correlation map of monthly SLP anomaly IPC1 (top) and IPC2 (bottom) with HadISST
monthly SST anomaly for the period January 1948–November 2006. Only significant values, at
1% level, are shown. Correlations are multiplied by 100. Adapted from Hannachi et al. (2009).
SLP IPC1 (Fig. 12.6, top) and SLP IPC2 (Fig. 12.6, bottom) with the Hadley Centre
Sea Ice and Sea Surface Temperature (HadISST). It can be seen, in particular, that
the correlation pattern with IPC1 reflects the North Pacific oscillation, whereas the
correlation with IPC2 reflects well the NAO pattern. The same rotation was also
applied in the context of exploratory factor analysis by Unkel et al. (2010), see
Chap. 10.
12.6.2 Extension to the Three-Way Data
The above independent component analysis applied to two-way data was extended
to three-way climate data by Unkel et al. (2011). Three-way data consist of data that
are indexed by three indices, such as time, horizontal space and vertical levels. For
this four-dimensional case (3 spatial dimensions plus time), with measurements on J
horizontal grid points at K vertical levels for a sample size n, the data are represented
by third-order tensor:
✘ = (xj tk ), j = 1, . . . J, t = 1, . . . n, k = 1, . . . K. (12.46)
A standard model for the three-way data, Eq. (12.46), is the three-mode Parafac
model with R components (e.g. Caroll and Chang 1970; Harshman 1970):

R
xj tk = aj r btr ckr + εj tk , j = 1, . . . J, t = 1, . . . n, k = 1, . . . K.
r=1
(12.47)
In Eq. (12.47) A = (aj r ), B = (btr ) and C = (ckr ), are the component matrices (or
modes) of the model, and ε = (εj tk ) is an error term.
A slight modification of the Parafac model, Eq. (12.47), was used by Unkel et
al. (2011), based on the Tucker3 model (Tucker 1966), which yields the following
model:
XA = A (C| ⊗ |B)T + EA , (12.48)
where the J × (nK) matrices XA and EA are reshaped version of ✘ = (xj tk ) and
ε = (εj tk ), obtained by K frontal slices of those tensors, and | ⊗ | is the columnwise
Kronecker (or Khatri–Rao) matrix product. The matrices A, B and C are obtained
by minimising the costfunction:

F = XA − AAT XA CCT ⊗ BBT 2
F, (12.49)
where ⊗ is the Kronecker matrix product (see Appendix D), by using an alternating
least square optimisation algorithm. The estimate Â of A is then rotated towards
independence based on the algorithm of the two-way method discussed above, see
Eq. (12.41).
Unkel et al. (2011) applied the method to the NCEP/NCAR geopotential heights
by using 5 components in modes A and C and 6 components in B. The data represent
winter (Nov–Mar) means for the period 1949–2009 with 2.5◦ ×2.5◦ horizontal grid,
north of 20◦ N , sampled over 17 vertical levels, so J = 4176, n = 61 and K = 17.
Figure 12.7 shows the rotated patterns. Patterns (i) and (iii) suggest modes related
to stratospheric activity, showing two different phases of the winter polar vortex
during sudden stratospheric warming (e.g. Hannachi and Iqbal 2019). Pattern (ii)
Fig. 12.7 Rotated three-way independent component analysis of the winter means geopotential
heights from NCEP/NCAR. The order is arbitrary. Adapted from Unkel et al. (2011)
12.7 ICA Generalisation: Independent Subspace Analysis 293
is dominated by the Scandinavian pattern, prominent in mid-troposphere, while the

last mode represents the North Atlantic Oscillation, a prominent pattern in the lower
and mid-troposphere.
12.7 ICA Generalisation: Independent Subspace Analysis
Like projection pursuit conventional, ICA is a one-dimensional algorithm, that

is the algorithm attempts to identify unidimensional (statistically) independent
components. Given that the climate system is characterised by complex nonlinear
interactions between many processes, the concept of (conventional) ICA may not be
realistic. A more reasonable approach is to relax the ICA assumption in favour of
looking for groups of independent sources, leading to the concept of independent
subspace analysis (Pires and Ribeiro 2017; Pires and Hannachi 2017). In Pires
and Hannachi (2017), for example, high-order cumulant tensors of coskewness and
cokurtosis, given, respectively, by
S(y)ij k = E(Yi Yj Yk )
K(y)ij kl = E(Yi Yj Yk Yl ) − E(Yi Yj )E(Yk Yl ) (12.50)
−E(Yi Yk )E(Yj Yl ) − E(Yi Yl )E(Yj Yk ),
for i, j, k, l in {1, . . . m}, where m is the dimension of the random vector y =

(Y 1, . . . Ym )T , are analysed via the Tucker decomposition or high-order singular
value decomposition (Tucker 1966; Lathauwer et al. 2000). A generalised mutual
information was then constructed and minimised. To overcome the handicap of
serial correlation, sample size and high dimensionality statistical tests were used
to obtain robust non-Gaussian subspaces, which are taken to represent independent
subspaces.
Pires and Hannachi (2017) applied the methodology to global sea surface
temperature (SST) anomalies and identified a number of independent components
and independent subspaces. In particular, the second component IC2 was found to
project onto the Atlantic Niño and the Atlantic Multidecadal Oscillation (AMO)
and was significantly separated from the remaining four components. These latter
components were not entirely independent and are dominated primarily by the
dyad (IC1 and IC5). The pattern IC1 represents El-Niño conditions combined with
negative Pacific Decadal Oscillation (PDO) and the positive phase of the North
Pacific Gyre Oscillation (NPGO). The pattern IC5 has a positive sign when SST
anomalies in South Pacific have a zonal dipole. The patterns associated with IC1
and IC5 share overlapping projection onto the NPGO.
Chapter 13
Kernel EOFs
Abstract This chapter describes a different way to obtain nonlinear EOFs via
kernel EOFs based on kernel methods. The kernel EOF method is based on mapping
the data onto a feature space and helps delineate complex structures. The chapter
discusses various types of transformations to obtain kernel EOFs, with particular
focus on the Gaussian kernel and its application to data from models and reanalyses.
Keywords Discriminant analysis · Feature space · Kernel EOFs · Kernel trick ·

Reproducing kernels · Gaussian kernel · Spectral clustering · Modularity
clustering · Quasi-geostrophic model · Flow tendency · Bimodality · Zonal
flow · Blocking
13.1 Background
It has been suggested that the large scale atmospheric flow lies on a nonlinear
manifold due to the nonlinearity involved in the dynamics of the system. Never-
theless, it is always possible to embed this manifold in a high-dimensional linear
space. The system may have two or more substructures, e.g. attractors, that one
would like to identify and separate. This problem belongs to the field of pattern
recognition. Linear spaces have the characteristic that “linear” patterns can be, in
general, identified efficiently using, for example, linear discriminant analysis or
LDA (e.g. McLachlan 2004). In general patterns are characterised by an equation of
the form:
f (x) = 0. (13.1)
For linear patterns, the function f (.) is normally linear up to a constant. Figure 13.1
illustrates a simple example of discrimination where patterns are obtained as the
solution of f (x) =< w, x > +b = 0, where “<, >” denotes a scalar product in the
linear space.
Given, however, the complex and nonlinear nature involved in weather and
climate, one expects in general the presence of different forms of nonlinearity and

296 13 Kernel EOFs
x
2
x x
x
x
x x
w
x
b
x x
x
1
f( x) = < w, x> + b = 0
Fig. 13.1 Illustration of linear discriminant analysis in linear spaces permitting a separation
between two sets or groups shown, respectively, by crosses and circles
where patterns or coherent structures in the input space are not easily separable. The
possibility then arises of attempting to embed the data of the system into another
space we label “feature space” where complex relationships can be simplified and
discrimination becomes easier and efficient, e.g. linear, through the application of
a nonlinear transformation φ(.). Figure 13.2 shows an example sketching of the
situation where the (nonlinear) manifold separating two groups in the input space
becomes linear in the feature space. Consider, as an example, and for the sake of
illustration, the case where the input space contains data that are “polynomially”
nonlinear. Then, it is possible that a transformation into the feature space involving
separately all monomials constituting the nonlinear polynomial would lead to
a simplification of the initial complex relationship. For example, if the input
space (x1 , x2 ) is two-dimensional and contains quadratic nonlinearities, then the
five-dimensional space obtained by considering all monomials of degree smaller
than 3, i.e. (x1 , x2 , x12 , x22 , x1 x2 ), would decimate/dismantle the initial complex
relationships. Figure 13.2 could be an (ideal) example of polynomial nonlinearity
where the feature becomes a linear combination of the coordinates and linear
discriminant analysis, for example, could be applied. In general, however, the
separation will not be that trivial, and the hypersurface separating the different
groups could be nonlinear, but the separation remains feasible. An illustration of
this case is sketched in Fig. 13.3. This could be the case, for instance, when the
nonlinearity is not polynomial, as will be detailed later. It is therefore the objective
13.1 Background 297
f 2 (x 1 ,x 2 )
x f =( f 1 , f 2 )
2
x x
x x
x x
x x
x x
x x
x
f 1 (x 1 ,x 2 )
x1
x
x
x
Fig. 13.2 Illustration of the effect of the nonlinear transformation φ(.) from the input space
(left) into the feature space (right) for a trivial separation between groups/clusters in the feature
space. Adapted from Hannachi and Iqbal (2019). ©American Meteorological Society. Used with
permission
x2 f =( f , f ) f2 (x 1 ,x 2 )
1 2 x x
x x
x x
x
x
x x
x x x
f1 (x 1 ,x 2 )
x1
x
x
x
Fig. 13.3 As in Fig. 13.2 but for a curved separation between groups or clusters in the feature
space
of kernel EOF method (or kernel PCA) to find an embedding or a transformation that
allows such easier discrimination or pattern identification to be revealed. The kernel
method can also be extended to include other related approaches such as principal
oscillation patterns (POPs), see Chap. 6, or maximum covariance analysis (MCA),
see Sect. 14.3.
Kernel PCA is not very well known in atmospheric science since it was first
applied by Richman and Adrianto (2010). These authors applied kernel EOFs to
298 13 Kernel EOFs
classify sea level pressure over North America and Europe. They identified two
clusters for each domain in January and also in July. Their study suggests that kernel
PCA captures the essence of data more accurately compared to conventional PCA.
13.2 Kernel EOFs
13.2.1 Formulation of Kernel EOFs
Let xt , t = 1, . . . , n be a sample of a p-dimensional time series with zero mean

and covariance matrix S. Kernel EOF method (Schölkopf et al. 1998) is a type
of nonlinear EOF analysis. It is based on an EOF analysis in a different space,
the feature space, obtained using a (nonlinear) transformation φ(.) from the p-
dimensional input data space into a generally higher dimensional space F. This
transformation associates to every element x from the input data space an element
ξ = φ(x) from the feature space. Hence, if φ(xt ) = ξ t , t = 1, . . . , n, the kernel
EOFs are obtained as the EOFs of this newly transformed data. The covariance
matrix in the feature space F is computed in the usual way and is given by
1 1
n n
S= ξ t ξ Tt = φ(xt )φ(xt )T . (13.2)
n n
t=1 t=1
Now, as for the covariance matrix S in the input space, an eigenvector v of S with
eigenvalue λ satisfies

n
nλv = φ(xt ) φ (xt )T v , (13.3)
t=1
and therefore, any eigenvector must lie in the subspace spanned by φ(xt ),
t = 1, . . . , n. Denoting by αt = φ(xt )T v, and K = Kij , where Kij = ξ Ti ξ j =
φ(xi )T φ(xj ), an eigenvector v of S, with non-zero eigenvalue λ, is found by solving
the following eigenvalue problem:
Kα = nλα. (13.4)
letting α = (α1 , . . . , αn ) and assuming it satisfies (13.4), and letting

In fact, T
v = ns=1 αs φ(xs ), it is straightforward to see that

n n
1 1
n n
Sv = ξt ξ t ξ s αs =
T
ξt Kts αs ,
n n
t=1 s=1 t=1 s=1
n
which is precisely 1
n t=1 nλξ t αt = λv.
13.2 Kernel EOFs 299

Exercise Starting from v = ns=1 αs φ(xs ), show, without using (13.4), that α =
(α1 , . . . , αn )T is an eigenvector of K.
n
of S, then
t=1 φ(xt )φ(xt ) v = nλv, yielding
Answer Since v is an eigenvector T

after little algebra, k s αs Kks φ(xk ) = nλ k αk φ(xk ). This equality is valid
for any eigenvector (one can also assume the vector φ(xk ), k = 1, . . . n, to be
independent), and thus we get s αs Kks = nλαk , which is precisely (13.4).
(k)
Denoting the kth EOF by vk = nt=1 αt φ(xt ), the PCs in the feature space are
computed using the dot product

n
φ(x)T vk = αt(k) φ (x)T φ(xt ) . (13.5)
t=1
The main problem with this expression is that, in general, the nonlinear mapping
φ() lies in a very high-dimensional space, and the computation is very expensive
in terms of memory and CPU time. For example, if one considers a polynomial
transformation of degree q, i.e. the mapping φ() will contain all the monomials of
order q, then we get a feature space of dimension (p+q−1)!
q!(p−1)! ∝ p .
q
An elegant alternative considered by Boser et al. (1992) and Schölkopf et al.

(1998) is to use the so-called kernel trick. It is based on a kernel representation of
the dot product and finds a kernel function K satisfying
K (x, y) = φ(x)T φ(y). (13.6)
So, rather than choosing φ() first, one chooses K instead, which then yields φ()
using (13.6). Of course, many kernels exist, but not all of them satisfy Eq. (13.6).
Let us consider a positive self-adjoint (Appendix F) integral operator K() defined
for every square integrable function over the p-dimensional space Rp , i.e.

K(f ) = K(x, y)f (y)dy, (13.7)
Rp
with the property

< K(f ), f >= f (x)K(x, y)f (y)dxdy ≥ 0, (13.8)
where the symbol <, > denotes a scalar product defined on the space of integrable
functions defined on Rp . Then, using Mercer’s theorem, see e.g. Mikhlin (1964)
and Moiseiwitsch (1977), and solving for the eigenfunctions of the linear integral
operator (13.7), K(f ) = λf , i.e.

K(x, y)f (y)dy = λf (x), (13.9)
Rp
300 13 Kernel EOFs
Eqs. (13.7–13.9) then yield the expansion

∞

K (x, y) = λk φk (x)φk (y), (13.10)
k=1
where λk ≥ 0 and φk (.), k = 1, 2, . . ., are, respectively, the eigenvalues and

corresponding eigenfunctions of the integral operator1 K(.) given in Eq. (13.7) and
satisfying the eigenvalue problem (13.9). Such kernels are known as reproducing
kernels. Note that the eigenfunctions are orthonormal with respect to the scalar
product (13.8).
Remark The above result given in Eq. (13.10) is also known in the literature of
functional analysis as the spectral decomposition theorem.
Combining Eqs. (13.6) and (13.10), it is possible to define the mapping φ(.) by
* * T
φ(x) = λ1 φ1 (x), λ2 φ2 (x), . . . . (13.11)
We can further simplify the problem of generating kernels by taking
K(x, y) = K1T (x)K1 (y), (13.12)
where K1 is any arbitrary integral operator. For example, if K1 (.) is self-adjoint,

then K12 (.) is reproducing. There are two main classes of reproducing kernels: those
with finite-dimensional feature spaces such as polynomial kernels and those with
infinite-dimensional feature spaces involving non-polynomial kernels. A number of
examples are given below for both types of kernels. These different kernels have
different performances, which are discussed below in connection with an example
of concentrated clusters, see Sect. 13.2.3.
Examples of Kernels
• Consider the case of the three-dimensional input space, i.e. p = 3, and K(x, y) =
(xT y)2 , and then the transformation φ(.) is given by
√ √ √ T
φ(x) = x12 , x22 , x32 , 2x1 x2 , 2x1 x3 , 2x2 x3 .
The dimension of φ(x) is p(p + 1)/2 = 6.

• The previous example can be extended to a polynomial of degree q by consider-
ing K(x, y) = (xT y)q . In this case the dimension of φ(x) (see above) is of the
order O(pq ).
1 Notethat
the convergence

in (13.10) is the space of square integrable functions, i.e.
limk→∞ |K (x, y) − ki=1 λi φi (x)φi (y)|2 dxdy = 0.
x−y 2
−
• The Gaussian kernel: K(x, y) = K(x, y)
=e 2σ 2 .
2
xT y
• Other examples include K(x, y) = exp 2a 2
and K(x, y) = tanh αxT y + β .
In these examples the vectors φ(x) are infinite.
One of the main benefits of using the kernel transformation is that the kernel
K(.) avoids computing the large (and may be infinite) covariance matrix S, see
Eq. (13.2), and permits instead the computation of the n × n matrix K = Kij and
the associated eigenvalues/eigenvectors λk and α k , k = 1, . . . , n. The kth extracted
PC of a target point x from the input data space is then obtained using the expression:

n
vTk φ(x) = αk,t K (xt , x) . (13.13)
t=1
This means that the computation is not performed in the very high-dimensional
feature space, but rather in the lower n-dimensional space spanned by the images of
xt , t = 1, . . . , n.
Remark One sees that in ordinary PCA, one obtains min(n, p) patterns, whereas in
kernel PCA one can obtain up to max(n, p) patterns.
13.2.2 Practical Details of Kernel EOF Computation
The computation of kernel EOFs/PCs within the feature space is based on the
assumption that the transformed data φ(x1 ), . . . , φ(xn ) have zero mean, i.e.
1
n
φ(xt ) = 0, (13.14)
n
t=1
so that S represents genuinely a covariance matrix. In general, however, the

mapping φ(.) is rarely and explicitly accessible, and therefore the zero-mean
property may not be verified directly on the images within the feature space. Since
we are using the kernel K(.) to form the kernel matrix K and compute the kernel
EOFs, we go back to the original expression of the kernel matrix K = (Kij ),
see Eq. (13.6), i.e. Kij = φ(xi )T φ(xj ), and use φ(xt ) − φ(xt ) where φ(xt ) =
n
t=1 φ(xt )/n. The original expression of the centred matrix becomes
1 T
n
K̃ij = φ(xi ) − φ(xt ) φ(xj ) − φ(xt ) . (13.15)
n
t=1
This yields the (centred) Gram matrix:

302 13 Kernel EOFs
1 1 1
K̃ = K − 1n×n K − K1n×n + 2 1n×n K1n×n , (13.16)
n n n
where 1n×n is the n × n matrix containing only ones.

Exercise Derive the above expression of the centred Gram matrix.

Indication The above expression yields K̃ij = Kij − n1 nt=1 Ktj − n1 nt=1 Kit +
1 n
n2 t,s=1 Kts . In matrix form this expression is identical to the one given in
Eq. (13.16).
13.2.3 Illustration with Concentric Clusters
We mentioned above that different kernels have different performances. Consider,

for example, polynomial kernels. One main weakness of these kernels is the lack
of a prior knowledge of the dimension of the feature space. And trying different
dimensions is not practical and useful as the dimension increases as a power law
of the degree of the polynomial. Another weakness is that polynomial kernels
are not localised, which can be problematic given the importance of locality
and neighbourhood in nonlinear dynamics. One way to overcome some of these
weaknesses is to use of kernels with localised structures such as kernels with
compact support. A typical localised kernel is the Gaussian kernel. Despite being
not truly compact support, the Gaussian kernel is well localised, because of the
exponential decay, a nice and very useful feature. Furthermore, compared to other
kernels (e.g. tangent hyperbolic used in neural networks), the Gaussian kernel has
only one single parameter σ .
Another point to mention is that polynomial kernels are normally obtained by
constructing polynomials of scalar products xT y, which is also the case for the
tangent hyperbolic and similar functions like cumulative distribution functions used
in neural networks. Within this context, another advantage of the Gaussian kernel
over the above kernels is that it uses distances x − y 2 , which translates into
(local) tendencies when applied to multivariate time series in weather and climate,
as is illustrated later. The Gaussian kernel is used nearly everywhere such as in
PDF estimation, neural network, spectral clustering, etc. Note also that there are
nonlinear structures that cannot be resolved using, for example, polynomial kernels
as shown next. Consider the example of three concentrix clusters. Two clusters
are distributed on concentric spheres of radii 50 and 30, respectively, and the third
one is a spherical cluster of radius 10. Figure 13.4a shows the data projected onto
the first two coordinates. Note that the outer two clusters are distributed near the
surface of the associated spheres. The conventional PC analysis (Fig. 13.4b) does
not help as no projective technique can dismantle these clusters. The kernel PC
analysis with a Gaussian kernel, however, is able to discriminate these clusters
a) Original (x1, x2) data b) (PC 1, PC 2) scatter

60
40
0.05
20
PC2
x2
0 0
-20
-40 -0.05
-60
-50 0 50 -0.05 0 0.05
x1 PC1
c) (KPC 1, KPC 2) scatter d) PDF of scaled (KPC 1, KPC 2)
0.06
2
0.04
0.02 1
KPC2
KPC2
0 0
-0.02 -1
-0.04 -2
-0.06
-0.04 -0.02 0 0.02 0.04 -2 -1 0 1
KPC1 KPC1
Fig. 13.4 First two coordinates of a scatter plot of three concentric clusters (a), the same scatter
projected onto the leading two PCs (b), and the leading two (Gaussian) kernel PCs (c), and the data
PDF within the leading two kernel PCs (d). Adapted from Hannachi and Iqbal (2019). ©American
(Fig. 13.4c). Figure 13.4c is obtained with σ 2 = 2.3, but the structure is quite robust
to a wide range of the smoothing parameters σ . We note here that non-local kernels
cannot discriminate between these structures, pointing to the importance of the local
character of the Gaussian kernel in this regard. The kernel PDF of the data within
the space spanned by kernel PC1 and kernel PC2 is shown in Fig. 13.4d. Note, in
particular, the curved shape of the two outer clusters (Figs. 13.4c,d), reflecting the
nonlinearity of the corresponding manifold.
To compare the above result with the performance of polynomial kernels, we now
apply the same procedure to the above concentric clusters using two polynomial
kernels of respective degrees 4 and 9. Figure 13.5 shows the obtained results. The
polynomial kernels fail drastically to identify the clusters. Instead, it seems that the
polynomial kernels attempt to project the data onto the outer sphere resulting in the
clusters being confounded. Higher polynomial degrees have also been applied with
no improvement.
304 13 Kernel EOFs
a) Polynomial kernel PC space b) Polynomial kernel PC space

0.05 0.05
KPC2
KPC2
0 0
-0.05 -0.05
-0.04 -0.02 0 0.02 0.04 -0.05 0 0.05
KPC1 KPC1
Fig. 13.5 Same as Fig. 13.4c, but with polynomial kernels of degree 4 (a) and 9 (b). Adapted from
Hannachi and Iqbal (2019). ©American Meteorological Society. Used with permission
13.3 Relation to Other Approaches
13.3.1 Spectral Clustering
The kernel EOF analysis is closely related to what is known as spectral clustering
encountered, for example, in pattern recognition. Spectral clustering takes roots
from the spectral analysis of linear operators through the construction of a set
of orthonormal bases used to decompose the spatio-temporal field at hand. This
basis is provided precisely by a set of orthonormal eigenfunctions of the Laplace–
Beltrami differential operator, using the diffusion map algorithm (e.g. Coifman and
Lafon 2006; Belkin and Niyogi 2003; Nadler et al. 2006). Spectral clustering (e.g.
Shi and Malik 2000) is based on using a similarity matrix S = (Sij ), where Sij
is the similarity, such as spatial correlation, between states xi and xj . There are
various versions of spectral clustering such as the normalised and non-normalised
(Shi and Malik 2000). The normalised version of spectral clustering considers the
(normalised) Laplacian matrix:
L = I − D−1/2 SD−1/2 , (13.17)

n
where D = diag(d1 , . . . , dn ) with dk = j =1 Skj . This matrix is known as
the degree matrix in the graph theory (see below). The wisdom here is that when
the clusters are well separated (or independent), then L will have, after some
rearrangement, a block diagonal form, i.e.
L = diag (L1 , . . . , Lk ) . (13.18)
Furthermore, it can be shown that L has k-zero eigenvalues (Luxburg 2007).

13.3 Relation to Other Approaches 305
Remark The non-normalised Laplacian corresponds to the matrix M = D − S

instead of L.
The eigenvectors u1 , . . . , uk associated with the k smallest eigenvalues of L are
U = (u1 , . . . , uk ) = D1/2 E (13.19)
with
⎛ ⎞
e1
⎜ .. ⎟
E=⎝ . ⎠,
ek
where e1 , . . . , ek are vectors of ones with different lengths. The procedure of

cluster identification (Shi and Malik 2000) then consists of using any clustering
procedure, e.g. k-means, applied to the first n rows of U into k groups. Now, by
considering an isotropic similarity matrix S = (Sij ), e.g. based on the Gaussian
kernel, the similarity matrix S is exactly the kernel matrix K. The eigenvalue
problem associated with the (normalised) matrix L in (13.17), or similarly the non-
normalised matrix M, then becomes equivalent to the eigenvalue problem of the
centred matrix K̃.
Remark Denoting by M = D − S, the matrix L, Eq. (13.17), is related to the matrix
P = (pij ) given by
P = D−1 M = I − D−1 S.
The matrix D−1 S appearing in the expression of P is linked to Markov chains.

Precisely, this matrix represents a stochastic matrix whose rows add up to one, and
thus each element pij of P can be interpreted as a transition probability from the ith
state to the j th state of the chain. So spectral clustering has a natural probabilistic
interpretation (Meila and Shi 2000).
13.3.2 Modularity Clustering
The problem of cluster identification, as discussed above in spectral clustering,

can also be viewed from a graph theory perspective. Graph theory is a theoretical
framework used to study and investigate complex networks. Here, each data point
xi in state space is considered as a vertex of some network, and edges between pairs
of vertices become substitutes of similarities obtained precisely from the adjacency
matrix A = (aij ) (see Chap. 3, Sect. 9.4). Each vertex i has a degree ki , defined as
the number of vertices to which it is connected (number
of edges linked to it) or the
number of adjacent vertices. It is given by ki = j aij .
306 13 Kernel EOFs
The idea of community or group identification was suggested by Newman

and Girvan (2004) and Newman (2006). The procedure starts with a division of
the network into two groups determined by a vector s containing only +1 and
−1 depending on whether the vertex belongs to groups 1 and 2, respectively.
The number of edges aij between vertices i and j is then compared to that
ki kj
obtained when edgesare placed randomly, 2m , where m is the total number
of edges (m = 0.5 k ). Newman and Girvan (2004) defined the modularity
ki kj
i
Q = ij aij − 2m si sj , that is,

1
Q = s Bs = s A −
T T T
kk s,
2m
where B = A − 2m 1
kkT is the modularity matrix. The modularity is a measure
reflecting the extent to which nodes in a graph are connected to those of their own
groups, reflecting thus the presence of clusters known as communities.
The objective then is to maximise the modularity. The symmetric modularity
matrix has the vector 1 (containing only ones) with associated zero eigenvalue, as
in the Laplacian matrix in spectral
clustering. By expanding the vector s in terms of
the eigenvectors uj of B, s = (sT uj )uj , the modularity reduces to

Q= (uTj s)2 λj
with λj , being the eigenvalues of B. It is clear that maximisation of Q requires s

to be proportional to the leading eigenvector u1 yielding s = sign(u1 ), providing,
hence, a clustering algorithm into two groups, given λ1 > 0. It is immediate to
see that if the eigenvalues are non-positive, λj ≤ 0, the leading eigenvector is 1,
meaning that the data cannot be clustered. Newman (2006) extended this to divide
the network into more than two groups, by introducing the contribution Q due to
dividing a group G (with nG elements or nodes) into two subgroups:
Q = sT B(G) s,
(G)
with the elements of the new nG × nG modularity matrix B(G) , given by Bij =

Bij − δij k in G Bik , with δij being the Kronecker delta. The procedure is then
repeated by maximising the modularity using Q.
13.4 Pre-images in Kernel PCA
The construction of the feature space and the computation of the EOF patterns and
associated PCs within it are a big leap in the analysis of the system producing
the data. What we need next is, of course, to be able to examine the associated
13.4 Pre-images in Kernel PCA 307
patterns in the input space. In many cases, therefore, one needs to be able to map
back, i.e. perform an “inverse mapping”, onto the input space. The inverse mapping
terminology is taken here in a rather broad sense not literally as the mapping may not
be one-to-one. Let us first examine the case of the standard (linear) EOF analysis.
In standard PCA the data xt is expanded as
p

p
xt = xTt .uk uk = ctk uk , (13.20)
k=1 k=1
where uk is the kth EOF and ctk is the kth PC of xt . Since this is basically a linear
projection, given a point x in the EOF space for which only the l leading coordinates
β1 , . . . , βl , (l ≤ p) are observed, the pre-image is simply obtained as

l
x∗ = βk uk . (13.21)
k=1
The above expression minimises x − x∗ 2 , i.e. the quadratic distance to the exact
point. Of course, if all the EOFs are used, then the distance is zero and the pre-image
is exact (see Chap. 3).
Now, because the mapping may not be invertible,2 the reconstruction of the
patterns back in the p-dimensional input data space can only be achieved approx-
imately (or numerically). Let us assume again a nonlinear transformation φ(.)
mapping the input (physical) space onto the (high-dimensional) feature space F,
where the covariance matrix is obtained using Eq. (13.2). Let vk be the kth EOF,
within the feature space, with associated eigenvalue λk , i.e.
Svk = λk vk . (13.22)
Like for the standard (linear) EOF analysis, and as pointed out above, the
EOFs are combination of the data; that is, they lie within the space spanned by
φ(x1 ), . . . , φ(xn ), e.g.

n
vk = akt φ(xt ). (13.23)
t=1
It can be seen, after inserting (13.23) back into (13.22), see also Eq. (13.3), that
the vector ak = (ak1 , . . . , akn )T is an eigenvector of the matrix K = (Kij ), where
Kij = K(xi , xj ) = φ(xi )T φ(xj ), i.e.
Kak = nλk ak . (13.24)
2 Even if it is, it would be prohibitive to compute it.

308 13 Kernel EOFs
Now, given a point x in the input space, the kernel PC of x in the feature space is the
usual PC of φ(x) within this feature space. Hence, the kth kernel PC is given by

n
βk (x) = φ(x)T vk = akt K(x, xt ). (13.25)
t=1
Remark Denoting by V = (v1 , . . . , vn ) and = (φ(x1 ), . . . , φ(xn )), Eq. (13.23)

is written as
V = A,
where A = (a1 , . . . , an ) and represents the matrix of the eigenvectors of K.

Let us now assume that we have a pattern in the feature space:

m
w= βk vk , (13.26)
k=1
where β1 , . . . , βm are scalar coefficients, and we would like to find a point x from
the input space that maps onto w in Eq. (13.26). Following Schölkopf et al. (1999),
one attempts to find the input x from the input space such that its image φ(x)
approximates w through maximising the ratio r:
2
wT φ(x)
r= . (13.27)
φ(x)T φ(x)
Precisely, the problem can be solved approximately through a least square minimi-
sation:

m
∗
x = argmin J = φ(x) − βk vk 2 . (13.28)
x k=1
To solve Eq. (13.28), one makes use of Eq. (13.23) and expresses w as
m

n
n
w= βk akt φ(xt ) = αt φ(xt ) (13.29)
t=1 k=1 t=1

(αt = m k=1 βk akt ), which is then inserted into (13.28). Using the property of the
kernel (kernel trick), one gets

n
φ(x) − w 2
= K(x, x) − 2 αt K(x, xt ) + c, (13.30)
t=1
where c is a constant independent on x.

13.5 Application to An Atmospheric Model and Reanalyses 309
If the kernel is isotropic, i.e. K(x, y) = H ( x−y 2 ), then the gradient of the error
squared (13.30) is easy to compute, and the necessary condition of the minimum
of (13.30) is given by

n
∇x J = αt H x − xt 2
(x − xt ) = 0, (13.31)
t=1
where H (u) = dH /du. The optimum then satisfies
1
n
x = n
αt H x − xt 2
xt . (13.32)
t=1 αt H x − xt 2
t=1
The above equation can be solved using the fixed point algorithm via the iterative
scheme:
n
αt dH x(m) − xt 2 xt
x(m+1)
= t=1 n
du
.
dH
t=1 αt du x(m) − xt 2
For example, in the case of the Gaussian kernel K(x, y) = exp − x − y 2 /2σ 2 ,
we get the iterative scheme:
1
n
z(m+1) = n (m) − x 2 /2σ 2 )
αt exp − z(m) − xt 2
/2σ 2 xt ,
t=1 αt exp(− z t t=1
(13.33)
which can be used to find an optimal solution x∗ by taking x∗ ≈ z(m) for large
enough m. Schölkopf et al. (1998, 1999) show that kernel PCA outperforms ordinary
PCA. This can be understood since kernel PCA can include higher order moments,
unlike PCA where only second-order moments are used. Note that PCA corresponds
to kernel PCA with φ(.) being the identity.
13.5 Application to An Atmospheric Model and Reanalyses
13.5.1 Application to a Simplified Atmospheric Model
The example discussed here consists of a 3-level quasi-geostrophic model of the

atmosphere. The model describes the evolution of the potential vorticity qi at the
ith level, where levels i = 1, 2 and 3 represent, respectively, 200-, 500- and 800-
hPa surfaces. The potential vorticity equations are given by
310 13 Kernel EOFs
∂q1
∂t = −J (ψ1 , q1 ) + D1 (ψ1 , ψ2 ) + S1
∂q2
∂t = −J (ψ2 , q2 ) + D2 (ψ1 , ψ2 , ψ3 ) + S2 (13.34)
∂q3
∂t = −J (ψ3 , q3 ) + D3 (ψ2 , ψ3 ) + S3 ,
where the potential vorticities are given by
q1 = ∇ 2 ψ1 − R1−2 (ψ1 − ψ2 ) + f
q2 = ∇ 2 ψ2 + R1−2 (ψ1 − ψ2 ) − R2−2 (ψ2 − ψ3 ) + f (13.35)
q3 = ∇ 2 ψ3 + R2−2 (ψ2 − ψ3 ) + f (1 + Hh0 ).
In the above equations ψi , i = 1, 2, 3, represents the streamfunction at level i,

R1 = 700 km and R2 = 450 km and represent Rossby radii of deformation,
f = 2 sin φ and represents the Coriolis parameter, is the Earth’s rotation
rate, φ is the latitude and H0 is height scale fixed to 9km. The terms Di , i =
1, 2, 3, represent the dissipation rates and include contributions from temperature
relaxation, the Ekman dissipation and horizontal diffusion. The latter is a hyper-
diffusion with e-folding time of 1.5 days. The term J () represents the nonlinear
∂q ∂ψ ∂q
Jacobian operator; J (ψ, q) = ∂ψ ∂x ∂y − ∂y ∂x , and ∇ is the horizontal Laplacian on
2
the sphere. The forcing terms Si , i = 1, 2, 3, are calculated in such a way that the
January climatology of the National Center for Environmental Prediction/National
Center for Atmospheric Research (NCEP/NCAR) streamfunction fields at 200 hPa,
500 hPa and 800 hPa levels is stationary solution of system, Eq. (13.34). The
term h() in Eq. (13.35) represents the real topography of the Earth in the northern
hemisphere (NH). The model is spectral with a triangular truncation T21 resolution,
i.e. 32 × 64 lat × lon grid resolution. The model is symmetrical with respect to the
equator, leading to slightly more than 3000 grid points or 693 spherical harmonics
(or spectral degrees of freedom). The model was shown by many authors to simulate
faithfully the main dynamical processes in the extratropics, see Hannachi and Iqbal
(2019) for more details and references. The point here is to show that the model
reveals nonlinearity when analysed using kernel EOFs and is therefore consistent
with conceptual low-order chaotic models such as the case of the Lorenz (1963)
model (Fig. 13.6).
The model run is described in Hannachi and Iqbal (2019) and consists of
one million-day trajectory. The averaged flow tendencies are computed within
the PC space and compared to those obtained from the kernel PC space. Mean
flow tendencies have been applied in a number of studies using the leading
modes of variability to reveal possible signature of nonlinearity, see e.g. Hannachi
(1997), Branstator and Berner (2005) and Franzke et al. (2007). For example,
using a simple (toy) stochastic climate model, Franzke et al. (2007) found that
the interaction between the resolved planetary waves and the unresolved waves is
the main responsible for the nonlinearity. Figure 13.7a shows an example of the
mean tendencies of the mid-level (500-hPa) streamfunction within the PC1-PC5
state space. The flow tendencies (Fig. 13.7a) reveal clear nonlinearities, which can
be identified by examining both the tendencies and their amplitudes. Note that
a) Trajectory and PDF b) Tendencies and trajectory

0.4
4
3
0.35
3
2
0.3
2
1 0.25
1
0.2
z
z
0 0
0.15
-1 -1
0.1
-2
-2
0.05
-3
-3 0
-2 -1 0 1 2 -2 0 2
x x
Fig. 13.6 PDF of a long simulation of the Lorenz (1963) model shown by shaded and solid
contours within the (x, z) plane (a), and the flow tendencies plotted in terms of magnitude (shaded)
and direction (normalised vectors) within the same (x, z) plane (b). A chunk of the model trajectory
is also shown in both panels along with the fixed points. Note that the variables are scaled by 10,
and the value z0 = 25.06 of the fixed point is subtracted from z. Adapted from Hannachi and Iqbal
in linear dynamics the tendencies are antisymmetric with respect to the origin,
and the tendency amplitudes are normally elliptical, as shown in Fig. 13.7b. The
linear model is fitted to the trajectory within the two-dimensional PC space using
a first-order autoregressive model, as explained in Chap. 6. The departure of the
linear tendencies from the total tendencies in Fig. 13.7c reveals two singular (or
fixed) points representing (quasi-)stationary states. The PDF of the system trajectory
is shown in Fig. 13.7d and is clearly unimodal, which is not consistent with the
conceptual low-order chaotic models such as Fig. 13.6.
The same procedure can be applied to the trajectory within the leading kernel
PCs. Figure 13.8a shows the departure of the tendencies from the linear component
within the kernel PC1/PC4 space and reveals again two fixed points. Figure 13.8b
shows the PDF of the mid-level streamfunction within kernel PC1/PC4 state space.
In agreement with low-order conceptual models, e.g. Fig. 13.6, the figure now
reveals strong bimodality, where the modes correspond precisely to regions of low
tendencies. Figure 13.9 displays the two circulation flows corresponding to the PDF
modes of Fig. 13.8 showing the anomalies (top) and the total (bottom) flows. The
first stationary state shows a low over the North Pacific associated with a dipole over
the North Atlantic reflecting the negative NAO phase (Woollings et al. 2010). The
second anomalous stationary solution represents approximately the opposite phase,
with a high pressure over the North Pacific associated with an approximate positive
312 13 Kernel EOFs
a) b)
3 0.4 0.4
2
0.3 0.3
1
PC4
PC4
0 0.2 0.2
-1
0.1 0.1
-2
-3 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
c) PC1 d) PC1
0.2 0.2
3
2
0.15 0.15
1
PC4
PC4
0 0.1 0.1
-1
0.05 0.05
-2
-3 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
PC1 PC1
Fig. 13.7 Total flow tendency of the mid-level streamfunction within the conventional PC1/PC5
state space. (b) Linear tendency based on a first-order autoregressive model fitted to the same
data. (c) Difference between the two tendencies of (a) and (b) showing the departure of the total
tendencies from the linear part. (d) Kernel PDF of the same data within the same two-dimensional
state space. Adapted from Hannachi and Iqbal (2019)
a) b)
0.3 0.12
2 0.25 2 0.1
1 0.2 1 0.08
KPC4
KPC4
0 0.15 0 0.06
-1 0.1 -1 0.04
-2 0.05 -2 0.02
-2 0 2 -2 0 2
KPC1 KPC1
Fig. 13.8 As in Fig. 13.7c(a) and 13.7d(b), but for the kernel PC1/PC4. Adapted from Hannachi
and Iqbal (2019). ©American Meteorological Society. Used with permission
NAO phase. In both cases the anomalies over the North Atlantic are shifted slightly
poleward compared to the NAO counterparts.
a) b)
c) d)
Fig. 13.9 Anomalies (a,b) and total (c,d) flows of mid-level streamfunction field obtained by
compositing over states within the neighbourhood of the modes of the bimodal PDF. Contour
interval 29.8 × 108 m2 /s (top) and 29.8 × 106 m2 /s (bottom). Adapted from Hannachi and Iqbal
The total flow of the stationary solutions, obtained by adding the climatology
to the anomalous stationary states, is shown in the bottom panels of Fig. 13.9. The
first solution shows a ridge over the western coast of North America associated
with a diffluent flow over the North Atlantic with a strong ridge over the eastern
North Atlantic. This latter flow is reminiscent of a blocked flow over the North
Atlantic. Note the stronger North Atlantic ridge compared to that of the western
314 13 Kernel EOFs
North American continent. The second stationary state (Fig. 13.9) shows a clear
zonal flow over both basins.
13.5.2 Application to Reanalyses
In the next example, kernel PCs are applied to reanalyses. The data used in
this example consist of sea level pressure (SLP) anomalies from the Japanese
Reanalyses, JRA-55 (Harada et al. 2016; Kobayashi et al. 2015). The anomalies
are obtained as departure of the mean daily annual cycle from the SLP field and
keeping unfiltered winter (December–January–February, DJF) daily anomalies over
the northern hemisphere. The kernel PCs of daily SLP anomalies are computed, and
the PDF is estimated.
Figure 13.10a shows the daily PDF of SLP anomalies over the NH poleward
of 27◦ N using KPC1/KPC7. Strong bimodality stands out from this PDF. To
characterise the flow corresponding to the two modes, a composite analysis is
performed by compositing over the points within the neighbourhood of the two
modes A and B. The left mode (Fig. 13.11a) shows a polar high stretching south
over Greenland accompanied with a low pressure system over the midlatitude
North Atlantic stretching from eastern North America to most of Europe and
the Mediterranean. This circulation regime projects strongly onto the negative
NAO. The second mode (Fig. 13.11b) shows a polar low with high pressure over
midlatitude North Atlantic, with a small high pressure over the northern North West
Pacific, and projects onto positive NAO. The regimes are not exactly symmetrical
of each other; regime A is stronger than regime B. Hannachi and Iqbal (2019)
also examined the hemispheric 500-hPa geopotential height. Their two-dimensional
PDFs (not shown) reveal again strong bimodality associated, respectively, with polar
low and polar high. The mode associated with the polar high is stronger however.
a) SLP anomalies PDF b) PDF difference

Prob. density Difference
Probability density
B 0.05
0.1 A
0.05 0
0
5 -0.05
0 2
0
KPC7 0 2 -2 0 2
-5 -2 -2
KPC1 KPC7 KPC1
Fig. 13.10 (a) Kernel PDF of the daily winter JRA-55 SLP anomalies within the kernel PC1/PC7
state space. (b) Difference between the PDFs of winter daily SLP anomalies of the first and second
halves of the JRA-55 record. Adapted from Hannachi and Iqbal (2019)
a) SLP anomaly composite (mode A) b) SLP anomaly composite (mode B)

18 6
16
4
14
2
12
10 0
8
-2
6
-4
4
2 -6
0 -8
-2
-10
-4
-6 -12
-8 -14
Fig. 13.11 SLP anomaly composites over states close to the left (a) and right (b) modes of the
PDF shown in Fig. 13.10. Units hPa. Adapted from Hannachi and Iqbal (2019).
Table 13.1 Correlation coefficients between the 10 leading KPCs and PCs for JRA-55 SLP
anomalies. The R-square of the regression between each individual KPCs and the leading 10 PCs
is also shown. Correlations larger than 0.2 are shown in bold faces
KPC1 KPC2 KPC3 KPC4 KPC5 KPC6 KPC7 KPC8 KPC9 KPC10
PC1 0.928 0.008 −0.031 −0.032 −0.090 −0.017 0.009 −0.017 −0.015 −0.030
PC2 0.031 −0.904 0.070 −0.037 0.251 0.006 −0.015 −0.054 0.063 0.003
PC3 −0.020 −0.002 0.561 −0.464 −0.271 0.471 −0.029 0.146 0.117 0.037
PC4 −0.035 −0.172 −0.287 0.400 −0.453 0.543 0.210 −0.108 −0.204 0.012
PC5 0.097 0.168 0.359 0.384 0.554 0.365 −0.043 −0.303 −0.050 0.084
PC6 0.011 0.079 −0.074 0.071 0.191 0.134 0.499 0.291 0.386 −0.489
PC7 0.027 −0.009 −0.104 −0.021 0.261 0.178 −0.091 0.615 −0.469 0.092
PC8 −0.018 −0.027 −0.079 0.072 −0.033 0.146 −0.644 0.120 0.032 −0.537
PC9 0.036 0.002 −0.129 0.155 0.052 0.079 −0.108 0.293 0.485 0.431
PC10 −0.003 −0.003 −0.040 0.072 0.042 0.097 −0.060 0.163 0.115 −0.052
R-square 0.88 0.88 0.57 0.57 0.77 0.74 0.73 0.72 0.68 0.73
The above results derived from the quasi-geostrophic model, and also reanalyses
clearly show that the leading KPCs, like the leading PCs, reflect large scale structure
and hence can explain substantial amount of variance. This explained variance is
already there in the feature space but is not clear in the original space. Table 13.1
shows the correlations between the leading 10 KPCs and leading 10 PCs of the sea
level pressure anomalies, along with the R-square obtained from multiple regression
between each KPC and the leading 10 PCs. It is clear that these KPCs are large scale
and also explain substantial amount of variance.
The kernel PCs can also be used to check for any change in the signal over
the reanalysis period. An example of this change is shown in Fig. 13.10b, which
represents the change in the PDF between the first and last halves of the data. A
clear regime shift is observed with a large decrease (increase) of the frequency of
316 13 Kernel EOFs
the polar high (low) between the two periods. This change projects onto the +NAO
(and +AO) and is consistent with an increase in the probability of the zonal wind
speed over the midlatitudes.
13.6 Other Extensions of Kernel EOFs
13.6.1 Extended Kernel EOFs

Direct Formulation
Extended EOFs seek propagating patterns from a multivariate times series xt , t =

1, . . . n (Chap. 7). By transforming the data into the feature space, there is the
possibility that extended EOFs could yield a better signal. We let φ(xt ), t =
1, . . . n, be the transformed data and, as in standard extended EOFs, Z denotes
the “extended” data matrix in the feature space:
⎛ ⎞ ⎛ ⎞
. . . φ(xM )T
φ(xt )T ϕ T1
⎜ ..
.. .. ⎟ ⎜ .. ⎟
Z=⎝ .
. . ⎠=⎝ . ⎠, (13.36)
φ(xn−M+1 ) . . . φ(xn )
T T ϕ n−M+1
T
where M is the embedding dimension and ϕ Tk = φ(xk )T , . . . , φ(xk+M−1 )T , k =

1, . . . n − M + 1. The covariance matrix of (13.36) is
1
n−M+1
1 T
S= Z Z= ϕ s ϕ Ts . (13.37)
n n
s=1
By applying the same argument as for kernel EOFs, any eigenvector of S is a linear
combination of ϕ t , t = 1, . . . , n − M + 1, i.e.

n−M+1
v= αt ϕ t . (13.38)
t=1
Using (13.37) and (13.38), the eigenvector v satisfies

n−M+1 M−1

n−M+1
n−M+1
n−M+1
nλv = ϕ s ϕ Ts αt ϕ t = Ks+k,t+k αt ϕ s .
s=1 t=1 s=1 t=1 k=0
(13.39)
Keeping in mind the expression of v, see Eq. (13.38), Eq. (13.39) yields the
following eigenvalue problem for α = (α1 , . . . , αn−M+1 ):
13.6 Other Extensions of Kernel EOFs 317
Kα = nλα, (13.40)

where K = (Kij ) with Kij = M−1 k=0 Ki+k,j +k . The vector α represents the (kernel)
extended PC in the feature space.
The reconstruction within the feature space can be done in a similar fashion to
the standard extended EOFs (Chap. 7, Sect. 7.5.3), and the transformation back to
the input space can again be used as in Sect. 13.4, but this is not expanded further
here.
Alternative Formulations
An easier alternative to applying extended EOFs in the feature space is to consider

the kernel PCs within the latter space as given by the eigenvectors of Eq. (13.4).
The next step consists of selecting, say, the leading N kernel PCs and applying
extended EOFs. We can use the extended PCs and apply a transformation back to
the kernel EOF space, using Eqs. (7.39) and (7.40) from Chap. 7. The reconstructed
pattern will then be a combination of a number of kernel EOFs, see Eq. (13.26).
The transformation back to the input space is then performed using the fixed point
algorithm as outlined in Eq. (13.33).
An alternative inverse approach can be followed here. Instead of constructing
delay coordinates (extended data matrix) in the feature space, these coordinates can
be constructed straightforward in the physical space, as in extended EOFs (Chap. 7),
and then either kernel EOFs or its analogue, the Laplacian matrix of spectral
clustering (Eq. 13.17) can be used. This latter approach was followed by Giannakis
and Majda (2012, 2013) who labelled it “nonlinear Laplacian spectral analysis”
(NLSA). They applied it to the upper ocean temperature in the North Pacific sector
using a long control run of a climate model. They identified, in addition to the annual
and decadal modes, a family of intermittent processes associated with the Kuroshio
current and subtropical/subpolar gyres. Because these latter processes explain little
variance, Giannakis and Majda (2013) suggest that the NLSA method can uncover
processes that conventional singular spectrum analysis cannot capture.
13.6.2 Kernel POPs
As described in Chap. 6, the POP analysis is based on a linear model, the first
order autoregressive model AR(1). Although POPs were quite successful in various
climate applications, the possibility remains that, as we discussed in the previous
sections, the existence of nonlinearity can hinder the validity of the linear model.
The POP, or AR(1), model can be defended when nonlinearity is weak. If, however,
we think or we have evidence that nonlinearity is important and cannot be neglected,
then one solution is to use the kernel transformation and get kernel POPs.
318 13 Kernel EOFs
Since the formulation of POPs involves inverting the covariance matrix, and to
avoid this complication in the feature space, a simple way is to apply POPs using
kernel PCs by selecting, say, the leading N KPCs. The computation, as it turns out,
becomes greatly simplified as the KPCs are uncorrelated. Like for kernel extended
EOFs, patterns obtained from the POP analysis are expressed as a combination of
kernel EOFs. The transformation back to the input space can be obtained using again
the fixed point algorithm.
Chapter 14
Functional and Regularised EOFs
Abstract Weather and climate data are in general discrete and result from sampling
a continuous system. This chapter attempts to take this into account when computing
EOFs. The first part of the chapter describes methods to construct EOFs/PCs of
profiles with application to oceanography. The second part of the chapter describes
regularised EOFs with application to reanalysis data.
Keywords Functional PCs · Ocean salinity · Mixed layer · Regularised EOFs ·

Generalized eigenvalue problem · Lagrangian · Smoothing parameter ·
Cross-validation · Siberian high · AO
14.1 Functional EOFs
Functional EOF/PC analysis (e.g. Ramsay and Silverman 2006; Jolliffe and Cadima
2016) is concerned with EOF/PC analysis applied to data which consist of curves
or surfacsses. In atmospheric science, for example, the observations consist of
spatio-temporal fields that represent a discrete sampling of continuous variables,
such as pressure or temperature at a set of finite grid points. In a number of cases
methods of coupled patterns (Chap. 15) can be applied to single fields simply by
assuming that the two fields are identical. For instance, functional and smooth EOFs
correspond to smooth maximum covariance analysis (MCA), as given by Eq. (15.30)
in Chap. 15, when the left and right fields are identical. Precisely, we suppose that
we are given a sample of n curves that constitutethe coordinates of a vector curve
x(t) = (x1 (t), . . . , xn (t))T , with zero mean, i.e. nk=1 xk (t) = 0 for all values of t.
The covariance function is given by
1
S(s, t) = xT (t)x(s), (14.1)
n−1
where t and s are in the domain of the definition of the curves. The ques-
tion
then is to find smooth functions (EOFs) a(t) maximising < a, Sa >=
S(s, t)a(s)a(t)dsdt subject to a normalisation constraint condition of the type

320 14 Functional and Regularised EOFs
< a, a > +α < D 2 a, D 2 a > −1 = 0. The solution to this problem is then given
by the following integro-differential equation:

S (t, s) a(s)ds = μ 1 + αD 4 a(t). (14.2)
Consider first the case of functional EOFs, that is when α = 0, which yields a
homogeneous Fredholm equation of the second kind. We suppose that the curves
can be expanded
p in terms of a number of basis functions φ1 (), . . . , φp () so that
xi (t) = k=1 λi,k φk (t), i = 1, . . . , n. In vector form this can be written as x(t) =
φ(t), where we let = (λij ). The covariance function becomes
1
S(t, s) = φ T (t)T φ(s). (14.3)
n−1
Assumingthat the functional EOF a(t) is expanded using the basis functions as
a(t) = ak φk (t) = φ T (t)a, the above integro-differential equation yields the
following system:
φ T (t)T a = μφ T (t)a, (14.4)
where = (n − 1)−1 < φ(s), φ T (s) >. This equality has to be satisfied for all
t in its domain of definition. Hence the solution a is given by the solution to the
eigenvalue problem:
T a = μa. (14.5)
Exercise
1. Derive the above eigenvalue problem satisfied by the vector coefficients a =
(a1 , . . . , ap )T .
2. Now we can formulate the problem by going back to the original problem as we
did above. Write the generalised eigenvalue problem using V = (< φi , Sφj >)
and A = (< φi , φj >).
3. Are the two equations similar? Explain your answer.
Given that the covariance-like matrix is symmetric and semi-definite positive,
one can compute its square root, or alternatively, one can use its Cholesky
decomposition to transform the previous equation into a symmetric eigenvalue
problem by multiplying both sides by the square root of .
Exercise
1. Show that is symmetric.
T
2. Show that aT b = n−1 1
a φ(t) bT φ(t) dt.
3. Deduce
that is semi-definite positive. Hint – ij =< φi (t), φj (s) >=
φi (t)φj (s)dtds = j i .
14.3 An Example of Functional PCs from Oceanography 321
14.2 Functional PCs and Discrete Sampling
The above section presents functional EOFs applied to a finite number of curves
x1 (t), . . . , xn (t), for t varying in a specified interval. The parameter may represent
time or a conventional or curvilinear coordinate, e.g. height. In practice, however,
continuous curves or profiles are not commonly observed, but can be obtained from
a set of discrete values at a regular grid.1 To construct continuous curves or profiles
from these samples a linear combination of a number of basis functions can be used
as outlined above. Examples of basis functions commonly used include radial basis
functions and splines (Appendix A).
The profile xi (t) is projected onto the basis φk (t), k = 1, . . . K, as

K
xi (t) = λi,k φk (t). (14.6)
k=1
The functional PCs are then given by solving the eigenvalue problem (Eq. (14.5)).
The problem is normally solved in two steps. First, the coefficients λi,k , i =
1, . . . n, k = 1, . . . K, are obtained from Eq. (14.6) using for example least squares
estimation. The matrix = (λi,k ) is then used as data matrix, and a SVD procedure
can be applied to get the eigenvectors (functional PCs) of the covariance matrix of
.
14.3 An Example of Functional PCs from Oceanography
An appropriate area for functional PC analysis is when we have measurements

of vertical profiles in the atmosphere or the ocean. Below we discuss precisely
vertical profiles of ocean temperature and salinity, which were studied in detail
by Pauthenet (2018). Temperature and salinity are two thermodynamic variables
of great importance in ocean studies as they control the stability and thermohaline
circulation of the ocean. They can also be used to identify frontal region in the ocean
as a function of depth.
The three-dimensional structure of the ocean is quite complex to analyse, e.g.
via conventional EOFs. A simpler way to analyse for example the frontal structure
of the ocean is to use the vertical profiles of salinity and temperature and draw a
summary of the three-dimensional ocean structure. Pauthenet (2018) investigated
the thermohaline structure of the southern ocean using functional PCs of vertical
profiles of temperature and salinity using various data products.
1 When the sampling is not regular an interpolation can be applied to obtain regular sampling.
Given vertical profiles of temperature T and salinity S of the ocean at a given

grid point in the ocean surface, defined at discrete depths, a continuous profile is
obtained using Eq. (14.6) with the index i representing the location and plays the
role of time in conventional EOFs. The n × K coefficient matrices T and S for
temperature T and salinity S, respectively, are used to compute the functional PCs
as the solution of the eigenvalue problem:
T Mu = λu, (14.7)
where = (T , S ), and represents the n × (2K) coefficient matrix, and M

is a (2K) × (2K) diagonal weighting (or scaling) matrix reflecting the different
units (and variances) of variables. The eigenvectors can be used to construct vertical
modes of salinity and temperature after being properly weighted, i.e. −1/2 M−1/2 u,
and can also be used to filter the data using a few leading functional PCs, e.g.
M−1/2 −1/2 u. Pauthenet used B-splines as basis functions.
Figure 14.1 shows an example of two vertical profiles of T and S along with
the leading 20 used B-splines used for fitting. The (non-equidistant) measurement
locations are also shown. Note that the vertical profiles are interpolated into a
regular vertical grid prior to B-spline fitting. Figure 14.2 shows an example of the
leading three functional PCs obtained from Monthly Isopycnal and Mixed-layer
Ocean Climatology (MIMOC), see Schmidtko et al. (2013). The figure reflects
broadly known water masses distribution as indicated in Fig. 14.2e (Talley 2008).
For example, functional PC2 and PC3 reflect the low- and high-salinity intermediate
water mass while functional PC1 reflects more the wind-driven gyres at upper levels.
Fig. 14.1 An example of two vertical profiles of temperature (left), salinity (middle) constructed
using 20 B-splines (right). The dots represent the measurements. Adapted from Pauthenet (2018)
14.3 An Example of Functional PCs from Oceanography 323
Fig. 14.2 MIMOC global ocean temperature at 340 dbar (a), salinity at 750 dbar (c) climatology,
along with the spatial distribution of functional PC1 (b), PC2 (d) and PC3 (f). Panel (e) shows the
low- and high-salinity intermediate water mass (Courtesy of Talley 2008). Adapted from Pauthenet
(2018)
The functional PCs of vertical profiles can provide invaluable information on

the ocean fronts. An example of the vertical profiles or PCs of the southern ocean
is shown in Fig. 14.3 along with their explained variance. An example is given in
Fig. 14.4 showing the spatial distribution of the leading four functional PCs in the
Southern ocean. The functional PCs reflect well the oceanic fronts, e.g. the polar
front. The leading functional PC correlates well with temperature at 250 m and
salinity at 25 m and 1355 m, whereas functional PC2 correlates better with salinity
at 610 m depth. Note, in particular, the low salinity near Antarctica. Salinity fronts
shift northward with depth (Fig. 14.4). The other two functional PCs are associated
with subantarctic mode water and southern ocean fronts.
Fig. 14.3 The leading four vertical mode or PCs of temperature and salinity along with their
percentage explained variance. Individual explained variance is also shown for T and S separately.
The mean profile (solid) is shown along with the envelope of the corresponding PCs. (a) PC1
(72.52%). (b) PC2 (19.89%). (c) PC3 (3.43%). (d) PC4 (1.35%). Adapted from Pauthenet et al.
14.4 Regularised EOFs
14.4.1 General Setting
We consider here the more general problem of regularised (or smooth) EOFs as
discussed in Hannachi (2016). As for the smoothed MCA (see Chap. 15), we let
V = (Vij ) = (< φi , Sφj >), and = ij = < D 2 φi , D 2 φj > , the vector a is
then obtained by solving the generalised eigenvalue problem:
Va = μ ( + α) a
Remark Note that we used the original maximisation problem and not the previous
integro-differential equation.
It can be seen that the associated eigenvalues are real non-negative, and the
eigenvectors are real. The only parameter required to solve this problem is the
14.4 Regularised EOFs 325
Fig. 14.4 Spatial structure of the leading four functional PCs of the vertical profiles of the 2007
annual mean temperature and salinity. Data are taken from the Southern Ocean State Estimate
(SOSE, Mazloff et al. 2010). (a) PC1 (72.52%). (b) PC2 (19.89%). (c) PC3 (3.43%). (d) PC4
(1.35%). Adapted from Pauthenet et al. (2017). ©American Meteorological Society. Used with
permission
smoothing parameter α. A number of methods can be used to get the regularisation

or smoothing parameter. In general, smoothing parameters can be obtained, for
example, from experience or using cross-validation. This is discussed below.
Exercise Show that the operator V is symmetric semi-definite positive.

Hint Vij = n1 φ( t)xT (t)x(s)φj (s)dtds = φi (t)xT (t)dt φj (s)x(s)ds =

Vj i . For the positivity, consider aT Va = ij ai aj Vij = ij ai aj φi (t)xT (t)dt

φj x(s)ds, which equals i,j,k ai aj φi (t)xk (t)dt φj (s)xk (s)ds =

k ij ai φi (t)xk (t)dt aj φj (s)xk (s)ds . This last expression is simply
2
k i ai φi (t)xk (t)dt ≥ 0.
The procedure based on radial basis functions (RBFs, Appendix A) mentioned
previously, see also Chap. 15, can also be used here in a similar fashion. The nice
thing about these functions is that they can be used to smooth curves. Suppose, for
example, that the curves x(t) = (x1 (t), . . . , xn (t)) are observed at discrete times
t = tk , k = 1, . . . , p. The interpolation, or smoothing, using radial basis functions
is similar to, but simpler than the B-spline smoothing and is given by

p
xi (t) = λi,k φ(|t − tk |). (14.8)
k=1
Perhaps one main advantage of using this kind of smoothing, compared to splines,
is that it involves one single radial function, which can be chosen from a list of
functions given in the Appendix A. The coefficients of the smoothed curves can
be easily computed by solving a linear problem, as shown in Appendix A. The
covariance matrix can also be computed easily in terms of the matrix = λij ,
p
and the radial function φ. The smoothed EOF curves a(t) = k=1 uk φ(|t − tk |)
are then sought by solving the above generalised eigenvalue problem to get u =
(u1 , . . . , up ). Note that the function φ(|t − tk |) now plays the role of the basis
function φk (t) used previously.
14.4.2 Case of Spatial Fields
Here we suppose that we observe a sample of space–time field F (x, tk ) at times tk ,

k = 1, . . . , n and x represents a spatial position. The covariance function of the field
is given by
1
n
S (x, y) = F (x, tk ) F (y, tk ) . (14.9)
n−1
k=1
The objective of smooth EOFs (Hannachi 2016) is to find the “EOFs” of the
covariance matrix (14.9). Denoting by the spatial domain, which may represent
the entire globe or a part of it, an EOF is a continuous pattern u(x) maximising:

u(x)S (x, y) u(y)dxdy

subject to u(x)2 + α(∇ 2 u(x))2 dx = 1. This yields a similar integro-
differential equation to (14.2), namely

S (x, y) u(x)dx = S (x, y) u(x)d = μ 1 + α∇ 4 u(y). (14.10)

In spherical coordinates for example d = R 2 cos(φ)dφdλ, where R is the Earth’s

radius, φ is the latitude and λ is the longitude. Here we suppose that the smoothing
14.5 Numerical Solution of the Full Regularised EOF Problem 327
parameter α is given, and we concentrate more on the application to observed fields.

The choice of this parameter is discussed later using the Lagrangian function.
There are two ways to solve the above integro-differential equation when we have
a finite sample of data, namely the RBF and the direct numerical solutions. The RBF
solution is briefly discussed below, and the full numerical solution is discussed in
detail in Sect. 14.5.
The Example of the RBF Solution
This is exactly similar to the approximation used in Sects. 14.1 and 14.2 above.
We use two-dimensional
RBFs φi (x) = φ( x − xi ) and expand u(x) in terms of
φi (x), e.g. u(x) = uk φk (x) = φ T (x)u. For example, for the case when α = 0, the
sample F(x) = (F1 (x), . . . , Fn (x))T is similarly expanded as Ft (x) = λt,k φk (x),
i.e. F(x) = φ(x) and from S(x, y) = (n − 1)−1 FT (x)F(y) we get exactly
a similar
eigenvalue problem to that of Sect. 14.1, i.e. T u = μu, where
= S 2 φ(x)φ (x)dx. The set S 2 represents the spherical Earth. The advantage
T
of this is that we can use spherical RBFs, which are specific to the sphere, see e.g.
Hubbert and Baxter (2001).
14.5 Numerical Solution of the Full Regularised EOF

Problem
To integrate Eq. (14.10) one starts by defining the sampled space–time field through
a (centred) data matrix X = (x1 , . . . , xd ), where xk = (x1k , . . . , xnk )T is the time
series of the field at the kth grid point. The associated sample covariance matrix is
designated by S. The pattern u = (u1 , . . . , ud )T satisfying the discretised version of
Eq. (14.10) is given by the solution to the following generalised eigenvalue problem
(see also Eq. (15.34)):

Su = μ Id + αD4 u. (14.11)
In Eq. (14.11) D4 is the square of the (discretised) Laplacian operator ∇ 2 , which is

self-adjoint (see, e.g. Roach 1970). The Laplacian in spherical coordinates takes the
2 ∂2f
form: ∇ 2 f (ϕ, λ) = R12 ∂∂ϕf2 + R 2 cos
1 tan ϕ ∂f
2 ϕ ∂λ2 − R 2 ∂ϕ , where R is the Earth’s radius.
The discretised Laplacian yields a matrix whose elements vary with the latitude ϕ.
Let u(ϕ, λ) be a function on the sphere. At grid point (ϕk , λl ) the Laplacian ∇ 2 u
can be explicitly computed and yields

tan ϕk
R 2 (D 2 u)k,l = 1
u
(δϕ)2 k−1,l
− (δϕ)
2
+ (δλ)2 cos
2
2 ϕ − δϕ uk,l
2 k
+ δϕ δϕ − tan ϕk uk+1,l + (δλ)2 cos2 ϕ uk,l+1 + uk,l−1

1 1 1
k
(14.12)
for k = 1, . . . , p, and l = 1, . . . , q and where uk,l = u (ϕk , λl ), and ϕk ,
k = 1, . . . , p, and λl , l = 1, . . . , q, are the discretised latitude and longitude
coordinates respectively.2 Furthermore, the matrix S in Eq. (14.11) is given by
a Hadamard product, S = S , where = φ1, with φ being the pq × 1
column vector containing q replicates of the vector (cos ϕ1 , . . . cos ϕp )T , i.e. φ =
T
cos ϕ1 , . . . cos ϕp , . . . cos ϕ1 , . . . cos ϕp and 1 is the 1 × pq vector of ones.
Remarks The above Hadamard product represents the area weighting used in
conventional EOF analysis accounting for the poleward converging meridians.
Another point worth mentioning is that the latitudinal discretisation in Eq. (14.12) is
uniform. A non-uniform discretisation, e.g. Gaussian grid used in spectral methods,
can be easily incorporated in Eq. (14.12).
Because Eq. (14.10) is a integro-differential operator its integration requires
boundary conditions to compute the matrix D4 in Eq. (14.11). Three basic types
of boundary conditions are discussed here, as in Hannachi (2016). We consider first
the case of hemispheric fields for which one can take ϕ0 = π/2, ϕk+1 = ϕk − δϕ
and λq = λ1 (periodicity). One can also take u0,l = 0, and up+1,l = up,l ,
l = 1, . . . , q, plus the periodic boundary condition uk,q+1 = uk,1 . Letting u =
T
uT1 , uT2 , . . . , uTq , which represents the spatial pattern measured at the grid points
(ϕk , λl ), k = 1, . . . , p, and l = 1, . . . , q, where uTl = u1,l , u2,l , . . . , up,l . A little
algebra yields
⎛ ⎞⎛ ⎞
A C O OC u1
... O
⎜C O⎟ ⎜ ⎟
⎜ A C O ⎟ ⎜ u2 ⎟
... O
⎜ ⎟⎜ ⎟
⎜O C A CO ⎟ ⎜ u3 ⎟
... O
R D u = Au = ⎜
2 2
⎜ .. .. .. ..
⎟ ⎜
.. ⎟ ⎜ .. ⎟
.. ⎟ .. (14.13)
⎜ . . . . ⎟⎜ . ⎟
. . .
⎜ ⎟⎜ ⎟
⎝O O O O . . . A C ⎠ ⎝ uq−1 ⎠
C O O O ... C A uq
where C and A are p × p matrices given, respectively, by
C = Diag c1 , c2 , . . . , cp
and
2 Note that here q and p represent respectively the resolutions in the zonal and meridional directions
respectively, so the total number of grid points is pq.

14.5 Numerical Solution of the Full Regularised EOF Problem 329
⎛ ⎞
a1 b1 0 ... 0 0 0
⎜ (δφ)−2 a ⎟
⎜ 2 b2 ... 0 0 0 ⎟
⎜ ⎟
⎜ 0 (δφ)−2 a3 ... 0 0 0 ⎟
A=⎜
⎜ .. .. .. .. .. .. .. ⎟
⎟
⎜ . . . . . . . ⎟
⎜ ⎟
⎝ 0 0 0 . . . (δφ)−2 ap−1 bp−1 ⎠
0 0 0 . . . 0 (δφ)−2 ap + bp

tan ϕk tan ϕk
where ak = − 2
(δϕ)2
+ 2
(δλ)2 cos2 ϕk
− δϕ , bk = 1
(δϕ)2
− δϕ , and ck =
−2
(δλ cos ϕk ) . The eigenvalue problem (14.11) yields (Hannachi 2016):
α
Su = μ Ipq + αD4 u = μ Ipq + 4 A2 u, (14.14)
R
where Ipq is the pq × pq identity matrix. For a given smoothing parameter α
Eq. (14.14) is a generalised eigenvalue problem.
Exercise The objective of this exercise is to derive Eq. (14.13).
Denote by v thevector R 2 D 2 u, and write v in a form similar to u, i.e. vT =
1.
vT1 , vT2 , . . . , vTq where vTl = (v1l , v2l , . . . , vpl ), for l = 1, . . . q. Write
Eq. (14.12) for v1l , v2l and vpl for l = 1, 2 and q.
2. Show in particular that v1 = Au1 + Cu2 + Cuq , v2 = Au2 + Cu1 + Cu3 and
vq = Auq + Cuq−1 + Cu1 .
3. Derive Eq. (14.13).
(Hint. 1. v11 = δϕ1 2 u01 + a1 u11 + b1 u21 + c1 (u10 + u12 ) and v21 = δϕ1 2 u11 +
a2 u21 + b2 u31 + c2 (u20 + u22 ), and similarly vp1 = δϕ1 2 up−1,1 + ap up1 +
bp up+1,1 + cp (up0 + up2 ), and recall that u01 = 0, u10 = u1q , u20 = u2q , etc.).
Note that in Eq. (14.14) it is assumed that u0l = 0, for l = 1, . . . q, meaning that
the pattern is zero at the pole. As it is explained in Hannachi (2016) this may be
reasonable for the wind field but not for other variables such as pressure. Another
type of boundary condition is to consider
u0l = u1,l
(for l = 1, . . . , q), which is reasonable because of the poleward convergence of the

meridians. This yields
R 2 D2 u = Bu, (14.15)
where B is similar to A, see Eq. (14.13), in which A = (aij ) is replaced by B =

(bij ) where b11 = δϕ1 2 + a1 and bij = aij for all other i and j .
Exercise Derive Eq. (14.15) for the case when u0l = u1l for l = 1, . . . q
Hint Follow the previous exercise.
The third type of boundary conditions represents the non-periodic conditions,
such as the case of a local region on the globe. This means
uk0 = uk1 and uk,q+1 = uk,q for k = 1, . . . p
in the zonal direction, and one keeps the same condition in the meridional direction,
u0l = u1l and up+1,l = upl for l = 1, . . . q. Using the expression of matrix B given
in Eq. (14.15), Eq. (14.13) yields a bloc tridiagonal matrix as
⎛ ⎞⎛ ⎞
B+C C O O ... O O u1
⎜ C ⎟⎜ ⎟
⎜ B C O ⎟⎜... ⎟ O O u2
⎜ ⎟⎜ ⎟
⎜ O C B C ⎟⎜... ⎟ O O u3
R D u = Cu = ⎜
2 2
⎜ .. .. .. ..
⎟⎜
⎟⎜..
⎟
⎟ .. .. .. (14.16)
⎜ . . . . ⎟⎜ . ⎟ . . .
⎜ ⎟⎜ ⎟
⎝ O O O O . . . B C ⎠ ⎝ uq−1 ⎠
O O O O ... C B + C uq
which goes into Eq. (14.11).

Hint Follow the above exercises. In particular, we get v1 = (B + C)u1 + Cu2 and
vq = Cuq−1 + (B + C)uq
Choice of the Smoothing Parameter
Generally, the smoothing parameters in regularisation problems can be obtained
from experience using for example trial and error or using cross-validation (CV),
as outlined previously, by minimising the total misfit obtained by removing each
time onedata point from the sample (leave-one-out approach). For a given α we let
(k) (k)
U = u1 , . . . , um , which represents the set of the leading m (m ≤ pq = d)
(k)
smooth EOFs obtained by discarding the kth observation (where the observations
can be assumed to be independent). Note that U(k) is a function of α. The residuals
obtained from approximating a data vector x using U(k) is

m
ε (k) (α) = x − βi u(k)
i ,
i=1
(k)
where βi , i = 1, . . . , m are obtained from the system of equations < x, ul >=
m (k) (k) m −1 (k)
j =1 βj < uj , ul >, l = 1, . . . , m, i.e. βi = j =1 G ij
< x, uj >. Here
(k) (k)
G is the scalar product matrix with elements [G]ij =< ui , uj >. The optimal

value of α is then the one that minimises the CV, where CV = nk=1 tr (k) , with
14.6 Application of Regularised EOFs to SLP Anomalies 331
(k) being the covariance matrix3 of ε (k) . SVD can be used to efficiently extract the
first few leading smooth EOFs then optimise the total sum of squared residuals.
We can also use instead the explained variance as follows. If one designates by
(k)
σ(k) = m i=1 μi the total variance explained by the leading m EOFs when the kth
observation is removed, then the best α is the one that maximises nk=1 σ(k) . Any
descent algorithm can be used to optimise this one-dimensional problem.
It turns out, however, and as pointed out by Hannachi (2016), that cross-
validation does not work in this particular case simply because EOFs minimise
precisely the residual variance, see Chap. 3. Hannachi (2016) used the Lagrangian
L of the original regularised EOF problem:

max u(x)S (x, y) u(y)dxdy
(14.17)
subject to u(x)2 + α(∇ 2 u(x))2 dx = 1,
that is

2
L= u(x)S (x, y) u(y)dxdy − μ 1 − [u(x)]2 dx − α ∇ 2 u(x) dx .

14.6 Application of Regularised EOFs to SLP Anomalies
Regularised EOF analysis was applied by Hannachi (2016) to the 2◦ × 2◦

NCEP/NCAR SLP field. Monthly SLP anomalies for the winter (DJF) months for
the period Jan 1948–Dec 2015 were used. The resolution was then reduced to obtain
sparser data with respective resolutions 5◦ × 5◦ , 5◦ × 10◦ and 10◦ × 10◦ latitude–
longitude grid in addition to the original data. Figure 14.5 shows the Lagrangian
Fig. 14.5 Lagrangian L of the optimisation problem eq (14.17) versus the smoothing parameter
α, based on the leading smooth EOF of the extratropical NH SLP anomalies for 2.5◦ × 2.5◦ (a),
5◦ × 5◦ (b), 5◦ × 10◦ (c) and 10◦ × 10◦ (d) latitude–longitude grid. Adapted from Hannachi (2016)
3 Note that ε (k) is a field, i.e. a residual data matrix.

Fig. 14.6 Eigenvalues of the generalised eigenvalue problem Eq. (14.11) for the winter SLP
anomalies over the northern hemisphere for the regularised problem (filled circles) and the
conventional (α = 0) problem (open circle). The eigenvalues are scaled by the total variance
of the SLP anomalies and transformed to a percentage, so for example the open circles provide the
percentage of explained variance of the EOFs. Adapted from Hannachi (2016)
L computed based on the leading smooth or regularised EOF as a function of the

smoothing parameter α. The figure shows indeed the existence of an optimum
value of the parameter, which increases with decreasing resolution. Figure 14.6
also shows the percentage of explained variance for both the conventional and
regularised EOFs for the 10◦ × 10◦ grid downgraded SLP anomalies. Interestingly,
the leading explained variance of the leading smooth EOF is about one and a half
times larger than that of the conventional EOF.
The leading two conventional and smooth EOFs are shown in Fig. 14.7. The
leading EOF is accepted in the literature as the Arctic Oscillation (AO). Figure 14.7
shows that EOF1 has a strong component of the NAO reflected by the asymmetry
between the North Atlantic and the North Pacific centres of action. This is related
to the mixing property, a characteristic feature of EOFs. The smoothing shows
that the leading pattern (Fig. 14.7b) is quasi-symmetric and represents more the
AO pattern, which explains the large explained variance compared to EOF1. If the
smoothing parameter is increased the pattern becomes more symmetric (not shown).
The second patterns (EOF2 and smooth EOF2) are quite similar.
Smooth EOFs have also been applied to Hadley Centre SST anomalies, HadISST.
But one of the clear applications was in connection to trend EOFs (Sect. 16.4). When
the data resolution is degraded it is found that the TEOF method missed the second
Fig. 14.7 Leading two conventional (a,c) and regularised (b,d) EOFs of the northern hemisphere
SLP anomalies based on the 10◦ × 10◦ latitude–longitude grid. The smoothing parameter is based
on the optimal value obtained from the Lagrangian (Fig. 14.5d). Adapted from Hannachi (2016)
trend pattern, namely the Scandinavian pattern. Figure 14.8 shows the eigenspec-
trum associated with the inverse rank matrix, see Eq. (16.21), corresponding to the
5◦ × 5◦ downgraded resolution of the winter SLP anomalies. It is clear that when
the trend EOF method is applied with the regularisation procedure, i.e. applying
Eq. (14.11) to the data matrix Z of Eq. (16.21), the eigenvalue of the second TEOF
is raised off the “noise” floor (Fig. 14.8b), and the second trend pattern regained. The
leading two smooth trend PCs are shown in Fig. 14.8a,b. The leading two smooth
EOFs along with the associated smooth trend patterns are shown in Fig. 14.9, which
can be compared to Fig. 16.8 in Chap. 16 (see also Hannachi 2016).
Fig. 14.8 Eigenspectrum, given in terms of percentage of explained variance, of the covariance
(or correlation) matrix of the inverse ranks of the SLP anomalies with a reduced resolution of
5◦ × 5◦ grid for the non-regularised (a) and regularised (b) cases, along with the regularised first
(c) and second (d) trend PCs associated with the leading two eigenvalues shown in (b). The optimal
smoothing parameter, α = 60, is used in (b). Adapted from Hannachi (2016)
Fig. 14.9 Leading two regularised trend EOFs (a,b) of the SLP anomalies, corresponding to the
leading two eigenvalues of Fig. 14.8b, and the associated trend patterns (c,d). Contour interval in
(c,d) 1 hPa. Adapted from Hannachi (2016)
Chapter 15
Methods for Coupled Patterns
Abstract Previous chapters focussed mostly on single fields, such as EOFs of

seal level pressure. This chapter is an extension of previous methods. It describes
different methods that mostly deal with two fields to identify coupled patterns that
covary coherently. The chapter discusses both the conventional and regularised
problems. It also explores the predictive power of coupled pattern analysis.
Keywords Canonical correlation analysis · Regularised CCA · Canonical

covariance analysis · Redundancy analysis · Principal predictors · Functional
CCA · Multivariate regression
15.1 Introduction
We have presented in the previous chapters methods aiming at extracting various

spatial patterns and associated time series satisfying specific properties. Those
techniques were presented in the context of a single space–time field.1 The focus
of this chapter is on presenting alternative methods that deal with finding coupled
patterns of variability from two or more space time fields. PCA was first proposed
by Pearson (1902), and a few years later Spearman (1904a,b) introduced techniques
to deal with more than one set of variables, and in 1936, Hotelling formulated
mathematically the problem, which has become known as canonical correlation
analysis (CCA) and was mainly developed in social science. A closely related
method is the maximum covariance analysis (MCA). Both MCA and CCA use both
data sets in a symmetric way in the sense that we are not seeking to explain one
data set as a function of the other. Methods that fit in the regression framework exist
and are based on considering one data set as predictor and the other as predictand.
Three main methods attempt to achieve this, namely, redundancy analysis (RDA),
1 It
is possible to apply these techniques to two combined fields, e.g. SLP and SST, by combining
them into a single space–time field. In this way the method does not explicitly take into account
the co-variability of both fields.

338 15 Methods for Coupled Patterns
principal predictor analysis (PPA) and principal regression analysis (PRA). RDA
(von Storch and Zwiers 1999; Wang and Zwiers 1999) aims at selecting predictors
that maximise the explained variance. PPA (Thacker 1999) seeks to select predictors
that maximise the sum of squared correlations. PRA (Yu et al. 1997), as its name
indicates, fits regression models between the principal components of the predictor
data and each of the predictand elements individually. Tippett et al. (2008) discuss
the connection between the different methods of finding coupled patterns and
multivariate regression through a SVD of the regression matrix.
In atmospheric science, it seems that the first idea to combine two or more sets
of variables in an EOF analysis was mentioned in Lorenz (1956) and was first
applied by Glahn (1962) and few others in statistical prediction of weather, see
e.g. Kutzbach (1967) and Bretherton et al. (1992). This combined EOF/PC analysis
is obtained by applying a standard EOF analysis to the combined space–time field
T
zt = xTt , yTt of xt and yt , t = 1, . . . n. The grand covariance matrix of zt ,
t = 1, . . . n, is given by
1
n
Sx,y = (zt − z) (zt − z)T , (15.1)
n
t=1

where z = n−1 nt=1 zt is the mean of the combined field. However, because of the
scaling problem, and in order for the combined field to be consistent the individual
fields are normally scaled by their respective variances. Hence if Sxx and Syy are the
respective covariance matrices of xt and yt , t = 1, . . . n, and Dx = Diag(Sxx ) and
−1/2 −1/2
Dy = Diag(Syy ), the scaled variables become xt Dx and yt Dy , respectively.
The grand covariance matrix of the combined scaled fields is

−1/2 −1/2
Dx O Dx O
−1/2 Sx,y −1/2 .
O Dy O Dy
The leading combined EOFs/PCs provide patterns that maximise variance, or

correlation to be more precise, of the combined fields. The obtained leading
EOFs of each field, however, do not necessarily reflect particular association,
e.g. covariability, between the fields. Canonical correlation analysis (CCA) and
canonical covariance analysis (CCOVA) achieve this by maximising correlation and
covariance between the fields.
15.2 Canonical Correlation Analysis 339
15.2 Canonical Correlation Analysis
15.2.1 Background
Canonical correlation analysis (CCA) dates back to Hotelling (1936a) and attempts
to find relationships between two space–time fields. Let xt = xt1 , . . . , xtp1 , and
yt = yt1 , . . . , ytp2 , t = 1, 2, . . ., be two multidimensional stationary time series
with respective dimensions p1 and p2 . The objective of CCA is to find a pair of
patterns a1 and b1 such that the time series obtained by projecting xt and yt onto a1
(1)
and b1 , respectively, have maximum correlation. In the following we let at = aT1 xt
(1)
and bt = bT1 yt , t = 1, 2, . . ., be such time series.
Definition

The patterns a1 and b1 maximising corr at(1) , bt(1) are the leading canonical
(1) (1)
correlation patterns, and the associated time series at and bt , t = 1, 2, . . ., are
the canonical variates.
15.2.2 Formulation of CCA
We suppose in the sequel that both time series have zero mean. Let xx and yy
be the respective covariance matrices of xt and yt . We also let xy be the cross-
covariance matrix between xt and yt , and similarly for yx , i.e. xy = E xt yTt =
Tyx . The objective is to find a1 and b1 such that
aT1 xy b1
(1) (1)
ρ = corr at , bt = 1 1
(15.2)
aT1 xx a1 2
bT1 yy b1 2
is maximum. Note that if a1 and b1 maximise (15.2) so are also αa1 and βb1 . To
(1) (1)
overcome this indeterminacy, we suppose that the time series at and bt , t =
1, 2, . . ., are scaled to have unit variance, i.e.
aT1 xx a1 = bT1 yy b1 = 1. (15.3)
Another way to look at (15.3) is by noting that (15.2) is independent of the scaling of
a1 and b1 , and therefore maximising (15.2) is also equivalent to maximising (15.2)
subject to (15.3). The CCA problem (15.2) is then written as

maxa,b ρ = aT xy b s.t aT xx a = bT yy b = 1. (15.4)
The solution is then provided by the following theorem.

Theorem We suppose that xx and yy are of full rank. Then the canonical
correlation patterns are given by the solution to the eigenvalue problems:
Mx a = −1 −1
xx xy yy yx a = λa
−1 −1
My b = yy yx xx xy b = λb (15.5)
.
Proof By introducing Lagrange multipliers α and β, Eq. (15.4) reduces to

max aT xy b − α aT xx a − 1 − β bT yy b − 1 .
a,b
After differentiating the above equation with respect to a and b, one gets
xy b = 2α xx a and yx a = 2β yy b,
which after combination yields (15.5). Note also that one could obtain the same
result without Lagrange multipliers (left as an exercise). We notice here that α and
β are necessarily equal. In fact, multiplying by aT and bT the respective previous
equalities (obtained after differentiation), one gets, keeping in mind Eq. (15.3), 2α =
2β = aT xy b = λ. Hence the obtained Eqs. (15.5) can be written as a single
(generalised) eigenvalue problem:

Op1 ,p1 xy a xx Op1 ,p2 a
=λ . (15.6)
yx Op2 ,p2 b Op2 ,p1 yy b
Remarks
1. The matrices involved in (15.5) have the same spectrum. In fact, we note first
that the matrices Mx and My have the same rank, which is that of xy . From the
1
SVD of xx , one can easily compute a square root2 xx
2
of xx and similarly for
yy . The matrix Mx becomes then
−1 1
Mx = xx2 AAT xx
2
, (15.7)
1 1
2 If xx = UUT , where U is orthogonal, then one can define this square root by xx
2
= U 2 . In
1
1 T
this case we have xx = xx xx . Note that this square root is not symmetric. A symmetric
2 2
1 1 1 1
square root can be defined by xx
2
= U 2 UT , in which case xx = xx
2
xx
2
, and hence the square
root of xx is not unique.
−1 −1
where A = xx2 xy yy2 . Hence, the eigenvalues of Mx are identical to the
−1 1
eigenvalues of AAT (see Appendix D). Similarly we have My = yy2 AAT yy
2
,
and this completes the proof.
2. The matrices Mx and My are positive semi-definite (why?).
1 1
3. Taking u = xx 2
a and v = xx2
b, then (15.5) yields AAT u = λu and AAT v =
λv, i.e. u and v are, respectively, the left and right singular vectors of A.
−1 −1
In conclusion, using the SVD of A, i.e. xx2 xy yy2 = UVT , where =
Diag (λ1 , . . . , λm ), and m is the rank of xy , and where the eigenvalues have been
arranged in decreasing order, the canonical correlation patterns are given by
−1 −1
ak = xx2 uk and bk = yy2 vk , (15.8)
where uk and vk , k = 1, . . . , m, are the columns of U and V, respectively. Note

that the canonical correlation patterns ak are not orthogonal, but their transforms uk
are, and similarly for bk , k = 1, . . . , m. The eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λm
are the canonical correlations. Note that the eigenvalues of Mx and My are λ2k , k =
1, . . . , m.
15.2.3 Computational Aspect

Estimation Using Sample Covariance Matrix
Given a finite sample of two multivariate time series xt , and yt , t = 1, . . . , n,

representing two space–time fields, one can form the data matrices X = (xti ) and
Y = ytj , t = 1, . . . n, i = 1, . . . p1 , and j = 1, . . . , p2 . We suppose that the
respective data matrices were centred by removing the respective time averages,
i.e. applying the centring operator In − n1 1n 1Tn to both matrices. The respective
covariance and cross-covariance matrices are then computed
Sxx = n1 XT X, Syy = n1 YT Y, Sxy = n1 XT Y. (15.9)

−1/2 −1/2
The pairs of canonical correlation patterns are then given by Sxx uk , Syy vk ,
where uk and vk , k = 1, . . . , m, are, respectively, the left and right singular vectors
−1/2 −1/2
of Sxx Syy Syy , and m = rank Sxy .
Remarks The CCA problem, Eq. (15.5), is based on the orthogonal projection onto
−1 T
the respective column space of the data matrices X and Y, i.e. X XT X X and
similarly for Y. So the CCA problem simply consists of the spectral analysis of
−1 T −1 T
X XT X X y YT Y Y .
−1
Exercise Show that CCA indeed consists of the spectral analysis of X XT X XT
−1
y YT Y YT .
Hint Multiply both sides of the first equation of (15.5) by X.
Estimation Using EOFs
The various operations outlined above can be expensive in terms of CPU time,
particularly the matrix inversion in the case of a large number of variables. This
is not the only problem that can occur in CCA. For example, it may happen that
either Sxx or Syy or both can be rank deficient in which case the matrix inversion
breaks down.3 The most common way to address these problems is to project both
fields onto their respective leading EOFs and then apply CCA to the obtained PCs.
The use of PCs can be regarded as a filtering operation and hence can reduce the
effect of sampling fluctuation. This approach was first proposed by Barnett and
Preisendorfer (1987) and also Bretherton et al. (1992) and is often used in climate
research. When using the PCs, the covariance matrices Sxx and Syy become the
identity matrices with respective orders corresponding to the number of PCs retained
for each field. Hence the canonical correlation patterns simply reduce to the left
and right singular vectors of the cross-covariance Sxy between PCs. For example, if
T
α = α1 , . . . , αq is the left eigenvector of this cross-covariance, where q is the
number of retained EOFs of the left field xt , t = 1, . . . , n, then the corresponding
canonical correlation pattern is given by

q
a= αk ek , (15.10)
k=1
where ek , k = 1, . . . , q are the q leading EOFs of the left field. The canonical
correlation pattern for the right field is obtained in a similar way. Note that since
the PCs are in general scaled to unit variance, in Eq. (15.10) the EOFs have to be
scaled so they have the same units as the original fields. To test the significance of
the canonical correlations, Bartlett (1939) proposed to test the null hypothesis that
the leading, say r, canonical correlations are non-zero, i.e. H0 : λr+1 = λr+2 =
. . . = λm = 0 using the statistic:
m
!
1
T = − n − (p1 + p2 + 3) Log 1 − λ̂2k .
2
k=r+1
3 Generalised inverse can be used as an alternative, see e.g. Khatri (1976), but as pointed out by
Bretherton et al. (1992) the results in this case will be difficult to interpret.
Under H0 and joint multinormality, one gets the asymptotic chi-square approxima-
tion, i.e.
T ∼ χ(p
2
1 −r)(p2 −r)
,
see also Mardia et al. (1979). Alternatively, Monte Carlo simulations can be used
to test the significance of the sample canonical correlations λ̂1 , . . . λ̂m , see e.g. von
Storch and Zwiers (1999) and references therein.
Remark
1. It is possible to extend CCA to a set of more than two variables, see Kettenring
(1971) for details.
2. Principal prediction patterns (PPPs) correspond to CCA applied to xt and yt =
xt+τ , t = 1, . . . , n − τ . See, e.g. Dorn and von Storch (1999) and von Storch and
Zwiers (1999) for examples.
3. The equations of CCA can also be derived using regression arguments. We first
let xt = aT xt and yt = bT yt . Then to find a and b we minimize the error variance
obtained from regressing yt onto xt , t = 1, . . . , n. Writing
yt = αxt + εt ,
2
it is easy to show that α = cov(x t ,yt ) [cov(xt ,yt )]
var(xt ) and var (εt ) = var (yt )− var(xt ) ; hence,
the sample variance estimate of the noise term is
2
aT Sxy b
σ̂ = b Syy b − T
2 T
.
a Sxx a
σ̂ 2
The patterns a and b minimising bT Syy b
are then obtained after differentiation
yielding
Sxy b = λSxx a and Syx a = λSyy b ,
which is equivalent to (15.5).
15.2.4 Regularised CCA
In several instances the data can have multi-colinearity, which occurs when there are
near-linear relationships between the variables. When this happens, problems can
occur when looking to invert the covariance matrix of X and/or Y. One common
solution to this problem is to use regularised CCA (RCCA). This is very similar to
ridge regression (Hoerl and Kennard 1970). In ridge regression4 the parameters of
the model y = Xβ + ε are obtained by adding a regularising term in the residual
sum of squares, RSS = (y − Xβ)T (y − Xβ) + λβ T β, leading to the estimate
β̂ = (XT X + λI)−1 Xy.
In regularised CCA the covariance matrices are similarly replaced by (XT X +
λI)−1 and (XT X + λI)−1 . Therefore, as for CCA, RCCA consists of a spectral
analysis of X(XT X + λ1 I)−1 XT Y(YT Y + λ2 I)−1 YT . In functional CCA, discussed
later, regularisation (or smoothing) is required to get meaningful solution. The
choice of the regularisation parameters is discussed later.
15.2.5 Use of Correlation Matrices
An alternative to using the covariance and cross-covariance matrices is to use

correlation and cross-correlation matrices Rxx , Ryy and Rxy . CCA obtained using
correlation matrices is the same as covariance-based CCA but applied to the
scaled fields. The equations remain the same, but the results will in general be
different. The difference between the two can be compared to the difference between
covariance-based or correlation-based EOFs. Note that CCA is invariant to scale
changes in the variables.
15.3 Canonical Covariance Analysis
Canonical covariance analysis (CCOVA) is similar to CCA except that canonical

variates are constrained to have maximum covariance instead of maximum correla-
tion. The method has been introduced into climate research by Wallace et al. (1992)
and Bretherton et al. (1992) under the name of SVD analysis. Some authors, see e.g.
von Storch and Zwiers (1999), have proposed to rename it as “maximum covariance
analysis” (MCA).
Using the same notation as before, the covariance between the respective
canonical variates (or expansion coefficient), at = aT xt , and bt = bT yt , t =
1, . . . , n, obtained from the sample is cov(at , bt ) = aT Sxy b. The CCOVA, or
MCA, problem is then formulated as follows. Find the patterns a and b maximising
cov(at ,bt )
a b , or similarly
4 Ridge regression is closely related to Tikhonov regularisation in Hilbert spaces. Tikhonov

regularisation consists of finding an approximate solution to an “ill-posed” problem, Af = u,
by solving instead a “regularised” problem, (A + λI) = u. This yields the approximate solution
f̂ = (A∗ A + λI)−1 A∗ u, which is obtained using “penalised least squares” by minimising
Af − u 2 + λ f 2 . The matrix A∗ is the adjoint of A.
15.3 Canonical Covariance Analysis 345

maxu,v γ = uT Sxy v s.t uT u = vT v = 1. (15.11)
The solution to Eq. (15.11) is provided by
Sxy v = λu and uT Sxy = λvT . (15.12)
Hence u and v correspond, respectively, to the leading eigenvectors of Sxy Syx and
Syx Sxy , or equivalently to the left and right singular vectors of the cross-covariance
matrix Sxy .
In summary, from the cross-covariance data matrix Sxy = n1 XT Y where X and Y
are the centred data matrices of the respective fields, the set of canonical covariances
are provided by the singular values γ1 , . . . , γm , of Sxy arranged in decreasing order.
The set of canonical covariance pattern pairs is provided by the associated left and
right singular vectors U = (u1 , . . . , um ) and V = (v1 , . . . , vm ), respectively, where
m is the rank of Sxy . Unlike CCA patterns, the canonical covariance pattern pairs
are orthogonal to each other by construction.
The CCOVA can be easily computed in Matlab. If X(n, p) and Y (n, q) are the
two data matrices of both fields, the SVD of XT Y gives the left and right singular
vectors along with the singular values, which can be arranged in decreasing order,
as for EOFs (Chap. 3):
>> [u s v] = svds (X’ * Y, 10, ‘L’);
A typical example of regional climate and related teleconnection is the Mediter-

ranean evaporation, which has been discussed, e.g. in Zveryaev and Hannachi
(2012, 2016). For example, Zveryaev and Hannachi (2012) identified different
teleconnection patterns associated with Mediterranean evaporation depending on
the season. East Atlantic pattern (e.g. Woollings et al. 2010) was found in the winter
season, whereas in the summer season a teleconnection of tropical origin was found.
It has been suggested that the heating associated with the Asian summer monsoon
(ASM) can initiate Rossby wave that can force climate variability over the east
Mediterranean (Rodwell and Hoskins 1996). Correlation between all India rainfall,
an index measuring the strength of the Indian summer monsoon, and summer MEVA
reveals an east–west dipolar structure. A more general procedure is to apply SVD
analysis between MEVA and outgoing long wave radiation (OLR) over the Asian
summer monsoon region.
Figure 15.1 shows the leading left and right singular vectors of the covariance
matrix between September MEVA and August Asian summer monsoon OLR. The
figure shows clear teleconnection between the Indian monsoon and MEVA with a
lead–lag of 1 month. Stronger (than normal) Indian monsoon yields weaker (resp.,
stronger) than normal evaporation over the eastern (resp., western) Mediterranean.
The time series associated with the left and right singular vectors (or left and right
singular PCs) have maximum covariance. Figure 15.2 shows these time series,
scaled to have unit standard deviation. The correlation between the time series is
0.64.
7
Left singular vector (Sep Med. Evap.) 5
-1
-3
-5
-7
-9
° -11
30 N
° -13
0
-15
Right singular vector (Aug. Asian Summer Monsoon OLR)

°
8
30 N
7
6
5
4
3
2
1
0
-1
-2
° -3
0
-4
-5
-6
-7
°
60 E
Fig. 15.1 Left (top) and right (bottom) leading singular vectors of the covariance matrix between
September Mediterranean evaporation and August Asian monsoon outgoing long wave radiation
Remark Both CCA and CCOVA can be used as predictive tools. In fact, if, for
example, the right field yt lags the left field xt , t = 1, . . . , n, with a lag τ , then
both these techniques can be used to predict yt from xt as in the case of principal
predictive patterns (PPPs), see e.g. Dorn and von Storch (1999). In this case, the
leading pair of canonical correlation patterns, for example, yield the corresponding
time series that are most cross-correlated at lag τ , see also Barnston and Ropelewski
(1992).
15.4 Redundancy Analysis 347
Scaled left/right singular PCs of Aug. OLR/Sep. MEVA singular vector

2
-1
-2
-3
1958 1969 1979 1989 1999
Fig. 15.2 Leading time series associated with the leading singular vectors of Fig. 15.1. Red: left
singular vector (MEVA), blue: right singular vector (OLR)
15.4 Redundancy Analysis
15.4.1 Redundancy Index
We have seen in Sect. 15.2.3 (see the remarks in that section) that CCA can be
obtained from a regression perspective by minimising the error variance. This error
variance, however, is not the only way to express the degree of dependence between
two variables. Consider two multivariate time series xt and yt , t = 1, 2, . . . (with
zero mean for simplicity). The regression of yt on xt is written as
yt = xt + εt , (15.13)
where = yx −1xx is the regression matrix obtained by multiplying (15.13) by

xt and then taking expectation. The error covariance matrix can also be obtained
from (15.13) and is given by
εε = yx −1
xx xy .
The error variance is then given by tr ( εε ). An alternative way to measure the

dependence or the redundancy of one variable relative to another is through the
redundancy index (Stewart and Love 1968):
tr yy − εε tr yx −1
xx xy
R 2 (yt , xt ) = = . (15.14)
tr yy tr yy
Hence the redundancy index represents the proportion of the variance in yt

explained by xt .
Remark
• The redundancy index is invariant under nonsingular transformation of the
independent variable xt and orthogonal transformation of the dependent variable
yt .
• The redundancy index represents the fraction of variance explained by the
regression. In one dimension the redundancy index represents the R-square.
15.4.2 Redundancy Analysis
Redundancy analysis was introduced first by van den Wollenberg (1977) and
extended later by Johansson (1981) and put in a unified frame by Tyler (1982). It
aims at finding pattern transformations (matrices) P and Q such that R 2 (Qyt , Pxt )
is maximised. To simplify the calculation, we will reduce the search to one single
pattern p such that R 2 yt , pT xt is maximised. Now from (15.14) this redundancy
−1
tr pT xx p yx ppT xy
index takes the form: tr ( yy )
, which, after simplification using a
little algebra, yields the redundancy analysis problem:

pT p
xy yx
max R yt , p xt =
2 T
. (15.15)
p pT xx p
The solution to Eq. (15.15) is obtained by solving the eigenvalue problem:
−1
xx xy yx p = λp. (15.16)
A better way to solve (15.16) is to transform it to a symmetric eigenvalue problem

by multiplying (15.16) through by yx . In fact, letting q = η yx p, where η is a
number to be found later, Eq. (15.16) becomes
Aq = yx −1
xx xy q = λq. (15.17)
The qs are then the orthogonal eigenvectors of the symmetric positive semi-definite
matrix A = yx −1 xx xy . Furthermore, from (15.16) the redundancy (actually
minus redundancy) becomes R 2 yt , pT xt = λ.
van den Wollenberg (1977) solved (15.16) and its equivalent, where the roles of x
and y are exchanged, i.e. the eigenvalue problem of −1
yy yx xy . As pointed out by
Johansson (1981) and Tyler (1982), the transformations of xt and yt are not related.
Johansson (1981) suggested using successive linear transformations of yt , i.e. bT yt ,
such that bT yx b is maximised with bs being orthogonal. These vectors are in fact
15.5 Application: Optimal Lag Between Two Fields and Other Extensions 349
the q vectors defined above, which are to be unitary and orthogonal. Since one
wishes the vectors qk = η yx pk to be unitary, i.e. η2 pTk xy yx pk = 1, we must
have ηk2 = λ−1
k where the λk s are the eigenvalues of the matrix A (see Eq. (15.17)).
Here we have taken pTk xx pk = 1, which does not change the optimisation
−1/2
problem (15.15). Therefore the q vectors are given by qk = λk yx pk , and
−1/2
similarly pk = λk −1xx xy qk .
Exercise Given that q = λ−1/2 yx p, show that p = λ−1/2 −1

xx xy q. (Hint. Use
Eq. (15.16).)
Remark It can be seen from (15.16) that if xt = yt one gets the singular vectors
of the (common) covariance matrix of xt and yt . Hence EOFs represent a particular
case of redundancy analysis.
15.5 Application: Optimal Lag Between Two Fields and

Other Extensions
15.5.1 Application of CCA
1. Given two zero-mean fields xt and yt , t = 1, . . . n, one wishes to find the optimal
lag τ between the two fields along with the associated patterns. If one chooses
the correlation between the two time series aT xt and bT yt+τ as a measure of
association, then the problem becomes equivalent to finding patterns a and b
satisfying
2
aT xy (τ )b
max φ (a, b) = , (15.18)
a,b aT xx a bT yy b
where xy (τ ) is the lagged covariance matrix, i.e. xy (τ ) = E (xt yt+τ ). The

problem becomes similar to (15.4) and (15.5) except that xy is now replaced
by xy (τ ) and yx by yx (−τ ) = Txy (τ ). By using the following SVD
decomposition:
−1 −1
= xx2 xy (τ ) yy2 = UVT ,
we get max [φ (a, b)] = λ21 , where λ1 is the largest singular value of .
Exercise Derive the above result, i.e. max [φ (a, b)] = λ21 .
In conclusion a simple way to find the best τ is to plot λ21 (τ ) versus τ and find
the maximum, if it exists. This can be achieved by differentiating this univariate
function and looking for its zeros. The associated patterns are then given by a =
−1/2 −1/2
xx u1 and b = yy v1 , where u1 and v1 are the leading left and right singular
vectors of , respectively.
2. Note that one could also have taken xy (τ ) + yx (−τ ) instead of xy (τ )
in (15.18) so the problem becomes symmetric. This extension simply means
considering the case y leading x by (−τ ) beside x leading y by τ , which are
the same.
3. Another extension is to look for patterns that maximise ρxy (τ )dτ . This is like
in OPP (chap. 8), which looks for patterns maximising the persistence time for a
single field, but applied to coupled patterns. In this case the numerator in (15.18)
2
M
is replaced by aT τ =−M xy (τ ) b for some lag M. The matrix involved
here is also symmetric.
Remark If in the previous extension one considers the case where yt = xt , then one
obviously recovers the OPP.
15.5.2 Application of Redundancy
The previous extension can also be accomplished using the redundancy

index (15.14). For example, Eq. (15.16) applied to xt and yt+τ yields
−1
xx xy (τ ) yx (−τ )u = λu. (15.19)
If one takes yt = xt+τ as a particular case, the redundancy problem (15.19) yields
the generalised eigenvalue problem:
xx (τ ) xx (−τ )u = λ xx u.
Note also that the matrix involved in (15.8) and (15.19), e.g. xy (τ ) yx (−τ ), is also
symmetric positive semi-definite. Therefore, to find the best τ (see above), one can
−1/2
plot λ2 (τ ) versus τ , where λ(τ ) is the leading singular value of xx xy (τ ), and
choose the lag associated with the maximum (if there is one).
Exercise Derive the leading solution of (15.19) and show that it corresponds to the
−1/2
leading singular value of xx xy (τ ).
15.6 Principal Predictors
As it can be seen from (15.5), the time series xt and yt play exactly symmetric
roles, and so in CCA both the time series are treated equally. In redundancy
analysis, however, the first (or left) field xt plays the role of the predictor variable,
15.6 Principal Predictors 351
whereas the second field represents the predictand or response variable. Like
redundancy analysis, principal predictors (Thacker 1999) are based on finding a
linear combination of the predictor variables that efficiently describe collectively
the response variable. Unlike redundancy analysis, however, in principal predictors
the newly derived variables are required to be uncorrelated. Principal predictors
can be used to predict one field from another as presented by Thacker (1999). A
principal predictor is therefore required to be maximally correlated with all the
response variables. This can be achieved by maximising the sum of the squared
correlations with these response variables. Let yt = yt1 , . . . , ytp be the response
field, where p is the dimension of the problem or the number of variables of the
response field, and let a be a principal predictor. The squared correlation between
xt = aT xt and the kth response variable ytk , t = 1, 2, . . . n, is
2
cov {ytk }t=1,...n , {aT xt }t=1,...n
rk2 = 2 aT S a
, (15.20)
σyk xx
2 = S

where σyk yy kk and represents the variance of the kth response variable ytk ,
t = 1, . . . n. The numerator of (15.20) is aT sk sTk a, where sk is the kth column of the
cross-covariance matrix Sxy . Letting Dyy = Diag Syy , we then have
p
1
s sT = Sxy D−1
2 k k yy Syx .
σ
k=1 yk
Exercise Derive the above identity.

(Hint. Take Dyy to be the identity matrix [for simplicity] and compute
p the (i, j )th
element of Sxy Syx and compare it to the corresponding element of k=1 sk sTk .)
p
The maximisation of k=1 rk2 yields

Sxy D−1
yy Syx
max aT a . (15.21)
a aT Sxx a
The principal predictors are therefore given by the solution to the generalised
eigenvalue problem:
Sxy D−1
yy Syx a = λSxx a. (15.22)
If μk and uk represent, respectively, the kth eigenvalue and associated left singular
−1/2 −1/2
vector of Sxx Sxy Dyy , then the kth eigenvalue λk of (15.22) is μ2k and the kth
−1/2
principal predictor is ak = Sxx uk . Furthermore, since aTk Sxx al = uk ul = δkl , the
new variables aTk xt are standardised and uncorrelated.
Remark The eigenvalue problem associated with EOFs using the correlation matrix,
e.g. Eqs. (2.21) or (3.25), can be written as D−1/2 Sxx D−1/2 u = λu, where D =
Diag (Sxx ) can also be written, after the transformation u = D1/2 v, as D−1 Sxx v =
λv. This latter eigenvalue problem is identical to (15.22) when xt = yt . Therefore
principal predictors reduce, when the two fields are identical, to a simple linear
(diagonal) transformation of the correlation-based EOFs.
15.7 Extension: Functional Smooth CCA
15.7.1 Introduction
Conventional CCA and related methods are basically formulated to deal with
multivariate observations of classical statistics where the data are sampled at
discrete, often regular, time intervals. In a number of cases in various branches of
science, however, the data can be observed/monitored continuously. In medicine, for
example, we have EEG records, etc. In meteorology, barometric pressure records
at a given location provide a good example. In the latter case, we can in fact
have barometric pressure records at various locations. Similarly, we can have a
continuous function of space, e.g. continuous surface temperature observed at
different times. Most space–time atmospheric fields belong to this category when
the coverage is dense enough. This also applies to slowly varying fields in space
and time where, likewise, the obtained patterns are expected to be smooth, see e.g.
Ramsay and Silverman (2006) for an introduction to the subject and for further
examples.
When time series are looked at from the angle of conceptual stochastic processes,
then one could attempt to look for smooth time series. The point is that there is no
universal definition of such a process, although, some definitions of smoothness
have been proposed. For example, if the first difference of a random field is zero
mean multivariate normal, then the field can be considered as smooth (Pawitan
2001, chap. 1). Leurgans et al. (1993) point out that in the context of continuous
CCA, smoothing is particularly essential and that without smoothing every possible
function can become a canonical variate with perfect canonical correlation.
15.7.2 Functional Non-smooth CCA and Indeterminacy
In functional or continuous CCA one assumes that the discrete space–time fields xk
and yk , k = 1, . . . n, are replaced by continuous curves xk (t) and yk (t), k = 1, . . . n,
and t is now a continuous parameter, in some finite interval T. For simplicity we
suppose that the curves have been centred to have zero mean, i.e. nk=1 xk (t) =
15.7 Extension: Functional Smooth CCA 353
n
k=1 yk (t) = 0, for all values of t within the above interval. Linear combinations
5
of xk (t) using, for example, a curve or a continuous function a(t) take now the form
of an integral, i.e.

< a, xk >= a(t)xk (t)dt.
T
In the following we suppose that x(t) and y(t) are two random functions and that
xk (t) and yk (t), k = 1, . . . n, are two finite sample realisations (of length n) drawn
from x(t) and y(t), respectively.6 The covariance between < a, x > and < b, y >
is given by

E [< a, x >< b, y >] = a(t)E [x(t)y(s)] b(s)dtds
T T

= a(t)Sxy (t, s)b(s)dtds, (15.23)
T T
where Sxy (t, s) = E [x(t)y(s)] represents the cross-covariance between x(t) and
y(s). The sampleestimate of this cross-covariance is defined in the usual way by
n
Ŝxy (t, s) = n−1 1
k=1 T T a(t)xk (t)yk (s)b(s)dtds = t T a(t)Ŝxy (t, s)a(s)dtds.
A similar expression can be obtained also for the remaining covariances, i.e.
Ŝyx (t, s), Ŝxx (t, s) and Ŝyy (t, s).
Remarks
• The functions Sxx (t, s) and Syy (t, s) are symmetric functions and similarly for
their sample estimates.
• Sxy (t, s) = Syx (s, t)
• Note that by comparison to the standard statistics of covariances the index k
plays the role of “time” or sample (realisation) as pointed out earlier, whereas
the variables t and s in the above integral mimic “space” or variable.
In a similar manner to the standard discrete case, the objective of functional CCA
is to find functions a(t) and b(t) such that the cross-correlation between linear
combination < a, xk > and < b, yk > is maximised. The optimisation problem
applied to the population yields

max a(t)Sxy (t, s)b(s)dtds (15.24)
a,b T T
5 This is like removing the ensemble mean of each field from each curve. Note that t here plays
the role of variables in the discrete case and the index k refers to observations or realisation.
6 In the standard notation of stochastic processes x(t) may be better noted as x(t, ω), where ω
refers to the random part. That is, for fixed ω, i.e. ω = ω0 (a realisation), we get a usual function
x(t, ω0 ) of t, and for fixed t, i.e. t = t0 , we get a random variable x(t0 , ω).
subject to the condition:

a(t)Sxx (t, s)a(s)dtds = b(t)Syy (t, s)b(s)dtds = 1. (15.25)
T T T T
The system of equations (15.24)–(15.25) is equivalent to maximising
2
a(t)Sxy (t, s)b(s)dtds
T T
−1
a(t)Sxx (t, s)a(s)dtds b(t)Syy (t, s)b(s)dtds .
T T T T
If one is interested in functional canonical (or maximum) covariance analysis, then

we obtain the same equations except that the covariances of the individual variables
are reduced to unity, i.e. Sxx = Syy = 1.
As for conventional CCA, the optimisation problem given by Eqs. (15.24)–
(15.25) yields the eigenvalue problem:

T Sxy (t, s)b(s)ds = μ T Sxx (t, s)a(s)ds (15.26)
T Sxy (t, s)a(t)dt = μ T Syy (t, s)b(t)dt,
which can be written as a single generalised eigenvalue problem (as for the discrete
case, Sect. 15.2); see next section for a proof outline of a similar result.
When the above result is applied to the sample curves, it turns out that there
are always functions a(t) and b(t) that guarantee perfect correlation between the
corresponding linear combinations < a, xk > and < b, yk >. Furthermore,
Leurgans et al. (1993) show that any linear combination of xk can be made perfectly
correlated with the corresponding linear combination of yk , k = 1, . . . n. This
points to a conceptual problem in estimating functional CCA from a finite sample
using Eqs. (15.24)–(15.25). This problem is overcome by introducing some sort of
smoothing in the estimation as is shown next.
15.7.3 Smooth CCA/MCA

Canonical Correlation
When one deals with continuous or functional data, smoothing becomes a useful
tool to gain some insights into the data and can also ease the interpretation of the
results. An example of a widely known nonlinear smooth surface fitting of a scatter
of data points is spline. For a given scatter of data points, smoothing spline attempts
to minimise a penalised residual sum of squares, using a smoothing parameter that
controls the balance between goodness of fit and smoothness. In general terms, the
smoothing takes the form of an integral of the squared second derivative of the
smoothing function. This smoothing derives from the theory of elastic rods and
is proportional to the energy of the rod when stressed, see Appendix A for more
details.
The smoothing procedure in CCA is similar to the idea of spline smoothing. To
achieve smooth CCA, the constraints (15.25) are penalised by a smoothing condition
taking the following form:
d2
2
T T a(t)Sxx (t, s)a(s)dtds + α T dt 2 a(t) dt
2 (15.27)
d2
= T T b(t)Syy (t, s)b(s)dtds + α T dt 2 b(t) dt = 1,
where α is a smoothing parameter and is also unknown, see also Ramsay and
Silverman (2006).
To solve the optimisation problem (15.24) subject to the smoothing con-
straints (15.27), a few assumptions on the regularity of the functions involved are
required. To ease things, one considers the notations < a, b >1 = T a(t)b(t)dt for
natural scalar product between smooth functions a() and b(), and < a, b >S =
the
T T a(t)S(t, s)b(s)dsdt as the “weighted” scalar product between a() and b().
The functions involved, as well as their mth derivatives, m = 1, . . . 4, are supposed
to be square integrable over the interval T. It is also required that the functions
and their four derivatives have periodic boundary conditions, i.e. if T=[τ0 , τ1 ], e.g.
dαa dαa
dt α (τ0 ) = dt α (τ1 ), α = 1, . . . 4. With these conditions, we have the following
result.
Theorem The solution to the problem (15.24), (15.27), i.e.
max < a, b >Sxy

s.t.
< a, a >Sxx +α < D 2 a, D 2 a >1 = 1 =< b, b >Syy +α < D 2 b, D 2 b >1 ,
α
where D α a(t) = ddt αa (t) and similarly for D α b, is necessarily given by the solution
to the following eigenvalue problem:

T Sxy (t, s)a(t)dt = μ T Syy (t, s)b(t)dt + αD 4b(s)
4
(15.28)
T Sxy (t, s)b(s)ds = μ T Sxx (t, s)a(s)ds + αD a(t) .
Proof Outline 1 We present here an outline of the proof using arguments from the
calculus of variation (e.g. Cupta 2004). The general approach used in the calculus of
variation is to assume the solution to be known and then work out a way to find the
conditions that it satisfies. Let a(t) and b(t) be the solutions to (15.24) and (15.27),
then for any function â(t) and b̂(t) defined on T and satisfying the above properties
the function g (1 , 2 ) =< a + 1 â, b + 2 b̂ >Sxy is maximised when 1 = 2 = 0,
subject, of course, to the constraint (15.27). In fact, letting
G1 (a, â, , S) =< a + â, a + â >S +α < D 2 a + D 2 â, D 2 a + D 2 â >1 −1,
the extended function to be maximised is given by
1 1
G (1 , 2 ) = g (1 , 2 ) − μ1 G1 (a, â, 1 , Sxx ) − μ2 G1 (b, b̂, 2 , Syy ),
2 2
where μ1 and μ2 are Lagrange multipliers. The necessary conditions of the
maximum of G (1 , 2 ) obtained using the gradient ∇G at 1 = 2 = 0 yields
< â, b >Sxy −μ1 < a, â >Sxx +α < D 2 a, D 2 â >1 = 0

< a, b̂ >Sxy −μ2 < b, b̂ >Syy +α < D 2 b, D 2 b̂ >1 = 0,
and this is true for all â and b̂ satisfying the required properties. Now the periodic
boundary conditions imply, using integration by parts:

D aD â =
2 2
âD 4 a,
where the integration is over T. This result is a direct consequence of the fact that
d2
the operator D 2 = dt 2 in the space of the functions satisfying the above properties
is self-adjoint. Therefore the first of the two above equations leads to

Sxy (t, s) b(s)ds − μ1 Sxx (t, s) a(s)ds + αD a(t)
4
â(t)dt = 0
for all functions â() with the required properties, and similarly for the second
equation. Hence the functions a(t) and b(t) are solution to the integro-differential
equations:

Sxy (t, s) b(s)ds = μ1 Sxx (t, s) a(s)ds + αD4 a(t)
4
Sxy (t, s) a(t)dt = μ2 Syy (t, s) b(t)dt + αD b(s) .
Furthermore, after multiplying these two integro-differential equations by a(t) and

b(s), respectively, and then integrating, using the periodic boundary conditions, one
gets μ1 = μ2 .
Proof Outline 2 Another shortcut to the proof can be used as follows. One consid-
ers the extended functional G() with 1 = 2 = 0, considered as a function of a()
and b(). After differentiation7 with respect to a() and b(), the (necessary) optimality
condition is
7 This is a formal differentiation noted as δa and operates as in the usual case. Note that the
differential δa is also a function of the same type.
2 2
δaS b − μ aS δa + α D aD (δa) +
xy 1 xx 2 2
aSxy δb − μ2 bSyy δb + α D bD (δb) = 0.
The first part of the equation yields in particular

δaSxy b − μ1 aSxx δa + α D 2 aD 2 (δa) = 0
and similarly for the second part. These equalities are satisfied for all perturbations
δa and δb having the required properties. Expanding the integrals using the periodic
boundary conditions yields (15.28).
Application
Ramsay and Silverman (2006), for example, discuss an approximation based on a

numerical integration scheme. In practice one does not need to solve the integro-
differential equations (15.28). The problem can be substantially simplified by
expanding the functions as well as the functional and smooth canonical func-
tions in terms of basis functions (Appendix pA) φ(t) = (φ1 (t), . . . , φp (t))T and
ψ(t) = (ψ1 (t), . . . , ψq (t)) as a(t) =
T
k=1 uk φk (t) = u φ(t) and b(s) =
T
q
v ψ
k=1 k k (s) = vT ψ(s). This method was particularly explored by Ramsay and
Silverman (2006).
Basis functions include Fourier and radial basis functions (Appendix A). In the
next section we consider the case of radial basis functions. Here I would like to go
back to the original problem based on (15.24) and (15.27) in the above theorem. We
also consider here the case of different spaces for x and y, hence φ and ψ.
Let us define the following matrices: V = (vij ), A = (aij ), B = (bij ), C = (cij )
and D = (dij ) given, respectively, by vij =< φi , Sxy ψj >, aij =< φi , Sxx φj >,
bij =< D 2 φi , D 2 φj >, cij =< ψi , Syy ψj > and dij =< D 2 ψi , D 2 ψj >,
where the notation <, > refers to the natural scalar product. Then the optimisation
problem (15.24), (15.27) can be transformed (see exercises below) to yield the
following generalised eigenvalue problem for the coefficients u = (u1 , . . . , up )T
and v = (v1 , . . . , vq )T :

VT O u O C + αD u
=μ .
O V v A + αB O v
There are various ways to determine the smoothing parameter α, and we discuss this
in the next two sections. In the following exercises we attempt to derive the above
system and propose a simple solution.
Exercise 1
1. Using the above notation associated with the expansion in terms of basis
functions, show that the maximisation problem (15.24) and (15.27) (see also the
above theorem) boils down to
maxu,v uT Vv
s.t
uT (A + αB) u = 1 = vT (C + αD) v.
2. Using Lagrangian multipliers μ1 and μ2 , transform the above system to an

unconstrained maximisation problem. To simplify notation, take P = A + αB
and Q = C + αD.
3. Show that μ1 = μ2 .
4. Derive the generalised eigenvalue problem satisfied by the vector a = (uT , vT )T .
(Hint. 3. Use the fact that uT Qu = vT Pv = 1 in association with the gradient,
with respect to u and also v, of the function to be maximised.)
Exercise 2 Since P and Q are symmetric and also semi-definite positive, they have
1 1T
square roots, e.g. P = P 2 P 2 .
1. Show that the above maximisation problem can be transformed to
max α T Eβ
s.t. α T α = 1 = β T β.
2. Give the expression of E.

3. Find the solution α and β to the above problem.
4. Derive the expression of u and v as a function of α and β.
1 1T
(Hint. 2. E = P− 2 VQ− 2 ; 3. Use the SVD theorem.)
Maximum Covariance
Smooth maximum covariance analysis (SMCA) is similar to SCCA except that the
constraint conditions (15.27) are reduced to
2 2
a +α
2
D a = b +α
2 2
D 2 b = 1. (15.29)
The optimisation problem (15.24) subject to (15.29) yields the following system of
integro-differential equations:

Sxy (t, s) b(s)ds = μ 1 + αD4 a(t)
4
(15.30)
Sxy (t, s) a(t)dt = μ 1 + αD b(s).
Exercise Using (15.24) and (15.29), derive (15.30).

(Hint. Use the remarks above.)
In those equations the smoothing parameter is also unknown, but in practice
various values can be tested to have an idea of the range of appropriate values. The
eigenvalue α and the eigenfunctions a() and b() are strictly unknown. Note also that
when α = 0, Eqs. (15.28) and (15.30) yield examples of (non-smooth) functional
CCA and functional MCA. In particular, the functional MCA (15.30) when α = 0
is given by

K (s, t) u(t)dt = μu(s), (15.31)
where

0 Sxy (s, t)
K (s, t) =
Sxy (t, s) 0
and u(s) = (a(s), b(s))T . Equation (15.31) is a vector version of Fredholm

homogeneous integral equation of second kind8 (see e.g. Roach 1970). Note that
when the continuous variables are equal, i.e. x = y, then Eq. (15.30) yields
smooth/functional PCA, see next section.
Contrary to conventional CCA/MCA where the obtained solution is a finite set
of vectors, here the solution is a set of infinite, in general denumerable, functions. In
application, however, the integro-differential equations (15.28) or (15.30) will have
to be discretised, yielding hence a finite set of equations.
In a similar manner to the case of canonical correlation, the previous integro-
differential equations can be simplified by using similar basis functions. The
obtained generalised eigenvalue problem is similar to the generalised eigenvalue
problem shown above (see also Exercise 1 above) except that here A =< φ, φ T >
and C =< ψ, ψ T >. Note that in general terms, the basis functions can be chosen
to be identical, i.e. φ = ψ, here and in canonical correlations.
Exercise Consider the above Exercise 2 again. Derive the maximisation problem
for this case and give the expression of E.
15.7.4 Application of SMCA to Space–Time Fields
We now suppose that we have two continuous space–time fields F (x, tk ) and
G (y, tk ), observed at times tk , k = 1, . . . n, where x and y represent spatial
locations. We also suppose that x and y vary in two spatial domains Dx and Dy ,
8 If μ is fixed to 1, (15.31) becomes of first kind.

respectively. The covariance function between fields F and G (with zero mean) at x
and y is given by
1
n
S (x, y) = F (x, tk ) G (y, tk ) . (15.32)
n
k=1
The objective is to find (continuous) spatial patterns (functions) u (x) and v (y)
maximising the integrated covariance:

u(x)S (x, y) v(y)dxdy
Dx Dy
subject to the smoothing constraint

2 2
u2 (x) + ∇ 2 u(x) dx = 1 = v 2 (y) + ∇ 2 v(y) dy,
Dx Dy
where ∇ 2 u = u is the Laplacian of u(). In two dimensions,

for example,
where
∂2 ∂2
x = (x1 , x2 ), the Laplacian of u(x) is ∇ u(x1 , x2 ) =
2
2 + 2 u(x1 , x2 ). A
∂x1 ∂x2
similar constraint for v(y) is also required. The situation is exactly similar to (15.24)
and (15.29) except that now one is dealing with spatial fields, and one gets the
following integro-differential system:

S (x, y) u(x)dx = μ 1 + α∇ 4 v(y)
Dx (15.33)
Dy S (x, y) v(y)dy = μ 1 + α∇ u(x).
4
To solve (15.33), one can apply, for example, the method of expansion in terms of
radial basis functions (RBFs), see Appendix A. For global fields over the spherical
earth, one can use spherical RBFs, and one ends up with a generalised eigenvalue
problem similar to that presented in the application in Sect. 15.7.3. This will be
discussed further in more detail in the next chapter in connection with smooth EOFs.
An alternative (easy) method is to discretise the left hand side of the system. In
practice, the fields are provided by their respective (centred) data matrices X =
(xtk ), t = 1, . . . n, k = 1, . . . p, and Y = ytj , t = 1, . . . n, j = 1, . . . q. The
cross-covariance matrix is then approximated by
1 T
Sxy = X Y,
n
where the fields are supposed to have been centred. The objective is then to find
patterns a = a1 , . . . , ap and b = b1 , . . . , bq satisfying
STxy a = μ Iq + αD4 b
(15.34)
Sxy b = μ Ip + αD4 a,
which can be written in the form of a single generalised eigenvalue problem. This
the “easy” solution and is also discussed further in the next chapter. Note that if one
is interested in smooth CCA, the identity matrices Ip and Iq above are to be replaced
by Sxx and Syy , respectively. The Laplacian operator ∇ 2 in (15.33) is approximated
using a finite difference scheme. In one dimension, for example, if the real function
a(x) is observed at discrete positions x1 , . . . xm , then
d2 1 1
a(xk ) ≈ [a(xk+1 ) − 2a(xk ) + a(xk−1 )] = (ak+1 − 2ak + ak−1 ) ,
dt 2 (δx)2 (δx)2
where δx = xk+1 − xk is the length of the discretised mesh. Hence D2 will be

a triangular matrix. With appropriate boundary conditions, taking for simplicity
a(x0 ) = a(xp+1 ) = 0, one gets
⎛ ⎞
−2 1 0
... 0
⎜ 1 −2 ... 0 ⎟
1
1 ⎜ ⎜ .. .. ..
..
⎟
⎟
D2 = ⎜ . .. 0 ⎟ .
(δx)2 ⎜ . ⎟
⎝ 0 . . . 1 −2 1 ⎠
0 . . . 0 1 −2
Once a and b are found, the corresponding smooth functions can be obtained using,
for example, radial basis functions as
p q
a(x) = k=1 ak φ (|x − xk |) and b(y) = l=1 bk (φ|y − yl |) , (15.35)
where φ() is a radial basis function. One could also use other smoothing procedures
such as splines or kernel smoothers.
In two dimensions, the discretised Laplacian in the plane is approximated by
u(k − 1, l)−2u(k, l)+u(k + 1, l) u(k, l−1)−2u(k, l)+u(k, l + 1)

∇ 2 u(k, l)= + .
(δx)2 (δy)2
In spherical geometry where λ and ϕ are the longitudinal and latitudinal coordinates,
i.e. x = r cos ϕ cos λ, y = r cos ϕ sin λ and z = r sin ϕ, the Laplacian takes the
form:
1 ∂ 2u 1 ∂ 2 u tanϕ ∂u
∇ 2u = + − 2 ,
r 2 ∂ϕ 2 r 2 cos2 ϕ ∂λ2 r ∂ϕ
which can be easily discretised. The discretised Laplacian in spherical geometry is

given in Chap. 14. Note that in (15.34) the discretised Laplacian depends on the
variables. For example, xt and yt may be observed at different grid points and from
different regions, e.g. hemispheric vs. regional, as in the example when hemispheric
SLP and tropical SST fields are used. Consequently, in (15.34) the Laplacian with
respect to x and y may be denoted, for example, by D1 and D2 , respectively.
The choice of the smoothing parameter α is slightly more complicated. The
experimentalist can always choose α based on previous experience. Another more
efficient approach is based on cross-validation, see e.g. chap. 7 of Ramsay and Sil-
verman (2006). The cross-validation procedure, introduced originally in statistical
estimation problems, can be extended in the same way to deal with SMCA. Letting
T
zTt = xTt , yTt , we decompose the field zt using the eigenvectors uj = aTj , bTj
obtained from the generalised eigenvalue problem (15.34) and then compute the
(k)
“residuals”9 ε t = zt − m j =1 βj uj . If uj , j = 1, . . . , m, are the eigenvectors
obtained after removing the kth observation, and εt,k , are the resulting residuals,
then the cross-validation is computed as

m
1
n
m
1 T
n
Cv (α) = tr ε t,k (α)ε Tt,k (α) = ε t,k ε t,k ,
n−1 n−1
k=1 t=1 k=1 t=1
which can be minimised to yield the cross-validation estimate of the smoothing

parameter α. As for the case of EOFs, one can use instead the eigenvalues to find
(k)
α. If μj , j = 1, . . . , m, are the m leading eigenvalues of (15.34) when the kth

observation is removed from the analysis, and σ (k) = m μ(k) , then the optimal
n j =1(k) j
value of α corresponds to the value that maximises k=1 σ .
In these methods, the smoothing parameter is either known a priori or estimated
separately. Another interesting approach used in SMCA was presented by Salim et
al. (2005). It attempts to find the maximum covariance patterns and the smoothing
parameter simultaneously using penalised likelihood approach. Their smoothing
definition for spatial fields a and b is based on the statistical assumption of
multivariate normal IID (Pawitan 2001, chap. 18).
Assuming X and Y to be realisation of p-dimensional random variable Salim et
al. (2005), consider the regression model
X = vaT + ε X
(15.36)
Y = ubT + ε Y ,
with εX ∼ N 0, σ 2 Ip and εY ∼ N 0, ζ 2 Iq , and attempt to maximise a profile

likelihood. The procedure is iterative and attempts and aims at maximising the
normalised covariance between Xa and Yb, see Salim et al. (2005) for the algorithm.
9 The residuals here do not have the same meaning as those used to construct EOFs via
minimisation. Here instead, these residuals are used as an approximation to compute the “mis-
fit”.
15.8 Some Points on Coupled Patterns and Multivariate Regression 363
Smooth MCA can be used to derive a new set of EOFs, smooth EOFs. These
represent, in fact, a particular case of smooth MCA where the spatial fields are
identical. This is discussed further in the next chapter.
15.8 Some Points on Coupled Patterns and Multivariate

Regression
In multivariate regression one attempts to explain the variability of the predictand

in terms of a linear model of the predictor variables:
y = Ax + ε . (15.37)
The regression matrix A is given by
A =< yxT >< xxT >−1 , (15.38)
where the bracket stands for the expectation operator, i.e. < . >= E(.). The least
squares fitted value of the predictand is given by
ŷ = Ax =< yxT >< xxT >−1 x. (15.39)
Now, given a p×n data matrix X of predictors and a q ×n matrix Y of predictand

or dependent variables, then the least squares regression matrix is given by
−1
A = YXT XXT . (15.40)
√
Note that in (15.40) X and Y are assumed to be scaled by n − 1. In many instances
one usually transforms one or both data sets using a linear transformation for various
reasons purposes, such as reducing the number of variables etc., and one would like
to identify the regression matrix of the transformed variables. If, in model (15.37),
x and y are replaced, respectively, by x = Lx and y = My, where L and M are
two matrices, then the model becomes
y = A x + ε . (15.41)
Using the data matrices X = LX and Y = MY, the new regression matrix is

T −1 −1
A = Y X X X
T
= M YXT LT LXXT LT . (15.42)
In the particular case where L is invertible, one gets
A = MAL−1 . (15.43)
Remarks When L is invertible, one gets a number of useful properties:

• The predicted value of the transformed variables is directly linked to the predicted
value of the transformed variables as
ŷ = A x = Mŷ. (15.44)
• We also get, via Eqs. (15.37) and (15.41), a relationship between the sum of
squared errors:
ε ε = (y − ŷ )T (y − ŷ ) = (y − ŷ)T MMT (y − ŷ).

T
(15.45)
The last equality in (15.45) means, in particular, that both the squared error (y −
ŷ)T (y − ŷ) and the positive semi-definite quadratic functions of the error (y −
ŷ)T MMT (y − ŷ) are minimised.
From the above remarks, it can be seen, by choosing particular forms for the matrix
M, such as rows of the identity matrix, that the error covariance of each variable
separately is also minimised. Consequently the full regression model also embeds
the individual models for the predictand variables (Tippett et al. 2008). By using
the SVD of the regression matrix A, the regression coefficients can be interpreted
in terms of correlation, explained variance, standardised explained variance and
covariance (Tippett et al. 2008). For example, by using a whitening transformation,
i.e. using the scaled PCs of both variables, then (as in the unidimensional case) the
regression matrix A simply becomes the correlation matrix between the (scaled)
predictand and predictors.
We now consider the SVD of the matrix A , i.e. A = USVT , then UT Y
and VT X decompose the (pre-whitened) data into time series that are maximally
correlated, and uncorrelated with subsequent ones. In terms of the original variables
the weight vectors satisfy, respectively, QTx X = VT X and QTy Y = VT Y and are
given by
−1/2 −1/2
Qx = XXT V and Qy = YYT U. (15.46)
Associated with these weight vectors are the pattern vectors

1/2 1/2
Px = XXT V and Py = YYT U. (15.47)
15.8 Some Points on Coupled Patterns and Multivariate Regression 365
Remark The pattern vectors are similar to EOFs (associated with PCs) and satisfy
Px QTx X = X, and similarly for Y. These equations are solved in a least square sense,
see e.g. Tippett et al. (2008). Note that the above condition leads to PTx Qx = I, and
similarly for the predictand variables.
Hence the data are decomposed into patterns with maximally correlated time series
and uncorrelated with subsequent predictor and predictand ones. The regression
matrix is also decomposed as
A = Py SQTx ,
and hence CCA diagonalises the regression (Tippett et al. 2008).

As for the univariate case, when only the predictor variables are pre-whitened
the regression matrix10 represents the explained variance of the corresponding
(individual) regression. This is what RDA attempts to achieve. If in addition the
predictand variables are scaled by the corresponding standard deviation, then the
regression matrix is precisely what PPA is about.
When the predictor variables are scaled by the covariance matrix, i.e. X =
−1
XXT X, then the regression matrix becomes
A = YXT ,
which represents the covariances between predictand and predictors, hence MCA.
Tippett et al. (2008) applied different transformation (or filtering) to a statistical
Fig. 15.3 CCA, RDA and Scaled SVD and coupled patterns
MCA within the α − β plane
of scaled SVD
RDA MCA
1
β
0.5
CCA
0
0 0.5 1
α
10 In fact, the absolute value of the elements of the matrix.

downscaling problem of precipitation over Brazil using a GCM. They found that
CCA provided the best overall results based on correlation as a measure of skill.
Remark MCA, CCA and RDA can be brought together in a unified-like approach
through the scaled SVD (Swenson 2015). If X = Ux Sx VTx is a SVD of the data
matrix X, then the data are scaled as

X∗ = Ux Sα−1 x Vx X = Ux Sx Vx
T α T
and similarly for Y. Scaling SVD is then obtained by applying the SVD to the
cross-covariance matrix X∗ Y∗T . It can be seen that CCA, MCA and RDA can be
recovered from scaled SVD by using, respectively, α = β = 1, α = β = 0 and
α = 0, β = 1 (Fig. 15.3). Swenson (2015) points out that other intermediate values
of 0 ≤ α, β ≤ 1 can isolate coupled signals better. This is discussed in more detail
in the next chapter.
Chapter 16
Further Topics
Abstract This chapter describes a number of further methods that have been
developed and applied to weather and climate. They include random projection,
which deals with very large data size; trend EOFs, which finds trend patterns in
gridded data; common EOFs, which identifies common patterns between several
fields; and archetypal analysis, which finds extremes in gridded data. The chapter
also discusses other methods that deal with nonlinearity.
Keywords Random projection · Cyclo-stationary EOFs · Trend EOFs · NAO ·

Siberian high · Common EOF analysis · Aleutian low · Continuum power CCA ·
Kernel MCA · Kernel CCA · Archetypal analysis · Riemannian manifold ·
Simplex visualisation · El-Nino · La-Nina · Western boundary currents ·
Principal nonlinear dynamical modes · Nonlinear PCs
16.1 Introduction
The research in multivariate data analysis has led to further development in various
topics in EOF analysis. Examples of such development include EOFs of large
datasets or data containing quasi-stationary signals. Also, sometimes we seek
to identify trends from gridded climate data without resorting to simple linear
regression. Another example includes the case when we seek to compute, for
instance, common EOFs from different groups of (similar) datasets.
Computer power has witnessed lately an unprecedented explosion, which
impacted on different branches of science. In atmospheric science climate modelling
has increased in complexity, which led to the possibility of running climate models
at a high resolution. Datasets with large spatial and/or temporal resolution are being
produced currently by various weather and climate centres across the world. This
has led to the need for ways to analyse these data efficiently. Projection methods
can be used to address this problem. When the data contain quasi-stationary signals
then results from the theory of cyclo-stationary processes can be applied yielding
cyclo-stationary EOFs. Trend EOF analysis is another method that can be used
to identify trend patterns from spatio-temporal climate data. Also, when we have

368 16 Further Topics
several datasets, as is frequently encountered in climate simulations from CMIP

(Climate Modeling Intercomparison Project), then common EOF analysis can be
used to efficiently compare these data. This chapter discusses these and other new
methods not so much applied in weather and climate, including smooth EOFs,
kernel CCA and archetypal analysis.
16.2 EOFs and Random Projection
EOF analysis has proved to be an easy and cheap way to reduce the dimension of
climate data retaining only a small set of the leading modes of variability usually
explaining a substantial amount of variance. This is particularly the case when the
size of the data matrix X is not too high, e.g. O(103 ). Advances in high performance
computers has led to the recent explosion in the volume of data from climate model
simulations, which beg for analysis tools. In particular, dimensionality reduction is
required in order to handle and analyse these climate simulations.
There are various ways to reduce the dimension of a data matrix. Perhaps the
most straightforward method is that based on “random projection” (RP). In simple
terms, RP is based on some sort of “sampling”. Precisely, given a n × p data matrix
X, RP is based on constructing a p × k data matrix R (k < p) referred to as random
projection matrix then projecting X onto R, i.e.
P = XR. (16.1)
By choosing k much smaller than p, the new n × k data matrix P becomes much
smaller than X where EOF analysis, or any other type of pattern identification
method, can be applied much more efficiently. Note that the “rotation matrix” is
approximately orthogonal because the vectors are drawn randomly. It can, however,
be made exactly orthogonal but this will be at the expense of saving memory and
CPU time.
Random projection takes its origin from the so-called Johnson and Lindenstrauss
(1984) lemma (see also Dasgupta and Gupta 2003):
Johnson-Lindenstrauss Lemma
Given a n × p data matrix X = (x1 , x2 , . . . xn )T , for any ε > 0 and integer
k > O( log
ε2
n
), there exists a mapping f : Rp → Rk , such that for any 1 ≤ i, j ≤ n,
we have:
(1 − ε) xi − xj 2
≤ f (xi ) − f (xj ) 2
≤ (1 + ε) xi − xj 2
. (16.2)
The message from the above lemma is that it is always possible to embed the data
into a lower dimensional space such that the interpoint distance is conserved up to
any desired accuracy. One way to construct such a mapping is to generate random
vectors that make up the rows of the projection matrix R. Seitola et al. (2014) used
16.2 EOFs and Random Projection 369
the standard normal distribution N (0, 1) and normalised the random row vectors of
R to have unit-length. Other distributions have also been used, see e.g. Achlioptas
(2003), and Frankl and Maehara (1988). Refinement of the lower limit value of the
dimension k, provided in the lemma, was also provided by a number of authors.
For example, the values k = 1 + 9(ε2 − 2ε3 /3)−1 , and k = 4(ε2 /2 − ε3 /3)−1 log n
were provided, respectively, by Frankl and Maehara (1988) and Dasgupta and Gupta
(2003).
Seitola et al. (2014) applied EOF analysis to the randomly projected data. They
reduced the data volume down to 10% and 1% of the original volume, and recovered
the spatial structures of the modes of variability and their associated PCs. Let us
assume that the SVD of X be X = USVT , where U and V are the PCs and EOFs of
the full data matrix X, respectively. When the spatial dimension p is to be reduced,
one first obtains an approximation of the PCs U of X by taking the PCs of the
reduced/projected matrix P, i.e. U = Upr where P = Upr Spr VTpr . The EOFs of X
are then approximated by projecting the PCs of the projected matrix P onto the data
matrix X, i.e.
V ≈ XT Upr D−1
pr . (16.3)
When the time dimension is to be reduced the same procedure can be applied to XT
using P = XT R where R is now n × k.
Remark Note that random projection can be applied only to reduce one dimension
but not both. This, however, is not a major obstacle since it is sufficient to reduce
only one dimension.
Seitola et al. (2014) applied the procedure to monthly surface temperature from
a millennial Earth System Model simulation (Jungclaus 2008) using two cases with
n × p = 4608 × 4607 and n × p = 4608 × 78336. They compared the results
obtained using 10% and 1% reductions. An example of plot of EOF patterns of
the original and reduced data is shown in Fig. 16.1. Figure 16.2 shows the (spatial)
correlation between the EOFs of the original data and those approximated from the
10% (top) and 1% (bottom) reduction of the spatial dimension. Clearly, the leading
EOFs are well reproduced even with a 1% reduction. Figure 16.3 compares the
spectra of the PCs from the original data and those from 10% and 1% reduction,
respectively. The main peaks associated with periods 1, 1/2, 1/3 and 1/4 yr are very
well reproduced in the PCs of the projected data. They also obtained similar results
with the reduction of the 4608 × 78336 data matrix. When the time dimension is
reduced, Seitola et al. (2014) find similar results to those obtained by reducing the
spatial dimension. The random projection was also extended to delay coordinates
yielding randomised multichannel SSA by Seitola et al. (2015). They applied it
to the twentieth century reanalysis data and two climate model (HadGEM2-ES and
MPI-ESM-MR) simulations from the CMIP5 data archive. They found, in particular,
that the 2–6 year timescale variability in the centre Pacific was well captured by
HadGEM2.
50 Original RP10% RP1%
0.05
PC9
0.00
0
−0.05
−50
0.10
50
0.05
PC10
0.00
0
−0.05
−50
−0.10
0.10
50
0.05
PC11
0.00
0
−0.05
−50
−0.10
0.10
50
0.05
PC12
0.00
0
−0.05
−50
−0.10
−150 −100 −50 0 50 100 150 −150 −100 −50 0 50 100 150 −150 −100 −50 0 50 100 150
Fig. 16.1 Ninth to the twelfth EOF patterns obtained from the original model simulation (left),
randomly projected data with 10% (middle) and 1% (left) reduction. Adapted from Seitola et al.
(2014)
16.3 Cyclo-stationary EOFs
16.3.1 Background
Conventional EOF analysis decomposes a space-time field into a finite sum of

“stationary” patterns modulated by time series amplitudes. The stationarity of EOFs
simply refers to the fact that these patterns do not change with time, but only their
amplitude and sign do. It is argued, however, that weather and climate variables are
not stationary but are characterised by specific temporal scales/cycles, such as the
annual cycles, making them rather cyclo-stationary. The conventional way to deal
with this is to either subtract a smoothed or modulated version of the cycle or to
focus on a specific season, such as the winter season.
The argument, however, is that the various cycles present in weather and climate
variables are not entirely additive but do interact with the variability of other
time scales, i.e. non-cyclic variations. This makes the variance as well as the
spatial and temporal autocorrelation characteristics season-dependent. This suggests
a wholistic analysis of weather and climate variables instead of, e.g. subtracting
the various cycles, as it was discussed in cyclo-stationary POP (CSPOP) analysis,
see Chap. 6. One way to achieve this wholistic analysis is through the application
of cyclo-stationary EOF (CSEOF) analysis as it was suggested by, e.g. Kim et al.
(1996) and Kim and North (1997).
16.3 Cyclo-stationary EOFs 371
ORIGINAL
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
1
1
2
0.9
3
4
0.8
5
6
0.7
7
8
0.6
9
10
RP10% 11
0.5
12
0.4
13
14
0.3
15
16
0.2
17
18
0.1
19
20
0
ORIGINAL
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
1
1
2
0.9
3
4
0.8
5
6
0.7
7
8
0.6
9
10
RP1% 11
0.5
12
0.4
13
14
0.3
15
16
0.2
17
18
0.1
19
20
0
Fig. 16.2 Spatial correlations between the original EOFs and RP with 10% (top) and 1% (bottom)
reduction. Adapted from Seitola et al. (2014)
a b c
1
2
3
4
5
6
7
8
9
10
11
12
10 3 2 1 1/2 1/3 1/4 10 3 2 1 1/2 1/3 1/4 10 3 2 1 1/2 1/3 1/4

Period in years Period in years Period in years
Fig. 16.3 Spectra of the 10 leading original (a), and RP with 10% (b) and 1% (c) reduction.
Adapted from Seitola et al. (2014)
16.3.2 Theory of Cyclo-stationary EOFs
In its simplest form cyclo-stationary EOF analysis bears some similarities with
CSPOP analysis in the sense that the variables have two distinct time scales
representing, respectively, the cycle and the nested time within the cycle, as in
the theory of cyclo-stationary processes encountered in signal processing (Gardner
and Franks 1975; Gardner 1994; Gardner et al. 2006). Perhaps the main difference
between conventional EOFs and CSEOFs is that the loading patterns (or EOFs) in
the latter method depend on time and are periodic.
If T is the nested period, then the field X(x, t) is decomposed as:

X(x, t) = Ek (x, t)vk (t), (16.4)
k
with the property
Ek (x, t) = Ek (x, t + T ). (16.5)
The cyclo-stationary loading vectors, i.e. the CSEOFs, are obtained as the solution
of a Karhunen–Loéve equation (Loève 1978):

K(x, t; x , t )E(x , t )dt dx = λE(x, t), (16.6)
16.3 Cyclo-stationary EOFs 373
where K(.) is the covariance function, i.e.
K(x, t; x , t ) = cov(X(x, t), X(x , t )). (16.7)
Cyclo-stationarity implies that the covariance function is doubly periodic, i.e.
K(x, t + T ; x , t + T ) = K(x, t; x , t ). (16.8)
Theoretically speaking the periodicity in t × t , Eq. (16.8), implies that K(.) can
be expanded using double Fourier series in t × t , and the CSEOFs can then be
computed in spectral space, which are then back-transformed into physical space.
This is feasible in simple examples such as the unidimensional case of Kim et al.
(1996) who constructed CSEOFs using Bloch’s wave functions encountered in solid
state physics. The CSEOFs are defined by
ψnm (t) = e2π int/N Unm (t), (16.9)
where Unm (.) is periodic with period T , and can be expanded as:

Unm (t) = unmk e2π ikt/T . (16.10)
k
The coefficients unmk are obtained by solving an eigenvalue problem involving the
cyclic spectrum.
The application to weather and climate fields is, however, very expensive in terms
of memory and CPU time. An approximate solution was suggested by Kim and
North (1997), based on the assumption of independence of the PCs. The method is
based on the Fourier expansion of the field X(x, t):
−1
T
X(x, t) = ak (x, t)e2π ikt/T . (16.11)
k=0
The CSEOFs are then obtained as the eigenvectors of the covariance matrix of the
extended data (χ (x, t)), t = 1, . . . , N, where:
χ (x, t) = (a0 (x, t), a1 (x, t), . . . , aT −1 (x, t)) . (16.12)
16.3.3 Application of CSEOFs
CSEOFs have been applied to various atmospheric fields. Hamlington et al. (2014),
for example, argue that CSEOFs are able to minimise mode mixing, a common
problem in conventional EOFs. Hamlington et al. (2011, 2014) applied CSEOFs
to reconstructed sea level. They suggest, using a nested period of 1 year, that
CSEOF analysis is able to extract the modulated annual cycle and the ENSO
signals from the Archiving, Validation and Interpretation of Satellite Oceanographic
(AVISO) altimetry data. Like many other methods, CSEOF analysis has been used
in forecasting (Kim and North 1999; Lim and Kim 2006) and downscaling (Lim et
al. 2010).
Kim and Wu (1999) conducted a comparative study between CSEOF analysis
and other methods based on EOFs and related techniques including extended EOFs,
POPs and cyclo-stationary POPs. Their study suggests that CSEOFs is quite akin to
extended EOFs where the lag is not unity, as in extended EOFs, but is equal to the
nested period T . Precisely, the extended data take the form:
χ (x, t) = (X(x, 1 + tT ), X(x, 2 + tT ), . . . , X(x, T + tT )) , (16.13)
and the associated covariance matrix takes the form:

⎛ ⎞
C11 . . . C1T
⎜ ⎟
C = ⎝ ... . . . ... ⎠ , (16.14)
CT 1 . . . CT T
where Ckl , k, l = 1, . . . , T , is the spatial covariance matrix between X(x, k + tT )

and X(x , l + tT ), i.e.
Ckl = cov (X(x, k + tT ), X(x, l + tT )) . (16.15)
16.4 Trend EOFs
16.4.1 Motivation
The original context of EOFs (Obukhov 1947; Fukuoka 1951; Lorenz 1956) was to
achieve a decomposition of a continuous space-time field X(t, s), such as sea level
pressure, where t and s denote, respectively, time and spatial location, as

M
X(t, s) = ck (t)ak (s) (16.16)
k=1
using an optimal set of basis functions of space ak () and expansion functions of time
ck (). As it was discussed in Chap. 3 the EOF method has some useful properties,
e.g. orthogonality in space and time. These properties yield, however, a number of
difficulties, as it is discussed in Chap. 3, such as:
16.4 Trend EOFs 375
• Physical interpretability—this is caused mainly by the predictable geometric

constraints and the mixing property of EOFs.
• Truncation order—which comes about when we seek to reduce the dimension-
ality of the data. Ideally, the best truncation order is provided by the “elbow” of
the spectrum of the covariance matrix. This, however, does not occur in practice
as the spectrum most often looks rather “continuous”.
• Second-order statistics and ellipticity—it is straightforward to see that the
restriction to using the second-order statistics underlies indirect assumptions
regarding the distribution of the data, namely the Gaussianity of the data1 since
the probability distribution function (pdf) of a multivariate Gaussian random
variable is completely specified by its mean and its covariance matrix. In fact, if
the data come from such a distribution, then the EOFs are simply the directions of
the principal axes of the distribution (Chatfield and Collins 1980; Jolliffe 2002).
This interpretation extends to a large class of probability distributions, namely
elliptically contoured distributions (Fang and Zhang 1990). A pdf is elliptical if
there exist a linear transformation that transforms the distribution to be spherical,
i.e. a function of the radial coordinate only. In brief, an elliptical pdf is one
whose contour body, i.e. points having the same value of the probability density
function, is an ellipsoid. Elliptically contoured distributions behave somewhat
like Gaussian distributions, but allow for fat tails; the t-distribution is an example.
So EOFs can be easily interpreted when the data are elliptically distributed.
There are other remaining problems with the EOF method related, for example, to
linearity, since the covariance or correlation are “linear” measures of association
and nonlinearity is not taken into consideration. Consequently, information from
high-order moments is excluded from the analysis. Furthermore, EOFs based on the
covariance matrix are in general different from those obtained from the correlation
matrix hence the choice between the two matrices remains arbitrary (Wilks 2011;
Jolliffe 2002).
All the above difficulties are detailed in Hannachi (2007). Ideally, of course,
one would like to find a method that can address all those issues. Note also that
as the measure of association based on conventional covariance or correlation is
characterised by a non-invariance to monotonic transformation, it is desirable to
have a measure of association that is invariant under such a transformation. Such a
method may not exist in reality, and if it does it may not be useful in data reduction.
Now, we know that trends exist in the atmosphere and can characterise weather
and climate variables. We also know that, in general, EOFs do not capture trends
as they are not conceived for that purpose,2 in addition to the mixing problem
characterising EOFs. It turns out, in fact, that the method that addresses the
difficulties listed above identifies trends, and was dubbed trend EOF (TEOF) method
by Hannachi (2007).
1 In practice, of course, EOFs of any data, not necessarily Gaussian, can be computed. But the point
is that using only the covariance matrix is consistent with normality.

2 There are of course exceptions such as when the trend has the largest explained variance.
16.4.2 Trend EOFs
The trend EOF method was introduced as a way to find trend patterns from gridded
data through overcoming the drawbacks of EOFs by addressing somehow the
difficulties listed above, see Hannachi (2007) for details. In essence, the method uses
some concepts from rank correlation. it is based on the rank correlation between the
time position of the sorted data.
Precisely, Let x1 , x2 , . . . xp , where xk = (x1k , . . . xnk ), k = 1, . . . p, be the p
variables, i.e. the p time series, or the rows forming our spatio-temporal field or data
matrix X = [x 1 , . . . x n ]T = (xij ), i = 1, . . . n, j = 1, . . . p. For each k, 1 ≤ k ≤ p,
we also designate by p1k , p2k , . . . , pnk the ranks of the corresponding kth time
series x1k , . . . xnk . The matrix of rank correlations is obtained by constructing first
the new variables:
yk = gk (xk ) = (p1k , p2k , . . . , pnk ) (16.17)
for k = 1, 2, . . . , p, where gk () is the transformation mapping the time series onto

its ranks. These ranks are given by
yk = (p1k , p2k , . . . , pnk ) = (pk (1), pk (2), . . . , pk (n)) , (16.18)
where pk () is a permutation of {1, 2, . . . , n}. As an illustration, let us consider a

simple time series with only five elements; x = (5, 0, 1, −3, 2). Then the new
variable is given by the ranks of this time series, i.e. y = (5, 2, 3, 1, 4), which is
a permutation p1 () of {1, 2, 3, 4, 5}.
The original data are first sorted in increasing order. Then the position in time
of each datum from the sorted series is taken. These (time) positions (from the
sorted data) constitute our new data. So the newly transformed data z1 , z2 , . . . zp
are composed of p time series, each of which is some permutation of {1, 2, . . . , n},
i.e.
zk = (q1k , q2k , . . . , qnk ) = (qk (1), qk (2), . . . , qk (n)) (16.19)
for some permutation qk () of {1, 2, . . . , n}. It can be seen (see the appendix in
Hannachi (2007)) that this permutation is precisely the reciprocal of the rank
permutation pk (), i.e.
qk = pk−1 = pk(n−2) , (16.20)
(m)
where pk = pk opk o . . . opk = pk (pk (. . . (pk )) . . .) is the mth iteration of the
permutation pk ().
As an illustration consider again the previous simple 5-element time series. The
new transformed time series is obtained by sorting first x to yield (−3, 0, 1, 2, 5).
Then, z consists of the (time) position of these sorted elements as given in the
16.4 Trend EOFs 377
original time series x, to yield z = (4, 2, 3, 5, 1), which is also a permutation q1 ()

of {1, 2, 3, 4, 5}. It can be easily checked that the permutations p1 () and q1 () are
simply reciprocal of each other, i.e. p1 oq1 () is the identity over the set {1, 2, 3, 4, 5}.
Now by looking for maximum correlation between the time positions of the
sorted data we are attempting to find times when the different time series are
increasing (or decreasing) altogether, i.e. coherently. The leading modes based on
this new correlation (or covariance) are expected therefore to capture these slowly
varying structures or trends. We should note obviously that if there is a very strong
trend with a large explained variance, then most probably it will be captured by
ordinary EOFs as the leading mode. However, this is the exception rather than the
rule.
In summary our new data matrix is now given by
⎛ ⎞
q11 q12 . . . q1p
⎜ q21 q22 . . . q2p ⎟
⎜ ⎟
Z = zT1 , zT2 , . . . , zTp = ⎜ . .. .. ⎟ (16.21)
⎝ .. . . ⎠
qn1 qn2 . . . qnp
and we are looking for correlations (or covariances) between (time) positions from
the sorted data as:
ρT (xk , xl ) = cov (zk , zl ) (16.22)
for k, l = 1, 2, . . . p. The trend EOFs are then obtained as the “EOFs/PCs” of the
newly obtained covariance matrix, which is also identical to the correlation matrix
(up to a multiplicative constant):
1 T T
T = (ρT (xk , xl )) = Z H HZ, (16.23)
n
where H = In − n1 1n 1Tn , and represents the centring operator, In being the n × n

identity matrix, and 1n = (1, 1, . . . , 1)T and is the column vector of length n
containing only ones. The application of TEOFs is then based on the SVD of the
new data matrix HZ to obtain the trend EOFs/PCs. The Matlab code for trend
EOFs is quite straightforward. Again if the data matrix is X(n, p12), the following
code obtains the required field T X, which is submitted to SVD exactly as in the
conventional EOFs of the data matrix X presented in Chap. 3:
>> [X, TX] = sort (X);
>> TX = scale (TX);
a) b) c)
4 4 4
2 2 2
x1
x2
x3
0 0 0
-2 -2 -2
-4 -4 -4
0 200 400 0 200 400 0 200 400
d) e) f)
4 4 4
2 2 2
PC1
PC2
PC4
0 0 0
-2 -2 -2
-4 -4 -4
0 200 400 0 200 400 0 200 400
g) h) i)
4 4 4
2 2 2
TPC1
TPC2
TPC3
0 0 0
-2 -2 -2
-4 -4 -4
0 200 400 0 200 400 0 200 400
Fig. 16.4 Time series of the first variables simulated from Eq. (16.24) (first row), PCs 1, 2 and 4
(second row) and the leading three trend PCs (third row). (a) Time series wt . (b) Time series xt . (c)
Time series yt . (d) PC1. (e) PC2. (f) PC4. (g) Trend PC1. (h) Trend PC2. (i) Trend PC3. Adapted
from Hannachi (2007)
16.4.3 Application of Trend EOFs

Illustration with a Simple Example
The TEOF method was illustrated with using simple examples by Hannachi (2007),
as shown in Fig. 16.4. The first row in Fig. 16.4 shows an example of time series
from the following 4-variable model containing a quadratic trend plus a periodic
wave contaminated by an additive AR(1) noise.
⎧
⎪
⎪ wt = 1.8at + 2βb(t) + 1.6εt1
⎨
xt = 1.8at + 1.8βb(t) + 2.4εt2
(16.24)
⎪
⎪ y = 0.5at + 1.7βb(t) + 1.5εt3
⎩ t
zt = 0.5at + 1.5βb(t) + 1.7εt4
16.4 Trend EOFs 379
a) Histogram b) Histogram c) Histogram

25 25 25
20 20 20
Frequency
Frequency
Frequency
15 15 15
10 10 10
5 5 5
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Correlation Correlation Correlation
d) Histogram e) Histogram f) Histogram

25 25 25
20 20 20
Frequency
Frequency
Frequency
15 15 15
10 10 10
5 5 5
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Correlation Correlation Correlation
Fig. 16.5 Histograms of the correlation coefficients between the quadratic trend, Eq. (16.24), and
the correlation-based PCs 1 (a), 2 (b) and 4 (c), and the trend PCs 1 (d), 2 (e) and 3 (f). Adapted
from Hannachi (2007)
where at is a quadratic trend proportional to t 2 , at ∝ t 2 , β = 2, and b(t) =

cos 4t5 + sin 5t . The noises εtk , k = 1, . . . 4, are again AR(1) with respective lag-
1 autocorrelations 0.5, 0.6, 0.3 and 0.35. The second row in Fig. 16.4 shows PC1,
PC2 and PC3 of the model whereas the last row shows the leading three “PCs”
obtained from the new data matrix, Eq. (16.21).
The trend is clearly captured by the first (trend) PC (Fig. 16.4g,h,i). To check the
significance of this result a Monte-Carlo approach is used to calculate the correlation
between the quadratic trend and the different PCs. Figure 16.5 shows the histogram
of these correlations obtained using the leading (conventional) PCs (Fig. 16.5a,b,c)
and the trend PCs (Fig. 16.4d,e,f). Clearly the trend is shared between all PCs.
However, when the PCs from the new data matrix, Eq. (16.21), are used the figure is
different (Fig. 16.5d,e,f), where now only the leading PC entirely captures the trend.
Fig. 16.6 Eigenspectra of the

covariance (or correlation)
matrix given in Eq. (16.23) of
SLP anomalies
Fig. 16.7 As in Fig. 16.6 but

for the sea level pressure
anomalies
Application to Reanalysis Data
The method was applied to reanalysis data using SLP and 1000-mb, 925-mb and
500-mb geopotential heights from NCAR/NCEP (Hannachi 2007). The application
to the NH SLP anomaly field for the (DJFM) winter months shows a discontinuous
spectrum of the new matrix (Eq. (16.21)) with two well separated eigenvalues and
a noise floor (Fig. 16.6). By contrast, the spectrum of the original data matrix
(Fig. 16.7) is “continuous”, i.e. with no noise floor. It can be seen that the third
eigenvalue is in this noise floor. In fact, Fig. 16.8 shows the third EOF of the new
data matrix where it is seen that clearly there is no coherent structure (or noise),
i.e. no trend. The leading two trend EOFs, with eigenvalues above the noise floor
(Fig. 16.6) are shown in Fig. 16.9. It is clear that the structure of these EOFs is
different from the first trend EOF.
Of course, to obtain the “physical” EOF pattern from those trend EOFs some
form of back-transformation is applied from the space of the transformed data
16.4 Trend EOFs 381
Fig. 16.8 Third EOF of winter monthly (DJFM) NCEP/NCAR SLP anomalies based on
Eq. (16.23)
Fig. 16.9 As in Fig. 16.8 but for the first (left) and second (right) trend EOFs of SLP anomalies
matrix (Eq. (16.21)) into the original (data) space. Hannachi (2007) applied a simple
regression between the trend PC and the original field to obtain the trend pattern,
but it is possible to apply more sophisticated methods for this back-transformation.
The obtained patterns are quite similar to the same patterns obtained based on
winter monthly (DJF) SLP anomalies (Hannachi 2007). These leading two patterns
associated with the leading two eigenvalues are shown in Fig. 16.10, and reveal,
respectively, the NAO pattern as the first trend (Fig. 16.10a), and the the Siberian
high as the second trend (Fig. 16.10b). This latter is known to have a strong trend in
a)
4
3
2
1
0
-1
-2
-3
-4
-5
-6
b)
3
-1
-2
-3
-4
Fig. 16.10 The two leading trend modes obtained by projecting the winter SLP anomalies onto
the first and second EOFs of Eq. (16.23) then regressing back onto the same anomalies. (a) Trend
mode 1 (SLP). (b) Trend mode 2 (SLP). Adapted from Hannachi (2007)
the winter time, see e.g. Panagiotopoulos et al. (2005) that is not found in the mid-
or upper troposphere.
The application to geopotential height shows somehow a different signature to
that obtained from the SLP. At 500-mb, for example, there is one single eigenvalue
separated from the noise floor. The pattern of the trend represents the NAO.
However, the eigenspectrum at 925-mb and 1000-mb yield two eigenvalues well
16.5 Common EOF Analysis 383
separated from the noise floor. The leading one is associated with the NAO as above,
but the second one is associated with the Siberian heigh, a well known surface
feature (Panagiotopoulos et al. 2005). The method was applied also to identify
the trend structures of global sea surface temperature by Barbosa and Andersen
(2009), regional mean sea level (Losada et al. 2013), and latent heat fluxes over
the equatorial and subtropical Pacific Ocean (Li et al. 2011a). The method was also
applied to other fields involving diurnal and seasonal time scales by Fischer and
Paterson (2014). They used the trend EOFs in combination with a linear trend model
for diurnal and seasonal time scales of Vinnikov et al. (2004). The TEOF method
was extended by Li et al. (2011b) to identify coherent structures between two fields,
and applied it to global ocean surface latent heat fluxes and SST anomalies.
16.5 Common EOF Analysis
16.5.1 Background
EOF (or PC) analysis deals with one data matrix, and as such it is also known as
one-sample method. The idea then arose to extend this to two- or more samples,
which is the objective of common EOF (or PC) analysis. The idea of using two-
sample PCA goes back to Krzanowski (1979) who computed the angles between the
leading EOFs for each group. Flury (1983) extended the two-sample EOF analysis
by computing the eigenvectors of −1 1 2 , where k , k = 1, 2, is the covariance
matrix of the kth group, to obtain, simultaneously, uncorrelated variables in the two
groups. Common PC (or EOF) analysis was suggested back in the late 1980s with
Flury (1984, 1988).
Given a number of groups, or population, common EOFs arise when we think
that the covariance matrices of these groups may have the same EOFs but with
different weights (or importance) in different groups. Sometimes this condition
is relaxed in favour of only a subset of similar EOFs with different weights in
different groups. This problem is quite common in weather and climate analysis.
Comparing, for example, the large scale flow patterns from a number of climate
model simulations, such as those from CMIP5, is a common problem in weather and
climate research. The basic belief underlying these models is that they may have the
same modes of variability, but with different prominence, e.g. explained variance,
in different models. This can also be used to compare different reanalysis products
from various weather centres, such as the National Center for Environmental
Prediction (NCEP), and the National Center for Atmospheric Research or the Japan
Meteorological Agency. Another direction where common EOF analysis can be
explored is the analysis of ensemble forecast. This can help identify any systematic
error in the forecasts.
The problem of common EOF analysis consists of finding common EOFs of a
set of M data matrices X1 , . . . , XM , where Xk , k = 1, . . . , M, is the nk × p data
matrix of the kth group. Note that data matrices can have different sample sizes.
The problem of common EOFs emerges naturally in climate research, particularly
in climate model evaluation. In CMIP5, for example, one important topic is often
to seek a comparison between the modes of variability of large scale flow of
different models. The way this is done in climate research literature is via a
simple comparison between the (conventional) EOFs of the different climate model
simulations. One of the main weaknesses of this approach is that the EOFs tend to be
model-dependent, leading to difficulties of comparison. It would be more objective
if the modes of variability are constructed on a common ground, and a natural way
to do this is via common EOFs.
16.5.2 Formulation of Common EOFs
We assume here that we are given M different groups or populations with

corresponding covariance matrices k , k = 1, . . . M. If the covariance matrices
are equal, then the within-population EOFs are also the same for the M different
populations. The more interesting case is when these covariance matrices are
different. In this case an attempt by statisticians was to identify an orthogonal matrix
A and M diagonal matrices k , k = 1, . . . M, satisfying the following hypothesis
Hc :
Hc : AT k A = k , k = 1, . . . M. (16.25)

The column vectors of A = a1 . . . ap are the common EOFs, and Uk = Xk A are
the common PCs.
Remark Note that in the case when the covariance matrices are different there is no
unique way to define the set of within-population EOFs (or PCs), i.e. the common
EOFs.
Note, however, that unlike conventional EOFs, the diagonal elements of k , k =
1, . . . M, need not have the same order, and may not be monotonic simultaneously
for the different groups.
The common PCA model (16.25) appeared in Flury (1988) as a particular model
among five models with varying complexity. The first model (level-1) is based on the
equality of all covariances k , k = 1, . . . , M. The second level model assumes that
covariance matrices k , k = 1, . . . , M, are proportional to 1 . The third model
is precisely Eq. (16.25), whereas in the fourth model only a subset of EOFs are
common eigenvectors, hence partial common PC model. The last model has no
restriction on the covariance matrices. Those models are described in Jolliffe (2002).
Flury (1984) estimated the common EOFs from Eq. (16.25) using maximum
likelihood based on the normality assumption N (μk , k ), k = 1, . . . M, of the
kth p-variate random vector generating the data matrix Xk , k = 1, . . . M. Letting
Sk , k = 1, . . . M, denote the sample (unbiased) estimate3 of the covariance matrix

k then the likelihood function (Flury 1984) takes the form:
!
M
1 n
k −1
L( 1 , . . . , M ) = α exp tr − Sk , (16.26)
| k |nk /2 2 k
k=1
where tr(.) is the trace function and α is a constant. Denoting by k =

diag λk1 , . . . , λkp , the maximisation of the logarithm of the likelihood L(.) in
Eq. (16.26) yields the following system of equations:

M λik −λil
aTk i=1 ni λik λil al = 0, 1 ≤ k < l ≤ p
(16.27)
subject to AT A = Ip .
Flury (1984) computed the likelihood ratio LR to test the hypothesis Hc , see
Eq. (16.25):
L( ˆ M)
ˆ k, . . . , M ˆ k|
|
LR = −2 log = nk log , (16.28)
L(S1 , . . . , SM ) |Sk |
k=1
where ˆ k ÂT , and represents the likelihood estimator of k , k = 1, . . . , M.

ˆ k = Â
This ratio behaves, asymptotically, as a chi-square:
LR ∼ χ(M−1)(p−1)/2
2
. (16.29)
As it is mentioned above, the estimator maximising the likelihood Eq. (16.26)

yields no guarantee on the simultaneous monotonic order of the eigenvalues of
k , k = 1, . . . , M. This can be seen as a weakness of the method, particularly
if the objective is to use the method for dimensionality reduction. Various attempts
have been applied to overcome this drawback. For example, Krzanowski (1984)
computed the EOFs of the pooled sample covariance matrix and the total sample
covariance matrix followed by a comparison of their EOFs. In fact, Krzanowski
(1984) computed the EOFs of S1 + . . . + SM , and compared them with those
obtained using some kind of weighted sums of Sk , k = 1, . . . M, in order to assess
whether the hypothesis Eq. (16.25) is true, see also the description in Jolliffe (2002).
Schott (1988) developed an approximate method for testing the equality of the EOF
subspaces from two groups, and extended it later to several groups (Schott 1991).
The method is based on using the sum of the corresponding covariance matrices,
which is somehow related to that of Krzanowski (1984). Other extensions of, and
tests for common EOF analysis are given in Jolliffe (2002).
3 The sample covariance matrix follows a Wishart distribution, i.e. nk Sk ∼ Wp (nk , k ).

All the methods described above including those mentioned in Jolliffe (2002)
share, however, the same drawback mentioned earlier, related to the lack of simulta-
neous monotonic change of the eigenvalues for all groups. A more precise method
to deal with the problem was proposed later (Trendafilov 2010) by computing the
common EOFs based on a stepwise procedure. The common EOFs are estimated
sequentially one-by-one allowing hence for a monotonic decrease (or increase) of
the eigenvalues of k , k = 1, . . . , M, in all groups simultaneously. The method
is similar to the stepwise procedure applied in simplifying EOFs (Hannachi et
al. 2006), and is based on projecting the gradient of the common EOF objective
function onto the orthogonal of the space composed of the common EOFs identified
in the previous step.
The reformulation of the common EOF problem is again based on the likelihood
Eq. (16.26), which takes a similar form, namely:

minA M k=1 nk log |diag(A Sk A)|
T
(16.30)
subject to A A = Ip .
T
If the eigenvalues are to decrease monotonically in all groups simultaneously, then

Eq. (16.30) can be transformed (Trendafilov 2010) to yield

maxa M T
k=1 nk log a Sk a (16.31)
subject to aT a = 1, and aT Qj −1 = 0T ,

where Qj −1 = a1 , a2 , . . . , aj −1 and Q0 = 0. The first-order optimality condition
of (16.31) is
M
nk Sk
T
− Ip a1 = 0, (16.32)
n a1 Sk a1
k=1
and for j = 2, . . . p:
M
nk Sk
Ip − Qj −1 QTj−1 − Ip aj = 0. (16.33)
n aTj Sk aj
k=1
Trendafilov (2010) solved Eqs. (16.32–16.33) using the standard power method
(Golub and van Loan 1996), which is a special case of the more general gradient
ascent method of quadratic forms on the unit sphere (e.g. Faddeev and Faddeeva
1963).
The solution to Eqs. (16.32–16.33) is applied to the NCAR/NCEP monthly SLP
anomalies over the period 1950–2015. The data are divided into 4 seasons to yield
4 datasets for monthly DJF, MAM, JJA and SON, respectively. The data used here
are on a 5◦ × 5◦ grid north of 20◦ N. The objective is to illustrate the common EOFs
from these data, that is to get the matrix of loadings that simultaneously diagonalise
Fig. 16.11 Explained variance of the common EOFs for the winter (dot), spring (circle), summer
(asterisk), and fall (plus)
the covariance matrices of those 4 datasets. The algorithm is quite fast with 4 1080×
1080 covariance matrices, and takes a few minutes on a MacPro.
The results are illustrated in Figs. 16.11 and 16.12. The obtained diagonal matrix
(or spectrum) for each dataset, is transformed into percentage of explained variance
(Fig. 16.11). There is a clear jump in the spectrum between the first and the
remaining modes for the four datasets. The leading common mode explains a little
more variance for spring and summer compared to winter and autumn. The spatial
patterns of the leading 4 common EOFs are shown in Fig. 16.12. The first common
EOF (top left) projects well onto the Arctic oscillation, mainly over the polar region,
with an opposite centre in the North Atlantic centred around 40N. Common EOF2
(top right) shows a kind of wavetrain with main centre of action located near North
Russia. The third common EOF (bottom right) reflects more the NAO pattern,
with around 10% explained variance. The NAO is weak in the spring and summer
seasons, which explains the reduced explained variance. The fourth common EOF
shows mainly the North Pacific centre, associated most likely with the Aleutian low
variability. As is pointed out earlier, The common EOF approach is more suited,
for example, to the analysis of outputs from comparable GCMs, such as the case of
CMIP models, where the objective is to evaluate and quantify what is common in
those models in terms of modes of variability.
Fig. 16.12 The leading 4 common EOFs of the four datasets, namely (monthly) DJF, MAM, JJA,
and SON NCEP/NCAR sea level pressure anomalies over the period 1950–2015
16.6 Continuum Power CCA
16.6.1 Background
Conventional CCA, MCA and RDA are standard linear methods that are used to
isolates pair of coupled patterns from two datasets. They are all based on SVD
and the obtained patterns from the different methods are linked through linear
transformations. In fact, it is possible to view CCA, MCA and RDA within a
unified frame where each of them becomes a particular case. This is obtained
through what is known as partial whitening transformation. Partial whitening, with
degree α, aims at partially decorrelating the variables. This transformation is used
in continuum power regression, which links partial least squares (PLS) regression,
ordinary least squares regression and principal component regression (PCR), e.g.
Stone and Brooks (1990). Swenson (2015) extended continuum power regression to
CCA to get continuum power CCA (CPCCA).
16.6 Continuum Power CCA 389
16.6.2 Continuum Power CCA
Let X be a n × p (anomaly) data matrix (p is the number of variables and n is

the sample size), C = n−1 XT X the covariance matrix. CPCCA (Swenson 2015) is
based on the partial whitening (with power α) of X, i.e.
X∗ = Aα,x XT , (16.34)
where
1−α −1
Aα,x = C 2 . (16.35)
We suppose here that C is full rank so that its non-integer power exists4
Remark The standard whitening transformation corresponds to α = 0.
CPCCA patterns u and v are obtained via:

max uT XT Yv
1−α 1−β (16.36)
s.t. uT XT X u = vT YT Y v = 1.
Remark It is straightforward to check that:

• α = β = 0 corresponds to conventional CCA.
• α = β = 1 corresponds to MCA.
• α = 0, β = 1 corresponds to RDA.
The above optimisation problem is equivalent to the following “MCA” problem:

max uT X∗ YT∗ v
(16.37)
s.t. uT u = vT v = 1,
where X∗ = Aα,x XT and Y∗ = Aβ,y YT . As for CCA, the CPCCA patterns (in the
partially whitened space) are given by the SVD of X∗ YT∗ , i.e. X∗ YT∗ = U+ SVT+ ,
and the associated cross-covariance by the diagonal of S. The CPCCA time series
are provided by projecting the partially whitened variables onto the singular vectors
U+ and V+ yielding Tx = XT∗ U+ and Ty = YT∗ V+ . The CPCCA patterns within
the original space are obtained by using the inverse of Aα,x and Aβ,y , i.e.
1−α 1−β
U = XT X 2 U+ and V = YT Y 2 V+ . (16.38)
4 If C = UUT , with = diag(λ1 , . . . , λp ) then Cδ = Uδ UT where δ = diag(λδ1 , . . . , λδp ).

Fig. 16.13 Fraction of signal variance explained (FSVE) for the leading CPCCA mode in the
partial whitening parameters (α,β) plane for different values of the signal amplitudes a and b
(maximum shown by ‘*’) for a = b = 1(a), 0.75(b), 0.6(c). Also shown are the maxima of the
cross-correlation, squared covariance fraction and the fraction of variance of Y explained by X,
associated, respectively, with CCA, MCA and RDA. Adapted from Swenson (2015). ©American
Remark The time series TTx and Ty are (cross-) uncorrelated, i.e. TTx Ty = S. The
time series Tx , however, are not uncorrelated.
Partial whitening provides more flexibility through varying the parameters α and
β. CPCCA is similar, in some way, to the partial phase transform, PHAT-β,
encountered in signal processing (Donohue et al. 2007). Partial whitening can be
shown to yield an increase in performance when applied to CCA regarding the S/N
ratio. Figure 16.13 shows the performance of CPCCA as a function of α and β for
an artificial example in which the “common component” is weighted by specific
numbers a and b in the data matrices X and Y, respectively.
16.6.3 Determination of the Degree Parameter
Various methods can be used to determine the degree of partial whitening (or
regularisation) parameter α. Perhaps an intuitive approach is to consider the
simultaneous optimisation of uT X∗ YT∗ v with respect to u, v α and β, where the
optimum solution is given by

{uo , vo , αo , βo } = argmax uT X∗ YT∗ v s.t. uT u = vT v = 1. (16.39)
It is possible to solve the above equation numerically. This was used by Salim et al.
(2005) to estimate the smoothing parameter in regularised MCA in addition to the
16.6 Continuum Power CCA 391
spatial patterns. They applied the method to analysing the association between the
Irish winter precipitation and sea surface temperature. They found clear association
between Irish precipitation anomalies, El-Niño Southern Oscillation and the North
Atlantic Oscillation. We note, however, that the calculation can be cumbersome.
The other, and common, method is to use cross-validation (CV). The CV is
feasible in practice but requires a relatively extended computation as it is based
on leave-one-out procedure. We explain the procedure here for the conventional
CCA. The application to the CPCCA is similar. In CCA we seek the spectral
−1 T −1 T
analysis of X XT X X Y YT Y Y whereas in regularised CCA, with reg-
ularisation parameter λ = (λ1 , λ2 ), we are interested in the spectral analysis
−1 T −1 T
of X XT X + λ1 I X Y YT Y + λ2 I Y , where I is the identity matrix. We
designate by X−i and Y−i the data matrices derived from X and Y, respectively,
by removing the ith rows xi and yi of X and Y, respectively. We also let ρλ(−i) be
the leading canonical correlation from CCA of X−i and Y−i with corresponding
(−i) (−i)
patterns (eigenvectors) uλ and vλ . The cross-validation score can be defined,
in general, as a measure of the squared error of a test set evaluated for an eigenvector
from a training set. The CV score is defined (Leurgans et al. 1993) by

(−i) (−i)
CV () = corr {xi uλ }i=1,...n , {yi vλ }i=1,...n . (16.40)
Note that in the above equation we consider xi as a 1 × p row vector. The cross-
validated parameter λ̂ is then given by
λ̂ = argmax CV (λ). (16.41)

λ=(λ1 ,λ2 )
The third method to estimate the optimal parameter is related to ridge regression
in which a regularisation parameter, in the form of λI, is added to the covariance
matrix of the predictor variables before computing the inverse. In ridge regression
−1/2
a transformation is applied using Tridge = (1 − ρ)XT X + ρμI , with μ =
1 T 2
p X F . An estimate of ρ is derived by Ledoit and Wolf (2004):
n
t=1 (n − 1)xTi xi − XT X 2
ρLW = F
(16.42)
XT X − μI 2
F
with . F being the Fröbenius norm ( C F = tr(CCT )). For CPCCA, Swenson
(2015) suggests the following estimator for the parameter α:
1−α
α̂ = argmin ν XT X − (1 − ρLW )XT X − ρLW μI 2
F (16.43)
α
1−α
with ν = XT 2/
F (XT X) 2 2.
F
16.7 Kernel MCA
16.7.1 Background
Given two data matrices X and Y, classical or standard MCA looks for patterns a
and b such that Xa and Yb have maximum covariance. These patterns are given,
respectively, by the left and right singular vectors of the cross-covariance5 matrix
B = XT Y. These vectors satisfy XT Yb = nλa and YT Xa = nλb. In addition, the
associated time series x = Xa and y = Yb satisfy, respectively:
XXT YYT x = n2 λ2 x
. (16.44)
YYT XXT y = n2 λ2 y
Exercise Derive Eq. (16.44) In practice, of course, we do not solve Eq. (16.44),
but we apply the SVD algorithm to XT Y. The above derivation is useful for what
follows.
16.7.2 Kernel MCA
Kernel MCA takes its roots from kernel EOF where a transformation φ(.) is used
to map the input data space onto a feature space, then EOF analysis applied to
the transformed data. In kernel MCA the X and Y feature spaces are spanned,
respectively, by φ(x1 ), . . . , φ(xn ) and φ(y1 ), . . . , φ(yn ), respectively. The objective
is similar to standard MCA but applied to the feature spaces.
We designate by X and Y the matrices (or rather operators) defined, respectively,
by
⎛ ⎞ ⎛ ⎞
φ(x1 )T φ(y1 )T
⎜ ⎟ ⎜ ⎟
X = ⎝ ... ⎠ and Y = ⎝ ... ⎠ , (16.45)
φ(xn )T φ(yn )T
and we seek “feature” patterns u and v from the feature space such that X u and Yv
have maximum covariance.
The cross-covariance matrix between φ(xk ) and φ(yk ), k = 1, . . . , n, is
1
n
1
C= φ(xt )φ(yt )T = X T Y (16.46)
n n
t=1
5X and Y are supposed to be centred.

16.7 Kernel MCA 393
and the left and right singular vectors satisfy
Cv = λu
(16.47)
CT u = λv.
As in kernel EOF we see that u and v take, respectively, the following forms:
n n
u= t=1 at φ(xt ) and v = t=1 bt φ(yt ). (16.48)
Inserting (16.48) into (16.47), using (16.46), and denoting by Kx and Ky the
y
matrices with respective elements Kijx = φ(xi )T φ(xj ) and Kij = φ(yi )T φ(yj ),
we get
Kx Ky b = nλKx a
(16.49)
Ky Kx a = nλKy b.
One can solve (16.49) simply by considering the necessary condition, i.e.
Ky b = nλa
(16.50)
Kx a = nλb,
which yields Kx Ky b = n2 λ2 b and similarly for a. Alternatively, we can still use

Eq. (16.49) to obtain, for example:
Ky Kx Ky b = nλKy Kx a = n2 λ2 Ky b, (16.51)
which is an eigenvalue problem with respect to Ky b.

Remark
• Note that Eq. (16.51) can also be seen as a generalised eigenvalue problem.
• With isotropic kernels of the form H ( x − y 2 ), such as the Gaussian kernel,
with H (0) = 0, Kx and Ky are, in general, invertible and Eq. (16.49) becomes
straightforward.
16.7.3 An Alternative Way
One can use the (data) matrices within the feature space, as in the standard case (i.e.
without transformation), and directly solve the system:
n X Yv = λu
1 T
(16.52)
n Y X u = λv,
1 T
which, as in the standard MCA, leads to
X X T YY T X u = n2 λ2 X u
(16.53)
YY T X X T Yv = n2 λ2 Yv.
Now, x = X u is a time series of length n, and similarly for y = Yv. Also, we have
X X T = Kx and YY T = Ky , and then Eq. (16.53) becomes
Kx Ky x = n2 λ2 x
(16.54)
Ky Kx y = n2 λ2 y.
So the time series x and y having maximum covariance are given, respectively, by
the right and left eigenvectors of Kx Ky .
Remark Comparing Eqs. (16.51) and (16.54) one can see thatx = Kx a and
n
y =Ky b, which can be verified keeping in mind that u = t=1 at φ(xt ) and
n
v = t=1 bt φ(yt ) in addition to the fact that x = X u and y = Yv.
One finds either a and b (Eq. (16.51)) or x and y (Eq. (16.54)). We then construct
the feature patterns u and v using Eq. (16.48). The corresponding patterns from the
input spaces can be obtained be seeking x and y such that uT φ(x) and vT φ(y) are
maximised. This leads to the maximisation problem:
n n
max t=1 at K(x, xt ) and max t=1 bt K(y, yt ). (16.55)
This is exactly like the pre-image for Kernel EOFs, and therefore the same fixed
point algorithm can be used.
16.8 Kernel CCA and Its Regularisation
16.8.1 Primal and Dual CCA Formulation
As above we let X and Y denote n × p and n × q two (anomaly) data matrices. The
conventional CCA is written in the primal form as:
uT XT Yv
ρ = max * . (16.56)
u,v (uT XT Xu)(vT YT Yv)
By denoting u = XT α and v = YT β, the above form can be cast in the dual form:
α T Kx Ky β
ρ = max - , (16.57)
α,β
(α T K2x α)(β T K2y β)
16.8 Kernel CCA and Its Regularisation 395
where Kx = XXT and Ky = YYT .

Exercise Show that the above problem is equivalent to
max α T Kx Ky β
(16.58)
s.t. α T K2x α = β T K2y β = 1.
This system can be analysed using Lagrange multipliers yielding a system of linear
equations in α and β:
Kx Ky β − λ1 Kx α = 0
(16.59)
Ky Kx α − λ2 Ky β = 0.
Verify that λ1 = λ2 , which can be denoted by λ.

Show that ρ 2 is indeed the maximum of the Raleigh quotient:

0 Kx Ky u
(uT vT )
Ky Kx 0 v
R= (16.60)
2
Kx 0 u
(uT vT ) .
0 K2y v
Remark Note that in the dual formulation the Raleigh quotient (16.60) and
also (16.57) the computation of the cross-correlation (or cross-covariance) is
avoided. This has implication when computing kernel CCA as shown later.
Exercise Assume that Kx and Ky are invertible, show that we have λ = 1.
The conclusion from the above exercise is that when Kx and Ky are invertible
perfect correlation can be obtained and the CCA problem becomes useless. This
is a kind “overfitting”.
Remark In CCA this problem occurs whenever Kx and Ky are invertible. This
means that rank(X) = n =rank(Y), i.e. n < q and n < p. This also means that the
covariance matrices XT X and YT Y are singular.
The solution to this problem is regularisation as it was discussed in Sect. 16.6,
(see also Chap. 15). by adding λ1 I and λ2 I to the correlation matrices of X and Y,
respectively, as in ridge regression. In ridge regression with a regression model Y =
−1 T
XB + the estimated matrix B̂ = XT X X Y is replaced by (R + λI)−1 XT Y,
with λ > 0, and R is the correlation matrix. The diagonal elements of R are
increased by λ, and this is where the name ridge comes from.
Remark The standard CCA problem can be cast into a generalised eigenvalue
problem as

O Cxy u 2 Cxx O u
=ρ
Cyx O v O Cyy v
(see exercise above). The above form can be used to extend CCA to multiple
datasets. For example, for three data one form of this generalisation is given by
the following generalised eigenvalue problem:
⎛ ⎞⎛ ⎞ ⎛ ⎞⎛ ⎞
O Cxy Cxz u Cxx O O u
⎝ Cyx O Cyz ⎠ ⎝ v ⎠ = ρ 2 ⎝ O Cyy O ⎠ ⎝ v ⎠ .
Czx Czy O w O O Czz w
16.8.2 Regularised KCCA
In canonical covariance analysis no scaling, (i.e. correlation) was used, and therefore
no regularisation was required. As with conventional CCA we denote, respectively,
the Gram matrices of X and Y by K = (Kij ) and L = (Lij ), with Kij =
φ(xi )T φ(xj ) and Lij = ψ(xi )T ψ(xj ). Note that here
we can use a different
map for
Y. The solution of KCCA looks for patterns a = i αi φ(xi ) and b = i βi ψ(yi )
that are maximally correlated. This leads to maximising the Lagrangian:
1 1
L = α T KLβ − λ α T K2 α − 1 − λ β T L2 β − 1 (16.61)
2 2
and also maximising the Raleigh quotient (in the dual form). The obtained system of
equation is similar to Eq. (16.59). Again, if, for example, K is of full rank, which is
typically the case in practice, then a naive application of KCCA leads to λ = 1. This
shows the need to regularise the kernel, which leads to the regularised Lagrangian
1 1
L = α T KLβ − λ α T K2 α + η1 α T α − 1 − λ β T L2 β + η1 β T β − 1 .
2 2
(16.62)
The associated Raleigh quotient is similar to that shown in the exercise above except
that K2 and L2 are replaced by K2 + η1 I and L2 + η2 I, respectively, and associated
generalised eigenvalue problem. Note that we can also take η1 = η2 = η.
Remarks
• The dual formulation allows us to use different kernels, e.g. φ(.) and ψ(.) φ(.)
for X and Y, respectively. For example, one can kernelize only one variable and
leave the other without kernel.
• The regularised parameter η can be estimated using the cross-validation proce-
dure.
16.9 Archetypal Analysis 397
16.8.3 Some Computational Issues
The solution to the regularised KCCA is given, e.g. for α, assuming that K is
invertible, by the standard eigenvalue problem:
(K + ηI)−1 L (L + ηI)−1 Kα = λ2 α. (16.63)
The above eigenvalue problem can be solved by standard Cholesky decomposition

(Golub and van Loan 1996) when the sample size is not very large. When we have
a large dataset an alternative is to use the incomplete Cholesky decomposition of
kernel matrices (Bach and Jordan 2002). Unlike the standard decomposition, in
the incomplete Cholesky decomposition all pivots below a selected threshold are
skipped. This leads to a lower triangular matrix with only m non-zero columns if
m is the number of pivots used, i.e. non-skipped. Another alternative to incomplete
Cholesky decomposition is to use the partial Gram-Schmidt orthogonalization (Cris-
tianini et al. 2001). This orthogonalization was applied with KCCA by Hardoon et
al. (2004) to analyse semantic representation of web images.
16.9 Archetypal Analysis
16.9.1 Background
Archetypal analysis (Cutler and Breiman 1994) is another method of multivariate

data exploration. Given a multivariate time series xt = (xt1 , . . . , xtm )T , t = 1, . . . n,
of m-dimensional variables we know that EOF analysis provides directions in this
m-dimensional space that maximise variance, making them not directly interpretable
in terms of the data values themselves. Clustering, for example, via k-means yields
centroids or typical prototypes of the observations. In archetypal analysis (AA)
the objective is to express the data in terms of a small number of “pure” types
or archetypes. The data are then approximately expressed as weighted average of
these archetypes. In addition, the archetypes themselves are also weighted average
of the data, and are not necessarily observed. The main feature, however, is that
the archetypes are in general extremal and this what distinguishes them from EOFs
and other closely related methods. AA therefore attempts to combine the virtues of
EOF analysis and clustering, in addition to dealing with extremes or corners of the
data in its state space. The archetypes are obtained by estimating the convex hull or
envelope of the data in state space. AA was applied mostly in pattern recognition,
benchmarking and market research, physics (astronomy spectra), computer vision
and neuro-imaging and biology, but not much in weather and climate research. In
climate research only recently AA was applied, see Steinschneider and Lall (2015)
and Hannachi and Trendafilov (2017).
16.9.2 Derivation of Archetypes
We let our n × m data matrix denoted by X = (x1 , . . . , xn )T . For a given number

p, 1 ≤ p ≤ n, AA finds archetypes z1 , . . . , zp , that are mixture, or convex
combination, of the data as:
n
zk = j =1 βkj xj , k = 1, . . . , p
(16.64)
s.t. βkj ≥ 0 and nj=1 βkj = 1.
The above equations make the patterns z1 , . . . , zp of data (or pure) types. The data,
in turn, are also approximated by a similar weighted average of the archetypes. That
p
is, each xt , t = 1, . . . , n is approximated by a convex combination j =1 αtj zj , with
p
αtj ≥ 0 and j =1 αtj = 1. The archetypes are therefore the solution of a convex
least square problem obtained by minimising a residual sum of squares (RSS):
p p
{z1 , . . . , zp } = argmin t xt − k=1 αtk zk 22 = argmin t xt − k=1 nj=1 αtk βkj xj 22
α,β
p n
s.t. αtk ≥ 0, k=1 αtk = 1, t = 1, . . . n, βkj ≥ 0, and j =1 βkj = 1, k = 1, . . . p,
(16.65)
where . 2 stands for the Euclidean norm.
The above formulation of archetypes can be cast in terms of matrices. Letting
AT = (αij ) and BT = (βij ), (A in Rp×n , B in Rn×p ) the above equation transforms
into the following matrix optimisation problem:

minA,B R = X − AT BT X 2F
(16.66)
A, B ≥ 0, AT 1p = 1n , and BT 1n = 1p .
In the above system A and B are row stochastic matrices, 1x stands for the x-column
vector of ones and . F stands for the Fröbenius norm (Appendix D). The inferred
archetypes are then convex combination of the observations, which are given by
Z = z1 , . . . , zp = XT B, and they exist on the convex hull of the data x1 , . . . , xn .
Furthermore, letting A = (α 1 , . . . , α n ), then for each data xt , t = 1, . . . , n,
Zα t represents its projection on the convex hull of the archetypes as each α t is a
probability vector.
For a given p Cutler and Breiman (1994) show that the minimisers of RSS R,
Eq. (16.66), provide archetypes Z = z1 , . . . , zp that are, theoretically, located on
the boundary of the convex hull (or envelope) of the data. The convex hull of a given
data is the smallest convex set containing the data. Archetypes provide therefore
typical representation of the “corners” or extremes of the observations. Figure 16.14
shows an illustration of a two-dimensional example of a set of points with its convex

hull and its approximation using five archetypes. The sample mean x = n1 xt
provides the unique archetype for p = 1, and for p = 2 the pattern z2 −z1 coincides
with the leading EOF of the data. Unlike EOFs archetypes are not required to be
nested (Cutler and Breiman 1994; Bauckhage and Thurau 2009). However, like k-
Fig. 16.14 Two-dimensional illustration of a set of 25 points along with the convex hull (dashed),
an approximate convex hull (solid) and 5 archetypes (yellow). The blue colour refers to points that
contribute to the RSS. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological
means clustering (and unlike EOFs) AA is invariant to translation and scaling and
to rotational ambiguity (Morup and Hansen 2012). In summary, AA combines the
virtues of EOFs and clustering and, most importantly, deals with extremes in high
dimensions.
Exercise Show that the mean x is the unique archetype for p = 1.
Hint Letting X = (x , xn )T = (y1 , . . . , ym ), β = (β1 , . . . , βn )T , and ε2 =
1n, . . .
X − 1n β X F = t=1 m
T 2 (x − β T yk )2 , and differentiating ε2 with respect
m tk
k=1
to β one obtains the system k=1 (yk − β T yk )xtk = 0, for t = 1, . . . n, that is
T
xi xj β = n1 1Tn XXT . The only solution (satisfying the constraint) is β = n1 1n ,
that is z = βX.
16.9.3 Numerical Solution of Archetypes
There are mainly two algorithms to solve the archetypes problem, which are
discussed below. The first one is based on the alternating algorithm (Cutler and
Breiman 1994), and the second one is based on a optimisation algorithm on
Riemannian manifolds (Hannachi and Trendafilov 2017).
Alternating Algorithm
To solve the above optimisation problem Cutler and Breiman (1994) used, starting
from an initial set of archetypes, an alternating minimisation between finding the
best A for a given set of archetypes, or equivalently B, and the best B for a given A.
The algorithm has the following multi-steps:
(1) Determine A, for fixed Z, by solving a constrained least squares problem.

(2) Using the obtained A, from step (1), solve, based on Eq. (16.66), for the
archetypes ZA = XT , i.e. Z = XT AT (AAT )−1 .
(3) Using the obtained archetypes, from step (2), estimate the matrix B again by
solving a constrained least squares problem.
(4) Obtain an update of Z through Z = XT B, then go to step (1) unless the residual
sum of squares is smaller than a prescribed threshold. Basically each time one
solves several convex least square problems, as follows:

ATl+1 = argmin X − AT ZTl 2
F + λ AT 1p − 1n 2
2
A≥0
and (16.67)

BTl+1 = argmin ZTl − BT X 2
F + μ BT 1n − 1p 2
2 .
B≥0
After each iteration the archetypes are updated. For example, after finding Al+1
from Eq. (16.67) and before solving the second equation, Z is updated using
−1
X = ATl+1 ZT , i.e. ZT = Al+1 ATl+1 Al+1 X, which is then used in the second
equation of (16.67). After optimising this second equation, Z is then updated
using Z = XT Bl+1 . This algorithm has been widely used since it was proposed
by Cutler and Breiman (1994).
Remark Both equations in (16.67) can be transformed to n + p individual convex
least square problems. For example, the first equation is equivalent to
min 12 aTi ZT Zai − (xTi Z)ai , i = 1, . . . , n

s.t. ai ≥ 0 and 1T ai = 1,
and similarly for the second equation.

The above optimisation problem is of the order O(n2 ) where n is the sample size.
Bauckhage and Thurau (2009) reduced the complexity of the problem to O(n2 )
with n < n by using only the set of data points that are not exact, but approximate,
convex combination of the archetypes, referred to as working set.
Riemannian Manifold-Based Optimisation
Hannachi and Trendafilov (2017) proposed a new non-alternating algorithm based

on Riemannian manifold optimisation. Riemannian optimisation on a differential
manifold M, e.g. Smith (1994) and Absil et al. (2010), seeks solutions to the
problem
min f (x). (16.68)

x∈M
Examples of differential manifolds include the (n − 1)-dimensional sphere S n−1 , a

submanifold of Rn . Of particular interest here is the set of matrices with unit-vector
rows (or columns), i.e.
Ob(n, p) = {X ∈ Rn×p , ddiag(XT X) = Ip },
where ddiag(Y) is the double diagonal operator, which transforms a square matrix
Y into a diagonal matrix with the same diagonal elements as Y. This manifold
is known as the oblique manifold and is topologically equivalent to the Cartesian
product of spheres, with a natural inner product given by
< X, Y >= tr(XYT ). (16.69)
The problem is transformed into an optimisation onto the oblique manifold

Ob(n, p) and Ob(p, n). Let A C denote the “Hadamard” or element-wise
product of A = (aij ) and C = (cij ), i.e. A C = (aij cij ). Because the weight
matrices A and B involved in the residual R, Eq. (16.66), are positive and satisfy
the convex (or stochasticity) constraint, i.e. AT 1p = 1n and BT 1n = 1p , these
matrices can be written as (element-wise) squares of two matrices from Ob(n, p)
and Ob(p, n), respectively, e.g. A = E E and B = F F. For convenience, we
will be using again the notation A and B instead of E and F. Therefore, the problem
can be transformed by replacing A and B by A A and B B, with A and B in
Ob(p, n) and Ob(n, p), respectively, that is:
R = X−(AA)T (BB)T X 2
F = tr(Z)−2tr(ZW)+tr(ZWT W), (16.70)
where Z = XXT and W = (A A)T (B B)T .

Exercise Derive the following identities:
2 ∇tr C(Y Y)T D = Y (DC)

1

2 ∇tr C(Y Y)(Y Y)T D = (DC)T (Y Y) Y + [(DC)(Y Y)] Y .
1

2 ∇tr C(Y Y)D(Y Y)T = CT (Y Y)DT Y + [C(Y Y)D] Y
1
(16.71)
Exercise Consider the hypersphere S n−1 in Rn , S n−1 = {x ∈ Rn , x = 1}, and
the oblique manifold, Ob(n, p) = {X ∈ Rn×p , ddiag(XT X) = Ip }. The tangent
space at x ∈ S n−1 is the set of all vectors orthogonal to x, i.e.
Tx S n−1 ≡ {u ∈ Rn , uT x = 0}, (16.72)
and the orthogonal projection u∗ of any vector u onto Tx S n−1 is
u∗ = u − (xT u)x = (I − xxT )u. (16.73)

Using the topological equivalence between Ob(n, p) and the Cartesian product
of p hyperspheres S n−1 , i.e. Ob(n, p) ∼ S n−1 × · · · × S n−1 derive the projection
U∗ of any U from Rn×p onto TX Ob(n, p), namely

U∗ = U − X ddiag XT U . (16.74)
Let us denote by A2. = A A and similarly for B. Then we have the following
expression of the gradient of the costfunction R (see Appendix D).

∇A R = 4 (B2. )T Z(−In + WT ) A
. (16.75)
∇B R = 4 Z(−In + WT )(A2. )T B
Finally, the projection of the gradient of R, ∇A,B R, onto the tangent space of the
oblique manifolds yields the final gradient gradA,B R, namely:
gradA R = ∇A R − A ddiag(AT ∇A R)
. (16.76)
gradB R = ∇B R − B ddiag(BT ∇B R)
After minimising the costfunction R using the expression of the gradient

Eq. (16.76), the archetypes are then given by
Z = XT (B B) = XT B2. (16.77)
Exercise Derive the above expressions of the gradient.

Other developments have been obtained including weighted and robust AA
(Eugster and Leisch 2011), and probabilistic archetype analysis (Seth and Eugster
2015). In weighted AA a weight matrix W is applied to X − AT BT X in Eq. (16.66)
in order to reduce the sensitivity of AA to outliers. The problem is still equivalent
to Eq. (16.66) though with X replaced by WX, i.e. RSS = W(X − AT BT X) 2F =
WX − AT BT WX 2F , and hence the same algorithm can be used.
Exercise Show that indeed W(X − AT Z) = WX − AT BT WX, and hence is of the
form X̃ − AT Z̃ with Z̃ = BT X̃.
Indication Proceed columnwise, i.e. use the columns X − AT Z. For example, the
p of
ith column of this matrix is of the form yi = xi − j =1 nk=1 aij bj k xk , and, conse-
p
quently the ith column of W(X−AT Z) is Wyi = Wxi − j =1 nk=1 aij bj k (Wxk ).
In probabilistic AA the procedure is based on analysing the convex hull in the
parameter space instead of the sample space. That is, the archetypes and associated
factors are expressed in terms of the parameters of the distribution of the data. For
example, the original AA problem can be formulated in terms of a (simplex) latent
variable model with normal observations. Eugster and Leisch (2011) extended the
probabilistic formulation to other distributions from the exponential family.
Remark (Relation to Non-negative Matrix Factorisation) AA has some similarities

to what is known as non-negative matrix factorisation (NMF), e.g. Lee and Seung
(1999). NMF seeks to decompose a non-negative n × p matrix X into a product of
two non-negative n × q and q × p matrices Z and H, i.e. X ≈ ZH, through, e.g. the
following optimisation problem:
{Z, H} = argmin X − ZH 2
F. (16.78)
Z,H≥0
Lee and Seung (1999) minimised the cost function:

n
p

F (Z, H) = xij log(ZH)ij − (ZH)ij ,
i=1 j =1
subject to the non-negativity constraint, and used a multiplicative updating rule. But
other algorithms, e.g. alternating rules as in AA, can also be used. It is clear that the
main difference between AA and NMF is the stochasticity of the matrices Z and H.
In terms of patterns, AA yields archetypes whereas NMF yields characteristic parts
(Bauckhage and Thurau 2009). To bring it closer to AA, NMF has been extended
to convex NMF (C-NMF), see e.g. Ding et al. (2010), where the non-negativity of
X is relaxed, and the non-negative matrix Z takes the form Z = XW, with W a
non-negative matrix.
16.9.4 Archetypes and Simplex Visualisation
One of the nice and elegant features of simplexes is the two-dimensional visual-
isation of any m-simplex,6 i.e. m-dimensional polytope that is the convex hull of
its m + 1 vertices. This projection is well known in algebraic topology, sometimes
referred to as “skew orthogonal” projection7 and shows all the vertices of a regular
simplex on a circle where all vertices pairs are connected by edges. For example,
the regular 3-simplex (tetrahedron) projects onto a square (Fig. 16.15). Any point
y = (y1 , . . . , ym+1 )T in Rm+1 can be projected onto the m-simplex. The projection
y onto the standard m-simplex is the closest point t = (t1 , . . . , tm+1 ) ≥ 0,
of
i ti = 1, to y. The point t satisfies ti = max(yi + e, 0), i = 1, . . . , m + 1. The
number e can be obtained through a sorting algorithm of complexity O(n log n)
(Michelot 1986; Malozemov and Pevnyi 1992; Causa and Raciti 2013).
As pointed out above Zα i , i = 1, . . . , n is the best approximation of xi on
the convex hull of the archetypes Z, i.e. the (p − 1)-simplex with vertices the
6 Given m+1 points (vertices) c

1 , . . . , cm+1in Rm+1 , the m-simplex is the set of points of the form
m+1 m+1
i=1 θi ci , with θi ≥ 0, i = 1, . . . , m + 1, and i=1 θi = 1.
7 Also known as Petrie polygon.
Fig. 16.15 An example of a 3-simplex or tetrahedron (a), its two-dimensional projection (b), and
a two-dimensional projection of a 4-simplex (c)
p archetypes z1 , . . . , zp . The components of α i represent the “coordinates” with

respect to these vertices, and can be visualised on a 2D plane using a skew
orthogonal projection. Cutler and Breiman (1994) provide an illustration of the
simplex visualisation with only three archetypes, known as ternary plot. Later
authors extended the visualisation to more than three archetypes (Bauckhage and
Thurau 2009; Eugster and Leisch 2013; Seth and Eugster 2015).
The following example (Hannachi and Trendafilov 2017) shows an application
of AA to a 5-cluster model located on the vertices of a 3D polytope. As described
in Sect. 16.2, AA is not nested, and therefore various values of p, the number of
archetypes, are used. For a given value of p the archetypes are obtained and the
residual sum of squares are computed. Figure 16.16a shows a kind of scree plot
of the (relative) RSS versus p. As Fig. 16.16a shows the most probable number
of archetypes can be suggested from the scree plot. Figure 16.16b shows the
costfunction R along with the gradient norm. The costfunction reaches its floor
during the first few hundred iterations although the gradient norm continues to
decrease with increasing iteration. Figure 16.16c shows the skew projection of the
elements of the probability matrix A2. = A A, namely the two-dimensional
simplex projection where the archetypes hold the simplex vertices. The clusters
are associated with some extent with the archetypes although the centroids are, as
expected, different from the archetypes.
16.9.5 An Application of AA to Climate
Hannachi and Trendafilov (2017) applied AA to sea surface temperature (SST) and
Asian summer monsoon. The SST anomaly data come from the Hadley Centre Sea
Ice and Sea Surface Temperature8 (Rayner et al. 2003). The data are on a 1◦ ×
8 www.metoffice.gov.uk/hadobs/hadisst/.
a)
100
Relative RSS
80
60
40
20
0
1 5 10 15
Number of archetypes p
b)
100
Gradient norm and
gradient
costfunction
costfunction
-1
10
10-2
10-3
10-4
0 100 200 300 400 500
Iteration number
c)
1
-1
-1 0
Fig. 16.16 Scree plot (a) of a five Gaussian-clusters, costfunction and gradient norm of R (b) for
5 archetypes, and the skew projection (c) using the same 5 archetypes. Adapted from Hannachi
and Trendafilov (2017). ©American Meteorological Society. Used with permission
1◦ latitude–longitude grid from Jan 1870 to Dec 2014, over the region 45.5◦ S–
45.5◦ N. The scree plot (Fig. 16.17) shows a break- (or elbow-)like feature at p = 3
suggesting three archetypes.
The three archetypes suggested by Fig. 16.17 are shown in Fig. 16.18. The
first two archetypes show, respectively, El-Niño and La-Niña, the third archetype
shows the western boundary currents, namely Kuroshio, Gulf Stream and Agulhas
currents, in addition to the Brazil, East Australian and few other currents. It is
known that western boundary currents are driven by major gyres, which transport
warm tropical waters poleward along narrow, and sometimes deep, currents. These
100
Relative RSS (%)

90
80
70
60
50
40
0 5 10 15 20
Number of archetypes
Fig. 16.17 Scree plot of the SST anomalies showing the relative RSS versus the archetypes
number. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society.
Used with permission
a) Archetype 1
24
22
20
18
16
14
12
10
86
42
-2 0
-4
-6
-8
-10
-12
-14
b) Archetype 2
8
6
4
2
0
-2
-4
-6
-8
-10
-12
-14
-16
c) Archetype 3
18
16
14
12
10
8
6
4
2
0
-2
-4
Fig. 16.18 The obtained three archetypes of the SST anomalies showing El-Niño (a), La-Niña
(b) and the western currents (c). Contour interval 0.2◦ C. Adapted from Hannachi and Trendafilov
0.2
Mixture weights
a
0.15
0.1
0.05
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
0.2
Mixture weights
b
0.15
0.1
0.05
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
0.08
Mixture weights
c
0.06
0.04
0.02
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
Time
Fig. 16.19 Mixture weights of the three archetypes of SST anomalies, El-Niño (a), La-Niña
(b), and the western boundary currents (c). Adapted from Hannachi and Trendafilov (2017).
currents are normally fast and are referred to as the western intensification (e.g.
Stommel 1948, Munk 1950). This strongly suggests that these water boundary
currents project on extreme events, which are located on the outer boundary in
the system state space. It should be reminded here that the SST field is different
from the surface currents, which better capture the boundary currents. Records of
surface currents, however, are not long enough, in addition to the non-negligible
uncertainties in these currents.
The mixture weights of these archetypes are shown in Fig. 16.19.
For El-Niño archetype (Fig. 16.19a) the contribution comes from various obser-
vations scattered over the observation period and most notably from the first half
of the record. Those events correspond to prototype El-Niño’s, with largest weights
taking place end of the nineteenth and early twentieth centuries and in the last few
decades.
For the La-Niña archetype (Fig. 16.19b) there is a decreasing contribution with
time, with most weights located in the first half of the record, with particularly high
contribution from the event of the year 1916–1917. One can also see contributions
from La-Niña events of 1955 and 1975. It is interesting to note that these
contributing weights are clustered (in time). Unlike the previous two archetypes
a)
1
Amplitude
0.5
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
b)
1
Amplitude
0.5
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
c)
1
Amplitude
0.5
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
Time
Fig. 16.20 Time series amplitudes of the leading three archetypes (a, b, c) of the SST anomalies.
Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with
permission
the third, western current, archetype (Fig. 16.19c) is dominated by the last quarter
of the observational period starting around the late 1970s.
The time series of the archetypes, i.e. the columns of the stochastic matrix A2.
show the “amplitudes” of the archetypes, somehow similar to the PCs, and are
shown in Fig. 16.20. The time series of El-Niño shows slight weakening of the
archetypes, although the events of early 80s and late 90s are clearly showing up.
There is a decrease from the 90s to the end of the record. Prior to about 1945
the signal seemed quite stationary in terms of strength and frequency. The time
series of La-Niña archetype shows a general decrease in the last 50 or so years. The
signal was somehow “stationary” (with no clear trend) before about 1920. Unlike the
previous El-Niño and La-Niña archetypes the third (or western current) archetype
time series has an increasing trend starting immediately after an extended period of
weak activity around 1910. The trend is not gradual, with the existence of a period
with moderate activity around 1960s. The strongest activity occurs during the last
two decades starting around late 1990s. Figure 16.21 shows the simplex projection
of the data using three archetypes. The colours refer to the points that are closest to
each of the three archetypes.
0.5
-0.5
-1
-1 -0.5 0 0.5 1
Fig. 16.21 Two-dimensional simplex projection of the SST anomalies using three archetypes. The
200 points that are closest to each of the three archetypes are coloured, respectively, red, blue and
black and the remaining points are shown by light grey-shading. Adapted from Hannachi and
Trendafilov (2017). ©American Meteorological Society. Used with permission
100
a) b)
Relative RSS (%)
90 4
EOF3
80 2
0
70
-2
60 100
50 50
100
0 50
40 EOF2 0
-50 -50
2 4 6 8 10 -100
Number of archetypes EOF1
Fig. 16.22 (a) Relative RSS of 8 subsamples of sizes ranging from 1500 to 100, by increments of
200 (continuous) as well as the curves for the whole sample (black dotted-dashed), and the original
data less the top 1st (blue diamond), 2.5th (red square) and the 5th (black triangle) percentiles. (b)
Projection of the three archetypes of the subsamples described in (a) onto the leading three EOFs
of the SST anomalies (star), along with the same plots of the whole sample (filled circle), and the
original data less the top 1st (diamond), 2.5th (square) and 5th (triangle) percentiles. Adapted from
Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with permission
Hannachi and Trendafilov (2017) also applied AA to the detrended SST anoma-
lies. Their finding suggests only two main archetypes, namely El-Niño and La Niña.
This once again strongly suggests that the archetype associated with the western
boundary currents is the pattern that mostly explains the trend in extremes. They also
show that that the method is quite robust to sample size and extremes (Fig. 16.22).
16.10 Other Nonlinear PC Methods
16.10.1 Principal Nonlinear Dynamical Modes
Unlike conventional (linear) PCA nonlinear methods seek an approximation of the

multivariate data in terms of a set of nonlinear manifolds (or curves) maximising
a specific criterion. These nonlinear manifolds constitute an approximation of the
d-dimensional data vector at each time, x(t), as:

q
x(t) = fk (pk (t)) + ε t , (16.79)
k=1
where fk (.) is the kth trajectory, i.e. a map from the set of real numbers into the d-
dimensional data space, pk (.) is the associated time series, and εt is a residual term.
Conventional EOF method corresponds to linear maps in Eq. (16.79).
An interesting nonlinear EOF analysis method, namely nonlinear dynamical
mode (NDM) decomposition was presented by Mukhin et al. (2015). In their
decomposition Mukhin et al. (2015) used Eq. (16.79) with extra parameters and
fitted the model to the multivariate time series xt , t = 1, . . . n, with the nonlinear
trajectory fk (.) being the kth NDM. The NDMs are computed recursively by
identifying one NDM at a time, then compute the next one from the residuals, etc.,
that is:
xt = f(a, pt ) + ε t , t = 1, . . . n, (16.80)
with εt being multinormal with diagonal covariance matrix. The component f(.), a
and pt , are first estimated from the sample xt , t = 1, . . . n, then the next compo-
nents, similar to f, a, and pt , t = 1, . . . n, are estimated from the residuals εt , etc.
Each component of the NDM, e.g. f(.) in Eq. (16.80), is basically a combination of
orthogonal, with respect to the Gaussian probability density function, polynomials,
namely Hermite polynomials.
As in most nonlinear PC methods, the data dimension was reduced via (linear)
PCA, which simplifies the computation significantly. In vector form the principal
components at time t, yt , t = 1, . . . , n is expanded as:

Aw(pt ) σ ε1
yt = + , (16.81)
O Dε 2
where w is a m-dimensional vector containing the leading m Hermite polynomials

(m being the number of PCs retained for the analysis), A = (aij ) and represents
the coefficients of these polynomials whereas pt , t = 1, . . . n, is the hidden time
series. Also, ε1 , and ε 2 are white noise vectors, σ is the (common) amplitude of
each component of ε1 , and D = diag(σ1 , . . . , σD−m ) and contains the amplitudes
16.10 Other Nonlinear PC Methods 411
of ε 2 . The last term on the right hand side of Eq. (16.81) represents the residuals,
which are used in the next step to get the next hidden time series and associated
NDM.
Mukhin et al. (2015) used a Bayesian framework and maximised the likelihood
function
L((aij ), p1 , . . . pn , σ, (σk )) = P r(x1 , . . . , xn |(aij ), p1 , . . . ,

pn , σ, (σk ))Ppr ((aij ), p1 , . . . , pn , σ, (σk )), (16.82)
where the last term is a prior distribution. The prior distribution of the latent
variables p11 , . . . p1n , i.e. p1 (t), t = 1, . . . n, was taken to be multinormal based
on the assumption of a first-order autoregressive model. They also assumed a
multinormal distribution with diagonal covariance matrix for the prior distribution
of the parameter vector a1 . One of the interesting properties of NDMs, compared
to other methods such as kernel EOFs and Isomap method (Hannachi and Turner
2013b), is that the method provides simultaneously the NDMs and associated time
series.
Mukhin et al. (2015), see also Gavrilov et al. (2016), applied the NDM method
to monthly National Oceanic and Atmospheric Administration (NOAA) optimal
interpolation OI.v2 SST data for the period 1981–2015. Substantial explained
variance was found to be captured by the leading few NDMs (Fig. 16.23). The
leading NDM captures the annual cycle (Fig. 16.23a). An interesting feature they
identified in the second NDM was a shift that occurred in 1997–1998 (Fig. 16.23b).
Fig. 16.23 The leading three time series p1t , p2t , p3t , t = 1, . . . n, of the leading three NDMs (a)
and the trajectory of the system within the obtained three-dimensional space (b). Blue colour refers
to the period 1981–1997 and the red refers to 1998–2015. Adapted from Mukhin et al. (2015)
Fig. 16.24 Spatial patterns associated with the leading two NDMs showing the difference between
winter and summer averages of SST explained by NDM1 (a) and the difference between SST
explained by NDM2 (b) averaged over the periods before and after 1998 and showing the opposite
phase of the PDO. Adapted from Mukhin et al. (2015)
The second NDM captures also some parts of the Pacific decadal oscillation
(PDO), North Tropical Atlantic (NTA) and the North Atlantic indices. The third
NDM (Fig. 16.23a) captures parts of the PDO, NTA and the Indian Ocean dipole
(IOD). The spatial patterns of the leading two NDMs are shown in Fig. 16.24. As
expected, the nonlinear modes capture larger explained variance compared to those
explained by conventional EOFs. For example, the leading three NDMs are found to
explain around 85% of the total variance versus 80% from the leading three EOFs.
The leading NDM by itself explains around 78% compared to 69% of the leading
EOF.
Gavrilov et al. (2016) computed the NDMs of SSTs from a 250-year pre-
industrial control run. They obtained several non-zero nonlinear modes. The leading
mode came out as the seasonal cycle whereas the ENSO cycle was captured by a
combination of the second and third NDMs. The combination of the fourth and
the fifth nonlinear modes yielded a decadal mode. The time series of these modes
are shown in Fig. 16.25. The leading five PCs of the SSTs are also shown for
comparison. The effect of mixing, characterising EOFs and PCs, can be clearly seen
in the figure. The time series of the nonlinear modes do not suffer from the mixing
drawback.
16.10.2 Nonlinear PCs via Neural Networks
An interesting method to obtain nonlinear PCs is to apply techniques taken from

machine (or deep) learning using neural networks. These methods, described in
the following chapter, use various nonlinear transformations to model complex
architectures. Hsieh (2009) describes the application of neural networks to compute
nonlinear PCs. An example is discussed in the last section of Chap. 17, presented
below.
16.10 Other Nonlinear PC Methods 413
Fig. 16.25 Times series of the first five nonlinear dynamical modes (left) and the PCs (right)
obtained from the SST simulation of a climate model. Adapted from Gavrilov et al. (2016)
Chapter 17
Machine Learning
Abstract This last chapter discusses a relatively new method applied in atmo-
spheric and climate science: machine learning. Put simply, machine learning refers
to the use of algorithms allowing the computer to learn from the data and use
this learning to identify patterns or draw inferences from the data. The chapter
describes briefly the flavour of machine learning and discusses three main methods
used in weather and climate, namely neural networks, self-organising maps and
random forests. These algorithms can be used for various purposes, including
finding structures in the data and making prediction.
Keywords Machine learning · Neural networks · Training sets ·

Back-propagation · Self-organising · Random forests · Decision trees
17.1 Background
Unprecedented advances in technology and computer power have led to a remark-

able increase in the amount of massive data generated from observational instru-
ments or model simulations. This data volume is obtained in countless domains
such as medicine (Keogh et al. 2001; Matsubara et al. 2014), finance (Zhu and
Shasha 2002) and climate (e.g. Hsieh and Tang 1998; Scher and Messori 2019).
This unequivocal increase in data volume brings a unique opportunity to scientists
to explore and analyse, in a comprehensive way, the available amount of information
to get valuable knowledge and gain wisdom. This kind of knowledge is usually
achieved under the heading of machine learning or artificial intelligence. The insight
behind this is to allow the computer to explore “all” the possibilities and figure
out the optimum solution. There is no universal definition, however, of machine
learning. For example, one of the earliest definitions is that of Samuel (1959),
who defined machine learning as the field of study that enables computers to
learn without explicit programming. More recently, one reads from Wikipedia:
“machine learning is the study of computer algorithms that improve automatically
through experience”. This is more like the definition of Mitchell (1998) who

416 17 Machine Learning
presents machine learning as a well-posed learning problem based on some kind

of performance measure to complete certain task that improves with experience.
Machine learning algorithms construct mathematical models based on a sample,
or training, dataset (e.g. Bishop 2006). Machine learning is a subset of artificial
intelligence, which refers to “intelligence” exhibited by machines. Machine learning
contains supervised (e.g. classification) and unsupervised (e.g. clustering) learning,
in addition to artificial neural network, while deep learning (Chollet 2018) is a subset
of neural networks (Scher 2020). Some computer scientists, however, consider that
artificial intelligence and machine learning can be seen as two faces of the same
coin (e.g. Bishop 2006), and in the rest of the chapter, and unless otherwise stated, I
will refer to machine learning.
Machine learning is used to solve many problems ranging from pattern recogni-
tion and feature extraction (e.g. clustering), dimensionality reduction, identification
of relationships among variables and nonlinear modelling and time series forecast-
ing. The algorithms used in machine learning include mostly (different types of)
neural networks (NNs), self-organising maps (SOMs) and decision trees and random
forests. Other algorithms exist, but the focus here is on these three types. A number
of textbooks have been written on machine learning, e.g. Bishop (2006), Hastie et
al. (2009), Goodfellow et al. (2016), Buduma (2017) and Haykin (1999, 2009). This
chapter provides an introduction to and discusses the above few algorithms mostly
used in machine learning, and for a more comprehensive literature the reader is
referred to the above textbooks.
17.2 Neural Networks
17.2.1 Background and Rationale
Neural networks (NNs) originate from an attempt by scientists to mimic the human
brain during the process of learning and pattern recognition (McCulloch and Pitts
1943; Rosenblatt 1962; Widrow and Stearns 1985). The cornerstone of NNs is
the so-called universal approximation theorem (Cybenko 1989; Hornik 1991). The
theorem states that any regular multivariate real valued function f (x) can be
approximated at any given precision by a NN with one hidden layer and a finite
number of neurons with the same activation function and one linear output neuron.
That is, f (x) ≈ m α
k=1 k g(w T x + b ), with g(.) being a bounded function, with
k j
specific properties, referred to as sigmoid function, see below. NNs are supposed to
be capable of performing various tasks, e.g. pattern recognition and regression (Watt
et al. 2020). In addition, NNs can also be used for other purposes such as dimension
reduction, time series prediction (Wan 1994; Zhang et al. 1997), classification and
pdf estimation. And as put by Nielsen (2015), “Neural networks are one of the most
beautiful programming paradigms ever invented”.
17.2 Neural Networks 417
The idea of introducing NN as computing machine was proposed by McCulloch

and Pitts (1943), but it was Rosenblatt (1958) who proposed the perceptron as
the first model for supervised learning (Haykin 2009). The perceptron is based
on a single neuron and was used in a binary classification problem. The model
of this perceptron is shown in Fig. 17.1 (upper panel). Given an input vector
x = (x1 , . . . xm )T , a set of weights w = (w1 , . . . wm )T is used to form a linear
combination w x = T wk xk , possibly with a bias term b, which is then fed to a
sigmoid function g() to get an output o = g(wT x + b). Two examples of sigmoid
functions are shown in Fig. 17.1 (middle panel).
There are basically two main types of statistical models, namely regression
and classification. Using a fixed set of basis functions φj (), j = 1, . . . M, these
M
models can be written as y = j =1 wj φj (x), for linear regression and y =
M
g( j =1 wj φj (x)) for classification, where g() is a nonlinear sigmoid function. To
illustrate, for example, how supervised learning works, let us consider the case of a
binary classification. We have a set of inputs x1 , . . . xn with corresponding classes
(or targets) c1 , . . . cn that are 0 or 1. The objective is to predict the class co of a new
observation xo . A naive solution is to fit a linear regression c(x) = wT x and use
it to get the predicted class of xo . This, however, gives unrealistic values of co and
can yield wrong classification. The correct alternative is to consider the so-called
logistic regression model by using the logistic function:
1
c(x) = g(wT x) = .
1 + e−w
Tx
Remark Note, in particular, that the logistic function (and similar sigmoids) has nice
properties. For example, it has a simple derivative (g (x) = 1 − g(x)) and that g(x)
is approximately linear for small |x|.
Now c(x) can be interpreted as a probability of x belonging to class 1, i.e. c(x) =
P r(c = 1|x; w). The logistic regression model shows the importance of using the
sigmoid function to determine the class of a new observation. Note that this model
works only for linearly separable classes. The application of the binary classification
problem in NN becomes straightforward, given a training set (x1 , c1 ), . . . (xn , cn ).
For the Rosenblatt perceptron, this is obtained by finding the weights, w1 , . . . wm
(and possibly a bias b), minimising an objective function measuring the distance
between the model output and the target. Since the target is binary, the costfunction
1 n T
consistent with the logistic regression is 2n i=1 f (g(w xi ), ci ), with the distance
function given by f (y, z) = −z log y − (1 − z) log(1 − y). This convex function
is known as cross-entropy error (Bishop 2006) and is derived based on statistical
arguments, see discussion in Sect. 17.2.5 below. The convexity is particularly
useful in optimisation. Rosenblatt (1962) showed precisely that the perceptron
algorithm converges and identifies the hyperplane between the two classes, that is,
the perceptron convergence theorem. The single perceptron NN can be extended to
include two or more neurons, which will form a layer, the hidden layer, yielding
the single-layer perceptron (Fig. 17.1, bottom panel). This latter model can also be
x
1
x2
. g( )
. wT x o = g( w T x + b)
.
xm
Logistic function: g(x) = 1/(1+exp(-x)) Threshold function: g(x) = 1

x>0
1 1
0.8 0.8
0.6 0.6
g(x)
0.4 0.4
0.2 0.2
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
x x
Fig. 17.1 Basic model of a simple nonlinear neuron, or perceptron (top), two examples of sigmoid
functions (middle) and a feedforward perceptron with one input, one output and one hidden layers,
along with the weights W(1) and W(2) linking the input and output layers to the hidden layer
(bottom)
extended to include more than one hidden layer, yielding the multilayer perceptron
(Fig. 17.2). Now, recalling the universal approximation theorem, a network with
one hidden layer can approximate any bounded function with arbitrary accuracy.
Similarly, a network with two hidden layers can approximate any function with
arbitrary accuracy.
17.2.2 General Structure of Neural Networks
The progress in NNs is mainly triggered by the advent in computing power.

Figure 17.2 shows a schematic of a general representation of a neural network
model. The model shown in Fig. 17.2 represents an example of a feedforward NN
because the flow is forward. The circles represent the neurons, or units, and the lines
or arrows represent the weights. Each NN comprises the input and output layers
as well as a finite number of hidden layers represented, respectively, by filled and
open circles (Fig. 17.2). The output of the ith layer is used as input to the next
(i + 1)th layer. The response of a unit is called activation. Figure 17.3 shows a
diagram reflecting the relationship between the j th neuron of the (i + 1)th layer,
denoted (j, i + 1), and the neurons of the previous layer. In general, the output (or
(i+1)
activation) yj of the unit (j, i + 1) is a nonlinear function, which may depend
(i)
on the outputs yk of the unit (k, i), k = 1, . . . m, as
m
(i+1)
(i) (i+1)
yj = g(i+1) i
wkj yk + εj , (17.1)
k=1
i ) represent interconnecting parameters between units, ε (i+1)

where the weights (wkj j
is the bias (or offset) and g(i+1) (.) is the transfer function characterising the (i +1)th
layer (Rumelhart et al. 1994; Haykin 2009; Bishop 2006). Note that the transfer (or
sigmoid) function g(1) (.) of the input layer, which contains the input data, is simply
the identity.
The transfer functions are normally chosen from a specific family referred to as
sigmoid or squashing functions. Sigmoid functions were considered in Chap. 12.
The main property of these sigmoid functions is that they are positive and go to
zero and one as the argument goes, respectively, to minus and plus infinity, i.e.
g(x) ≥ 0, and g(x) → 0 as x → −∞ and g(x) → 1 as x → +∞. In
most cases the sigmoid functions are monotonic increasing. The hyperbolic tangent
function is often the most used sigmoid. Other transfer functions, such as the logistic
and cumulative distribution functions, are also commonly used. The cumulative
distribution function of the logistic distribution is
−1
1 aT x + b
g(x) = 1 + exp aT x + b = 1 − tanh ,
2 2
Fig. 17.2 Schematic representation of an example of multilayer feedforward neural network

model with 5 layers
Fig. 17.3 Relation between a given neuron of a given layer and neurons from the previous layer
The following two functions are also used sometimes:

• threshold function, e.g. g(x) = 1x>0 and
• piece-wise linear, g(x) = 12 + x − 1]− 1 , 1 [ + 1[ 1 ,∞[ .
2 2 2
One of the main reasons behind using squashing functions is to reduce the effect
of extreme input values on the performance of the network (Hill et al. 1994).
Many of the sigmoid functions have other nice properties, such as having simple
derivatives, see above for the example of the sigmoid function. This property is
particularly useful during the optimisation process.
For the output layer, the activation can also be linear or threshold function. For
example, in the case of classification, the activation of the output layer is a threshold
function equaling one if the input belongs to its class and zero otherwise. This has
the effect of identifying a classifier, i.e. a function g(.) from the space of all objects
(inputs) into the set of K classes where the value of each point is either zero or one
(Ripley 1994).
Remarks
• The parameter a in the sigmoid function determines the steepness of the response.
• In the case of scalar prediction of a time series, the output layer contains only
one unit (or neuron).
• Note that there are also forward–backward as well as recurrent NNs. In recurrent
NNs connections exist between output units.
NNs can have different designs depending on the tasks. For instance, to compare
two multivariate time series xk and zk , k = 1, 2, . . . n, where xk is p-dimensional
and zk is q-dimensional, a three-layer NN can be used. The input layer will then
have p units (neurons) and the output layer will have q units. Using a three-layer
NN where the transfer functions of the second and third layers g(2) (.) and g(3) (.)
are, respectively, a hyperbolic tangent and linear functions with p and q units in
the input and output layers, respectively. Cybenko (1989), Hornik et al. (1989) and
Hornik (1991) showed that it is possible to approximate any continuous function
from Rp to Rq if the second layer contains a sufficiently large number of neurons.
The NN is trained by finding the optimal values of the weight and bias parameters
which minimise the error or costfunction. This error function is a measure of the
proximity of the NN output O from the desired target T and can be computed using
the squared error O − T 2 , which is a function of the weights (wij ). For example,
when comparing two time series x = {xt , t = 1, . . . , n} and z = {zt , t = 1, . . . , n},
the costfunction takes the form:
J =< z − O (x, θ ) 2
>, (17.2)
where θ represents the set of parameters, i.e. weights and biases, O is the output
from the NN and “< >” is a time-average operator. The parameters are then
required to satisfy
∇θ J = 0. (17.3)
The minimisation is generally carried out using quasi-Newton or conjugate gradient

algorithms, see Appendix E. These types of algorithms involve iteratively changing
the weights by small steps in the direction of −∇θ J . The other and mostly used
alternative is to apply a backpropagation algorithm (Rumelhart et al. 1986; Hertz et
al. 1991), which is similar to a backward integration of the adjoint operator of the
linearised equations in numerical weather prediction.
In the case of prediction via NNs, the predictor variables are used as inputs and
the predictands as outputs. In this case the costfunction of a three-layer model, as in
Fig. 17.1 (bottom panel), is
2
J = zk − zko , (17.4)
k
(2)
where zk = (2)
j wj k yj + bk , yj = tanh
(1)
i wij xi + bj
(1)
, and zk◦ is the
kth observed value (target). The NN model is normally trained, i.e. estimating its
parameters, by using a first subset of the data (the training set) and then the second
subset for forecasting (the testing subset).
Remark Different architectures can yield different networks. Examples of special
networks include convolutional (LeCun et al. 2015) and recurrent (e.g. Haykin
2009) networks. Convolutional networks act directly on matrices or tensors (for
images) overcoming, hence, the difficulty resulting from transforming those struc-
tures into vectors, as is the case in conventional fully connected networks (Watt et
al. 2020). Transforming, for example, images into vectors yields a loss of the spatial
information.
Not all networks are feedforward. There are networks that contain feedback con-
nections. These are the recurrent NNs. Recurrent networks differ from conventional
feedforward NN by the fact that they have at least one feedbackward loop. They can
be seen as multiple copies of the same network and are used to analyse sequential
data such as texts and time series.
Another type of network is support vector machine (SVM) pioneered by Boser
et al. (1992). SVM is basically a learning machine with a feedforward network
having a single hidden layer of nonlinear units. It is used in pattern recognition
and nonlinear regression, through a nonlinear mapping from the input space into
a high-dimensional feature space (see Chap. 13). Within this feature space, the
problem becomes linear and could be solved at least theoretically. For example, in
binary classification the problem boils down to constructing a hyperplane maximally
separating the two classes (Haykin 2009; Bishop 2006).
Fig. 17.4 An example of a

linear perceptron showing a
feedforward neural network
with one input layer, one
output layer and no hidden
layers
17.2.3 Examples of Architectures
The simplest NN architecture corresponds to a single-layer feedforward with a

single or multiple outputs (Fig. 17.4), where the activation or output of zj (x) of
the j th neuron, j = 1, . . . m, is given by

zj (x) = g a0 + wij xi , (17.5)
i

where a0 is the bias parameter. The term a0 + i wij xi is the net input to the j th
neuron. Intuitively speaking, a0 + i wij xi represents the total level of voltage
exciting the j th neuron, and zj (x) represents the intensity of the resulting output
(the activation level) of the neuron (Werbos 1990). The simplest case is obtained
when g(.) is the identity and with one output. In this case we get

d
z(x) = a0 + wi xi
i=1
known in classification as the discriminant function. In general, however, the

function g() is chosen to be of sigmoid/squashing type, well suited for regression or
classification problems.
Other forms also exist
such as splines or RBFs (Appendix A) when the output
takes the form z(x) = m i=1 wi φi (x), as with support vector machines, mentioned
above. The multilayer perceptron (MLP) is an extension of the single layer and
Fig. 17.5 Schematic representation of neural network model with multiple layers
includes one or more hidden layers (Fig. 17.5). An architecture with more than one
hidden layer leads to the nested sigmoid scheme, see e.g. Poggio and Girosi (1990):

(1)
(2)

zl (x) = g a0 + wil g a1 + wil g . . . g ap + wα xα . . . .
i k α
(17.6)
Note that each layer can have its own sigmoid although, in general, the same sigmoid
is used for most layers. When the transfer function is an RBF φ(.) (Appendix A),
such as the Gaussian probability density function, one obtains an RBF network, see
e.g. chap. 5 of Haykin (2009), with the mapping:

m
z(x) = wi φi (x), (17.7)
i=1

where φi (x) = φ d1i x − ci . The parameter ci is the centre of φi (.) and di its
scaling factor. In this case the distance between the input x and the centres (or
weights) is used instead of the standard dot product i wij xi .
Common choices of RBFs include the Cauchy distribution, multiquadratic and
its inverse, the Gaussian function and thin-plate spline. RBF networks1 can model
1 There is also another related class of NNs, namely probabilistic NNs, derived from RBF networks,
most useful in classification problems. They are based on estimating the pdfs of the different classes
(Goodfellow et al. 2016).
any shape of function, and therefore the number of hidden layers can be reduced
substantially compared to MLP. RBF networks can also be trained extremely
quickly compared to MLP. The learning process is achieved through the training
set. If no training set is available, the learning is unsupervised, also referred to
sometimes as self-organisation, which is presented later.
Example (Autoregressive Model) A two-layer NN model connected by linear
transfer functions with inputs being the lagged values of a time series, xt , . . . xt−d
and whose output is the best prediction x̂t+1 of xt+1 from previous values reduces
to a simple autoregressive model. The application section below discusses other
examples including nonlinear principal component analysis.
17.2.4 Learning Procedure in NNs
The process of changing the weights during the minimisation of the costfunction
in NNs makes the training or learning algorithm. The most known minimisation
algorithm in NNs is the backpropagation (e.g. Watt et al. 2020). It is based on taking
small steps wij controlled by the unchanged learning rate η in the direction of the
steepest descent, i.e. following −∇J as
∂J
new
wij = wij
old
+ wij = wij
old
−η .
∂wij
This descent is controlled by the learning rate η. Consider, for example, the
feedforward NN with two (input and output) layers and one hidden layer (Fig. 17.1).
Let x1 , . . . xd be the actual values of the input units (in the input layer), which will
propagate to the hidden layer. The response (or activation) value hl at unit l of the
hidden layer takes the form
d
(1) (1)
hl = gh wil xi + εl ,
i=1
where gh (.) is the activation function of the hidden layer. These responses will then
make the inputs to the output layer “o” (Fig. 17.1) so that the kth output ok takes the
form
d
(2) (2) (2)
(2) (1) (1)
ok = go wlk hl + εk = go εk + wlk gh wil xi + εl ,
l l i=1
where go (.) is the activation function of the output layer.

Various algorithms exist to minimise the objective (or loss) function. These
include conjugate gradient and quasi-Newton algorithms (Appendix E), in addition
to other algorithms such as simulated annealing, see e.g. Hertz et al. (1991) and
Hsieh and Tang (1998). Because of the nested sigmoid architecture, the conventional
chain rule to compute the gradient can easily become confusing and erroneous
particularly when the network grows complex. The most popular algorithm used for
supervised learning is the (error) backpropagation algorithm (see e.g. Amari 1990).
At its core, backpropagation is an efficient (and exact) way to compute the gradient
of the costfunction in only one pass through the system. Backpropagation is the
equivalent of adjoint method, i.e. backward integration of the adjoint equations, in
variational data assimilation (Daley 1991).
Backpropagation proceeds as follows. Let yiα denote the activation of the ith
unit from layer α (Fig. 17.6), with α = 1, . . . L (the values 1 and L correspond,
respectively, to the input and output layers). Let also xiα denote the input to the ith
neuron of layer α + 1 prior to the sigmoid activation and wij α the weights between
layers α and α + 1 (Fig. 17.6); that is,

xiα = j wjαi yjα and yiα+1 = g(xiα ). (17.8)
Here we have dropped the bias function for simplicity, and we consider only one
sigmoid function g(.). If there are different sigmoids for different layers, then g(.)
will be replaced by gα (.). The costfunction then takes the form
m
2
J = z◦ − yL 2
= zk◦ − ykL (17.9)
k=1
for one input–output pair, and more generally,

C
J = z◦n − yL
n
2
(17.10)
n=1
for more than one input–output pair. Differentiating J , we get
∂J ∂J ∂xiα ∂J
α = α = α yjα , (17.11)
∂wj i ∂xi ∂wjαi ∂xi
and using Eq. (17.8), we get ∂J

∂xiα = ∂J
g (xiα ), i.e.
∂yiα+1
∂J ∂J
= α+1 g (xiα )yjα . (17.12)
∂wjαi ∂yi
For α = L − 1, the partial derivative ∂J

in Eq. (17.12) is easily computed for the
∂yiα+1
output layer using Eq. (17.9) or Eq. (17.10).
For the αth hidden layer, 1 ≤ α < L−2,
we observe that yjα+2 = g(xjα+1 ) = g( k wkj α+1 α+1
yk ), i.e. the term yiα+1 exists in
Fig. 17.6 Illustration of a structure in a feedforward multilayer NN used to explain how

backpropagation works (see text for details)
xjα+1 for all j . Therefore,
∂J ∂J ∂xjα+1 ∂J
α+1
= = wij . (17.13)
∂yiα+1 j
∂xjα+1 ∂yiα+1 j
∂xjα+1
The term ∂J
is computed as in Eq. (17.12), i.e. ∂J
= ∂J
g (xjα+1 ), and
∂xjα+1 ∂xjα+1 ∂yjα+2
Eq. (17.13) becomes
∂J ∂J
= g (xjα+1 )wij
α+1
(17.14)
∂xiα+1 j
∂yjα+2
for 1 ≤ α ≤ L − 2. That is, the gradient is computed recursively starting

from the output layer and going backward. Hence in the process of learning of a
feedforward neural network, activation values are propagated forward as signals to
the output layer. The error signal is then computed and propagated backward to
enable computing the gradient of the costfunction. The weights (and consequently
the costfunction) are then updated according to the rule given in the beginning of this
section. This process is repeated until the costfunction falls below some tolerance
level. To avoid overfitting, etc., it is desirable to reduce the size by limiting the
connection between neurons by fixing, for example, some weights to zero so that
they do not add to the computational burden.
17.2.5 Costfunctions for Multiclass Classification
The background section illustrates briefly how supervised learning works. We

continue here that discussion on the multiclass classification with a little more focus
on the costfunction to be minimised. The binary (two-class) case is related precisely
to the Bernoulli random variable (Appendix B). The Bernoulli distribution has two
outcomes, e.g. 0 and 1, with respective probabilities q and p = 1 − q, and its
pdf is px (1 − p)1−x (Appendix B). Based on this probabilistic interpretation, the
conditional distribution of the target z, given the input x parametrised by the weights
w, is precisely a Bernoulli distribution with pdf y z (1 − y)1−z , where y is the output
from the perceptron using the sigmoid (logistic)( function. Now, given a training set
(x1 , z1 ), . . . (xn , zn ), the likelihood is precisely ni=1 yizi (1 − yi )1−zi . The weights
are then obtained by maximising the likelihood or minimising minus log-likelihood,
i.e.

n .
w = argmin − zi log yi + (1 − zi ) log(1 − yi ) .
i=1
Remark The inverse of the logistic function is known as the logit function. It is used
in generalised linear models (McCullagh and Nelder 1989) in the logistic regression
y
model, i.e. log 1−y = wT x, in which y is the binary/Bernoulli response variable,
i.e. the probability of belonging to one of the two classes. The logit is also known as
the link function of this generalised linear model. The above illustration shows the
equivalence between the single-layer perceptron and the logistic regression.
The above Bernoulli distribution
can be extended to K outcomes with respective
probabilities p1 , . . . , pK , pk = 1. Similarly, for a K-class classification, the
likelihood for a given input x with targets (classes) z1 , . . . zK in {0, 1} and outputs
yk = P r(zk = 1|x, w), ( k =zk1, . . . K, is obtained as a generalisation of the two-
class model to yield K k=1 yk (x, w). With a training set xi , zij , i = 1 : n, j =
1 : K, the K-class cross-entropy takes the form − ik zki log yk (xi , w). The K-
class logistic function can be obtained based on the Bayes theorem, i.e. P r(k|x) ∝
P r(x|k)P r(k), where P r(k|x) is the probability of class k, given x, and hence yk can
be written as yk (x, w) = exp(ak )/( exp(aj )), referred to as the softmax function
(e.g. Bishop 2006).
Remark As outlined in Sect. 17.2.1, for regression problems the activation function
of the output unit is linear (in the weights), and the costfunction can simply be the
quadratic norm of the error.
17.3 Self-organising Maps 429
17.3 Self-organising Maps
17.3.1 Background
Broadly speaking, there are two main classes of training networks, namely super-
vised and unsupervised training. The former refers to the case in which a target
output exists for each input pattern (or observation) and the network learns to
produce the required outputs. When the pair of input and output patterns is not
available, and we only have a set of input patterns, then one deals with unsupervised
training. In this case, the network learns to identify the relevant information from
the available training sample. Clustering is a typical example of unsupervised
learning. A particularly interesting type of unsupervised learning is based on what
is known as competitive learning, in which the output network neurons compete
among themselves for activation resulting, through self-organisation, in only one
activated neuron: the winning neuron or the best matching unit (BMU), at any time.
The obtained network is referred to as self-organising map (SOM) (Kohonen 1982,
2001). SOM or a Kohonen network (Rojas 1996) has two layers, the input and output
layers. In SOM, the neurons are positioned at the nodes of a usually low-dimensional
(one- or two-dimensional) lattice. The positions of the neurons follow the principle
of neighbourhood so that neurons dealing with closely related input patterns are
kept close to each other following a meaningful coordinate system (Kohonen 1990;
Haykin 1999, 2009). In this way, SOM maps the input patterns onto the spatial
locations (or coordinates) of the neurons in the low-dimensional lattice. This kind of
discrete maps is referred to as topographic map. SOM can be viewed as a nonlinear
generalisation of principal component analysis (Ritter 1995; Haykin 1999, 2009).
SOM is often used for clustering and dimensionality reduction (or mapping) and
also prediction (Vesanto 1997).
17.3.2 SOM Algorithm

Mapping Identification and Kohonen Network
Each input pattern x = (x1 , . . . , xd )T from the usually high-dimensional input

space is mapped onto i(x) in the output space, i.e. neuron lattice (Fig. 17.7, left
panel).
A particularly popular SOM model is the Kohonen network, which is a feedfor-
ward system in which the output layer is organised in rows and columns (for a 2D
lattice). This is shown schematically in Fig. 17.7 (right panel).
Fig. 17.7 Schematic of the SOM mapping from the high-dimensional feature space into the
low-dimensional space (left) and the Kohonen SOM network showing the input layer and the
computational layer (or SOM grid) linked by a set of weights (right)
Training of SOM
Each neuron in the low-dimensional (usually two-dimensional) SOM grid is

associated with a d-dimensional weight vector, also known as prototype, codebook
vector and synaptic weight. The training of SOM is done iteratively. In each step one
sample vector x from the input data is chosen randomly, and the distance x − wj
between x and all the prototype (weight) vectors is computed. The best matching
unit (BMU), or the winning neuron, is identified. So the BMU is the unit whose
weight vector is the closest to input sample x. The weight vectors are then updated
as detailed below, in such a way that the BMU gets closer to the input vector x
and similarly for the BMU neighbours. Figure 17.8 (left panel) shows an example
of updating units. So, both the weight vector of the BMU and its neighbours are
updated, by moving them a little closer to the input data sample (Fig. 17.8, left).
This yields a kind of stretching of the grid around the BMU as the associated weight
vector moves towards the input data vector. In other words, during this training
process, SOM constructs a kind of elastic net, which folds onto the input data clouds,
approximating, hence, the data probability distribution function (Kohonen 2001).
This is comparable somewhat to the non-metric multidimensional scaling (MDS)
presented in Chap. 9 (Sect. 9.4). The goal in non-metric MDS is to preserve the
monotonicity of interpoint distances, whereas the objective here is to preserve the
proximity with no particular concern for monotonicity. The neighbours are defined
based on a neighbourhood function. The neighbourhood function is a kernel-type
function centred at the winner unit, such as a Gaussian or box (bubble) function,
i.e. one around the BMU and zero elsewhere. Discrete neighbourhood function can
also be used, including rectangular (Fig. 17.8, middle) or hexagonal (Fig. 17.8, right)
lattice. These can have different sizes, e.g. 0, 1 and 2, where the innermost polygon
corresponds to the 0-neighbour (only the winner unit is considered), the second
polygon to the 1-neighbour and the largest to the 2-neighbour.
Let wj = (wj 1 , . . . , wj d )T be the synaptic weight corresponding to the j th,
j = 1, . . . , M, output neuron, with d being the input space dimension. The index
i(x) of the neuron associated with the image of x (Fig. 17.7, right panel) is the one
maximising the scalar product xT wj , i.e.
17.3 Self-organising Maps 431
Fig. 17.8 Left: schematic of the updating process of the best matching unit and its neighbours.
The solid and dashed lines represent, respectively, the topological relationships of the SOM before
and after updating, the units are at the crossings of the lines and the input sample vector is shown
by x. The neighbourhood is shown by the eight closest units in this example. Middle and right:
rectangular and hexagonal lattices, respectively, and the units are shown by the circles. In all panels
the BMU is shown by the blue filled circles
i(x) = argmin x − wj . (17.15)

j = 1, . . . M
Equation (17.15) summarises the competitive process among the M output neurons
in which i(x) is the winning neuron or BMU (Fig. 17.8, left). SOM has a number of
components that set up the SOM algorithm. Next to the competitive process, there
are two more processes, namely co-operative and adaptive processes, as detailed
below.
The co-operative process concerns the neighbourhood structure. The winning
neuron determines a topological neighbourhood since a firing neuron tends to excite
nearby neurons more than those further away. A topology is then defined in the SOM
lattice, reflecting the lateral interaction among a set of excited neurons (Fig. 17.8).
Let dkl denote the lateral distance between neurons k and l on the SOM grid. This
then allows to define the topological neighbourhood hj,i(x) . As is mentioned above,
various neighbourhood functions can be used. A typical choice for this topology or
neighbourhood is given by the Gaussian function:
2
dj,i(x)
hj,i(x) = exp − , (17.16)
2σ 2
where dj,i(x) = rj − ri(x) and denotes the lateral distance on the SOM grid
between, respectively, the winning and excited neurons i(x) and j , and rj and ri(x)
are positions of neurons j and i(x) on the same grid. A characteristic feature of SOM
is that the neighbourhood size shrinks with time (or iteration). A typical choice of
this shrinking is given by

n
σ (n) = σ0 exp − , (17.17)
τ0
where n is the iteration time step, σ0 is an initial value of σ and τ0 is a constant.

Note also that, in some algorithms, the left hand side of Eq. (17.16) is multiplied by
a factor α(n), which is a linearly and slowly decreasing function of the iteration n,
such as α(n) = 0.95(1 − n/1000) when using 1000 iterations, in order to accelerate
the convergence.
Remark It is interesting to note that if the neighbourhood function is a delta function
at the BMU, i.e. hj,i(x) = δi(x) (1 at the BMU and 0 elsewhere), then SOM simply
reduces to the k-means clustering algorithm (Moody and Darken 1989). So, k-means
is a simple particular case of SOM.
The adaptive process regards the way of learning process leading to the self-
organisation of the outputs, and the feature map is formed. The topographic
neighbourhood mirrors the fact that the weights of the winning and also neighbour-
ing neurons get adapted, though not by an equal amount. In practice, the weight
update is given by
wj = wj (n + 1) − wj (n) = η(n)hj,i(x) x − wj (n) , (17.18)
which is applied to all neurons in the lattice within the neighbourhood of the winning
neuron (Fig. 17.8). The discrete time learning rate is given by η(n) = η0 exp(− τn1 ).
The values η0 = 0.1 and τ1 = 1000 are typical examples that can be used in
practice (Haykin 1999). Also, in Eq. (17.17), the value τ0 = 1000/ log(σ0 ) can be
adapted in practice (Haykin 1999). The neighbourhood of neurons can be defined by
a radius within the 2D topological map of neurons. This neighbourhood decreases
monotonically with iterations.
√ A typical √ initial value of this neighbourhood radius
could be of the order O( N), e.g. N/2, for a sample size N . The size of the
topological map, that is, the number of neurons
√ M, can be learned from experience,
but typical values can be of the order O( N). For √ example, for a two-dimensional
SOM, the grid can have a total number of, say, 5 N neurons, e.g. (Vesanto and
Alhoniemi 2000).
Remark The SOM error can be computed based on the input data, weight vectors
and neighbourhood function.
For a fixed neighbourhood function, the SOM error
function is ESOM = N t=1
M
j =1 hj,i(xt ) xt − wj , where hj,i(xt ) represents the
2
neighbourhood function centred at unit i(xt ), i.e. the BMU of input vector xt .
In order not to end in a metastable state, the adaptive process has two identifiable
phases, namely the ordering or self-organising and the convergence phases. During
the former phase, the topological ordering of the weight vectors takes place within
around 1000 iterations of the SOM algorithm. In this context, the choice of η0 =
0.1 and τ1 = 1000 is satisfactory, and the previous choice of τ0 in Eq. (17.17)
can also be adopted. During the convergence phase, the feature map is fine tuned
providing, hence, an accurate statistical quantification of the input patterns, which
17.4 Random Forest 433
takes a number of iterations, around 500 times the number of neurons in the network
(Haykin 1999).
Summary of SOM Algorithm
The different steps of SOM algorithm can be summarised as follows:

• (1) Initialisation—Randomly choose initial values of the weights wj (0), j =
1, . . . M, associated with the M neurons in the lattice. Note that these weights
can also be chosen from the input data xt , t = 1, . . . N, where N is the sample
size.
• (2) Sampling—Draw a sample input vector x from the input space.
• (3) Neuron identification—Find the winning neuron i(x) based on Eq. (17.15).
• (4) Updating—Update the synaptic weights wj , j = 1, . . . , M, following
Eq. (17.18).
• (5) Continuation—Go to step 2 above until the feature map stops changing
significantly.
17.4 Random Forest
In machine learning, supervised learning refers to applications in which the training

set is composed of pairs of input data and their corresponding target output.
Classification in which the target (or classes) is composed of finite number of
discrete categories is a good example in which the classes of the training data input
are known. When the target data are composed of continuous variables, one gets
regression.
Random forest (RF) is a supervised learning algorithm for classification and
regression problems (Breiman 2001), based on what is known as decision trees,
which are briefly described below.
17.4.1 Decision Trees

What Are They?
Decision trees are the building blocks of random forests. They aim, based on a
training set, at predicting the output of any data from the input space. A decision
tree is based on a sequence of binary selections and looks like a (reversed) tree
(Bishop 2006). Precisely, the input sample is (sequentially) split into two or more
homogeneous sets based on the main features of the input variables. The following
simple example illustrates the basic concept. Consider the set {−5, −3, 1, 2, 6},
Fig. 17.9 Decision tree of the simple dataset shown in the top box
which is to be separated using the main features (or variables), namely sign (+/−)
and parity (even/odd). Starting with the sign, the binary splitting yields the two sets
{1, 2, 6} and {−3, −5}. The last set is homogeneous, i.e. a class of negative odds.
The other set is not, and we use the parity feature to yield {1} and {2, 6}. This can be
summarised by a decision tree shown in Fig. 17.9. Each decision tree is composed
of:
• root node—containing the entire sample,
• splitting node,
• decision node—a subnode that splits into further subnodes,
• terminal node or leaf—a node that cannot split further,
• branch—a subtree of the whole tree and
• parent and child nodes.
In the above illustrative example, the root node is the whole sample. The set {1, 2, 6},
for example, is a decision node, whereas {2, 6} is a terminal node (or leaf). Also, the
last subset, i.e. {2, 6}, is a child node of the parent node {1, 2, 6}. The splitting rule
in the above example is quite simple, but for real problems more criteria are used
depending on the type of problem at hand, namely regression or classification, which
are discussed next.
Classification and Regression Trees
As is mentioned above, decision trees attempt to partition the input (or pre-
dictor) space using a sequence of binary splitting. This splitting is chosen to
optimise a splitting criterion, which depends on the nature of the predictor, e.g.
discrete/categorical versus continuous, and the type of problem at hand, e.g.
classification versus regression. In the remaining part of this chapter we assume
that our training set is composed of n observations (x1 , y1 ), . . . , (xn , yn ), where,
for each k, k = 1, . . . n, xk = (xk1 , . . . , xkd )T and contains the d variables (or
features), and yk is the response variable. The splitting proceeds recursively with
each unsplit node by looking for the “best” binary split.
Case of Categorical Predictors
We discuss here the case of categorical predictors. A given unsplit (parent) node,
containing a subset x1 , . . . , xm of the training set, is split into two child nodes: a
left (L) node and a right (R) node. For regression, a common criterion F is the mean
square residual or variance of the response variable. For this subsample in the parent
node, the costfunction (variance) is
1
m
F = (yi − y)2 , (17.19)
m
i=1
where y is the average response of this subset. Now, if FL and FR are the
corresponding variance for the (left and right) child nodes, then the best split
is based on the variable (or feature) yielding maximum variance reduction, i.e.
max|F − FLR |, or similarly minimising FLR , i.e. min(FLR ), where FLR =
m−1 (mL FL + mR FR ), with mL and mR being the sample sizes of the left and right
(child) nodes, respectively. Note again that to obtain the best result, every possible
split based on every variable (among the d variables) is considered.
The splitting criterion for classification, with, say, K classes, is different. Let
pk designate
the proportion of the kth class in a given (unsplit) node, i.e. pk =
m−1 m i=1 1{yi =k} , where 1A is the indicator of set A, i.e. 1 if A is true and 0
otherwise. Then the most commonly used criterion is the Gini index (e.g. Cutler
et al. 2012):

K
K
F = pk pk = 1 − pk2 , (17.20)
k<k k=1
and the “best” split corresponds again to the one maximising |F − FLR | or
minimising FLR .
Case of Continuous Predictors
In the case of continuous predictor variables, the splitting involves sorting the values
of the predictor subset in the node of interest and considering all splits between
consecutive pairs (Cutler et al. 2012). Precisely, denote again by x1 , . . . , xm the
continuous d-dimensional subset (of the training set) in the node at hand. For each
variable j , j = 1, . . . , d, the observations are sorted as x1,j ≤ x2,j ≤ . . . ≤ xm,j ,
and a set is chosen of thresholds θ0,j , θ1,j , . . . θn+1,j , with θ0,j ≤ x1,j ; xn+1,j ≤
θn,j and xt,j ≤ θt,j ≤ xt+1,j , for t = 1, . . . n − 1. Note that any choice of the θ
parameters satisfying the above inequalities yields the same results. For example, a
common choice (after sorting) corresponds to the midpoint of any two consecutive
values, i.e. θt,j = 0.5(xt,j + xt+1,j ) for t = 1, . . . n − 1. Next, for each t and
j , t = 1, . . . n, and j = 1, . . . d, a binary split is defined based on the condition
1{x.j ≤θt,j } , where x.j stands for the j th variable of any observation. In other words,
for any choice of θ (among the (n + 1)d choices of θt,j ) and any choice of variable,
a split is performed on the node under consideration. This requires around O(m2 d)
operations. For each of these splits, the costfunction FLR is computed, and the best
split is again chosen based on minimising FLR . Using a fast algorithm, the number
of operations can be reduced to O(dm log(m)). We remind again that each split is
based on one variable or feature.
In summary, the above steps can be summarised in the following algorithm
(e.g. Cutler et al. 2012). Given a training set (x1 , y1 ), . . . , (xn , yn ), where xt =
(xt1 , . . . , xtd )T and yt , t = 1, . . . n, denote, respectively, the d predictors and
associated responses, the decision tree algorithm proceeds as follows:
(1) Starting—Begin with the training set in the root node.
(2) Tree construction—Each unsplit node is split into two child nodes based on the
best split, as detailed above by finding the predictor variable among 1, . . . d,
which minimises FLR .
(3) Prediction—When the tree is obtained, a new input variable x (not from the
training set) is passed through the tree until it falls in a terminal node (or
leaf), denoted l, with responses yl1 , . . . , ylp (obtained from the training set).
p
The prediction for x is then given by yx = p−1 t=1 ylt for regression and
p
yx = argmax t=1 1{ylt =y} for classification, where y refers to any one class.
y
Remarks
• Besides the two optimisation criteria mentioned above, there are a number of
other criteria that are used in decision trees. These include chi-square, entropy
and gain ratio.
• Decision trees have a number of advantages but also disadvantages. The follow-
ing are the main advantages:
− > it is a non-parametric method;
− > it is easy to understand and implement;
− > it can use any data type (categorical or continuous).
• Although decision trees can be applied to continuous variables, they are not ideal
because of categorisation. But the main disadvantage is that of overfitting, related
to the fact that the tree can grow unnecessarily—e.g. ending with one leaf for
each single observation.
One way to overcome the main downside mentioned above, i.e. overfitting, is to
apply what is known as pruning. Pruning consists of trimming off unnecessary
branches starting from the leaf nodes in such a way that accuracy is not much
reduced. This can be achieved, for example, by dividing the training set into
two subsets, fitting the (trimmed) tree with one subset and validating it with the
other subset. The best trimmed tree corresponds to optimising the accuracy of the
validation set. There is another more attractive way to overcome overfitting, namely
random forest discussed next.
17.4.2 Random Forest: Definition and Algorithm
Random forest is an algorithm based on an ensemble of large number of individual

decision trees aimed to obtain better prediction skill. The random element in the
random forest comes from the bootstrap sampling, which is based on two key
rules:
• Tree building is based on random subsamples of the training set.
• Splitting nodes is based on random subsets of the variables (or features).
Note that no pruning is done in random forest. The final outcome is then obtained
by averaging the predictions of the individual trees from continuous response (for
regression) or by combined voting for categorical response (for classification). The
algorithm for RF is given below:
Random Forest Algorithm
(1) Begin with the training set (x1 , y1 ), . . . , (xn , yn ), and decide a number I for the
bootstrap samples and a number N of predictor variables to be chosen among
the original d variables.
(2) i = 1
Decide a bootstrap subsample Bi placed in a root node.
• Randomly select N predictors.
• Identify the best binary split based on the N selected variables, and split the
parent node accordingly.
• Continue until the ith decision tree is obtained.
• If i < I , then i = 1 + i, then go to (2), otherwise go to (3).

(3) For a new data x, the prediction is given by ŷx = I −1 Ii=1 yx,i for regression
I
and ŷx = argmax i=1 1{yx,i =y} for classification, where as before yx,i is the
y
predicted response at x from the ith tree.
17.4.3 Out-of-Bag Data, Generalisation Error and Tuning

Out-of-Bag Data and Generalisation Error
A convenient and robust way to get an independent estimate of the prediction error
is to use, for each tree in the random forest, the observations that do not get selected
in the bootstrap; a tool similar to cross-validation. These data are referred to as out-
of-bag (oob) data. It is precisely these data that are used to estimate the performance
of the algorithm and works as follows. For each input data (xt , yt ) from the training
set, find the set Tt of those (Nt ) trees that did not include
this observation. Then,
the oob prediction at xt is given by yoob,t , i.e. Nt−1 Tt yxt ,i for regression and
argmax Tt 1{yxt ,i } for classification, where yxt ,i is the prediction of xt based on the
y
ith tree of the random forest. These predictions are then used to calculate the oob
by (e.g. Cutler et al. 2012) εoob = n−1 nt=1 (yt − yoob,t )2 for
error, εoob , given
regression or n−1 nt=1 1{yt =yoob,t } for classification.
Parameter Selection
To get the best out of the random forest, experience shows that the number of trees in
the forest can be chosen to be large enough (Breiman 2001), e.g. several hundreds.
Typical values of the selected number of variables in the random forest depend√on
the type of the problem at hand. For classification, a standard value of N is d,
whereas for regression it is of the order n/3.
Remarks Random forest algorithm has a number of advantages inherited from
decision trees. The main advantages are accuracy, robustness (to outliers) in addition
to being easy to use and reasonably fast. The algorithm can also handle missing
values in the predictors and scale well with large samples (Hastie et al. 2009).
The algorithm, however, has few main disadvantages. Random forest tends to be
difficult to interpret, in addition to being not very good at capturing relationships
involving linear combinations of predictor variables (Cutler et al. 2012).
17.5 Application
Machine learning has been applied in weather and climate since the late 90s, with
an application of neural network to meteorological data analysis (e.g. Hsieh and
Tang 1998). The interest in machine learning application in weather and climate
has grown recently in the academia and also weather centres, see e.g. Scher
(2020). The application spans a wide range of topics ranging from exploration to
weather/climate prediction. This section discusses some examples, and for more
examples and details, the reader is referred to the references provided.
17.5.1 Neural Network Application

NN Nonlinear PCs
Nonlinear EOF, or nonlinear principal component analysis, can take various forms.
We have already presented some of the methods in the previous chapters, such as
17.5 Application 439
Fig. 17.10 Schematic representation of the NN design for nonlinear PCA. Adapted from Hsieh
(2001b)
independent PCA, PP and kernel EOFs. Another class of nonlinear PCA has also
been presented, which originates from the field of artificial intelligence. These are
based on neural networks (Oja 1982; Diamantaras and Kung 1996; Kramer 1991;
Bishop 1995) and have also been applied to climate data (Hsieh 2001a,b; Hsieh
and Tang 1998; Monahan 2000). As is discussed in Sect. 17.2, the neural network
model is based on linking an input layer containing the input data to an output
layer using some sort of “neurons” and various intermediate layers. The textbook of
Hsieh (2009) provides a detailed account of the application of NN to nonlinear PC
analysis, which is briefly discussed below.
The NN model used by Hsieh and Tang (1998) and Monahan (2000) to construct
nonlinear PCs contains five layers, three intermediate (or hidden), one input and one
output layers. A schematic representation of the model is shown in Fig. 17.10.
A nonlinear function maps the high-dimensional state vector x = x1 , . . . , xp
onto a low-dimensional state vector u (one-dimensional in this case). Then, another
nonlinear transformation maps u onto the state space vector z = z1 , . . . , zp
from the original p-dimensional space. This mapping is achieved by minimising
the costfunction J =< x - z 2 >. These transformations (or weighting functions)
are
f(x) = f1 (W1 x + b1 ) ,
u = f2 (W2 f(x) + b2 ) ,
(17.21)
g(u) = f3 (w3 u + b3 ) ,
z = f4 (W4 g(u) + b4 ) ,
where f1 (.) and f3 (.) are m-dimensional functions. Also, W1 and W4 are m × p and
p × m weight matrices, respectively. The objective of NN nonlinear PCA is to find
the scalar function u(t) = F (x(t)) that minimises J . Note that if F (.) is linear, i.e.
F (x) = aT x, then u(t) corresponds to the conventional PC time series. Monahan

(2000, 2001) and Hsieh (2001a,b) have applied this NN model to find nonlinear PCs
from atmospheric data. They chose f1 (.) and f3 (.) to be hyperbolic tangent, whereas
f2 (.) and f4 (.) correspond to the identity function (in one- and p-dimensional spaces,
respectively). During the minimisation procedure, u was normalised by choosing
2
< u2 >= 1, and the costfunction was taken to be < x - z 2 > + < u2 > −1 .
Hsieh (2001a,b) penalised the costfunction by the norm of the matrix W1 as
2
J =< x − z 2
> + < u2 > −1 + λ 1 2
(wij ) , (17.22)
ij
in order to stabilise the algorithm through smoothing of J (see Appendix A). An

account of the problems encountered in the application of NNs to meteorological
data is given in Hsieh and Tang (1998).
Hsieh and Tang (1998), and also Hsieh (2001a,b) applied the NN nonlinear PC
analysis to the Lorenz model and used it to analyse and forecast tropical Pacific
SST. In particular, Monahan (2000) found that the leading two nonlinear PCs of
the Lorenz model capture about 99.5% of the total variance compared to 95% from
conventional EOFs. Monahan et al. (2000) applied NN to compute the nonlinear
PCs of the northern hemisphere SLP from a long integration of the Canadian
climate model. They identified, in particular, a bimodal behaviour from the control
simulation (Fig. 17.11); the corresponding SLP patterns are shown in Fig. 17.12.
This bimodal behaviour disappeared in the climate change experiment (Fig. 17.13).
Application to Weather Forecasting
Weather and climate prediction is another topic that attracted the interest of climate
scientists. Neural networks can in principle approximate any nonlinear function
(Nielsen 2015; Goodfellow et al. 2016) and can be used to approximate the
nonlinear relationships involved in weather forecasting. An example was presented
by Scher (2018) to approximate a simple GCM using a convolutional neural
network. This example was used as a proof of concept to show that it is possible
to consider NN to learn the time evolution of atmospheric fields and hence provide
a potential for weather prediction. Convolutional NN was also used to study
precipitation predictability on regional scales and discharge extremes by Knighton
et al. (2019). Another example was presented by Subashini et al. (2019) to forecast
weather variables using data collected from the National Climatic Data Centre. They
used a recurrent NN, based on a long short-term memory (LSTM) algorithm, to
forecast wind, temperature and cloud cover.
Weyn et al. (2019) developed an elementary deep learning NN to forecast a few
meteorological fields. Their model was based on a convolutional NN architecture
and was used to forecast 500-hPa geopotential height for up to 14 days lead
time. The model was found to outperform persistence, climatology and barotropic
Fig. 17.11 Nonlinear PCs from the control simulation of the Canadian climate model showing the
temporal variability along the NN PC1 with its PDF (top) and the nonlinear PC approximation of
the data projected onto the leading two PCs (bold curve, bottom) along with the PDFs associated
with the two branches. Adapted from Monahan et al.(2000)
Fig. 17.12 Composite SLP anomaly maps associated with the three representative points on the
nonlinear PC shown in Fig. 17.11. Adapted from Monahan et al.(2000)
vorticity models, in terms of root mean square errors (RMSEs) at forecast lead time
of 3 days. The model does not, however, beat an operational weather forecasting
system and climate forecasting system (CFS), as expected, as the latter contains
full physics. Weyn et al. (2019) found, however, that the model does a good job of
forecasting realistic states at a lead time of 14 days and captures reasonably well the
Fig. 17.13 As in Fig. 17.11, but for the climate change simulation. Adapted from Monahan
et al.(2000)
500-hPa climatology and annual variability. Figure 17.14 shows the RMSE of 500-
hPa heights for up to 3 days lead time of different convolutional NN architectures
compared to the other models. An example of 24-hr 500-hPa forecast is shown in
Fig. 17.15. The main features of the field are very well captured.
Another important topic in weather prediction is forecasting uncertainty. Forecast
uncertainty is an integral component of weather (and climate) prediction, which
is used by the end users for planning and design. In weather forecasting centres,
forecast uncertainty is usually obtained using a computationally expensive ensemble
of numerical weather predictions. A number of authors have proposed machine
learning as an alternative to ensemble methods. An example where this is important
is tropical cyclone (Richman and Leslie 2010, Richman et al. 2017) and Typhoon
(Haghroosta 2019) forecast. For example, Richman and Leslie (2012) used machine
learning approaches, based on support vector regression (a subclass of support
vector machine), to provide seasonal prediction of tropical cyclone frequency and
intensity over Australia. We remind that the support vector regression is a special
architecture of neural net with two layers, an input and an output layer, and
where each input observation is mapped into a high-dimensional feature space.
As mentioned above, the architecture belongs to the class of radial basis function
Fig. 17.14 RMSE forecast error of 500-hPa height over the test period 2007–2009, obtained from
different models: persistence, climatology, barotropic vorticity, the operational climate forecast
system (CFS) and the different convolutional NN architectures. Adapted from Weyn et al. (2019)
Fig. 17.15 An example of a 24-hr 500-hPa height forecast at 0 UTC 13 April 2007, based on
the NN architectures (bottom) compared to the barotropic (c) and the CFS (d). Coloured shading
shows the difference between forecasts and the verification (b) in dekameter. (a) initial state, (e-h)
forecasts from LSTM neural network. Adapted from Weyn et al. (2019)
networks (Haykin 2009, Chap. 5), in which the mapping is based on nonlinear
radial basis functions (Appendix A). The authors obtained high values of R 2 of the
order 0.56 compared to 0.18 obtained with conventional multiple linear regression.
Fig. 17.16 Schematic illustration of the convolutional NN used by Scher and Messori (2019)
to predict weather forecast uncertainty. The network is fed with gridded atmospheric fields and
generates a scalar representing the uncertainty forecast (Source: Modified from Scher and Messori
(2019))
Richman et al. (2017) used the same machine learning architecture to reduce tropical
cyclone prediction error in the North Atlantic regions. A review of the application
of machine learning to tropical cyclone forecast can be found in Chen et al. (2020).
Scher and Messori (2019) proposed machine learning to predict weather forecast
uncertainty. They considered a convolutional NN (Krizhevsky et al. 2012; LeCun
et al. 2015) trained on past weather forecasts. As is discussed above, convolutional
NN is not fully connected, characterised by local (i.e. not full) connections and also
weight sharing (i.e. sharing similar weights), and involves convolution operation
and hence is faster than fully connected nets. In this network, training is done
with the forecast errors and the ensemble spread of forecasts. An uncertainty is
then predicted, given an initial forecast field (Scher 2020). Figure 17.16 shows a
schematic illustration of the approach. They suggest that while the obtained skill is
lower than that of ensemble methods, the network-based method is computationally
very efficient and hence offers the possibility to be explored.
17.5.2 SOM Application
SOM has been applied since the early 90s and is still being applied, in atmospheric
science and oceanography to reduce the dimensionality of the system and identify
patterns and clustering (e.g. Hewitson and Crane 1994; Ambroise et al. 2000; Liu et
al. 2006; Liu and Weisberg 2005; Cassano et al. 2015; Huva et al. 2015; Gibson et al.
2017; Meza–Padilla 2019). This application spans a wide range of topics including
synoptic climatology, cloud classification, weather/climate extremes, downscaling
and climate change. SOM has also been suggested to be used in time series
prediction (Vesanto 1997). SOM is particularly convenient and quite useful in
synoptic weather categorisation (Sheridan and Lee 2010; Hewitson and Crane
2002). The obtained weather types are then used to study the relationship between
large scale teleconnections and local surface climate variables such as surface
temperature or precipitation. Surface weather maps and mid-tropospheric fields
have been used by a number of authors to study changes in atmospheric synoptic
circulations and their relation to precipitation (e.g. Hewitson and Crane 2002). The
identification of synoptic patterns from reanalysis as well as climate models was also
performed via SOM by Schuenemann et al. (2009) and Schuenemann and Cassano
(2010). Mass and moisture fields were also used to study North American monsoon
(Cavazos et al. 2002), precipitation downscaling (Ohba et al. 2016), Antarctic
climate (Reusch et al. 2005) and local Mediterranean climate (Khedairia and Khadir
2008), see e.g. Huth et al. (2008) for a review of SOM application to synoptic
analysis.
Clouds are known to have complex features and constitute a convenient test-bed
for SOM application and feature extraction (Ambroise et al. 2000; McDonald et
al. 2016). For example, McDonald et al. (2016) show that SOM analysis enables
identification of a wide range of cloud clusters representative of low-level cloud
states, which are related to geographical position. They also suggest that SOM
enables an objective identification of the different cloud regimes. SOM has also
been applied to study weather and climate extremes and climate change (Gibson
et al. 2016; Horton et al. 2015; Gibson et al. 2017; Cassano et al. 2016). These
studies show that SOM can reveal correspondence between changes in the frequency
of geopotential height patterns and temperature and rainfall extremes (Horton et
al. 2015; Cassano et al. 2015, 2016). Gibson et al. (2017) found that synoptic
circulation patterns are well represented during heat waves in Australia but also
highlight the importance of critically assessing the SOM features.
SOM was also explored in oceanography to explore and analyse SST and sea
surface height (Leloup et al. 2007; Telszewski et al. 2009; Iskandar 2009), ocean
circulation (e.g. Meza-Padilla et al. 2019) and ocean current forecasting (Vilibić
et al. 2016), see also the textbook of Thomson and Emery (2014, chapter 4). An
interesting feature revealed by SOM in ocean circulation in the Gulf of Mexico
(Meza-Padilla 2016) is the existence of areas of loop current eddies dominating
the circulation compared to local regimes at the upper slope. Vilibić et al. (2016)
found that SOM-based forecasting system of ocean surface currents was found to be
slightly better than the operational ROMS-derived forecasting system, particularly
during periods of strong wind conditions. Altogether, this shows that SOM, and
machine learning in general, has potential for improving ocean surface current
forecast.
Figure 17.17 illustrates the SOM algorithm application to weather and climate
fields. The two-dimensional (gridded) field data are transformed into a n × d data
matrix, with n and d being the sample size and the number of grid points (or the
number of variables), respectively. Each observation xt (t = 1, . . . n) is then used
to update the weights of SOM following Eqs. (17.15–17.18). The obtained weight
vectors of the SOM lattice are then transformed to physical space to yield the SOM
patterns (Fig. 17.17). To illustrate how SOM organises patterns, large scale flow
based on SLP is a convenient example to discuss. Johnson et al. (2008), for example,
Fig. 17.17 Schematic illustration of the different steps used by SOM in meteorological application
(Source: Modified from Liu et al. (2006))
examined the SOM continuum of SLP over the winter (Dec–Mar) NH using NCEP-
NACR reanalysis. Figure 17.18 shows an example of NH SLP 4 × 5 SOM maps
obtained from Johnson et al. (2008), based on daily winter NCEP-NCAR SLP
reanalysis over the period 1958–2005. By construction (a small number of SOM
patterns), the figure shows large scale and low-frequency patterns. One of the main
features of Fig. 17.18 is the emergence of familiar teleconnection patterns, e.g.
−NAO (bottom left) and +NAO (bottom right). The occurrence frequency of those
patterns is used as a measure of climate change signal reflected by the NAO shift.
The SOM analysis also shows that interdecadal SLP variability can be understood
in terms of changes in the frequency of occurrence of the teleconnection patterns.
The SOM analysis of Johnson et al. (2008) reveals a change from a westward-
displaced −NAO-like pattern to an eastward-displaced +NAO-like pattern. More
examples and references of SOM application to synoptic climatology and large scale
phenomena can be found in Hannachi et al. (2017).
Due to its local character and the proximity neighbourhood, SOM seems to offer
some advantages compared to classical methods such as PCA and k-means (Reusch
et al. 2005; Astel et al. 2007; Lin and Chen 2006; Solidoro et al. 2007). Reusch et al.
(2005) compared the performance of SOM and PCA using synthetic climatological
data with and without noise contamination. They conclude that SOM is more robust
than PCA. For instance, SOM is able to isolate the predefined patterns with the
correct explained variance. On the other hand, PCA fails to identify the patterns due
to mixing. This conclusion is shared with other researchers (Liu and Weisberg 2007;
Annas et al. 2007; Astel et al. 2007). Liu and Weisberg (2007) compared SOM and
EOFs in capturing ocean current patterns using velocity field from moored ADCP
array. They found that SOM was readily more accurate to reveal, for example,
Fig. 17.18 Illustration of 4 × 5 SOM maps of daily winter (Dec–Mar) SLP field from NCEP-
NCAR reanalysis for the period 1958–2005. Positive and negative values are shown by continuous
and dashed lines, respectively. Percentages of occurrence of the patterns are shown in the bottom
right corner for the whole period and in the top right corners for the periods 1958–1977 (top),
1978–1997 (middle) and 1998–2005 (bottom). Contour interval: 2 hPa. Adapted from Johnson et
al. (2008). ©American Meteorological Society. Used with permission
the asymmetric features (between upwelling and downwelling current patterns) in

current strength, jet location and the veering velocity with depth.
As a last example, we discuss the application of SOM to rainfall events. Daily
rainfall events in any region of the globe can be analysed in terms of events with a
number of features. Derouiche et al. (2020) transformed winter daily rainfall series
in northern Tunisia into six features (or variables), on seasonal basis, namely number
of events, number of rainy day, seasonal total rainfall, average accumulation per
event, average event duration and average accumulation per rainy days. A rainfall
event is defined as consecutive rainy days separated by at least two no rainy days.
These features are computed based on a 50-year rainfall series observed over the
period 1960–2009, using a rain gauge network of 70 stations in northern Tunisia.
SOM was applied to this feature space, with 3500 space-time observations, using
two-dimensional (2D) hexagonal grid of the SOM map, with 320 neurons.
One of the main outstanding features of SOM is its prominent visualisation
property, which enables a sensible data survey. This is made possible because
Fig. 17.19 Schematic representation of the two-level approach of SOM clustering
SOM transforms, using a topology-preserving projection, the data from its orig-
inal (usually high-dimensional) state space onto a low-dimensional (usually two-
dimensional) space, i.e. the SOM map. This SOM map, represented as an ordered
grid, contains prototype vectors representing the data (e.g. Vesanto and Alhoniemi
2000). This map can then be used to construct, for example clusters (Fig. 17.19).
This algorithm of clustering SOM rather than the original data is referred to as two-
level approach (e.g. Vesanto and Alhoniemi 2000).
SOM provides therefore a sensible tool to classify for example rainfall regimes
in a given location. Figure 17.21 (top left) shows the obtained SOM map of rainfall
events in northern Tunisia. Note that each neuron, or prototype vector (individual
hexagon), contains a number of observations in its neighbourhood. These prototype
vectors are then agglomerated to obtain clusters. One particularly interesting method
to obtain the number of clusters is to use the data image method (Minnotte and West
1999). The data image is a powerful visualisation tool showing the dissimilarity
matrix as an image where each pixel shows the distance between two observations
(Martinez et al. 2010). Several variants of this image can be incorporated. Precisely,
rows and columns of the dissimilarity matrix can be reordered, for example, based
on some clustering algorithm, such as hierarchical clustering, allowing clusters to
emerge as blocks along the main diagonal. An example of application of data image
to the stratosphere can be found in Hannachi et al. (2011).
In the hierarchical clustering algorithm a bunch of fully nested sets are obtained.
The smallest sets are the clusters obtained as the individual elements of the dataset,
whereas the largest set is obtained as the whole dataset. Starting, for example from
the individual data elements as clusters, the algorithm then proceeds by successively
merging closest clusters, based on a chosen similarity measure until we are left with
only one single cluster. This can be achieved using a linkage algorithm, such as
single or complete linkages (e.g. Gordon 1999; Hastie et al. 2009). The result of the
hierarchical clustering is presented in the form of a tree-like graph or dendrogram.
This dendrogram is composed of branches linking the whole cluster to the individual
elements. Cutting through the dendrogram at a specific level yields specific number
Fig. 17.20 Two-dimensional scatter plot of two Gaussian clusters, with sample size of 50 each
(top left), dendrogram (top right), data matrix is showing interpoint distance between any two data
points (bottom left) and data matrix when the data are reordered so that the top left and bottom
right blocks represent, respectively, the first and the second clusters. Adapted from Hannachi et al.
of clusters. Figure 17.20 shows an illustrative example of two clusters (Fig. 17.20,
top left) and the associated dendrogram (Fig. 17.20, top right). The interpoint
distance matrix between data points is shown as a data matrix in Fig. 17.20 (bottom
left). Dark and light contrasts represent, respectively, small and large distances. The
dark diagonal line represents the zero value. The figure shows scattered dark- and
light-coloured areas. When the lines and columns of the interpoint distance matrix
are reordered following the two clusters obtained from the dendrogram, see the
vertical dashed line in Fig. 17.20 (top right), the data matrix (Fig. 17.20, bottom
right) now shows two dark blocks along the main diagonal, with light contrast in the
background.
The application of the two-way approach, i.e. SOM+clustering, is shown in
Fig. 17.21. The bottom left panel of Fig. 17.21 shows the data matrix obtained
from the interpoint distances between the SOM prototypes (SOM map). Like the
example above, dark and light contrasts reflect, respectively, a state of proximity
and remoteness of the SOM prototype vectors. Those prototypes that are close
to each other can be agglomerated by the clustering algorithm. Figure 17.21
shows the image when the SOM (prototype) data are reordered based on two
Fig. 17.21 SOM map with hexagonal grid of the rainfall events (top left), SOM map with three
clusters on the SOM map (top right), data image of the SOM prototype vectors (bottom left),
data image with two (bottom centre) and three (bottom right) clusters. The numbers in the SOM
map represent the numbers of observations within a neighbourhood of the neurons (or prototype
vectors). Courtesy of Sabrine Derouiche
(Fig. 17.21, bottom centre) and three (Fig. 17.21, bottom right) clusters obtained
from a dendrogram or clustering tree of hierarchical clustering. The contrast
between the diagonal blocs and the background is stronger in the three-cluster case
(Fig. 17.21, bottom right), compared to the two-cluster case (Fig. 17.21, bottom
centre), suggesting three clusters, which are shown in the SOM map (Fig. 17.21,
top right). These clusters are found to represent three rainfall regimes in the studied
area, namely wet, dry and semi-dry.
17.5.3 Random Forest Application
Random forest (RF) is quite new to the field of weather/climate. It has been applied
recently to weather prediction (Karthick et al. 2018), temperature downscaling
(Pang et al. 2017) and a few other related fields such as agriculture, e.g. crop yield
(Jeong et al. 2016), greenhouse soil temperature (Tsai et al. 2020) and forest fire (Su
et al. 2018). In weather prediction, for example, Karthick et al. (2018) compared
few techniques and found that RF was the best with around 87% accuracy, with
only one disadvantage, namely overfitting. Temperature downscaling using RF was
performed by Pang et al. (2017) in the Pearl river basin in southern China. The
method was compared to three other methods, namely artificial NN, multilinear
regression and support vector machines. The authors found, based on five different
criteria, that RF outperforms all the other methods. For example, RF could identify
the best predictor combination compared to all the other methods. In crop yield
production, Jeong et al. (2016) used RF to predict three types of crops in response
to climate and biophysical variables and compared it to multiple linear regression as
a benchmark. RF was found to outperform the multilinear regression. For example,
the root mean square error was in the range of 6−14% compared to 14−49% for
multiple linear regression. Though this suggests that RF is an effective and versatile
tool for crop yield prediction, the authors also caution that it may result in a loss
of accuracy when applied to extremes or responses beyond the boundaries of the
training set, a weakness that characterises machine learning approaches in general.
Appendix A
Smoothing Techniques
A.1 Smoothing Splines
Splines provide a nonlinear and smooth fitting to a unidimensional or multivariate

scatter of points. The spline smoothing can be regarded as a nonlinear and non-
parametric regression model. We assume that at each point xi (independent variable)
we observe yi (dependent variable), i = 1, . . . n, and we are interested in seeking a
nonlinear relationship linking y to x of the form:
y = f (x) + ε, (A.1)
for which the objective is to estimate f (.). It is of course easier if we knew the
general form of the function f (.). In practice, however, this information is very
seldom available. The spline smoothing considers f (.) to be a polynomial. One of
the most familiar polynomial smoothing is the cubic spline and corresponds to the
case of a piece-wise cubic function, i.e.
f (x) = fi (x) = ai + bi x + ci x 2 + di x 3 for xi ≤ x ≤ xi+1 , (A.2)
for i = 1, . . . n − 1. In addition, to get smoothness the first two derivatives are

assumed to be continuous, i.e.
dα dα
α
fi (xi ) = fi−1 (xi ), (A.3)
dx dx α
for α = 0, 1, and 2. The constraints given by Eq. (A.2) and Eq. (A.3) lead to a
smooth function. However, the problem is not closed, and we need
extra conditions.
The problem is normally simplified by minimising the quantity ni=1 (yi − f (xi ))2
with a smoothness condition that takes the form of an integral of the second
derivative squared. The functional to be minimised is

Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3
454 A Smoothing Techniques

n 2
d 2 f (x)
F = [yk − f (xk )] + λ2
dx. (A.4)
dx 2
k=1
The first part of Eq. (A.4) is a measure of the goodness of fit, whereas the second
part provides a measure of the overall smoothness. In the theory of elastic rods the
latter term is proportional to the energy of the rod when it is bent under constraints.
Note that the functional F (.), Eq. (A.4), can be extended to two dimensions and
the final surface will behave like a smooth plate. The function F in Eq. (A.4) is
known as the penalised residual sum of squares. Also, in Eq. (A.4) λ represents the
smoothing parameter and controls the relative weight given to the roughness penalty
and goodness of fit. It controls therefore the balance between goodness of fit and
smoothness. For example, the larger the parameter λ the smoother the function f .
Remark Note that if ε = 0 in Eq. (A.1) the spline simply interpolates the data. This
means that the spline solves Eq. (A.4) with λ → 0, which is equivalent to

2
min f (x) dx subject to yk = f (xk ), k = 1, . . . n. (A.5)
Equation (A.4) can be extended to higher dimensions by replacing the second

2
derivative roughness measure f by high-order derivatives with respect to all
coordinates to ensure consistency with all directions. In this case the quantity to be
minimised takes the form

n
2 ∂k
2
yj − f (xj ) + λ k
k1 +...+km f (x) dx,
j =1 k1 +...+km =k Rm ∂x1k1 . . . ∂xm
km
(A.6)
where k is a fixed positive integer. The obtained solution is known as thin-plate
spline. The solution is generally obtained as linear combination of the m m+k−1
monomials of degree less than k and a set of n radial basis functions (Wahba 1990).
The minimisation of Eq. (A.4), when λ is known, yields the cubic spline. The
determination of λ, however, is more important since it controls the smoothness of
the fitting. One way to obtain an appropriate estimate of λ is to use cross-validation.
The idea behind cross-validation is to have estimates that minimise the effect of
omitted observations. If fλ,k (x) is an estimate of the spline with parameter λ when
2
the kth observation is omitted, the mis-fit of the point xk is given by yk − fλ,k (x) .
The best value of λ is the one that minimises the cross-validation criterion

n
2
wk yk − fλ,k (x) ,
k=1
where wk , k = 1, . . . n, is a set of weights, see Sect. A.1.2.

A Smoothing Techniques 455
A.1.1 More on Smoothing Splines
Originally, splines were used as a way to smoothly interpolate the set of points
(xk , yk ), k = 1, . . . n, where x1 < x2 < . . . xn belong to some interval [a, b],
by means of piece-wise polynomials. The name spline was coined by Schoenberg
(1964), see also Wahba (2000). The function is obtained as a solution to a variational
problem such as1

1
n b 2
min (yi − f (xi ))2 + μ f (m) (x) dx (A.7)
n a
i=1
for some μ > 0, and over the set of functions with 2m − 2 continuous derivatives
over [a, b], C 2m−2 ([a, b]). The solution is a piece-wise polynomial of degree 2m−1
inside each interval [xi , xi+1 ], i = 1, . . . n − 1, and m − 1 inside the outer intervals
[a, x1 ] and [xn , b].
In general, the smoothing spline can be formulated as a regularisation problem
(Tikhonov 1963; Morozov 1984). Given a set of points x1 , . . . xn in Rd and n
numbers y1 , . . . yn , we seek a smooth function f (x) from Rd into R that best fits the
data (x1 , y1 ), . . . (xn , yn ). This problem is normally solved by seeking to minimise
the functional:

n
F (f ) = (yi − f (xi ))2 + μ L(f ) 2 . (A.8)
i=1
Note that the first part measures the mis-fit and the second part is a penalty
measuring the smoothness of the function f (.). The operator L is in general a
differential operator, and μ is the smoothing parameter, which controls the trade-
off between both attributes. By computing F (f + δf ) − F (f ), where δf is a small
“departure” from f , the stationary solutions of Eq. (A.8) can be shown to satisfy
n
μ LL∗ (f )(x) = [yi − f (x)] δ (x − xi ) , (A.9)
i=1
where L∗ is the adjoint (Appendix F) of L.
1 The space over which Eq. (A.7) is defined is normally referred to as Sobolev space of functions
b 2
defined over [a, b] with m−1 absolutely continuous derivatives and satisfying a f (m) (x) dx <
∞.
Exercise Derive Eq. (A.9).

Hint Using the L2 norm where L(f ) =< L(f ), L(f ) >, we obtain,
after discarding the second-order terms in δf , F (f + δf ) − F (f ) =
−2 ni=1 [(yi − f (xi )) δf (xi )] + 2μ < L∗ L(f ), δf >.
The solution is then obtained as the stationary points of the differential operator
F (f ):

< F (f ), v >= F (f ).v = −2 ni=1 [yi − f (xi )] v(xi ) + 2μ < L∗ L(f ), v >=
n
−2 < i=1 (yi − f (xi )) .δ(x − xi ), v > +2μ < L∗ L(f ), v > .
The solution to Eq. (A.9) can be expressed in the form of an integral as

1
n
f (x) = G(x, y) (yi − f (y)) δ(y − xi ) dy,
μ
i=1
where G(x, y) is the Green’s function of the operator L∗ L (see Sect. A.3 below);
hence,
1
n n
f (x) = (yi − f (xi )) G (x, xi ) = μi G (x, xi ) . (A.10)
μ
i=1 i=1
The coefficients μj , j = 1 . . . n, are computed by applying it to x = xj , i.e. yj =

n
i=1 μi Gj i or
Gμ = y, (A.11)
where G = (G)ij = G xi , xj , μ = (μ1 , . . . , μn )T and y = (y1 , . . . , yn )T . Note

that Eq. (A.11) can be extended to include a homogeneous solution p(x) of the
partial differential equation (PDE) L∗ L, to yield

n
f (x) = μi G (x, xi ) + p(x)
i=1
with L∗ L(p) = 0. Examples of functions p(.) are given below.

The popular thin-plate spline corresponds to the case where the differential
operator is an extension to that given in Eq. (A.7) to the multidimensional space,
i.e.
2
m! ∂m
L(f ) 2
= f (x) dx, (A.12)
k1 ! . . . kd ! Rd ∂x1k1 . . . ∂xdkd
k1 +...kd =m
and where the functional or energy to be minimised takes the form


n
F (f ) = (yk − f (xk ))2 + λ Lf 2
(A.13)
k=1
for a fixed positive integer m. The function f (x) used in Eq. (A.12) or Eq. (A.13) is
of class C m , i.e. with (m − 1) continuous derivatives and the mth derivative satisfies
Lf (m) < ∞. The corresponding L∗ L operator is invariant under translation and
rotation; hence, the corresponding Green’s function G(x, y) is a radial function and
satisfies
(−1)m m G(x) = δ(x), (A.14)
whose solution, see e.g. Gelfand and Vilenkin (1964), is a thin-plate spline, i.e.

x 2m−d log |x for 2m − d > 0, and d is even
G(x) = 2m−d (A.15)
x when d is odd.
The general thin-plate spline is then given by

n
f (x) = μj G x − xj + p(x), (A.16)
j =1
where p(x) is a polynomial

of degree m − 1 that can be expressed as a linear
combination of l = dd+m−1 monomials in Rd of degree less than m, i.e. p(x) =
l
k=1 λk pk (x). The parameters μj , j = 1, . . . m, and λk , k = 1, . . . l, are obtained
by taking f (xj ) = yj and imposing further conditions on the polynomial p(x) in
order to close the problem. This is a well known radial basis function problem.
Noisy data have not been explicitly mentioned, but the formulation given
in Eq. (A.7) takes account of an uncorrelated noise (see Sect. A.3) when the
interpolation is not exact. If the noise is not autocorrelated, e.g.
yi = f (xi ) + εi
for i = 1, . . . n, with zero-mean multinormal noise with E εε T = W where

ε = (ε1 , . . . , εn )T , then the functional to be minimised takes the form of a penalised
likelihood:
F (f ) = (y − f)T W−1 (y − f) + μ L(f ) 2 ,
where f = (f (x1 ), . . . , f (xn ))T . In the case of thin-plate spline where L(f ) 2
is given by Eq. (A.12), the solution is similar to Eq. (A.16) and the parameters
μ = (μ1 , . . . , μl )T and λ = (λ1 , . . . , λn )T are the solution of a linear system of
the form:

G + nμW P λ y
= ,
PT O μ 0
where G = (Gij ) = G( xi − xj ) for i, j = 1, . . . n and P = (Pik ) = (pk (xi ))

for i = 1, . . . n and k = 1, . . . l.
A.1.2 Choice of the Smoothing Parameter
So far the parameter μ was assumed to be fixed but unknown. One way to deal
with the problem would be to choose an arbitrary value based on experience. A
more concise way is to compute it from the data using an elegant procedure known
as cross-validation, see also Chap. 15. Suppose that one would like to solve the
problem given in Eq. (A.7) and would like to have an optimal estimate of μ. The
idea of cross-validation is to take one or more points out of the sample and find the
value of μ that minimises the mis-fit. Suppose in fact that xk was taken out. Then
(k)
the spline fμ (.) that fits the remaining data minimises the functional:
b
1
n
2
[yi − f (xi )]2 + μ f (m) (t) dt. (A.17)
n a
i=1,i=k
The overall optimal value of μ is the one that minimises the overall mis-fit or cross-
validation:
1
n
2
cv (μ) = yk − fμ(k) (xk ) . (A.18)
n
k=1
Let us designate by fμ (x) the spline function fitted to the whole sample for
a given μ. Let also A(μ) = aij (μ) , i, j = 1, . . . n, be the matrix relating
T
y = (y1 , . . . , yn )T to fμ = fμ (x1 ), . . . fμ (xn ) , i.e. satisfying A(μ)y = fμ .
Then Craven and Wahba (1979) have shown that
yk − fμ (xk )
yk − fμ(k) (xk ) = ,
1 − akk (μ)
and therefore,
n
1 yk − fμ (xk ) 2
cv (μ) = . (A.19)
n 1 − akk (μ)
k=1
The generalised cross-validation is obtained by substituting a(μ) = n1 tr (A(μ)) for

akk (μ) to yield
2
(I − A(μ)) y
cv (μ) = n . (A.20)
tr (I − A(μ))
Then, Eq. (A.19) or Eq. (A.20) is minimised to yield an optimal value of μ.
A.2 Radial Basis Functions
A.2.1 Exact Interpolation
Radial basis functions (RBFs) constitute one of the attractive tools to interpolate
and/or smooth scattered data. RBFs have been formally introduced and coined
by Powell (1987) in exact multivariate interpolation, although the technique was
hanging around before that, see e.g. Hardy (1971), Franke (1982).
Given m distinct points2 xi , i = 1, . . . n, in Rd , and n real numbers fi , i =
1, . . . n, these numbers can be regarded as the values at xi of a certain unknown
function f (x). The problem of RBF is to find a smooth real function s(x) satisfying
the interpolation conditions:
s(xi ) = fi i = 1, . . . n (A.21)
and that the interpolating function is of the form

n
m
s(x) = λk φk (x) = λk φ ( x − xk ) . (A.22)
k=1 k=1
The functions φk (x) = φ ( x − xk ) are known as radial basis functions. The real
function φ(x) is defined over the positive numbers, and . is any Euclidean norm
or Mahalanobis distance. Thus the radial basis functions s(x) are a simple linear
combination of the shifted radial basis functions φ(x).
Examples of Radial Basis Functions
• φ(r) = r k , for a positive integer k. The cases k = 1, 2, and 3 correspond,

respectively, to the linear, quadratic and cubic RBF.
1
• φ(r) = r 2 + c 2 for c > 0 and corresponds to the multiquadratic case.
• φ(r) = e−ar for a > 0 is the Gaussian RBF.
2
• φ(r) = r 2 log r, which is the thin-plate spline.
2 Generally known as nodes, knots, or centres of interpolation.

−1
• φ(r) = 1 + r 2 and corresponds to inverse quadratic.
Equations (A.21) and (A.22) lead to the following linear system:
Aλ = f, (A.23)
where A = (aij ) = φ xi − xj , i, j = 1, . . . n, and f = (f1 , . . . fn )T .

The more general RBF interpolation problem is obtained by extending Eq. (A.22)
to yield

n
s(x) = pm (x) + λk φ ( x − xk ) , (A.24)
k=1
where pm (x) is a low-order polynomial of degree at most m in Rd . Apart from

interpolation, RBFs constitute an attractive tool that can be used for various other
purposes such as minimisation of multivariate functions, see e.g. Powell (1987) for
a discussion, filtering and pattern recognition (Carr et al. 1997), and can also be used
in PDEs and neural networks (Larsson and Fornberg 2003). Note also that RBFs are
used naturally in other fields such as gravitation.3
Because there are more parameters than constraints in the case of Eq. (A.24), further
constraints are imposed, namely,

n
λj p(xj ) = 0 (A.25)
j =1
for all polynomials p(x) of degree at most m. Apart from introducing more
equations, system of Eq. (A.25) can be used to measure the smoothness of the
RBF (Powell 1990). It also controls the rate of growth at infinity of the non-
polynomial
part of s(x) in Eq. (A.24) (Beatson et al. 1999). If (p1 , . . . pl ), with
l= d m+d
= (m+d)!
m!d! , is a basis of the space of algebraic polynomials of degree less
or equal than m in Rd , then Eq. (A.25) becomes
3 For example in the N -body problem, the gravitational potential at a point y takes the form

N
αi
φ(y) = .
xi − y
i=1
Similarly, the heat equation ∂

∂t h − ∇ 2 h = 0, with initial condition h(t, x) = g(x), has, for t > 0,
the solution

3 x−y 2
h(t, x) = (4π t)− 4 e− 4t g(y)dy,
which looks like Eq. (A.22) when it is discretised.


n
λj pk (xj ) = 0 k = 1, . . . l. (A.26)
j =1
l also that pm (x) in Eq. (A.24) can be substituted for the combination
Note
k=1 μk pk (x), which, when combined with Eq. (A.26), yields the following
system:

λ A P λ f
A = = , (A.27)
μ PT O μ 0
where P = (pij ) = (pj (xi )), (i = 1, . . . n, j = 1, . . . l). The next important

equation in RBF is related to the invertibility of the system of Eq. (A.23) and
Eq. (A.27). For many choices of φ(.), the matrix A in Eq. (A.23) is invertible.
For example, for the linear and multiquadratic cases, A is always invertible for
every n and d, provided the points are distinct (Michelli 1986). For the quadratic
case where s(x) becomes quadratic in x, A becomes singular if the number of
points n exceeds the dimension of the space of quadratic polynomials, i.e. n >
2 (d + 1)(d + 2), whereas for the cubic case A can be singular if d ≥ 2 but is always
1
nonsingular for the unidimensional case.4 Powell (1987) also gives further examples
−β
of nonsingularity such as φ(r) = r 2 + 1 , (β > 0).
Consider now the extended interpolation in Eq. (A.24), the matrix A in
Eq. (A.27) is nonsingular only if the columns of P are linearly independent.
Michelli (1986) gives sufficient conditions for the invertibility of the system of
Eq. (A.27) based on conditional positivity,5 i.e. when φ(r) is conditionally strictly
4 In this case the matrix is nonsingular for all φ(r) = r 2α+1 (α positive integer) and the
interpolation function

n
s(x) = λi |x − xi |2α+1
i=1
is a spline function.
5 A real function φ(r) defined on the set of non-negative real numbers is conditionally (strictly)
positive definite of order m + 1 if for any distinct points x1 , . . . xn and scalars satisfying λ1 , . . . λn
satisfying Eq. (A.26) the quadratic form

λT λ = λi φ xi − xj λj
ij
is non-negative (positive). The “conditionally” in the definition refers to Eq. (A.26). The set of
conditionally positive definite functions of order m has been characterised by Michelli (1986). If a
dk
continuous function φ(.) defined on the set of non-negative real numbers is such that (−1)k dr k φ(r)
is completely monotonic, then φ(r 2 ) is conditionally positive definite of order k. Note that if
dk
(−1)k dr k φ(r) ≥ 0 for all positive integers k, then φ(r) is said to be completely monotonic. The
following important result is also given in Michelli (1986). If the continuous and positive function
positive. The previous two sufficient conditions allow for various choices of radial
−α β
basis functions, such as r 2 + a 2 for α > 0, and r 2 + a 2 for 0 < β < 1. For
3
instance, the functions φ1 (r) = r 2 and φ2 (r) = 4r log r have their second derivatives
completely monotonic for r > 0. The functions φ1 (r 2 ) = r 3 and φ2 (r 2 ) = r 2 log r
can also be used as RBF. The latter case corresponds to the thin-plate spline.
Another case of singularity was provided by Powell (1987) and corresponds to
∞ b
φ(r) = 0 e−xr ψ(x)dx, where ψ(x) is non-negative with a ψ(t)dt > 0 for
2
some constants a and b. There is also another set of functions such as

2(m+1)−d
r for odd and 2(m + 1) > d
φ(r) = 2(m+1)−d
r log r 2(m+1)−d ,
where m is the largest degree of the polynomial included in s(x).

Remark The system of Eq. (A.27) can be solved using SVD. Alternatively, one can
define the n × (n − l) matrix Q whose columns span the orthogonal complement
of the columns of P. Hence PT λ yields a unique γ such that PT λ = γ . The first
system from Eq. (A.27) yields QT AQγ = QT f, which is invertible since Q is full
rank and A is strictly conditionally positive definite of order m + 1. The vector μ is
then obtained from Pμ = f − AQγ . A possible choice of Q is given by Beatson et
al. (2000), namely
⎡ ⎤
p1,l+1 p1,l+2 . . . p1n
⎢ . .. ⎥
⎢ .. . ⎥
⎢ ⎥
⎢ ⎥
⎢p p . . . pln ⎥
Q = − ⎢ l,l+1 l,l+2 ⎥,
⎢ 1 0 ... 0 ⎥
⎢ ⎥
⎢ .. ⎥
⎣ . ⎦
0 0 ... 1
see also Beatson et al. (1999) for fast fitting and evaluation of RBF.
Example: Thin-Plate Spline

In this case we have φ(r) = r 2 log r and s(x) = p(x) + nk=1 λk φ ( x − xi ) with
p(x) = μ0 + μ1 x + μ2 y. The matrix P in this case is given by
φ(r) defined on the set of non-negative numbers has its first derivative completely monotonic (not
constant), then for any distinct points x1 , . . . xn

(−1)n−1 det φ xi − xj 2 > 0.
⎛ ⎞
1 x1 y1
⎜1 x1 y1 ⎟
⎜ ⎟
P=⎜. .. .. ⎟ .
⎝ .. . . ⎠
1 xn yn
Note that thin-plate (or biharmonic) splines serve to model the deformation of an
infinite thin plate (Bookstein 1989) and are a C 1 function that minimises the energy
2 2 2
∂ 2s ∂ 2s ∂ 2s
E(s) = +2 + dxdy.
R2 ∂x 2 ∂x∂y ∂y 2
A.2.2 RBF and Noisy Data
In the previous section the emphasis was on exact interpolation where the fitted
function goes exactly through the data (xi , fi ). This corresponds to the case when
the data are noise-free. If the data are contaminated with noise, then instead of the
condition given by Eq. (A.21) we seek a function s(x) that minimises the functional
1
n
[s(xi ) − fi ]2 + ρ s 2 , (A.28)
n
i=1
where the penalty function, given by ρ s 2 , provides a measure of the smoothness

of s(x) and ρ ≥ 0 is the smoothing parameter. Equation (A.12) provides an example
of norm, which is used in thin-plate splines (Cox 1984; Wahba 1979; Craven and
Wahba 1979).
Remark If we use the semi-norm in Eq. (A.12) with d = 3 and m = 2, the solution
to Eq. (A.28) is given (Wahba 1990) by

n
s(x) = p(x) + λi x − xi ,
i=1
where p(x) is a polynomial of degree 1, i.e. p(x) = μ0 + μ1 x1 + μ2 x2 + μ3 x3 , and

the coefficients are given by the linear system:

A − 8nπρI P λ f
= ,
PT O μ 0
where A and P are defined as in Eq. (A.27).

A.2.3 Relation to PDEs and Other Techniques
In many problems in mathematical physics, one seeks to solve the following PDE:
Lu = f, (A.29)
in a domain D within Rd under specific boundary conditions, and L is a differential

operator. The Green’s function6 G of the operator L is the (generalised) function
satisfying
LG(x, y) = δ(x − y), (A.30)
where δ is the Dirac (or impulse) function. The solution to Eq. (A.29) is then given
by the following convolution:

u(x) = f (y)G (x, y) dy + p(x), (A.31)
D
where p(x) is a solution to the homogeneous equation Lp = 0. Note that Eq. (A.31)
is to be compared with Eq. (A.24). In fact, if there is an operator L satisfying Lφ(x−
xi ) = δ(x − xi ) and also Lpm (x) = 0, then clearly the RBF φ(r) is the Green’s
function of the differential operator L and the radial basis function s(x) given by
Eq. (A.24) is the solution to

n
Lu = λk δ(x − xk ).
k=1
As such, it is possible to use the PDE solver to solve an interpolation problem (see
e.g. Press et al. (1992)). In general, given φ, the operator L can be determined using
filtering techniques from time series.
RBF interpolation is also related to kriging. For example, when pm (x) = 0 in
Eq. (A.24), then the equations are similar to kriging and where φ plays the role of
an (isotropic) covariance function. The relation to splines has also been outlined.
For example, when the radial function φ(.) is cubic or thin-plate spline, then we
have a spline interpolant function. In this case the function minimises the bending
energy of an infinite thin plate in two dimensions, see Poggio and Girosi (1990)
for a review.
n For instance, if φ(r)
m = r 2m+1 (m positive integer), then the function
s(x) = i=1 λi |x − xi | 2m+1 + k=1 μk x k is a natural spline.
6 The Green’s function G depends only on the operator L and has various properties. For example, if
L is self-adjoint, then G is symmetric. If L is invariant under translation, then G(x, y) = G(x − y),
and if L is also invariant under rotation, then G is a radial function, i.e. G(x, y) = G( x − y ).
A.3 Kernel Smoother
This is a kind of local average smoother where a weighted average is obtained

around each target point. Unlike linear smoothers, the kernel smoother uses a
particular function K() known as kernel. Given the data points (xj , yj ), j = 1, . . . n,
for each target point xi the weighted average ŷi is obtained by

n
ŷi = Kij yj , (A.32)
j =1
where the weights are given by

−1

n
xi − xm xi − xj
Kij = K K .
b b
m=1
the weights are non-negative and add up to one, i.e. Kij ≥ 0, and for each
Clearly
i, nj=1 Kij = 1. The kernel function K(.) satisfies the following properties:
∞ ≥ 0, for all t.
(1) K(t)
(2) −∞ K(t)dt = 1.
(3) K(−t) = K(t) for all t.
Hence, K(.) is typically a symmetric probability density function. Note that the
parameter b gives a measure of the size of the neighbourhood in the averaging
process around each target point xi . Basically, the parameter b controls the “width”
of the function K( xb ). In the limit b → 0, we get a Dirac function δ0 . In this case
the smoothed function is identical to the original scatter, i.e. ŷi = yi . On the other
hand, in the limit b → ∞ we get a uniform
weight function, and the smoothed curve
reduces to the mean, i.e. ŷi = y = n1 yi . A familiar example of kernels is given
by the Gaussian PDF:
1 x2
K(x) = √ e− 2 .
2π
There are several other kernels used in the literature. The following are examples:
• Box kernel K(x) = 1[− 1 , 1 ] .
2 2
• Triangle kernel K(x) = 1 − |t| a 1[− a1 , a1 ] (for some a > 0).
⎧ 3
⎪
⎪ − x 2
+ |x|
for x ≤ M
⎪
⎨ 1 6 M 6 M 2
3
• Parzen kernel: K(x) = 2 1 − |x| for 2 ≤ |x| ≤ M
M
⎪
⎪ M
⎪
⎩
0 for |x| > M.
These kernels can also extend easily to the multidimensional case.
Appendix B
Introduction to Probability and Random
Variables
B.1 Background
Probability is a branch of mathematics that deals with chance, randomness or

uncertainty. When tossing a coin, for example, one talks of probability of getting
head or tail. With an operator receiving phone calls at a telephone switch board, one
also talks about the probability of receiving a given number of phone calls within
a given time interval. We also talk about the probability of having rain tomorrow
at 13:00 at a given location. Games of chance also constitute other good examples
involving probability. Games of chance are very old indeed, and it has been found
that cubic dices have been used by ancient Egyptians around 2000 BC (DeGroot
and Shervish 2002). Probability calculus has been popularised apparently around
the mid-fifteen century by Blaire Pascal and Pierre De Fermat, and it was in 1933
that A. N. Kolmogorov axiomatised the probability theory using sets and measures
theory (Kolmogorov 1933).
Despite its common use by most scientists, no unique interpretation of prob-
ability exists among scientists and philosophers. There are two main schools of
thought:
(i) The frequentists school, led by von Mises (1928) and Reichenback (1937),
holds that the probability p of an event is the relative frequency nk of the
occurrence of that event in an infinite sequence of similar (and independent)
trials, i.e.
k
p = lim ,
n→∞ n
where k is the number of times that event occurred in n trials.
(ii) The subjective or “Bayesian” school, which holds that the probability of an
event is a subjective or personal judgement of the likelihood of that event.
This interpretation goes back to Thomas Bayes (1763) and Pierre Simon

468 B Introduction to Probability and Random Variables
Laplace in the early sixteenth century (see Laplace 1951). This trend argues
that randomness is not an objectively measurable phenomenon but rather a
“knowledge” phenomenon, i.e. they regard probability as an epistemological
rather than ontological concept.
Besides these two main schools, there is another one: the classical school, which
interprets probability based on the concept of equally likely outcomes. According
to this interpretation, when performing a random experiment, one can assign the
same probability to events that are equally likely. This interpretation can be useful in
practice, although it has a few difficulties such as how to define equally likely events
before even computing their probabilities, and also how to define probabilities of
events that are not equally likely.
B.2 Sets Theory and Probability
B.2.1 Elements of Sets Theory

Sets and Subsets
Let S be a set of elements s1 , s2 , . . . , sn , finite or infinite. Note that in the case of

infinite sets one distinguishes two types: (1) countable sets whose elements can be
counted using natural numbers 1, 2, . . . and (2) uncountable sets that are infinite but
one cannot count their elements. The set of rational numbers is countable, whereas
the set of real numbers is uncountable. A subset A of S, noted A ⊂ S, is a set whose
elements belong to S. The empty set Ø and the set S itself are examples of (trivial)
subsets.
Operations on Subsets
Given a set S and subsets A, B, and C, one can perform the following operations:
• Union—The union of A and B, note A ∪ B, is the subset containing the elements
from A or B. It is clear that A ∪ Ø = A, and A ∪ S = S. Also, if A ⊂ B, then
A ∪ B = B. This definition can be extended to an infinite sequence of subsets
A1 , A2 , . . . to yield ∪∞
k=1 Ak .
• Intersection—The intersection of two subsets A and B, noted as A ∩ B, is the
set containing only common elements to A and B. If no common elements exist,
then A∩B = Ø, and the two subsets are said to be mutually exclusive or disjoint.
It can be seen that A ∩ S = A and that if D ⊂ B then D ∩ B = D. The definition
also extends to an infinite sequence of subsets.
• Complements—The complement of A, noted as Ac , is the subset of elements that
are not in A. One has (Ac )c = A; S c = Ø; A ∪ Ac = S and A ∩ Ac = Ø.
B Introduction to Probability and Random Variables 469
B.2.2 Definition of Probability

Link to Sets Theory
An experiment involving different outcomes when it is repeated under similar

conditions is a random experiment. For example, throwing a dice yields in general1
a random experiment. The outcome of a random experiment is called an event. The
set of all possible outcomes of a random experiment is named as the sample space
S. For the case of the dice, S = {1, 2, 3, 4, 5, 6}, and any subset of S is an event.
For example, A = {1, 3, 5} corresponds to the event of odd outcomes. The empty
subset Ø corresponds to the impossible event. A sample space is discrete if it is
finite or countably infinite. All the operations on subsets mentioned above can be
readily transferred to operations between events. For example, two events A and B
are disjoint or mutually exclusive if A ∩ B = Ø.
Definition/Axioms of Probability
Given a sample space S, a probability is a function defined on the events of S,

assigning a number P r (A) to each event A, satisfying the following properties
(axioms):
(1) For any event A, 0 ≤ P r(A) ≤ 1.
(2) P r(S) = 1. ∞
(3) P r(∪∞
i=1 Ai ) = i=1 P r(Ai ), for any sequence of disjoint events A1 , A2 , . . ..
Properties of Probability
• Direct consequences
(1) P r(Ø) = 0.
(2) P r(Ac ) = 1 − P r(A).
(3) P r(A ∪ B) = P r(A) + P r(B) − P r(A ∩ B).
(4) If A ⊂ B, then P r(A) ≤ P r(B).
(5) If A and B are exclusive, then P r(A ∩ B) = 0.
Exercise Derive the above properties.
Exercise Compute P r(A ∪ B ∪ C).
Answer P r(A) + P r(B) + P r(C) − P r(A ∩ B) − P r(A ∩ C) − P r(B ∩ C) +
P r(A ∩ B ∩ C).
1 When the dice is fair.

• Conditional Probability
Given two events A and B, with P r(B) > 0, the conditional probability of A
given by B, denoted by P r(A|B) is defined by
P r(A|B) = P r(A ∩ B)/P r(B).
• Independence—Two events A and B are independent if and only if P r(A ∩ B) =

P r(A)P r(B). This is equivalent to P r(A|B) = P r(A). This definition also
extends to more than two independent events. As a consequence, one has the
following property:
P r(A|B) = P r(B|A)P r(A)/P r(B).
Note the difference between independent and exclusive/disjoint events.

• Bayes theorem
For n events B1 , . . . , Bn forming a partition of the sample space S, i.e.
mutually exclusive whose union is S, and A any event, then
P r(Bi )P r(A|Bi )
P r(Bi |A) = n .
j =1 P r(Bj )P r(A|Bj )
B.3 Random Variables and Probability Distributions
Definition A random variable is a real valued function defined on a sample space

S of a random experiment. A random variable is usually noted by a capital letter,
e.g. X, Y or Z, and the values it takes by a lower case, e.g. x, y or z. Hence a random
variable X assigns a value x to each outcome in S. Depending on the sample space,
one can have either discrete or continuous random variables. Sometimes we can also
have a mixed random variable. Here we mainly describe discrete and continuous
random variables.
B.3.1 Discrete Probability Distributions
Let X be a discrete random variable taking discrete values x1 , . . . , xk , and pj =

pr(X = xj ), j = 1, . . . , k. Then the function
f (x) = P r(X = x)
k
is the probability function of X. One immediately sees that j =1 f (xj ) =
k
j =1 pj = 1. The function F (x) defined by

F (x) = P r(X ≤ x) = f (xi )
xi ≤x
is the cumulative distribution function (cdf) of X. The cdf of a discrete random vari-
able is piece-wise constant function between 0 and 1. Various other characteristics
can be defined from X, which are included in the continuous case discussed below.
B.3.2 Continuous Probability Distributions
Definition Let X be a continuous random variable taking values in a continuous

subset I of the real axis. The function f (x) defined by
b
P r(a ≤ x ≤ b) = f (x)dx
a
for any interval [a, b] in I is the probability density function (pdf) of X. Hence
the quantity f (x)dx represents the probability of the event x ≤ X ≤ x + dx, i.e.
P r(x ≤ X ≤ x + dx) = f (x)dx. The pdf satisfies the following properties:
(1) f
∞(x) ≥ 0 for all x.
(2) −∞ f (x)dx = 1.
The cumulative distribution function of X is given by
x
F (x) = f (x)dx.
−∞
Remark Let X be a discrete random variable taking values x1 , . . . , xk , with prob-

abilities p1 , . . . , pk . Designate by δx () the Dirac impulse function, i.e. δx (y) = 1,
only if y = x, and zero otherwise. Then the probability function f (x) can be written
as f (x) = kj =1 pj δxj (x). Hence by using the rule of integration of a Dirac impulse
function, i.e.

δx (y)g(y)dy = g(x)1I (x),
I
where 1I () is the indicator of the interval I , then X can be analysed as if it were

continuous.
Moments of a Random Variable
Let X be a continuous random variable with pdf f () and cdf F (). The quantity:

E(X) = xf (x)dx
is the expected value or first-order moment of X. Notethat for a discrete random

variable one obtains, using the above remark, E(X) = xi pi .
The kth-order moment of X is defined by mk = E(Xk ) = x k f (x)dx. The
centred kth-order moment is

μk = E (X − E(X))k = (x − μ)k f (x)dx.
The second-order centred moment μ2 is the variance, var(X) = σ 2 of X, and we

have σ 2 = E(X2 ) − E(X)2 . One can define addition and multiplication of two (or
more) random variables over a sample space S and also multiply a random variable
by a scalar. The expectation operator is a linear operator over the set of random
variables on S, i.e. E(λX + Y ) = λE(X) + E(Y ). We also have var(λX) =
λ2 var(X).
Cumulants
The (non-centred) moments μm , m = 1, 2, . . ., of a random variable X with pdf

f (x), are defined by

μm = E X m
= x m f (x)dx.
The centred moments are defined with respect to the centred random variable X −
E(X). The characteristic function is given by

φ(s) = E eisX = eisx f (x)dx,
and the moment generating function is given by

g(s) = E esX = esx f (x)dx.
dm
We have, in particular, μm = ds m φ(s)|s=0 . The cumulant of order m of X, κm , is
given by
dm
κm = log (φ(s)) |s=0 .
i m ds m
For example, the third-order moment is the skewness, which provides a measure of
the symmetry of the pdf (with respect to the mean when centred moment is used),
and κ3 = μ3 − 3μ2 μ1 + 2μ31 . For the fourth-order cumulant, also called the kurtosis
of the distribution, κ4 = μ4 − 4μ3 μ1 − 3μ22 + 12μ21 μ2 − 6μ41 . Note that for zero-
mean distributions κ4 = μ4 − 3μ22 . A distribution with zero kurtosis is known as
mesokurtic, like the normal distribution.
A distribution with positive kurtosis is known as super-Gaussian or leptokurtic. This
distribution is characterised by a higher maximum and heavier tail than the normal
distribution with the same variance. A distribution with negative kurtosis is known
as sub-Gaussian or platykurtotic and has lower peak and higher tails than the normal
distribution with the same variance.
B.3.3 Joint Probability Distributions
Let X and Y be two random variables over a sample space S with respective pdfs
fX () and fY (). For any x and y, the function f (x, y) defined by
x y
P r(X ≤ x; Y ≤ y) = f (u, v)dudv
−∞ −∞
is the joint probability density function. The definition can be extended in a similar
T
fashion to p random variables X1 , . . . , Xp . The vector x = X1 , . . . , Xp is called
a random vector, and its pdf is given by the joint pdf f (x) of these random variables.
Two random variables X and Y are said to be independent if
f (x, y) = fX (x)fY (y),
for all x and y. The pdfs fX () and fY () and associated cdfs FX () and FY () are called
marginal pdfs and marginal cdfs of X and Y , respectively. The marginal pdfs and
cdfs are linked to the joint cdf via
FX (x) = F (x, ∞) and fX (x) = d

dx FX (x),
and similarly for the second variable.

The expectation of any function h(X, Y ) is given by

E(h(X, Y )) = h(x, y)f (x, y)dxdy.
The covariance between X and Y is given by
cov(X, Y ) = E(XY ) − E(X)E(Y ).
The correlation between X and Y is given by

cov(X, Y )
ρX,Y = √
var(X)var(Y )
and satisfies −1 ≤ ρX,Y ≤ 1. If ρX,Y = 0, the random variables X and Y are

said to be uncorrelated. Two independent random variables are uncorrelated, but the
converse is not true.
For a random vector x = X1 , . . . , Xp with joint pdf f (x) = f (x1 , . . . , xp ),
the joint probability (or cumulative) distribution function is given by
xp x1
F (x1 , . . . , xp ) = ... f (u1 , . . . , up )du1 . . . dup .
−∞ −∞
The joint pdf is then given by
∂ p F (x1 , . . . , xp )
f (x1 , . . . , xp ) = .
∂x1 . . . ∂xp
Like the bivariate case, p random variables X1 , . . . , Xp are independent if the joint
cdf F () can be factorised into a product of the marginal cdfs as: F (x1 , . . . , xp ) =
F 1 (x1 ) . . . FXp (xp ), and similarly for the joint pdf. Also, we have fX1 (x1 ) =
X
∞ ∞
−∞ . . . −∞ f (x)dx2 . . . dxp , and similarly for the remaining marginal pdfs.
B.3.4 Expectation and Covariance Matrix of Random Vectors
T
Let x = X1 , . . . , Xp be a random vector with pdf f () and cdf F (). The
expectation of a function g(x) is defined by

E [g(x)] = g(x)f (x)dx.

The mean μ of x is obtained when g() is the identity, i.e. μ = xf (x)dx. Assuming
the random variables X1 , . . . , Xp have finite variance, the covariance matrix, xx ,
of x is given by

xx = E (x − μ) (x − μ)T = E xx T − μμT ,
with components [ xx ]ij = cov Xi , Xj . The covariance matrix is symmetric

positive semi-definite. Let us now designate by Dxx = diag σ12 , . . . , σp2 , the
diagonal matrix containing the individual variances of X1 , X2 , . . . , Xp , then the
correlation matrix ρXi ,Xj is given by:
−1/2 −1/2 −1/2 −1/2

xx = E Dxx (x − μ) (x − μ)T Dxx = Dxx xx Dxx .
B.3.5 Conditional Distributions
Let x and y be two random vectors over some state space with joint pdf fx,y (.). The
conditional probability density of y given x = x is given by
fy|x (y|x) = fx,y (x, y)/fx (x),
when fx (x) = 0; otherwise, one takes fy|x (x, y) = fx,y (x, y). Using this
conditional pdf, one can obtain the expectation of any function h(y) given x = x,
i.e.
∞
E (h(y)|x = x) = h(y)fy|x (y|x)dy,
−∞
which is a function of x only. As in the two-variable case, x and y are independent

if fy|x (y|x) = fy (y) or equally fx,y (.) = fx (.)fy (.). In particular, two (zero-
mean) random vectors x and y are uncorrelated if the covariance matrix vanishes,
i.e. E xy T = O.
B.4 Examples of Probability Distributions
B.4.1 Discrete Case

Bernoulli Distribution
A Bernoulli random variable X takes only two values, 0 and 1, i.e. X has two
outcomes: success or failure (true or false) with respective probabilities P r(X =
1) = p and P r(X = 0) = q = 1 − p. The pdf of this distribution can be written as
f (x) = px (1 − p)1−x , where x is either 0 or 1. A familiar example of a Bernoulli
trial is the tossing of a coin.
Binomial Distribution
A binomial random variable X with parameters n and 0 ≤ p ≤ 1, noted as X ∼

B(n, p), takes n + 1 values 0, 1, . . . , n with probabilities

P r(X = j ) = n
j pj (1 − p)n−j ,

where nj = j !(n−j n!
)! . Given a Bernoulli trial with probability of success p, a
Binomial trial B(n, p) consists of n repeated and independent Bernoulli trials.
Formally, if X1 , . . . , Xn are independent and identically
distributed (IID) Bernoulli
random variables with probability of success p, then nk=1 Xk follows a binomial
distribution B(n, p). A typical example consists of tossing a coin n times, and the
number of heads is a binomial random variable.
Exercise Let X ∼ B(n, p), show that μ = E(X) = np, and σ 2 = var(X) =
np(1 − p). Show that the characteristic function φ(t) = E(eiXt ) is (peit + q)n .
Negative Binomial Distribution
In a series of independent Bernoulli trials, with constant probability of success p, the

random variable X representing the number of trials until r successes are obtained
is a negative binomial with parameters p and r. The parameter r can take values
1, 2, . . ., and for each value, we have a distribution, e.g.

j −1
P r(X = j ) = r−1 (1 − p)j −r pr ,
for j = r, r + 1, . . .. If we are interested in the first success, i.e. r = 1, one gets the
geometric distribution.
Exercise Show that the mean and variance of the negative binomial distribution are,
respectively, μ = r/p and σ 2 = r(1 − p)/p2 .
Poisson Distribution
A Poisson random variable with parameter λ > 0 can take all the integer numbers
and satisfies
λk −λ
P r(X = k) = k! e k = 0, 1, . . . .
Poisson distributions are typically used to analyse processes involving counts.

Exercise Show
that for a Poisson distribution one has E(X) = λ = var(X). and
φ(t) = exp λ(eit − 1) .
B.4.2 Continuous Distributions

The Uniform Distribution
A continuous uniform random variable over the interval [a, b] has the following pdf:
1
f (x) = 1[a,b] (x),
b−a
where 1I () is the indicator of I , i.e. with a value of one inside the interval and zero
elsewhere.
Exercise Show that for a uniform random variable X over [a, b], E(X) = (a+b)/2
and var(X) = (a − b)2 /12.
The Normal Distribution
The normal (or Gaussian) distribution, N(μ, σ 2 ), has the following pdf:

1 (x − μ)2
f (x) = √ exp − .
σ 2π 2σ 2
Exercise Show that for the above normal distribution E(X) = μ and var(X) = σ 2 .
For a normal distribution, the random variable X−μ
σ has zero mean and unit variance
and is referred
x to as the standard normal. The cdf of X is generally noted as
(x) = −∞ f (u)du and is known as the error function. The normal distribution
is very useful and can be reached using a number of ways. For example, if Y is
Y −np
binomial B(n, p), Y ∼ B(n, p), then √np(1−p) approximates the standard normal
for large np. The same result holds for Y√−λ when Y follows a Poisson distribution
λ
with parameter λ. This result constitutes a particular case of a more general result,
namely the central limit theorem (see e.g. DeGroot and Shervish 2002, p. 282).
The Central Limit Theorem
Let X1 , . . . , Xn be a sequence of n IID random variables with mean μ and variance
0 < σ 2 < ∞ each, then for every number x

Xn − μ
lim P r √ ≤ x = (x),
n→∞ σ/ n

where () is the standard normal cdf, and Xn = n1 nk=1 Xk . The theorem says that
the (properly scaled) sum of a sequence of independent random variables with same
mean and (finite) variance is approximately normal.
The Exponential Distribution
The pdf of the exponential distribution with parameter λ > 0 is given by


λe−λx if x≥0
f (x) =
0 otherwise.
The Gamma Distribution
The pdf of the gamma distribution with parameters λ > 0 and β > 0 is given by
,
λβ β−1 −βx
(β) x e if x>0
f (x) =
0 otherwise,
∞
where (y) = 0 e−t t y−1 dt, for y > 0. If the parameter β < 0, the distribution is
known as Erlang distribution.
Exercise Show that for the above gamma distribution E(X) = β/λ, and var(X) =
β/λ2 . Show that φ(t) = (1 − it/λ)−β .
The Chi-Square Distribution
The chi-square random variable with n degrees of freedom (dof), noted as χn2 , has
the following pdf:
,
2−n/2 n2 −1 −x/2
(n/2) x e if x>0
f (x) =
0 otherwise.
Exercise Show that E(χn2 ) = n and var(χn2 ) = 2n.

If X1 , . . . , Xn are independent N(0, 1), the random variable nk=1 Xk2 is distributed
as χn2 with n dof. If Xk ∼ N(0, σ 2 ), then the obtained χn2 follows the σ 2 chi-square
distribution.
Exercise Find the pdf of the σ 2 chi-square distribution.
x
σ n 2−n/2 n2 −1 − 2σ 2
Answer f (x) = (n/2) x e for x > 0.
The Student Distribution
The student T distribution with n dof has the following pdf:
− n+1
n−1/2 ( n+1
2 ) x2 2
f (x) = 1+ .
(1/2)(n/2) n
If X ∼ N (0, 1) and Y ∼ χn2 are independent, then T = √X

Y /n
has a student
distribution with n dofs.
The Fisher–Snedecor Distribution
The Fisher–Snedecor random variable with n and m dof, Fn,m , has the following
pdf:
,
( mn )n/2 ( n+m
2 )
n n+m
nx − 2
f (x) = (n/2)(m/2) x 2 −1 1 + m if x>0
0 otherwise.
2m2 (n+m−2)
Exercise Show that E(Fn,m ) = m
m−2 and var(Fm,n ) = 4(m−2)(m−4) .
X/n
If X ∼ χn2 and Y ∼ χm2 are independent, then Fn,m = Y /m follows a Fisher–
Snedecor distribution with n and m dof.
The Multivariate Normal Distribution
A multinormally distributed random vector x, noted as x ∼ Np (μ, ), has the pdf

1 1
f (x) = p exp − (x − μ)T −1 (x − μ) ,
1
(2π ) ||
2 2
2
where μ and are, respectively, the mean and the covariance matrix of x. The
characteristic function of this distribution is φ(t) = exp iμT t − 12 tT t . The
multivariate normal distribution is widely used and has some very useful properties
that are given below:
• Let A be a m × p matrix, and y = Ax, and then y ∼ Nm (Aμ, AAT ).
• If x ∼ Np (μ, ), and rank() = p, then
(x − μ)T −1 (x − μ) ∼ χp2 .
• Let the random vector x ∼ Np (μ, ) partitioned as x T = (x T1 , x T2 ), where x 1

is q-dimensional (q < p), and similarly for the mean and the covariance matrix,
i.e. μ = μT1 , μT2 , and

11 12
= ,
21 22
then
(1) the marginal distribution of x 1 is multinormal Nq (μ1 , 11 );

(2) x 1 and x 2 are independent if and only if 12 = O;
(3) if 22 is of full rank, then the conditional distribution of x 1 given x 2 = x2
is multinormal with
E (x 1 |x 2 =x2 ) =μ1 + 12 −1 −1
22 (x2 −μ2 ) and var (x 1 |x 2 ) = 11 − 12 22 21 .
The Wishart Distribution
The Wishart distribution with n dofs and parameter , a p × p symmetric positive

semi-definite matrix (essentially a covariance matrix), is the distribution of a p × p
random matrix X (a matrix whose elements are random variables) with pdf
⎧
⎨2np/2 π p(p−1)/4 |X| 2
n−p−1
(p exp 1
2 tr −1 X if X is positive definite
f (X) = ||n/2 k=1 (
n+1−k
)
⎩ 2
0 otherwise.
If X1 , . . . , Xn are IID Np (0, ), p ≤ n, then the p × p random matrix

n
W= Xk XTk
k=1
has a Wishart probability distribution with n dof.
B.5 Stationary Processes
A (discrete) stochastic process is a sequence of random variables X1 , X2 , . . ., which

is a realisation of some random variable X, noted as Xk ∼ X. This stochastic
process is entirely characterised by specifying the joint probability distribution
of any finite set (Xk1 , . . . , Xkm ) from the sequence. The sequence is also called
sometimes time series, when the indices t = 1, 2, . . . are identified with “time”.
When one observes a finite realisation x1 , x2 , . . . xn of the previous sequence,
one also talks of a finite sample time series. Let μt = E(Xt ) and γ (t, k) =
cov(Xt , Xt+k ), for t = 1, , 2, . . ., and k = 0, 1, . . .. The process (or time series) is
said to be stationary if μt and γ (t, k) are independent of t. In this case one obtains
E(Xt ) = μ and γ (k) = γk = cov(Xt , Xt+k ) .
The function γ (), defined on the integers, is the autocovariance function of the
stationary stochastic process. The function
γk
ρk =
γ0
is the autocorrelation function.

We assume that we have a finite sample x1 , . . . , xn , supposed to be an inde-
pendent realisation of some random variable X of finite mean and variance. Let
x and s 2 be the sample meanand the sample variance, respectively, i.e. x =
1 n n
k=1 xk , and s = n−1 k=1 (xk − x) . Note that because the sample is
2 1 2
n−1
random, these estimators are also random. These estimators satisfy, respectively,
E(X) = E(X), and E(s 2 ) = var(X), and for this reason, they are referred to
as unbiased estimators of the mean and the variance of X, respectively. Also, the
function F̂ (x) = n1 #{xk , xk ≤ x} represents an estimator of the cdf F () of X, or
empirical distribution function (edf).
Given a finite sample x1 , . . . , xn , an estimator of the autocovariance function is
given by
1
n−k
γ̂k = (xi − x)(xi+k − x),
n
t=1
and the estimator

of the autocorrelation is ρ̂k = γ̂k /γ̂0 . It can be shown that
var(ρ̂k ) = n1 ∞ i=1 ρ 2+ρ
i i+k ρi−k − 4ρk ρi ρi−k + 2ρi ρk . This expression can
2 2
be simplified further if the autocorrelation decays, e.g. exponentially. The compu-

tation of the variance of the sample estimators is useful in defining a confidence
interval for the estimators.
Appendix C
Stationary Time Series Analysis
This appendix gives a brief introduction to stationary time series analysis for the
univariate and multivariate cases.
C.1 Autocorrelation Structure: One-Dimensional Case
A (discrete) time series is a sequence of numbers xt , t = 1, 2 . . . , n. In time series

exploration and modelling, a time series is considered as a realisation of some
stochastic process, i.e. a sequence of random variables. So, conceptually the time
series xt , t = 1, 2, . . . is considered as a sequence of random variables and the
corresponding observed series is simply a realisation of these random variables.
The time series is said to be (second-order) stationary if the mean is constant and
the covariance between any xt and xs is a function of t − s, i.e.
E (xt ) = μ and cov (xt , xs ) = γ (t − s). (C.1)
C.1.1 Autocovariance/Correlation Function
The autocovariance function of a stationary time series xt , t = 1, 2 . . ., is defined

by
γ (τ ) = cov (xt+τ , xt ) = E (xt+τ − μ) (xt − μ) . (C.2)
It is clear that the variance of the time series is simply σ 2 = γ (0). The
autocorrelation function ρ() is given by

484 C Stationary Time Series Analysis
γ (τ )
ρ(τ ) = . (C.3)
σ2
Properties of the Autocovariance Function
The autocovariance function γ (.) satisfies the following properties:

• |γ (τ )| ≤ γ (0) = σ 2 .
• γ (τ ) = γ (−τ ).
• For any p integers τ1 , τ2 , . . . τp and real numbers a1 , a2 , . . . ap , we have

p
γ (τi − τj )ai aj ≥ 0, (C.4)
i,j =1
and the autocovariance function is said to be non-negative definite or positive

semi-definite.
Exercise Derive the above properties.
Hint Use the fact that var (λxt + xt+τ ) ≥ 0 for any real λ. For the last one, use the
fact that var( i ai xτi ) ≥ 0.
C.1.2 Time Series Models
Let εt , t ≥ 1, be a sequence of IID random variables with zero mean and variance
σε2 . This sequence is called white noise. The autocovariance of such a process is
simply a Dirac pulse, i.e.
γε (τ ) = δτ ,
i.e. one at τ = 0, and zero elsewhere. Although the white noise process is the
simplest time series, it remains, however, hypothetical because it does not exist
in practice. Climate and other time series are autocorrelated. Simple linear time
series models have been formulated to explain this autocorrelation. The models
we are reviewing here have been formulated in the early 1970s and are known as
autoregressive moving average (ARMA) models (Box and Jenkins 1970; see also
Box et al. (1994)).
Some Basic Notations
Given a time series (xt ) where t is either continuous or discrete, various operations
can be defined.
C Stationary Time Series Analysis 485
• Backward shift or delay operator B—This is defined for discrete time series by
Bxt = xt−1 . (C.5)
More generally, for any integer m ≥ 1, B m xt = xt−m . By analogy, one can define
the inverse operator B −1 , which is the forward operator. It is clear from Eq. (C.5)
that for a constant c, Bc = c. Also for any integers m and n, B m B n = B m+n .
Furthermore, for any time series (xt ), t = 1, 2, . . ., we have
1
xt = (1 + αB + α 2 B 2 + . . .)xt = xt + αxt−1 + . . . (C.6)
1 − αB
whenever |α| < 1.

Remark Consider the mean x of a finite time series (xt ), t = 1, . . . n. Then,
n
1 1 i
n
1 1 − Bn
x= xi = B xn = xn .
n n n 1−B
i=1 i=0
• Differencing operator ∇ = 1 − B—This is defined by
∇xt = (1 − B)xt = xt − xt−1 . (C.7)
For example, ∇ 2 xt = (1 − B)2 xt = xt − 2xt−1 + xt−2 .

• Seasonal differencing ∇k = 1 − B k —This operator is frequently used to deal
with seasonality.
• Gain operator—It is a simple linear multiplication of the time series, i.e. axt . The
parameter a is referred to as gain.
• Differencing operator D—For a continuous time series, {y(t), a ≤ t ≤ b}, the
differencing operator is simply a differentiation D, i.e.
dy(t)
Dy(t) = (C.8)
dt
whenever this differentiation is possible.
• Continuous shift operator—Another useful operator normally encountered in
filtering is the shift operator in continuous time series, B u , defined by
B u y(t) = y(t − u). (C.9)
This operator is equivalent to the backward shift operator in discrete time series.
It can be shown that
B u = e−uD = e−u dt .
d
(C.10)
Fig. C.1 Examples of time series of AR(1) models with lag-1 autocorrelation 0.5 (a) and −0.5
(b)
Exercise Derive Eq. (C.10).

Hint Use a Taylor expansion of y(t − u).
ARMA Models
• Autoregressive schemes: AR(p)

Autoregressive models of order p are given by
p

xt = φ1 xt−1 + φ2 xt−2 + . . . + φp xt−p + εt = φk B k xt + εt . (C.11)
k=1
The white noise εt is only correlated with xs for s ≥ t. When p = 1, one gets
the familiar Markov or first-order autoregressive, AR(1), model also known as red
noise. Figure C.1 shows an example of generated time series of an AR(1) model with
opposite lag-1 autocorrelations. The red noise is a particularly simple model that is
frequently used in climate research and constitutes a reasonably good model for
many climate processes, see e.g. Hasselmann (1976, 1988), von Storch (1995a,b),
Penland and Sardeshmukh (1995), Hall and Manabe (1997), Feldstein (2000) and
Wunsch (2003) to name just a few.
• Moving average scheme: MA(q)
Moving average models of order q, MA(q), are defined by

q
xt = εt + φ1 εt−1 + . . . + φq εt−q = 1 + φk B k εt . (C.12)
k=1
It is possible to combine both the above models, AR(p) and MA(q), into just one
single model, the ARMA model.
• Autoregressive moving average scheme: ARMA(p, q)
It is given by

p
q
1− φk B k
xt = 1 + θk B k εt . (C.13)
k=1 k=1
The ARMA(p, q) model can also be written in a more compact

q form as φ(B)xt =
p
θ (B)εt , where φ(z) = 1 − k=1 φk zk and θ (z) = 1 + k=1 θk zk . Stationarity of
the ARMA(p, q) model, Eq. (C.13), requires that the roots of
φ(z) = 0 (C.14)
be outside the unit circle, see e.g. Box et al. (1994) for details.
Various ways exist to identify possible models for a given time series. For
example, the autocorrelation function of an ARMA model is a damped exponential
and/or sine waves that could be used as a guide to select models. Another useful
measure is the partial autocorrelation function. It exploits the fact that, for example,
for an AR(p) model the autocorrelation function can be entirely described by
the first p lagged autocorrelations whose behaviour is described by the partial
autocorrelation, which is a function that cuts off after lag p for the AR(p) model.
Alternatively, one can use concepts from information theory (Akaike 1969, 1974)
by fitting a whole range of models, computing the residual estimates ε̂ and their
variances (the mean squared errors) σ̂ 2 and then deriving, for example, the Akaike
information criterion (AIC) given by
2
AI C = log(σ̂ 2 ) + (P + 1), (C.15)
n
where P is the number of parameters to be estimated. The best model corresponds
to the smallest AIC.
C.2 Power Spectrum
We assume that we have a stationary time series xt , t = 1, 2 . . ., with summable

autocovariance function γ (.), i.e. k γ (k) < ∞. The spectral density function, or
power spectrum, f (ω) is defined by
∞
1
f (ω) = γ (k)e−ikω . (C.16)
2π
k=−∞
Fig. C.2 Autocorrelation function of AR(1) models with lag-1 autocorrelations 0.5 (a) and
−0.5(b)
Using the symmetry of the autocovariance function, the power spectrum becomes
∞
σ2
f (ω) = 1+2 ρ(k)coskω . (C.17)
2π
k=1
Remark Similar to power spectrum, the bispectrum is the Fourier transform of the
bicovariance function, and is related to the skewness (e.g. Pires and Hannachi 2021)
Properties of the Power Spectrum
• f () is even, i.e. f (−ω) = f (ω).
• f (ω) ≥ 0 for all ω in [−π, π].
π π
• γ (τ ) = −π eiωτ f (ω)dω = −π cosτ ωf (ω)dω, i.e. the autocovariance function
is the inverse Fourier transform of the power spectrum. Note π that from the last
property, one gets, in particular, the familiar result σ 2 = −π f (ω)dω, i.e. the
power spectrum distributes the variance.
Examples
2
• The power spectrum of a white noise is constant, i.e. f (ω) = 2π
σ
.
• For a red noise time series (of zero mean), xt = αxt−1 + εt , the auto-
correlation function is ρ(τ ) = α |τ | , and its power spectrum is f (ω) =
σ2 2 −1 (Figs. C.2, C.3).
2π 1 − 2αcosω + α
Exercise Derive the relationship between the variance of xt and that of the
innovation εt in the red noise model.
Hint σ 2 = σε2 (1 − α 2 )−1 .
Exercise
autocorrelation function ρ(.) of the AR(1) model xt = φ1 xt−1 + εt .

1. Compute the
2. Compute k≥0 ρ(k).
Fig. C.3 Power spectra of two AR(1) models with lag-1 autocorrelation 0.5 and −0.5
3. Write ρ(τ ) = e−τ/T0 , and calculate T0 as a function of φ1 .

∞
4. Calculate 0 e−τ/T0 dτ.
5. Reconcile the expression from 2 with that from 4.
Hint 1. For k ≥ 1, xt xt−k = φ1 xt−1 xt−k + εt xt−k yields ρ(k) = φ1k .
1
2. 1−φ 1
.
3. T0 = − logφ
1
1
.
4. T0 .
5. T0−1 = −log(1 − (1 − φ1 )) ≈ 1 − φ1 .
General Expression of the ARMA Spectra
A direct method to compute the spectra of ARMA processes is to make use of results
from linear filtering as outlined in Sect. 2.6 of Chap. 2.
Exercise Consider the delay operation yt = Bxt . Find the relation between
the Fourier transforms y(ω) and x(ω) of yt and xt , respectively. Find a similar
relationship when yt = αxt + βBxt .
Answer y(ω) = (α + βeiω )x(ω).
Let xt , t = 0, 1, 2, . . ., be a stationary time series, and consider the filtering
equation:

p
yt = αk xt−k .
k=1
Using the above exercise, we get y(ω) = A(eiω )x(ω), where

p
A(z) = αk zk ,
k=1
where the function a(ω) = A(eiω ) is the frequency response function, which is the
Fourier transform of the transfer function. Now, the power spectrum of yt is linked
to that of xt following:
fy (ω) = |a(ω)|2 fx (ω).
The application of this to the ARMA time series model (C.13), see also Chap. 2,
yields
) )
) θ (eiω ) )2
fx (ω) = σε2 )) ) . (C.18)
φ(eiω ) )
In the above equation it is assumed that the roots of φ(z) are outside unit circle
(stationarity) and similarly for θ (z) (for invertibility, i.e. εt is written as a convergent
power series in xt , xt−1 , . . .).
C.3 The Multivariate Case
Let xt , t = 1, 2, . . ., be a multivariate time series where each element xt =

xt1 , xt2 , . . . xtp is now p-dimensional. We suppose that xt is of mean zero and
covariance matrix 0 .
C.3.1 Autocovariance Structure
The lagged cross- or autocovariance matrix (τ ) is defined by

(τ ) = E xt+τ xTt . (C.19)
The elements of (τ ) are [(τ )]ij = E xt+τ,i xt,j . The diagonal elements are the
autocovariances of the individual unidimensional time series forming xt , whereas
its off-diagonal elements are the lagged cross-covariances. The lagged covariance
matrix has the following properties:
• (−τ ) = [(τ )]T .
• (0) is the covariance matrix 0 of xt .
• (τ ) is positive semi-definite, i.e. for any integer m > 0, and real vectors
a1 , . . . , am ,

m
aTi (i − j )aj ≥ 0. (C.20)
i,j =1
Similarly, the lagged cross-correlation matrix
−1/2 −1/2
ϒ(τ ) = 0 (τ ) 0 ,
− 1
whose elements ρij (τ ) are [ϒ(τ )]ij = γij (τ ) γii (0)γjj (0) 2 , has similar
properties. Furthermore, we have
|ρij (τ )| ≤ 1.
Note that the inequality γij (τ ) ≤ γij (0), for i = j , is not true in general.
C.3.2 Cross-Spectrum
As for the univariate case, we can define the spectral density matrix F(ω) of xt ,
t = 1, 2, . . . for −π ≤ ω ≤ π as the Fourier transform of the autocovariance
matrix:
∞
1 −iτ ω
F(ω) = e (τ ) (C.21)
2π τ =−∞

whenever
τ (τ ) < ∞, where . is a matrix norm. For example, if
τ |γ ij (τ )| < ∞, for i, j = 1, 2, . . . p, then F(ω) exists. Unlike the univariate case,
however, the spectral density matrix can be complex because () is not symmetric.
The diagonal elements of F(ω) are real because they represent the power spectra of
the individual univariate time series that constitute xt . The real part of F(ω) is the
co-spectrum matrix, whereas the imaginary part is the quadrature spectrum matrix.
The spectral density matrix has the following properties:
• F(ω) is Hermitian, i.e.
F(−ω) = [F(ω)]∗T ,
where (∗ ) represents the complex conjugate.

• The autocovariance matrix is the inverse Fourier transform of F(ω), i.e.
π
(τ ) = F(ω)eiτ ω dω. (C.22)
−π
π
• 0 = −π F(ω)dω, and 2π F(0) = k (k).
• F(ω) is semi-definite (Hermitian), i.e. for any integer m > 0, and complex
numbers c1 , c2 , . . . , cp , we have

p
∗T
c F(ω)c = ci∗ Fij (ω)cj ≥ 0, (C.23)
i,j =1
T
where c = c1 , c2 , . . . , cp . The coherence and phase between xt,i and xt,j ,
t = 1, 2, . . ., for i = j , are, respectively, given by
|Fij (ω)|2
cij (ω) = , (C.24)
Fii (ω)Fjj (ω)
and

I m(Fij (ω))
φij (ω) = Atan . (C.25)
Re(Fij (ω))
The coherence, Eq. (C.24), gives essentially a measure of the square of the
correlation coefficient between both the time series in the frequency domain. The
phase, Eq. (C.25), on the other hand, gives a measure of the time lag between the
time series.
C.4 Autocorrelation Structure in the Sample Space
C.4.1 Autocovariance/Autocorrelation Estimates
We assume that we have a finite sample of a time series, xt , t = 1, 2 . . . n. There

are various ways to estimate the autocovariance function γ (). The most widely used
estimators are
1
n−τ
γ̂1 (τ ) = (xt − x) (xt+τ − x) (C.26)
n
t=1
and
1
n−τ
γ̂2 (τ ) = (xt − x) (xt+τ − x) . (C.27)
n−τ
t=1
We can assume for simplicity that the sample mean is zero. It is clear from
Eq. (C.26) and Eq. (C.27) that γ̂1 () is slightly biased, with bias of order n1 , i.e.
asymptotically unbiased, whereas γ̂2 () is unbiased. The asymptotically unbiased
estimator γ̂1 () is, however, consistent, i.e. its variance goes to zero as the sample
size goes to infinity, whereas the estimator γ̂2 () is inconsistent with its variance
tending to infinity with the sample size (see e.g. Jenkins and Watts 1968). But, for
a fixed lag both the estimators are asymptotically unbiased and with approximate
variances satisfying (Priestly 1981, p. 328)
var γ̂1 (τ ) ≈ O( n1 ) and var γ̂2 (τ ) ≈ O( n−k

1
).
Similarly, the autocorrelation function can be estimated by
γ̂ (τ )
ρ̂(τ ) = , (C.28)
σ̂ 2
whereγ̂ () is an estimator of the autocovariance function, and σ̂ 2 = (n −

1)−1 nt=1 (xt − x)2 is the sample variance. The sample estimate ρ̂1 (τ ), τ =
0, 1, . . . n − 1, is positive semi-definite, see Eq. (C.4), whereas this is not in general
true for ρ̂2 (.), see e.g. Priestly (1981, p. 331). The graph showing the sample
autocorrelation function ρ̂(τ ) versus τ is normally referred to as correlogram. A
simple and useful significance test for the sample autocorrelation function is one
based on asymptotic normality and white noise, namely,
E ρ̂(τ ) ≈ 0 for τ = 0
and
var ρ̂(τ ) ≈ 1
n for τ = 0.
These approximations can be used to construct confidence intervals for the sample
autocorrelation function.
C.4.2 The Periodogram

Raw Periodogram
We consider again a centred sample of a time series, xt , t = 1, 2 . . . n, with sample

autocovariance function γ̂ (). In spectral estimate we normally consider the Fourier
frequencies ωk = 2πn k , for k = −[ n−1
2 ], . . . , [ 2 ], where [x] is the integer part of x.
n
2π
The frequency 2t (radians/time unit), where t is the sampling interval, is known
as the Nyquist frequency.1 The Nyquist frequency represents the highest frequency
that can be resolved, and therefore, the power spectrum can only be estimated for
frequencies less than the Nyquist frequency.
The sequence of the following complex vectors:
1 T
ck = √ eiωk , e2iωk , . . . , einωk
n
for k = 1, 2, . . . n, is orthonormal, i.e. c∗T

k cl = δkl , and therefore, any n-
dimensional complex vector x can be expressed as
n
[2]

x= αk ck , (C.29)
k=−[ n−1
2 ]
where αk = c∗T k x. The application of Eq. (C.29) to the vector (x

T
1 , . . . , xn ) yields

the discrete Fourier transform of the time series, i.e. αk = √1n nt=1 xt e−itωk . The
periodogram of the time series is defined as the squared amplitude of the Fourier
coefficients, i.e.
) n )
1 )) −itωk ))2
In (ωk ) = ) xt e ) . (C.30)
n
t=1
Now, from Eq. (C.29) one easily gets

n n
[2] [2]

n
(n − 1)σ̂ =
2
xt2 = |αk | =
2
In (ωk ). (C.31)
t=1 k=−[ n−1 k=−[ n−1
2 ] 2 ]
As for the power spectrum, the periodogram also distributes the sample variance, i.e.
the periodogram In (ωk ) represents the contribution to the sample variance from the
frequency ωk . The periodogram can be seen as an estimator of the power spectrum,
Eq. (C.16). In fact, by expanding Eq. (C.30) one gets
⎡ ⎤
1 ⎣ 2
n n−1
In (ωp ) = xt + xt xτ eikωp + e−ikωp ⎦
n
t=1 k=1 |t−τ |=k
1 Or 1
2t if the frequency is expressed in (1/time unit). For example, if the sampling time interval
is unity, then the Nyquist frequency is 12 .

n−1
= γ̂ (k) cos(ωp k). (C.32)
k=−(n−1)
1
Therefore 2π In (ωp ) is a candidate estimator for the power spectrum f (ωp ). Fur-

thermore, it can be seen from Eq. (C.32) that E In (ωp ) = n−1 k=−(n−1) E γ̂ (k)
cos(ωp k), i.e.
E In (ωp ) ≈ 2πf (ωp ). (C.33)
The periodogram is therefore an asymptotically unbiased estimator of the power

spectrum. However, it is not consistent because it can be shown to have a constant
variance. The periodogram is also highly erratic with sampling fluctuations that do
not vanish as the sample size increases, and therefore, some smoothing is required.
Periodogram Smoothing
Various ways exist to construct a consistent estimator of the spectral density

function. Smoothing is the most widely used way to achieve consistency. The
smoothed periodogram is obtained by convolving the (raw) periodogram using a
“spectral window” W () as
n
[2]
1
fˆ(ω) = W (ω − ωk )In (ωk ). (C.34)
2π
k=−[ n−1
2 ]
The spectral window is a symmetric kernel function that integrates to unity and
decays at large values. This smoothing is equivalent to a discrete Fourier transform
of the weighted autocovariance estimator using a (time domain) lag window λ(.) as
1
n−1
fˆ(ω) = λ(k)γ̂ (k) cos(ωp k). (C.35)
2π
k=−(n−1)
The sum in Eq. (C.35) is normally truncated at the truncation point of the lag
window.
The spectral window W () is the Fourier transform of the lag window, whose
aim is to neglect the contribution, in the sample autocovariance function, from
large lags. This means that localisation in time is associated with broadness in the
spectral domain and vice versa. Figure C.4 illustrates the relationship between time
(or lag) window and spectral window. Various lag/spectral windows exist in the
literature. Two examples are given below, namely, the Bartlett (1950) and Parzen
(1961) windows:
Fig. C.4 Illustration of the relationship between time and spectral windows
• Bartlett window for which the lag window is defined by

1− M
τ
for |τ | < M
λ(τ ) = (C.36)
0 otherwise,
and the corresponding spectral window is

2
M sin(π Mω)
W (ω) = . (C.37)
n π Mω
• Parzen window:
⎧
⎪ τ 2
⎨1 − 6 M + 6
τ 3
for |τ | ≤ M
M 2
λ(τ ) = 2 1− Mτ 3 (C.38)
⎪
⎩
0 otherwise,
and
4
6 sin(Mω/4)
W (ω) = . (C.39)
πM3 sinω/2
Fig. C.5 Parzen window showing W (ω) in ordinate versus ω in abscissa for different values of
the parameter M
Figure C.5 shows an example of the Parzen window for different values of the
parameter M. Notice in particular that as M increases the lag window becomes
narrower. Since M can be regarded as a time resolution, it is clear that the variance
increases with M and vice versa.
Remark There are other ways to estimate the power spectrum such as the maximum
entropy method (MEM). The MEM estimator is achieved by fitting an autoregres-
sive model to the time series and then using the model parameters to compute
the power spectrum, see e.g. Burg (1972), Ulrych and Bishop (1975), and Priestly
(1981).
The cross-covariance and the cross-spectrum can be estimated in a similar way to the
sample covariance function and sample spectrum. For example, the cross-covariance
between two zero-mean time series samples xt , and yt , t = 1, . . . n, can be estimated
using
1
n−τ
γ̂12 (τ ) = xt yt+τ (C.40)
n
t=1
for τ = 0, 1, . . . , n − 1, which is then complemented by symmetry, i.e. γ̂12 (−τ ) =

γ̂21 (τ ). Similarly, the cross-spectrum can be estimated using
1
M
fˆ12 (ω) = λ(k)γ̂12 (k)eiωk . (C.41)
2π
k=−M
Appendix D
Matrix Algebra and Matrix Function
D.1 Background
D.1.1 Matrices and Linear Operators

Matrices
Given two n-dimensional vectors x = (x1 , . . . , nn )T and y = (y1 , . . . , yn )T and

scalar λ, then x + y and λx are also n-dimensional vectors given, respectively,
by (x1 + y1 , . . . , xn + yn )T and (λx1 , . . . , λxn )T . The set En of all n-dimensional
vectors is called a linear (or vector) space. It is n-dimensional if it is real and 2n-
dimensional if it is complex. For the real case, for example, a natural basis of the
space is (e1 , . . . , en ), where ek contains zero everywhere except at the kth position
where it is one.
A matrix X of order n × p is a collection of (real or complex) numbers xij ,
i = 1, . . . , n, j = 1, . . . , p, taking the following form:
⎛ ⎞
x11 x12 . . . x1p
⎜ x21 x22 . . . x2p ⎟
⎜ ⎟
X=⎜ . .. .. ⎟ = xij .
⎝ .. . . ⎠
xn1 xn2 . . . xnp
When p = 1, one obtains a n-dimensional vector, i.e. one column of n numbers

x = (x1 , . . . , xn )T . When n = p, the matrix is called square.
Similar operations can be defined on matrices, i.e. for any two n × p matrices
X = xij and Y = yij , and scalar λ, we have X + Y = xij + yij and λX =
λxij . The set of all n × p real matrices is a linear space with dimension np.

500 D Matrix Algebra and Matrix Function
Matrices and Linear Operators
Any n×p matrix X is a representation of a linear operator from a linear space Ep into
a linear space En . For example, if the space En is real, then one gets En = Rn . Let us
denote by xk = (x1k , . . . , xnk )T , and then the matrix is written as X = x1 , . . . , xp .
The kth column xk of X represents the image of the kth basis vector ek of Ep , i.e.
Xek = xk .
D.1.2 Operation on Matrices

Transpose
The transpose of a n × p matrix X = (xij ) is the p × n matrix XT = yij where

yij = xj i .
Product
The product of n × p and p × q matrices X

and Y, respectively, is the n × q matrix
p
Z = XY, defined by Z = zij , with zij = k=1 xik ykj .
Diagonal
A diagonal matrix is n × n matrix of the form A = xij δij , where δij is the
Kronecker symbol. For a n × p matrix A, the main diagonal is given by all the
elements aii , i = 1, . . . , min(n, p).
Trace
n
The trace of a square n × n matrix X = (xij ) is given by tr (X) = k=1 xkk .
Determinant
Let X = (xij ) be a p × p matrix, and then, the determinant |X| of X is a multilinear

function of the columns of X and is defined by
!
p
det (X) = |X| = (−1)|π | xkπ(k) , (D.1)
π k=1
D Matrix Algebra and Matrix Function 501
where the sum is over all permutations π() of {1, 2, . . . , p} and |π | is either +1 or
−1 depending on whether π() is written as the product of an even or odd number
of transpositions, respectively. The determinant can also be defined in a recurrent
manner as follows. For a scalar x, the determinant is simply x, i.e. det (x) = x.
Then, for a p × p matrix X, the determinant is given by

|X| = (−1)i+j xij ij = (−1)i+j xij ij ,
j i
where ij is the determinant of the (p − 1) × (p − 1) matrix X−(i,j ) obtained by

deleting the ith line and j th column. The determinant ij is referred to as the minor
of xij , and the term cij = (−1)i+j ij as the cofactor of xij . It can be shown that

p
xik cj k = |X|δij , (D.2)
k=1
where δij is the Kronecker symbol. The matrix C = (cij ) is the matrix of cofactors
of X.
Matrix Inversion
• Conventional inverse
When |X| = 0, the square p × p matrix X = (xij ) is invertible and its inverse
X−1 satisfies XX−1 = X−1 X = Ip . It is clear from Eq. (D.2) that when X is
invertible the inverse is given by
1 T
X−1 = C , (D.3)
|X|
where C is the matrix of cofactors of X. In what follows, the elements of X−1 are
denoted by x ij , i, j = 1, . . . n, i.e. X−1 = (x ij ).
• Generalised inverse
Let X be a n × p matrix, and the generalised inverse of X is the p × n matrix X−
satisfying the following properties:
XX− and X− X are symmetric

XX− X = X, and X− XX− = X− .
The generalised inverse is unique and is also known as pseudoinverse or Moore–

Penrose inverse.
• Rank
The rank of a n × p matrix X is the number of columns of X or its transpose that
are linearly independent. It is the number of rows (and columns) of the largest
invertible square submatrix of X. We have automatically: rank(X) ≤ min(n, p).

The matrix is said to be of full rank if rank(X) = min(n, p).
Symmetry, Orthogonality and Normality
Let X be a real p × p square matrix, and then

• X is symmetric when XT = X;
• it is orthogonal (or unitary) when XXT = XT X = Ip ;
• it is normal when it commutes with its transpose, i.e. XXT = XT X,
when the matrix is complex;
• X is Hermitian when X∗T = X.
For the complex case, the other two properties remain the same except that the
transpose (T ) is replaced by the complex conjugate transpose (∗T ).
Direct Product
Let A = (aij ) and B = (bij ) two matrices of respective order n × p and q × r. The
direct product of A and B, noted as A × B or A ⊗ B, is the nq × pr matrix defined
by
⎛ ⎞
a11 B a12 B . . . a1p B
⎜ a21 B a22 B . . . a2p B ⎟
⎜ ⎟
A⊗B=⎜ . .. .. ⎟ .
⎝ .. . . ⎠
an1 B an2 B . . . anp B.
The above product is indeed a left direct product. A direct product is also known
as Kronecker product. There is also another type of product between two n × p
matrices of the same order A = (aij ) and B = (bij ), and that is the Hadamard
product given by
A B = aij bij .
Positivity
A square p × p matrix A is positive semi-definite if xT Ax ≥ 0, for any p-

dimensional vector x. It is definite positive when xT Ax > 0 for any non-zero
p-dimensional vector x.
Eigenvalues/Eigenvectors
Let A a p ×p matrix. The eigenvalues of A are given by the set of complex numbers
λ1 , . . . , λp solution to the algebraic polynomial equation:
|A − λIp | = 0.
The eigenvectors of u1 , . . . , up of A are the solutions to the eigenvalue problem:
Au = λu,
where λ is an eigenvalue. The eignevectors are normally chosen to have unit-length.

For any invertible p × p matrix B, the eigenvalues of A and B−1 AB are identical.
Some Properties of Square Matrices
Let A and B be two p × p matrices. Then we have:

• tr(αA + B) = αtr(A) + tr(B), for any number α.
• tr(AB) = tr(BA).
• tr(A) = tr(P−1 AP) for any nonsingular p × p matrix P.
• tr AxxT = xT Ax, where x is a vector.
• (AB)−1 = B−1 A−1 .
• det (AB) = |AB| = |A||B|.
• |A ⊗ B| =|A|p |B|p and tr (A ⊗ B) = tr(A)tr(B).
p
• tr (A) = k=1 λk , where λ1 , . . . , λp are the eigenvalues of A.
• The eigenvectors corresponding to different eigenvalues are orthogonal.
• rank (A) = #{λk ; λk = 0}.
• If A is (real) symmetric, then its eigenvalues λ1 , . . . , λp and its eigenvectors
P = u1 , . . . , up are real. If it is positive semi-definite, then its eigenvalues
are all non-negative. If the matrix is Hermitian, then its eigenvalues are all non-
negative. For both cases, we have A = PP∗T , where = diag λ1 , . . . , λp .
• If A is normal, i.e. commuting with its Hermitian transpose, then it is diagonal-
isable and has a complete set of orthogonal eigenvectors.
Singular Value Decomposition (SVD)
The SVD theorem has different forms, see e.g. Golub and van Loan (1996), and
Linz and Wang (2003). In its simplest form, any n × p real matrix X, of rank r, can
be decomposed as
X = UDVT , (D.4)
where the n × r and p × r matrices U and V are orthogonal, i.e. UT U = VT V = Ir

and D = diag (d1 , . . . , dr ) where dk > 0, k = 1, . . . r, are the singular values of X.
Theorem of Sums of Products
Let A, B, C and D be p × p, p × q, q × q and q × p matrices, respectively, then

−1
• (A + BCD)−1 = A−1 − A−1 B C−1 + DA−1 B DA−1 and
• |A + BD| = |A||Ip + A−1 BD| = |A||Iq + DA−1 B|,
when all necessary inverses exist.
Theorem of Partitioned Matrices
Let A be a block partitioned matrix as

A11 A12
A= ,
A21 A22
then we have
|A| = |A11 ||A22 − A21 A−1 −1

11 A12 | = |A22 ||A11 − A12 A22 A21 |,
when all necessary inverses exist. Furthermore, if A is invertible with inverse

denoted by

−1 A11 A12
A = ,
A21 A22
then
−1
A11 = A11 − A12 A−1
22 A21 ,
A12 = −A11 A12 A−1 −1

22 = −A11 A12 A ,
22
−1
A22 = A22 − A21 A−111 A12 , and
A21 = −A22 A21 A−1 −1 −1

11 = −A22 A21 A11 .
D.2 Most Useful Matrix Transformations
• LU decomposition
For any nonsingular n × n matrix A, there exists some permutation matrix P such
that
PA = LU,
where L is a lower triangular matrix with ones in the main diagonal and U is an
upper triangular matrix.
• Cholesky factorisation
For any symmetric positive semi-definite matrix A, there exists a lower triangular
matrix L such that
A = LLT .
• QR decomposition
For any m × n matrix A, with m ≥ n say, there exist a m × m unitary matrix Q and
a m × n upper triangular matrix R such that
A = QR. (D.5)
The proof of this result is based on Householder transformation and finds a sequence
of n unitary matrices Q1 , . . . , Qn such that Qn . . . Q1 A = R. If at step k, we have
say

Lk | B
Qk . . . Q1 A = ,
Om−k,k | c|C
where Lk is a k × k upper triangular matrix, then Qk+1 will transform the vector c =
(ck+1 , . . . , cm )T into the vector d = (d, 0, . . . , 0)T without changing the structure
of Lk and the (m − k) × k null matrix Om−k,k . This matrix is known as Householder
transformation and has the form:

Ik Ok,m−k
Qk+1 = ,
Om−k,k Pm−k
where Pm−k = Im−k − 2

uT u
uuT , where u = c + c (1, 0, . . . , 0).
Remark The following formula can be useful when expressing matrix products.
Consider two p×q and r ×q matrices U and V, respectively, with U = u1 , . . . , uq
q j
and V = v1 , . . . , vq . Since the ith and j th element of UVT is k=1 uik vk , and
j
because uik vk is the ith and j th element of uk vTk , one gets

q
UVT = uk vTk .
k=1
Similarly, if = λ1 , . . . , λq is a diagonal matrix, then we also have UVT =

q T
k=1 λk uk vk .
D.3 Matrix Derivative
D.3.1 Vector Derivative

T
Let f (.) be a scalar function of a p-dimensional vector x = x1 , . . . , xp . The
∂f
partial derivative of f (.) with respect to xk is noted ∂x k
and is defined in the usual
way. The derivative of f (.) with respect to x is given by
T
∂f ∂f ∂f
= ∇f (x) = ,..., (D.6)
∂x ∂x1 ∂xp
and is also known as the gradient of f (.) at x. The differential of f () is then written
p ∂f T
as df = K=1 ∂x k
dxk = ∇f (x)T dx, where dx = dx1 , . . . , dxp .
Examples
• For a linear form f (x) = aT x, ∇f (x) = a.
• For a quadratic form f (x) = xT Ax, ∇x f = 2Ax.
For a vector function f(x) = f1 (x), . . . , fq (x) , where f1 (.), . . . , fq (.) are scalar
functions of x, the gradient in this case is called the Jacobian matrix of f(.) and is
given by
∂f
j
Df(x) = ∇f1 (x) , . . . , ∇fq (x) =
T T
(x) . (D.7)
∂xi
D.3.2 Matrix Derivative
Definition Let X = xij = x1 , . . . , xq be a p ×q matrix and Y = yij = F (X)

a r × s matrix function of X. We assume that the elements yij of Y are differentiable
scalar function with respect to the elements xij of X. We distinguish two cases:
1. Scalar Case
If Y = F (X) is a scalar function, then to define the derivative of F () we first use
T
the vec (.) notation given by vec (X) = xT1 , . . . , xTq transforming X into a pq-
dimensional vector. The differential of F (X) is then obtained by considering F ()
as a function of vec (X). One gets the following expression:

∂F ∂F
= . (D.8)
∂X ∂xij
The derivative ∂F
∂X is then a p × q matrix.
2. Matrix Case
If Y = F (X) is a r × s matrix, where each yij = Fij (X) is a differentiable scalar
function of X, the partial derivative of Y with respect to xmn is the r × s matrix:

∂Y ∂Fij (X)
= . (D.9)
∂xmn ∂xmn
The partial derivative of Y with respect to X, based on Eq. (D.9), is the pr × qs

matrix given by
⎛ ∂Y ∂Y
⎞
∂x11 ... ∂x1q
∂Y ⎜ .. ⎟
= ⎜ .. ⎟
. ⎠. (D.10)
∂X ⎝ . ...
∂Y ∂Y
∂xp1 ... ∂xqq
Equation (D.10) also defines the Jacobian matrix DY (X) of the transformation.
Another definition of the Jacobian matrix is given in Magnus and Neudecker (1995,
p. 173) based on the vec transformation, namely,
∂vecF (X)
DF (X) = . (D.11)
∂ (vecX)T
Equation (D.11) is useful to compute the Jacobian matrices using the vec trans-
formation of X and Y and then get the Jacobian of a vector function. Note that
Eqs. (D.9) or (D.10) can also be written as a Kronecker product:

∂Y ∂ ∂yij
=Y⊗ = . (D.12)
∂X ∂X ∂X
In this appendix we adopt the componentwise derivative concept as in Dwyer

(1967).
D.3.3 Examples
In the following examples the p × q matrix Jij will denote the matrix whose ith
and j th element is one and zero elsewhere, i.e. Jij = δm−i,n−j = δmi δnj , and
similarly for the r × s matrix Kαβ . For instance, if X = (xmn ), then Y = Jij X is the
matrix whose ith line is the j th line of X and zero elsewhere (i.e. ymn = δmi xj n ), and
Z = XJij is the matrix whose j column is the ith column of X and zero elsewhere
(i.e. zmn = δj n xmi ). The matrices Jij and Kij are essentially identical, but they are
obtained differently, see the remark below.
Case of Independent Elements
We assume that the matrix X = xij is composed of pq independent variables. We

j
will also use interchangeably the notation δij or δi for Kronecker symbol.

• Let X be a p × p matrix, and f (X) = tr(X) = k xkk . The derivative of f () is
∂xmn (tr (X)) = δmn . Hence,
∂
∂ ∂
tr (X) = Ip = tr XT . (D.13)
∂X ∂X
∂f
• f (X) = tr (AX). Here f (X) = i k aik xki , and ∂xmn = k n
i,k aik δm δi =
anm ; hence,
∂f
= AT . (D.14)
∂X
• g (X) = g (f (X)), where f (.) is a scalar function of X and g(y) is a
differentiable scalar function of y. In this case we have
∂g dg ∂f
= (f (X)) .
∂X dy ∂X
∂ tr(XA)
For example, ∂X e = etr(XA) AT .
(X) = det (X) = |X|. For this case, one can use Eq. (D.2), i.e. |X| =
• f
j xαj Xαj where Xαj is the cofactor of xαj . Since Xαj is independent of xαk ,
∂|X|
for k = 1, . . . n, one gets ∂xαβ = Xαβ , and using Eq. (D.3), one gets
∂|X|
= |X|X−T . (D.15)
∂X
Consequently, if g(y) is any real differentiable scalar function of y, then we get
∂ dg
g (|X|) = (|X|) |X|X−T . (D.16)
∂X dy
• f (X) = g (H(X)), where g(Y) is a scalar function of matrix Y and H(X) is

a matrix function of X, both differentiable. Using a similar argument from the
derivative of a scalar function of a vector, one gets
∂f (X) ∂g ∂yij ∂g ∂ (H(X))ij

= (H(X)) = (H(X)) .
∂xαβ ∂yij ∂xαβ ∂yij ∂xαβ
i,j i,j
For example, if Y = Y (X) is any differentiable matrix

T function of X, then
∂|Y(X)| ∂|Y| ∂yij ∂yij
∂xαβ = i,j ∂yij ∂xαβ = Y
i,j ij ∂xαβ = Y ∂Y
i,j ij ∂xαβ . That is,
ji
T
∂|Y(X)| −T ∂Y ∂Y −1
= tr |Y|Y = |Y|tr Y . (D.17)
∂xαβ ∂xαβ ∂xαβ
Remark We can also compute the derivative with respect to an element and
derivative of an element with respect to a matrix as in the following examples.
∂XT
• Let f (X) = X, then ∂X
∂xαβ = Jαβ , and
∂xαβ = Jβα = Jαβ .
T
∂y
• For a r × s matrix Y = yij , we have ∂Yij = Kij .
∂[f (X)]ij
• For f (X) = AXB, one obtains ∂f (X)
∂xαβ = AJαβ B and ∂X = AT Kij BT .
∂Xn
Exercise Compute ∂xαβ .
∂XXn−1
Hint Use a recursive relationship. Write Un = ∂xαβ , and then Un = Jαβ Xn−1 +
n−1
∂xαβ = Jαβ X
X ∂X n−1 + XU
n−1 . By induction, one finds that
Un = Xn−1 Jαβ + XJαβ Xn−2 + X2 Jαβ Xn−3 + . . . + Xn−2 Jαβ X.
Application 1. f (X) = XAX, in this case
∂f (X)
= Jαβ AX + XAJαβ . (D.18)
∂xαβ
This could be proven by expressing the ith and j th element [XAX]ij of XAX.
Application2. g(X) = tr(f (X)) where f (X)

= XAX.
Since tr ∂f (X)
∂xαβ = tr Jαβ AX + XAJ αβ = [AX]βα + [XA]βα , hence
∂tr (XAX)
= (AX + XA)T . (D.19)
∂X
Application 3.
−1 −1
∂|XAXT |
= |XAXT | XAT XT XAT + XAXT XA . (D.20)
∂X
∂|XAXT | −1
In particular, if A is symmetric, then ∂X = 2|XAXT | XAT XT XA .
∂ XAXT
One can use the fact that = Jαβ AXT + XAJβα , see also Eq. (D.18),
∂xαβ

∂ XAXT ∂xik T ∂ AXT
which can be proven by writing ∂xαβ ij = k ∂x αβ
AX kj + xik ∂xαβ kj .
β
The first sum in the right hand side of this expression is simply k δk δiα AXT kj ,
which is the (i, j )th element of Jαβ AXT (and also the (α, β)th element of Jij XAT ).
Similarly, the second sum is the (i, j )th element of XAJαβ and, by applying the
trace operator, provides the required answer.
Exercise Complete the proof of Eq. (D.20).
∂|XAXT | −1
∂xαβ =|XAX |tr Jαβ AXT+XAJβα .
Hint First use Eq. (D.17), i.e. T XAXT
−1
Next use the same argument as that used in Eq. (D.19) to get tr XAXT Jαβ
−1 T −T
AX T = i XAX
T AX βi = XAX T XA T . A similar
iα αβ
reasoning yields
−1 −1 −1
tr XAXT
XAJαβ = tr XAJβα XAXT
= XAXT XA ,
αβ
which again yields Eq. (D.20).

Application 4.
∂|AXB|
= |AXB|AT (AXB)−T BT . (D.21)
∂X

In fact, one has ∂x∂αβ [AXB]ij = aiα bβj = AJαβ B ij . Hence ∂|AXB|
∂xαβ =
−1 −1

|AXB|tr AJαβ B (AXB) . The last term equals i aiα B (AXB) =
−1
−1
βi
B (AXB) βi aiα , which can be easily recognised as B (AXB) A βα =
iT
A (AXB)−T BT αβ .
−1
−1
∂xαβ , one can use the fact that X X =
• Derivative of the inverse. To compute ∂X
ik ∂x ik
Ip , i.e. k x xkj = δij , which yields after differentiation: k ∂xαβ xkj =
ik ∂X−1 −1
− k x Jαβ kj , i.e. ∂xαβ X = −X Jαβ or
∂
X−1 = −X−1 Jαβ X−1 . (D.22)
∂xαβ
• f (X) = Y = AX−1 B −. First, we have ∂x∂αβ Y = −AX−1 Jαβ X−1 B.

Let us now find the derivative of each element of Y with respect to X. We
first note that for any two matrices of respective
orders
n × p and q × m,
β
A = aij and B = bij , one has AJαβ = aiα δj and AJαβ B = aiα bβj .
∂yij
Now, ∂xαβ = − AX−1 Jαβ X−1 B ij = − AX−1 iα X−1 B βj , which is also
T T T T
− AX−1 X−1 B = AX−1 Jij X−1 B , that is
αi jβ αβ
∂yij
= −X−T AT Jij BT X−T . (D.23)
∂X
• f (X) = y = tr X−1 A −. One uses the previous argument, i.e. ∂x∂yαβ =

−tr X−1 Jαβ X−1 A , which is − i X−1 iα X−1 A βi = − i X−T αi
T −T
A X iβ
. Therefore,
∂
tr X−1 A = −X−T AT X−T . (D.24)
∂X
Alternatively, one can also use the identity tr(X) = |X|tr(X−1 ) (e.g. Graybill
1969, p. 227).
Case of Symmetric Matrices
The matrices dealt with in the previous examples have independent elements. When,
however, the elements are not independent, the rules change. Here we consider
the case of symmetric matrices, but there are various other dependencies such as
normality, orthogonality etc. Let X = xij be a symmetric matrix, and J ij =
Jij + Jj i − diag Jij , i.e. the matrix with one for the (i, j )th and (j, i)th elements
and zero elsewhere. We have ∂x ∂X
ij
= J ij . Now, if f (X) is a scalar function of
the symmetric matrix X, then we can start first with the scalar function f (Y) for a
general matrix, and we get (e.g. Rogers 1980)

∂f (X) ∂f (Y) ∂f (Y) ∂f (Y)
= + − diag .
∂X ∂Y ∂YT ∂Y Y=X
The following examples illustrate this change.

∂xki
• ∂x∂αβ tr (AX) = i,k aik ∂x αβ
= aαβ + aβα ; therefore,
∂
tr (AX) = A + AT . (D.25)
∂X
Exercise Show that ∂

∂xαβ AXB = AJαβ B + BJβα B − AJαα B δαβ .
• Derivative of determinants
∂
|X| = |X| 2X−1 − diag X−1 . (D.26)
∂X
∂
|AXB|=|AXB| AT (AXB)−T BT +B (AXB)−1 A−diag B (AXB)−1 A .
∂X
(D.27)
Exercise Derive Eq. (D.26) and Eq. (D.27).
Hint Apply (D.17) to the transformation Y(X) = X1 + XT1 − diag (X), where
X1 is the lower triangular matrix whose elements are xij1 = xij , for i ≤ j . Then
∂|Y| ∂yij
one gets ∂x∂αβ |Y(X)| = ij ∂yij ∂xαβ = ij |Y|y
ji J
αβ ij . Keeping in mind
that Y = X, the previous expression yields |X|tr X−1 J αβ . To complete the
proof,
remember that J αβ = Jαβ + Jβα − diag Jαβ ; hence, tr X−1 J αβ =

x βα + x αβ − x αβ δx = 2X−1 − diag X−1 αβ .
αβ
Similarly, Eq. (D.27) is similar to Eq. (D.21) but involves symmetry, i.e.
−1
∂xαβ AXB = AJ αβ B. Therefore, ∂xαβ |AXB| = |AXB|tr AJ αβ B (AXB)
∂ ∂
.
• Derivative of a matrix inverse
∂X−1
= −X−1 J αβ X−1 . (D.28)
∂xαβ
The proof is similar to the non-symmetric case, but using J αβ instead.

• Trace involving matrix inverse
∂
tr X−1 A = −X−1 A + AT X−1 + diag X−1 AX−1 . (D.29)
∂X
D.4 Application
D.4.1 MLE of the Parameters of a Multinormal Distribution
Matrix derivatives find straight application in multivariate analysis. The most famil-
iar example is perhaps the estimation of a p-dimensional multinormal distribution
N (μ, ) from a given sample of data. Let x1 , . . . , xn be a sample from such a

distribution. The likelihood of this sample (see e.g. Anderson 1984) is
!
n !n
−p/2 −1/2 1 −1
L= f (xt ; μ, ) = (2π ) || exp − (xt − μ) (xt −μ) .
T
2
t=1 t=1
(D.30)
The objective is then to estimate μ and by maximising L. It is usually simpler to
use the log-likelihood L = log L, which reads
1
n
np n
L = log L = log 2π − log || − (xt − μ)T −1 (xt − μ) . (D.31)
2 2 2
t=1
L
The estimates are obtained by solving the system of equations given by ∂∂μ = 0 and
∂L
∂ = O. The first of these yields is (assuming that −1 exists)

n
(xt − μ) = 0, (D.32)
t=1
which provides the sample mean. For the second, one can use Eqs. (D.16)–(D.26),
and Eq. (D.29) for the last term, which can be written as a trace of a matrix product.
This yields

2 −1 − diag −1 − 2 −1 S −1 + diag −1 S−1 −1 = O,
which can be simplified to 2 −1 Ip − S −1 − diag −1 Ip − S −1 = O,

i.e.

−1 Ip − S −1 = O, (D.33)
yielding the sample covariance matrix S.
D.4.2 Estimation of the Factor Model Parameters
The estimation of the parameters of a factor model can be found in various text
books, e.g. Anderson (1984), Mardia et al. (1979). The log-likelihood of the model
has basically the same form as Eq. (D.31) except that now is given by =
+ T , where is a diagonal covariance matrix (see Chap. 10, Eq. (10.11)).
Using Eq. (D.16) along with results from Eq. (D.20), we get ∂ ∂
log |T +
−T
| = 2 T + . In a similar way we get ∂
∂ log |T + | =
−1
diag T + ∂
. Furthermore, using Eq. (D.27), one gets ∂ log |T +
−1
−1

| = 2T T + − diag T T + .
Exercise Let H = XAXT . Show that ∂

∂X tr H−1 S = H−T ST H−T XAT −
H−1 SH−1 XA.
Hint Let H = (hij ), then using arguments from Eq. (D.24) and Eq. (D.19) one gets
∂ ∂tr(H−1 S) ∂
tr H−1 S = hij
∂xαβ ∂hij ∂xαβ
ij

= −H−T ST H−T Jαβ AXT + XAJβα .
ij ij
ij

This is precisely tr −H−1 SH−1 Jαβ AXT + XAJβα . Using an argument similar
to that presented in Eqs. (D.23)–(D.24), one gets
∂
trH−1 S = − H−T ST H−T XAT + H−1 SH−1 XA .
∂xαβ αβ
Applying the above exercise and keeping in mind the symmetry of , see
Eq. (D.29), yield
∂
tr −1 S = 2 −2 −1 S −1 + diag −1 S −1 .
∂
Exercise Let H = AXB, and show that ∂

∂X tr H−1 S = −AT H−T ST H−T BT .

Hint As before, one finds − ij H−1 SH−1 ij AJαβ B ij = −tr AJαβ BH−T ST H−T ,

and this can be written as − i aiα BH−T ST H−T βi = − BH−T ST H−T A βα .
Using the above result with A = , B = T and X = , keeping in mind the
symmetry of and , one gets

∂
∂ tr −1 S = 2T −2 −1 S −1 + diag −1 S −1

+diag T −2 −1 S −1 + diag −1 S −1 .
For diagonal, one simply gets
∂
tr −1 S = −2 −1 S −1 + diag −1 S −1 , that is − diag −1 S −1 .
∂ []αα αα
Finally, one gets

∂L

= − n2 2 −1 + 2 −2 −1 S −1 + diag −1 S −1
∂ −1
= −n ( − 2S) −1 + diag −1 S −1
∂L

= − n2 2T −1 − diag T −1 + 2T −2 −1 S −1 + diag −1 S −1
∂
− n2 diag T −2 −1 S −1 + diag −1 S −1
∂L

∂ = − n2 diag −1 − diag −1 S −1 = − n2 diag −1 ( − S) −1 .
(D.34)
Note that if one removes the terms pertaining to symmetry one finds what has been
presented in the literature, e.g. Dwyer (1967), Magnus and Neudecker (1995). For
example, in Dwyer (1967) symmetry was not explicitly taken into account in the
differentiation. The symmetry condition can be considered via Lagrange multipliers
(Magnus and Neudecker 1995). It turns out, in fact, that the stationary points of a
scalar function f (X) of the symmetric p × p matrix X, i.e. ∂f∂X
(X)
= O are also the
solutions to (Rogers 1980, th. 101, p. 80)

∂f (Y)
= O, (D.35)
∂Y Y=X
where Y is non-symmetric, whenever ∂f∂Y (Y)

= ∂f∂Y(Y)T , which is
Y=X Y=X
straightforward based on the definition of the derivative given in Eq. (D.8), and
where the first differentiation reflects positional aspect, whereas the second one
refers to the ordinary differentiation. This result simplifies calculation considerably.
The stationary solutions to L are given by
−1 ( − S) −1 = O
T −1 ( − S) −1 = O (D.36)
diag −1 ( − S) −1 = O.
D.4.3 Application to Results from PCA
Matrix derivative also finds application in various other subjects. The eigenvalue
problem of EOFs is a straightforward application. An interesting alternative to this
eigenvalue problem, which uses matrix derivative, is provided by the following
result (see Magnus and Neudecker 1995, th. 3, p. 355). For a given p × p positive
semi-definite matrix , of rank r, the minimum of
(Y) = tr ( − Y)2 , (D.37)
where Y is positive semi-definite matrix of rank q ≤ p, is obtained for

q
Y= λ2k vk vTk , (D.38)
k=1
where λ2k , and vk , k = 1, . . . q, are the leading eigenvalues and associated

eigenvectors of . The matrix Y thus defined provides the best approximation to
. Consequently, if represents the covariance matrix of some data matrix, then Y
is simply the covariance matrix of the filtered data matrix obtained by removing the
contribution from the last r − q eigenvectors of .
Exercise Show that Y defined by Eq. (D.38) minimises Eq. (D.37).
Hint Write Y = AAT and find A.
A p × r matrix A is semi-orthogonal1 if AT A = Ir . Another connection to EOFs
is provided by the following theorem (Magnus and Neudecker 1995).
Theorem Let X be a n × p data matrix. The minimum of the following real valued
function:
T
φ(X) = tr X − ZAT X − ZAT = X − ZAT 2
F, (D.39)
where the p × r matrix A is semi-orthogonal and . F stands for the Fröbenius

norm, is obtained for A = (v1 , . . . , vr ) and Z = XA, where v1 , . . . , vr are the
eigenvectors associated with the leading eigenvalues λ21 , . . . , λ2r of XT X. Further,
p
the minimum is k=r+1 λ2k .
In other words, A is the set of the leading eigenvectors, and Z the matrix of the
associated PCs.
Exercise Find the stationary points of Eq. (D.39).
Hint Use a Lagrange function (see exercise below).
Exercise Show that the Lagrangian function of
min φ(X) s.t. F(X) = O,
where F(X) is a symmetric matrix function of X, is
(X) = φ(X) − tr (LF(X)) ,
where L is a symmetric matrix.

Hint The Lagrangian function is simply φ(X) − i,j lij [F(X)]ij , where lij = lj i
since F(X) is symmetric.
1 The set of these matrices is known as Stiefel manifold.

D.5 Common Algorithms for Linear Systems and Eigenvalue

Problems
D.5.1 Direct Methods
A number of algorithms exist to solve linear systems of the kind Ax = b and

Ax = λx. For the linear system, the m × m matrix A is normally assumed to be
nonsingular. A large number of algorithms exists to solve those problems, see e.g.
Golub and van Loan (1996). Some of those algorithms are more suited than others
particularly for large and/or sparse matrices. Broadly speaking, two main classes
of methods exist for solving linear systems and also eigenvalue problems, namely,
direct and iterative. Direct methods are based mostly on decompositions. The main
direct methods include essentially the SVD, LU and QR decompositions (Golub and
van Loan 1996).
D.5.2 Iterative Methods

Case of Eigenvalue Problems
Iterative methods are based on what is known as Krylov subspaces. Given a

m × m matrix A and a non-vanishing m-dimensional vector y, the sequence
y, Ay, . . . , An−1 y is referred to as Krylov sequence. The Krylov subspace Kn (A, y)
is defined as the space spanned by a Krylov sequence, i.e.

Kn (A, y) = Span y, Ay, . . . , An−1 y . (D.40)
Iterative methods to solve systems of linear equations (see below) or eigenvalue

problems are generally referred to as Krylov space solvers. The Krylov sequence
b, Ab, . . . , An−1 b can be used in the approximation process of the eigenelements,
but it is in general ill-conditioned, and an orthonormalisation is required. This is
obtained using two main algorithms: Lanczos and Arnoldi algorithms for Hermitian
and non-Hermitian matrices, respectively (Watkins 2007; Saad 2003).
The basic idea is to construct an orthonormal basis given by the m × n matrix
Qn = [q1 , . . . , qn ] of Kn , which is used to obtain a projection of the operator A
onto Kn , Hn = QH H H ∗T
n AQn , where Qn is the conjugate transpose of Qn , i.e. Qn .
The pair (λ, x), with Hn x = λx, provides an approximate pair of eigenvalue and
associated eigenvector2 of A.
2 The number λ and vector Qk x are, respectively, referred to as Ritz value and Ritz vector of A.
Lanczos Method
Lanczos method is based on a triangularisation algorithm of a Hermitian matrix (or
symmetric for real cases) A, as
AQn = Qn Hn , (D.41)
where Hn = [h1 , . . . , hn ] and is a triangular matrix. Let us designate by

[α1 , . . . , αn ] the main diagonal of H and [β1 , . . . , βn−1 ] as its upper and sub-
diagonals. Identifying the j th columns from both sides of Eq. (D.41) yields
βj qj +1 = Aqj − βj −1 qj −1 − αj qj . (D.42)
The algorithm then starts from an initial vector q1 (taking q0 = 0) and obtains
αj , βj and qj +1 at each iteration step. (The vectors qi , i = 1, . . . n, are orthonor-
mal.) After k steps, one gets
AQk = Qk Hk + βk qk+1 eTk (D.43)
with ek being the k-element vector (0, . . . , 0, 1)T . The algorithm stops when βk =
0.
Arnoldi Method
Arnoldi algorithm is similar to Lanczos’s except that the matrix Hn = (hij ) is upper
Hessenberg matrix, which satisfies hij = 0 for i ≥ j + 2. As for the Lanczos
j
method, Eq. (D.42) yields hj +1,j qj +1 = Aqj − i=1 hij qi . After k steps, one
obtains
AQk = Qk Hk + hk+1,k qk+1 eTk . (D.44)
The above Eq. (D.44) can be written in a compact form as AQk = Qk+1 Hk ,
where Hk is the obtained (k + 1) × k Hessenberg matrix. The matrix Hk is
obtained from H k+1 by deleting the last row. Note that Arnoldi (and also Lanczos)
methods are modified versions of the Gram–Schmidt orthogonalisation procedure,
with Hessenberg and tridiagonal matrices involved, respectively, in the two methods.
Note also that a non-symmetric Lanczos method exists, which yields a non-
symmetric tridiagonal matrix (e.g. Parlett et al. 1985).
Case of Linear Systems
The simplest iterative method is the Jacobi iteration, which solves a fixed point
problem. It transforms the linear system Ax = b into x = Âx + b̂, where Â =
Im − D−1 A and b̂ = D−1 b, with D being either the diagonal matrix of A or simply
the identity matrix. The fixed point iterative algorithm is then given by xn+1 =
Âxn +b̂, with a given initial condition x0 . The algorithm converges when the spectral
radius of Â is less than unity, i.e. ρ(Â) < 1. The computation of the nth residual
vector rn = b − Axn involves the Krylov subspace Kn+1 (A, r0 ). Other methods
like gradient and semi-iterative methods are included in the Krylov space solver.
From an initial condition x0 , the residual takes the form
xn − x0 = pn−1 (A)r0 , (D.45)
for a polynomial pn−1 (.) of degree n − 1, and belongs to Kn (A, r0 ). The problem
is then to find a good choice of xn in the Krylov space. There are essentially two
methods for this (Saad 2003), namely, Arnoldi’s method (described above) or FOM
(Full Orthogonalisation Method) and the GMRES (Generalised Minimum Residual
Method) algorithms.
The FOM algorithm is based on the above Arnoldi orthogonalisation procedure
and looks for xn − x0 within Kn (A, r0 ) such that (b − Axn ) is orthogonal to
this Krylov space (Galerkin condition). From the initial residual r0 , and letting
r0 = βq1 , with β = r0 2 , the algorithm yields a similar equation to (D.41), i.e.
QTk AQk = Hk and QTk r0 = βq1 . The approximate solution at step k is then given
by
xk = x0 + Qk yk = x0 + βQk H−1 −1
k Qk q1 = x0 + βQk Hk e1 ,
T
(D.46)
where e1 = QTk q1 , i.e. e1 = (1, 0, . . . , 0)T .

Exercise Derive the above approximation (Eq. (D.46)).
Hint At the kth step, the vector xk − x0 belongs to Kk and is therefore of the form
Qk y. The above Galerkin condition means that QTk (b − Axk ) = 0, that is, QTk r0 −
QTk AQTk y = 0, and using Eq. (D.41) yields Eq. (D.46).
The GMRES algorithm seeks vectors xn − x0 within Kn (A, r0 ) such that b −
Axn is orthogonal to AKn . This condition implies the minimisation of Axn − b 2
(e.g. Saad and Schultz 1985). Variants of these methods and other algorithms exist
for particular choices of the matrix A, see e.g. Saad (2003) for more details. An
approximate solution at step k, which is in the space x0 + Kk , is given by
xk = x0 + Qk z∗ . (D.47)
Using Eq. (D.44), one gets
Axk − b = r0 − AQk z∗ = βq1 − Qk+1 Hk z∗ , (D.48)
which yields Qk+1 (βe1 − Hk z∗ ). The vector z∗ is then given by
z∗ = argmin βq1 − Qk+1 Hk z 2 = argmin βe1 − Hk z 2 , (D.49)

z z
where the last equality holds because Qk+1 is orthonormal ( Qk+1 x 2 = x 2 ), and
Hk is the matrix defined below Eq. (D.44).
Remark The Krylov space can be used, for example, to approximate the exponential
of a matrix, which is useful particularly for large matrices. Given a matrix A and a
vector v, an approximation of eA v, using Kk (A, v), is given by (Saad 1990)
etA v ≈ βQk etHk e1 , (D.50)
with β = v 2 . Equation (D.50) can be used, for example, to compute the solution
of an inhomogeneous system of first-order ODE. Also, and as pointed out by Saad
(1990), Eq. (D.50) can be used to approximate the (matrix) integral:
∞ T
X= euA bbT euA du, (D.51)
0
which provides a solution to the Lyapunov equation AXT + XAT + bbT = O.

Appendix E
Optimisation Algorithms
E.1 Background
Optimisation problems are ubiquitous in all fields of science. Various optimisation

algorithms exist in the literature, and they depend on whether the first and/or second
derivative of the function to be optimised is available. These algorithms also depend
on the function to be optimised and the type of constraints. Since minimisation is
the opposite of maximisation, we will mainly focus on the former. In general there
are four types of objective functions (and constraints):
• linear;
• quadratic;
• smooth nonlinear;
• non-smooth.
A minimisation problem without constraints is an unconstrained minimisation
problem. When the objective function is linear and the constraints are linear
inequalities, one has a linear programming, see e.g. Foulds (1981, chap. 2). When
the objective function and the constraints are nonlinear, one gets a nonlinear
programme. Minimisation algorithms also vary according to the nature of the
problem. For instance in the unconstrained case, Newton’s method can be used in
the multivariate case when the gradient and the second derivative of the objective
function are provided. When only the first derivative is provided, a quasi-Newton
method can be applied. When the dimension of the problem is large, conjugate
gradient algorithms can be used.
In the presence of nonlinear constraints, a whole class of gradient methods, such
as reduced and projected gradient methods, can be used. Convex programming
methods can be used when we have linear or nonlinear inequalities as constraints.
In most cases, a constrained problem can be transformed into an unconstrained
problem using Lagrange multipliers. In this appendix we provide a short review of
the most widely used optimisation algorithms that are used in atmospheric science.

522 E Optimisation Algorithms
For a detailed review of the various optimisation problems and algorithms, the
reader is referred to Gill et al. (1981).
There is in general a large difference between one- and multidimensional
problems. The univariate and bivariate minimisation problems are in general not
difficult to solve since the function can be plotted and visualised, particularly when
the function is smooth. When the first derivative is not provided, methods like the
golden section can be used. The problem gets more difficult for many variables
when there are multiple minima. In fact, the main obstacle to minimisation in the
multivariate case is the problem of local minima. For example, the global minimum
can be attained when the function is quadratic:
1 T
f (x) = x Ax − bT x + c, (E.1)
2
where A is a symmetric matrix. The quadratic Eq. (E.1) is a typical example that
deserves attention. The gradient of f (.) is Ax − b, and a necessary condition for
optimality is given by ∇f (x) = 0. The solution to this linear equation provides a
partial answer, however. To get a complete answer, one has to compute the second
derivative at the solution of the necessary condition, to yield the Hessian:

∂ 2f
H = (hij ) = = A. (E.2)
∂xi ∂xj
Clearly, if A is positive semi-definite, the solution of the necessary condition is

a global minimum. In general, however, the function to be minimised is non-
quadratic, and more advanced tools have to be applied, and this is the aim of this
appendix.
E.2 Single Variable
In general the one-dimensional case is the easiest minimisation problem, particu-

larly when the objective function is smooth.
E.2.1 Direct Search
A simple direct search method is based on successive function evaluation, aiming at

reducing the length of the interval containing the minimum. The most widely used
methods are:
• Dicholomus search—It is based on successively halving the interval containing
the minimum. After n iterations, the length In of the interval containing the
minimum is
E Optimisation Algorithms 523
1
In = I0 ,
2n/2
where I0 = x2 − x1 if [x1 , x2 ] is the initial interval.
• Golden section—It is based on subdividing the initial interval [x1 , x2 ] into three
subintervals using two extra points x3 and x4 , with x1 < x3 < x4 < x2 . For
example, if f (x3 ) ≤ f (x4 ), then the minimum is expected to lie within [x1 , x4 ];
otherwise, it is in [x3 , x2 ]. The iterative procedure takes the form:

(i) τ −1 (i) (i) (i)
x3 = τ x2 − x 1 + x1
(i) (i) (i) (i)
x4 = τ x2 − x1
1
+ x1 ,
√
where τ is the Golden number1 1+2 5 . There are various other methods such as
quadratic interpolation and Powell’s method, see Everitt (1987) and Box et al.
(1969).
E.2.2 Derivative Methods
When the first and perhaps the second derivatives are available, then it is known that
the two conditions:
∗
dx f (x ) = 0
d
d2 (E.3)
dx 2
f (x ∗ ) > 0
are sufficient conditions for x ∗ to be a minimum of f (). In this case the most widely
used method is based on Newton algorithm, also known as Newton–Raphson, and
aims at computing the zero of dfdx(x) based on the tangent line at dfdx(x) . The algorithm
reads
df (xk )/dx
xk+1 = xk − (E.4)
d 2 f (xk )/dx 2
when d 2 f (xk )/dx 2 = 0. Note that when the second derivative is not provided, then
the denominator of Eq. (E.4) can be approximated using a finite difference scheme:
xk − xk−1 df (xk )
xk+1 = xk − .
df (xk ) − df (xk−1 ) dxk
un+1
1 It is the limit of un when n → ∞ where u0 = u1 = 1 and un+1 = un + un−1 .
E.3 Direct Multivariate Search
As for the one-dimensional case, there are direct search methods and gradient-
based algorithms. Among the most widely used direct search methods, one finds
the following:
E.3.1 Downhill Simplex Method
This method is due to Nelder and Mead (1965) and was originally described by
Spendley et al. (1962). The method is based on a simplex,2 generally with mutually
equidistant vertices, from which a new simplex is formed simply by reflection of
the vertex (where the objective function is largest) through the opposite facet, i.e.
through the hyperplane formed by the remaining m points (or vertices), to a “lower”
vertex where the function is smaller. Details on the method can be found, e.g. in Box
et al. (1969) and Press et al. (1992). The method can be useful for a quick search
but can become inefficient for large dimensions.
E.3.2 Conjugate Direction/Powell’s Method
Most multivariate minimisation algorithms attempt to find the best search direction
along which the function can be minimised. The conjugate direction method is
based on minimising a quadratic function and is known as quadratically convergent.
Consider the quadratic function:
1 T
f (x) = x Gx + bT x + c. (E.5)
2
The directions u and v are said to be conjugate (with respect to G) if uT Gv = 0. The

method is then based on finding a set of mutually conjugate search directions along
which minimisation can proceed. Powell (1964) has shown that if a set of search
(i) (i) (i) (i)
are normalised so that uk Guk = 1,
directions u1 , . . . un , at the ith iteration,
(i) (i)
k = 1, . . . , n, then det u1 , . . . , un is maximised only when the vectors are
(linearly independent) mutually conjugate. This provides a way of finding a new
search direction that can replace an existing one. The procedure is therefore to
minimise the function along individual lines and proceeds as follows. Starting from
2 A simplex in a m-dimensional space is a polygonal geometric figure with m + 1 vertices, or m + 1
facets. Triangles and pyramids are examples of simplexes in three- and four-dimensional spaces,
respectively.
x0 and a direction u0 , one minimises the univariate function f (x0 + λu0 ) and then
replaces x0 and u0 by x0 + λu0 and λu0 , respectively. Powell’s algorithms run as
follows:
0. Initialise ui = ei , i.e. the canonical basis vectors, i = 1, . . . , m.
1. Initialise x = x0 .
2. Minimise f (xi−1 + λui ), xi = x0 + λui , i = 1, . . . , m.
3. Set ui+1 = ui , i = 1, . . . , m, um = xm − x0 .
4. Minimise f (xm + λum ), x0 = xm + λum , and then go to 1.
Powell (1964) showed that the procedure yields a set of k mutually conjugate
directions after k iterations. The procedure has to be reinitialised with new vectors
after every m iterations in order to escape dependency of the obtained vectors, see
Press et al. (1992) for further details.
Remark The reason for using one-dimensional minimisation is conjugacy. In fact,
if u1 , . . . , um are mutually conjugate with respect to G, the required minimum is
taken to be

m
x1 = x0 + αk uk ,
k=1
m the parameters αk , k = 1, . . . , m, are chosen to minimise f (x1 ) = f (x0 +

where
k=1 αk uk ). These coefficients therefore minimise
m
1
f (x1 ) = αi2 uTi Gui + αi uTi (Gx0 + b) + f (x0 ) (E.6)
2
i=1
based on conjugacy of ui , i = 1, . . . , m. Hence, the effect of searching along ui

is to find αi that minimises 12 αi2 uTi Gui + αi uTi (Gx0 + b), and this value of αi is
independent of the other terms in Eq. (E.6).
It is shown precisely by Fletcher (1972) that for a quadratic function, with particular
definite Hessian G, when the search directions ui , i = 1, . . . , m, are conjugate
of each other, then the minimum is found in at most m iterations. Furthermore,
x(i+1) = x(i) + α (i) ui is the minimum point in the subspace generated by the initial
approximation x(1) , and the directions u1 , . . . , ui , and the identities gTi+1 uj = 0,
j = 1, . . . , i,, also hold (gi+1 = ∇f (x(i+1) )).
E.3.3 Simulated Annealing
This algorithm is based on concepts from statistical mechanics and makes use
of Boltzmann probability of energy distribution in thermodynamical systems in
equilibrium (Metropolis et al. 1953). The method uses Monte Carlo simulation to
generate moves and is particularly useful because it can escape local minima. The
algorithm can be applied to continuous and discrete problems (Press et al. 1992), see
also Hannachi and Legras (1995) for an application to atmospheric low-frequency
variability.
E.4 Multivariate Gradient-Based Methods
Unlike direct search methods, gradient-based approaches use the gradient of the
objective function. Here we assume that the (smooth) objective function can be
approximated by
1
f (x + δx) = f (x) + g(x)T δx + δxT Hδx + o(|δx|2 ), (E.7)
2

where g(x) = ∇f (x), and H = ∂xi∂∂xj f (x) are, respectively, the gradient vector
and Hessian matrix of f (x). Gradient methods also belong to the class of descent
algorithms where the approximation of the desired minimum at various iterations is
perturbed in an additive manner as
xm+1 = xm + λu. (E.8)
Descent algorithms are distinguished by the manner in which the search direction
u is chosen. Most gradient methods use the gradient as search direction since the
gradient ∇f (x) points in the direction where the function increases most rapidly.
E.4.1 Steepest Descent

g
The steepest descent uses u = − g and choses λ that minimises the univariate
objective function:
h(λ) = f (xm + λu). (E.9)
The solution at iteration m + 1 is then given by
∇f (xm )
xm+1 = xm − λ . (E.10)
∇f (xm )
Note that Eq. (E.9) is quadratic when Eq. (E.7) is used, in which case the solution
is given by3
∇f (xm ) 3
λ= . (E.11)
∇f (xm )T H∇f (xm )
Note that because of the one-dimensional minimisation at each step, the method
can be computationally expensive. Some authors use decreasing step-size selection
λ = α k , (0 < α < 1), for k = 1, 2, . . . , until the first k where f () has decreased
(Cadzow 1996).
E.4.2 Newton–Raphson Method
This is a generalisation of the one-dimensional Newton–Raphson method and is

based on the minimisation of the quadratic form Eq. (E.1), where the search
direction is given by
u = −H−1 ∇f (x). (E.12)
At the (m + 1)th iteration, the approximation to the minimum is given by
xm+1 = xm − H−1 (xm )∇f (xm ). (E.13)
Note that it is also possible to choose xm+1 = xm − λH−1 ∇f (xm ), where λ can be
found through a one-dimensional minimisation as in the steepest descent.
Newton method requires the inverse of the Hessian at each iteration, and this can
be quite expensive particularly for large problems. There is also another drawback
of the approach, namely, the convergence towards the minimum can be secured
only if the Hessian is positive definite. Similarly, the steepest descent is no better
since it is known to exhibit a linear convergence, i.e. a slow convergence rate. These
drawbacks have led to the development of more advanced and improved algorithms.
Among these methods, two main classes of algorithms stand out, namely, the
conjugate gradient and the quasi-Newton methods discussed next.
3 One can eliminate the Hessian from Eq. (E.11) by choosing a first guess λ0 for λ and then using
−1
g λ20
Eq. (E.7) with δx = −λ0 g , which yields λ = 2 g f x − λ0 gg − f (x) + λ0 g .
E.4.3 Conjugate Gradient
It is possible that the descent direction −g = −∇f (x) and the direction to the
minimum may be near to orthogonality, which can explain the slow convergence
rate of the steepest descent. For a quadratic function, for example, the best search
direction is conjugate to that taken at the previous step (Fletcher 1972, th. 1). This is
the basic idea of conjugate gradient for which the new search direction is constructed
to be conjugate to the gradient of the previous step. The method can be thought of as
an association of conjugacy with steepest descent (Fletcher 1972), and is also known
as Fletcher–Reeves (or projection) method. From the set of conjugate gradients −gk ,
k = 1, . . . , m, a new set of conjugate directions is formed via linear combination as

k−1
uk = −gk−1 + αj k uj , (E.14)
j =1
where αj k = −gTk−1 Huj /uTj Huj , j = 1, . . . , k − 1, and gk = ∇f (xk ). Since in a

quadratic form, e.g. Eq. (E.5), one has
δgk = gk+1 − gk = Hδx = H(xk+1 − xk ),
and because in a linear (one-dimensional) search δxk = λk uk , one gets
gTk−1 δgj −1
αj k = − (E.15)
uTj δgj −1
for j = 1, . . . , k − 1. Furthermore, αj k , j = 0, . . . , k − 2, vanishes,4 yielding
gTk−1 δgk−2
uk = −gk−1 + uk−1 ,
uTk−1 δgk−2
which simplifies to
gTk−1 gk−1
uk = −gk−1 + uk−1 , (E.16)
gTk−2 gk−2
4 After k − 1 one-dimensional searches in (u1 , . . . , uk−1 ), the quadratic form is minimised at xk−1 ,
then gk−1 is orthogonal to uj , j = 1, . . . , k − 2, because of the one-dimensional requirement for
minimisation in each direction um , m = 1, . . . , k − 2, dα d
f (xk−2 + αuj ) = gTk−1 uj = 0 . Fur-
thermore, since the vectors uj are linear combinations of gi , i = 1, . . . , j , the vectors gj are
j
also linear combinations of u1 , . . . , uj , i.e. gj = i=1 αi ui , hence gTk−1 gj = 0, j = 1, . . . , k − 2.
for k = 1, . . . , n with u1 = −g0 . For a quadratic function, the algorithm converges

in at most n iterations, where n is the problem dimension. For a general function,
Eq. (E.16) can be used to update the search direction every iteration, and that in
practice uk is reset to −gk−1 after every n iterations.
E.4.4 Quasi-Newton Method
The Newton–Raphson direction −H−1 g may be thought of as an improvement (or

correction) to the steepest descent direction −g = −∇f (x). The quasi-Newton
approach attempts to take advantage of the steepest descent and quadratic conver-
gence rates of the basic second-order Newton method. It is based on approximating
the inverse of the Hessian matrix H. In the modified Newton–Raphson, Goldfeld et
al. (1966) propose to use the following iterative scheme:
xk+1 = xk − (λIn + H)−1 g (E.17)
based on the approximation:

−1
In + λ−1 H ≈ In − λ−1 H + λ−2 H2 + . . . . (E.18)
The most widely used quasi-Newton procedure, however, is the Davidson–Fletcher–

Powell method (Fletcher and Powell 1963), sometimes referred to as variable metric
method, which is based on approximating the inverse H−1 by an iterative procedure
for which the kth iteration reads
xk+1 = xk − λk Sk gk , (E.19)
where Sk is a sequence that converges to H−1 and is given by
Sk δgk δgTk Sk δxk δxTk

Sk+1 = Sk − + , (E.20)
δgTk Sk δgk δxTk δgk
where δgk = gk+1 − gk and δxk = xk+1 − xk = −λk Sk δgk . Note that there exist in
the literature various other formulae for updating Sk , see e.g. Adby and Dempster
(1974) and Press et al. (1992). These techniques can be adapted and simplified
further depending on the objective function, such as the case of the sum of squares,
encountered in least square regression analysis, see e.g. Everitt (1987).
E.4.5 Ordinary Differential Equations-Based Methods
Optimisation techniques based on solving systems of ordinary differential equations

have also been proposed and used for some time, although not much in atmospheric
science, see e.g. Hannachi et al. (2006); Hannachi (2007). The approach seeks the
solution to
min F (x), (E.21)

x
where x is a n-dimensional real vector, by following trajectories of a system of

ordinary differential equations. For instance, we know that if x∗ is a solution to
Eq. (E.21), then ∇F (x∗ ) = 0. Therefore by integrating the dynamical system
dx
= −∇F (x), (E.22)
dt
starting from a suitable initial condition, one should converge in principle to x∗ .
This method can be regarded as the continuous version of the steepest descent
algorithm. In fact, Eq. (E.22) becomes equivalent to the steepest algorithm when
dx xt+h −xt
dt is approximated by the simple finite difference h . The system of ODE,
Eq. (E.22), can be interpreted as the equation describing a particle moving in
a potential well given by F (.). Note that Eq. (E.22) can also be replaced by a
continuous Newton equation of the form:
dx
= −H−1 (x) ∇F (x), (E.23)
dt
where H is the Hessian matrix of F () at x. Some of these techniques have

been reviewed in Brown (1986), Botsaris (1978), Aluffi-Pentini et al. (1984) and
Snyman (1982). It is argued, for example, in Brown (1986) that ordinary differential
equation-based methods compare very favourably with conventional Newton and
quasi-Newton algorithms, see Hannachi et al. (2006) for an application to simplified
EOFs.
E.5 Constrained Minimisation
E.5.1 Background
Constrained minimisation problems are more subtle than unconstrained problems.

We give here a brief review, and for more details the reader is referred to more
specialised textbooks on optimisation, see e.g. Gill et al. (1981). A typical (smooth)
constrained minimisation problem takes the form:
minx f (x) s.t.

gi (x) = 0 i = 1, . . . , r (E.24)
hj (x) ≤ 0 j = 1, . . . , m.
When the functions involved in Eq. (E.24) are convex or polynomials, the problem
is known under the name of mathematical programming. For instance, if f (.) is
quadratic or convex and the constraints are linear, efficient programming procedures
exist for the minimisation.
In general, most algorithms attempt to transform Eq. (E.24) into an unconstrained
problem. This can be done easily, via a change of variable, when the constraints are
simple. The following examples illustrate this.
• For constraints of the form x ≥ 0, the change of variable is x = y 2 .
• For a ≤ x ≤ b, one can have x = a+b 2 + 2 sin y.
b−a
In a number of cases the inequality h(x) ≤ 0 can be handled by introducing a slack

variable y yielding
h(x) + y 2 = 0.
Equality constraints can be handled in general by introducing Lagrange multipliers.

Under some regularity conditions,5 a necessary condition for x∗ to be a constrained
local minimum of Eq. (E.24) is the existence of Lagrange multipliers u∗ =
(u∗1 , . . . , u∗r )T and v∗ = (v∗1 , . . . , v∗m )T such that

∇f (x∗ ) + ri=1 u∗i ∇gi (x∗ ) + mj =1 v∗j ∇hj (x∗ ) = 0
v∗j hj (x∗ ) = 0 j = 1, . . . , m (E.25)
v∗j ≥ 0 j = 1, . . . , m;
the conditions given by Eq. (E.25) are known as Kuhn–Tucker optimality conditions
and express the stationarity of the Lagrangian:

r
m
L (x; u, v) = f (x) + ui gi (x) + vj hj (x) (E.26)
i=1 j =1
at x∗ for the optimum values u∗ and v∗ . Note that the first vector equation in
Eq. (E.25) can be solved by minimising the sum of squares of its elements, i.e.
2
min nk=1 ∂x ∂L
k
. In mathematical programming, system Eq. (E.25) is generally
referred to as the dual problem of Eq. (E.24).
5 Namely, linear independence between ∇hj (x∗ ) and ∇gi (x∗ ), i = 1, . . . , r, for all j satisfying
hj (x∗ ) = 0.
E.5.2 Approaches for Constrained Minimisation

Lagrangian Method
It is based on minimising, at each iteration, the Lagrangian:
L (x; u, v) = f (x) + uTk g(x) + vTk h(x), (E.27)
yielding a minimum xk+1 at the next iteration step k + 1. The multipliers uk+1 and
vk+1 are taken to be the optimal multipliers for the linearised constraints:
g(xk+1 ) + (x − xk+1 )T ∇g(xk+1 ) = 0

(E.28)
h(xk+1 ) + (x − xk+1 )T ∇h(xk+1 ) ≤ 0.
This method is based on linearising the constraints about the current point xk+1 .
More details can be found in Adby and Dempster (1974) and Gill et al. (1981).
Note that in
most iterative
techniques, an initial feasible point can be obtained by
minimising j hj (x) + i gi2 (x).
Penalty Function
The basic idea of penalty is as follows. In the search of a constrained minimum of

some function, one often encounters the common situation where the constraints
are of the form g(x) ≤ 0, and at each iteration the newly formed x has to satisfy
these constraints. A simple way to handle this is by forming a linear combination
of the elements of g(x), i.e. uT g(x), known as penalty function that accounts for
the positive components of g(x). The components of u are zero whenever the
corresponding components of g(x) do not violate the constraints (i.e. non-positive)
and large positive otherwise. One then has to minimise the sum of the original
objective function and the penalty function, i.e. the penalised objective function.
Minimising the penalised objective function can prevent the search algorithm from
choosing directions where the constraints are violated.
In general terms, the penalised method is based on sequentially minimising an
unconstrained problem of the form:

F (x) = f (x) + wj G hj (x), ρ + H (gi (x, ρ)) , (E.29)
j i
where wj , j = 1, . . . , m, and ρ are parameters that can change value during the
minimisation, and usually ρ decreases to zero as the iteration number increases.
The functions G() and H () are penalty functions. For example, the function
u2
G(u, ρ) = (E.30)
ρ
is one of the widely used penalties. When inequality constraints are present, and
for a fixed ρ, the barrier function G() is non-zero in the interior of the feasible
region (hj (x) ≤ 0, j = 1, . . . , m) and infinite on its border. This maintains
iterates xk inside the feasible set, and as ρ → 0 the constrained minimum is
approached. Examples of barrier functions in this case include log (−h(x)) and
ρ
h2 (x)
. The following penalty function
wj 1 2
ρ3 + gi (x) (E.31)
j
h2j (x) ρ
i
has also been used for problem Eq. (E.24).
Gradient Projection
This method is based on finding search directions by projecting the gradient −g =

−∇f (x) onto the hyperplane tangent to the feasible set, i.e. the set satisfying
the constraints, at the current point x. The inequality constraints (that are not
satisfied) and the equality constraints are linearised around the current point, i.e.
by considering
Kx = 0, (E.32)
where K is (r + l1 ) × n matrix, and l1 is the number of constraints h(x) ≥ 0. Then at

each iteration the constraints are linear, and the direction is obtained by projecting
−g onto the tangent space to obtain u, i.e.
− g = u + KT w. (E.33)
Using Eq. (E.32), one gets

−1
w = − KKT Kg, (E.34)
and the negative projected gradient reads

−1
u = − In − KT KKT K g. (E.35)
The algorithm goes as follows:

0. Choose x0 from the feasible set.
1. Linearise equations gi (), i = 1, . . . , r, and the currently binding inequalities, i.e.

those for which hj (xk ) ≥ 0 to compute K in Eq. (E.32).
2. Compute the projected gradient u from Eq. (E.35).
(1)
3. Minimise the one-dimensional function f (xk + λk+1 u), and then set xk+1 =
(1) (2) (1)
xk + λk+1 u. If xk+1 is feasible, then set xk+1 = xk+1 , otherwise use a suitable
(2)
version of the Newton’s method applied to ρ1 j h2j (x) to find a point xk+1 on
the boundary of the feasible region.
4. If f (x(2) (2)
k+1 ) ≤ f (xk ), set xk+1 = xk+1 and then go to 1. Otherwise go to 3,
(t)
and solve for λk+1 by generating, e.g. a sequence xk+1 = xk + 1
λ u
τ t−2 k+1
until
(t)
f (xk+1 ) ≤ f (xk ) is satisfied.
5. Iterate steps 1 to 4 until the constrained minimum is obtained. The optimal
multipliers vi corresponding to the binding inequalities and u∗ are given by w
in Eq. (E.34).
Other Gradient-Related Methods
Another gradient-related approach is the multiple gradient summation where the

search direction is given by
∇f (xk ) ∇hj (xk )

u=− , (E.36)
∇f (xk ) ∇hj (xk )
j
where the summation is taken over the violated constraints at the current point xk .
Another search method, based on small step gradient, is given by

m
u = −∇f (xk ) − wj (xk )∇hj (xk ), (E.37)
j =1
where wj (xk ) = w if hj (xk ) > 0 (w is a suitably chosen large constant) and zero
otherwise, see Adby and Dempster (1974).
The ordinary differential equations-based method can also be used in constrained
minimisation in a similar way after the problem has been transformed into an
unconstrained minimisation problem, see e.g. Brown (1986) and Hannachi et al.
(2006) for the case of simplified EOFs.
Appendix F
Hilbert Space
This appendix introduces some concepts of linear vector spaces, metrics and Hilbert
spaces.
F.1 Linear Vector and Metric Spaces
F.1.1 Linear Vector Space
A linear vector space X is a set of elements x, y, . . . on which one can define

addition x + y between elements x and y of X satisfying, for all elements x, y,
and z, the following properties:
• x + y = y + x.
• (x + y) + z = x + (y + z).
• The null element 0, satisfying x + 0 = x, belongs to X .
• The “inverse” −x of x, satisfying x + −x = 0, is also in X .
The first two properties are known, respectively, as commutativity and associativity.
These properties make X a commutative group. In addition, a scalar multiplication
has to be defined on X with the following properties:
• α (x + y) = αx + αy and (α + β) x = αx + βx,
• α (βx) = (αβ) x,
• 1x = x,
for any x and y in X , and scalars α and β.

536 F Hilbert Space
F.1.2 Metric Space
A metric d(., .) defined on a set X is a real valued function defined over X × X with
the following properties:
(i) d(x, y) = d(y, x),
(ii) d(x, y) = 0 if and only if x = y,
(iii) d(x, y) ≤ d(x, z) + d(z, y), for all x, y and z in X .
A set X with a metric d(., .) is referred to as a metric space (X , d).
F.2 Norm and Inner Products
F.2.1 Norm
A norm on a linear metric space X , denoted by . , is a real valued function

satisfying, for all vectors x and y in X and scalar λ the following properties:
(i) x ≥ 0 , and x = 0 if and only if x = 0,
(ii) λx = |λ| x ,
(iii) x+y ≤ x + y .
A linear vector space with a norm is named normed space.
F.2.2 Inner Product
An inner product defined on a linear vector space X is a scalar function defined on

X × X , denoted by < ., . >, satisfying, for all vectors x and y in X and scalar λ,
the following properties:
(i) < λx + y, z >= λ < x, z > + < y, z >,
(ii) < x, y >= (< y, x >)∗ , where the superscript (∗ ) stands for the complex
conjugate,
(iii) < x, x >≥ 0, and < x, x >= 0 if and only if x = 0.
A linear vector space X with an inner product is an inner product space.
F.2.3 Consequences
The existence of a metric and/or a norm leads to defining various topologies.

For example, given a metric space (X , d), a sequence {xn }∞
n=1 of elements in X
F Hilbert Space 537
converges to an element x0 in X if
lim d (xn , x0 ) = 0.
n→∞
Similarly, a sequence {xn }∞

n=1 of elements in a normed space X , with norm . , is
said to converge to x0 in X if
lim xn − x0 = 0.
n→∞
Also, a sequence {xn }∞

n=1 of elements in an inner product space X converges to an
element x0 in X if
lim < xn − x0 , xn − x0 >= 0.

n→∞
The existence of an inner product in a linear vector space X allows the definition of
orthogonality as follows. Two vectors x and y are orthogonal, denoted by x ⊥ y, if
< x, y >= 0.
F.2.4 Properties
1. A normed linear space, with norm . , defines a metric space with the metric
given by d(x, y) = x − y .
2. An inner product space X , with inner product < ., . >, is a normed linear space
with the norm defined by x =< x, x >1/2 , and is consequently a metric space.
3. For any x and y in an inner product space X , the following properties hold.
• | < x, y > | ≤ x y ,
• x+y 2+ x−y 2 = 2 x 2 + 2 y 2 . This is known as the parallelogram
identity.
4. Given an n-dimensional linear vector space with an inner product, one can always
construct an orthonormal basis (u1 , . . . , un ), i.e. < uk , ul >= δkl .
5. Also, the limit of the sum of two sequences in an inner product space is the sum
of the limit of the sequences. Similarly, the limit of the inner product of two
sequences is the inner product of the limits of the corresponding sequences.
538 F Hilbert Space
F.3 Hilbert Space
F.3.1 Completeness
Let (X , d) be a metric space. A sequence {xn }∞

n=1 of elements in X is a Cauchy
sequence if for every real > 0 there exists a positive integer N for which
d (xn , xm ) < for all m ≥ N, and n ≥ N. A metric space is said to be complete
if every Cauchy sequence in the metric space converges to an element within the
space.
F.3.2 Hilbert Space
A complete inner product space X with the metric:
d (x, y) =< x − y, x − y >1/2 = x − y
is a Hilbert space. A number of results can be drawn from a Hilbert space.

For example, if the sequence {xn }∞ n=1 of elements in a given Hilbert space X is
orthogonal, then the sequence {yn }∞ , given by yn = nk=1 xk , converges in X
n=1
if and only if the scalar series ∞ k=1 xk
2 converges, see e.g. Kubáčkouá et al.
(1987).
A fundamental result in Hilbert spaces concerns the concept of approximation of
vectors from the Hilbert space by vectors from subspaces. This result is expressed
under the so-called projection theorem, given below (see e.g. Halmos 1951).
Projection Theorem Let U be a Hilbert space and V a Hilbert subspace of U . Let
also u be a vector in U but not in V, and v a vector in V. Then there exists a unique
vector v in V such that
u − v = min u − v .
v in V
Furthermore, the vector v is uniquely determined by the property that < u −

v , v >= 0 for all v in V.
The vector v is called the (orthogonal) projection of u onto V.
The concept of Hilbert space finds its natural way in time series and prediction
theory, and we provide a few examples below.
Examples of Hilbert Space
• Example 1
F Hilbert Space 539
Consider the collection U of all (complex) random variables U , with zero mean and
finite variance, i.e. E(U ) = 0, and V ar(|U |2 ) < ∞, defined on some sample space.
The following operation defined for all random variables U and V in U by
< U, V >= E U ∗ V ,
where U ∗ is the complex conjugate of U , defines a scalar product over U and makes
U a Hilbert space, see e.g. Priestly (1981, p. 190).
Exercise Show that the above operation is well defined.
Hint Use the fact that V ar(λU + V ) ≥ 0 for all scalar λ to deduce that < U, V >
is well defined.
The theory of Hilbert space in stochastic processes and time series started towards
the late 1940s (Loève 1948) and was lucidly formulated by Parzen (1959, 1961) in
the context of random function (stochastic processes). The concept of Hilbert space,
and in particular the projection theorem, finds natural application in the theory of
time series prediction.
• Example 2
Designate by T a subset of the real numbers, and let {Xt , for t in T } be a stochastic
process (or random function) satisfying E |Xt |2 < ∞ for t in T . Such stochastic
is said to be second order. Let U be the set of random variables of the form
process
U = nk=1 ck Xtk , where n is a positive integer, c1 , . . . , cn are scalars and t1 , . . . , tn
are elements in T . That is, U is the set of all finite linear combinations of random
variables Xt for t in T and is known as the space spanned by the random function
{Xt , for t in T }. The inner product < U, V >= E(U V ∗ ) induces an inner product
on U . The space U , extended by including all random variables that are limit of
sequences in U , i.e. random variables W satisfying
lim Wn − W = 0
n→∞
for some sequences {Wn }∞

n=1 in U , is a Hilbert space (see e.g. Parzen 1959).
F.3.3 Application to Prediction

The Univariate Case
Let H be the Hilbert space defined in example 1 above and {Xt , t = 0, ±1, ±2, . . .}
a (discrete) stochastic process. Let now Ht be the subset spanned by the sequence
Xt , Xt−1 , Xt−2 , . . .. Using the same reasoning as in example 2 above, Ht is a
Hilbert space.
540 F Hilbert Space
Let now m be a given positive integer, and our objective is to estimate Xt+m
using elements from Ht . This is the classical prediction problem, which seeks an
element X̂t+m in Ht satisfying

Xt+m − X̂t+m 2
= E |Xt+m − X̂t+m |2 = min Xt+m − Y 2
.
Y in Ht
Hence X̂t+m is simply the orthogonal projection of Xt+m onto Ht . From the
projection theorem, we get

E Xt+m − X̂t+m Y = 0,
that is, Xt+m − X̂t+m is orthogonal to Y , for any random variable Y in Ht .

In prediction theory, the set Ht is sometimes referred to as the set of all
possible predictors and that the predictor X̂t+m provides the minimum mean square
prediction error (see e.g. Karlin and Taylor 1975, p. 464). Since X̂t+m is an
element of Ht , the previous orthogonality also yields another orthogonality between
Xt+m − X̂t+m and Xt+n − X̂t+n for n < m. This is because Ht is a subspace of
Hs for t ≤ s. This also yields a general orthogonality between Xt+k − X̂t+k and
Xt+l − X̂t+l , i.e.

E Xt+k − X̂t+k Xt+l − X̂t+l = 0, for k = l.
In probabilistic terms we consider the stochastic process (Xt ) observed for t ≤ n,

and we seek to “estimate” the random variable Xn+h . The conditional probability
distribution of the random variable Xn+h given In = {Xt , t ≤ n} is fh (xn+h |xt , t ≤
n) = P r(Xn+h ≤ x|Xt , t ≤ n) = fh (x). The knowledge of fh (.) permits the
determination of all the conditional properties of Xn+h |In . The estimate X̂n+h of
Xn+h |In is then chosen as a solution to the minimisation problem:

min E (X̂n+h − Y )2 |In = min (x − y)2 fh (x)dx. (F.1)
The solution is automatically given by
X̂n+h = E (Xn+h |Xt , t ≤ n) , (F.2)
and the term εn+h = Xn+h − X̂n+h is known as the forecast error.
Exercise Show that the solution to Eq. (F.1) is given by Eq. (F.2).

Hint Recall the condition fh (x)dx = 1 and use Lagrange multiplier.
An important result emerges when {Xt } is Gaussian, namely, E (Xn+h |Xt , t ≤ n)
is a linear function of Xt , t ≤ n, and this what makes the reason behind choosing
F Hilbert Space 541
linear predictors. The general linear predictor is then

X̂n+h = αk Xt−k . (F.3)
k≥0
The predictor X̂n+h is meant to optimally approximate the (unobserved) future value
of Xn+h of the time series. In stationary time series the forecast error εn+h is also
stationary, and its variance σ2 = E(εn+h
2 ) is the forecast error variance.
The Multivariate Case
Prediction of multivariate time series is more subtle than single variables time series
not least because matrices are involved. Matrices have two main features, namely,
they do not (in general) commute, and they can be singular without being null.
In this appendix a brief review of the multivariate prediction problem is given.
For a full discussion on prediction of vector time series, the reader is referred to
Doob (1953), Wiener and Masani (1957, 1958), Helsen and Lowdenslager (1958),
Rozanov (1967), Masani (1966), Hannan (1970) and Koopmans (1974), and the
up-to-date text by Wei (2019).
T
Let xt = Xt1 , . . . , Xtp denote a p-dimensional second-order (E |xt |2 <
∞) zero-mean random vector. Let also {xt , t = 0, ±1, ±2, . . .} be a second-order
vector random function (or stochastic process, or time series), H the Hilbert space
spanned by this random function, i.e. the space spanned by Xt,k , k = 1, . . . , p,
t = 0, ±1, ±2, . . ., and finally, Hn the Hilbert space spanned by Xt,k , k = 1, . . . , p,
t ≤ n. A p-dimensional random vector y = Y1 , . . . , Yp is an element of Hn
if each component Yk , k = 1, . . . , p belongs to Hn . Stated otherwise Hn can be
regarded as composed of random vectors y that are finite linear combinations of
elements of the vector random functions of the form:

m
y= Ak xtk
k=1
for some integers m, t1 , . . . , tm , and p ×p (complex) matrices Ak , k = 1, . . . , m. To

be consistent with the definition of uncorrelated random vectors, a generalised inner
product on H, known as the Gramian matrix (see e.g. Koopmans 1974), is defined
by
u, v p = E uv∗ ,
542 F Hilbert Space
where (∗ ) stands for the transpose complex conjugate.1 Note that the norm over H
is the trace of the Gramian matrix, i.e.

p
x 2
= E|Xk |2 = tr xxT p .
k=1
It can be seen that the orthogonality of random vectors is equivalent to non-

correlation (as in the univariate case). Let u = (U1 , . . . , Up ) be a random vector
in H. The projection û = (Û1 , . . . , Ûp ) of u onto Hn is a random vector whose
components are the projection of the associated components of u. The projection
theorem yields, for any y in Hn ,

u − û, y p = E u − û y∗ = O.
As for the univariate case, the predictor x̂t+m of xt+m is given by the orthogonal
projection of xt+m onto Ht . The prediction error εt+m = xt+m − x̂t+m is orthogonal
to all vectors in Ht . Also, ε k is orthogonal to ε l for l = k, i.e.

E ε k ε Tl = δkl ,
where is the covariance matrix of the prediction error εk . The prediction error
variance tr E ε t+1 ε Tt+1 of the one-step ahead prediction is given in Chap. 8.
1 Thatis, the Gramian matrix consists of all the inner products between the individual components
of u and v.
Appendix G
Systems of Linear Ordinary Differential
Equations
This appendix gives the solutions of systems of ordinary differential equations

(ODEs) of the form:
dx
= Ax + b (G.1)
dt
with the initial condition x0 = x(t0 ), where A is a m × m real (or complex) matrix
and b is a m-dimensional real (or complex) vector. When A is constant, the solution
is quite simple, but when it is time-dependent the solution is slightly more elaborate.
G.1 Case of a Constant Matrix A
G.1.1 Homogeneous System
By using the exponential form of matrices:

1
eA = Ak , (G.2)
k!
k≥0
which can also be extended to etA , for any scalar t, one gets
detA
= etA A = AetA .
dt
Remark In general eA+B = eA eB . However, if A and B commute, then we get

equality.

544 G Systems of Linear Ordinary Differential Equations
Using the above result, the solution of
dx
= Ax (G.3)
dt
with initial condition x0 is
x(t) = etA x0 . (G.4)
Remark The above result can be used to solve the differential equation:
d my d m−1 y dy
m
+ am−1 m−1 + . . . + a1 + a0 y = 0 (G.5)
dt dt dt
m−1
with initial conditions y(t0 ), dy(t 0) d y(t0 )
dt , . . . , dt m−1 . The above ODE can be trans-
formed into a system similar to Eq. (G.3), with the Fröbenius matrix A given by
⎛ ⎞
0 1 ... 0 0
⎜ 0 0 ... ⎟ 0 0
⎜ ⎟
⎜ .. ⎟
A=⎜ . ⎟,
⎜ ⎟
⎝ 0 0 ... 0 1 ⎠
−a0 −a1 . . . −am−2 −am−1
m−1
and x(t) = (y(t), dy(t) d y(t) T
dt , . . . , dt m−1 ) , with initial condition x0 = x(t0 ).
G.1.2 Non-homogeneous System
Here we consider the non-homogeneous case corresponding to Eq. (G.1), and we

assume that b = b(t), i.e. b is time-dependent. By noting that dx
dt − Ax =
d
etA dt (e−tA x), the solution is given by
t
x(t) = etA x0 + e(t−s)A b(s)ds. (G.6)
t0
Remark Equation (G.6) can be used to integrate a mth-order non-homogeneous

differential equation.
G Systems of Linear Ordinary Differential Equations 545
G.2 Case of a Time-Dependent Matrix A
G.2.1 General Case
We consider now the following system of ODEs:
dx
= A(t)x (G.7)
dt
with initial condition x0 . The theory behind the integration of Eq. (G.7) is
based on using a set of independent solutions of the differential equation. If
x1 (t), . . . , xm (t) is a set of m solutions of Eq. (G.7) with respective initial conditions
x1 (t0 ), . . . , xm (t0 ), assumed to be linearly independent, then the matrix M(t) =
(x1 (t), . . . , xm (t)) satisfies the following system of ODEs:
dM
= AM. (G.8)
dt
It turns out that if M(t0 ) is invertible the solution to (G.8) is also invertible.
Remark It can be shown, see e.g. Said-Houari (2015) or Teschl (2012), that the
Wronskian W (t) = det (M(t)) satisfies the ODE:
dW
= tr(A)W, (G.9)
dt
t
(or W (t) = W (t0 ) exp( t0 tr(A(u))du). The Wronskian can be used to show that,
like M(t0 ), M(t) is also invertible.
The solution to Eq. (G.7) then takes the form:
x(t) = S(t, t0 )x(t0 ), (G.10)
where S(., .) is the propagator of Eq. (G.7) and is given by
S(t, u) = M(t)M−1 (u). (G.11)
These results can be extended to the case of a non-homogeneous system:
dx
= A(t)x + b(t), (G.12)
dt
with initial condition x0 , which takes the form:
t
x(t) = S(t, t0 )x(t0 ) + S(t, u)b(u)du. (G.13)
t0
546 G Systems of Linear Ordinary Differential Equations
The above solution can again be used to integrate a mth-order non-homogeneous

differential equation with varying coefficients.
A useful simplification of Eq. (G.11) can be obtained when the matrix A satisfies
A(t)A(s) = A(s)A(t) for all t and s. In this case the propagator S(., .) takes a
simple expression, namely
t
S(t, s) = e s A(u)du
. (G.14)
It is worth mentioning here that Eq. (G.13) can be extended to the case when the
term b is a random forcing in relation to time-dependent multivariate autoregressive
models.
G.2.2 Particular Case of Periodic Matrix A: Floquet Theory
A particularly important case in physical sciences corresponds to a periodic A(t),

with period T , i.e. A(t +T ) = A(t). This case is particularly relevant to atmospheric
science because of the strong seasonality. The theory of the solution of
ẋ = A(t)x, (G.15)
with initial condition x0 = x(t0 ), for a periodic m × m matrix A(t) is covered by the
so-called Floquet theory (Floquet 1883). The solution takes the form:
x(t) = eμt y(t), (G.16)
for some periodic function y(t), and therefore need not be periodic. A set of m
independent solutions x1 (t), . . . xm (t) make what is known as the fundamental
matrix X(t), i.e. X(t) = [x1 (t), . . . , xm (t)], and if the initial condition X0 = X(t0 )
is the identity matrix, i.e. X0 = Im , then X(t) is called the principal fundamental
matrix. It is therefore clear that the solution to Eq. (G.15) is x(t) = X(t)X−1 0 x0 ,
where X(t) is a fundamental matrix.
An important result from Floquet theory is that if X(t) is a fundamental
matrix so is X(t + T ), and that there exists a nonsingular matrix B such that
X(t + T ) = X(t)B. Using the Wronskian, Eq. (G.9), one gets the determinant
T
of B, i.e. |B| = exp 0 tr(A(u))du . Furthermore, the eigenvalues of B, or
characteristic multipliers, which can be written as eμ1 T , . . . , eμm T , yield the so-
called characteristic (or Floquet) exponents μ1 , . . . , μm .
Remark In terms of the resolvent, see Sect. G2, the propagator S(t, τ ) is the
principal fundamental matrix.
The characteristic exponents, which may be complex, are not unique but the
characteristic multipliers are. In addition, the system (or the origin) is asymptotically
G Systems of Linear Ordinary Differential Equations 547
stable if the characteristic exponents have negative real parts. It can be seen that
if u is an eigenvector of B with eigenvalue ρ = eμT , then x(t) = X(t)u is a
solution to Eq. (G.15), and that x(t + T ) = ρx(t). The solution then takes the form
x(t) = eμT x(t)e−μT = eμT y(t), where precisely the vector y(t) is T-periodic.
Appendix H
Links for Software Resource Material
An EOF primer by the author can be found here:

https://pdfs.semanticscholar.org/f492/b48483c83f70b8e6774d3cc88bec918ab630.
pdf.
A CRAN (R programming language) package for EOFs and EOF rotation by Alan
Jassby is here:
https://www.rdocumentation.org/packages/wq/versions/0.4.8/topics/eof.
The site of David M. Kaplan provides Matlab codes for EOFs and varimax rotation:
https://websites.pmc.ucsc.edu/~dmk/notes/EOFs/EOFs.html.
Mathworks provides codes for PCA, factor analysis and factor rotation using
different rotation criteria at:
https://uk.mathworks.com/help/stats/rotatefactors.html.
https://uk.mathworks.com/help/stats/analyze-stock-prices-using-factor-analysis.
html.
There are also freely available Matlab source codes of factor analysis at
freesourcecode.net:
http://freesourcecode.net/matlabprojects/57962/factor-analysis-by-the-principal-
components-method.--in-matlab#.XysoXfJS80o.
Python (and R) PCA and varimax rotation can be found at this site:
https://mathtuition88.com/2019/09/13/python-code-for-pca-rotation-varimax-
matrix/.
A R package provided by MetNorway, including EOF, CCA and more, can be found
here:

550 H Links for Software Resource Material
https://rdrr.io/github/metno/esd/man/ERA5.CDS.html.
The site of Imad Dabbura from HMS provides coding implementation in R and
Python at:
https://imaddabbura.github.io/.
A step-by-step introduction to NN with programming codes in Python is provide by

Dr Andy Thomas: “An Introduction to Neural Networks for Beginners” at
https://adventuresinmachinelearning.com/wp-content/uploads/2017/07/.
Mathworks also provides softwares for recurrent NN used in time series forecasting:
https://uk.mathworks.com/help/deeplearning/.
The following site provides various Matlab codes in Machine learning:

http://codeforge.com/s/0/self-organizing-map-matlab-code.
The site of Dr Qadri Hamarsheh “Neural Network and Fuzzy Logic: Self-
Organizing Map Using Matlab” here:
http://www.philadelphia.edu.jo/academics/qhamarsheh/uploads/Lecture%2016_
Self-organizing%20map%20using%20matlab.pdf.
Self-organising Maps Using Python, by James McCaffrey here:

https://visualstudiomagazine.com/articles/2019/01/01/self-organizing-maps-python.
aspx.
The book by Nielsen (2015) provides hands-on approach on NN (and deep learning)
with Python (2.7) here:
http://neuralnetworksanddeeplearning.com/about.html.
The book by Buduma (2017) provides codes for deep learning in Tensorflow at:
https://github.com/darksigma/Fundamentals-of-Deep-Learning-Book.
The book by Chollet (2018) provides an exploration of deep learning from scratch
with Python codes here:
https://www.manning.com/books/deep-learning-with-python.
Random Forest: Simple Implementation with Python:

https://holypython.com/rf/random-forest-simple-implementation/.
Random Forest (Easily Explained), with Python, by Shubham Gupta:

https://medium.com/@gupta020295/random-forest-easily-explained-4b8094feb90.
An Implementation and Explanation of the Random Forest in Python by Will

Koehrsen:
https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-
H Links for Software Resource Material 551
forest-in-python-77bf308a9b76.
Forecasting with Random Forest (Python implementation), by Eric D. Brown:

https://pythondata.com/forecasting-with-random-forests/.
Time series forecasting with random forest via time delay embedding (In R
programming language), by Mauel Tilgner:
https://www.statworx.com/at/blog/time-series-forecasting-with-random-forest/.
A Python-based learning library to evaluate mathematical expression efficiently is

found here:
http://deeplearning.net/software/theano/.
Other programming languages.
Yann Lecun provides a set of softwares in Lush at:

http://yann.lecun.com/ex/downloads/index.html.
A toolkit for recurrent NN applied to language modelling is given by Tomas

Mikolov at:
http://www.fit.vutbr.cz/~imikolov/rnnlm/.
A recurrent NN library for LSTM, multidimensional RNN, and more, can be found
here:
https://sourceforge.net/projects/rnnl/.
A Matlab 5 SOM toolbox by Juha Vasento et al. can be found here:

http://www.cis.hut.fi/projects/somtoolbox/.
The following site provides links to a number of softwares on deep learning:

http://deeplearning.net/software_links/.
References
Absil P-A, Mahony R, Sepulchre R (2010) Optimization on manifolds: Methods and applications.
In: Diehl M, Glineur F, Michiels EJ (eds) Recent advances in optimizations and its application
in engineering. Springer, pp 125–144
Achlioptas D (2003) Database-friendly random projections: Johnson-Lindenstrauss with binary
coins. J Comput Syst Sci 66:671–687
Ackoff RL (1989) From data to wisdom. J Appl Syst Anal 16:3–9
Adby PR, Dempster MAH (1974) Introduction to optimization methods. Chapman and Hall,
London
Aires F, Rossow WB, Chédin A (2002) Rotation of EOFs by the independent component analysis:
toward a solution of the mixing problem in the decomposition of geophysical time series. J
Atmospheric Sci 59:111–123
Aires F, Chédin A, Nadal J-P (2000) Independent component analysis of multivariate time series:
application to the tropical SST variability. J Geophys Res 105(D13):17437–17455
Akaike H (1969) Fitting autoregressive models for prediction. Ann Inst Stat Math 21:243–247
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Auto Control
19:716–723
Allen MR, Smith LA (1997) Optimal filtering in singular spectrum analysis. Phys Lett A 234:419–
423
Allen MR, Smith LA (1996) Monte Carlo SSA: Detecting irregular oscillations in the presence of
colored noise. J Climate 9:3373–3404
Aluffi-Pentini F, Parisi V, Zirilli F (1984) Algorithm 617: DAFNE: a differential-equations
algorithm for nonlinear equations. Trans Math Soft 10:317–324
Amari S-I (1990) Mathematical foundation of neurocomputing. Proc IEEE 78:1443–1463
Ambaum MHP, Hoskins BJ, Stephenson DB (2001) Arctic oscillation or North Atlantic oscilla-
tion? J Climate 14:3495–3507
Ambaum MHP, Hoskins BJ, Stephenson DB (2002) Corrigendum: Arctic oscillation or North
Atlantic oscillation? J Climate 15:553
Ambrizzi T, Hoskins BJ, Hsu H-H (1995) Rossby wave propagation and teleconnection patterns in
the austral winter. J Atmos Sci 52:3661–3672
Ambroise C, Seze G, Badran F, Thiria S (2000) Hierarchical clustering of self-organizing maps for
cloud classification. Neurocomputing 30:47–52. ISSN: 0925–2312
Anderson JR, Rosen RD (1983) The latitude-height structure of 40–50 day variations in atmo-
spheric angular momentum. J Atmos Sci 40:1584–1591
Anderson TW (1963) Asymptotic theory for principle component analysis. Ann Math Statist
34:122–148

554 References
Anderson TW (1984) An introduction to multivariate statistical analysis, 2nd edn. Wiley, New
York
Angell JK, Korshover J (1964) Quasi-biennial variations in temperature, total ozone, and
tropopause height. J Atmos Sci 21:479–492
Ångström A (1935) Teleconnections of climatic changes in present time. Geografiska Annaler
17:242–258
Annas S, Kanai T, Koyama S (2007) Principal component analysis and self-organizing map for
visualizing and classifying fire risks in forest regions. Agricul Inform Res 16:44–51. ISSN:
1881–5219
Asimov D (1985) The grand tour: A tool for viewing multidimensional data. SIAM J Sci Statist
Comp 6:128–143
Adachi K, Trendafilov N (2019) Some inequalities contrasting principal component and factor
analyses solutions. Jpn J Statist Data Sci. https://doi.org/10.1007/s42081-018-0024-4
Astel A, Tsakouski S, Barbieri P, Simeonov V (2007) Comparison of self-organizing maps
classification approach with cluster and principal components analysis for large environmental
data sets. Water Research 41:4566–4578. ISSN: 0043-1354
Bach F, Jorda M (2002) kernel independent component analysis. J Mach Learn Res 3:1–48
Bagrov NA (1959) Analytical presentation of the sequences of meteorological patterns by means
of the empirical orthogonal functions. TSIP Proceeding 74:3–24
Bagrov NA (1969) On the equivalent number of independent data (in Russian). Tr Gidrometeor
Cent 44:3–11
Baker CTH (1974) Methods for integro-differential equations. In: Delves LM, Walsh J (eds)
Numerical solution of integral equations. Oxford University Press, Oxford
Baldwin MP, Gray LJ, Dunkerton TJ, Hamilton K, Haynes PH, Randel WJ, Holton JR, Alexander
MJ, Hirota I, Horinouchi T, Jones DBA, Kinnersley JS, Marquardt C, Sao K, Takahas M (2001)
The Quasi-biennial oscillation. Rev Geophys 39:179–229
Baldwin MP, Stephenson DB, Jolliff IT (2009) Spatial weighting and iterative projection methods
for EOFs. J Climate 22:234–243
Barbosa SM, Andersen OB (2009) Trend patterns in global sea surface temperature. Int J Climatol
29:2049–2055
Barlow HB (1989) Unsupervised learning. Neural Computation 1:295–311
Barnett TP (1983) Interaction of the monsoon and Pacific trade wind system at international time
scales. Part I: The equatorial case. Mon Wea Rev 111:756–773
Barnston AG, Liveze BE (1987) Classification, seasonality, and persistence of low-frequency
atmospheric circulation patterns. Mon Wea Rev 115:1083–1126
Barnett TP (1984a) Interaction of the monsoon and the Pacific trade wind systems at interannual
time scales. Part II: The tropical band. Mon Wea Rev 112:2380–2387
Barnett TP (1984b) Interaction of the monsoon and the Pacific trade wind systems at interannual
time scales. Part III: A partial anatomy of the Southern Oscillation. Mon Wea Rev 112:2388–
2400
Barnett TP, Preisendorfer R (1987) Origins and levels of monthly and seasonal forecast skill for
United States srface air temperatures determined by canonical correlation analysis. Mon Wea
Rev 115:1825–1850
Barnston AG, Ropelewski CF (1992) Prediction of ENSO episodes using canonical correlation
analysis. J Climate 5:1316–1345
Barreiro M, Marti AC, Masoller C (2011) Inferring long memory processes in the climate network
via ordinal pattern analysis. Chaos 21:13,101. https://doi.org/10.1063/1.3545273
Bartholomew DJ (1987) Latent variable models and factor analysis. Charles Griffin, London
Bartlett MS (1939) The standard errors of discriminant function coefficients. J Roy Statist Soc
Suppl. 6:169–173
Bartlett MS (1950) Periodogram analysis and continuous spectra. Biometrika 37:1–16
Bartlett MS (1955) An introduction to stochastic processes. Cambridge University Press, Cam-
bridge
References 555
Basak J, Sudarshan A, Trivedi D, Santhanam MS (2004) Weather data mining using independent
component analysis. J Mach Lear Res 5:239–253
Basilevsky A, Hum PJ (1979) Karhunen-Loève analysis of historical time series with application
to Plantation birth in Jamaica. J Am Statist Ass 74:284–290
Basilevsky A (1983) Applied matrix algebra in the statistical science. North Holland, New York
Bauckhage C, Thurau C (2009) Making archetypal analysis practical. In: Pattern recognition,
Lecture Notes in Computer Science, vol 5748. Springer, Berlin, Heidelberg, pp 272–281.
https://doi.org/10.1007/978-3-642-03798-6-28
Bayes T (1763) An essay towards solving a problem in the doctrine of chances. Phil Trans 53:370
Beatson RK, Cherrie JB, Mouat CT (1999) Fast fitting of radial basis functions: Methods based on
preconditioned GMRES iteration. Adv Comput Math 11:253–270
Beatson RK, Light WA, Billings S (2000) Fast solution of the radial basis function interpolation
equations: Domain decomposition methods. SIAM J Sci Comput 200:1717–1740
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data represen-
tation. Neural Comput 15:1373–1396
Bellman R (1961) Adaptive control processes: A guide tour. Princeton University Press, Princeton
Bell AJ, Sejnowski TJ (1995) An information-maximisation approach to blind separation and blind
deconvolution. Neural Computing 7:1004–1034
Bell AJ, Sejnowski TJ (1997) The “independent components” of natural scenes are edge filters.
Vision Research 37:3327–3338
Belouchrani A, Abed-Meraim K, Cardoso J-F, Moulines E (1997) A blind source separation
technique using second order statistics. IEEE Trans Signal Process 45:434–444
Bentler PM, Tanaka JS (1983) Problems with EM algorithms for ML factor analysis. Psychome-
trika 48:247–251
Berthouex PM, Brown LC (1994) Statistics for environmental engineers. Lewis Publishers, Boca
Raton
Bishop CM (1995) Neural networks for pattern recognition. Clarendon Press, Oxford, 482 p.
Bishop CM (2006) Pattern recognition and machine learning. Springer series in information
science and statistics. Springer, New York, 758 p.
Bjerknes J (1969) Atmospheric teleconnections from the equatorial Pacific. Mon Wea Rev 97:163–
172
Björnsson H, Venegas SA (1997) A manual for EOF and SVD analyses of climate data. Report
No 97-1, Department of Atmospheric and Oceanic Sciences and Centre for Climate and Global
Change Research, McGill University, p 52
Blumenthal MB (1991) Predictability of a coupled ocean-atmosphere model. J Climate 4:766–784
Bloomfield P, Davis JM (1994) Orthogonal rotation of complex principal components. Int J
Climatol 14:759–775
Bock H-H (1986) Multidimensional scaling in the framework of cluster analysis. In: Degens P,
Hermes H-J, Opitz O (eds) Studien Zur Klasszfikation. INDEKS-Verlag, Frankfurt, pp 247–
258
Bock H-H (1987) On the interface between cluster analysis, principal component analysis, and
multidimensional scaling. In: Bozdogan H, Kupta AK (eds) Multivariate statistical modelling
and data analysis. Reidel, Boston
Boers N, Donner RV, Bookhagen B, Kurths J (2014) Complex network analysis helps to identify
impacts of the El Niño Southern Oscillation on moisture divergence in South America. Clim
Dyn. https://doi.org/10.1007/s00382-014-2265-7
Bolton RJ, Krzanowski WJ (2003) Projection pursuit clutering for exploratory data analysis. J
Comput Graph Statist 12:121–142
Bonnet G (1965) Theorie de linformation−sur l’interpolation optimale d’une fonction aléatoire
èchantillonnée. C R Acad Sci Paris 260:297–343
Bookstein FL (1989) Principal warps: thin plate splines and the decomposition of deformations.
IEEE Trans Pattern Anal Mach Intell 11:567–585
Borg I, Groenen P (2005) Modern multidimensional scaling. Theory and applications, 2nd edn.
Springer, New York
556 References
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifier. In:
Haussler D (ed) Proceedings of the 5th anuual ACM workshop on computational learning
theory. ACM Press, pp 144–152 Pittsburgh.
Botsaris CA, Jacobson HD (1976) A Newton-type curvilinear search method for optimisation. J
Math Anal Applics 54:217–229
Botsaris CA (1978) A class of methods for unconstrained minimisation based on stable numerical
integration techniques. J Math Anal Applics 63:729–749
Botsaris CA (1979) A Newton-type curvilinear search method for constrained optimisation. J Math
Anal Applics 69:372–397
Botsaris CA (1981) Constrained optimisation along geodesics. J Math Anal Applics 79:295–306
Box MJ, Davies D, Swann WH (1969) Non-linera optimization techniques. Oliver and Boyd,
Edinburgh
Box GEP, Jenkins MG, Reinsel CG (1994) Time series analysis: forecasting and control. Prentice
Hall, New Jersey
Box GEP, Jenkins MG (1970) Time series analysis. Forecasting and control. Holden-Day, San
Fracisco (Revised and published in 1976)
Branstator G, Berner J (2005) Linear and nonlinear Signatures in the planetary wave dynamics of
an AGCM: Phase space tendencies. J Atmos Sci 62:1792–1811
Breakspear M, Brammer M, Robinson PA (2003) Construction of multivariate surrogate sets from
nonlinear data using the wavelet transform. Physica D 182:1–22
Breiman L (2001) Random forests. Machine Learning 45:5–32
Bretherton CS, Smith C, Wallace JM (1992) An intercomparison of methods for finding coupled
patterns in climate data. J Climate 5:541–560
Bretherton CS, Widmann M, Dymnykov VP, Wallace JM, Bladé I (1999) The effective number of
spatial degrees of freedom of a time varying field. J Climate 12:1990–2009
Brillinger DR, Rosenblatt M (1967) Computation and interpretation of k-th order spectra. In: Harris
B (ed) Spectral analysis of time series. Wiley, New York, pp 189–232
Brillinger DR (1981) Time series-data: analysis and theory. Holden-Day, San-Francisco
Brink KH, Muench RD (1986) Circulation in the point conception-Santa Barbara channel region.
J Geophys Res C 91:877–895
Brockwell PJ, Davis RA (1991) Time series: theory and methods, 2nd edn. Springer, New York
Brockwell PJ, Davis RA (2002) Introduction to time series and forecasting. Springer, New York
Brown AA (1986) Optimisation methods involving the solution of ordinary differential equations.
Ph.D. thesis, the Hatfield polytechnic, available from the British library
Brownlee J (2018) Statistical methods for machine learning. e-learning. ISBN-10. https://www.
unquotebooks.com/get/ebook.php?id=386nDwAAQBAJ
Broomhead DS, King GP (1986a) Extracting qualitative dynamics from experimental data. Physica
D 20:217–236
Broomhead DS, King GP (1986b) On the qualitative analysis of experimental dynamical systems.
In: Sarkar S (ed) Nonlinear phenomena and chaos. Adam Hilger, pp 113–144
Buduma N (2017) Fundamentals of deep learning, 1st edn. O’Reilly, Beijing
Bürger G (1993) Complex principal oscillation pattern analysis. J Climate 6:1972–1986
Burg JP (1972) The relationship between maximum entropy spectra and maximum likelihood
spectra. Geophysics 37:375–376
Cadzow JA, Li XK (1995) Blind deconvolution. Digital Signal Process J 5:3–20
Cadzow JA (1996) Blind deconvolution via cumulant extrema. IEEE Signal Process Mag (May
1996), 24–42
Cahalan RF, Wharton LE, Wu M-L (1996) Empirical orthogonal functions of monthly precipitation
and temperature ever over the united States and homogeneous Stochastic models. J Geophys
Res 101(D21): 26309–26318
Capua GD, Runge J, Donner RV, van den Hurk B, Turner AG, Vellore R, Krishnan R, Coumou
D (2020) Dominant patterns of interaction between the tropics and mid-latitudes in boreal
summer: Causal relationships and the role of time scales. Weather Climate Discuss. https://
doi.org/10.5194/wcd-2020-14.
References 557
Cardoso J-F (1989) Source separation using higher order moments. In: Proc. ICASSP’89, pp 2109–
2112
Cardoso J-F (1997) Infomax and maximum likelihood for source separation. IEEE Lett Signal
Process 4:112–114
Cardoso J-F, Souloumiac A (1993) Blind beamforming for non-Gaussian signals. IEE Proc F
140:362–370
Cardoso J-F, Hvam Laheld B (1996) Equivalent adaptive source separation. IEEE Trans Signal
Process 44:3017–3030
Carr JC, Fright RW, Beatson KR (1997) Surface interpolation with radial basis functions for
medical imaging. IEEE Trans Med Imag 16:96–107
Carreira-Perpiñán MA (2001) Continuous latent variable models for dimensionality reduction and
sequential data reconstruction. Ph.D. dissertation. Department of Computer Science, University
of Sheffield
Carroll JB (1953) An analytical solution for approximating simple structure in factor analysis.
Psychometrika 18:23–38
Caroll JD, Chang JJ (1970) Analysis of individual differences in multidimensional scaling via an
n-way generalization of ’Eckart-Young’ decomposition. Psychometrika 35:283–319
Cassano EN, Glisan JM, Cassano JJ, Gutowski Jr. WJ, Seefeldt MW (2015) Self-organizing map
analysis of widespread temperature extremes in Alaska and Canada. Clim Res 62:199–218
Cassano JJ, Cassano EN, Seefeldt MW, Gutowski WJ, Glisan JM (2016) Synoptic conditions
during wintertime temperature extremes in Alaska. J Geophys Res Atmos 121:3241–3262.
https://doi.org/10.1002/2015JD024404
Causa A, Raciti F (2013) A purely geometric approach to the problem of computing the projection
of a point on a simplex. JOTA 156:524–528
Cavazos T, Comrie AC, Liverman DM (2002) Intraseasonal variability associated with wet
monsoons in southeast Arizona. J Climate 15:2477–2490. ISSN: 0894-8755
Chan JCL, Shi J-E (1997) Application of projection-pursuit principal component analysis method
to climate studies. Int J Climatol 17(1):103–113
Charney JG, Devore J (1979) Multiple equilibria in the atmosphere and blocking. J Atmos Sci
36:1205–1216
Chatfield C (1996) The analysis of time series. An introduction 5th edn. Chapman and Hall, Boca
Raton
Chatfield C, Collins AJ (1980) Introduction to multivariate analysis. Chapman and Hall, London
Chatfield C (1989) The analysis of time series: An introduction. Chapman and Hall, London, 241 p
Chekroun MD, Kondrashov D (2017) Data-adaptive harmonic spectra and multilayer Stuart-
Landau models. Chaos 27:093110
Chen J-M, Harr PA (1993) Interpretation of extended empirical orthogonal function (EEOF)
analysis. Mon Wea Rev 121:2631–2636
Chen R, Zhang W, Wang X (2020) Machine learning in tropical cyclone forecast modeling: A
Review. Atmosphere 11:676. https://doi.org/10.3390/atmos11070676
Cheng X, Nitsche G, Wallace MJ (1995) Robustness of low-frequency circulation patterns derived
from EOF and rotated EOF analysis. J Climate 8:1709–1720
Chernoff H (1973) The use of faces to represent points in k-dimensional space graphically. J Am
Stat Assoc 68:361–368
Coifman RR, Lafon S (2006) Diffusion maps. Appl Comput Harmon Anal 21:5–30. https://doi.
org/10.1016/j.acha.2006.04.006
Chollet F (2018) Deep learning with Python. Manning Publications, New York, 361 p
Christiansen B (2009) Is the atmosphere interesting? A projection pursuit study of the circulation
in the northern hemisphere winter. J Climate 22:1239–1254
Cleveland WS, McGill R (1984) The many faces of a scatterplot. J Am Statist Assoc 79:807–822
Cleveland WS (1993) Visualising data. Hobart Press, New York
Comon P, Jutten C, Herault J (1991) Blind separation of sourcesi, Part ii: Problems statement.
Signal Process 24:11–20
Comon P (1994) Independent component analysis, a new concept? Signal Process 36:287–314
558 References
Cook D, Buja A, Cabrera J (1993) Projection pursuit indices based on expansions with orthonormal
functions. J Comput Graph Statist 2:225–250
Cover TM, Thomas JA (1991) Elements of information theory. Wiley Series in Telecommunica-
tion. Wiley, New York
Cox DD (1984) Multivariate smoothing spline functions. SIAM J Num Anal 21:789–813
Cox TF, Cox MAA (1994) Mulyidimensional scaling. Chapman and Hall, London
Craddock JM (1965) A meteorological application of principal component analysis. Statistician
15:143–156
Craddock JM (1973) Problems and prospects for eigenvector analysis in meteorology. Statistician
22:133–145
Craven P, Wahba G (1979) Smoothing noisy data with spline functions: estimating the correct
degree of smoothing by the method of generalized cross-validation. Numer Math 31:377–403
Cristianini N, Shawe-Taylor J, Lodhi H (2001) Latent semantic kernels. In: Brodley C, Danyluk
A (eds) Proceedings of ICML-01, 18th international conference in machine learning. Morgan
Kaufmann, San Francisco, pp 66–73
Crommelin DT, Majda AJ (2004) Strategies for model reduction: Comparing different optimal
bases. J Atmos Sci 61:2206–2217
Cupta AS (2004) Calculus of variations with applications. PHI Learning, India, 256p. ISBN:
9788120311206
Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36:338–347
Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Zhang C, Ma YQ (eds) Ensemble
machine learning. Springer, New York, pp 157–175
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signal
Sys 2:303–314
Daley R (1991) Atmospheric data assimilaltion. Cambridge University Press, Camnbridge, 457 p
Dasgupta S, Gupta A (2003) An elementary proof of a theorem of Johnson and Lindenstrauss.
Random Struct Algorithm 22:60–65
Daubechies I (1992) Ten lectures on wavelets. Soc. for Ind. and Appl. Math., Philadelphia, PA
Davis JM, Estis FL, Bloomfield P, Monahan JF (1991) Complex principal components analysis of
sea-level pressure over the eastern USA. Int J Climatol 11:27–54
de Lathauwer L, de Moor B, Vandewalle J (2000) A multilinear singular value decomposition.
SIAM J Matrix Analy Appl 21:1253–1278
DeGroot MH, Shervish MJ (2002) Probability and statistics, 4th edn. Addison–Wesley, Boston,
p 893
DelSole T (2001) Optimally persistent patterns in time varying fields. J Atmos Sci 58:1341–1356
DelSole T (2006) Low-frequency variations of surface temperature in observations and simula-
tions. J Climate 19:4487–4507
DelSole T, Tippett MK (2009a) Average predictability time. Part I: theory. J Atmos Sci 66:1172–
1187
DelSole T, Tippett MK (2009b) Average predictability time. Part II: seamless diagnoses of
predictability on multiple time scales. J Atmos Sci 66:1188–1204
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM
algorithm. J Roy Statist Soc B 39:1–38
De Swart HE (1988) Low-order spectral models of the atmospheric circulation: A survey. Acta
Appl Math 11:49–96
Derouiche S, Mallet C, Hannachi A, Bargaoui Z (2020) Rainfall analysis via event features and
self-organizing maps with application to northern Tunisia. J Hydrolo revised
Diaconis P, Freedman D (1984) Asymptotics of graphical projection pursuit. Ann Statist 12:793–
815
Diamantaras KI, Kung SY (1996) Principal component neural networks. Wiley, New York
Ding C, Li T, Jordan IM (2010) Convex and semi-nonnegative matrix factorizations. IEEE Trans
Pattern Anal Mach Intell 32:45–55
References 559
Donges JF, Petrova I, Loew A, Marwan N, Kurths J (2015) How complex climate networks
complement eigen techniques for the statistical analysis of climatological data. Clim Dyn
45:2407–2424
Donges JF, Zou Y, Marwan N, Kurths J (2009) Complex networks in climate dynamics. Eur Phys
J Spec Top 174:157–179. https://doi.org/10.1140/epjst/e2009--01098-2
Dommenget D, Latif M (2002) A cautionary note on the interpretation of EOFs. J Climate 15:216–
225
Dommenget D (2007) Evaluating EOF modes against a stochastic null hypothesis. Clim Dyn
28:517–331
Donner RV, Zou Y, Donges JF, Marwan N, Kurths J (2010) Recurrence networks—a novel
paradigm for nonlinear time series analysis. New J Phys 12:033025. https://doi.org/10.1088/
1367-2630/12/3/033025
Donohue KD, Hennemann J, Dietz HG (2007) Performance of phase transform for detecting
sound sources with microphone arrays in reverberant and noisy environments. Signal Process
87:1677–1691
Doob JL (1953) Stochastic processes. Wiley, New York
Dorn M, von Storch H (1999) Identification of regional persistent patterns through principal
prediction patterns. Beitr Phys Atmos 72:15–111
Dwyer PS (1967) Some applications of matrix derivatives in multivariate analysis. J Am Statist
Ass 62:607–625
Ebdon RA (1960) Notes on the wind flow at 50 mb in tropical and subtropical regions in January
1957 and in 1958. Q J R Meteorol Soc 86:540–542
Ebert-Uphoff I, Deng Y (2012) Causal discovery for climate research using graphical models. J
Climate 25:5648–5665. https://doi.org/10.1175/JCLI-D-11-00387.1
Efron B (1979) Bootstrap methods: Another look at the Jackknife. Ann Stat 7:1–26
Efron B, Tibshirani RJ (1994) An introduction to bootstrap. Chapman-Hall, Boca-Raton. ISBN-13:
978-0412042317
Eslava G, Marriott FHC (1994) Some criteria for projection pursuit. Stat Comput 4:13–20
Eugster MJA, Leisch F (2011) Weighted and robustarchetypal analysis. Comput Stat Data Anal
55:1215–1225
Eugster MJA, Leisch F, (2013) Archetypes: Archetypal analysis. http://CRAN.R-project.org/
package=archetypes. R package version 2.1-2
Everitt BS (1978) Graphical techniques for multivariate data. Heinemann Educational Books,
London
Everitt BS (1984) An introduction to latent variable models. Chapman and Hall, London
Everitt BS (1987) Introduction to optimization methods and their application in statistics. Chapman
and Hall, London
Everitt BS (1993) Cluster analysis, 3rd edn. Academic Press, London, 170pp
Everitt BS, Dunn G (2001) Applied Multivariate Data Analysis, 2nd edn. Arnold, London
Evtushenko JG (1974) Two numerical methods of solving non-linear programming problems. Sov
Math Dokl 15:20–423
Evtushenko JG, Zhadan GV (1977) A relaxation method for solving problems of non-linear
programming. USSR Comput Math Math Phys 17:73–87
Fang K-T, Zhang Y-T (1990) Generalized multivariate analysis. Springer, 220p
Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy U (eds) (1996) Advances in knowledge
discovery and data mining. AAAI Press/The MIT Press, Menlo Park, CA
Ferguson GA (1954) The concept of parsimony in factor analysis. Psychometrika 18:23–38
Faddeev DK, Faddeeva NV (1963) Computational methods of linear algebra. W.H. Freeman and
Company, San Francisco
Fisher RA (1925) Statistical methods for research workers. Oliver & Boyd, Edinburgh
Fischer MJ, Paterson AW (2014) Detecting trends that are nonlinear and asymmetric on diurnal
and seasonal time scales. Clim Dyn 43:361–374
Fischer MJ (2016) Predictable components in global speleothem δ 18 O. Q Sci Rev 131:380–392
560 References
Fischer MJ (2015) Predictable components in Australian daily temperature data. J Climate

28:5969–5984
Fletcher R (1972) Conjugate direction methods. In: Murray W (ed) Numerical methods for
unconstrained optimization. Academic Press, London, pp 73–86
Fletcher R, Powell MJD (1963) A rapidly convergent descent method for minimization. Comput J
6:163–168
Floquet G (1883) Sur les équations différentielles linéaires à coefficients periodiques. Annales de
l’École Normale Supérieure 12:47–88
Flury BN (1988) Common principal components and related mutivariate models. Wiley, New York
Flury BN (1984) Common principal components in k groups. J Am Statist Assoc 79:892–898
Flury BN (1983) Some relations between the comparison of covariance matrices and principal
component analysis. Comput Statist Dana Anal 1:97–109
Fodor I, Kamath C (2003) On the use of independent component analysis to separate meaningful
sources in global temperature series. Technical Report, Lawrence Livermore National Labora-
tory
Foulds LR (1981) Optimization techniques: An introduction. Springer, New York
Frankl P, Maehara H (1988) The Johnson-Lindenstrauss lemma and the sphericity of some graphs.
J Combin Theor 44:355–362
Fraedrich K (1986) Estimating the dimensions of weather and climate attractors. J Atmos Sci
43:419–432
Franke R (1982) Scattered data interpolation: tests of some methods. Math Comput 38(157):181–
200
Franzke C, Feldstein SB (2005) The continuum and dynamics of Northern Hemisphere telecon-
nection patterns. J Atmos Sci 62:3250–3267
Franzke C, Majda AJ, Vanden-Eijnden E (2005) Low-order stochastic mode reduction for a
realistic barotropic model climate. J Atmos Sci 62:1722–1745
Franzke C, Majda AJ, Branstator G (2007) The origin of nonlinear signatures of planetary wave
dynamics: Mean phase space tendencies and contributions from non-Gaussianity. J Atmos Sci
64:3987–4003
Franzke C, Feldstein SB, Lee S (2011) Synoptic analysis of the Pacific-North American telecon-
nection pattern. Q J R Meterol Soc 137:329–346
Fraser AM, Dimitriadis A (1994) Forecasting probability densities by using hidden Markov models
with mixed states. In: Weigend SA, Gershenfeld NA (eds) Time series prediction: forecasting
the future and understanding the past. Persus Books, Reading, MA, pp 265–282
Frawley WJ, Piatetsky-Shapiro G, Mathews CJ (1992) Knowledge discovery in databases: an
overview. Al Magazine 13:57–70
Frederiksen JS (1997) Adjoint sensitivity and finite-time normal mode disturbances during
blocking. J Atmos Sci 54:1144–1165
Frederiksen JS, Branstator G (2001) Seasonal and intraseasonal variability of large-scale barotropic
modes. J Atmos Sci 58:50–69
Frederiksen JS, Branstator G (2005) Seasonal variability of teleconnection patterns. J Atmos Sci
62:1346–1365
Friedman JH, Tukey JW (1974) A projection pursuit algorithm for exploratory data analysis. IEEE
Trans Comput C23:881–890
Friedman JH, Stuetzle W, Schroeder A (1984) Projection pursuit density estimation. J Am Statist
Assoc 79:599–608
Friedman JH (1987) Exploratory projection pursuit. J Am. Statist Assoc 82:249–266
Fuller WA (1976) Introduction to statistical time series. Wiley, New York
Feldstein SB (2000) The timescale, power spectra, and climate noise properties of teleconnection
patterns. J Climate 13:4430–4440
Friedman JH, Stuetzle W (1981) Projection pursuit regression. J Amer Statist Assoc 76:817–823
Fukunaga K, Koontz WLG (1970) Application of the Karhunen-Loève expansion to feature
selection and ordering. IEEE Trans Comput C-19:311–318
References 561
Fukuoka A (1951) A study of 10-day forecast (A synthetic report). Geophys Mag Tokyo XXII:177–
218
Galton F (1885) Regression towards mediocrity in hereditary stature. J Anthropological Inst
15:246–263
Gámez AJ, Zhou CS, Timmermann A, Kurths J (2004) Nonlinear dimensionality reduction in
climate data. Nonlin Process Geophys 11:393–398
Gardner WA, Napolitano A, Paura L (2006) Cyclostationarity: Half a century of research. Signal
Process 86:639–697
Gardner WA (1994) Cyclostationarity in communications and signal processing. IEEE Press, 504 p
Gardner WA, Franks LE (1975) Characterization of cyclostationary random signal processes. IEEE
Trans Inform Theory 21:4–14
Gavrilov A, Mukhin D, Loskutov E, Volodin E, Feigin A, Kurths J (2016) Method for reconstruct-
ing nonlinear modes with adaptive structure from multidimensional data. Chaos 26:123101.
https://doi.org/10.1063/1.4968852
Geary RC (1947) Testing for normality. Biometrika 34:209–242
Gelfand IM, Vilenkin NYa (1964) Generalized functions-vol 4: Applications of harmonic analysis.
Academic Press
Ghil M, Allen MR, Dettinger MD, Ide K, Kondrashov D, Mann ME, Robertson AW, Saunders
A, Tian Y, Varadi F, Yiou P (2002) Advanced spectral methods for climatic time series. Rev
Geophys 40:1.1–1.41
Giannakis D, Majda AJ (2012) Nonlinear laplacian spectral analysis for time series with intermit-
tency and low-frequency variability. Proc Natl Sci USA 109:2222–2227
Giannakis D, Majda AJ (2013) Nonlinear laplacian spectral analysis: capturing intermittent and
low-frequency spatiotemporal patterns in high-dimensional data. Stat Anal Data Mining 6.
https://doi.org/10.1002/sam.11171
Gibbs JW (1902) Elementary principles in statistical mechanics developed with especial reference
to the rational foundation of thermodynamics. Yale University Press, New Haven, CT.
Republished by Dover, New York in 1960
Gibson J (1994) What is the interpretation of spectral entropy? In: Proceedings of IEEE
international symposium on information theory, p 440
Gibson PB, Perkins-Kirkpatrick SE, Uotila P, Pepler AS, Alexander LV (2017) On the use of self-
organizing maps for studying climate extremes. J Geophys Res Atmos 122:3891–3903. https://
doi.org/10.1002/2016JD026256
Gibson PB, Perkins-Kirkpatrick SE, Renwick JA (2016) Projected changes in synoptic weather
patterns over New Zealand examined through self-organizing maps. Int J Climatol 36:3934–
3948. https://doi.org/10.1002/joc.4604
Gill PE, Murray W, Wright HM (1981) Practical optimization. Academic Press, London
Gilman DL (1957) Empirical orthogonal functions applied to thirty-day forecasting. Sci Rep No 1,
Department of Meteorology, Mass Inst of Tech, Cambridge, Mass, 129pp.
Girshik MA (1939) On the sampling theory of roots of determinantal equations. Ann Math Statist
43:128–136
Glahn HR (1962) An experiment in forecasting rainfall probabilities by objective methods. Mon
Wea Rev 90:59–67
Goerg GM (2013) Forecastable components analysis. J Mach Learn Res Workshop Conf Proc
28:64–72
Goldfeld SM, Quandt RE, Trotter HF (1966) Maximization by quadratic hill-climbing. Economet-
rica 34:541–551
Golub GH, van Loan CF (1996) Matrix computation. John Hopkins University Press, Baltimore,
MD
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge, MA, 749 p.
http://www.deeplearningbook.org
Gordon AD (1999) Classification, 2nd edn. Chapman and Hall, 256 p
Gordon AD (1981) Classification: methods for the exploratory analysis of multivariate data.
Chapman and Hall, London
562 References
Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate
analysis. Biometrika 53:325–338
Graybill FA (1969)Introduction to matrices with application in statistics. Wadsworth, Belmont, CA
Graystone P (1959) Meteorological office discussion−Tropical meteorology. Meteorol Mag
88:113–119
Grenander U, Rosenblatt M, (1957) Statistical analysis of time series. Wiley, New York
Grimmer M (1963) The space-filtering of monthly surface temperature anomaly data in terms of
pattern using empirical orthogonal functions. Q J Roy Meteorol Soc 89:395–408
Hackbusch W (1995) Integral equations: theory and numerical treatment. Birkhauser Verlag, Basel
Haghroosta T (2019) Comparative study on typhoon’s wind speed prediction by a neural networks
model and a hydrodynamical model. MethodsX 6:633–640
Haines K, Hannachi A (1995) Weather regimes in the Pacific from a GCM. J Atmos Sci 52:2444-
2462
Hall A, Manabe S (1997) Can local linear stochastic theory explain sea surface temperature and
salinity variability? Clim Dyn 13:167–180
Hall, P (1989) On polynomial-based projection indices for exploratory projection pursuit. Ann
Statist 17:589–605
Halmos PR (1951) Introduction to Hilbert space. Chelsea, New York
Halmos PR (1972) Positive approximants of operators. Indian Univ Math J 21:951–960
Hamlington BD, Leben RR, Nerem RS, Han W, Kim K-Y (2011) Reconstructing sea level using
cyclostationary empirical orthogonal functions. J Geophys Res 116:C12015. https://doi.org/10.
1029/2011JC007529
Hamlington BD, Leben RR, Strassburg MW, Kim K-Y (2014) Cyclostationary empirical orthogo-
nal function sea-level reconstruction. Geosci Data J 1:13–19
Hamming RW (1980) Coding and information theory. Prentice-Hall, Englewood Cliffs, New Jersey
Hannachi A, Allen M (2001) Identifying signals from intermittent low-frequency behaving
systems. Tellus A 53A:469–480
Hannachi A, Legras B (1995) Simulated annealing and weather regimes classification. Tellus
47A:955–973
Hannachi A, Iqbal W (2019) On the nonlinearity of winter northern hemisphere atmospheric
variability. J Atmos Sci 76:333–356
Hannachi A, Turner AG (2013a) Isomap nonlinear dimensionality reduction and bimodality of
Asian monsoon convection. Geophys Res Lett 40:1653–1658
Hannachi A, Turner GA (2013b) 20th century intraseasonal Asian monsoon dynamics viewed from
isomap. Nonlin Process Geophys 20:725–741
Hannachi A, Dommenget D (2009) Is the Indian Ocean SST variability a homogeneous diffusion
process. Clim Dyn 33:535–547
Hannachi A, Unkel S, Trendafilov NT, Jolliffe TI (2009) Independent component analysis of
climate data: A new look at EOF rotation. J Climate 22:2797–2812
Hannachi, A (2010) On the origin of planetary-scale extratropical winter circulation regimes. J
Atmos Sci 67:1382–1401
Hannachi A (1997) Low frequency variability in a GCM: three-dimensional flow regimes and their
dynamics. J Climate 10:1357–1379
Hannachi A, O’Neill A (2001) Atmospheric multiple equilibria and non-Gaussian behaviour in
model simulations. Q J R Meteorol Soc 127:939–958
Hannachi A (2008) A new set of orthogonal patterns in weather and climate: Optimall interpolated
patterns. J Climate 21:6724–6738
Hannachi A, Jolliffe TI, Trendafilov N, Stephenson DB (2006) In search of simple structures in
climate: Simplifying EOFs. Int J Climatol 26:7–28
Hannachi A, Jolliffe IT, Stephenson DB (2007) Empirical orthogonal functions and related
techniques in atmospheric science: A review. I J Climatol 27:1119–1152
Hannachi A (2007) Pattern hunting in climate: A new method for finding trends in gridded climate
data. Int J Climatol 27:1–15
Hannachi A (2000) A probobilistic-based approach to optimal filtering. Phys Rev E 61:3610–3619
References 563
Hannachi A, Stephenson DB, Sperber KR (2003) Probability-based methods for quantifying

nonlinearity in the ENSO. Climate Dynamics 20:241–256
Hannachi A, Mitchell D, Gray L, Charlton-Perez A (2011) On the use of geometric moments to
examine the continuum of sudden stratospheric warmings. J Atmos Sci 68:657–674
Hannachi A, Woollings T, Fraedrich K (2012) The North Atlantic jet stream: a look at preferred
positions, paths and transitions. Q J Roy Meteorol Soc 138:862–877
Hannachi A (2016) Regularised empirical orthogonal functions. Tellus A 68:31723. https://doi.
org/10.3402/tellusa.v68.31723
Hannachi A, Stendel M (2016) What is the NAO? In: Colijn (ed) Appendix 1 in Quante, North sea
region climate change assessment. Springer, Berlin, Heidelberg, pp 489–493
Hannachi A, Trendafilov N (2017) Archetypal analysis: Mining weather and climate extremes. J
Climate 30:6927–6944
Hannachi A, Straus DM, Franzke CLE, Corti S, Woollings T (2017) Low-frequency nonlinearity
and regime behavior in the Northern Hemisphere extratropical atmosphere. Rev Geophys
55:199–234. https://doi.org/10.1002/2015RG000509
Hannan EJ (1970) Multiple time series. Wiley, New York
Harada Y, Kamahori H, Kobayashi C, Endo H, Kobayashi S, Ota Y, Onoda H, Onogi K, Miyaoka
K, Takahashi K (2016) The JRA-55 reanalysis: Representation of atmospheric circulation and
climate variability. J Meteor Soc Jpn 94:269–302
Hardy RL (1971) Multiquadric equations of topography and other irregular surfaces. J Geophys
Res 76:1905–1915
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview
with application to learning methods. Neural Comput 16:2639–2664
Harman HH (1976) Modern factor analysis, 3d edn. The University of Chicago Press, Chicago
Harshman RA (1970) Foundation of the PARAFAC procedure: models and methods for an
’Explanatory’ multi-mode factor analysis. In: UCLA working papers in phonetics, vol 16, pp 1–
84
Hartigan JA (1975) Clutering algorithms. Wiley, New York
Hasselmann K (1976) Stochastic climate models. Part I. Theory. Tellus 28:474–485
Hasselmann K (1988) PIPs and POPs−A general formalism for the reduction of dynamical systems
in terms of principal interaction patterns and principal oscillation patterns. J Geophys Res
93:11015–11020
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining,
inference, and prediction, 2nd edn. Springer series in statistics. Springer, New York
Haykin S (1999) Neural networks: A comprehensive foundation, 2nd edn. Prentice Hall Interna-
tional, New Jersey, 897 p
Haykin S (2009) Neural networks and learning machines, 3rd edn. Prentice Hall, New York, 938 p
Hayashi Y (1973) A method of analyzing transient waves by space-time cross spectra. J Appl
Meteorol 12:404–408
Haykin S (ed) (1994) Blind deconvolution. Prentice-Hall, Englewood Cliffs, New Jersey
Heinlein RA (1973) Time enough for love. New English Library, London
Heiser WJ, Groenen PJF (1997) Cluster differences scaling with a within-clusters loss component
and a fuzzy successive approximation strategy to avoid local minima. Psychometrika 62:63–83
Held IM (1983) Stationary and quasi-stationary eddies in the extratropical troposphere: theory. In:
Hoskins BJ, Pearce RP (eds) Large-scale dynamical processes in the atmosphere. Academic
Press, pp 127–168
Helsen H, Lowdenslager D (1958) Prediction theory and Fourier series in several variables. Acta
Math 99:165–202
Hendon HH, Salby ML (1994) The life cycle of the Madden-Julian oscillation. J Atmos Sci
51:2225–2237
Hertz JA, Krogh AS, Palmer RG (1991) Introduction to the theory of neural computation. Lecture
Notes Volume I, Santa Fe Institute Series. Addison-Wesley Publishing Company, Reading, MA
Hewitson BC, Crane RG (2002) Self-organizing maps: applications to synoptic climatology. Clim
Res 22:13–26. ISSN: 0936-577X
564 References
Hewitson BC, Crane RG (1994) Neural nets: Applications in geography. Springer, New York.
ISBN: 978-07-923-2746-2
Higham NJ (1988) Computing nearest symmetric positive semi-definite matrix. Linear Algebra
Appl 103:103–118
Hill T, Marquez L, O’Connor M, Remus W (1994) Artificial neural network models for forecasting
and decision making. Int J Forecast 10:5–15
Hinton GE, Dayan P, Revow M (1997) Modeling the manifolds of images of hand written digits.
IEEE Trans Neural Netw 8:65–74
Hirsch MW, Smale S (1974) Differential equations, dynamical systems, and linear algebra.
Academic Press, London
Hochstadt H (1973) Integral equations. Wiley, New York
Hodges JL, Lehmann EL (1956) The efficiency of some non-parametric competitors of the t-test.
Ann Math Statist 27:324–335
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems.
Technometrics 12:55–67
Holsheimer M, Siebes A (1994) Data mining: The search for knowledge in databases. Technical
Report CS-R9406, CWI Amsterdam
Horel JD (1981) A rotated principal component analysis of the interannual variability variability
of the Northern Hemisphere 500 mb height field. Mon Wea Rev 109:2080–2092
Horel JD (1984) Complex principal component analysis: Theory and examples. J Climate Appl
Meteor 23:1660–1673
Horn RA, Johnson CA (1985) Matrix analysis. Cambridge University Press, Cambridge
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks
4:251–257
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal
approximators. Neural Networks 2:359–366
Horton DE, Johnson NC, Singh D, Swain DL, Rajaratnam B, Diffenbaugh NS (2015) Contribution
of changes in atmospheric circulation patterns to extreme temperature trends. Nature 522:465–
469. https://doi.org/10.1038/nature14550
Hosking JRM (1990) L-moments: analysis and estimation of distributions using linear combina-
tions of order statistics. J R Statist Soc B 52:105–124
Hoskins BJ, Karoly DJ (1981) The steady linear response to a spherical atmosphere to thermal and
orographic forcing. J Atmos Sci 38:1179–1196
Hoskins BJ, Ambrizzi T (1993) Rossby wave propagation on a realistic longitudinally varying
flow. J Atmos Sci 50:1661–1671
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ
Psych 24:417–520, 498–520
Hotelling H (1935) The most predictable criterion. J Educ Psych 26:139–142
Hotelling H (1936a) Simplified calculation of principal components. Psychometrika 1:27–35
Hotelling H (1936b) Relation between two sets of variables. Biometrika 28:321–377
Hsieh WW (2001a) Nonlinear canonical correlation analysis of the tropical Pacific climate
variability using a neural network approach. J Climate 14:2528–2539
Hsieh WW (2001b) Nonlinear principal component analysis by neural networks. Tellus 53A:599–
615
Hsieh WW (2009) Machine learning methods in the environmental sciences: neural networks and
kernels. Cambridge University Press, Cambridge
Hsieh W, Tang B (1998) Applying neural network models to prediction and data analysis in
meteorology and oceanography. Bull Am Meteorol Soc 79:1855–1870
Hubbert S, Baxter B (2001) Radial basis functions for the sphere. In: Haussmann W, Jetter
K, Reimer M (eds) Recent progress in multivariate approximation, 4th international confer-
ence, September 2000, Witten-Bommerholz. International Series of Numerical Mathematics,
vol. 137. Birkhäuser, Basel, pp 33–47
Huber PJ (1985) Projection pursuit. Ann Statist 13:435–475
Huber PJ (1981) Robust statistics. Wiley, New York, 308 p
References 565
Hunter JS (1988) The digidot plot. Am Statistician 42:54

Hurrell JW (1996) Influence of variations in extratropical wintertime teleconnections on Northern
Hemisphere temperature. Geophys Res Lett 23:665–668
Hurrell JW, Kushnir Y, Ottersen G, Visbeck M (2003) An overview of the North Atlantic
Oscillation. In: Hurrell JW, Kushnir Y, Ottersen G, Visbeck M (eds) The North Atlantic
Oscillation, climate significance and environmental impact, Geophysical Monograph, vol 134.
American Geophysical Union, Washington, pp 1–35
Huth, R., C. Beck, A. Philipp, M. Demuzere, Z. Ustrnul, M. Cahynová, J. Kyselý, O. E. Tveito,
(2008) Classifications of atmospheric circulation patterns, Recent advances and applications.
Ann. NY Acad Sci 1146(1):105–152. ISSN: 0077-8923
Huva R, Dargaville R, Rayner P (2015) The impact of filtering self-organizing maps: A case study
with Australian pressure and rainfall. Int J Climatol 35:624–633. https://doi.org/10.1002/joc.
4008
Hyvärinen A (1998) New approximations of differential entropy for independent component
analysis and projection. In: Jordan MA, Kearns MJ, Solla SA (eds) Advances in neural
information processing systems, vol 10. MIT Press, Cambridge, MA, pp 273–279
Hyvärinen A (1999) Survey on independent component analysis. Neural Comput Surv 2:94–128
Hyvärinen A, Oja E (2000) Independent component analysis: Algorithms and applications. Neural
Net 13:411–430
Hyvärubeb A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, 481pp
Iskandar I (2009) Variability of satellite-observed sea surface height in the tropical Indian Ocean:
comparison of EOF and SOM analysis. Makara Seri Sains 13:173–179. ISSN: 1693-6671
Izenman AJ (2008) Modern multivariate statistical techniques, regression, classification and
manofold learning. Springer, New York
Jackson JE (2003) A user’s guide to principal components. Wiley, New Jersey, 569pp
James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning: with
application in R. Springer texts in statistics. Springer, New York. https://doi.org/10.1007/978-
1-4614-7138-7-5
Jee JR (1985) A study of projection pursuit methods. Technical Report TR 776-311-4-85, Rice
University
Jenkins MG, Watts DG (1968) Spectral analysis and its applications. Holden-Day, San Francisco
Jennrich RI (2001) A simple general procedure for orthogonal rotation. Psychometrika 66:289–306
Jennrich RI (2002) A simple general procedure for oblique rotation. Psychometrika 67:7–19
Jennrich RI (2004) Rotation to simple loadings using component loss function: The orthogonal
case. Psychometrika 69:257–273
Jenssen R (2000) Image denoising based on independent component analysis. M.Sc. Thesis, the
University of Tromso
Jeong JH, Resop JP, Mueller ND, Fleisher DH, Yun K, Butler EE, et al. (2016) Random forests for
global and regional crop yield predictions. PLOS ONE 11:e0156571. https://doi.org/10.1371/
journal.pone.0156571
Johnson W, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In:
Conference in modern analysis and probability (New Haven, Conn., 1982). Contemporary
mathematics, vol 26. American Mathematical Society, pp 189–206
Johnson ES, McPhaden MJ (1993) On the structure of intraseasonal Kelevin waves in the equatorial
Pacific ocean. J Phys Oceanogr 23:608–625
Johnson NC, Feldstein SB, Tremblay B (2008) The continuum of northern hemisphere teleconnec-
tion patterns and a description of the NAO Shift with the use of self-organizing maps. J Climate
21:6354–6371
Johansson JK (1981) An extension of Wollenberg’sredundancy analysis. Psychometrika 46:93–103
Jolliffe IT (2003) A cautionary note on artificial examples of EOFs. J Climate 16:1084–1086
Jolliffe IT, Cadima J (2016) Principal components analysis: a review and recent developments. Phil
Trans R Soc A 374:20150202
Jolliffe IT, Uddin M, Vines KS (2002) Simplified EOFs−three alternatives to retain. Clim Res
20:271–279
566 References
Jolliffe IT, Trendafilov TN, Uddin M (2003) A modified principal component technique based on
the LASSO. J Comput Graph Stat 12:531–547
Jolliffe IT (1987) Rotation of principal components: Some comments. J Climatol 7:507–510
Jolliffe IT (1995) Rotation of principal components: Choice of normalization constraints. J Appl
Stat 22:29–35
Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York
Jones MC (1983) The projection pursuit algorithm for exploratory data analysis. Ph.D. Thesis,
University of Bath
Jones MC, Sibson R (1987) What is projection pursuit? J R Statist Soc A 150:1–36
Jones RH (1975) Estimating the variance of time averages. J Appl Meteor 14:159–163
Jöreskog KG (1967) Some contributions to maximum likelihood factor analysis. Psychometrika
32:443–482
Jöreskog KG (1969) A general approach to confirmatory maximum likelihood factor analysis.
Jung T-P, Makeig S, Mckeown MJ, Bell AJ, Lee T-W, Sejnowski TJ (2001) Imaging brain dynamics
using independent component analysis. Proc IEEE 89:1107–1122
Jungclaus J (2008) MPI-M earth system modelling framework: millennium full forcing experiment
(ensemble member 1). World Data Center for climate. CERA-DB “mil0010”. http://cera-www.
dkrz.de/WDCC/ui/Compact.jsp?acronym=mil0010
Jutten C, Herault J (1991) Blind separation of sources, part i: An adaptive algorithm based on
neuromimetic architecture. Signal Process 24:1–10
Kaiser HF (1958) The varimax criterion for analytic rotation in favor analysis. Psychometrika
23:187–200
Kano Y, Miyamoto Y, Shimizu S (2003) Factor rotation and ICA. In: Proceedings of the 4th
international symposium on independent component analysis and blind source separation
(Nara, Japan), pp 101–105
Kao SK (1968) Governing equations and spectra for atmospheric motion and transports in
frequency-wavenumber space. J Atmos Sci 25:32–38
Kapur JN (1989) Maximum-entropy models in science and engineering. Wiley, New York
Karlin S, Taylor HM (1975) A first course in stochastic processes, 2nd edn. Academic Press,
Boston
Karthick S, Malathi D, Arun C (2018) Weather prediction analysis using random forest algorithm.
Int J Pure Appl Math 118:255–262
Keller LB (1935) Expanding of limit theorems of probability theory on functions with continuous
arguments (in Russian). Works Main Geophys Observ 4:5–19
Kendall MG (1994) Advanced theory of statistics. Vol I: distribution theory, 6th edn. In: Stuart A,
Ord JK (eds). Arnold, London.
Kendall MG, Stuart A (1961) The advanced theory of statistics: Inference and relationships, 3rd
edn. Griffin, London.
Kendall MG, Stuart A (1977) The advanced Theory of Statistics. Volume 1: distribution theory,
4th edn. Griffin, London
Keogh EJ, Chu S, Hart D, Pazzani MJ (2001) An online algorithm for segmenting time series. In:
Proceedings 2001 IEEE international conference on data mining, pp 289–296. https://doi.org/
10.1109/ICDM.2001.989531
Kettenring JR (1971) Canonical analysis of several sets of variables. Biometrika 58:433–451
Khatri CG (1976) A note on multiple and canonical correlation for a singular covariance matrix.
Khedairia S, Khadir MT (2008) Self-organizing map and k-means for meteorological day type
identification for the region of Annaba–Algeria. In: 7th computer information systems and
industrial management applications, Ostrava, pp 91–96. ISBN: 978-0-7695-3184-7
Kiers HAL (1994) Simplimax: Oblique rotation to an optimal target with simple structure.
Kikkawa S, Ishida M (1988) Number of degrees of freedom, correlation times, and equivalent
bandwidths of a random process. IEEE Trans Inf Theory 34:151–155
References 567
Kiladis GN, Weickmann KM (1992) Circulation anomalies associated with tropical convection
during northern winter. Mon Wea Rev 120:1900–1923
Killworth PD, McIntyre ME (1985) Do Rossby-wave critical layers absorb, reflect or over-reflect?
J Fluid Mech 161:449–492
Kim K-Y, Hamlington B, Na H (2015) Theoretical foundation of cyclostationary EOF analysis for
geophysical and climate variables: concept and examples. Eart Sci Rev 150:201–218
Kim K-Y, North GR (1999) A comparison of study of EOF techniques: analysis of non-stationary
data with periodic statistics. J Climate 12:185–199
Kim K-Y, Wu Q (1999) A comparison study of EOF techniques: Analysis of nonstationary data
with periodic statistics. J Climate 12:185–199
Kim K-Y, North GR, Huang J (1996) EOFs of one-dimensional cyclostationary time series:
Computations, examples, and stochastic modeling. J Atmos Sci 53:1007–1017
Kim K-Y, North GR (1997) EOFs of harmonizable cyclostationary processes. J Atmos Sci
54:2416–2427
Kimoto M, Ghil M, Mo KC (1991) Spatial structure of the extratropical 40-day oscillation. In:
Proc. 8’th conf. atmos. oceanic waves and stability. Amer. Meteor. Soc., Boston, pp 115–116
Knighton J, Pleiss G, Carter E, Walter MT, Steinschneider S (2019) Potential predictability of
regional precipitation and discharge extremes using synoptic-scale climate information via
machine learning: an evaluation for the eastern continental United States. J Hydrometeorol
20:883–900
Knutson TR, Weickmann KM (1987) 30–60 day atmospheric oscillation: Composite life cycles of
convection and circulation anomalies. Mon Wea Rev 115:1407–1436
Kobayashi S, Ota Y, Harada Y, Ebita A, Moriya M, Onoda H, Onogi K, Kamahori H, Kobayashi
C, Endo H, Miyaoka K, Takahashi K (2015) The JRA-55 Reanalysis: General specifications
and basic characteristics. J Meteor Soc Jpn 93:5–48
Kohonen T (2001) Self-organizing maps, 3rd edn. Springer, Berlin, 501 p
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biological
Cybernetics 43:59–69
Kohonen T (1990) The self-organizing map. Proc IEEE 78:1464–1480
Kolmogorov AN (1933) Foundations of the theory of probability (Grundbegriffe der Wahrschein-
lichkeitsrechnung). Translated by Nathan Morrison and Published by Chelsea Publishing
Company, New York, 1950
Kolmogorov AN (1939) Sur l’interpolation et l’extrapolation des suites stationaires. Comptes
Rendus Acad Sci Paris 208:2043–2045
Kolmogorov AN (1941) Stationary sequences in Hilbert space. Bull Math Univ Moscow 2:1–40
Kondrashov D, Chekroun MD, Yuan X, Ghil M (2018a) Data-adaptive harmonic decomposition
and stochastic modeling of Arctic sea ice. Dyn Statist Clim Syst 3:179–205
Kondrashov, D., M. D. Chekroun, P. Berloff, (2018b) Multiscale Stuart-Landau emulators:
Application wind-driven ocean gyres. Fluids 3:21. https://doi.org/10.3390/fluids3010021
Kooperberg C, O’Sullivan F (1996) Prediction oscillation patterns: A synthesis of methods for
spatial-temporal decomposition of random fields. J Am. Statist Assoc 91:1485–1496
Koopmans LH (1974) The spectral analysis of time series. Academic Press, New York
Kramer MA (1991) Nonlinear principal component analysis using autoassociative neural networks.
AIChE J 37:233–243
Kress R, Martensen E (1970) Anwendung der rechteckregel auf die reelle Hilbertransformation
mit unendlichem intervall. Z Angew Math Mech 50:T61–T64
Krishnamurthi TN, Chakraborty DR, Cubucku N, Stefanova L, Vijaya Kumar TSV (2003) A
mechanism of the Madden-Julian oscillation based on interactions in the frequency domain.
Q J R Meteorol Soc 129:2559–2590
Kruskal JB (1964a) Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika 29:1–27
Kruskal JB (1964b) Nonmetric multidimensional scaling: a numerical method. Psychometrika
29:115–129
568 References
Kruskal JB (1969) Toward a practical method which helps uncover the structure of a set of
multivariate observations by finding the linear transformation which optimizes a new ‘index
of condensation’. In: Milton RC, Nelder JA (eds) Statistical computation, New York
Kruskal JB (1972) Linear transformations of multivariate data to reveal clustering. In: Multidi-
mensional scaling: theory and application in the behavioural sciences, I, theory. Seminra Press,
New York
Krzanowski WJ, Marriott FHC (1994) Multivariate analysis, Part 1. Distributions, ordination and
inference. Arnold, London
Krzanowski WJ (2000) Principles of multivariate analysis: A user’s perspective, 2nd edn. Oxford
University Press, Oxford
Krzanowski WJ (1984) Principal component analysis in the presence of group structure. Appl
Statist 33:164–168
Krzanowski WJ (1979) Between-groups comparison of principal components. J Am Statist Assoc
74:703–707
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional
neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural
information processing systems, vol 25. Curran Associates, Red Hook, NY, pp 1097–1105
Kubáčkouá L, Kubáček L, Kukuča J (1987) Probability and statistics in geodesy and geophysics.
Elsevier, Amsterdam
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Kundu PK, Allen JS (1976) Some three-dimensional characteristics of low-frequency current
fluctuations near the Oregon coast. J Phys Oceanogr 6:181–199
Kutzbach JE (1967) Empirical eigenvectors of sea-level pressure, surface temperature and
precipitation complexes over North America. J Appl Meteor 6:791–802
Kwon S (1999) Clutering in multivariate data: visualization, case and variable reduction. Ph.D.
Thesis, Iowa State University
Kwasniok F (1996) The reduction of complex dynamical systems using principal interaction
patterns. Physica D 92:28–60
Kwasniok F (1997) Optimal Galerkin approximations of partial differential equations using
principal interaction patterns. Phys Rev E 55:5365–5375
Kwasniok F (2004) Empirical low-order models of barotropic flow. J Atmos Sci 61:235–245
Labitzke K, van Loon H (1999) The stratosphere. Springer, New York
Laplace PS (1951) A philosophical essay on probabilities. Dover Publications, New York
Larsson E, Fornberg B (2003) A numerical study of some radial basis function based solution
methods for elliptic PDEs. Comput Math Appli 47:37–55
Laughlin S (1981) A simple coding procedure enhances a neuron’s information capacity. Z
Natureforsch 36c:910–912
Lawley DN (1956) Tests of significance for the latent roots of covariance and correlation matrices.
Biometrika 43:128–136
Lawley DN, Maxwell AE (1971) Factor analysis as a statistical method, 2nd edn. Butterworth,
London
Lazante JR (1990) The leading modes of 10–30 day variability in the extratropics of the Northern
Hemisphere during the cold season. J Atmos Sci 47:2115–2140
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/
nature14539
Ledoit O, Wolf M (2004) A well-conditioned estimator for large-dimensional covariance matrices.
J Multivar Anal 88:365–411
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization.
Nature 401:788–791
Legates DR (1991) The effect of domain shape on principal components analyses. Int J Climatol
11:135–146
Legates DR (1993) The effect of domain shape on principal components analyses: A reply. Int J
Climatol 13:219–228
References 569
Leith CE (1973) The standard error of time-average estimates of climatic means. J Appl Meteorol
12:1066–1069
Leloup JA, Lachkar Z, Boulanger JP, Thiria S (2007) Detecting decadal changes in ENSO using
neural networks. Clim Dyn 28:147–162. https://doi.org/10.1007/s00382-006-0173-1. ISSN:
0930-7575
Leurgans SE, RA Moyeed, Silverman BW (1993) Canonical correlation analysis when the data are
curves. J R Statist Soc B 55:725–740
Li G, Ren B, Yang C, Zheng J (2011a) Revisiting the trend of the tropical and subtropical Pacific
surface latent heat fluxduring 1977–2006. J Geophys Res 116:D10115. https://doi.org/10.1029/
2010JD015444
Li G, Ren B, Zheng J, Yang C (2011b) Trend singular value decomposition analysis and its
application to the global ocean surfacelatent heat flux and SST anomalies. J Climate 24:2931–
2948
Lin G-F, Chen L-H (2006) Identification of homogeneous regions for regional frequency analysis
using the self-organizing map. J Hydrology 324:1–9. ISSN: 0022-1694
Lingoes JC, Roskam EE (1973) A mathematical and empirical analysis of two multidimensional
analysis scaling algorithms. Psychometrika 38(Monograph supplement):1–93
Linz P, Wang RLC (2003) Exploring numerical methods: An introduction to scientific computing
using MATLAB. Jones and Bartlett Publishers, Sudbury, MA
Lim Y-K, Kim K-Y (2006) A new perspective on the climate prediction of Asian summer monsoon
precipitation. J Climate 19:4840–4853
Lim Y-K, Cocke S, Shin DW, Schoof JT, LaRow TE, O’Brien JJ (2010) Downscaling large-scale
NCEP CFS to resolve fine-scale seasonal precipitation and extremes for the crop growing
seasons over the southeastern United States. Clim Dyn 35:449–471
Liu Y, Weisberg RH (2007) Ocean currents and sea surface heights estimated across the West
Florida Shelf. J Phys Oceanog 37:1697–1713. ISSN: 0022-3670
Liu Y, Weisberg RH, Mooers CNK (2006) Performance evaluation of the selforganizing map for
feature extraction. J Geophys Res 111:C05018. https://doi.org/10.1029/2005JC003117. ISSN:
0148-0227
Liu Y, Weisberg RH (2005) Patterns of ocean current variability on the West Florida Shelf using
the self-organizing map. J Geophys Res 110:C06003. https://doi.org/10.1029/2004JC002786
Loève M (1948) Functions aléatoires du second order. Suplement to P. Lévy: Processus Stochas-
tiques et Mouvement Brownien. Gauthier-Villars, Paris
Loève M (1963) Probability theory. Van Nostrand Reinhold, New York
Loève M (1978) Probability theory, vol II, 4th edn. Springer, 413 p
Lorenz EN (1963) Deterministic non-periodic flow. J Atmos Sci 20:130–141
Lorenz EN (1970) Climate change as a mathematical problem. J Appl Meteor 9:325–329
Lorenz EN (1956) Empirical orthogonal functions and statistical weather prediction. Technical
report, Statistical Forecast Project Report 1, Dept. of Meteor., MIT, 49 p
Losada IJ, Reguero BG, Méndez FJ, Castanedo S, Abascal AJ, Minguez R (2013) Long-term
changes in sea-level components in Latin America and the Caribbean. Global Planetary Change
104:34–50
Lucio JH, Valdés R, Rodríguez LR (2012) Improvements to surrogate data methods for nonstation-
ary time series. Phys Rev E 85:056202
Luxburg U (2007) A tutorial on spectral clustering. Statist Comput 17:395–416
Lütkepoch H (2006) New introduction to multiple time series analysis. Springer, Berlin
Madden RA, Julian PR (1971) Detection of a 40–50 day oscillation in the zonal wind in the tropical
pacific. J Atmos Sci 28:702–708
Madden RA, Julian PR (1972) Description of global-scale circulation cells in the tropics with a
40–50 day period. J Atmos Sci 29:1109–1123
Madden RA, Julian PR (1994) Observations of the 40–50-day tropical oscillation−A review. Mon
Wea Rev 122:814–837
Magnus JR, Neudecker H (1995) Matrix differential calculus with applications in statistics and
econometrics. Wiley, Chichester
570 References
Malik N, Bookhagen B, Marwan N, Kurths J (2012) Analysis of spatial and temporal extreme
monsoonal rainfall over South Asia using complex networks. Clim Dyn 39:971–987. https://
doi.org/10.1007/s00382-011-1156-4
Malozemov VN, Pevnyi AB (1992) Fast algorithm of the projection of a point onto the simplex.
Vestnik St. Petersburg University 1(1):112–113
Mansour A, Jutten C (1996) A direct solution for blind separation of sources. IEEE Trans Signal
Process 44:746–748
Mardia KV, Kent TJ, Bibby MJ (1979) Multivariate analysis. Academic Press, London
Mardia KV (1980) Tests of univariate and multivariate normality. In: Krishnaiah PR (ed) Handbook
of statistics 1: Analysis of variance. North-Holland Publishing, pp 279–320
Martinez WL, Martinez AR (2002) Computational statistics handbook with MATLAB. Chapman
and Hall, Boca Raton
Martinez AR, Solka J, Martinez WL (2010) Exploratory data analysis with MATLAB, 2nd edn.
CRS Press, 530 p
Maruyama T (1997) The quasi-biennial oscillation (QBO) and equatorial waves−A historical
review. Pap Meteorol Geophys 47:1–17
Marwan N, Donges JF, Zou Y, Donner RV, Kurths J (2009) Complex network approach for
recurrence analysis of time series. Phys Lett A 373:4246–4254
Mathar R (1985) The best Euclidean fit to a given distance matrix in prescribed dimensions. Linear
Algebra Appl 67:1–6
Matsubara Y, Sakurai Y, van Panhuis WG, Faloutsos C (2014) FUNNEL: automatic mining
of spatially coevolving epidemics. In: KDD, pp 105–114 https://doi.org/10.1145/2623330.
2623624
Matthews AJ (2000) Propagation mechanisms for the Madden-Julian oscillation. Q J R Meteorol
Soc 126:2637–2651
Masani P (1966) Recent trends in multivariate prediction theory. In: Krishnaiah P (ed) Multivariate
analysis – I. Academic Press, New York, pp 351–382
Mazloff MR, Heimbach P, Wunch C (2010) An eddy-permitting Southern Ocean State Estimate. J
Phys Oceano 40:880–899
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, London,
511 p
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophys 5:115–133
McDonald AJ, Cassano JJ, Jolly B, Parsons S, Schuddeboom A (2016) An automated satellitecloud
classification scheme usingself-organizing maps: Alternative ISCCP weather states. J Geophys
Res Atmos 121:13,009–13,030. https://doi.org/10.1002/2016JD025199
McEliece RJ (1977) The theory of information and coding. Addison-Wesley, Reading, MA
McGee VE (1968) Multidimensional scaling of N sets of similarity measures: a nonmetric
individual differences approach. Multivar Behav Res 3:233–248
McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York
McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley Interscience,
545 p
Meila M, Shi J (2000) Learning segmentation by random walks. In: Proceedings of NIPS, pp 873–
879
Mercer T (1909) Functions of positive and negative type and their connection with the theory of
integral equations. Trans Lond Phil Soc A 209:415–446
Merrifield MA, Winant CD (1989) Shelf circulation in the Gulf of California: A description of the
variability. J Geophys Res 94:18133–18160
Merrifield MA, Guza RT (1990) Detecting propagating signals with complex empirical orthogonal
functions: A cautionary note. J Phys Oceanogr 20:1628–1633
Mestas-Nuñez AM (2000) Orthogonality properties of rotated empirical modes. Int J Climatol
20:1509–1516
Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E (1953) Equation of state calculation
by fast computing machines. J Chem Phys 21:1087–1092
References 571
Meyer Y (1992) Wavelets and operators. Cambridge University Press, New York, 223 p
Meza–Padilla R, Enriquez C, Liu Y, Appendini CM (2019) Ocean circulation in the western Gulf
of Mexico using self–organizing maps. J Geophys Res Oceans 124:4152–4167. https://doi.org/
10.1029/2018JC014377
Michelli CA (1986) Interpolation of scattered data: Distance matrices and conditionally positive
definite functions. Constr Approx 2:11–22
Michelot C (1986) A finite algorithm for finding the projection of a point onto the canonical
simplex of Rn . JOTA 50:195–200
Mirsky L (1955) An introduction to linear algebra. Oxford University Press, Oxford, 896pp
Mitchell TM (1998) Machine learning. McGraw-Hill, New York, 432 p
Mikhlin SG (1964) Integral equations, 2nd edn. Pergamon Press, London
Minnotte MC, West RW (1999) The data image: A tool for exploring high dimensional data sets.
In: Proc. ASA section on stat. graphics, Dallas, TX, American Statistical Association, pp 25–33
Moiseiwitsch BL (1977) Integral equations. Longman, London
Monahan AH, DelSole T (2009) Information theoretic measures of dependence, compactness, and
non-Gaussianity for multivariate probability distributions. Nonlin Process Geophys 16:57–64
Monahan AH, Fyfe CJ (2007) Comment on the shortcomings of nonlinear principal component
analysis in identifying circulation regimes. J Climate 20:374–377
Monahan, A.H., L. Pandolfo, Fyfe JC (2001) The prefered structure of variability of the Northern
Hemisphere atmospheric circulation. Geophys Res Lett27:1139–1142
Monahan AH, Tangang FT, Hsieh WW (1999) A potential problem with extended EOF analysis of
standing wave fields. Atmosphere-Ocean 3:241–254
Monahan AH, Fyfe JC, Flato GM (2000) A regime view of northern hemisphere atmospheric
variability and change under global warming. Geophys Res Lett 27:1139–1142
Monahan AH (2000) Nonlinear principal component analysis by neural networks: theory and
application to the Lorenz system. J Climate 13:821–835
Monahan AH (2001) Nonlinear principal component analysis: tropical Indo–Pacific sea surface
temperature and sea level pressure. J Climate 14:219–233
Moody J, Darken CJ (1989) Fast learning in networks of locally-tuned processing units. Neural
Comput 1:281–294
Moon TK (1996) The expectation maximization algorithm. IEEE Signal Process Mag, 47–60
Mori A, Kawasaki N, Yamazaki K, Honda M, Nakamura H (2006) A reexamination of the northern
hemisphere sea level pressure variability by the independent component analysis. SOLA 2:5–8
Morup M, Hansen LK (2012) Archetypal analysis for machine Learning and data mining.
Neurocomputing 80:54–63
Morozov VA (1984) Methods for solving incorrectly posed problems. Springer, Berlin. ISBN: 3-
540-96059-7
Morrison DF (1967) Multivariate statistical methods. McGraw-Hill, New York
Morton SC (1989) Interpretable projection pursuit. Technical Report 106. Department of Statistics,
Stanford University, Stanford. https://www.osti.gov/biblio/5005529-interpretable-projection-
pursuit
Mukhin D, Gavrilov A, Feigin A, Loskutov E, Kurths J (2015) Principal nonlinear dynamical
modes of climate variability. Sci Rep 5:15510. https://doi.org/10.1038/srep15510
Munk WH (1950) On the wind-driven ocean circulation. J Metorol 7:79–93
Nadler B, Lafon S, Coifman RR, Kevrikedes I (2006) Diffusion maps, spectral clustering, and
reaction coordinates of dynamical systems. Appl Comput Harmon Anal 21:113–127
Nason G (1992) Design and choice of projection indices. Ph.D. Thesis, The University of Bath
Nason G (1995) Three-dimensional projection pursuit. Appl Statist 44:411–430
Nason GP, Sibson R (1992) Measuring multimodality. Stat Comput 2:153–160
Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7:308–313
Newman MEJ (2006) Modularity and community structure in networks. PNAS 103:8577–8582.
www.pnas.org/cgi/doi/10.1073/pnas.0601602103
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys
Rev E 69:026113. https://doi.org/10.1103/PhysRevE.69.026113
572 References
Nielsen MA (2015) Neural networks and deep learning. Determination Press

North GR (1984) Empirical orthogonal functions and normal modes. J Atmos Sci 41:879–887
North GR, Bell TL, Cahalan FR, Moeng JF (1982) Sampling errors in the estimation of empirical
orthogonal functions. Mon Wea Rev 110:699–706
Neumaier A, Schneider T (2001) Estimation of parameters and eigenmodes of multivariate
autoregressive models. ACL Trans Math Soft 27:27–57
Nuttal AH, Carter GC (1982) Spectral estimation and lag using combined time weighting. Proc
IEEE 70:1111–1125
Obukhov AM (1947) Statistically homogeneous fields on a sphere. Usp Mat Navk 2:196–198
Obukhov AM (1960) The statistically orthogonal expansion of empirical functions. Bull Acad Sci
USSR Geophys Ser (English Transl.), 288–291
Ohba M, Kadokura S, Nohara D, Toyoda Y (2016) Rainfall downscaling of weekly ensemble
forecasts using self-organising maps. Tellus A 68:29293. https://doi.org/10.3402/tellusa.v68.
29293
Oja E (1982) A simplified neuron model as a principal component analyzer. J Math Biol 15:267–
273
Önskog T, Franzke C, Hannachi A (2018) Predictability and non-Gaussian characteristics of the
North Atlantic Oscillation. J Climate 31:537–554
Önskog T, Franzke C, Hannachi A (2020) Nonlinear time series models for the North Atlantic
Oscillation. Adv Statist Clim Meteorol Oceanog 6:1–17
Osborne AR, Kirwan AD, Provenzale A, Bergamasco L (1986) A search for chaotic behavior in
large and mesoscale motions in the pacific ocean. Physica D Nonlinear Phenomena 23:75–83
Overland JE, Preisendorfer RW (1982) A significance test for principal components applied to a
cyclone climatology. Mon Wea Rev 110:1–4
Packard NH, Crutchfield JP, Farmer JDR, Shaw RS (1980) Geometry from a time series. Phys Rev
Lett 45:712–716
Palmer CE (1954) The general circulation between 200 mb and 10 mb over the equatorial Pacific,
Weather 9:3541–3549
Pang B, Yue J, Zhao G, Xu Z (2017) Statistical downscaling of temperature with the random forest
model. Hindawi Adv Meteorol Article ID 7265178:11 p. https://doi.org/10.1155/2017/7265178
Panagiotopoulos F, Shahgedanova M, Hannachi A, Stephenson DB (2005) Observed trends and
teleconnections of the Siberian High: a recently declining center of action. J Climate 18:1411–
1422
Parlett BN, Taylor DR, Liu ZS (1985) A look-ahead Lanczos algorithm for nonsymmetric matrices.
Math Comput 44:105–124
Parzen E (1959) Statistical inference on time series by Hilbert space methods, I. Technical Report
No. 23, Department of Statistics, Stanford University. (Published in Time Series Analysis
Papers by E. Parzen, Holden-Day, San Francisco
Parzen E (1961) An approach to time series. Ann Math Statist 32:951–989
Parzen E (1963) A new approach to the synthesis of optimal smoothing and prediction systems. In:
Bellman R (ed) Proceedings of a symposium on optimization. University of California Press,
Berkeley, pp 75–108
Pasmanter RA, Selten MF (2010) Decomposing data sets into skewness modes. Physica D
239:1503–1508
Pauthenet E (2018) Unraveling the thermohaline of the Southern Ocean using functional data
analysis. Ph.D. thesis, Stockholm University
Pauthenet E, Roquet F, Madec G, Nerini D (2017) A linear decomposition of the Southern Ocean
thermohaline structure. J Phys Oceano 47:29–47
Pavan V, Tibaldi S, Brankovich C (2000) Seasonal prediction of blocking frequency: Results from
winter ensemble experiments. Q J R Meteorol Soc 126:2125–2142
Pawitan Y (2001) In all likelihood: statistical modelling and inference using likelihood. Oxford
University Press, Oxford
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 2:559–
572
References 573
Pearson K (1895) Notes on regression and inheritance in the case of two parents. Proc R Soc
London 58:240–242
Pearson K (1920) Notes on the history of correlation. Biometrika 13:25–45
Pham D-T, Garrat P, Jutten C (1992) Separation of mixture of independent sources through
maximum likelihood approach. In: Proc EUSIPCO, pp 771–774
Pires CAL, Hannachi A (2021) Bispectral analysis of nonlinear interaction, predictability and
stochastic modelling with application to ENSO. Tellus A 73, 1–30
Plaut G, Vautard R (1994) Spells of low-frequency oscillations and weather regimes in the northern
hemisphere. J Atmos sci 51:210–236
Pearson K (1902) On lines and planes of closest fit to systems of points in space. Phil Mag 2:559–
572
Penland C (1989) Random forcing and forecasting using principal oscillation patterns. Mon Wea
Rev 117:2165–2185
Penland C, Sardeshmukh PD (1995) The optimal growth of tropical sea surface temperature
anomalies. J Climate 8:1999–2024
Pezzulli S, Hannachi A, Stephenson DB (2005) The variability of seasonality. J Climate 18:71–88
Philippon N, Jarlan L, Martiny N, Camberlin P, Mougin E (2007) Characterization of the
interannual and intraseasonal variability of west African vegetation between 1982 and 2002
by means of NOAA AVHRR NDVI data. J Climate 20:1202–1218
Pires CAL, Hannachi A (2017) Independent subspace analysis of the sea surface temperature
variability: non-Gaussian sources and sensitivity to sampling and dimensionality. Complexity.
https://doi.org/10.1155/2017/3076810
Pires CAL, Ribeiro AFS (2017) Separation of the atmospheric variability into non-Gaussian
multidimensional sources by projection pursuit techniques. Climate Dynamics 48:821–850
Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78:1481–1497
Polya G, Latta G (1974) Complex variables. Wiley, New York, 334pp
Powell MJD (1964) An efficient method for finding the minimum of a function of several variables
without calculating derivatives. Comput J 7:155–162
Powell MJD (1987) Radial basis functions for multivariate interpolation: a review. In: Mason JC,
Cox MG (eds) Algorithms for the approximation of functions and data. Oxford University
Press, Oxford, pp 143–167
Powell MJD (1990) The theory of radial basis function approximation in (1990) In: Light W (ed)
Advances in numerical analysis, Volume 2: wavelets, subdivision algorithms and radial basis
functions. Oxford University Press, Oxford
Preisendorfer RW, Mobley CD (1988) Principal component analysis in meteorology and oceanog-
raphy. Elsevier, Amsterdam
Press WH, et al (1992) Numerical recipes in Fortran: The Art of scientific computing. Cambridge
University Press, Cambridge
Priestly MB (1981) Spectral analysis of time series. Academic-Press, London
Posse C (1995) Tools for two-dimensional exploratory projection pursuit. J Comput Graph Statist
4:83–100
Ramsay JO, Silverman BW (2006) Functional data analysis, 2nd edn. Springer Series in Statistics,
New York
Rasmusson EM, Arkin PA, Chen W-Y, Jalickee JB (1981) Biennial variations in surface tempera-
ture over the United States as revealed by singular decomposition. Mon Wea Rev 109:587–598
Rayner NA, Parker DE, Horton EB, Folland CK, Alexander LV, Rowell DP, Kent EC, Kaplan A
(2003) Global analyses of sea surface temperature, sea ice, and night marine air temperature
since the late nineteenth century. J Geophys Res 108:014, 4407.
Rényi A (1961) On measures of entropy and information. In: Neyman J (ed) Proceedings of the
Fourth Bekeley symposium on mathematical statistics and probability, vol I. The University of
California Press, Berkeley, pp 547–561
Rényi A (1970) Probability theory. North Holland, Amsterdam, 666pp
Reed RJ, Campbell WJ, Rasmussen LA, Rogers RG (1961) Evidence of a downward propagating
annual wind reversal in the equatorial stratosphere. J Geophys Res 66:813–818
574 References
Reichenback H (1937) Les fondements logiques du calcul des probabilités. Ann Inst H Poincaré
7:267–348
Rennert KJ, Wallace MJ (2009) Cross-frequency coupling, skewness and blocking in the Northern
Hemisphere winter circulation. J Climate 22:5650–5666
Renwick AJ, Wallace JM (1995) Predictable anomaly patterns and the forecast skill of northern
hemisphere wintertime 500-mb height fields. Mon Wea Rev 123:2114–2131
Reusch DB, Alley RB, Hewitson BC (2005) Relative performance of self-organizing maps and
principal component analysis in pattern extraction from synthetic climatological data. Polar
Geography 29(3):188–212. https://doi.org/10.1080/789610199
Reyment RA, Jvreskog KG (1996) Applied factor analysis in the natural sciences. Cambridge
University Press, Cambridge
Richman MB (1981) Obliquely rotated principal components: An improved meteorological map
typing technique. J Appl Meteor 20:1145–1159
Richman MB (1986) Rotation of principal components. J Climatol 6:293–335
Richman MB (1987) Rotation of principal components: A reply. J Climatol 7:511–520
Richman MB (1987) Rotation of principal components: A reply. J Climatol 7:511–520
Richman M (1993) Comments on: The effect of domain shape on principal components analyses.
Int J Climatol 13:203–218
Richman M, Adrianto I (2010) Classification and regionalization through kernel principal compo-
nent analysis. Phys Chem Earth 35:316–328
Richman MB, Leslie LM (2012) Adaptive machine learning approaches to seasonal prediction of
tropical cyclones. Procedia Comput Sci 12:276–281
Richman MB, Leslie LM, Ramsay HA, Klotzbach PJ (2017) Reducing tropical cyclone prediction
errors using machine learning approaches. Procedia Comput Sci 114:314–323
Ripley BD (1994) Neural networks and related methods for classification. J Roy Statist Soc B
56:409–456
Riskin H (1984) The Fokker-Planck quation. Springer
Ritter H (1995) Self-organizing feature maps: Kohonen maps. In: Arbib MA (ed) The handbook
of brain theory and neural networks. MIT Press, Cambridge, MA, pp 846–851
Roach GF (1970) Greens functions: introductory theory with applications. Van Nostrand Reinhold
Campany, London
Rodgers JL, Nicewander WA (1988) Thirten ways to look at the correlation coefficients. Am
Statistician 42:59–66
Rodwell MJ, Hoskins BJ (1996) Monsoons and the dynamics of deserts. Q J Roy Meteorol Soc
122:1385–1404
Rogers GS (1980) Matrix derivatives. Marcel Dekker, New York
Rojas R (1996) Neural networks: A systematic introduction. Springer, Berlin, 509 p
Rosenblatt F (1962) Principles of neurodynamics. Spartman, New York
Rosenblatt F (1958) The perceptron: A probabilistic model for information storage and organiza-
tion in the brain. Psychological Rev 65:386–408
Ross SM (1998) A first course in probability, 5th edn. Prentice-Hall, New Jersey
Roweis ST (1998) The EM algorithm for PCA and SPCA. In: Jordan MI, Kearns MJ, Solla SA
(eds) Advances in neural information processing systems, vol 10. MIT Press, Cambridge, MA
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linera embedding.
Science 290:2323–2326
Rozanov YuA (1967) Stationary random processes. Holden-Day, San-Francisco
Rubin DB, Thayer DT (1982) EM algorithms for ML factor analysis. Psychometrika 47:69–76
Rubin DB, Thayer DT (1983) More on EM for ML factor analysis. Psychometrika 48:253–257
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagation
errors. Nature 323:533–536
Rumelhart DE, Widrow B, Lehr AM (1994) The basic ideas in neural networks. Commun ACM
37:87–92
References 575
Runge J, Petoukhov V, Kurths J (2014) Quantifying the strength and delay of climatic interactions:
the ambiguities of cross correlation and a novel measure based on graphical models. J Climate
27:720–739
Runge J, Heitzig J, Kurths J (2012) Escaping the curse of dimensionality in estimating multivariate
transfer entropy. Phys Rev Lett 108:258701. https://doi.org/10.1103/PhysRevLett.108.258701
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia
Saad Y (1990) Numerical solution of large Lyapunov equations. In: Kaashoek AM, van Schuppen
JH, Ran AC (eds) Signal processing, scattering, operator theory, and numerical methods, Pro-
ceedings of the international symposium MTNS-89, vol III, pp 503–511, Boston, Birkhauser
Saad Y, Schultz MH (1985) Conjugate gradient-like algorithms for solving nonsymmetric linear
systems. Math Comput 44:417–424
Said-Houari B (2015) Diffierential equations: Methods and applications. Springer, Cham, 212pp
Salim A, Pawitan Y, Bond K (2005) Modelling association between two irregularly observed
spatiotemporal processes by using maximum covariance analysis. Appl Statist 54:555–573
Sammon JW Jr (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput C-
18:401–409
Samuel AL (1959) Some studies in machine learning using the game of of checkers. IBM J Res
Dev 3:211–229
Saunders DR (1961) The rationale for an “oblimax” method of transformation in factor analysis.
Scher S (2020) Artificial intelligence in weather and climate prediction. Ph.D. Thesis in Atmo-
spheric Sciences and Oceanography, Stockholm University, Sweden 2020
Scher S (2018) Toward data-driven weather and climate forecasting: Approximating a simple
general circulation model with deep learning. Geophys Res Lett 45:12,616–12,622. https://
doi.org/10.1029/2018GL080704
Scher S, Messori G (2019) Weather and climate forecasting with neural networks: using general
circulation models (GCMs) with different complexity as a study ground. Geosci Model Dev
12:2797–2809
Schmidtko S, Johnson GC, Lyman JM (2013) MIMOC: A global monthly isopycnal upper-ocean
climatology with mixed layers. J Geophys Res, 118. https://doi.org/10.1002/jgrc.20122
Schneider T, Neumaier A (2001) Algorithm 808: ARFit − A Matlab package for the estimation
of parameters and eigenmodes of multivariate autoregressive models. ACM Trans Math Soft
27:58–65
Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue
problem. Neural Comput 10:1299–1319
Schölkopf B, Mika S, Burgers CJS, Knirsch P, Müller K-R, Rätsch G, Smola A (1999) Input space
vs. feature space in kernel-based methods. IEEE Trans Neural Netw 10:1000–1017
Schoenberg IJ (1935) Remarks to Maurice Fréchet’s article ‘sur la définition axiomatique d’une
classe e’espace distanciés vectoriellement applicable sur l’espace de Hilbert’. Ann Math (2nd
series) 36:724–732
Schoenberg IJ (1964) Spline interpolation and best quadrature formulae. Bull Am Soc 70:143–148
Schott JR (1991) Some tests for common principal component subspaces in several groups.
Biometrika 78:771–778
Schott JR (1988) Common principal component subspaces in two groups. Biometrika 75:229–236
Scott DW (1992) Multivariate density estimation: theory, practice, and vizualization. Wiley, New
York
Schuenemann KC, Cassano JJ (2010) Changes in synoptic weather patterns and Greenland precip-
itation in the 20th and 21st centuries: 2. Analysis of 21st century atmospheric changes using
self-organizing maps, J Geophys Res 115:D05108. https://doi.org/10.1029/2009JD011706.
ISSN: 0148-0227
Schuenemann KC, Cassano JJ, Finnis J (2009) Forcing of precipitation over Greenland: Synoptic
climatology for 1961–99. J Hydrometeorol 10:60–78. https://doi.org/10.1175/2008JHM1014.
1. ISSN: 1525-7541
576 References
Scott DW, Thompson JR (1983) Probability density estimation in higher dimensions. In: Computer
science and statistics: Proceedings of the fifteenth symposium on the interface, pp 173–179
Seal HL (1967) Multivariate statistical analysis for biologists. Methuen, London
Schmid PJ (2010) Dynamic mode decomposition of numerical and experimental data. J Fluid Mech
656(1):5–28
Seitola T, Mikkola V, Silen J, Järvinen H (2014) Random projections in reducing the dimension-
ality of climate simulation data. Tellus A, 66. Available at www.tellusa.net/index.php/tellusa/
article/view/25274
Seitola T, Silén J, Järvinen H (2015) Randomized multi-channel singular spectrum analysis of the
20th century climate data. Tellus A 67:28876. Available at https://doi.org/10.3402/tellusa.v67.
28876.
Seltman HJ (2018) Experimental design and analysis. http://www.stat.cmu.edu/~hseltman/309/
Book/Book.pdf
Seth S, Eugster MJA (2015) Probabilistic archetypal analysis. Machine Learning. https://doi.org/
10.1007/s10994-015-5498-8
Shalvi O, Weinstein E (1990) New criteria for blind deconvolution of nonminimum phase systems
(channels). IEEE Trans Inf Theory 36:312–321
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–
656
Shepard RN (1962a) The analysis of proximities: multidimensional scaling with unknown distance
function. Part I. Psychometrika 27:125–140
Shepard RN (1962b) The analysis of proximities: multidimensional scaling with unknown distance
function. Part II. Psychometrika 27:219–246
Sheridan SC, Lee CC (2010) Synoptic climatology and the general circulation model. Progress
Phys Geography 34:101–109. ISSN: 1477-0296
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach
Intell 22:888–905
Schnur R, Schmitz G, Grieger N, von Storch H (1993) Normal modes of the atmosphere as
estimated by principal oscillation patterns and derived from quasi-geostrophic theory. J Atmos
Sci 50:2386–2400
Sibson R (1972) Order invariant methods for data analysis. J Roy Statist Soc B 34:311–349
Sibson R (1978) Studies in the robustness of multidimensional scaling: procrustes statistics. J Roy
Statist Soc B 40:234–238
Sibson R (1979) Studies in the robustness of multidimensional scaling: Perturbational analysis of
classical scaling. J Roy Statist Soc B 41:217–229
Sibson R (1981) Multidimensional scaling. Wiley, Chichester
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall,
London
Simmons AJ, Wallace MJ, Branstator WG (1983) Barotropic wave propagation and instability, and
atmospheric teleconnection patterns. J Atmos Sci 40:1363–1392
Smith S (1994) Optimization techniques on Riemannian manifolds. In Hamiltonian and gradient
flows, algorithm and control (Bloch A, Ed.), Field Institute Communications, Vol 3, Amer Math
Soc, 113–136.
Snyman JA (1982) A new and dynamic method for unconstrained optimisation. Appl Math Modell
6:449–462
Solidoro C, Bandelj V, Barbieri P, Cossarini G, Fonda Umani S (2007) Understanding dynamic of
biogeochemical properties in the northern Adriatic Sea by using self-organizing maps and k-
means clustering. J Geophys Res 112:C07S90. https://doi.org/10.1029/2006JC003553. ISSN:
0148-0227
Sočan G (2003) The incremental value of minimum rank factor analysis. Ph.D. Thesis, University
of Groningen, Groningen
Spearman C (1904a) General intelligence, objectively determined and measured. Am J Psy
15:201–293
References 577
Spearman C (1904b) The proof and measurement of association between two things. Am J Psy
15:72, and 202
Spence I, Garrison RF (1993) A remarkable scatterplot. Am Statistician 47:12–19
Spendley W, Hext GR, Humsworth FR (1962) Sequential applications of simplex designs in
optimization and evolutionary operations. Technometrics 4:441–461
Stewart D, Love W (1968) A general canonical correlation index. Psy Bull 70:160–163
Steinschneiders S, Lall U (2015) Daily precipitation and tropical moisture exports across the east-
ern United States: An application of archetypal analysis to identify spatiotemporal structure. J
Climate 28:8585–8602
Stephenson G (1973) Mathematical methods for science students, 2nd edn. Dover Publication,
Mineola, 526 p
Stigler SM (1986) The History of Statistics: The Measurement of Uncertainty Before 1900.
Harvard University Press, Cambridge, MA
Stommel H (1948) The westward intensification of wind-driven ocean currents. EOS Trans Amer
Geophys Union 29:202–206
Stone M, Brooks RJ (1990) Continuum regression: cross-validation sequentially constructed
prediction embracing ordinary least squares, partial least squares and principal components
regression. J Roy Statist Soc B52:237–269
Su Z, Hu H, Wang G, Ma Y, Yang X, Guo F (2018) Using GIS and Random Forests to identify fire
drivers in a forest city, Yichun, China. Geomatics Natural Hazards Risk 9:1207–1229. https://
doi.org/10.1080/19475705.2018.1505667
Subashini A, Thamarai SM, Meyyappan T (2019) Advanced weather forecasting Prediction using
deep learning. Int J Res Appl Sci Eng Tech IJRASET 7:939–945. www.ijraset.com
Sura P, Hannachi A (2015) Perspectives of non-Gaussianity in atmospheric synoptic and low-
frequency variability. J Cliamte 28:5091–5114
Swenson ET (2015) Continuum power CCA: A unified approach for isolating coupled modes. J
Climate 28:1016–1030
Takens F (1981) Detecting strange attractors in turbulence. In: Rand D, Young LS (eds) Dynamical
systems and turbulence, warwick 1980. Lecture Notes in Mathematics, vol 898. Springer, New
York, pp 366–381
Talley LD (2008) Freshwater transport estimates and the global overturning circulation: shallow,
deep and throughflow components. Progress Ocenaography 78:257–303
Taylor GI (1921) Diffusion by continuous movement. Proc Lond Math Soc 20(2):196–212
Telszewski M, Chazottes A, Schuster U, Watson AJ, Moulin C, Bakker DCE, Gonzalez-Davila
M, Johannessen T, Kortzinger A, Luger H, Olsen A, Omar A, Padin XA, Rios AF, Steinhoff
T, Santana-Casiano M, Wallace DWR, Wanninkhof R (2009) Estimating the monthly pCO2
distribution in the North Atlantic using a self-organizing neural network. Biogeosciences
6:1405–1421. ISSN: 1726–4170
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear
dimensionality reduction. Science 290:2319–2323
TerMegreditchian MG (1969) On the determination of the number of independent stations which
are equivalent to prescribed systems of correlated stations (in Russian). Meteor Hydrol 2:24–36
Teschl G (2012) Ordinary differential equations and dynamical systems. Graduate Studies in
Mathematics, vol 140, Amer Math Soc, Providence, RI, 345pp
Thacker WC (1996) Metric-based principal components: data uncertainties. Tellus 48A:584–592
Thacker WC (1999) Principal predictors. Int J Climatol 19:821–834
Tikhonov AN (1963) Solution of incorrectly formulated problems and the regularization method.
Sov Math Dokl 4:1035–1038
Theiler J, Eubank S, Longtin A, Galdrikian B, Farmer JD (1992) Testing for nonlinearity in time
series: the method of surrogate data. Physica D 58:77–94
Thiebaux HJ (1994) Statistical data analyses for ocean and atmospheric sciences. Academic Press
Thomas JB (1969) An introduction to statistical communication theory. Wiley
Thomson RE, Emery WJ (2014) Data analysis methods in physical oceanography, 3rd edn.
Elsevier, Amsterdam, 716 p
578 References
Thompson DWJ, Wallace MJ (1998) The arctic oscillation signature in wintertime geopotential
height and temperature fields. Geophys Res Lett 25:1297–1300
Thompson DWJ, Wallace MJ (2000) Annular modes in the extratropical circulation. Part I: Month-
to-month variability. J Climate 13:1000–1016
Thompson DWJ, Wallace JM, Hegerl GC (2000) Annular modes in the extratropical circulation,
Part II: Trends. J Climate 13:1018–1036
Thurstone LL (1940) Current issues in factor analysis. Psychological Bulletin 37:189–236
Thurstone LL (1947) Multiple factor analysis. The University of Chicago Press, Chicago
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288
Tippett MK, DelSole T, Mason SJ, Barnston AG (2008) Regression based methods for finding
coupled patterns. J Climate 21:4384–4398
Tipping ME, Bishop CM (1999) Probabilistic principal components. J Roy Statist Soc B 61:611–
622
Toumazou V, Cretaux J-F (2001) Using a Lanczos eigensolver in the computation of empirical
orthogonal functions. Mon Wea Rev 129:1243–1250
Torgerson WS (1952) Multidimensional scaling I: Theory and method. Psychometrika 17:401–419
Torgerson WS (1958) Theory and methods of scaling. Wiley, New York
Trenberth KE, Jones DP, Ambenje P, Bojariu R, Easterling D, Klein Tank A, Parker D, Rahimzadeh
F, Renwick AJ, Rusticucci M, Soden B, Zhai P (2007) Observations: surface and atmospheric
climate change. In: Solomon S, Qin D, Manning M, et al. (eds) Climate Change (2007) The
physical science basis. Contribution of working Group I to the fourth assessment report of the
intergovernmental panel on climate change. Cambridge University Press, p 235–336
Trenberth KE, Shin W-TK (1984) Quasi-biennial fluctuations is sea level pressures over the
Northern Hemisphere. Mon Wea Rev 111:761–777
Trendafilov NT (2010) Stepwise estimation of common principal components. Comput Statist Data
Anal 54:3446–3457
Trendafilov NT, Jolliffe IT (2006) Projected gradient approach to the numerical solution of the
SCoTLASS. Comput Statist Data Anal 50:242–253
Tsai YZ, Hsu K-S, Wu H-Y, Lin S-I, Yu H-L, Huang K-T, Hu M-C, Hsu S-Y (2020) Application of
random forest and ICON models combined with weather forecasts to predict soil temperature
and water content in a greenhouse. Water 12:1176
Tsonis AA, Roebber PJ (2004) The architecture of the climate network. Phys A 333:497–504.
https://doi.org/10.1016/j.physa.2003.10.045
Tsonis AA, Swanson KL, Roebber PJ (2006) What do networks have to do with climate? Bull Am
Meteor Soc 87:585–595. https://doi.org/10.1175/BAMS-87-5-585
Tsonis AA, Swanson KL, Wang G (2008) On the role of atmospheric teleconnections in climate. J
Climate 21(2990):3001
Tu JH, Rowley CW, Luchtenburg DM, Brunton SL, Kutz JN (2014) On dynamic mode decompo-
sition: Theory and applications. J Comput Dyn 1:391–421. https://doi.org/10.3934/jcd.2014.1.
391
Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31:279–
311
Tukey JW (1977) Exploratory data analysis. Addison-Wesley, Reading, MA
Tukey PA, Tukey JW (1981) Preparation, prechosen sequences of views. In: Barnett V (ed)
Interpreting multivariate data. Wiley, Chichester, pp 189–213
Tyler DE (1982) On the optimality of the simultaneous redundancy transformations. Psychome-
trika 47:77–86
Ulrych TJ, Bishop TN (1975) Maximum entropy spectral analysis and autoregressive decomposi-
tion. Rev Geophys Space Phys 13:183–200
Unkel S, Trendafilov NT, Hannachi A, Jolliffe IT (2010) Independent exploratory factor analysis
with application to atmospheric science data. J Appl Stat 37:1847–1862
Unkel S, Trendafilov NT, Hannachi A, Jolliffe IT (2011) Independent component analysis for three-
way data with an application from atmospheric science. J Agr Biol Environ Stat 16:319–338
References 579
Uppala SM, Kallberg PW, Simmons AJ, Andrae U, Bechtold VDC, Fiorino M, Gibson JK, Haseler
J, Hernandez A, Kelly GA, Li X, Onogi K, Saarinen S, Sokka N, Allan RP, Andersson E, Arpe
K, Balmaseda MA, Beljaars ACM, Berg LVD, Bidlot J, Bormann N, Caires S, Chevallier F,
Dethof A, Dragosavac M, Fisher M, Fuentes M, Hagemann S, Hólm E, Hoskins BJ, Isaksen
L, Janssen PAEM, Jenne R, Mcnally AP, Mahfouf J-F, Morcrette J-J, Rayner NA, Saunders
RW, Simon P, Sterl A, Trenberth KE, Untch A, Vasiljevic D, Viterbo P, Woollen J (2005) The
ERA-40 re-analysis. Q J Roy Meteorol Soc 131:2961–3012
van den Dool HM, Saha S, Johansson Å(2000) Empirical orthogonal teleconnections. J Climate
13:1421–1435
van den Dool HM (2011) An iterative projection method to calculate EOFs successively without
use of the covariance matrix. In: Science and technology infusion climate bulletin NOAA’s
National Weather Service. 36th NOAA annual climate diagnostics and prediction workshop,
Fort Worth, TX, 3–6 October 2011. www.nws.noaa.gov/ost/climate/STIP/36CDPW/36cdpw-
vandendool.pdf
van den Wollenberg AL (1977) Redundancy analysis: an alternative to canonical correlation
analysis. Psychometrika 42:207–219
Vasicek O (1976) A test for normality based on sample entropy. J R Statist Soc B 38:54–59
Vautard R, Ghil M (1989) Singular spectrum analysis in nonlinear dynamics, with applications to
paleoclimatic time series. Physica D 35:395–424
Vautard R, Yiou P, Ghil M (1992) Singular spectrum analysis: A toolkit for short, noisy chaotic
signals. Physica D 58:95–126
Venables WN, Ripley BD (1994) Modern applied statistics with S-plus. McGraw-Hill, New York
Vesanto J, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Net
11:586–600
Vesanto J (1997) Using the SOM and local models in time series prediction. In Proceedings of
workshop on self-organizing maps (WSOM’97), Espo, Finland, pp 209–214
Vinnikov KY, Robock A, Grody NC, Basist A (2004) Analysis of diurnal and seasonal cycles and
trends in climate records with arbitrary observations times. Geophys Res Lett 31. https://doi.
org/10.1029/2003GL019196
Vilibić I, et al (2016) Self-organizing maps-based ocean currents forecasting system. Sci Rep
6:22924. https://doi.org/10.1038/srep22924
von Mises R (1928) Wahrscheinlichkeit, Statistik und Wahrheit, 3rd rev. edn. Springer, Vienna,
1936; trans. as Probability, statistics and truth, 1939. W. Hodge, London
von Storch H (1995a) Spatial patterns: EOFs and CCA. In: von Storch H, Navarra A (eds) Analysis
of climate variability: Application of statistical techniques. Springer, pp 227–257
von Storch J (1995b) Multivariate statistical modelling: POP model as a first order approximation.
In: von Storch H, Navarra A (eds) Analysis of climate variability: application of statistical
techniques. Springer, pp 281–279
von Storch H, Zwiers FW (1999) Statistical analysis in climate research. Cambridge University
Press, Cambridge
von Storch H, Xu J (1990) Principal oscillation pattern analysis of the tropical 30- to 60-day
oscillation. Part I: Definition of an index and its prediction. Climate Dynamics 4:175–190
von Storch H, Bruns T, Fisher-Bruns I, Hasselmann KF (1988) Principal oscillation pattern analysis
of the 30- to 60-day oscillation in a general circulation model equatorial troposphere. J Geophys
Res 93:11022–11036
von Storch H, Bürger G, Schnur R, Storch J-S (1995) Principal ocillation patterns. A review. J
Climate 8:377–400
von Storch H, Baumhefner D (1991) Principal oscillation pattern analysis of the tropical 30- to
60-day oscillation. Part II: The prediction of equatorial velocity potential and its skill. Climate
Dynamics 5:1–12
Wahba G (1979) Convergence rates of “Thin Plate” smoothing splines when the data are noisy.
In: Gasser T, Rosenblatt M (eds) Smoothing techniques for curve estimation. Lecture notes in
mathematics, vol 757. Springer, pp 232–246
580 References
Wahba G (1990) Spline models for observational data SIAM. Society for Industrial and Applied
Mathematics, Philadelphia, PA, 169 p
Wahba G (2000) Smoothing splines in nonparametric regression. Technical Report No 1024,
Department of Statistics, University of Wisconsin. https://www.stat.wisc.edu/sites/default/files/
tr1024.pdf
Walker GT (1909) Correlation in seasonal variation of climate. Mem Ind Met Dept 20:122
Walker GT (1923) Correlation in seasonal variation of weather, VIII, a preliminary study of world
weather. Mem Ind Met Dept 24:75–131
Walker GT (1924) Correlation in seasonal variation of weather, IX. Mem Ind Met Dept 25:275–332
Walker GT, Bliss EW (1932) World weather V. Mem Roy Met Soc 4:53–84
Wallace JM (2000) North Atlantic Oscillation/annular mode: Two paradigms–one phenomenon.
QJR Meteorol Soc 126:791–805
Wallace JM, Dickinson RE (1972) Empirical orthogonal representation of time series in the
frequency domain. Part I: Theoretical consideration. J Appl Meteor 11:887–892
Wallace JM (1972) Empirical orthogonal representation of time series in the frequency domain.
Part II: Application to the study of tropical wave disturbances. J Appl Meteor 11:893–900
Wallace JM, Gutzler DS (1981) Teleconnections in the geopotential height field during the
Northern Hemisphere winter. Mon Wea Rev 109:784–812
Wallace JM, Smith C, Bretherton CS (1992) Singular value decomposition of wintertime sea
surface temperature and 500-mb height anomalies. J Climate 5:561–576
Wallace JM, Thompson DWJ (2002) The Pacific Center of Action of the Northern Hemisphere
annular mode: Real or artifact? J Climate 15:1987–1991
Walsh JE, Richman MB (1981) Seasonality in the associations between surface temperatures over
the United States and the North Pacific Ocean. Mon Wea Rev 109:767–783
Wan EA (1994) Time series prediction by using a connectionist network with internal delay lines.
In: Weigend AS, Gershenfeld NA (eds) Time series prediction: forecasting the future and
understanding the past. Addison-Wesley, Boston, MA, pp 195–217
Wang D, Arapostathis A, Wilke CO, Markey MK (2012) Principal-oscillation-pattern analysis of
gene expression. PLoS ONE 7 7:1–10. https://doi.org/10.1371/journal.pone.0028805
Wang Y-H, Magnusdottir G, Stern H, Tian X, Yu Y (2014) Uncertainty estimates of the EOF-
derived North Atlantic Oscillation. J Climate 27:1290–1301
Wang D-P, Mooers CNK (1977) Long coastal-trapped waves off the west coast of the United States,
summer (1973) J Phys Oceano 7:856–864
Wang XL, Zwiers F (1999) Interannual variability of precipitation in an ensemble of AMIP climate
simulations conducted with the CCC GCM2. J Climate 12:1322–1335
Watkins DS (2007) The matrix eigenvalue problem: GR and Krylov subspace methods. SIAM,
Philadelphia
Watt J, Borhani R, Katsaggelos AK (2020) Machine learning refined: foundation, algorithms and
applications, 2nd edn. Cambridge University Press, Cambridge, 574 p
Weare BC, Nasstrom JS (1982) Examples of extended empirical orthogonal function analysis. Mon
Wea Rev 110:481–485
Wegman E (1990) Hyperdimensional data analysis using parallel coordinates. J Am Stat Assoc
78:310–322
Wei WWS (2019) Multivariate time series analysis and applications. Wiley, Oxford, 518 p
Weideman JAC (1995) Computing the Hilbert transform on the real line. Math Comput 64:745–762
Weyn JA, Durran DR, Caruana R (2019) Can machines learn to predict weather? Using deep
learning to predict gridded 500-hPa geopotential heighjt from historical weather data. J Adv
Model Earth Syst 11:2680–2693. https://doi.org/10.1029/2019MS001705
Werbos PJ (1990) Backpropagation through time: What it does and how to do it. Proc IEEE,
78:1550–1560
Whittle P (1951) Hypothesis testing in time series. Almqvist and Wicksell, Uppsala
Whittle P (1953a) The analysis of multiple stationary time series. J Roy Statist Soc B 15:125–139
Whittle P (1953b) Estimation and information in stationary time series. Ark Math 2:423–434
References 581
Whittle P (1983) Prediction and regulation by linear least-square methods, 2nd edn. University of
Minnesota, Minneapolis
Widrow B, Stearns PN (1985) Adaptive signal processing. Prentice-Hall, Englewood Cliffs, NJ
Wikle CK (2004) Spatio-temporal methods in climatology. In: El-Shaarawi AH, Jureckova J (eds)
UNESCO encyclopedia of life support systems (EOLSS). EOLSS Publishers, Oxford, UK.
Available: https://pdfs.semanticscholar.org/e11f/f4c7986840caf112541282990682f7896199.
pdf
Wiener N, Masani L (1957) The prediction theory of multivariate stochastic processes, I. Acta
Math 98:111–150
Wiener N, Masani L (1958) The prediction theory of multivariate stochastic processes, II. Acta
Math 99:93–137
Wilkinson JH (1988) The algebraic eigenvalue problem. Clarendon Oxford Science Publications,
Oxford
Wilks DS (2011) Statistical methods in the atmospheric sciences. Academic Press, San Diego
Williams MO, Kevrekidis IG, Rowley CW (2015) A data-driven approximation of the Koopman
operator: extending dynamic mode decomposition. J Nonlin Sci 25:1307–1346
Wiskott L, Sejnowski TJ (2002) Slow feature analysis: unsupervised learning of invariances.
Neural Comput 14:715–770
Wise J (1955) The autocorrelation function and the spectral density function. Biometrika 42:151–
159
Woollings T, Hannachi A, Hoskins BJ, Turner A (2010) A regime view of the North Atlantic
Oscillation and its response to anthropogenic forcing. J Climate 23:1291–1307
Wright RM, Switzer P (1971) Numerical classification applied to certain Jamaican eocene
numuulitids. Math Geol 3:297–311
Wunsch C (2003) The spectral description of climate change including the 100 ky energy. Clim
Dyn 20:353–363
Wu C-J (1996) Large optimal truncated low-dimensional dynamical systems. Discr Cont Dyn Syst
2:559–583
Xinhua C, Dunkerton TJ (1995) Orthogonal rotation of spatial patterns derived from singular value
decomposition analysis. J Climate 8:2631–2643
Xu J-S (1993) The joint modes of the coupled atmosphere-ocean system observed from 1967 to
1986. J Climate 6:816–838
Xue Y, Cane MA, Zebiak SE, Blumenthal MB (1994) On the prediction of ENSO: A study with a
low order Markov model. Tellus 46A:512–540
Young GA, Smith RL (2005) Essentials of statistical inference. Cambridge University Press, New
York, 226 p. ISBN-10: 0-521-54866-7
Young FW (1987) Multidimensional scaling: history, theory and applications. Lawrence Erlbaum,
Hillsdale, New Jersey
Young FW, Hamer RM (1994) Theory and applications of multidimensional scaling. Eribaum
Associates, Hillsdale, NJ
Young G, Householder AS (1938) Discussion of a set of points in terms of their mutual distances.
Young G, Householder AS (1941) A note on multidimensional psycho-physical analysis. Psy-
chometrika 6:331–333
Yu Z-P, Chu P-S, Schroeder T (1997) Predictive skills of seasonal to annual rainfall variations
in the U.S. Affiliated Pacific Islands: Canonical correlation analysis and multivariate principal
component regression approaches. J Climate 10:2586–2599
Zveryaev II, Hannachi AA (2012) Interannual variability of Mediterranean evaporation and its
relation to regional climate. Clim Dyn. https://doi.org/10.1007/s00382-011-1218-7
Zveryaev II, Hannachi AA (2016) Interdecadal changes in the links between Mediterranean
evaporation and regional atmospheric dynamics during extended cold season. Int J Climatol.
https://doi.org/10.1002/joc.4779
Zeleny, M (1987) Management support systems: towards integrated knowledge management.
Human Syst Manag 7:59–70
582 References
Zhang G, Patuwo BE, Hu MY (1997) Forecasting with artificial neural networks: The state of the
art. Int J Forecast 14:35–62
Zhu Y, Shasha D (2002) Statstream: Statistical monitoring of thousands of data streams in real time.
In: VLDB, pp 358–369. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.8732
Index
A Artificial intelligence, 415

Absolutely summable, 157 Artificial neural network, 416
Activation, 419 Asimov, D., 12
Activation function, 416, 425 Aspect ratio, 167
Activation level, 423 Assimilation, 2
Active phase, 214 Asymptotically unbiased, 256
Adaptive filter, 150 Asymptotic approximation, 45
Adjacancy matrix, 67, 169, 305 Asymptotic distribution, 228
Adjoint, 344 Asymptotic limit, 88
Adjoint operator, 422 Asymptotic uncertainty, 46
Adjoint patterns, 134 Atlantic Multidecadal Oscillation (AMO), 293
Adjoint vector, 134 Atlantic Niño, 293
Advection, 118 Atmosphere-Land-Ocean-Ice system, 2
African jet, 164 Atmospheric models, 119
Agulhas currents, 405 Attractors, 146–148, 295
Air temperature, 199 Augmenting function, 259
Algebraic topology, 403 Autocorelation function, 176
Altimetry data, 374 Autocorrelation, 49, 50
Amplitude, 96 Autocorrelation function, 48, 152, 172, 481
Analytic functions, 102 Autocovariance function, 26, 27, 98, 156, 483
Analytic signal, 102 Autoregression matrix, 119
Angular momentum, 101 Autoregressive model, 125, 311, 425, 486, 546
Annual cycle, 160 Autoregressive moving-average (ARMA)
Annular mode, 184 processes, 48
Anthropogenic, 3 Auxiliary matrix, 152
APHRODITE, 216 Average information content, 244
Approximation theorem, 416 Average predictability, 183
Archetypal analysis, 55, 397 Average predictability pattern (APP), 183
Arctic-like oscillation, 233
Arctic Oscillation (AO), 42, 56, 180, 285, 288,
332, 387 B
ARMA process, 154 Back fitting algorithm, 258
Arnoldi algorithm, 40, 138 Background noise, 149
Arnoldi method, 518 Backpropagation, 422, 425, 426
AR(1) process, 51 Back-transformation, 381

584 Index
Band pass filter, 108 Canonical correlation, 339

Bandwidth, 256 Canonical correlation analysis (CCA), 4, 14
Baroclinic structures, 125 Canonical correlation patterns, 341
Baroclinic waves, 135 Canonical covariance analysis (CCOVA), 344
Barotropic models, 56 Canonical covariance pattern, 345
Barotropic quasi-geostrophic model, 141 Canonical variates, 254, 339
Barotropic vorticity, 440 Caotic time series, 48
Bartlett’s factor score, 229 Carbone dating, 2
Basis functions, 35, 320, 357 Categorical data, 203
Bayesian framework, 411 Categorical predictors, 435
Bayes theorem, 470 Cauchy distribution, 424
Bernoulli distribution, 475 Cauchy principal value, 101
Beta-plane, 142 Cauchy sequence, 538
Between-groups sums-of-squares, 254 Cauchy’s integral formula, 101
Betweenness, 68 Causal interactions, 68
Bias, 419 Caveats, 168
Bias parameter, 421, 423 Centered, 256
Bias-variance trade-off, 47 Centering, 23
Biennial cycles, 132 Centering operator, 341
Bimodal, 259 Central England temperature, 149
Bimodal behaviour, 440 Central limit theorem (CLT), 45, 247, 277
Bimodality, 213, 259, 311, 314 Centroid, 167, 205
Binary split, 435 Chaotic attractor, 147
Binomial distribution, 475–476 Chaotic system, 147
Bi-orthogonality, 178 Characteristic multipliers, 546
Bi-quartimin criterion, 231 Chernoff, H., 17
Bivariate cumulant, 250 Chi-square, 342, 385, 436
Blind deconvolution, 267, 271 Chi-squared distance, 253
Blind separation, 4 Chi-square distribution, 478
Blind source separation (BSS), 266, 268 Cholesky decomposition, 320, 397
Bloch’s wave, 373 Cholesky factorisation, 505
Blocked flow, 313 Circular, 154
Bloc-Toeplitz, 159 Circular autocorrelation matrix, 154
Bock’s procedure, 254 Circular covariance matrix, 154
Boltzmann H-function, 245, 271 Circulation patterns, 445
Bootstrap, 45, 46, 49, 437 Circulation regime, 314
Bootstrap blocks, 49 Classical MDS, 206
Botstrap, 47 Classical scaling, 207–209
Botstrap resampling, 48 Classical scaling problem, 204
Boundary condition, 329 Classification, 3, 434
Boundary currents, 405 Climate analysis, 16
Box plots, 17 Climate change, 442
Branch, 434 Climate change signal, 184
Break phases, 214 Climate dynamics, 3
Broadband, 115 Climate extremes, 444, 445
Broad-banded waves, 114 Climate forecast system (CFS), 443
B-slpines, 322, 326 Climate Modeling Intercomparison Project
Bubble, 430 (CMIP), 368
Burg, 196 Climate models, 2, 445
Climate modes, 3
Climate networks, 67, 68, 169, 281
C Climate prediction, 440
Calculus of variations, 273 Climatic extreme events, 68
Canberra, 203 Climatic sub-processes, 68
Index 585
Climatological covariance, 183 Conjugate information, 136

Climatology, 23, 160 Connection, 68
Climbing algorithms, 247 Constrained minimization, 76, 530
Closeness, 68 Contingency table, 203
Clouds, 243 Continuous AR(1), 222
Cluster analysis, 254 Continuous CCA, 352
Clustering, 4, 260, 397, 416, 429 Continuous curves, 321
Clustering index, 254 Continuous predictor, 435
Clustering techniques, 242 Continuum, 5
Clusters, 243, 253 Continuum power regression, 388
CMIP5, 369, 383 Convective processes, 118
CMIP models, 387 Convex combination, 398
Cocktail-party problem, 268 Convex hull, 397, 398
Codebook, 430 Convex least square problems, 398, 400
Co-decorrelation time matrix, 177 Convex set, 398
Coding principle, 280 Convolution, 26, 38, 107, 266
Coherence, 492 Convolutional, 422
Coherent structures, 33, 296 Convolutional linear (integral) operator, 169
Cokurtosis, 293 Convolutional NN, 440, 442
Colinearity, 82 Convolving fIlter, 267
Combined field, 338 Coriolis parameter, 310
Common EOF analysis, 368 Correlation, 16
Common EOFs, 383 Correlation coefficient, 16, 21, 193
Common factors, 220, 221 Correlation integral, 178
Common PC, 383 Correlation matrix, 24, 61
Communality, 222 Coskewness, 293
Communities, 306 Co-spectrum, 29, 114
Compact support, 173 Costfunction, 402, 435
Competitive process, 431 Coupled, 2
Complex conjugate, 26, 29, 106, 187, 191 Coupled pattern, 33, 337
Complex conjugate operator, 96 Coupled univariate AR(1), 138
Complex covariance matrix, 95 Covariability, 201, 338
Complex data matrix, 97 Covariance, 16, 21, 24
Complex EOFs (CEOFs), 4, 13, 94, 95 Covariance function, 36
Complex frequency domain, 97 Covariance matrix, 24, 25, 40, 53, 54, 58, 61,
Complexified fileds, 35, 95, 96 88
Complexified multivariate signal, 106 Covariance matrix spectra, 49
Complexified signal, 106 Covariance matrix spectrum, 57
Complexified time series, 99, 100 Critical region, 229
Complexity, 1–3, 11, 55, 67, 241, 265 Cross-correlations, 93
Complex network, 169 Cross-covariance function, 27
Complex nonlinear dynamical system, 2 Cross-covariance matrix, 106, 134, 137, 339,
Complex principal components, 94, 96 341, 392
Composition operator, 138 Cross-entropy, 417
Comprehensive Ocean-Atmosphere Data Set Cross-spectra, 98, 106
(COADS), 169 Cross-spectral analysis, 93
Conditional distribution, 223 Cross-spectral covariances, 94
Conditional expectations, 189, 228 Cross-spectral matrix, 97
Conditional probability, 129, 470 Cross-spectrum, 28, 100, 497
Condition number, 42 Cross-spectrum matrix, 28, 29, 98, 105, 113,
Confidence interval, 109 114
Confidence limits, 41, 46, 58 Cross-validation (CV), 45–47, 238, 325, 330,
Conjugate directions, 525 362, 437, 454, 458
Conjugate gradient, 422, 528 Cross-validation score, 391
586 Index
Cubic convergence, 283 Delay operator, 485

Cubic spline, 19, 453, 454 Delays, 148
Cumulant, 249, 250, 267, 275, 472–473 Delay space, 149
Cumulative distribution function, 20, 251, 302, Dendrogram, 448
419, 471 Descent algorithm, 83, 187, 209, 331, 526
Currents, 405 Descent numerical algorithm, 236
Curse of dimensionality, 9 Descriptive data mining, 3
Curvilinear coordinate, 321 Determinant, 500
Curvilinear trajectory, 85 Determinantal equation, 153
Cyclone frequencies, 61 Deterministic, 175
Cyclo-stationarity, 132, 373 Detrending, 17
Cyclo-stationary, 100, 125, 132, 370 Diagonalisable, 40, 66, 96
Cyclo-stationary EOFs, 367, 372 Diagonal matrix, 40
Cyclo-stationary processes, 372 Dicholomus search, 522–523
Differenced data, 58
Differencing operator, 58
D Differentiable function, 209
Damp, 164 Differential entropy, 244, 271
Damped oscillators, 138 Differential manifold, 400
Damped system, 135 Differential operator, 464
Damping, 56 Diffierentiable, 242
Damping times, 138 Diffusion, 56, 310
Data-adaptive harmonic decomposition Diffusion map, 304
(DAHD), 169 Diffusion process, 56–58
Data analysis, 11 Digital filter, 124
Data assimilation, 426 Dimensionality reduction, 138, 429
Database, 3 Dimension reduction, 3, 11, 38
Data image, 448 Dirac delta function, 27, 103
Data-Information-Knowledge-Wisdom Direct product, 502
(DIKW), 10 Discontinuous spectrum, 380
Data mapping, 11 Discrepancy measures, 259
Data matrix, 22, 25, 36, 38, 41, 42, 45, 58, 61, Discrete categories, 433
63, 158 Discrete fourier transform, 104, 107
Data mining, 3, 4 Discretised Laplacian, 362
Data space, 223 Discriminant analysis, 254, 295
Davidson–Fletcher–Powell, 529 Discriminant function, 423
Dawson’s integral, 103 Disjoint, 469
Deaseasonalised, 17 Disorder, 244
Decadal modes, 317 Dispersive, 114, 164
Decay phase, 164 Dispersive waves, 115
Decision making, 4 Dissimilarities, 202
Decision node, 434 Dissimilarity matrix, 202, 205, 448
Decision trees, 433 Distance matrix, 207
Deconvolution, 4, 266, 267 Distortion errors, 210
Decorrelating matrix, 276 Distribution ellipsoid, 63
Decorrelation time, 171, 172, 176, 178, 180, Domain dependence, 71
241 Double centered dissimilarity matrix, 206
Degeneracy, 168, 169 Double diagonal operator, 401
Degenerate, 91, 110, 152, 165, 168 Doubly periodic, 373
Degenerate eigenvalue, 155 Downscaling, 38, 374, 445, 451
Degrees of freedom (dof), 2, 11, 50 Downward propagating signal, 93, 110
Delay coordinates, 148, 167, 317 Downward signal propagating, 113
Delayed vector, 158 Downwelling current patterns, 447
Delay embedding, 169 Dual form, 396
Index 587
Duality, 176 Entropy index, 248, 250, 255

Dynamical mode decomposition (DMD), 138 Envelope, 103, 397
Dynamical reconstruction, 147, 157 EOF rotation, 55
Dynamical systems, 86, 88, 118, 138, 146, Epaneshnikov kernel, 248
147, 169 Equiprobable, 9
ERA-40 reanalyses, 167, 263
ERA-40 zonal mean zonal wind, 113
E Error covariance matrix, 175, 189
Earth System Model, 369 E-step, 227
East Atlantic pattern, 345 Euclidean distance, 203
Easterly, 93 European Centre for Medium Range Weather
Eastward propagation, 125 Forecasting (ECMWF), 92
Eastward shift, 49 European Re-analyses (ERA-40), 92, 212
East-west dipolar structure, 345 Expalined variance, 109
ECMWF analyses, 125 Expansion coefficients, 38
Edgeworth expansion, 275 Expansion functions, 35
Edgeworth polynomial expansion, 250 Expectation (E), 227
Effective number of d.o.f, 53 Expectation maximisation (EM), 226
Effective numbers of spatial d.o.f, 53 Expectation operator, 251, 262
Effective sample size, 46, 50 Explained variance, 41, 53, 238
Efficiency, 247 Exploratory data analysis (EDA), 3
E-folding, 138 Exploratory factor analysis (EFA), 233, 239,
E-folding time, 50, 51, 123, 127, 173, 310 290
Eigenanalysis, 35 Exponential distribution, 477–478
Eigenmode, 111 Exponential family, 402
Eigenspectrum, 96, 382 Exponential smoothing, 18
Eigenvalue, 37, 39, 503 Exponential smoothing filter, 27
Eigenvalue problems, 34, 134 Extended EOFs, 35, 94, 139, 146, 316
Eigenvector, 39, 503 Extremes, 397, 398
Ekman dissipation, 310
Elbow, 405
Ellipsoid of the distribution, 58 F
Ellipsoids, 62 Factor analysis (FA), 4, 12, 46, 219, 224
Elliptical, 311 Factor loading matrix, 234
Elliptical distributions, 61 Factor loading patterns, 233
Elliptically contoured distributions, 375 Factor loadings, 220, 221, 230, 237
Elliptical region, 65 Factor model, 223, 228, 233, 238
Ellipticity, 375 Factor model parameters, 513–515
El-Niño, 33, 55, 157, 293, 405 Factor rotation, 73
El-Niño Southern Oscillation (ENSO), 2, 11, Factors, 219, 223
33, 132, 391, 412 Factor scores, 229, 237
EM algorithm, 238 Factor-scores matrix, 220
Embeddings, 148, 151, 202, 204, 211 Fastest growing modes, 135
Embedding space, 148 FastICA, 282, 283
Empirical distribution function (edf), 20, 21 FastICA algorithm, 282
Empirical orthogonal functions (EOFs), 13, 22, Fat spectrum, 149
34, 38 Feasible set, 86
Empirical orthogonal teleconnection (EOT), 67 Feature analysis, 199
Emptiness paradox, 9 Feature extraction, 3, 11, 416
Emptyness, 243 Feature space, 296, 297, 306, 392, 422
Empty space phenomena, 243 Feedback matrix, 119, 121, 122, 130, 146
Empty space phenomenon, 6, 9, 10 Filter, 174
Energy, 456, 463 Filtered data matrix, 63
Entropy, 196, 232, 243, 244, 271, 278, 436 Filtered time series, 99
588 Index
Filtering, 112, 166, 342 Fröbenius norm, 217, 230, 285, 391, 398
Filtering problem, 177 Fröbenius product, 230
Filter matrix, 276 Fröbenius structure, 139
Filter patterns, 178 Full model, 228
Finear filter, 26 Full rank, 54, 187
Finite difference scheme, 361 Funcional EOFs, 321
First-order auto-regressive model, 56 Functional analysis, 300
First-order Markov model, 179 Functional CCA, 353
First-order optimality condition, 386 Functional EOF, 319
First-order spatial autoregressive process, 56 Functional PCs, 322
First-order system, 135 Fundamental matrix, 546
Fisher information, 243, 245
Fisher’s linear discrimination function, 254
Fisher–Snedecor distribution, 479 G
Fitted model, 228 Gain, 27
Fixed point, 394 Gamma distribution, 478
Fletcher-Powell method, 226 Gaussian, 19, 63, 129, 192, 243
Fletcher–Reeves, 528 Gaussian grid, 44, 328
Floquet theory, 546 Gaussianity, 375
Floyd’s algorithm, 211 Gaussian kernel, 301, 393
Fluctuation-dissipation relation, 129 Gaussian mixture, 214
Forecast, 172, 185 Gaussian noise, 221
Forecastability, 172, 197 General circulation models (GCMs), 387
Forecastable component analysis (ForeCA), Generalised AR(1), 139
196 Generalised eigenvalue problem, 61, 66, 177,
Forecastable patterns, 196 324, 327, 357, 361, 396
Forecasting, 38, 416, 422 Generalised inverse, 501
Forecasting accuracy, 130 Generalised scalar product, 190
Forecasting models, 179 Generating kernels, 300
Forecasting uncertainty, 442 Geodesic distances, 211
Forecast models, 179 Geometric constraints, 70, 71
Forecast skill, 185 Geometric moments, 167
Forward-backward, 421 Geometric properties, 72
Forward stepwise procedure, 257 Geometric sense, 62
Fourier analysis, 17 Geopotential height, 62, 66, 68, 125, 180, 188,
Fourier decomposition, 107 260, 262, 382, 445
Fourier series, 104, 373 Geopotential height anomalies, 181
Fourier spectrum, 103 Geopotential height re-analyses, 157
Fourier transform (FT), 27, 48, 99, 102, 125, Gibbs inequality, 273
176, 183, 187, 192, 267, 494 Gini index, 435
Fourth order cumulant, 170 Global scaling, 209
Fourth order moment, 75 Gobal temperature, 284
Fractal dimensions, 149 Golden section, 523
Fredholm eigen problem, 37 Goodness-of-fit, 209, 259
Fredholm equation, 320 Gradient, 85, 242, 386
Fredholm homogeneous integral equation, 359 Gradient ascent, 282
Frequency-band, 97 Gradient-based algorithms, 283
Frequency domain, 94, 97, 176 Gradient-based approaches, 526
Frequency response function, 29, 104, 108, Gradient-based method, 268
189, 191, 267 Gradient methods, 247
Frequency-time, 103 Gradient optimisation algorithms, 256
Friction, 93 Gradient types algorithms, 282
Friedman’s index, 251 Gram matrix, 301
Fröbenius matrix norm, 233 Gram-Schmidt orthogonalization, 397
Index 589
Grand covariance matrix, 160, 165, 338 Hybrid, 185

Grand data matrix, 161 Hyperbolic tangent, 421, 440
Grand tour, 12 Hypercube, 6, 8
Green function, 119 Hyperspaces, 241
Greenhouse, 450 Hypersphere, 5, 7
Greenland blocking, 146 Hypersurface, 86, 296
Green’s function, 129, 457, 464 Hypervolumes, 6, 243
Growing phase, 135 Hypothesis of independence, 45
Growth phase, 164
Growth rates, 127, 138
Gulf Stream, 405 I
Gyres, 405 ICA rotation, 286
Ill-posed, 344
Impulse response function, 27
H Independence, 266, 470
Hadamard, 285, 401 Independent and identically distributed (IID),
Hadamard product, 328, 502 50, 224
HadCRUT2, 183 Independent component analysis (ICA), 55,
HadGEM2-ES, 369 266
HadISST, 332 Independent components, 63, 268
Hadley Centre ice and sea surface temperature Independent principal components, 286
(HadISST), 290 Independent sample size, 50
Hamiltonian systems, 136 Independent sources, 265, 293
Hankel matrix, 169 Indeterminacy, 352
Heavy tailed distributions, 252 India monsoon rainfall, 66
Hellinger distance, 259 Indian Ocean dipole (IOD), 58, 412
Henderson filter, 18 Indian Ocean SST anomalies, 58
Hermite polynomials, 249, 252, 410 Inference, 3
Hermitian, 29, 96, 98, 137, 492, 502 Infomax, 280, 282, 283
Hermitian covariance matrix, 106, 109 Info-max approach, 278
Hermitian matrix, 109 Information, 244
Hessian matrix, 526 Information capacity, 280
Hexagonal, 430 Information-theoretic approaches, 270
Hidden, 4 Information theory, 243, 244, 271
Hidden dimension, 12 Initial condition, 140
Hidden factors, 269 Initial random configurations, 209
Hidden variables, 220 Inner product, 36, 401, 536
Hierarchical clustering, 448 Insolation, 199
High dimensionality, 3, 11 Instability, 71
Higher-order cumulants, 284 Instantaneous frequency, 102, 112, 113
Higher order moments, 266, 267 Integrability, 190, 191
High-order singular value decomposition, 293 Integrable functions, 299
Hilbert canonical correlation analysis (HCCA), Integral operator, 299, 300
137 Integrated power, 177
Hilbert EOFs, 95, 97, 100, 109, 113, 161 Integro-differential equations, 320, 326, 357,
Hilbert filter, 102 359
Hilbert PC, 113 Integro-differential operator, 328
Hilbert POPs (HPOPs), 136 Integro-differential system, 360
Hilbert singular decomposition, 101 Interacting molecules, 1
Hilbert space, 36, 138, 190, 344 Interacting space/time scales, 2
Hilbert transform, 97, 101, 102, 105, 107, 109, Inter-dependencies, 67, 69
136, 145 Interesting, 242, 243
Homogeneous diffusion processes, 56 Interesting features, 15
Hovmoller diagram, 31 Interestingness, 243–245
590 Index
Interesting structures, 243, 264 Kelvin wave, 100, 162

Intergovernmental Panel for Climate Change Kernel, 169, 300, 465
(IPCC), 183 Kernel CCA, 395
Intermediate water, 322 Kernel density estimate, 280
Interpoint distance, 201, 368, 430 Kernel density estimation, 255
Interpoint distance matrix, 449 Kernel EOF, 297
Interpolated, 17 Kernel function, 19, 37, 299
Interpolation, 189 Kernel matrix, 301, 397
Interpolation covariance matrix, 194 Kernel MCA, 392
Interpolation error, 190, 241 Kernel methods, 280
Interpolation filter, 190 Kernel PCA, 297
Interpretation, 11 Kernel PDF, 259
Inter-tropical convergence zone (ITCZ), 147 Kernel POPs, 317
Intraseasonal time scale, 132 Kernel smoother, 256, 260, 280, 465
Intrinsic mode of variability, 58 Kernel smoothing, 19
Invariance, 72 Kernel transformation, 301
Invariance principle, 237 Kernel trick, 299
Inverse Fourier transform, 48 k-fold CV, 47
Inverse mapping, 307 Khatri-Rao, 291
Invertibility, 191 K-L divergence, 273, 274
Invertible, 395 k-means, 254, 305, 397
Invertible linear transfomation, 121 k-means clustering, 399
Irish precipitation, 391 k-nearest neighbors, 212
ISOMAP, 211 Kohonen network, 429
Isomap, 212, 411 Kolmogorov formula, 175
Isopycnal, 322 Kolmogorov-Smirnov distance, 253
Isotropic, 105, 283 Kolmogorov-Wiener approach, 174
Isotropic kernels, 393 Koopman operator, 138
Isotropic turbulence, 53 Kriging, 464
Isotropic uniqueness, 237 Kronecker matrix product, 291
Isotropy, 253 Kronecker symbol, 133
Iterative methods, 43 Krylov method, 138
Iterative process, 260 Krylov subspace, 40, 43
Kullback-Leibler distance, 259
Kullback-Leibler (K-L) divergence, 272, 277
J Kuroshio, 405
Jacobian, 272, 278, 283 Kuroshio current, 317
Jacobian operator, 310 Kurtosis, 53, 170, 259, 264, 267, 281, 282
JADE, 284
Japanese reanalyses, 314
Japan Meteorological Agency, 383 L
Johnson-Lindenstrauss Lemma, 368–370 Lag-1 autocorrelations, 379
Joint distribution, 223 Lagged autocorrelation matrix, 150
Joint entropy, 280 Lagged autocovariance, 50
Joint probability density, 269 Lagged covariance matrix, 94, 98, 99, 159,
Joint probability density function, 473 175, 180
JRA-55, 314 Lagrange function, 516
Jump in the spectrum, 387 Lagrange multiplier, 39, 76, 255, 273, 283,
340, 395
Lagrangian, 77, 331, 396
K Lagrangian function, 327
Karhunen–Loéve decomposition, 155 Lagrangian method, 532
Karhunen–Loéve equation, 373 Lagrangian multipliers, 358
Karhunen–Loéve expansion, 36, 37, 91 Lanczos, 40, 42
Index 591
Lanczos method, 517–518 Local averages, 11

La-Niña, 33, 405 Local concepts, 11
Lapalce probability density function, 270 Localized kernel, 302
Laplace-Beltrami differential operator, 304 Local linear embedding, 211
Laplacian, 305, 310, 360, 361 Logistic, 419
Laplacian matrix, 304, 317 Logistic function, 279, 280
Laplacian operator, 327, 361 Logistic regression, 417
Laplacian spectral analysis, 317 Log-likelihood, 224, 238
Large scale atmosphere, 264 Long-memory, 173, 179, 180
Large scale atmospheric flow, 295 Long short-term memory (LSTM), 440
Large scale flow, 384 Long-term statistics, 1
Largescale processes, 167 Long term trends, 180
Large scale teleconnections, 445 Lorenz, E.N., 147
Latent, 220 Lorenz model, 440
Latent heat fluxes, 383 Lorenz system, 157
Latent patterns, 4 Low frequency, 35, 180
Latent space, 223 Low-frequency modes, 184
Latent variable, 11, 12, 56 Low-frequency patterns, 194, 446
Lattice, 431 Low-frequency persistent components, 199
Leading mode, 44 Low-level cloud, 445
Leaf, 434 Low-level Somali jet, 212
Learning, 416, 425 Low-order chaotic models, 310
Leas square, 82 Low-order chaotic systems, 148
Least Absolute Shrinkage and Selection Lyapunov equation, 520
Operator (LASSO), 82 Lyapunov function, 263
Least square, 64, 165, 174
Least squares regression, 388
Leave-one-out CV, 47 M
Leave-one-out procedure, 391 Machine learning, 3, 415
Legendre polynomials, 251 Madden-Julian oscillation (MJO), 91, 132,
Leptokurtic, 270, 473 146, 164, 184
Likelihood, 385 Mahalanobis distance, 459
Likelihood function, 385 Mahalanobis metrics, 203
Likelihood ratio statistic, 228 Mahalanobis signal, 183
Lillieford test, 59 Manifold, 86, 211, 223, 295, 303
Linear convergence, 283 Map, 22
Linear discriminant analysis, 296 Marginal density function, 269
Linear filter, 29, 102, 189, 210, 266, 267 Marginal distribution, 223
Linear growth, 125 Marginal pdfs, 473
Linear integral operator, 299 Marginal probability density, 279, 280
Linear inverse modeling (LIM), 129 Marginal probability density functions, 65
Linearisation, 135 Markov chains, 305
Linearised physical models, 71 Markovian time series, 173
Linear operator, 26, 500 Markov process, 118
Linear programming, 521 Matching unit, 430
Linear projection, 241, 243 Mathematical programming, 531
Linear space, 295 Matlab, 23, 24, 26, 40, 77, 109, 161, 345, 377
Linear superposition, 55 Matrix derivative, 506–512
Linear transformation, 268 Matrix inversion, 342
Linear trend, 193 Matrix norm, 216
Linkage, 448 Matrix of coordinates, 205
Loading coefficients, 54 Matrix optimisation, 398
Loadings, 38, 54, 74, 75, 223 Matrix optimisation problem, 63
Loading vectors, 372 Maximization, 227
592 Index
Maximization problem, 74 MPI-ESM-MR, 369

Maximum covariance analysis (MCA), 337, M-step, 227
344 Multichannel SSA (MSSA), 157
Maximum entropy, 274 Multi-colinearity, 343
Maximum likelihood, 46, 62 Multidimensional scaling (MDS), 4, 201, 242,
Maximum likelihood method (MLE), 224 254
Maximum variance, 38, 40 Multilayer perceptron (MLP), 423
Mean sea level, 383 Multimodal, 243
Mean square error, 37, 47 Multimodal data, 249
Mediterranean evaporation, 66, 345 Multinormal, 58
Mercer kernel, 37 Multinormal distribution, 130
Mercer’s theorem, 299 Multinormality, 46
Meridional, 94 Multiplicative decomposition, 259
Meridional overturning circulation, 199 Multiplicity, 154
Mesokurtic, 473 Multiquadratic, 424
Metric, 202 Multispectral images, 256
Mid-tropospheric level, 68 Multivariate filtering problem, 29
Minimum-square error, 175 Multivariate Gaussian distribution, 8
Minkowski distance, 202 Multivariate normal, 62, 245
Minkowski norm, 217 Multivariate normal distribution, 9
Mis-fit, 458 Multivariate normal IID, 362
Mixed-layer, 322 Multivariate POPs, 138
Mixing, 42, 55, 412 Multivariate random variable, 23, 24
Mixing matrix, 268, 269, 276 Multivariate spectrum matrix, 199
Mixing problem, 55, 284, 375 Multivariate t-distribution, 61
Mixing property, 375 Multivarite AR(1), 219
Mixture, 268 Mutual information, 64, 273–275, 278
Mixture model, 238 Mutually exclusive, 469
Mode analysis, 64
Model evaluation, 384
Model simulations, 2 N
Mode mixing, 373 Narrow band, 103
Modes, 56 Narrow band pass filter, 98
Modes of variability, 38 Narrow frequency, 103
Modularity, 306 National Center for Environmental Prediction
Modularity matrix, 306 (NCEP), 383
Moisture, 164, 445 National Oceanic and Atmospheric
Moment matching, 53 Administration (NOAA), 411
Moments, 250, 259, 269 N -body problem, 460
Momentum, 136 NCEP-NCAR reanalysis, 68, 146, 233, 260,
Monomials, 299 446
Monotone regression, 208 NCEP/NCAR, 31, 184, 310
Monotonicity, 430 Negentropy, 245, 259, 274, 277, 282, 284
Monotonic order, 385 Neighborhood graph, 212
Monotonic transformation, 210, 375 Nested period, 374
Monsoon, 114, 345 Nested sets, 448
Monte Carlo, 45, 46, 48, 343, 379, 525 Nested sigmoid architecture, 426
Monte Carlo approach, 260 Networks, 67, 68
Monte-Carlo bootstrap, 49 Neural-based algorithms, 283
Monthly mean SLP, 77 Neural network-based, 284
Moore–Penrose inverse, 501 Neural networks (NNs), 276, 278, 302, 415,
Most predictable patterns, 185 416
Moving average, 18, 150 Neurons, 419
Moving average filter, 27 Newton algorithm, 283
Index 593
Newton–Raphson, 529 Null space, 137

Newton–Raphson method, 527 Nyquist frequency, 114, 494
Noise, 118
Noise background, 149
Noise covariance, 139
Noise floor, 382 O
Noise-free dynamics, 124 Objective function, 84
Noise variability, 53 Oblimax, 232
Non-alternating algorithm, 400 Oblimin, 231
Nondegeneracy, 205 Oblique, 74, 81, 230
Nondifferentiable, 83 Oblique manifold, 401
Non-Gaussian, 267 Oblique rotation, 76, 77, 231
Non-Gaussian factor analysis, 269 Occam’s rasor, 13
Non-Gaussianity, 53, 264 Ocean circulation, 445
Non-integer power, 389 Ocean current forecasting, 445
Nonlinear, 296 Ocean current patterns, 446
Nonlinear association, 12 Ocean currents, 101
Nonlinear dynamical mode (NDM), 410 Ocean fronts, 322
Nonlinear features, 38 Ocean gyres, 170
Nonlinear interactions, 2 Oceanic fronts, 323
Nonlinearity, 3, 12 Ocean temperature, 321
Nonlinear manifold, 213, 247, 295, 410 Ocillating phenomenon, 91
Nonlinear mapping, 299 Offset, 419
Nonlinear MDS, 212 OLR, 161, 162, 164
Nonlinear ow regimes, 49 OLR anomalies, 160
Nonlinear PC analysis, 439 One-mode component analysis, 64
Nonlinear programme, 521 One-step ahead prediction, 175, 185
Nonlinear smoothing, 19 Operator, 118
Nonlinear system of equations, 263 Optimal decorrelation time, 180
Nonlinear units, 422 Optimal interpolation, 411
Nonlocal, 10 Optimal lag between two fields, 349–350
Non-locality, 115 Optimal linear prediction, 179
Non-metric MDS, 208 Optimally interpolated pattern (OIP), 189, 191
Non-metric multidimensional scaling, 430 Optimally persistent pattern (OPP), 176, 178,
Non-negative matrix factorisation (NMF), 403 180, 183, 185, 241
Non-normality, 269, 288 Optimisation algorithms, 521
Non-parametric approach, 280 Optimization criterion, 76
Nonparametric estimation, 277 Order statistics, 271
Non parametric regression, 257, 453 Ordinal MDS, 208
Non-quadratic, 76, 83, 209, 275 Ordinal scaling, 210
Nonsingular affine transformation, 246 Ordinary differential equations (ODEs), 85,
Normal, 502 147, 263, 530, 543
Normal distribution, 245, 477 Orthogonal, 74, 81, 230, 502
Normalisation constraint, 81 Orthogonal complement, 86, 206
Normal matrix, 40 Orthogonalization, 397
Normal modes, 119, 123, 130, 134, 138 Orthogonal rotation, 74, 75, 77
North Atlantic Oscillation (NAO), 2, 9, 31, 33, Orthomax-based criterion, 284
42, 49, 56, 68, 77, 83, 234, 259, 284, Orthonormal eigenfunctions, 37
288, 289, 293, 311, 312, 332, 381, 387, Oscillatory, 122
391, 446 Outgoing long-wave radiation (OLR), 146, 345
North Pacific Gyre Oscillation (NPGO), 293 Outlier, 17, 250, 260, 271
North Pacific Oscillation, 233, 234 Out-of-bag (oob), 437
Null hypothesis, 48, 56, 57, 228, 288, 342 Overfitting, 395
594 Index
P Polar vortex, 93, 167, 259, 291

Pacific decadal oscillation (PDO), 293, 412 Polynomial equation, 153
Pacific-North American (PNA), 2, 33, 68, 127, Polynomial fitting, 19
259, 263 Polynomial kernels, 302
Pacific patterns, 83 Polynomially, 296
Pairwise distances, 201 Polynomial transformation, 299
Pairwise similarities, 202 Polytope, 403, 404
Parabolic density function, 248 POP analysis, 219
Paradox, 10 POP model, 179
Parafac model, 291 Positive semi-definite, 177, 206, 207, 216, 502
Parsimony, 13 Posterior distribution, 227
Partial least squares (PLS) regression, 388 Potential vorticity, 309
Partial phase transform, 390 Powell’s algorithms, 525
Partial whitening, 388, 390 Power law, 286
Parzen lagged window, 184 Power spectra, 196, 267
Parzen lag-window, 185 Power spectrum, 48, 100, 110, 156, 172–174,
Parzen window, 182 180, 199, 487
Pattern recognition, 295, 416 Precipitation, 445
Patterns, 3, 22 Precipitation predictability, 440
Pattern simplicity, 72 Predictability, 184, 199
Patterns of variability, 91 Predictable relationships, 54
Pdf estimation, 416 Predictand, 338, 363
Pearson correlation, 281 Prediction, 3, 189, 416, 421
Penalised, 354 Prediction error, 174
Penalised likelihood, 457 Prediction error variance, 174, 185
Penalised objective function, 532 Predictive data mining, 3
Penalized least squares, 344 Predictive Oscillation Patterns (PrOPs), 185
Penalty function, 83, 532 Predictive skill, 170
Perceptron convergence theorem, 417 Predictor, 338, 363
Perfect correlation, 354 Pre-image, 394
Periodic signal, 149, 151, 155, 180 Prewhitening, 268
Periodogram, 180, 187, 192, 495 Principal axes, 58, 63
Permutation, 95, 159, 376 Principal component, 39
Permutation matrix, 160, 170 Principal component analysis (PCA), 4, 13, 33
Persistence, 157, 171, 185, 350, 440 Principal component regression (PCR), 388
Persistent patterns, 142, 172 Principal coordinate analysis, 202
Petrie polygon, 403 Principal coordinate matrix, 206
Phase, 27, 96, 492 Principal coordinates, 206–208, 215
Phase changes, 113 Principal fundamental matrix, 546
Phase functions, 110 Principal interaction pattern (PIP), 119, 139
Phase propagation, 113 Principal oscillation pattern (POP), 15, 94, 95,
Phase randomization, 48 119, 126
Phase relationships, 97, 100 Principal prediction patterns (PPP), 343
Phase shift, 95, 150, 151 Principal predictors, 351
Phase space, 169 Principal predictors analysis (PPA), 338
Phase speeds, 112, 157 Principal regression analysis (PRA), 338
Physical modes, 55, 56 Principal trend analysis (PTA), 199
Piece-wise, 19 Principlal component transformation, 54
Planar entropy index, 250 Prior, 223
Planetary waves, 310 Probabilistic archetype analysis, 402
Platykurtic, 271 Probabilistic concepts, 17
Platykurtotic, 473 Probabilistic framework, 45
Poisson distribution, 476 Probabilistic models, 11, 50
Polar decomposition, 217 Probabilistic NNs, 424
Index 595
Probability, 467 Quadratic trend, 379

Probability-based approach, 149 Quadrature, 91, 110, 114, 149, 150, 155, 167,
Probability-based method, 46 192
Probability density function (pdf), 2, 11, 23, Quadrature function, 107
58, 65, 196, 245, 471 Quadrature spectrum, 29
Probability distribution, 219, 243 Quantile, 46
Probability distribution function, 255 Quantile-quantile, 58
Probability function, 470 QUARTIMAX, 75
Probability matrix, 404 Quartimax, 230
Probability space, 213 QUARTIMIN, 76, 77, 81
Probability vector, 398 Quartimin, 231
Product-moment correlation coefficient, 16 Quas-biennial periodicity, 93
Product of spheres, 401 Quasi-biennial oscillation (QBO), 91, 101,
Profile likelihood, 363 113, 145
Profiles, 321 Quasi-geostrophic, 315
Progressive waves, 114, 115 Quasi-geostrophic model, 135, 309
Projected data, 251 Quasi-geostrophic vorticity, 135
Projected gradient, 83, 86, 283, 533 Quasi-Newton, 422, 437
Projected gradient algorithm, 284 Quasi-Newton algorithm, 141, 425
Projected matrix, 369 Quasi-stationary signals, 367
Projected/reduced gradient, 85
Projection index, 242, 246
Projection methods, 367 R
Projection operators, 86, 206 Radial basis function networks, 442
Projection pursuit (PP), 242, 269, 272, 280, Radial basis functions (RBFs), 321, 325, 357
293 Radial coordinate, 375
Projection theorem, 538 Radial function, 457
Propagating, 91, 97 Radiative forcing, 2, 118
Propagating disturbances, 97, 107, 117 Rainfall, 447
Propagating features, 168 Rainfall extremes, 445
Propagating patterns, 91, 95, 96, 122, 135, 166 Raleigh quotient, 177, 183, 395, 396
Propagating planetary waves, 72 Random error, 219
Propagating signal, 110 Random experiment, 469
Propagating speed, 93 Random forest (RF), 433, 450
Propagating structures, 94, 95, 118, 145, 157 Random function, 190, 353
Propagating wave, 162 Randomness, 244
Propagation, 113, 145 Random noise, 192, 220
Propagator, 130, 546 Random projection, 369
Prototype, 397, 430 Random projection matrix, 368
Prototype vectors, 448 Random samples, 49
Proximity, 201, 430 Random variable, 24, 224, 470
Pruning, 436, 437 Random vector, 154, 474
Pseudoinverse, 501 Rank, 501
Rank correlation, 21
Rank order, 210
Q Ranndom projection (RP), 368
QR decomposition, 505–506 Rational eigenfunctions, 109
Quadratic equation, 154 RBF networks, 424
Quadratic function, 82 Reanalysis, 2, 77, 131, 445
Quadratic measure, 53 Reconstructed attractor, 148
Quadratic nonlinearities, 296 Reconstructed variables, 164
Quadratic norm, 53 Rectangular, 430
Quadratic optimisation problem, 38 Recurrence networks, 68, 169
Quadratic system of equations, 263 Recurrence relationships, 153
596 Index
Recurrences matrix, 169 Rotationally invariant, 250, 251

Recurrent, 422 Rotation criteria, 74
Recurrent NNs, 421 Rotation matrix, 73, 74, 115, 223, 284
Recursion, 259 Roughness measure, 454
Red-noise, 49, 51, 52, 152 R-square, 315, 348
Red spectrum, 125, 135 Runge Kutta scheme, 263
Reduced gradient, 242 Running windows, 50
Redundancy, 274, 278
Redundancy analysis (RDA), 337, 348
Redundancy index, 347 S
Redundancy reduction, 279 Salinity, 321
Regimes, 3 Sample covariance matrix, 237
Regime shift, 315 Sample-space noise model, 227
Regime transitions, 169 Sampling errors, 178
Regression, 3, 38, 67, 82, 184, 221, 315, 337, Sampling fluctuation, 342
381, 395, 416, 422, 434 Sampling problems, 71
Regression analysis, 257 Scalar product, 140
Regression curve, 209 Scalar product matrix, 207
Regression matrix, 66, 338, 347 Scaled SVD, 365
Regression matrix A, 363 Scaling, 25, 54, 62, 207, 396, 399
Regression models, 4 Scaling problem, 62, 238, 338
Regularisation, 325, 390, 395, 396 Scandinavian pattern, 234, 293, 330
Regularisation parameter, 391 Scandinavian teleconnection pattern, 288
Regularisation problem, 330, 455 Scores, 37
Regularised EOFs, 331 Scree plot, 404
Regularised Lagrangian, 396 Sea level, 374
Regularization parameters, 344 Sea level pressure (SLP), 21, 31, 33, 41, 68,
Regularized CCA (RCCA), 343 115, 180, 194, 212, 284, 314, 315
Replicated MDS, 210 Sea saw, 13
Reproducing kernels, 300 Seasonal cycle, 42, 160
Resampling, 46 Seasonality, 17, 93
Residual, 6, 36, 44, 87 Sea surface temperature (SST), 11, 33, 55, 132,
Residual sum of squares (RSS), 344, 398, 404 180, 284, 293, 383, 391, 404, 445
Resolvent, 129, 546 Second kind, 359
Response, 27 Second order centered moment, 24
Response function, 102 Second-order differential equations, 86
Response variables, 351 Second-order Markov model, 179
RGB colours, 256 Second order moments, 259, 266
Ridge, 313, 395 Second-order stationary, 132
Ridge functions, 257, 258 Second-order stationary time series, 196
Ridge regression, 344, 391, 395 Second-order statistics, 375
Riemannian manifold, 400 Self-adjoint, 37, 299, 300
Robustness, 438 Self-interactions, 68
Rominet patterns, 3 Self-organisation, 425, 429
Root node, 434 Self-organising maps (SOMs), 416, 429
Rosenblatt perceptron, 417 Semi-annual oscillation, 162
Rossby radii, 310 Semi-definite matrices, 63
Rossby wave, 115, 263 Semi-difinite positivity, 226
Rossby wave breaking, 146 Semi-orthogonal matrix, 64
Rotated EOF (REOF), 4, 13, 35, 73, 141 Sensitivity to initial as well as boundary
Rotated factors, 72, 230 conditions, 2
Rotated factor scores, 234 Sen surface height, 445
Rotated principal components, 73 Sentivity to initial contitions, 147
Rotation, 72, 73, 229 Sequentially, 386
Index 597
Serial correlation, 171 Smooth maximum covariance analysis

Serial orrelation, 50 (SMCA), 319, 358
Shannon entropy, 37, 243, 244 Smoothness, 256, 352
Shew orthogonal projection, 404 Smoothness condition, 453
Short-memory, 180 Smooth time series, 352
Siberian high, 381, 383 Smpling with replacement, 47
Sigmoid, 279, 421, 423 Sneath’s coefficient, 203
Sigmoid function, 278 Soothing condition, 355
Signal patterns, 178 Southern Oscillation, 33
Signal-to-noise maximisation, 177 Southern Oscillation mode, 196
Signal-to-noise ratio, 44, 254 Spacetime orthogonality, 70
Significance, 343 Sparse systems, 40
Significance level, 59 Spatial derivative, 111
Similarity, 201 Spatial weighting, 44
Similarity coefficient, 203 Spearman’s rank correlation coefficient, 21
Similarity matrix, 207, 217 Spectral analysis, 344, 391
Simple structure, 72 Spectral clustering, 302, 304, 306, 317
Simplex, 260, 402, 403, 524 Spectral decomposition theorem, 300
Simplex method, 524 Spectral density, 156, 196
Simplex projection, 408 Spectral density function, 26, 27, 98, 124, 186
Simplex vertices, 404 Spectral density matrix, 175, 185, 187,
Simplicity, 72, 73 190–192
Simplicity criteria, 81 Spectral domain, 97
Simplicity criterion, 73, 74 Spectral domain EOFs, 99
Simplified Component Technique-LASSO Spectral entropy, 197
(SCoTLASS), 82 Spectral EOFs, 98
Simplified EOFs (SEOFs), 82 Spectral methods, 328
Simulated, 3 Spectral peak, 125
Simulated annealing, 250 Spectral radius, 217
Simulations, 33 Spectral space, 373
Singular, 175 Spectral window, 187
Singular covariance matrices, 151 Spectrum, 54, 58, 66, 137, 148, 160, 167, 180,
Singular system analysis (SSA), 146 375, 380
Singular value, 26, 42, 74, 161, 345, 350 Sphered, 256
Singular value decomposition (SVD), 26, 40, Sphereing, 283
42, 96, 503–504 Spherical cluster, 302
Singular vectors, 109, 122, 149, 160, 206, 341, Spherical coordinates, 326, 327
342, 389, 393 Spherical geometry, 44, 361
Skewness, 26, 264, 288 Spherical harmonics, 118, 310
Skewness modes, 262 Spherical RBFs, 360
Skewness tensor, 263 Sphering, 25, 54
Skew-symmetric, 217 Splines, 19, 321, 326, 354, 423
Sliding window, 148 Spline smoothing, 258
SLP anomalies, 42, 49, 213, 233, 285, 331, 381 Split, 436
S-mode, 22 Splitting, 434
Smooth EOFs, 319, 326, 332 Splitting node, 434
Smooth functions, 319 Squared residuals, 210
Smoothing, 18, 19 Square integrable, 36
Smoothing constraint, 258 Square root of a symmetric matrix, 25
Smoothing kernel, 19 Square root of the sample covariance matrix,
Smoothing parameter, 303, 326, 362 256
Smoothing problem, 258 Squashing, 279
Smoothing spectral window, 192 Squashing functions, 421
Smoothing spline, 354 Srrogate data, 48
598 Index
SST anomalies, 60 Supervised, 416

Stability analysis, 135 Support vector machine (SVM), 422
Standard error, 45 Support vector regression, 442
Standard normal distribution, 10, 46 Surface temperature, 62, 183, 369, 445
Standing mode, 169 Surface wind, 33
Standing oscillation, 135 Surrogate, 48
Standing waves, 124, 168, 169 Surrogate data, 45
State space, 38 Swiss-roll, 211
Stationarity, 121, 370 Symmetric, 502
Stationarity conditions, 121 Synaptic weight, 430
Stationary, 26, 94, 154 Synchronization, 68
Stationary patterns, 91 Synoptic, 167
Stationary points, 283 Synoptic eddies, 180
Stationary solution, 88, 310 Synoptic patterns, 445
Stationary states, 311, 313 Synoptic weather, 444
Stationary time series, 98, 156 System’s memory, 171
Statistical downscaling, 365
Steepest descent, 425, 526
Steepness, 421 T
Stepwise procedure, 386 Tail, 9, 250
Stochastic climate model, 310 Taining dataset, 48
Stochastic integrals, 99 Tangent hyperbolic, 282
Stochasticity, 12 Tangent space, 86
Stochastic matrix, 305, 398 t-distribution, 375
Stochastic model, 122 Teleconnection, 2, 33, 34, 49, 67, 284, 345,
Stochastic modelling, 222 446
Stochastic process, 36, 37, 190, 352, 480 Teleconnection pattern, 31
Stochastic system, 118 Teleconnectivity, 66, 67
Storm track, 61, 180 Tendencies, 136, 302, 310
Stratosphere, 91, 448 Terminal node, 434
Stratospheric activity, 291 Ternary plot, 404
Stratospheric flow, 93 Tetrahedron, 403
Stratospheric warming, 146 Thermohaline circulation, 321
Stratospheric westerlies, 93 Thin-plate, 463
Stratospheric westerly flow, 93 Thin-plate spline, 424, 454
Stratospheric zonal wind, 125 Third-order moment, 262
Streamfunction, 127, 132, 141, 263, 310, 311 Thompson’s factor score, 229
Stress function, 208, 209 Three-way data, 291
Structure removal, 256 Tikhonov regularization, 344
Student distribution, 478 T-mode, 22
Subantarctic mode water, 323 Toeplitz, 156, 159
Sub-Gaussian, 271, 280 Toeplitz covariance matrix, 152
Subgrid processes, 44 Toeplitz matrix, 149
Subgrid scales, 118 Topographic map, 429
Subjective/Bayesian school, 467 Topological neighbourhood, 431
Submatrix, 159 Topology, 431
Subscale processes, 118 Topology-preserving projection, 448
Substructures, 168 T-optimals, 179
Subtropical/subpolar gyres, 317 Tori, 211
Sudden stratospheric warming, 291 Trace, 500
Summer monsoon, 34, 66, 212, 404 Training, 46, 416, 425
Sum-of-squares, 254 Training set, 47, 391
Sum of the squared correlations, 351 Trajectory, 147
Super-Gaussian, 270, 473 Trajectory matrix, 149
Index 599
Transfer entropy, 281 Variational problem, 455

Transfer function, 27, 104, 105, 125, 419, 424 VARIMAX, 73, 77, 81
Transition probability, 305 Varimax, 115
Transpose, 500 Vector autoregressive, 138
Trapezoidal rule, 192 Vertical modes, 322
Travelling features, 124 Visual inspection, 241
Travelling waves, 117, 145 Visualization, 11
Tree, 448 Visualizing proximities, 201
Trend, 377 Volcanoes, 2
Trend EOF (TEOF), 375 Vortex, 167
Trend pattern, 381 Vortex area, 167
Trends, 3, 180 Vorticity, 135
Triangular matrix, 397
Triangular metric inequality, 205
Triangular truncation, 310 W
Tridiagonal, 153 Water masses, 322
Tropical cyclone forecast, 444 Wave amplitude, 113
Tropical cyclone frequency, 442 Wavelet transform, 103
Tropical Pacific SST, 440 Wave life cycle, 113
Tucker decomposition, 293 Wavenumber, 71, 111, 115
Tukey’s index, 248 Weather forecasting, 440
Tukey two-dimensional index, 248 Weather predictability, 184
Tukey window, 183 Weather prediction, 34, 451
Two-sample EOF, 383 Weighted covariance matrix, 281
Two-way data, 291 Weighted Euclidean distances, 211
Weight matrix, 278
Welch, 196
U Westerly flows, 34, 93
Uncertainty, 45, 49, 244 Westerly jets, 92
Unconstrained problem, 76, 186 Western boundary currents, 55
Uncorrelatedness, 270 Western current, 406
Understanding-context independence, 9 Westward phase tilt, 125
Uniform distribution, 196, 244, 251, 271 Westward propagating Rossby waves, 157
Uniformly convergent, 37 Whitened, 389
Uniform random variables, 21 Whitening transformation, 364
Unimodal, 259, 311 White noise, 56, 131, 149, 197, 267
Uniqueness, 222 Wind-driven gyres, 322
Unit, 419, 421 Wind fields, 212
Unitary rotation matrix, 115 Window lag, 160, 167
Unit circle, 135 Winning neuron, 430
Unit gain filter, 107 Wishart distribution, 385, 480
Unit-impulse response, 266 Working set, 400
Unit sphere, 386 Wronskian, 545
Unresolved waves, 310
Unstable modes, 135
Unstable normal modes, 135
Unsupervised, 416 X
Upwelling, 447 X11, 17
V Y
Validation, 3 Young-Householder decomposition, 206
Validation set, 47 Yule-Walker equations, 175, 185
600 Index
Z Zonal velocity, 184

Zero frequency, 107, 177 Zonal wavenumber, 126
Zero-skewness, 170 Zonal wind, 92, 184
Zonally symmetric, 92 z-tranform, 267
Zonal shift the NAO, 49

Hannachi2021_Book_PatternsIdentificationAndDataM

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hannachi2021_Book_PatternsIdentificationAndDataM

Uploaded by

Copyright:

Available Formats

Springer Atmospheric Sciences

More information about this series at http://www.springer.com/series/10176

ISSN 2194-5217 ISSN 2194-5225 (electronic)

© Springer Nature Switzerland AG 2021

Stockholm, Sweden Abdelwaheb Hannachi

Complexity, nonlinearity and high-dimensionality constitute the main characteristic

3 Empirical Orthogonal Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4.2 Hilbert Transform: Discrete Signals . . . . . . . . . . . . . . . . . . . . 103

8.3 Optimal Persistence and Average Predictability. . . . . . . . . . . . . . . . . . . 176

10.6 Basic Difference Between EOF and Factor Analyses . . . . . . . . . . . . . 235

13 Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

15.7.2 Functional Non-smooth CCA and Indeterminacy . . . . . . 352

17.2.3 Examples of Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

A Smoothing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453

B Introduction to Probability and Random Variables . . . . . . . . . . . . . . . . . . . . 467

C Stationary Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483

C.2 Power Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487

D Matrix Algebra and Matrix Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

E Optimisation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521

F Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535

G Systems of Linear Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . 543

H Links for Software Resource Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549

Abstract This chapter describes the characteristic features of high dimensional-

Keywords Complexity of the climate system · High dimensionality · Curse of

1.1 Complexity of the Climate System

Our atmosphere is composed of the collection of an innumerable interacting

© Springer Nature Switzerland AG 2021 1

1.2 Data Exploration, Data Mining and Feature Extraction

The importance of patterns is reflected through their interpretation into knowl-

1.3 Major Concern in Climate Data Analysis

1.3.1 Characteristics of High-Dimensional Space Geometry

A d-dimensional Euclidean space is a mathematical continuum that can be described

x1 = r cosθ1 cosθ2 . . . cosθd−1

It can be shown that the Jacobian J of this transformation6 satisfies |J | =

Vd◦ (r) = C◦ (d) r d , (1.4)

Vd (2r) − Vd◦ (r)

as d increases to infinity. Hence with increasing space dimension most of the

Table 1.1 Values of C◦ as a function of the space dimension d

Fig. 1.2 Representation of

Vd◦ (r) − Vd◦ (r − ε)  ε d

Two Further Paradoxical Examples

• Inscribed spheres (Hamming 1980)

Fig. 1.3 Inscribed spheres in

Data scattered uniformly in high-dimensional spaces will always be concentrated

1.3.2 Curse of Dimensionality and Empty Space Phenomena

Data is a valuable source of information which provide ultimately the way to

Fig. 1.4 Data–Information–Knowledge–Wisdom (DIKW) hierarchy

1.3.3 Dimension Reduction and Latent Variable Models

Fig. 1.5 Knowledge gained via data reduction

to explain processes happening in high-dimensional spaces in terms of a small

1.3.4 Some Problems and Remedies in Dimension Reduction

Various problems exist in dimension reduction, some of which are in general

1.4 Examples of the Most Familiar Techniques

Various techniques have been used/adapted/developed in climate analysis studies to

• Canonical Correlation Analysis (CCA).

Keywords Data processing · Smoothing · Scaling and sphering · Filtering ·

By its nature, the climate data analysis is a large multivariate (high-dimensional)

© Springer Nature Switzerland AG 2021 15

2.2 Simple Visualisation Techniques

2.3 Data Processing and Smoothing

2.3.1 Preliminary Checking

Smoothing is the operation that allows removing “irregular” or more precisely,

It is a simple local average using a sliding window. If we designate by xt , t =

for k = 1, 2, . . . n − M + 1. Note that to estimate the seasonal cycle, a non-

Vd◦ (r) − Vd◦ (r − ε) ε d

fy (ω) = fx (ω)|(ω)|2 , (2.28)

fxy (ω) = (ω)fx (ω). (2.35)