Professional Documents
Culture Documents
Ebook Statistical Methods For Climate Scientists Timothy Delsole Online PDF All Chapter
Ebook Statistical Methods For Climate Scientists Timothy Delsole Online PDF All Chapter
https://ebookmeta.com/product/statistical-models-and-methods-for-
data-science-leonardo-grilli/
https://ebookmeta.com/product/statistical-methods-for-survival-
data-analysis-3rd-edition-lee/
https://ebookmeta.com/product/statistical-methods-for-handling-
incomplete-data-2nd-edition-kim/
https://ebookmeta.com/product/numerical-methods-for-engineers-
and-scientists-3rd-edition-amos-gilat/
Modern Statistical Methods for Health Research 1st
Edition Yichuan Zhao
https://ebookmeta.com/product/modern-statistical-methods-for-
health-research-1st-edition-yichuan-zhao/
https://ebookmeta.com/product/statistical-methods-4th-edition-
donna-mohr/
https://ebookmeta.com/product/multivariate-statistical-machine-
learning-methods-for-genomic-prediction-montesinos-lopez/
https://ebookmeta.com/product/statistical-methods-for-healthcare-
performance-monitoring-1st-edition-alex-bottle-paul-aylin/
https://ebookmeta.com/product/applied-numerical-methods-with-
matlab-for-engineers-and-scientists-5th-edition-steven-chapra/
S TAT I S T I C A L M E T H O D S F O R C L I M AT E S C I E N T I S T S
This book provides a comprehensive introduction to the most commonly used statistical
methods relevant in atmospheric, oceanic, and climate sciences. Each method is described
step-by-step using plain language, and illustrated with concrete examples, with relevant
statistical and scientific concepts explained as needed. Particular attention is paid to nuances
and pitfalls, with sufficient detail to enable the reader to write relevant code. Topics covered
include hypothesis testing, time series analysis, linear regression, data assimilation, extreme
value analysis, Principal Component Analysis, Canonical Correlation Analysis, Predictable
Component Analysis, and Covariance Discriminant Analysis. The specific statistical chal-
lenges that arise in climate applications are also discussed, including model selection prob-
lems associated with Canonical Correlation Analysis, Predictable Component Analysis, and
Covariance Discriminant Analysis. Requiring no previous background in statistics, this is
a highly accessible textbook and reference for students and early career researchers in the
climate sciences.
T I M OT H Y M . D E L S O L E
George Mason University
MICHAEL K. TIPPETT
Columbia University
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314-321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467
www.cambridge.org
Information on this title: www.cambridge.org/9781108472418
DOI: 10.1017/9781108659055
© Cambridge University Press 2022
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2022
Printed in the United Kingdom by TJ Books Limited, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: DelSole, Timothy M., author.
Title: Statistical methods for climate scientists / Timothy M. DelSole and Michael K. Tippett.
Description: New York : Cambridge University Press, 2021. | Includes
bibliographical references and index.
Identifiers: LCCN 2021024712 (print) | LCCN 2021024713 (ebook) |
ISBN 9781108472418 (hardback) | ISBN 9781108659055 (epub)
Subjects: LCSH: Climatology–Statistical methods. | Atmospheric
science–Statistical methods. | Marine sciences–Statistical methods. |
BISAC: SCIENCE / Earth Sciences / Meteorology & Climatology
Classification: LCC QC866 .D38 2021 (print) | LCC QC866 (ebook) |
DDC 551.601/5118–dc23
LC record available at https://lccn.loc.gov/2021024712
LC ebook record available at https://lccn.loc.gov/2021024713
ISBN 978-1-108-47241-8 Hardback
Additional resources for this publication at www.cambridge.org/9781108472418.
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Contents
2 Hypothesis Tests 30
2.1 The Problem 31
2.2 Introduction to Hypothesis Testing 33
2.3 Further Comments on the t-test 40
2.4 Examples of Hypothesis Tests 43
2.5 Summary of Common Significance Tests 49
2.6 Further Topics 50
2.7 Conceptual Questions 51
3 Confidence Intervals 52
3.1 The Problem 53
v
vi Contents
Appendix 510
A.1 Useful Mathematical Relations 510
A.2 Generalized Eigenvalue Problems 511
A.3 Derivatives of Quadratic Forms and Traces 512
References 514
Index 523
Preface
This book provides an introduction to the most commonly used statistical methods
in atmospheric, oceanic, and climate sciences. The material in this book assumes
no background in statistical methods and can be understood by students with only a
semester of calculus and physics. Also, no advanced knowledge about atmospheric,
oceanic, and climate sciences is presumed. Most chapters are self-contained and
explain relevant statistical and scientific concepts as needed. A familiarity with
calculus is presumed, but the student need not solve calculus problems to perform
the statistical analyses covered in this book.
The need for this book became clear several years ago when one of us joined
a journal club to read “classic” papers in climate science. Specifically, students in
the club had difficulty understanding certain papers because these papers contained
unfamiliar statistical concepts, such as empirical orthogonal functions (EOFs), sig-
nificance tests, and power spectra. It became clear that our PhD curriculum was
not adequately preparing students to be “literate” in climate science. To rectify this
situation, we decided that students should take a statistics class. However, at that
time, there did not exist a single self-contained course that covered all the topics
that we considered to be essential for success in climate science. Therefore, we
designed a single course that covered these topics (which eventually expanded into
a two-semester course). This book is based on this course and embodies over a
decade of experience in teaching this material.
This book covers six key statistical methods that are essential to understanding
modern climate research: (1) hypothesis testing; (2) time series models and power
spectra; (3) linear regression; (4) Principal Component Analysis (PCA), and related
multivariate decomposition methods such as Canonical Correlation Analysis (CCA)
and Predictable Component Analysis, (5) data assimilation; and (6) extreme value
analysis. Chapter 1 reviews basic probabilistic concepts that are used throughout the
book. Chapter 2 discusses hypothesis testing. Although the likelihood ratio provides
a general framework for hypothesis testing, beginners often find this framework
xiii
xiv Preface
difficulties are rarely discussed in standard statistics texts. In the climate literature,
the standard approach to this problem is to apply these techniques to a few principal
components of the data, so that the time dimension is much bigger than the state
dimension. The most outstanding barrier in this approach is choosing the number
of principal components. Unfortunately, no standard criterion for selecting the num-
ber of principal components exists for these multivariate techniques. This gap was
sorely felt each time this material was taught and motivated us to conduct our own
independent research into this problem. This research culminated in the discovery
of a criterion that was consistent with standard information criteria and could be
applied to all of the problems discussed in this book. For regression models and
CCA, this criterion is called Mutual Information Criterion (MIC) and is introduced
in Chapter 14 (for full details, see DelSole and Tippett, 2021a). After formulating
this criterion, we discovered that it was consistent with many of the criteria derived
by Fujikoshi et al. (2010) based on likelihood ratio methods, which supports the
soundness of MIC. However, MIC is considerably easier to derive and apply. We
believe that MIC will be of wide interest to statisticians and to scientists in other
fields who use these multivariate methods.
The development of this book is somewhat unique. Initially, we followed our
own personal experience by giving formal lectures on each chapter. Inspired by
recent educational research, we began using a “flipped classroom” format, in which
students read each chapter and sent questions and comments electronically before
coming to class. The class itself was devoted to going over the questions/comments
from students. We explicitly asked students to tell us where the text failed to help
their understanding. To invite feedback, we told students that we needed their help
in writing this book, because over the ten years that we have been teaching this topic,
we have become accustomed to the concepts and could no longer see what is wrong
with the text. The resulting response in the first year was more feedback than we had
obtained in all the previous years combined. This approach not only revolutionized
the way we teach this material but gave us concrete feedback about where precisely
the text could be improved. With each subsequent year, we experimented with new
material and, if it did not work, tried different ways. This textbook is the outcome
of this process over many years, and we feel that it introduces statistical concepts
much more clearly and in a more accessible manner than most other texts.
Each chapter begins with a brief description of a statistical method and a concrete
problem to which it can be applied. This format allows a student to quickly ascertain
if the statistical method is the one that is needed. Each problem was chosen after
careful thought based on intrinsic interest, importance in real climate applications,
and instructional value.
Each statistical method is discussed in enough detail to allow readers to write
their own code to implement the method (except in one case, namely extreme value
xvi Preface
theory, for which there exists easy-to-use software in R). The reason for giving this
level of detail is to ensure that the material is complete, self-contained, and covers
the nuances and points of confusion that arise in practice. Indeed, we, as active
researchers, often feel that we do not adequately understand a statistical method
unless we have written computer code to perform that method. Our experience is
that students gain fundamental and long-lasting confidence by coding each method
themselves. This sentiment was expressed in an end-of-year course evaluation, in
which one of our students wrote, “Before this course, I had used someone else’s
program to compute an EOF, but I didn’t really understand it. Having to write my
own program really helped me understand this method.”
The methods covered in this book share a common theme: to quantify and exploit
dependencies between X and Y . Different methods arise because each method is tai-
lored to a particular probability distribution or data format. Specifically, the methods
depend on whether X and Y are scalar or vector, whether the values are categorical
or continuous, whether the distributions are Gaussian or not, and whether one vari-
able is held fixed for multiple realizations of the other. The most general method for
quantifying X-Y dependencies for multivariate Gaussian distributions is Canonical
Correlation Analysis. Special cases include univariate regression (scalar Y ), field
significance (scalar X), or correlation (scalar X and scalar Y ). In climate studies,
multiple realizations of Y for fixed X characterize ensemble data sets. The most gen-
eral method for quantifying X-Y dependencies in ensemble data sets is Predictable
Component Analysis (or equivalently, Multivariate Analysis of Variance). Special
cases include Analysis of Variance (scalar Y ), and the t-test (scalar X and scalar Y ).
Many of these techniques have non-Gaussian versions. Linear regression provides
a framework for exploiting dependencies to predict one variable from the other.
Autoregressive models and power spectra quantify dependencies across time. Data
assimilation provides a framework for exploiting dependencies to infer Y given X
while incorporating “prior knowledge” about Y . The techniques for the different
cases, and the chapter in which they are discussed, are summarized in Table 0.1.
This chapter reviews some essential concepts of probability and statistics, including
the following:
1
2 Basic Concepts in Probability and Statistics
27
25
Figure 1.1 A time series of the monthly Niño 3.4 index over the period 1990–2000.
Nino 3.4 index (raw) 1948–2018 Nino 3.4 index (raw) 1948−2018
0.4
0.4
Frequency
Frequency
0.2
0.2
0.0
0.0
24 25 26 27 28 29 24 25 26 27 28 29
Nino 3.4 index (°C) 3.4 index (°C)
Nino
Figure 1.2 Histograms of the monthly mean Niño 3.4 index over the period 1948–
2017. The two histograms show the same data, but the left histogram uses a wider
bin size than the right.
rectangle over each bin such that the area of each rectangle equals the empirical
frequency with which samples fall into the bin. The total area of the rectangles equals
one. (Sometimes, histograms may be defined such that the total area of the rectangles
equals the total number of samples, in which case the area of each rectangle equals
the number of samples that fall into that bin.)
Histograms of the Niño 3.4 index for different bin sizes are shown in Figure 1.2.
The figure shows that this index varied between 24◦ C and 29.5◦ C over the period
1948–2017. Also, values around 27◦ occur more frequently than values around 25◦
or 29◦ . However, the shape of the histogram is sensitive to bin size (e.g., compare
Figures 1.2a and b); hence, the conclusions one draws from a histogram can be
sensitive to bin size. There exist guidelines for choosing the bin size, e.g., Sturges’
rule and the Freedman–Diaconis rule, but we will not discuss these. They often are
implemented automatically in standard statistical software.
The scatterplot provides a way to visualize the relation between two variables. If
X and Y are two time series over the same time steps, then each point on the scatter-
plot shows the point (X(t),Y (t)) for each value of t. Some examples of scatterplots
are illustrated in Figure 1.3. Scatterplots can reveal distinctive relations between X
and Y . For instance, Figure 1.3a shows a tendency for large values of X to occur
at the same time as large values of Y . Such a tendency can be used to predict one
variable based on knowledge of the other. For instance, if X were known to be
at the upper extreme value, then it is very likely that Y also will be at its upper
extreme. Figure 1.3b shows a similar tendency, except that the relation is weaker,
and therefore a prediction of one variable based on the other would have more
uncertainty. Figure 1.3c does not immediately reveal a relation between the two
variables. Figure 1.3d shows that X and Y tend to be negatively related to each other,
4 Basic Concepts in Probability and Statistics
Figure 1.3 Scatterplots of X versus Y for various types of relation. The correlation
coefficient ρ, given in the title of each panel, measures the degree of linear
relation between X and Y . The data were generated using the model discussed
in Example 1.7, except for data in the bottom right panel, which was generated by
the model Y = X 2 , where X is drawn from a standardized Gaussian.
when one goes up, the other goes down. Methods for quantifying these relations are
discussed in Section 1.7.
Definition 1.2 (Sample Mean) The sample mean (or average) of N numbers
X1, . . . ,XN is denoted μ̂X and equals the sum of the numbers divided by N
1.2 Measures of Central Value: Mean, Median, and Mode 5
mean – 2 * sd mean + 2 * sd
x
mean median
0.4
Frequency
5% 95%
0.2
0.0
24 25 26 27 28 29
3.4 index (°C)
Nino
Figure 1.4 Histogram of the monthly mean (raw) Niño 3.4 index over the period
1948–2017, as in Figure 1.2, but with measures of central value and dispersion
superimposed. The mean and median are indicated by dashed and dotted vertical
lines, respectively. The dash-dotted lines indicate the 5th and 95th percentiles. The
horizontal “error bar” at the top indicates the mean plus or minus two standard
deviations. The empirical mode is between 27◦ C and 27.5◦ C.
1
N
X1 + X2 + · · · + XN
μ̂X = = Xn . (1.1)
N N
n=1
The mean of the Niño 3.4 index is indicated in Figure 1.4 by the dashed vertical
line. The mean is always bounded by the largest and smallest elements.
Another measure of central value is the median.
Definition 1.3 (Sample Median) The sample median of N numbers X1, . . . ,XN
is the middle value when the data are arranged from smallest to largest. If N is odd,
the median is the unique middle value. If N is even, then two middle values exist and
the median is defined to be their average.
The median effectively divides the data into two equal halves: 50% of the data
lie above the median, and 50% of the data lie below the median. The median of the
Niño 3.4 index is shown by the dotted vertical line in Figure 1.4 and is close to the
mean. In general, the mean and median are equal for symmetrically distributed data,
but differ for asymmetrical distributions, as the following two examples illustrate.
Example 1.1 (The Sample Median and Mean for N Odd) Question: What is
the mean and median of the following data?
2 8 5 9 3. (1.2)
Answer: To compute the median, first order the data:
2 3 5 8 9. (1.3)
6 Basic Concepts in Probability and Statistics
The median is a special case of a percentile: It is the 50th percentile (i.e., p = 0.5).
The above definition states merely that at least p · 100% of the data lies below the
100p’th percentile, hence the sample percentile is not unique. There are several
definitions of sample quantiles; for instance, Hyndman and Fan (1996) discuss nine
different algorithms for computing sample quantiles. The differences between these
sample quantiles have no practical importance for large N and will not be of concern
in this book. Mathematical software packages such as Matlab, R, and Python have
built-in functions for computing quantiles.
The percentile range is the interval between two specified percentile points. For
instance, the 5–95% range includes all values between the 5th and 95th precentiles.
This percentile range is a measure of variation in the sense that it specifies an interval
in which a random number from the population will fall 90% of the time. The 5th
1.3 Measures of Variation: Percentile Ranges and Variance 7
and 95th percentiles of the Niño 3.4 index are indicated in Figure 1.4 by the two
dash-dot lines.
Another measure of variation is the variance.
Definition 1.5 (Sample Variance) The sample variance of N numbers X1, . . . ,XN
is denoted σ̂X2 and defined as
1
N
σ̂X2 = (Xn − μ̂X )2, (1.7)
N −1
n=1
The reader ought to be curious why the sum in (1.7) is divided by N − 1, whereas
the sum for the mean (1.1) was divided by N. The reason for this will be discussed in
Section 1.10 (e.g., see discussion after Theorem 1.4). Based on its similarity to the
definition of the mean, the variance is approximately the average squared difference
from the sample mean.
Definition 1.6 (Standard Deviation) The standard deviation is the (positive)
square root of the variance:
σ̂X = σ̂X2 . (1.8)
The standard deviation has the same units as X.
Among the different measures listed above, the ones that will be used most often
in this book are the mean for central tendency, and the variance for variation. The
main reason for this is that the mean and variance are algebraic combinations of
the data (i.e., they involve summations and powers of the data); hence, they are
easier to deal with theoretically compared to mode, median, and percentiles (which
require ranking the data). Using the mean and variance, a standard description of
variability is the mean value plus and minus one or two standard deviations. For the
Niño 3.4 index shown in Figure 1.4, the mean plus or minus two standard deviations
is indicated by the error bar at the top of the figure.
mean. With (1.9), the sample variance can be computed from one pass of the data,
but requires tracking two quantities, namely the means of X and X2 . The sample
variance is nonnegative, but in practice (1.9) can be (slightly) negative owing to
numerical precision error.
can differ considerably from 50%. Asserting that heads occurs with 50% probability
is tantamount to asserting knowledge of the “inner machinery” of nature. We refer
to the “50% probability” as a population property, to distinguish it from the results
of a particular experiment, e.g., “6 out of 10 tosses,” which is a sample property.
Much confusion can be avoided by clearly distinguishing population and sample
properties. In particular, it is a mistake to equate the relative frequency with which
an event occurs in an experiment with the probability of the event in the population.
A random variable is a function that assigns a real number to each outcome of
an experiment. If the outcome is numerical, such as the temperature reading from
a thermometer, then the random variable often is the number itself. If the outcome
is not numerical, then the role of the function is to assign a real number to each
outcome. For example, the outcome of a coin toss is heads or tails, i.e., not a number,
but a function may assign 1 to heads and 0 to tails, thereby producing a random
variable whose only two values are 0 and 1. This is an example of a discrete random
variable, whose possible values can be counted. In contrast, a random variable is said
to be continuous if its values can be any of the infinitely many values in one or more
line intervals.
Sometimes a random variable needs to be distinguished from the value that it
takes on. The standard notation is to denote a random variable by an uppercase
letter, i.e. X, and denote the specific value of a random draw from the population by
a lowercase letter, i.e. x. We will adopt this notation in this chapter. However, this
notation will be adhered to only lightly, since later we will use uppercase letters to
denote matrices and lowercase letters to denote vectors, a distinction that is more
important in multivariate analysis.
If a variable is discrete, then it has a countable number of possible realizations
X1,X2, . . .. The corresponding probabilities are denoted p1,p2, . . . and called the
probability mass function. If a random variable is continuous, then we consider a
class of variables X such that the probability of {x1 ≤ X ≤ x2 }, for all values of
x1 ≤ x2 , can be expressed as
x2
P (x1 ≤ X ≤ x2 ) = pX (x)dx, (1.10)
x1
where pX (x) is a nonnegative function called the density function. By this definition,
the probability of X falling between x1 and x2 corresponds to the area under the
density function. This area is illustrated in Figure 1.5a for a particular distribution.
If an experiment always yields some real value of X, then that probability is 100%
and it follows that
∞
pX (x)dx = 1. (1.11)
−∞
10 Basic Concepts in Probability and Statistics
Figure 1.5 Schematic showing (a) a probability density function for X and the fact
that the probability that X lies between 1/2 and 1 is given by the area under the
density function p(x), and (b) the corresponding cumulative distribution function
F (x) and the values at x = 0.5 and x = 1, the difference of which equals the area
of the shaded region in (a).
The histogram provides an estimate of the density function, provided the histogram
is expressed in terms of relative frequencies. Another function is
x
F (x) = P (X ≤ x) = pX (u)du, (1.12)
−∞
which is called the cumulative distribution function and illustrated in Figure 1.5b.
The probability that X lies between x1 and x2 can be expressed equivalently as
P (x1 ≤ X ≤ x2 ) = F (x2 ) − F (x1 ). (1.13)
The above properties do not uniquely specify the density function pX (x), as there
is more than one pX (x) that gives the same left-hand side of (1.10) (e.g., two density
functions could differ at isolated points and still yield the same probability of the
1.6 Expectation 11
same event). A more precise definition of the density function requires measure
theory, which is beyond the scope of this book. Such subtleties play no role in the
problems discussed in this book. Suffice it to say that distributions considered in
this book are absolutely continuous. A property of this class is that the probability
of the event {X = x1 } vanishes:
x1
P (X = x1 ) = pX (x)dx = 0. (1.14)
x1
Although continuous random variables are defined with integrals, you need not
explicitly evaluate integrals to do statistics – all integrals needed in this book can
be obtained from web pages, statistical software packages, or tables in the back of
most statistical texts.
1.6 Expectation
Just as a sample can be characterized by its mean and variance, so too can the
population.
If the random variable is discrete and takes on discrete values X1,X2, . . . ,XN
with probabilities p1,p2, . . . ,pN , then the expectation is defined as
N
EX [X] = Xn pn . (1.17)
n=1
times, respectively. Then, the sum in (1.1) involves N1 terms equal to x1 , N2 terms
equal to x2 , and so on; hence, the sample mean is
x1 N1 + x2 N2 + ... + xK NK K N
μ̂X = = xn fn, (1.18)
N1 + N2 + ... + NK n=1
Notation
We use carets ˆ to distinguish sample quantities from population quantities; for
example,
1 1
N N
μ̂x = xn, μ̂X = Xn and μX = EX [X]. (1.19)
N n=1 N n=1
The sample mean μ̂x is a specific numerical value obtained from a given sample,
μ̂X is a random variable because it is a sum of random variables, and μX is a fixed
population quantity. Because μ̂X is a random variable, it can be described by a
probability distribution with its own expectation. This fact can lead to potentially
confusing terminology, such as “the mean of the mean.” In such cases, we say
“expectation of the sample mean.”
• EX [k1 ] = k1
• EX [k1 X] = k1 EX [X]
• EX [k1 X + k2 Y ] = k1 EX [X] + k2 EX [Y ]
Definition 1.8 (Variance) The variance of the random variable X is defined as
var[X] = EX [(X − EX [X])2 ]. (1.20)
The variance of X often is denoted by σX2 . The standard deviation σX is the positive
square root of the variance.
Interpretation
Variance is a measure of dispersion or scatter of a random variable about its mean.
Small variance indicates that the variables tend to be concentrated near the mean.
1.7 More Than One Random Variable 13
• var[k] = 0
• var[X] = EX [X 2 ] − (EX [X])2
• var[kX] = k 2 var[X]
• var[X + k] = var[X]
In general, a comma separating two events is shorthand for “and.” The probability
of the single event {x1 ≤ X ≤ x2 } can be computed from the joint density p(x,y)
by integrating over all outcomes of Y :
14 Basic Concepts in Probability and Statistics
x2 ∞
P (x1 ≤ X ≤ x2, − ∞ ≤ Y ≤ ∞) = pXY (x,y)dxdy. (1.23)
x1 −∞
However, this probability already appears in (1.10). Since this is true for all x1 and
x2 , it follows that
∞
pX (x) = pXY (x,y)dy. (1.24)
−∞
To emphasize that only a single variable is considered, the density pX (x) often is
called the unconditional or marginal probability density of X.
Notation
Technically, the distributions of X and Y should be denoted by separate functions,
say pX (x) and pY (y). Moreover, the arguments can be arbitrary, say pX (w) and
pY (z). For conciseness, the subscripts are dropped with the understanding that the
argument specifies the specific function in question. For example, p(x) denotes the
density function pX (x). Similarly, p(x,y) denotes the joint density pXY (x,y). The
expectation follows a similar convention: The expectation is taken with respect to the
joint distribution of all random variables that appear in the argument. For instance,
E[f (X)] denotes the expectation of f (x) with respect to pX (x), and E[g(X,Y )]
denotes the expectation with respect to the joint distribution of X and Y ; that is,
∞ ∞
E[g(X,Y )] = g(x,y)pXY (x,y)dxdy. (1.25)
−∞ −∞
The variance of a sum generally differs from the sum of the individual variances.
The covariance of X and Y depends on the (arbitrary) units in which the variables
are measured. However, if the two variables are standardized, then the resulting
covariance is independent of measurement units, and has other attractive properties,
as discussed next.
1.8 Independence
A fundamental concept in statistical analysis is independence:
If two variables are independent, then any functions of them are also independent.
If two random variables are not independent, then they are dependent. Dependence
between two random variables can be quantified by its conditional distribution.
Important property
If X and Y are independent, then cov[X,Y ] = 0. This fact can be shown as follows:
∞ ∞
cov[X,Y ] = (x − μX )(y − μY )p(x,y)dxdy
−∞ −∞
∞ ∞
= (x − μX )(y − μY )p(x)p(y)dxdy
−∞ −∞
∞ ∞
= (x − μX )p(x)dx (y − μY )p(y)dy
−∞ −∞
= E[X − μX ]E[Y − μY ]
= 0. (1.34)
It follows as a corollary that if X and Y are independent, then ρXY = 0. This fact is
one of the most important facts in statistics!
While covariance vanishes if two variables are independent, the converse of this
statement is not true: the covariance can vanish even if the variables are dependent.
Figure 1.3f shows a counter example (see also Section 1.11). The fact that indepen-
dence implies vanishing of the covariance is a valuable property that is exploited
repeatedly in statistics.
Example 1.6 (Variance of a Sum of Independent Variables) Question: What
is the variance of X + Y , where X and Y are independent? Answer: According to
example (1.4)
var[X + Y ] = var[X] + var[Y ] + 2 cov[XY ] (1.35)
= var[X] + var[Y ], (1.36)
where we have used cov[X,Y ] = 0 for independent X and Y . This result shows that
if variables are independent, the variance of the sum equals the sum of the variances.
This result will be used repeatedly throughout this book.
The last equality follows from the fact that var[X] = var[Z] = 1. The covariance is
cov[X,Y ] = cov[X,ρX + 1 − ρ 2 Z] = ρ var[X] + 1 − ρ 2 cov[X,Z] = ρ,
(1.39)
where we have used selected properties of the covariance, cov[X,X] = var[X], and
cov[X,Z] = 0 because X and Z are independent. Consolidating these results yields
cov[X,Y ]
cor[X,Y ] = √ = ρ. (1.40)
var[X] var[Y ]
This example shows that the expectation of the sample mean equals the popu-
lation mean. As a result, the sample mean is a useful estimator of the population
mean. The expectation E[X] = μ is a population parameter, while the sample mean
μ̂ is an estimator of μ.
Definition 1.13 (Unbiased Estimator) If the expectation of an estimator equals
the corresponding population parameter, then it is called an unbiased estimator. Oth-
erwise it is called a biased estimator.
Example 1.10 The sample variance (1.7) is an unbiased estimator; that is, E[σ̂X2 ] =
σX2 . Had the sample variance been defined by dividing the sum by N instead of N − 1,
as in
1
N
σ̂B2 = xn − μ̂X , (1.42)
N
n=1
1.9 Estimating Population Quantities from Samples 19
then the resulting estimator σ̂B2 would have been biased, in the sense that E[σ̂B2 ] = σX2 .
In fact, it can be shown that E[σ̂B2 ] = σX2 (N − 1)/N .
1
N
N
= cov[Xi ,Xj ] (1.43)
N2
i=1 j =1
The first line follows by definition of variance. The second line follows by definition
of sample mean, and the fact that μX is constant. Importantly, the two summations
in the second line have different indices. The reason for this is that the sum should be
computed first then squared, which is equivalent to computing the sum two separate
times then multiplying them together. The third line follows by algebra. The last line
follows from the definition of covariance. Since the variables are independent and
identically distributed,
0 if i = j
cov[Xi ,Xj ] = . (1.44)
σX if i = j
2
Thus, the double sum in (1.43) vanishes whenever i = j , and it equals σX2 whenever
i = j . It follows that (1.43) can be simplified to
1 2
N
σ2
var[μ̂X ] = 2 σX = X . (1.45)
N N
i=1
√
The standard deviation σX / N is known as the standard error of the mean.
mean tend to cancel, yielding a number that is closer to the mean (on average) than
the individual random variables.
This result embodies a fundamental principle in statistics: arithmetic averages
have less variability than the variables being averaged. One way or another, every
statistical method involves some type of averaging to reduce random variability.
Acronym
Samples that are drawn independently from the same population are said to be
independent and identically distributed. This property is often abbreviated as iid.
Table 1.1. Values of (α, zα/2 ) that satisfy P (−zα/2 < Z < zα/2 ) = 1 − α for a standardized
normal distribution. The value of 1 − α is the fractional area under a standardized normal
distribution contained in the interval (−zα/2 , zα/2 ).
Figure 1.7 Illustration of the definition of zα/2 for a standardized normal distribu-
tion (i.e., a normal distribution with zero mean and unit variance).
P −zα/2 ≤ Z < zα/2 = 1 − α, (1.47)
or equivalently, as
P (μ − σ zα/2 < X < μ + σ zα/2 ) = 1 − α, (1.48)
where Z = (X − μ)/σ (i.e., a standardized Gaussian). The meaning of zα/2 is illus-
trated in Figure 1.7, and values of zα for some common choices of α are tabulated
in Table 1.1.
Notation
The statement that X is normally distributed with mean μ and variance σ 2 is denoted
as X ∼ N (μ,σX2 ). The symbol ∼ means “is distributed as.” A standardized normal
distribution is a normal distribution with zero mean and unit variance, N (0,1).
In essence, the Central Limit Theorem states that the sum of iid variables tends
to have a normal distribution, even if the original variables do not have a normal
distribution.
To illustrate the Central Limit Theorem, consider a discrete random variable X
that takes on only two values, −1 or 1, with equal probability. Thus, the probability
mass function of X is P (X = −1) = P (X = 1) = 1/2, and is zero otherwise.
A histogram of samples from this distribution is shown in the far left panel of
Figure 1.8. This histogram is very unlike a normal distribution. The mean and vari-
ance of this distribution are derived from (1.17):
1 1
E[X] = (−1) + (1) = 0 (1.51)
2 2
1 1
var[X] = (−1) + (1)2 = 1.
2
(1.52)
2 2
Now suppose the arithmetic mean of N = 10 random samples of X is computed.
Computing this arithmetic average repeatedly for different samples yields the his-
togram in the middle. The histogram now has a clear Gaussian shape. Although the
1.10 Normal Distribution and Associated Theorems 23
N=1 N = 10 N = 100
5
2.5
1.2
1.0
4
2.0
0.8
3
1.5
Density
0.6
1.0
2
0.4
0.5
1
0.2
0.0
0.0
0
−1.0 0.0 0.5 1.0 −1.0 0.0 0.5 1.0 −1.0 0.0 0.5 1.0
mean of x mean of x mean of x
Figure 1.8 An illustration of the Central Limit Theorem using randomly gen-
erated ±1s. The random number is either +1 or −1 with equal probability. A
histogram of a large number of samples of this random variable is shown in
the far left panel – it is characterized by two peaks at ±1, which looks very
different from a Gaussian distribution (e.g., it has two peaks rather than one).
The mean of the population distribution is zero and the variance is 1. The
middle panel shows the result of taking the average of N =10 random ±1s over
many repeated independent trials. Superimposed on this histogram is a Gaussian
distribution with mean zero and variance 1/N, as predicted by the Central Limit
Theorem (1.50). The right panel shows the result of averaging N = 100 random
variables.
Central Limit Theorem applies only for large N , this example shows that theorem
is relevant even for N = 10. For reference, the middle panel of Figure 1.8 also
shows a normal distribution evaluated from (1.50), using μX = 0 and σX = 1 from
(1.51) and (1.52). The histogram matches the predicted normal distribution fairly
well. Repeating this experiment with N = 100 yields the histogram on the far right
panel, which matches a normal distribution even better.
This example is an illustration of a Monte Carlo technique. A Monte Carlo tech-
nique is a computational procedure in which random numbers are generated from
a prescribed population and then processed in some way. The technique is useful
for solving problems in which the population is known but the distribution of some
function of random variables from the population is difficult or impossible to derive
analytically. In the example, the distribution of a sum of variables from a discrete
distribution was not readily computable, but was easily and quickly estimated using
a Monte Carlo technique.
An important property of the normal distribution is that a sum of independent
normally distributed random variables also has a normal distribution.
24 Basic Concepts in Probability and Statistics
Comment
From our theorems about expectations, we already knew the mean and variance of
iid variables. What is new in Theorem 1.2 is that if the X’s are normally distributed,
then Y is also normally distributed. In other words, we now know the distribution
of Y .
Example 1.13 (Distribution of the Sample Mean of Gaussian Variables) Ques-
tion: Let X1 , X2 , . . ., XN be independent random variables drawn from the normal
distribution N (μ,σX2 ). What is the distribution of the sample mean of these variables?
Answer: The sample mean is a sum of independent normally distributed random
variables. Therefore, by Theorem 1.2, the sample mean also has a normal distribu-
tion. Moreover, the expectation of the sample mean was shown in Example 1.8 to be
μX , and the variance of the sample mean was shown in Example 1.11 to be σX2 /N .
Therefore, the sample mean is normally distributed as
σX2
μ̂X ∼ N μX, . (1.58)
N
The distributions (1.50) and (1.58) are identical, but the latter is exact because
the original variables were known to be normally distributed, whereas the former
holds only for large N because the original variables were not necessarily normally
distributed.
Recall that the sample variance (1.7) involves squares of a variable. Importantly,
X and X2 do not have the same distribution. The relevant distribution for squares
of normally distributed random variables is the chi-squared distribution.
1.10 Normal Distribution and Associated Theorems 25
Y 2 ∼ χN2 . (1.60)
A corollary of Theorem 1.3 is that if Z1, . . . ,ZN are independent variables from
a standardized normal distribution then
Q=1
Chi−squared Density Function
Q=2
0.4
Q=5
0.3
0.2
0.1
0.0
0 2 4 6 8 10
x
Figure 1.9 Illustration of the chi-squared distribution for three different values of
the degrees of freedom.
26 Basic Concepts in Probability and Statistics
⎧ N/2−1 −x/2
⎪
⎪ x e
⎨ N/2 , x>0
2 (N/2)
p(x) = , (1.62)
⎪
⎪
⎩
0, otherwise
where (·) denotes the gamma function (a standard function in mathematics).
Computations involving this distribution rarely require working with the explicit
form (1.62). Instead, they can be performed using standard tables or statistical
packages.
• var[χN ] = 2N
2
2 2
• If χN1 and χN2 are two independent random variables with chi-squared distribu-
tions having N1 and N2 degrees of freedom, respectively, then χN2 1 + χN2 2 also
has a chi-squared distribution with N1 + N2 degrees of freedom. This additivity
property implies that the sum of any number of independent chi-squared variables
is also chi-squared distributed, with the degrees of freedom equal to the sum of
the degrees of freedom of the individual variables.
(N − 1)σ̂X2
∼ χN−1
2
. (1.63)
σX2
N
N
N
(Xn − μ̂X ) = Xn − μ̂X = N μ̂X − N μ̂X = 0. (1.64)
n=1 n=1 n=1
This constraint holds for all realizations of X1, . . . ,XN and does not depend on pop-
ulation or whether the variables are iid. The constraint is a simple consequence of the
definition of the sample mean. Because of this constraint, the variables (Xn − μ̂X )
are not independent, even if X1, . . . ,XN are themselves independent. After all,
if we know any N − 1 values of (Xn − μ̂X ), we know the N’th value exactly,
hence the N ’th value is not random. Because the variables are not independent,
Theorem 1.3 cannot be invoked directly. The constraint (1.64) is a linear function
Another random document with
no related content on Scribd:
entraînements auxquelles elle était livrée. Il fallait s'arrêter au point
juste, entre les tendances rétrogrades et les tendances follement
novatrices en fait de mariage, d'héritage, de testament, etc.
Napoléon n'avait que l'instruction qu'il est possible de recevoir dans
une bonne école militaire; mais il était né au milieu des vérités de
1789, et ces vérités qu'on peut méconnaître avant qu'elles soient
révélées, une fois connues deviennent la lumière à la lueur de
laquelle on aperçoit toutes choses. Se faisant chaque jour instruire
par MM. Portalis, Cambacérès et surtout Tronchet, de la matière
qu'on devait traiter le lendemain au Conseil d'État, il y pensait vingt-
quatre heures, écoutait ensuite la discussion, puis, avec un
souverain bon sens, fixait exactement le point où il fallait s'arrêter
entre l'ordre ancien et l'ordre nouveau, et de plus, avec sa puissance
d'application, forçait tout le monde à travailler. Il contribua ainsi de
deux manières décisives à la confection de nos codes, en
déterminant le degré de l'innovation, et en poussant l'œuvre à terme.
Plusieurs fois avant lui on avait entrepris cette œuvre, et chaque fois
cédant au vent du jour, on s'était livré à des exagérations dont
bientôt on avait eu honte et regret, après quoi l'œuvre avait été
abandonnée. Napoléon prit ce vaisseau échoué sur la rive, le mit à
flot et le poussa au port. Ce navire c'était le Code civil, et personne
ne peut nier que ce code ne soit celui du monde civilisé moderne.
C'est assurément pour un jeune militaire une belle et pure gloire que
d'avoir mérité d'attacher son nom à l'organisation civile de la société
moderne, et c'en est une bien belle également pour la France, chez
laquelle cette œuvre s'est accomplie! On pourra dire en effet que si
l'Angleterre a eu le mérite de donner la meilleure forme politique des
États modernes, la France a eu celui de donner par le Code civil la
meilleure forme de l'état social, beau et noble partage de gloire entre
deux nations les plus civilisées du globe!
Ce qui avait porté à placer la cavalerie sur les ailes, chez les
anciens et chez les modernes, c'était le besoin de couvrir les flancs
de l'infanterie qui ne savait pas manœuvrer comme aujourd'hui, et
faire front de tous les côtés en se formant en carré.
Organisation et
armement de L'infanterie était jusqu'à la fin du dix-septième
l'infanterie. siècle une vraie phalange macédonienne, une
sorte de carré long, présentant à l'ennemi sa face
allongée, laquelle était composée de piquiers, entremêlés de
quelques mousquetaires. Ces derniers placés ordinairement sur le
front, et couverts par la longueur des piques, faisaient feu, puis
quand on approchait de l'ennemi couraient le long du bataillon, et
venaient se ranger sur ses ailes, laissant aux piquiers le soin
d'exécuter la charge ou de la repousser à l'arme blanche. Il est facile
de comprendre que si les feux avaient eu alors l'importance qu'ils
ont de notre temps, un tel bataillon eût été bientôt détruit. Les
boulets entrant dans une masse où seize, quelquefois vingt-quatre
hommes étaient rangés les uns derrière les autres, y auraient causé
d'affreux ravages. Ce même bataillon, n'ayant des piques que sur
son front, était dans l'impossibilité de défendre ses flancs contre une
attaque de la cavalerie.
Ce fils, élevé par des protestants français et bientôt des mains des
protestants passant à celles des philosophes, plein de génie et
d'impertinence, tenant le passé du monde pour une extravagance
tyrannique, regardant les religions comme un préjugé ridicule, ne
reconnaissant d'autre autorité que celle de l'esprit, avait pris en
dégoût le pédantisme militaire régnant à la cour de Berlin, et par ce
motif devint odieux à son père, lequel dans un accès de colère battit
à coups de canne celui qui devait être le grand Frédéric. Le grand
Frédéric, battu et détenu dans une forteresse pour ne pas assez
aimer le militaire, est certainement un de ces spectacles singuliers
tels que l'histoire en offre quelquefois! Mais ce
Avénement du
grand Frédéric. père étrange mourut en 1740, et aussitôt son fils
se jeta sur les armes d'Achille qu'il n'avait pas
d'abord reconnues pour les siennes. L'empereur Charles VI venait
de mourir, laissant pour unique héritière une fille, Marie-Thérèse,
que personne ne croyait capable de défendre son héritage. Chacun
en convoitait une partie. La Bavière désirait la couronne impériale, la
France aspirait à conquérir tout ce que l'Autriche possédait à la
gauche du Rhin, l'Espagne avait elle-même des vues sur l'Italie, et le
jeune Frédéric songeait à rendre ses États dignes par leur
dimension du titre de royaume. Cependant, tandis que tout le monde
dévorait des yeux une partie de l'héritage de Marie-Thérèse,
personne n'osait y porter la main. Frédéric fit
À peine monté
sur le trône, il comme les gens qui mettent le feu à une maison
se jette sur la qu'ils veulent dépouiller: il se jeta sur la Silésie, fut
Silésie. bientôt imité par toute l'Europe, et alluma ainsi
l'incendie dont il devait si bien profiter. Ayant reçu
de son père un trésor bien fourni et une armée toujours tenue sur le
pied de guerre, il entra en Silésie en octobre 1740 (six mois après
être monté sur le trône), avait conquis cette province tout entière en
décembre, l'Autriche n'ayant presque pas d'armée à lui opposer, et
prouvait ainsi la supériorité d'un petit prince qui est prêt sur un grand
qui ne l'est pas.
Pourtant il n'y eut qu'un cri en Europe, c'est que le jeune roi de
Prusse était un étourdi, et qu'en janvier suivant il expierait sa
témérité. Les Autrichiens en effet, ayant réuni leurs forces,
débouchèrent de Bohême en Silésie, et Frédéric avait si peu
d'expérience qu'il laissa les Autrichiens s'établir sur ses derrières, et
le couper de la Prusse. Il se retourna, marcha à
Bataille de
Molwitz. eux avec l'audace qui inspirait toutes ses actions,
et livra bataille, bien qu'il n'eût jamais fait
manœuvrer un bataillon, ayant le dos tourné vers l'Autriche, tandis
que les Autrichiens l'avaient vers la Prusse. S'il eût été battu, il
n'aurait pas revu Berlin; et, chose singulière, dans cette première
bataille il n'eut pas d'autre tactique que celle du temps passé. Sa
belle infanterie, commandée par le brave maréchal
Comment elle
fut gagnée. Schwerin, était au centre, sa cavalerie sur les
ailes, son artillerie sur le front, comme à Rocroy,
aux Dunes, à Lutzen. La cavalerie autrichienne qui était disposée
aussi sur les ailes, et fort supérieure en force et en qualité, s'ébranla
au galop, et emporta la cavalerie prussienne ( procella equestris),
avec le jeune Frédéric, qui n'avait jamais assisté à pareille scène.
Mais, tandis que les deux cavaleries, l'une poursuivant l'autre,
couraient sur les derrières, la solide infanterie prussienne était
restée ferme en ligne. Si les choses s'étaient passées comme du
temps de Condé ou d'Alexandre, la cavalerie autrichienne, revenant
sur l'infanterie prussienne, l'eût prise sur les deux flancs et bientôt
détruite. Il n'en fut point ainsi: le vieux maréchal Schwerin, demeuré
inébranlable, se porta en avant, enleva le ruisseau et le moulin de
Molwitz, et, quand la cavalerie autrichienne revint victorieuse, elle
trouva son infanterie battue et la bataille perdue. Frédéric triompha
ainsi par la valeur de son infanterie, qui avait vaincu pendant qu'il
était entraîné sur les derrières. Mais, il l'a dit lui-même, la leçon était
bonne, et bientôt il devint général. L'Europe cria au miracle,
proclama Frédéric un homme de guerre, et plus du tout un étourdi,
mais ce qui importait davantage, l'infanterie prussienne venait
d'acquérir un ascendant qu'elle conserva jusqu'en 1792, lorsqu'elle
rencontra l'infanterie de la Révolution française.