Professional Documents
Culture Documents
Statistical Learning For Big Dependent Data (Wiley Series in Probability and Statistics) 1st Edition Daniel Peña
Statistical Learning For Big Dependent Data (Wiley Series in Probability and Statistics) 1st Edition Daniel Peña
https://ebookmeta.com/product/statistics-for-data-scientists-an-
introduction-to-probability-statistics-and-data-analysis-maurits-
kaptein/
https://ebookmeta.com/product/statistical-foundations-reasoning-
and-inference-for-science-and-data-science-springer-series-in-
statistics-goran-kauermann/
https://ebookmeta.com/product/python-for-probability-statistics-
and-machine-learning-2nd-edition-jose-unpingco/
https://ebookmeta.com/product/statistical-data-analysis-using-
sas-intermediate-statistical-methods-springer-texts-in-
statistics-marasinghe/
Fundamentals of Queueing Theory (Wiley Series in
Probability and Statistics) 5th Edition Shortle
https://ebookmeta.com/product/fundamentals-of-queueing-theory-
wiley-series-in-probability-and-statistics-5th-edition-shortle/
https://ebookmeta.com/product/statistics-volume-3-categorical-
and-time-dependent-data-analysis-3rd-edition-kunihiro-suzuki/
https://ebookmeta.com/product/introduction-to-linear-regression-
analysis-wiley-series-in-probability-and-statistics-6th-edition-
montgomery/
https://ebookmeta.com/product/statistics-and-probability-
faulkner/
https://ebookmeta.com/product/advances-in-machine-learning-for-
big-data-analysis-satchidananda-dehuri-editor/
STATISTICAL LEARNING FOR BIG
DEPENDENT DATA
WILEY SERIES IN PROBABILITY AND STATISTICS
Established by Walter A. Shewhart and Samuel S. Wilks
The Wiley Series in Probability and Statistics is well established and authoritative.
It covers many topics of current research interest in both pure and applied statistics
and probability theory. Written by leading statisticians and institutions, the titles
span both state-of-the-art developments in the field and classical methods.
Reflecting the wide range of current research in statistics, the series encompasses
applied, methodological and theoretical statistics, ranging from applications and
new techniques made possible by advances in computerized practice to rigorous
treatment of theoretical approaches.
This series provides essential and invaluable reading for all statisticians, whether in
academia, industry, government, or research.
DANIEL PEÑA
Universidad Carlos III de Madrid
Madrid, Spain
RUEY S. TSAY
University of Chicago
Chicago, United States
This first edition first published 2021
© 2021 John Wiley and Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise,
except as permitted by law. Advice on how to obtain permission to reuse material from this title is
available at http://www.wiley.com/go/permissions.
The right of Daniel Peña and Ruey S. Tsay to be identified as the authors of this work has been asserted
in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products
visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content
that appears in standard print versions of this book may not be available in other formats.
Names: Peña, Daniel, 1948- author. | Tsay, Ruey S., 1951- author.
Title: Statistical learning for big dependent data / Daniel Peña, Ruey S.
Tsay.
Description: First edition. | Hoboken, NJ : Wiley, 2021. | Series: Wiley
series in probability and statistics | Includes bibliographical
references and index.
Identifiers: LCCN 2020026630 (print) | LCCN 2020026631 (ebook) | ISBN
9781119417385 (cloth) | ISBN 9781119417392 (adobe pdf) | ISBN
9781119417415 (epub) | ISBN 9781119417408 (obook)
Subjects: LCSH: Big data–Mathematics. | Time-series analysis. | Data
mining–Statistical methods. | Forecasting–Statistical methods.
Classification: LCC QA76.9.B45 P45 2021 (print) | LCC QA76.9.B45 (ebook)
| DDC 005.7–dc23
LC record available at https://lccn.loc.gov/2020026630
LC ebook record available at https://lccn.loc.gov/2020026631
10 9 8 7 6 5 4 3 2 1
To our good friend & mentor George C. Tiao (DP & RST)
PREFACE xvii
vii
viii CONTENTS
INDEX 529
PREFACE
For the first time in human history, we are collecting data everywhere and every
second. These data grow exponentially, are produced and stored at minimum costs,
and are changing the way we learn things and control our activities, including mon-
itoring our health and using our leisure time. Statistics, as a scientific discipline, was
created in a different environment, where data were scarce, and focused mainly on
obtaining maximum efficiency of available information from small-structured data
sets. New methods are, therefore, needed to extract useful information from big data
sets, which are heterogeneous and unstructured. These methods are being developed
in statistics, computer science, machine learning, operation research, artificial intel-
ligence, and other fields. They constitute what is usually called data science. Most
advances in big data analysis so far have assumed that the data are collected from
independent subjects, so that they can be treated as independent observations. On
the other hand, empirical data are often generated over time or in space, and, hence,
they have dynamic and/or spatial dependence.
The main goal of this book is to learn from big dependent data. Data with tem-
poral dynamics have been studied in statistics as time series analysis, whereas spatial
dependence belongs to the newer area of spatial statistics. Several key ideas that
formed the core of recent developments in big data methods are from time series
analysis. For instance, the first model selection criterion was proposed by Akaike for
selecting the order of an autoregressive process, an iterative model building proce-
dure with nonlinear estimation was proposed by Box and Jenkins for autoregressive
integrated moving-average (ARIMA) models, and methods for combining forecasts,
which open the way to model averaging and ensemble methods in machine learning,
can be traced back to Bayesian forecasting. Today, analyzing big data with temporal
and spatial dependence is needed in many scientific fields, ranging from economics
and business to environmental and health science, and to engineering and computer
xvii
xviii PREFACE
vision. It is our belief that modeling and processing big dependent data is a key
emerging area of data science.
Several excellent books have been written for statistical learning with inde-
pendent data, but much less work has been done for dependent data. This book
tries to stimulate the use of available methods in forecasting and classification, to
promote further research in statistical methods for extracting useful information
from big dependent data, and to point out the potential weaknesses when the data
dependence is overlooked. We start with brief reviews of time series analysis, both
univariate and multivariate, followed by methods for handling heterogeneity when
analyzing many time series. We then discuss methods for classification and clustering
many time series and introduce dynamic factor models for modeling multivariate
or high-dimensional time series. This is followed by a thorough discussion on
forecasting in a data-rich environment, including nowcasting. We then turn to deep
learning and analysis of spatio-temporal data.
The book is applied oriented, but it also provides some basic theory to
help readers understand better the methods and procedures used. Due to the
high-dimensional nature of the problems discussed, we include some technical
derivations in a few sections, but for most parts, we refer readers interested in
rigorous mathematical proofs to the available literature. Real examples are used
throughout the book to demonstrate the analysis and applications. For empirical
data analysis, we use R extensively and provide the necessary instructions and R
scripts so that readers can reproduce the results shown in the book.
The book is organized as follows. Chapter 1 provides some examples of the data
sets considered in the book and introduces the basic ideas of temporal dependency,
stochastic process, and time series. It also discusses principal component analysis
and illustrates the weaknesses of traditional statistical inference when the data
dependence is overlooked. Chapter 2 starts with new ways to visualize large sets
of time series, summarizes the main properties of ARIMA and state space models,
including spectral analysis and Kalman filter, and presents a methodology for build-
ing univariate models for large sets of time series. Chapter 3 reviews the available
methods for building multivariate vector ARMA models and their limitations with
high-dimensional time series. It also discusses cointegration and ways to handle
unit-root multivariate nonstationary series. Real time series data are typically
heterogeneous, with clustering and certain common features, contain missing values
and outliers, and may encounter structural breaks. We address these issues in the
next two chapters. Chapter 4 presents missing value estimation and some new
methods for detecting and modeling outliers in univariate and multivariate time
series. The outliers considered include additive outliers, innovative outliers, level
shifts, transitory shifts, and parameter changes. Chapter 5 analyzes clustering, or
unsupervised classification methods, for time series, including recent procedures
for clustering time series based on their dependency or for selecting the number
of clusters. The chapter also considers time series discrimination, or supervised
classification, and covers approaches developed not only in statistics, but also in
machine learning such as support vector machines. Chapter 6 focuses on an active
research topic for modeling large sets of time series, namely dynamic factor models
and other types of factor models. The chapter describes in detail the developments
of the topic and discusses the pros and cons of several models available in the
literature, including those for the high-dimensional setting. Chapter 7 concentrates
PREFACE xix
on the key problem of forecasting large sets of time series. The application of Lasso
regression to time series is investigated, as well as procedures developed mostly in
the statistic and econometric literature for forecasting high-dimensional time series.
It also studies the method of nowcasting and demonstrates its usefulness with real
examples. Chapter 8 considers machine learning for forecasting and classification,
including neural networks and deep learning. It also studies methods developed
in the interface between Statistics and Machine Learning, commonly known as
Data Science, including Classification and Regression Trees (CART) and Random
Forests. Finally, Chapter 9 studies spatio-temporal data and their applications,
including various kriging methods for spatial predictions.
Dependent data cover a wide range of topics and methods, especially in the pres-
ence of high-dimensional data. It is too much to expect that a book can cover all
these important topics. This book is no exception. We need to make decision on the
material to include. Like other books, our choices depend heavily on our experience,
research areas, and preference. There are some important subjects, especially those
under rapid development, that we do not cover, for instance, functional data analysis
for dependent data. We hope to include this and others relevant topics in a future
edition.
The book takes advantages of many available R packages. Yet, there remain some
methods discussed in the book that cannot be implemented with any existing pack-
age. We have developed some R scripts to perform such analyses. For instance, we
have developed a new automatic modeling procedure for a large set of time series,
which is used and demonstrated in Chapter 2. We have included, with the great help
of Angela Caro and Antonio Elías, these R scripts into a new package for the book
and refer to it as the SLBDD package. In addition, the data sets used, except for
those subject to copyright protection are also included. Some R scripts are provided
in the web page of the book at https://www.rueytsay.com/slbdd.
We are grateful to many people who have contributed to our research in general
and to this book in particular. We have dedicated the book to Professor George C.
Tiao, a giant in time series analysis and a pioneer of many procedures presented here.
He is a generous mentor and a good friend of us. We have also learned a lot from
our coauthors on topics covered in the book. In particular, we like to acknowledge
the contributions from Andrés Alonso, Stevenson Bolívar, Jorge Caiado, Angeles
Carnero, Rong Chen, Nuno Crato, Chaoxing Dai, Soudeep Deb, Pedro Galeano,
Zhaoxing Gao, Victor Guerrero, Yuefeng Han, Nan-Jung Hsu, Hsin-Cheng Huang,
Ching-Kang Ing, Ana Justel, Mladen Kolar, Tengyuan Liang, Agustin Maravall,
Fabio Nieto, Pilar Poncela, Javier Prieto, Veronika Rockova, Julio Rodríguez,
Rosario Romera, Juan Romo, Esther Ruiz, Ismael Sánchez, Maria Jesús Sánchez,
Ezequiel Smucler, Victor Yohai, and Rubén Zamar. Some of the programs used
in our analyses were written by Angela Caro, Pedro Galeano, Carolina Gamboa,
Yuefeng Han, and Javier Prieto. We sincerely thank them for their wonderful works.
Chaoxing Dai helps us in many ways with the keras package for deep learning.
Without his help, we would not be able to demonstrate its applications.
Finally, we have always had the constant support of our families. DP wants to
thank his wife, Jette Bohsen, for her constant love and encouragement in all his
projects, and, in particular, for her generous continued help during the years he was
writing this book. RST likes to thank the love and care of his wife and the support
xx PREFACE
from his children. Our families are the source of our energy and inspiration. With-
out their unlimited and unselfish support, it would not have been possible for us
to complete this book, especially in this challenging time of COVID-19 pandemic
during which the book was finished. We hope that the methods presented in this book
can help, by learning from big complex dependent data, to prevent and manage the
challenging problems we will face in the future.
Statistical Learning for Big Dependent Data, First Edition. Daniel Peña and Ruey S. Tsay.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
1
2 INTRODUCTION TO BIG DEPENDENT DATA
in different scientific fields in their data analysis. Also, it could attract more attention
to and motivate further research in the challenging problem of analyzing big depen-
dent data.
In this section, we present some examples of big dependent data considered in the
book, introduce some statistical methods for describing such data, and provide a
framework for making statistical inference. Details of the methods introduced will
be given in later chapters. Our goal is to demonstrate the data we intend to analyze
and to highlight the consequences of ignoring their dependency. Whenever possi-
ble, we also illustrate the difficulties statistical methods developed for independent
observations are facing when they are applied to big dependent data.
By dependent data, we mean observations of the variables of interest were taken
over time or across space or both. Consequently, in these data the order in time or
space matters. By big data, we mean the number of variables, k, is large or the number
of data points, T, is large or both, and it could be that k > T. Figure 1.1 shows a data
set of three time series representing the relative average temperature of November
in Europe, North America, and South America from 1910 to 2014. The data are in
the file Temperatures.csv of the book web site. The measurement is relative
to the average temperature of November in the twentieth century. These data are
dependent because observations of consecutive years tend to be closer to each other
than observations that are many years apart. Furthermore, the relative temperature
in South America shows a growing trend since 1980, whereas the trend can only be
seen in Europe starting from 2000, yet it does not appear in the North America data.
Statistical analysis of these monthly temperature series is relatively easy, as the
number of series is only 3 and the sample size is small. It belongs to the multivariate
Europe
1
−2
1920 1940 1960 1980 2000
Time
North Amer
1
−2
1920 1940 1960 1980 2000
Time
South Amer
1.0
−0.5
1920 1940 1960 1980 2000
Time
Figure 1.1 Relative average temperatures of November in Europe, North America, and
South America from 1910 to 2014. The measurement is relative to the average temperature of
November in the twentieth century.
EXAMPLES OF DEPENDENT DATA 3
time series analysis in statistics. See, for instance, Tsay (2014). The main concern of
this book is statistical analysis when the number of time series is large or the time
intervals between observations are small, resulting in high-frequency data with large
sample sizes. Modeling big temperature data is important in many applications. As a
matter of fact, daily (even hourly) temperature is routinely taken at many locations
around the world. These temperatures affect the demands of electricity and heating
oil, are highly related to air pollution measurements, such as the particulate mat-
ter PM2.5 and ozone concentration, and play an important role in commodity prices
around the world.
Figure 1.2 shows the standardized daily stock indexes of the 99 most important
financial markets around the world, from 3 January 2000 to 16 December 2015
for 4163 observations. The data are in the file Stock_indexes_99world.csv.
The magnitudes of stock index vary markedly from one market to another so that
we standardize the indexes in the plot. Specifically, each standardized index series
has mean zero and variance one. In applications, we often analyze the returns of
stock indexes. Figures 1.3 and 1.4 show the time plots of the first six indexes and their
log returns, respectively. The log returns are obtained by taking the first difference
of the logarithm of stock index.
These time plots demonstrate several features of multiple time series. First, they
show the overall evolution and cross dependence of the financial markets. All mar-
kets exhibited a trough in 2003, had a steady increase reaching a peak around early
2008, then showed a dramatic drop caused by the 2008 financial crisis. Yet some mar-
kets experienced a gradual recovery after 2009. Second, the variabilities of the world
financial markets appear to be higher when the markets were down; see Figure 1.4.
This is not surprising as the fear factor, i.e. market volatility, is likely to be dominant
during a bear market. On the other hand, the ranges of world market indexes at a
given time seem to be smaller when the market were down; see Figure 1.2.
Figures 1.5 and 1.6 show the time plots of the daily financial indexes for 22 Asian
and 47 European markets, respectively, for the same time period, but with x-axis now
2
Stocks
−2
Figure 1.2 Time plots of standardized daily stock market indexes of the 99 most important
financial markets around the world from 3 January 2000 to 16 December 2015.
4 INTRODUCTION TO BIG DEPENDENT DATA
Stock idx 1 2
Stock idx 4
2
1 1
0 0
−1 −1
−2 −2
2000 2005 2010 2015 2000 2005 2010 2015
Year Year
2
Stock idx 2
Stock idx 5
2
1 1
0 0
−1 −1
−2
2000 2005 2010 2015 2000 2005 2010 2015
Year Year
2
Stock idx 3
Stock idx 6
2
1 1
0 0
−1 −1
−2
2000 2005 2010 2015 2000 2005 2010 2015
Year Year
Figure 1.3 Time plots of the first six standardized daily stock market indexes: 3 January 2000
to 16 December 2015.
0.10
Stock idx 1
Stock idx 4
0.05
0
−0.05
−0.10
2000 2005 2010 2015 2000 2005 2010 2015
Year Year
0.10
Stock idx 2
Stock idx 5
0.05
0
−0.05
−0.10
2000 2005 2010 2015 2000 2005 2010 2015
Year Year
0.10
Stock idx 3
Stock idx 6
0.05
0
−0.05
−0.10
2000 2005 2010 2015 2000 2005 2010 2015
Year Year
Figure 1.4 Time plots of the log returns of the first six daily stock market indexes: 3 January
2000 to 16 December 2015.
30 000
10 000
0
0 1000 2000 3000 4000
Time
Figure 1.5 Time plots of daily stock market indexes of 22 Asian financial markets from 3
January 2000 to 16 December 2015.
EXAMPLES OF DEPENDENT DATA 5
8e+04
4e+04
0e+04
0 1000 2000 3000 4000
Time
Figure 1.6 Time plots of daily stock market indexes of 47 European financial markets from
3 January 2000 to 16 December 2015.
being in time index. The difference between Asian and European financial markets
is easily seen, implying that care must be exercised when analyzing similar time series
across different continents.
As a third example, Figure 1.7 shows 33 monthly consumer price indexes of
European countries from January 2000 to October 2015. The data are in the file
CPIEurope2000-15.csv. In the plot, the series have been standardized to have
zero mean and unit variance. As expected, the plot shows an upward trend of the
price indexes, but it also demonstrates sufficient differences between the series. For
example, some series show a clear seasonal pattern, but others do not.
Large sets of data are available in marketing research. As an example, Figure 1.8
shows the time plots of daily sales, in natural logarithms, of a clothing brand in 25
provinces in China from 1 January 2008 to 9 December 2012 for 1805 observations.
The data are from Chang et al. (2014) and are given in the file clothing.csv.
The plots exhibit certain annual pattern, referring to as seasonal behavior in time
1
Normalized CPI
–1
–2
Figure 1.7 Time plots of 33 monthly price indexes of European countries from January 2000
to October 2015.
6 INTRODUCTION TO BIG DEPENDENT DATA
11
10
ln (sales)
Figure 1.8 Time plots of daily sales in natural logarithms of a clothing brand in 25 provinces
in China from 1 January 2008 to 9 December 2012.
series analysis. To gain further insight, Figure 1.9 shows the same time plots for the
first eight provinces. They are Beijing, Fujian, Guangdong, Guangxi, Hainan, Hebei,
Henan, and Hubei. The levels of the plots are adjusted so that there is no overlapping
in the figure. A special characteristic of the plots is that local peaks occur irregularly
in the early part of each year, followed by certain drops in sales. This is caused by the
Chinese New Year holidays that vary from year to year. In addition, the peaks do not
occur for all provinces; see the second plot (from the top) of Figure 1.9. Analyzing
these series jointly requires modeling the common features as well as the variations
from one province to another.
Figure 1.9 Time plots of daily sales in natural logarithms of a clothing brand in eight
provinces in China from 1 January 2008 to 9 December 2012.
EXAMPLES OF DEPENDENT DATA 7
300
PM 2.5 100
0
2006 2008 2010 2012 2014 2016
Year
300
PM 2.5
100
0
2006 2008 2010 2012 2014 2016
Year
Figure 1.10 Time plots of hourly observations of the particulate matter PM2.5 from two mon-
itoring stations in southern Taiwan from 1 January 2006 to 31 December 2015.
Figure 1.10 shows the time plots of hourly observations of the particulate matter
PM2.5 from two monitoring stations in southern Taiwan from 1 January 2006 to 31
December 2015. We drop the data of February 29 for simplicity, resulting in a sample
size 87 600. The data are given in the file TaiwanPM25.csv. The two series are parts
of many monitoring stations throughout the island. Also available at each station
are measurements of temperature, dew points, and other air pollution indexes. Thus,
this is a simple example of big dependent data. The time plots exhibit certain strong
annual patterns and seem to indicate a minor decreasing trend. Figure 1.11 shows
the first 30 000 sample autocorrelations of the PM2.5 series of Station 1. The annual
cycle with periodicity s = 24 × 365 = 8760 is clearly seen from the autocorrelations.
Figure 1.12, on the other hand, shows the first 120 sample autocorrelations of the
same time series. The plot shows that there also exists a daily cyclic pattern in the
Series PM2.5
1.0
0.8
0.6
0.4
ACF
0.2
0.0
–0.2
Series PM2.5
1.0
0.8
0.6
ACF
0.4
0.2
0.0
–0.2
0 20 40 60 80 100 120
Lag
PM2.5 series with a periodicity 𝜆 = 24. Consequently, analysis of such a big dependent
data should consider not only the overall trend, but also various cyclic patterns with
different frequencies.
Finally, Figure 1.13 plots the monitoring stations (in circle) and the ozone levels
(in color-coded squares) of US Midwestern states on 20 June 1987. The data set is
available from the R package fields. The states are added to the plot so that readers
can understand the geographical features of ozone levels in the Midwest. Clearly,
on this particular day, high ozone readings appeared in the southwest of Illinois and
south of Indiana. Also, ozone levels of neighboring monitoring stations are similar.
This is an example of spatial process discussed in Chapter 9.
The six data sets presented are examples of multivariate spatio-temporal series,
where we observe the values of several variables of interest over time and space.
44
Michigan:south 80
42 Iowa 60
Indiana oh 40
40 Illinois
20
Missouri
38
Kentucky
0
92 90 88 86 84
Figure 1.13 Locations and ozone levels at various monitoring stations in US Midwestern
states. The measurements are 8-hour averages (9 a.m. to 4 p.m.) on 20 June 1987.
STOCHASTIC PROCESSES 9
The time sequence introduces dynamic dependency in a given series, which may be
time varying. The data also show cross dependence between series and the strength
of cross dependence may, in turn, depend on geographical locations or economic
relations among countries. One of the main focuses of the book is to investigate both
the dynamic and cross dependence between multiple time series.
The objectives of analyzing big dependent data include (a) to study the relation-
ships between the series, (b) to explore the dynamic dependence within a series and
across series, and (c) to predict the future values of the series. For spatial processes,
one might be interested in predicting the variable of interest at a location where
no observations are available. The first two objectives are referred to as data anal-
ysis and modeling, whereas the third objective involves statistical inference. Both
modeling and inference require not only methods for handling data, but also the
theory on which proper inference can be drawn. To this end, we need a framework
for understanding the characteristics of the time series under study and for making
sound predictions. We discuss this framework by starting with stochastic processes
in the next section.
Theory and properties of stochastic processes form the foundation for statistical anal-
ysis and inference of dependent data.
provided that the integral exists, where R denotes the real line. The sequence 𝝁T1 =
(𝜇1 , 𝜇2 , . . . , 𝜇T ) is the mean function of zT1 . Define the variance of zt as
provided that the integral exists. The sequence {𝝈 2t }T1 = (𝜎12 , . . . , 𝜎T2 ) is the variance
function of zT1 . Higher order moments can be defined similarly, assuming that they
exist.
A special case of interest is that all elements of zT1 have the same mean. In this
case, 𝝁T1 is a constant function. If all elements of zT1 have the same variance, then
{𝝈 2t }T1 is a constant function. If both 𝝁T1 and {𝝈 2t }T1 are constants for all finite T, then
the stochastic process zt is stable and we say that it is a homogeneous process. In
some applications, 𝝁T1 or {𝝈 2t }T1 may not exist.
In what follows, we assume that 𝜎t2 exists for all t. The linear dynamic dependence
of a stochastic process zt is determined by its autocovariance or autocorrelation func-
tion (ACF). For zt and zt−j , where j is a given integer, we define their autocovariance
as
𝛾(t, t − j) = Cov(zt , zt−j ) = E[(zt − 𝜇t )(zt−j − 𝜇t−j )]. (1.1)
When j = 0, 𝛾(t, t) = Var(zt ) = 𝜎t2 . Also, from the definition, it is clear that 𝛾(t, t − j) =
𝛾(t − j, t). The autocorrelation between zt and zt−j is defined as
𝛾(t, t − j)
𝜌(t, t − j) = . (1.2)
𝜎t 𝜎t−j
use stationarity to denote weak stationarity of a stochastic process. For this process
the autocovariance and autocorrelation functions depend only on the difference of
lags and, in particular, we have 𝛾0 = 𝜎 2 , 𝛾j = 𝛾−j and 𝜌j = 𝜌−j , where j is an arbitrary
integer.
Assume that zt is stationary. For a given positive integer m, the covariance matrix
of (zt , zt−1 , . . . , zt−m+1 )′ is defined as
⎧⎡ z − 𝜇 ⎤ ⎫
⎪⎢ t
⎪
⎪⎢ zt−1 − 𝜇 ⎥⎥ ⎪
M z (m) = E ⎨ [(zt − 𝜇), (zt−1 − 𝜇), . . . , (zt−m+1 − 𝜇)]⎬
⎪⎢⎢ ⋮ ⎥ ⎪
⎪⎣ zt−m+1 − 𝜇 ⎥⎦ ⎪
⎩ ⎭
⎡ 𝛾0 𝛾1 · · · 𝛾m−1 ⎤
⎢ 𝛾 𝛾0 · · · 𝛾m−2 ⎥
=⎢ ⎥.
1
(1.3)
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢ ⎥
⎣ 𝛾m−1 𝛾m−2 · · · 𝛾0 ⎦
This covariance matrix has the special pattern that (a) all diagonal elements are the
same, (b) all elements above and below the main diagonal are the same, and so on.
Such a matrix is called a Toeplitz matrix. The corresponding autocorrelation matrix
is
⎡ 1 𝜌1 · · · 𝜌m−1 ⎤
⎢ 𝜌 1 · · · 𝜌m−2 ⎥
Rz (m) = ⎢ ⎥,
1
(1.4)
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢ ⎥
⎣ 𝜌m−1 𝜌m−2 · · · 1 ⎦
where c = (c1 , . . . , cm )′ ≠ 𝟎 and zt,m = (zt , zt−1 , . . . , zt−m+1 )′ . Since yt is stationary, its
variance exists and is given by c′ M z (m)c ≥ 0, where M z (m) is defined in Eq. (1.3).
Consequently, M z (m) must be a nonnegative definite matrix. Similarly, the correla-
tion matrix Rz (m) must also be nonnegative definite.
Strict stationarity implies weak stationarity provided that the first two moments
of the process exist. But weak stationarity does not guarantee strict stationarity of a
stochastic process. Consider the following simple example. Let {zt } be a sequence
of independent and identically distributed standard normal random variables,
12 INTRODUCTION TO BIG DEPENDENT DATA
It is easy to show that (a) E(xt ) = 0, (b) Var(xt ) = 1, and (c) E(xt xt−j ) = 0 for j ≠ 0.
Therefore, {xt } is a weakly stationary process. On the other hand, xt is Gaussian if t
is even and it is a shifted 𝜒 2 with one degree of freedom when t is odd. The process
{xt } is not identically distributed and, hence, is not strictly stationary. If we assume
that zT1 follows a multivariate normal distribution for all T, then weak stationarity
is equivalent to strict stationarity, because a normal (or Gaussian) distribution is
determined by its first two moments.
1.2.1.2 White Noise Process An important stationary process is the white noise
process. A stochastic process zt is a white noise process if it satisfies the following
conditions: (1) E(zt ) = 0; (2) Var(zt ) = 𝜎 2 < ∞; and (3) Cov(zt , zt−j ) = 0 for all j ≠ 0.
In other words, zt is a white noise series if and only if it has a zero mean and finite
variance, and is not serially correlated. Thus, a white noise is not necessarily strictly
stationary. If we impose the additional condition that zt and zt−j are independent and
have the same distribution, where j ≠ 0, then zt is a strict white noise process.
provided that the expectations exist, and the lag-𝓁 autocovariance function as
provided that the covariances involved all exist, where 𝓁 is an integer. We say that the
vector process {zt } is weakly stationary if (a) 𝝁t = 𝝁, a constant vector, and (b) 𝚪(t, t −
𝓁) depends only on 𝓁. In other words, a k-dimensional stochastic process zt is weakly
stationary if its first two moments are time-invariant. In this case, we write 𝚪(t, t −
𝓁) = E[(zt − 𝝁)(zt−𝓁 − 𝝁)′ ] = 𝚪(𝓁) = [𝛾ij (𝓁)], where 1 ≤ i, j ≤ k. In particular, the lag-0
autocovariance matrix of a weakly stationary k-dimensional stochastic process zt is
given by
⎡ 𝛾11 (0) 𝛾12 (0) · · · 𝛾1k (0) ⎤
⎢ 𝛾 (0) 𝛾 (0) · · · 𝛾 (0) ⎥
𝚪(0) = ⎢ ⎥
21 22 2k
(1.6)
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢ ⎥
⎣ 𝛾k1 (0) 𝛾k2 (0) · · · 𝛾kk (0) ⎦
which is symmetric, because 𝛾ij (0) = 𝛾ji (0). Note that the diagonal element 𝛾ii (0) =
𝜎i2 = Var(zit ).
For a stationary vector process zt , its lag-𝓁 autocovariance matrix 𝚪(𝓁) = [𝛾ij (𝓁)]
measures the linear dependence among the component series zit and their lagged
variables zj,t−𝓁 . It pays to study the exact meaning of individual element 𝛾ij (𝓁) of 𝚪(𝓁).
First, the diagonal element 𝛾ii (𝓁) is the lag-𝓁 autocovariance of the scalar process zit
(i = 1, . . . , k). Second, the off-diagonal element 𝛾ij (𝓁) is
which measures the linear dependence of zit on the lagged value zj,t−𝓁 for 𝓁 > 0.
Thus, 𝛾ij (𝓁) quantifies the linear dependence of zit on the 𝓁th past value of zjt . Third,
in general, 𝚪(𝓁) is not symmetric if 𝓁 > 0. As a matter of fact, 𝛾ij (𝓁) ≠ 𝛾ji (𝓁) for most
stochastic process zt because, as stated before, 𝛾ij (𝓁) denotes the linear dependence
of zit on zj,t−𝓁 , whereas 𝛾ji (𝓁) quantifies the linear dependence of zjt on zi,t−𝓁 . Finally,
we have
Therefore, we have 𝛾ij (−𝓁) = 𝛾ji (𝓁) for all 𝓁. Consequently, the autocovariance matri-
ces of a stationary vector process zt satisfy
𝚪(−𝓁) = [𝚪(𝓁)]′ ,
and, hence, it suffices to consider 𝚪(𝓁) for 𝓁 ≥ 0 in practice. If the dimension of 𝚪(𝓁) is
needed to avoid any confusion, we write 𝚪(𝓁) ≡ 𝚪k (𝓁) with the subscript k denoting
the dimension of the underlying stochastic process zt .
A global measure of the overall variability of the process zt is given by its general-
ized variance, which is the determinant of the lag-0 covariance matrix 𝚪(0), i.e. |𝚪(0)|.
The standardized, by the dimension, measure is the effective variance, defined by
For instance, if zt is two dimensional, then EV(zt ) = [𝛾11 (0)𝛾22 (0){1 − 𝜌212 (0)}]1∕2 ,
where 𝜌12 (0) is the contemporaneous correlation coefficient between z1t and z2t .
14 INTRODUCTION TO BIG DEPENDENT DATA
Clearly, EV(zt ) assumes its largest value when the two series are uncorrelated, i.e.
𝜌12 (0) = 0. For a k-dimensional process zt , as the determinant of a square-matrix is
the product of its eigenvalues, the effective variance is the geometrical mean of the
eigenvalues of 𝚪(0).
We can summarize the properties of a stationary vector process zt by using linear
combinations of its components. Consider, for example, the linear process
∑
k
yt = c′ zt = ci zit ,
i=1
∑
m
yt = b′i zt−i+1 = b′1 zt + b′2 zt−1 + · · · + b′m zt−m+1
i=1
1 ∑
T
z̄ = (̄z1 , . . . , z̄ k )′ = z. (1.14)
T t=1 t
For an independent and identically distributed random sample, it is easy to see that
E(̄zi ) = E(zit ) = 𝜇i , and the variance of z̄ i is 𝜎i2 ∕T, where 𝜎i2 = Var(zit ). Therefore,
the sample mean in this case is a consistent estimator of the population mean 𝜇i as
T → ∞.
Note that for a stationary process, the consistency of the sample mean does not
necessarily hold. For example, let z11 be a random sample from a normal distribution
with mean zero and variance 𝜎12 . Let z1t = z11 for t = 2, . . . , T. This special process
{z1t } is stationary because (1) E(z1t ) = 0, which is time-invariant, (2) Var(z1t ) = 𝜎12 ,
which is also time-invariant, and Cov(z1t , z1,t−j ) = 𝜎12 , which is time-invariant for all j.
However, in this particular case we have z̄ 1 = z11 and Var(̄z1 ) = 𝜎12 , no matter what
the sample size T is. Therefore, the sample mean is not a consistent estimator for this
special process as T → ∞.
The condition under which the sample mean z̄ i is a consistent estimator of the
population mean 𝜇i is referred to as the ergodic condition of a stochastic process. The
ergodic theory is concerned with conditions under which the time average of a long
realization (sample mean) converges to the space average (average of many random
samples at a given time index) of a dynamic system. For a scalar ∑ weakly stationary zt ,
the ergodic condition for z̄ to converge to 𝜇 = E(zt ) is that ∞ i=0 |𝛾i | < ∞. We assume
this condition in the book from now on.
For a weakly stationary vector process zt , the covariance matrix of the sample
mean z̄ is given by
[{ }{ }]
[ ] 1 ∑T
1 ∑T
Var(̄z) = E (̄z − 𝝁)(̄z − 𝝁)′ = E (z − 𝝁) (z − 𝝁)′
T t=1 t T j=1 j
[ {T }] [ T−1 ]
1 ∑ ∑ ∑
T
1
= 2 E (zt − 𝝁) (zj − 𝝁) ′
= 2 (T − |j|)𝚪(j) . (1.15)
T t=1 j=1
T j=−(T−1)
This implies that the asymptotic variance of the scalar sample mean z̄ i is
∑
∞
T × Var(̄zi ) → 𝛾ii (0) + 2 𝛾ii (𝓁), as T → ∞. (1.16)
𝓁=1
where, again, 𝛾ii (𝓁) is the lag-𝓁 autocovariance of zit . From Equation (1.2), if {zit } is
white noise the variance of the sample mean z̄ i is 𝜎i2 ∕T. In general, we have
[ ]
1 ∑T
Var(̄zi ) ≈ 𝛾 (0) + 2 𝛾ii (l) , as T → ∞.
T ii l=1
SAMPLE MOMENTS OF STATIONARY VECTOR PROCESS 17
Therefore, Var(̄zi ) could be large if the sum of the autocovariances of zit is large. A
necessary (although not sufficient) condition for the sum of autocovariances of zit to
converge is
lim 𝛾ii (l) → 0,
l→∞
which implies that the dependence of zit on its past values zi,t−l vanishes as l increases.
1 ∑
T
̂
𝚪(𝓁) = (z − z̄ )(zt−𝓁 − z̄ )′ . (1.17)
T t=𝓁+1 t
∑T
̂ = 1
𝚪(0) (z − z̄ )(zt − z̄ )′ .
T t=1 t
̂
R(𝓁) ̂ −1∕2 𝚪(𝓁)
=D ̂ D ̂ −1∕2 , (1.18)
̂ = diag{̂𝛾11 (0), . . . , 𝛾̂kk (0)}. From Eq. (1.18), we have
where D
𝛾̂ij (𝓁)
̂
R(𝓁) = [𝜌̂ij (𝓁)] with 𝜌̂ij (𝓁) = √ .
𝛾̂ii (0)̂𝛾jj (0)
Under some mild conditions, R(𝓁)̂ is a consistent estimator of R(𝓁) for a weakly
stationary process zt . The conditions are met if zt is a stationary Gaussian process.
̂
The asymptotic properties of R(𝓁), however, are rather complicated. They depend
on the dynamic dependence of the process zt . However, the results for some special
cases are easier to understand. Suppose the vector time series zt is a white noise
process so that R(𝓁) = 𝟎 for 𝓁 ≠ 0. Then, we have
1
Var[𝜌̂ij (𝓁)] ≈ , 𝓁 ≠ 0.
T
and, for i ≠ j,
[1 − 𝜌2ij (0)]2
Var[𝜌̂ij (0)] ≈ . (1.19)
T
For further details, see, for instance, Reinsel (1993, section 4.1.2). These simple results
are useful in exploratory data analysis of vector time series. This is particularly so
when the dimension k is large. An important feature in analyzing time series data is
to explore the dynamic dependence of the observed time series. For a k-dimensional
series, each sample CCM R(𝓁) ̂ is a k × k matrix. If k = 10, then each CCM contains
100 sample correlations. It would not be easy to decipher the pattern embedded in
the sample CCM R(𝓁)̂ simultaneously for 𝓁 = 1, . . . , m, where m is a given positive
18 INTRODUCTION TO BIG DEPENDENT DATA
integer. To aid the reading of sample CCM, Tiao and Box (1981) device a simplified
̂
approach to inspect the features of sample CCM. Specifically, consider R(𝓁) for 𝓁 >
0. Tiao and Box define a corresponding simplified matrix S(𝓁) = [Sij (𝓁)], where
√
⎧ + if 𝜌̂ij (𝓁) ≥ 2∕ T,
⎪ √
Sij (𝓁) = ⎨ ⋅ if |𝜌̂ij (𝓁)| < 2∕ T,
⎪ √
⎩ − if 𝜌̂ij (𝓁) ≤ −2∕ T,
√
where it is understood that 1∕ T is the sample standard error of elements of R(𝓁)̂
provided that zt is a Gaussian white noise. It is much easier to comprehend the
dynamic dependence of zt by inspecting the simplified CCM S(𝓁) for 𝓁 = 1, . . . , m.
We demonstrate the effectiveness of the simplified CCM via an example.
Example 1.1
Consider the three temperature series of Figure 1.1. The sample means of the three
series are 0.15, 0.14, and 0.12, respectively, for Europe, North America, and South
America. The sample standard deviations are 1.02, 1.06, and 0.47, respectively. The
average temperatures of November for South America appear to be lower and rel-
atively less variable compared with those of Europe and North America. Turn to
sample CCMs. In this particular instance, k = 3 so that each CCM contains 9 real
numbers. Suppose we like to examine the dynamic dependence of the three series
using the first 12 lags of sample CCM. This would require us to decipher 108 num-
bers simultaneously. On the other hand, the dynamic dependence pattern is relatively
easy to comprehend by examining the simplified CCMs, which are given below.
Simplified matrix:
CCM at lag: 1
. + +
. . +
. . +
CCM at lag: 2
. . +
. . +
+ . +
CCM at lag: 3
. . +
. . .
. + +
CCM at lag: 4
. . .
. . .
. + +
CCM at lag: 5
. . .
. . +
. + +
SAMPLE MOMENTS OF STATIONARY VECTOR PROCESS 19
CCM at lag: 6
. . +
. . .
. . +
CCM at lag: 7
. . .
. . +
. . +
CCM at lag: 8
. . .
. . .
. . +
CCM at lag: 9
. . +
. . .
. . +
CCM at lag: 10
. . .
. . .
. . +
CCM at lag: 11
. . .
. . +
. . +
CCM at lag: 12
. . .
. . +
. . +
From the simplified CCM, we make the following observations. First, the dynamic
dependence of the temperature series of South America appears to be persistent as
all of its sample ACFs are large (showing by the + sign). This is not surprising as
the time plot of the series in Figure 1.1 shows an upward trend. Second, the tem-
perature series of Europe and North America are dynamically correlated with that
of South America because there exist several + signs at the (1,3)th and (2,3)th ele-
ments of the CCMs. Finally, there seems to have no strong dynamic dependence in
November temperatures between Europe and North America, because most sample
cross-correlations at the (1,2)th and (2,1)th positions are small. ◾
da <- read.table(“temperatures.txt”,header=TRUE)
da <- da[,-1] # remove time index
require(MTS)
ccm(da)
20 INTRODUCTION TO BIG DEPENDENT DATA
• If the dimension k is close to or larger than the sample size T, then the afore-
mentioned properties of sample CCMs do not hold. We shall discuss the situa-
tion in later chapters, e.g. Chapter 6.
• When the dimension k is large, even the simplified CCM S(𝓁) becomes hard
to comprehend and some summary statistics must be used. In this book, we
provide three summary statistics to extract helpful information embedded in
the sample CCMs. They are given below:
1. P-value plot: A scatterplot of the p-values of the null hypothesis H0 ∶ R(𝓁) =
𝟎 versus lag 𝓁. The test statistic used asymptotically follows a 𝜒 2 2 distribution
k
under the null hypothesis that there are no serial or cross correlations in zt
for 𝓁 > 0. Details of the test statistic is given in Chapter 3.
2. Diagonal element plot: A scatterplot of the fraction of significant diagonal
̂ ̂
elements of R(𝓁) √ 𝓁. Here an element of R(𝓁) is said to be significant
versus
if it is greater than 2∕ T in absolute value.
3. Off-diagonal element plot: A scatterplot of the fraction of significant
̂
off-diagonal elements of R(𝓁) √ 𝓁. Again, the significance of an
versus
̂
element of R(𝓁) is with respect to 2∕ T.
The R command for the three summary plots is Summaryccm of the SLBDD pack-
age, which is a package associated with this book.
Example 1.2
Consider the daily log returns of the 99 financial market indexes of the world in
Figure 1.2. Here k = 99 and it would be hard to comprehend any 99-by-99 cross cor-
relation matrix. Figure 1.14 shows the three summary statistics of the daily log returns.
The upper plot is the scatterplot of p-values. From the plot, all p-values are close to
zero indicating that there are serial and cross-correlations in the 99-dimensional log
returns for lags 1 to 12. The middle plot shows the fraction of significant diagonal
̂
elements of R(𝓁). This plot shows that there are significant serial correlations in the
daily log returns of 99 world financial indexes. But, as expected, the serial correla-
tions decay quickly so that the fraction approaches zero as 𝓁 increases. The bottom
plot shows the fraction of significant off-diagonal elements of R(𝓁)̂ versus 𝓁. The
plot suggests that there is significant cross dependence among the daily log returns
of world financial market indexes when 𝓁 is small. As expected, the cross dependence
also decays to zero quickly as 𝓁 increases.
require(SLBDD)
da <- read.csv2(“Stock_indexes_99world.csv”,header=FALSE)
da <- as.matrix(da[,-1]) # remove the time index
rt <- diff(log(da)) # log returns
Summaryccm(rt)
◾
NONSTATIONARY PROCESSES 21
p-Value
0.8
0.4
0.0
2 4 6 8 10 12
Lag
Diagonal
0.8
0.4
0.0
2 4 6 8 10 12
Lag
Off−diagonal
0.8
0.4
0.0
2 4 6 8 10 12
Lag
Figure 1.14 Scatterplots of three summary statistics of lagged CCMs of the daily log returns
of 99 world financial market indexes. The summary statistics are (a) p-values of testing
̂
R(𝓁) = 𝟎, (b) fractions of significant diagonal elements of R(𝓁), and (c) fractions of significant
̂
off-diagonal elements of R(𝓁).
0.10
0.05
Inflation
0.00
Figure 1.15 Time plots of the first-differenced series of the 33 log monthly price indexes of
European countries from January 2000 to October 2015.
found in Dahlhaus (2012). Different from the conventional time series analysis,
statistical inference on locally stationary processes is based on in-filled asymptotics.
To illustrate, consider the time-varying parameter autoregressive process
zt = 𝜙t zt−1 + 𝜎t 𝜖t , (1.20)
where {𝜖t } is a sequence of standard normal random variates, and 𝜙t and 𝜎t are
rescaled smooth functions such that 𝜙t ≡ 𝜙(t∕T) ∶ [0, 1] → (−1, 1) and 𝜎t ≡ 𝜎(t∕T) ∶
[0, 1] → (0, ∞), where T is the sample size. When T increases, the observations on
𝜙(t∕T) and 𝜎(t∕T) become denser in their domain [0,1] so that we can obtain bet-
ter estimates of the time-varying parameters. Figure 1.17 shows the time plot of a
realization with 300 observations from the time-varying parameter AR(1) model
of Eq. (1.20), where 𝜙(t∕T) = 0.2 + 0.4(t∕T) − 0.2(t∕T)2 and 𝜎(t∕T) = exp(t∕T) with
T = 300. As expected, the series fluctuates around zero with an increasing variabil-
ity. The process is locally stationary because it can be approximated by a stationary
series within a small time interval. For instance, in this particular example, the first
120 observations of Figure 1.17 appears to be weakly stationary.
0.05
dd(ln(cpi))
0.00
–0.50
–0.10
2005 2010 2015
Year
Figure 1.16 Time plots of the first and seasonally differenced series of the 30 monthly price
indexes of European countries from January 2000 to October 2015.
Another random document with
no related content on Scribd:
30
Oct
11658 Menk W “ 12 F
30
Morrow J C, S’t 101 Oct
11683
Maj E 31
Oct
11684 McCann J Cav 11 L
31
184 Oct
11686 Moore W
B 31
Oct
11692 Mulligan J 7H
31
Nov
11909 McCune J 67 E
8
Nov
11913 McClush N 97 E
8
Nov
11982 Manee M 53 H
13
145 Nov
12008 McCray J
A 14
118 Nov
12088 Maher D
E 18
Nov
12103 Miller W 31 I
22
Dec
12248 Murray W Cav 14 H
8
Dec
12326 McIntire J 55 C
24
Dec
12334 Myers A D 52 A
26
Jan
12554 Matthews J Cav 6F 65
30
184 Feb
12595 Maloy J M
D 5
Feb
12625 McGenger J 20 C
9
12696 Myers H 87 E Feb
23
Mar
12771 McDonald —— 9G
13
103 Feb
12806 McGarrett R W
F 21
May
1134 Nicholson Jno Cav 3H 64
16
May
1298 Nelson Wm 76 H
23
July
2832 Nolti Wm 6F
3
183 July
3653 Newell G S
A 20
July
4246 Nicholson W Cav 1H
29
Aug
4489 Nelson Geo 2K
1
Aug
4936 Naylor G W, S’t Cav 13 L
7
125 Aug
5109 Nichols D A
D 9
Aug
6001 Neal H G 90 B
17
37 Aug
6011 Nickle C
G 17
77 Aug
6702 Nickem Jas
G 24
Sept
8154 Naylor S Cav 20 H
8
Sept
8907 Noble J 73 D
16
Sept
9424 Nice Isaac 11 L
21
9468 Neff J Cav 4D Sept
21
Oct
10146 Nelson G 55 A
1
145 Oct
10286 Nelson J A
G 4
Oct
10764 Newberry Jno Cav 20 A
12
160 Oct
11107 Nelson A
E 18
19 Oct
11254 Noble Thos Cav 64
G 21
Nov
11776 Nichols G 20 C
3
April
414 Osbourne S K 4K
7
April
622 Oglesby J Cav 4K
19
May
1318 O’Brien P 13 A
23
May
1409 Ottinger I Cav 8 I
27
June
1817 O’Neil Jno, S’t 69 -
12
55 June
2589 Oswald Stephen
G 28
July
3161 O’Conor —— 83 -
11
July
3199 O’Neil J 63 I
12
July
3704 Olmar H, S’t Cav 2H
21
July
3861 O’Connor H 49 E
24
July
4161 Owens G H 7A
28
5119 Offlebach Z 90 K Aug
9
103 Aug
5184 Oliver W
D 9
101 Aug
5939 O’Hara M
E 17
183 Aug
6254 O’Connell Wm
G 20
150 Aug
6535 O’Hara Jno
E 23
103 Aug
6658 Oiler Sam
G 24
109 Aug
6908 O’Rourke Chas
C 26
Aug
7105 Otto Jno Cav 5B
28
Sep
9330 Owens E 50 D
20
Oct
10805 Osborn E, Cor Cav 11 A
13
Mar
30 Peck Albert 57 K
9
Mar
62 Patterson Rob Res 2E
18
Mar
125 Parker Jas M, Cor 76 B
23
April
500 Petrisky H 54 F
12
May
1110 Patterson T Cav 3A
15
73 May
1119 Patent Thos
G 15
May
1258 Powell Wm Cav 14 D
21
1556 Powers Jno 26 I June
2
June
1780 Preso Thos 26 E
9
June
1884 Powell Frank 18 -
12
183 June
2566 Page J
G 27
101 June
2590 Porter David
H 28
103 July
2903 Parsons J T
D 5
July
3197 Painter J G 26 F
11
July
3445 Painter S 63 A
17
101 July
4049 Patterson R
H 27
July
4157 Pickett J C Cav 3A
28
July
4177 Pratt F “ 14 I
28
July
4191 Plymeer W “ 20 B
28
112 July
4415 Page Jno
A 31
102 Aug
4473 Powell H
H 1
Aug
5323 Prosser J 63 -
11
72 Aug
5579 Pyers Isaac
G 14
101 Aug
5610 Phillips Jas B
I 14
184 Aug
5947 Parish J A
- 17
6341 Preans H 149 Aug
K 21
140 Aug
6439 Palmer H
D 22
Aug
6527 Poole G 52 B
22
13 Aug
6536 Pifer M
G 23
Aug
6574 Phillips J W Cav 1F
23
103 Aug
6843 Peterson G
D 25
Aug
6844 Penn Jno Cav 5E
25
Aug
6885 Patten H W Art 2F
26
183 Aug
7118 Potts Edw
H 28
103 Aug
7232 Perkins N
D 29
149 Sept
8030 Powell A T
C 6
Sept
8160 Pricht F 87 H
8
145 Sept
8763 Peck C W
H 14
101 Sept
8877 Persil Frederick
- 15
143 Sept
9220 Palmer A
D 19
143 Sept
9684 Perego W
G 24
Sept
9754 Phipps J H 57 E
26
10074 Price G 106 Sept
H 30
144 Oct
10573 Penstock A
B 9
101 Oct
10858 Powell I 64
I 13
109 Oct
11168 Price O
C 19
Oct
11261 Phay M 69 C
21
Oct
11637 Phillips F 61 K
28
145 Nov
11737 Pees M T
H 2
Nov
11833 Penn J Cav 18 I
6
Nov
11918 Phelps W “ 4G
8
Oct
11328 Porterfield J K “ 5M
23
Nov
12075 Pencer W 18 C
18
Nov
12191 Pryor Wm 11 C
28
Dec
12359 Poleman H Cav 1F
30
121 Jan
12378 Perry H 65
C 2
Jan
12388 Pritchett J 72 C
8
148 Jan
12479 Potter B F
I 17
Aug
6756 Quinby L C 76 E 64
24
Mar
47 Reed Sam Cav 4D
15
126 Robertson J 119 Mar
K 23
49 Mar
132 Rosenburg Henry
G 24
Mar
171 Reign Jno 83 K
26
April
308 Richpeder A 13 B
2
April
610 Ray Wm Cav 8F
18
May
847 Rhinehart J “ 3D
3
May
895 Russell F 4D
3
May
907 Rhinebolt J Cav 18 I
5
150 May
940 Robinson C W, S’t
E 7
May
1152 Randall H Cav 4H
16
May
1218 Rigney Chas “ 4G
19
51 May
1454 Raleigh A
G 29
May
1485 Rudolph S, Cor Cav 13 K
30
June
1599 Rhine Geo 63 I
4
June
1624 Rosenburg H Cav 13 H
4
June
1719 Raymond Jno, S’t “ 18 H
8
June
1803 Rheems A, S’t 73 I
10
1833 Ramsay J D 103 June
F 11
18 June
1922 Rush S
G 14
June
1942 Robinson Wm 77 D
14
101 June
2225 Roush Peter
E 20
June
2528 Rupert F Cav 2H
26
June
2602 Roat J 54 F
28
July
2735 Rhoades F 79 E
1
July
2911 Rock J E 5M
5
July
2979 Regart Jno Cav 13 E
7
July
2103 Ray A, S’t 77 E
17
103 July
3024 Rugh M J
D 7
July
3270 Robins R 69 B
13
148 July
3468 Ransom H
I 17
July
3827 Rinner L Cav 5A
23
July
4074 Ringwalk J F 79 H
27
115 July
4241 Roger L
L 29
July
4309 Rogers C 73 C
30
184 Aug
4476 Ray Jas R
B 1
4507 Riese S 103 Aug
D 1
103 Aug
4844 Richie Jas
B 6
Aug
4940 Ruthfer J Art 2F
7
101 Aug
5319 Rice Sam’l
K 11
103 Aug
5389 Ross David
B 12
Aug
5430 Robinson John 99 D
12
Aug
5537 Rose B 13 I
13
Aug
5800 Robins J Cav 2M
15
Aug
5879 Rider H “ 7L
16
143 Aug
5894 Richards E
E 16
103 Aug
5912 Reese Jacob
B 17
Aug
5940 Richards Jno, Cor Cav 1G
17
106 Aug
6321 Robbins G
G 21
110 Aug
6373 Roger Jno L
H 21
Aug
6520 Reynolds J 14 H
22
103 Aug
6725 Rowe E, Cor
A 24
149 Aug
6777 Rangardener J 64
H 25
6789 Richards G Cav 13 A Aug
25
Aug
6790 Runels Jno “ 6L
25
188 Aug
6822 Rum A
C 25
148 Aug
6838 Reese D
K 25
Aug
6896 Raiff T 1A
26
Aug
6933 Richardson —— 61 -
26
143 Aug
7067 Reese D
F 28
103 Aug
7202 Ruff J
F 29
Aug
7292 Redmire H 98 B
30
Aug
7293 Robins Geo 62 A
30
103 Aug
7410 Richardson H
K 31
Sept
7467 Richard D Cav 18 D
1
Sept
7716 Rice E 7B
3
101 Sept
7738 Roads Frederick
E 3
Sept
8139 Rathburn K 2F
8
Sept
8540 Russell S A, Cor 79 A
12
149 Sept
8545 Ray A
D 12
106 Sept
8602 Richards J
H 12
8635 Rhangmen G, S’t 138 Sept
D 13
Sept
8742 Root D 48 B
14
Sept
9019 Ret Geo 18 A
17
149 Sept
9272 Ramsay J I
- 19
Sept
9585 Richie H 11 F
23
Sept
9599 Renamer W H 87 H
23
113 Sept
9612 Richards Jno
D 23
103 Sept
9653 Reed R
A 24
Sept
9766 Ramsay R 84 D
25
Sept
9882 Richards J 53 K
27
Oct
10174 Reed J 55 A
1
Oct
10863 Ramsay Wm 87 B
13
Oct
10622 Reedy E T, S’t 87 B
10
Oct
10935 Roundabush H B 51 A
14
Oct
10947 Rockwell A Cav 2L
14
Oct
11071 Raeff J B 72 E
17
Oct
11115 Rinkle Jno A 20 A
18
11293 Rolston J 18 F Oct
22
Oct
11147 Rudy J 13 F
19
189 Oct
11444 Riffle S G, Cor
C 25
144 Oct
11566 Richardson A
E 27
111 Nov
11868 Rowland N
F 6
Nov
12008 Rapp A E Cav 18 I
15
Nov
12048 Ruth B S 23 I
16
101 Dec
12206 Rothe C
A 1
Dec
12355 Reese D 7A
29
128 Jan
12372 Reed W S 65
H 1
April
377 Smith M D 18 B 64
5
April
788 Smith Geo Cav 5H
28
May
881 Smith Wm 4A
4
19 May
882 Smith T
G 4
12 May
921 Steffler W J, S’t Cav
G 6
May
1014 Serend H “ 4D
10
May
1030 Shebert Gotlieb 73 C
11
May
1058 Spilyfiter A 54 F
13
1105 Sullivan D 101 May
K 15
140 May
1114 Shindle S R, S’t
K 15
May
1155 Stearnes E K Cav 14 A
16
May
1169 Sloat D 76 I
16
May
1175 Scott Wm 4B
16
139 May
1216 Severn C
A 19
May
1256 Sammoris B, S’t Cav 2B
21
May
1349 Smith Chas 26 A
24
May
1453 Schlenbough C Cav 4G
29
May
1503 Smith Martin “ 18 H
31
June
1535 Stone Samuel 26 F
1
June
1543 Shoemaker M, S’t Cav 13 H
1
June
1605 Swearer G 13 H
4
June
1620 Schiefeit Jacob 54 F 64
4
June
1632 Schmar R 45 F
5
June
1963 Smith D Cav 11 H
14
June
2039 Slough H 53 -
15
2070 Stevens A Cav 13 June
M 16
June
2121 Sherwood C H, S’t “ 4M
17
June
2123 Stall Sam’l 75 D
17
June
2126 Say J R Cav 4K
17
June
2163 Steele J S “ 7F
19
June
2259 Scoles M 27 K
21
14 June
2331 Sims B Cav
G 22
June
2412 Shop Jacob 2M
24
101 June
2622 Springer Jno
E 28
103 June
2650 Stewart J B
A 29
150 July
2725 Scott Allen
H 1
73 July
2738 Schimgert J
G 1
July
2791 Shimer J A Cav 13 A
2
July
2864 Scott Wm, (negro) 8D
4
July
2905 Stump A 11 I
5
July
2941 Smith Jacob 51 H
6
140 July
2982 Shaw W
B 7
112 July
2999 Smulley Jno
K 7
3057 Sutton R M 103 July
I 9
July
3113 Sweet H 57 K
10
148 July
3136 Shoemaker M
G 10
July
3154 Sillers Wm 77 D
11
53 July
3214 Stone W F
G 12
103 July
3480 Swelser J
D 17
July
3567 Smalley L 58 K
19
150 July
3568 Stevens S G
H 19
116 July
3586 Sickles Daniel
K 19
142 July
3632 Serders J S
K 20
July
3670 Stopper Wm 16 B
20
172 July
3763 Stillenberger F
F 22
July
3775 Strance D 11 H
22
July
3855 Smith J 79 F
24
77 July
3906 Smith O C
G 24
144 July
3956 Seilk A
D 25
July
3960 Sullivan T 77 F
25
4006 Smith F 64 K July
26
July
4009 Shafer J H 84 E
26
103 July
4012 Shapley Geo
G 26
July
4043 Strickley C 53 H
27
19 July
4064 Shriveley E S Cav
M 27
145 July
4113 Sheppard E
G 28
101 July
4164 Smith S W
B 28
July
4213 Shaffer Peter 52 F
29
July
4223 Shister F Cav 3A
29
July
4228 Stein J 7G
29
July
4274 Sloan J 11 E
29
July
4285 Shone P Cav 4D
30
101 July
4345 Stobbs W W, Cor
E 30
July
4348 Scott A 22 F
31
July
4351 Scundler J 67 A
31
July
4372 Smith P 72 C
31
15 Aug
4566 Sale Thos
M 2
Aug
4775 Shink Jas 81 F
5
4791 Sullivan Ed 67 H Aug
5
Aug
4797 Sear C Cav 14 L
5
Aug
4845 Shember Jno “ 11 D
6
Aug
4928 Slicker J 77 D
6
61 Aug
4931 Sheit P
G 7
Aug
4945 Swartz P, Cor 27 I
7
22 Aug
5160 Stiner Jno Cav
G 9
Aug
5189 Striker F “ 14 C
9
184 Aug
5215 Sworeland Wm
A 10
118 Aug
5232 Speck A
A 10
Aug
5411 Shaffer Daniel Cav 13 F 64
12
103 Aug
5529 Spangrost A
D 12
149 Aug
5437 Shears J S
K 12
Aug
5463 Stibbs W 56 H
13
Aug
5494 Shape F Cav 18 A
13
Aug
5603 Somerfield W 69 E
14
150 Aug
5700 Stinebach A
C 15
5750 Spears W M, S’t Cav 2K Aug
15
Aug
5874 Sheppard N 79 F
16
Aug
5965 Shultz F Cav 13 K
17
103 Aug
6205 Shoop G
K 19
Aug
6289 Smith H 26 K
20
Aug
6337 Smith W Cav 18 B
21
101 Aug
6382 Swager M
F 21
118 Aug
6436 Spain Thos
H 22
Aug
6523 Stover J 49 F
22
149 Aug
6526 Stahler S
G 22
118 Aug
6534 Snyder Jno
C 23
Aug
6584 Sloate E 50 D
23
105 Aug
6595 Shirley Henry
I 23
Aug
6669 Sherwood P 84 I
24
150 Aug
6776 Shellito R
C 25
118 Aug
6823 Spain Richard
H 25
79 Aug
6829 Sturgess W A, Cor
G 25
Aug
6880 Stuler D Cav 4A
26
7029 Strickler J W 11 F Aug
27
Aug
7106 Smith Jno F 55 C
28
Aug
7137 Sloan J M Cav 18 D
28
113 Aug
7141 Springer J
F 29
Aug
7262 Shriver B Cav 18 K
30
Aug
7302 Singer J Art 2A
30
Aug
7358 Scoleton J 53 F
31
Aug
7363 Sweeney D Cav 14 E
31
Aug
7379 Scott W B “ 4D
31
Sept
7631 Streetman J 7E
2
62 Sept
7638 Steele J
M 2
Sept
7648 Spencer Geo 20 C
3
183 Sept
7662 Snyder M S
A 3
Sept
7705 Swartz Geo Cav 5A
3
Stockhouse D, Sept
7770 “ 18 I
Cor 4
149 Sept
7905 Sellers H
G 5
Sept
7939 Shultz Jno Cav 4 I
5
7960 Smith A C 7F Sept