Professional Documents
Culture Documents
Improving Earthquake Prediction With Principal Component Analysis: Application To Chile
Improving Earthquake Prediction With Principal Component Analysis: Application To Chile
Improving Earthquake Prediction With Principal Component Analysis: Application To Chile
net/publication/274700549
CITATIONS READS
10 535
5 authors, including:
Some of the authors of this publication are also working on these related projects:
DEVELOPING SKILLS IN THE FIELD OF INTEGRATED ENERGY PLANNING IN MED LANDSCAPES (ENEPLAN) View project
All content following this page was uploaded by Francisco Martínez-Álvarez on 09 April 2015.
G. Asencio-Cortés1 , F. Martínez-Álvarez1 ,
A. Morales-Esteban2 , J. Reyes3 , and A. Troncoso1
1
Department of Computer Science, Pablo de Olavide University of Seville, Spain
{guaasecor,fmaralv,ali}@upo.es
2
Department of Building Structures and Geotechnical Engineering, University of Seville, Spain
ame@us.es
3
NT2 Labs, Chile
daneel@geofisica.cl
1 Introduction
2 Related work
The prediction of earthquakes, due to the devastating eect they may cause
in human activity, has been thoroughly studied as discussed by Panakkat and
Adeli in 2008 [17] and, later in 2012, by Tiampo and Shcherbakov [23]. These
two surveys reveal the vast number of approaches proposed based on geophysical
assumptions and statistical procedures.
However, a new collection of techniques based on data mining are emerging
as powerful tools nowadays as pointed out in recent reviews [3, 15].
The proposal of seismicity indicators as inputs for supervised classiers were
rst proposed in [16]. The authors selected articial neural networks as classier
to make predictions. The zone studied was South California. The same authors
rened their own approach two years later and obtained even better results for
the same location [18].
Another new set of seismicity indicators were proposed in [21]. This time
most of them were based on the well-known Gutenberg-Ritcher's b-value [8], as
well as on the Bath and Omori-Utsu laws. The zones under analysis were four
Chilean cities and surroundings. The same set of seismicity indicators were used
to predict earthquakes in the Iberian Peninsula [12] and even to discover seismic
zoning in Croatia [13].
Another active zone, India, has been analyzed with supervised classiers [5].
Also the tectonic regions of Northeast India have been explored [22]. The au-
thors retrieved earthquake data from NOAA and USGS catalogues and proposed
two non-linear forecasting models. Both approaches are stable and suggest the
existence of certain seasonality in earthquake occurrence in this area.
Zamani et al. [24] also proposed a set of seismicity indicators but, this time,
its performance was assessed before the Qeshm earthquake in South Iran. They
used articial neural networks and adaptive neural fuzzy inference system. The
seismicity in Greece has also been studied [10]. In this work, the authors only
used the magnitude of the previous earthquakes as input and obtained a high
accuracy rate for medium earthquakes. However, the rate considerably decreased
when major seismic events were considered. The northern Red Sea area is not
an exception and has also been analyzed in [1]. The authors compared the per-
formance of their approach to several Box-Jenkins models.
All the mentioned methods have obtained, in general terms, good results.
However they all share a same feature: none of them questioned the quality of
the sets they used or even they tried to transform. This gap in the literature
justies the conduction of this research study.
3 Methodology
Set of
inputs #1
New sets
PCA ANN, J48, RF MODELS Predictions
Set of
inputs #2
Please note that a prediction is made every time that an earthquake of mag-
nitude larger than 3.0 occurs. In [21], it was shown that the cuto magnitude for
the earthquakes' database of Chile is 3.0 (M0 = 3.0). Ought to the high seismic
activity of the areas under study, a prediction is made almost daily.
4 Experimental study
This section presents the results obtained. First, the quality measures used to
assess the methodology's performance are introduced in Section 4.1. Section 4.2
describes the datasets used in this work. Then, the setup of all the algorithms
used is reported in Section 4.3. Finally, the results themselves for Santiago and
Pichilemu are exposed in Section 4.4.
To assess the performance of the ANN's designed, several parameters have been
used. In particular:
1. True positives (TP). The number of times that an upcoming earthquake has
been properly predicted.
2. True negatives (TN). The number of times that neither the ANN triggered
an alarm nor an earthquake occurred.
3. False positives (FP). The number of times that the ANN erroneously pre-
dicted the occurrence of an earthquake.
4. False negatives (FN). The number of times that the ANN did not trigger an
alarm but an earthquake did occur.
The combination of these parameters leads to the calculation of:
TN
NPV = (1)
TN + FN
TP
PPV = (2)
TP + FP
where N P V denotes the well-known negative predictive value, and P P V the
well-known positive predictive value.
Additionally, two more parameters that correspond to common statistical
measures of supervised classiers performance have been used to evaluate the
performance of the ANN's. These two parameters, sensitivity or rate of actual
positives correctly identied as such (denoted by Sn ) and specicity or rate of
actual negatives correctly identied (denoted by Sp ), are dened as:
TP
Sn = (3)
TP + FN
TN
Sp = (4)
TN + FP
To globally take into consideration all these measures, an arithmetic mean
will be calculated for all of these parameters. Obviously, this average could be
weighted reinforcing, for instance, specicity (high reliability when no alarms are
triggered) or PPV (high reliability when an alarm is triggered). However, this
would pose several questions such as determining a subjective weight for each
of them. For this reason, the authors have decided to calculate just the simple
arithmetic average.
4.2 Datasets description
This section describes the datasets used. Two cities of Chile especially aected
by signicant quakes have been analyzed. In particular, Santiago and Pichilemu
earthquake data have been retrieved from a public repository, managed by the
University of Chile's National Service of Seismology [4].
Analyzed cities and surroundings' surfaces are 1◦ ×1◦ for Santiago and 0.5◦ ×
0.5 for Pichilemu, following with the description in [21]. Moreover, these cities
◦
are the main cities in seismic regions #5 and #4, as determined in [20]. This
information is summarized in Table 2.
As for the length of the datasets, Santiago's training set contains the lin-
early independent vectors occurred from August 10th 2005 to March 31th 2010.
Analogously, the test set included the vectors generated from April 1st 2010 to
October 8th 2011. As for Pichilemu, the training set contains the linearly inde-
pendent vectors occurred from August 10th 2005 to March 31st 2010. Its test set
includes the vectors generated from April 1st 2010 to October 8th 2011.
Table 3. Setup for methods used combined with PCA to assess methodology's perfor-
mance.
Method Setup
ANN numLayers=AUTO, learningRate=0.2, momentum=0.2
RF maxDepth=0, numFeatures=0, numTrees=10
J48 condenceFactor=0.25, minNumObject=2, numFolds=3, unpruned=FALSE
The PCA has been launched in the R environment [19]. In particular, the
Psych package, version 1.4.2.3, has been used with Varimax rotation. The anal-
ysis has been carried out from two to thirteen principal components in order to
evaluate the number of components that better transform the data. This number
has been selected because the number of considered attributes is fourteen.
4.4 Study cases: Santiago and Pichilemu
1 60
2
3 50
4
5
40
Features
6
7
30
8
9
10 20
11
12 10
13
14
1 2 3 4 5 6 7 8 9 10 11
Principal components
Fig. 2. Feature distribution into the eleven principal components for Santiago.
Table 5 summarizes the results for the city of Pichilemu. The best results,
this time, have been obtained when PCA has been used with three components,
prior to the application of ANN, J48 and RF, as proposed in the methodology.
The average result for the four quality parameters without previous PCA is
64.08%, whereas a 73.03% average result has been obtained when PCA has
been previously applied. That is, a relative improvement of 13.97%. Figure 3
illustrates the degree of membership for the fourteen attributes listed in Table 1
(same order in the gure as listed in the table) for the three principal components
generated.
1 60
2
3 50
4
5
40
6
Features
7
30
8
9
10 20
11
12 10
13
14
1 2 3
Principal components
Fig. 3. Feature distribution into the three principal components for Pichilemu.
The best result has been obtained when RF has been applied after PCA
with three components, with an average result of 75.57%. Except for the J48
without PCA, again three out of four best results correspond to congurations
with previous PCA. It is particularly remarkable the low number of FP that
RF generated when PCA has been applied: 12 versus 126 generated without
previous PCA. This fact is especially important because too many false alarms
generally turn the predictions quite unreliable.
Again, the sum of FP and FN is signicantly lower when PCA is applied. This
number is 149 without PCA (F P + F N = 149) and 65 with PCA (F P + F N =
105), a value 56.38% better. That is, the inclusion of PCA in the general scheme
of prediction turns its outputs into more reliable results, admitting than less
earthquakes are going to be detected.
Figures 4 and 5 illustrate the average results of applying PCA from two to
thirteen components. It can be appreciated that the best results are achieved
with eleven and three components for Santiago and Pichilemu, respectively. As
for the rest of the values, it can be easily concluded that an adequate application
of PCA may lead to major improvements in terms of accuracy, as wanted to be
shown in this work. Even though the use of some particular number of compo-
nents decreases the classiers performance, it is undeniable that it can lead to
much better results.
Additionally, in order to avoid possible smoothing in results when averaging
all quality parameters, Figures 6 and 7 depict the value of every parameter
separately for the cities of Santiago and Pichilemu, respectively. It is noteworthy
that Sn and N P V reach values verging on 90% in both cases. In other words,
the FP rate remains at a very low value which is a very desirable situation due
to the social alarm that FP generally cause. By contrast, Sp and P P V are not
as high as desired but higher enough to be considered satisfactory.
5 Conclusions
The use of PCA has been shown to be useful to improved earthquake prediction
in Chile. By including this step in an existing methodology, the results obtained,
in terms of average accuracy, have been signicantly outperformed. In particular,
articial neural networks, classication trees and random forest algorithms have
been successfully applied to Santiago and Pichilemu, two of the Chilean cities
65
60
55
Average (%)
50
45
Without PCA
With PCA
40
2 3 4 5 6 7 8 9 10 11 12 13
Number of principal components
Fig. 4. Average results for Santiago, when PCA varies from 2 to 13 components.
75
70
65
Average (%)
60
55
With PCA
Without PCA
50
2 3 4 5 6 7 8 9 10 11 12 13
Number of principal components
Fig. 5. Average results for Pichilemu, when PCA varies from 2 to 13 components.
with the highest seismic activity. The results reported in this work suggest that
this methodology could be applied to any other region in the world. As future
work, given the imbalanced nature of data to be classied [7], these techniques are
intended to be applied, especially when considering higher magnitude thresholds
where the imbalance becomes more evident in the classes.
Acknowledgments.
The nancial support from the Junta de Andalucía, under project P12-TIC-1728,
and from the Pablo de Olavide University of Seville, under help APPB813097,
are acknowledged.
100
90
80
70
60
Value (%) 50
40
30
NPV
20
PPV
Sn
10
Sp
0
2 3 4 5 6 7 8 9 10 11 12 13
Number of principal components
100
90
80
70
Value (%)
60
50
40
NPV
PPV
30 Sn
Sp
20
2 3 4 5 6 7 8 9 10 11 12 13
Number of principal components
References