Introduction to


Diego Quiros, Ph.D.

Lecturer in Geophysics
September 2022
Geophysics Module
Lecture 3 - Inverse Theory and Earthquake Location

1. Inverse Theory

1. Linear regression

2. The method of least squares

3. Derivation of least squares for a line

4. Least squares solution for linear inverse problems

5. Solution of least squares for a line

2. Earthquake Location

1. Linearization of the earthquake location problem

2. Least squares solution for earthquake location

3. Example: Numerical Earthquake location for constant velocity

4. Examples from the literature and complexities

Inverse Theory
Scientists frequently try to relate physical parameters that characterize a model m, to some collect-
ed observations making up some set of data, d. Assumming that the fundamental physics relating
m and d are adequately understood, we can write this as a function G

G (m) = d

Finding d given m is what is known as a forward problem. This is essentially using a physical theory
to predict the the outcome of some measurements. For example using Newton’s law of gravity to
uniquely predict the gravity field around a planet given the distribution of inside the planet would
be a forward problem.

An inverse problem is finding m given a set of observations d. Back to the example of gravity, the
inverse problems uses measurements of the gravity field to infer the mass distribution inside the
planet. The complication here is that there are different distributions of mass that give exactly the
same gravity field. This is called non-uniqueness.

An forward problem

{Measurements, data}
{Physical properties, unknowns}
Model Physics Observables

An inverse problem
stars -> model predictions
Linear Regression dots -> observed data

A simple yet helpful example of an inverse problem is that

of regression. Everyone should be familiar with the
* dy


equation for a straight line y = a + b x.
The forward problem would be: given a and b find the
value of y. The inverse problem is then to find the values

* *
of a and b given a set of observations (xi, yi).
* yipre

Because observations contain errors, the data will not

give an exact fit to a and b and we will need to find the
} a

best fit model (m = [a, b]). The line that best fits the data.
A useful approximate solution can be found by finding a Figure. Observations are dots while
particular model m that minimizes the misfit between the asterisks are predictions for the best
actual data and Gm. fit model.
the points dont fit a perfect line as there is noise, instrument errors and analytical errors.
analytical errors.

The residual vector (r = d - Gm) is the vector of differences between observed data and correspond-
ing model predictions. For example, in the figure above, the data for the i-th observation y iobs is
compared to many possible models (m1 = [a1,b1], m2 = [a2, b2], ... ) each model resulting in a prediction
(y1pre, y2pre, ... ) for the i-th observation y ipre. Now for each model (m 1, m 2, ... ) we can calculate the re-
sidual vector.
The method of Least Squares *
Let’s first use the model m1 to illustrate the residual how the yiobs
* dy

vector looks for a dataset of 8 observations (see Figure).
residual -> observed(measured) - predicted
* *
y1obs - y1pred, m1
y2obs - y2pred, m1 ** y ipre

r(m1) = d - Gm1 =
obs - y8
pred, m1

We can similarly write the residual vector for any other model (e.g. Figure. Observations are dots
m2). while asterisks are predictions
for the best fit model.
The question becomes, which of the models (m1, m2, ... ) results in
the smallest error when compared to the observations?
First we need to define what we mean by error. In the method of least squares we define error E using
the residual vector and the L2-norm: n

E(m) = Σ r (m)
i i
2 = rT r

In the equation E(m) is the total error for a particular model m, ri(m) are the elements of the residual
vector, just as defined above in the example for 8 data points, however in the equation we
generalized to n observations, and finally rT stands for transpose of the vector r.

Now that we have a definition of error how do we find which model m gives the smallest error?
Least Squares Solution for a line

The problem is the elementary calculus problem of locating the minimum of the function
E(m) = E(a,b). In calculus to find the minimum of a function we set the derivative(s) to zero and solve
the equations. In the case of the linear regression problem function E has to variables a, and b so we
need to take the partialobserved
of E with respect to each variable and set it to zero.
at the diff. points parameter of
N model
∂E ∂ N N
Σ [d - a - b xi ] = 2N a + 2b Σ xi - 2 Σ di = 0
= i
∂a ∂a i i i

∂E = ∂ N N N

∂b ∂b
Σ [d i
- a - b xi ] = 2 a
Σx i
+ 2 b Σ xi2 - 2 Σx di i
i i i

These two equations are then solved simultaneously for a and b yielding the classic formulas for the
least squares fitting of a line.

Least squares can be extended to the general linear inverse problem (i.e., not just fitting a line).
That is, as long as the problem can be written as d = Gm (as opposed to d = G(m) ) we can find the
general solution in the least squares sense.
Least Squares Solution of the Linear Inverse Problem

To compute the least squares solution we follow the same recipe as before. We need to compute
the derivative of the error E with respect to one of the model parameters say mq and set the result
to zero.

Note that we changed the notation slightly, mq refers to the q parameter of the model m, while
before m1, m2, ... , referred to a model from a set of models.

Σ[ d ][ d ]
E(m) = rT r = (d - Gm)T (d - Gm) =
- ΣG
mj i - ΣG

Taking the derivative of E with respect to mq yields:

= 0 = 2 Σ mk Σ Giq Gik - 2 Σ Giq di
∂mq k i i

Writing this equation in matrix notation yields: GT -> matrix

G Gm - G d = 0
solve for m (the best model calculated from the error eq.)

Note that the quantity GTG is a square M x M matrix and that it multiplies a vector m of length M.
GTd is also a vector of length M. Presuming that [GTG]-1 exists we have the estimate of the model pa-
mest = [GTG]-1 GTd

This is the least squares solution to the inverse problem Gm = d.

m -> two model parameters

In the straight line problem the model is di = m1 + m2 zi, so the equation Gm = d has the form
G -> equation that connects the
slope and intercept to the data

In the straight line problem the model is di = m1 + m2 zi, so the equation Gm = d has the form

1 x a + bx -> d

The matrix products required by the least squares solution are GTG and GTd

This gives the least squares solution

sum of
nr of data points data values

all x-values
(best fitting model) added up

sum of the sum of data points x z values

squares of z

^T -> transpose; ^-1 -> the inverse

Earthquake Location

We first consider the classic inverse problem of locating an earthquake and finding its origin time
using the arrival times of seismic waves at various stations. The velocity structure, which determines
the ray paths and hence the travel times, is crucial. Here we regard the velocity structure as known.

Assume that an earthquake occurred at an unknown time t, at an unknown position x = (x, y, z). The
position x is the hypocenter. The point (x, y) on the surface above the hypocenter is called the
epicenter. The event is recorded by n seismic stations at locations xi = (xi, yi, zi), each station detects
the earthquake with an arrival time di’. The arrival time depends on the origin time t and the travel
time T(x, xi) between and the hypocenter and the station. Epicenter Station i

di’ = T(x, xi) + t (xi, yi, 0)

For the earthquake location problem the relation between

the model parameters and the data is non-linear, even for Hypocenter
(x, y, z)
the simple example of a constant velocity medium.
The arrival time from a station with coordinates (xi, yi, zi) to a hypocenter (x, y, z) is related by

√(x - xi)2 + (y - yi)2 + (z - z i)2

arrival time ->
solve for t
di’ = +t
this is a non-linear problem, but can be
Clearly di’ does not scale linearly with either x, y, or z on this equation. The result is that we cannot
use standard methods of solving linear equations. However the problem can be linearized.
Linearization of the Earthquake Location Problem
Taylor expansion:

To linearize the earthquake location problem lets begin with a starting model m°, which is an
estimate (or guess at) a model that we hope is close to the solution we seek. The starting model
predicts that we would have observed data di°. Unless we are very lucky, these predicted data are
not what were actually observed.

Hence we seek changes Δmj in the starting model

mj = m° + Δmj

That will make the predicted data closer to the observed data. Since for this problem the data does
not depend linearly on the model parameters, so we linearize the problem by expanding the data in
a Taylor series about the starting model m° and keeping only the linear term
calculated (guessed)
from formula on last slide
di ≈ di° + Σ Δmj
j ∂mj

This equation can be written in terms of the difference between the observed data and the
predicted data
Δdi° = di' - di° ≈ Σ Δmj°
j ∂mj

For simplicity we drop superscripts and define the partial derivative matrix as G ij =
Linearized Earthquake Location

Using the partial derivative matrix notation the equation for Δdi° becomes

Δd = G Δm or Δdi = Σ G ij Δm j

This equation is a linear inverse problem, and to solve it we can apply the same method used for the
least squares solution for the lineary inverse problem.

We generally have arrival time observations at many (often several hundred) seismic stations, and
are solving for only 4 model parameters. This means that in the equation above j = 1, ... , 4 and
i = 1, ... , n. As we said, generally n is much greater than 4.

This means the matrix G has a number of rows equal to the number of arrival time observations, and
a number of columns equal to the number of model parameters. This means than generally G has
many more rows than columns.

diff betw. the guess of the hypocenter and the

next guess
Least Square Solution for Earthquake Location

Because G is not a square matrix it does not have an inverse. To obtain a solution we can minimize
the error function just as we did before

Writting the above in matrix notation leads to the familiar equation

GTG Δm - GT Δd = 0

For which we know the solution for Δm

change in the model

Δmest = [GTG]-1 GT Δd

The operator [GTG]-1 GT which acts on the data to yield the best fit model, is called the generalized
inverse of G and is written G-g. It provides the "best" solution in a least squares sense, because it
gives the smallest squared misfit.

The generalized inverse is the analog of the inverse but for a matrix that is not squared and hence
does not have a conventional inverse. If G is square and has an inverse then G-1 = G-g.
Example: Earthquake Location for constant velocity

To make the concepts discussed less abstract we can consider the simple case of locating an
earthquake in a homogeneous medium of uniform velocity v. In this case the raypaths connecting an
earthquake and seismic stations are straight lines. zi -> 0 as we assume the location of the instrument
is at 0 elevation.

As before we can write the arrival times

di = T(x, xi) + t = 1 [(x - xi)2 + (y - yi)2 + (z - zi)2]1/2 + t
arrival time at location i v
For simplicity assume all the stations are at the surface zi = 0.

To solve the inverse problem we form the matrix Gij. The partial derivatives of the
elements of the data vector di (the arrival times at each station) with respect to the model
parameters mj (the location of the hypocenter and the origin time) are easily found.

Differentiation of the i-th element of the data vector is done with respect to the first model
parameter (the x position of the hypocenter).
this is respect to x
this will look the same as to y and z and time, just replace
∂di ∂di ∂T(x, xi)
= (x - xi) [(x - xi) + (y - yi) + z ]
the x's
2 2 2 -1/2
Gi1 = = =
∂m1 ∂x ∂x v
Similar expressions give the partial derivatives with respect to y and z. The final partial derivative is
with respect to the origin time, which is just
∂di ∂di
Gi4 = = =1
∂m4 ∂t
Example: How to use the Generalized Inverse
To use the method we can follow a 'recipe'.

Step 1 is to begin with a starting model m° = (x°, y°, z°, t°) and predict expected values of the data d°.
d^i -> arrival time observed from the seismogram

Step 2 is to form the residual vector giving the misfit of the model to the data Δd° = d' - d°. d^o -> predicted
do this for all stations

Step 3 is to evaluate the matrix of partial derivatives Gij about the starting model

Gij =

Step 4 use the generalized inverse G-g = [GTG]-1 GT to find Δm°, the change in the starting model that
gives a better fit to the data.

Step 5 Calculate the new model m1 = m° + Δm° and predict the values of the data d1. This new
prediction of the data should be closer to the observations.

Step 6 is to form a new residual vector Δd1 = d' - d1.

Step 7 Examine the Error (squared misfit) as defined before

E1 = Σ (Δdi1)2 = Σ (d ' - d )
i i
1 2

i i

This should be less than the corresponding misfit for the starting model E° = Σ (Δdi°)2. This process
is repeated until successive iterations produce only small changes in the model and hence in the to-
tal misfit to the data.
Example: Using 10 stations to locate an event
The figure illustrates a hypothetical example of locating an
earthquake with 10 stations located within a 100 km square. The
earthquake occurred at time t = 0 s at the point (0, 0, 10) km.

We are going to try to locate it based on the arrival times at the y

10 stations, this is our data d'. The data can be computed from
the known hypocenter d'i = T(x, xi). The station locations are

r1 = (35, 9), r2 = (-44, 10), r3 = (-11, -25), r4 = (23, -39),

r5 = (42, -27), r6 = (-12, 50), r7 = (-45, 16), r8 = (5, -19),
r9 = (-1, -11), r10 = (20, 11) km.
With the station coordinates, the known hypocenter, and a uniform velocity (5 km/s) we can calculate
d'i. Notice that in the real world we observe d'i and never calculate it, however this is an example and
we need to create the data first.

d' = [7.499, 9.243, 5.817, 9.273, 10.184, 10.476, 9.759, 4.409, 2.979, 4.983]T the time in seconds

From here on we ignore the fact that we know the actual hypocenter and origin time of the event
and start with a guess as the initial model, for example m° = (x, y, z, t) = (3, 4, 20, 2). a complete guess

Now we follow the recipe given in the previous slide.

Example: Following the Recipe Steps 1 - 3
Step 1 is to calculate the predicted data for
m° = (x, y, z, t) = (3, 4, 20, 2)
use this formula to
The predicted data for the starting model is
get the calculated
predicted data of
arrival times

d° = [9.613, 12.285, 9.581, 12.293, 12.736, 12.470, 12.673, 8.109, 7.063, 7.433]T

Step 2 is to calculate the residual vector (and the Misfit or Error while we are at it)

Δd° = d' - d° = [-2.1, -3.0, -3.7, -3.0, -2.5, -1.9, -2.9, -3.6, -4.0, -2.4]T
E = Σ (Δd 1)2 = 92.4 s2 the error is the square of the residual vector, this is large meaning our model prediction is quite off

Step 3 is to evaluate G about m°.

For our case G has 10 rows and 4 colums. Lets write the first row
(x - x1) [(x - x )2 + (y - y )2 + z2]1/2
G11 = 1 1 G12 = (y - y1) [(x - x1)2 + (y - y 1)2 + z2]1/2
v v

G13 = z [(x - x )2 + (y - y )2 + z2]1/2 G = 1

1 1 14
Similarly, we can setup the other 9 rows of the G matrix.
the other stations
Example: Following the Recipe Steps 4 - 7
Step 4 use the generalized inverse G-g = [GTG]-1 GT to find Δm° = [GTG]-1 GT Δd°

This results in Δm° which we use to find a new model m1 = m° + Δm°

original modal + new model

Step 5 Calculate the new model. The new model turns out to be m1 = (-0.5, -0.6, 10.1, 0.2) which is
closer to the hypocenter than the initial guess. From the new model we have to predict the
data d1.

d1 = [7.827, 9.379, 5.883, 9.427, 10.408, 10.772, 9.911, 4.539, 3.101, 5.325]T

Step 6 is to calculate the residual vector and the Misfit.

observed - predicted data

Δd1 = d' - d1 = [-0.4, -0.2, -0.1, -0.2, -0.3, -0.3, -0.2, -0.2, -0.2, -0.4]T
E = Σ (Δd 1)2 = 0.5 s2 we started with a very large diff., this is better -> we are close to the actual location

If we do one more iteration, that is, go to Step 3 and evaluate G about m1 we can then obtain Δm1
and obtain a new model m2 = m1 + Δm1.

It turns out that m2 is equal to the actual hypocenter, that is m2 = (0, 0, 10, 0). Now obtaining the
exact solution does not happen in the real world because our observed data d' has noise that
arises from different sources. Nonetheless, in our example the data are noise-free and the
estimated model yields the true model exactly, which fits the data perfectly.
Example: Southern California Seismicity
Earthquake catalogs are dominated by small earthquakes yet catalogs are missing a much larger
number of even smaller earthquakes because they are harder to detect on seismograms. To
overcome this, Ross et al. (2019) applied a template-matching detection technique to the entire
catalog of the regional seismic network in Southern California. Their effor resulted in 1.81 million
events, a 10-fold increase that provides insights into the geometry of fault zones at depth.

Template-matching refers to using a known earthquake to look for undetected smaller events
within a catalog that are similar to the known event. Doing this for different types of earthquakes
results in many more events being detected.

Figure. Seismicity in the San Jacinto fault zone in southern California shown by two earthquake
catalogs. (left) Standard Southern California Seismic Network (SCSN) catalog. (right) Catalog
derived with an earthquake template matching algorithm by Ross et al. (2019).
Example: Double-Difference and Cross-correlation Earthquake Location
The goal of the previous example was to detect more events than in a standard catalog. Here the
goal is to relocate events in a catalog using advanced algorithms and techniques to better
delineate faults.

With the double-difference technique it is possible to obtain very precise relative locations
while the Cross-correlation of waveforms reduce innacuracies in the phase arrivals (Picking P or
S on a seismogram is not trivial).

Figure. (left) Seismicity of the San Andreas Fault system. (right) Relocat-
ed catalog using double-difference and cross-correlation. Notice how
seismicity "clouds" are sharpened.
a more advanced mathamatical method for getting earthquake locations

Example: Probabilistic Earthquake Location

As the last example, recall that the earthquake location problem is inherently non-linear. The
approach we took was to linearized the problem using a Taylor series expansion. However this is
not the only way of solving the problem. There are a few methods that can be used to approach
earthquake location as a non-linear, one of these methods is to apply a probabilistic point of view.

Here a priori information on the model parameters is represented by a probability distribution

over the 'model space'. The idea is that this a priori probability distribution is transformed into
the a posteriori probability distribution by incorporating a physical theory.

This probabilistic formulation can be applied to any kind of inverse problem, including strongly
non-linear problems. The probabilistic formulation implies that the solution of an inverse
problem is not a model but a collection of models.

The figure shows the solution to the problem of finding the

epicenter of an earthquake using a probabilistic approach.
The black dots are the stations. The crescent-shape indicates
the solution to the problem. Every point within the
crescent-shaped probability density is a solution mi that fits earthquake is in this area

the data, each point is a model, thus the solution to the in-
verse problem is a collection of models.

The crescent-shape of the solution indicates that the

azimuth of the event is not well resolved, on the other hand
the distance since to be well determined (~ 15 km away
Figure. Probability density for the
from the network).
epicenter of an earthquake.

