Linear and Nolinear Fitting Through Mathcad

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 116

EXAMPLES

Calibrating a Thermocouple
by Jean Giraud

Thermocouples measure temperature indirectly by measuring the


voltage produced across dissimilar metals when there is a
temperature gradient present at the junction. Converting any
measured voltage to a temperature requires calibration of the
thermocouple. Typically, thermocouples are calibrated using
standard reference tables of coefficients published by NIST
(http://srdata.nist.gov/its90/main/its90_main_page.html),
which fit a piecewise polynomial curve to the standard ITS-90
reference temperatures
(http://srdata.nist.gov/its90/tables/table_iii.html).

There are several problems with the temperature conversions


obtained in this manner. First, the curve that describing the
voltage-to-temperature data is highly nonlinear, and is, perhaps,
better fit with a more complicated fit function. The calibrations
published by NIST are piecewise for this reason, so you must know
which temperature range you will measure before you can choose a
set of calibration parameters. Next, the NIST data only represents
the calibration of the thermocouple wires themselves, not any
extended cabling, gradient thermal effects, time-dependent
degradation, or other sources of error.

We will use a rational function to fit the calibration data. The fit
parameters will be stored so that subsequent measurements at
non-fixed-point temperatures can be calculated whenever new
measurements are taken.

It is unlikely that all the calibration points will be measured, but


you can only fit as many parameters as there are data points. For
demonstration purposes, use the calibrated values from NIST.

− − 
 −6.25629 −268.935 
  Values at the fixed points
 −6.22919 −259.340 
Fixed points °C mV
 −6.19773 −252.870 
  Helium NBP -268.935 -6.25629
 −6.17138 −248.595
 Hydrogen TP -259.340 -6.22919
 −6.15358 −246.048  Hydrogen NBP -252.870 -6.19773
  Neon TP -248.595 -6.17138
 −5.87302 −218.789
 Neon NBP -246.048 -6.15358
 −5.75328 −210.002  Oxygen TP -218.789 -5.87302
 −5.53559 −195.802
 Nitrogen TP -210.002 -5.75328
  Nitrogen NBP -195.802 -5.53559
 −5.31472 −182.962  Oxygen NBP -182.962 -5.31472
 −2.74070 −78.476 
Cabon Dioxide SP -78.476 -2.74070
  Mercury FP -38.862 -1.43494
ET :=  −1.43494 −38.862  Ice Point 0.00 0.00
 0.00000 0.000 
Ether TP 26.87 1.0679
  Water BP 100.00 4.2773
 1.06790 26.870  Benzoic TP 122.37 5.3414
 4.27730 100.000 
Indium FP 156.634 7.0364
  Tin FP 231.9681 11.0133
 5.34140 122.370  Bismuth FP 271.442 13.2188
 7.03640 156.634 
Cadmium FP 321.108 16.0953
  Lead FP 327.502 16.4733
 11.01330 231.968  Mercury BP 356.66 18.2179
 13.21880 271.442 
  type ' T ' Thermocouple (T/C) from NBS
 16.09530 321.108 
Monograph 125
 16.47330 327.502 
 
 18.21790 356.660 

i := 0 .. rows ( ET) − 1

Divide the matrix into X and Y vectors.

〈 0〉 〈 1〉
X := ET Y := ET

500

Y
10 5 0 5 10 15 20

500

Regression
Here, rationalfit is used to fit the data.

top_order := 7 bottom_order := 6

These would need to be reduced if there were fewer data


points available.
( − )
(
cfit := rationalfit X , Y , 0.95 , top_order , bottom_order , 10
−5
, "noscale" )

∑ ( cn⋅ x )
On
n

n=0
Poly ( x , c , On , Od) :=

∑ ( cn+On⋅ x )
Od
n
1+
n=1
〈 0〉
yfiti := Poly  Xi , cfit , top_order , bottom_order
 

Look at the final fit. This seems to have good convergence at first
glance.

500

Y
yfit
10 5 0 5 10 15 20

Residuals
0.1

Y− yfit
10 5 0 5 10 15 20

0.1

Use the stored fitted parameters for subsequent


calculations.

 0.06053529506 
 0.14298647377   
   25.87152590017 
 25.89470606652   2.90755340593 
 4.75419821808   −1.58842445974 
   
 −0.41190837679   −0.26452928703 
 −0.08307537523  
  −3 
 1.93318016539 × 10 
 1.33815495105 × 10− 3  
  −3 
−4  1.58775766658 × 10 
 2.46943457583 × 10  〈 0〉
K := cfit = −5 
 0.21474459551
  3.30408223193 × 10 
   
 −0.01184922606  
0.1426986327

 −3  −0.05952407034 
 −3.88461364789 × 10   −0.01212247532 
 −5  
 −1.01779135807 × 10  −4
 −1.38551568319 × 10 
 −5 

 1.36789923053 × 10  −5 
 7.14362618209 × 10 
 5.82209744522 × 10− 8  
  −6 
 2.51908911013 × 10 
Define a temperature conversion function:
〈 0〉
temp ( mV) := Poly  mV , cfit , top_order , bottom_order
 
Input the measured voltage here point := 3

measured := ETpoint , 0 measured = −6.1714

Calculate the measured temperature given a measured


voltage (in mV) here. We've used one of the fixed point
voltages, to compare results, but any temperature in the
range of the calibration is acceptable.

temp ( measured) = −248.5921 ETpoint , 1 = −248.595

Let's compare this with the polynomial expression. For a type


T thermocouple, the coefficients from the table are

 0.100860910 
 
 25727.94369 
 −767345.8295 
 
 78025595.81 
NIST :=  −9247486589 
 11 
 6.97688⋅ 10 
 13 
 −2.66192⋅ 10 
 14 
 3.94078⋅ 10 

T ( V) := NIST 0 + V⋅  NIST 1 + V⋅  NIST 2 + V⋅  NIST 3 + V⋅  NIST 4 + V⋅  NIST 5 + V⋅ ( NIST 6

in units of volts.

T
 measured  = −227.5048 temp ( measured) = −248.5921
3 
 10 

Actual value: ETpoint , 1 = −248.595

Calculate the residuals for the NIST fit and compare them with the
new calibrated fit.
 →

NISTY := T 
X 
3
 10 
Comparison of rational fit χ2 error with the
polynomial fit parameters:

( ( Y − temp ( X) ) )
 → 2
∑ ∑ (Y − NISTY)
2
= 0.0135 = 3647.8134

Improving the fit further

We can further improve and stabilize this fit by transforming


the data. The closer the trend is to a straight line, the faster
and better converged the rational polynomial will be. We
should also be able to use a smaller number of fit coefficients,
which will also give more stable results from the solvers, and
allows us to measure fewer data points for calibration. We can
achieve this with the current data set by transforming the y
variable to be y/x. To accomplish this, we'll strip out the (0,0)
point from the data set.

 −6.25629 −268.935 

 −6.22919 −259.34 
 −6.19773 −252.87 
 
 −6.17138 −248.595 
 −6.15358 −246.048 
 
 −5.87302 −218.789 
 −5.75328 −210.002 
 
 −5.53559 −195.802 
 −5.31472 −182.962 
 
−2.7407 −78.476 
ET1 :=  i := 0 .. rows ( ET1) − 1
 −1.43494 −38.862 
 
 1.0679 26.87 
〈 0〉
 4.2773 100  X1 := ET1
 
 5.3414 122.37
 

 7.0364 156.634   ET1〈1〉 
 11.0133  Y1 :=
231.968  ET1〈0〉 
   
 13.2188 271.442 
 16.0953 321.108

 
 16.4733 327.502  top_order := 6
 18.2179 
 356.66  bottom_order := 6
45

35
Y1
25

15
10 5 0 5 10 15 20
X1

Finally, we'll use genfit to do the fit, so we can use analytical


derivatives, which will give a few more places of accuracy than
the numerical derivatives. To do this, we'll need to construct the
vector of analytical derivatives required by genfit.

( cn+top_order⋅ xn)
bottom_order
f'Poly ( x , c , k) := BottomSum ← 1 +

n=1

∑ ( cn⋅ x )
top_order
k− top_order n
−x ⋅
n=0
return if k > top_order
2
BottomSum
k
x
return
BottomSum

GenfitFunctionMatrix ( x , C) := A0 ← Poly ( x , C , top_order , bottom_order)


for i ∈ 0 .. top_order + bottom_order
Ai+ 1 ← f'Poly ( x , C , i)
return A

To construct a guess value, calculate a set of polynomial fit


coefficients to the data, which is the same as a numerator in the
rational polynomial with a denominator of 1 (order 0).

j := 0 .. top_order cguess := 0
top_order+ bottom_order

Mi , j := ( X1i)
j (
q := M ⋅ M
T ) − 1⋅ MT⋅ Y1
cguess := q j
j
Calculate the rational fit with genfit.

(
cfit := genfit X1 , Y1 , cguess , GenfitFunctionMatrix )
(
y1i := Poly X1i , cfit , top_order , bottom_order )

40

Y1
30
y1

20

10 5 0 5 10 15 20
X1

Residuals on the "Reduced" data set


0.001

10 5 0 5 10 15 20

0.001

−6
∑ (Y1 − y1)
2
= 2.0373 × 10 corr ( Y1 , y1) = 1.0000000

Go back and look at this fit with respect to the original,


untransformed data, and its residuals, by multiplying by the
x values.

(
fitrs ( x) := Poly x , cfit , top_order , bottom_order )
x := −6.26 , −6.25 .. 20

400

200
 →
( X1 ⋅ y1)
Y 10 5 0 5 10 15 20
x⋅ fitrs ( x)
200

400

X1 , X , x
 →
residual := Y − ( X⋅ fitrs ( X) )

Residuals on the original data set


0.005

10 5 0 5 10 15 20

0.005

Transforming back to the original data:

∑ (residual)
2
= 0.0001

The χ2 from above was

( ( Y − temp ( X) ) )
 → 2
∑ = 0.0135

One extra operation, that of transforming the data, has


improved the fit by a factor of approximately 600%, with fewer
fit parameters to calculate.

References
The Omege Instruments Web Site:
http://www.omega.com/temperature/Z/pdf/z021-032.pdf

The NIST ITS90 Calibration Standard page:


http://srdata.nist.gov/its90/main/its90_main_page.html
EXAMPLES

Comparing Splines with a Polynomial Fit


by Robert Adair

The statistical B-spline functions introduced in the data


analysis extension pack will calculate a string of knots using the
Durbin-Watson statistic to accept or reject spline fits. In this
way, statistical B-splines supply a minimal number of knots to
reflect all of the data features.

a := READPRN ( "example1.txt" )
〈 0〉 〈 1〉 〈 2〉
x := a y := a w := a

where w is a vector of weights giving the estimated standard


deviations of the random error in y. We'll fit with spline
polynomials of order 3.

SplineDegree := 3

b := Spline2 ( x , y , SplineDegree , w)

The number of knots returned for the optimal spline fit is


b1 = 40 , which is a good compression of the data from the
original length( x) = 536 points. We can take a look at the
quality of the spline fit:

i := 0 .. b1 knotsi := bi+ 2

i⋅ ( max ( x) − min ( x) )
i := 0 .. 100 range i := + min ( x)
101

〈 i〉
Spline := Binterp ( range i , b)

T
Spline := Spline
8 .10
4

6 .10
4

4 .10
4

2 .10
4

0
0 500 1000 1500 2000
original data
interpolated spline

Polynomial fits
But why couldn't we fit this same data with an equivalent global
polynomial? Do a global fit to the data using the same number
of free parameters as the spline has knots, to see why the
spline is a better option.

NumFreeParm := b1 PolyDegree := NumFreeParm − 1

vs := regress ( x , y , PolyDegree)

fglobal.regress ( x1) := interp ( vs , x , y , x1)

 →
yregress := fglobal.regress ( x)

The regress function fails in this fit, which is not surprising


given the high polynomial degree.

1 .10
19

5 .10
18

yregress 0

5 .10
18

1 .10
19
0 500 1000 1500 2000
x
Try fitting to Chebyshev polynomials for a better numerical
stability.

xmin := min ( knots) xmax := max ( knots)

 x − xmin 
fPoly( n , x) := Tcheb  n , 2 − 1
xmax − xmin
 

i := 0 .. last ( x) j := 0 .. PolyDegree

(
Mi , j := fPoly( j , xi) c := M ⋅ M
T ) − 1⋅ MT⋅ y
The matrix inversion is nearly singular in a numerical sense as
can be seen by evaluating the determinant.

T 92
M ⋅ M = 2.528236 × 10

This is testament to the numerical complexity of fitting a


polynomial with enough coefficients to represent all the
inflections in the data, as well as the large number of points.

fglobal ( x) :=
∑ ( cj⋅ fPoly( j , x))
j

 →
yglobal := fglobal ( x)

8 .10
4

6 .10
4

4 .10
4

2 .10
4

0
200 400 600 800 1000 1200 1400 1600 1800 2000
original data
interpolated spline
global polynomial
Even the more stable Tchebychev polynomials will oscillate on
the tails of the data because of the large number of terms,
and fail to accurately represent the peaks and finer features
of the data. While the splines are perhaps less informative as
a physical model, they are a much better predictor of the
behavior of the data at any arbitrary point in the range.
EXAMPLES

Complex Cubic Splines of Prescribed Length


by Alex Kushkuley

Cubic splines are a common method of approximating a curve


with cubic polynomials. This article discusses a method for
specifying the length of a section of the cubic spline fit between
two endpoints. Splines are used to describe a curve without a
generating function. Spline interpolation can also be used to
approximate a curve which has an exceptionally complicated
function associated with it. The utility of splines is based on the
fact that a cubic polynomial is completely determined if its value
and the value of its derivative are specified at two distinct points.
This means that one can prescribe directions of the curve defined
by a cubic polynomial at its endpoints. This property of cubic
polynomials is used for creating continuous and well-behaved
interpolation curves.

To be more precise, examine the third-order polynomial

2 3
p ( x) = a0 + a1 ⋅ x + a2 ⋅ x + a3 ⋅ x

and assume the initial and boundary conditions


d d
p ( 0) = 0 p ( 1) = 0 p ( 0) = c p ( 1) = b
dt dt

Plugging these values back into the cubic equation, we get a


system of linear equations

a0 = 0 a1 + a2 + a3 = 0

a1 = c a1 + 2⋅ a2 + 3⋅ a3 = b

which yields the equation of an approximating cubic spline in the


form
2 3
spline ( x , c , b) = c⋅ x − ( 2⋅ c + b) ⋅ x + ( c + b) ⋅ x

The curve defined by this equation passes through points (0,0)


and (1,0); and its slopes at these points are equal to c and b.
On the other hand, suppose that we want the arc of our curve
between the endpoints to have a prescribed length — this
requirement is also quite natural for various CAD/CAM problems.
Now we have a nonlinear equation

1

 2
d 
(1) L ( c , b) = s where L ( c , b) =  1 +  spline ( x , c , b)  dx

⌡0  dx 

and s is a specified length. Fixing c, we can solve this equation for


b in order to obtain a cubic curve which has a given length and
which satisfies all the conditions of the "plain" cubic spline, except
that now we can't prescribe the slope of the curve at its
destination point. The integral involved is an elliptic one, so it is
impossible to solve equation (1) analytically (see, e. g. [1] ).
Mathcad can be used to get a numeric solution.

The important thing to understand is that the length function on


the left-hand side of equation (1) is a convex function of its
second argument, i. e. the second derivative of L in b is positive
and hence, dL/db is a monotonically increasing function of b.
This property means that the function has no more than two
solutions. This is a rather general property of such functions (cf.
[2]) but in our simple case we can use live symbolics to verify
that it is correct.

The integrand as a function of its second argument looks like


the following expression on the left, where x and y are arbitrary
constants. Using the symbolic processor, differentiate twice on
b to get

1
2 2 2
d  1 + ( x + b⋅ y) 2  simplify → y
2 3
db
( 1 + x2 + 2⋅ x⋅ b⋅ y + b2⋅ y2) 2

This expression confirms that the second derivative of the length


function is positive, and therefore that equation (1) has no more
than two solutions. We can find these solutions by the following
procedure.
d
First, solve the equation L ( b) = 0
db

which has only one solution since the left hand side is a
monotone function. Using the previously derived expression for
spline(x,c,b) we get

1
⌠ 1

  2
2 
2

L ( c , b) :=   1 +  c − 2⋅ ( 2⋅ c + b) ⋅ x + 3⋅ ( c + b) ⋅ x   dx
⌡0

Fix c := −4 and define L1 ( b) := L ( c , b)

The derivative of L(c,b) in b is

( )
1
⌠ 
c − 2⋅ ( 2⋅ c + b) ⋅ x + 3⋅ ( c + b) ⋅ x  ⋅ −2⋅ x + 3⋅ x
2 2
DerivL ( c , b) :=  

dx
1

 2
  2 
2
  1 +  c − 2⋅ ( 2⋅ c + b) ⋅ x + 3⋅ ( c + b) ⋅ x  
⌡0

Define DerivL1 ( b) := DerivL ( c , b)

and take the initial value b := 0

and compute the resulting value of the first derivative at


the endpoint:

Slope := root ( DerivL1 ( b) , b) Slope = −0.834195

And find the length corresponding to this value of the derivative:

−4
L1 ( Slope) = 1.599284 DerivL1 ( Slope) = −2.320678 × 10

Note that L1(Slope) is the minimum possible length of the cubic


curve which satisfies three given conditions (values at the
endpoints and the slope at the initial point). If we want our curve
to have a given length, this length should be no less than
L1(Slope). In this case the number Slope separates the two roots
of our equation (1).
Let's say we look for a spline with the length

L0 := 1.8

then we will have one root on the left of Slope

guess := Slope − 1 b1 := root ( L1 ( guess) − L0 , guess)

Check the result: b1 = −3.26509 L1 ( b1) = 1.80003

The second root is on the right of Slope

guess := Slope + 1 b2 := root ( L1 ( guess) − L0 , guess)

Check the result: b2 = 1.335276 L1 ( b2) = 1.799278

This gives us two curves, corresponding to the two roots of L1(b):

spline1 ( x) :=  c⋅ x − ( 2⋅ c + b1) ⋅ x  + ( c + b1) ⋅ x


2 3

spline2 ( x) :=  c⋅ x − ( 2⋅ c + b2) ⋅ x  + ( c + b2) ⋅ x


2 3

which we plot below x := 0 , .01 .. 1

0.5
0.35
0.2
0.05
spline1 ( x) 0.1
0.25
spline2 ( x)
0.4
0.55
0.7
0.85
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

So far our methodology is of limited use, since we cannot


change the endpoints of the curve, which, for the sake of
simplicity, were fixed at (0,0) and (1,0). Below is a more
general formulation. After the formulations given above, only a
few changes are necessary.
Arbitrary splines of prescribed length

Points on the plane will now be represented as complex


numbers, and cubic curves as cubic polynomials with complex
coefficients. So, in specifying the initial conditions, use complex
numbers as follows:

starting point: start := 1 + i

end point: end := −1 − i

derivative of the curve at the c := −1 + 5⋅ i


starting point (the velocity of
the curve):

To solve for the derivative at the end point, we can fix the ratio
between its imaginary and real parts and solve the length
equation for the real part. This will also insure that the curve
has a prescribed tangent line at the end point. Note that we
were unable to achieve this with real cubic polynomials. If we
set, for example, the slope at the finishing point to be

Slope := 1

then the velocity at the finishing point is Slope · x · i + x, where


x is the real part of a complex variable. Define the coefficients
of the complex spline by setting up a system of equations from
the initial conditions, just as we did for the non-complex case.
In this instance, the second and third order coefficients will be
functions of the real variable, x, since the velocity at the
endpoint is written in terms of x, as above. This yields:

A ( x) := ( −3⋅ start + 3⋅ end − 2⋅ c − x − Slope⋅ x⋅ i)

B ( x) := ( x + Slope⋅ x⋅ i + 2⋅ start − 2⋅ end + c)

so that the curve we are looking for has the form


2 3
spline ( x , t) := start + c⋅ t + A ( x) ⋅ t + B ( x) ⋅ t

where the real parameter t belongs to the interval [0,1].

The velocity vector is equal to


the derivative of spline(x, t): 2
C ( x , t) := c + 2⋅ A ( x) ⋅ t + B ( x) ⋅ t ⋅ 3
Hence, the length of the curve is equal to

1

L ( x) :=  C ( x , t) dt
⌡0

There are two ways to proceed at this juncture. One way to find
the minimum possible length is to use a solve block and the
Minerr function. The accuracy of the value for the root of the
function L(x) will be poor with the low tolerance used, but it will
calculate quickly, and allow you to play with the values start,
end, c and slope above. The second, more familiar way is to find
the derivative of L(x), set it to zero, and solve for x. This
method will be covered later, since it raises some interesting
issues.

x := −5 , −4.5 .. 5

To find a guess for the


minimum possible length
for the spline, plot L over a L ( x)
range of x values (you may
need to change the range of
x above to see the
minimum.) 5 0 5
x

Pick a value of x x := 0
near the
minimum and
enter this as a Given L ( x) = 0
guess for the
solve block.
d := Minerr ( x) d = 0.137445 L ( d) = 3.601786

The value of L(d) found this way will still be good because the
minimum is shallow.

Method 2:
Differentiating on x we get the derivative of the length function

( )
1
⌠ 
 Re ( 1 + Slope⋅ i) ⋅ C ( x , t) ⋅ −2⋅ t + 3⋅ t
2
DerivL ( x) :=  dt
C ( x , t)
⌡0
DerivL(x) is differentiable even when |C(x,t)| =0, and in many
such cases Mathcad will compute this integral correctly, despite
the apparent singularity. To avoid this issue and still get a good
approximation for DerivL(x), remove small intervals of
integration in the neighborhood of the roots of |C(x,t)|. The
following expressions calculate the roots of |C(x,t)|, and create
three integrals that leave out the intervals ±ε around each root.

To do this, compute the roots t1 and t2 of C(x,t):

2
D ( x) := A ( x) − 3⋅ c⋅ B ( x)

−A ( x) + D ( x) −A ( x) − D ( x)
t1 ( x) := t2 ( x) :=
3⋅ B ( x) 3⋅ B ( x)

We look for those roots which are real and belong to the interval
[0,1]:

Q ( x) := if [ [ Im( ( x) ) < TOL] ⋅ ( Re ( x) ≥ 0) ⋅ ( Re ( x) ≤ 1) , Re ( x) , 1]

t1 ( x) := Q ( t1 ( x) ) t2 ( x) := Q ( t2 ( x) )

Order the roots in such a way that T1 < T2, using the
intermediate values t1 and t2:

T1 ( x) := if ( t1 ( x) < t2 ( x) , t1 ( x) , t2 ( x) )

T2 ( x) := if ( t2 ( x) > t1 ( x) , t2 ( x) , t1 ( x) )

Define the size of the


intervals to be removed λ := 10 ε := λ ⋅ TOL
from the integral:
and rewrite the expression for the derivative in the
following way:

T1 ( x) − ε

DerivL1 ( x) :=



( )
Re ( 1 + Slope⋅ i) ⋅ C ( x , t) ⋅ −2⋅ t + 3⋅ t
2
dt
 C ( x , t)
⌡0

T2 ( x) − ε

DerivL2 ( x) :=



(
Re ( 1 + Slope⋅ i) ⋅ C ( x , t) ⋅ −2⋅ t + 3⋅ t
2
dt
)
 C ( x , t)
⌡T1 ( x) + ε

( )
1
⌠ 
Re ( 1 + Slope⋅ i) ⋅ C ( x , t) ⋅ −2⋅ t + 3⋅ t
2
DerivL3 ( x) := 
dt
 C ( x , t)
⌡T2 ( x) + ε

The new expression for the derivative becomes:

DerivL ( x) := DerivL1 ( x) + ( T1 ( x) + ε < T2 ( x) − ε ) ⋅ DerivL2 ( x) + ( T2 ( x) + ε <

Now we can find the root of the derivative.

guessx := 0.7 d := root ( DerivL ( guessx) , guessx)

−4
d = 0.215161 DerivL ( d) = 6.637192 × 10

With either method, the minimal possible length for our curve
is approximately

L ( d) = 3.602334

Choose a length L0 to be greater than L(d): L0 := 4.2


Find two roots of equation L(x) = L0.

guessp := d − 5 guessq := d + 5 Check:

x := guessp r1 := root ( L ( x) − L0 , x) L ( r1) = 4.199986

x := guessq r2 := root ( L ( x) − L0 , x) L ( r2) = 4.20001

So the first curve is

2 3
spline1 ( t) := start + c⋅ t + A ( r1) ⋅ t + B ( r1) ⋅ t

and the second one is

2 3
spline2 ( t) := start + c⋅ t + A ( r2) ⋅ t + B ( r2) ⋅ t

To graph these splines we need to plot imaginary part of our


polynomials against their real parts.
i
range variable: I := 100 i := 0 .. I si :=
I
 →  →
X := Re ( spline1 ( s) ) Y := Im( spline1 ( s) )
 →  →
U := Re ( spline2 ( s) ) W := Im( spline2 ( s) )

Yi
0
Wi

2
1.5 1 0.5 0 0.5 1
Xi , Ui

This is an "equilateral fish" — the red solid curve has the same
length as the blue dotted one. Both curves start at the same
point on the "fish tail." Their starting velocities are equal. They
finish simultaneously at the same point. Their finishing
velocities are different. However the curves have the same
tangent line at the finishing point which has a prescribed angle
with the x-axis (45 degrees in this case).
To verify the results let's calculate the velocities of our curves at
the endpoints (these match the complex value of c set at the
start of this example):

d d
t := 0 spline1 ( t) = −1 + 5i spline2 ( t) = −1 + 5i
dt dt

d  d 
t := 1 vel1 := arg spline1 ( t)  vel2 := arg spline2 ( t) 
 dt   d t 

To see that the slope at the finishing point really is equal to 45


degrees, compare the tangents of the two velocities (the tangent
will equal the value of Slope, set earlier):

tan ( vel1) = 1 tan ( vel2) = 1

The reader is advised to play with all the parameters in this


document. For example, setting the parameter slope to be a big
number will force both curves to be perpendicular to the x-axis at
the finishing point (the correct way of doing this is, of course, to
reparametrize the problem in terms of y instead of x). Changing
the velocity at the starting point can radically change the shape of
the solutions even if the slope at the starting point remains the
same. Try, for example, multiplying the starting velocity by 5 and
changing length L0 to 12. Note also that if the required length
approaches L(d) the curves will move closer and closer to each
other until they will merge when the length will be approximately
equal to L(d). When the length is less then L(d) the root
functions fail to converge (bear in mind, however, that some
precision loss is inevitable).

Another experiment is to impose other linear initial conditions or


to relax some of these conditions. This article can be used for the
purposes of practical curve design. Some experimentation will be
required in choosing initial guesses for the Mathcad root
functions, but the information on the length function presented
here is sufficient for doing this quickly. The calculations performed
here are applicable to any parametric families of curves which
depend linearly on a parameter which controls the length. (cf. [2]).

The author is grateful to Frank Purcell, Paul Lorczak and Leslie


Bondaryk for very useful comments.

References
[1] Handbook of Mathematical Functions, edited by M. Abramovitz
and A Stegun, National Bureau of Standards, 1964.
[2] A. Kushkuley and S. Rozenberg. Length function on a
parametric family of curves, Latvian Math. Ezhegodnik, vol. 27,
Riga, pp. 154-159, 1983 (Russian).
EXAMPLES

Conical Surface Regression and Analysis


by Xavier Colonna de Lega
Zygo Corporation

Introduction

This worksheet illustrates how to calculate the parameters


defining a conical surface that best matches a set of (x,y,z)
points in a least-squares sense. The deviation of the original data
points with respect to the best-fit cone is then calculated to
create radial and tangential profiles of the surface deviation.

Valves used for hydraulic and fuel injection systems in cars and
trucks are typically made of a moving ball or needle that mates to
a conical surface. The region on the cone where the two surfaces
contact in the closed position is called the valve seat. In order for
the seat to be an effective sealing surface its deviation from a
perfect cone is tightly toleranced. Dedicated profilers are used on
the production floor and in the QC lab to verify that these
tolerances are met. The data used in this worksheet are the result
of a measurement of a machined valve seat with a surface
profiling interferometer.
Coordinate system
The (X,Y,Z) coordinate system is defined by the interferometer
geometry, see figure. The center is the point P. The axes are
defined by the unitless X, Y and Z normed vectors.

In the figure the conical surface is perfectly aligned to the


instrument: its axis defined by the cone center point C and the
unit vector D is collinear with the Z-axis. The cone semi included
angle is γ. The inspection diameter where roundness deviation
must be measured is φ.
Finally, the location of the cone Y
center on the Z-axis is such that M Z
the normal to the surface at a point
M belonging to the inspection X γ C
diameter passes through the center D
of the coordinate system P. In other
words, the conical surface is φ
tangent to a sphere of radius MP P
centered at P. The nominal distance
MP is defined by the instrument. It
is called "radius" in the worksheet.

Notations: Vector variables are bold faced. Small caps variables are
scalar values. Large caps variables correspond to (x,y,z) points.

Predefined units

Coordinate system:
 1  0
Unit vectors: X :=  0  Y := 1 
 Z := X × Y
   
 0  0
 0
Center position: P := 0  ⋅ m

 
 0
Nominal cone semi included angle: γ 0 ≡ 45⋅ deg

Nominal cone direction: we define the normed vector D by two angles


α and β that correspond to two successive rotations of the cone axis about
the X and Y axes.

Nominal rotations: α 0 := 0⋅ mrad β 0 := 0⋅ mrad

1 0 0   cosβ 0 −sin β 
Rotation matrix: rot( α , β ) := 0 cos α −sin α  ⋅  0
 1 0 
  
 0 sin α cos α   sin β 0 cos β 

Nominal cone direction: ( )


D0 := rot α 0 , β 0 ⋅ ( −Z)

Radius of tangent sphere: radius ≡ 2.0⋅ mm

radius
Nominal cone center or apex: C0 := P + ⋅Z
sinγ 0
We have defined the conical surface and its location in space by 6
scalar values: the semi included angle γ, the (x,y,z) coordinates of the
cone center or apex and the two rotation angles describing the normed
vector D.

In practice, the measured data corresponds to a part that is not


perfectly aligned to the instrument. The least-squares fit cone will
provide estimates for these 6 parameters.

Import measured data


Load measured (x,y,z) coordinates:

MeasuredXYZ := READBIN( "ConeXYZ.bin" , "float" , 0 , 3 , 0)⋅ mm

Number of (x,y,z) measurement points: rows (MeasuredXYZ ) = 51518

Extract coordinates:
〈0〉 〈1〉 〈2〉
x := MeasuredXYZ y := MeasuredXYZ z := MeasuredXYZ

Best-fit conical surface

We first define a merit function to minimize. Here we choose to


calculate the sum of the squared distances from the measured
data points to the best-fit cone. The distance is measured
along the local cone normal.

The figures at right illustrates the M”


calculation of the distance HM for a H
given measured point M. γ
M
We have: CM' = CM⋅ D > 0

The signed length M'M" is defined as


positive. γ
M’
M'M'' = CM' ⋅ tan( γ ) C
D
The signed length M'M is also positive.
M'M = MM' = ( CM )2 − ( CM' )2
M
This yields an expression for the signed H γ
length MM".
MM'' = M'M'' − M'M M”

MH
We have: MM'' =
cos( γ )
M’ γ
If we define the surface deviation as C
HM this signed quantity is given by: D

HM = cos( γ ) ⋅ ( CM )2 − ( CM' ) 2 − sin( γ )⋅ CM'

The sum calculation has been optimized in this program. It


turns out that calculation with vectors is slower than
scalar calculations. The least square sum is scaled to mm2
in order to avoid large exponents.
SumSquares( points , cx , cy , cz, α , β , γ ) := Σ ←0
sin( γ )
sγ ←
mm
cos( γ )
cγ ←
mm
 Dx   sin( β ) 
 Dy  ←  sin( α ) ⋅ cos( β ) 
   
 Dz   −cos( α ) ⋅ cos( β ) 
"Loop on all data points"
for k ∈ 0 .. rows ( points) − 1
CMx ← points − cx
k, 0
CMy ← points − cy
k, 1
CMz ← points − cz
k, 2
2 2 2
CM2 ← CMx + CMy + CMz
cmz ← CMx⋅ Dx + CMy⋅ Dy + CMz⋅ Dz
HM ← cγ⋅ CM2 − cmz ⋅ cmz − sγ⋅ cmz
2
Σ ← Σ + HM
return Σ

Guess values for the fit:

α 0 = 0mrad β 0 = 0 mrad γ 0 = 45 deg

 0   cx0 
 
C0 =  0  mm  cy0  := C0
 
 2.828   cz 
 0
Minimize the sum of squares using a solve block:
−4
Set convergence tolerance: TOL := 10
On slower computers we can use a data subset to perform the fit.
sub := 20
T
i := 0 , sub .. rows ( MeasuredXYZ) − 1 T0 := MeasuredXYZ
〈 i 〉
sub 〈i〉 T
subMeasuredXYZ := T0 subMeasuredXYZ := subMeasuredXYZ

Sum of Squares Solve Block

Enable one or the other equality statement, depending on if you


wish to use the truncated data or the full set. The Quasi-Newton
algorithm gives the best results here. The right-click option on
Minerr was used to force this choice.

Given
( )
SumSquares subMeasuredXYZ , cx0 , cy0, cz0 , α 0 , β 0, γ 0 = 0

SumSquares( MeasuredXYZ , cx0 , cy0 , cz0 , α 0 , β 0 , γ 0) = 0

( cx (
cy cz α β γ ) := Minerr cx0, cy0 , cz0 , α 0 , β 0 , γ 0 )T
Verify improvement of sum of squares:

( )
SumSquares MeasuredXYZ , cx0 , cy0 , cz0 , α 0 , β 0 , γ 0 = 441.274195467072

SumSquares( MeasuredXYZ , cx , cy , cz , α , β , γ ) = 0.010398

Best fit cone included angle: 2⋅ γ = 90.091 deg

Note: A typical cone angle tolerance is on the order of 1°.

Tilt of cone axis: α = −0.296 deg β = 0.244 deg

Best fit cone parameters:

Direction: D := rot( α , β ) ⋅ D0

 cx 
Cone center: C :=  cy 
 
 cz 
 −0.011 
Displacement from nominal: C − C0 =  0.013  mm
 
 −0.133 
3D plot of measured data and fitted cone

Angular width of displayed best fit cone surface: ∆θ := 6⋅ deg

Best-fit 3D plot
The red line corresponds to the best-fit cone axis.

The light-blue wireframe surface represents a section of the


best fit cone.

 x , y , z  ,  FitX , FitY , FitZ  ,  AxisX , AxisY , AxisZ 


     
 mm mm mm   mm mm mm   mm mm mm 

Calculation of deviation from best fit cone

This step consists in calculating the distance of each data


point to the best-fit cone along the local normal. This
distance h is associated to the (X,Y) projection of the data
point onto a plane perpendicular to the cone axis.
Residuals
Each line in the residual matrix contains the height deviation
d from the best-fit cone, X, Y, ρ and ω . (X,Y) are the
rectangular coordinates of the projection of the data point
onto a plane perpendicular to the best-fit cone axis. (ρ,ω )
are the distance of the data point from the cone center and
the angular position of the data point projection in the plane
XY.
Residuals := GetResiduals( MeasuredXYZ , C, D , γ )
〈0〉
Surface deviation: δ := Residuals

Look at the distribution of values:

(
hh := histogram 1000 , δ ⋅ nm
−1 )
300
Frequency (count)

200

100

0
1500 1000 500 0 500 1000 1500 2000 2500
Height deviation (nm)

Cone flatness deviation


Peak to valley surface deviation:
Range( δ ) = 3.661 µm

Projected points coordinates:


〈1〉 〈2〉
Px := Residuals Py := Residuals
Plots
φ = 2.68mm ψ = 45 deg

 Px , Py , δ  ,  Cx , Cy , Cz  ,  strx , stry , strz 


     
 mm mm µm  mm mm µm  mm mm µm 
This surface is plotted using a 3D scatter plot because the
surface plot can not handle data with holes. The style for the 3
plots is scatter plot.

The blue line position is defined by the angle ψ. The so-called


straightness profile is generated as the intersection of the
surface with a plane that contains the blue segment and is
perpendicular to the plane of the figure.

Similarly, the red trace is the location where we want to extract


a roundness profile, which is the intersection of the measured
surface with a cylinder of diameter φ that has its axis
perpendicular to the plane of the figure.

In the following we will use the projection shown above to


extract a straightness and a roundness profile. Note that the
projection introduces geometrical distortions that we will ignore in
this worksheet.
Straightness profile

Straightness profile at angular position: ψ ≡ 45⋅ deg

We select a subset of data points based on their proximity to


the generatrix of interest. Here we retain points whose angular
position is within +/-1° of the generatrix direction ψ .
Straightness

subData := SelectDataSt(ψ , Residuals , 2⋅ deg)

〈1〉 〈2〉 〈0〉


subX := subData subY := subData subZ := subData

〈4〉
〈3〉 subData
subρ := subData subω :=
m

We use loess to create an interpolated profile. The


loess proximity parameter is chosen to be small to limit
lowpass filtering:

lp := 0.05

Interpolation sampling step: dρ := 2⋅ µm

Straightness := StProf ( ψ , subData , γ , lp, dρ)


〈0〉 〈3〉
stZ := Straightness stρ := Straightness

We also create an interpolated profile by fitting a plane to the


four closest neighbors of each interpolation location. The
corresponding function needs data in a new coordinate
system. The first parameter X is the position of the data points
projected on the plane of the profile. The second parameter Y
is the distance of the data points to the plane of the profile.
The third parameter Z is the surface deviation. The last two
parameters are the maximum distance from the profile plane for
a point to be valid and the sampling step for the interpolation.

4-point interpolation
 →
X := subρ Y := ( subρ⋅ mod( subω − ψ , 2⋅ π) )

Z := subZ

Straightness' := Interpolate4PtPlane ( X , Y , Z, min( subρ) , max( subρ) , 10⋅ µm, dρ)

〈1〉 〈0〉
stZ' := Straightness' stρ' := Straightness'
Range( stZ' ) = 1.861 µm

Radial profile (from cone center)


2
1.6
Surface deviation (micrometer)

1.2
0.8
0.4
0
0.4
0.8
1.2
1.6
2
1.75 1.8 1.85 1.9 1.95 2 2.05 2.1
Distance from cone center (mm)
Loess interpolation
4-point linear interpolation
Projection of selected data points

Roundness profile

Cone diameter where roundness is measured:


φ ≡ 2.68⋅ mm

Roundness

We first select the data points that fall within a certain distance
of the circle of diameter φ. The last parameter of the function call
below defines this maximum distance.
subData2 := SelectDataRd ( φ , Residuals , 10⋅ µm)

〈1〉 〈2〉 〈0〉


subX := subData2 subY := subData2 subZ := subData2

〈4〉
〈3〉 subData2
subρ := subData2 subω :=
m

Nominal interpolation sampling step along the perimeter:


dp := 5⋅ µm

Exact sampling step:


π⋅ φ
dp := dp = 5µm
 π⋅ φ 
2⋅ round 
 2⋅ dp 
Use 4-point linear interpolation:
φ φ
X := ⋅ ( subω + π) Y := subρ⋅ sin( γ ) −
2 2
Z := subZ

Because the data wrap around in this case we pad both ends
of the data sent to the interpolation function.

cut := round ( 0.5⋅ last( X) )

X := stack( submatrix( X , cut , last( X) , 0 , 0) − π⋅ φ , X, submatrix( X, 0 , cut , 0 , 0) + π⋅ φ )

Y := stack( submatrix( Y , cut , last( Y) , 0 , 0) , Y, submatrix(Y , 0 , cut , 0 , 0))

Z := stack(submatrix( Z, cut , last( Z) , 0 , 0) , Z, submatrix( Z, 0 , cut , 0 , 0) )

Roundness := Interpolate4PtPlane ( X , Y , Z, 0⋅ mm, π⋅ φ − dp , 10⋅ µm, dp)


〈1〉 〈0〉
rdZ := Roundness rdp := Roundness

The roundness error at diameter φ is equal to:


max( rdZ ) − min(rdZ ) = 0.381 µm Range( rdZ ) = 0.381 µm

Note: A typical roundness tolerance is on the order of 1 micrometer.

Roundness profile
1
0.8
Surface deviation (micrometer)

0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9
Position along the perimeter (mm)
4-point linear interpolation
Profile filtering

Roundness profiles are usually lowpass, band-pass and highpass


filtered to extract form, waviness and roughness information. We
illustrate here how to extract form with a Gaussian filter. A typical
cutoff frequency (50% transfer function) for this type of data is
"50 upr", which corresponds to 50 undulations per revolution.

Since there may be missing data points in the interpolated


roundness profile, we use linear interpolation to fill in the missing
data and then use a Fourier transform to apply the Gaussian
filter. Note that since the data is effectively periodic for a
roundness profile there are no edge discontinuities issues for the
Fourier transformation.

Create regularly sampled data

Generate Gaussian filter for the frequency domain:


Npt
Npt := rows ( rdp) j := 0 ..
2
  2
filt := exp −ln( 2)⋅ 
j
filt := filt
j  2 mod ( Npt− j , Npt) j
 50 

 →
Filtered profile: rdZ2 := icfft cfft rdZ2 ⋅ filt 
( ( ) )
 
The roundness from error at diameter φ is equal to:

( ) (
max rdZ2 − min rdZ2 = 0.277 µm )
Roundness profiles are frequently presented as polar plots.

Offset radius: R := 1.5⋅ µm


max( rdZ ) + R 
Plot range: sc := ceil  ⋅ 0.2⋅ µm
 0.2⋅ µm 
The number of grid points on the polar axis should be set manually
sc⋅ 10
to = 14 to get a scale of 0.1⋅ µm per radial division.
µm

105 90 75
120 60

135 45

150 30

165 15

180 0

195 345

210 330

225 315

240 300
255 285
270
4-point interpolation
50-upr profile

3D plot detail

As a final step, let's display the subregion of the data where the
roundness and straightness profiles intersect. Rotate the figure
to observe how the profiles follow the original surface deviation.
Select data

 subX , subY , subZ  ,  rdX , rdY , rdZ  ,  stX , stY , stZ 


     
 mm mm µm   mm mm µm   mm mm µm 

References

Metrology data courtesy of Zygo Corporation, all rights reserved,


used by permission.

For more information on surface metrology: D. J. Whitehouse


(1994), Handbook of Surface Metrology, Institute of Physics
Publishing, ISBN 0-7503-0039-6.

For more information on optical profilers: Peter de Groot, Xavier


Colonna de Lega (May 2003), “Valve cone measurement using
white light interference microscopy in a spherical measurement
geometry,” Optical Engineering, Vol. 42, No. 5, pp. 1232-1237.
EXAMPLES

Cosine Smoothing
by James C. (Jim) Bach
Delphi Delco Electronic Systems
Kokomo, IN, USA

Introduction

Mathcad provides a number of built-in smoothing filters,


which are often sufficient for cleaning-up your
data/waveforms prior to the primary processing operation.
Sometimes, however, the built-in filters oversmooth data,
removing features that are not noise, and you cannot obtain
the desired level or quality of filtering you need. For example,
here is a noisy signal which we might want to filter before
analyzing its key characteristics.

Signal := READPRN( "signal.prn" ) i := 0 .. last( Signal)

i
Time :=
i 300

Passing the signal through the three built-in smoothing filters we


see:
NFilt1
NFilt1 := 41 NFilt2 :=
300

(
Filtered Med := medsmooth Signal , NFilt1 )
Filtered K := ksmooth( Time , Signal , NFilt2)

Filtered Sup := supsmooth( Time , Signal)


Median Smoothing
1

1
0 0.5 1 1.5 2 2.5 3 3.5
Filtered Signal
Original Signal

K (Gaussian Weighted) Smoothing


1

1
0 0.5 1 1.5 2 2.5 3 3.5
Filtered Signal
Original Signal

Super (Variable Bandwidth) Smoothing


1

1
0 0.5 1 1.5 2 2.5 3 3.5
Filtered Signal
Original Signal

With this dataset, there are problems with all of the


above smoothing techniques.
Median smoothing artifically
'squared-up' the
high-frequency ripple.

2.5

Gaussian smoothing
overattenuated the
high-frequency ripple,

2.5

and artificially
'rounded-off' the
square pulse.

1.5 2

Super smoothing removed the


high-frequency ripple (highly
over-filtered),

2.5
and artificially rounded the
square pulse.

1.5 2
Mean smoothing
In cases such as this, you can write your own sliding window
filters using a Mathcad program block (multiline function), as
shown in the example below. This "Mean" smoother will filter
your data (input vector "Data") by taking the average (mean)
of the data points within the sliding window (width specified
by "Width"):

MeanSmooth( Data , Width) ≡ "Calculate 1/2-width of filtering window"

WidthHalf ← trunc
Width 

 2 
"Calculate where we need to start collapsing the window"
NearEnd ← last( Data) − WidthHalf

"Iterate through all of the data points"


for Pt ∈ 0 .. last( Data)
"Calculate beginning (Start) of sliding window"
Start ← Pt − WidthHalf if Pt ≥ WidthHalf

Start ← 0 otherwise
"Calculate ending (Stop) of sliding window"
Stop ← Pt + WidthHalf if Pt ≤ NearEnd

Stop ← last( Data) otherwise


"Use 'submatrix' to extract the window of data, then take mean of that chunk"
Out ← mean( submatrix( Data , Start , Stop , 0 , 0) )
Pt
"Return the filtered data"
return Out

Testing this filter with the signal yeilds

(
Filtered Mean := MeanSmooth Signal , NFilt1 )
Mean Smoothing
1

0.5

0.5

1
0 0.5 1 1.5 2 2.5 3 3.5
Filtered Signal
Original Signal
Cosine smoothing

A varient of the mean smoother, shown above, is the "Cosine


Weighted Mean" smoother, which weights the averaged data points
so that those near the center of the window have more weight
(importance) than those near the edges of the window:

CosMeanSmooth ( Data , Width) := "Calculate 1/2-width of filtering window"

WidthHalf ← trunc
Width 

 2 
"Calculate where we need to start collapsing the window"
NearEnd ← last( Data) − WidthHalf

"Store-away pi divided by 2 because we'll need it often"


π
Π←
2
"Iterate through all of the data points"
for Pt ∈ 0 .. last( Data)
"Calculate beginning (Start) of sliding window"
Start ← Pt − WidthHalf if Pt ≥ WidthHalf

Start ← 0 otherwise
"Calculate ending (Stop) of sliding window"
Stop ← Pt + WidthHalf if Pt ≤ NearEnd

Stop ← last( Data) otherwise


"Calculate 'Area Correction Factor' . . ."
" . . . compensate for size of sliding window"
1
η←
Stop
cos Π ⋅
j − Pt 
∑  WidthHalf 

j = Start
"Iterate through the window of data . . ."
"Take sum of of cosine-weighted data points"
Stop
  j − Pt  
Out
Pt
← η⋅
∑  Dataj⋅ cos Π ⋅ Width 
j = Start   Half  

"Return the filtered data"


return Out
Note the truncated (narrowed) window widths at the ends of the
data, as controlled by the variables "Start" (beginning of sliding
window) and "Stop" (ending of sliding window). At the 1st data
point, only 1/2 of the specified window width is available for
averaging. As the program block iterates through the data, the
averaging window expands until it is the specified width, which
occurs when it has reached the "WidthHalf" data point. It
continues to slide across the data until it nears the end of the
data, at which point (at "NearEnd" data point) it begins to
collapse. When it finally reaches the last data point, the
averaging window is once again only 1/2 of the specified width. It
is because of this automatically expanding/collapsing window
that the summation of weighted points needs to be multiplied by
the scaling factor containing the "Start" and "Stop" indices:

1
η ←
Stop
cos Π ⋅
j − Pt 
∑  WidthHalf 

j = Start

Testing this filter, we see:

(
Filtered Cos := CosMeanSmooth Signal , NFilt1 )
Cosine Mean Smoothing
1

0.5

0.5

1
0 0.5 1 1.5 2 2.5 3 3.5
Filtered Signal
Original Signal

Note, however, that the area compensation factor (η) needs to


be adjusted for each shape by performing an iterated sum of the
shape's values in order to find the area under the curve. Since
the curve (window) expands and collapses dynamically at the
data endpoints, this needs to be calculated on-the-fly, or, at
least when near the ends. This process can be made more
efficient (since the collapsed window occurs a small fraction of
the time) by only repeatedly calculating the window area when
near the ends of the data, as shown in the following revision of
the program block:
CosMeanSmooth ( Data , Width) := "Calculate 1/2-width of filtering window"

WidthHalf ← trunc
Width 

 2 
"Calculate where we need to start collapsing the window"
NearEnd ← last( Data) − WidthHalf

"Store-away pi divided by 2 because we'll need it often"


π
Π←
2
"Calculate window area when free-and-clear of the ends of data"
1
η mid ←
2⋅ WidthHalf
 j − WidthHalf 
∑ cos Π ⋅
 WidthHalf


j =0
"Iterate through all of the data points"
for Pt ∈ 0 .. last( Data)
"Calculate beginning (Start) of sliding window"
Start ← Pt − WidthHalf if Pt ≥ WidthHalf

Start ← 0 otherwise
"Calculate ending (Stop) of sliding window"
Stop ← Pt + WidthHalf if Pt ≤ NearEnd

Stop ← last( Data) otherwise


"Calculate 'Area Correction Factor' . . ."
" . . . compensate for size of sliding window"
η ← η mid if WidthHalf ≤ Pt ≤ NearEnd

1
η← otherwise
Stop
cos Π ⋅
j − Pt 
∑  WidthHalf 

j = Start
"Iterate through the window of data . . ."
"Take sum of of cosine-weighted data points"
Stop
  j − Pt  
Out
Pt
← η⋅
∑  Dataj⋅ cos Π ⋅ Width
 

Half  
j = Start
"Return the filtered data"
return Out
Testing this more efficient version of the smoother:

(
Filtered Cos2 := CosMeanSmooth Signal , NFilt1 )
Cosine Mean Smoothing
1

0.5

0.5

1
0 0.5 1 1.5 2 2.5 3 3.5
Filtered Signal
Original Signal

Notice that this filter doesn't suffer (as much) the same problems
of the built-in smoothing filters. The high-frequency ripple isn't
as severely attenuated and the square pulse is more accurately
reproduced:

0 0

2.5 1.5 2
The graph below illustrates three varients of "Weighted Mean"
smoothing windows that are eaily implemented with the
programming block above:

We can find the area under the curves, presuming untruncated


(full-width) windows, as:
π
Π := Width := 10000 WidthHalf := 50%⋅ Width
2

"Cosine" Window

Width
 j − WidthHalf 
∑ cos Π ⋅
 WidthHalf


j =0
= 63.661977 %
Width

"Cosine Squared" Window

2
Width
 j − WidthHalf 
∑ cos Π ⋅
 WidthHalf


j =0
= 50 %
Width
"Linear Ramp" Window

Width
 j − WidthHalf  
∑   1 − Width
  Half


j =0
= 50 %
Width
EXAMPLES

Limitations of Linear Correlation and Extrapolation


by Paul Lorczak

Linear correlation has the potential to be misinterpreted.


This example shows the potential for misunderstanding,
and the dangers of extrapolation from fitted parameters.
Consider the following example. Let z and y be related by
the following formula:

2
y ( z) := .4⋅ z + 5⋅ z + 1

This is not a linear relationship. However, if we have only


data from a small interval, we might compute the
correlation coefficient based on an insufficient sampling to
see the relationship:

i := 0 .. 19 zi := .001⋅ i − .01 fz i := y ( zi)

It is interesting to note that the covariance is quite small:


−4
cvar ( z , fz ) = 1.662367 × 10

But from Pearson's correlation coefficient, it looks as if


the points are collinear:

corr ( z , fz ) = 1

The intercept and slope of this line are

a   a   1.000013 
  := line ( z , fz)   =  
b   b   4.9996 

When we plot the points and the line, we get this graph:

1.05

fzi
1
a+ b ⋅ zi

0.95
0.01 0 0.01
zi
It seems conclusive that z and fz are linearly correlated
over this small range. Once we move to larger values of
z, however, the difference between the true functional
value and the assumed linear one becomes apparent.

zi := 10⋅ i − 100 fz i := y ( zi)

corr ( z , fz ) = 0.048599

4000

2000
fzi

a+ b ⋅ zi
0

2000
100 50 0 50
zi
graph of y(z)
extrapolation of best-fitting line for |z| small

Here is another set of data with what looks like a linear


functional relationship.

0   1 
1   1.1 
   
2   1.2 
S :=  3  T :=  1.3 
   
4   1.4 
5   1.5 
   
6   1.6 

Again, we compute the intercept and slope of the best-fitting line.

SL := slope ( S , T) INT := intercept ( S , T)

SL = 0.1 INT = 1

j := 0 .. 6 range := .01 , .05 .. 6.01

F ( u) := 3⋅ sin 
 12⋅ π  + 1 + u

 u − 3 10
When we graph the points, we see that they
might be drawn from any number of functions
that pass through them. Adequate sampling
frequency and range of measurement is critical
to draw meaningful conclusions about your data.

Tj

INT+ SL⋅ range


F ( range) 0

0 2 4 6
S j , range , range
EXAMPLES

Numerical Integration of Data


by Jean Giraud, Richard Jackson, and Leslie Bondaryk

One common problem in data analysis is that of


integrating over a data set which can't be accurately fit
with an analytical function. Various discrete techniques
can be used, or the data can be interpolated with a
function that can then be used to integrate over an
arbitrarily close spacing of points, yielding more accurate
results.

Here are two vectors that represent the same exponentially


shaped data. They present several challenges for
integration routines. First, they are sparse, that is, there is
wide spacing between points and they cover a large range
of magnitudes. X1 has approximately half the number of
points as X2, spanning the same range of values. We wish
to integrate over the same limits for both vectors so that
we can compare five methods of numerical integration.

f ( x) := 25⋅ exp ( −41⋅ x) + 0.015


n1 := 25 n2 := 51

Create a range variable for each data set:


xstart := 0 xend := 0.5
ix1 := 0 .. n1 − 1 ix2 := 0 .. n2 − 1

Generate vectors for the 51 and 25 point data sets in x and y:

 xend − xstart ⋅ ix1   xend − xstart ⋅ ix2 


 X1ix1     X2ix2   
  := n1 − 1   := n2 − 1
       
f ( X1ix1) f ( X2ix2)
Y1 ix1 Y2 ix2
   

 ∆x1   X11 − X10   ∆x1   0.020833 


  :=    = 
 ∆x2   X21 − X20   ∆x2   0.01 
0 0
0 25.015 0 25.015
1 10.65594354 1 16.60625625
Y1 = Y2 =
2 4.54418717 2 11.02579136
3 1.942793 3 7.32231444
4 0.83554146 4 4.86450106

Spacing for the first data set is 0.01, and the second is 0.021.

30

20
Y1

Y2
10

0 0.1 0.2 0.3 0.4 0.5


X1 , X2

Ideally, when integrated, these two sets of data should yield the
same area under the curve.

Trapezoidal and Simpson's rules


Traditional numerical integration methods use quadrature rules
to obtain approximate areas. The simplest of these quadrature
methods are the Trapezoidal rule and Simpson's rule. The
Trapezoidal rule can be applied to any vector of data, whereas
Simpson's rule requires that the number of data points is odd
(i.e. an even number of divisions). While the Trapezoidal rule
fits a straight line to each division (a straight line is drawn from
data point to data point) forming trapezoidal areas, which are
summed, Simpson's rule fits second degree polynomials or
parabolas to three subsequent points, in many instances giving
a better approximation to the integrated area.

 Vlast ( V) + V0 last ( V) − 1 
Trapezoidal ( V , h) := h⋅ 
 2
+

Vi

 i=1 

Trapezoidal ( Y1 , ∆x1) = 0.65389 Trapezoidal ( Y2 , ∆x2) = 0.62577


 last ( V) last ( V) − 2 
 2 2 
Simpsons ( V , h) := ⋅  Vlast ( V) + V0 + 4⋅ V2 ⋅ i
∑ ∑
h
V2 ⋅ i− 1 + 2⋅
3  
 i=1 i=1 

Simp51 := Simpsons ( Y1 , ∆x1) Simp25 := Simpsons ( Y2 , ∆x2)


Simp51 = 0.618914 Simp25 = 0.61735

Note that there is much better agreement between the two


values using Simpson's rule. Sometimes, it may even be
warranted to fit higher degree polynomials to the data, in
which case the Newton-Coates formula may be used.

Integration using interpolating functions


An better approach is to create an interpolating function, for
example using splines, and then apply Mathcad's numerical
integration methods to the interpolating function. The
advantage of this approach is that the numerical integration
methods can have an arbitrarily small step size. What is
important in this approach is that the chosen interpolation
function is well suited to the shape of the curve it must
interpolate. We'll compare three interpolation methods: a cubic
spline with parabolic endpoints (pspline), linear interpolation,
and a rational function interpolation.

The pspline:
sx1 := pspline ( X1 , Y1) sx2 := pspline ( X2 , Y2)

psp ( x) := interp ( sx1 , X1 , Y1 , x) PSP ( x) := interp ( sx2 , X2 , Y2 , x)

The linear interpolation:


lin ( x) := linterp ( X1 , Y1 , x) Lin ( x) := linterp ( X2 , Y2 , x)

And the rational function interpolation:


r ( x) := rationalint ( X1 , Y1 , x) 0 R ( x) := rationalint ( X2 , Y2 , x) 0

Compare the three methods and the previous results


from Simpson's rule. Since both sets of data represent
the same curve, we would like the values for a similar
method to yield similar results.
51 points vector 26 points vector
0.5 0.5
⌠ ⌠
 psp ( x) dx = 0.62081  PSP ( x) dx = 0.617497
⌡0 ⌡0

0.5 0.5
⌠ ⌠
 lin ( x) dx = 0.65388  Lin ( x) dx = 0.62572
⌡0 ⌡0

0.5 0.5
⌠ ⌠
 r ( x) dx = 0.617256  R ( x) dx = 0.617256
⌡0 ⌡0

Simp51 = 0.618914 Simp25 = 0.61735

The integral of the actual function used to generate this data is

0.5

 f ( x) dx → .61725609679868727660
⌡0

Rational spline integration produces a correct value to six


significant figures, and, what's more, it produces consistent
values for both numbers of data points. Rational function
interpolation is particularly well-suited to asymptotic data,
and interpolates between the values well even in the sparse
case.

The parabolic-endpoint cubic spline does well in the case


where there are more points, but fails somewhat when the
data is sparse, as it cannot adequately predict the interim
values for this type of curve. Linear interpolation produces
much the same result as the Trapezoidal method, as we
might expect, since both are drawing straight lines from
one data point to the next and finding the area of the
resulting trapezoid.

This should convince you of the importance of getting the


most appropriate interpolating method you can, as the
errors propagate when you make subsequent use of the
data. For example, look at the errors generated by two of
the interpolation methods for the missing data points in the
sparse vector.
Generate the interpolated points for the 51 point spacing
using the interpolation for the 26 point spacing.

→
I := R ( X1)

Find the associated errors with each of these points, using


the second value returned by rationalint.

Eix2 := rationalint ( X1 , Y1 , X2ix2) 1

err := 2⋅ E

The error is 0 at the


0
points which overlap in
0 0
Y2, since these were
used to generate the
1 1.03136·10 -8
interpolating values,
2.676739·10 -10
2
but at the interleaved
→ 3 -5.796653·10 -10 points, the errors are
err 4 -5.205209·10 -11 very small.
= %
Y2 5 6.770412·10 -11
6 9.668197·10 -12
7 -1.082638·10 -11
8 -2.72565·10 -12
9 2.064213·10 -12
10 6.988379·10 -13

If we do the same analysis using the pspline values for the


sparse case,

Ipsp := psp ( X2) Epsp := Y2 − Ipsp

0 We see much larger


0 0 errors, particularly at
1 -2.077208 the hard-to-fit initial
→ 2 -0.220314 values on the
Epsp 3 1.471206 asymptotic slope.
= %
Y2 4 0.324831
5 -0.758896
6 -0.236028
7 0.583565
8 0.296593

This difference accounts for the difference in the


integration values.
EXAMPLES

Optimal Spacing of Interpolation Points


by Robert Adair

In his seminal book on spline fitting, deBoor says,


"Polynomial interpolation at appropriately chosen points (e.g.
the Chebyshev points) produces an approximation which ...
differs very little from the best possible approximant by
polynomials of the same order". That is, if your points are
well chosen, a single polynomial of appropriate order will fit
all your data well. If they are not, then a spline fitting
method may be a better choice. In particular, uniformly
spaced points can have bad consequences. Further, "...If the
function to be approximated is badly behaved anywhere in
the interval of approximation, then the approximation is poor
everywhere. This global dependence on local properties can
be avoided when using piecewise polynomial approximants."
by which he means cubic splines or bsplines.

What this means is that a polynomial of the same order as


the number of the data points will pass through all the
points, but can give a terrible approximation to the function
elsewhere if the points are badly chosen. For example,
consider the following uniformly spaced sample points across
a Lorenzian function:

Endpoints: a := −.9 b := 1

i⋅ ( b − a)
n = 8 points i := 0 .. n vxi := a +
n
Sample the function

vyi := g1 ( vxi)
1
g1 ( x) :=
2
1 + 25⋅ x

Create a global polynomial interpolation using this data spacing

fit ( x) := polyint ( vx , vy , x) 0

b−a
j := 0 .. 200 step := x j := a + j ⋅ step
200
You might expect that as you increase the number of
sampled points and the order of the polynomial, n, the
approximation would improve, but it does not. Try n = 20.

n≡8

0.5
vyi
 →
fit ( x) 0

g1 ( x)
0.5

1
1 0.5 0 0.5 1
vxi , x

What deBoor recommends is instead choosing the interpolating


points as the zeros of the Chebyshev polynomial of degree n,

 ( 2⋅ i + 1) ⋅ π 
a + b − ( a − b) ⋅ cos  
vxChebi :=
 2⋅ ( n + 1) 
2

vxCheb := sort ( vxCheb) vyChebi := g1 ( vxChebi)

〈 0〉 〈 1〉
data := vxCheb data := vyCheb

Interpolate the same function using these new points. Go back and
try a few values of n, to see that these points generally provide a
nicer fit everywhere for larger values of n.

fitCheb ( x) := polyint ( vxCheb , vyCheb , x) 0


1

vyChebi 0.5
 →
fitCheb ( x)
g1 ( x)
0

0.5
1 0.5 0 0.5 1
vxCheb i , x

The Chebyshev point spacing does indeed provide a better


interpolation for higher values of n, as we expect. However,
it would be nice to know what the optimal spacing of points
was so that we could use a lower value of n and still get
good results. This can be calculated if we recall that a global
polynomial interpolation is the same as a polynomial
regression of the same order as the number of data points.

Optimal point spacing


Based on the notion that a polynomial of order n can always be
found that passes through n points, we can find the polynomial
that best approximates the given function g, and find which
points on the curve it fits exactly. These points are the optimal
interpolation points, that is, the points that, when interpolated
with a polynomial, provide the best possible approximation
everywhere on the interval.

We need to work with the integral version of the least squares


problem to get the optimal solution for a polynomial fit.

b
⌠ 2
 degree 
Error :=  
∑ ( )
cn ⋅ x − g ( x) dx
n
  

⌡a n = 0 

The normal equations are generated by taking the derivative of


the error with respect to the polynomial coefficients and
setting the result equal to 0.

b
⌠ degree  k
∑ ( )
  cn ⋅ x − g ( x) ⋅ x dx
d n
Error := 2⋅ 
dck   
⌡a n = 0 
To solve the equations, we will need to evaluate the integrals

b b
⌠ n+ k ⌠ k
fI0 ( n , k) :=  x dx fI1 ( k) :=  g1 ( x) ⋅ x dx
⌡a ⌡a

In general, the integral fI1 will require numerical evaluation.


In the case of a Runge function, it could be evaluated
symbolically, but the speed of evaluation is good enough that
both fI0 and fI1 will be left as numerical integrals.

The least squares matrix equation for the coefficients then


becomes

M⋅ c = yintegral

where:
degree := n k := 0 .. degree k1 := 0 .. degree

yintegral := fI1 ( k)
k

Mk , k1 := fI0 ( k , k1)

Solving for the optimal interpolation coefficients, c gives:


−1
L := cholesky ( M) Linverse := L

T
c := Linverse ⋅ Linverse⋅ yintegral

Then the optimal fit equation is given by the new polynomial of


order n

∑ ( ck⋅ x )
k
fitOpt ( x) :=
k

Take a look at the interpolated error over the data range:


ChebErr ( x) := g1 ( x) − fitCheb ( x) OptErr ( x) := g1 ( x) − fitOpt ( x)

1 0.5 0 0.5 1

0.2

0.4
Finding the zeros of the error will give the points at which
the optimal fit polynomial is exactly equal to the function,
and these are the optimal interpolation points. To do this,
let's bracket the zero crossings, then use these brackets as
guesses for the root solver.

Bracket all the roots

fGuess ( a , b) := npts ← 200


b−a
dx ←
npts
x0 ← a
y0 ← OptErr ( x0)
k← 0
for i ∈ 1 .. npts
x1 ← a + i⋅ dx
y1 ← OptErr ( x1)
if y0⋅ y1 ≤ 0
if y0⋅ y1 = 0
dx
x1 ← x1 +
2
y1 ← OptErr ( x1)
RootBracketk , 0 ← x0
RootBracketk , 1 ← x1
k← k + 1
x0 ← x1
y0 ← y1
return RootBracket

Refine the roots from the initial bracketing guesses


xOpt := guess ← fGuess ( a , b)
nRoots ← rows ( guess)
for i ∈ 0 .. nRoots − 1
guessi , 0 + guessi , 1
x←
2
ri ← root ( OptErr ( x) , x , guessi , 0 , guessi , 1 )
return r

yOpt := fitOpt ( xOpt)


Compare the optimal fit to the Chebyshev fit. Overall the two
fits have the same general appearance, and the ripple is much
smaller than the equally spaced data points fit. Comparing the
location of the interpolation points, one finds the Chebyshev
points have a similar distribution to the optimal points which
helps to reduce the overall ripple. Adjust the value of n to see
how these two approximations change with sampling.

0.5

0.5
1 0.5 0 0.5 1
Chebyshev points
Optimal points
Optimal polynomial interpolation
original function
Chebyshev pts polynomial interpolation

If you have the luxury of choosing where your data will be


sampled, you may wish to use guidelines of this sort when
choosing your data spacing.

Reference
de Boor, Carl (1978), A Practical Guide to Splines,
Springer-Verlag, Chapter 2.
EXAMPLES

Principal Component Regression of NIR Spectra for Alcohol


Mixtures
by Richard Jackson
Bruker Optics

This data represents the near-infrared (NIR) spectra of 15


mixtures of three alcohols: methanol, ethanol, and propanol.
The first column is wavenumber (1/wavelength), in cm-1, the
remaining 30 columns are the spectral absorbances, 2 spectra
of each mixture. The objective is to produce a calibration
based on this data that can later be used to predict the
concentrations of the alcohols in an unknown mixture.

DATA :=
8900 -0.00673 -0.00682 -0.00586 -0.00572
8896.144 -0.00638 -0.00646 -0.00528 -0.00513
8892.287 -0.00603 -0.00611 -0.00465 -0.00449
8888.431 -0.00573 -0.00583 -0.00401 -0.00386
8884.575 -0.00552 -0.0056 -0.00338 -0.00324
8880.719 -0.00537 -0.00542 -0.00273 -0.00261
8876.862 -0.00525 -0.00528 -0.00206 -0.00193
8873.006 -0.00512 -0.00517 -0.00134 -0.00119
8869.15 -0.00495 -0.00501 -0.00055 -0.00037
8865.294 -0.00471 -0.00475 0.000305 0.000509
8861.437 -0.00438 -0.00441 0.001213 0.001413
8857.581 -0.00395 -0.00399 0.002172 0.002355
8853.725 -0.00344 -0.00348 0.003187 0.003363
8849.869 -0.00283 -0.00288 0.004242 0.004431
8846.012 -0.00212 -0.00218 0.005344 0.005554

Create Data vectors

First, split X (independent variables - the wavenumbers) and Y


(dependent variables - the measured absorbances) data into a
vector and a matrix. Also transpose the data so that the
spectra are in the rows, so that each column corresponds to an
independent variable.

〈0〉
X := DATA
T
A1 := submatrix( DATA , 0 , rows( DATA) − 1 , 1 , cols( DATA) − 1)

Plot a few of the spectra to see what they look like (by
convention, wavenumbers are plotted in decreasing order):
3

2
Absorbance

0
8500 7500 6500 5500 4500
Wavenumber

Multiple Linear Regression

One method we can use for the calibration is multiple linear


regression (MLR), also sometimes called Inverse Least Squares
(ILS) or k-matrix. The first step is to generate a calibration curve.
We can model the data as

C = A1⋅ B + E

where for l components (in this case 3), m standards (in this case
30), and n variables (in this case absorbances) C is a m x l matrix
of concentrations, A is the m x n matrix of variables, B is the n x l
matrix of calibration coefficients, and E is a m x l matrix of
residuals. The calibration coefficients are found using the equation

( )
−1
T T
B = A1 ⋅ A1 ⋅ A1 ⋅ C

An estimate of the concentrations for an unknown spectrum, a,


can then be predicted using the equation

T
c = a ⋅B

The problem with this methodology is that the dimension n (i.e.


the number of variables) cannot exceed the dimension m (i.e. the
number of standards), otherwise the matrix A1 ⋅ A1 has no ( T )
inverse (the problem is underdetermined).
This is clearly a problem with the data above, which has only 30
standards, but 1142 absorbances. Looking at the data, however,
most of it must be redundant. Not only are the spectra at different
wavenumbers interrelated, but we only have three concentrations
varying (which, since they add to 100%, represents only two
degrees of freedom), plus other minor effects due to interactions
between the alcohols, temperature changes, etc. It must therefore
be possible to reduce the number of absorbances so that the
calibration coefficients can be found. One obvious way to do this is
to just keep the absorbances at a few wavenumbers, and throw
the rest of the data away. This leaves us with the problem of which
wavenumbers to choose, and if the absorbances that are chosen
( T )
for the calibration are collinear, the matrix A1 ⋅ A1 still has no
inverse.

Now we need to pick a number of wavelengths. We could pick up


to 30, but let's just choose 5 because it's easier (and because
there is a better way to do this anyway).

 4600 
 5700 
 
WaveNumbers :=  6500 
 7500 
 
 8500 

Use the Match function to get a smaller data matrix


corresponding only to those wavenumbers

〈Match ( WaveNumbers 0 , X , "near" ) 0〉


Anew := NewData ← A1
for i ∈ 1 .. rows( WaveNumbers) − 1

(
NewData ← augment NewData , A1
〈Match( WaveNumbers i , X , "near" ) 0〉
)
NewData

Now we can calibrate for the concentrations. These are the


concentrations of the alcohols methanol, ethanol, and
propanol in each of the 30 samples
C :=
0 1 2
0 0 0 100
1 0 0 100
2 100 0 0
3 100 0 0
4 0 100 0
Calculate the calibration coefficients:

( )
−1
T T
B := Anew ⋅ Anew ⋅ Anew ⋅ C

Now predict each concentration for all the samples, to see how
good the calibration is:
c := Anew ⋅ B i := −10 , 0 .. 110

Methanol Ethanol
100 100
Predicted Value

0 Predicted Value 0
0 50 100 0 50 100
Reference Value Reference Value

Propanol The r2 values for the three


100 calibrations are:
Predicted Value

(
〈0〉 〈0〉 2
corr C , c = 0.99895 )
0 corr C , c (
〈1〉 〈1〉 2
= 0.98736 )
( )
0 50 100
〈2〉 〈2〉 2
Reference Value corr C , c = 0.99386

These calibrations look OK (although they could certainly be


better!), but this is not a good way to evaluate how well the
data has been modelled. We are interested in the ability of the
model to predict concentrations from data that was not included
in the calibration. We can get a measure of this by performing
what is termed a cross validation. In this scheme we remove the
data for sample 1 (in this case, the first two spectra), calibrate
using the remaining data, then predict the concentrations for
sample 1. The process is then repeated for sample 2, sample 3,
etc. Here is a function that performs a cross validation given a
matrix of data, a matrix of concentrations, and the number of
samples to remove each time.

Function to split matrix into two, by pulling out N rows starting


at index:
SplitMatrix( M , index , N) := OUT ← submatrix( M , index , index + N − 1 , 0 , cols( M ) − 1
0
for i ∈ index .. index + N − 1
ind ←i
i− index
OUT ← trim( M , ind)
1
OUT

Cr_Val( DATA , C , N) := RMSECV ← 0


for i ∈ 0 , N .. rows( DATA) − N
"calibrate"
SplitData ← SplitMatrix( DATA , i , N)
CalData ← SplitData
1
PredData ← SplitData
0
CalC ← SplitMatrix( C , i , N)
1

( )
−1
T T
B ← CalData ⋅ CalData ⋅ CalData ⋅ CalC
c ← PredData⋅ B
PredictedC ← c if i = 0
PredictedC ← stack( PredictedC , c) otherwise
PredictedC

Now we can look at the results of the cross validation:


c := Cr_Val( Anew , C , 2 )
Methanol Ethanol
100 100
Predicted Value

Predicted Value

0 0
0 50 100 0 50 100
Reference Value Reference Value

Propanol The r2 values for the three


100 calibrations are:
Predicted Value

(
〈0〉 〈0〉 2
corr C , c = 0.9972 )
0 corr C , c (
〈1〉 〈1〉 2
= 0.96529 )
( )
0 50 100
〈2〉 〈2〉 2
Reference Value corr C , c = 0.98241
An indicator of the average error in the cross validation is given
by the root mean square error of cross validation, or RMSECV:

(Ref i − Predi)
2
last( Ref )
RMSECV( Ref , Pred) :=
∑ rows( Ref )
i =0

For methanol, ethanol, and propanol the values are:

( 〈0〉 〈0〉
RMSECV C , c )
= 1.55091

RMSECV( C , c ) = 5.48239
〈1〉 〈1〉

RMSECV( C , c ) = 3.87515
〈2〉 〈2〉

We could improve the cross validations by changing the number


of absorbances used, and by finding the best wavenumbers to
use, but this is not easy. There are obviously a huge number of
possible combinations we could try. We have also thrown away
a lot of information, and the variables we are left with are not
orthogonal. If they are not chosen carefully they may in fact be
collinear, in which case we cannot obtain the calibration
coefficients.

Fortunately, there is a much better way to compress the data


than just throwing most of it away. Principle Component Analysis
will yield the minimum number of variables required to describe the
data, and the variables are guaranteed to be orthogonal. The
regression is then performed on the new variables (the scores).
This is termed principle component regression, or PCR.

Principal Component Regression

The first step in PCA is to mean center the data. This is done
automatically by the Nipals function. By mean centering the
data we remove everything that is common to all the spectra.
i := 0 .. cols( A1) − 1 MeanA := mean A1
〈i〉
i
( )
〈i〉
i := 0 .. rows( A1) − 1
〈i〉
CenteredA := A1
T ( )
− MeanA
T
CenteredA := CenteredA
Since the data has been mean centered, we will also mean
center the concentrations:
i := 0 .. cols( C) − 1 MeanC := mean C
〈i〉
i
( )
〈i〉
i := 0 .. rows( C) − 1
〈i〉
CenteredC := C
T ( )− MeanC

T
CenteredC := CenteredC

We now wish to compress the data to the minimum possible


number of orthogonal variables. To do this we will use the
Nipals function. We do not wish to scale the data to the
standard deviations, so we set the last argument to "noscale".
To start, we will just try a calibration using the first two PCs:

− 10
NumPC := 2 Acc := 10 MaxIter := 100

PCA_result := Nipals( A1 , NumPC , MaxIter , "noscale" , Acc)

Get the Scores:

SCORES := scores ( PCA_result)

Get the Loadings:

LOADINGS := loadings( PCA_result)

The loading vectors are vectors of the same length as the


original spectra. Here is what the first two look like. It can
sometimes be instructive to look at the loadings, because
they can indicate which parts of the data are important

0.1

0.05

0.05

0.1

0.15
8500 7500 6500 5500 4500
loading 1
loading 2
Now perform a cross validation:

Centeredc := Cr_Val( SCORES , CenteredC , 2 )

Add the mean concentrations back in:


i := 0 .. rows( C) − 1
〈i〉
〈i〉
(
c := Centeredc
T )
+ MeanC c := c
T

i := −10 , 0 .. 110

Methanol Ethanol
100 100
Predicted Value

Predicted Value

0 0
0 50 100 0 50 100
Reference Value Reference Value

Propanol The r2 values for the three


100 calibrations are:
Predicted Value

(
〈0〉 〈0〉 2
corr C , c = 0.9959 )
0 corr C , c (
〈1〉 〈1〉 2
= 0.91845 )
( )
0 50 100
〈2〉 〈2〉 2
Reference Value corr C , c = 0.94823

For methanol, ethanol, and propanol the RMSECV values are:

( 〈0〉 〈0〉
RMSECV C , c )
= 1.86236

RMSECV( C , c ) = 8.40882
〈1〉 〈1〉

RMSECV( C , c ) = 6.63735
〈2〉 〈2〉
These are comparable to the values obtained using MLR. We
can do better, though, by optimizing the number of PCs used
in the calibration. To do this, cross validate using 1 PC, then
2 PCs, then 3 PCs, etc. For each number of PCs we calculate
the sums of the squares of the prediction errors for each
component. The prediction error sum of squares (PRESS) will
tend to decrease until we start to overfit the data, when it
will increase. The PRESS function below will keep adding PCs
until the PRESS for all three components increases. We will
also calculate the r 2 value for each cross validation. To save
calculating the scores and loadings again, the PRESS
function will also return these.

PRESS( DATA , C , N) := "Center the concentrations"


for i ∈ 0 .. cols( C) − 1

MeanC ← mean C
i
〈i〉
( )
for i ∈ 0 .. rows( C) − 1
〈i〉
〈i〉
CenteredC ← C
T ( )
− MeanC
T
CenteredC ← CenteredC
"Get the first PC and calculate the PRESS and r-squared"
PCA_result ← Nipals( DATA , 1 , MaxIter , "noscale" , Acc)
S ← PCA_result
0
L ← PCA_result
1
Centeredc ← Cr_Val( S , CenteredC , N)
for i ∈ 0 .. cols( C) − 1
rows( C) − 1
 ( CenteredC〈i〉 ) − ( Centeredc〈i〉 ) 
2
PRESS
0, i

∑  j j
j =0

R_squared
0, i
( 〈i〉
← corr CenteredC , Centeredc
〈i〉 2
)
"Calculate more PCs until the PRESS increases or we get the maximum number"
for k ∈ 1 .. rows( C ) − 1
PCA_result ← Nipals2( PCA_result , 1 )
S ← PCA_result
0
L ← PCA_result
1
Centeredc ← Cr_Val( S , CenteredC , N)
for i ∈ 0 .. cols( C) − 1
rows( C) − 1
 ( CenteredC〈i〉 ) − ( Centeredc〈i〉 ) 
2
PRESS
k, i

∑  j j
j =0

( 〈〉
)
〈〉 2
R_squared
k, i
( 〈i〉
← corr CenteredC , Centeredc )
〈i〉 2

Num_Greater ← 0
break if for i ∈ 0 .. cols( C) − 1
Num_Greater ← Num_Greater + 1 if PRESS > PRESS
k, i
OUT ← PRESS
0
OUT ← R_squared
1
OUT ← S
2
OUT ← L
3
OUT

 {10,3} 
PRESS_result = 
{10,3} 
PRESS_result := PRESS( A1 , C , 2)
 {30,10} 
 {1142,10} 
 
The PRESS values indicate that for all components 6 PCs give the
lowest predicion error:

(
i := 0 .. rows PRESS_result
0 )−1
1 .10
5

1 .10
4

1 .10
3
PRESS

100

10

1
1 2 3 4 5 6 7 8 9 10
Number of PCs
methanol
ethanol
propanol
Although the two indicators do not always give the same results,
in this case the r2 values also indicate that 6 PCs are optimum:

0.995
R_squared

0.99

0.985

0.98
1 2 3 4 5 6 7 8 9 10
Number of PCs
methanol
ethanol
propanol

It is worth noting here that if we use the nth PC, there is no


requirement that we use all the PCs lower than n. In fact, the
graphs above show that when we include the 4th PC the
RMSECV and r2 for ethanol are worse, but improve again when
we include the 5th PC. The lowest RMSECV for ethanol is in fact
obtained if only PCs 1,2,3,5,6 are used. This is because PCA
compresses the data to the most dominant factors, not the
most relevant factors. The most common method used for
multivariate calibration that compresses the data to the most
relevant factors is Partial Least Squares (PLS).

Final Calibration

We will keep the first 6 scores from those returned by the PRESS
calculation (for simplicity, we will keep the 4th PC for the ethanol
calibration):

( (
SCORES := submatrix PRESS_result , 0 , rows PRESS_result
2 2 ) − 1 , 0 , 5)
Now perform a cross validation:
Centeredc := Cr_Val( SCORES , CenteredC , 2 )
Add the mean concentrations back in:
i := 0 .. rows( C) − 1
〈i〉
〈i〉
(
c := Centeredc
T )
+ MeanC c := c
T
i := −10 , 0 .. 110

Methanol Ethanol
100 100
Predicted Value

Predicted Value
0 0
0 50 100 0 50 100
Reference Value Reference Value

Propanol The r2 values for the three


100 calibrations are:
Predicted Value

〈0〉 〈0〉 2
corr C , c (
= 0.99991 )
0
〈1〉 〈1〉 2
corr C , c (
= 0.99893 )
( )
0 50 100
〈2〉 〈2〉 2
Reference Value corr C , c = 0.99943

For methanol, ethanol, and propanol the RMSECV values are:

( 〈0〉 〈0〉
RMSECV C , c )
= 2.75962 × 10
−1

RMSECV( C , c ) = 9.50189 × 10
〈1〉 〈1〉 −1

RMSECV( C , c ) = 6.9443 × 10
〈2〉 〈2〉 −1

These results are exellent! The final calibration is

( )
−1
T T
B := SCORES ⋅ SCORES ⋅ SCORES ⋅ CenteredC

Prediction
We will keep the first 6 loadings from those returned by the PRESS
calculation:

(
LOADINGS := submatrix PRESS_result , 0 , rows PRESS_result
3 ( ) − 1 , 0 , 5)
3
To predict the unknown concentrations from a spectrum, first we
subtract the mean spectrum, as the Nipals function did for the
calibration data:
Unknown :=
0
0 -0.0065
1 -0.0062
2 -0.0059

Unk_centered := Unknown − MeanA

Next calculate an estimate of the scores using the first 6 loadings:

T
S := Unk_centered ⋅ LOADINGS

(
S = 6.1464 5.2431 × 10
−1 −1
1.0577 × 10 7.0234 × 10
−2
1.0719 × 10
−2
2.6901 × 10
−2
)
Finally, calculate the predicted concentrations, remembering to
add the mean concentrations to get the final answer:

T
y := ( S⋅ B) + MeanC
T (
y = 5.95299 1.62882 × 10
1
7.775 × 10
1
)
We can combine all these steps into a single statement

T
 
y :=  ( ( Unknown − MeanA) ⋅ LOADINGS⋅ B) + MeanC
T
T

(
y = 5.95299 1.62882 × 10
1
7.775 × 10
1
)
To conclude, we have used PCA to compress the original data to
6 variables, which are used for the final regression. This has the
advantages over MLR that we do not have a wavenumber
selection problem, and the variables are guaranteed to be
orthogonal.

References

Spectroscopic data courtesy of Bruker Optics, Inc., all


rights reserved, used by permission.
EXAMPLES

Savitzky-Golay and Median Filtering


by Erik Esveld
Wageningen University and Research Center

For the analysis of kinetic processes, the derivative of a


signal measured at regular time intervals is often required.
Because of the noise in real measured systems, the data
needs to be smoothed in order to obtain meaningful results
for the derivative of the data.

Savitzky-Golay (SG) filtering is a smoothing technique which


relies on a local polynomial fit of regularly spaced data.
Since the nth order derivative for x=0 is directly determined
by the respective polynomial coefficient, the filtering
method can also conveniently be used to directly obtain the
derivative of the data.

SG filtering is a convolution method. The coefficients of the


convolution window which follow from the least-square
polynomial fit are obtained by the Moore Penrose inverse of
a fixed matrix.

Real data is often also cluttered with spikes, which can be


efficiently removed by a median filter. This filter is especially
useful for monotonic ascending or descending data or data
which contains sudden level changes. It doesn't work very
well for spectral data with peaks.

Filter function definition

The filter functions have error messages defined in this vector:

 "Odd window width expected" 


CustErrMsg :=  "Order of polynomial should be less \nthan width of the window" 
 
 "Order of derivative cannot be greater \nthan the order of polynomial" 

The following function calculates the Savitzky-Golay


coefficients for the smoothing, with odd-length window width w,
and a kth order polynomial to obtain the the dth order derivative
by convolution.
SGcoef ( k , w , d) := (
error CustErrMsg
0 ) if mod( w , 2 ) = 0

otherwise
(
error CustErrMsg
1 ) if k ≥ w

otherwise
(
error CustErrMsg
2 ) if d > k

otherwise
(w − 1)
m←
2
for r ∈ 0 .. 2 ⋅ m
for c ∈ 0 .. k
c
S ← ( r − m)
r, c

(T )
−1
T
MPinverse ← S ⋅ S ⋅S
〈d〉
(
d! ⋅ MPinverse
T )

Convolution of a vector X with window coefficients c:

Convol( X , c) := width ← length( c)

(
error CustErrMsg
0 ) if mod( width , 2 ) = 0

otherwise
( width − 1)
m←
2
Con ←0
last ( X)
for i ∈ m .. ( last( X) − m)
Con ← c⋅ submatrix( X , i − m , i + m , 0 , 0 )
i
Con

The error messages are handy to track misuse of the


functions. For example, click on this faulty expression:

SGcoef ( 2 , 5 , 3 ) =
Filter application example

Consider the example data with increasing chirp.

f( t) := 2 − cos exp  t



  300  

Add white noise with stdevwn := 0.2.

(
whitenoise ( t) := qnorm rnd( 1 ) , 0 , stdevwn )

Add some outliers with stdevout := 2 every n := 20 points.

outliers( t) := if  rnd( 1 ) ≤ , qnorm rnd( 1 ) , 0 , stdevout , 0


( )
1
 n 

Store the original data and the noisy data in vectors.

i := 0 .. 1000 F := f( i)
i

X := f( i) + whitenoise( i) + outliers( i)
i

Stdev( X − F) = 0.5403

Xi
2
Fi

0
0 200 400 600 800 1000
i
Let's get rid of the spikes by the application of a narrow
(5pt window) median filter.

XM := medsmooth( X , 5)

XM i
2
Fi

0
0 200 400 600 800 1000
i

Smooth the result with a 2nd order polynomial fit over 41 points.

k := 2 w := 41
XMSG := Convol( XM , SGcoef ( k , w , 0 ) )

XMSGi
2
Fi

0
0 200 400 600 800 1000
i

Now look at the first derivative of the smoothed data.

d
dXMSG := Convol( XM , SGcoef ( k , w , 1 ) ) df := f( i)
i di

0.1

dXMSGi
0
df i

0.1
0 200 400 600 800 1000
i
The smoothing of the function worked well with the exception
of the region ends, where the window width extends past the
end of the data. The first derivative shows an underestimation
of the steepest slopes. This is due to the limitations of the
second order (parabolic) fit over the large window. With a
smaller window or with a quadratic polynomial fit, these steep
features can be better modeled at the cost of decreased noise
smoothing. Try changing the values of k and w to examine
these effects.

Loess smoothing
Finally, it's worth comparing this method with another
polynomial fitting and windowing method, the loess fit.
Loess, with small scale factors, similar to windows, provides
an effective smoother. The difference with the
Savitzky-Golay method is that loess uses a weighted and
adaptive window to locally fit a 2nd degree polynomial.

I := i
i

smoothed := loess( I , XM , 0.07)

XML := interp ( smoothed , I , XM , i)


i

XMLi
2
Fi

0
0 200 400 600 800 1000
i

Since the loess method cannot be used for extrapolation, and


numerical derivatives require differencing on either side of the
endpoints, we'll contract the range over which we find the
derivatives.

i := 1 .. length ( X) − 2
To obtain the first derivative we can use the simple linear
difference since the local parabolic fit is discontinuous in the
second derivative.

XML − XML
i+ 1 i− 1
dXML :=
i 2

0.2

0.1
dXMLi

df i
0

0.1
0 200 400 600 800 1000
i

In the case of loess, the fit is applicable over a wider range,


since it uses an adaptive and weighted window. Unlike the
Savitzky-Golay method, the derivative at the right side is not
underestimated. However, better fidelity in the derivative implies
poorer smoothing. The Savitzky-Golay method, on the other hand,
is capable of using higher order fits to obtain all the derivatives
directly.

Reference
W.H. Press et al. (1992), Numerical recipes in C : the art of
scientific computing, Cambridge University Press, 2nd ed.
Chapter 14.8, page 640.
EXAMPLES

Stabilizing and Normalizing the Error Variance


by Paul Lorczak

Suppose we have a multivariate dataset where the value


of σ2, the random error, is increasing with respect to one of
the independent variables

sample size: n := 75

〈 0〉
σINCR := DATA ← runif ( n , 0 , 1)
〈 1〉
DATA ← runif ( n , 0 , 1)
"Two columns of random data for x1 and x2"
〈 0〉
σincr ← DATA
"create a column of errors that depend on the first data column"
 →
ε ← σincr⋅ rnorm ( n , 0 , 1)
 4 
β ← 7 
2 
 
"create a linear relationship between x1, x2, and a dep. variable"
〈 2〉 〈 0〉 〈 1〉
DATA ← β 0 + β 1 ⋅ DATA + β 2 ⋅ DATA + ε
DATA

〈 2〉
y := σINCR i := 0 .. last ( y)

〈 0〉 〈 0〉 〈 1〉 〈 1〉
X := σINCR X := σINCR
The error term ε is from a normal distribution with a
mean of zero.

To check model assumptions, do a multivariate


polynomial fit with the two columns of independent
variables to the single column of dependent variables:

params := regress ( X , y , 2)
〈〉
 ( T) i 
yfiti := interp  params , X , y , X

resid := y − yfit

∑ resid
2

stERR := stERR = 0.573246


last ( y) − 5

If scatterplots of the standardized residuals versus each


of the independent variables or versus the predicted
values show no discernible pattern, then the variance of
the error terms is most likely constant. On the plot of the
residuals versus x1 (shown in red), the points spread out
increasingly from left to right, indicating that ε increases
with x1.

2
resid
0
stERR

4
0 0.2 0.4 0.6 0.8
〈0〉
σINCR

On a similar graph for x2, the pattern is not clear.

resid
0
stERR

4
0 0.2 0.4 0.6 0.8
〈1〉
σINCR
The pattern, however, is again in evidence on the graph of
the residuals versus the predicted values,

resid
0
stERR

4
4 6 8 10 12
yfit

To perform a valid regression, we'll need to counteract the


increasing error variance by transforming the dependent
data. The variance stabilizing transformations are listed
below in order of increasing severity. The first equation is
enabled. To see the effect of the other transformations, you
can disable the first equation and enable another equation.



ystab := y

 →
ystab := ln ( y)


1
ystab :=
y

Below, we've plotted the new resulting residuals versus the


predicted values, to see the effect of stabilization.

(
params := regress X , ystab , 2 )
〈〉

yfitsi := interp  params , X , ystab , X

( T) i 
resids := ystab − yfits

∑ resid
2

stERR := stERR = 0.573246


last ( y) − 5
0.5

resids
0
stERR

0.5
2 2.5 3 3.5
yfits

You may find that some of the transformations are too


severe, actually causing a violation rather than removing it.
For instance, the graph pattern could be curved rather than
random, indicating that a nonlinear model might provide a
better fit to the data.

Correcting Nonnormality
Besides helping to stabilize the error variance, the same
transformations may also correct any nonnormality of
errors. To see this, we'll generate a sample from a
population having error terms that are exponentially
distributed.

sample size: n_expo := 25

〈 0〉
εEXPO := DATA ← runif ( n_expo , 0 , 1)
〈 1〉
DATA ← runif ( n_expo , 0 , 1)
ε1 ← rexp ( n_expo , 10)
 4 
β ← 7 
2 
 
〈 2〉 〈 0〉 〈 1〉
DATA ← β 0 + β 1 ⋅ DATA + β 2 ⋅ DATA + ε1
DATA

〈 2〉
ye := εEXPO i := 0 .. last ( ye) Xei := 1

〈 1〉 〈 0〉 〈 2〉 〈 1〉
Xe := εEXPO Xe := εEXPO
Nonnormality can be detected in a normal plot of the
standardized residuals. If the plot resembles a straight
line, then most likely the errors are from a normal
distribution.

(
params := Xe ⋅ Xe
T ) − 1⋅ ( XeT⋅ ye)

4.041308 

yfit := Xe⋅ params params =  7.087907 
 1.991035 
reside := ye − yfit
 

∑ reside
2

stERR := stERR = 0.107443


last ( ye) − cols ( Xe)

rplot := qqplot  , "normal" 


reside
 stERR 

〈 〉
rplot 1 0

4
1 0 1 2 3 4 5
〈 〉
rplot 0

As expected, the residuals are not normally


distributed, as shown on this nonlinear scatter plot.

Transforming for normality


Again, we've defined the transformations in order of
increasing severity. The first equation is enabled. To see
the effect of the other transformations, you can disable
the first equation, and enable another equation.

ynorm := ye

ynorm := ln ( ye)

1
ynorm :=
ye
Inspecting the new standardized residual normal plot.

(
params := Xe ⋅ Xe
T ) − 1⋅ XeT⋅ ynorm

2.109525 

yfit := Xe⋅ params params =  1.248504 
 0.339393 
reside := ynorm − yfit
 

∑ reside
2

stERR := stERR = 0.028774


( )
last ynorm − cols ( Xe)

rplot := qqplot  , "normal" 


reside
 stERR 

〈 〉
rplot 1 0

4
2 1 0 1 2 3
〈 〉
rplot 0

You can investigate the relationship between the


transformation and the nonnormality of error terms through
enabling and disabling different equations. Again, you'll
probably find that some transformations work better than
others.
EXAMPLES

Statistical Analysis of Water Meter Data


by D. M. Griffin
Louisiana Tech University

The data used here are daily water usage values expressed
as gal/min. These are being collected as part of an ongoing
research project between Louisiana Tech University and the
Louisiana Department of Transportation and Development to
quantify water usage and wastewater generation at
Interstate Rest Areas in Louisiana.

The medsmooth function was used to examine water


flowrates over a period of record from 7/1/97 to the present.
Using a smoothed data set (bandwidth = 7 days) it was
found that daily flows after day 1200 were somewhat lower
and less variable than those before day 1200. This results
in substantial changes in less frequent flow rates before and
after day 1200. A bandwidth of 7 days was used because
significant autocorrelation occurs within a one-week period,
less so after one week.

Read in daily water usage rates at Grand Prairie Interstate


rest area, I-49 in gal/min. Missing values given value of -1.
There are 10 to 20 missing values in the data set.

DATA := GP.PRN

i := 0 .. length ( DATA) − 1 length ( DATA) = 2059


gal
elapsed_timei := i⋅ day gpm :=
min

DAILY FLOW (GPM) VS ELAPSED TIME (DAYS)


40
35.5
Average daily flow (gpm)

31
26.5
22
17.5
13
8.5
4
0.5
5
0 205.9 411.8 617.7 823.6 1029.5 1235.4 1441.3 1647.2 1853.1 2059
Elapsed time in days
The raw data exhibits no obvious changes in pattern.
There were about 20 days where flow data were not
available and these were given the value -1

We first need to deal with the -1 values. There is no


generally accepted way of doing this. If they are removed,
the data set is discontinuous, which is undesirable. In this
case I will first remove the -1 values to create another set of
"real" values. Then I will compute the mean of that data set.
Finally, I will substitute the median value of that data set for
all -1 values in the original data set. In this way I have a
continuous data set (not a perfect solution, but this is the
real world!) with no missing values.

Get all the indices of the data which are of value -1:
index := match ( −1 , DATA)

Trim them from the data set:


DATAtrim := trim( DATA , index)

rows ( DATA) = 2059 rows ( DATAtrim) = 2024

Now compute median of DATAtrim and substitute it for all -1


values.

medianval := median ( DATAtrim)

DATA := DATA⋅ gpm DATAorig := DATA

j := 0 .. last ( index)

DATA( index ) := medianval⋅ gpm


j

As expected, the median changes less in this data transformation,


since it's a more robust measure of centrality.

mean ( DATA) = 4.274 gpm mean ( DATAorig) = 4.199 gpm

median ( DATA) = 3.417 gpm median ( DATAorig) = 3.401 gpm


Autocorrelation
Autocorrelation is a procedure to determine if a data set is
correlated with itself over time. Data which are correlated are not
independent and standard statistical hypothesis tests are not
valid, strictly speaking. Data which are not autocorrelated are said
to be independent.

First, compute and plot the residuals, the differences between


each daily flow value and the overall mean flow for the data
set.

µ := mean ( DATA)
RESID := DATA − µ mark1 := 320 mark2 := 360

FLOWRATE RESIDUALS
40
35 mark1
mark2
30
(Data value - data mean)

25
20
15
10
5 0
0
5
10
0 240 480 720 960 1200 1440 1680 1920 2160 2400
elapsed days

Comparing the data points to the y = 0 line, we see that the


flow was generally larger than the mean for the first 90 days
or so; then was less than the mean until about day 240.
Recently the flow has been increasing again.
The high values occurring between 320 and 360 were caused by
watering of trees and grass during May and June 1998. In
addition, problems occurred with the well around 6/15/98, (day
349) which necessitated pumping the well for an extended
period while working on it. High flows around the end of
September (day 450) were probably a result of large numbers of
evacuees heading north from New Orleans and surrounding area
as hurricane Georges approached New Orleans (actually it
turned out that the riser pipe in the well had split). High
residuals occurring near day 880 and day 890 are due to the
conduct of dye tests in the rock plants filters. All of these
events tend to wash out legitimate long-term trends in the
data, which is what we are interested in.

Plot the correlogram for the daily flow. A correlogram is the


sequence of autocorrelation coefficients of flow over a series of
specified time intervals, called lag periods or lags.

n := 0 .. 28 N := length ( DATA) σ := stdev ( DATA)

N− 1

∑ ( DATAi − µ ) ⋅ ( DATAi− n − µ )


1
lagcorrn := ⋅
2
N⋅ σ i=n

CORRELOGRAM
Autocorrelation Coefficient

0.8

0.6

0.4

0.2

0.2
0 7 14 21 28
Lag (days)

The autocorrelation coefficient over a lag period of 7 days is


approximately 0.4. This means that the flow on any day is positively
correlated with the the flow 7 days previous. Similarly the value at
21 days is somewhat over 0.2. This indicates that the flow is
correlated with the flow 21 days previous but not as well as the flow
7 or 14 days ago.

Now look at the overall curve. It has a regular pattern that


repeats every 7- 8 days. This is not immediately obvious from the
raw data. It is also interesting to note that the value of the
coefficients drop with each succeeding cycle.
Here are the upper and lower 95% confidence limits on the
autocorrelation coefficients. Values inside this region cannot be
assumed different than zero in a statistical sense.

 −1 + 1.96⋅ length ( DATA) − 2 


u :=   u = 0.043
 length ( DATA) − 2 

l :=
( −1 − 1.96⋅ length ( DATA) − 2)
l = −0.044
( length ( DATA) − 1)

Median Smoothing

Now use medsmooth to smooth the data. Based on the


serial correlation analysis, it was decided to use a bandwidth
of 7 days.

Q7 := medsmooth ( DATA , 7)

The median of the 7 points surrounding each data point is used to


replace each data point. The procedure is followed for each point
so there are as many point in the smoothed data set as the
original.

7-Day Median Smooth


15
13.4
11.8
10.2
8.6
7
5.4
3.8
2.2
0.6
1
0 240 480 720 960 1200 1440 1680 1920 2160 2400
7 day median smooth
Note the change in water use highlighted by the smoothed data.
The curve above suggests that daily water use appeared to
increase up until about day 1200, which would correspond to
about October of 2000. After that time, water usage dropped
and appeared to be less variable. This is not at all evident in the
original data. The main reason for this change is the fact that
before day 1200, testing was occurring at the site, requiring
extra water to be run through the treatment system. Based on
these results it could be argued that we have, in effect, 2 data
sets, one before day 1200, and one from day 1200 to the
present. It seems reasonable to plot probability curves for the
data before and after day 1200.

Weibull plots

First, we extract the data before and after day 1200.

k1 := 0 .. 1199 k2 := 0 .. last ( DATA) − 1200

DATAbef_1200 := DATAk1
k1

DATAaft_1200 := DATAk2+ 1200


k2

( ) ( )
gal gal
mean DATAbef_1200 = 4.708 mean DATAaft_1200 = 3.666
min min

( ) ( )
gal gal
median DATAbef_1200 = 3.785 median DATAaft_1200 = 3.125
min min

We sort the data and create the Weibull plotting variable for
three probability curves. The first is for all data, the second
for data collected before day 1200, and the third for data
collected after day 1200.

sort ( DATA) sort ( Rank ( DATA) )


x1 := y1 := ⋅ 100
gpm length ( DATA)

x2 :=
(
sort DATAbef_1200 ) y2 :=
( (
sort Rank DATAbef_1200 ) ) ⋅ 100
gpm (
length DATAbef_1200 )

x3 :=
(
sort DATAaft_1200 ) y3 :=
( (
sort Rank DATAaft_1200 ) ) ⋅ 100
gpm (
length DATAaft_1200 )
125

percent flows less than stated value


100

75

50

25

0
0 10 20 30 40
water use
all data
before day 1200
after day 1200

The curves may be interpreted as follows:

1. The curve for all the data combined shows a flow rate
with a 50% probability of occurrence is 3.42 gal/min. That
is, 50% of future flows will be less than this value.
Corresponding values for data collected before and after
day 1200 are 3.8 and 3.1 gal/min respectively.

percentile ( DATA , 50%) = 3.417 gpm

percentile DATAbef_1200 , 50% = 3.785 gpm ( )

percentile DATAaft_1200 , 50% = 3.125 gpm ( )

2. Using the curve for data collected after day 1200,


90% of future flows will be less than 5.8 gal/min.
The corresponding value for data collected before
day 1200 is 8.9 gal/min

percentile DATAbef_1200 , 90% = 8.981 gpm ( )

percentile DATAaft_1200 , 90% = 5.833 gpm ( )


Now, lets examine changes in flowrate as a function of how
often the flow occurs. We will look at 20 percentile flows,
median flows, and 90 percentile flows.

20 percentile flow

Before day 1200: (


B120020 := percentile DATAbef_1200 , 20% )

After day 1200: (


A120020 := percentile DATAaft_1200 , 20% )
B120020 − A120020
change20 := − change20 = −7.845 %
A120020

Median flows

Before day 1200: (


B120050 := percentile DATAbef_1200 , 50% )

After day 1200: (


A120050 := percentile DATAaft_1200 , 50% )
B120050 − A120050
changemed := − changemed = −21.111 %
A120050

90 percentile flow

Before day 1200: (


B120090 := percentile DATAbef_1200 , 90% )

After day 1200: (


A120090 := percentile DATAaft_1200 , 90% )
B120090 − A120090
change90 := − change90 = −53.96 %
A120090

Summary:

change20 = −7.8 % changemed = −21.1 % change90 = −54 %


These data suggest that changes in the more frequent flows
before and after day 1200 are small, about - 5%. However,
the change in median flows before and after day 1200 is
-21% and the change in the 90 percentile flowrate is twice
that, or - 54%. This suggests that:
1. changes in flow before and after day 1200 were
substantial. The flow dropped considerably after day
1200.
2. Percentage changes in the mean or median flowrate
underpredict changes at higher, less frequent, flow rates.
This makes sense because the testing that was done
required large water flows and thus affected the
probability of occurrence of the less frequent flows.
EXAMPLES

Two-Point, Two-Body Elements for the Planet Jupiter


by Roger L. Mansfield
Astronomical Data Service

The problem to be solved is how to generate a set of orbital


elements for the planet Jupiter that can be used to calculate
Jupiter's position at any instant of the year 2004, i.e., during
the period 2004 January 1.0 to 2005 January 1.0. Orbital
elements are a dynamical astronomer's way of describing an
orbit in a manner that is useful for calculating positions in the
orbit, as we shall see in what follows.

To begin our task of generating elements for Jupiter, we retrieve


a dataset from a U.S. Naval Observatory publication [1].


2452880.5 −4.590859157 2.525277766 1.195317055 "2003 Aug 29" 0 
 2452920.5 −4.744349733 2.29600822 1.100718447 "2003 Oct 28" 1 
 2452960.5 −4.883439395 2.059769646 1.002778822 "2003 Nov 17" 2 
 2453000.5 
−5.007785421 1.817312563 0.901811754 "2003 Dec 27" 3
 
 2453040.5 −5.11708814 1.569396401 0.798135691 "2004 Feb 05" 4 
 2453080.5 −5.211090635 1.316787749 0.692073192 "2004 Mar 16" 5 
 
 2453120.5 −5.289578417 1.060258691 0.583950212 "2004 Apr 25" 6 
 2453160.5 −5.352379063 0.800585218 0.474095406 "2004 Jun 04" 7 
DataSet :=  2453200.5 −5.399361849 0.538545706 0.362839474 "2004 Jul 14" 8 
 
 2453240.5 −5.430437358 0.274919463 0.25051452 "2004 Aug 23" 9 
 2453280.5 −5.445557097 0.010485323 0.137453446 "2004 Oct 02" 10 
 2453320.5 
−5.444713104 −0.253979697 0.023989362 "2004 Nov 11" 11
 
 2453360.5 −5.427937567 −0.517701717 −0.089544984 "2004 Dec 21" 12 
 2453400.5 −5.395302447 −0.779911265 −0.202817767 "2005 Jan 30" 13 
 
 2453440.5 −5.346919109 −1.039844529 −0.315498628 "2005 Mar 11" 14 
 2453480.5 −5.282937969 −1.296744589 −0.427259208 "2005 Apr 20" 15 
 2453520.5 −5.203548148 −1.549862639 −0.537773687 "2005 May 30" 16 
DataSet is called an ephemeris (plural: ephemerides),
because it is a table of times (zeroth column) and positions of
the planet Jupiter at those times (the first, second and third
columns are the x, y, and z components of Jupiter's 3D
position vectors). Our ephemeris gives Jupiter's positions at
equal 40-day intervals. The times are Julian dates; their
corresponding calendar dates are given in the fourth column.

Note that calendar date 2004 January 1.0, having Julian


date 2453005.5, lies between the dates in rows 3 and 4 of
the table. Also note that calendar date 2005 January 1.0,
having Julian date 2453371.5, lies between the dates in
rows 12 and 13 of the table.

We want our orbital elements to work for the period 2004


January 1.0 (midnight on 2003 December 31) to 2005 January
1.0 (midnight on 2004 December 31). To generate the orbital
elements, we will need to calculate the positions of Jupiter very
precisely on these two dates.

Aitken-Neville iterated polynomial interpolation [2], as


implemented in polyiter, is useful for this calculation. It allows
us not only to interpolate for the two positions of Jupiter on the
two dates of interest, but also to put a tolerance, ε, on how
good the interpolation must be.

The x, y, and z positional coordinates of Jupiter in DataSet


columns 1, 2, and 3, respectively, have been rounded off to 9
places past the decimal, and are known to be accurate to this
number of decimal places. Therefore, we should specify our
Aitken-Neville interpolation tolerance as

− 10
ε := 5 ⋅ 10

We will use eight positions from DataSet, the zeroth through the
7th, to interpolate for Jupiter's position on 2004 January 1.0, and
eight positions, the 9th through the 16th, to interpolate for
Jupiter's position on 2005 January 1.0, as these two sets of eight
positions bracket the two ephemeris points of interest.

With eight data points input, polyiter can compute a


polynomial of, at most, degree seven. Therefore, we input a
maximum iteration count of seven to polyiter.
N := 7

We now extract the column vectors of x, y, and z coordinates


for interpolation. The dot subscript "1" denotes the column
vectors of values as needed for interpolation in x, y, and z, at
the first Julian date. The subscript "2" denotes the column
vectors of values for interpolation at the second Julian date.
(See also additional comments on notation, Note 6, at the
end of the worksheet.)
JD1 := submatrix( DataSet , 0 , 7 , 0 , 0 ) JD2 := submatrix( DataSet , 9 , 16 , 0 , 0)

x1 := submatrix( DataSet , 0 , 7 , 1 , 1 ) x2 := submatrix( DataSet , 9 , 16 , 1 , 1)

y1 := submatrix( DataSet , 0 , 7 , 2 , 2 ) y2 := submatrix( DataSet , 9 , 16 , 2 , 2)

z1 := submatrix( DataSet , 0 , 7 , 3 , 3 ) z2 := submatrix( DataSet , 9 , 16 , 3 , 3)

We specify the Julian dates for the interpolated ephemeris


positions 1 and 2.

JDT1 := 2453005.5 JDT2 := 2453371.5

Aitken-Neville Interpolation

We perform iterated interpolation for Jupiter's position r1 at


the first Julian date JDT1 by invoking polyiter three times,
once for each of x, y, and z.

 1 
 
(
Out := polyiter JD1 , x1 , JDT1 , N , ε ) Out =

6

 −5.0222766203 
r1 := Out
0 2
 1 
 
(
Out := polyiter JD1 , y1 , JDT1 , N , ε ) Out =

6

 1.7866059269 
r1 := Out
1 2

 1 
 
(
Out := polyiter JD1 , z1 , JDT1 , N , ε ) Out =

6

 0.8889938158 
r1 := Out
2 2

 −5.0222766203  This is Jupiter's interpolated position


r1 =  1.7866059269 
vector, in A.U., on the first Julian date,
  2453005.5, corresponding to 2004
 0.8889938158  January 1.0 TT (Terrestrial Time).
What we see is that the x, y, and z coordinates were all
successfully found by Aitken-Neville interpolation, because
the first element of each polyiter output vector, the
"Converged" flag, is set to 1. We see that only six iterations
were needed in all cases, one iteration less than seven, the
maximum degree and iteration count permissible for eight
input data points.

We should note at this point a strength of Aitken-Neville


interpolation: if there is a manual transcription error in the
input data points, iteration will usually go up to the maximum
permissible degree and will not converge (Converged = 0). So
if you are pretty sure you have input enough data points to
meet your convergence criterion, then Aitken-Neville is
probably telling you, "Better check your input data for manual
entry errors."

Now we perform iterated interpolation for Jupiter's position r2 at


the second Julian date JDT2.

 1 
 
(
Out := polyiter JD2 , x2 , JDT2 , N , ε ) Out =

6

 −5.4205399099 
r2 := Out
0 2
 1 
 
(
Out := polyiter JD2 , y2 , JDT2 , N , ε ) Out =

6

 −0.5899925915 
r2 := Out
1 2
 1 
 
(
Out := polyiter JD2 , z2 , JDT2 , N , ε ) Out =

6

 −0.1207351015 
r2 := Out
2 2

 −5.4205399099  This is Jupiter's interpolated position


r2 =  −0.5899925915 
vector, in A.U., on the second Julian
  date, 2453371.5, corresponding to
 −0.1207351015  2005 January 1.0 TT.
Orbit Determination via Two-Body Mechanics

We now have two precise positions for the planet Jupiter on two
dates, 2004 January 1.0, and 2005 January 1.0. These dates are
exactly 366 days apart because the year 2004 is a leap year.

Let us assume for a moment that the solar system consists of just
two bodies, Jupiter and the Sun, and that Jupiter travels from its
position on the first date to its position on the second under the
gravitational influence of the Sun alone. Motion under this
assumption is called "two-body orbital mechanics," and the orbit
that results is called a "two-body orbit." The larger body is called
the primary and the smaller body is called the secondary.

The great mathematician Karl Friedrich Gauss deduced a little


more than 200 years ago a method of determining a two-body
orbit from two position vectors and the time of flight from the
first to the second. We will use Gauss's method below to
determine the orbital path of Jupiter from 2004 January 1.0 to
2005 January 1.0. This path will be quite close to the DataSet
path, as we will see.

Since we will be using the two points of Jupiter's orbit that we


have just calculated above via polyiter, and since we are
assuming two-body mechanics, we will call the resulting orbital
elements "two-point, two-body" (2P2B) orbital elements for the
planet Jupiter.

Gauss's method can be summed up in the two functions, VEL1


and TWOPOS, defined below in a collapsed area. I won't say
too much about these two functions here. The mathematics
that is implemented in these functions is fully documented in
Chapter 9 of [3].

TWOPOS and VEL1 definitions

We now use TWOPOS to calculate Jupiter's heliocentric


equatorial velocity vector at the first interpolated position, as
needed to travel to the second interpolated position under the
two-body assumptions. First we set up the arguments K and
∆t for input to TWOPOS.
k1 := 0.01720209895 Gaussian constant for two-body
motion when the Sun is the
primary body.

µ := 1.000954786 Sum of the Sun's mass and


Jupiter's mass, expressed in
solar masses.

K := k1 ⋅ µ Gravitational parameter
composed of k1 and µ.

∆t := JDT2 − JDT1 We specify a one-year time of


flight, in this case, 366 days for
leap year 2004.

The output from TWOPOS is the position and velocity vectors


side by side. We put these into the 3-by-2 array PV. Then we
extract position r1 and velocity v1 for use in calculating orbital
elements.

(
PV := TWOPOS K , ∆t , r1 , r2 )
〈0〉 〈1〉
r1 := PV v1 := PV

 −5.0222766203   −0.0028745571 
r1 =  1.7866059269  v1 = −0.0061497325 

   
 0.8889938158   −0.002567772 
Transformation to Ecliptic Coordinates

The components of position are expressed in astronomical units


(A.U.). The components of velocity are expressed in A.U./day.

The two vectors r1 and v1 are "heliocentric equatorial cartesian."


This means that they are the (x, y, z) components of position
and (vx, vy, vz) components of velocity in a reference frame
whose origin is at the Sun and whose fundamental plane is
Earth's equatorial plane. We will want to transform r1 and v1 to a
set of classical orbital elements for the planet Jupiter. Before we
can do this we must make the vectors "heliocentric ecliptic
cartesian." This means that the fundamental reference plane
must be changed to the plane of Earth's orbit around the sun,
called the ecliptic plane. We need function EQ2EC to do the
transformation, and we will need EC2EQ later to go back the
other way, so we define both functions now in a collapsed area.

EC2EQ and EQ2EC definitions

Now we transform r1 and v1 from heliocentric equatorial to


heliocentric ecliptic.

( )
r1 := EQ2EC r1 ( )
v1 := EQ2EC v1

Transformation to Classical Elements

Position and velocity at some epoch (in this case, 2004


January 1.0) are called "fundamental" orbital elements. What
we want are called "classical" orbital elements. Classical orbital
elements are derived and discussed in Chapter 5 of [3]. We
use the function PV2CL ("position and velocity to classical
elements") to do the transformation (see Chapter 8 of [3]). We
define PV2CL in a collapsed area.

PV2CL definition
We invoke PV2CL to transform position and velocity to classical
elements.

(
Elmts := PV2CL K , r1 , v1 )
This is what the
We now have our 2P2B
classical elements are,
elements for Jupiter,
for a body in orbit
for the year 2004:
around the Sun:

 5.20183217   "Semimajor axis, in A.U." 


 0.04896418   "Orbital eccentricity" 
   
Elmts =  1.30557531   "Orbital inclination, in deg" 
 100.08496669   "Celestial longitude of ascending node, in deg" 
 273.99147004   "Argument of perihelion, in deg"

   
 140.91224419   "Mean anomaly, in deg" 

Generation and Plot of 2P2B Ephemeris


We will now use these elements to generate positions of
Jupiter all around its orbit, which we will plot. But before we
do so, let us take a quick look at a plot of the DataSet
that we started with:

This is a 3D
scatterplot of
the raw DataSet
data.

(DataSet〈1〉 , DataSet〈2〉 , DataSet〈3〉 )


The points that we see are positions of Jupiter at equal
40-day intervals over the time span 2003 Aug 29 to 2005
May 30.
To generate our 2P2B Jupiter ephemeris for plotting, we use
functions PQ2EQ and Ephem. The math in these two functions
is derived and discussed in Chapter 5 of [3]. Expand the
collapsible area below to see them.

PQ2EQ and Ephem definitions

In order to know how many points to plot, we need to know


the orbital period of Jupiter, in days. This is the amount of
time it takes Jupiter to travel once around the Sun. The
formula from Kepler's Third Law of Planetary Motion is

3
2⋅ π
( )
2
P := ⋅ Elmts P = 4331.374
K 0

Let us choose a time increment that gives us 36 equally-spaced


ephemeris points to plot (the 37th gets plotted over the first). We
have
P ∆t = 366
∆t :=
36

We input this time step to Ephem and ask for 37 points (points 0
through 36), then we plot them.

(
Orbit := Ephem Elmts , K , JD1 , ∆t , 36 )

This 3D plot
shows the
raw data,
plus the
2P2B orbit
plotted for
37
ephemeris
points.

(Orbit〈1〉 , Orbit〈2〉 , Orbit〈3〉 ) , (DataSet〈1〉 , DataSet〈2〉 , DataSet〈3〉 )


We have used our 2P2B orbital elements for Jupiter to
generate position points around the entire orbit (see Note
4). We also see our DataSet points (red) superimposed
upon the 2P2B orbital trace and position points (blue).

Accuracy of the Ephemeris and the DataSet

Why didn't we simply fit three coordinate polynomials to


DataSet and use the coefficients to generate Jupiter's positions
during the year 2004? This is indeed the preferred approach for
highest accuracy. But the orbital elements, as we have seen,
are much more descriptive, and using Aitken-Neville interpolation
on just two points has allowed us to generate 2P2B orbital
elements for all of 2004.

This concludes our data analysis. But the orbital analyst who
wants to generate 2P2B elements for Jupiter for years 2005,
2006, 2007, and so on, will want quantitative answers to the
following two questions.

1. How accurate are the positions generated by Ephem using


2P2B elements?

2. How accurate are the positions in DataSet in the first place?

To answer the first question, let us look at the 2P2B elements


error in the 8th dataset point, expressed in seconds of arc,
i.e., in arcseconds.

SecPerRad := r2d⋅ 3600 (SecPerRad is the number of


arcseconds in one radian.)

(
Orbit := Ephem Elmts , K , JDT1 , DataSet
8, 0
− JDT1 , 1 )
i := 0 .. 3

 〈 〉
datarow :=
i
( DataSetT) 8  i
 2453200.5   2453200.5 
〈1〉
datarow = 
−5.399361849 
 0.538545706 
(OrbitT) =
−5.39935018 
 0.53854764 
 0.362839474   0.36283998 
   

〈 〉
(
datarow − Orbit
T )1 ⋅ SecPerRad = 2.44128 (arcseconds)
This suggests how to write a function that calculates all of
the errors in the 2P2B predicted positions vs. the tabular
points in DataSet during year 2004.

Errors := for i ∈ 0 .. 8

(
Orbit ← Ephem Elmts , K , JDT1 , DataSet
i+ 4 , 0
− JDT1 , 1 )
for j ∈ 0 .. 3
〈i+ 4〉

(
datarow ←  DataSet
j
T)
j
〈 〉
datarow − ( Orbit )
1
T
Out ← ⋅ SecPerRad
i
Out

 0.89309  We see that the 2P2B elements


 1.65831  predict the positions of Jupiter at
 2.16386  the tabular dates in 2004 to better
 2.42132  than 3 arcseconds.
 
Errors =  2.44128  This accuracy is quite adequate for
 2.23164  planetary visibility predictions for
  Jupiter made from 2P2B elements in
 1.79887  2004.
 1.14679 
 0.27748 

On to the final question: the positions in DataSet were


obtained from a U.S. Naval Observatory publication from 1951.
So it is quite appropriate to ask how accurate they are today.
It will suffice just to look at the two positions r1 and r2 that we
obtained from DataSet by Aitken-Neville iterated interpolation.

Both r1 and r2 are referred to the mean equator and equinox of


the Besselian epoch B1950.0, having Julian date 2433282.423.
We precess them to the Julian epoch J2000.0, having Julian
date 2451545.0, so that we can compare them with results
from the U.S. Naval Observatory's most recent MICA program
[5]. We use the following precession function, PRECESS.
Expand the collapsible area below to see this function.

PRECESS definition

JD2000.0 := 2451545.0
 −5.046196922  r1m was obtained from the U.S.
r1m :=  1.730326266  Naval Observatory's MICA
  1990-2005 program.
 0.864522613 
We need to convert r1 from
( )
r1 := EC2EQ r1 ecliptic back to equatorial.

This is r1
 −5.04619017  precessed from
PRECESS( r1 , JD2000.0) =  1.730339222  B1950.0 to
  J2000.0.
 0.864536908 

Now we compute the angle between r1m and r1 in arcseconds.

 r1m⋅ PRECESS( r1 , JD2000.0) 


SecPerRad⋅ acos  = 0.78015
 r1m ⋅ PRECESS( r1 , JD2000.0) 

We do the same for r2m and r2.

 −5.412958671  r2m was obtained from the U.S.


r2m :=  −0.650550743 
Naval Observatory's MICA
  1990-2005 program.
 −0.147064256 
This is r2
 −5.412956867  precessed from
PRECESS( r2 , JD2000.0) = −0.65053459 

B1950.0 to
 
 −0.147050304  J2000.0.

Finally, we compute the angle between r2m and r2 in arcseconds.

 r2m⋅ PRECESS( r2 , JD2000.0) 


SecPerRad⋅ acos  = 0.79517
 r2m ⋅ PRECESS( r2 , JD2000.0) 

We see that the angle between r1m and r1 and the angle
between r2m and r2 are both less than an arcsecond. That is,
calculations that the U.S. Naval Observatory made in the late
1940s on vacuum-tube computers, and that were published in
1951 in "Coordinates of the Five Outer Planets, 1653-2060", are
still good, more than 50 years later, to better than an
arcsecond.
References and Notes

[1] Eckert, W. J., Brouwer, Dirk and Clemence, G. M.,


"Coordinates of the Five Outer Planets 1653-2060", Astronomical
Papers, Vol. XII, U.S. Naval Observatory, Washington, 1951.

[2] McCalla, Thomas Richard, Introduction to Numerical


Methods and FORTRAN Programming, John Wiley (1967).
Although I have programmed extensively in Borland's Turbo
Pascal and in several implementations of C and C++, I still find
this book to be quite useful. The FORTRAN programs are short
and elegant; the emphasis of the book is on deriving numerical
methods. Try an author search at http://www.alibris.com to
find this out-of-print book.

[3] Mansfield, Roger L., Topics in Astrodynamics, Astronomical Data


Service, Colorado Springs, Colorado (September 2003). See
http://home.att.net/~astrotopics/ for availability.

[4] The 2P2B elements that we have generated for Jupiter are
most accurate during the year 2004, yet we can use them to
see what the entire orbit looks like, and that is what we have
done in the second 3D scatterplot. We could not have plotted
the entire orbit using the DataSet points alone.

[5] Nautical Almanac Office, U.S. Naval Observatory, Multiyear


Interactive Computer Almanac 1990-2005, Washington, DC (1998).
Available from Willmann-Bell, Richmond, Virginia. See
http://www.willbell.com.

[6] A note on notation: In physics and math, and especially in


dynamical astronomy, vectors and matrices are typically denoted
by boldface type. The convention was originally implemented in
this worksheet by using a "user-defined" Math style called
"Vectors & Matrices" to create vector and matrix variables,
rather than simply to use the standard, default Math style
"Variables", which uses the "regular" typeface.

However, adoption of a user-defined Math style is not a good idea


for an Electronic book, because the Ebook user who is not familiar
with that Math style might encounter problems (a) when trying to
modify math regions in the Ebook, or (b) after copying and pasting
math regions into another worksheet, and then trying to modify
them.

So a compromise was struck: the math regions do not use boldface


for vectors and matrices, but the text regions that describe them
still do.
EXAMPLES

Using a Rational Function to Fit Optical Constants


by Robert Adair

The refractive index of a material is a particularly good


candidate for fitting with a rational function because the
dispersion model is a sum of partial fractions, which is
another form of a rational function.

2 2
B⋅ λ D⋅ λ
n( λ) = A +
2
+ + ..
2 2
C−λ E−λ

which is another form of:

2 4
c0 + c1⋅ λ + c2⋅ λ + ..
n( λ) =
2
2 4
1 + d1⋅ λ + d2⋅ λ + ..

where λ is the wavelength of light.

One problem encountered in optics is obtaining this dispersion


model from a set of data points. This is normally done using a
nonlinear least-squares solver to solve the unknown constants
in the partial fraction expansion. If the data is accurate
enough, rationalint can also be used to interpolate between
the data points.

As an example, consider the dispersion of fused silica. This data


was taken from a table in the Melles Griot Optics catalog:

SiO2_Data :=
0 1
0 180 1.5853
−9
1 190 1.5657 nm ≡ 10 ⋅m
2 200 1.5505
−6
3 213.9 1.5343 µm ≡ 10 ⋅m
4 226.7 1.5228

〈 0〉 〈 1〉
λtest := SiO2_Data ⋅ nm n := SiO2_Data
rows ( SiO2_Data) = 56

Typically, the experimenter will only have a small number


of data points, so extract a subset of data which
rationalint will interpolate.
 302.2 1.48719  Where l is in nanometers
  (first column), and n is
 404.7 1.4696  dimensionless (second
 496.5 1.4625  column).
 
〈 0〉
InterpData :=  
632.8 1.457
λ interp := InterpData ⋅ nm
 706.5 1.4551 
  ninterp := InterpData
〈 1〉
 1500 1.4446

 1800 1.4409 
  i := 0 .. rows ( InterpData ) − 1
 2100 1.4366 
Create a rational function that can evaluate the data between
the interpolation points. Note that we expect the index of
refraction squared to follow a rational function, not the index.
It is best, then, to transform the measured data by squaring
it before interpolating, then take the square root when
comparing with the original data.


→ 

2 2
vx1 := λ interp vy1 := ninterp

(
fnsq ( x) := rationalint vx1 , vy1 , x 0
2 ) ( )
fn ( x) := rationalint λ interp , ninterp , x 0

Compute the refractive index at the full range of wavelengths.


 → →
nfit := fnsq ( λtest) n2fit := fn ( λtest)

Compare the computed values of the refractive index


to the experimental data.

1.65

1.6

1.55

1.5

1.45

1.4
0 500 1000 1500 2000 2500
full data range
extrapolated from 8 points (squared)
8 points provided
extrapolated from 8 points (direct)
The interpolation for the squared data lies directly on top of
the actual data, even though it must extrapolate outside of
the data range, and bridge large gaps between points. The
direct interpolation without squaring overpredicts n when it is
used to extrapolate.

Regression
Here is an example using rationalfit to fit the data.

The Dispersion formula is given in Melles Griot as


2 2 2
0.6961663⋅ λ .4079426⋅ λ .8974794⋅ λ
fε1 ( λ ) := 1 + + +
λ − ( .0684043⋅ µm) λ − ( .1162414⋅ µm) λ − ( 9.896161⋅ µm)
2 2 2 2 2 2

Let's give ourselves a squared dataset to work with that is


scaled to microns for numerical accuracy:
→
2
 λ interp 
vx1_sc :=  
 µm 
The data also has a published accuracy:
−5
StdYi := 3⋅ 10

Fit the data with the rationalfitnp function which creates a


rational function with no poles in the data fitting region. If the
poles are not in an area of concern, then rationalfit may be
used for slightly higher accuracy.

(
param := rationalfitnp vx1_sc , vy1 , 0.98 , 2 , 2 , 10
− 15)
 1.2605 1.2598 1.2612 
 −201.8952 −202.0415 −201.7489 
 
param =  3.1568 3.1465 3.1671 
 −95.9621 −96.0318 −95.8924 
 
 1.0871 1.0823 1.0919 

2
β 0 + β 1⋅ x + β 2⋅ x
f ( x , β ) :=
2
1 + β 3⋅ x + β 4⋅ x

λ min := min ( λtest) λ max := max ( λtest)

 λ  2 〈 0〉 
wav := λ min , λ min + 1⋅ nm .. λ max nfit ( λ ) := f   , param 
 µm  
1.55

1.5

1.45

1.4
0 500 1000 1500 2000 2500
full data set
rationalfit
8 point data set
Sellmeier equation

All the curves, including the well-established Sellmeier


equation, appear to lie approximately on top of one
another within the range of the measured data.
 →  →
(
residratfit := ninterp − nfit λ interp ( ))
residSell := ninterp − fε1 λ interp ( ( ))
5 .10
5

residratfit
0
residSell

5 .10
5

5 .10 1 .10 1.5 .10 2 .10 2.5 .10


7 6 6 6 6
0
λ interp

− 10 −9
∑ residratfit ∑ residSell
2 2
= 8.6934 × 10 = 4.356 × 10

Note that scaling is critical to the calculation to help the


nonlinear solver converge quickly to a numerically accurate
solution. Squaring the wavelength also helps in that it
creates a rational function that best matches the physics of
the refractive index dispersion.

Reference
(1988) Melles Griot Optics Guide, Melles Griot, Irvine, CA, p.3-5.
EXAMPLES

Wilcoxon Signed-Rank Test

The following are the compressive strengths of a material


manufactured by two different methods, A and B:

 60.3   56.0 
   
 50.2   56.2 
 56.5   55.1 
 60.6   59.2 
   
 59.3   62.3  The data is paired, that
is, there are an equal
 49.7   54.5  number of measurements
    in each pool.
 50.8   56.5 
A :=  59.8  B :=  57.1  n := length ( A) n = 15
   
 52.5   56.2 
 57.4   56.1 
   
 55.8   58.5 
 54.5   63.5 
   
 53.6   58.2 
 56.8   48.9 
   
 57.1   53.0 

The signed-rank test will tell us, to some degree of statistical


significance, whether the means of the two data sets are equal,
that is, if they came from the same distribution. The test
compares the ranks of the positive and negative differences
between data pairs. First, find the differences:

diff := B − A

and determine which are positive and which are negative:

positive ( v) := count ← 0
for i ∈ 0 .. last ( v)
if vi > 0
outcount ← i
count ← count + 1
out
These definitions will not count differences which are 0. Then, find
the rank of the absolute values of these differences, and pull and
sum the ranks corresponding to the positive and negative
differences independently:

→
ranks := Rank diff ( )
T
diff = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 -4.3 6 -1.4 -1.4 3 4.8 5.7 -2.7 3.7 -1.3 2.7 9 4.6 -7.9 -4.1

T
ranks = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 9 13 2.5 2.5 6 11 12 4 7 1 5 15 10 14 8

indexp := positive ( diff ) indexn := positive ( −diff )

posranks := for i ∈ 0 .. last ( indexp) negranks := for i ∈ 0 .. last ( indexn)


pi ← ranks ( indexp ) pi ← ranks ( indexn )
i i

p p

 13 
   9 
6  2.5 
 11   
   2.5 
posranks =   negranks =  4 
12
7  
   1 
5  14 
 15   
   8 
 10 

T+ :=
∑ posranks T- :=
∑ negranks
T+ = 79 T- = 41

Note that the variable names, T+ and T-, were made by


typing T[Ctrl-Shift-K]+{Ctrl-Shift-K]. This key sequence
lets you type literal values that would otherwise be
interpreted as operators in Mathcad.

The test statistic for smaller sample sizes (n<5) is

T := if ( T+ < T- , T+ , T-) T = 41
but here we will use the percent point function for the
normal distribution, which is reasonably close to the
appropriate shape for this test.

T0 := qnorm( 0.95 , 0 , 1) T0 = 1.644854

n⋅ ( n + 1)
µ T := µ T = 60
4

n⋅ ( n + 1) ⋅ ( 2⋅ n + 1)
σ2T := σ2T = 310
24

T+ − µ T
T := T = 1.079127
σ2T

WRStest := "null hypothesis rejected, means are unequal" if T > T0

"null hypothesis OK, means are equal" otherwise

WRStest = "null hypothesis OK, means are equal"

So we may conclude that the mean for both materials samples


is the same, with a 95% confidence limit. This has
implications for the manufacturing processes for these
materials.

Reference
Kottegoda and Rosso, Statistics, Probability and Reliability for Civil
and Environmental Engineers, McGraw-Hill, 1997. pp.274-5.

You might also like