Professional Documents
Culture Documents
Msellitto Thesis
Msellitto Thesis
675
400
(R
rs
R
est
rs
)
2
+
800
720
(R
rs
R
est
rs
)
2
0.5
675
400
(R
rs
)
2
+
800
720
(R
rs
)
2
0.5
(2.1)
The summation is calculated for the 38 wavelength bands from 400-675 nm and
720-800 nm. The value of R
rs
represents the measured remote sensing reectance at
each pixel, and R
est
rs
is the estimated remote sensing reectance, each consisting of
a vector of 38 of the 42 wavelength bands from 400-800 nm. R
rs
is calculated for
each wavelength as follows (where all variables besides P, G, BP, B, and H are input
constants unless otherwise dened):
R
est
rs
= 0.5r
rs
/(1 1.5r
rs
) (2.2)
r
rs
= r
dp
rs
1 e
kH[D
C
u
+1/ cos (w)]
+
1
Be
kH[D
B
u
+1/ cos (w)]
(2.3)
k = a + b
b
(2.4)
D
B
u
= 1.04(1 + 5.4u)
0.5
(2.5)
D
C
u
= 1.03(1 + 2.4u)
0.5
(2.6)
r
dp
rs
= (0.084 + 0.17u)u (2.7)
u = b
b
/(a + b
b
) (2.8)
b
b
= 00038(400/)
4.3
+ BP(400/)
Y
(2.9)
a = a
w
+ [a
0
+ a
1
ln (P)] P + Gexp [0.015( 400)] (2.10)
The objective variables P, G, BP, B, and H are given minimum and maximum
values for use in the numerical optimization step. This turns this problem into a non-
linear bound-constrained optimization problem and it has been shown to be quite
10
a computationally intensive task to solve [4]. Typical values for the minimum and
maximum values of the objective variables are shown in table 2.1.
Parameter Units Lower Limit Upper Limit
P m
1
0.005 0.5
G m
1
0.002 3.5
BP m
1
0.001 0.5
B - 0.01 1.0
H m 0.2 10.0
Table 2.1: Spectroscopy parameter constraints used in optimization
For more detailed information on the theory of the image spectroscopy algorithm
described in this this thesis please see Goodman and Ustin [5].
2.2 Mathematical Optimization
2.2.1 Fundamentals
Mathematical or numerical optimization refers to the minimization (or maximization)
of a given objective function of one or more decision variables, possibly subject to
a series of constraints. These constraints may be equality or inequality constraints.
More formally an optimization problem can be stated as:
11
min
x
N f(x)
subject to: g
i
(x) = 0, (i = 0 . . . (M 1))
h
i
(x) 0, (i = 0 . . . (J 1))
L
i
x
i
U
i
, (i = 0 . . . (N 1))
Optimization problems have many characteristics that can be used to classify
them. Some of these classication criteria for optimization problems are:
Presence of constraints: The optimization problem may be classied as uncon-
strained or constrained, based on weather the problem has constraints or not.
Number of decision variables: The optimization problem may be classied as
one-dimensional or multi-dimensional based on the number of decision variables
in its objective function.
Nature of objective function: Based on weather the problems objective function
is linear or not, the objective problem can be classied as linear or non-linear.
Permissible values of decision variables: Depending on the values permitted for
the design variables, optimization problems can be classied as integer and real-
valued programming problems.
There is also a special type of constrained optimization problem that has con-
straints only on the range of the decision variable x, these types of problems are known
as bound-constrained optimization problems. There is a large number of classes of
optimization problems and there are various techniques that are used to solve each
12
one. In this thesis we are most concerned with real, non-linear, bound-constrained
optimization problems. The next two sections provide a brief overview of some of the
basic theory and techniques used to solve this class of optimization problem, begin-
ning with the simplied unconstrained cases and expanding the methods for dealing
with bound-constraints.
2.2.2 Single-Variable Optimization Techniques
Optimization techniques aim to nd a value x
may be a global minimum, which means that there is no other value of x which
produces a value f(x) less than f(x
). A value x
is known to be a local
minimum if:
f
(x
) = 0
and
f
(x
) 0
These are known as the rst and second order necessary conditions. Note that
if f
(x
) = 0, x
(x
n
) =
y
x
=
f(xn)0
xnx
n+1
Newtons Method: x
n+1
= x
n
f(xn
f
(xn)
Newtons method is successively repeated until a satisfactory estimate for x
is
14
achieved (until the estimate changes by less than a certain tolerance value for in-
stance). We show two iterations of Newtons method in gure 2.5.
Figure 2.5: Two iterations of Newtons method
Newtons method can be modied to nd an estimate for a local minimizer of a
function instead of a zero by estimating the zero of the functions rst derivative:
Newtons Method for Optimization: x
x+1
= x
n
(xn)
f
(xn)
2.2.3 Multi-Variable Optimization Techniques
Many optimization techniques often require the rst and second derivatives of a func-
tion. When dealing with a function f(x) of more than one variable, the gradient
f is used instead of the rst derivative. This is dened as a vector of the partial
derivatives of f(x), where f(x) is a function of N variables:
f = [
f
x
1
, . . . ,
f
x
N
]
15
Similarly, the Hessian matrix
2
f is used for the second derivative of a function
of N variables and is dened as:
2
f =
2
f
x
1
x
1
2
f
x
1
x
2
2
f
x
1
x
N
2
f
x
2
x
2
2
f
x
2
x
N
.
.
.
symmetric
2
f
x
N
x
N
(xn)
f
(xn)
Newtons Method (multi-variable): x
n+1
= x
n
2
f(x
n
)
1
f(x
n
)
Note that in the multi-variable Newton method, the search direction d is equivalent
to
2
f(x
n
)
1
f(x
n
). Newtons method can also be extended to handle bound
constrained problems in the same way that the steepest descent method. Though
this technique is not commonly used in practice due to its requirement that the
objective functions Hessian must be provided, it serves as the basis for commonly used
quasi-Newton type methods such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS)
method [7].
BFGS Method
The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is a quasi-Newton type nu-
merical optimization method. It is categorized as a quasi-Newton type method be-
cause it does not require the Hessian of the function being optimized but instead
iteratively builds up an approximation to the Hessian by changes in the functions
gradient [8]. A summary of the steps of the BFGS method are:
1. Choose an initial starting point x
0
.
2. Initialize the approximate Hessian B to the identity matrix.
3. Check for convergence: f(x
n
) = 0. If converged, local minimum is found and
algorithm is done.
4. Calculate search direction d = B
1
f(x
n
).
18
5. Find step size that minimizes the function along the negative gradient direc-
tion by solving the 1-dimensional optimization problem: () = f(x
n
+ d).
6. Update the approximate Hessian B by using the BFGS update formula:
S
k
= x
n+1
x
n
y
k
= f
n+1
f
n
B
n+1
= B
n
BnSnS
T
n
Bn
S
T
n
BnSn
+
yny
T
n
y
T
n
sn
7. Set the minimizer of step 5 as the new x
n
and go to step 3.
Note that care must be taken in step 5 when choosing a step size in order to
keep the approximate hessian matrix B positive-denite. This is analogous to having
f
(x
n
) 0 in the single variable case, to be certain that x
675
400
(R
rs
R
est
rs
)
2
+
800
720
(R
rs
R
est
rs
)
2
0.5
675
400
(R
rs
)
2
+
800
720
(R
rs
)
2
0.5
where R
est
rs
is dened in terms of P, G, BP, B, and H by equations 2.2-2.10. The
water properties P, G, BP, B, and H have minimum and maximum bounds assigned
to them. This function can be viewed as the objective function to be optimized for
each pixel. Since each pixels water properties can be estimated independently from
one another, the image spectroscopy algorithm can be stated as an array of numerical
optimization problems where the goal is to nd:
min(err
j
(P, G, BP, B, H)), (j = 0 . . . (M 1))
L
P
P U
P
L
G
G U
G
L
BP
BP U
BP
L
B
B U
B
L
H
H U
H
Where M is the number of pixels in the hyperspectral image. This array of op-
timization problems can be passed into and solved in parallel by the BFGS-B CL
Solver.
40
4.2 Image Tests
Three hyperspectral images were used to test the performance of the BFGS-B CL
Solver, a 2MB synthetic image of 12K pixels, a 17MB real-world image of 100K
pixels (referred to as Real Image A), and a 23MB real-world image of 128K pixels
(referred to as Real Image B). All images have 42 spectral bands. The compute time
results as well as the quality of the output of the BFGS-B CL tests are compared
against the Generalized Reduced Gradient (GRG) serial optimization solver. The
GRG solver is provided by the IDL (interactive data language) programming language
which serves as the basis for the ENVI (ENvironment for Visualizing Images) remote
sensing application. The IDL/GRG solver was used by Goodman and Ustin [5] for
their work in developing the spectroscopy algorithm. Serial BFGS-B is also used for
compute time comparison. All water properties for the synthetic image are known
and only the water depth (H) is known for the real-world images. For the BFGS-B
tests, two variations of runs were made, one (referred to as normal-quality) where
the Hessian approximation parameter is set to 6 and another (referred to as higher
quality) where the Hessian approximation parameter is set to 20.
All BFGS-B and BFGS-B CL tests were run on a 6-core Intel Xeon 2.9GHz system
running Linux with 4GB of RAM and an Nvidia Tesla M2070 GPU. The IDL solver
was run on an Intel Xeon Quad 3.00 GHz system running Windows with 8GB of
RAM.
4.2.1 Compute-Time Performance Results
Testing for computation performance was done on the synthetic as well as two real-
world images. Tables 4.1-4.3 show the computation time taken per pixel on each
41
of the images and compares BFGS-B CL against serial BFGS-B as well as the IDL
optimization solver. BFGS-B CL was tested with two quality variations and 1, 2,
4, and 8 CPU compute threads. Figures@4.1-4.6 show the relative performance of
BFGS-B CL to the IDL/GRG solver and serial BFGS-B.
Synthetic Image Compute Time-Performance
Test Time / Pixel (ms) Time / Pixel (ms) (HQ)
IDL/GRG 118.90 -
Serial BFGS-B 7.19 9.31
BFGS-B CL + 1 CPU Thread 7.19 9.31
BFGS-B CL + 2 CPU Threads 1.17 4.05
BFGS-B CL + 4 CPU Threads 0.83 2.18
BFGS-B CL + 8 CPU Threads 0.44 0.75
Table 4.1: Synthetic image compute-time performance results.
Real Image A Compute-Time Performance
Test Time / Pixel (ms) Time / Pixel (ms) (HQ)
IDL/GRG 15.0 -
Serial BFGS-B 7.39 9.89
BFGS-B CL + 1 CPU Thread 0.81 3.21
BFGS-B CL + 2 CPU Threads 0.49 1.73
BFGS-B CL + 4 CPU Threads 0.40 1.00
BFGS-B CL + 8 CPU Threads 0.30 0.55
Table 4.2: Real image A compute-time performance results.
42
Real Image B Compute-Time Performance
Test Time / Pixel (ms) Time / Pixel (ms) (HQ)
IDL/GRG 18.6 -
Serial BFGS-B 6.98 9.63
BFGS-B CL + 1 CPU Thread 0.79 3.19
BFGS-B CL + 2 CPU Threads 0.47 1.73
BFGS-B CL + 4 CPU Threads 0.33 0.92
BFGS-B CL + 8 CPU Threads 0.26 0.53
Table 4.3: Real image B compute-time performance results.
Figure 4.1: Synthetic image compute-time performance results using IDL as the base.
43
Figure 4.2: Synthetic image compute-time performance results using serial BFGS-B
as base.
Figure 4.3: Real image A compute-time performance results using IDL as the base.
44
Figure 4.4: Real image A compute-time performance results using serial BFGS-B as
the base.
Figure 4.5: Real image B compute-time performance results using IDL as the base.
45
Figure 4.6: Real image B compute-time performance results using serial BFGS-B as
the base.
46
The results show that BFGS-B CL is capable of obtaining a relatively high speedup
in comparison to both the IDL solver and serial BFGS-B. When using 8 CPU compute
threads, BFGS-B CL achieves up to a 270x speedup on the synthetic image and a
70x speedup on the real-world images compared to the IDL solver. Compared to
the serial variant, BFGS-B CL is capable of obtaining a 8x speedup when using only
the GPU and one CPU compute thread and achieves over two orders of magnitude
speedup when using at least 4 CPU compute threads.
4.2.2 Image Scaling Eects on Compute-Time Performance
To explore the impact of image size on performance, the synthetic image was tiled
to create larger size images of approximately 10MB, 21MB, 104MB, and 209MBs.
These tiled images were run with both 2 and 8 CPU compute threads with normal
quality settings. The compute-time performance per pixel is shown in table 4.4 and
presented in gure 4.7.
Image Scaling Eects on Compute-Time
Image Size BFGS-B CL + 2 CPU Threads BFGS-B CL + 8 CPU Threads
(MB) Time / Pixel (ms) Time / Pixel (ms)
2.1 0.83 0.44
10.5 0.78 0.42
20.9 0.75 0.41
104.6 0.66 0.40
209.2 0.63 0.39
Table 4.4: Image scaling eects on compute-time.
The image scaling tests show that image size has little impact on the throughput
of the BFGS-B CL Solver. In fact, increasing the image size results in a slight increase
in throughput of the solver.
47
Figure 4.7: Image scaling eects on compute-time.
4.2.3 Output Quality Performance
In addition to compute-time performance, the quality of the output water properties
of the BFGS-B CL was also tested and compared with the IDL solver. For the
synthetic image all ve water properties are known, while on the real-world images
only the water depth (H) is known. The output water properties were measured for
the correlation coecient (multiple R), coecient of determination (R
2
), standard
error, and average absolute dierence. These measurements are shown in tables 4.5-
4.10 and gures 4.8-4.21.
48
Synthetic Image Multiple R
P G BP B H
IDL/GRG 0.997 1.00 0.941 0.625 0.817
BFGS-B CL 0.994 0.999 0.994 0.549 0.823
BFGS-B CL (HQ) 0.995 0.999 0.994 0.592 0.852
Table 4.5: Synthetic image multiple R for the various water properties.
Synthetic Image R
2
P G BP B H
IDL/GRG 0.995 1.00 0.886 0.390 0.667
BFGS-B CL 0.987 0.998 0.988 0.302 0.677
BFGS-B CL (HQ) 0.990 0.989 0.989 0.350 0.726
Table 4.6: Synthetic image R
2
for the various water properties.
Synthetic Image Standard Error
P G BP B H
IDL/GRG 0.0131 0.0142 0.0631 0.2813 1.5770
BFGS-B CL 0.0207 0.0630 0.0201 0.3054 1.7096
BFGS-B CL (HQ) 0.0186 0.0548 0.0195 0.2967 1.5550
Table 4.7: Synthetic image standard error for the various water properties.
49
Synthetic Image Average Absolute Dierence
P G BP B H
IDL/GRG 0.00137 0.00099 0.01071 0.16674 0.90355
BFGS-B CL 0.00306 0.00579 0.00187 0.19577 0.88673
BFGS-B CL (HQ) 0.00256 0.00424 0.00161 0.17635 0.75302
Table 4.8: Synthetic image average absolute dierence for the various water proper-
ties.
Figure 4.8: Synthetic image multiple R for the various water properties.
50
Figure 4.9: Synthetic image R
2
for the various water properties.
Figure 4.10: Synthetic image standard error for P, G, and BP.
51
Figure 4.11: Synthetic image standard error for B and H.
Figure 4.12: Synthetic image average absolute dierence for P, G, and BP.
52
Figure 4.13: Synthetic image average absolute dierence for B and H.
53
Real Image A Output Quality of Water Property H
Multiple R R
2
Std Error Avg Absolute Dierence
IDL/GRG 0.854 0.729 2.0860 2.8955
BFGS-B CL 0.677 0.459 4.9364 4.1306
BFGS-B CL (HQ) 0.750 0.563 3.2408 2.9241
Table 4.9: Real image A output quality of water property H.
Real Image B Output Quality of Water Property H
Multiple R R
2
Std Error Avg Absolute Dierence
IDL/GRG 0.881 0.776 3.0092 3.6135
BFGS-B CL 0.931 0.866 3.0888 1.8780
BFGS-B CL (HQ) 0.877 0.769 3.3528 3.1363
Table 4.10: Real image B output quality of water property H.
54
Figure 4.14: Real image A multiple R of water property H.
Figure 4.15: Real image A R
2
of water property H.
55
Figure 4.16: Real image A standard error of water property H.
Figure 4.17: Real image A average absolute dierence of water property H.
56
Figure 4.18: Real image B multiple R of water property H.
Figure 4.19: Real image B R
2
of water property H.
57
Figure 4.20: Real image B standard error of water property H.
Figure 4.21: Real image B average absolute dierence of water property H.
58
For the synthetic image test, the BFGSB-CL multiple R and R
2
are quite compa-
rable to the IDL solver. The standard error is very similar between the two solvers
for water properties P, B, and H. The IDL solver results in a higher standard error for
property BP, and the BFGS-B CL Solver obtains a better standard error for property
BP. The two solvers have comparable average absolute values for all water properties
except property BP where BFGS-B CL outperforms the IDL solver.
For both real-world images the solvers also achieve comparable multiple R and
R
2
values. For real-world image A, the IDL solver achieves a lower standard error
and absolute dierence for property H compared to the normal quality BFGS-B CL,
however this dierence in quality is diminished when compared to the higher quality
test.
For real-world image B the two solvers achieve similar standard error values. The
average absolute dierence of property H for real-world image B is lower for both
BFGS-B CL runs. These results show that the quality of the BFGS-B CL Solver is
similar to the IDL solver and therefore of acceptable quality for the purposes of the
image spectroscopy algorithm.
59
Chapter 5
Conclusion and Future Work
5.1 Conclusion
The processing requirements for todays scientic algorithms continue to increase
with their complexity and the data sizes that they operate on. New computing plat-
forms such as GPUs and heterogeneous computing systems have emerged to meet
the processing demands of these new applications. In this thesis we have shown a
method of utilizing todays CPU-GPU heterogeneous computing platform to accel-
erate a spectroscopy algorithm for submerged marine environments. We have pre-
sented the implementation of a parallel non-linear optimization solver BFGS-B CL,
that takes advantage of the massive SIMD throughput of GPUs, as well as coarse-
grained multi-threading of todays multi-core CPUs to accelerate the execution of the
spectroscopy algorithm. Testing results show a magnitude increase in computational
throughput compared to the serial CPU version of the solver and up to two magnitude
increase when compared to other commercial non-linear solvers such as IDL/GRG.
The algorithm output quality when using the BFGS-B CL solver was also shown to
60
be comparable to the IDL/GRG solver when examining the resulting output for two
real-world hyperspectral images, as well as a synthetic image. GPU and heteroge-
neous computing is a powerful processing technology and shows substantial potential
for accelerating future remote sensing algorithms, as well as other computationally
demanding scientic applications.
5.2 Future Work
5.2.1 Coarse-grained Search for Initial Point
There is potential for the GPU to be further utilized for achieving even higher perfor-
mance with the BFGS-B Solver. Since the GPU is capable of high throughput when
performing SIMD operations, it can be employed eectively when evaluating a large
number of points of a given mathematical function. This property would allow the
GPU to very quickly search for a better initial point for the optimization algorithm by
performing a coarse-grained search of the function which has potential for decreasing
the amount of iterations required to perform the optimization. This may allow the
BFGS-B CL solver to yield additional acceleration and/or improved accuracy.
5.2.2 Pipelining the Execution of the Solver
Another potential modication to explore for further performance gains of the BFGS-
B CL Solver is to pipeline its execution. The solver can be thought of as having two
stages for each iteration, the rst stage is the execution of a single step of the BFGS-B
algorithm in a coarse-grained manner by the CPU compute threads and the second
stage is the evaluation of the functions and gradients in a SIMD fashion by the GPU.
Currently the solver is designed to run all problem elements through the rst stage
61
of the solver before executing the evaluation kernel on the GPU. The solver could be
modied to pipeline these two stages by waiting for a batch of problem elements to
be completed by the rst solver stage instead of waiting for all problem elements to
complete it.
With 8 CPU compute threads the BFGS-B solver has been measured to spend
about 65% of its execution time in its rst stage and about 35% of its execution time
in its second stage. It is estimated with the NVIDIA Occupancy Calculator that it
would take about 6K threads (or problem elements) to occupy the GPU suciently
enough to hide memory latencies and to cover the overhead of GPU kernel launches.
With this knowledge the solver could be modied to start execution of the second
stage of the solver on 6k sized batches of problem elements as they become available
for execution by nishing the rst stage. The overlapping of these solver stages would
lead to less time being spent on each iteration of the problem and therefore would
increase the solvers throughput. Ideally the solver would be able to gain about 35%
more performance compared to its current design. It can be estimated that on smaller
sized problems such as the synthetic image (which has 12K problem elements) that
each iteration would see an increase of performance by about 17.5%. On larger sized
images such as real image A (which has 100K problem elements) it can be estimated
that each iteration would see about a 33% increase in performance.
62
Bibliography
[1] N. Short, NASA Remote Sensing Tutorial. Website, 2010. http://rst.gsfc.
nasa.gov.
[2] Z. Lee, K. L. Carder, C. D. Mobley, R. G. Steward, and J. S. Patch, Hyperspec-
tral remote sensing for shallow waters. 1. a semianalytical model, Appl. Opt.,
vol. 37, pp. 63296338, Sep 1998.
[3] Z. Lee, K. L. Carder, C. D. Mobley, R. G. Steward, and J. S. Patch, Hyper-
spectral remote sensing for shallow waters. 2. deriving bottom depths and water
properties by optimization, Appl. Opt., vol. 38, pp. 38313843, Jun 1999.
[4] J. Goodman, D. Kaeli, and D. Schaa, Accelerating an imaging spectroscopy
algorithm for submerged marine environments using graphics processing units,
Journal of Selected Topics in Earth Observation and Remote Sensing, vol. 4,
pp. 669 676, Sep 2011.
[5] J. A. Goodman and S. L. Ustin, Classication of benthic composition in a coral
reef environment using spectral unmixing, Journal of Applied Remote Sensing,
vol. 1, no. 1, p. 011501, 2007.
[6] J. Nocedal, Numerical Optimization, ch. 8. Springer, 2nd ed., 2006.
63
[7] S. Rao, Engineering Optimization: Theory and Practice, ch. 1. Wiley, 4th ed.,
2009.
[8] C. Kelley, Iterative Methods for Optimization, ch. 5. Society for Industrial Math-
ematics, 1999.
[9] I. Buck, The Evolution of GPUs for General Purpose Computing. GTC 2010.
[10] R. Fernando, GPU Gems: Programming Techniques, Tips and Tricks for Real-
Time Graphics, ch. 28. Pearson Higher Education, 2004.
[11] Nvidia, Nvidia CUDA Programming Guide Version 3.0, 2010.
[12] Khronos OpenCL Working Group, The OpenCL Specication Version 1.0, 2008.
[13] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, A limited memory algorithm
for bound constrained optimization, SIAM Journal on Scientic Computing,
vol. 16, no. 5, p. 1190, 1995.
64