Msellitto Thesis

NORTHEASTERN UNIVERSITY
Graduate School of Engineering

Thesis Title: Accelerating an Imaging Spectroscopy Algorithm for Submerged Ma-
rine Environments Using Heterogeneous Computing
Author: Matthew Sellitto
Department: Electrical and Computer Engineering
Approved for Thesis Requirements of the Master of Science Degree:
Thesis Advisor: Professor David Kaeli Date
Thesis Committee: Professor Gunar Schirner Date
Thesis Committee: Dr. James Goodman Date
Head of Department: Date
Graduate School Notied of Acceptance:
Associate Dean of Engineering: Dr. Sara Wadia-Fascetti Date
Copy Deposited in Library:
Reference Librarian Date
ii
ACCELERATING AN IMAGING SPECTROSCOPY
ALGORITHM FOR SUBMERGED MARINE
ENVIRONMENTS USING HETEROGENEOUS
COMPUTING
A Thesis Presented
by
Matthew Sellitto
to
The Department of Electrical and Computer Engineering
in partial fulllment of the requirements
for the degree of
Master of Science
in
Electrical and Computer Engineering
in the eld of
Computer Engineering
Northeastern University
Boston, Massachusetts
January 2012
c Copyright 2012 by Matthew Sellitto
All Rights Reserved
iv
Abstract
Graphics Processing Units (GPUs) have proven to be highly eective at accelerating
processing speed for a large range of scientic and general purpose applications. As
data needs increase, and more complex data analysis methods are used, the processing
requirements for solving scientic problems also correspondingly increase. The mas-
sive parallel processing power of GPUs can be harnessed and used alongside multi-core
CPUs to address these increased needs and allow acceleration of scientic algorithms
to open up new realms of possibilities. As an example, there are many scientic
problems that require solving non-linear optimization problems of multiple variables
across large arrays of data. These types of problems are classied as highly dicult
and require a great deal of computational time to solve using traditional techniques.
By utilizing modern local optimization techniques, such as the iterative quasi-Newton
algorithms, and combining them with the computational throughput of a CPU-GPU
heterogeneous computing platform, we can greatly decrease the processing time re-
quired to solve scientic problems of this form.
Remote sensing, which is utilized across a wide array of disciplines, including re-
source management, disaster relief planning, environmental assessment, and climate
change impact analysis, represents an ideal problem to be addressed using these tech-
niques. The data volume and processing requirements associated with remote sensing
v
are rapidly expanding as a result of the increasing number of satellite and airborne
sensors, greater data accessibility, and expanded utilization of data intensive technolo-
gies, such as imaging spectroscopy. In this thesis we demonstrate the advantages of
this high performance computing technology by accelerating an imaging spectroscopy
algorithm for submerged marine habitats using a CPU-GPU heterogeneous computing
platform and a parallel optimization solver written to take advantage of this platform.
Results indicate that considerable improvement in performance of approximately an
order of magnitude can be achieved using parallel processing on a CPU-GPU com-
puting platform compared to serial processing on the CPU using the same techniques.
This technology has enormous potential for continued growth in exploiting high per-
formance computing, and provides the foundation for signicantly enhanced remote
sensing capabilities.
vi
Contents
Abstract v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Hyperspectral Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 An Image Spectroscopy Algorithm for Submerged
Marine Environments . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Mathematical Optimization . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Single-Variable Optimization Techniques . . . . . . . . . . . . 13
2.2.3 Multi-Variable Optimization Techniques . . . . . . . . . . . . 15
2.3 GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Modern GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
vii
2.3.3 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 The Design of the BFGS-B CL Optimization Solver 29
3.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Problem Input Form . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Using the BFGS-B CL Solver . . . . . . . . . . . . . . . . . . 31
3.3 The L-BFGS-B Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Outline of Solver Execution . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Coarse-Grained Data Parallelism Using Multi-Threading . . . . . . . 35
3.6 Parallel Function and Gradient Evaluations on the GPU . . . . . . . 37
4 Accelerating the Spectroscopy Algorithm with the BFGS-B CL
Optimization Solver 39
4.1 Stating the Spectroscopy Algorithm as an
Array of Optimization Problems . . . . . . . . . . . . . . . . . . . . . 39
4.2 Image Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Compute-Time Performance Results . . . . . . . . . . . . . . 41
4.2.2 Image Scaling Eects on Compute-Time Performance . . . . . 47
4.2.3 Output Quality Performance . . . . . . . . . . . . . . . . . . . 48
5 Conclusion and Future Work 60
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.1 Coarse-grained Search for Initial Point . . . . . . . . . . . . . 61
5.2.2 Pipelining the Execution of the Solver . . . . . . . . . . . . . 61
viii
Bibliography 63
ix
List of Figures
2.1 Two-dimensional projection of a hyperspectral cube . . . . . . . . . . 6
2.2 Spectral curves of objects measured from hyperspectral imagery . . . 7
2.3 Marine environment mapping of Kaneohe Bay, Hawaii . . . . . . . . . 9
2.4 Local minimums, global minimums, and inection points . . . . . . . 14
2.5 Two iterations of Newtons method . . . . . . . . . . . . . . . . . . . 15
2.6 The steepest descent method . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Traditional graphics pipeline . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Evolution to a wider and more generalized graphics pipeline . . . . . 22
2.9 A modern GPU architecture (including memories) . . . . . . . . . . . 24
2.10 Theoretical oating-point performance of todays GPUs versus CPUs 25
2.11 The platform model of OpenCL . . . . . . . . . . . . . . . . . . . . . 26
2.12 The memory model of OpenCL . . . . . . . . . . . . . . . . . . . . . 27
3.1 Calling arguments for the BFGS-B CL Solver . . . . . . . . . . . . . 32
3.2 Execution Outline of the BFGS-B CL Solver . . . . . . . . . . . . . . 35
3.3 Distributing the solver work items to the CPU compute threads . . . 36
3.4 An example OpenCL evaluation kernel . . . . . . . . . . . . . . . . . 38
x
4.1 Synthetic image compute-time performance results using IDL as the
base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Synthetic image compute-time performance results using serial BFGS-
B as base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Real image A compute-time performance results using IDL as the base. 44
4.4 Real image A compute-time performance results using serial BFGS-B
as the base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Real image B compute-time performance results using IDL as the base. 45
4.6 Real image B compute-time performance results using serial BFGS-B
as the base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.7 Image scaling eects on compute-time. . . . . . . . . . . . . . . . . . 48
4.8 Synthetic image multiple R for the various water properties. . . . . . 50
4.9 Synthetic image R
2
for the various water properties. . . . . . . . . . . 51
4.10 Synthetic image standard error for P, G, and BP. . . . . . . . . . . . 51
4.11 Synthetic image standard error for B and H. . . . . . . . . . . . . . . 52
4.12 Synthetic image average absolute dierence for P, G, and BP. . . . . 52
4.13 Synthetic image average absolute dierence for B and H. . . . . . . . 53
4.14 Real image A multiple R of water property H. . . . . . . . . . . . . . 55
4.15 Real image A R
2
of water property H. . . . . . . . . . . . . . . . . . 55
4.16 Real image A standard error of water property H. . . . . . . . . . . . 56
4.17 Real image A average absolute dierence of water property H. . . . . 56
4.18 Real image B multiple R of water property H. . . . . . . . . . . . . . 57
4.19 Real image B R
2
of water property H. . . . . . . . . . . . . . . . . . 57
4.20 Real image B standard error of water property H. . . . . . . . . . . . 58
4.21 Real image B average absolute dierence of water property H. . . . . 58
xi
List of Tables
2.1 Spectroscopy parameter constraints used in optimization . . . . . . . 11
4.1 Synthetic image compute-time performance results. . . . . . . . . . . 42
4.2 Real image A compute-time performance results. . . . . . . . . . . . 42
4.3 Real image B compute-time performance results. . . . . . . . . . . . . 43
4.4 Image scaling eects on compute-time. . . . . . . . . . . . . . . . . . 47
4.5 Synthetic image multiple R for the various water properties. . . . . . 49
4.6 Synthetic image R
2
for the various water properties. . . . . . . . . . . 49
4.7 Synthetic image standard error for the various water properties. . . . 49
4.8 Synthetic image average absolute dierence for the various water prop-
erties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.9 Real image A output quality of water property H. . . . . . . . . . . . 54
4.10 Real image B output quality of water property H. . . . . . . . . . . . 54
xii
Chapter 1
Introduction
1.1 Motivation
Remote sensing is utilized across a wide array of applications, including disaster re-
lief planning, environmental assessment, national defense, and climate change impact
analysis. Rapid progress continues to be made in all areas of remote sensing technol-
ogy. The complexity of the algorithms used, the number of available remote sensing
instruments, and the sophistication of those instruments all continue to increase.
With this technological expansion comes increasing data volumes and more complex
data analysis methods which result in the need for increasingly faster processing ca-
pability.
An example of a relatively new remote sensing technology is hyperspectral imag-
ing. Hyperspectral imaging sensors are capable of measuring from 50 to 250 spectral
bands per pixel as compared with 3 bands per pixel by traditional RGB sensors or
fewer than 10 bands for multispectral sensors. This capability allows for increased
data richness and for more sophisticated image analysis algorithms. These advanced
1
algorithms enable a more robust analysis of the images, including the ability to as-
sess a greater number of output parameters and/or improved accuracy. However, the
processing power requirements of these complex algorithms limit their practical use,
especially with large data sets and situations where the results are time critical. It
has become quite evident that more advanced computing techniques are required in
order to address this problem.
Traditional methods for high performance computing centered on using central
processing units (CPUs). These general purpose processors excel at serially structured
computer programs and can be used as building blocks in multi-CPU systems that
range from 2 processor systems all the way up to thousands of processing nodes.
These large systems can be prohibitively expensive in many applications in terms of
processing power per unit power consumed, nancial costs, and often times physical
space requirements.
Recently, a new paradigm in high-performance computing has developed that uses
graphics processing units (GPUs) to perform general purpose computations. This
new paradigm is known as GPGPU (general purpose GPU) computing. Graphics
processing units were specically designed and used for displaying 3-D computer
graphics on the screen, but due to the expansion and advancement of the computer
gaming industry, these processors units have evolved into sophisticated processing
devices that can be provide greater than one teraop in raw performance power for
a cost comprable to that of CPUs at the time of this writing. GPUs, being originally
designed to render many graphic elements concurrently, are inherently parallel in
the design of their processing pipeline. This parallel architecture is what gives these
processors their very high raw computational performance. Thanks to the addition
of programmability to GPU data pipelines and the advent of programming languages
2
such as CUDA and OpenCL that allow for general purpose programming on GPUs,
the computational power of these devices is able to be harnessed in many types of
general purpose applications provided that applications are properly structured to
take advantage of the parallel architecture and very high memory bandwidth of the
GPU.
Since GPUs are specically designed to handle parallel operations in a SIMD
(single instruction, multiple data) fashion where the same operation is preformed on
dierent data elements, their performance can be lacking in parts of applications that
are more serial in nature and/or have many branches in their code. This problem is
solved by the technique of heterogeneous computing, where parts of the application
are selectively executed on certain processing devices based on the inherent structure
of the operations needed to be performed in order to enhance processing eciency and
performance. Heterogeneous computing systems are typically composed of a general
purpose processing device (usually a CPU) and a special-purpose processing unit to
perform a specialized subset of the computing operations. In the case of GPGPU
computing, the special-purpose processor is a GPU. These types of systems open up
new realms of possibilities in accelerating many types of computationally demanding
applications.
The goal of this thesis is to develop a technique to harness the architecture of a
heterogeneous computing platform, specically, multi-core CPU systems with a GPU,
to accelerate the computationally intensive task of executing image spectroscopy al-
gorithms on remotely-sensed images. This technique will be implemented in a gen-
eralized fashion as to be useful to many types of image spectroscopy algorithms and
even to other areas of scientic computing that have a similar problem structure.
3
1.2 Contribution of Thesis
In this work, we develop a parallel mathematical optimization solver that takes ad-
vantage of the architecture of a CPU-GPU heterogeneous computing platform to
accelerate a specic spectroscopy algorithm and approach real-time processing of the
hyperspectral images. One of the goals in developing this solver is to make it general
purpose in design so that it can be used to accelerate similarly structured problems
in other scientic disciplines as well.
1.3 Organization of Thesis
The remainder of the thesis is organized as follows:
Chapter 2 lays out the background information on hyperspectral imaging, image
spectroscopy, mathematical optimization and GPU computing. Related research work
in accelerating imaging spectroscopy is also surveyed.
Chapter 3 presents the methodology and the design of the BFGS-B CL solver,
a general-purpose, parallel, mathematical optimization solver capable of harnessing
the architecture of CPU-GPU systems and able to accelerate applications such as
imaging spectroscopy.
Chapter 4 discusses the results of using the BFGSB-B-CL solver as the tool to
implement an image spectroscopy algorithm on remotely-sensed hyperspectral images
on a heterogeneous computing platform.
Finally, Chapter 5 presents the conclusion of this thesis as well as the future
directions of the BFGS-B CL solver.
4
Chapter 2
Background
This chapter provide background discussion that will help the reader understand the
remainder of this thesis. Section 2.1 describes the basics of hyperspectral imaging
and the image spectroscopy algorithm targeted for acceleration in this thesis. Sec-
tion 2.2 gives a brief introduction to mathematical optimization theory relevant to
performing the image spectroscopy algorithm. Section 2.3 gives a brief history of
GPGPU computing, describes the basic hardware architecture of GPUs, and the fun-
damentals of using the OpenCL language to program general purpose applications on
the GPU. Finally, Section 2.4 describes some previous work done to accelerate image
spectroscopy algorithms.
5
2.1 Hyperspectral Imaging
2.1.1 Overview
Hyperspectral imaging is a remote sensing technology that is able to collect and pro-
cess images from across the electromagnetic spectrum in continuous spectral bands.
This capability allows for a much greater amount information about the remotely
sensed environment to be gathered compared to traditional RGB or multispectral
imaging technology. This increased data richness also allow for sophisticated process-
ing algorithms to be used on hyperspectral images for a more robust analysis of the
environment. The sensors used in hyperspectral imaging are capable of measuring
from 50 to 250 spectral bands per pixel as compared with 3 bands per pixel by tradi-
tional RGB sensors. Figure 2.1 shows a two-dimensional projection of a hyperspectral
cube in which the full front is the scene itself, in x-y coordinates, and the z-axis is
the set of the many individual narrow band images [1].
Figure 2.1: Two-dimensional projection of a hyperspectral cube [1]
Each spectral band covers a portion of the hyperspectral sensors spectral range
and the width of each band is referred to as the sensors spectral resolution. Since
6
data is collected with high-resolution over a continuous spectral range, it is possible
to construct a spectral curve from the hyperspectral data that can then be matched
with spectral signatures of individual materials collected from laboratory or eld
measurements and available in data banks. Figure 2.2 shows a hyperspectral image
and the spectral curves for four of the features of the image [1]. The versatility of
hyperspectral imaging and its advantage of being able to collect data from across
the electromagnetic spectrum explains why it has been deployed in a wide array of
scientic elds including mining, geology, agriculture, ecology, and surveillance.
Figure 2.2: Spectral curves of objects measured from hyperspectral imagery [1]
2.1.2 An Image Spectroscopy Algorithm for Submerged
Marine Environments
Performing remote sensing in submerged marine habitats presents a challenging prob-
lem. This is mainly due to the confounding eects of the water above the environment
7
and the variations in composition of marine habitats. Robust solutions to this prob-
lem are quite complex due to the high absorption of light in the water, which also
limits the range of wavelengths available for analysis. These more complex solutions
can be used across a range of water and habitat conditions because they do not
oversimplify the analysis or require signicant a priori information. However, the
associated computing time required when utilizing these solutions is often extremely
high.
The image spectroscopy algorithm that is the primary focus of this thesis aims to
generate a mapping of submerged marine habitats from imaging spectroscopy data
(hyperspectral imagery). This algorithm is a robust solution that utilizes an inverse
semi-analytical model to derive water properties, water depth, and bottom albedo
from hyperspectral imagery based on the methodology developed by Lee et al. [2,
3]. The water properties that are derived from the hyperspectral data serve as the
input to later stages that generate the nal environment mapping output. Algorithm
testing and validation have demonstrated the validity of this approach, showing strong
agreement between model estimates and measured eld data. Figure 2.3 shows an
example of the nal output of the image spectroscopy algorithm which displays a
mapping of the marine environment of Kaneohe Bay, Hawaii [4].
The stage of the algorithm that derives the water properties is the target for
acceleration in this thesis. The model used to derive the water properties denes R
rs
as a non-linear function of just ve model parameters: R
rs
() = f(P, G, BP, B, H).
The physical meanings of these variables are as follows:
R
rs
: the surface remote sensing reectance, is the ratio of upwelling water leaving
radiance to the downwelling incident irradiance (in sr
1
).
: Wavelength (in nm).
8
Figure 2.3: Marine environment mapping of Kaneohe Bay, Hawaii [5]
.
P: Phytoplankton absorption coecient at 440 nm (in m
1
).
G: Absorption coecient for combined inuences of gelbsto and detritus at 440 nm
(in m
1
).
BP: Absorption coecient for combined inuences of particle-backscattering, view-
angle, and sea state (in m
1
).
B: Bottom albedo at 550 nm.
H: Water depth (in m).
Knowing the value of R
rs
from the hyperspectral data, the water variables can then
be derived by solving the model with a non-linear numerical optimization technique
for each pixel of the image. The objective function that is used in the optimization
is:
9
err =
675
400
(R
rs
R
est
rs
)
2
+
800
720
(R
rs
R
est
rs
)
2
0.5
675
400
(R
rs
)
2
+
800
720
(R
rs
)
2
0.5
(2.1)
The summation is calculated for the 38 wavelength bands from 400-675 nm and
720-800 nm. The value of R
rs
represents the measured remote sensing reectance at
each pixel, and R
est
rs
is the estimated remote sensing reectance, each consisting of
a vector of 38 of the 42 wavelength bands from 400-800 nm. R
rs
is calculated for
each wavelength as follows (where all variables besides P, G, BP, B, and H are input
constants unless otherwise dened):
R
est
rs
= 0.5r
rs
/(1 1.5r
rs
) (2.2)
r
rs
= r
dp
rs
1 e
kH[D
C
u
+1/ cos (w)]
+
1
Be
kH[D
B
u
+1/ cos (w)]
(2.3)
k = a + b
b
(2.4)
D
B
u
= 1.04(1 + 5.4u)
0.5
(2.5)
D
C
u
= 1.03(1 + 2.4u)
0.5
(2.6)
r
dp
rs
= (0.084 + 0.17u)u (2.7)
u = b
b
/(a + b
b
) (2.8)
b
b
= 00038(400/)
4.3
+ BP(400/)
Y
(2.9)
a = a
w
+ [a
0
+ a
1
ln (P)] P + Gexp [0.015( 400)] (2.10)
The objective variables P, G, BP, B, and H are given minimum and maximum
values for use in the numerical optimization step. This turns this problem into a non-
linear bound-constrained optimization problem and it has been shown to be quite
10
a computationally intensive task to solve [4]. Typical values for the minimum and
maximum values of the objective variables are shown in table 2.1.
Parameter Units Lower Limit Upper Limit
P m
1
0.005 0.5
G m
1
0.002 3.5
BP m
1
0.001 0.5
B - 0.01 1.0
H m 0.2 10.0
Table 2.1: Spectroscopy parameter constraints used in optimization
For more detailed information on the theory of the image spectroscopy algorithm
described in this this thesis please see Goodman and Ustin [5].
2.2 Mathematical Optimization
2.2.1 Fundamentals
Mathematical or numerical optimization refers to the minimization (or maximization)
of a given objective function of one or more decision variables, possibly subject to
a series of constraints. These constraints may be equality or inequality constraints.
More formally an optimization problem can be stated as:
11
min
x
N f(x)
subject to: g
i
(x) = 0, (i = 0 . . . (M 1))
h
i
(x) 0, (i = 0 . . . (J 1))
L
i
x
i
U
i
, (i = 0 . . . (N 1))
Optimization problems have many characteristics that can be used to classify
them. Some of these classication criteria for optimization problems are:
Presence of constraints: The optimization problem may be classied as uncon-
strained or constrained, based on weather the problem has constraints or not.
Number of decision variables: The optimization problem may be classied as
one-dimensional or multi-dimensional based on the number of decision variables
in its objective function.
Nature of objective function: Based on weather the problems objective function
is linear or not, the objective problem can be classied as linear or non-linear.
Permissible values of decision variables: Depending on the values permitted for
the design variables, optimization problems can be classied as integer and real-
valued programming problems.
There is also a special type of constrained optimization problem that has con-
straints only on the range of the decision variable x, these types of problems are known
as bound-constrained optimization problems. There is a large number of classes of
optimization problems and there are various techniques that are used to solve each
12
one. In this thesis we are most concerned with real, non-linear, bound-constrained
optimization problems. The next two sections provide a brief overview of some of the
basic theory and techniques used to solve this class of optimization problem, begin-
ning with the simplied unconstrained cases and expanding the methods for dealing
with bound-constraints.
2.2.2 Single-Variable Optimization Techniques
Optimization techniques aim to nd a value x
which minimizes the function f(x).

Depending on the technique and nature of the function being minimized, this value
x
may be a global minimum, which means that there is no other value of x which
produces a value f(x) less than f(x
), or a local minimum, which means there are

no other values x on some interval (i.e., within a localized region of the solution
space) which produce a value f(x) less than f(x
). A value x
is known to be a local
minimum if:
f
(x
) = 0
and
f
(x
) 0
These are known as the rst and second order necessary conditions. Note that
if f
(x
) = 0, x
may be an inection point and more information is needed to

determine if it is a local minimum or not. A pictorial example of local minimums,
global minimums, and inection points is shown in gure 2.4 [6].
There are many techniques to solve one-dimensional unconstrained optimization
problems, some of the most common are bracketing methods, golden section search,
13
Figure 2.4: Local minimums, global minimums, and inection points [6]
and polynomial-t based methods. One of the best known techniques is Newtons
Method and it serves as the foundation for techniques discussed later in this thesis.
Newtons Method
Newtons method is an iterative approach for approximating the roots (or zeros) of
a function. It requires the functions derivative and an initial estimate (x
0
) for the
root. It works by nding the tangent line at the current iterate estimate x
n
and then
using the x-intercept of that line to derive a better approximation of the functions
root to use as the next iterate estimate x
n+1
. Newtons method can be derived by
simple algebra from the denition of the derivative:
f
(x
n
) =
y
x
=
f(xn)0
xnx
n+1
Newtons Method: x
n+1
= x
n
f(xn
f
(xn)
Newtons method is successively repeated until a satisfactory estimate for x
is
14
achieved (until the estimate changes by less than a certain tolerance value for in-
stance). We show two iterations of Newtons method in gure 2.5.
Figure 2.5: Two iterations of Newtons method
Newtons method can be modied to nd an estimate for a local minimizer of a
function instead of a zero by estimating the zero of the functions rst derivative:
Newtons Method for Optimization: x
x+1
= x
n
(xn)
f
(xn)
2.2.3 Multi-Variable Optimization Techniques
Many optimization techniques often require the rst and second derivatives of a func-
tion. When dealing with a function f(x) of more than one variable, the gradient
f is used instead of the rst derivative. This is dened as a vector of the partial
derivatives of f(x), where f(x) is a function of N variables:
f = [
f
x
1
, . . . ,
f
x
N
]
15
Similarly, the Hessian matrix
2
f is used for the second derivative of a function
of N variables and is dened as:
2
f =
2
f
x
1
x
1
2
f
x
1
x
2

2
f
x
1
x
N
2
f
x
2
x
2

2
f
x
2
x
N
.
.
.
symmetric

2
f
x
N
x
N
Steepest Descent Method

Steepest descent (or gradient descent) is a rst-order iterative optimization technique.
After an initial point is chosen, a local minimum is found by taking successive steps
in the negative gradient direction of the function. The steps are taken by nding the
minimizer along the direction of the negative gradient. A summary of the steps of
the steepest descent method are:
1. Choose an initial starting point x
0
.
2. Check for convergence: f(x
n
) = 0. If converged, local minimum is found and
algorithm is done.
3. Calculate search direction d = f(x
n
).
4. Find step size that minimizes the function along the negative gradient direc-
tion by solving the 1-dimensional optimization problem: () = f(x
n
+ d).
5. Set the minimizer of step 4 as the new x
n
and go to step 2.
16
Four iterations of the steepest descent algorithm are shown in gure 2.6. The
steepest descent method can be extended for use with bound-constrained problems
by simply projecting the negative gradient into the feasible region and by setting
bounds on the step size search found in step 4, this modied algorithm is known as
the projected-gradient method. Note that in the bound constrained case the algorithm
would also need to check for convergence when x
n
lies on the boundary of the feasible
region and the negative gradient is perpendicular to the boundary of the feasible
region [7].
Figure 2.6: The steepest descent method
Newtons Method in Multi-Dimensions
Newtons method can be easily extended for use in multi-dimensional optimization
problems by substituting the gradient for the rst derivative and the Hessian matrix
for the second derivative:
17
Newtons Method (single-variable): x
n+1
= x
n
(xn)
f
(xn)
Newtons Method (multi-variable): x
n+1
= x
n
2
f(x
n
)
1
f(x
n
)
Note that in the multi-variable Newton method, the search direction d is equivalent
to
2
f(x
n
)
1
f(x
n
). Newtons method can also be extended to handle bound
constrained problems in the same way that the steepest descent method. Though
this technique is not commonly used in practice due to its requirement that the
objective functions Hessian must be provided, it serves as the basis for commonly used
quasi-Newton type methods such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS)
method [7].
BFGS Method
The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is a quasi-Newton type nu-
merical optimization method. It is categorized as a quasi-Newton type method be-
cause it does not require the Hessian of the function being optimized but instead
iteratively builds up an approximation to the Hessian by changes in the functions
gradient [8]. A summary of the steps of the BFGS method are:
1. Choose an initial starting point x
0
.
2. Initialize the approximate Hessian B to the identity matrix.
3. Check for convergence: f(x
n
) = 0. If converged, local minimum is found and
algorithm is done.
4. Calculate search direction d = B
1
f(x
n
).
18
5. Find step size that minimizes the function along the negative gradient direc-
tion by solving the 1-dimensional optimization problem: () = f(x
n
+ d).
6. Update the approximate Hessian B by using the BFGS update formula:
S
k
= x
n+1
x
n
y
k
= f
n+1
f
n
B
n+1
= B
n
BnSnS
T
n
Bn
S
T
n
BnSn
+
yny
T
n
y
T
n
sn
7. Set the minimizer of step 5 as the new x
n
and go to step 3.
Note that care must be taken in step 5 when choosing a step size in order to
keep the approximate hessian matrix B positive-denite. This is analogous to having
f
(x
n
) 0 in the single variable case, to be certain that x
will be a local minimum.

The BFGS method can also be extended to be used with bound-constrained problems
in a similar fashion as the projected steepest descent method. This is known as the
projected BFGS method.
2.3 GPGPU
2.3.1 History
Graphics processing units (GPUs) were conceived as special purpose processors de-
signed to rapidly alter and manipulate memory in order to accelerate the building of
2D and 3D computer images for output on a display device. GPUs can today be found
19
in many dierent types of computer systems such as workstations, laptops, tablets,
mobile phones, and video game consoles. Graphics processing involves executing a
large number of independent calculations, generally one for each pixel or block of pix-
els. Todays GPUs are very ecient at manipulating computer graphics due to their
highly parallel architecture which enables them to perform concurrent operations on
many independent graphics elements. This is in contrast to central processing units
(CPUs) which are very much serial in their design relative to that of GPUs [9].
GPUs were traditionally used exclusively to render computer graphics and were
modeled after the concept of a xed graphics pipeline. The pipeline can be thought of
as a conceptual model of the stages that graphics data is sent through as it is processed
by the computer system. The pipelines primary function is to transform coordinates
from 3D space specied by the programmer into 2D pixel space for outputting to
a display. Typically the rst stage performed operations on the individual vertices
of the image, the second stage performed point and line setup, the third stage exe-
cuted operations on the individual pixels such as shading, the fourth stage performed
raster operations such as blending, and nally the last stage wrote the data to the
frame-buer memory for output. Figure 2.7 depicts a traditional non-programmable
graphics pipeline [10].
The programmer is able to access the graphics pipeline from CPU programs
through standardized APIs such as OpenGL (Open Graphics Library). Driven by
the consumer gaming market and enabled by advances in semi-conductor technol-
ogy, graphics processors evolved over the years to have higher clock rates and deeper
pipelines to ooad more processing from the CPU. GPU design also takes advantage
of the parallel nature of graphics processing by increasingly widening the pipelines
to process more vertices and pixels at once in order to achieve higher performance.
20
Figure 2.7: Traditional graphics pipeline [9]
Figure 2.8 depicts the evolution of graphics pipeline to a wider and more parallel
architecture [9].
21
Figure 2.8: Evolution to a wider and more generalized graphics pipeline
GPU pipelines were characterized as xed-function units until the early 2000s,
meaning that no programmability inside the processing pipelines was possible. The
rst GPU with pipelines that were programmable were introduced in the year 2001.
The programmer was then able to write vertex programs (also called shader programs)
that would execute operations on the data while still inside the pipeline. These
small programs were known as shader kernels and were programmed in assembly-like
shader languages in order to run on the GPU. This gave the programmer much more
exibility in terms of what types of processing could be done on the graphics data [9].
Continuing along with the trend of increased programmability, GPU architecture
evolved to unify much of its hardware such as pixel shaders and vertex shaders into
single processing units known as unied shaders. With the unied shader hardware
design, the vast majority of the traditional graphics pipeline model had become only
22
a software abstraction. Modern GPU architecture has expanded this concept even
further to pave the way for general purpose programming on GPUs.
2.3.2 Modern GPUs
Todays GPU architectures are fully programmable and are much more general pur-
pose in terms of their functionality. They contain dozens of symmetrical multiproces-
sors (or SMs) which are composed of many programmable processing elements. The
processing elements in an SM execute instructions in a SIMD fashion, executing the
same instruction over dierent elements. Each SM may contain its own local caches
and managed memory shared by all the processing elements located inside it. All
the SMs share a large global memory located o the main GPU chip. A simplied
architecture of a modern GPUs is shown in gure 2.9. Todays GPU designs are
even more complex, oering more programmable functions, a greater degree of cache
levels, more processing elements, and increased memory sizes [11].
23
Figure 2.9: A modern GPU architecture (including memories) [11]
Modern GPU designs are capable of executing hundreds of oating point opera-
tions every clock cycle due to the extremely high amount of the chip that is allocated
to these small processing elements. GPUs sacrice cache sizes, branch prediction
logic, memory-latency, and single-thread performance compared to CPUs, for very
high parallel processing performance and memory bandwidth. These characteristics
make GPUs quite attractive for applications that can take advantage of this high
degree of parallelism found in todays modern GPU architectures. Theoretical peak
oating-point performance of some of todays latest generations of GPUs versus to-
days CPUs is shown in gure 2.10.
Before the availability of general purpose GPU computing languages, programmers
24
Figure 2.10: Theoretical oating-point performance of todays GPUs versus CPUs
[11]
were relegated to using graphics APIs to leverage the general purpose parallel pro-
cessing potential of GPUs, which was quite awkward and inecient for non-graphics
oriented tasks. With the advent of general purpose GPU computing languages such
as CUDA and OpenCL, programmers are able to easily write compute kernels that
run on the GPU that are capable of performing general purpose computations. These
programming model of these GPGPU compute languages treat each processing ele-
ment of the GPU as a simple CPU core, allowing compute kernels to actively run
up to hundreds of threads of execution concurrently, with groups of these processing
elements executing the same instruction over dierent data. This model is sometimes
referred to as single-instruction multiple thread (SIMT). These GPU compute lan-
guages have opened the door for programmers to be able to utilize the GPGPU as an
25
important new programming platform for high-performance computing applications.
2.3.3 OpenCL
OpenCL (Open Computing Language) is a framework developed by the Khronos
Group for writing programs that can run across heterogeneous platforms composed
of CPUs, GPUs and other processors. It includes a language (based on C99) for
writing kernels on OpenGL devices and a set of APIs that are used to control the
devices. The part of the platform that controls the OpenCL devices is known as the
OpenCL host. Each OpenCL device is composed of groups of processing elements
called compute units, each of these processing elements is able to run one thread of
the compute kernel at the same time. This platform model is shown in 2.11 [12].
Figure 2.11: The platform model of OpenCL [12]
The threads of the OpenCL compute kernels are grouped together into OpenCL
workgroups, where each workgroup runs on a compute unit, allowing its threads to
synchronize and share resources such as memory local to that compute unit. All
threads in the compute kernel share a global memory space with support for a global
26
read-only constant memory as well. This programming model is depicted in g-
ure 2.12 [12]. The SMs, processing elements, and shared memory of the GPU are
mapped to the OpenCL compute units, processing elements, and local memory re-
spectively. The o-chip memory of the GPU functions as the OpenCL global memory.
Figure 2.12: The memory model of OpenCL [12]
OpenCL programs can be written once and run on a wide variety of devices and its
threading model allows programmers to easily take advantage of the parallel process-
ing power of todays GPU. It also enables programs to be written that concurrently
leverage both todays CPUs and GPUs to create a heterogeneous compute platform,
allowing for a great deal of acceleration across a wide variety of computationally
intensive applications [12].
27
2.4 Related Work
The image spectroscopy algorithm used for the primary motivation for this thesis
was demonstrated in Goodman and Ustin [5]. The algorithm uses an inverse semi-
analytical model to derive water properties, water depth, and bottom albedo from
hyperspectral imagery from the methodology developed by Lee et al. [2, 3].
The image spectroscopy algorithm was previously accelerated by Goodman, Schaa,
and Kaeli [4], by using GPUs to execute a parallel coarse-grained search combined
with a coordinate descent algorithm with a two orders magnitude gain in perfor-
mance compared to a similar CPU-based algorithm and performance comparable to
commercial CPU-based solvers such as Generalized Reduced Gradient (GRG) [4].
28
Chapter 3
The Design of the BFGS-B CL
Optimization Solver
3.1 Design Overview
The BFGS-B CL Optimization Solver is a parallel numerical optimization solver that
utilizes heterogeneous computing to accelerate the task of solving arrays of non-linear,
bound-constrained, optimization problems. The solver uses multiple CPUs along
with a GPU to perform the projected-BFGS quasi-newton optimization technique
in parallel over each element of the optimization problem. BFGS-B CL is primarily
written in C++ and uses the POSIX Threads (Pthreads) library to utilize multiple
CPU cores to perform the coarse-grained section of the BFGS-B algorithm over all
the problem elements and utilizes the OpenCL framework to execute function and
gradient evaluations for those elements on the GPU in a SIMD fashion. The solver is
currently built to run on the x86-64 Linux operating system and requires an OpenCL
compatible GPU that supports double-precision oating-point.
29
3.2 User Interface
3.2.1 Problem Input Form
The BFGS-B CL Solver solves arrays of bound-constrained optimization problems of
the form:
min
x
N f
j
(x), (j = 0 . . . (M 1))
L
i
x
i
U
i
, (i = 0 . . . (N 1))
xinit
0
. . . xinit
M1
Where M is the number of problem elements to be solved and N is the dimension
of each optimization problem. The optimization functions f
0
(x) . . . f
M1
(x) must all
be of the same form but are permitted to have diering constant values as inputs. x is
a vector of length N that serves as the variable input to each objective function with
the initial values of xinit
0
. . . xinit
M
. Each vector component of each of the function
inputs x may be optionally bounded with a minimum and/or a maximum value.
The resulting output of the BFGS-B CL Solver is:
xmin
0
. . . xmin
M1
f
0
(xmin
0
) . . . f
M
(xmin
M1
)
Where xmin
0
. . . xmin
M1
is an array of x values for the M problem elements
that the solver has found to best minimize f
0
(x) . . . f
M1
(x) while keeping x within
the supplied bounds and f
0
(xmin
0
) . . . f
M
(xmin
M1
) are the minimum values of the
objective functions found by the solver.
30
3.2.2 Using the BFGS-B CL Solver
The BFGS-B CL Solver provides a C++ interface to the user for performing numer-
ical optimization in parallel on CPUs and GPUs. The most important solver input
arguments are as follows:
M: The number of optimization problems to be solved in parallel (positive integer).
N: The dimension of the input vector x to each optimization problems objective
function (integer).
xinit[N][M]: The initial values of x for each of the M optimization problems (double-
precision oats).
L[N]: The lower bounds for x (double-precision oats).
U[N]: The upper bounds for x (double-precision oats).
fCLSrc: The OpenCL source le that contains the user dened objective functions
f
0
(x) . . . f
M1
(x) and their gradients.
opts: Additional options for the solver including:
P: The number of CPU threads to use for solver execution (positive integer).
T: The maximum number of iterations of the BFGS-B algorithm to run on
each problem element before stopping (positive integer).
H: The maximum number of variable corrections allowed in the limited memory
Hessian matrix (positive integer).
funcArgs: Additional user arguments for the objective function f(x).
31
The output arguments for the solver are as follows:
xmin[N][M]: The resulting output value of the solver giving the x values that min-
imize the objective functions f
0
(x) . . . f
M1
(x) (double-precision oats).
fmin[M]: The resulting output value of the solver that gives the function values
f
0
(xmin
0
) . . . f
M
(xmin
M1
) (double-precision oats).
A summary of the input and output arguments to the BFGS-B CL solver are
shown in gure 3.1.
bf g s b c l (
M, // Number of el ement s of t he probl em
N, // Number of di mensi ons of f unct i on i nput x
x i ni t [ N] [M] , // I n i t i a l val ues f or x
L[ N] , // Lower bounds f or x
U[ N] , // Upper bounds f or x
fCLSrc , // CL source f i l e de f i ni ng f ( x ) and i t s gr adi ent
opts , // Other s o l v e r opt i ons
xmin [ N] [M] , // Output : f i n a l v al ue s of x f ound by s o l v e r
fmi n [M] ) // Output : f i n a l v al ue s of o b j e c t i v e f unc t i ons
Figure 3.1: Calling arguments for the BFGS-B CL Solver
3.3 The L-BFGS-B Algorithm
The BFGS-B CL Solver leverages the limited memory BFGS-B (L-BFGS-B) algo-
rithm developed by Byrd et al. [13]. The L-BFGS-B solver serves as the basic building
block that each CPU thread of the BFGSB-B CL solver uses to perform the numerical
optimizations in parallel across each element of the input problem.
32
The L-BFGS-B algorithm is based on the gradient projection method and uses a
limited memory matrix to approximate the Hessian of the objective function. The
gradient projection method is used to determine the active constraints of the problem
at each iteration. L-BFGS-B uses the line search approach each iteration to deter-
mine the next iteration point. The Hessian is kept positive-denite each iteration by
enforcing the Wolfe conditions each line search:
f(x
n+1
) f(x
n
) +
n
f(x
n
)d
n
|f(x
n+1
)| |f(x
n
)d
n
|
where n is the iteration number,
n
is the step size, d
n
is the search direction and
and are parameters that dene how much decrease in the functions value and
slope is required, they have values 10
4
and 0.9 respectively set by the solver. The
Hessian matrices used in L-BFGS-B are limited memory BFGS matrices that require
only a small amount of memory which is advantageous when the dimensions of the
problem are very large. L-BFGS-B has been shown to perform very well on a variety
of non-linear bound-constrained problems [13].
33
3.4 Outline of Solver Execution
The BFGS-B CL Solver relies on two primary technique to achieve acceleration using
a heterogeneous CPU-GPU computing system. The rst technique uses multiple
CPU to concurrently operate on all elements of the input problem in a coarse-grained
fashion. The second technique is to evaluate the functions and gradients for each
problem element in parallel on the GPU. Figure 3.2 shows a basic execution outline
of the BFGS-B CL Solver.
The solver rst sets up thread structures, initializes OpenCL, and sets up buers
for the parallel function evaluation module. It then initializes a L-BFGS-B solver
driver for each problem element and adds them to the main work list. Every iteration
of the main solver loop iterates through the work list and sends work to a CPU com-
pute thread. These CPU compute threads are responsible for running one step of the
L-BFGS-B algorithm for each solver driver they are given to work on. After one step
is performed on each problem element by the CPU compute threads the objective
functions and their gradients must be evaluated. The solver evaluated the objective
functions and their gradients for all problem elements in parallel on the GPU. This
process of taking one step in the L-BFGS-B algorithm and then evaluating the objec-
tive functions and gradients is repeated until the L-BFGS-B algorithm converges for
every problem element. When a problem element converges or reaches the maximum
number of iterations it is taken o the work list. The main solver loop completes
when the work list is empty meaning that a minimum x value has been found for
each elements objective function.
34
bf g s b c l ( . . . )
{
Parse arguments and setup thread s t r uc t ur e s .
. . .
Setup OpenCL and p a r a l l e l f unc t i on e val uat i on module .
. . .
I n i t i a l i z e LBFGSB s o l ve r dr i ve r s f o r each el ement of the problem .
. . .
Put each s o l ve r dr i ve r on work l i s t W.
. . .
whi l e W != empty
{
f o r each s o l ve r el ement S of l i s t W
{
i f S i s f i ni s he d : remove S from W
e l s e : pl ace S on a CPU compute thread s workqueue
to execut e one s t ep of the LBFGSB al gor i t hm .
}
Wait f o r a l l CPU compute t hr eads to f i n i s h work
Eval uate f unc t i ons and gr adi e nt s f o r a l l el ements i n p a r a l l e l on the GPU
}
}
Figure 3.2: Execution Outline of the BFGS-B CL Solver
3.5 Coarse-Grained Data Parallelism Using Multi-
Threading
The coarse-grained data parallelism in BFGS-B CL is achieved by using the Pthreads
library to distribute work across multiple CPU cores. Each CPU thread is responsi-
ble for executing one step of the L-BFGS-B algorithm for each element placed into
its work queue. Since each problem element is independent from one another, the
execution of one step of the L-BFGS-B algorithm requires no communication or syn-
chronization between CPU work thread. The work items are distributed to the CPU
work threads in a round-robin fashion, assigning each thread work on approximately
35
an equal number of work elements. Figure 3.3 shows how the solver work items are
distributed to P number of CPU compute threads.
Figure 3.3: Distributing the solver work items to the CPU compute threads
Each work element is represented by a C++ class object that contains all of its
relevant state information for the L-BFGS-B algorithm. The step in the L-BFGS-
B algorithm may require updating the approximate Hessian, calculating the search
direction, conducting a line search, or a combination of these. When the L-BFGS-B
algorithm requires an evaluation of the problem elements objective function and/or
its gradient, the step is nished for that iteration of the main solver loop. The
objective function and gradient evaluations are executed later in the main solver loop
on the GPU by the parallel evaluation module. When the L-BFGS-B algorithm has
reached its specied maximum solver iterations or has converged to a solution it is
marked as completed and will be taken o the work list for the next iteration of the
solver loop.
36
3.6 Parallel Function and Gradient Evaluations on
the GPU
Every iteration of the solver loop requires evaluation of all active work items objective
functions and gradients. This task is carried out in parallel in OpenCL on the GPU
by the solvers parallel evaluation module. This module is tasked with managing the
relevant OpenCL buers, reading and writing data back and forth between the CPU
and GPU, handling extra user data needed for the objective function, and managing
execution of the OpenCL evaluation kernel on the GPU.
The evaluation kernel is written in OpenCL and is provided by the user as input to
the BFGS-B CL Solver. The kernel is responsible for calculating the objective func-
tion, as well as its gradient for all elements of the input problem. This is performed
in parallel on the GPU by creating an OpenCL thread for each problem element.
Running this kernel on the GPU allows the function and gradients for each problem
element to be executed in parallel by the many processing elements of the GPU.
Since every problems objective function has the same form, these calculation are
performed very eciently by the GPU in a SIMD fashion. The inputs arguments to
the evaluation kernel is the objective functions x value and the active mask of the
problem elements. The user must set the kernels output arguments to be executed
every iteration on the GPU, f (the objective functions value at point x) and g (the
objective functions gradient at point x). The gradient may either be directly calcu-
lated or estimated by a numerical dierentiation formula. Note that an active mask
is provided as an input argument to the kernel to increase eciency of execution, as
problem elements start to converge or reach their maximum allowed evaluations the
37
GPU threads that are associated with them are set as inactive to save doing unneces-
sary computing. Figure 3.4 shows an example of an OpenCL evaluation kernel that
calculates the function and gradient values for each of the problem elements. The
code of this kernel is run once for each problem element and the threads global ID
serves as the identier of the problem element. In this case each objective function
is one-dimensional in nature. There are M numbers of threads running in parallel
(one for each problem element) and each thread is responsible for evaluating f
i
(x)
and f
i
(x) where i is the corresponding threads global ID.
k e r ne l void
e va l ke r ne l (
gl obal bool amask , // ac t i v e mask
gl obal double x , // o b j e c t i v e f unct i on i nput val ue ( i nput )
gl obal double f , // o b j e c t i v e f unct i on val ue ( out put )
gl obal double g , // gr adi ent of o b j e c t i v e f unct i on ( out put )
. . . // any addi t i onal user arguments
)
{
i nt gi d = g e t g l o ba l i d ( 0 ) ; // get t hread i d e n t i f i e r
i f ( amask [ gi d ] == 0) return ; // i f el ement i s i nact i ve , do no work
f [ gi d ] = obj Funct ( x [ gi d ] ) ; // c a l c ul a t e o b j e c t i v e f unct i on val ue
g [ gi d ] = gradi entObj Funct ( x [ gi d ] ) ; // c a l c ul a t e gr adi ent val ue
}
Figure 3.4: An example OpenCL evaluation kernel
38
Chapter 4
Accelerating the Spectroscopy
Algorithm with the BFGS-B CL
Optimization Solver
4.1 Stating the Spectroscopy Algorithm as an
Array of Optimization Problems
The inversion model stage of the image spectroscopy algorithm attempts to estimate
the water properties (P, G, BP, B, and H) at each pixel that best matches that pixels
measured reectance value (R
rs
). To do this, the algorithm must attempt to minimize
the error between the measured reectance value (R
rs
) and the estimated reectance
value R
est
rs
at each pixel according to the equation:
39
err =
675
400
(R
rs
R
est
rs
)
2
+
800
720
(R
rs
R
est
rs
)
2
0.5
675
400
(R
rs
)
2
+
800
720
(R
rs
)
2
0.5
where R
est
rs
is dened in terms of P, G, BP, B, and H by equations 2.2-2.10. The
water properties P, G, BP, B, and H have minimum and maximum bounds assigned
to them. This function can be viewed as the objective function to be optimized for
each pixel. Since each pixels water properties can be estimated independently from
one another, the image spectroscopy algorithm can be stated as an array of numerical
optimization problems where the goal is to nd:
min(err
j
(P, G, BP, B, H)), (j = 0 . . . (M 1))
L
P
P U
P
L
G
G U
G
L
BP
BP U
BP
L
B
B U
B
L
H
H U
H
Where M is the number of pixels in the hyperspectral image. This array of op-
timization problems can be passed into and solved in parallel by the BFGS-B CL
Solver.
40
4.2 Image Tests
Three hyperspectral images were used to test the performance of the BFGS-B CL
Solver, a 2MB synthetic image of 12K pixels, a 17MB real-world image of 100K
pixels (referred to as Real Image A), and a 23MB real-world image of 128K pixels
(referred to as Real Image B). All images have 42 spectral bands. The compute time
results as well as the quality of the output of the BFGS-B CL tests are compared
against the Generalized Reduced Gradient (GRG) serial optimization solver. The
GRG solver is provided by the IDL (interactive data language) programming language
which serves as the basis for the ENVI (ENvironment for Visualizing Images) remote
sensing application. The IDL/GRG solver was used by Goodman and Ustin [5] for
their work in developing the spectroscopy algorithm. Serial BFGS-B is also used for
compute time comparison. All water properties for the synthetic image are known
and only the water depth (H) is known for the real-world images. For the BFGS-B
tests, two variations of runs were made, one (referred to as normal-quality) where
the Hessian approximation parameter is set to 6 and another (referred to as higher
quality) where the Hessian approximation parameter is set to 20.
All BFGS-B and BFGS-B CL tests were run on a 6-core Intel Xeon 2.9GHz system
running Linux with 4GB of RAM and an Nvidia Tesla M2070 GPU. The IDL solver
was run on an Intel Xeon Quad 3.00 GHz system running Windows with 8GB of
RAM.
4.2.1 Compute-Time Performance Results
Testing for computation performance was done on the synthetic as well as two real-
world images. Tables 4.1-4.3 show the computation time taken per pixel on each
41
of the images and compares BFGS-B CL against serial BFGS-B as well as the IDL
optimization solver. BFGS-B CL was tested with two quality variations and 1, 2,
4, and 8 CPU compute threads. Figures@4.1-4.6 show the relative performance of
BFGS-B CL to the IDL/GRG solver and serial BFGS-B.
Synthetic Image Compute Time-Performance
Test Time / Pixel (ms) Time / Pixel (ms) (HQ)
IDL/GRG 118.90 -
Serial BFGS-B 7.19 9.31
BFGS-B CL + 1 CPU Thread 7.19 9.31
BFGS-B CL + 2 CPU Threads 1.17 4.05
Table 4.1: Synthetic image compute-time performance results.
Real Image A Compute-Time Performance
IDL/GRG 15.0 -
Table 4.2: Real image A compute-time performance results.
42
Real Image B Compute-Time Performance
IDL/GRG 18.6 -
Table 4.3: Real image B compute-time performance results.
Figure 4.1: Synthetic image compute-time performance results using IDL as the base.
43
Figure 4.2: Synthetic image compute-time performance results using serial BFGS-B
as base.
Figure 4.3: Real image A compute-time performance results using IDL as the base.
44
Figure 4.4: Real image A compute-time performance results using serial BFGS-B as
the base.
Figure 4.5: Real image B compute-time performance results using IDL as the base.
45
Figure 4.6: Real image B compute-time performance results using serial BFGS-B as
the base.
46
The results show that BFGS-B CL is capable of obtaining a relatively high speedup
in comparison to both the IDL solver and serial BFGS-B. When using 8 CPU compute
threads, BFGS-B CL achieves up to a 270x speedup on the synthetic image and a
70x speedup on the real-world images compared to the IDL solver. Compared to
the serial variant, BFGS-B CL is capable of obtaining a 8x speedup when using only
the GPU and one CPU compute thread and achieves over two orders of magnitude
speedup when using at least 4 CPU compute threads.
4.2.2 Image Scaling Eects on Compute-Time Performance
To explore the impact of image size on performance, the synthetic image was tiled
to create larger size images of approximately 10MB, 21MB, 104MB, and 209MBs.
These tiled images were run with both 2 and 8 CPU compute threads with normal
quality settings. The compute-time performance per pixel is shown in table 4.4 and
presented in gure 4.7.
Image Scaling Eects on Compute-Time
Image Size BFGS-B CL + 2 CPU Threads BFGS-B CL + 8 CPU Threads
(MB) Time / Pixel (ms) Time / Pixel (ms)
2.1 0.83 0.44
10.5 0.78 0.42
20.9 0.75 0.41
104.6 0.66 0.40
209.2 0.63 0.39
Table 4.4: Image scaling eects on compute-time.
The image scaling tests show that image size has little impact on the throughput
of the BFGS-B CL Solver. In fact, increasing the image size results in a slight increase
in throughput of the solver.
47
Figure 4.7: Image scaling eects on compute-time.
4.2.3 Output Quality Performance
In addition to compute-time performance, the quality of the output water properties
of the BFGS-B CL was also tested and compared with the IDL solver. For the
synthetic image all ve water properties are known, while on the real-world images
only the water depth (H) is known. The output water properties were measured for
the correlation coecient (multiple R), coecient of determination (R
2
), standard
error, and average absolute dierence. These measurements are shown in tables 4.5-
4.10 and gures 4.8-4.21.
48
Synthetic Image Multiple R
P G BP B H
IDL/GRG 0.997 1.00 0.941 0.625 0.817
BFGS-B CL 0.994 0.999 0.994 0.549 0.823
BFGS-B CL (HQ) 0.995 0.999 0.994 0.592 0.852
Table 4.5: Synthetic image multiple R for the various water properties.
Synthetic Image R
2
P G BP B H
IDL/GRG 0.995 1.00 0.886 0.390 0.667
BFGS-B CL 0.987 0.998 0.988 0.302 0.677
BFGS-B CL (HQ) 0.990 0.989 0.989 0.350 0.726
Table 4.6: Synthetic image R
2
for the various water properties.
Synthetic Image Standard Error
P G BP B H
IDL/GRG 0.0131 0.0142 0.0631 0.2813 1.5770
BFGS-B CL 0.0207 0.0630 0.0201 0.3054 1.7096
BFGS-B CL (HQ) 0.0186 0.0548 0.0195 0.2967 1.5550
Table 4.7: Synthetic image standard error for the various water properties.
49
Synthetic Image Average Absolute Dierence
P G BP B H
IDL/GRG 0.00137 0.00099 0.01071 0.16674 0.90355
BFGS-B CL 0.00306 0.00579 0.00187 0.19577 0.88673
BFGS-B CL (HQ) 0.00256 0.00424 0.00161 0.17635 0.75302
Table 4.8: Synthetic image average absolute dierence for the various water proper-
ties.
Figure 4.8: Synthetic image multiple R for the various water properties.
50
Figure 4.9: Synthetic image R
2
for the various water properties.
Figure 4.10: Synthetic image standard error for P, G, and BP.
51
Figure 4.11: Synthetic image standard error for B and H.
Figure 4.12: Synthetic image average absolute dierence for P, G, and BP.
52
Figure 4.13: Synthetic image average absolute dierence for B and H.
53
Real Image A Output Quality of Water Property H
Multiple R R
2
Std Error Avg Absolute Dierence
IDL/GRG 0.854 0.729 2.0860 2.8955
BFGS-B CL 0.677 0.459 4.9364 4.1306
BFGS-B CL (HQ) 0.750 0.563 3.2408 2.9241
Table 4.9: Real image A output quality of water property H.
Real Image B Output Quality of Water Property H
Multiple R R
2
Std Error Avg Absolute Dierence
IDL/GRG 0.881 0.776 3.0092 3.6135
BFGS-B CL 0.931 0.866 3.0888 1.8780
BFGS-B CL (HQ) 0.877 0.769 3.3528 3.1363
Table 4.10: Real image B output quality of water property H.
54
Figure 4.14: Real image A multiple R of water property H.
Figure 4.15: Real image A R
2
of water property H.
55
Figure 4.16: Real image A standard error of water property H.
Figure 4.17: Real image A average absolute dierence of water property H.
56
Figure 4.18: Real image B multiple R of water property H.
Figure 4.19: Real image B R
2
of water property H.
57
Figure 4.20: Real image B standard error of water property H.
Figure 4.21: Real image B average absolute dierence of water property H.
58
For the synthetic image test, the BFGSB-CL multiple R and R
2
are quite compa-
rable to the IDL solver. The standard error is very similar between the two solvers
for water properties P, B, and H. The IDL solver results in a higher standard error for
property BP, and the BFGS-B CL Solver obtains a better standard error for property
BP. The two solvers have comparable average absolute values for all water properties
except property BP where BFGS-B CL outperforms the IDL solver.
For both real-world images the solvers also achieve comparable multiple R and
R
2
values. For real-world image A, the IDL solver achieves a lower standard error
and absolute dierence for property H compared to the normal quality BFGS-B CL,
however this dierence in quality is diminished when compared to the higher quality
test.
For real-world image B the two solvers achieve similar standard error values. The
average absolute dierence of property H for real-world image B is lower for both
BFGS-B CL runs. These results show that the quality of the BFGS-B CL Solver is
similar to the IDL solver and therefore of acceptable quality for the purposes of the
image spectroscopy algorithm.
59
Chapter 5
Conclusion and Future Work
5.1 Conclusion
The processing requirements for todays scientic algorithms continue to increase
with their complexity and the data sizes that they operate on. New computing plat-
forms such as GPUs and heterogeneous computing systems have emerged to meet
the processing demands of these new applications. In this thesis we have shown a
method of utilizing todays CPU-GPU heterogeneous computing platform to accel-
erate a spectroscopy algorithm for submerged marine environments. We have pre-
sented the implementation of a parallel non-linear optimization solver BFGS-B CL,
that takes advantage of the massive SIMD throughput of GPUs, as well as coarse-
grained multi-threading of todays multi-core CPUs to accelerate the execution of the
spectroscopy algorithm. Testing results show a magnitude increase in computational
throughput compared to the serial CPU version of the solver and up to two magnitude
increase when compared to other commercial non-linear solvers such as IDL/GRG.
The algorithm output quality when using the BFGS-B CL solver was also shown to
60
be comparable to the IDL/GRG solver when examining the resulting output for two
real-world hyperspectral images, as well as a synthetic image. GPU and heteroge-
neous computing is a powerful processing technology and shows substantial potential
for accelerating future remote sensing algorithms, as well as other computationally
demanding scientic applications.
5.2 Future Work
5.2.1 Coarse-grained Search for Initial Point
There is potential for the GPU to be further utilized for achieving even higher perfor-
mance with the BFGS-B Solver. Since the GPU is capable of high throughput when
performing SIMD operations, it can be employed eectively when evaluating a large
number of points of a given mathematical function. This property would allow the
GPU to very quickly search for a better initial point for the optimization algorithm by
performing a coarse-grained search of the function which has potential for decreasing
the amount of iterations required to perform the optimization. This may allow the
BFGS-B CL solver to yield additional acceleration and/or improved accuracy.
5.2.2 Pipelining the Execution of the Solver
Another potential modication to explore for further performance gains of the BFGS-
B CL Solver is to pipeline its execution. The solver can be thought of as having two
stages for each iteration, the rst stage is the execution of a single step of the BFGS-B
algorithm in a coarse-grained manner by the CPU compute threads and the second
stage is the evaluation of the functions and gradients in a SIMD fashion by the GPU.
Currently the solver is designed to run all problem elements through the rst stage
61
of the solver before executing the evaluation kernel on the GPU. The solver could be
modied to pipeline these two stages by waiting for a batch of problem elements to
be completed by the rst solver stage instead of waiting for all problem elements to
complete it.
With 8 CPU compute threads the BFGS-B solver has been measured to spend
about 65% of its execution time in its rst stage and about 35% of its execution time
in its second stage. It is estimated with the NVIDIA Occupancy Calculator that it
would take about 6K threads (or problem elements) to occupy the GPU suciently
enough to hide memory latencies and to cover the overhead of GPU kernel launches.
With this knowledge the solver could be modied to start execution of the second
stage of the solver on 6k sized batches of problem elements as they become available
for execution by nishing the rst stage. The overlapping of these solver stages would
lead to less time being spent on each iteration of the problem and therefore would
increase the solvers throughput. Ideally the solver would be able to gain about 35%
more performance compared to its current design. It can be estimated that on smaller
sized problems such as the synthetic image (which has 12K problem elements) that
each iteration would see an increase of performance by about 17.5%. On larger sized
images such as real image A (which has 100K problem elements) it can be estimated
that each iteration would see about a 33% increase in performance.
62
Bibliography
[1] N. Short, NASA Remote Sensing Tutorial. Website, 2010. http://rst.gsfc.
nasa.gov.
[2] Z. Lee, K. L. Carder, C. D. Mobley, R. G. Steward, and J. S. Patch, Hyperspec-
tral remote sensing for shallow waters. 1. a semianalytical model, Appl. Opt.,
vol. 37, pp. 63296338, Sep 1998.
[3] Z. Lee, K. L. Carder, C. D. Mobley, R. G. Steward, and J. S. Patch, Hyper-
spectral remote sensing for shallow waters. 2. deriving bottom depths and water
properties by optimization, Appl. Opt., vol. 38, pp. 38313843, Jun 1999.
[4] J. Goodman, D. Kaeli, and D. Schaa, Accelerating an imaging spectroscopy
algorithm for submerged marine environments using graphics processing units,
Journal of Selected Topics in Earth Observation and Remote Sensing, vol. 4,
pp. 669 676, Sep 2011.
[5] J. A. Goodman and S. L. Ustin, Classication of benthic composition in a coral
reef environment using spectral unmixing, Journal of Applied Remote Sensing,
vol. 1, no. 1, p. 011501, 2007.
[6] J. Nocedal, Numerical Optimization, ch. 8. Springer, 2nd ed., 2006.
63
[7] S. Rao, Engineering Optimization: Theory and Practice, ch. 1. Wiley, 4th ed.,
2009.
[8] C. Kelley, Iterative Methods for Optimization, ch. 5. Society for Industrial Math-
ematics, 1999.
[9] I. Buck, The Evolution of GPUs for General Purpose Computing. GTC 2010.
[10] R. Fernando, GPU Gems: Programming Techniques, Tips and Tricks for Real-
Time Graphics, ch. 28. Pearson Higher Education, 2004.
[11] Nvidia, Nvidia CUDA Programming Guide Version 3.0, 2010.
[12] Khronos OpenCL Working Group, The OpenCL Specication Version 1.0, 2008.
[13] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, A limited memory algorithm
for bound constrained optimization, SIAM Journal on Scientic Computing,
vol. 16, no. 5, p. 1190, 1995.
64

Msellitto Thesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Msellitto Thesis

Uploaded by

Copyright:

Available Formats

NORTHEASTERN UNIVERSITY

Graduate School of Engineering

which minimizes the function f(x).

), or a local minimum, which means there are

may be an inection point and more information is needed to

Steepest Descent Method

will be a local minimum.

You might also like