SAR Backprojection Cell

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Implementing SAR Image Processing Using

Backprojection on the Cell Broadband Engine


William Lundgren l (wlundgren@gedae.com), Uttam Majumder2 (uttam.majumder@wpafb.af.mil),
Mark Backues3 (mark.backues@wpafb.af.mil), Kerry Barnes l (kbarnes@gedae.com),
James Steed l (jsteed@gedae.com)
1: Gedae, Inc., 1247 N. Church St., Suite 5, Moorestown, NJ 08057, phone: + 1 (856) 231-4458
2: Air Force Research Laboratory, Wright-Patterson Air Force Base, OH 45433, phone: + 1 (937) 674-9043
3: SET Corporation, 2940 Presidential Dr., Suite 270, Fairborn, OH 45324, phone: + 1 (937) 426-9401

Abstract-Many compute intense algorithms once deemed local storage [2] for holding both instructions and data, which
impractical are now being found realizable on modern multicore requires manual management by the programmer.
processors due to their massive compute power. An example of The paper will study how the implementation of CBP must
such an algorithm is the backprojection method for synthetic be changed to deal with the programming issues related to
aperture radar (SAR) image processing. Backprojection provides
using the Cell/B.E. processor. The programming issues that
many benefits compared to the other method of SAR processing,
the polar format algorithm (PFA). This paper studies the are studied include partitioning the work across a
implementation of convolution backprojection (CBP) on the Cell heterogeneous set of processors (SPEs and PPEs), managing
Broadband Engine™ (Cell/B.E.) architecture. The paper will the heterogeneous memories (small local storage and large off-
study how the implementation of CBP must be changed to deal chip XDR memory), overlapping processing with
with the programming issues related to using the Cell/B.E. communication and read/writes from the XDR, and optimizing
processor. The programming issues that are studied include the SPE and PPE code to highly utilize all ALU paths. The
partitioning the work across a heterogeneous set of processors Gedae programming language and multithreading compiler is
(SPEs and PPEs), managing the heterogeneous memories (small used to address these programming issues. Performance of the
local storage and large off-chip XDR memory), overlapping
CBP implementation on the Cell/B.E. processor using Gedae
processing with communication and read/writes from the XDR,
and optimizing the SPE and PPE code to highly utilize all ALU is presented.
paths. The Gedae programming language and multithreading
compiler is used to address these programming issues. II. BACKPROJECTION
Performance of the CBP implementation on the Cell/B.E. The two most common algorithms for SAR image
processor using Gedae is presented.
processing are the polar format algorithm (PFA) and
backprojection.Data collected by SAR systems is called
Index Terms- Cell Broadband Engine, SAR, Backprojection
Fourier space or K-space data and resides on an annular grid
[1]. K-space data is related to image data through the Fourier
I. INTRODUCTION
transform. Therefore, to form an image, a 2D FFT is applied
Many compute intense algorithms once deemed impractical to the K-space data. However, before applying the 2D FFT,
are now being found realizable on modern multicore an FFT based imaging method (such as PFA) needs to inscribe
processors due to their massive compute power. An example a rectangular grid within the annular region. This is called
of such an algorithm is the backprojection method for polar-to-rectangular (Cartesian) coordinate conversion of K-
synthetic aperture radar (SAR) image processing. space data and is accomplished by applying a 2D interpolation
Backprojection provides many benefits compared to the other on the K-space data. Two major drawbacks of the PFA are:
method of SAR processing, the polar format algorithm (PFA). (1) all radar data must be available from the sensed scene to
This paper studies the implementation of convolution apply polar to rectangular conversion, and (2) interpolation
backprojection (CBP) on the Cell Broadband Engine™ inaccuracy leads to degraded image quality. Backprojection
(Cell/B.E.) architecture. The Cell/B.E. processor combines image formation uses K-space data directly and does not
one Power Processing Element (PPE) with 8 identical require polar to rectangular conversion. In addition, all radar
Synergistic Processing Elements (SPE) [2]. The Cell/B.E. data need not be available from the sensed scene to start the
processor is sufficiently powerful to tackle the CBP algorithm, image formation process. Further, the quality of images
but many programming issues must be addressed to create an formed by backprojection is superior to that of the PFA.
implementation of CBP on the architecture. Because the PPE The backprojection algorithm was conceived in the field of
provides relatively modest performance, the key to using the computer-aided tomography (CAT). Later, it was found that
Cell/B.E. processor efficiently is to take best advantage of the SAR image reconstruction is mathematically similar to CAT
SPEs. An SPE can process vector arithmetic very efficiently and researchers developed backprojection for SAR [1][3 ].
because each processor has 4 SIMD ALU paths processing at Two variations of backprojection are found in literature:
3.2GHz; however, each SPE has only 256KB of dedicated convolution backprojection (CBP) and backprojection filtering

1-4244-1539-X/08/$25.00 ©2008 IEEE

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on April 9, 2009 at 16:13 from IEEE Xplore. Restrictions apply.
(BpjF)[5]. Some papers refer to CBP as filtered IV. ALGORITHM
backprojection (FBP) or factored backprojection (FBP) [4]. Gedae is used to implement the CBP algorithm on the
In CBP, convolution with a kernel function is accomplished on Cell/B.E. processor. Figure 2 shows the geometry of the
each projection. The result is then backprojected to a grid. problem. It assumes that the RADAR is flying in a circle
In the BpjF method, first the projections are backprojected surrounding the center of the area to be imaged. The goal of
on an image grid. Then the backprojected image is filtered. this process is to partition the data such that it fits in the SPEs.
This filtering process in BpjF introduces image reconstruction We start by showing the symbolic expressions for the
inaccuracy [5]. Hence CBP is the preferred method for backprojection algorithm and then reconstruct them to show
backprojection. the expressions that can be deployed in the SPEs. The
expression syntax used makes it easy to express the data
III. CELL BROADBAND ENGINE ARCHITECTURE reorganization required to fit the data into SPE fast local store.
The current implementation of the Cell Broadband Engine
(Cell/B.E.) processor combines one Power Processing Element
(PPE) with 8 identical Synergistic Processing Elements (SPE).
The PPE is a dual-threaded PowerPC core with an instruction
set that has been extended to include SIMD (Single-
Instruction, Multiple-Data) instructions known as VMX. Each
SPE contains a high-speed SIMD processor with its own 256
kB local store and DMA engine. The nine cores and the on-
chip memory controller and I/O controller are interconnected
with the high speed Element Interconnect Bus (EIB) as shown
in Figure 1. The EIB provides a measured peak bandwidth of
over 200 GB/s at 3.2 GHz. [2]
Because the PPE provides relatively modest performance,
the key to using the Cell/B.E. processor efficiently is to take
best advantage of the SPEs. While an SPE can process vector
arithmetic very efficiently, each SPE has only 256 kB of
dedicated local storage which holds both instructions and data.
Thus when processing a large data set, the developer must Figure 2 - The RADAR is flown in a circle with RADAR
either distribute the large data set across the processing cores always pointing in approximate direction of center of the
or store the data in off-chip system memory. For instance, an region of interest.
entire Ik-by-Ik complex matrix cannot reside in the SPE local
storage, even if the program size is zero. Therefore large data The range variables (see appendix) that we need to define
sets will initially reside in system memory, and then the the algorithm are
p = O.. Npulses-l
developer will strip-mine the data, bringing it chunk-by-chunk fO = O.. NrngGates-l
to the SPE local storage for processing. f = O.. 4*NrngGates-l
xi = O.. X-l /* the x image indices */
yi = O.. Y-l /* the y image indices */
SPE where we will form an image with x*y pixels, Npulses is the
number of pulses used to form an image, and NrngGates is the
number of samples collected for each pulse. We also have 3
parameters
OF = delta frequency between range bins
C40F = C/4/0F a convenient constant
(x[xi], y[yi]) = location of the (xi,yi) pixel
where C is the speed of light. The pixel location in the image
is calculated based on parameters set by the user.
The inputs to the algorithm are the RADAR returns and
several parameters that define the state of the system. The
RADAR data is given as:
in[p] [fO]
where p and fO are defined above. The location of the
RADAR at the time it is pulsed is:
(xobs [p], yobs [p], zobs [p] )
So there are p observations points, one for each pulse in the
Figure 1 - The current CelllB.E. Processor combines a PPE data set. . We also have the range of the first range bin from
core with 8 SPE cores, all interconnected via a high-speed the RADAR. That is, the RADAR is pulsed and then the
bus. analog to digital conversion starts a little later and so the first
range bin is some distance away from the observation point. It
is given by:

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on April 9, 2009 at 16:13 from IEEE Xplore. Restrictions apply.
rstart[p] = range of the first range bin in the delta phase matrix, floating point and integer copies of the
RADAR return
range bin matrix, 2 weighting matrices and an output matrix -
The pulse compresssion is computed by windowing, zero
in memory. These matrices require
filling the input data then computing the FFT.
inl [p] [f] = 0.0 (6*4*X*Y+8*X*Y)*Npulses bytes of data since 6 of the
inl[p] [fO] = in[p] [fO]*W[fO] matrices are floating point or integer and the other is complex.
in2[p] [] = ifft(inl[p] [])
We also need the input data in memory. Since the input data is
where W[] is the windowing vector. The range is computed as complex it is 8*4*NrngGates*Npulses bytes, the total memory
follows: required is 32*X*Y+32*NrngGates bytes of data. Even if X,
rng [p] [yi] [xi] = sqrt ( (xobs [p] -x [xi] ) "2 +
(yobs[p]-y[yi])"2 + zobs[p]"2) Y, Npulses and NrngGates are only 1024, the requirement is
where xobs[], yobs[] and zobs[] is the location of the RADAR for>32 Gbytes of memory. Since all the data cannot fit into
when it is pulsed for pulse p, and x and yare the locations of memory, we must do one pulse and one tile of the image at a
the image pixels in the x,y coordinate system. The phase delta time. As it turns out - this decomposition of the image means
of the signal due to the round trip time from the RADAR to the that we have to bring in only a subvector of in2[p] for each
image pixel and back is given by: pulse / tile combination - further limiting the memory
dphase[p] [yi] [xi] = exp(i*4*PI/C*fO[p])* requirements.
rng [p] [yi] [xi]
Since there is not enough local store memory on the SPEs to
The range of the return from the pixel at (xi,yi) in units of bin
hold all the data we need, we will have to process a single tile
indices is given by:
rbin[p] [yi] [xi] = Nbins * (rng[p] [yi] [xi] - of the image at a time. Figure 3 shows the geometry of only
rstart[p]) / 2 / C4DF one tile from the area of interest.
where rstart[] is the distance of the first range bin from the We define the structure of the algorithm by defining how the
from the RADAR. The range bin index is given by output image will be decomposed for processing. We will
irbin [p] [yi] [xi] = floor (rbin [p] [yi] [xi]) leave it to the reader to substitute the same decomposition in
The weights used for linear interpolation are the range related arrays required to complete the
wI [p] [yi] [xi] = rbin [pJ [yi] [xi] - irbin [pJ [yiJ [xiJ
wO [pJ [yi] [xi] = 1 - wI [pJ [yiJ [xi] implementation.
The output image is the accumulation of phase adjusted returns We define several new ranges to define the decomposition
from the appropriate bin. If the phases match, then the value of the processing. We have first divided the y index to
will be reinforced. The computation is: distribute on the SPEs and then divided further for stream
out[yiJ [xiJ += processing of the tiles. Since there are multiple SPEs for
in2 [p] [irbin [p] [yiJ [xiJ ] *wO [pJ [yiJ [xiJ +
in2 [pJ [irbin [pJ [yiJ [xiJ +IJ *wl [pJ [yiJ [xiJ) *
processing we define a range variable for the number of
dphase[p] [yiJ [xi] processors:
proc = O.. Nprocs-l

V. IMPLEMENTATION
where Nprocs in our case will be 8. We will decompose the
output image (matrix) into tiles. The new y indices are given
To allow the implementation to run on the SPEs of the by:
Cell/B.E. processor, we must consider how much data can fit yi2 = 0 .. Y2-1
into memory at one time. In order to complete the where Nprocs*Y2 = Y and
computation we need to have 7 matrices - the range matrix, yi = proc*Y2 + yi2
and equivalently
[yiJ = [proc,yi2J
In this first step of the decomposition we only decompose for
the processors. Notice that throughout this decomposition we
specify the decomposition of the left hand side of the equation
and then propagate that decomposition to the right hand side.
Portion of
We leave the indices of the weighting array out to simplify the
the pulse
required for expressions. So the output:
out [yi] [xi] += in2 [pJ [irbin [pJ [yiJ [xiJ ] *wO +
this tile
in2 [pJ [irbin [pJ [yiJ [xiJ +1] *wl) *
dphase[pJ [yiJ [xiJ
can be rewritten using the comma notation as:
out[proc,yi2J [xiJ+=
in2[pJ [irbin[pJ [proc,yi2] [xi]]*wO +
in2[p] [irbin[p] [proc,yi2J [xi]+I]*wl)*
dphase (p) [proc, yi2] [xi]
and then reorganized as a family of submatrices (henceforth
tiles) one for each processor:
[proc]out[yi2] [xi]+=
in2 [p] [[procJ irbin[pJ [yi2] [xi]] *wO +
in2[pJ [[proc]irbin[p] [yi2] [xi]+I]*wl)*
Figure 3 - We will process a tile at a time. The tile shown is [procJdphase[p] [yi2] [xi]
representative of the tiles that completely cover the area. It is The pre-index indicates that there is a family of sub-images
apparent that only a small portion of the pulse return is used one for each SPE We define the additional range variables:
to compute the backprojection.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on April 9, 2009 at 16:13 from IEEE Xplore. Restrictions apply.
xiO 0 .. XO-I
xiI 0 .. XI-I
yiO 0 .. YO-I
yil 0 .. YI-I
where XO*XI = X and YO*YI = Y2. We will use these ranges
to reorganize the tiles on each processor into smaller tiles that
can be streamed from system memory into local memory. That
is, converting to comma notation:
[proc]out[yiO,yil] [xiO,xiI]+=
in2 [p] [ [proc] irbin [p] [yiO, yiI] [xiO, xiI] ] *wo +
in2 [p] [[proc] irbin[p] [yiO,yiI] [xiO,xiI]+l] *wI) *
[proc]dphase[p] [yiO,yiI] [xiO,xiI]
If we fix proc, p, yiO and xiO then are looking at a single tile of
the matrix and the token size is yi 1*xi 1 elements. We move
them into memory one at a time to reduce memory
requirements. To represent this tiling we will change the slow
moving (left most) indices into streams indices - specifying
that individual tiles will be processed in sequence.
[proc] out (yiO, xiO) [yiI] [xiI] +=
in2 (p) [ [proc] irbin (p, yiO, xiO) [yiI] [xiI]] *wo +
in2(p) [[proc]irbin(p,yiO,xiO) [yiI] [xiI]+l]*wI)*
[proc] dphase (p, yiO, xiO) [yiI] [xiI]
Since the pulse is the slow moving index and we are
accumulating over that index it means we have to move output
tiles into and out of memory with each iteration. This tile
movement will stress the memory bandwidth and limit
throughput. To address this issue, we reorder the indices so
that the fast moving index is the pulse and so we process all
pulses for a tile of the image before proceeding to the next tile. Figure 4 - Execution trace for processing of a single tile. The
This reordering describes a transpose but rather than actually functions that take the longest are the complex exponent and
transpose the data in system memory, a costly operation, we extracting the range bin from the input data.
will just read/write the data from/to system memory in non- rng (yiO, xiO, p) [yiI] [xiI] = sqrt ( (xobs (p) -
contiguous order. Since we can fit the output tile into memory, x (xiO) [xiI]) "2 + (yobs (p) -y (yiO) [yiI]) "2 +
zobs(p)"2)
we sum into that tile while processing the pulses and then
Weare now streaming all of the data into the local stores to
move it into system memory at the end of the processing. This
minimize memory use.
approach minimizes the movement of data to and from system
The result of this reorganization and streaming is the
memory. In the following equation we reorganize the tiles into
requirement that we store 7 matrices in memory each with
the appropriate stream order to process all pulses of one tile in
Yi 1*Xi 1 data elements, as well as the input matrix with
sequence:
[proc] out (yiO, xiO) [yiI] [xiI] += 4*NrngGates inputs. Since the input matrix and one of the 7
in2 (p) [ [proc] irbin (yiO, xiO, p) [yiI] [xiI]] *wo + other matrices are complex data, we will have a total of
in2(p) [[proc]irbin(yiO,xiO,p) [yiI] [xiI]+l]*wI)* 6*4*Yil *Xil+8*Yil *Xil+8*4*NrngGates or
[proc] dphase (yiO, xiO, p) [yiI] [xiI]
32*Yil *Xil+32*NrngGates bytes of memory. If we limit the
This ordering minimizes the movement of data from local
memory used for the 7 range related matrices to about 100
store into system memory. We combine the move with the
insert into the output image. =
Kbytes then Yi 1*Xi 1 3200. So Xi 1 and Yi 1 are limited to
about 56. This restriction allows us to easily fit a tile size of 32
We also note that we are now only using a tile of the weight
by 32, a size sufficiently large to exploit the SPE's SIMD
array at a time and a tile of irbin. To do this processing most
ALU resource.
efficiently we apply the same re-ordering to those data items
and arrive at the following:
rbin(yiO,xiO,p) [yiI] [xiI] = Nbins * VI. RESULTS
(rng(yiO,xiO,p) [yiI] [xiI] - rstart(p)) / 2 /
C4DF For purposes of our timing, we have chosen X, Y, Npulses
and and NrngGates to the same value and used values of 256, 512,
wI (yiO,xiO,p) [yiI] [xiI] 1024 and 2048. The processing time for each experiment is
rbin(yiO,xiO,p) [yiI] [xiI] - shown in table 1.
irbin(yiO,xiO,p) [yiI] [xiI]
wO (yiO,xiO,p) [yiI] [xiI] = 1 - A Trace Table of the processing required for 1 pulse and 1
wI(yiO,xiO,p) [yiI] [xiI] tile is shown in figure 4. In the Trace Table, each row lists a
These quantities are derived from dphase, and the x and y grid different function in the algorithm, and black bars indicate the
locations. We extend the streaming to those quantities as well: duration of the execution of the kernel. Notice that 3
dphase (yiO,xiO,p) [yiI] [xiI] = exp (i*4*PI/C*fO [p] *
rng(yiO,xiO,p) [yiI] [xiI]
functions' execution time are significantly longer than the
and other functions' times, the mz_exp function and two instances
of the vz vi scatter function.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on April 9, 2009 at 16:13 from IEEE Xplore. Restrictions apply.
The mz_exp function implements the complex exponential VII. CONCLUSIONS
function, a transcendental function which is by nature slow. Gedae makes the complex implementation required to
Optimized vector routines are used in the functions to take implement these algorithms on the Cell/B.E. architecture
advantage of the SPE's 4-wide SIMD ALU. The mz_exp practical. The advent of high performance multicore hardware
function uses such a routine and the performance is much makes implementation of backprojection possible, and the use
better than vanilla C code. Other implementations and of Gedae to program these new architectures makes the
approximations of the exponential were considered as implementation more easily achievable.
optimizations. We considered replacing the mz_exp function
with a table lookup but because the time of the vectorized REFERENCES
exponential is approximately the same as the other table
[1] Desai, Mita D. and W. Kenneth Jenkins. ·"Convolution
lookups there is little value to be gained by such an
Backprojection Image Reconstruction for Spotlight Mode
experiment. Synthetic Aperture Radar." IEEE Transactions on Image
The vz_vi_scatter function (2 copies) is slow because the Processing, 1(4), October 1992.
function is basically a table lookup - a function not [2] IBM, Sony Computer Entertainment, Toshiba. Cell Broadband
particularly suited for speedup on an SPE due to the Engine Programming Handbook, Version 1.1, April 2007.
processor's reliance on memory alignment in both input and <http://www.ibm.com>.
output memory. No vector function is readily available for the [3] Jain, Anil K. Fundamentals of Digital Image Processing.
SPE to perform this operation. Some speedup was achieved Prentice Hall, NJ, 1989.
by using SPE intrinsics (the C interface to the SPE's assembly [4] Ulander, Lars M.H., Hans Hellsten, and Gunnar Stenstroem.
··Synthetic-Aperture Radar Processing Using Fast Factorized
language) in the function. The hand-optimized function does
Back-Projection." IEEE Transactions on Aerospace and
the pointer arithmetic in the table lookup using the SIMD Electronic Systems, 39(3), July 2003.
ALU, and uses other intrinsics for extracting and loading [5] Zeng, Gengsheng L. and Grant T. Gullberg. "·Can the
values into the quad word vector data structure. Backprojection Filtering Algorithm be as Accurate as the
An additional optimization was required to achieve high Filtered Backprojection Algorithm?" Proceedings of Nuclear
performance on the v_outerAdd function. This function Science Symposium and Medical Imaging Conference, volume
calculates the equation 3. Institute of Electrical and Electronics Engineers (IEEE),
out[i] [j] = a[i] + b[j] October 1994.
This function can be implemented as R calls to the vector
routine that performs a scalar-by-vector add, however the APPENDIX: SYMBOLIC EXPRESSIONS
small row-size of the matrix tiles leads to a high overhead in The code segments shown in this document are created
filling and emptying the pipeline for each row. To avoid this using Gedae symbolic expressions. In these expressions,
inefficiency, the function was hand-coded using SPE intrinsic ranges are used to express iteration over dimensions. For
to perform all 1024 additions without emptying the pipeline. example, a vector add
There was initial concern that the square root function out[i] = a[i]+b[i]
would need to be approximated in order to achieve high is specified in C code as
performance. However the sqrt vector routine used in the for (i=O; i<N; i++) out[i] = a[i]+b[i];
implementation is efficient, and there is little value to and can be specified by the pair of symbolic expressions
i=O .. N-l
replacing that function with an approximation. out[i] = a[i]+b[i]
Table 1 shows the expected number of pulses. Ifwe assume Data is assumed to be part of an infinite data stream, which
a PRF of 0.1 mSec the number of processors required is allows the expressions to iterate over time, as well as, space.
(Time/Pulse) / (0.1) and is shown in the fourth row of table 1. In addition to array dimensions specified by square brackets
A practical implementation is likely to only process sub ("'["' and "']"), data can be grouped temporarily using
regions after the data has been decimated. Because practical parentheses ("'("' and "')"). For example, a vector to matrix
implementations may vary, we cannot make a general conversion can be specified as the temporal grouping of R
statement here. For the implementation presented, we can say vectors in a stream
that the number of processors required is approximately: i=O .. R-l
Np = 2e-6 * X * Y / PRF j=O .. C-l
out [i] [j] = in (i) [j]
where X and Yare the number of pixels in the region to be
imaged, PRF is the pulse repetition frequency and Np is the Note that the indices the time dimension is the slower moving
number of Cell/B.E.s required for a real-time implementation. dimension, and therefore is listed to the left of the array index.
Data reorganization can be specified not just between
Npulses 256 512 1024 2048 indices but also within a single index by using the comma
Time (mSec) 35.7 285.1 2,368.8 18,259.4 notation. For example, to transpose the elements in a I-d
Time / Pulse (mSec) 0.139 0.557 2.313 8.916 vector of length N
# ofCell/B.E. Rqrd 2 6 24 90 i=O .. R-l
j=O .. N/R-l
Table 1 - The timing results running on a QS20 blade server out [j, i] = in [i, j]
from IBM. The second row is the time to process a single Collapsing operators can be used to perform operations
RADAR pulse. If we assume a PRF of 10 Khz then the third where the index being iterated over disappears in the output
row is the number of CelllB.E. processors required. array, such as a vector sum of elements

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on April 9, 2009 at 16:13 from IEEE Xplore. Restrictions apply.
i=O .. N-l
out += in[i]
The collapsing operators are similar to C for-loops where
the initial value is set to a default value, such as zero for "+="
and one for "*=". The for-loop is formed from the collection
of all indices specified on the right hand side of the equation
that are absent from the left hand side.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on April 9, 2009 at 16:13 from IEEE Xplore. Restrictions apply.

You might also like