Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

2011 Symposium on Application Accelerators in High-Performance Computing

GPU performance comparison for accelerated radar data processing


C. T. Fallen, B. V. C. Bellamy, G. B. Newby
Arctic Region Supercomputing Center University of Alaska Fairbanks Fairbanks, AK, USA e-mail: {ctfallen, bvbellamy, gbnewby}@alaska.edu
AbstractRadar is a data-intensive measurement technique often requiring significant processing to make full use of the received signal. However, computing capacity is limited at remote or mobile radar installations thereby limiting radar data products used for real-time decisions. We used graphics processing units (GPUs) to accelerate processing of high resolution phase-coded radar data from the Modular UHF Ionosphere Radar (MUIR) at the High-frequency Active Auroral Research Program (HAARP) facility in Gakona, Alaska. Previously, this data could not be processed on-site in sufficient time to be useful for decisions made during active experiment campaigns, nor could the data be uploaded for offsite processing to high-performance computing (HPC) resources at the Arctic Region Supercomputing Center (ARSC) in Fairbanks. In this paper, we present a radar dataprocessing performance comparison of a workstation equipped with dual NVIDIA GeForce GTX 480 GPU accelerator cards and a node from ARSCs PACMAN cluster equipped with dual NVIDIA Tesla M2050 cards. Both platforms meet performance requirements, are relatively inexpensive and could operate effectively at remote observatories such as HAARP. Keywords Radar, HAARP, GPU, OpenCL, OpenMP.

B. J. Watkins
Geophysical Institute University of Alaska Fairbanks Fairbanks, AK, USA e-mail: bjwatkins@alaska.edu detects strong ionosphere plasma waves driven by the HAARP Ionosphere Research Instrument (IRI), a powerful phased-array HF transmitter capable of producing ionosphere temperature and density irregularities. In the standard long-pulse mode, MUIR transmits a 996 sec ultra high frequency (UHF) pulsecorresponding to a range resolution of ~150 kmand records the received signal with a sample rate of 250 kHz. At this sample rate, the range resolution can nominally be improved to ~600 m by using a coded long pulse [2], where the phase of the pulse is modulated with a specified pattern of 4 sec bits. Pulse modulation (or pulse compression) techniques were patented in the early 1950s as a method to increase radar sensitivity by lengthening the transmitted pulse without sacrificing range resolution [3]. However, the increased range resolution is attained at the cost of increased signal processing. Pulse compression techniques generally involve modulating either the frequency or the phase of the transmitted pulses, then correlating the modulation pattern with the received pulse before proceeding with spectral analysis. A. Motivation The correlation times of the target plasma waves are typically greater than the baud length but less than the time between successive radar pulses (the inter pulse period, or IPP), so processing the coded long pulse data essentially requires calculating the spectra of every lag self-product of

I.

INTRODUCTION

The term radar refers to equipment and techniques used for the radio detection and ranging of objects, originally developed in the 1920s by E. V. Appleton and M. A. F. Barnette of Cambridge University for the purpose of detecting ionosphere layer heights. Major efforts to refine radar technology in the late 1930s were motivated by wartime needs to locate and track distant metallic objects such as aircraft. Modern uses of radar incorporate a variety of sophisticated techniques to transmit radio waves electromagnetic radiation with frequency typically between 30 kHz and 300 GHzand then receive the radiation scattered by one or more targets. Radar is now used to remotely sense the shapes of objects, the speed and direction of atmospheric winds, the type and magnitude of atmospheric precipitation, ionosphere plasma density and temperature, and the distribution of aboveground and underground structures. The Modular UHF Ionosphere Radar (MUIR) [1] is a phased-array radar at the DoD High-frequency Active Auroral Research Program (HAARP) facility in Alaska (Figure 1) that can use pulse modulation techniques for high range resolution operation. MUIR is a diagnostic radar that
978-0-7695-4448-9/11 $26.00 2011 IEEE DOI 10.1109/SAAHPC.2011.14 84

Figure 1. (a) The IRI at HAARP generates ionosphere plasma waves detected with (b) the MUIR radar. (c) The TAU workstation at ARSC and (d) the PACMAN cluster are equipped with GPU-accelerators.

each sampled pulse. For each ~50 sec of M MUIR coded longpulse (CLP) operation (10 msec IPP and 1100 complexd valued 600 m range-bin samples per pulse the processing e), effort to generate typical range-time-intensity (RTI) images is roughly equivalent to performing 11 mi illion 1100-point complex-array multiplies and Fast Fou urier Transforms (FFTs) [4]. This can be a prohibitive task at a remote facility t with limited computational resources and da transfer rates. ata Consequently, MUIR high range-resolutio measurements on are not typically available during HAA ARP experiment campaigns, slowing the pace of research an increasing the nd cost of results. here experiment Space weather can affect ionosph conditions, sometimes resulting in remarka able observations at HAARP. Figure 2 [5, 6] shows an examp of MUIR CLP ple measurements made during exceptionally bright artificial etectable plasma auroral airglow emissions. The radar-de turbulence clearly corresponds to regions of intense airglow adar results from emissions. However, the high-resolution ra this experiment were not available until several weeks aign due to the following the conclusion of the campa processing load. Furthermore, sustained pow consumption wer at the HAARP facility can reach ~10 M MW, so there are significant costs associated with repeating experiments that each last approximately one hour. me Our goal is to reduce the processing tim of a one-hour MUIR CLP experiment to one hour o less using a or commodity workstation. We tested two GPU-equipped o systems, a workstation and a compute nod from a cluster. de The MUIR data shown in Figure 2 was used as a test s collection for evaluating processing perf formance. Both systems are pictured in Figure 1 and desc cribed in section II.C.

B. Objective The CLP processing task consist of four key algorithmic ts steps applied to an 1100 5000 complex-valued matrix containing ~50 sec of baseband rada data covering the range ar 0 to ~660 km. Informally, the step are (1) array multiply, ps (2) discrete Fourier transform, (3) ar rray power, and (4) peakfind. Standard single-threaded exec cution on the system CPU requires several hundred core-secon to process one ~50 s nds (~40 MB) data file. Multithre eaded computation with OpenMP or the Matlab Parallel Too olbox on an 8-core system results in sublinear speedup. Real-ti data processing, used ime here to describe processing that can be accomplished in less n time than the length of the radar da set, has been achieved ata with multi-node computation using the Matlab Distributed g Computing Server and 32 proce essor cores on ARSCs Midnight [7] and PACMAN systems. Our objective is, first, to proces MUIR CLP data faster ss than real time using a single comp pute node or workstation equipped with one or more GPU accelerators. Second, we a wish to compare real-world (sing precision) processing gle performance of two GPU-accelera ated platforms. Results from single-threaded and multith hreaded computation on standard CPU platforms provide a performance baseline for the GPU experiments. C. Prior Work Despite significant data-proce essing requirements of ionosphere radar systems, geospac researchers generally ce have not yet attempted to utilize GPU systems to speed chers often rely on quickaccess to the measurements. Researc look low resolution data products during experiments and llowing traditional CPUobtain high-resolution products fol based post-processing. However, GPUs are beginning to etic show encouraging results in synthe aperture radar (SAR) applications [8, 9] and SAR data processing is similar to ionosphere radar processing to the extent that both applications make heavy use of the FFT. F Fundamental to many signal pro ocessing applications, the FFT was one of the earliest gene eral-purpose calculations adapted to GPU devices [10]. Subsequent benchmarks n demonstrated that GPUs do not necessarily execute FFTs faster than CPUs; specifically, 2D FFTs show significant D acceleration on GPUs (due to rela atively poor CPU cache utilization) but repeated 1D FFTs do not [11]. Recent work o closely related to the experiment re eported here suggests that significant GPU speedup of 1D re epeated FFT calculations may still be significant when th size and number of he transforms is close to those re equired for radar data processing [5]. Still, even without significant FFT speedup, eal-world FFT-intensive overall GPU acceleration of re processing applications ultimately depends strongly on the a speedup of the remaining auxiliary algorithms.

Figure 2. High-resolution MUIR coded long-pulse (CL measurements of LP) field-aligned HF-enhanced ion-line intensity (copper) superimposed over ) optical measurements of field-aligned artificial airglo (white) during an ow ionosphere modification experiment at HAARP. Wit thout processing, the radar signal would be spread over a range o 150 km. of

85

II.

METHOD

A. Algorithm MUIR CLP data is recorded as a sequence of HDF5format 1 files each containing a single-precision matrix X 11005000 of baseband data and meta-information. Each column of X represents the signal received from one radar pulse and the rows correspond to 600 m range bins. The algorithm described below is expressed in terms of matrixvector operations for clarity and brevity, although the actual software implementation of each operation may be ad hoc. The algorithm outlined in [2] is applied to each radar range bin and may be loosely described as: sample the adjacent range bins, "multiply by the code," calculate the power spectral density, and then analyze the spectra or accumulate with previous pulses. The sampled pulses first need to be decoded before the spectral density is calculated. This step is to isolate the signal scattered from the range-bin of interest from nearby ranges (recall that the MUIR CLP is ~150 km wide). Let C '11005000 be a modulation matrix whose columns correspond to the columns of X (radar pulses) that encodes the modulation pattern of each transmitted pulse. The entries of C are 1 for transmitter on with a sign change representing a 180 change in phase; transmitter off is represented by 0. MUIR CLP mode uses a 996 sec pulse width and 4 sec baud length, so each column of C starts with 249 non-zero entries followed by 851 zeros. While the transmitter modulation can change from pulse to pulse, this is not typical MUIR operation so the columns of C are in this case identical. Then C = c115000 where c is an 1100element column vector specifying the phase modulation. The nonzero entries of c are (reading left to right, top to bottom):
++-+--+-+--++-++--+++--+++-+----+---++ -++++-++--++---+-----++-+-+++----+++---+-++-+--++---++--++---+++-++++++---+ -+-------++-+++-----++--+++---+--+--++

where I is an identity matrix with the specified number of rows and 0 represents corresponding blocks of zeros. The complete "decode" operation written in terms of matrix-array products that must be performed before calculating the discrete Fourier transform along each column is

( P c1
k

15000

)* X Y

(1)

Note that * represents the element-by-element (array) product and the remaining multiplication operations are matrix (linear algebra) products. Recall that the array product = * is defined for and as a matrix with two identically-sized matrices elements ij = ij ij that each cost a single multiplication operation for real-valued matrices. Whereas the matrix is defined by taking inner products of rows product M = from the left-side matrix with columns of the right-side matrix M ij = k ik kj and generally costs several multiplication and addition operations to calculate each element in the product matrix. The rows of matrix Yk represent the decoded signal scattered from range bin k and the columns correspond to individual radar pulses. Next, the power spectral density at range bin k is calculated for each radar pulse. That is, we calculate the FFT of each column of the 1100 5000 complex matrix Yk . The FFT calculation on Yk is essentially the matrix product
" " %
N

1 1 2 i N FFT 1 e Yk = # # 2 i ( N 1) 1 e

1 e
2 i ( N 1) N

#
2 i ( N 1)( N 1)

" e

FT F Y Y k k N (2)

FFT

---+----++--++-++++--+++-+++++---+-+++ +---++++-++++-----+++++--+---+--++++-+++++-+-++---+---+--+

To decode the signal scattered from range bin k (using zero-index notation), the rows of C are shifted by pushing k rows of zeros onto C and then the element-by-element product of the baseband and shifted code matrix is taken. In block matrix notation, a square permutation matrix P can be written

where the matrix multiplication is optimized to take advantage of symmetric properties of the transformation matrix N . The number of points used in the transform N = 1100 is in practice reduced to the next smallest power of two 1024 = 210 . (Recall that the matrix Yk is already padded with zeros.) At this point in the calculation, the matrix Yk can be deleted from the GPU memory and Yk can be copied to system memory if the entire complex spectrum from range bin k of each received pulse is to be stored for later analysis. This is a large amount of data to store for multiple hour-long experiments, so the complex range-resolved time-dependent spectra are usually processed (reduced) further and then discarded. To produce a plot of radar intensity vs. range and time (RTI plot) similar to Figure 3, the goal of the data

k 0 0 Pk = 1100 k I 0
1

http://www.hdfgroup.org/HDF5/

86

z1 11005000 # Z z 1100

Figure 3. RTI image of MUIR coded long-pulse data recorded during a typical HAARP experiment, (a) before and (b) after processing. The HF pump switches from continuous to pulsed operation at 02:48 UTC.

The entries in matrix Z are assigned to a (logarithmic) color-scale and plotted to form a column ~50 sec wide. The pre and post processing data from one file is illustrated in Figure 3. RTI images are often concatenated in time from many matrices to form RTI plots of several minutes to hours. For example, Figure 2 is made from the 71-file test collection and is the result of approximately 78,000 iterations of the algorithm described above, each applied to a 5000pulse data file. Typical filtering operations that potentially benefit from GPU acceleration have not been considered here. For instance, to eliminate ground clutter effects resulting from radar side-lobes or target features with a correlation time larger than the radar IPP, a two-pulse moving-window difference filter can be applied to the baseband matrix X before proceeding with calculation (1). Similarly, applying a moving-window time-integration filter to the power spectra Tk or the peak power values in Z may help improve signalto-noise ratio (SNR). Figure 3 shows two range-timeintensity (RTI) images of the same radar data. The first image is of the raw (non-decoded) radar power. The second image is the result of the decoding process described above, in conjunction with difference and integration filters to remove ground clutter and enhance SNR. B. Parallelization and GPU Acceleration We implemented the MUIR data processing application in C++. The algorithm sequence (1) through (4) is easily parallelizable since the calculation for each RTI image pixel uses data only from the column containing that pixel and may be performed independently of the calculations for the remaining pixels. The CPU application was parallelized via OpenMP v3.0 (as distributed with GCC v4.5.2), and was accomplished by distributing the algorithm iterations corresponding to each range-bin row in the output RTI image. The GPU application was parallelized via OpenCL, achieved by expressing the calculations for a range-bin row as a sequence of OpenCL kernels, executed for each pixel (element of X ). OpenMP is a standard API available in C++ for writing shared-memory parallel applications and is well-suited for multicore architectures. We used the OpenMP parallel for loop directive to distribute the algorithm sequence in section II.A to multiple threads; each thread calculated one 5000element row of the RTI output image for each MUIR data file. Figure 4 illustrates how the sequence of main algorithm steps is thread-parallelized on the CPU. Ad-hoc code performed the phase-code multiplication and spectral peakfinding; the fftw v3.2.2 library [12] was used to calculate the FFT. An OpenMP critical construct is required to initialize fftw in each thread upon the completion of a row, negatively affecting multithreaded CPU performance. Timing of each

processing task described here, the power at each frequency bin of each pulse (column of Yk ) is first calculated

Re Yk * Re Yk + Im Yk * Im Yk Tk

(3)

Then the frequency bin with maximum power is extracted and normalized by the mean power of each power spectra

max ( Tk ) mean ( Tk ) z k

(4)

The array multiply and matrix addition operations in (3) are performed on the GPU. Intermediate power spectra are stored on the GPU in the temporary matrix Tk . Array (element-by-element) division is applied to the intermediate row vectors containing the maximum and mean power from each pulse (from each column of Tk ). The max, mean, and array division functions are performed on the GPU and the final resulting row vector z k , containing the intensity of range k for each received radar pulse, is copied to system memory. For each radar data file, GPU-accelerated calculations in (1) through (4) are completed in sequence for each range bin k , and the intensities stored in z k 15000 are stacked to form a positive real matrix:

87

algorithm step was measured and recorded with the timer class from the Boost2 C++ Libraries v1.45.0. We accelerated the section II.A algorithm by adapting it to the OpenCL programming model. OpenCL is an API for writing data-parallel applications that utilize a heterogeneous collection of computational resourcesincluding both CPUs and GPUsinstalled in a workstation or compute-node platform. Code portability requirements led us to choose the OpenCL API over the CUDA API that is restricted to GPU devices from NVIDIA. However, comparable CUDA-based code may outperform our OpenCL program on the NVIDIA hardware used in this experiment. Each of the four algorithm steps was implemented as an OpenCL kernel. The FFT kernel used in this experiment is based on the Apple OpenCL_FFT v1.4 demonstration code3. The data file is first read into the host memory then the phase-code vector and baseband data matrix are copied to the GPU device static and dynamic shared memories, respectively. A sequence of OpenCL kernels performs (1) through (4) on each GPU accelerator, distributing the calculations for one range bin of many radar pulses to the CUDA cores. An ad hoc kernel calculates the elements of Yk in (1) from the baseband data matrix X and phase code vector c . Next, the FFT kernel calculates the repeated 1-D FFT, replacing the range-k decoded baseband matrix Yk in the GPU shared memory with a matrix Yk of Fourier coefficients. Ad hoc kernels then calculate the repeated power spectra Tk in (3) and the maximum power from each spectra

reduction operation, and the results are stored in the GPU device global memory. After the kernels have executed, the processing program copies the image matrix from GPU global memory to the system memory then writes it to disk. When the application uses two GPUs, two input files are loaded simultaneously and processed in parallel. We are currently exploring the possibility of processing one file with two GPUs. C. Systems 1) TAU: Dual GeForce GTX 480 The TAU system is a single-user workstation made by Penguin Computing operating Red Hat Enterprise Linux Client release 5.6 (Tikanga). It contains a 64-bit 2.4 GHz AMD quad-core Opteron CPU and 4 GB of memory. TAU is equipped with two NVIDIA GeForce GTX 480 GPU cards. The GTX 480 contains 480 CUDA cores and 1.5 GB of dedicated memory with a bandwidth of 177 GB/sec. The theoretical peak performance of each GPU is 168 doubleprecision Gigaflops or 1350 single-precision Gigaflops. NVIDIA drivers version 260.19.36 were installed to allow for an OpenCL v1.0 context. 2) PACMAN: Dual Tesla M2050 The PACMAN (Pacific Area Climate Modeling and Analysis Network) system is a multiuser cluster made by Penguin Computing operating Red Hat Enterprise Linux Server release 5.5 (Tikanga) and the Torque/Moab batch system scheduler. It is composed of 144 nodes, a Mellanox QDR Infiniband Interconnect, and a ~100 TB Panasas version 12 file system. PACMAN contains two GPU nodes, each equipped with two 64-bit quad-core 2.4 GHz Intel Xeon CPUs, 64 GB of memory, a QDR Infiniband Network Card, and two NVIDIA Tesla M2050 GPU cards. The Tesla M2050 contains 448 cuda cores and 3GB of dedicated memory with a bandwidth of 149 GB/sec. The theoretical peak performance of each GPU is 515 double-precision Gigaflops or 1030 single-precision Gigaflops. The OpenCL v1.0 context was based on version 260.19.36 of the NVIDIA drivers.

z k in (4). Note that the normalization of peak

spectral powers max ( Tk ) in (4) by the mean amplitudes was not performed during the benchmark experiments reported here, even though the normalization step is often used in the production MUIR data analysis program. That is, the value of mean ( Tk ) was not calculated in this experiment but it will be in future versions of the GPU-accelerated processing software. The sequence of kernels is then mapped to the individual elements of the baseband matrix and executed. Step (4) is a

2 3

http://www.boost.org/ http://developer.apple.com/library/mac/#samplecode/OpenCL_FFT/

88

Figure 4. Execution flow and thread parallelism of the high-resolution range-time-intensity (RTI) processing algorithm for MUIR coded long pulse data. Each CPU thread calculates one row of the final RTI image. Each GPU thread distributes the calculations for each row to the CUDA cores.

D. Experiment We defined 71 HDF5-format data files containing CLP measurements (Figure 2) from the MUIR receiver to be a test collection of input data. Each file occupied 44 MB on disk and contained an 1100 5000 complex-valued singleprecision matrix of baseband radar data labeled as X in section II.A. Each matrix represents approximately 50 sec of data. For CPU or single-GPU processing, the files were processed serially, using OpenMP or OpenCL to divide the computation among the CPU or GPU cores. Two files were processed simultaneously during dual-GPU experiments. The processed data from each input file was saved to disk in HDF5 format. Results were verified by visually comparing the resulting images to reference images produced by a separate program and by numerically comparing the respective CPU- and GPU-calculated output matrices. Numerical differences were found between the reference results and the GPU results from the M2050 cards on the PACMAN system. However, the resulting RTI images were visually indistinguishable. We are currently diagnosing the numerical anomaly. The TAU and PACMAN machines executed several GPU-experiment runs and CPU-control runs on the data test collection. Single-precision experiments were executed on both systems using one or two GPU accelerators, for a total of 4 GPU-accelerated runs. Control experiments were

executed using one CPU thread on each system, for a total of 2 control runs. Finally, a variety of CPU and CPU+GPU runs were executed on TAU, the PACMAN GPU node, and other PACMAN compute nodes (the CPU+GPU results will be discussed in a subsequent paper). PACMAN is a shared multiuser resource, jobs were submitted to the Torque/Moab scheduler and the target node was reserved for the duration of each experiment. Each experiment produced 71 output data files containing the radar RTI data along with diagnostic and timing information. The total time to read and process the 71-file test collection in each experiment was recorded with GNU time (version 1.7). Additionally, the time for each thread to complete each task for a given range gate and radar pulse was recorded with the Timer class from the Boost C++ Libraries. III. PERFORMANCE RESULTS

A. Metric Our performance objective is to process MUIR CLP data in real time, i.e. to load and process the one-hour test collection and then to save the RTI data in one hour of wall time or less. The MUIR receiver records the baseband data in single precision so we focus on single-precision floatingpoint calculation performance.

89

Multithreaded speedup was measured relative to singlethreaded CPU execution time on each respective system. GPU speedup was measured relative to both single-threaded and multithreaded CPU execution time. Both the CPU and GPU codes are under development, so additional optimizations are likely available, potentially affecting the speedup estimates in either direction. In particular, GPU speedup reported here is less than that observed in similar experiments reported in [5] due to subsequent improvements in the multithreaded CPU code. B. Speedup analysis Since the sequence of processing tasks described in section II.A may be parallelized over the radar pulses (columns of data matrix X ) and range bins (rows of X ), both OpenMP multithreaded computation and GPUacceleration is expected to significantly reduce the time to solution. In addition to measuring total processing time, we measured the performance of each algorithm task to identify performance bottlenecks. Figure 5a shows the wall time elapsed (including disk I/O) and parallel speedup for the CPU-processed test collection. Multithreaded speedup over single-threaded performance is seen to increase sub-linearly on each platform with increasing CPU threads. The 4-core TAU workstation processed one hour of radar data in approximately six hours using single-threaded single-precision computation. Approximately two hours are required when using 4 threads with OpenMP (one thread per CPU core). Similarly, the 8core PACMAN GPU node required, respectively, ~2.5 and 0.5 hours for 1 and 8 threaded OpenMP processing on the CPU. The OpenMP CPU application executed on PACMAN meets the performance objective but still falls significantly short of the GPU performance. The GPU performance of both systems (Figure 5b) significantly exceeded the performance objective. The TAU workstation processed the 1-hour dataset in approximately ~15 min using single-precision arithmetic with a single GTX 480 card, or ~8 min using both cards. Dual-GPU performance on TAU is nearly twice the single-GPU performance. The PACMAN node required ~20 and ~15 min using one or both M2050 cards, respectively. The marginal performance improvement of using both M2050 cards over one card is somewhat surprising and warrants further study. As expected from the theoretical peak performance specification, the GTX 480 outperforms the M2050 for processing single-precision MUIR data. However, if double precision is required, the M2050 may outperform the GTX 480 under a similar processing load. Relative to available CPU resources in each system, the M2050 offered only modest speedup on the PACMAN system during this experiment while the GTX 480 speedup on the TAU system was significant. Whether similar results would be observed on these systems using the CUDA interface instead of OpenCL is an open question. The time to processes a single range bin (row) in a 5000pulse MUIR data file is a measurement (row completion time) that isolates the CPU or GPU performance from disk

Figure 5. Aggregate (a) CPU and (b) GPU performance. Wall time to process one hour of MUIR coded pulse data includes disk I/O time. The horizontal lines in (a) indicate, from top to bottom, the time-tosolution goal and the time-scale for the vertical axis in (b). GPU speedup factors in (b) are measured relative to either (middle bar) single-threaded or (right-hand bar) full multithreaded processing time on the host system CPU.

90

I/O performance. Figure 6 shows a box plot of the distribution of row completion times from each experiment collected over the entire 1-hour test collection. The GPU row-completion times exhibit far less absolute and relative variability than the respective multithreaded CPU times, consistent with the synchronized nature of GPU calculations. The variability in the CPU times may be the result of inefficient use of the cache and implies that the CPU calculations may benefit somewhat from further optimization. Finally, it is worthwhile to briefly examine the execution time and relative speedup of each of the section II.A processing steps. Table I shows the time required per processing task per thread for each of the six experiments. Column (0), "setup," refers to initialization of data structures and libraries. In the multithreaded CPU application, data structures are re-initialized at the completion of (4) since multiple rows of the RTI image are calculated in parallel. During GPU processing, the data structures are reused since each RTI row is calculated serially. The phase-code (1) requires somewhat more processing time on the CPU as the FFT (2); the combined power (3) and peak-find (4) on the CPU cost approximately one third of the time spent on the complex FFT. GPU relative speedup generally decreased with each successive step, consistent with expected suitableness of each task to the GPU architecture. That is, the greatest speedup was observed for the phase code array-multiply operation and the least speedup was obtained for the peakfind operation, with the FFT performance falling somewhere in between. The lack of GPU speedup observed in step (4) is expected, given the large-scale synchronous computation architecture of GPU accelerators and the asynchronous nature of multiple linear searches. Array or matrix multiply, the fundamental operation of the other algorithm steps, is

Figure 6. Variability of algorithm performance on TAU and PACMAN. The center line of each box indicates the median wall time (per thread) to complete one iteration (one range bin of 5000 pulses), the edges of each box indicate the 25th and 75th percentiles, and the whiskers indicate the measurement range (excluding outliers).

work that can be evenly distributed evenly to the GPU compute cores, but a peak-finding operation is less efficient since a subset of compute cores must store a new local maximum value in addition to each comparison during the linear search. IV. CONCLUSIONS AND FUTURE WORK

The GeForce GTX 480 at Tesla M2050 GPU accelerator cards are cost-effective and portable solutions to highresolution radar data processing tasks such as the high-

TABLE I SUMMARY PROFILE


Machine Experiment 1 thread 2 thread TAU 3 thread 4 thread 1 GTX 480 2 GTX 480 1 thread 2 thread PACMAN 4 thread 8 thread 1 M2050 2 M2050 Setup (0) Phasecode (1) FFT (2) Power (3) Peakfind (4) Row total

Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup 53.76 63.68 72.98 85.18 0.00 0.00 31.85 36.24 42.60 67.79 0.00 0.00 1.00 1.69 2.21 2.52 N/A N/A 1.00 1.76 2.99 3.76 N/A N/A 129.51 140.05 153.81 168.65 0.17 0.17 33.78 33.88 37.31 57.45 0.22 0.22 1.00 1.85 2.53 3.07 783.11 1567.90 1.00 1.99 3.62 4.70 153.47 307.08 45.30 53.66 57.75 63.98 0.86 0.85 26.95 27.02 27.25 33.52 1.25 1.26 1.00 1.69 2.35 2.83 52.66 106.10 1.00 1.99 3.96 6.43 21.47 42.93 29.61 29.67 30.22 31.31 2.71 2.71 24.36 24.40 24.49 24.56 3.55 3.55 1.00 2.00 2.94 3.78 10.94 21.88 1.00 2.00 3.98 7.94 6.87 13.73 0.01 0.01 0.02 0.02 5.94 5.95 0.01 0.01 0.02 0.09 1.00 1.88 1.63 2.48 -617.56 -309.28 1.00 1.52 0.90 0.49 258.19 287.07 314.79 349.13 9.67 9.67 116.94 121.55 131.67 183.40 12.31 12.30 1.00 1.80 2.46 2.96 26.70 53.38 1.00 1.92 3.55 5.10 9.50 19.01

7.28 -1374.01 7.28 -686.55

Time is reported as wall time per thread. Speedup is calculated relative to single-threaded wall time (neglecting disk I/O time).

91

resolution RTI product described here. Although dual-GPU configurations increased performance beyond the respective single-GPU configurations, the performance of either card alone significantly exceeded our real-time processing goal. High-resolution low-cost radar data products can now, in principle, be offered on-site during experiment campaigns at HAARP, and batch processing of data from an entire experiment campaign can be completed in several hours rather than days. The raw MUIR data is single-precision so the GTX 480 is better suited for our application, both because of its relatively superior single-precision performance and its significantly lower cost than the M2050. However, the M2050 may be a better choice for similar double-precision calculations. Also, a similar program written with CUDA could perform better than our OpenCL program on either the GTX 480 and M2050. Future work includes accelerating additional pre- and post-processing tasks. A time-integration filter applied after (4) will improve radar signal-to-noise ratio, and a timedifference filter applied before (1) will reduce the effects of radar side-lobe ground clutter; both tasks are computeintensive and good candidates for GPU acceleration. Finally, accelerated high-resolution radar processing creates new research opportunities for real-time adaptive control of the MUIR and HAARP systems. ACKNOWLEDGMENT Chris Fallen and Beau Bellamy thank Jeremiah Dabney and Rob Cermak for TAU system support; Don Bahls for PACMAN system support; and Oralee Nudson for assistance with editing this manuscript. Hardware and HPC resources were supported by a grant from the Arctic Region Supercomputing Center and the University of Alaska Fairbanks. REFERENCES [1] S. Oyama, B.J. Watkins, F.T. Djuth, M.J. Kosch, P.A. Bernhardt, and C.J. Heinselman, Persistent enhancement of the HF pump-induced plasma line measured with a UHF diagnostic radar at HAARP, J. Geophys. Res., vol. 111, 2006, doi:10.1029/2005JA011363. M.P. Sulzer, A radar technique for high range resolution incoherent scatter autocorrelation function measurements utilizing the full average power of klystron radars, Radio Sci., vol. 21, no. 6, 1986, pp. 1033-1040. C.E. Cook, Pulse Compression-Key to More Efficient Radar Transmission, Proc. of the IRE, vol. 48, no. 3, 1960, pp. 310-316.

[4]

J.W. Cooley, and J.W. Tukey, An Algorithm for the Machine Calculation of Complex Fourier Series, Mathematics of Computation, vol. 19, no. 90, 1965, pp. 297-301. C.T. Fallen, B.V.M. Bellamy, G.B. Newby, and B.J. Watkins, GPU accelerators for portable radar data processing, Proc. Users Group Conference, 2011. T. Pedersen, B. Gustavsson, E. Mishin, E. Kendall, T. Mills, H.C. Carlson, and A.L. Snyder, Creation of artificial ionospheric layers using high-power HF waves, Geophys. Res. Lett., vol. 37, 2010, doi:10.1029/2009gl041895. C.T. Fallen, Applications of a time-dependent polar ionosphere model for radio modification experiments, Ph.D. thesis, Dep. of Phys., Univ. of Alaska Fairbanks, Fairbanks, Alaska, 2010. C. Clemente, M. di Bisceglie, M. Di Santo, N. Ranaldo, and M. Spinelli, Processing of synthetic Aperture Radar data with GPGPU, Proc. IEEE Workshop on Signal Processing Systems, 2009, pp. 309-314. M. Blom, and P. Follo, VHF SAR image formation implemented on a GPU, Proc. IEEE Int. Geoscience and Remote Sensing Symp., vol. 5, 2005, pp. 3352-3356. K. Moreland, and E. Angel, The FFT on a GPU, in Graphics Hardware, San Diego, California, 2003. J.D. Owens, S. Sengupta, and D. Horn, Assessment of Graphic Processing Units (GPUs) for Department of Defense (DoD) Digital Signal Processing (DSP) Applications, Computer Engineering Research Laboratory, University of California, Davis, California, Rep. ECE-CE-20053, 2005. M. Frigo, and S.G. Johnson, FFTW: an adaptive software architecture for the FFT, Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, vol. 3, 1998, pp. 1381-1384.

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[2]

[12]

[3]

92

You might also like