Realtimesignal Cuda

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Linköpings universitet | Institutionen för datavetenskap

Examensarbete på grundnivå, 16 hp | Datateknik


Vårterminen 2020 | LIU-IDA/LITH-EX-G-20/033-SE

Parallelizing Digital Signal


Processing for GPU
_______________
Hannes Ekstam Ljusegren
Hannes Jonsson

Supervisor: ​George Osipov


Examiner: Peter Jonsson

External Supervisors: Fredrik Bjurefors & Daniel Fransson

Linköpings universitet
SE-581 83 Linköping
013-28 10 00, ​www.liu.se
Upphovsrätt
Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från
publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka
kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för
undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta
tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att
garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och
administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning
som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot
att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är
kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida
http://www.ep.liu.se/​.

Copyright
The publishers will keep this document online on the Internet – or its possible replacement –
for a period of 25 years starting from the date of publication barring exceptional
circumstances.
The online availability of the document implies permanent permission for anyone to read, to
download, or to print out single copies for his/hers own use and to use it unchanged for
non-commercial research and educational purpose. Subsequent transfers of copyright
cannot revoke this permission. All other uses of the document are conditional upon the
consent of the copyright owner. The publisher has taken technical and administrative
measures to assure authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when his/her
work is accessed as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its procedures
for publication and for assurance of document integrity, please refer to its www home page:
http://www.ep.liu.se/​.

© Hannes Ekstam Ljusegren


© Hannes Jonsson

1
Abstract
Because of the increasing importance of signal processing in today's society, there is a need
to easily experiment with new ways to process signals. Usually, fast-performing digital signal
processing is done with special-purpose hardware that are difficult to develop for. GPUs
pose an alternative for fast performing digital signal processing. The work in this thesis is an
analysis and implementation of a GPU version of a digital signal processing chain provided
by SAAB. Through an iterative process of development and testing, a final implementation
was achieved. Two benchmarks, both comprised of 4.2 M test samples, were made to
compare the CPU implementation with the GPU implementation. The benchmark was run on
three different platforms: a desktop computer, a NVIDIA Jetson AGX Xavier and a NVIDIA
Jetson TX2. The results show that the parallelized version can reach several magnitudes
higher throughput than the CPU implementation.

2
Table of Contents
Abstract 2

1. Introduction 5
1.1 Background 5
1.2 Problem Statement 6

2. Theory 6
2.1 Digital Processing Chain 6
2.1.1 Electromagnetic Waves 7
2.1.2 ADC Sampling 7
2.1.3 Fourier Transform 7
2.1.4 Window Function 8
2.1.5 Window Overlap 9
2.1.6 WOLA 9
2.1.7 Batching 9
2.1.8 Pulse Detection 10
2.2 Hardware 10
2.2.1 Central Processing Unit 10
2.2.2 Graphics Processing Unit 10
2.2.3 NVIDIA Jetson AGX Xavier 11
2.2.4 Air-T NVIDIA Jetson TX2 11
2.3 Software 11
2.3.1 Compute Unified Device Architecture 11
2.3.2 CUFFT 12
2.3.3 NVIDIA profiler 12
2.4 Related Works 12

3. Method 13
3.1 Parallelizing DP-chain 13
3.2 Benchmarks 13
3.3 Hardware Setup 14
3.4 Software Setup 14

4. Results 15
4.1 Parallelizing DP-chain 15
4.2 Benchmarks 16

5. Discussion 20
5.1 Batch Size 20
5.2 Latency Sacrifice 21
5.3 Pulse Activity Dependence 21
5.4 Throughput Deviation 21

3
5.5 Jetson Optimizations 21
5.6 Power Mode Divergences 22
5.7 Criticism of Methodology 22
5.8 Credibility of Sources 23
5.9 Ethics 23

6. Conclusion 23

References 24

4
1. Introduction

1.1 Background
Signal processing is an important part of many applications today, such as: audio
manipulation, image processing, data compression techniques, radar and many more. In a
lot of these cases the processing of signal data needs to be carried out very fast. For
instance, when using radar in an airplane or a naval vessel it is of vital importance to quickly
identify a recorded signal. Consequently, the usual solution of using a central processing unit
(CPU) to carry out the computations is generally not enough to reach the desired
performance requirements [1]. Therefore, the usage of special purpose hardware is common
when there is a need for time-critical signal processing.

Traditionally, signal processing has mainly been computed on highly specialized digital
signal processors and field-programmable gate arrays (FPGA) due to their low energy
consumption combined with high computational power. A drawback when using these
technologies is their high development cost, which is caused by the difficulties and required
technical knowledge to utilize parallel programming [1]. Graphics processing units (GPU)
provide opportunities to lower the development cost [1] and could possibly be used as a
replacement for signal processing computations.

GPUs are rapidly improving in performance and efficiency. With its highly parallel structure, it
makes for an excellent general-purpose computing unit, especially for large independent
data sets as seen in computer graphics- and image processing. The modern GPU
manufacturers also allow connecting several GPUs to work in unison, further increasing
performance. Furthermore, GPUs are relatively simple to work with, using libraries such as
CUDA and OpenCL. GPUs are also relatively cheap, both in terms of price and development
cost compared to DSPs and FPGAs. [2]

GPUs and FPGAs both have in common that they efficiently compute in parallel. Despite
that, FPGAs still outperform the GPUs at some specialized tasks in terms of processing
speed and generally require less power and are thus widely used in many areas.

Saab works with special-purpose hardware to perform signal processing efficiently. But,
because of the higher development costs associated with developing for special purpose
hardware, they are looking for other ways that are easier to work with to speed up
development and the possibility to test new functionality. Previously, a software-based
implementation has been under development which would ease prototyping of new
solutions. The first part of the thesis work would be to revise certain parts of the
software-based implementation to enable GPU utilization. The second part of the thesis is to
compare their current software implementation with a parallelized approach with GPUs on a
number of platforms.

5
1.2 Problem Statement
The signal processing chain has three steps, an analog-to-digital converter, a channelizer
and a pulse detector. The digital processing chain (DP-chain) is only the channelizer and the
pulse detector part. The analog-to-digital converter receives an electromagnetic wave from
an antenna and converts it to digital samples. The samples are fed into the channelizer
which consists of a window function and a Fast Fourier Transform (FFT). The channelizer
produces bins which represent how well the samples correlate to a certain frequency
bandwidth. The final step is a pulse detector that takes the information in the frequency bins
from the FFT and attempts to detect pulses. These steps will be explained in further detail in
chapter 2.

The study contains the following research questions:

1. What parts of the DP-chain are parallelizable on GPU?

2. How does the parallelized DP-chain perform in comparison to the CPU


implementation?

The platforms that were used include two embedded computer modules and a desktop
computer. The two embedded computer modules were the NVIDIA Jetson AGX Xavier and
an AIR-T platform with a NVIDIA Jetson TX2. The desktop had a NVIDIA RTX 2080 Super
GPU and an i9-9900 CPU. The chain was also measured using different power modes on
the Jetson platforms.

2. Theory
To develop a parallel digital processing chain, some theory needs to be understood. This
chapter will go through related works and the theory behind the digital processing chain will
be explained.

2.1 Digital Processing Chain


In this part, the entire digital DP-chain will be discussed, including details and the theory.
The DP-chain is the channelizer and the pulse detector of the signal processing chain.
Figure 1 shows the signal processing chain.

6
Figure 1. Visualization of the signal processing chain. Each step shows its input and output. The
digital signal processing chain contains two major parts, the channelizer and the pulse detector.
However, the channelizer has been split up into smaller parts containing the window function and
FFT.

2.1.1 Electromagnetic Waves


The phenomenon of an electromagnetic wave arises because a changing magnetic field
induces a changing electric field, which in turn induces a magnetic field and so on. This
process continues and the wave propagates through space at a constant speed.
Electromagnetic waves, unlike usual mechanical waves, do not need a medium to travel
through. The speed at which an electromagnetic wave travels is the speed of light, which is
approximately 3 · 10​8 m/s. Electromagnetic waves are commonly used to transfer
information, by modulation, but they can also be used to locate objects at distance by
sending an electromagnetic pulse in a certain direction and then measure the time it takes to
bounce off the object and return. In this report, the term pulse will refer to artificially made
electromagnetic waves.

2.1.2 ADC Sampling


An analog-to-digital converter (ADC) is an electronic component that takes a continuous
analog signal and converts it to a digital signal. The digital signal samples are binary
numbers that are proportional to the analog signal, with a certain bit precision. An ADC has a
limit on how quickly it can retrieve digital signal samples, a sample rate, which is measured
in Hz. An antenna will induce a voltage from electromagnetic waves passing by it, which can
then be converted to digital signal samples by an ADC. The number of samples that can be
processed per second through the DP-chain will directly influence what sample rate can be
used by the ADC in a real-time environment.

2.1.3 Fourier Transform


The Fourier transform breaks up a function of time into a function of frequency. The
transformed function tells how periodic the original function is at certain frequencies. The
definition of the Fourier transform is:

F (α) = ∫ f (t)e−i2πtα dt ​ (1)
−∞

7
where f denotes the function of time and F denotes the transformed function of frequency α .
The function of time and the function of frequency output complex numbers. For the
transformed function, the magnitude of the complex number tells how well it correlates to the
given frequency, while the angle of the complex number tells what phase it corresponds best
with.

Since the Fourier transform is a continuous mathematical analytical transformation, it needs


to be discretized to be used in digital processing. This is called a discrete Fourier transform
(DFT) and it works by taking equally-spaced samples of the continuous function. The
definition of the DFT is:
N −1
kn
F k = ∑ f n e−i2π N ​ (2)
n=0
where f n denotes the ​n​th sample from the function of time and F k denotes the discrete
transformed function of frequency bin indexed by k . Each of the frequency bins, F k , consists
of a band of frequencies. The frequency band per bin is determined by how many samples
(referred to as N in the definition) were used in the discrete transform. The DFT can easily
be implemented in software based on the definition, but it would require O(N​2​) operations.
There is a more widely used algorithm to compute the DFT which is called a fast Fourier
transform (FFT), which was first presented by James W. Cooley and John W. Tukey [3]. The
FFT algorithm is a divide and conquer strategy to compute the DFT and it reduces the
complexity to be O(Nlog​2​(N)) and thus performs better when N grows compared to the naive
DFT.

2.1.4 Window Function


When trying to detect pulses, there is a need to only analyze a slice of a signal through time.
This slice of a signal is moved through time and is referred to as a window. By only
analyzing part of the signal it is possible to catch the majority of the periodic nature of a
pulse in the window, so when a Fourier transform is performed, that periodicity is identifiable.

To only pick a window of a signal, a mathematical function is applied to the signal called the
window function. This window function is zero-valued outside of the window interval and is
multiplied by the signal, this causes the resulting signal to be zero outside the interval of the
window. There are many types of window functions, and they all affect the result of a Fourier
transform in different ways. Two of the major properties that the window function influence in
the Fourier transform is the amount of spectral leakage and the resolution. Spectral leakage
is the amount of non-zero values the Fourier transform will produce that are outside of a
certain signal’s frequency. With a lot of spectral leakage, a signal A with a certain frequency
and high amplitude may obscure a signal B with a different frequency and lower amplitude.
With bad enough spectral leakage, this obfuscation may happen even if the frequency of A
and the frequency of B is relatively far away. The resolution is how well a specific frequency
is captured by the Fourier transform. With a bad resolution, two frequencies that are very
close to each other may not be differentiable. [11]

8
This whole windowing process is able to be discretized just like with the Fourier transform.
By taking discrete samples from the window function and multiplying them with the samples
taken for the DFT, it is possible to make digital computations. Figure 2 shows the processing
of samples by using windows.

Figure 2. Visualization of the discrete windowing process. Each line of the samples represents a
32-bit floating-point number. The window size in this case is 10 samples. The window function will
therefore contain 10 floating-point numbers that are multiplied by the samples in the window. W​N i​ s
the ​N​th position of the window in time. In this case, the window moves its entire size after each
digital signal processing.

2.1.5 Window Overlap


Because the window function usually approaches zero at the edges, information is lost when
sliding whole window sizes through time. To counteract this, an overlap between windows is
taken. In other words, instead of sliding the window by whole window sizes, a smaller value
is advanced forward. This value of how much the window moves for each processing is
called the step size.

2.1.6 WOLA
By increasing the size of a window by an integer multiple of the DFT input size, a summation
of multiple DFT inputs can be used as input to a single DFT. This has the effect of widening
the frequency band of the output and also reducing spectral leakage. [14]

This technique will be referred to as WOLA (weight, overlap, add) in this report.

2.1.7 Batching
Batching is the process of taking multiple windows and processing them in parallel. Because
of the window overlap, the number of samples processed in each batch will be the product of
the step size and how many windows are being processed per batch.

samples per batch = step size · batch size (3)

The batch size is the number of windows processed in a single batch. A visualization for why
the samples per batch formula are reasonable is given in Figure 3.

9
Figure 3. Similar to Figure 2, but with window overlapping and batching. W1-W5 represents the 5
windows that will be processed in one batch. The step size shows how far the window should move
for the next samples to be processed. The reason for why the last samples in W5 (the ones that are
not overlapped by W4) are not considered processed is because the next batch will start on those
samples, therefore they are not finished being processed.

2.1.8 Pulse Detection


In this report, the pulse detection consist of finding which frequencies signal strength is
above a certain threshold for a period of time. This is accomplished by iterating each
frequency bin from the output of the FFT, computing the magnitude of that bin and then
checking if the value is above a constant threshold. This process is then repeated for each
FFT output when the window goes through time. A pulse is said to be detected when it has
continuously been above the constant threshold for a certain amount of time. In this report,
this constant threshold will be referred to as the detection threshold.

2.2 Hardware

2.2.1 Central Processing Unit


The CPU is one of the main components of a computer. The main function of the CPU is to
run programs by executing a set of instructions specified by the program. The CPU is
optimized for flexibility and to perform a wide amount of programs serially. Consequently,
there is typically a lack of performance in concurrent tasks such as parallel computing. [7]

2.2.2 Graphics Processing Unit


Historically, the GPU was designed to process graphical workloads at superior speeds
compared to the CPU. Nowadays, the GPU’s computational power has been expanded to
compute more general things that are not only graphics related.

A GPU generally contains more computing cores than a CPU. As a result, the GPU can
reach much higher throughput than a CPU. However, the high throughput usually comes at
the cost of higher latency compared to a CPU. Thus, using a GPU will perform well in
applications with a large amount of independent calculation. [7]

10
The architecture of a GPU varies by generation. On a high level, a NVIDIA GPU consists of
a number of streaming multiprocessors (SM). The SMs in turn consists of CUDA cores, a
dedicated set of registers and a shared memory. [7]

2.2.3 NVIDIA Jetson AGX Xavier


The NVIDIA Jetson AGX Xavier system has an 8-core NVIDIA Carmel ARM CPU. The AGX
also has an integrated 512-core NVIDIA Volta @ 1.37 GHz GPU. The memory used is a 16
GB 256-bit LPDDR4x @ 2133 MHz. The internal memory bus allows a data transfer rate of
136.5 GB/s. The CPU and the GPU share memory space, nullifying copy time between the
components. For powering the system, there are three modes: 10 W, 15 W and 30 W,
allowing for dynamic use of power depending on computational needs. [5]

2.2.4 Air-T NVIDIA Jetson TX2


The Air-T is a system designed with radio transmission and receiving in mind. It has several
components, including a CPU, an FPGA and a NVIDIA Jetson TX2 embedded on the chip.
In this study, only the TX2 will be used for measurements.

The Jetson TX2 module has two CPUs, a dual-core NVIDIA Denver CPU and a quad-core
A57 ARM Cortex CPU. The TX2 has a 256-core NVIDIA Pascal GPU. For memory, the TX2
has an 8 GB LPDDR4 memory @ 1866 MHz with a data transfer rate of 59.7 GB/s. This
memory is shared between the CPU and the GPU. Like the Jetson Xavier, the TX2 has
different power modes; a 7.5 W mode and a 15 W mode. [6]

2.3 Software

2.3.1 ​Compute Unified Device Architecture


Compute Unified Device Architecture (CUDA) is an API by NVIDIA for parallel computing on
GPUs. The API works by extending the C++ programming language with a set of extensions
and a runtime library. [8]

In CUDA, a GPU function call starts a kernel. A kernel is executed in parallel by a number of
threads. The programmer organizes these threads in a number of blocks called thread
blocks, which all execute the same kernel. Each thread in a thread block executes one
instance of the kernel. The minimum amount of scheduled work is 32 threads and is called a
warp. These thread blocks are grouped in a grid. An illustration of the structure can be found
in Figure 4. [8] [9]

11
Figure 4. The figure illustrates the thread structure. A thread has personal memory. Several threads
build a thread block. Several thread blocks build a grid. Every Grid runs a kernel. There may be
multiple grids.

2.3.2 CUFFT
CUDA fast Fourier transform (cuFFT) is NVIDIA’s API to perform FFTs on GPUs. The API
comes with two libraries, the cuFFT library and the cuFFTW library. The cuFFT library helps
users to perform FFTs on GPUs by having compiled programs available to run. The cuFFTW
is mainly a porting tool for the commonly used CPU based version for Fourier transforms,
Fastest Fourier Transform In The West (FFTW). [10]

2.3.3 NVIDIA profiler


Nvprof is a tool commonly used when profiling GPU applications. It allows a user to gather
CUDA related information such as kernel execution time, memory transfers and more.
Nvprof is run through a command line with optional flags and an application. [8]

2.4 Related Works


F. García-Rial et al. [12] present a solution to perform image processing for 3-D THz radar in
real-time, for the purpose of identifying hidden person-borne threats. Their solution contains
common signal processing algorithms, such as a window function, FFT and peak detection
and it achieved a refresh rate of more than 8 FPS for 6000 pixels. The window function was
a custom-made CUDA kernel and the cuFFT library was used for the FFT. While many parts
of the solution are similar to what is needed for pulse detection in this report, the application
is different considering it is image-based.

Yaniv Rubinpur and Sivan Toledo [15] implemented and compared two versions of a wildlife
tracking system, ATLAS; one where the signal processing parts (including, but not limited to,
an FIR filter, a Fourier transform, an overlap-add) are done by a CPU, and the other where
they are done by a GPU. The comparison was done on several different setups, including a
GeForce TITAN Xp, a Jetson AGX Xavier and a Jetson TX2. The authors found that the
GPU implementation resulted in a more than 50 times improved performance, and about 5

12
times less power used compared to the CPU implementation. While relatively similar to this
report in both platforms used and steps performed, the main process is different. However,
the result the authors got would suggest a big performance increase in this report when
comparing the CPU implementation with the GPU implementation.

P. Monsurro et al. [3] discusses bottlenecks when using GPUs as a general-purpose


computational unit. Two major factors include the latency of sending data from the CPU to
the GPU through the PCI Express interface and the computational power. The current
industry standard as of 2020 is 16xPCIe 3.0 with a bandwidth of 16 GB/s. This might be
limiting on the desktop, however, both the NVIDIA Jetson AGX Xavier and the NVIDIA
Jetson TX2 uses an integrated GPU with a shared memory space that can receive data with
a bandwidth of about 137 GB/s [5] and 60 GB/s [6] respectively, and will consequently not be
as affected by this latency. However, the computational power might be throttling the less
powerful jetson cards more than the desktop setup.

3. Method
This chapter will discuss the methodology used to achieve the goal of parallelizing the digital
signal processing application, and also how the benchmarking was done to analyze the
performance.

3.1 Parallelizing DP-chain


The parallelization in the study was conducted through an iterative process of evaluation,
implementation and optimization.

The first step in parallelizing the DP-chain was to evaluate what sections of the DP-chain
was parallelizable. This was done through discussions and through understanding the
different sections of the DP-chain, as well as reviewing the CPU implementation. The CPU
implementation was done by SAAB and was decided to be the reference for test results.
Once an evaluation had been conducted, an implementation was written. The
implementation was then tested and optimizations based on the result were made. This
process was done iteratively several times.

To test that the entire GPU DP-chain worked as expected, the result was compared to the
CPU implementation after every iteration. The comparisons were based on the complex
numbers in the frequency bins, as well as the number of pulses found when given the same
input samples to the GPU and CPU implementations.

3.2 Benchmarks
For the test data, a sample size of roughly 4.2 M 32-bit floating-point samples was chosen.
The DP-chain was evaluated using an expected best-case test and an expected worst-case
test. These tests will be referred to as best-case and worst-case. The best-case test was set
up to not have any pulses. This was considered best-case since having no pulses in the

13
samples should require the least amount of work. The worst-case was set up to have both
long pulses, stretching most of the runtime, as well as short and highly repetitive pulses
throughout the runtime. This was considered the worst-case because the highly repetitive
pulses requires many frequent calculations while the long pulses requires heavy serial
computation, thus decreasing thread efficiency on the GPU. Therefore, a combination of
long pulses and short frequent pulses are expected to be difficult. The samples tested were
the exact same on all different platforms.

The benchmark was measured by using the standard C++ library Chrono with the
high_resolution_clock class to measure every individual step of the chain. The throughput,
measured in samples per second, of each step was then calculated through

batch size * step size


throughput = elapsed time (4)

where batch size is the number of windows in a batch, step size is the step size used and
elapsed time is the amount of time spent in a particular function. The throughput is
measured for every batch. Once all the batches have been calculated, an average is taken
as the final measurement.

For the final benchmarks the test was run with 17 different batch sizes, ranging from 2​0 to 2​16
for all platforms. A DFT size of 64 with a WOLA of 5 summations was used, meaning a
window of 320 samples. The step size was 32. Each platform ran both the best-case test
and the worst-case test once. The jetson platforms also ran the two tests on each power
mode and with jetson_clocks, a script maximizing clock frequency and power usage.

3.3 Hardware Setup


The benchmarks were performed on three different platforms, a desktop computer, a NVIDIA
Jetson AGX Xavier and NVIDIA Jetson TX2.

The desktop computer used contained an Intel Core i9-9900 CPU, NVIDIA RTX 2080 Super
GPU with a PCIe 3.0 x16 connection and 64 GB of RAM. The CPU thermal design power is
65W.

The NVIDIA Jetson AGX Xavier is specified in section 2.2.3.

The NVIDIA Jetson TX2 is specified in section 2.2.4

3.4 Software Setup


For the desktop computer, Linux CentOS was used as operating system, CUDA 10.2 for
GPU programming and GCC 7.3.0 as C++ compiler.

For the NVIDIA Jetson AGX Xavier, Linux Ubuntu was used as operating system, CUDA
10.2 for GPU programming and GCC 7.4.0 as C++ compiler.

14
For the NVIDIA Jetson TX2, Linux Ubuntu was used as operating system, CUDA 10.0 for
GPU programming and GCC 7.4.0 as C++ compiler.

Only the desktop setup was used to benchmark the CPU version of the DP-chain, while the
GPU version was benchmarked on all three platforms. The GPU version of the DP-chain
uses the same code for all three platforms, but it is compiled on the platforms and could
therefore create slight variations in the binary.

4. Results
This chapter contains the results of the process that was defined in the methodology. It
starts out by presenting the resulting parallelized algorithm that came from previous studies
and iterative development. Then, the benchmarks of the final implementation are presented.

4.1 Parallelizing DP-chain


The CPU implementation of the digital signal processing is a serial implementation and deals
with one window at a time. It gathers enough samples to fill a window, then proceeds to
apply the window function together with a WOLA. Then it runs the FFT on the preprocessed
samples, using the FFTW library. After this the pulse detector can run on the frequency bins,
keeping track of frequencies with a magnitude above a constant threshold and writing out
pulses if found. This concludes the processing of one window, the process then repeats by
gathering new samples to process. The process terminates when there are no more
samples to process and it prints out all the detected pulses.

Once the CPU implementation was understood, an evaluation based on previous studies
and the implementation gave a first take of the GPU-based digital signal processing chain.
Based on F. García-Rial et al. [12] and the CUDA API documentation [8], preprocessing
(window function and WOLA) and FFT could trivially be parallelized on the GPU. For the
preprocessing, each GPU thread computes the product of the decimated window with the
window function and puts the result into a GPU memory buffer. The FFT was performed
using the cuFFT library. In the pulse detection step, one thread for each frequency bin of the
FFT output is started. Each of these threads computes the magnitude of the complex
number and checks if it is higher than the detection threshold. The pulse detection threads
have a state data structure to keep track of pulses when the digital signal processing is
repeated through time. This concludes the processing of one window. Similar to the CPU
implementation, the process repeats by gathering new samples for the next window and is
done when there are no more samples to process.

This first implementation did not reach full GPU resource usage with our desired window
size. A better throughput could be achieved by performing whole windows in parallel as well.
The next implementation did exactly this, the number of windows processed in parallel is
called the batch size. A higher batch size means waiting for the samples required to fill the

15
batch, which causes a higher latency from the time the sample was gathered to the time
when it is finished being processed.

In the final iteration, a new detector was written because it turned out to be a large
bottleneck. This was caused by the fact that for large batch sizes, the detector serially
analyzes each FFT output in the batch even though each frequency bin is in parallel. To get
more parallel utilization, an early out algorithm was implemented. This algorithm starts one
thread per FFT output in the whole batch. Each thread then looks at the previous signal
strength and the current signal strength to determine if it is the start of a pulse. If it is, the
thread goes through the batch until it reaches the end of the pulse. This shortens the amount
of linear analysis for the detector to only be the length of a pulse, while in the previous
detector it always went linearly through the whole batch. This provided quite the speedup
when there was low pulse activity in the sample data as will be shown in 4.2.

4.2 Benchmarks
The following results show the last iteration of testing when running the test samples through
the entire DP-chain. For the desktop throughput data (Figure 5), the best-case test doubled
in performance for every double of the batch size until a batch size of around 2048 where it
started to plateau. The best-case test peaked at roughly 2 billion samples per second. For

16
Figure 5. The figure shows a log scaled throughput chart when running samples through the entire
DP-chain on the desktop setup. The vertical axis is scaled as a log​10 and the horizontal axis is in log​2​.
The GPU best-case shows the number of samples processed per second during the best-case test.
The GPU worst-case shows the number of samples processed per second during the worst-case test.
The CPU shows the number of samples processed with the CPU implementation and is not affected
by the signal size or batch size.

the worst-case, the GPU peaked at a batch size of 4096 with about ​200 M samples per
second and started to plateau with a batch size of around 256. The CPU consistently
processed 4.6 M samples per second (batch size is not applicable) and serves as a
reference.

For the AGX throughput data (Figure 6), the test cases behaved similarly to the desktop
throughput data. The best-case test doubled in performance for all power modes until a
batch size of about 2048, where it starts to plateau. When measuring the AGX without a
limiting power, it reaches a peak at about 1 B samples per second. The performance from no
limit to 30 W drops about 40%. This is also the relation between the other power modes. For

17
the worst-case test, the AGX peaks at a batch size of about 1024 at about 40-90 M samples
processed per second depending what power mode is run. The performance then declines
before starting to increase again as the batch size increases.

Figure 6. The figure shows a log scaled throughput chart when running samples through the entire
DP-chain on the NVIDIA Jetson AGX Xavier setup. The vertical axis is scaled as a log​10 and the
horizontal axis is in log​2​. The graph shows the performance of the different power modes in both test
cases. The AGX best-case and the AGX worst-case represents the AGX without limiting power
consumption. The CPU shows the number of samples processed with the CPU implementation and is
not affected by signal size or batch size.

The TX2 shows similar patterns in the throughput data (Figure 7) as the AGX. In the
best-case test, performance is doubled when the batch size is doubled up to 1024 for all
power modes before plateauing for bigger batch sizes. For the worst-case test, the
performance is doubled up to a batch size of 1024. For larger sizes, it dips in performance
before settling around 50-80 M samples per second.

Furthermore, as shown in Figures 8 and 9, the time taken for each section of the DP-chain is
roughly 5-10 microseconds for batch sizes up to 256 except for the detector in the
worst-case test. In the best-case test, larger batch sizes show memory copy as the

18
prominent bottleneck, taking approximately 70% of the total time. After a batch size of 2048,
the time starts to double with each doubling of batch size, meaning time taken increases
linearly beyond this point. In the worst-case test however, larger batch sizes show the
detector as the prominent bottleneck, taking approximately 92% of the total time. The time
taken for the different sections on the jetson platforms are distributed in a similar way as the
desktop for both of the test cases.

Figure 7. The figure shows a log scaled throughput chart when running samples through the entire
DP-chain on the NVIDIA Jetson TX2 setup. The vertical axis is scaled as a log​10 and the horizontal
axis is in log​2​. The graph shows the performance of the different power modes in both test cases. The
CPU shows the number of samples processed with the CPU implementation and is not affected by
signal size or batch size.

19
Figure 8. The figure shows the time it took for a Figure 9. The figure shows the time it took for a
batch to be processed through individual sectors batch to be processed through individual sectors
of the DP-chain during the best-case. The Y-axis of the DP-chain during the worst-case. The
shows the time in microseconds and the X-axis Y-axis shows the time in microseconds and the
the batch size used on a log​2​ scale. X-axis shows the batch size used on a log​2 ​scale.

5. Discussion
This chapter will discuss noteworthy aspects from the results.

5.1 Batch Size


The most prominent factor to gain throughput on the GPU was the batch size. As can be
seen in Figure 5-7, the throughput got consistently higher with the batch size until it tapers
off. This is expected behavior due to the fact that a larger batch size will utilize the number of
threads on the GPU more efficiently up to a certain point. This point represents near
maximum concurrent resource usage on the GPU.

At low batch sizes, the throughput from the GPU is quite low, even lower than the CPU. This
shows that when the parallel power of the GPU is not used, it gets outperformed by the
serial performance of the CPU. The minimum batch size to perform better than the CPU is
different for each platform, but all platforms reached higher throughputs at some point. The
desktop reached a higher throughput at a batch size of 4, for both the expected best and
expected worst-case. All different configurations of the AGX matched or exceeded the

20
throughput of the CPU at a batch size of 32, for both the expected best and worst case. All
different configurations of the TX2 matched or exceeded the throughput of the CPU at a
batch size of 128, for both the expected best and worst case.

5.2 Latency Sacrifice


In a real-time scenario, the process of batching will require waiting for more samples to be
gathered. This increases the latency, in the sense that it takes longer to go from receiving an
input to getting an output. This is of course only a problem in cases where real-time signal
processing is required, and also needs low latency. For non-real-time processing, a higher
batch size will almost always result in faster processing of samples, depending on the size of
the data that needs to be processed.

5.3 Pulse Activity Dependence


A common pattern with the throughput result is the impact of the worst-case data. The
worst-case data can reduce the throughput by as much as a magnitude compared to the
best-case data. The reason for this can be shown in Figure 8 and Figure 9. Figure 8 shows
that the largest bottleneck is the memory transfer from host to device, while all the other
parts of the DP-chain are by comparison not as time-consuming. On the other hand, Figure
9 shows that the pulse detector is a major bottleneck even when compared to the memory
transfer. This can be explained by the fact that the only part of the DP-chain that is
performance dependant on what is in the sample data, is the detector. Therefore a lot of
pulse activity, especially large pulses, causes the detector to be a major bottleneck. Figure 8
and Figure 9 only show the extreme expected cases of worst and best data, so depending
on application and pulse activity, the average throughput may vary between these cases.

5.4 Throughput Deviation


When running the GPU tests, a large deviation was found in the performance from test to
test, usually in one or two of the batch sizes. This deviation shows itself in, e.g., Figure 9 and
explains why the detector is not completely linear. The reason for this deviation is difficult to
explain without scheduling the GPU manually, but could for instance be that the monitor was
rendered during that exact time, or the GPU being overheated and had to throttle for a while.

5.5 Jetson Optimizations


Since the DP-chain was directly ported over from the desktop to the Jetson systems without
code modification, there may be some hardware-specific optimizations to increase
performance. One such optimization is to utilize the fact that the integrated GPU has shared
memory with the CPU, meaning that the memory transfers are not needed. Considering that
there is only one major memory transfer in the final DP-chain algorithm, this should not affect
the results presented here in a significant way, but it shows that there are possibilities of
platform-specific optimizations.

21
5.6 Power Mode Divergences
The following data is taken from the best-case test.

In Figure 6, the power modes different throughputs are presented for the AGX platform. The
AGX Xavier, when limited to the 30 W power mode, has a maximum throughput of about
750M samples per second, which is 25 M samples/s per watt. The 15 W power mode has a
max throughput of about 410 M per second, equaling roughly 27 M samples/s per watt. The
10W power mode has a max throughput of about 230 M, equaling a throughput of 23 M
samples/s per watt. For the AGX Xavier in no limit mode, a reference power lies around ~50
W [13]​ ​and a max throughput of about 1.1 B, equaling roughly 22 M samples/s per watt.

In Figure 7, the power modes different throughputs are presented for the TX2 platform. The
TX2 platform when running in the 15 W power mode, has a max performance of 260 M
samples per second, equaling to 17.3 M samples/s per watt. The 7.5 W power mode has a
max performance of 214 M samples per second, equaling 28.5 M samples/s per watt. For
the TX2 in no limit mode, a reference power lies around ~40 W, and a max throughput of
about 309 M, reaching about 7.7 M samples/s per watt.

As seen through this analysis (table 1), the AGX seems to proportionally lower its
capabilities with the power supplied. The discrepancy seen in this case could be contributed
to throughput deviation of the test results. For the TX2 however, there seem to be a small
loss of capabilities for a relatively big decrease in power supplied. The CPU in table 1 uses
its TDP of 65 W as reference power consumption and max throughput of 4.6 M.

Platform AGX AGX AGX AGX TX2 TX2 TX2 CPU*


30W 15W 10W 15W 7.5W

Samples/ 22 M 25 M 27 M 22 M 7.7 M 17.3 M 28.5 M 70 K


s per watt

Max 1129 754 M 417 M 231 M 309 M 258 M 214 M 4.6 M


samples/s M

Table 1. The table shows comparisons of the different platforms and power modes in terms of sample
throughput. The AGX and the TX2 represent the no limit mode of the respective platforms. Maximum
samples/s is the highest throughput reached during the benchmarks. *The CPU uses the TDP (65W)
as reference power.

5.7 Criticism of Methodology


The fact that the detector is such a large bottleneck is a problem that is probably possible to
solve. Given more time, it could have been rewritten to utilize some reduction methods that

22
are common to perform on the GPU. Thus, with some more studying on how to solve
reduction problems on the GPU, maybe a more effective solution could have been made.

The benchmark results that are presented are each from one run of the 4.2 M test samples.
This introduces some uncertainty around the results. A more accurate method would have
been to run the benchmark multiple times for each batch size, and present the average of
those runs instead. Also, it may have been accurate to include standard deviations in that
case as well.

Another criticism of the method is the lack of a real-life test. The current tests use an
expected best-case and an expected worst-case test that shows an accurate range of the
capabilities of the parallelized program. However, it doesn’t show a real-life application, i.e, a
sample set based on real data. It is therefore impossible to say whether the program will run
closer to the best-case test or the worst-case test in terms of performance.

5.8 Credibility of Sources


Some of the sources are from peer-reviewed journals or conferences making them
trustworthy. Other sources, such as NVIDIA’s own documentation is deemed trustworthy, as
the content is regarding their products and coding language. One of the sources comes from
a NVIDIA employee in a forum. Because it is a NVIDIA employee, in the NVIDIA forums,
making the claim, it is deemed trustworthy. One of the sources is taken from Arxiv, an
open-access site for scientific papers. While there isn’t a full peer-review for every article, the
papers go through approval by moderators. It is thus deemed reliable. The final questionable
source is an article from “The Collaboration for Astronomy Signal Processing and Electronic
Research”, CASPER for short. This collaboration consists of multiple university institutions
and is thus deemed reliable.

5.9 Ethics
As this thesis has mainly contributed by investigating if a part of the signal processing chain
could be parallelized on a GPU, followed by actually parallelizing it, there are not many
ethical questions to discuss. However, as we saw in table 1, the CPU has quite poor
performance per watt compared to the jetson cards. Using the algorithm in this report would
thus save power.

6. Conclusion
Every part of the digital signal processing chain can be parallelized, although each part
shows different amounts of improvements. The parallelized digital signal processing chain is
only beneficial in terms of throughput if batching of a lot of samples at a time is performed.
Depending on platform, the amount of batching required to show improvement compared to
the non-parallelized implementation varies. Although on all platforms tested, the parallelized
version shows improvement after a certain batch size. Using a large batch size will cause a
larger latency from input to output, which may affect some real-time applications that need

23
fast response times. Therefore, a tradeoff between latency and throughput may be important
to consider.

The parts of the digital signal processing chain showed different amount of improvements,
that often reflects how well that part of the parallelized algorithm manage to use all available
GPU threads. All parts show very significant improvements except for the pulse detector,
which in the worst-case can become quite linear. Therefore, future work to improve the pulse
detector might be possible with an algorithm that better utilizes the concurrency of the GPU.
This may be possible to do with common reduction algorithms.

Another future work opportunity is to make platform-specific optimizations on the Jetson


hardware. For example, a reduction of memory transfers can be possible since these
embedded systems have an integrated GPU that shares memory with the CPU. Additionally,
other platform-specific optimizations are interesting to identify.

References
1. A. HajiRassouliha, A. J. Taberner, M. P. Nash, P. M. F. Nielsen, "Suitability of recent
hardware accelerators (DSPs FPGAs and GPUs) for computer vision and image
processing algorithms", Signal Process. Image Commun., vol. 68, pp. 101-119, Oct.
2018.
2. R. S. Perdana, B. Sitohang and A. B. Suksmono, "A survey of graphics processing
unit (GPU) utilization for radar signal and data processing system," 2017 6th
International Conference on Electrical Engineering and Informatics (ICEEI),
Langkawi, 2017, pp. 1-6, doi: 10.1109/ICEEI.2017.8312430.
3. P. Monsurro, A. Trifiletti, F. Lannutti, "Implementing radar algorithms on CUDA
hardware", 2014 MIXDES, pp. 455-458, 19–21 June 2014.
4. James W. Cooley, John W. Tukey "An Algorithm for the Machine Calculation of
Complex Fourier Series" Mathematics of Computation, vol 19, no. 90, pp 287-301,
April 1965.
5. NVIDIA Jetson Xavier System-on-Module Data Sheet (taken 2020-02-10). Available
at:
https://developer.nvidia.com/embedded/downloads#?search=Jetson%20AGX%20Xa
vier&tx=$product,jetson_agx_xavier
6. Jetson TX2 Series Module Data Sheet (taken 2020-05-05). Available at:
https://developer.nvidia.com/embedded/downloads#?search=Data%20Sheet
7. Stallings, William, “Computer Organization and Architecture: Designing for
Performance”, 10th edition, Pearson Education.
8. CUDA programming API documentation: Available at: ​https://docs.nvidia.com/cuda/
9. NVIDIA Fermi GPU Architecture whitepaper.​ ​Available at:
https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_A
rchitecture_Whitepaper.pdf
10. NVIDIA cuFFT library API documentation: Available at:
https://docs.nvidia.com/cuda/cufft/index.html

24
11. Gerhard Heinzel, Albrecht Rüdiger and Roland Schilling, “Spectrum and spectral
density estimation by the Discrete Fourier transform (DFT), including a
comprehensive list of window functions and some new at-top windows”, Internal
Report, Max-Planck-Institut für Gravitationsphysik, Hannover, 2002.
12. F. García-Rial, L. Úbeda-Medina and J. Grajal, "Real-time GPU-based image
processing for a 3-D THz radar", ​IEEE Trans. Parallel Distrib. Syst.​, vol. 28, no. 10,
pp. 2953-2964, Oct. 2017.
13. AGX Watt Design Reference: Taken 2020-05-26, Available at:
https://forums.developer.nvidia.com/t/xavier-power-consumption-information-solved/6
4595/5
14. “The Polyphase Filter Bank Technique”, Available at:
https://casper.ssl.berkeley.edu/wiki/The_Polyphase_Filter_Bank_Technique
15. Rubinpur, Yaniv, and Sivan Toledo. "High-Performance GPU and CPU Signal
Processing for a Reverse-GPS Wildlife Tracking System." ​arXiv preprint
arXiv:2005.10445​ (2020).

25

You might also like