Hybrid Electric Vehicles

ABSTRACT
Formation flying synthetic aperture radar (FF-SAR) systems, as an important development direction of
multichannel SAR, can achieve high-resolution wide-swath imaging. Coherently combining data from
satellite receivers puts a strain on the traditional real-time processing systems based on individual satellites.
Characteristics, such as the power of real-time on-orbit processing platform, must be properly balanced with
constrained memory and parallel computational resources. This article proposes a distributed SAR real-time
imaging method based on the embedded graphics processing units (GPUs). The parallel computing method
of the chirp scaling algorithm is designed based on the parallel programming model of compute unified
device architecture, and the optimization methods of memory and performance are proposed for the
hardware architecture of embedded GPUs. In particular, the unified memory management method is used
to avoid data copying and communication delays between the CPU and GPU. A hardware verification
system for distributed SAR real-time imaging processing based on multiple embedded GPUs is constructed.
The proposed algorithm takes 5.86 s to process single-precision floating-point complex imaging with a data
size of 8192 × 8192 on a single Jetson Nano platform. The actual power consumption is less than 5 W, and
the performance-to-power ratio is greater than 1.7%. The experimental results show that the real-time
processing method based on the embedded GPUs proposed in this article has high performance and low-
power consumption.
Keywords: Chirp scaling (CS) algorithm, distributed architecture, embedded graphics processing unit
(GPU), on-orbit real-time processing, synthetic aperture radar (SAR).
LIST OF CONTENTS
No. Title Page No.
1 Introduction 1
2 Literature Survey 3
3 Distributed Real-Time Image Processing of 7

Formation Flying SAR Based on Embedded
GPUs
3.1 CS Imaging Algorithm 7

Design and Optimization of a Distributed
3.2 8
SAR Real-Time Imaging System
Hardware System Design

3.2.1 9
3.2.2 CS Algorithm Rescheduling 9
3.2.3 CUDA Program Optimization 12
4 Results and Discussion 14
4.1 Experimental Results 14
4.2 Performance and Evaluation 16
5 Conclusion 21
6 Future Scope 22
7 References 23
LIST
LIST OF
OF FIGURES
TABLES
No. Title Page No.
1 Flowchart of CS imaging algorithm 7
2 Schematic diagram of FF-SAR system with four satellites 8

Distributed System Architecture with Four Embedded
3 9
GPUs
4 Flowchart of CUDA implementing CS algorithm. 9
5 Heterogeneous architecture models in GPUs. (a) Discrete 10

architecture
6 Heterogeneous architecture models in GPUs. (b) Integrated 11
architecture.
Schematic diagram of the data processing of CS imaging
7 algorithm in the distributed processing system. (a) 11
Schematic diagram of the first stage of data processing. (b)
Schematic diagram of the second stage of data processing.
(c) Schematic diagram of the third stage of data processing.
Imaging results of GF-3 raw data of 8192 × 8192

8 16
points by implementing CS algorithm on different
platforms. (a) Jetson Nano imaging results. (b) Jetson AGX
Orin imaging results. (c) RTX 2060 Max-Q imaging results.
(d) MATLAB imaging results
9 Distributed simulation verification system based on the 19

embedded GPUs.
10 Schematic diagram of the data transmission pipeline 19

between the master node and slave node at each stage
LIST OF TABLES
No. Title Page No.
1 HARDWARE SYSTEM PARAMETERS OF THE 14

EXPERIMENTAL PLATFORMS
2 EXECUTION TIME OF KERNEL FUNCTION AND 14
MEMORY COPY TIME ON DIFFERENT
EXPERIMENTAL PLATFORMS
3 COMPARISON BETWEEN THE PROPOSED 18
ARCHITECTURE WITH PREVIOUS SOLUTIONS
IMPLEMENTING CS ALGORITHM
Distributed Real-Time Image Processing of Formation Flying SAR Based on Embedded GPUs
Chapter 1
Introduction
1.1 Overview
Spaceborne synthetic aperture radar (SAR) systems provide high-resolution, all-time, and all-
weather ground observation capabilities. Therefore, they are widely used in important fields, such as disaster
monitoring, resource exploration, and environmental protection.
Formation flying synthetic aperture radar (FF-SAR) is a new operational mode used to achieve high-
resolution wide-swipe SAR images. FF-SAR is usually combined with a set of very compact lightweight
satellite platforms, which has a lower overall cost, is easier to replace faulty satellites, and is more adaptable
to future fast and flexible launch missions. The TanDEM-X mission and the CanX-4&5 formation mission
have successfully demonstrated important capabilities in this area.
In a CubeSat train was proposed for high-resolution radar detection and imaging missions in
Antarctica. Its formation consists of 50 CubeSats, and the coherent combination of radar echoes collected
through all platforms is expected to guarantee high cross-orbital resolution, demonstrating the significant
potential of FF-SAR for future applications. It is foreseeable that, in future FF-SAR missions, the data
volume to be processed will greatly increase. The processing power of each satellite is limited on account
of the limitations of satellite size and power consumption. On-orbit real-time imaging processing systems
based on single satellites face considerable pressure. Therefore, it is necessary to explore an on-orbit real-
time imaging processing system suitable for FF-SAR mode.
In a multi-satellite distributed data processing system was proposed, which can effectively reduce the
processing pressure of a single satellite by reasonably assigning the computational tasks of the SAR imaging
algorithm to multiple satellites for processing. The system takes a field programmable gate array (FPGA)
as the core processing unit. FPGAs are attractive for on-orbit real-time processing systems because they can
meet the requirements for high performance and low-power consumption. Flexibility is the main advantage
of FPGAs. However, in the pursuit of computational accuracy, floating-point operations need to be used,
which results in a large consumption of required computational resources.
As a high-performance platform, graphics processing units (GPUs) are often used to accelerate the
processing of SAR imaging.
Dept of ECE, RRCE 2023-24 1

The advantage of GPUs is parallel computing capability, but the challenge is that GPUs are limited
by the ability to interact with data. The coordination between data throughput and computation needs to be
optimized. The traditional GPUs are not feasible as on-orbit real-time processing systems because their size
and power consumption cannot meet the requirements. However, the emergence of embedded GPUs has
provided a new opportunity for many real-time data processing tasks in recent years.
Embedded GPUs have the advantages of high integration, low-power consumption, and high
performance. Benefiting from the compute unified device architecture (CUDA) programming method, the
development cycle is short. Some pieces of literature have studied to implement of SAR imaging using
embedded GPUs. In two SAR processing algorithms were implemented and tested based on the Jetson TX1
platform. It shows that running these two algorithms on Jetson TX1 is faster than using CPU.
However, the overall optimization efficiency is limited because of using open-source library ArrayFire
for parallel computation. In the details of performing SAR imaging with Jetson TK1 were provided, but the
results suggest that the transfer of redundant data consumes considerable processing time between the CPU
and GPU. In fact, the data transfer between the CPU and GPU could have been avoided on the embedded
GPU. Notably, the feasibility of embedded GPU on-orbit operation has been verified.
Related studies have shown that embedded GPUs can provide considerable advantages for
computationally intensive data processing in low earth orbit applications. Therefore, embedded GPUs have
excellent application prospects in short-term tasks. This article proposes a distributed SAR imaging method
based on the embedded GPU for FF-SAR system. The proposed method is scalable to different embedded
GPU platforms, and the quantity configuration is also flexible. The processing of chirp scaling (CS)
algorithm has been rescheduled to suit the distributed SAR imaging systems. In order to maximize the
processing performance of embedded GPU, the corresponding optimization methods are proposed for
CUDA parallel computing.
Finally, a distributed simulation system based on embedded GPUs is constructed, and its processing
performance is verified by using the raw data of Gaofen-3 (GF-3). The rest of this article is organized as
follows. Section II introduces the CS imaging algorithm. In Section III, the method of design and
optimization of a distributed SAR real-time imaging system is introduced. Section IV gives the experimental
results and discussion.

CHAPTER 2
Literature Survey
2.1 Overview of Literature survey
A literature survey, often a critical component of research endeavours, involves an extensive

review and analysis of existing academic or scholarly literature related to a specific topic or field of study.
It serves as a foundational step in research, offering a comprehensive understanding of existing
knowledge, theories, methodologies, and gaps in the subject area. It provides context by summarizing the
current state of knowledge, theories, and findings related to the chosen research topic. This context helps
situate the new research within the broader landscape of existing work.
By analysing existing literature, a literature survey helps identify gaps, inconsistencies, or areas
where further research is needed. These gaps could be in knowledge, methodology, or unresolved
questions within the field. It outlines the trends, patterns, and ongoing debates or controversies in the
field. This helps researchers understand differing viewpoints and directions of research. It assesses various
methodologies and approaches used in prior studies, providing insights into the strengths and limitations
of different research methods. A literature survey justifies the significance of the new research by
demonstrating how it builds upon or contributes to the existing body of knowledge.
It informs the design of the new study, guiding the research questions, hypotheses, experimental
design, and methodologies. It offers a framework for analysing and interpreting the new data or findings
in light of what is already known in the field.
2.2 Base Papers
2.2.1 Real-time processing of spaceborne SAR data with nonlinear trajectory based on variable PRF
Authors Names: Yew Lam Neo, Frank H. Wong, Ian G. Cumming
Base Paper Methodology: This paper proposes a real-time processing approach for spaceborne synthetic
aperture radar (SAR) data with nonlinear trajectories and variable pulse repetition frequencies (PRFs). The
methodology involves:

1) Utilizing a modified range-Doppler algorithm that accounts for the varying PRF along the azimuth
dimension.
2) Incorporating a trajectory estimator to estimate the nonlinear sensor trajectory based on the range-
compressed data.
3) Performing azimuth compression with the estimated nonlinear trajectory, enabling accurate focusing of
the SAR data.
4) Implementing the processing on a real-time computing platform, demonstrating the feasibility of on-
board processing for spaceborne SAR systems with nonlinear trajectories and variable PRFs.
The key aspects are handling variable PRFs, estimating nonlinear trajectories from the data itself, and
integrating these into a real-time processing chain for on-board SAR data focusing.
2.2.2 Detecting ships in the New Zealand exclusive economic zone: Requirements for a dedicated
small-sat SAR mission
Author Names: J. Krecke, M. Villano, N. Ustalli, A. C. M. Austin, J. E. Cater, and G. Krieger
Base Paper Methodology:

This paper discusses the requirements for a dedicated small satellite synthetic aperture radar (SAR)
mission to detect ships within New Zealand's exclusive economic zone (EEZ). The methodology involves:
1) Analysing the maritime traffic patterns and ship density in the New Zealand EEZ to determine the
required SAR imaging capabilities.
2) Evaluating the performance of different SAR modes (Stripmap, ScanSAR, and TOPSAR) in terms of
resolution, swath width, and coverage rate for ship detection.
3) Assessing the feasibility of using a small satellite platform with a compact SAR payload for this
mission.
4) Determining the optimal orbit parameters, such as altitude and inclination, to achieve the desired
coverage and revisit times.
5) Investigating the use of advanced signal processing techniques, like ship detection algorithms and
constant false alarm rate (CFAR) detectors, to improve ship detection accuracy.
The primary focus is on defining the technical requirements, including SAR modes, satellite platform, and
orbit design, to enable a dedicated small-sat SAR mission for maritime surveillance and ship detection
within the New Zealand EEZ.

2.2.3 Assessments of ocean wind retrieval schemes used for Chinese Gaofen-3 synthetic aperture
radar co-polarized data
Author Names: Yang Zhang, Xiao-Ming Li, Ke-Xin Zhang, Qi Yang, Wei Yang

This paper evaluates different ocean wind retrieval schemes for the co-polarized synthetic aperture radar
(SAR) data from China's Gaofen-3 satellite. The methodology involves:
1) Collecting and preprocessing Gaofen-3 SAR co-polarized data and corresponding wind measurements
from buoys or numerical models.
2) Implementing and assessing the performance of several ocean wind retrieval algorithms, including
empirical models (CMOD5.N, CMOD-IFR2), semi-empirical models (DWAV-GS, XWAVE), and
physical models (RFSCAT).
3) Evaluating the accuracy of the retrieved wind speeds and directions from these models by comparing
them with the ground truth data from buoys or models.
4) Analyzing the effects of various factors, such as wind speed range, incidence angle, and polarization,
on the retrieval performance of different models.
5) Identifying the most suitable wind retrieval scheme(s) for Gaofen-3 SAR co-polarized data based on
the accuracy assessments and specific application requirements.
The main objective is to comprehensively evaluate and compare the capabilities of different wind retrieval
algorithms in estimating ocean wind fields accurately from the Gaofen-3 SAR co-polarized data,
accounting for various environmental and sensor-related factors.
2.2.4 Spaceborne demonstration of distributed SAR imaging with TerraSAR-X and TanDEMX
Authors' Names:Gerhard Krieger, Nico Adam, Mohsen Younis, Marc Rodriguez-Cassola, Pau Prats,
Marco Antweiler
This paper describes a spaceborne demonstration of distributed synthetic aperture radar (SAR) imaging
using the TerraSAR-X and TanDEM-X satellites. The methodology involves:

1) Developing a distributed SAR imaging concept, where the two satellites act as a large single-pass
interferometric SAR system with a adjustable baseline.
2) Implementing a bi-static synchronization link between TerraSAR-X (transmitter) and TanDEM-X
(receiver) to ensure precise timing and phase synchronization.
3) Conducting experiments with various baseline configurations, ranging from a conventional along-track
interferometric mode to a pendulum mode with large cross-track baselines.
4) Processing the bi-static SAR data collected by the two satellites using specialized distributed SAR
imaging algorithms.
5) Analyzing the focused bi-static SAR images and interferometric products to assess the performance of
the distributed SAR imaging concept.
6) Demonstrating the potential for enhanced capabilities, such as improved spatial resolution, suppressed
ambiguities, and extended imaging opportunities, compared to conventional monostatic SAR systems.
The key aspects are the synchronization between the two satellites, the implementation of distributed SAR
imaging algorithms, and the evaluation of the obtained bi-static SAR images and interferometric products
to validate the concept and its advantages.
2.2.5 Compact and free-floating satellite MIMO SAR formations

Authors' Names: Sigurd Huber, Marc Rodriguez-Cassola, Paco López-Dekker, Jaan Praks, Marwan
Younis, Gerhard Krieger
The paper proposes a multiple-input multiple-output synthetic aperture radar (MIMO SAR) concept using
a compact, free-floating satellite formation without precise baseline requirements. Key aspects include:
1) Developing signal processing techniques to handle arbitrary satellite positions.
2) Implementing multi-channel processing like interferometry and digital beamforming.
3) Analyzing potential advantages like improved resolution, wide-swath imaging, and increased
sensitivity.
4) Investigating feasibility considerations like formation control and synchronization.
5) Exploring applications such as high-resolution mapping and moving target indication.
The focus is on exploiting a free-floating compact satellite formation for MIMO SAR capabilities through
advanced signal processing methods.

Chapter 3
Distributed Real-Time Image Processing of Formation Flying SAR Based
on Embedded GPUs
3.1 CS Imaging Algorithm
In FF-SAR systems, resolution accuracy and computational complexity of imaging algorithms

must be considered. The CS algorithm is characterized by low computational complexity and high-
precision imaging algorithm that is widely used in spaceborne SAR imaging. The specific process of the
CS algorithm is shown in Fig. 1. The critical processing of the CS algorithm includes three steps: CS
operation, range pulse compression, and azimuth pulse compression [28], [29]. Assume t ˆand tm denote
the fast time and slow time, respectively. fr and fa denote the Doppler frequencies corresponding to a fast
time and slow time, respectively. First, azimuth fast Fourier transform (FFT) is performed to carry raw
data to the range-Doppler domain. Then, the data array is multiplied by H1(t, f ˆ a; Rs) function in the
range-Doppler domain.
Fig. 1. Flowchart of CS imaging algorithm

Then, the range inverse fast Fourier transform (IFFT) is used to collapse to the focused range
envelope at the correct range position. After range IFFT, the data are carried into the range-Doppler
domain. To compensate for the remaining phase and implement azimuth compression processing.

The CS algorithm uses phase multiplication instead of interpolation to complete range migration
correction. In order to make the range migration trajectories of all targets uniform, the CS operation is
used to eliminate the space-varying characteristics of range migration and uniformly correct the
remaining range migrations for all scatter points. The CS algorithm does not require interpolation
operations and can perform accurate image processing only through complex multiplication and
FFT/IFFT.
3.2 Design and Optimization of a Distributed SAR Real-Time Imaging System
Fig. 2. Schematic diagram of FF-SAR system with four satellites.
In this section, a real-time imaging processing system adapted to FF-SAR mode is proposed. The
system is a distributed architecture based on multiple embedded GPUs. The specific content includes
hardware architecture of the system, rescheduled CS algorithm, and parallel computing optimization
method. First, distributed hardware system is introduced. The hardware architecture of this system is
scalable. For the convenience of description, this section takes the FF-SAR system consisting of four
embedded GPUs as an example. Second, processing of CS algorithm is rescheduled, which can be applied
to distributed hardware systems. Finally, CUDA program optimization method for parallel computing of
CS algorithm based on embedded GPU is introduced.

Fig. 3. Distributed system architecture with four embedded GPUs.
3.2.1. Hardware System Design

The FF-SAR real-time imaging processing system includes multiple satellites as processing units
for collaborative processing, as shown in Fig. 2. However, a satellite is needed as master node to complete
the data division and splicing operations. The distributed architecture based on four embedded GPUs is
shown in Fig. 3. Master node is the main control unit used to divide the data and CS task operations
throughout the processing. Slave nodes implement the cooperative imaging process according to the
command of master node. Fiber optic cables are used in the simulation system to simulate intersatellite
laser communication. The master and slave nodes are connected through optical fibers to PCIe (OTP)
modules. Quad-small form-factor pluggable is used between OTPs to communicate with each other.
3.2.2 CS Algorithm Rescheduling
Fig. 4. Flowchart of CUDA implementing CS algorithm.

In the FF-SAR mission, multiple nodes can be employed to jointly process radar data. Therefore, it
is different from the processing flow where all CS algorithms are performed on a single embedded GPU.
The processing tasks of different stages of CS algorithm need to be rescheduled so that the four embedded
GPUs can cooperate to complete the processing. Coprocessing of multiple embedded GPUs reduces the
data volume processed by each embedded GPU and improves processing efficiency.
Fig. 4 shows the flowchart of GPU implementation of CS algorithm. The CS algorithm is
decomposed into three stages for the convenience of describing the data flow and processing flow of each
stage.
Fig. 5. Heterogeneous architecture models in GPUs. (a) Discrete architecture.
In first stage, as shown in Fig. 5(a), master embedded GPU performs transposition operation to
obtain the data arranged in azimuth direction. First, the data are evenly divided into four parts in azimuth
direction. The divided data are stored continuously in azimuth direction. One portion of the data is
reserved by master node, and the remaining three parts of the data are sent to slave nodes through optical
fibers. The master and slave nodes perform 1-D azimuth FFT to carry the data into range-Doppler domain.
CS phase factor used to change the frequency scale of line modulation is calculated, and the
corresponding point target data are multiplied by this factor to obtain values after range bending. The data
processing steps executed on the master node and slave nodes are independent and parallel. After first data
processing stage is completed by slave nodes, the data are sent back to the master node. Finally, master
node performs sequential splicing of the received data.

Fig. 6. Heterogeneous architecture models in GPUs. (b) Integrated architecture .

The processing flow of second stage is shown in Fig. 5(b). First, the data are transposed on master node.
The transposed data are arranged contiguously in range direction. Then, the master node divides the data
into four equal parts in range direction. A portion of the data is reserved, and rest are allocated to
slave nodes. Phase factor is calculated for range compression and range migration correction. After the
data are multiplied by the phase factor, range pulse compression operation is completed. Finally, 1-D
IFFT is performed to convert the data into range-Doppler domain.
After completing the second stage of data processing with the slave nodes, the data are sent back
to master node. Finally, the master node stitches the received data in sequence. The data flow of third
stage is shown in Fig. 7(c). First, master node performs transposition operation to obtain the data arranged
in azimuth direction. Then, the master node evenly divides the data into four equal parts in azimuth
direction. A portion is retained, and the rest of the data are distributed to slave nodes, respectively. The
master and slave nodes calculate phase factor to compensate for the remaining phase and azimuth
compression. After multiplying by the phase factor, the data complete azimuth pulse compression.
Finally, IFFT is performed on the data. The data processed are returned from slave nodes to the
master node. The master node stitches the received data to obtain final image data.

3.2.3 CUDA Program Optimization
CUDA programming is used for the development of embedded GPUs, which is the same style as
the traditional GPU. Different from the traditional GPUs in hardware architecture, the embedded GPU
memory space is generally small. It is necessary to optimize the memory of the embedded GPU. Since
CUDA programming model requires CPU and GPU to work together, CS algorithm needs to be
decomposed into two parts suitable for GPU parallel computing and CPU serial execution, respectively.
The main steps in the CS algorithm include matrix transposition operations, FFT operations, IFFT
operations, and phase multiplications.
The FFT and IFFT operations are highly parallel. Matrix multiplication also has the feature of
implementing parallel computing. The following optimization methods are adopted in this article.
1) Unified memory management: As shown in Fig. 5, the traditional GPU and CPU
heterogeneous computing architectures are generally discrete. The GPU and CPU have separate memory,
and data need to be transmitted through PCIe bus. However, the heterogeneous computing architecture of
embedded GPU is an integrated architecture. As shown in Fig. 7(b), the CPU and GPU share same
physical memory, and there is no need for data transmission through PCIe bus.
Therefore, the use of unified memory management can avoid duplicate memory allocation and
data transmission and effectively improve the performance of embedded GPUs.
2) Memory reuse: Due to the limited memory resources of embedded GPUs, in addition to using
unified memory to reduce the use of memory space, memory reuse is adopted to avoid the waste of
memory space further. Address space needs to be allocated and freed when calling cuFFT library. The
time for address space allocation and free can even exceed the FFT operation. Address space is allocated
only on the first call to cuFFT library and freed after all cuFFT calls are complete, which is an efficient
means of memory multiplexing. The in-place transposition of the matrix is also a method of memory
reuse. The transposed matrix covers the address space of matrix before the transposition, so the memory
space is saved.
3) Align and merge access: Global memory is the largest and most frequently used memory in
GPUs, and most applications are susceptible to memory bandwidth limitations. Therefore, maximizing the
use of global memory bandwidth is the key to optimizing the performance of kernel function.

Unaligned and unmerged memory access wastes bandwidth and affects the GPU memory access
speed. Matrix transpose can be used to implement aligned and merged memory. During azimuth direction
processing, the data are stored in the azimuth direction. When range processing is performed, the data are
stored in the range direction to improve the efficiency of the processor in reading and writing data in the
memory.
4) Shared memory: Latency and bandwidth are the major factors when optimizing memory
performance. Shared memory can be used to avoid the effects of global memory latency and bandwidth on
the performance. Bank conflicts need to be avoided when using shared memory; otherwise, the memory
access efficiency will be reduced. If two addresses of a memory request fall in the same memory bank,
there is a bank conflict and the access has to be serialized. Memory padding methods can avoid bank conflicts.
When declaring shared memory, pad the extra space so that the memory addresses to be accessed fall in
different banks to avoid bank conflicts.
Fig 7. Schematic diagram of the data processing of CS imaging algorithm in the distributed processing system. (a)
Schematic diagram of the first stage of data processing. (b) Schematic diagram of the second stage of data processing.
(c) Schematic diagram of the third stage of data processing.

Chapter 4
Results and Discussion
To evaluate the processing performance of embedded GPU in FF-SAR task, the 3-m-resolution
single-precision floating-point complex raw data of the GF-3 satellite was used. The experimental
platform used is Jetson Nano. The same experiment on NVIDIA AGX Orin platform and NVIDIA
GeForce RTX 2060 Max-Q platform for comparison was conducted. Table I presents the hardware
parameters of all experimental platforms.
4.1 Experimental Results

Fig. 6 shows the final imaging results of the experimental platforms and the imaging results in
MATLAB. In the experiments of CS algorithm on a single embedded GPU, the entire data processing
time, excluding data reading, is calculated. With a data volume of 0.5 GB, it only takes about 5.86 s to
complete image processing on Jetson Nano platform, and the power consumption is not higher than 5 W.
It takes 0.395 s to implement the imaging algorithm on the Jetson AGX Orin platform with the power
consumption of 60 W. In addition, the same experiment was conducted for the RTX 2060 Max-Q
platform using the same optimized CUDA program and data volume. It took 0.956 s to complete the entire
imaging process. For GPU platforms on computers, such as RTX 2060 Max-Q, although they provide
powerful performance, the power consumption is generally very high. Therefore, they are not suitable as
real-time processing platforms on satellites. Embedded GPUs balance performance and power
consumption, making them suitable as the on-orbit real-time processing platform.
The time consuming of Jetson Nano to execute the kernel function and memory copy of CS
algorithm is analyzed and compared with the results of Jetson AGX Orin platform and RTX 2060 Max-Q
platform. The results are shown in Table II. By comparing the execution time of different tasks on
different experimental platforms, it could be found that the time consuming of running different kernel
functions on the Jetson AGX Orin platform and the RTX 2060 Max-Q platform is less than the Jetson
Nano. This is related to the number of CUDA cores and GPU frequency.
Notably, Jetson AGX Orin platform and RTX 2060 Max-Q platform have comparable CUDA core
\counts. However, the time consumption of Jetson AGX Orin platform is less than the RTX 2060 Max-Q
platform, which is largely due to CUDA memory copy time. Since the embedded GPUs are integrated
heterogeneous architecture, the CPU and GPU share the same physical storage space
Notably, Jetson AGX Orin platform and RTX 2060 Max-Q platform have comparable CUDA core
\counts. However, the time consumption of Jetson AGX Orin platform is less than the RTX 2060 Max-Q
platform, which is largely due to CUDA memory copy time. Since the embedded GPUs are integrated
heterogeneous architecture, the CPU and GPU share the same physical storage space.
There is no need to transfer data between the host and the device before and after the execution of
the kernel function. The CPU and GPU in the RTX 2060 Max-Q platform are discrete architecture, and
the data transfer between the CPU and GPU must use the PCIe bus.
Therefore, CUDA memory copy occupies a lot of run-times on the RTX 2060 Max-Q platform
and reduces processing performance. The Jetson Nano and Jetson AGX Orin platforms benefit from the
integrated architecture, saving the time of CUDA memory copy.

Fig. 8. Imaging results of GF-3 raw data of 8192 × 8192 points by implementing CS algorithm on different
platforms. (a) Jetson Nano imaging results. (b) Jetson AGX Orin imaging results. (c) RTX 2060 Max-Q imaging results.
(d) MATLAB imaging results
A distributed embedded GPU simulation system is built using four Jetson Nanos. In this
experiment, optical fiber communication between different data processing units was used to simulate
laser communication between satellites. The raw data used in the experiment are 16 384 × 16 384 points
of complex single-precision floating-point numerical data. Fig. 8 shows the architecture of the distributed
embedded GPU simulation system. The system includes raw data delivery module, embedded GPUs, and
OTP modules. The raw data delivery module is responsible for simulating the sending process of the raw
data of the spaceborne SAR. Each data processing unit includes an embedded GPU and an OTP module.
The embedded GPU and the OTP module are connected through PCIe bus.

The data are transmitted between OTP modules via optical fibers. The imaging result of
implementing the CS algorithm based on the distributed system is shown in Fig. 4. In the three stages- of
the CS algorithm, after each data division, the data volume allocated to each node is the same. After the
data are divided from master node, they are transmitted to slave nodes for processing, and finally, the data
are returned to master node. In each stage, the data transfer pipeline is shown in Fig. 8. Since data transfer
is pipelined, transfer times can be overlapped. First, the three pieces of data on the master node are
transferred to the OTP module via the PCIe bus. Then, the data are transferred from the OTP module to
the slave node. Each slave node starts processing the data after receiving the complete data. The processed
data are transmitted from the slave node to the OTP module. Finally, the data processed by each slave
node are transmitted to the master node by the OTP module.
At this time, the OTP module needs to wait for the data block of the slave node 1 to be completely
transmitted to the interior before the next transmission of the data. The transmission rate of fiber is
5 Gbps, but the PCIe transfer rate is about 2 GB/s. Therefore, in order to avoid rate mismatches,
pipeline transmission is not used here. The processing time associated with each processing stage in the
distributed system was determined on the Jetson Nano, and the results are shown in Table III. The time
consumed by the distributed system to implement stages 1, 2, and 3 is about 4.5, 5.2, and 4.8 s,
respectively. It includes the CPU scheduling time, data transfer time, and GPU parallel computing time of
each stage.
The total time to implement CS algorithm imaging with four Jetson Nanos is about 14.5 s. The
time to implement the CS algorithm on a single Jetson Nano is about 5.86 s. Compared with the
implementation of the CS algorithm on a single Jetson Nano, the time-consuming increase is due to CPU
scheduling and data transmission in the process of data division and splicing.
4.2 Performance Evaluation and Discussion

The reason for the difference between the MATLAB implementation and the Jetson Nano
implementation, as shown in Fig. 11, is due to the fact that the intrinsic numerical accuracy is different
between MATLAB and the Jetson Nano, despite they both use double-precision floating-point for
computation. The difference is trivial in the beginning but is further amplified by the process of
complex computations in the CS algorithm. The calculations in MATLAB are generally considered
accurate enough. Therefore, using the results in MATLAB as a reference, absolute errors of Jetson
Nano processing results are calculated, and the maximum value is not more than 0.008. According to
statistical calculation, the average error is within 10−5 orders of magnitude, which is acceptable. Thus,

the reliability of Jetson Nano is verified. For comparison with other real-time processing
platforms, the performance-to-power ratio to measure the processing performance of different
platforms is used. It considers the quantity of data processed, processing time, and processing power.
TABLE IV COMPARISON BETWEEN THE PROPOSED ARCHITECTURE WITH PREVIOUS SOLUTIONS

IMPLEMENTING CS ALGORITHM
The results of different processing platforms are shown in Table III. The 0.5 GB of data were
processed using the CS strip imaging algorithm taking 5.86 s on the Jetson Nano. By using the Jtop
system monitoring utility in the Jetson system, it could be found that the peak power consumption of the
Jetson Nano during operation did not exceed 5 W, which is consistent with measurements using power
meter. The performance-to-power ratio is as high as 1.706%. The Jetson AGX Orin platform exhibits very
high-processing performance. It takes 0.395 s to process 8192 × 8192 points of data in 60 W power
consumption mode, and the performance-to-power ratio is 2.110%.
The processing performance of the RTX 2060 Max-Q platform is also very powerful. For the
same data, the processing time is shorter. However, its size and power consumption cannot meet the
requirements of on-orbit processing platform. The results show that Jetson Nano and Jetson AGX Orin
have higher performance-to-power ratio compared with other platforms. In addition, embedded GPU
platform and FPGA+ASIC platform show high performance-to-power ratio. The optimization method
proposed in this article has a significantly higher performance-to-power ratio. For the low
performance-to-power ratio, as shown in, weak platform performance and poor CUDA program
optimization are the main reasons.
However, under the constraints of power consumption, the FPGA and embedded GPU can
have a higher performance-to-power ratio.

Compared with FPGA platforms, embedded GPU platforms have short development cycles
and are easier to implement. Due to the short development cycle, embedded GPUs will have great
application potential in future satellite launch missions with large numbers and short cycles. Through
the performance analysis of the distributed architecture simulation system based on four Jetson Nanos,
although the use of memory space is optimized through unified memory management, memory reuse,
and in-place storage, Jetson Nano’s memory space of only 4 GB is not enough to process the data.
Thus, there is still a bottleneck in distributed processing in the system. And 1 × 4 PCIe Gen2
makes the data transfer in the system more time consuming, which affects the processing performance
of the distributed system.
Fig. 9. Distributed simulation verification system based on the embedded GPUs.
Fig. 10. Schematic diagram of the data transmission pipeline between the master node and slave node at each stage

However, with the rapid development of embedded GPUs, NVIDIAs newly released 64 GB Jetson
AGX Orin could run in 15 W power mode, provide extraordinary improved memory capacity, and support
2 × 8 PCIe Gen4. Moreover, the data transmission rate of this platform has been greatly accelerated. If this
platform can pass the on-orbit environmental reliability tests, it will provide significant advantages in FF-
SAR on-orbit real-time imaging.

CONCLUSION
In order to explore a more suitable imaging processing method for FF-SAR system, this article
proposed a distributed real-time imaging processing method for spaceborne SAR based on embedded
GPUs. The original CS algorithm processing was rescheduled to accommodate the distributed
systems. According to the hardware and software architecture of embedded GPU, optimization
methods for memory and parallel computing are proposed to maximize its processing performance.
The simulation system was implemented using the Jetson Nano platform and the proposed method
was verified using GF-3 raw data. The results show that the proposed method has better real-time
performance under low-power consumption. Compared with the previous pieces of literature, it has a
higher performance-to-power ratio. The development cycle of embedded GPU platforms is shorter and
the scalability is more advantageous. It can be seen that embedded GPUs have good application
prospects in the real-time processing of spaceborne SAR.

FUTURE SCOPE
1. Scaling to larger formations: Extending the distributed processing approach to handle data from larger
formations with more satellites/platforms for increased coverage and resolution.
2. Advanced processing techniques: Incorporating more advanced SAR processing algorithms and
techniques, such as interferometry, polarimetry, and moving target indication, into the distributed real-time
processing pipeline.
3. Heterogeneous computing: Exploring the use of heterogeneous computing architectures, combining

embedded GPUs with other specialized hardware accelerators (FPGAs, ASICs) for even higher performance
and energy efficiency.
4. On-board machine learning: Integrating on-board machine learning capabilities for tasks like automatic
target recognition, change detection, or data compression, leveraging the parallel processing power of
embedded GPUs.
5. Inter-satellite communication: Improving inter-satellite communication and data exchange protocols for
efficient distribution of processing tasks and data sharing within the formation.
6. Fault tolerance and redundancy: Developing fault-tolerant and redundant processing strategies to
ensure reliable operations in case of hardware failures or data losses.
7. Power and thermal management: Optimizing power consumption and thermal management strategies
for the embedded GPU-based processing systems to enable sustainable long-term operations.
8. Hybrid architectures: Investigating hybrid architectures that combine on-board processing with ground-
based processing facilities for more complex or computationally intensive tasks.
9. Application to other domains: Adapting the distributed real-time processing approach to other domains
that involve formation flying platforms, such as astronomical interferometry or multi-robot systems.

REFERENCE
[1] J. Chen, J. Zhang, Y. Jin, H. Yu, B. Liang, and D.-G. Yang, “Real-time processing of spaceborne SAR
data with nonlinear trajectory based on variable PRF,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022,
Art. no. 5205212.
[2] J. Krecke, M. Villano, N. Ustalli, A. C. M. Austin, J. E. Cater, and G. Krieger, “Detecting ships in the
New Zealand exclusive economic zone: Requirements for a dedicated smallsat SAR mission,” IEEE J.
Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 3162–3169, Mar. 2021.
[3] L. Ren et al., “Assessments of ocean wind retrieval schemes used for Chinese Gaofen-3 synthetic
aperture radar co-polarized data,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 9, pp. 7075–7085, Sep.
2019.
[4] J. Chen, M. Xing, H. Yu, B. Liang, J. Peng, and G.-C. Sun, “Motion compensation/autofocus in
airborne synthetic aperture radar: A review,” IEEE Geosci. Remote Sens. Mag., vol. 10, no. 1, pp. 185–
206, Mar. 2022.
[5] G. Krieger et al., “TanDEM-X: A satellite formation for high-resolution SAR interferometry,” IEEE
Trans. Geosci. Remote Sens., vol. 45, no. 11, pp. 3317–3341, Nov. 2007.
[6] T. Kraus, G. Krieger, M. Bachmann, and A. Moreira, “Spaceborne demonstration of distributed SAR
imaging with TerraSAR-X and TanDEMX,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 11, pp. 1731–
1735, Nov. 2019.
[7] D. Giudici, P. Guccione, M. Manzoni, A. M. Guarnieri, and F. Rocca, “Compact and free-floating
satellite MIMO SAR formations,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 1000212.
[8] A. Renga, M. D. Graziano, and A. Moccia, “Formation flying SAR: Analysis of imaging performance
by array theory,” IEEE Trans. Aerosp. Electron. Syst., vol. 57, no. 3, pp. 1480–1497, Jun. 2021.
[9] G. Krieger et al., “TanDEM-X,” in Distributed Space Missions for Earth System Monitoring, vol. 31.
New York, NY, USA: Springer, 2013, pp. 387–436.
[10] N. Roth et al., “Flight results from the CanX-4 and CanX-5 formation flying mission,” in Proc.
Small Satellites Syst. Serv. Symp., Valletta, Malta, 2016, p. 30.


Hybrid Electric Vehicles

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hybrid Electric Vehicles

Uploaded by

Copyright:

Available Formats

ABSTRACT

No. Title Page No.

3 Distributed Real-Time Image Processing of 7

3.1 CS Imaging Algorithm 7

Hardware System Design

3.2.3 CUDA Program Optimization 12

4 Results and Discussion 14

4.1 Experimental Results 14

4.2 Performance and Evaluation 16

No. Title Page No.

1 Flowchart of CS imaging algorithm 7

2 Schematic diagram of FF-SAR system with four satellites 8

4 Flowchart of CUDA implementing CS algorithm. 9

5 Heterogeneous architecture models in GPUs. (a) Discrete 10

Imaging results of GF-3 raw data of 8192 × 8192

9 Distributed simulation verification system based on the 19

10 Schematic diagram of the data transmission pipeline 19

No. Title Page No.

1 HARDWARE SYSTEM PARAMETERS OF THE 14

Dept of ECE, RRCE 2023-24 1

Dept of ECE, RRCE 2023-24 2

2.1 Overview of Literature survey

A literature survey, often a critical component of research endeavours, involves an extensive

2.2 Base Papers

Authors Names: Yew Lam Neo, Frank H. Wong, Ian G. Cumming

Dept of ECE, RRCE 2023-24 3

Author Names: J. Krecke, M. Villano, N. Ustalli, A. C. M. Austin, J. E. Cater, and G. Krieger

Base Paper Methodology:

Dept of ECE, RRCE 2023-24 4

Base Paper Methodology:

Dept of ECE, RRCE 2023-24 5

2.2.5 Compact and free-floating satellite MIMO SAR formations

Dept of ECE, RRCE 2023-24 6

3.1 CS Imaging Algorithm

In FF-SAR systems, resolution accuracy and computational complexity of imaging algorithms

Fig. 1. Flowchart of CS imaging algorithm

Dept of ECE, RRCE 2023-24 7

3.2 Design and Optimization of a Distributed SAR Real-Time Imaging System

Fig. 2. Schematic diagram of FF-SAR system with four satellites.

Dept of ECE, RRCE 2023-24 8

Fig. 3. Distributed system architecture with four embedded GPUs.

3.2.1. Hardware System Design

3.2.2 CS Algorithm Rescheduling

Fig. 4. Flowchart of CUDA implementing CS algorithm.

Dept of ECE, RRCE 2023-24 9

Fig. 5. Heterogeneous architecture models in GPUs. (a) Discrete architecture.

Dept of ECE, RRCE 2023-24 10

Fig. 6. Heterogeneous architecture models in GPUs. (b) Integrated architecture .

Dept of ECE, RRCE 2023-24 11

3.2.3 CUDA Program Optimization

Dept of ECE, RRCE 2023-24 12

Dept of ECE, RRCE 2023-24 13

4.1 Experimental Results

Dept of ECE, RRCE 2023-24 14

Dept of ECE, RRCE 2023-24 15

Dept of ECE, RRCE 2023-24 16

4.2 Performance Evaluation and Discussion

Dept of ECE, RRCE 2023-24 17

TABLE IV COMPARISON BETWEEN THE PROPOSED ARCHITECTURE WITH PREVIOUS SOLUTIONS

Dept of ECE, RRCE 2023-24 18

Fig. 9. Distributed simulation verification system based on the embedded GPUs.

Dept of ECE, RRCE 2023-24 19

Dept of ECE, RRCE 2023-24 20