Application-Specific Soft-Core Vector Processor For Advanced Driver Assistance Systems

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Application-Specific Soft-Core Vector Processor for

Advanced Driver Assistance Systems


Stephan Nolting, Florian Giesemann, Julian Hartig, Achim Schmider, and Guillermo Payá-Vayá
Institute of Microelectronic Systems, Leibniz Universität Hannover, Appelstr. 4, 30167 Hannover, Germany
Email: {nolting,giesemann,hartig,schmider,guipava}@ims.uni-hannover.de

Abstract—Implementing convolutional neural networks for TABLE I


scene labelling is a current hot topic in the field of advanced N UMBER OF REQUIRED GIGA OPERATIONS (#GOP) FOR PROCESSING 30
driver assistance systems. The massive computational demands FRAMES ( THREE INPUT IMAGE SCALES EACH ) BY ALL PARTS OF THE CNN
under hard real-time and energy constraints can only be tackled ALGORITHM : C ONVOLUTION , POOLING , AND CLASSIFICATION . [2]

using specialized architectures. Also, cost-effectiveness is an im-


portant factor when targeting lower quantities. In this PhD thesis, Scale Convolve Pooling Classify #GOP
a vector processor architecture optimized for FPGA devices is S 107.86 0.39 0.38 108.64
proposed. Amongst other hardware mechanisms, a novel complex M 27.7 0.1 0.09 27.91
operand addressing mode and an intelligent DMA are used to
L 7.31 0.03 0.02 7.36
increase perfromance. Also, a C-compiler support for creating
applications is introduced. #GOP 142.87 0.52 0.49 143.91

I. I NTRODUCTION AND M OTIVATION


TABLE II
Modern cars are more and more equipped with complex P EAK AND REAL NUMBER OF GIGA OPERATIONS PER SECOND WHEN
EXECUTING THE CNN ALGORITHM ON A DIFFERENT GPU, FPGA AND
systems to increase driver safety and comfort. Hence, the ASIC PLATFORMS . DATA IS TAKEN FROM [2].
market for Advanced Driver Assistance Systems (ADAS) is a
steadily growing area. Starting from simple cruise controls, Author Year Platform GOP/s
the focus is shifting more and more towards autonomous Mobile GPU Implementations
driving. For this emerging area, a detection and understanding Farabet 2011 nVidia GT335m @30W 54
of the current situation of the vehicle and the environment is Cavigelli 2015 nVidia Tegra k1 @11W 76
required. One promising technique for the object detection and FPGA Implementations
scene labelling are so-called Convolutional Neuronal Networks Farabet 2011 Virtex-6 VLX240T @10W 147
(CNN) [1]. These biologically-inspired approaches are based Gokhale 2014 Zync ZC706 – 227
on convolving and subsampling (“Convolve”, “Pooling”) sev- Zhang 2015 Virtex-7 485t @18.6W 61.62
eral scaled versions of the original input image (see Fig. 1). In ASIC Implementations
a final step, the processed images are fed to a fully-connected Chen 2015 Accelerator in 65nm @0.5W 452
neuronal network. Based on proper training, this network then Cavigelli 2015 Accelerator in 65nm @1.2W 203
performs the actual classification (“Classify”) of each pixel in
one of the semantic classes (e.g., “pedestrian” or “free space”).
Especially for the field of automotive applications hard real- the power requirements for several exemplary approaches are
time constraints are mandatory. Therefore, very high compu- presented in Table II. Commercial computing platforms like
tational power needs to be provided in order to tackle the CPUs or GPUs are less suitable, since power consumption
processing requirements. Table I shows the required number of is also a crucial factor. In contrast, an application-specific
operations for the different processing steps of the CNN algo- integrated circuit (ASIC) minimizes the energy requirements,
rithm. For the computation of 30 frames, a total of 143 billion but for lower product quantities the price per unit is high.
operations is required. The processing performance as well as Field-Programmable Gate Arrays (FPGA) present a feasible
trade-off between energy requirements, processing power, unit
prices, and also flexibility for future changes. Therefore, the
conv pool conv pool goal of this PhD thesis is the implementation of a massive-
image pyramid

parallel vector processor architecture for CNN applications


conv pool conv pool fc
background
free space
targeted for FPGA devices. The main challenges are the effi-
vehicle
conv pool conv pool pedestrian cient mapping of the CNN algorithm to the vector processing
elements to exploit maximum data and instruction parallelism
Fig. 1. Simplified structure of the CNN algorithm: Three scaled input from
and also the resource-efficient as well as high-performance
the image pyramid are convolved & pooled two times and finally classified mapping of the architecture to the targeted FPGA devices. In
and labelled using a fully-connected neuronal network. [2] contrast to previous FPGA implementations (see Table II), the

Authorized licensed use limited to: KIT Library. Downloaded on April 16,2023 at 16:59:29 UTC from IEEE Xplore. Restrictions apply.
([WHUQDO0HPRU\ 9HFWRU3URFHVVLQJ$UUD\
)3*$ Challenge: To achieve maximal operating frequencies the
,QWHUIDFH 9HFWRU8QLW1 logic of the complex operand addressing as well as for
9HFWRU8QLW 9HFWRU/DQH

2Q&KLS%XV
(WKHUQHW,QWHUIDFH 9HFWRU8QLW9HFWRU/DQH the main ALU of the VLs are mapped entirely onto fully-

$UELWHU
9HFWRU/DQH

),)2
$UELWHU
9HFWRU/DQH
9HFWRU/DQH
pipelined DSP slices. Block RAMs are directly instantiated


),)2

0HPRU\
0,36

6FKHGXOHU

/RFDO
9HFWRU/DQH


),)2
3URFHVVRU 9HFWRU/DQH
9HFWRU
to ensure correct mapping of the memories (vector register


6\VWHP 9HFWRU/DQH
,QVWUXFWLRQV
9HFWRU/DQH
file, local memory) without additional logic overhead. Early
L'0$
results (unconstrained placement & routing) show, that a setup
Fig. 2. Block diagram of the vector processor system. (for a Virtex-6 XC6VLX240T-1FFG1156 FPGA) of a VU
with two VLs and 12kB local memory requires 347 logic
D /LQHDU            
slices, 8 DSP slices and 10 block RAMs and can operate
DGGUHVVLQJ             at up to 285 MHz . However, a single VL is capable of
E 6WULGH       operating at 333MHz. Since most of the critical path delay
DGGUHVVLQJ
           
is caused by pure routing, manual or half-automatic routing
       
and placement of the individual components using macros
F ³6SHFLDO´
DGGUHVVLQJ
           
[4] is mandatory. By implying these techniques, the goal is
           
to achieve physically-maximal performance (450MHz for the
Fig. 3. Addressing examples of the proposed complex addressing scheme. specific Virtex-6 FPGA). Furthermore, all available FPGA
resources shall be utilized to implement vector units (37680
slices available → theoretically 108 VUs with two VLs each).
proposed architecture provides a fully programmable system Challenge: An intelligent DMA controller (iDMA) is re-
that can easily be adapted for new or modified applications quired to transfer, modify and align the necessary data blocks
in the field of computer vision. between main memory and the local memories of several VUs.
These memories only provide two access ports in order to
II. D ESIGN C HALLENGES AND G OALS
optimally map them to block RAM resources. The goal of the
The proposed vector processor system, shown in Fig. 2, iDMA design is to provide transfer features to optimally map
is basically built from a scalar MIPS-based CPU and the the required data structure to the available local memories.
massive-parallel vector processing element array. The MIPS
processor system [3] serves as global controller and general B. Compiler Support
purpose CPU, which also computes any kind of flow control The high-level (i.e., C) software toolchain for the vec-
for the actual programs. In contrast, the vector processing array tor processor is based on the free and open-source LLVM
is in charge of processing the computation-intensive tasks. project [5]. By using a special intrinsic in combination with
This array consists of a configurable number of vector units the modified MIPS backend of the compiler the according 96-
(VU). These units contain a number of vector lanes (VL), bit vector instruction words can be directly constructed. All
which present the actual vertical data processing units. The vector intrinsics can be also compiled for a native architecture.
lanes of one unit are connected via chaining to exchange In this case, the actual instructions are replaced by a simulation
processing results directly. Each VU contains a configurable library that emulates the behavior of the vector processor.
local memory, which is accessible by all lanes and serves as Thus, an application code can be verified and further debugged
fast scratch pad memory. Additionally, scheduling logic and a when compiled for the host architecture (e.g., x86-based).
FIFO is part of each VU to buffer and orchestrate incoming Challenge: The full implementation of a configurable
vector operations to different lanes in the unit. The actual simulation framework representing several VUs with several
vector operations are send from the instruction decode stage VLs (including chaining and local memories) as well as the
of the MIPS processor system and are executed in parallel. development of a library and compiler-assisted development
framework. The goal is to provide highly-optimized CNN
A. Optimized Vector Processor Architecture for FPGA Devices processing libraries and further compiler assistance to simplify
To increase the computational power of common media pro- more sophisticated programming tasks like chaining.
cessing applications (e.g., 2D convolutions), the VLs provide R EFERENCES
a complex operand addressing scheme. By this, the classic
[1] Kunihiko Fukushima and Sei Miyake, “Neocognitron: A new algorithm
linear addressing of consecutive vector elements (Fig. 3a) is for pattern recognition tolerant of deformations and shifts in position,”
expanded to stride-based operations (Fig. 3b) and even more Pattern recognition, vol. 15, no. 6, pp. 455–469, 1982.
sophisticated addressing schemes (Fig. 3c). Each of the three [2] Guillermo Payá-Vayá and Holger Blume, Ed., Towards a Common
Software/Hardware Methodology for Future Advanced Driver Assistance
operands of a VL (source 1 & 2, destination) are defined by Systems. River Publishers, 2017.
the following formula, which can be efficiently mapped to a [3] S. Nolting, G. Payá-Vayá et al., “Dynamic self-reconfiguration of a mips-
single DSP slice: address = offset +α·x+β ·y. Each operand based soft-processor architecture,” in Parallel and Distributed Processing
Symposium Workshops, 2016 IEEE International. IEEE, 2016.
features a unique offset and two unique scaling factors (α, β), [4] Paul Glover and Steve Elzinga, “Relationally placed macros,” Xilinx Inc.,
which are multiplied with global looping variables (x , y) that Tech. Rep., 2008.
are used for the address computation of all operands. [5] The LLVM Compiler Infrastructure. [Online]. Available: www.llvm.org

Authorized licensed use limited to: KIT Library. Downloaded on April 16,2023 at 16:59:29 UTC from IEEE Xplore. Restrictions apply.

You might also like