Professional Documents
Culture Documents
Application-Specific Soft-Core Vector Processor For Advanced Driver Assistance Systems
Application-Specific Soft-Core Vector Processor For Advanced Driver Assistance Systems
Application-Specific Soft-Core Vector Processor For Advanced Driver Assistance Systems
Authorized licensed use limited to: KIT Library. Downloaded on April 16,2023 at 16:59:29 UTC from IEEE Xplore. Restrictions apply.
([WHUQDO0HPRU\ 9HFWRU3URFHVVLQJ$UUD\
)3*$ Challenge: To achieve maximal operating frequencies the
,QWHUIDFH 9HFWRU8QLW1 logic of the complex operand addressing as well as for
9HFWRU8QLW 9HFWRU/DQH
2Q&KLS%XV
(WKHUQHW,QWHUIDFH 9HFWRU8QLW9HFWRU/DQH the main ALU of the VLs are mapped entirely onto fully-
$UELWHU
9HFWRU/DQH
),)2
$UELWHU
9HFWRU/DQH
9HFWRU/DQH
pipelined DSP slices. Block RAMs are directly instantiated
),)2
0HPRU\
0,36
6FKHGXOHU
/RFDO
9HFWRU/DQH
),)2
3URFHVVRU 9HFWRU/DQH
9HFWRU
to ensure correct mapping of the memories (vector register
6\VWHP 9HFWRU/DQH
,QVWUXFWLRQV
9HFWRU/DQH
file, local memory) without additional logic overhead. Early
L'0$
results (unconstrained placement & routing) show, that a setup
Fig. 2. Block diagram of the vector processor system. (for a Virtex-6 XC6VLX240T-1FFG1156 FPGA) of a VU
with two VLs and 12kB local memory requires 347 logic
D /LQHDU
slices, 8 DSP slices and 10 block RAMs and can operate
DGGUHVVLQJ at up to 285 MHz . However, a single VL is capable of
E 6WULGH operating at 333MHz. Since most of the critical path delay
DGGUHVVLQJ
is caused by pure routing, manual or half-automatic routing
and placement of the individual components using macros
F ³6SHFLDO´
DGGUHVVLQJ
[4] is mandatory. By implying these techniques, the goal is
to achieve physically-maximal performance (450MHz for the
Fig. 3. Addressing examples of the proposed complex addressing scheme. specific Virtex-6 FPGA). Furthermore, all available FPGA
resources shall be utilized to implement vector units (37680
slices available → theoretically 108 VUs with two VLs each).
proposed architecture provides a fully programmable system Challenge: An intelligent DMA controller (iDMA) is re-
that can easily be adapted for new or modified applications quired to transfer, modify and align the necessary data blocks
in the field of computer vision. between main memory and the local memories of several VUs.
These memories only provide two access ports in order to
II. D ESIGN C HALLENGES AND G OALS
optimally map them to block RAM resources. The goal of the
The proposed vector processor system, shown in Fig. 2, iDMA design is to provide transfer features to optimally map
is basically built from a scalar MIPS-based CPU and the the required data structure to the available local memories.
massive-parallel vector processing element array. The MIPS
processor system [3] serves as global controller and general B. Compiler Support
purpose CPU, which also computes any kind of flow control The high-level (i.e., C) software toolchain for the vec-
for the actual programs. In contrast, the vector processing array tor processor is based on the free and open-source LLVM
is in charge of processing the computation-intensive tasks. project [5]. By using a special intrinsic in combination with
This array consists of a configurable number of vector units the modified MIPS backend of the compiler the according 96-
(VU). These units contain a number of vector lanes (VL), bit vector instruction words can be directly constructed. All
which present the actual vertical data processing units. The vector intrinsics can be also compiled for a native architecture.
lanes of one unit are connected via chaining to exchange In this case, the actual instructions are replaced by a simulation
processing results directly. Each VU contains a configurable library that emulates the behavior of the vector processor.
local memory, which is accessible by all lanes and serves as Thus, an application code can be verified and further debugged
fast scratch pad memory. Additionally, scheduling logic and a when compiled for the host architecture (e.g., x86-based).
FIFO is part of each VU to buffer and orchestrate incoming Challenge: The full implementation of a configurable
vector operations to different lanes in the unit. The actual simulation framework representing several VUs with several
vector operations are send from the instruction decode stage VLs (including chaining and local memories) as well as the
of the MIPS processor system and are executed in parallel. development of a library and compiler-assisted development
framework. The goal is to provide highly-optimized CNN
A. Optimized Vector Processor Architecture for FPGA Devices processing libraries and further compiler assistance to simplify
To increase the computational power of common media pro- more sophisticated programming tasks like chaining.
cessing applications (e.g., 2D convolutions), the VLs provide R EFERENCES
a complex operand addressing scheme. By this, the classic
[1] Kunihiko Fukushima and Sei Miyake, “Neocognitron: A new algorithm
linear addressing of consecutive vector elements (Fig. 3a) is for pattern recognition tolerant of deformations and shifts in position,”
expanded to stride-based operations (Fig. 3b) and even more Pattern recognition, vol. 15, no. 6, pp. 455–469, 1982.
sophisticated addressing schemes (Fig. 3c). Each of the three [2] Guillermo Payá-Vayá and Holger Blume, Ed., Towards a Common
Software/Hardware Methodology for Future Advanced Driver Assistance
operands of a VL (source 1 & 2, destination) are defined by Systems. River Publishers, 2017.
the following formula, which can be efficiently mapped to a [3] S. Nolting, G. Payá-Vayá et al., “Dynamic self-reconfiguration of a mips-
single DSP slice: address = offset +α·x+β ·y. Each operand based soft-processor architecture,” in Parallel and Distributed Processing
Symposium Workshops, 2016 IEEE International. IEEE, 2016.
features a unique offset and two unique scaling factors (α, β), [4] Paul Glover and Steve Elzinga, “Relationally placed macros,” Xilinx Inc.,
which are multiplied with global looping variables (x , y) that Tech. Rep., 2008.
are used for the address computation of all operands. [5] The LLVM Compiler Infrastructure. [Online]. Available: www.llvm.org
Authorized licensed use limited to: KIT Library. Downloaded on April 16,2023 at 16:59:29 UTC from IEEE Xplore. Restrictions apply.