Professional Documents
Culture Documents
Systolic-Based 2D Convolver For CNN in FPGA: October 2017
Systolic-Based 2D Convolver For CNN in FPGA: October 2017
net/publication/320968058
CITATIONS READS
2 130
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jozef Papan on 22 December 2017.
Abstract— Convolution is a primary mathematical operation convolution that increases the dimension of data from a
used in many signal processing and analysis algorithms. High vector to a matrix while it preserves the basic mathematical
dependence of the complex systems on the correct operation way of output computation - the weighted sum. 2D
of the convolver demands its continual improvements mostly convolution is applied in areas that process multivariate
related to the decrease of resource consumption. The paper data, such as image processing. Because we selected the
proposes a model of 2D convolution massively used in the pipelining as a primary processing approach related to
algorithms of image processing. The paper provides a systolic arrays, we expect pixels to flow through the system
detailed description of the model structure with focus on the one-by-one in a stream. The streaming approach allows us
implementation aspect. The model is particularly applied to to avoid data preprocessing related to the data
the convolutional layer of Convolutional Neural Network, rearrangement in memory, which would increase the
currently the most known image-based deep learning latency. However, the vector-based form does not fit data
method. The key difference of the proposed model compared structure required for the 2D convolution. Thus, the system
with other common implementations lies in the placement of requires modifications in comparison with the traditional
line buffers. The correctness of the model design is validated 1D convolution design.
through the simulation discussed at the end of paper.
There are various approaches presented in [1]–[5] that
Keywords- 2d convolution, FPGA, CNN, systolic array.
deal with 2D convolution from the point of implementation
optimization regarding the consumed resources. The
I. INTRODUCTION common idea of streaming data through the system of
processing elements is same for all models described in
Convolution is a primary mathematical operation used mentioned papers, but they differ in the structure and
in many signal processing and analysis algorithms, e.g. interconnection of elementary processing units.
FIR and IIR filters. Therefore, there will be always need According to (1), which defines 2D convolution, we can
for an improvement of the convolution implementation in split the computation of the final output into several steps,
order to meet increasing requirements on the consumed where each step returns only the partial result
resources of hardware platforms - memory, processing corresponding to one line of the weights matrix - kernel
units, and time. illustrated in Fig. 1 and presented as inner sum in (1). This
An example of massive convolution application is the view enables us to decompose the operation into several 1D
Convolutional Neural Network (CNN), in particular its vector-based convolutions and apply common practical
solutions. Before we describe the structure and behavior of
convolutional layers. CNN is a type of deep neural
our proposed architecture, we state the assumptions about
network applied to image processing and thus it processes the format of inputs and outputs of the system. These
a huge amount of data. Furthermore, the method requires assumptions specify the overall design of the 2D convolver.
the algorithm implementation to run in real-time and often 𝐾−1 𝐾−1
on embedded devices that achieve very restricted
resources. Therefore, the optimal algorithm 𝑦(𝑖, 𝑗) = ∑ ∑ 𝑥(𝑖 + 𝑚, 𝑗 + 𝑙) × 𝑤(𝑚, 𝑙) (1)
implementation for platforms like FPGA is crucial for the 𝑚=0 𝑙=0
overall performance of the system. A. Input conditions of the proposed model
The paper presents a particular model of 2D convolution We assume that the input feature map is a squared
with focus on the effectiveness of computation. The image with the same width and height and this shape is a
second section introduces the mathematical expression of constant known in advance. Therefore, we describe the
2D convolution, a way of its decomposition into a group input image size only via one parameter N. The same
of general one-dimensional convolutions and our selected
approach. The third and fourth sections provide a detailed
description of the proposed model structure composed of W W X X
W X
two parts – data path and control path. The processing of [1,1] [1,2] [1,3] [i,j] [i,j+1] [i,j+2] DP1
both parts and related timing control are described in the
fifth section. The last two sections contain the discussion W W W X X X
[2,1] [2,2] [2,3] [i+1,j] [i+1,j+1] [i+1,j+2] DP2
about the correctness of the proposed model, its
contribution, limits and future improvements. W W W X X X
[3,1] [3,2] [3,3] [i+2,j] [i+2,j+1] [i+2,j+2] DP3
II. 2D CONVOLUTION AND ITS DECOMPOSITION
The key operation of the convolutional layer in the CNN
model is the 2D convolution. This operation is a variant of Figure 1. Visual form of 2D convolution
assumption is made about the shape of kernels defined via convolution meets all of these requirements and therefore
the parameter K. is an exemplar candidate.
The other set of parameters defines the dimension of
inputs, weights and outputs. Because we use fixed-point III. THE STRUCTURE OF DATA PATH
arithmetic, we set the integer and fraction width for all The data path of the proposed 2D convolver design
these operands after consideration of their expected ranges consists of two main parts depicted as abstract model in Fig.
and the whole computational model. Thus, we perform 2 and as implementation model in Fig. 3. The first part
correct arithmetic computations and subsequent numeric computes all required partial sums of products. It performs
adjustments, such as rounding and truncation. a classical digital signal processing (DSP) task based on the
The changes of these parameters lead to the multiplication and subsequent addition of signals
(Multiply-&-Accumulate - MAC). As an output we get the
modification of the original model and its components. On partial inner products (DPi) that have to be properly
the other side, these changes are straightforward due to the combined into the final result afterwards (P). Because the
chosen design of systolic arrays. Usually they require inner products given in the particular moment belong to
only the insertion of additional elements while preserving different outputs, we need to provide the synchronization
the model core. For that reason, all examples displayed in accordingly.
figures assume K=3 as the elementary case. This The second part addresses the issue of timing. It
elementary case provides the simplicity and clear combines inner products and at the same time maintains the
understandability while it preserves the main principle. synchronized state. In other words, it controls the speed of
data that flow through the system. The primary focus is on
B. Architectural approach – Systolic array timing because the related inner products have to be
Systolic arrays represent an architectural approach that combined into the same output.
enables a massive parallel computing and internal
A. Part 1 – Systolic Elements
pipelined processing. Both benefits lead to the increase of
the performance. This design form is applicable to the The first computation part contains a chain of systolic
problems that exhibit simplicity and regularity hidden in elements that are connected in the sequence. Each systolic
repeated computational patterns. The modular design element (SEi) is represented by the N-tap FIR filter that
consists of a set of basic cells - processing elements (PE) computes one line of the window - 1D convolution. That
- arranged in a systematic configuration, such as chain, fact allows us to apply all known and recommended
matrix, or tree. PE units are interconnected through a approaches to realize FIR filter in the FPGA. Considering
simple regular network of links. The network controls the the implementation of the design into FPGA circuit, we
correct flow of data through the system of PE units and so refer to the traditional and time-proven methods in [7], [8]
it ensures the synchronization. Because the modular design that describe various FPGA models of FIR filters using
provides scalability and flexibility, the model created for DSP blocks. The abstract structure is straightforward and
the solution of a particular problem is also usable for a set represents a systolic array [6]. On the other hand, the
of bigger problems. They require only minimal implementation design exhibits some distinctions in the
modifications without the need to start from scratch. The synchronization associated with the structure of DSP in
systolic array pays off when applied to the computation- current FPGA (it uses 2K-1 registers instead of the original
bound problems because its structure maximizes usage of K registers as shown for K=3 in upper side of Fig. 3
inputs required for many computational operations. compared to Fig. 2).
The principle of systolic array provides various models Each DSP block executes the multiplication of the actual
that differ in the way of arrangement and interconnection input and the corresponding weight. The product is then
of PEs. The paper [6] explains the foundations of systolic pushed to the adder and attached to the cumulative output.
array with its benefits and presents several examples of The registers are inserted between the operators to enable
models appropriate for a particular task related to FIR filter the pipelining and so they increase the maximal operating
– the convolution. This digital signal processing task is an frequency. The DSP blocks are introduced in the next
example that deals with the combination of two input data section.
flows - the inputs and coefficients. Considering the 1) The Structure and Application of DSP in SE
following implementation dependent on the structure and The DSP block provides features that lead to the optimal
possibilities of DSP blocks available in FPGA, we chose implementation of FIR. Internal multiplier and post-adder
the design W2 depicted in the mentioned paper. This model support symmetric rounding and quantization of the results
makes the assets of permanent utilization of all to address the bit growth caused by the arithmetic
computation units, pipelining and continuous output flow
X
without the requirement for the adder tree to merge the z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1
products.
However, not all systems can be effectively designed as W[1,1] W[1,2] W[1,3] W[2,1] W[2,2] W[2,3] W[3,1] W[3,2] W[3,3]
BIAS
a systolic array. The suitable system has to meet particular + + +
PART 2 N
SWITCHING BLOCK DPi
ADD1 ADD2 ADD3
pixel_counter_alert=’1’;
pixel_counter_treshold <=
NO_VALID_PIXELS_PER_LINE;
VERTICAL pixel_counter_set <= ‘1’;
pixel_counter_alert=’0’; line_counter_ce <= ‘1’;
_BORDER
START_UP
pixel_counter_alert=’0’;
The FSM is visually displayed in the Fig. 5. It consists of all windows in the input map. Therefore, we require a
of five states that directly map beside the mentioned window to be finally covered by each block in order to
window situations also the initialization and start-up compute all its lines - partial inner products (inner sum in
periods. The controller implements two counters with equation (1)). We can imagine the task of pattern blocks as
dynamic upper limits - PIXEL_COUNTER and a coloring of the assigned line of windows. The processing
LINE_COUNTER in the VHDL code. Each counter is
of the window is complete when all its lines are colored by
responsible for the particular transition between states. The
current form of the FSM operates only when the input is the corresponding pattern blocks. Because the visual shape
valid, otherwise it waits in the last state without change of of pattern specifies different positions of blocks, inner
any attribute, such as the current value of counters. This products returned from SE blocks in the particular time
chosen behavior simplifies its design and in the same way point correspond to different windows in the original
influences the operation of the data path. In other words, feature map. Thus, they cannot be directly combined to the
data path waits without action during the invalid inputs. final output, but they have to be reorganized at first. The
rearrangement of these products introduces the
V. DESCRIPTION OF THE PROPOSED SYSTEM AND ITS synchronization performed by appropriately placed delay
TIMING CONTROL buffers. The adders finish the computation and merge all
A. Data processing by the SE chain related inner products to the common output.
To interpret data processing by the SE chain, we use a Regarding the form of the pattern (Fig. 6), blocks visit
simple visual model. Computation of 2D convolution a window sequentially in order from the left block first.
consists in scanning the input feature map by technique of The gap between two adjacent blocks is exactly N-K time
moving window that is gradually used for computation of units based on the line-by-line movement of the pattern.
individual output features. Considering this process, the Therefore, inner products of two adjacent blocks taken in
SE cluster can be visually expressed as a pattern built of K time points t and t+N-K belong to the same result. By
blocks shown in Fig. 6. Pattern blocks in the figure are applying this principle to the chain of K blocks, the inner
divided into K lines and each contains exactly one products taken from blocks in t, t+N-K, ..., t+(N-K)×(K-1)
highlighted line. The form of pattern depends on the constitute one output feature. This time shift is a reason for
assignment of weights from the original kernel to the deployment of delay buffers to adequately adjust time
individual pattern blocks. Every pattern block represents points of all inner products. Thus, the length of each delay
one SE that computes its highlighted line. In other words, buffer regarding the abstract model is equal N-K, as shown
the particular pattern block colors always the specific line in the Fig. 2.
A) B) C) D)
K-1
B)
N-K+1
C)
K-1
D)
K-1 N-K+1
Figure 7. Transition through vertical border Figure 9. Valid and invalid regions
need to change the perspective to overcome the stated This signal represents the activity of the
problem. If we add a delay buffer of the positive length to LINE_COUNTER, that counts the processed lines of the
the signal path, actually we slow the speed of the signal input image. It indicates the transition to other state of
that spreads inside the system. That means, the other control path in the same way as the previous alert signal.
signals will be faster than the delayed one. On the contrary, The progress of signal valid_out corresponds to the areas
the delay buffer of negative length should have an inverse in Fig. 9.
effect, hence it should speed up the corresponding signal The simulation run of the system captured in the
in relation to the other signals of the system. We can diagram matches the desired behavior of both parts of the
accomplish the same behavior corresponding to the proposed model. Simulation outputs were also compared
negative-long buffer through slowing down all signals but with results of the experiments performed on the GPU
the particular one. The synchronization state of the system under same conditions - the kernel and input images. The
stays untouched. Switching block inserted between the results are almost same considering the sufficiency of the
first and second part (Fig. 2) provides the desired accuracy and so they serve as a proof of the desired
operation. behavior of the proposed model.
The particular structure of the second part enables us to
B. Contribution of the proposed model
exploit its symmetry (all delay buffers have same length
and are connected in a raw) to design the switching block. The proposed model introduces a novel structure
In the case of negative delay, we reverse the connections considering the delay buffers position. The other common
of the inner products coming out of the SE blocks to the models insert the delay buffers before the computation
individual adders, i.e. the last inner product will be blocks to arrange the input pixels into matrix format at the
connected to the first adder; the first inner product will be entrance to the multipliers. That approach requires the
connected to the last adder. In practice, we reach it via a adder tree placed after the sets of multipliers to merge the
set of multiplexers (shown in Fig. 10) that are inner products, which increases the latency. Our approach
appropriately controlled via the selector - a signal defined inserts delay buffers after the computation blocks. The
by the sign of the statement N-2K+1. buffers are interconnected with the adders in the scheme
that does not require the adder tree at all. By applying this
DP3 construction, we save time of signals spreading through the
ADD3 adder tree and so we decrease the total latency of the
DP1 SEL outputs.
DP2 ADD2
VII. CONCLUSION AND FEATURE WORK
DP1
ADD1 The paper emphasized the important role of the
DP3 SEL convolution in various areas and the need for its continual
N improvement from the implementation aspect. Therefore,
Sign(N-6+1) we proposed a model with increased effectivity compared
to other common models mentioned in second section. The
Figure 10. Switching block model chosen approach - systolic array – suits well the
architectural structure of FPGA. The paper presented the
VI. DISCUSSION detailed structure of proposed model together with
description of its processing. Regarding the discussion
A. The desired operation of the proposed model about functionality of the model we proved its correctness
The correctness of the abstract design is grounded in in the computation of 2D convolution.
the mentioned documents [6], [7], [9]. The right behavior However, the proposed design exhibits drawbacks that
of the proposed model including control path and data path limit format of supported input data via the static N and K
was practically validated through a behavioral simulation. parameters in the whole system. The potential
The simulator integrated into Vivado [10] was used as a improvements of the model consist in providing the
simulation tool. We carried out the controlled simulation additional flexibility through the dynamic setup of N and
with stimulus and evaluated the results. K at run-time. If we assume the variability of these
Fig. 11 shows the waveform diagram of the proposed parameters with the purpose to enable reuse of 2D
model simulation. Because of the length, the waveform is convolver blocks for inputs and kernels of distinct size, we
split into two diagrams belonging to the same simulation need to adjust the model architecture. All components, that
run. The important signals of data path and control path are depend on these parameters, must be modified to preserve
displayed with their values in different time points. The their correct dynamic behavior. These components are SE
first diagram displays the transition of window through the chain and delay buffers.
vertical border with emphasis on the pixel_counter_alert In case of the SE chain, we could use the neutral element
signal. This signal represents the activity of the of summation - zero - and replace all extra inner products
PIXEL_COUNTER, that counts the processed input considering the current value of K. This functionality could
pixels. Its active state (value 1) indicates coming transition be implemented as a part of the switching block.
to other state of control path. In similar way, the second The dynamic length of delay buffer could be realized by
diagram displays the transition of window through the application of the BRAM, as described in [11]. In that
horizontal border with emphasis on the line_counter_alert. implementation of FIFO as a delay buffer, the variable
Figure 11. Waveform diagram of simulation
length represents an offset between address pointers to the
current write and read memory positions. If we change the [5] J. Qiu et al., “Going Deeper with Embedded
addresses and their offsets appropriately, the buffer length FPGA Platform for Convolutional Neural
will adapt accordingly. Network,” in Proceedings of the 2016
ACM/SIGDA International Symposium on Field-
VIII. ACKNOWLEDGMENT Programmable Gate Arrays - FPGA ’16, 2016,
This paper is supported by Faculty of management pp. 26–35.
science and informatics of University of Zilina, funded by [6] Kung and H. T., “Why systolic architectures?,”
research grant number FVG/27/2017. Computer (Long. Beach. Calif)., vol. 15, no. 1,
pp. 37–46, Jan. 1982.
REFERENCES [7] Xilinx, DSP: Designing for Optimal Results, 1.0.
[1] J.-J. Lee and G.-Y. Song, “Super-Systolic Array 2005.
for 2D Convolution,” in TENCON 2006 - 2006 [8] Xilinx, “XtremeDSP for Virtex-4 FPGAs User
IEEE Region 10 Conference, 2006, pp. 1–4. Guide.” 2008.
[2] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, [9] Xilinx, “7 Series DSP48E1 Slice User Guide.”
“CNP: An FPGA-based processor for 2016.
Convolutional Networks,” FPL 09 19th Int. Conf. [10] “Xilinx Vivado.” [Online]. Available:
F. Program. Log. Appl., vol. 1, no. 1, pp. 32–37, https://www.xilinx.com/products/design-
2009. tools/vivado.html.
[3] W. Qadeer et al., “Convolution engine,” in [11] Bailey; Donald G., Design for Embedded Image
Proceedings of the 40th Annual International Processing on FPGAs, First edit. Wiley-IEEE
Symposium on Computer Architecture - Press, 2011.
ISCA ’13, 2013, vol. 41, no. 3, pp. 24–35.
[4] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J.
Cong, “Optimizing FPGA-based Accelerator
Design for Deep Convolutional Neural
Networks,” in Proceedings of the 2015
ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays - FPGA ’15, 2015,
pp. 161–170.