Professional Documents
Culture Documents
A Heterogeneous Parallel Processor For High-Speed Vision Chip
A Heterogeneous Parallel Processor For High-Speed Vision Chip
A Heterogeneous Parallel Processor For High-Speed Vision Chip
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
Abstract—This paper proposes a heterogeneous parallel pro- and compact system features. The vision chips with the
cessor for high-speed vision chip. It contains four levels of 2D array of pixel-parallel PEs can perform simple image
processors with different parallelisms and complexities: process- processing very efficiently and make high-speed image
ing element (PE) array processor, patch processing unit (PPU) processing possible. The reported vision chips usually have
array processor, self-organizing map (SOM) neural network processing rate from hundreds to thousands of frames per
processor and dual-core microprocessor (MPU). The fine-grained
second [7]–[22]. However, the high performance of PE array
PE array processor, middle-grained PPU array processor and
SOM neural network processor carry out image processing in processors is obtained by trading off the processor flexibility.
pixel-parallel, patch-parallel and distributed parallel fashions, It is therefore difficult to carry out complex algorithms. As a
respectively. The MPU controls the overall system and execute result, early vision chips are usually application specific and
some serial algorithms. The processor can improve the total they only perform certain image processing operations [7],
system performance from low level to high level image processing [8], [10], [12], [14]. Later, to increase the processor flexibility,
significantly. A prototype is implemented with 64 × 64 PE array, programmable vision chips were reported [9], [15]–[19],
8 × 8 PPU array, 16 × 24 SOM network and a dual-core MPU. [21], [22]. Based on the concept that image processing can
The proposed heterogeneous parallel processor introduces a new be classified into low-, middle- and high-level [19], [23],
degree of parallelism, namely patch parallel which is for parallel we first proposed the SIMD real time vision chip (SRVC)
local feature extraction and feature detection. It can flexibly
architecture that includes both 2D PE array processor and
perform state of the art computer vision as well as various
image processing algorithms at high-speed. Various complicated one-dimension (1D) row-parallel processors (RP) [15]. The
applications including feature extraction, face detection, and high- chip can finish low-, middle- and simple high-level image
speed tracking are demonstrated. processing algorithms. The SRVC architecture was further
developed in work [19]. It integrates PE array processor,
Keywords—Vision chip, parallel processing, high-speed, SIMD, row-parallel row processor and a non-parallel microprocessor
computer vision, heterogeneous, face detection, object tracking.
unit (MPU) to manage low-, middle-, and high-level image
processing, respectively. To address the high-level processing
I. I NTRODUCTION speed bottleneck issue of the non-parallel MPU, we proposed
a vision chip with self-organizing map (SOM) neural network
H IGH-SPEED image processing and recognition can
be applied to many fields such as industrial assembly
line control, defect detection, robot vision, and human-
to speed up feature classification [21]. The SOM neuron
network is reconfigured from PE array processor, similar
to work such as [24], [25]. Although these vision chips
computer interaction [1], [2]. Various methods can be used
achieve good performance in some vision applications, their
to accomplish image processing tasks. Traditional vision
design concept were still within the scope of traditional
systems cannot achieve high processing rate due to the serial
image processing flow. Fig. 1 (a) shows the traditional
image transmission and serial image processing bottlenecks.
image processing flow and it is composed of three parts,
For example, SIFT algorithm costs more than half a minute
namely, filtering, holistic feature extraction and feature
to process a 1024 × 768 image on an embedded ARM A8
classification. Filtering is used for contour, edge extraction
processor [3]. GPU is a highly parallel architecture that has
or morphological operations. The holistic feature is a single
been popular in vision processing in recent years [4], but
global vector extracted on the whole image to represent the
the cost of GPU system is high and it requires complex
object. However, holistic features are sensitive to illumination,
supporting environments. More importantly, CPU and GPU
angle and background changes which affect the classification
usually consume tens to hundreds of watts power which
accuracy greatly. Fig. 1 (b) shows the modern computer vision
rule them out for many low-power embedded applications.
processing flow, it first extracts the area of interests or divides
Vision chip concept was first introduced in 1990s [5], [6]. It
the image into multiple small patches and then in each area
integrates high-speed image sensor and two-dimensional (2D)
or patch, a local feature is extracted. At last, all the extracted
array of single-instruction-multiple-data (SIMD) pixel-parallel
features are used in classification. Compared to the traditional
processing elements (PEs) on a single chip. It not only
processing flow, it greatly increase the accuracy of detection
increases the bandwidth of image transmission and the speed
and classification and is widely used in vision applications
of image processing but also brings low power, low cost
[26]. But, the vision chips developed so far cannot carry out
Jie Yang, Yongxing Yang, Zhe Chen, Liyuan Liu, Jian Liu and Nanjian Wu this flow in an efficient way. Firstly, the row-parallel processor
are with Institute of Semiconductors, Chinese Academy of Sciences, Beijing was designed specifically for holistic feature extraction, so it
100083, China (email:nanjian@red.semi.ac.cn)
Copyright 20xx IEEE. Personal use of this material is permitted. However, permission to use this material for any other
purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 2
(a)
II. A RCHITECTURE
Feature Feature Feature A. Hierarchical Heterogeneous System Architecture
Detection Description Classification Fig. 2 shows the proposed processor architecture. It mainly
Susan, DoG, SURF, SIFT, SURF, BRIEF, LBP, AdaBoost, BP, RBF, SVM ... consists of three von Neumann-type processors and a bio-
FAST ... FREAK ...
inspired non-von Neumann-type SOM neuron network pro-
Panda cessor. The von Neumann-type processors consist of an N×N
pixel-parallel PE array processor, an M×M patch-parallel PPU
array processor and a dual-core MPU. The PE array and PPU
array perform the low-level pixel-parallel and middle-level
(b)
patch-parallel algorithms, respectively. The MPU manages the
whole system while finishing some serial algorithms. The
non-von Neumann-type SOM neuron network is dynamically
Fig. 1: (a) traditional image processing flow. (b) modern reconfigured from the PE array and PPU array. The PPU
computer vision processing flow. array and the SOM neuron network can carry out feature
classification in distributed parallel fashion.
Fig. 3 gives the hierarchical view of the proposed archi-
tecture. The four different kinds of processors have different
cannot extract local features efficiently. Secondly, the vision granularity and complexities. The PE array processor performs
chips lack the consideration of the nature of image signal pixel-parallel image processing algorithms in SIMD scheme.
processing and computer vision algorithms because most Each PE is a simple 1-bit processor. However, the massive
algorithms are carried out on image patches rather than the parallel PE array exhibits enormous computation power in
whole image, and different image patches can be processed prerequisite algorithms for many recognition and detection
in parallel. Current vision chips all fail to utilize this degree applications, such as 2D-filters, background reduction, feature
of parallelism to further improve system performance. from accelerated segment test (FAST) [27] and local binary
This paper proposes a novel heterogeneous parallel vision pattern (LBP) [28]. With the help of the massively parallel PE
chip processor. It consists of three Von Neumann-type array processor, the processing speed of these procedures can
processors: PE array processor, patch processing unit (PPU) be largely improved. The algorithm details will be explained
array processor and dual-core microprocessor (MPU), and one later in Section IV.
non-Von Neumann-type processor: SOM neural processor. The PPU is more flexible and powerful than PE. It is capable
The PE array and PPU array processors can process image of executing complex algorithms such as orientation assign-
in pixel-parallel and patch-parallel fashions, respectively. ment, local feature extraction and block matching. Each PPU
The PE array processor can pipeline the two phases of data corresponds to a PE sub-array with a mesh structure, as shown
transfer and image processing. Pixel-parallel operations can in the dashed lines in Fig. 3. Currently, in our implementation,
be performed in PE array with high efficiency. The PPU the mapping of PPUs to PEs is fixed, and corresponds to an
array can carry out feature extraction in patch-parallel while 8 × 8 grid of PEs for each PPU. This mapping relationship
it also performs binary classification in a distributed parallel significantly differs from that between the PE array and the row
fashion. The non-Von Neumann-type SOM neuron network processor array in the reported architectures [15], [19], [21]
carries out the multi-class classification in a vector-parallel where one row of PE processor corresponds to one row pro-
way. PE and PPU arrays can be dynamically reconfigured cessor. It is more suitable for modern image signal processing
into the SOM neuron network so that the resource usage and computer vision algorithms than the traditional one-row
of the architecture is greatly reduced. Complicated image of PE to one row processor mapping scheme. Nowadays, local
processing and computer vision algorithms are realized in feature representations have become dominant among various
this architecture in high performance and a flexible fashion. ways to describe image feature [29]–[34]. They are widely
The proposed heterogeneous processor architecture introduces applied in trackers, classifiers to accomplish tracking, detection
a new degree of parallelism, namely patch parallel which and recognition applications. The extraction of the local feature
is designed for parallel local feature extraction and feature is mainly based on processing image patches and histogram
detection. The ability of local feature processing enables the statistics. However, an image consists of significant numbers
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 3
System Bus
: PE reconfigured neuron
Instr.Mem Array Controller Patch Array Interface : PPU reconfigured neuron
Param.
Instructions Winning
Neighborhood
PE PE PE PE PE
Neuron
Data Reorganizer PPU PPU PPU PPU
PE PE PE PE PE
PE PE PE PE PE
PPU PPU PPU PPU
High-speed
image sensor M×M
PE PE PE PE PE
PPU Array
PE PE PE PE PE
PPU PPU PPU PPU Feature vector (pattern)
SOM Neural Network
Dynamically Reconfiguring
N × N PE Array
of patches; thus the whole procedure can be computationally processor, namely row processor, can only directly access
expensive and time consuming. The mapping relationship one row rather than one patch of image data. The proposed
between PE array and PPU array is designed to facilitate architecture integrates PPU array processor for patch-parallel
the local-feature extraction procedure. The PPU can access processing. Each PPU in the processor corresponds to a
the data memory of a PE sub-array that stores an image PE sub-array with a mesh structure. The PPU can access
patch. Thus it can directly perform local feature extraction the local image patch data in the PE sub-array directly. All
immediately after the PE array processing. It removes the PPUs operate independently in a SIMD fashion. Thus the
bottleneck of image data access between the row processor and high-speed patch-parallel processing is achieved. Fig. 5(a)
PE memory in one row-processor to a row of PEs mapping shows the non-patch-parallel and patch-parallel processing
relationship. Additionally, the SIMD PPU array can carry methods. In non-patch-parallel processing, image patches are
out classification algorithms such as AdaBoost in distributed serially processed one by one so that the total processing time
parallel fashion. is proportional to the number of patches. In patch-parallel
The SOM neuron network is adopted in this architecture processing, image patches are processed concurrently. The
to speed up multi-class classification in vector-parallel. The algorithm can be accomplished by executing the procedure
SOM neuron plane is partitioned into several non-overlapping once. Therefore the speed-up factor is the number of PPUs.
regions, and each of these regions corresponds to a feature Additionally, adjacent patches usually have certain degree
class. In the SOM neural network, each neuron can store a K- of overlapping to ensure that all possible features or locations
dimension reference vector. On-line training and updating of are included. Fig. 5(b) shows a schematic diagram of two
the reference vector can be completed with the learning vector adjacent image patches overlapping with each other with
quantization (LVQ) method. A typical processing flow of the dx pixels. In our proposed architecture, this overlapping
proposed hierarchical heterogeneous architecture is shown in mechanism can be implemented efficiently with the help
Fig. 4. of PE array. Instead of shifting the patch window, we shift
the data stored in PE array with dx pixels. With only few
operations, we can shift data in all patches with an offset of
B. Architecture Feature dx simultaneously. Then, we execute the algorithm again to
1) Patch-parallel Processing: The image processing or extract local features in the new patches. In Section IV, we
computer vision system usually selects and processes the will demonstrate the efficiency of this mechanism in object
minimum image data unit sequentially. The image data unit tracking applications.
within a rectangle structure is called as an image patch. The
unique local features can be extracted from the image patch. 2) Distributed Parallel Classification: Classification is
These features are critical for the following image processing. one of the most important and commonly encountered
But, it is difficult to process image patches in parallel by the problems in image processing and computer vision. A
previously reported vision chips because their middle level performance- and accuracy-oriented classifier implementation
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 4
High-speed
image sensor
Memory
access
Parallelism
Fig. 3: The hierarchical view of the proposed heterogeneous architecture. The four kinds of processor has different granularity
and complexity.
Pixel-parallel Pixel-parallel
algorithms Speed-up by the : non-patch-parallel processing
number of PPUs
2D-filter,
Mathematical : Active Image patch window
morphology, Image
Background
subtraction,
: Overlap image patch window
Fast-9, Non-patch-parallel Patch-parallel
DoG,
processing processing
LBP,
: PE Susan (a)
dx dx
patch-parallel Patch-parallel
algorithms Speed-up by the
number of PPUs PE data shifting
Orientation
assignment,
Block matching, Image
Block sum,
Patch Processing Unit Patch Processing Unit Patch Processing Unit Patch Processing Unit
Histogram value,
Statistical Non-patch-parallel Patch-parallel Patch-parallel
operation processing processing processing
Multi-class Binary
: Class 0
classification classification
(b)
: Class 1
PPU array
Classification
: Class 2
: Class 3
Target Tracking,
Face detection,
result Pattern
recognition Fig. 5: (a) The serial and patch-parallel processing methods,
gray box indicate active patch. (b) Patch overlapping process-
Fig. 4: Typical processing flow of the proposed architecture ing in serial and patch-parallel processing.
and representative algorithms for each step.
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 5
Frame
frame (n+1) frame (n+1)
frame (n) frame (n)
(a) (b)
Slice 1
(a) Time
Frame
frame (n+1) frame (n+3)
frame (n) frame (n+2) Processing
frame (n+1)
Read-out
frame (n)
(c) (d)
(b) Time
Fig. 6: Different pixel-PE mapping relationship of the proposed
architecture.
Fig. 7: Frame pipeline scheme in the proposed architecture.
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 6
vision chip designs, the PE array is used either for image TABLE I
acquisition [9], [10], [15], [18] or as memory buffer [19] for S OME C OMMON F UNCTION OF PE A RRAY
the image sensor. Thus, image acquisition and processing
cannot be carried out in parallel. Similar to work [24], the Function Cycles Function Cycles
proposed architecture adopts a frame pipeline scheme to hide c=a+b 17 c=a-b 18
the CIS read-out latency, as shown in Fig. 7 (b). When the c = max(a,b) 52 c = min(a,b) 52
image sensor is reading-out a new frame of image data, the ∂f ∂f ∂2f ∂2f
processors are processing the previous frame of image data. ∂x
or ∂y
18
∂2x
or ∂2y
52
If the algorithm execution time is equal to the CIS read-out c=a<b? c=a>b?
26 27
latency, the idle time of the processor can be zero. This a:0 a:0
pipeline scheme is realized by introducing a simple FIFO in non-
each PE. This will be described in the PE circuit section. c = kak 72 maximum 167
suppression
5) Enhance SOM Neuron Network: A SOM neural
network can be dynamically reconfigured from the PE
array processor and the PPU array processor. Because the PEs are in SOM neuron mode, 1-bit ALUs of all PEs in a
1D RP array is not compatible with 2D-structured SOM SOM neuron are connected to form a 16-bit ripple-adder. In
neuron network, the RP array processor in previous work PE array processor mode (R = 0), PEs are mesh-connected,
could not be reconfigured into the SOM neurons so that the multiplexer op1 mux selects the first ALU operand from
the computation power of the middle-level processor was its own main memory, or from the main memory of its east,
wasted [21]. However, in the proposed architecture, the south, west and north neighboring PEs (E, S, W and N entries
PPU array intrinsically has the 2D mesh structure and is in Fig.9). The register C stores the carry-out of the ALU in
suitable for SOM neuron network implementation. Thus, current cycle and acts as carry-in of the ALU for the successive
the proposed architecture can reconfigure both the PE array cycles. Multiplexer op2 mux selects operands from constant
and the PPU array into SOM neural network. As shown in zero, constant one, temporary register and the output of FIFO
Fig. 8, a dedicated condition generator and 4 × 4 snake as the second operand for the ALU. By concatenating the 1-
chained PEs constitute a neuron with a 16-bit ALU. With the bit ALU operations, multiple-bit operations can be finished
same condition generator, the two 16-bit ALUs in PPU can [19]. Table I lists some common functions that PE can finish
form two SOM neurons. The reconfiguration of PPU array where a, b and c are 8-bit variables in the PE memory. ∂f /∂x
further increases the computation power of the SOM neuron and ∂ 2 f /∂ 2 x are the first and second order of image gradients,
network. The reconfiguration costs only multiplexers for PE respectively. In SOM neuron mode, 16 such PEs are connected
topological connection switch and condition generators for in a snake-style to form a SOM neuron.
condition operation. However, compared with a standalone
SOM neuron network implementation, the reconfiguration
cost is negligible (5%) [21]. PEs in [24], [25] can operate the B. PPU Circuit
function of a neuron, but the coarse grained design makes it Fig. 10 shows the schematic of the proposed PPU circuit. It
difficult to carry out pixel-parallel operations. mainly consists of a PE input buffer (PEIB), a general purpose
registers (GPR), two 16-bit ALUs and one data memory. The
III. MODULE DESIGN OF THE VISION CHIP PEIB serves as the interface between PE and PPU. Since PE
PROCESSOR and PPU both have their independent data memory, once the
date exchange is finished, they can work simultaneously. The
A. PE Circuit two ALUs can perform two 16-bit or one 32-bit operations
Fig. 9 shows the proposed PE circuit. It consists of a main by controlling the carry signal. In image processing, 32-bit
memory, a FIFO, a 1-bit ALU and some multiplexers. The precision is seldom required, thus mostly we use the ALUs
main memory stores temporary values and results during image in 2-way SIMD mode to increase PPU performance. In this
processing stage. The FIFO stores the incoming pixel data mode, the two 16-bit ALUs can perform comparison, shift, as
of a new frame. The PE can process the previous frame on well as max/min extraction for non-linear operations. We also
the main memory while storing the data of the incoming implement identical condition generator for the two ALUs so
frame on the FIFO without interfering each other. After the that they are fully compatible with the 16 PEs reconfigured
image processing of previous frame and the CIS procedure SOM neuron. A bus interface is used to support the data
of current frame are finished, the data stored in the FIFO exchange between the MPU and the PPU. Write procedure of
are automatically moved to the main memory by a hardware the memory is controlled by the condition execution module
interrupt and this procedure only takes a few instruction cycles. that supports eight kinds of condition based on the status of
The ALU performs the operations of full adder, inverter, the condition registers.
AND gate and OR gate. The reconfiguration multiplexers Two important operations in computer vision processing
(shaded in Fig.9) can switch the topological connections be- are orientation assignment and histogram calculation. Usually,
tween neighboring PEs for different operation modes. When orientation assignment involves costly division, inverse tangent
reconfiguration multiplexers are selected (R = 1, in Fig.9), and rounding calculations. Fortunately, the required calculation
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 7
Neuron
Neuron
PE PE PE PE
West
East
[15] [8] [7] [0]
PE PE PE PE
[14] [9] [6] [1]
PPU Instr./
32bit Bus
R SOM Ctrl 16b FV Bus
PE PE PE PE
[13] [10] [5] [2]
16bit-ALU 16bit-ALU
PE PE PE PE
[12] [11] [4] [3]
OV S bit [12,15] OV S bit [12,15]
Wi,j Wi,j+1 bit-postion
Condition Generator (CG) Condition Generator (CG)
(bp)
cond cond
PPU SOM Neuron 4×4 PEs SOM Neuron
Fig. 8: The reconfiguration of PEs and PPU to SOM neuron. Every 4 × 4 PEs can constitute a SOM neuron while each PPU
can split into two SOM neurons.
T T_en R
1b ALU 16b 32b
Condition Addr Data
byteena Bus Ext mem data
Execution Data Mem Interface Ext mem addr
precision of orientation assignment in most algorithms is For histogram calculation, we first calculate the sum of a base
not high. Like SIFT, SURF and HoG, 8 orientations are address and the variable as the new address. Then the value
often enough for acceptable processing results. This allows stored in this address is loaded to the GPR and added by one.
us to determine the orientation by simply comparing region At last, the value is stored back to the memory. The above
boundaries. As shown in Fig. 11 (a), the gradient orientation of procedure repeats until all variables are calculated
a pixel can be assigned to one of the eight equally-spaced angu-
lar regions based on its gradients on x and y directions. Given a
gradients pair (dx, dy), we can test three boundary conditions as C. SOM Neuron Circuit
indicated in Fig. 11 (b), the logical results are then combined to There are two types of SOM Neuron circuit in the proposed
generate a corresponding bin number between zero and seven. architecture: PE-based SOM neuron and PPU-based SOM
The two 16-bit ALUs can simultaneously work on two gradient neuron. The PE-based SOM neuron constitutes of 16 snake-
pairs to reduce processing time. More accurate boundary tests style connected PEs. In the 16 PEs, the 1-bit ALU carry-out of
can also be implemented, for example tan20◦ ≈ 0.37, we can the PE at i-th bit-position is directly connected to the carry-in
test boundary (16 × dy + 8 × dy) > (8 × dx + 1 × dx) signal of the 1-bit ALU at (i + 1)-th bit-position. The LL, LH,
where the multiplication can be performed by shift operations. AH ports connections of PE circuits at different bit-positions
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 8
y
TABLE II
2 1 T HE C ONNECTION B ETWEEN N EIGHBORING PE I N THE SOM
dy
3 0
dx
M ODE
x
4 7
5 6 bit-position
0 15 other bp
(bp)
East This This
LL (Logic
neuron neuron neuron
Low)
PE[15] PE[14] PE[bp-1]
dx > 0 ? dy > 0 ? |dx| > |dy| ? This West This
LH (Logic
neuron neuron neuron
High)
Fig. 11: Gradient orientation with boundary comparison (a) PE[1] PE[0] PE[bp+1]
Simple orientation assignment. (b) Three boundary condition
This This This
for orientation assignment. AH (Arith.
neuron neuron neuron
High)
PE[1] PE[15] PE[bp+1]
S_sel OV_sel
S OV Wi,j
PE_bit[15] PE_bit[12] FPGA Board and Camera
0 1 0 1
0100101110111...
Source
Assembler
S OV
Ethernet Port
cond_sel
daughter
1 board Image
sensor Computer
1 W&S W&OV W
mem_en
Fig. 13: The prototype system. It consists of a FPGA develop-
ment board, an image sensor daughter board and compilation
environment on PC.
cond
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 9
TABLE III
F 0 1 P ERFORMANCE C OMPARISON W ITH OTHER A RCHITECTURES
E 2
D 3 This
[3] [40] [41]
C P 4 work
B 5
A 6
FAST
9 8 7 Algorithm FAST SIFT FAST FAST
HOG
Pixel-
Multicore Embedded
Fig. 14: The FAST feature detector. The numbered pixels are Architecture parallel Other
processor processor
used in the feature detection. The pixel p is the candidate PE array
pixel. The arc is indicated by the dashed line passes through 9 Frequency 50MHz 168MHz 1GHz 25Mhz
contiguous pixels which are brighter than p by the threshold. Throughput
1.2 0.03 0.04 0.77
(pixel/cycle)
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 10
PE array
dx1
dy2 dx2
dy1
Target window
Search window
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 11
result. One iteration of the face detection routine consumes frame, we select several search windows that surround the
about 60µs. Cooperating with our proposed pipeline PE tracking window of the previous frame and compute their
scheme and pixel-PE mapping relationship, the architecture LBP feature representations. In high-speed tracking, the
can continuously execute the algorithm to scan the whole difference between two successive frames will be limited to
image for face detection in very high efficiency. For example, a few pixels. Thus, as shown in Fig. 18, the selected search
as shown in Fig. 16, in a single image readout, any 256 × window will mostly overlap with the target window of the
64 region of the image sensor can be stored in the PE array previous frame, and the offsets between the two windows in
as four 64 × 64 regions and it consumes only 4 × 60 = horizontal and vertical directions dx and dy are limited within
240µs to process the four regions. In the next frame, we two or three pixels. The search window with the smallest
capture the same 256 × 64 region and process the three 64 difference compared with the target window was regarded as
× 64 regions which overlap with the processed four regions. the new location of the target. We update the target feature
It will consume 3 × 60 = 180µs. Using this method, we every frame to adapt affine and scale changes.
can slide any possible regions to perform face detection for The feature extraction within the search window is
the whole 256 × 256 image plane. However, due to the accomplished by cooperating with PE and PPU array. For
pipelined PE scheme, the total processing time for the 256 example, if the search window has dx and dy offsets with the
× 64 region is only relevant to the image sensor readout target window in horizontal and vertical directions. Firstly,
time. Fig. 17 shows some detection results. We also feed we shift the data in PE array with -dx and -dy offsets in
our processor with the LFW database [46], and it achieves the two directions, respectively. Then, we can obtain the
91% detection rate. Actually, using the mentioned method, search window feature by running the target window feature
we can adopt or combine LBP, HoG and other similar local extraction routine again. Fig. 19 shows the experiment setup
representations with the distributed parallel classification to for a target tracking application. The CMOS image sensor
carry out detection for other objects as long as we train and FPGA were mounted on an actuator with two degrees
the classifier with corresponding database. For example, we of freedom. A remote control jeep model was used as the
can train a new classifier with HoG feature for pedestrian target. The control signal of the actuator is given by the MPU
detection or combine LBP and HoG for human detection. directly. According to the position of the calculated target
The enhanced SOM neuron network can also be used position, the actuator adjusts to locating the target in the
for detection and recognition. For comparison, the face center of the view. In our experiments, we choose the target
detection and recognition algorithms described in [21] size as 56 × 56 and vary dx and dy from -2 to +2 to obtain
were implemented. Due to the increased network scale, the 25 different search windows. The total processing time for
enhanced SOM neuron network achieves 6% higher detection the tracking algorithm is 2 ms. Tracking results in our lab
accuracy and 3 additional faces can be recognized without environment are shown in Fig. 20.
deteriorates the accuracy and frame rate.
To demonstrate the ability of the architecture for high-
speed tracking applications, a target tracking algorithm was
V. P ERFORMANCE AND COMPARISON
implemented. It begins from the image acquisition of the
sensor, and then the object within the initial tracking window The proposed vision chip can execute traditional image
was described with LBP feature histogram. In the successive processing as well as complex algorithms such as feature point
detection, face detection and target tracking in high speed.
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 12
TABLE IV
C OMPARISON W ITH P REVIOUS A RCHITECTURES
The proposed patch-parallel PPU array accelerates feature more robust than its counterparts [5, 8, 10, 17-19], because the
extraction and matching procedures. The experimental results object feature is invariant to illumination changes and updated
of various complicated algorithms demonstrate that the vision every frame to adapt to scale and rotation. The PE array
chip can achieve a high system performance. and PPU array cooperating mechanism greatly facilitate the
In FAST feature detection, the architecture achieves 1.2 object search procedure, thus makes high-speed robust tracking
pixel/cycle throughput, which is the highest among the listed possible.
architectures. In face detection, the system uses local LBP Table IV compares the proposed processor with other state-
feature. This feature is more robust than the PPED feature of-the-art digital programmable vision chip processors. The
which is commonly used in vision chips. The dimension of the architecture has patch-parallel PPU array which greatly in-
obtained feature is ten times of the PPED feature, and it is more creased the efficiency of local feature extraction and matching
robust and more representative. Thanks to the patch-parallel procedure. It extends the SOM neuron network with the help
PPU array, the feature extraction only takes 1300 instruction of PE and PPU reconfiguration, the network scale is thus
cycles for all image patches. The AdaBoost classification is enlarged. The frame pipeline scheme of the architecture can
finished with the PPU array in a distributed parallel manner. overlap the image sensor exposure time to decrease the idle
Compared with a serial implementation, the speed-up factor time of the processor. Constrained by the row-parallel archi-
of the algorithm is roughly the number of PPUs, which is tecture, previously reported vision chips use low dimension
64 in our case. The implemented high-speed tracking is also PPED features. In contrast, in our architecture, we can extract
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 13
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 14
tion,” Circuits and Systems for Video Technology, IEEE Transactions [40] Y. Moko, Y. Watanabe, T. Komuro, M. Ishikawa, M. Nakajima, and
on, vol. 22, no. 4, pp. 581–588, 2012. K. Arim, “Implementation and evaluation of fast corner detection on
[21] C. Shi, J. Yang, Y. Han, Z. Cao, Q. Qin, L. Liu, N.-J. Wu, and Z. Wang, the massively parallel embedded processor mx-g,” in Computer Vision
“A 1000 fps vision chip based on a dynamically reconfigurable hybrid and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer
architecture comprising a pe array processor and self-organizing map Society Conference on. IEEE, 2011, pp. 157–162.
neural network,” Solid-State Circuits, IEEE Journal of, vol. 49, no. 9, [41] K. Dohi, Y. Yorita, Y. Shibata, and K. Oguri, “Pattern compression
pp. 2067–2082, 2014. of fast corner detection for efficient hardware implementation,” in
[22] J. Yang, C. Shi, L. Liu, and N. Wu, “Heterogeneous vision chip and lbp- Field programmable logic and applications (FPL), 2011 international
based algorithm for high-speed tracking,” Electronics Letters, vol. 50, conference on. IEEE, 2011, pp. 478–481.
no. 6, pp. 438–439, 2014. [42] C. Shi, J. Yang, L. Liu, N. Wu, and Z. Wang, “A massively parallel
keypoint detection and description (mp-kdd) algorithm for high-speed
[23] S. Kyo, S. Okazaki, and T. Arai, “An integrated memory array processor
vision chip,” Science China Information Sciences, vol. 57, no. 10, pp.
for embedded image recognition systems,” Computers, IEEE Transac-
1–12, 2014.
tions on, vol. 56, no. 5, pp. 622–634, 2007.
[43] J. Yang, C. Shi, L. Liu, J. Liu, and N. Wu, “Pixel-parallel feature
[24] F. Gregoretti, R. Passerone, L. M. Reyneri, and C. Sansoé, “A high detection on vision chip,” Electronics Letters, vol. 50, no. 24, pp. 1839–
speed vlsi architecture for handwriting recognition,” Journal of VLSI 1841, 2014.
signal processing systems for signal, image and video technology,
vol. 28, no. 3, pp. 259–278, 2001. [44] C. Shan, “Learning local binary patterns for gender classification on
real-world face images,” Pattern Recognition Letters, vol. 33, no. 4, pp.
[25] A. Broggi, G. Conte, F. Gregoretti, C. Sansoe, R. Passerone, and 431–437, 2012.
L. M. Reyneri, “Design and implementation of the paprica parallel
architecture,” Journal of VLSI signal processing systems for signal, [45] “Orl human face library.” https://www.cl.cam.ac.uk/research/dtg/
image and video technology, vol. 19, no. 1, pp. 5–18, 1998. attarchive/facedatabase.html.
[26] K. Grauman and B. Leibe, “Visual object recognition,” Synthesis [46] “Labeled faces in the wild.” http://vis-www.cs.umass.edu/lfw/.
lectures on artificial intelligence and machine learning, vol. 5, no. 2, [47] Z. Chen, J. Yang, C. Shi, and N. Wu, “A novel architecture of local
pp. 1–181, 2011. memory for programmable simd vision chip,” in ASIC (ASICON), 2013
IEEE 10th International Conference on. IEEE, 2013, pp. 1–4.
[27] E. Rosten and T. Drummond, “Machine learning for high-speed corner
detection,” in Computer Vision–ECCV 2006. Springer, 2006, pp. 430–
443.
[28] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale
and rotation invariant texture classification with local binary patterns,”
Pattern Analysis and Machine Intelligence, IEEE Transactions on,
vol. 24, no. 7, pp. 971–987, 2002.
[29] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
International journal of computer vision, vol. 60, no. 2, pp. 91–110,
Jie Yang was born in Chongqing, China, in 1987.
2004.
He received the B.S. degree in electronic science and
[30] N. Dalal and B. Triggs, “Histograms of oriented gradients for human technology from Tianjin University, Tianjin, China,
detection,” in Computer Vision and Pattern Recognition, 2005. CVPR in 2010, and the Ph.D. degree in microelectron-
2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. ics from the Institute of Semiconductors, Chinese
886–893. Academy of Sciences, Beijing, China, in 2015.
[31] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local He is currently a Post-Doctoral Fellow with the
binary patterns: Application to face recognition,” Pattern Analysis and Department of Electrical and Computer Engineering,
Machine Intelligence, IEEE Transactions on, vol. 28, no. 12, pp. 2037– University of Calgary, Calgary, Canada. His research
2041, 2006. interests include algorithms and very large scale in-
tegration architecture of image processing, computer
[32] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust vision, wide dynamic range imaging and deep learning.
features (surf),” Computer vision and image understanding, vol. 110,
no. 3, pp. 346–359, 2008.
[33] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, and
P. Fua, “Brief: Computing a local binary descriptor very fast,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 7,
pp. 1281–1298, 2012.
[34] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,”
in Computer Vision–ECCV 2012. Springer, 2012, pp. 864–877.
[35] P. Viola and M. J. Jones, “Robust real-time face detection,” International Yongxing Yang received the B.S. degree from Ts-
journal of computer vision, vol. 57, no. 2, pp. 137–154, 2004. inghua University, Beijing, China, in 2011. He is
currently pursuing the Ph.D. degree with the Institute
[36] Z. Cao, Y. Zhou, Q. Li, Q. Qin, L. Liu, and N. Wu, “Design of pixel for of Semiconductors, Chinese Academy of Sciences,
high speed cmos image sensor,” in Proc. International Image Sensor Beijing.
Workshop, 2013, pp. 229–232. His research interests include visual target track-
[37] “Altera arria v, available:,” http://www.altera.com/products/devkits/ ing,image processing, computer vision, and deep
altera/kit-arria-v-starter.html. learning.
[38] J. Li and N. M. Allinson, “A comprehensive review of current local
features for computer vision,” Neurocomputing, vol. 71, no. 10, pp.
1771–1787, 2008.
[39] E. Rosten, R. Porter, and T. Drummond, “Faster and better: A machine
learning approach to corner detection,” Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 32, no. 1, pp. 105–119, 2010.
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2618753, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. X, NO. X, MARCH 2016 15
Jian Liu received the Dr. rer. nat. degree from Julius
Maximilian University of Würzburg,Würzburg, Ger-
many, in 2003.
He joined the Institute of Semiconductors, Chi-
nese Academy of Sciences working towards semi-
conductor spintronics, optoelectronic detectors and
nanofabrication. He is presently a full professor in
microelectronics and solid state electronics.
1051-8215 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.