Professional Documents
Culture Documents
An FPGA Implementation of A Flexible, Parallel Image Processing Architecture Suitable For Embedded Vision Systems
An FPGA Implementation of A Flexible, Parallel Image Processing Architecture Suitable For Embedded Vision Systems
net/publication/220949182
CITATIONS READS
41 98
2 authors, including:
Peter Lee
University of Kent
74 PUBLICATIONS 548 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Investigation of Adiabatic Logic Circuits For Very Low-Power Logic Design View project
All content following this page was uploaded by Peter Lee on 06 June 2014.
Abstract. This paper describes the design of a This paper presents a novel parallel processing
programmable parallel architecture that is to be used for architecture which combines the flexibility of general-
signal pre-processing in intelligent embedded vision purpose machines, speed of DSPs, small-size and low-
systems. The architecture has been implemented and power performance of application-specific cores in a
tested using a Celoxica RC1000 Prototyping Platform with single, balanced platform specifically tailored to serve
a Xilinx XCV2000E FPGA. The system operates at a clock image processing operations. It describes the architecture
rate of 50 MHz and can perform pre-processing functions and performance of these processors when implemented as
such as filtering, correlation and transformation on an a prototype on a Xilinx XCV2000E FPGA, prior to
image of 256x256 pixels at up to 667 frames/s. realisation in a complete system-on-a-programmable-chip.
The paper will begin with a brief overview of the parallel
architecture, followed by a description of the
implementation and its use in an example application.
1. Introduction
Image Preprocessing
Image Processing Array 15
DMA Channel 0 1
Pixel Pre-Processing
DMA Channel
Acquisition Layer
Temp. Original
Optical Sensor Buffers Buffers
From Sensor
Pixel Address
Coefficient
Coefficient
Coeff_Addr Memory coeff
32-bit Result Reg.
256x16
Wr_Coeff ALU & Muxes
32 32
X’ier
READY
IPE 16
ACK Controller
16x32 16x32
control Reg. Reg.
File File
1 2
Instruction
Instr_Addr Memory Pixel/Address
Wr_Instr FIFO (256-words)
Instruction 256x16
From DMA
and act upon the information received. This paper will which a previous output of the image pre-processing layer
concentrate on the architecture of the image pre- was stored. The DMA channel then distributes the source
processing aspect of the system, which comprises the pixels to the Image Processing Array, which comprises 16
DMA channel and a parallel array of 16 processing identical Processing Elements, each operating on a set of
elements, detailed in figure 2. source pixels in accordance with a programmable
algorithm. The DMA channel is designed in such a way to
The DMA Channel addresses the source AOI according to detect overlapping regions, e.g., in adjacent 3x3 windows,
a set of 24 addressing modes which were chosen to cover in order to minimise the need for redundant pixel reads.
the most commonly used image processing algorithms The pre-processed image resulting from the parallel array
(e.g., windowing, correlation). The source frame may is pooled into an Image Memory accessible to the host
either be the output of the sensor, or a temporary buffer to processor, which can then extract the information needed
The image pre-processing architecture communicates with The verification model of the architecture was synthesised
its host through a program memory. The host processor using Synposys FPGA Express synthesis tools and Xilinx
sends a block of control and data manipulation instructions Foundation ISE for Place and Route. Technology mapping
to the pre-processor’s program memory, and awaits and resource estimation as well as processing performance
feedback through status reporting from the main controller measurements are listed in tables 1-3.
before picking up data from the image memory.
The figure also indicates the presence of a shifter, which 6. Example Application
can be configured to compensate for scaling factors or
simply to normalise the output of the array. An output Figure 5 demonstrates an application which utilises the
sequencer multiplexes between the outputs of the parallel processor on board the RC1000-PP to pre-process
processing elements, and provides the handshake signals vehicle images for numberplate recognition. The parallel
necessary to confirm data delivery. processor removes the image background and unwanted
details. It therefore prepares the image for upper layers to
locate the plate position, before ‘cutting out’ characters
and passing them on to a neural network for classification.
3. The Image Processing Element
At 50 MHz clock frequency, the parallel processor
The Image Processing Array is composed of 16 identical implementation on the FPGA can achieve a throughput of
processing elements. Each element can be thought of as a up to 125 Frames/s; whereas the original software
small DSP specifically intended for image processing application which normally runs on a standard PC
algorithms. The processing element is built upon a 16-bit achieves 50 Frames/s with a processor clock frequency of
input, 32-bit output datapath, and a RISC-like instruction 266 MHz. The factor of 2.5 improvement over a CPU
set composed of 15 instructions. which is clocked at 5 times the input frequency is mainly
due to the parallelism of the architecture, and its optimised
Figure 3 illustrates the structure of the Image Processing datapath.
Element (IPE), which operates on two’s complements 16-
bit data and produces a 32-bit output. The IPE receives its
data manipulation instructions from the main controller, 7. Conclusion & Further Work
and operates on pixels stored in its local memory
according to the decoded instructions. It also comprises a Rapid implementation of parallel structures based on
small coefficient memory which can hold multiplication FPGAs using VHDL proves to be a very efficient, cost-
coefficients, convolution and correlation masks as well as effective and attractive methodology for design
matrix constants. It has two register files which may be verification. New multi-million gate FPGAs [9] with
used for temporary storage during computation. Once the extended memory and fast I/O interfaces made it possible
algorithm execution has been completed, the IPE makes to develop and test a large parallel architecture such as the
the data and target address available on its output, and one described in this paper. Future work will explore the
informs the output sequencer of its readiness. possibility of integrating a host RISC processor into the
system so as to complete the processing blocks needed for
a complete embedded vision system. This will be an ideal
4. Implementation use of System-on-a-Programmable-Chip (SOPC)
technology [10], where the host processor is implemented
The architecture described in the previous section was either as a soft or hard core on a high-density Field
designed as a soft IP core using VHDL, prior to being Programmable Device with sufficiently large amounts of
embedded with a host processor on the same on-chip memories and advanced interfaces.
programmable device. To evaluate the performance of the
system, the architecture was first implemented and tested Acknowledgements
using Celoxica’s RC1000-PP board [8]. This is made up of
a single Virtex FPGA with extended memory capability This work is funded by the European Commission’s Marie
(XCV2000E), and four external memory banks used for Curie Host Fellowship contract number HPMI-CT-199-
frame buffering. The architecture connects directly to a 00055 with NeuriCam S.p.A., Italy.
C
O
Memory Memory
M Arbiter & Banks
M
S Switches 0, 1, 2, 3
XCV2000E FPGA:
Image Pre-Processing Architecture
References
1. F. Paillet, Design Solutions and Techniques for Vision System on a Chip and Fine-grain Parallelism Circuit Integration, Workshop
at the IEEE ASIC / System On Chip Conf., Washington DC, USA, 2000.
2. M. Betke et al, Real-time multiple vehicle detection and tracking from a moving vehicle, Machine Vision and Applications, 12(2),
2000, 69-83.
3. N Yamashita et al, A 3.84 GIPS Integrated Memory Array Processor with 64 Processing Elements and a 2-Mb SRAM, IEEE J. of
Solid-State Circ., 29(11), 1994, 1336-1343.
4. NeuriCam, NC1802 Pupilla 640x480-pixel Digital Camera, Datasheet Preliminary Rel. 11/2001. www.neuricam.com
5. P. Athanas & A. Abbott, Addressing the Computational Requirements of Image Processing with a Custom Computing Machine: An
Overview, Workshop on Reconfigurable Architectures, IPPS '95, 1-15, 1995.
6. U Ramacher et al, A 53-GOPS Programmable Vision Processor For Processing, Coding-Decoding And Synthesizing of Images,
Proc. 27th Eur. Solid-State Circ. Conf., Villach, Austria, 2001, 160-163.
7. T Minami et al, A 300-MOPS Video Signal Processor with a Parallel Architecture, IEEE J. of Solid-State Circ., 26(12), 1991, 1868-
1875.
8. Celoxica, RC1000 Product Information Sheet, www.celoxica.com
9. Xilinx, Virtex 2000-E Datasheet, www.xilinx.com
10. Pat Mead, Investigating the Reality of System-On-a-Programmable-Chip, FPL 2001.
11. Anil Jain, Fundamentals of Digital Image Processing (New Jersey: Prentice Hall, 1989).