Chapter 1

Multi Operand Redundant Adders On Fpgas
CHAPTER-1
INTRODUCTION
VLSI stands for "Very Large Scale Integration". This is the field which involves
packing more and more logic devices into smaller and smaller areas. Thanks to VLSI,
circuits that would have taken board full of space can now be put into a small space few
millimeters across. This has opened up a big opportunity to do things that were not
possible before.
Verilog is a great low level language. Structural models are easy to design and
behavioral RTL code is pretty good. The syntax is regular and easy to remember. It is the
fastest HDL language to learn and use. However Verilog lacks user defined data types
and lacks the interface-object separation of the VHDL's entity-architecture model.
The use of Field Programmable Gate Arrays (FPGAs) to implement digital circuits
has been growing in recent years. In addition to their reconfiguration capabilities,
modern FPGAs allow high parallel computing. FPGAs achieve speedups of two orders
of magnitude over a general-purpose processor for arithmetic intensive algorithms.
Thus, these kinds of devices are increasingly selected as the target technology for
many applications, especially in digital signal processing hardware accelerators
cryptography and much more. Therefore, the efficient implementation of generalized
operators on FPGAs is of great relevance.
The typical structure of an FPGA device is a matrix of configurable logic

elements (LEs), each one surrounded by interconnection resources. In general, each
configurable element is basically composed of one or several n-input lookup tables (N-
LUT) and flips flops. However, in modern FPGA architectures, the array of LEs has
been augmented by including specialized circuitry, such as dedicated multipliers, block
RAM, and so on. In the authors demonstrate that the intensive use of these new elements
reduces the performance GAP between FPGA and ASIC implementations.
One of these resources is the carry-chain system, which is used to improve the
implementation of carry propagate adders (CPAs). It mainly consists of additional
specialized logic to deal with the carry signals, and specific fast routing lines between
consecutive LEs, as shown in Fig. 1.
Sri Venkateswara Institute Of Science & Information Technology Page 1

This resource is presented in most current FPGA devices from low-cost ones to
high-end families, and it accelerates the carry propagation by more than one order of
magnitude compared to its implementation using general resources. Apart from the CPA
implementation, many studies have demonstrated the importance of using this resource
to achieve designs with better performance and/or less area requirements, and even for
implementing non arithmetic circuits.
Multi operand addition appears in many algorithms, such as multiplication,

filters, SAD, and others. To achieve efficient implementations of this operation,
redundant adders are extensively used. Redundant representation reduces the addition
time by limiting the length of the carry propagation chains.
The most usual representations are carry-save (CS) and signed-digit (SD). A CS
adder (CSA) adds three numbers using an array of Full-Adders (FAs), but without
propagating the carries. In this case, the FA is usually known as a 3:2 counter. The result
is a CS number, which is composed of a sum-word and a carry-word. Therefore, the CS
result is obtained without any carry propagation in the time taken by only one FA.
The addition of two CS numbers requires an array of 4:2 compressors, which can
be implemented by two 3:2 counters. The conversion to non redundant representation is
achieved by adding the sum and carry word in a conventional CPA.
However, due to the efficient implementation of CPAs, the use of redundant

adders has usually been rejected when targeting FPGA technology. A direct
implementation of a 3:2 counter usually doubles the area requirements of its equivalent
CPA and improved speed is only noticeable for long bit widths. Nevertheless, several
recent studies have demonstrated that redundant adders can be efficiently mapped on
FPGA structures, reducing area overhead and improving speed.
Despite the important advances represented by these previous studies, the

solutions proposed require either (or sometimes both) the use of a sophisticated heuristic
to generate each compressor tree or a low-level design. The latter impedes portability,
because it is highly dependent on the inner structure.

In this paper, we study the efficient implementation of Multi-operand redundant

compressor trees in modern FPGAs by using their fast carry resources. Our approaches
strongly reduce delay and they generally present no area overhead compared to a CPA
tree. Moreover, they could be defined at a high level based on an array of standard CPAs.
As a consequence, they are compatible with any FPGA family or brand, and any
improvement in the CPA system of future FPGA families would also benefit from them.
Furthermore, due to its simple structure, it is easy to design a parametric HDL core,
which allows synthesizing a compressor tree for any number of operands of any bit
width. Compared to previous approaches, our design presents better performance, is
easier to implement, and offers direct portability.
The rest of the paper focuses on CS representation, because the extension to SD

representation could be simply achieved by inverting certain input and output signals
from and to the compressor tree, as was demonstrated. Since it is unnecessary to make
any internal changes to the array structure, these small modifications do not significantly
modify compressor tree performance.
Adders:
In electronics, an adder or summer is a digital circuit that performs addition of

numbers. In many computers and other kinds of processors, adders are used not only in
the arithmetic logic unit(s), but also in other parts of the processor, where they are used
to calculate addresses, table indices, and similar operations.
Although adders can be constructed for many numerical representations, such

as binary-coded decimal or excess-3, the most common adders operate on binary numbers.
In cases where two's complement or ones' complement is being used to represent
negative numbers, it is trivial to modify an adder into an adder–sub tractor. Other signed
number representations require a more complex adder.
The use of Field Programmable Gate Arrays (FPGAs) to implement digital

circuits has been growing in recent years. In addition to their reconfiguration capabilities,
modern FPGAs allow high parallel computing.

1.1 Literature Survey:
B. Cope, P. Cheung, W. Luk, and L. Howe’s (2010) [1]: The systematic approach to
the comparison of the graphics processor (GPU) and reconfigurable logic is defined in
terms of three throughput drivers. The approach is applied to five case study algorithms,
characterized by their arithmetic complexity, memory access requirements, and data
dependence, and two target devices: the nVidia GeForce 7900 GTX GPU and a Xilinx
Virtex-4 field programmable gate array (FPGA).
Two orders of magnitude speedup, over a general-purpose processor, is observed for

each device for arithmetic intensive algorithms. An FPGA is superior, over a GPU, for
algorithms requiring large numbers of regular memory accesses, while the GPU is
superior for algorithms with variable data reuse. In the presence of data dependence, the
implementation of a customized data path in an FPGA exceeds GPU performance by up
to eight times. The trends of the analysis to newer and future technologies are analyzed.
S. Dikmese, A. Kavak, K. Kucuk, S. Sahin, A. Tangel, and H. Dincer (2010)[2] :The

Software radio implementations of beam formers on programmable processors such as
digital signal processor (DSP) and field programmable gate array (FPGA) still remain as
a challenge for the integration of smart antennas into existing wireless base stations for
3G systems. This study presents the comparison of DSP- and FPGA-based
implementations of space-code correlation (SCC) beam former, which is practical to use
in CDMA2000 systems. Implementation methodology is demonstrated and results
regarding beam forming accuracy, weight vector computation time (execution time) and
resource utilization are presented.
The SCC algorithm is implemented on Texas Instruments (TI) TMS320C6713

floating-point digital signal processors (DSPs) and Xilinx-s Vertex IV family FPGA. In
signal modeling, CDMA2000 reverse link format is employed. The results show that
beam former weights can be obtained within less than 10-ms via implementation on
c6713 DSP with direction-of-arrival (DOA) search resolution whereas it can be achieved
within less than 25 s on Vertex IV FPGA for five-element uniform linear array (ULA).
These results demonstrate that FPGA implementation achieves weight vector
computation in much smaller time (nearly 500 times) as compared to DSP
implementation in this study.

S. Roy and P. Banerjee (2005)[3] : The most practical FPGA designs of digital
signal processing (DSP) applications are limited to fixed-point arithmetic owing to the
cost and complexity of floating-point hardware. While mapping DSP applications onto
FPGAs, a DSP algorithm designer must determine the dynamic range and desired
precision of input, intermediate, and output signals in a design implementation.
The first step in a MATLAB-based hardware design flow is the conversion of the
floating-point MATLAB code into a fixed-point version using "quantizes" from the filter
design and analysis (FDA) toolbox for MATLAB. This paper describes an approach to
automate the conversion of floating-point MATLAB programs into fixed-point
MATLAB programs, for mapping to FPGAs by profiling the expected inputs to estimate
errors. Our algorithm attempts to minimize the hardware resources while constraining
the quantization error within a specified limit. Experimental results on five MATLAB
benchmarks are reported for Xilinx Vertex II FPGAs.
F. Schneider, A. Agarwal, Y.M. Yoo, T. Fukuoka, and Y. Kim (2010) [4]: The
Application-specific ICs have been traditionally used to support the high computational
and data rate requirements in medical ultrasound systems, particularly in receive beam
forming.
Utilizing the previously developed efficient front-end algorithms, in this paper, we

present a simple programmable computing architecture, consisting of a field-
programmable gate array (FPGA) and a digital signal processor (DSP), to support core
ultrasound signal processing.
It was found that 97.3% and 51.8% of the FPGA and DSP resources are,
respectively, needed to support all the front-end and back-end processing for B-mode
imaging with 64 channels and 120 scan lines per frame at 30 frames/s. These results
indicate that this programmable architecture can meet the requirements of low- and
medium-level ultrasound machines while providing a flexible platform for supporting the
development and deployment of new algorithms and emerging clinical applications.

1.2 Organization of report:

Chapter 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1

Uploaded by

Copyright:

Available Formats

Multi Operand Redundant Adders On Fpgas

The typical structure of an FPGA device is a matrix of configurable logic

Sri Venkateswara Institute Of Science & Information Technology Page 1

Multi operand addition appears in many algorithms, such as multiplication,

However, due to the efficient implementation of CPAs, the use of redundant

Despite the important advances represented by these previous studies, the

Sri Venkateswara Institute Of Science & Information Technology Page 2

In this paper, we study the efficient implementation of Multi-operand redundant

The rest of the paper focuses on CS representation, because the extension to SD

In electronics, an adder or summer is a digital circuit that performs addition of

Although adders can be constructed for many numerical representations, such

The use of Field Programmable Gate Arrays (FPGAs) to implement digital

Sri Venkateswara Institute Of Science & Information Technology Page 3

1.1 Literature Survey:

Two orders of magnitude speedup, over a general-purpose processor, is observed for

S. Dikmese, A. Kavak, K. Kucuk, S. Sahin, A. Tangel, and H. Dincer (2010)[2] :The

The SCC algorithm is implemented on Texas Instruments (TI) TMS320C6713

Sri Venkateswara Institute Of Science & Information Technology Page 4

Utilizing the previously developed efficient front-end algorithms, in this paper, we

Sri Venkateswara Institute Of Science & Information Technology Page 5

1.2 Organization of report:

Sri Venkateswara Institute Of Science & Information Technology Page 6

You might also like