Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 39

CHAPTER-1

INTRODUCTION
The use of dedicated decimal hardware has been limited to main-frame
business computers and handheld calculators at the low end. Very recently a renewed
interest in providing hardware acceleration for decimal arithmetic has emerged. It has
been boosted by the numerically intensive computing requirements of new
commercial, financial and Internet applications, such as e-commerce and e-banking.
Because of the perspectives of a more wide spread use of decimal processing, the
revised IEEE Standard for Floating-Point incorporates a specification for decimal
arithmetic. Besides, the advances in FPGA technology have opened up new
opportunities to implement efficient floating-point coprocessors for specialized tasks
for a fraction of the cost of an ASIC implementation. Thus, even though most of the
current research on decimal arithmetic is targeted at high-performance VLSI design, a
few works present implementations of decimal arithmetic units for FPGAs .
The design of decimal units for FPGAs faces several challenges: first, the
inherent in efficiency of decimal representations in systems based on two-state logic
and a complex mapping of the decimal arithmetic rules into boolean logic. On the
other hand, the special built-in characteristics of FPGA architectures make it difficult
to use many well-known methods to speedup computations (for example, carry-save
and signed-digit arithmetics). Therefore, it may be preferable to develop specific
decimal algorithms more suitable for FPGAs rather than adapting existing ones
targeted for ASIC platforms. In this context, we present the algorithm, architecture
and FPGA implementation of a novel unit to perform fast addition of a large amount
of decimal (BCD) fixed-point or integer operands. This operator is also of key
importance for other arithmetic operations such as decimal multiplication and
division.
The use of Field Programmable Gate Arrays (FPGAs) to implement digital
circuits has been growing in recent years. In addition to their reconfiguration
capabilities, modern FPGAs allow high parallel computing.
FPGAs achieve speedups of two orders of magnitude over a general-purpose
processor for arithmetic intensive algorithms. Thus, these kinds of devices are
increasingly selected as the target technology for many applications, especially in
digital signal processing hardware accelerators cryptography and much more.

1
Therefore, the efficient implementation of generalized operators on FPGAs is
of great relevance. The typical structure of an FPGA device is a matrix of
configurable logic elements (LEs), each one surrounded by interconnection resources.
In general, each configurable element is basically composed of one or several n-input
lookup tables (N- LUT) and flip-flops. However, in modern FPGA architectures, the
array of LEs has been augmented by including specialized circuitry, such as dedicated
multipliers, block RAM, and so on. In the authors demonstrate that the intensive use
of these new elements reduces the performance GAP between FPGA and ASIC
implementations.
One of these resources is the carry-chain system, which is used to improve the
implementation of carry propagate adders (CPAs). It mainly consists of additional
specialized logic to deal with the carry signals, and specific fast routing lines between
consecutive LEs, as shown in Fig. 1. This resource is presented in most current FPGA
devices from low-cost ones to high-end families, and it accelerates the carry
propagation by more than one order of magnitude compared to its implementation
using general resources. Apart from the CPA implementation, many studies have
demonstrated the importance of using this resource to achieve designs with better
performance and/or less area requirements, and even for implementing non arithmetic
circuits.
Redundant representation reduces the addition time by limiting the length of
the carry-propagation chains. The most usual representations are carry-save (CS) and
signed-digit (SD). A CS adder (CSA) adds three numbers using an array of Full-
Adders (FAs), but without propagating the carries. In this case, the FA is usually
known as a 3:2 counter. The result is a CS number, which is composed of a sum-word
and a carry-word. Therefore, the CS result is obtained without any carry propagation
in the time taken by only one FA.
The addition of two CS numbers requires an array of 4:2 compressors,which
can be implemented by two 3:2 counters. The conversion to nonredundant
representation is achieved by adding the sum and carry word in a conventional CPA .
However, due to the efficient implementation of CPAs, the use of redundant adders
has usually been rejected when targeting FPGA technology. A direct implementation
of a 3:2 counter usually doubles the area requirements of its equivalent CPA and
improved speed is only noticeable for long bit widths.
Nevertheless, several recent studies have demonstrated that redundant adders

2
can be efficiently mapped on FPGA structures, reducing area overhead and improving
speed.

Despite the important advances represented by these previous studies, the


solutions proposed require either (or sometimes both) the use of a sophisticated
heuristic to generate each compressor tree or a low-level design. The latter impedes
portability, because it is highly dependent on the inner structure. In addition, their area
and speed could be improved, because the use of a specialized fast carry-chain is very
limited.

Fig1.1 Carry chain resources included in FPGA


The efficient implementation of multioperand redundant compressor trees in
modern FPGAs by using their fast carry resources. Our approaches strongly reduce
delay and they generally present no area overhead compared to a CPA tree. Moreover,
they could be defined at a high level based on an array of standard CPAs. As a
consequence, they are compatible with any FPGA family or brand, and any
improvement in the CPA system of future FPGA families would also benefit from
them. Furthermore, due to its simple structure, it is easy to design a parametric HDL
core, which allows synthesizing a compressor tree for any number of operands of any
bit width. Compared to previous approaches, our design presents better performance,
is easier to implement, and offers direct portability. The rest of the paper focuses on
CS representation, because the extension to SD representation could be simply
achieved by inverting certain input and output signals from and to the compressor
tree, as was demonstrated in Since it is unnecessary to make any internal changes to
the array structure, these small modifications do not significantly modify compressor

3
tree performance.
VLSI
Very-large-scale integration (VLSI) is the process of creating integrated
circuits by combining thousands of transistor-based circuits into a single chip. VLSI
began in the 1970s when complex semiconductor and communication technologies
were being developed.
The microprocessor is a VLSI device. The term is no longer as common as it
once was, as chips have increased in complexity into the hundreds of millions of
transistors.
Overview
The first semiconductor chips held one transistor each. Subsequent advances
added more and more transistors, and, as a consequence, more individual functions or
systems were integrated over time. The first integrated circuits held only a few
devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it
possible to fabricate one or more logic gates on a single device.
Now known retrospectively as "small-scale integration" (SSI), improvements
in technique led to devices with hundreds of logic gates, known as large-scale
integration (LSI), i.e. systems with at least a thousand logic gates. Current technology
has moved far past this mark and today's microprocessors have many millions of gates
and hundreds of millions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-
scale integration above VLSI. Terms like Ultra-large-scale Integration (ULSI) were
used. But the huge number of gates and transistors available on common devices has
rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of
integration are no longer in widespread use. Even VLSI is now somewhat quaint,
given the common assumption that all microprocessors are VLSI or better.
As of early 2008, billion-transistor processors are commercially available, an
example of which is Intel's Montecito Itanium chip. This is expected to become more
commonplace as semiconductor fabrication moves from the current generation of 65
nm processes to the next 45 nm generations (while experiencing new challenges such
as increased variation across process corners). Another notable example is NVIDIA’s
280 series GPU.
This microprocessor is unique in the fact that its 1.4 Billion transistor count,
capable of a teraflop of performance, is almost entirely dedicated to logic (Itanium's

4
transistor count is largely due to the 24MB L3 cache). Current designs, as opposed to
the earliest devices, use extensive design automation and automated logic synthesis to
lay out the transistors, enabling higher levels of complexity in the resulting logic
functionality. Certain high-performance logic blocks like the SRAM cell, however,
are still designed by hand to ensure the highest efficiency (sometimes by bending or
breaking established design rules to obtain the last bit of performance by trading
stability).

VLSI stands for "Very Large Scale Integration". This is the field which
involves packing more and more logic devices into smaller and smaller areas.
 Simply we say Integrated circuit is many transistors on one chip.
 Design/manufacturing of extremely small, complex circuitry using modified
semiconductor material
 Integrated circuit (IC) may contain millions of transistors, each a few mm in size
 Applications wide ranging: most electronic logic device
Advantages of ICs Over Discrete Components
While we will concentrate on integrated circuits , the properties of integrated
circuits-what we can and cannot efficiently put in an integrated circuit-largely
determine the architecture of the entire system. Integrated circuits improve system
characteristics in several critical ways. ICs have three key advantages over digital
circuits built from discrete components:
Size: Integrated circuits are much smaller-both transistors and wires are shrunk to
micrometer sizes, compared to the millimeter or centimeter scales of discrete
components. Small size leads to advantages in speed and power consumption, since
smaller components have smaller parasitic resistances, capacitances, and inductances.
Speed: Signals can be switched between logic 0 and logic 1 much quicker within a
chip than they can between chips. Communication within a chip can occur hundreds
of times faster than communication between chips on a printed circuit board. The high
speed of circuits on-chip is due to their small size-smaller components and wires have
smaller parasitic capacitances to slow down the signal.
Power Consumption: Logic operations within a chip also take much less power.
Once again, lower power consumption is largely due to the small size of circuits on
the chip-smaller parasitic capacitances and resistances require less power to drive
them.

5
VLSI And Systems
These advantages of integrated circuits translate into advantages at the system level:
 Smaller physical size. Smallness is often an advantage in itself-consider portable
televisions or handheld cellular telephones.
 Lower power consumption. Replacing a handful of standard parts with a single
chip reduces total power consumption. Reducing power consumption has a ripple
effect on the rest of the system: a smaller, cheaper power supply can be used;
since less power consumption means less heat, a fan may no longer be necessary;
a simpler cabinet with less shielding for electromagnetic shielding may be
feasible, too.
 Reduced cost. Reducing the number of components, the power supply
requirements, cabinet costs, and so on, will inevitably reduce system cost. The
ripple effect of integration is such that the cost of a system built from custom ICs
can be less, even though the individual ICs cost more than the standard parts they
replace.
Understanding why integrated circuit technology has such profound influence
on the design of digital systems requires understanding both the technology of IC
manufacturing and the economics of ICs and digital systems.
Applications of VLSI
Electronic systems now perform a wide variety of tasks in daily life.
Electronic systems in some cases have replaced mechanisms that operated
mechanically, hydraulically, or by other means; electronics are usually smaller, more
flexible, and easier to service.
In other cases electronic systems have created totally new applications.
Electronic systems perform a variety of tasks, some of them visible, some more
hidden:
 Personal entertainment systems such as portable MP3 players and DVD
players perform sophisticated algorithms with remarkably little energy.
 Electronic systems in cars operate stereo systems and displays; they also
control fuel injection systems, adjust suspensions to varying terrain, and
perform the control functions required for anti-lock braking (ABS) systems.
 Digital electronics compress and decompress video, even at high-definition
data rates, on-the-fly in consumer electronics.

6
 Low-cost terminals for Web browsing still require sophisticated electronics,
despite their dedicated function.
 Personal computers and workstations provide word-processing, financial
analysis, and games. Computers include both central processing units (CPUs)
and special-purpose hardware for disk access, faster screen display, etc.
 Medical electronic systems measure bodily functions and perform complex
processing algorithms to warn about unusual conditions. The availability of
these complex systems, far from overwhelming consumers, only creates
demand for even more complex systems.

The growing sophistication of applications continually pushes the design and


manufacturing of integrated circuits and electronic systems to new levels of
complexity. And perhaps the most amazing characteristic of this collection of systems
is its variety-as systems become more complex, we build not a few general-purpose
computers but an ever wider range of special-purpose systems.
Our ability to do so is a testament to our growing mastery of both integrated
circuit manufacturing and design, but the increasing demands of customers continue
to test the limits of design and manufacturing

7
CHAPTER-2
LITERATURE SURVEY
2.1 EXISTING SYSTEM
Multipliers are essential component in different circuits, particularly in
arithmetic operation such as compressors, parity checkers and comparators.
Multipliers consist of three fundamental parts: a partial product generator, a partial
product reduction and a final fast adder part . A booth encoder is used to generate the
partial products and partial product are reduces to two rows using compressor circuits.
Finally, fast adder is used to sum the two rows. The partial product reduction part of
multiplier contributes maximum power consumption, delay and layout area. Various
high speed multipliers use 3-2, 4-2 and 5-2 compressors to lower the latency of partial
product reduction part. These compressors are used to minimize delay and area which
leads to increase the performance of the overall system. Compressors are generally
designed by XOR-XNOR gates and multiplexers. A compressor is a device which is
used to reduce the operands while adding terms of partial products in multipliers. An
X-Y compressor takes X equally weighted input bits and produces Y-bit binary
number. The most widely and the simplest used compressor is the 3-2 compressor
which is also known as a full adder. A 3-2 compressor has three inputs X1, X2, X3
and generates two outputs, the sum and the carry bits. The block diagram of 3-2
compressor is shown in figure.

8
Fig 2.1 3-2 Compressor

Table 2.1 3-2 Compressor Truth Table


A 3-2 compressor cell can be implemented in many different logic structures.
However, in general, it is composed by three main modules. The first module is
required to generate XOR or XNOR function, or both of them. The second module is
used to generate sum and the last module is to produce carry output. The 3-2
compressor can also be implemented as full adder cell when its third input is taken as
the Carry input from the preceding compressor block or X3 = Cin. The basic equation
of 3-2 compressor is:
X1 + X2 + X3 = Sum + 2 •Carry

9
Fig 2.2 Conventional 3-2 Compressor

Fig 2.3 Improved 3-2 Compressor


The conventional architectures of 3-2 compressor shown in figure 2(a), it has
two XOR gates in the critical path. The sum output is generated by the second XOR
and carry output is generated by the multiplexer (MUX).
The 3-2 compressor architecture shown in figure 2(b) has less delay as
compared to other architectures, as some of the XOR circuits are replaced by the
multiplexer circuits. In this compressor the select bit at multiplexer present
before the input arrives, so reduces the delay. Thus the switching time of the
transistors in the critical path is decreased [5]. This minimizes the delay to a
significant amount. This architecture shows critical path delay of Δ-XOR +Δ-MUX.
Regular CS Compressor Tree Design
The classic design of a multioper and CS compressor tree attempts to reduce
the number of levels in its structure. The 3:2 counter or the 4:2 compressor are the

10
most widely known building blocks to implement it . We select a 4:2 compressor as
the basic building block, because it could be efficiently implemented on Xilinx
FPGAs [28]. The implementation of a generic CS compressor tree requires dNop=2e
_ 1 4:2 compressors (because each one eliminates two signals), whereas a carry-
propagate tree uses Nop _ 1 CPAs (since each one eliminates one signal).If we bear in
mind that a 4:2 compressor uses practically double the amount of resources as CPAs
[28], both trees basically require the same area. On the other hand, the speed of a
compressor tree is determined by the number of levels required. In this case, because
each level halves the number of input signals, the critical path delay (D) is
approximately

L4:2 ¼ dlog2ðNopÞe _ 1 ð1Þ


D _ L4:2 _ d4:2; ð2Þ

Fig 2.4 9:2 Compressor Tree based on a Linear array


Where L4:2 is the number of levels of the compressor tree and d4:2 is the
delay of a 4:2 compressor level (including routing). This structure is constructed
assuming a similar delay for all paths inside each 4:2 compressor. Nevertheless, in
FPGA devices with dedicated carry resources, the delay from the carry input to the
carry output and the routing to the next carry input is usually more than one order of
magnitude faster than the rest of the paths involved in connecting two FAs . Thus, the
connection of FAs through the carry-chain should be preserved as much as possible to
obtain fast circuits. In fact, this is the idea behind the structure of the 4:2 compressor

11
presented in [28] and [29] for Xilinx FPGA. We now generalize this idea to
compressors of any size by proposing a different approach based on linear arrays.
This reduces the critical path of the compressor tree when it is implemented on
FPGAs with specialized carry-chains.
BCD Addition Method
Unlike direct decimal addition methods, the methods based on pre- and post-
corrections of the binary sum can make full use of any binary adder, for example a
binary carry-ripple adder. However, the area and latency overhead due to these
decimal corrections is not negligible and the resulting implementations are as
complex as the BCD carry-chain adders proposed by Bioul et al.The method proposed
in this work also obtains the decimal sum from the binary addition of the BCD input
operands, but the post-correction stage is simplified or removed at the expense of a
slightly complex pre-correction stage. Besides, another advantage of our proposal is
that the pre-correction stage can be integrated with the binary carry-ripple addition
logic in Virtex-5/6 FPGAs at the cost of an extra 6-input LUT per digit. As we show
later in Section 4, each 1-digit (4-bit) adder occupies the area of 4-bit binary carry-
ripple adder (one Virtex-5/6 slice) and an additional 6-input LUT.
The algorithm is defined for two BCD input operands but can be applied
recursively to more BCD addends. An example of addition for three BCD operands
X, Y , W is shown in Fig. 4. First, BCD operands X, Y are added as Z = X + Y in
three steps. The two first are pre-correction steps and the third is the subsequent
binary carry-propagate addition. The pre-correction consists on conditional addition
of +6 (01102) factors to the input BCD operands in a digit-wise fashion. These +6
corrections are needed to get the correct decimal carry outputs at each digit position
since the BCD addition is computed as a binary carry-propagate addition. In the
previous methods [2, 15, 16, 17], the +6 factors were added independently of the
value of the input operands, such that the binary sum had to be corrected at each
decimal position when the corresponding carry-out was zero.
In our method, a correction factor of 6 is added to the BCD input digits X i, Yi
when the sum XiU + YiU is equal or higher than 8. We denote by X iU , YiU the sum of
the 3 most significant bits of Xi and Yi respectively
2.2 PROPOSED SYSTEM 9:2 COMPRESSOR TREE

12
Fig:2.5 9:2Compressor tree

Fig shows an example for a 9:2 compressor tree designed using the proposed
linear structure, where allIn relation to the delay analysis, from a classic point of view
our compressor tree has Nop _ 2 levels. This is much more than a classic Wallace tree
structure and, thus, a longer critical path. Nevertheless, because we are targeting an
FPGA implementation, we temporarily assume that there is no delay for the carry-
chain path.
Under this assumption, the carry signal connections could be eliminated from
the critical path analysis and our linear array could be represented as a hypothetical
tree, as shown in Fig (where the carry-chain is represented in gray). To compute the
number of effective time levels (ETL) of this hypothetical tree, each CSA is
considered a 2:1 adder, except for the first, which is considered a 3:1 adder. Thus, the
first level of adders is formed by the first bðNop _ 1Þ=2cCSAs (which correspond to
partial addition of the input operands). This first ETL produces bðNop _ 1Þ=2c partial
sum-words that are added to a second level of CSAs (together with the last input
operand if Nop is even) and soon, in such a way that each ETL of CSAs halves the
number of inputs to the next level.
Multi-Operand Addition Method
In this method we present prior work on binary and decimal multi-operand
addition and analyse different representative FPGA implementations, discussing
their associated advantages and costs. Based on the conclusions extracted from this

13
survey, in this we present a proposal that leads to more efficient implementations of
decimal multi-operand adders in FPGA platforms.
Multi-Operand Adders For FPGAs
Binary multi-operand adders are generally arranged in two ways: in an array
of rows or in a tree-like structure. In an array configuration each row of adders
reduces one further operand, so that m levels of adders are required to reduce m
operands into a final one. On the other hand, in a m-operand adder tree the number
of logic levels is log2(m) (or log2(m) 1 levels for a signed-digit or a carry-save adder
tree, but a final carry-propagate adder is needed in this case). Furthermore, the
hardware cost of both configurations is similar, so tree configurations are usually
preferred, though the array has a more regular routing.
Concerning the type of adder, the delay of each carry-propagate adder (carry-
ripple, carry-look ahead,...) depends on the length of its input operands (delay
proportional to n for an n-bit binary carry-ripple adder, O(log 2 n) for a carry look
ahead adder).
To reduce the latency of addition on FPGAs, these devices in-corporate
dedicated fast carry chain paths. This favours simple carry-ripple implementations in
FPGAs over other carry-propagate adder topologies, except in the case of non-
pipelined implementations for large size operands, but at the expense of a high
hardware cost [33]. Moreover, pipelining techniques in FPGAs can be applied more
effectively to carry-ripple adders .
On the other hand, signed-digit and carry-save adders have a constant
computation time, delaying the computation of the carry propagation until the end.
However, a straight for-ward implementation on FPGAs roughly requires double
hardware than a carry-ripple adder, and does not exploit the fast carry chain to
improve speed. Several authors have recently proposed efficient mappings of carry-
save adders on FPGAs but only for the binary case.
The delays of both carry-ripple and carry-save adder trees are proportional to
n + log2(m) 1 1, because the horizontal delays through the log 2(m) levels of carry-
ripple chains overlap. In ASIC implementations the advantage of the carry-save
adder tree lies on the flexibility of routing, so the critical path delay can be reduced
by an optimization of interconnections between full adders. In FPGAs, a
conventional carry-save adder tree of full adders is slower than a carry-ripple adder
tree due to the complex routing, so the carry-ripple adder tree configuration is

14
generally preferred. This can be partially solved by mapping efficiently compressors
that perform operand reductions larger than 3:2.
Decimal multi-operand adders can also be implemented as carry-ripple,
signed-digit or carry-save adder trees but they require additional logic for decimal
correction. Therefore, delay and hardware cost are larger than in the binary case.
Next, we describe the most representative methods proposed for decimal fixed-
point/integer carry-propagate (carry-ripple, carry lookahead,...) and carry-free
(signed-digit, carry-save...) addition.

CHAPTER-3
DESIGN CONSIDERATIONS
3.1 COMPRESSORS ARCHITECTURES
Various high speed multipliers use 3-2, 4-2 and 5-2 compressors to lower the
latency of partial product reduction part [5]-[7]. These compressors are used to
minimize delay and area which leads to increase the performance of the overall
system. Compressors are generally designed by XOR-XNOR gates and
multiplexers[8]-[11]. A compressor is a device which is used to reduce the operands
while adding terms of partial products in multipliers. An X-Y compressor takes X
equally weighted input bits and produces Y-bit binary number.
The most widely and the simplest used compressor is the 3-2 compressor
which is also known as a full adder [12]. A 3-2 compressor has three inputs X1, X2,
X3 and generates two outputs, the sum and the carry bits. The block diagram of 3-2
compressor is shown in figure.

15
Fig 3.1 3-2 Compressor
3.2 COMPRESSOR
The block diagram of a (5:2) compressor is shown in Fig. Which has seven
inputs and four outputs. Five of the inputs are the primary inputs x1, x2, x3,x4 and
x5and the other two inputs, cinl and cin2 receive their values from the neighboring
compressor of one binary bit order lower in significance.
All the seven inputs have the same weight. The ( 5 2 ) compressor generates
an output, sum of the same weight as the inputs, andthree outputs carry,cout1,cout2
weighted one binary order higher. The cout1and cout2 are fed to the neighboring
compressor of higher significance.

Fig 3.2 5:2 Compressor


3.3 FPGA
Field-programmable gate arrays (FPGA) are the modern-day technology for
building a breadboard or prototype from standard parts; programmable logic blocks
and programmable interconnects allow the same FPGA to be used in many different
applications.

16
Fig 3.3 FPGA Structure

For smaller designs and/or lower production volumes, FPGAs may be more
cost effective than an ASIC design even in production.
 An application-specific integrated circuit (ASIC) is an integrated circuit (IC)
customized for a particular use, rather than intended for general-purpose use.
 A Structured ASIC falls between an FPGA and a Standard Cell-based ASIC
 Structured ASIC’s are used mainly for mid-volume level design. The design
task for structured ASIC’s is to map the circuit into a fixed arrangement of
known cells.
An Application-Specific Integrated Circuit (ASIC) is an integrated circuit (IC)
customized for a particular use, rather than intended for general-purpose use. For
example, a chip designed solely to run a cell phone is an ASIC. Intermediate between
ASICs and industry standard integrated circuits, like the 7400 or the 4000 series, are
application specific standard products (ASSPs).
As feature sizes have shrunk and design tools improved over the years, the
maximum complexity (and hence functionality) possible in an ASIC has grown from
5,000 gates to over 100 million. Modern ASICs often include entire 32-bit processors,
memory blocks including ROM, RAM, EEPROM, Flash and other large building
blocks. Such an ASIC is often termed a SoC (system-on-a-chip). Designers of digital

17
ASICs use a hardware description language (HDL), such as Verilog or VHDL, to
describe the functionality of ASICs.

CHAPTER-4
IMPLEMENTATION
4.1 LINEAR ARRAY STRUCTURE
The carry resources are only used in the design of a single 4:2 compressor, but
these resources have not been considered in the design of the whole compressor tree
structure. To optimize the use of the carry resources, we propose a compressor tree
structure similar to the classic linear array of CSAs .However, in our case, given the
two output words of each adder (sum-word and carry-word), only the carry-word is
connected from each CSA to the next, whereas the sum words are connected to lower
levels of the array. Fig shows an example for a 9:2 compressor tree designed using the
proposed linear structure, where all lines are N bit width buses, and carry signal are
correctly shifted. For the CSA, we have to distinguish between the regular inputs (A

18
and B) and the carry input (Ci in the figure), whereas the dashed line between the
carry input and output represents the fast carry resources. With the exception of the
first CSA, where Ci is used to introduce an input operand, on each CSA Ci is
connected to the carry output (Co) of the previous CSA, as shown in Fig.
Thus, the whole carry-chain is preserved from the input to the output of the
compressor tree (from I0 to Cf). First, the two regular inputs on each CSA are used to
add all the input operands (Ii). When all the input operands have been introduced in
the array, the partial sum-words (Si) previously generated are then added in order
(i.e., the first generated partial sums are added first) as shown in Fig.In this way, we
maximize the overlap between propagation through regular signals and carry-chains.
Regarding the area, the implementation of a generic compressor tree based on
N bit width CSAs requires Nop _ 2 of these elements (because each CSA eliminates
one input signal). Therefore, considering that a CSA could be implemented using the
same number of resources as a binary CPA (as shown below), the proposed linear
array, the 4:2 compressor tree, and the binary CPA tree have approximately the same
hardware cost.

Fig 4.1 9:2 Compressor Tree


In relation to the delay analysis, from a classic point of view our compressor
tree has Nop _ 2 levels. This is much more than a classic Wallace tree structure and,
thus, a longer critical path. Nevertheless, because we are targeting an FPGA

19
implementation, we temporarily assume that there is no delay for the carry-chain path.
Under this assumption, the carry signal connections could be eliminated from the
critical path analysis and our linear array could be represented as a hypothetical tree,
as shown in Fig (where the carry-chain is represented in gray).
To compute the number of effective time levels (ETL) of this hypothetical
tree, each CSA is considered a 2:1 adder, except for the first, which is considered a
3:1 adder. Thus, the first level of adders is formed by the first bðNop _ 1Þ=2c CSAs
(which correspond to partial addition of the input operands). This first ETL produces
bðNop _ 1Þ=2c partial sum-words that are added to a second level of CSAs (together
with the last input operand if Nop is even) and so on, in such a way that each ETL of
CSAs halves the number of inputs to the next level. Therefore, the total ETLs in this
hypothetical tree are
L ¼ dlog2ðNop _ 1Þe; ð3Þ and the delay of this tree is approximately L times the
delay of a single ETL.

Fig 4.2 Proposed CS 9-2 Compressor Tree


However, the delay of the carry-chain is comparatively low, but not null. Let
us consider just two global values for the delay: dcarry, which is the delay for the path
between the carry inputs (Ci) of two consecutive CSAs ,and dsum, which is the delay
from one general input of a CSA (A or B) to a general input of a directly connected
CSA, i.e., the time taken by the data to go from an ETL to the next one. Even under
this simplified scenario, it is unfeasible to obtain a general analytical expression for
the delay of our compressor tree structure. On each ETL, the propagation through
carry-chains and the general paths are overlapped and this overlap depends on
multiple factors. First, it depends on the relative relationship between the values of d

20
carry and d sum (which is associated with the FPGA family used). Second, it depends
on the number of operands that affect both the delay of the carry-chain of each ETL
and the internal structure of the hypothetical tree. Even though the former could be
expressed as an analytical formula, the latter cannot be expressed in this way
(especially when Nop _ 1 is not a power of two). However, it is possible to bound the
critical path delay by considering two extreme options. One extreme situation occurs
when the delay of the whole carry-chain corresponding to each ETL (d carry _ the
number of CSAs of the ETL) is always greater than the delay from an ETL to the next
one (d sum). In this case, the timing behavior corresponds to a linear array and the
critical path is represented in Fig. 4. Initially, the first carry out signal is generated
from I1, I2, I3 in the first CSA and then the carry signal is propagated through the
whole carry-chain until the output. Thus, the delay of the critical path has two
components corresponding to the generation of the first carry signal and the
propagation through the carry-chain.

Fig 4.3 Critical path of the 9:2 Compressor Tree


4.2 PIPELINED ARCHITECTURE
We have fully pipelined the combinational architecture as follows: each level
of the adder tree is placed in a pipeline stage. Besides, each carry-ripple adder is
pipelined in chunks of k bits at most. The total number of pipelined stages is equal to
d(4p + l)=ke + dlog2(m)e . In Fig. 8 we show a BCD adder tree with this register
placement for chunks of k = p bits.
A significant amount of registers is required for input synchronization.To
reduce the hardware cost, these synchronization registers are placed in the first
pipeline level of the tree and packed together as 16-bit shift register LUTs.
4.3 ADDER STRUCTURES

21
Adders are used in many aspects . It is generally recognized that most of the
time required by adders is due to carry propagation, so how to reduce the propagation
time is the focus on today’s techniques. Different binary adder schemes have their
own characters, such as area and energy dissipation.
No such adder scheme is the best for every condition, so to choose in a
specific context with specific requirement and constraint is important. Because this
thesis work does not focus on analysis of delay time of different adders, here the
function of some commonly used adders is given.
Two’s Complement Representation
Two’s complement representation uses the most significant bit as a sign bit,
making it easy to test whether an integer is positive or negative. Range of two’s
complement representation is from − 2n−1 to 2n−1 −1 . Consider an n bits integer A , in
two’s complement representation. If A is positive, then the sign bit an−1 is zero.

The remaining bits represent the magnitude of the number, in the same fashion
as for sign magnitude:
n−2
A = ∑2i ai for A ≥0
i=0
The number zero is identified as positive and therefore has a 0 sign bit and a
magnitude of all 0s, we can see that the range of positive integers that maybe
represented is from 0 to 2n−1 −1 . Any larger number would require more bits.
Fixed Time Type
Most commonly implemented is the fixed time type adder scheme. The
character is that no signal is indicated when addition is completed. Therefore the
worst case delay should be considered.
Variable Time Type
Contrary to fixed time type adder scheme, the variable time type adders have
a completion signal so that the result of the addition can be used as soon as the
completion signal is asserted.
4.4 CARRY-PROPAGATE ADDER
Carry-propagate adders (CPA) can get the result in conventional number
system, also called fixed-radix system. The property of fixed-radix system is that

22
every number has a unique representation, so that no two sequences have the same
numerical value. A digit set from 0 to r −1, where r means radix.
4.5 RIPPLE-CARRY ADDER
An n -bit adder used to add two n -bit binary numbers can build by connecting
n full adders in series. Each full adder represents a bit position i (from 0 to n −1).
Each carry out from a full adder at position i is connected to the carry in of the full
adder at the higher position i +1.
The sum output of a full adder at position i as shown in Figure 1 is given by:
Si = X i ⊕Yi ⊕Ci
The carry output of each FA as shown in Figure 1 is given by:
Ci+1 = X iYi + X i Ci +Yi Ci

Fig4.4 Ripple-carry adder


In the expression of the sum, Ci must be generated by the full adder at the
lower position i −1. tc is the delay from the input from the full adder to the carry
output and ts is the delay form the input to the sum output.
The worst case delay is given
TCRA = (n −1)tc + max(tc ,ts )
This adder is slow for large n . The main advantage of this adder is the
simplicity of its cell and connection among them.
4.6 REDUNDANT ADDERS
The character of redundant adders is that no carry propagation is required. In
other words, independence of numbers of bits of the adders. The operand is
represented using a redundant set. The main purpose of the redundant adder is to
reduce the addition time. But this kind of adder have some disadvantages, first is the
increase of the number of bits needed for representation of a number, which depend

23
on the degree of the redundancy.
Another disadvantage is that some of operations can’t be performed in
redundant numbers such as magnitude comparison or sign detection.
4.7 CARRY-SAVE ADDER
Carry-save adder(CSA) have the same circuit as the full adder, as show in
Figure .

Fig 4.5 Function Of Carry-Save Adder

The carry in signal is considered as an input of the CSA, and the carry out
signal is considered as an output of the CSA. Figure 5 show how n carry save adders
are arranged to adder three n bit numbers x , y , z . into two numbers c and s .

Fig 4.6 CSA Used In N Bit Number


In Figure note that all full adders are independent
Figure show the CSA compute flow and Table 1 will show how the CSA works (basic
on binary numbers).

24
Fig 4.7 CSA computation

Table 4.1 CSA Computation


The computation can be divided into two steps, first we compute S and C
using a CSA, then we use a CPA to compute the total sum. From this example, we can
see that the carry signal and the sum signal can be computed independently to get
only two n -bits numbers. A CPA is used for the last step computation and the carry
propagation exist only in the last step.
4.8 CARRY-LOOKAHEAD ADDER
The basic idea of carry-lookahead adder is computing the carries
simultaneously, i.e. in this type of adder all the carries in the same groups
are computed at the same time. The carry-lookahead adder has two
functions, first is to compute all the carries then the operation Si = X i

⊕Yi ⊕Ci is implemented by a simple 3-input XOR gate. The design of


the lookahead carry generator involves two Boolean functions named
Generate and Propagate. For each pair of input bits these functions are
defined as:
Gi = X iYi

25
Pi = X i ⊕Yi
In the first case, the carry bit is activated by the local conditions (the values of
X I andYi ). In the second, the carry bit is received from the less significant elementary
addition and is propagated further to the more significant elementary addition
depending on the function Pi .Therefore, the carry-out bit corresponding to a pair of
bits X i and Yi is computed according to the equation:

Ci = Gi + Pi Ci−1
Hence, the carry signal can be computed by carry in, Generate and Propagate
signals.
For example, consider a four bit adder
C1 = G0 + P0Cin
C2 = G1 + P1G0 + P1 P0Cin
C3 = G2 + P2G1 + P2 P1G0 + P2 P1 P0Cin
C4 = G3 + P3G2 + P3 P2G1 + P3 P2 P1G0 + P3 P2 P1 P0Cin
Figure can help us understand the carry out signal computation procedure more
clearly.

26
Fig 4.8 Carry out of Carry Lookahead Adder

The sumoutput of each column is given in Figure

Fig 4.9 Sum of Carry Lookahead Adder

The advantage of carry-lookahead adder is if we consider the input vector of n


bits is divided into groups of m bits and groups connected like a ripple-carry adder,
the worst delay should be:
T = nt
CLA m groups +ts
The worst delay is less than ripple-carry adder because tgroups is smaller than
mtc . Hence the carry-look ahead adder is faster than ripple-carry adder.
4.9 SIGNED-DIGIT ADDER (SDA)
Signed-digit (SD) number representation systems have been defined for any
radix r with digit values ranging over the set (- alpha , . . ., -1, 0, 1, . . ., alpha ), where
r −1
alpha is an arbitrary integer in the range 2 ≤ alpha ≤ r −1 .Such number
representation systems possess sufficient redundancy to allow for the cut up of carry
or borrow chains and hence result in fast propagation-free addition and subtraction.

27
The result of the addition uses signed digit representation. Use fixed-radix
representation with digit value from a signed-integer set.
n−1
x = ∑xi r i
0
with a digit set (- alpha , . . ., -1, 0, 1, . . ., alpha ).
Here the addition algorithm is not mention in detail. The objective of SDA is to
eliminate carry propagation. A signed-digit addition is performed in two steps.
Step 1: to compute sum( w ) and transfer( t ), the transfer’s function is something like
carriers in CPA.
x + y = w +t
At the digital level this correspond to
xi + yi = wi + rti+1
Figure show the addition of the first two bits of n -bit numbers

Fig4.10 First step of Signed-digit Addition


Step 2 compute s = w +t At the digital level
si = wi +ti
We can compute si without produce a carry, as shown in Figure.

Fig 4.11 Second step of Sign-digit Addition

28
Finally we can conclude SDA structure, as shown in Figure.

Fig 4.12 Sign-Digit Addition


A.Avizienis proposed a redundant binary number (a radix-2 signed-digit
number).
With this type of number, the propagation of carry figures is absorbed into its
redundancy and the addition processes are unrelated to the number of digits and
can be executed in only two steps. More detail to compute ti and representation of
operands has been mentioned in.

4.10 CARRY SAVE ADDER TREE

Fig 4.13 A [ p :2] Adder

29
The carry save adder tree can be used to add three operands in two’s
complement representation and produces a result as the sum of two vectors. A 3-to-2
reduction is called [3:2] adder, and using this tree, we can use a [ p :2] adder to reduce
p bit-vectors to 2 bit-vectors using CSAs.
From Figure, each column’s bit numbers are k , and have p levels. We can
use [3:2] adders to reduce the rows and get 2 bit vectors. No propagation of the carries
are required except on the last two rows which result in a speed up of the
computation.

Fig 4.14 Reduction By Rows


From Figure, the number of input vectors were reduced by the rows. Finally,
we should estimate the numbers of levels of the CSA tree as
log k level ≈ 2
log 32
where k is the number of input operands.
4.11 VIRTEX-5/6 IMPLEMENTATIONS
In this Section we detail the Virtex-5/6 implementation of the proposed BCD
carry-ripple adder tree for multioperand decimal addition. We present the com-
binational architecture.
Combinational Architecture

30
Fig 4.15 Combinational Architecture
Ripple adders that reduces m BCD operands Z[k] into a final BCD sum S. The
width of each BCD adder in the first row of the tree is p digits (or 4p bits). To avoid
sum overflows, we need to extend the width of each BCD adder an appropriate
amount. For example, the width of the final adder requires to be incremented in at
least dlog10 me digits or l bits, with l given by

l = 4blog10 mc + dlog2(m=10blog10 mc)e


The result of each two-operand BCD sum is passed to an adder a level down
and the carry is ripple through length of the adder. However, since the carry-ripple
chains overlap, the signals are propagated as much as 4p + l bits in length plus
dlog2(m)e levels in depth. Each BCD carry-ripple adder adds two BCD operands and
produce an in-termediate decimal sum as described .
To implement efficiently the sum digit Equation (8), we merged the
computation of the pre-correction steps (computation of AUi and conditional +6
addition) with the 4-bit binary carry-ripple sum. The resultant Virtex-5/6
implementation of 1-digit BCD carry-ripple adder is shown in Fig. 6. This sum cell
adds two decimal digits Xi, Yi, and a carry input Ci, and produces a decimal digit Zi
and a carry output Ci+1. The decimal digits are represented in the extended BCD code
of Table 3, though in practice digits 8 and 9 only have the following two
representations: f10002; 11102g for 8, and f10012; 11112g for 9.
The hardware cost of this 1-digit BCD adder is of five 6-input LUTs (1 slice
and a LUT). The 6-input LUTs compute the binary carry-generate (gi;j ) and carry-
propagate (pi;j) signals required to implement Zi (Equation (8)).
The boolean expressions for the binary carry-generate and carry-propagate

31
signals are given by:
g = x
i;3 i;3yi;3 _ xi;3(yi;2 _ yi;1) _ yi;3(xi;2 _ xi;1) _ xi;2yi;2(xi;1 _ yi;1)
gi;2 = 0
gi;1 = 0
g = x
i;0 i;0yi;0

Fig 4.16 4Bit Adder


4.12 MULTI-OPERAND ADDITION
A common structure for adding several operands is an adder tree, such as
Wallace tree, Dadda tree, carry save adder tree and so on. In this thesis, carry save
adder tree structure and Wallace tree are used. The primitive operation performed on
the inputs bit-array is reduction, to achieve an output bit-array with a small number of
bits. There are two methods used: reduction by rows and reduction by columns, carry
save adder tree belong to first method and the Wallace tree belong to second method.
Modules to reduce the rows are called adders and reduce the columns are called
counters.

4.13 WALLACE TREE


Wallace tree structures are widely used in additions with several operands.
The reduction by column is similar to reduction by rows if the number of bits in each
column of the array is the same. But conditions are always not like this. For example

32
the partial products of the multiplier, the Least-significant column can’t receive bits
from other columns. So reduction by columns is introduced. The basic concept is to
reduce bit numbers in each column of each level. So full adder and half adder are used
as (3:2) counter adder and (2:2) counter.

Fig 4.17 FA and HA as (3:2) counter and (2:2) counter


In Figure 12, three nodes inside pane represent the FA’s three inputs and two
nodes outside represents the FA’s carry out and sum. The half adder has two inputs,
abd one sum and one carry out.
Here is a example used in this thesis presented.
Example: a =[2 3] means 2 bit numbers with weight 21 and 3 bit numbers with
weight 20 . We can use a Wallace tree as shown in Figure 13 to achieve fast addition.
The basic module in the Wallace tree is (3:2) counter and (2:2) counter.

Fig 4.18 Reduction By Columns


The vector change from a =[2 3] to a =[1 2 1] . The max change from 3 bits to
2 bits. Carry propagation delay was eliminated except for the last row. The last step
use the CPA, like carry-lookahead adder, to compute the sum, and fast addition is
achieved.

33
Dadda tree is a special condition of the Wallace tree where all bit-numbers are
collected and using Wallace tree with minimum number of counters and critical path.
Wallace tree was chosen as basic algorithm of program for this thesis.

CHAPTER-5
RESULT ANALYSIS

34
Combinational and pipelined versions of the BCD carry-ripple adder tree pre-
sented in Section 4 were designed in VHDL for arbitrary values m, p and k using the
Xilinx ISE Design Suite. To have more control over the mapping process the different
adder cells were directly instantiated into components of the Xilinx Virtex-6 library.
The architectures were synthesized in a Virtex-6 XC6VLX75T device with speed
grade-3 using the XST compiler and simulated with Model sim SE 6.5.
we determined that the most suitable structures for fast BCD multi-operand
addition in FPGAs are trees built of carry-ripple or carry-chain adders. To extract
more precise conclusions from comparison, we coded and synthesized two different
adder trees: a binary carry-ripple adder tree and a tree built of the BCD carry-chain
adders .
The proposed BCD carry-ripple adder practically halves the hardware cost of
the BCD carry-chain adder and is still very competitive in terms of speed. The BCD
carry-chain adders have speed advantages for large carry chains at expense of high
hardware costs.

CONCLUSION
The design of a BCD multi-operand adder and its implementation on a Virtex-
5/6 FPGA has been successfully designed. We performed a survey of prior techniques

35
of BCD multi-operand addition to find the most area-efficient low-latency FPGA
implementation. From this survey we found that tree structures build of BCD carry-
ripple or carry-chain adders are suitable for state-of-the-art FPGA implementations.
Efficiently implementing CS compressor trees on FPGA, in terms of area and
speed, is made possible by using the specialized carry-chains of these devices in a
novel way. Similar to what happens when using ASIC technology, the proposed CS
linear array compressor trees lead to marked improvements in speed compared to
CPA approaches and, in general, with no additional hardware cost. Furthermore, the
proposed high-level definition of CSA arrays based on CPAs facilitates ease-of-use
and portability, even in relation to future FPGA architectures, because CPAs will
probably remain a key element in the next generations of FPGA. We have compared
our architectures, implemented on different FPGA families, to several designs and
have provided a qualitative and quantitative study of the benefits of our proposals.

FUTURE SCOPE

36
This project can be used in Transaction processing system, ATM, Personal
computers and Workstations ,Medical electronic systems etc… As future work we
plan to increase the compressor length and to decrease the number of time units. And
also we integrate the proposed multi-operand BCD adder in a core of the FloPoCo
project4, a generator of arithmetic cores for FPGAs, and to support more FPGA
targets than Virtex-5/6 families.

REFERENCES

37
 M. J. Adiletta and V. C. Lamere, BCD Adder Circuit, US patent 4,805,131, Jul
1989.
 D. P. Agrawal, Fast BCD/Binary Adder/Subtractor, Electronics Letters, vol.
10, no. 8, pp. 121–122, Apr. 1974.
 G. Bioul, M. Vazquez, J.-P. Deschamps, G. Sutter, Decimal Addition in
FPGA, V Southern Conference on Programmable Logic (SPL09), pp. 101–
108, Apr. 2009.
 F. Y. Busaba, C. A. Krygowski, W. H. Li, E. M. Schwarz and S. R. Carlough,
The IBM z900 Decimal Arithmetic Unit, Asilomar Conference on Signals,
Systems and Computers, vol. 2, pp. 1335–1339, Nov. 2001.
 Electronics for ‘U’

38
39

You might also like