Professional Documents
Culture Documents
Chapter 1
Chapter 1
INTRODUCTION
The use of dedicated decimal hardware has been limited to main-frame
business computers and handheld calculators at the low end. Very recently a renewed
interest in providing hardware acceleration for decimal arithmetic has emerged. It has
been boosted by the numerically intensive computing requirements of new
commercial, financial and Internet applications, such as e-commerce and e-banking.
Because of the perspectives of a more wide spread use of decimal processing, the
revised IEEE Standard for Floating-Point incorporates a specification for decimal
arithmetic. Besides, the advances in FPGA technology have opened up new
opportunities to implement efficient floating-point coprocessors for specialized tasks
for a fraction of the cost of an ASIC implementation. Thus, even though most of the
current research on decimal arithmetic is targeted at high-performance VLSI design, a
few works present implementations of decimal arithmetic units for FPGAs .
The design of decimal units for FPGAs faces several challenges: first, the
inherent in efficiency of decimal representations in systems based on two-state logic
and a complex mapping of the decimal arithmetic rules into boolean logic. On the
other hand, the special built-in characteristics of FPGA architectures make it difficult
to use many well-known methods to speedup computations (for example, carry-save
and signed-digit arithmetics). Therefore, it may be preferable to develop specific
decimal algorithms more suitable for FPGAs rather than adapting existing ones
targeted for ASIC platforms. In this context, we present the algorithm, architecture
and FPGA implementation of a novel unit to perform fast addition of a large amount
of decimal (BCD) fixed-point or integer operands. This operator is also of key
importance for other arithmetic operations such as decimal multiplication and
division.
The use of Field Programmable Gate Arrays (FPGAs) to implement digital
circuits has been growing in recent years. In addition to their reconfiguration
capabilities, modern FPGAs allow high parallel computing.
FPGAs achieve speedups of two orders of magnitude over a general-purpose
processor for arithmetic intensive algorithms. Thus, these kinds of devices are
increasingly selected as the target technology for many applications, especially in
digital signal processing hardware accelerators cryptography and much more.
1
Therefore, the efficient implementation of generalized operators on FPGAs is
of great relevance. The typical structure of an FPGA device is a matrix of
configurable logic elements (LEs), each one surrounded by interconnection resources.
In general, each configurable element is basically composed of one or several n-input
lookup tables (N- LUT) and flip-flops. However, in modern FPGA architectures, the
array of LEs has been augmented by including specialized circuitry, such as dedicated
multipliers, block RAM, and so on. In the authors demonstrate that the intensive use
of these new elements reduces the performance GAP between FPGA and ASIC
implementations.
One of these resources is the carry-chain system, which is used to improve the
implementation of carry propagate adders (CPAs). It mainly consists of additional
specialized logic to deal with the carry signals, and specific fast routing lines between
consecutive LEs, as shown in Fig. 1. This resource is presented in most current FPGA
devices from low-cost ones to high-end families, and it accelerates the carry
propagation by more than one order of magnitude compared to its implementation
using general resources. Apart from the CPA implementation, many studies have
demonstrated the importance of using this resource to achieve designs with better
performance and/or less area requirements, and even for implementing non arithmetic
circuits.
Redundant representation reduces the addition time by limiting the length of
the carry-propagation chains. The most usual representations are carry-save (CS) and
signed-digit (SD). A CS adder (CSA) adds three numbers using an array of Full-
Adders (FAs), but without propagating the carries. In this case, the FA is usually
known as a 3:2 counter. The result is a CS number, which is composed of a sum-word
and a carry-word. Therefore, the CS result is obtained without any carry propagation
in the time taken by only one FA.
The addition of two CS numbers requires an array of 4:2 compressors,which
can be implemented by two 3:2 counters. The conversion to nonredundant
representation is achieved by adding the sum and carry word in a conventional CPA .
However, due to the efficient implementation of CPAs, the use of redundant adders
has usually been rejected when targeting FPGA technology. A direct implementation
of a 3:2 counter usually doubles the area requirements of its equivalent CPA and
improved speed is only noticeable for long bit widths.
Nevertheless, several recent studies have demonstrated that redundant adders
2
can be efficiently mapped on FPGA structures, reducing area overhead and improving
speed.
3
tree performance.
VLSI
Very-large-scale integration (VLSI) is the process of creating integrated
circuits by combining thousands of transistor-based circuits into a single chip. VLSI
began in the 1970s when complex semiconductor and communication technologies
were being developed.
The microprocessor is a VLSI device. The term is no longer as common as it
once was, as chips have increased in complexity into the hundreds of millions of
transistors.
Overview
The first semiconductor chips held one transistor each. Subsequent advances
added more and more transistors, and, as a consequence, more individual functions or
systems were integrated over time. The first integrated circuits held only a few
devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it
possible to fabricate one or more logic gates on a single device.
Now known retrospectively as "small-scale integration" (SSI), improvements
in technique led to devices with hundreds of logic gates, known as large-scale
integration (LSI), i.e. systems with at least a thousand logic gates. Current technology
has moved far past this mark and today's microprocessors have many millions of gates
and hundreds of millions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-
scale integration above VLSI. Terms like Ultra-large-scale Integration (ULSI) were
used. But the huge number of gates and transistors available on common devices has
rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of
integration are no longer in widespread use. Even VLSI is now somewhat quaint,
given the common assumption that all microprocessors are VLSI or better.
As of early 2008, billion-transistor processors are commercially available, an
example of which is Intel's Montecito Itanium chip. This is expected to become more
commonplace as semiconductor fabrication moves from the current generation of 65
nm processes to the next 45 nm generations (while experiencing new challenges such
as increased variation across process corners). Another notable example is NVIDIA’s
280 series GPU.
This microprocessor is unique in the fact that its 1.4 Billion transistor count,
capable of a teraflop of performance, is almost entirely dedicated to logic (Itanium's
4
transistor count is largely due to the 24MB L3 cache). Current designs, as opposed to
the earliest devices, use extensive design automation and automated logic synthesis to
lay out the transistors, enabling higher levels of complexity in the resulting logic
functionality. Certain high-performance logic blocks like the SRAM cell, however,
are still designed by hand to ensure the highest efficiency (sometimes by bending or
breaking established design rules to obtain the last bit of performance by trading
stability).
VLSI stands for "Very Large Scale Integration". This is the field which
involves packing more and more logic devices into smaller and smaller areas.
Simply we say Integrated circuit is many transistors on one chip.
Design/manufacturing of extremely small, complex circuitry using modified
semiconductor material
Integrated circuit (IC) may contain millions of transistors, each a few mm in size
Applications wide ranging: most electronic logic device
Advantages of ICs Over Discrete Components
While we will concentrate on integrated circuits , the properties of integrated
circuits-what we can and cannot efficiently put in an integrated circuit-largely
determine the architecture of the entire system. Integrated circuits improve system
characteristics in several critical ways. ICs have three key advantages over digital
circuits built from discrete components:
Size: Integrated circuits are much smaller-both transistors and wires are shrunk to
micrometer sizes, compared to the millimeter or centimeter scales of discrete
components. Small size leads to advantages in speed and power consumption, since
smaller components have smaller parasitic resistances, capacitances, and inductances.
Speed: Signals can be switched between logic 0 and logic 1 much quicker within a
chip than they can between chips. Communication within a chip can occur hundreds
of times faster than communication between chips on a printed circuit board. The high
speed of circuits on-chip is due to their small size-smaller components and wires have
smaller parasitic capacitances to slow down the signal.
Power Consumption: Logic operations within a chip also take much less power.
Once again, lower power consumption is largely due to the small size of circuits on
the chip-smaller parasitic capacitances and resistances require less power to drive
them.
5
VLSI And Systems
These advantages of integrated circuits translate into advantages at the system level:
Smaller physical size. Smallness is often an advantage in itself-consider portable
televisions or handheld cellular telephones.
Lower power consumption. Replacing a handful of standard parts with a single
chip reduces total power consumption. Reducing power consumption has a ripple
effect on the rest of the system: a smaller, cheaper power supply can be used;
since less power consumption means less heat, a fan may no longer be necessary;
a simpler cabinet with less shielding for electromagnetic shielding may be
feasible, too.
Reduced cost. Reducing the number of components, the power supply
requirements, cabinet costs, and so on, will inevitably reduce system cost. The
ripple effect of integration is such that the cost of a system built from custom ICs
can be less, even though the individual ICs cost more than the standard parts they
replace.
Understanding why integrated circuit technology has such profound influence
on the design of digital systems requires understanding both the technology of IC
manufacturing and the economics of ICs and digital systems.
Applications of VLSI
Electronic systems now perform a wide variety of tasks in daily life.
Electronic systems in some cases have replaced mechanisms that operated
mechanically, hydraulically, or by other means; electronics are usually smaller, more
flexible, and easier to service.
In other cases electronic systems have created totally new applications.
Electronic systems perform a variety of tasks, some of them visible, some more
hidden:
Personal entertainment systems such as portable MP3 players and DVD
players perform sophisticated algorithms with remarkably little energy.
Electronic systems in cars operate stereo systems and displays; they also
control fuel injection systems, adjust suspensions to varying terrain, and
perform the control functions required for anti-lock braking (ABS) systems.
Digital electronics compress and decompress video, even at high-definition
data rates, on-the-fly in consumer electronics.
6
Low-cost terminals for Web browsing still require sophisticated electronics,
despite their dedicated function.
Personal computers and workstations provide word-processing, financial
analysis, and games. Computers include both central processing units (CPUs)
and special-purpose hardware for disk access, faster screen display, etc.
Medical electronic systems measure bodily functions and perform complex
processing algorithms to warn about unusual conditions. The availability of
these complex systems, far from overwhelming consumers, only creates
demand for even more complex systems.
7
CHAPTER-2
LITERATURE SURVEY
2.1 EXISTING SYSTEM
Multipliers are essential component in different circuits, particularly in
arithmetic operation such as compressors, parity checkers and comparators.
Multipliers consist of three fundamental parts: a partial product generator, a partial
product reduction and a final fast adder part . A booth encoder is used to generate the
partial products and partial product are reduces to two rows using compressor circuits.
Finally, fast adder is used to sum the two rows. The partial product reduction part of
multiplier contributes maximum power consumption, delay and layout area. Various
high speed multipliers use 3-2, 4-2 and 5-2 compressors to lower the latency of partial
product reduction part. These compressors are used to minimize delay and area which
leads to increase the performance of the overall system. Compressors are generally
designed by XOR-XNOR gates and multiplexers. A compressor is a device which is
used to reduce the operands while adding terms of partial products in multipliers. An
X-Y compressor takes X equally weighted input bits and produces Y-bit binary
number. The most widely and the simplest used compressor is the 3-2 compressor
which is also known as a full adder. A 3-2 compressor has three inputs X1, X2, X3
and generates two outputs, the sum and the carry bits. The block diagram of 3-2
compressor is shown in figure.
8
Fig 2.1 3-2 Compressor
9
Fig 2.2 Conventional 3-2 Compressor
10
most widely known building blocks to implement it . We select a 4:2 compressor as
the basic building block, because it could be efficiently implemented on Xilinx
FPGAs [28]. The implementation of a generic CS compressor tree requires dNop=2e
_ 1 4:2 compressors (because each one eliminates two signals), whereas a carry-
propagate tree uses Nop _ 1 CPAs (since each one eliminates one signal).If we bear in
mind that a 4:2 compressor uses practically double the amount of resources as CPAs
[28], both trees basically require the same area. On the other hand, the speed of a
compressor tree is determined by the number of levels required. In this case, because
each level halves the number of input signals, the critical path delay (D) is
approximately
11
presented in [28] and [29] for Xilinx FPGA. We now generalize this idea to
compressors of any size by proposing a different approach based on linear arrays.
This reduces the critical path of the compressor tree when it is implemented on
FPGAs with specialized carry-chains.
BCD Addition Method
Unlike direct decimal addition methods, the methods based on pre- and post-
corrections of the binary sum can make full use of any binary adder, for example a
binary carry-ripple adder. However, the area and latency overhead due to these
decimal corrections is not negligible and the resulting implementations are as
complex as the BCD carry-chain adders proposed by Bioul et al.The method proposed
in this work also obtains the decimal sum from the binary addition of the BCD input
operands, but the post-correction stage is simplified or removed at the expense of a
slightly complex pre-correction stage. Besides, another advantage of our proposal is
that the pre-correction stage can be integrated with the binary carry-ripple addition
logic in Virtex-5/6 FPGAs at the cost of an extra 6-input LUT per digit. As we show
later in Section 4, each 1-digit (4-bit) adder occupies the area of 4-bit binary carry-
ripple adder (one Virtex-5/6 slice) and an additional 6-input LUT.
The algorithm is defined for two BCD input operands but can be applied
recursively to more BCD addends. An example of addition for three BCD operands
X, Y , W is shown in Fig. 4. First, BCD operands X, Y are added as Z = X + Y in
three steps. The two first are pre-correction steps and the third is the subsequent
binary carry-propagate addition. The pre-correction consists on conditional addition
of +6 (01102) factors to the input BCD operands in a digit-wise fashion. These +6
corrections are needed to get the correct decimal carry outputs at each digit position
since the BCD addition is computed as a binary carry-propagate addition. In the
previous methods [2, 15, 16, 17], the +6 factors were added independently of the
value of the input operands, such that the binary sum had to be corrected at each
decimal position when the corresponding carry-out was zero.
In our method, a correction factor of 6 is added to the BCD input digits X i, Yi
when the sum XiU + YiU is equal or higher than 8. We denote by X iU , YiU the sum of
the 3 most significant bits of Xi and Yi respectively
2.2 PROPOSED SYSTEM 9:2 COMPRESSOR TREE
12
Fig:2.5 9:2Compressor tree
Fig shows an example for a 9:2 compressor tree designed using the proposed
linear structure, where allIn relation to the delay analysis, from a classic point of view
our compressor tree has Nop _ 2 levels. This is much more than a classic Wallace tree
structure and, thus, a longer critical path. Nevertheless, because we are targeting an
FPGA implementation, we temporarily assume that there is no delay for the carry-
chain path.
Under this assumption, the carry signal connections could be eliminated from
the critical path analysis and our linear array could be represented as a hypothetical
tree, as shown in Fig (where the carry-chain is represented in gray). To compute the
number of effective time levels (ETL) of this hypothetical tree, each CSA is
considered a 2:1 adder, except for the first, which is considered a 3:1 adder. Thus, the
first level of adders is formed by the first bðNop _ 1Þ=2cCSAs (which correspond to
partial addition of the input operands). This first ETL produces bðNop _ 1Þ=2c partial
sum-words that are added to a second level of CSAs (together with the last input
operand if Nop is even) and soon, in such a way that each ETL of CSAs halves the
number of inputs to the next level.
Multi-Operand Addition Method
In this method we present prior work on binary and decimal multi-operand
addition and analyse different representative FPGA implementations, discussing
their associated advantages and costs. Based on the conclusions extracted from this
13
survey, in this we present a proposal that leads to more efficient implementations of
decimal multi-operand adders in FPGA platforms.
Multi-Operand Adders For FPGAs
Binary multi-operand adders are generally arranged in two ways: in an array
of rows or in a tree-like structure. In an array configuration each row of adders
reduces one further operand, so that m levels of adders are required to reduce m
operands into a final one. On the other hand, in a m-operand adder tree the number
of logic levels is log2(m) (or log2(m) 1 levels for a signed-digit or a carry-save adder
tree, but a final carry-propagate adder is needed in this case). Furthermore, the
hardware cost of both configurations is similar, so tree configurations are usually
preferred, though the array has a more regular routing.
Concerning the type of adder, the delay of each carry-propagate adder (carry-
ripple, carry-look ahead,...) depends on the length of its input operands (delay
proportional to n for an n-bit binary carry-ripple adder, O(log 2 n) for a carry look
ahead adder).
To reduce the latency of addition on FPGAs, these devices in-corporate
dedicated fast carry chain paths. This favours simple carry-ripple implementations in
FPGAs over other carry-propagate adder topologies, except in the case of non-
pipelined implementations for large size operands, but at the expense of a high
hardware cost [33]. Moreover, pipelining techniques in FPGAs can be applied more
effectively to carry-ripple adders .
On the other hand, signed-digit and carry-save adders have a constant
computation time, delaying the computation of the carry propagation until the end.
However, a straight for-ward implementation on FPGAs roughly requires double
hardware than a carry-ripple adder, and does not exploit the fast carry chain to
improve speed. Several authors have recently proposed efficient mappings of carry-
save adders on FPGAs but only for the binary case.
The delays of both carry-ripple and carry-save adder trees are proportional to
n + log2(m) 1 1, because the horizontal delays through the log 2(m) levels of carry-
ripple chains overlap. In ASIC implementations the advantage of the carry-save
adder tree lies on the flexibility of routing, so the critical path delay can be reduced
by an optimization of interconnections between full adders. In FPGAs, a
conventional carry-save adder tree of full adders is slower than a carry-ripple adder
tree due to the complex routing, so the carry-ripple adder tree configuration is
14
generally preferred. This can be partially solved by mapping efficiently compressors
that perform operand reductions larger than 3:2.
Decimal multi-operand adders can also be implemented as carry-ripple,
signed-digit or carry-save adder trees but they require additional logic for decimal
correction. Therefore, delay and hardware cost are larger than in the binary case.
Next, we describe the most representative methods proposed for decimal fixed-
point/integer carry-propagate (carry-ripple, carry lookahead,...) and carry-free
(signed-digit, carry-save...) addition.
CHAPTER-3
DESIGN CONSIDERATIONS
3.1 COMPRESSORS ARCHITECTURES
Various high speed multipliers use 3-2, 4-2 and 5-2 compressors to lower the
latency of partial product reduction part [5]-[7]. These compressors are used to
minimize delay and area which leads to increase the performance of the overall
system. Compressors are generally designed by XOR-XNOR gates and
multiplexers[8]-[11]. A compressor is a device which is used to reduce the operands
while adding terms of partial products in multipliers. An X-Y compressor takes X
equally weighted input bits and produces Y-bit binary number.
The most widely and the simplest used compressor is the 3-2 compressor
which is also known as a full adder [12]. A 3-2 compressor has three inputs X1, X2,
X3 and generates two outputs, the sum and the carry bits. The block diagram of 3-2
compressor is shown in figure.
15
Fig 3.1 3-2 Compressor
3.2 COMPRESSOR
The block diagram of a (5:2) compressor is shown in Fig. Which has seven
inputs and four outputs. Five of the inputs are the primary inputs x1, x2, x3,x4 and
x5and the other two inputs, cinl and cin2 receive their values from the neighboring
compressor of one binary bit order lower in significance.
All the seven inputs have the same weight. The ( 5 2 ) compressor generates
an output, sum of the same weight as the inputs, andthree outputs carry,cout1,cout2
weighted one binary order higher. The cout1and cout2 are fed to the neighboring
compressor of higher significance.
16
Fig 3.3 FPGA Structure
For smaller designs and/or lower production volumes, FPGAs may be more
cost effective than an ASIC design even in production.
An application-specific integrated circuit (ASIC) is an integrated circuit (IC)
customized for a particular use, rather than intended for general-purpose use.
A Structured ASIC falls between an FPGA and a Standard Cell-based ASIC
Structured ASIC’s are used mainly for mid-volume level design. The design
task for structured ASIC’s is to map the circuit into a fixed arrangement of
known cells.
An Application-Specific Integrated Circuit (ASIC) is an integrated circuit (IC)
customized for a particular use, rather than intended for general-purpose use. For
example, a chip designed solely to run a cell phone is an ASIC. Intermediate between
ASICs and industry standard integrated circuits, like the 7400 or the 4000 series, are
application specific standard products (ASSPs).
As feature sizes have shrunk and design tools improved over the years, the
maximum complexity (and hence functionality) possible in an ASIC has grown from
5,000 gates to over 100 million. Modern ASICs often include entire 32-bit processors,
memory blocks including ROM, RAM, EEPROM, Flash and other large building
blocks. Such an ASIC is often termed a SoC (system-on-a-chip). Designers of digital
17
ASICs use a hardware description language (HDL), such as Verilog or VHDL, to
describe the functionality of ASICs.
CHAPTER-4
IMPLEMENTATION
4.1 LINEAR ARRAY STRUCTURE
The carry resources are only used in the design of a single 4:2 compressor, but
these resources have not been considered in the design of the whole compressor tree
structure. To optimize the use of the carry resources, we propose a compressor tree
structure similar to the classic linear array of CSAs .However, in our case, given the
two output words of each adder (sum-word and carry-word), only the carry-word is
connected from each CSA to the next, whereas the sum words are connected to lower
levels of the array. Fig shows an example for a 9:2 compressor tree designed using the
proposed linear structure, where all lines are N bit width buses, and carry signal are
correctly shifted. For the CSA, we have to distinguish between the regular inputs (A
18
and B) and the carry input (Ci in the figure), whereas the dashed line between the
carry input and output represents the fast carry resources. With the exception of the
first CSA, where Ci is used to introduce an input operand, on each CSA Ci is
connected to the carry output (Co) of the previous CSA, as shown in Fig.
Thus, the whole carry-chain is preserved from the input to the output of the
compressor tree (from I0 to Cf). First, the two regular inputs on each CSA are used to
add all the input operands (Ii). When all the input operands have been introduced in
the array, the partial sum-words (Si) previously generated are then added in order
(i.e., the first generated partial sums are added first) as shown in Fig.In this way, we
maximize the overlap between propagation through regular signals and carry-chains.
Regarding the area, the implementation of a generic compressor tree based on
N bit width CSAs requires Nop _ 2 of these elements (because each CSA eliminates
one input signal). Therefore, considering that a CSA could be implemented using the
same number of resources as a binary CPA (as shown below), the proposed linear
array, the 4:2 compressor tree, and the binary CPA tree have approximately the same
hardware cost.
19
implementation, we temporarily assume that there is no delay for the carry-chain path.
Under this assumption, the carry signal connections could be eliminated from the
critical path analysis and our linear array could be represented as a hypothetical tree,
as shown in Fig (where the carry-chain is represented in gray).
To compute the number of effective time levels (ETL) of this hypothetical
tree, each CSA is considered a 2:1 adder, except for the first, which is considered a
3:1 adder. Thus, the first level of adders is formed by the first bðNop _ 1Þ=2c CSAs
(which correspond to partial addition of the input operands). This first ETL produces
bðNop _ 1Þ=2c partial sum-words that are added to a second level of CSAs (together
with the last input operand if Nop is even) and so on, in such a way that each ETL of
CSAs halves the number of inputs to the next level. Therefore, the total ETLs in this
hypothetical tree are
L ¼ dlog2ðNop _ 1Þe; ð3Þ and the delay of this tree is approximately L times the
delay of a single ETL.
20
carry and d sum (which is associated with the FPGA family used). Second, it depends
on the number of operands that affect both the delay of the carry-chain of each ETL
and the internal structure of the hypothetical tree. Even though the former could be
expressed as an analytical formula, the latter cannot be expressed in this way
(especially when Nop _ 1 is not a power of two). However, it is possible to bound the
critical path delay by considering two extreme options. One extreme situation occurs
when the delay of the whole carry-chain corresponding to each ETL (d carry _ the
number of CSAs of the ETL) is always greater than the delay from an ETL to the next
one (d sum). In this case, the timing behavior corresponds to a linear array and the
critical path is represented in Fig. 4. Initially, the first carry out signal is generated
from I1, I2, I3 in the first CSA and then the carry signal is propagated through the
whole carry-chain until the output. Thus, the delay of the critical path has two
components corresponding to the generation of the first carry signal and the
propagation through the carry-chain.
21
Adders are used in many aspects . It is generally recognized that most of the
time required by adders is due to carry propagation, so how to reduce the propagation
time is the focus on today’s techniques. Different binary adder schemes have their
own characters, such as area and energy dissipation.
No such adder scheme is the best for every condition, so to choose in a
specific context with specific requirement and constraint is important. Because this
thesis work does not focus on analysis of delay time of different adders, here the
function of some commonly used adders is given.
Two’s Complement Representation
Two’s complement representation uses the most significant bit as a sign bit,
making it easy to test whether an integer is positive or negative. Range of two’s
complement representation is from − 2n−1 to 2n−1 −1 . Consider an n bits integer A , in
two’s complement representation. If A is positive, then the sign bit an−1 is zero.
The remaining bits represent the magnitude of the number, in the same fashion
as for sign magnitude:
n−2
A = ∑2i ai for A ≥0
i=0
The number zero is identified as positive and therefore has a 0 sign bit and a
magnitude of all 0s, we can see that the range of positive integers that maybe
represented is from 0 to 2n−1 −1 . Any larger number would require more bits.
Fixed Time Type
Most commonly implemented is the fixed time type adder scheme. The
character is that no signal is indicated when addition is completed. Therefore the
worst case delay should be considered.
Variable Time Type
Contrary to fixed time type adder scheme, the variable time type adders have
a completion signal so that the result of the addition can be used as soon as the
completion signal is asserted.
4.4 CARRY-PROPAGATE ADDER
Carry-propagate adders (CPA) can get the result in conventional number
system, also called fixed-radix system. The property of fixed-radix system is that
22
every number has a unique representation, so that no two sequences have the same
numerical value. A digit set from 0 to r −1, where r means radix.
4.5 RIPPLE-CARRY ADDER
An n -bit adder used to add two n -bit binary numbers can build by connecting
n full adders in series. Each full adder represents a bit position i (from 0 to n −1).
Each carry out from a full adder at position i is connected to the carry in of the full
adder at the higher position i +1.
The sum output of a full adder at position i as shown in Figure 1 is given by:
Si = X i ⊕Yi ⊕Ci
The carry output of each FA as shown in Figure 1 is given by:
Ci+1 = X iYi + X i Ci +Yi Ci
23
on the degree of the redundancy.
Another disadvantage is that some of operations can’t be performed in
redundant numbers such as magnitude comparison or sign detection.
4.7 CARRY-SAVE ADDER
Carry-save adder(CSA) have the same circuit as the full adder, as show in
Figure .
The carry in signal is considered as an input of the CSA, and the carry out
signal is considered as an output of the CSA. Figure 5 show how n carry save adders
are arranged to adder three n bit numbers x , y , z . into two numbers c and s .
24
Fig 4.7 CSA computation
25
Pi = X i ⊕Yi
In the first case, the carry bit is activated by the local conditions (the values of
X I andYi ). In the second, the carry bit is received from the less significant elementary
addition and is propagated further to the more significant elementary addition
depending on the function Pi .Therefore, the carry-out bit corresponding to a pair of
bits X i and Yi is computed according to the equation:
Ci = Gi + Pi Ci−1
Hence, the carry signal can be computed by carry in, Generate and Propagate
signals.
For example, consider a four bit adder
C1 = G0 + P0Cin
C2 = G1 + P1G0 + P1 P0Cin
C3 = G2 + P2G1 + P2 P1G0 + P2 P1 P0Cin
C4 = G3 + P3G2 + P3 P2G1 + P3 P2 P1G0 + P3 P2 P1 P0Cin
Figure can help us understand the carry out signal computation procedure more
clearly.
26
Fig 4.8 Carry out of Carry Lookahead Adder
27
The result of the addition uses signed digit representation. Use fixed-radix
representation with digit value from a signed-integer set.
n−1
x = ∑xi r i
0
with a digit set (- alpha , . . ., -1, 0, 1, . . ., alpha ).
Here the addition algorithm is not mention in detail. The objective of SDA is to
eliminate carry propagation. A signed-digit addition is performed in two steps.
Step 1: to compute sum( w ) and transfer( t ), the transfer’s function is something like
carriers in CPA.
x + y = w +t
At the digital level this correspond to
xi + yi = wi + rti+1
Figure show the addition of the first two bits of n -bit numbers
28
Finally we can conclude SDA structure, as shown in Figure.
29
The carry save adder tree can be used to add three operands in two’s
complement representation and produces a result as the sum of two vectors. A 3-to-2
reduction is called [3:2] adder, and using this tree, we can use a [ p :2] adder to reduce
p bit-vectors to 2 bit-vectors using CSAs.
From Figure, each column’s bit numbers are k , and have p levels. We can
use [3:2] adders to reduce the rows and get 2 bit vectors. No propagation of the carries
are required except on the last two rows which result in a speed up of the
computation.
30
Fig 4.15 Combinational Architecture
Ripple adders that reduces m BCD operands Z[k] into a final BCD sum S. The
width of each BCD adder in the first row of the tree is p digits (or 4p bits). To avoid
sum overflows, we need to extend the width of each BCD adder an appropriate
amount. For example, the width of the final adder requires to be incremented in at
least dlog10 me digits or l bits, with l given by
31
signals are given by:
g = x
i;3 i;3yi;3 _ xi;3(yi;2 _ yi;1) _ yi;3(xi;2 _ xi;1) _ xi;2yi;2(xi;1 _ yi;1)
gi;2 = 0
gi;1 = 0
g = x
i;0 i;0yi;0
32
the partial products of the multiplier, the Least-significant column can’t receive bits
from other columns. So reduction by columns is introduced. The basic concept is to
reduce bit numbers in each column of each level. So full adder and half adder are used
as (3:2) counter adder and (2:2) counter.
33
Dadda tree is a special condition of the Wallace tree where all bit-numbers are
collected and using Wallace tree with minimum number of counters and critical path.
Wallace tree was chosen as basic algorithm of program for this thesis.
CHAPTER-5
RESULT ANALYSIS
34
Combinational and pipelined versions of the BCD carry-ripple adder tree pre-
sented in Section 4 were designed in VHDL for arbitrary values m, p and k using the
Xilinx ISE Design Suite. To have more control over the mapping process the different
adder cells were directly instantiated into components of the Xilinx Virtex-6 library.
The architectures were synthesized in a Virtex-6 XC6VLX75T device with speed
grade-3 using the XST compiler and simulated with Model sim SE 6.5.
we determined that the most suitable structures for fast BCD multi-operand
addition in FPGAs are trees built of carry-ripple or carry-chain adders. To extract
more precise conclusions from comparison, we coded and synthesized two different
adder trees: a binary carry-ripple adder tree and a tree built of the BCD carry-chain
adders .
The proposed BCD carry-ripple adder practically halves the hardware cost of
the BCD carry-chain adder and is still very competitive in terms of speed. The BCD
carry-chain adders have speed advantages for large carry chains at expense of high
hardware costs.
CONCLUSION
The design of a BCD multi-operand adder and its implementation on a Virtex-
5/6 FPGA has been successfully designed. We performed a survey of prior techniques
35
of BCD multi-operand addition to find the most area-efficient low-latency FPGA
implementation. From this survey we found that tree structures build of BCD carry-
ripple or carry-chain adders are suitable for state-of-the-art FPGA implementations.
Efficiently implementing CS compressor trees on FPGA, in terms of area and
speed, is made possible by using the specialized carry-chains of these devices in a
novel way. Similar to what happens when using ASIC technology, the proposed CS
linear array compressor trees lead to marked improvements in speed compared to
CPA approaches and, in general, with no additional hardware cost. Furthermore, the
proposed high-level definition of CSA arrays based on CPAs facilitates ease-of-use
and portability, even in relation to future FPGA architectures, because CPAs will
probably remain a key element in the next generations of FPGA. We have compared
our architectures, implemented on different FPGA families, to several designs and
have provided a qualitative and quantitative study of the benefits of our proposals.
FUTURE SCOPE
36
This project can be used in Transaction processing system, ATM, Personal
computers and Workstations ,Medical electronic systems etc… As future work we
plan to increase the compressor length and to decrease the number of time units. And
also we integrate the proposed multi-operand BCD adder in a core of the FloPoCo
project4, a generator of arithmetic cores for FPGAs, and to support more FPGA
targets than Virtex-5/6 families.
REFERENCES
37
M. J. Adiletta and V. C. Lamere, BCD Adder Circuit, US patent 4,805,131, Jul
1989.
D. P. Agrawal, Fast BCD/Binary Adder/Subtractor, Electronics Letters, vol.
10, no. 8, pp. 121–122, Apr. 1974.
G. Bioul, M. Vazquez, J.-P. Deschamps, G. Sutter, Decimal Addition in
FPGA, V Southern Conference on Programmable Logic (SPL09), pp. 101–
108, Apr. 2009.
F. Y. Busaba, C. A. Krygowski, W. H. Li, E. M. Schwarz and S. R. Carlough,
The IBM z900 Decimal Arithmetic Unit, Asilomar Conference on Signals,
Systems and Computers, vol. 2, pp. 1335–1339, Nov. 2001.
Electronics for ‘U’
38
39