Reducing The Hardware Complexity of A Parallel Prefix Adder

Chapter-1
INTRODUCTION TO VLSI
1.1Very-large-scale integration
Very-large-scale integration (VLSI) is the process of creating integrated circuits by

combining thousands of transistors into a single chip. VLSI began in the 1970s when
complex semiconductor and communication technologies were being developed. The
microprocessor is a VLSI device.
Fig1.1 A VLSI integrated-circuit die
1.2 History
During the 1920’s, several inventors attempted devices that were intended to control
the current in solid state diodes and convert them into triodes. Success, however, had
to wait until after World War II, during which the attempt to improve silicon and
germanium crystals for use as radar detectors led to improvements both in fabrication
and in the theoretical understanding of the quantum mechanical states of carriers in
semiconductors and after which the scientists who had been diverted to radar
development returned to solid state device development. With the invention of
transistors at Bell labs, in 1947, the field of electronics got a new direction which shifted
from power consuming vacuum tubes to solid state devices.
With the small and effective transistor at their hands, electrical engineers of the 50s saw
the possibilities of constructing far more advanced circuits than before. However, as
the complexity of the circuits grew, problems started arising.
Another problem was the size of the circuits. A complex circuit, like a computer, was
dependent on speed. If the components of the computer were too large or the wires
interconnecting them too long, the electric signals couldn't travel fast enough through
the circuit, thus making the computer too slow to be effective.
Page 1
Jack Kilby at Texas Instruments found a solution to this problem in 1958. Kilby's idea
was to make all the components and the chip out of the same block (monolith) of
semiconductor material. When the rest of the workers returned from vacation, Kilby
presented his new idea to his superiors. He was allowed to build a test version of his
circuit. In September 1958, he had his first integrated circuit ready. Although the first
integrated circuit was pretty crude and had some problems, the idea was
groundbreaking. By making all the parts out of the same block of material and adding
the metal needed to connect them as a layer on top of it, there was no more need for
individual discrete components. No more wires and components had to be assembled
manually. The circuits could be made smaller and the manufacturing process could be
automated. From here the idea of integrating all components on a single silicon wafer
came into existence and which led to development in Small Scale Integration(SSI) in
early 1960s, Medium Scale Integration(MSI) in late 1960s, Large Scale
Integration(LSI) and in early 1980s VLSI 10,000s of transistors on a chip (later
100,000s & now 1,000,000s).
1.3 Developments
The first semiconductor chips held two transistors each. Subsequent advances added
more and more transistors, and, as a consequence, more individual functions or systems
were integrated over time. The first integrated circuits held only a few devices, perhaps
as many as ten diodes, transistors, resistors and capacitors, making it possible to
fabricate one or more logic gates on a single device.Now known retrospectively as
small-scale integration (SSI), improvements in technique led to devices with hundreds
of logic gates, known as medium-scale integration (MSI). Further improvements led to
large-scale integration (LSI), i.e. systems with at least a thousand logic gates. Current
technology has moved far past this mark and today's microprocessors have many
millions of gates and billions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-scale
integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But
the huge number of gates and transistors available on common devices has rendered
such fine distinctions moot. Terms suggesting greater than VLSI levels of integration
are no longer in widespread use.
As of early 2008, billion-transistor processors are commercially available. This is

expected to become more commonplace as semiconductor fabrication moves from the
current generation of 65 nm processes to the next 45 nm generations (while
experiencing new challenges such as increased variation across process corners). A
notable example is Nvidia's280 series GPU. This GPU is unique in the fact that almost
all of its 1.4 billion transistors are used for logic, in contrast to the Itanium, whose large
transistor count is largely due to its 24 MB L3 cache. Current designs, unlike the earliest
devices, use extensive design automation and automated logic synthesis to lay out the
transistors, enabling higher levels of complexity in the resulting logic functionality.
Page 2
Certain high-performance logic blocks like the SRAM (Static Random Access
Memory) cell, however, are still designed by hand to ensure the highest efficiency
(sometimes by bending or breaking established design rules to obtain the last bit of
performance by trading stability). VLSI technology is moving towards radical level
miniaturization with introduction of NEMS technology. Alot of problems need to be
sorted out before the transition is actually made.
Structured design
Structured VLSI design is a modular methodology originated by Carver Mead and Lynn
Conway for saving microchip area by minimizing the interconnect fabrics area. This is
obtained by repetitive arrangement of rectangular macro blocks which can be
interconnected using wiring by abutment. An example is partitioning the layout of an
adder into a row of equal bit slices cells. In complex designs this structuring may be
achieved by hierarchical nesting.
Structured VLSI design had been popular in the early 1980s, but lost its popularity later
because of the advent of placement and routing tools wasting a lot of area by routing,
which is tolerated because of the progress of Moore's Law. When introducing the
hardware description language KARL in the mid' 1970s, Reiner Hartenstein coined the
term "structured VLSI design" (originally as "structured LSI design"), echoing
EdsgerDijkstra's structured programming approach by procedure nesting to avoid
chaotic spaghetti-structured programs.
1.4 Challenges
As microprocessors become more complex due to technology scaling, microprocessor

designers have encountered several challenges which force them to think beyond the
design plane, and look ahead to post-silicon:
 Power usage/Heat dissipation – As threshold voltages have ceased to scale

with advancing process technology, dynamic power dissipation has not scaled
proportionally. Maintaining logic complexity when scaling the design down
only means that the power dissipation per area will go up. This has given rise to
techniques such as dynamic voltage and frequency scaling (DVFS) to minimize
overall power.
 Process variation – As photolithography techniques tend closer to the
fundamental laws of optics, achieving high accuracy in doping concentrations
and etched wires is becoming more difficult and prone to errors due to variation.
Designers now must simulate across multiple fabrication process corners before
a chip is certified ready for production.
 Stricter design rules – Due to lithography and etch issues with scaling, design
rules for layout have become increasingly stringent. Designers must keep ever
more of these rules in mind while laying out custom circuits. The overhead for
Page 3
custom design is now reaching a tipping point, with many design houses opting
to switch to electronic design automation (EDA) tools to automate their design
process.
 Timing/design closure – As clock frequencies tend to scale up, designers are
finding it more difficult to distribute and maintain low clock skew between these
high frequency clocks across the entire chip. This has led to a rising interest in
multicore and multiprocessor architectures, since an overall speedup can be
obtained by lowering the clock frequency and distributing processing.
 First-pass success – As die sizes shrink (due to scaling), and wafer sizes go up
(to lower manufacturing costs), the number of dies per wafer increases, and the
complexity of making suitable photomasks goes up rapidly. A mask set for a
modern technology can cost several million dollars. This non-recurring expense
deters the old iterative philosophy involving several "spin-cycles" to find errors
in silicon, and encourages first-pass silicon success. Several design philosophies
have been developed to aid this new design flow, including design for
manufacturing (DFM), design for test (DFT), and Design for X.
1.5 VLSI Technology
Gone are the days when huge computers made of vacuum tubes sat humming in entire
dedicated rooms and could do about 360 multiplications of 10 digit numbers in a
second. Though they were heralded as the fastest computing machines of that time, they
surely don’t stand a chance when compared to the modern day machines. Modern day
computers are getting smaller, faster, and cheaper and more power efficient every
progressing second. But what drove this change? The whole domain of computing
ushered into a new dawn of electronic miniaturization with the advent of semiconductor
transistor by Bardeen (1947-48) and then the Bipolar Transistor by Shockley (1949) in
the Bell Laboratory.
Since the invention of the first IC (Integrated Circuit) in the form of a Flip Flop by Jack
Kilby in 1958, our ability to pack more and more transistors onto a single chip has
doubled roughly every 18 months, in accordance with the Moore’s Law. Such
exponential development had never been seen in any other field and it still continues to
be a major area of research work.
Page 4
Fig 1.2 A comparison: First Planar IC (1961) and Intel Nehalem Quad Core Die
1.6 History & Evolution of VLSI Technology
The development of microelectronics spans a time which is even lesser than the average
life expectancy of a human, and yet it has seen as many as four generations. Early 60’s
saw the low density fabrication processes classified under Small Scale Integration (SSI)
in which transistor count was limited to about 10. This rapidly gave way to Medium
Scale Integration in the late 60’s when around 100 transistors could be placed on a
single chip.
It was the time when the cost of research began to decline and private firms started
entering the competition in contrast to the earlier years where the main burden was
borne by the military. Transistor-Transistor logic (TTL) offering higher integration
densities outlasted other IC families like ECL and became the basis of the first
integrated circuit revolution. It was the production of this family that gave impetus to
semiconductor giants like Texas Instruments, Fairchild and National Semiconductors.
Early seventies marked the growth of transistor count to about 1000 per chip called the
Large Scale Integration.
By mid eighties, the transistor count on a single chip had already exceeded 1000 and
hence came the age of Very Large Scale Integration or VLSI. Though many
improvements have been made and the transistor count is still rising, further names of
generations like ULSI are generally avoided. It was during this time when TTL lost the
battle to MOS family owing to the same problems that had pushed vacuum tubes into
negligence, power dissipation and the limit it imposed on the number of gates that could
be placed on a single die.
The second age of Integrated circuit revolution started with the introduction of the first
microprocessor, the 4004 by Intel in 1972 and the 8080 in 1974. Today many companies
like Texas Instruments, Infineon, Alliance Semiconductors, Cadence, Synopsys, Celox
Page 5
Networks, Cisco, Micron Tech, National Semiconductors, ST Microelectronics,
Qualcomm, Lucent, Mentor Graphics, Analog Devices, Intel, Philips, Motorola and
many other firms have been established and are dedicated to the various fields in
"VLSI" like Programmable Logic Devices, Hardware Descriptive Languages, Design
tools, Embedded Systems etc.
VLSI Design
VLSI chiefly comprises of Front End Design and Back End design these days. While
front end design includes digital design using HDL, design verification through
simulation and other verification techniques, the design from gates and design for
testability, backend design comprises of CMOS library design and its characterization.
It also covers the physical design and fault simulation.
While Simple logic gates might be considered as SSI devices and multiplexers and
parity encoders as MSI, the world of VLSI is much more diverse. Generally, the entire
design procedure follows a step by step approach in which each design step is followed
by simulation before actually being put onto the hardware or moving on to the next
step. The major design steps are different levels of abstractions of the device as a whole:
1. Problem Specification: It is more of a high level representation of the system.

The major parameters considered at this level are performance, functionality, physical
dimensions, fabrication technology and design techniques. It has to be a tradeoff
between market requirements, the available technology and the economical viability of
the design. The end specifications include the size, speed, power and functionality of
the VLSI system.
2. Architecture Definition: Basic specifications like Floating point units, which

system to use, like RISC (Reduced Instruction Set Computer) or CISC (Complex
Instruction Set Computer), number of ALU’s cache size etc.
3. Functional Design: Defines the major functional units of the system and hence
facilitates the identification of interconnect requirements between units, the physical
and electrical specifications of each unit. A sort of block diagram is decided upon with
the number of inputs, outputs and timing decided upon without any details of the
internal structure.
4. Logic Design: The actual logic is developed at this level. Boolean expressions,
control flow, word width, register allocation etc. are developed and the outcome is
called a Register Transfer Level (RTL) description. This part is implemented either
with Hardware Descriptive Languages like VHDL and/or Verilog. Gate minimization
techniques are employed to find the simplest, or rather the smallest most effective
implementation of the logic.
5. Circuit Design: While the logic design gives the simplified implementation of
the logic,the realization of the circuit in the form of a netlist is done in this step. Gates,
Page 6
transistors and interconnects are put in place to make a netlist. This again is a software
step and the outcome is checked via simulation.
6. Physical Design: The conversion of the netlist into its geometrical representation
is done in this step and the result is called a layout. This step follows some predefined
fixed rules like the lambda rules which provide the exact details of the size, ratio and
spacing between components. This step is further divided into sub-steps which are:
6.1 Circuit Partitioning: Because of the huge number of transistors involved, it is not
possible to handle the entire circuit all at once due to limitations on computational
capabilities and memory requirements. Hence the whole circuit is broken down into
blocks which are interconnected.
6.2 Floor Planning and Placement: Choosing the best layout for each block from
partitioning step and the overall chip, considering the interconnect area between the
blocks, the exact positioning on the chip in order to minimize the area arrangement
while meeting the performance constraints through iterative approach are the major
design steps taken care of in this step.
6.3 Routing: The quality of placement becomes evident only after this step is
completed. Routing involves the completion of the interconnections between modules.
This is completed in two steps. First connections are completed between blocks without
taking into consideration the exact geometric details of each wire and pin. Then, a
detailed routing step completes point to point connections between pins on the blocks.
6.4 Layout Compaction: The smaller the chip size can get, the better it is. The
compression of the layout from all directions to minimize the chip area thereby reducing
wire lengths, signal delays and overall cost takes place in this design step.
6.5 Extraction and Verification: The circuit is extracted from the layout for
comparison with the original netlist, performance verification, and reliability
verification and to check the correctness of the layout is done before the final step of
packaging.
7. Packaging: The chips are put together on a Printed Circuit Board or a Multi Chip
Module to obtain the final finished product.
Initially, design can be done with three different methodologies which provide different
levels of freedom of customization to the programmers. The design methods, in
increasing order of customization support, which also means increased amount of
overhead on the part of the programmer, are FPGAs and PLDs, Standard Cell (Semi
Custom) and Full Custom Design.
While FPGAs have inbuilt libraries and a board already built with interconnections and
blocks already in place; Semi Custom design can allow the placement of blocks in user
defined custom fashion with some independence, while most libraries are still available
for program development. Full Custom Design adopts a start from scratch approach
where the programmer is required to write the whole set of libraries and also has full
Page 7
control over the block development, placement and routing. This also is the same
sequence from entry level designing to professional designing.
Fig1.3: Future of VLSI
Where do we actually see VLSI Technology in action? Everywhere, in personal

computers, cell phones, digital cameras and any electronic gadget. There are certain
key issues that serve as active areas of research and are constantly improving as the
field continues to mature. The figures would easily show how Gordon Moore proved
to be a visionary while the trend predicted by his law still continues to hold with little
deviations and don’t show any signs of stopping in the near future. VLSI has come a
far distance from the time when the chips were truly hand crafted. But as we near the
limit of miniaturization of Silicon wafers, design issues have cropped up.
VLSI is dominated by the CMOS technology and much like other logic families, this
too has its limitations which have been battled and improved upon since years. Taking
the example of a processor, the process technology has rapidly shrunk from 180 nm in
1999 to 60nm in 2008 and now it stands at 45nm and attempts being made to reduce it
further (32nm) while the Die area which had shrunk initially now is increasing owing
to the added benefits of greater packing density and a larger feature size which would
mean more number of transistors on a chip.
As the number of transistors increase, the power dissipation is increasing and also the
noise. If heat generated per unit area is to be considered, the chips have already neared
that of the nozzle of a jet engine. At the same time, the Voltage scaling of threshold
voltages beyond a certain point poses serious limitations in providing low dynamic
power dissipation with increased complexity. The number of metal layers and the
interconnects be it global and local also tend to get messy at such nano levels.
Page 8
Even on the fabrication front, we are soon approaching towards the optical limit of
photolithographic processes beyond which the feature size cannot be reduced due to
decreased accuracy. This opened up Extreme Ultraviolet Lithography techniques. High
speed clocks used now make it hard to reduce clock skew and hence putting timing
constraints. This has opened up a new frontier on parallel processing. And above all,
we seem to be fast approaching the Atom-Thin Gate Oxide layer thickness where there
might be only a single layer of atoms serving as the oxide layer in the CMOS transistors.
New alternatives like Gallium Arsenide technology are becoming an active area of
research owing to this.
1.7 Overview on VLSI
Very-large-scale integration (VLSI) is the process of creating an integrated circuit (IC) by

combining hundreds of thousands of transistors or devices into a single chip. VLSI began
in the 1970s when complex semiconductor and communication technologies were being
developed. The microprocessor is a VLSI device. Before the introduction of VLSI
technology most ICs had a limited set of functions they could perform. An electronic circuit
might consist of a CPU, ROM, RAM and other glue logic. VLSI lets IC designers add all of
these into one chip.
1.6 INTRODUCTION TO PARALLEL PREFIX ADDER
1.6.1 Motivation
To humans, decimal numbers are easy to comprehend and implement for

performing arithmetic. However, in digital systems, such as a microprocessor, DSP
(Digital Signal Processor)or ASIC (Application-Specific Integrated Circuit), binary
numbers are more pragmatic for a given computation. This occurs because binary
values are optimally efficient at representing many values.
The hardware implementation of binary addition is a fundamental architectural

component in many processors, such as microprocessors, digital signal processors,
mobile devices and other hardware applications. In these systems when building
arithmetic logic unit (ALU), adders play an important role for performing the basic
arithmetic operations, such as addition, subtraction, multiplication, division, etc.
Therefore, the hardware implementation of an effective adder is necessary to increase
the performance of ALU and, consequently, the processor itself as a whole. Currently,
a parallel prefix adder (PPA) is considered effective adder for performing the addition
Page 9
of two multi-bit numbers. Circuit complexity and the speed of PPA are important
parameters at the stage of efficient hardware implementation and, therefore, in
recent years various types of PPA with different characteristics of the parameters have
been developed.
Kogge-Stone adder [3] is investigated, which is one of the known effective fastest
PPA. Kogge-Stone is widely and efficiently used. Such an adder has minimum delay
while performing the binary addition. However, for estimation of hardware costs this
adder has a great number of logic gates and Quine-complexity used in the schematic
implementation. Therefore, in the present work for reducing its hardware complexity
a modified parallel prefix adder is developed. Then, the comparison of the two
presented adders is made by the following parameters: the number of logic gates,
Quine- complexity, as well as the delay obtained by simulation in XILINX ISE 14.4
DESIGN SUIT. Perspective architecture is proposed for schematic implementation
of various PPA. And derivation of the formulas is also described for computing the
hardware characteristics which are dependent on the bit width of input operands of
the presented adders.
1.7 ARCHITECTURE OF THE PARALLEL PREFIX ADDER

Parallel prefix adder (PPA) is a multi-bit carry-propagate adder which is used for
parallel addition of two multi-bit numbers. PPA extends the generated and propagated
logic of the carry look-ahead adder to perform addition even faster [4]. As the basic
schematic structure of the various PPA, perspective architecture is analyzed, it
consists of three stages (Figure1.7.1) [5]: pre-processing stage, prefix computation

stage and final processing stage. Let consider each stage in more detail
Page 10
Fig.1.7.1. Architecture of the parallel prefix adder
At the pre-processing stage, carry-generate gi and carry- propagate hi signals are

computed for each pair of input operands 𝐴𝑖 and 𝐵𝑖 . The calculation of these signals
is described by the following corresponding logical equations:
At the prefix computation stage, group carry-generate G[i:k]and group carry-

propagate H[i:k] signals are calculated for each bit by the following equations:
The final processing stage involves the formation of output carries and sum-values for
each individual operand bit. The expression for 𝑃𝑖 and 𝑆𝑖 is defined as the following
equations respectively:
Where, P i-1 is carry-in ( Pin=0 ). In the framework of this paper, basic schematic
nodes are used for greater visibility of the given architecture when constructing the
presented adders. Figure 2.2.2 shows the basic schematic nodes: a black cell, a gray
cell, a white cell and a circle. These schematic nodes are implemented by the logical
equations at all stages of the given architecture.
The number of logic gates and Quine-complexity of each schematic node are
computed for estimation of hardware costs. In this work Quine-complexity is
determined by the total number of inputs of all the logic gates used in the schematic
nodes. And one can also calculate the number of logic gates used in them.
Page 11
Fig.1.7.2. Basic schematic nodes
Figure 1.7.2 shows that a black cell contains 3 logic gates and Quine-complexity is 6,
and a gray cell has 2 logic gates and Quine-complexity = 4. In these two cells, G[i:k]
and H[i:k] signals are calculated for each bit by equations (1.3) ɢ (1.4) at the prefix
computation stage. ȼɭ using these equations, a black cell and a gray cell receive inputs
from the upper part of block spanning bits i:j and take inputs from the lower part
spanning bits j-1 : k . Then, these schematic cells are combined to form a tree of
generate and propagate signals for the entire block spanning bits i:k . So, the main
challenge is to compute rapidly all the group generated signals G0:0 , G1:0 , G2:0,
G3:0 , . . . , Gn-1:0 . These signals, along with propagated signals H0 :0 , H1:0 , H2:0
, H3:0 , . . . , Hn-1:0 are called prefixes [Error! Reference source not found.]. The
network of these schematic nodes (black cell + gray cell) is a prefix tree. The white
cell consists of 2 logic gates and Quine complexity = 4, it serves to calculate i g and i
p signals of input operands Ai and Bi with equations (1.1) and (1.2) at the pre-
processing stage. At the final stage for performing the results of the binary addition
with equation (1.6) the circle consisting of one logic gate is used and Quine-
complexity is equal 2.
1.8 OVERVIEW ON PPA
PPA are considered as the effective combinational circuits for performing the binary
addition of two multibit numbers.PPA are widely used in ALU which are parts of
modern processors such as microprocessors and digital signal processors etc.
1.9 Literature review
Athira.T.S, Divya.R, Karthik.M, Manikandan.A. Design of Kogge-Stone for fast
addition. They propose a Kogge-Stone Adder (KSA) with low power consumption
and delay. Usually, Ripple Carry Adders (RCA) are preferred for addition of two N-
Page 12
bit numbers as these RCAs provide fast design time among all types of conventional
methods. However, RCA’s have limitation that every full adder blocks must wait till
carry bits generated from previous blocks of full adder. They implemented Kogge-
Stone Adder which is a parallel prefix form Carry Look Ahead (CLA) adder. Parallel
prefix adders (PPA) are tree based structure which speed up the binary addition. Hence
prefix adders are used for fast addition algorithms.
Geeta Rani, Sachin Kumar. “Delay Analysis of Parallel-Prefix Adders”. Parallel-
Prefix adders or Carry tree adders are the kind of adders that uses prefix operation in
order to do efficient addition. Nowadays Parallel-Prefix adders are the frequently used
adders due to the high speed computation properties. So called carry tree adder uses
the prefix operation to do the arithmetic addition with way greater speed than the
simple parallel adders that is ripple carry adder, carry skip adder, carry select adder
etc. they discuss about the various parallel-prefix adders and analyses there delay with
respect to one another so that the fastest adder can be found and also the specific adder
for a specific operation can be found.
Sunil. M,Ankith.R.D, Manjunatha.G.D and Premananda.B.S. Design and
implementation of faster parallel prefix Kogge Stone adder. Multiplier is important in
many dsp systems and in many hardware blocks. Multiplier are used in various DSP
application like digital filtering, digital communication, this requires parallel array
multiplier to achieve high execution speed and to meet the performance. A typical
implementation of such an array multiplier is Braun design. Braun multiplier is a type
of parallel multiplier, Braun multiplier architecture consist of some carry save adder
number of AND gates and Ripple Carry Adder Braun multiplier is proposed with
high speed Parallel Prefix Adder instead of using Ripple Carry Adder. This modified
Braun multiplier reduced the delay due to the Ripple Carry Adder.
CH. Sudha, Rani, CH. Ramesh. Design and Implementation of High Performance
Parallel Prefix Adders. performance High performance adders (also known as carry
tree or parallel prefix adders) areknown to have the best in VLSI designs. However,
this performance advantage does not translate directly into FPGA implementations
due to constraints on logic block configurations and routing overhead. This paper
investigates four types of carry-tree adders (the Kogge-Stone, sparse Kogge-Stone,
spanning tree adder and Brent-kung adder) and compares them to the simple Ripple
Carry Adder (RCA) and Carry Skip Adder (CSA). These designs of varied bitwidths
were implemented on a Xilinx Spartan 3E FPGA and delay measurements were made
Page 13
with a high-performance logic analyzer. Due to the presence of a fast carry-chain, the
RCA designs exhibit better delay performance up to 128 bits. The carry-tree adders
are expected to have a speed advantage over the RCA as bit widths approach 256.
1.10 organization of thesis

Chapter 1 explains detailed about the introduction of VLSI and parallel prefix adder
and its architecture, literature survey.
Chapter 2 explains about the existing system and its drawback.
Chapter 3 explains about the proposed system working and its advantages.
Chapter 4 explains about the VERILOG hardware description language.
Chapter 5 explains about FPGA and its design flow
Chapter 6 presents all the simulation results which is compiled and simulated using
XILINX.
Chapter 7 provides the conclusion of the work undertaken in this thesis. The reference
taken for the purpose of this research work is also the part of this chapter.
1.11 Summary
In this chapter we discussed in detail about VLSI technology and, we gone through
about parallel prefix adder and its architecture how its work. And literature review.
Chapter-2
EXISTING KOGGE-STONE ADDER
2.1 KOGGE-STONE ADDER
Page 14
Kogge-stone adder is a parallel-prefix form carry look ahead adder.which has a
minimum delay .Kogge-Stone adder was developed by peter M. Kogge and Harold S.
Stone which they published in 1973. This adder is widely used in high performance
applications.
2.2 16-bit Kogge-stone adder
The scheme of a 16-bit Kogge-Stone adder is shown in figure 2.2.1.
Fig. 2.2.1 Scheme of a 16-bit Kogge-Stone adder
This adder computes gi and hi signals for the preprocessing stage. Then at the first level
( L=l) of prefix tree, G[i :k] and H[i:k] signals of 2-bit are computed within the same
time. At the second level (L=2) of prefix tree, G[i:k] and H[i:k] of 4-bit are calculated
by using the result of 2-bit at level 1. Therefore, the actual carry-out value of the 4th bit
would be available while the calculations at level 2 are being computed. At the third
level (L=3)of prefix tree, the carry-out of the 8th bit is computed by using the 4th bit
carry result. The same method adopted at level 3 is applied to get carry-out values of
the 16th bit in fourth level (L=4) and etc. All other carries of bit are also computed in
Page 15
parallel. Finally, at the final processing stage the sums are computed from these final
carry-out signals of the prefix tree.
In the prefix tree the number of levels corresponds to L= 𝑙𝑜𝑔2𝑛 and the number of
schematic nodes (white cell + gray cell) will be (K= [n(𝑙𝑜𝑔2𝑛 ) − 𝑛 + 1]. Quine-
complexity SKQ . and the number of logic gates S KC . in the Kogge-Stone adder are
given by the following equation.
2.3 32-bit Kogge-stone adder
Figure 2.3.1 shows the scheme of a 32-bit Kogge-Stone adder with increasing the bit
width of input operands more than 16 bits.
Fig. 2.3.1 Scheme of a 32-bit Kogge-Stone adder
2.4 Drawback
 Large hardware complexity
2.5 Summary
Page 16
In this chapter we discussed about existing 16 bit and 32 bit kogge stone adder and
drawback of existing method.
Chapter-3
MODEFIED PARALLEL PREFIX ADDER
Page 17
Let ◦ be an arbitrary associative binary operation. A prefix circuit for ◦ is a
combinational circuit which takes n inputs x1, x2, . . . , xn and generates n outputs x1,
x1 ◦ x2, x1 ◦ x2 ◦ x3, . . . , x1 ◦ · · · ◦ xn
To explain what the prefix adder is, it is compared with the full adder, the most famous
adder ever. The biggest difference between two adders is that in the full adder,
summation and carry calculation is done in the same one bit block but in the prefix
adder, summation and carry calculation are separated from the bit block and all
calculation is treated as a whole in the carry graph as shown in Figure 1.2. The carry
graph uses the prefix circuit of Section 1.1 and this is the origin of the name, “Prefix
Adder”. The following is the mathematical derivation of a prefix adder. Let A =
an−1an−2 . . . a0 and B = bn−1bn−2 . . . b0 be n-bit binary numbers with sum S =
sn−1sn−2 . . . s0. The Least Significant Bit (LSB) is bit 0 and the Most Significant Bit
(MSB) is bit n-1. If c0 is carry-in signal to the LSB, the following equations can be
used to compute the sum:
Page 18
3.1 Modified PPA
Modified parallel prefix adder is developed for reducing the hardware complexity of
Kogge-Stone adder. Figure 3.1.1 shows the scheme of a 8-bit and 16-bit modified
parallel prefix adder. The architectures employ the associative property of the PPA to
keep the number of computation nodes at a minimum, by eliminating the massive
overlap between the prefix sub-terms being computed.
Page 19
Fig.3.1.1. Scheme of a 16-bit modified parallel prefix adder
The construction of the first level of the prefix tree of this adder is similar to the
construction of Kogge-Stone adder. The main structural difference begins from the
second level of the prefix tree. At the second level of the prefix tree, the groups of two
schematic nodes are formed, at the 3rd level - groups compose four schematic nodes
and at the 4th level – groups including 8 schematic nodes, etc.
CMOS logic family may be used to implement only inverting functions. The inverting
property of CMOS logic is employed, by alternatively cascading odd computation cells
and even computation cells. An alternative cascade of odd computation cell and even
computation cell provides the benefit of elimination of two pairs of inverters between
successive stages. This benefit is achieved, if both the inputs of a computation node in
stage ' ' i are from stage ' (2* 1) ' i j   where j  0 . A pair of inverters is introduced in
the path, if a dot or a semi-dot computation node in a stage ' ' i receives any of its inputs
from stage ' (2* ) ' i j  . The pair of inverters in a path is represented by a ‘’ in the
prefix graph. From the prefix graph of the proposed structure shown in Figure 4.6,
Figure 4.7, Figure 4.8 and Figure 4.9, it is observed that there are only few edges with
a pair of inverters. Thus by introducing two cells for dot operator and two cells for semi-
Page 20
dot operator, a large number of inverters are eliminated. Due to inverter elimination in
paths, the propagation delay in those paths will be reduced. Further it accounts for a
benefit in power reduction, since these inverters if not eliminated, would have
contributed to significant amount of power dissipation due to switching. The output of
the odd-semi-dot cells gives the value of the carry signal in that corresponding bit
position. The output of the even-semi-dot cell gives the complemented value of carry
signal in that corresponding bit position.
Prefix computation employed for deriving the carry signals in the proposed
architectures shown in Figures 4.6 to 4.9, introduce four different computation nodes
for achieving improved performance. The proposed architectures use two cells for the
dot operator and two cells for the semi-dot operator in the prefix-computation step. First
cell for the dot operator named odd-dot is represented by a ‘’, and second cell for the
dot operator named even-dot is represented by a ‘’. First cell for the semi-dot operator
named odd-semi-dot is represented by a ‘  ’, and the second cell for the semi-dot
operator named even-semi-dot is represented by a ‘  ’. The last computation node in
each column of the proposed architecture is a semi-dot operator. The stages with odd
indices use odd-dot and odd-semi-dot cells where as the stages with even indices use
even-dot and even-semi-dot cells. The lateral fan-out slightly increases in the proposed
architectures, but we get an advantage of limited interconnect lengths since the prefix
graph grows only along the main diagonal. This adder first computes gi and hi signals
for the first stage. Then at the first level of prefix tree, G[i:k] and H[i:k] signals of 2-bit
are computed at the same time, and then, it computes G[i:k] and H[i:k] signals for pairs
of columns, then for blocks of 4, then for blocks of 8, then 16, and so on until the final
G[i:k] signal for every column is known. Finally, at the last stage this adder computes
the sums together with the
generated signals obtained from the previous prefix computation stage. The number of
levels of the prefix tree corresponds to ( L=𝑙𝑜𝑔2𝑛 ) and the number of schematic nodes
will be
The number of logic gates C Modified PPA and Quine-complexity Q Modified PPA of
this modified adder are calculated using the following equation
Page 21
3.2 32-bit Modified PPA
The scheme of a 32-bit modified parallel prefix adder is shown in Figure 4.2.1 with
increasing the bit width of input operands more than 16 bits.
Fig. 3.2.1 Scheme of a 32-bit modified parallel prefix adder
The first stage in the architectures of the proposed prefix adder structures involves the
creation of kill, propagate and complementary generate signals for individual operand
bits using the equations (4.28) to (4.30) respectively.
In the above equations, ai bi represent input operand bits for the adder. The prefix
computation in this scheme is responsible for creating group generate and group kill
signals. The odd-dot operator and the odd-semi-dot operator use active low group
Page 22
generate and active high group kill inputs to produce active high group generate and
active low group kill outputs. The even-dot operator and even-semi-dot operator use
active high group generate and active low group kill as inputs to yield active low
group generate and active high group kill outputs. 111 The computation for odd-dot
operator is defined by the equation
The second cell for the dot operator named even-dot represented by a ‘’, is defined
by the equation
Similarly, there are two cells designed for the semi-dot operator. First cell for the
semi-dot operator named odd-semi-dot represented by a ‘  ’, the second cell for the
semi-dot operator named even-semi-dot represented by a ‘  ’, work based on
equations (4.33) and (4.34) respectively.
The output of the odd-semi-dot cells gives the value of the carry signal in that
corresponding bit position. The output of the even-semi-dot cell gives the
complemented value of carry signal in that corresponding bit position. The final stage
involves generation of sum bits from the propagate 112 signals of the individual
operand bits and the carry bits generated in true form or complement form.
Page 23
3.3 ADVANTAGES
 Hardware complexity is reduced

 High speed
 Less number of logic gates compared to existing methods
 Delay is less
 Number of schematic nodes are reduced when compared to existing PPA
3.4 Summary
In this chapter we discussed about the modified Parallel-prefix adder and its
advantages over existing methods.
Page 24
Chapter-4
Verilog HDL
In the semiconductor and electronic design industry, Verilog is a hardware description
language (HDL) used to model electronic systems. Verilog HDL, not to be confused
with VHDL (a competing language), is most commonly used in the design, verification,
and implementation of digital logic chips at the register-transfer level of abstraction. It
is also used in the verification of analog and mixed-signal circuits.
4.1 Overview on verilog
Hardware description languages such as Verilog differ from software programming

languages because they include ways of describing the propagation of time and signal
dependencies (sensitivity). There are two assignment operators, a blocking assignment
(=), and a non-blocking (<=) assignment. The non-blocking assignment allows
designers to describe a state-machine update without needing to declare and use
temporary storage variables. Since these concepts are part of Verilog's language
semantics, designers could quickly write descriptions of large circuits in a relatively
compact and concise form. At the time of Verilog's introduction (1984), Verilog
represented a tremendous productivity improvement for circuit designers who were
already using graphical schematic capture software and specially written software
programs to document and simulate electronic circuits.
The designers of Verilog wanted a language with syntax similar to the C programming
language, which was already widely used in engineering software development. Like
C, Verilog is case-sensitive and has a basic pre processor (though less sophisticated
than that of ANSI C/C++). Its control flow keywords (if/else, for, while, case, etc.) are
equivalent, and its operator precedence is compatible. Syntactic differences include
variable declaration (Verilog requires bit-widths on net/reg types, demarcation of
procedural blocks (begin/end instead of curly braces {}), and many other minor
differences.
A Verilog design consists of a hierarchy of modules. Modules encapsulate design

hierarchy, and communicate with other modules through a set of declared input, output,
and bidirectional ports. Internally, a module can contain any combination of the
following: net/variable declarations (wire, reg, integer, etc.), concurrent and sequential
statement blocks, and instances of other modules (sub-hierarchies). Sequential
statements are placed inside a begin/end block and executed in sequential order within
Page 25
the block. However, the blocks themselves are executed concurrently, making Verilog
a dataflow language.
Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined") and strengths (strong, weak, etc.). This system allows abstract modeling of
shared signal lines, where multiple sources drive a common net. When a wire has
multiple drivers, the wire's (readable) value is resolved by a function of the source
drivers and their strengths.
A subset of statements in the Verilog language are synthesizable. Verilog modules that
conform to a synthesizable coding style, known as RTL (register-transfer level), can be
physically realized by synthesis software. Synthesis software algorithmically
transforms the (abstract) Verilog source into a net list, a logically equivalent description
consisting only of elementary logic primitives (AND, OR, NOT, flip-flops, etc.) that
are available in a specific FPGA or VLSI technology. Further manipulations to the
netlist ultimately lead to a circuit fabrication blueprint (such as a photo mask set for an
ASIC or a bit stream file for an FPGA).
4.2 History
4.2.1 Beginning
Verilog was the first modern hardware description language to be invented. It was
created by Phil Moorby and Prabhu Goel during the winter of 1983/1984. The wording
for this process was "Automated Integrated Design Systems" (later renamed to Gateway
Design Automation in 1985) as a hardware modeling language. Gateway Design
Automation was purchased by Cadence Design Systems in 1990. Cadence now has full
proprietary rights to Gateway's Verilog and the Verilog-XL, the HDL-simulator that
would become the de-facto standard (of Verilog logic simulators) for the next decade.
Originally, Verilog was intended to describe and allow simulation; only afterwards was
support for synthesis added.
4.2.2 Verilog-95
With the increasing success of VHDL at the time, Cadence decided to make the
language available for open standardization. Cadence transferred Verilog into the
public domain under the Open Verilog International (OVI) (now known as Accellera)
organization. Verilog was later submitted to IEEE and became IEEE Standard 1364-
1995, commonly referred to as Verilog-95.
In the same time frame Cadence initiated the creation of Verilog-A to put standards
support behind its analog simulator Spectre. Verilog-A was never intended to be a
standalone language and is a subset of Verilog-AMS which encompassed Verilog-95.
Page 26
4.2.3 Verilog 2001
Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that
users had found in the original Verilog standard. These extensions became IEEE
Standard 1364-2001 known as Verilog-2001.
Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support

for (2's complement) signed nets and variables. Previously, code authors had to perform
signed operations using awkward bit-level manipulations (for example, the carry-out
bit of a simple 8-bit addition required an explicit description of the Boolean algebra to
determine its correct value). The same function under Verilog-2001 can be more
succinctly described by one of the built-in operators: +, -, /, *, >>>. A
generate/endgenerate construct (similar to VHDL's generate/endgenerate) allows
Verilog-2001 to control instance and statement instantiation through normal decision
operators (case/if/else). Using generate/endgenerate, Verilog-2001 can instantiate an
array of instances, with control over the connectivity of the individual instances. File
I/O has been improved by several new system tasks. And finally, a few syntax additions
were introduced to improve code readability (e.g. always @*, named parameter
override, C-style function/task/module header declaration).
Verilog-2001 is the dominant flavor of Verilog supported by the majority of

commercial EDA software packages.
4.2.4 Verilog 2005
Not to be confused with System Verilog, Verilog 2005 (IEEE Standard 1364-2005)
consists of minor corrections, spec clarifications, and a few new language features (such
as the uwire keyword).
A separate part of the Verilog standard, Verilog-AMS, attempts to integrate analog and
mixed signal modeling with traditional Verilog.
Example
A hello world program looks like this:
module main;
initial
begin
$display("Hello world!");
$finish;
end
endmodule
A simple example of two flip-flops follows:
Page 27
moduletoplevel(clock,reset);
input clock;
input reset;
reg flop1;
reg flop2;
always @ (posedge reset or posedge clock)

if (reset)
begin
flop1 <= 0;
flop2 <= 1;
end
else
begin
flop1 <= flop2;
flop2 <= flop1;
end
endmodule
The "<=" operator in Verilog is another aspect of its being a hardware description
language as opposed to a normal procedural language. This is known as a "non-
blocking" assignment. Its action doesn't register until the next clock cycle. This means
that the order of the assignments is irrelevant and will produce the same result: flop1
and flop2 will swap values every clock.
The other assignment operator, "=", is referred to as a blocking assignment. When "="
assignment is used, for the purposes of logic, the target variable is updated immediately.
In the above example, had the statements used the "=" blocking operator instead of
"<=", flop1 and flop2 would not have been swapped. Instead, as in traditional
programming, the compiler would understand to simply set flop1 equal to flop2 (and
subsequently ignore the redundant logic to set flop2 equal to flop1.)
An example counter circuit follows:
module Div20x (rst, clk, cet, cep, count, tc);

// TITLE 'Divide-by-20 Counter with enables'
// enable CEP is a clock enable only
// enable CET is a clock enable and
// enables the TC output
// a counter using the Verilog language
parameter size = 5;
parameter length = 20;
Page 28
inputrst; // These inputs/outputs represent
inputclk; // connections to the module.
inputcet;
inputcep;
output [size-1:0] count;

outputtc;
reg [size-1:0] count; // Signals assigned

// within an always
// (or initial)block
// must be of type reg
wiretc; // Other signals are of type wire
// The always statement below is a parallel

// execution statement that
// executes any time the signals
// rst or clk transition from low to high
always @ (posedgeclk or posedgerst)

if (rst) // This causes reset of the cntr
count<= {size{1'b0}};
else
if (cet&&cep) // Enables both true
begin
if (count == length-1)
count<= {size{1'b0}};
else
count<= count + 1'b1;
end
// the value of tc is continuously assigned

// the value of the expression
assigntc = (cet&& (count == length-1));
endmodule
An example of delays:
...
reg a, b, c, d;
wire e;
Page 29
...
always @(b or e)
begin
a = b & e;
b = a | b;
#5 c = b;
d = #6 c ^ e;
end
The always clause above illustrates the other type of method of use, i.e. it executes
whenever any of the entities in the list (the b or e) changes. When one of these changes,
a is immediately assigned a new value, and due to the blocking assignment, b is
assigned a new value afterward (taking into account the new value of a). After a delay
of 5 time units, c is assigned the value of b and the value of c ^ e is tucked away in an
invisible store. Then after 6 more time units, d is assigned the value that was tucked
away.
Signals that are driven from within a process (an initial or always block) must be of
type reg. Signals that are driven from outside a process must be of type wire. The
keyword reg does not necessarily imply a hardware register.
Definition of constants
The definition of constants in Verilog supports the addition of a width parameter. The
basic syntax is:
<Width in bits>'<base letter><number>
Examples:
 12'h123 - Hexadecimal 123 (using 12 bits)

 20'd44 - Decimal 44 (using 20 bits - 0 extension is automatic)
 4'b1010 - Binary 1010 (using 4 bits)
 6'o77 - Octal 77 (using 6 bits)
Synthesizeable constructs
There are several statements in Verilog that have no analog in real hardware, e.g.
$display. Consequently, much of the language can not be used to describe hardware.
The examples presented here are the classic subset of the language that has a direct
mapping to real gates.
// Mux examples - Three ways to do the same thing.
// The first example uses continuous assignment
Page 30
wire out;
assign out =sel?a : b;
// the second example uses a procedure

// to accomplish the same thing.
reg out;
always@(a or b orsel)
begin
case(sel)
1'b0: out = b;
1'b1: out = a;
endcase
end
// Finally - you can use if/else in a

// procedural structure.
reg out;
always@(a or b orsel)
if(sel)
out= a;
else
out= b;
The next interesting structure is a transparent latch; it will pass the input to the output
when the gate signal is set for "pass-through", and captures the input and stores it upon
transition of the gate signal to "hold". The output will remain stable regardless of the
input signal while the gate is set to "hold". In the example below the "pass-through"
level of the gate would be when the value of the if clause is true, i.e. gate = 1. This is
read "if gate is true, the din is fed to latch_out continuously." Once the if clause is false,
the last value at latch_out will remain and is independent of the value of din.
// Transparent latch example
reg out;
always@(gate or din)
if(gate)
out= din;// Pass through state
// Note that the else isn't required here. The variable
// out will follow the value of din while gate is high.
// When gate goes low, out will remain constant.
The flip-flop is the next significant template; in Verilog, the D-flop is the simplest, and
it can be modeled as:
Page 31
reg q;
always@(posedgeclk)
q <= d;
The significant thing to notice in the example is the use of the non-blocking assignment.
A basic rule of thumb is to use <= when there is a posedge or negedge statement within
the always clause.
A variant of the D-flop is one with an asynchronous reset; there is a convention that the
reset state will be the first if clause within the statement.
reg q;
always@(posedgeclkorposedge reset)
if(reset)
q <=0;
else
q <= d;
The next variant is including both an asynchronous reset and asynchronous set
condition; again the convention comes into play, i.e. the reset term is followed by the
set term.
reg q;
always@(posedgeclkorposedge reset orposedge set)
if(reset)
q <=0;
else
if(set)
q <=1;
else
q <= d;
Note: If this model is used to model a Set/Reset flip flop then simulation errors can
result. Consider the following test sequence of events. 1) reset goes high 2) clk goes
high 3) set goes high 4) clk goes high again 5) reset goes low followed by 6) set going
low. Assume no setup and hold violations.
In this example the always @ statement would first execute when the rising edge of
reset occurs which would place q to a value of 0. The next time the always block
executes would be the rising edge of clk which again would keep q at a value of 0. The
always block then executes when set goes high which because reset is high forces q to
remain at 0. This condition may or may not be correct depending on the actual flip flop.
However, this is not the main problem with this model. Notice that when reset goes
low, that set is still high. In a real flip flop this will cause the output to go to a 1.
Page 32
However, in this model it will not occur because the always block is triggered by rising
edges of set and reset - not levels. A different approach may be necessary for set/reset
flip flops.
The final basic variant is one that implements a D-flop with a mux feeding its input.
The mux has a d-input and feedback from the flop itself. This allows a gated load
function.
// Basic structure with an EXPLICIT feedback path

always@(posedgeclk)
if(gate)
q <= d;
else
q <= q;// explicit feedback path
// The more common structure ASSUMES the feedback is present

// This is a safe assumption since this is how the
// hardware compiler will interpret it. This structure
// looks much like a latch. The differences are the
// '''@(posedgeclk)''' and the non-blocking '''<='''
//
always@(posedgeclk)
if(gate)
q <= d;// the "else" mux is "implied"
Note that there are no "initial" blocks mentioned in this description. There is a split
between FPGA and ASIC synthesis tools on this structure. FPGA tools allow initial
blocks where reg values are established instead of using a "reset" signal. ASIC synthesis
tools don't support such a statement. The reason is that an FPGA's initial state is
something that is downloaded into the memory tables of the FPGA. An ASIC is an
actual hardware implementation.
Initial and always
There are two separate ways of declaring a Verilog process. These are the always and
the initial keywords. The always keyword indicates a free-running process. The initial
keyword indicates a process executes exactly once. Both constructs begin execution at
simulator time 0, and both execute until the end of the block. Once an always block has
reached its end, it is rescheduled (again). It is a common misconception to believe that
an initial block will execute before an always block. In fact, it is better to think of the
initial-block as a special-case of the always-block, one which terminates after it
completes for the first time.
//Examples:
Page 33
initial
begin
a =1;// Assign a value to reg a at time 0
#1;// Wait 1 time unit
b = a;// Assign the value of reg a to reg b
end
always@(a or b)// Any time a or b CHANGE, run the process

begin
if(a)
c = b;
else
d =~b;
end// Done with this block, now return to the top (i.e. the @ event-control)
always@(posedge a)// Run whenever reg a has a low to high change

a <= b;
These are the classic uses for these two keywords, but there are two significant
additional uses. The most common of these is an always keyword without the @(...)
sensitivity list. It is possible to use always as shown below:
always
begin// Always begins executing at time 0 and NEVER stops
clk=0;// Set clk to 0
#1;// Wait for 1 time unit
end// Keeps executing - so continue back at the top of the begin
The always keyword acts similar to the "C" construct while(1) {..} in the sense that it
will execute forever.
The other interesting exception is the use of the initial keyword with the addition of the
forever keyword.
The example below is functionally identical to the always example above.
initialforever// Start at time 0 and repeat the begin/end forever

begin
#1;// Wait for 1 time unit
Page 34
end
Fork/join
The fork/join pair are used by Verilog to create parallel processes. All statements (or
blocks) between a fork/join pair begin execution simultaneously upon execution flow
hitting the fork. Execution continues after the join upon completion of the longest
running statement or block between the fork and join.
initial
fork
$write("A");// Print Char A
$write("B");// Print Char B
begin
$write("C");// Print Char C
end
join
The way the above is written, it is possible to have either the sequences "ABC" or
"BAC" print out. The order of simulation between the first $write and the second $write
depends on the simulator implementation, and may purposefully be randomized by the
simulator. This allows the simulation to contain both accidental race conditions as well
as intentional non-deterministic behavior.
Notice that VHDL cannot dynamically spawn multiple processes like Verilog
Race conditions
The order of execution isn't always guaranteed within Verilog. This can best be
illustrated by a classic example. Consider the code snippet below:
initial
a =0;
initial
b = a;
initial
begin
#1;
$display("Value a=%b Value of b=%b",a,b);
end
Page 35
What will be printed out for the values of a and b? Depending on the order of execution
of the initial blocks, it could be zero and zero, or alternately zero and some other
arbitrary uninitialized value. The $display statement will always execute after both
assignment blocks have completed, due to the #1 delay.
Operators
Note: These operators are not shown in order of precedence.
Operator
Operator type Operation performed
symbols
~ Bitwise NOT (1's complement)
& Bitwise AND
Bitwise | Bitwise OR
^ Bitwise XOR
~^ or ^~ Bitwise XNOR
! NOT
Logical && AND
|| OR
& Reduction AND
~& Reduction NAND
| Reduction OR
Reduction
~| Reduction NOR
^ Reduction XOR
~^ or ^~ Reduction XNOR
+ Addition
- Subtraction
- 2's complement
Arithmetic
* Multiplication
/ Division
** Exponentiation (*Verilog-2001)
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
Relational
Logical equality (bit-value 1'bX is removed from
==
comparison)
Logical inequality (bit-value 1'bX is removed from
!=
comparison)
Page 36
4-state logical equality (bit-value 1'bX is taken as
===
literal)
4-state logical inequality (bit-value 1'bX is taken as
!==
literal)
>> Logical right shift
<< Logical left shift
Shift
>>> Arithmetic right shift (*Verilog-2001)
<<< Arithmetic left shift (*Verilog-2001)
Concatenation { , } Concatenation
Replication {n{m}} Replicate value m for n times
Conditional ?: Conditional
Four-valued logic
The IEEE 1364 standard defines a four-valued logic with four states: 0, 1, Z (high
impedance), and X (unknown logic value). For the competing VHDL, a dedicated
standard for multi-valued logic exists as IEEE 1164 with nine levels.
4.3 Summary
Hardware description languages such as Verilog are similar to software programming

languages because they include ways of describing the propagation time and signal
strengths (sensitivity). There are two types of assignment operators; a blocking
assignment (=), and a non-blocking (<=) assignment. The non-blocking assignment
allows designers to describe a state-machine update without needing to declare and use
temporary storage variables. Since these concepts are part of Verilog's language
semantics, designers could quickly write descriptions of large circuits in a relatively
compact and concise form. At the time of Verilog's introduction (1984), Verilog
represented a tremendous productivity improvement for circuit designers who were
already using graphical schematic capture software and specially written software
programs to document and simulate electronic circuits.
The designers of Verilog wanted a language with syntax similar to the C programming
language, which was already widely used in engineering software development. Like
C, Verilog is case-sensitive and has a basic preprocessor (though less sophisticated than
that of ANSI C/C++). Its control flow keywords (if/else, for, while, case, etc.) are
equivalent, and its operator precedence is compatible with C. Syntactic differences
include: required bit-widths for variable declarations, demarcation of procedural blocks
Page 37
(Verilog uses begin/end instead of curly braces {}), and many other minor differences.
Verilog requires that variables be given a definite size. In C these sizes are assumed
from the 'type' of the variable (for instance an integer type may be 8 bits).
A Verilog design consists of a hierarchy of modules. Modules encapsulate design

hierarchy, and communicate with other modules through a set of declared input, output,
and bidirectional ports. Internally, a module can contain any combination of the
following: net/variable declarations (wire, reg, integer, etc.), concurrent and sequential
statement blocks, and instances of other modules (sub-hierarchies). Sequential
statements are placed inside a begin/end block and executed in sequential order within
the block. However, the blocks themselves are executed concurrently, making Verilog
a dataflow language.
Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined") and signal strengths (strong, weak, etc.). This system allows abstract
modeling of shared signal lines, where multiple sources drive a common net. When a
wire has multiple drivers, the wire's (readable) value is resolved by a function of the
source drivers and their strengths.
A subset of statements in the Verilog language are synthesizable. Verilog modules that
conform to a synthesizable coding style, known as RTL (register-transfer level), can be
physically realized by synthesis software. Synthesis software algorithmically
transforms the (abstract) Verilog source into a netlist, a logically equivalent description
consisting only of elementary logic primitives (AND, OR, NOT, flip-flops, etc.) that
are available in a specific FPGA or VLSI technology. Further manipulations to the
netlist ultimately lead to a circuit fabrication blueprint (such as a photo mask set for an
ASIC or a bitstream file for an FPGA).
Page 38
Chapter-5
SYNTHESIS RESULTS
5.1 Synthesis
Figure 5.1.1 FPGA Synthesis
The process that translates VHDL/ Verilog code into a device netlist format i.e.
a complete circuit with logical elements (gates flip flop, etc…) for the design. If the
design contains more than one sub designs, ex. to implement a processor, we need a
CPU as one design element and RAM as another and so on, then the synthesis process
generates netlist for each design element Synthesis process will check code syntax and
analyze the hierarchy of the design which ensures that the design is optimized for the
design architecture, the designer has selected. The resulting netlist(s) is saved to an
NGC (Native Generic Circuit) file (for Xilinx® Synthesis Technology (XST)).
5.2 Implementation
This process consists of a sequence of three steps
 Translate
Page 39
 Map
 Place and Route
Translate:
Process combines all the input netlists and constraints to a logic design file. This
information is saved as a NGD (Native Generic Database) file. This can be done using
NGD Build program. Here, defining constraints is nothing but, assigning the ports in
the design to the physical elements (ex. pins, switches, buttons etc) of the targeted
device and specifying time requirements of the design. This information is stored in a
file named UCF (User Constraints File). Tools used to create or modify the UCF are
PACE, Constraint Editor Etc.
Figure 5.2.1 FPGA Translate
Map:
Process divides the whole circuit with logical elements into sub blocks such
that they can be fit into the FPGA logic blocks. That means map process fits the logic
defined by the NGD file into the targeted FPGA elements (Combinational Logic Blocks
(CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit Description)
file which physically represents the design mapped to the components of FPGA. MAP
program is used for this purpose.
Page 40
Figure 5.2.2 FPGA map
Place and Route:
PAR program is used for this process. The place and route process places the
sub blocks from the map process into logic blocks according to the constraints and
connects the logic blocks. Ex. if a sub block is placed in a logic block which is very
near to IO pin, then it may save the time but it may affect some other constraint. So
tradeoff between all the constraints is taken account by the place and route process.
The PAR tool takes the mapped NCD file as input and produces a completely
routed NCD file as output. The output NCD file consists of the routing information.
Figure 5.2.3 FPGA Place and route
5.3 Synthesis Result
To investigate the advantages of using our technique in terms of area overhead

against “Fully ECC”and against the partially protection, we implemented
andsynthesized for a Xilinx XC3S500E different versions of a32-bit, 32-entry, dual
read ports, single write port registerfile. Once the functional verification is done, the
RTL model is taken to the synthesis process using the Xilinx ISE tool. In synthesis
process, the RTL model will be converted to the gate level netlist mapped to a specific
technology library. Here in this Spartan 3E family, many different devices were
available in the Xilinx ISE tool. In order to synthesis this design the device named as
Page 41
“XC3S500E” has been chosen and the package as “FG320” with the device speed such
as “-4”.
5.4RTL Schematic
The RTL (Register Transfer Logic) can be viewed as black box after synthesize
of design is made. It shows the inputs and outputs of the system. By double-clicking on
the diagram we can see gates, flip-flops and MUX.
5.3.1 RTL block diagram of existing 16-bit kogge-stone adder.
From the RTL block diagram we can say the 16-bit kogge stone adder has 3 inputs
(a[15:0],b[15:0],and cin).and two outputs( sum[15:0] and carry out(cout)). The below
figure shows the internal RTL schematic of existing 16-bit Kogge-stone adder.
Page 42
5.3.2 RTL block diagram of 32-bit existing kogge-stone adder
The above figure shows the RTL schematic of 32-bit kogge-stone adder. Which consist
of 3 inputs (a[31:0],b[31:0],cin), and two outputs sum[31:0] and cout.
5.3.3 RTL Block diagram of 16-bit proposed kogge-stone adder

16-bit proposed kogge-stone adder has 3 inputs(a[15:0],b[15:0],cin) and two outputs
sum[15:0] and cout.
Page 43
6.3.4 RTL block diagram of proposed 32-bit PPA
6.3.2.1 RTL schematic of 16-bit existing kogge-stone adder
Page 44
5.3.2.2 RTL schematic of existing 32-bit kogge-stone adder
6.3.2.3 RTL schematic of 16-bit proposed PPA
Page 45
5.3.2.4 RTL schematic of proposed 32-bit PPA
When existing ,proposed 16-bit,32-bit kogge-stone adder is synthesized using Xilinx

ISE we get the technology schematic which represent the number of LUTS used,
hardware used. which is nothing but FPGA of design .
5.3.2.a Technology schematic of 16-bit existing PPA
Page 46
Fig 5.3.2.b technology schematic of 32-bit existing Kogee-stone adder
Fig 5.3.2.c Technology schematic of 16-bit proposed kogge-stone adder
Page 47
Fig 5.3.2.d Technology schematic of Proposed 32-bit kogge-stone adder
5.4 Synthesis Report
This device utilization includes the following.
 Logic Utilization
 Logic Distribution
 Total Gate count for the Design
The device utilization summery is shown above in which its gives the details of
number of devices used from the available devices and also represented in %. Hence as
the result of the synthesis process, the device utilization in the used device and package
is shown below.
Page 48
Fig 5.4.1 Design summary of 16-bit existing PPA
Fig 5.4.2 design summary of 32-bit existing kogge_stone adder
Page 49
Fig 5.4.3 Design summary of proposed 16-bit kogge stone adder
Fig 5.4.4 Design summary of proposed 32-bit kogge-stone adder
Page 50
5.6 Summary :
RTL coding is done using Verilog ,In order to synthesis this design the device
named as “XC3S500E” has been chosen and the package as “FG320” with the device
speed such as “-4”. In synthesis process, the RTL model will be converted to the gate
level netlist mapped to a specific technology library. Here in this Spartan 3E family,
many different devices were available in the Xilinx ISE tool. With this process we can
get RTL ,technology schematic and design summary ,simulation results of the designs
Page 51
Chapter-6
SIMULATION RESULTS
The corresponding simulation results of the existing(16,32-bit) and proposed(16,32-

bit) kogge-stone adder are shown below. Inputs a,b,cin are supplied from test bench
we can apply any number of inputs and can observe the respective output to verify
whether the design is working properly or not.
Fig6.1 simulation output of existing 16-bit kogge stone adder

The above fig 6.1 shows the simulation results of existing 16-bit Kogge stone adder
with 3 inputs a,b of size 16 bits and cin and two outputs SUM and Cout which generates
the sum and cout value based on given inputs.
Page 52
Fig6.2 simulation output of existing 32-bit kogge-stone adder
The above fig 6.2 shows the simulation results of existing 32-bit Kogge stone adder
with 3 inputs a,b of size 32 bits and cin and two outputs SUM of size 32 bits and Cout
which generates the sum and cout value based on given inputs.
Fig6.3 simulation output of proposed 16-bit kogge-stone adder

The above fig 6.3 shows the simulation results of proposed 16-bit Kogge stone adder
which generates the sum and cout value based on given inputs.
Page 53
Fig6.4 simulation output of propose 32-bit kogge-stone adder
The above fig 6.4 shows the simulation results of proposed 32-bit Kogge stone adder
which generates the sum and cout value based on given inputs. The simulation result is
obtained when simulated the code in XILINX ISE 14.7
Comparision table:
design No of slices No” LUT’s No” of IOBS delay
Existing 41 72 50 15.171ns
16_kogge stone
adder
Existing-32 84 148 98 21.284ns
Proposed 16 27 48 50 16.174 ns
Proposed32 75 132 98 22.395ns
The above table shows the comparison between existing 16,32 bit kogge stone adder
and proposed kogge stone adder .we can observe that there is lot of difference in terms
of hardware when compared to proposed and existing designs. The proposed design has
less hardware than existing one, thus how we reduced the hardware complexity.
6.2 Summary
Page 54
in this chapter we discussed about the simulation result of existing and proposed
method when executed using XILINX ISE 14.4 design suit.
Chapter-7
CONCLUSION AND FUTURE SCOPE
7.1 Conclusion
Analysis of the perspective architecture for constructing various multi-bit PPA
schemes; derivation of formulas for estimating the hardware complexity of multi-bit
PPA, schematic implementation of the standard 16-bit and 32-bit Kogge-Stone adders
and schematics implementation of 16-bit and 32-bit modified parallel prefix adders.
Then, a comparative analysis of parameters and simulation results of the presented
adders have been carried out. As a result, researches have shown, that the modified
parallel prefix adder proposed in the work has an advantage in terms of hardware
complexity in comparison with the known structure of KoggeStone adder. Additionally,
in terms of speed the proposed parallel prefix adder has the advantage over group-prefix
and carry-lookahead adders, and famous as parallel prefix adders Sklansky and Brent-
Kung.
7.2 FUTURE SCOPE

Now a day’s all the devices need a design with compact and high speed portable
components. KS(kogge-stone) adder can used to design a fast multiplier and multiplier
is an important device for high speed processor, digital image processing. These devices
can be used in high efficient convolution and de-convolution, FIR filter, ALU etc. In
future ,the parallel prefix adders must be tested for other adders also to optimize the
area and timing both.
Page 55
REFERENCES
[1] Geeta Rani, Sachin Kumar. “Delay Analysis of Parallel-Prefix Adders”.
International Journal of Science and Research (IJSR), ISSN: 2319-7064, Impact
Factor (2012): 3.358. Volume 3 Issue 6, June, 2014. pp. 2339.
[2] Reto Zimmermann. Binary Adder Architectures for Cell-Based VLSI and their
Synthesis. Thesis for the degree of Doctor of technical sciences. Zurich. 1997. pp. 5-
7.
[3] Sunil.M, Ankith.R.D, Manjunatha.G.D and Premananda.B.S. Design and

implementation of faster parallel prefix Kogge Stone adder. International Journal of
Electrical and Electronic Engeering & Telecommunications 2014. ISSN 2319 – 2518.
Vol. 3, No. 1, January 2014. pp. 116.
[4] Athira.T.S, Divya.R, Karthik.M, Manikandan.A. Design of Kogge-Stone for fast

addition. Proceedings of 34th IRF International Conference, 26th February 2017,
Bengaluru, India. ISBN: 978-93-86291-639. pp. 27-28.
[5] CH. Sudha, Rani, CH. Ramesh. Design and Implementation of High Performance
Parallel Prefix Adders. International Journal of Innovative Research in Computer and
Communication Engineering. An ISO 3297: 2007 Certified Organization. Vol.2, Issue
9, September 2014. pp. 5900.
[6] Bazarova S. B. M., Mantatov B.V. Adders: Methodical instructions for laboratory
work. Publishing house of the SSCU. Ulan-Ude. 2006. pp. 810.
[7] David Money Harris, Sarah L. Harris David. Digital Design and Computer
Architecture. 2nd Edition. Avenue South, New York, 2013. pp. 237-238.
[8] Vibhuti Dave. High-speed multi operand addition utilization flag bits. Submitted
in partial fulfillment of the requirements for the degree of Doctor of Philosophy in
Computer Engineering, Chicago. Illinois. May 2007. pp. 38-39.
Page 56
Page 57

Reducing The Hardware Complexity of A Parallel Prefix Adder

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reducing The Hardware Complexity of A Parallel Prefix Adder

Uploaded by

Copyright:

Available Formats

Chapter-1

Very-large-scale integration (VLSI) is the process of creating integrated circuits by

Fig1.1 A VLSI integrated-circuit die

As of early 2008, billion-transistor processors are commercially available. This is

As microprocessors become more complex due to technology scaling, microprocessor

 Power usage/Heat dissipation – As threshold voltages have ceased to scale

1.5 VLSI Technology

1.6 History & Evolution of VLSI Technology

1. Problem Specification: It is more of a high level representation of the system.

2. Architecture Definition: Basic specifications like Floating point units, which

Fig1.3: Future of VLSI

Where do we actually see VLSI Technology in action? Everywhere, in personal

1.7 Overview on VLSI

Very-large-scale integration (VLSI) is the process of creating an integrated circuit (IC) by

1.6 INTRODUCTION TO PARALLEL PREFIX ADDER

To humans, decimal numbers are easy to comprehend and implement for

The hardware implementation of binary addition is a fundamental architectural

1.7 ARCHITECTURE OF THE PARALLEL PREFIX ADDER

consists of three stages (Figure1.7.1) [5]: pre-processing stage, prefix computation

At the pre-processing stage, carry-generate gi and carry- propagate hi signals are

At the prefix computation stage, group carry-generate G[i:k]and group carry-

1.10 organization of thesis

2.1 KOGGE-STONE ADDER

2.2 16-bit Kogge-stone adder

The scheme of a 16-bit Kogge-Stone adder is shown in figure 2.2.1.

Fig. 2.2.1 Scheme of a 16-bit Kogge-Stone adder

2.3 32-bit Kogge-stone adder

Fig. 2.3.1 Scheme of a 32-bit Kogge-Stone adder

 Large hardware complexity

Fig. 3.2.1 Scheme of a 32-bit modified parallel prefix adder

 Hardware complexity is reduced

4.1 Overview on verilog

Hardware description languages such as Verilog differ from software programming

A Verilog design consists of a hierarchy of modules. Modules encapsulate design

Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support

Verilog-2001 is the dominant flavor of Verilog supported by the majority of

4.2.4 Verilog 2005

A hello world program looks like this:

A simple example of two flip-flops follows:

always @ (posedge reset or posedge clock)

An example counter circuit follows:

module Div20x (rst, clk, cet, cep, count, tc);

output [size-1:0] count;

reg [size-1:0] count; // Signals assigned

wiretc; // Other signals are of type wire

// The always statement below is a parallel

always @ (posedgeclk or posedgerst)

// the value of tc is continuously assigned

<Width in bits>'<base letter><number>

 12'h123 - Hexadecimal 123 (using 12 bits)

// Mux examples - Three ways to do the same thing.

// The first example uses continuous assignment

// the second example uses a procedure

// Finally - you can use if/else in a

// Transparent latch example

// Basic structure with an EXPLICIT feedback path

// The more common structure ASSUMES the feedback is present

Initial and always

always@(a or b)// Any time a or b CHANGE, run the process

always@(posedge a)// Run whenever reg a has a low to high change

The example below is functionally identical to the always example above.

initialforever// Start at time 0 and repeat the begin/end forever

Note: These operators are not shown in order of precedence.