Professional Documents
Culture Documents
Alajmi Rashed Thesis 2019
Alajmi Rashed Thesis 2019
Electrical Engineering
By
May 2019
The graduate project of Rashed Hamad Alajmi is approved:
Date
Dr. Xiyi, Hang
Date
Dr. El Naga Nagi , Chair
ii
ACKNOWLEDGEMENTS
I would like to express my special thanks of gratitude to my professor Dr. Nagi El Naga for
all his advice in this project and since I started my master degree.
Secondly I would like to thank my parents and my whole family for all their support and
iii
Table Of Contents
SIGNATURE PAGE ……………………………………………………………………i
ACKNOWLEDGMENT ………………………………………………………………..ii
ABSTRACT........................................................................................................................ x
2.3 Conclusion..................................................................................................... 15
iv
3.1.1 Simple Multiplication Method……………………………………………………….17
v
6.3 Description of hardware components ……………………………………………………………….60
6.6.1 Introduction…………………………………………………………………………………………………56
Chapter 7……………………………………………………………………………61
References…………………………………………………………………………..63
vi
List of Figures:
vii
3.3.2.2 Modified Operation Rules for recoding by pairs 33
3.4 Schematic for recoding by pairs multiplier 34
3.4 Decoding Operations Truth Table 35
4.1 Gate designs in a recoding by pairs decoder 36
4.3 34-bit Complementing Circuit 37
4.4 Accumulator Unit 38
4.5 Shift left register 39
viii
6.6.6 Segment controller 60
ix
ABSTRACT
By
The goal of this project is to design, model, simulate and ultimately create a High-speed
32-bit multiplication system which will be utilizing recording by pair algorithm in order to speed
up the process. In basic mathematical terms, multiplication is the process by which a number is
scaled by another number. In order to elaborate on high speed 32-bit multiplication process, I have
discussed in detail several methodologies and frameworks. After thoroughly comparing different
algorithms, booth’s algorithm is by far the most efficient in terms of speed and accuracy.
The logical circuits are used to carry out a set of different actions which are dependent on
the input fed to the system. The function of the logical circuits can be defined as shifting,
complementing and control circuits. The specifics to these components are examined, designed
and articulated during this Project. A carry look ahead adder (CLA) is used in the project due to
its fast propagation time. The CLA is used along with a carry save adder because the addition
process involves multiple n-bit numbers. Two different architectures, one level CLA and two
level CLA adder have been discussed which can be used to increase the speed of operation.
However in this project, a two level CLA adder is used because it is much faster than one level
CLA adder when it comes to dealing with 32 bit numbers.
x
Chapter 1: Project Overview
1.1 Introduction
and most importantly digital signal processing (DSP). In a general-purpose multiplier, the data
input is a continuous process which makes the algorithm complex. The complexity of any
algorithm can be classified into cost and time. Therefore, a better algorithm would not cost high
as well as have a fast execution time. Multiplication processes are time-consuming since they
involve multiple complicated computations which would increase the execution time, thus the
algorithm must be optimal to reduce time delay. It has been found by VLSI designers that assigning
large area to the integer and floating-point multipliers helps in speeding up the multiplication
process.
Optimally, the rate of calculations needs to be as fast as one billion arithmetic operations
performed in each second in conjunction with real-time DSP. The average rate required is one
considering the vast computations necessary in real-time DSP. Advancements in Very Large Scale
Integration (VLSI) technology has resolved the heavy time-consumption issues in real-time DSP
computations by incorporating innovative and new methodologies and design architectures which
increase the efficiency of the operation significantly. The theoretical aspects of multiplication
algorithms which seemed far-fetched a few decades ago can now easily be implemented thanks to
the advancement in VLSI technology in both the production of the devices and the articulation of
1
relevant methodologies that have resolved any issues with fabrication and enhanced the efficiency
of complex design.
1.2 Objective
The primary purpose of this project is the modelling, design, testing and implementing of
the 32-bit high speed multiplier by using recording by pair algorithm on a field programmable gate
array (FPGA) which is both efficient in terms of speed and accuracy in terms of solving a huge
amount of complicated computations. Normally, the speed and complexity of the design would be
compromised due to its inherent nature but if a carry look-ahead adder circuit can help reduce the
speed issues.
2
1.3 Project Outline
The introduction of the project is presented in chapter 1. It helps to give an overview to the
Chapter 2- High Performance Adders: This section will briefly touch upon some of the fast
addition techniques that improve the multiplier performance such as Carry Save adders and
discussed carry look-ahead adders in great detail since they are of more importance for this project.
Chapter 3- Multiplication Algorithms: This section will adequately introduce, classify and
discuss the various multiplication methodologies and theories that can be implemented. These
include Recoding algorithms, direct multiplication and Booths algorithm. The main focus of this
project is implementing the booths algorithm on FPGA to design the multiplier therefore the booth
Chapter 4- Logic Circuits Design: As discussed previously, many logic circuits will be
used in order to design, model, simulate and implement the multiplier which include decoders,
accumulators, right and left shift registers, control circuits and how each of them play a crucial
Chapter 5- Designing the Multiplier: After laying down the groundwork necessary for the
articulation of our multiplier in chapters 2,3 and 4 in terms of the adders, multipliers and logic
circuits to be implemented, the high speed multiplier is studied and designed so that it is compatible
3
Chapter 2: High Performance Adders
Fast addition is an essential component in this digital era and especially in real-time digital
signal processing. The efficiency and speed of the adders ends up playing a very important in the
overall speed and accuracy of any mathematical circuit. In this chapter, some of the fast adding
methods widely used are investigated along with the necessary details regarding Carry Look-
Ahead Adders (CLA) which will be implemented in the final design of the multiplier. It is deduced
that in order to enhance the effectiveness of the addition and the overall system, it would be more
CLA adders in contrast to the slow and basic ripple carry adder are much more complicated
but provide a very efficient upgrade in speed. Ripple-Carry (RC) Adders can be compared with
conventional methods of addition i.e. via paper and pencil in which corresponding digit are added
to one another starting from the units position or whichever is the least significant until all
corresponding digits have been added and a final result has been obtained. In the RC Adders, there
is a chance that the sum of the corresponding digits might exceed the limit because of which an
extra carry bit has to be carried to the next least most significant number. The main difference
between RC and CLA adders is that although both processes initiate in the same manner i.e.
propagation through each 4-bit segment, in the CLA adder after the initiation, the speed is 4 times
greater since it involves jumping from one adjacent carry unit to the next which ultimately results
in the carry propagating inside the numbers in that segment for each group that has accepted a
carry in.
4
2.1.1 CLA Theory
Based on the concept established of 1-bit full adders, let’s assume a full adder circuit as
shown in Figure 1 in which the operand bits Ai and Bi are being added along with the Carry in bit
As it can be seen, there are two internal signals being generated, namely Pi and Gi which
Subsequently, the sum and carry out functions can be defined as follows:
Where Si, Ci+1 and Ci are the sum, carry out and carry in functions respectively and Pi and
Gi are known as the carry propagate and the carry generate respectively. The carry generate is
known by that term since it since a carry out is generated whenever the signal is equal to 1,
irrespective of the carry in signal. The carry propagate is known by that term since it propagates
the carry from carry in to carry out whenever the carry propagate is equal to 1. There exist two
different architectures in CLA adders which are known as One-level and Two-level CLA
respectively. A thorough investigation needs to be made in order to decide which of the two will
5
give the most optimal results. In order to analyze and design these units, there are a few modules
Based on the CLA theory discussed in section 2.2.1, it was deduced how the sum and carry
out signals are determined in a CLA Adder. However, there exist some fan-in restrictions because
of which the adder is split into different 4-bit groups. These 4-bit groups are split across 3 levels
First Level: All the P &G signals are generated from here, more specifically four sets of
P& G logic signals ( each set includes an AND gate and an XOR gate)
6
Second Level: This is logic block of the CLA that that includes 4 different 2 level
implementation circuits. In the above figure, the C1, C2, C3 and C4 are generated in this level
Third Level: This consist of the four logic XOR gates which generate the sum signals S0,
Building upon the carry look-ahead adder theory and group discussed in the previous
discussions and observing the schematic and block diagram shown in Figure 3 and 4 respectively,
7
The Boolean expressions of the carry outputs at each stage could be determined and
simplified by simply substituting the previous carry output expressions as described below:
𝐶𝐶3 = 𝐺𝐺2 +𝑃𝑃2 𝐺𝐺1 +𝑃𝑃2 𝑃𝑃1 𝐺𝐺0 +𝑃𝑃2 𝑃𝑃1 𝑃𝑃0 𝐶𝐶0 … … … . .9
𝐶𝐶4 = 𝐺𝐺3 +𝑃𝑃3 𝐺𝐺2 +𝑃𝑃3 𝑃𝑃2 𝐺𝐺1 +𝑃𝑃3 𝑃𝑃2 𝑃𝑃1 𝐺𝐺0 +𝑃𝑃3 𝑃𝑃2 𝑃𝑃1 𝑃𝑃0 𝐶𝐶0 … … … . .11
Thus the equations relevant for this project are equations 5, 7, 9 and 11. The carry output
C4 is the final carry generated from the previous carry generates and propagates and is thus fed
The schematic and block diagram shown in Figure 5 and 6 respectively indicate some of
the differences that exist in CLAU 1 and CLAU 2. Based on these differences, it is possible to
derive modified expressions for the carry generate (G1*) and carry propagate (P1*) which are as
follows:
𝐺𝐺1∗ = 𝐺𝐺3 + 𝐺𝐺2 . 𝑃𝑃3 . 𝑃𝑃2 + 𝐺𝐺1 . 𝑃𝑃3 . 𝑃𝑃2 + 𝐺𝐺0 . 𝑃𝑃3 . 𝑃𝑃2 𝑃𝑃1 … … .13
8
A one-level Carry-Look Ahead Adder is used in this case which basically consists of a
Four-bit Carry Look-ahead Unit 1 (CLAU1) and a Carry Look-Ahead Adder Group (CLAAG)
combined in order to enhance the speed of the multiplier. 4 bit carry generates and propagates are
generate from the CLAAG which uses 4-bit carry input from the CLAU 1. The CLAU 1 then
generates the 4-bit carry as the output after taking from the CLAAG the carry generate and
propagate as its inputs. This ends up creating a total of 8 blocks which have a combined 32 bits,
all inter-connected which gives way for the concept of a 32-bit one-level CLA unit.
9
2.1.5 32-bit One-Level Carry-Look Ahead Adder
In this technique, as discussed briefly in the previous section, the different adder units are
segmented into different segment and the carry look-ahead method is applied at the group level.
The output carry generates propagation is permitted along the various groups that exist in this
adder for the purpose of reducing the time delay which is a significant issue in conventional CLA
adders. The time delay in CLA adders is denoted by 𝜏𝜏𝑔𝑔 which is accounted in the total addition
𝑛𝑛
𝑇𝑇𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 (𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛 − 𝑏𝑏𝑏𝑏𝑏𝑏 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎) = 𝜏𝜏𝑔𝑔 + 2 𝜏𝜏𝑔𝑔 𝑥𝑥 + 𝜏𝜏𝑔𝑔 … … … … … 14
4
*note: The first time delay 𝜏𝜏𝑔𝑔 denoted is the delay due to the carry propagate and carry
generate in the CLAAGs, the second denotes the delay in the output sum generation of the final
group while the middle term denotes the time delay in signal propagation through the CLAU1.
Figure 7 represents a 32-bit one level CLA adder where A and B are the inputs over 32 bits and S
is the corresponding output. C denotes the flow of carry propagation from one stage to the other.
The final output carry is C32 while the first input carry is C0.
10
2.1.6 Two Level CLA Adder unit
A two-level CLA adder unit is much faster compared to a one-level CLA unit adder which
is why it is preferred over it usually. There are some key differences in the two-level CLA adder
and the one-level CLA Adder as it can be inferred from Figure 8. It forms one one-level CLA unit
at a group level wile it forms two two-level adders at the piece wise. The carry output is generated
from each of the pieces and is rippled throughout the latter pieces in the Adder. When comparing
Figure 7 and 8, the terms A,B and S in both figures serve the same purpose where A and B are the
inputs over 32 bits and S is the corresponding output. The main difference in the bottom CLAU 1
sections where Cin is the input carry to the CLAU 1 and C32 is the output carry and C16 serves both
purposes as it propagates from one CLAU1 stage to the other. Other differences are that this system
includes P1* and G1* too which are modified carry propagate and generate connection.
The time delay is also measured differently in the two-level CLA Adder and it is s follows:
11
𝑇𝑇𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 𝜏𝜏𝑔𝑔 + 2 𝜏𝜏𝑔𝑔 + 2 𝜏𝜏𝑔𝑔 𝑥𝑥 𝑆𝑆 + 2 𝜏𝜏𝑔𝑔 + 𝜏𝜏𝑔𝑔 … … … … … 16
𝑇𝑇𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 10 𝜏𝜏𝑔𝑔
*note: The first time delay 𝜏𝜏𝑔𝑔 denoted is the delay due to the carry propagate and carry
generate in the CLAAGs, the second (2 𝜏𝜏𝑔𝑔 ) is the delay due to the carry propagate and carry
generate in the CLAU2, the third (2 𝜏𝜏𝑔𝑔 𝑥𝑥 𝑆𝑆) is the delay which occurs when signals are propagating
through the CLAU2 sections, the fourth (2 𝜏𝜏𝑔𝑔 ) is the delay which occurs during signal propagation
through the last section’s CLAU2 and the last delay 𝜏𝜏𝑔𝑔 is due to the output sum produced in the
last segment.
As we can see the time delay is much lesser compared to the one-level CLA unit which
indicates better performance at least in terms of speed. However, when n-bit numbers have to be
added, where n is larger, CLAU2 might become inefficient which is where carry save adders can
The main drawback of using conventional adders is that they are devised in a way to only
add two numbers simultaneously. For the current project of a 32 bit multiplier, it is required that
the multiplicand multiples are added at a significantly quick rate without and restrictions on the
number of units being added which is due to the inherent size of the operands and the vast amount
of partial additions required. Carry Save Adders or CSA can overcome the shortcomings of
conventional adders and even CLAU2 Adders since it is possible to carry out addition of three n-
bit numbers simultaneously while producing the sum vectors and carry generates which are then
used as inputs by the following CSA units. A CSA unit will have the same number of full adders
12
as the bit size i.e. a 32 bit CSA will have 32 full adders. The general mechanism behind a CSA
adder is that it generates two n-bit output vectors which are the sum (S) for the partial sum and the
carry ( C) for the partial carry to be used later on and takes the input of 3 n-bit binary numbers
which can be denoted as x, y and z. The order in which they are added is irrelevant to the
computations of a CSA. A CSA is highly similar to a 1-bit full adder as can be seen in Figure 9
with the difference being that the input carry is now denoted as ‘z’, the original answer output is
now denoted as ‘s’ and the output carry is denoted as c. Figure 10 depicts how the full adders in a
CSA unit can add three distinct n-bit numbers (x, y and z) and convert them to two output vectors
(c and s). In a CSA unit, the carry vector is moved to the left by one-bit always.
Figure 9: The Similarities between a 1-bit Full Adder and a Carry Save Adder
The final sum however is calculated by using a look-ahead carry adder (LCA). Figure 11
shows how many steps does it take to add m different n-bit numbers. Before the numbers are added
in the final LCA, they need to go through m-2 blocks, with each block representing multiple one-
13
bit CSA units arranged in a parallel orientation. Each time a block is passed, the numbers are
incremented by 1-bit in terms of size. Consequently if we assume the time delay at each gate to be
𝜏𝜏𝑔𝑔 then each CSA level contributes to a time delay of 2𝜏𝜏𝑔𝑔 which is the same as the time delay
incurred in a full adder stage. According to figure 11 shown below which is a three-leveled tree,
14
2.3 Conclusion
In this section, the carry look ahead technique was described in detail which will be used
for the multiplication techniques that will be discussed later. The CLAAG units and their
importance in addition was also discussed. Then two addition implementation commonly applied
to multiplier were discussed namely the One-level CLA unit and the Two-level CLA unit. After
comparing their speeds in terms of time delay, it was deduced that the two-level CLA unit is much
faster having a time delay of only 10𝜏𝜏𝑔𝑔 while the one-level CLA unit has a delay of 18𝜏𝜏𝑔𝑔 . It was
then investigated how multiple n-bit numbers could be added and the techniques most feasible was
carry save adder in which a three-leveled CSA gives a time of delay of only 6𝜏𝜏𝑔𝑔 thus the total time
15
Chapter 3: Multiplication Algorithms
In this chapter, several multiplication algorithms and techniques will be discussed. The
basic method of adding a number of partial products is constant throughout all techniques however
they differ in their complexity and speed, both of which are important factors to determining which
is best suited for the task at hand. The primary objective is to achieve a perfect a balance in speed
The optimal speed required can be determined the reduction of time delay required and the
simplicity can be determined by the maximum reduction in gate complexity. These in turn depend
on the cost incurred and the overall performance of the system. Therefore it is important to
determine which broad category of multiplication techniques does the required method fall under.
There exist three main categories namely serial, parallel and serial-parallel multipliers. Serial
multipliers include add-shift techniques and recoding techniques while parallel techniques include
Rom network, reduction methods and iterative cellular arrays. The main focus for this project is
serial multipliers and in the following section, several techniques that have been developed and
The main concept behind serial multiplication and architectures based on serial
multiplications use the add-shift method for their operations. The number of bits n determines the
multiplication time and complexity as they are proportional to the square of n (n2). Thus the
multiplication and time and complexity tend to increase exponentially for greater values of n. The
method of serial multiplication entails that the least significant bit will be the first to be sequentially
inspected. If the value of the bit is ‘1’ then the most important segment of the double-length
16
accumulator that is valued at zero will be added with the multiplication with the multiplicand
whose bit is ‘1’. The accumulator shifts one bit to the right after every sequential inspection and
once all the bits have been inspected, the product is generated within the accumulator. The trade-
off in this technique is that it although it uses fewer resources and is simple, the computations
could become significantly complicated should the size of the multiplicand/multiplier increase.
Improvements to this basic mechanism have been devised and they would be discussed in the
following sections.
The simple multiplication or direct multiplication method, each multiplier with the value
of 1 is added to the multiplicand. The number of digits that have the value 1 correspond to the
exact number of addition operations required i.e. 3 digit multiplier will need 3 addition operations.
The entire premise of this simple multiplication method is based on detection of 0’s and 1’s. In the
former the multiplier takes no action and for the latter it performs an addition operation. The
X=101101010110
In the above example, there are a total of 8 1’s detected which means that the multiplier
17
3.1.2 Booths Algorithm
Booths algorithm is one of the most widely used algorithms that involves the multiplication
of two signed binary numbers in a complement notation of two’s. This algorithm is of great
importance in computer architecture. The importance of Booth algorithm lies in the fact that it can
preserve the sign of the result. The entire theory is based on the notion that the strings of binary
digits in a multiplier only need to be shifted and not necessarily added. There are four main steps
to be followed in the Booths Algorithm and the foundation of this method is built upon the notion
that an extra 0 will be added next to the lowest significant bit of the multiplier (b). The multiplier
bits bi and bi-1 from the lowest significant bit are checked sequentially and then the multiplicand
(a) is added to their partial product (m) or subtracted based on the signs and values. A single bit of
the multiplier is moved to the right at the end of each step till it obtains the value of 0. The actions
0 0 Do Nothing
0 1 Add a
1 0 Subtract a
1 1 Do Nothing
A more concise way of representing this table would be by the expression (bi-1-bi) which if
18
depicts the flow chart and the method by which Booths algorithm is carried out generally following
19
Example:
Qs. Multiply 2 (0010) by -3 (1101) using booths algorithm using 4 bit numbers? The
answer should be 1111 1010 1. The steps for this computation are shown below in Figure 14:
*note: The colored bits in each iteration are the bits used to determine what the next step
A few examples which compare the number of operations when the multiplication is
carried out on the same number using booths and direct multiplication are shown below:
1. X = 1 1 1 0 1 1 1 1 0 1 1 0
1 1 1 0 1 1 1 1 0 1 1 0
- + - + - +
From this we can see that it will only take 6 operations using the booths algorithm.
2. X = 1 0 1 0 0 1 0 1 0 1 0 1
20
When using direct multiplication, the number of operations required would be 6
1 0 1 0 0 1 0 1 0 1 0 1
- + - + - + - + - + -
From this we can see that it will take 11 operations using the booths algorithm.
This shows that the pattern of the binary digits determines the complexity of the operations
and the speed too hence Booth’s algorithm is only beneficial when there are lesser bits with the
value of 1 which is a shortcoming of this technique; however improvements to this have been
The Modified Booth’s algorithm is much greater in speed than the normal Booth’s
algorithm, almost twice as fast. This algorithm is meant to group the consecutive bits in either of
two operands to formulate signed multiples that decreases the total number of partial products and
ultimately increases the efficiency of the operation. The most efficient popular modified Booth’s
algorithm is known as the Radix-4 which uses the following algorithm in order to scan 3 bits of
strings:
1. Firstly, in order to ensure that that n is even, the sign bit 1 position should be extended
if need be.
2. Secondly, a ‘0’ is added to the right of the multipliers least significant bit.
3. Thirdly, the value of each vector will determine what the partial product will be. The
partial product can only take the value of 0, +X, -X, +2X, -2X where X is the
multiplicand. The bits are to be grouped in groups of three so that they can overlap by
21
one bit with the previous group. This process starts from the least significant bit and
only 2 bits of the multiplier are used for the first group of 3. Figure 15 below shows the
22
3.2 Recoding Multipliers
In direct multiplication, it was discussed that the multiplicand must be added for each digit
that attains the value of one therefore n number of operations are required for multipliers that have
n digits with the value of one. The primary objective of using recoding multiplication algorithms
is to enhance the simplicity of the system by reducing the number of operations required regardless
of how many digits that have attained the value of 1 exist. The uniform shift of one and the uniform
shift of two methods are the two methods by which recoding multipliers carry out their operations.
The premise of recoding multipliers is similar to that discussed in booths algorithm i.e. that both
addition and subtraction operations are carried out based on these two following fundamental rules:
1. The additional of a multiplicand that has been moved i positions = Subtraction of that
multiplicand plus an addition of a multiplicand that has been moved i+1 positions so they can be
interchanged.
2. The addition of a multiplicand that has been moved i+1 positions = Two additional of a
For a multiplier X, recoding yi as discussed before is dependent on what values are attained
by it and yi+1. The final recoding results also depend on what the results of the recoded yi-1
multiplier were. The operations fi that is performed because of the recoding principles mentioned
23
Operation required yi yi+1 fi
None 0 0 0
None 0 1 0
Addition 1 0 1
Subtraction 1 1 -1
*note: The above situations only occur when the recordings results of yi-1 have no bearing
From the above table, in the last scenario when both digits attain the value of 1, instead of
carrying out the addition with the multiplicand that has moved i positions, subtraction of that
multiplicand is carried plus an addition of a multiplicand that has been moved i+1 positions. In the
event we have an yi+2 and it attains the value of zero then instead of carrying out two addition
operations on the yi+1 multiplicand, all that is need to be done is a single addition operation of the
yi+2 multiplicand however as it can be seen, the number of operations remain the same for this
scenario. This is due to the fact that two addition operations of yi and yi+1 are now being replaced
by an equal number of operations that include an addition of the yi+2 multiplicand and a subtraction
of the yi multiplicand. The number of operations will actually reduce if the yi+2 multiplicand also
attained the value of 1. In this case, instead of having two addition operations for yi+1, a single
addition of the yi+2 multiplicand is sufficient and instead of the two additions that will be needed
on the yi+2 position, a single addition process on the yi+3 multiplicand will be sufficient if it attains
24
the value of 1 too. This implies that as long as this process continues, the total number of steps
The carry propagation theory encapsulates the entire impact that a bit-recoding has on the
bits that follow later on. The pseudo carry(ci)can be defined as the operation that as a result of
recoding applied to yi-1 is pushed forward to the yi multiplicand. It takes the value of 1 in the event
that an addition operation is pushed forward and 0 if nothing happens. Figure 18 gives a detailed
yi yi+1 ci ci+1 fi
0 0 0 0 0
0 0 1 0 1
1 0 0 0 1
1 0 1 1 0
0 1 0 0 0
0 1 1 1 -1
1 1 0 1 -1
1 1 1 1 0
Figure .18: Operation Rules when depending on yi and carry variable is considered
For uniform shifts of one operation, it is important to introduce two additional binary
variables namely fi1 that represents either addition of subtraction operations and fi0 which indicates
25
whether or not an operation is taking place. The value assignment for these variables can be
1 Operation is required
0 No operations required
Incorporating these more elaborately defined operation variables into Figure 18, the
26
yi yi+1 ci ci+1 fi0 fi1
0 0 0 0 0 N/A
0 0 1 0 1 0
1 0 0 0 1 0
1 0 1 1 0 N/A
0 1 0 0 0 N/A
0 1 1 1 1 1
1 1 0 1 1 1
1 1 1 1 0 N/A
A few equations and relationships can be derived on the basis of the above information. It
can be seen that fi1 attains the same values as yi+1 given that fi0 has the value of 1. The others are as
follows:
27
In conjunction with the above equations, Figure 22 shows the design of a uniform shift by
one recoding bit multiplier, the sequence in which it passes the logic gates, logic circuits and the
28
3.3.2 Uniform Shift of Two method
As discussed previously, there was significant room for improvement in speed of the
multiplier as the uniform shift of one method was able to reduce the total number of operations in
most situations to a great extent. However there is more room for speed improvement in multiple
multipliers are scanned in each cycle. Scanning two multiplier bits simultaneously could reduce
the number of operations by half, and three multiplier bits by three times. This is known as uniform
shift of multiples. For this project, only uniform shift of two will be discussed which includes non-
In this method, the multiples outputted from each multiplier bit are used in an addition
operation with the partial product. A better way to visualize this is through an example where an
even word length of n = 2M, where M is the total bits examined, is assumed. For this example, M
is assumed as 2 which means that there are a total of 4 bits. The least significant bits among all the
bits which are y1 and y0 are scanned and there are four possible routes that could be taken
depending on their values. The sums that result from these operations contain more than n bits that
have been moved multiple times albeit in either direction. The following rules determine those
operations:
2. If y1 attains the value of 0 and y0 attains the value of 1, then the multiplicand X is added
29
3. If y1 attains the value of 0 and y0 attains the value of 1, then a multiple of the
multiplicand X (in this case 2X) is added to the following partial product. 2X implies
4. If both y1 and y0 attain the value of 1, both X and 2X are added to the following partial
product.
binary digit to the left by i positions with zero digit coming in from the right. The following
equations capture this relationship where y1 and y0 are represented j and T ranges from 0 to k.
𝑘𝑘
𝑗𝑗 𝑥𝑥 𝐵𝐵 = 21 𝑥𝑥 𝐵𝐵 = 2𝐵𝐵 … … … … … … . .21
Following this addition operation, the accumulator and the multipliers both are moved
together as if they were on unit by 2 positions towards the right side as per the following
equation:
30
A carry save adder and carry propagate adder which were discussed in section 2 are
implemented in a non-overlapped scanning multiplier as per Figure 23. The flow of operations
shown is that firstly the carry save adder receives two inputs which it then sets aside for the X
and 2X multiples. The way these multiples are generated is by passing through the AND gate
with the y1 and y0 multipliers. The third input received by the carry save adder is generated from
the accumulator’s existing partial product. The carry propagate adder receives its inputs in the
form the two carry save adder outputs and thus generates the final product. The newly generated
partial product is moved two positions to the right after every iteration.
31
3.3.2.2 Overlapped Scanning
Through this technique it is possible to drastically reduce the total quantity of multiplicated
multiples which in turn decreases the total number of operations. The main idea behind this
algorithm is basically recoding by pairs or overlapped scanning. The basic process starts when the
multiplier is split into paired-bit groups and only one of these groups is scanned at a specified time.
As with other multipliers, either no operation happens or either of addition/subtraction takes place.
The multiplicand in the addition or subtraction operations exists in multiples of 2’s (2 times X/ 4
times X etc.). The multiple is obtained by moving the multiplicand from its position of entry in the
adder to the left by either 1 position or 2 positions from the reference bit which is the low order bit
in the sequence. After this, the partial product that is obtained is moved by 2 positions to the right,
0 0 No operation No Operation
0 1 +X +2xX
1 0 +2xX +2xX
1 1 +3xX +4xX
32
The above figure encapsulates the rules for recoding by pairs. What it infers is that if the
first bit attains the value of 1, then an error of X is incurred in the partial product which can be
mitigated when the preceding pair is processed and 4 times X is subtracted from the partial product
where X is the multiplicand. The final set of modified rules is shown in the figure below:
0 0 0 No operation
0 0 1 +2xX
0 1 0 +2xX
0 1 1 +4xX
1 0 0 -4xX
1 0 1 -2xX
1 1 0 -2xX
1 1 1 No operation
The lowest significant bit is assumed to have a value of zero during the computation, and
then the initiating partial product is zero. If the lowest significant bit attains the value of 1 then the
partial product I equal to the multiplicand. The speed with which this algorithm carries out
operations especially or larger bit numbers is precisely why it is preferred for this project.
33
3.4 Implementing the Overlapped Scanning multiplier algorithm
The main components of this system are the shifting, complementing circuits, the adder,
accumulator, multiplier and decoder as shown in Figure 26. Four control signals (S1 – S4) are
generated after the least significant three bits of the multiplier index are decoded.
The first of these control signals, S1 is known as the operation control signal that is inputted
to the adder. It has the task of enabling or disabling the results from the shifting and complementing
circuits. The S2 signal is the Addition/Subtraction operation control signal that as the name
suggests indicates which of the operations is supposed to be performed. S3 is the operation signal
that shifts the bits one position to the left (one-bit shift), while S4 is the operation signal that shifts
the bits two positions to the left (two-bit shift). The decoding operations take place as per Figure
27.
34
yi+2 yi+1 yi operation S1 S2 S3 S4
0 0 1 +2xX 1 0 1 0
0 1 0 +2xX 1 0 1 0
0 1 1 +4xX 1 0 0 1
1 0 0 -4xX 1 1 0 1
1 0 1 -2xX 1 1 1 0
1 1 0 -2xX 1 1 1 0
From this we can infer that if any of the three bits is 1, then S1 will attain the value of 1 but
if all three bits are 0 or 1 then it will be 0. S2 always takes the same value as yi+2 provided that the
S1 signal has the value of 1. If any of the two (yi+1 or yi) bits is 1, then S3 will attain the value of 1
but if both bits are 0 or 1 then it will be 0 and S4 will take the opposite value as S3.
35
Chapter 4- Logic Circuits and Modules
There are many different modules and circuits which have been used effectively in order
to carry out this project effectively. These include decoders, control gates, left-bit shifters, right-
4.1 Decoders
A decoder is defined as a circuit that alters the code and converts them into a set of signals
which is primarily the reverse of encoding. It includes different logic gates such as AND, OR and
XOR that take different inputs and generate a certain number of control signals. In this project, the
inputs are yi+2, yi+1 and yi which generate 4 signals (S1 – S4). A schematic of the decoder used for
this project and its respective components, inputs and outputs is shown in Figure 28.
36
4.2 Control Gates
Control Gates are memoryless circuits which generate an output solely based on the
combination of their inputs which can be 0 or 1 at a given time. They have no feedback and any
change to the signals being fed to them will instantaneously alter the output signals too. The control
gate used in this project includes a chain of AND gates which works on the following principle: if
the two inputs to an AND gate are 1, only then will the output signal be 1 otherwise it will be zero.
In the circuit implemented for this project, there is a 34-bit control gates circuit which receives
input from the decoder and from the complementing circuit’s output.
The complementing circuit used in the project is a chain of XOR (Exclusive OR Gate)
circuits that operate on the following principle: It receives multiple inputs and has one output with
an exclusive disconnection. If any one of the input signals has a value of 1, only then will the
output signal be 1 but if both are 0 or both are 1, then the output signal will be 0. In this circuit, a
34-bit complementing circuit is used as shown in Figure 29 which receives the input signal S2 from
37
4.4 Accumulator
intermediate storage unit for the logic and arithmetic input from the computer’s CPU. If these
accumulators did not exist, then it would be necessary to copy each of the computations and results
onto the main memory which will be very time consuming as accessing the main memory over
and over again in order to read the results is a much slower process as opposed to reading the
results from an accumulator since the controller overhead for reading/writing is used for memory
elements.
The primary purpose that the accumulator register serves in this project is the accumulation
of the list of member bits. The count in the accumulator is initially zero and keeps rising as numbers
enter into it from the CLA unit. The result is stored in the accumulator and the multiplier register
once all the numbers have gone through the necessary operations. Below in Figure 30, it can be
seen that the extension bits € and the load_acc which is the binary input received to the accumulator
are fed into the accumulator and ultimately exist to the 32-bit multiplier register Q. If the input
signal coming from the load_acc is 1, then the data will add onto the accumulator, otherwise the
2 2
To Q regiter
E Accumulator
Load_acc
38
4.5 shift left register
A shift left register is a circuit conjunction that moves the data towards the opposite
direction of the control signal flow (towards the left) by other one or two position and the output
gained is a 2’s multiple of the multiplicand. This register is enabled by the S3 (one-bit shift left
control signal) and the S4 (two-bit shift left control signal) control signals. Below in Figure 31, the
shift left register circuit used in the 32 bit multiplier is shown. The 32-bit input is denoted by A32.
As discussed previously, the control signal S3 comes from the decoder and if it attains the value of
1 then the shifting circuit moves to the left by one bit position and if it has the value of 0 then no
shifting occurs. The output is a total of 34 bits which then behave as the input for the
complementing circuit discussed in section 4.3. Similarly, if control signal S4 attains the value of
1 then the shifting circuit moves to the left by two bit positions and if it has the value of 0 then no
shifting occurs. The output is a total of 34 bits which also behaves as the input for the
complementing circuit.
39
The very first least significant bit in the figure above is set at 0 while the others are
dependent on the combination of the inputs received and the control signals.
A shift register is required in order to carry out two fundamental tasks: storing the data and
moving it subsequently. It consists of a group of flip-flops that each stores a single binary bit and
then shifts that data from one flip-flop to the other within itself or outside it. A shift right register
moves the bits towards the right, one or multiple bits at a time in the direction of the control signal.
The multiplier register and the accumulators behave as the shift right registers in this project and
40
Chapter 5: Designing the Multiplier
A hierarchal modeling methodology has been applied in this project in order to design a
top module multiplier. A carry propagate adder, 3-leveled carry save adders and 4 recoding logic
modules are amalgamate in order to design the high speed recoding multiplier and an accumulator
and multiplier register instantiate the multiplier. This multiplier operates much faster than a ripple-
carry adder due to the extensive CLA circuitry deployed along with considerations for propagation
delays.
The gate-level logic circuit schematic of the multiplier and the block diagram of the
recoding logic components are shown in Figure 32 and 33 respectively which essentially comprise
of shifting, logic and complementing circuits. This figure takes into consideration that 8 bits are
being recoded at a time for a multiplicand of 8-bits hence there would be a total of 17 bits being
generated including the sign bit. This can be understood better if the total 17 bits are seen as 10
output bits, one sign bit and the rest sign extension bits. The results generated from these are reliant
on the three least significant bits from the multiplier. The required number of concurrent recoding
logic components is 4 since the multiplier is split into 8 bits in each component which adds up 32
bits in total. They also play an important part in the 32-but multiplication process. Four control
signals are generated from the decoder which then becomes the input of the recoding logic
components. The width of each of these output signals can be generalized by the following
equation where x is the module number and n represents the number of bits and m is the number
of operands:
41
𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊ℎ 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑡𝑡ℎ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = (𝑛𝑛 + 2𝑚𝑚)𝑥𝑥 + 1 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑏𝑏𝑏𝑏𝑏𝑏 … … … … … . .22
𝑒𝑒. 𝑔𝑔. 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊ℎ 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑡𝑡ℎ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤ℎ 2 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = (𝑛𝑛 + 4)𝑥𝑥 + 1 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑏𝑏𝑏𝑏𝑏𝑏
42
From the above figures and details, it can be seen that 4 recoding logic components and 4
operands are necessary in order to recode 8 bits at a time. Each of the components generates a total
of n+8 bits plus one sign bit too. The output for each of the recoding logic components are
Module Output
1st 6 bit sign extension + 1 sign bit + (x+2) (exiting from the control gates)
2nd 4 bit sign extension + 1 sign bit + (x+2) (exiting from the control gates)
3rd 2 bit sign extension + 1 sign bit + (x+2) (exiting from the control gates)
4th 1 sign bit + (x+2) (exiting from the control gates) + 6 bits
It has already been discussed how the recoding logic components will produce a total of
17 bits including the sign bit. This output of the first three is then fed to the first carry save adder
after which the 17-bit sum is obtained and the carry vectors are stored so that they can serve as the
input for the next carry save adder. Simultaneously, the 4th recoding logic module generates its 17
bits and feeds them to the next carry save adder together with the output from the accumulator.
This is then fed to the carry look ahead and is output is then subsequently fed to the accumulator
which carries it to the third carry save adder. This entire process is depicted in Figure 36. Since
there are 8 bits being recoded at a time, the 8 most significant bits of the output would be in the
accumulator whereas the 8 least significant bits would be in the multiplier register at the end of
43
Figure .36: 8-bit multiplication unit using Recoding by pair's algorithm
From the figure, it can be seen that the ALOAD signal controls the 8 least significant bits
that go into the multiplier register. The first three of these bits are fed to the decoder which then
outputs the 4 control signals S1-S4. The recoding logic components generate the 17bits once they
obtain the 8 bits that are loaded in the multiplier register which then serve as the input for the adder
component in order to give the final output. In the figure above, S21 serves as the input for the
second carry save adder along with S22, S23 and S24. In order to better understand the workings of
44
X (8 bit multiplicand) = 11101010 (234), Y (9 bit multiplier) = 001101100 (108)
1st Transition Cycle: The 1st decoder takes in the y0-y2 bits, the 2nd decoder takes in the y2-
y4 bits, the 3rd decoder takes in the y4-y6 bits and the 4th decoder takes in the y6-y8 bits and the
= 0 and S4 = 1. These control signals indicate that the error will be fixed by subtracting 4 times
the multiplicand from the partial product. This moves it towards the left by two bit positions and
then complements which gives the output 11111110001010111. The sign extension bits are the
first 6 bits which is a replication of the signs bit that resulted from the gating of control signals S1
= 0 and S4 = 1. These control signals indicate that the error will be fixed by adding 4 times the
multiplicand to the partial product. This moves it towards the left by two bit positions and then
complements which gives the output 00000111010100000. The sign extension bits are the last 2
bits and the first 4 bits which is a replication of the signs bit.
= 1 and S4 = 0. These control signals indicate that the error will be fixed by subtracting 4 times
the multiplicand from the partial product. This moves it towards the left by two bit positions and
then complements which gives the output 11110001010111111. The sign extension bits are the
last 4 bits and the first 2 bits which is a replication of the signs bit.
= 1 and S4 = 0. These control signals indicate that the error will be fixed by adding 2 times the
45
multiplicand to the partial product. This moves it towards the left by one bit position and then
complements which gives the output 00111010100000000. The sign extension bits are the last 6
The results from the first three recoding logic components will be fed to the first carry save
adder after which the sum and carry vectors are obtained. This sum and the carry vectors are stored
so that they can serve as the input for the next carry save adder along with the output from the 4th
recoding logic module. The addition process in the second carry save adder then generates a new
sum and carry vector and feeds them to the third carry save adder together with the output from
the accumulator. This is then fed to the carry propagate adder to get the final output as
00110001010101111. The two least significant bits from the accumulator are moved to the right
towards the multiplier register that then takes up their spots as the two most significant bits of the
multiplier register. The two most significant bit of the accumulator is compared in order to
determine which bits would move forward to the two bit extension register which marks the end
In the 2nd transition cycle, the control signals are generated from the 4 decoders after the three least
significant bits from the multiplier register are compared. The results from the decoders and final
result from the 2nd transition cycle are described in the figure below:
46
Stage Output
Figure .37: Outputs for each level in the 2nd transition cycle
This final result is loaded onto the accumulator and moved to the right by two bit positions
towards the multiplier register. The same process is repeated for the 3rd and 4th transition cycles
Cycle Output
Figure .38: Outputs for the 3rd and 4th transition cycles
The output that was loaded into the accumulator at the end of the 4th transition cycle was
00111000011100010 while the output in the multiplier register was 10111000. The final output is
determined by taking the 7 least significant bits from the accumulator and the 8 bits from the
47
5.3 Final Multiplier Design
The final design follows the same method as the one discussed in section 5.2. It is depicted
in Figure 39 which shows the final design for the high speed multiplier of 32 bits which uses a
recoding by pairs algorithm. 4 transition cycles are required in order to simulate the 8 bits of input
in the example in 5.2 but in this final design, the total number of cycles required is 16 is the bits
Figure .39: High speed multiplier of 32 bits which uses a recoding by pairs algorithm (Final
Design)
48
Chapter 6: DETAILS OF IMPLEMENTATION
6.1 introduction
This chapter talks about the implementation of booth algorithms on FPGA including the
circuit diagram and hardware components used to build the project. It will guide you through the
step by step to build this project. The machine is power using a 3.3v 500mA power supply. The
Items Qty
Arty Z7 1
7-segment display 4
NPN transistor 16
Resistor-470 ohms 16
Wires 40
Breadboard 1
49
6.3 Description of hardware component
The 7-segment used is a 4 digit 7-segment display. Pins 1, 2, 6 & 8 are common anodes
Pins 14, 16, 13, 3, 5, 11, 15 & 7 are the pins corresponding to the LED’s.
6.3.2 Transistors:
An NPN transistor is used in the project to connect common anode from the 7-segment to the
positive supply. It has been chosen because a NPN transistor avoid some of the voltage base to
The transistor here is used as a switch to control the positive supply going to the display. The
base of the transistor is connected to the FPGA using a 470 ohms resistor. It is enough to operate
50
6.3.3 ArtyZ7
It is the development kit designed around the Zynq-7000 from Xilinx. It consists of dual-core,
650 MHz ARM Cortex-A9 processor with Xilinx 7-series Field Programmable Gate Array
(FPGA) logic. This is the core of the project where multiplication algorithm is implemented.
51
6.4 Schematic Diagram
52
Figure 44: circuit before connecting to the ArtyZ7
53
6.5 Functional Description of the Project:
ArtyZ7 is programmable SOC (system on chip) using A9 processor with architecture that
integrate dual core and 650 MHZ clock rate, which make it a powerful processor. Also it has
four buttons and two switches which are used in this project to get user input. Switch one is to
select between input one and input two, switch 2 is to select the sign of the input which is
selected by switch one. It sends logic one or high signal to activate the digit on the seven
segments digit. The switch 3 is assigned to change the value of the input and preform the
multiplication. Button 0 is to add one to the first digit of the four seven-segment. Every time is
pressed, it will increment the value until it reaches 9 then start from zero again by sending high
signal to the emitter of the transistor that is connected to the digit. Button 1 is controlling the
second digit and button 2 is controlling the third digit of the input that is selected by switch one.
54
Figure 45: circuit after connecting to the ArtyZ7
55
6.6 Software Description:
6.6.1 Introduction
After designing the circuit, it is modeled in VHDL. Since the circuit does not require the
use of memory and the output is only dependent on the present input, the combinational circuit
design process is used for the implementation of the circuit. This section describes the
While writing the code, the standard packages and libraries from IEEE is used. The clock
frequency is set to the default of 100MHz. The code consists of several components including
singed multipliers, seven segment displays, BCD display, hex to seven segment converters and
signed to slv converter. The following are the components in the code:
This section contains the implementation of Booth’s algorithm. The signals input 1 and
input 2 provide the two numbers for multiplication. The clock signal synchronizes the circuit with
other components in the circuit. The reset is always set to zero. When start signal is high,
multiplication occurs and when the start signal is low, the multiplication is complete and the result
56
Figure 46: Signed multiplier
6.6.3 Signed_to_SLV:
Following to the multiplier is the signed to slv component. It is used to generate signal for
the segment converter to display the sign of the number. There are three signed to slv converters,
one each for A, B and result. It looks at the number and generate 1 if the number is smaller than
zero. The output goes to a multiplexer, which generates either ‘0111111’ or ‘1111111’. The signals
are stored in the register and passed on to the segment converter to display to the output.
57
Figure 48: signed to SLV details circuit
Next follows a bcd display component. This component is a binary to bcd converter, which
converts 12 bits binary input from signed to slv converter to 4 bit bcd signals. It works by shifting
bits from one shift register to another starting from MSB first. There are three binary to bcd
converter in the code, one each for input A, input B and result.
58
6.6.5 Hex_to_7_Seg:
The hex to seven segment converter the hex input from the bcd display and converts it to
the seven-segment output which is fed into the segment controller. There is a total of 13 hex to 7
segment converters in the code. It consists of predefined outputs for the set of inputs.
The section of the code controls the output on the seven-segment display. The input signals
are clock, refresh rate and digital inputs from the output of the multiplier and the output connects
directly to the pins of the seven-segment display. The segment refresh rate is set to 50 kHz. The
segment controller works by toggling between different pins of the seven-segment display to
59
The sign_proc process handles the sign assignment to be displayed for A, B and result
based upon the value stored in sign register. With every rising edge, it reads the value of sign
The add_sub_proc process accounts for the digits which are displayed at A and B. With
every rising edge of the clock, if the one_edge_start is 1 than it increments the LSB of the output
displays. Similarly, if the ten_edge_start of the input is 1, it will control the center digit or the digit
at tens place. If the hundred_edge_start of the input is high, it will control the MSB.
The counter_proc is a counter to create desired multiplexed rate and shift toggle bits.
The toggle_proc toggles between the various seven segment displays by selecting
60
Chapter 7: SUMMARY AND CONCLUSION
Second, also during simulation, the last 32 of the 64 bits in the multiplication final result
were shown having values in complement form, that is, instead of 0's there were 1's and
61
vice versa. The error was caused by the right shift input to the accumulator, which was zero
all the times. Whenever 2's complement was performed on the multiplicand multiples, one
was needed as right shift input to the accumulator. To resolve this issue, an AND gate was
added to provide the correct shift input. The accumulator MSB (acc [40]) and constant
number "1" were used as inputs to the AND gate. With this new addition, the functionality
of the multiplier was successfully verified.
All throughout the synthesis, many difficulties were overcome. However, the most
significant one was the definition of constraints. Each step was performed a number of
times until the right set of constraints were conceived. The excellent debugging capabilities
of both synthesis tools made it possible to identify the critical paths in the design such that
when inspected, gave the author a better understanding of the way constraints were used by
the synthesis tools.
Ultimately, the experience and knowledge gained while working on the project have
been very valuable. Although, the design sacrifices some uniformity and cost, the recoding
by pairs multiplier was designed, modeled and simulated successfully. Thus, achieving the
initial goals set for the project.
62
References
Baugh, C.R. and Wooley, B.A., "A Two's Complement Parallel Array Multiplication Algorithm"
Brent, R.P. and Kung, H.T., "The Area time Complexity of Binary Multiplication," Journal of the
ACM, 1981.
Dr. Nagi, El naga. "ECE621 Lecture Notes", California state university, Northridge, 2009
Fenwick, P.M., "Binary Multiplication with Overlapped Addition Cycles," IEEE Trans. Comp.,
Habiti, A and Wintz, P.A., "Fast Multipliers," IEEE Trans. Computers, Vol. C-19, No.4, Feb 1971.
Kai, Hwang, " Global Versus Modular Two's Complement Array Multipliers" IEEE Trans.
Kai, Hwang, "Computer Arithmetic- Principles, Architecture and Design," New York: John wiley
Kamal, A.A and Ghanam, M., "High - Speed Multiplication Systems," IEEE Trans. Computers,
Lyon, R. F., "Two's Complement Pipeline Multipliers," IEEE Trans. Commun., com-24, Apr.
1976.
63
Mi, Lu, "Arithmetic and logic in computer systems" John Wiley and sons, Hoboken, NJ, c2004.
Morris, Mano. "Digital Design" Upper Saddle River, NJ: Prentice Hall, 2007.
Palnitkar, Samir, "Verilog HDL, a guide to Digital Design and Synthesis", Prentice Hall, NJ, 2008
Pezaris, S, D, "A 40ns 17-bit-by-bit An-ay Multiplier," IEEE Trans. Computers, Vol. C-20, No.4,
Apr. 1971.
Stenzel, W.J. et al., "A Compact High - Speed Multiplication Scheme," IEEE Trans. Computers,
Oct. 1977.
64
Appendix
1- VHDL
----------------------------------------------------------------------------------
-- Company:
-- Engineer: Rashed Alajmi
--
-- Create Date: 11/21/2018 11:35:06 AM
-- Design Name:
-- Module Name: top - Behavioral
-- Project Name:
-- Target Devices:
-- Tool Versions:
-- Description:
--
-- Dependencies:
--
-- Revision:
-- Revision 0.01 - File Created
-- Additional Comments:
--
-- ALU Top
-- Libraries
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
65
use IEEE.numeric_std.all;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use IEEE.std_logic_signed.all;
entity Booth_Top is
generic (
clock_frequency : integer := 100000000; -- Input clock rate in Hz (100 MHz
default)
segment_refresh : integer := 50000); -- Refresh rate in Hz
port (
ja : out std_logic_vector(6 downto 0); -- seg out
dp : out std_logic;
digit_select : out std_logic_vector(15 downto 0);
one : in std_logic;
ten : in std_logic;
hundred : in std_logic;
start : in std_logic;
sel : in std_logic;
add_sub : in std_logic;
clk : in std_logic);
end Booth_Top;
-------------------------------------------------------------------------------
-- COMPONENTS
-------------------------------------------------------------------------------
-- ALU
66
--component ALU_2
--generic (
-- bit_depth : integer := 8);
--port (
-- opcode : in std_logic_vector(2 downto 0);
-- A : in signed(2 * bit_depth - 1 downto 0);
-- B : in signed(2 * bit_depth - 1 downto 0);
-- execute : in std_logic;
-- result : out signed(2 * bit_depth - 1 downto 0));
--end component;
-- Seg Display
component Seg_Display_16
67
generic(
input_clk_freq : integer := 100000000; -- Input clock rate in Hz
refresh_rate : integer := 50000); -- Refresh rate in Hz
port(
-- 7 Segment Display Output
seg : out std_logic_vector(6 downto 0);
-- 7 Segment Display Decimal Point
dp : out std_logic;
-- Selects Digit
an : out std_logic_vector(15 downto 0);
-- Input segments 0 through 3
digit_1 : in std_logic_vector(6 downto 0);
digit_2 : in std_logic_vector(6 downto 0);
digit_3 : in std_logic_vector(6 downto 0);
digit_4 : in std_logic_vector(6 downto 0);
digit_5 : in std_logic_vector(6 downto 0);
digit_6 : in std_logic_vector(6 downto 0);
digit_7 : in std_logic_vector(6 downto 0);
digit_8 : in std_logic_vector(6 downto 0);
digit_9 : in std_logic_vector(6 downto 0);
digit_10 : in std_logic_vector(6 downto 0);
digit_11 : in std_logic_vector(6 downto 0);
digit_12 : in std_logic_vector(6 downto 0);
digit_13 : in std_logic_vector(6 downto 0);
digit_14 : in std_logic_vector(6 downto 0);
digit_15 : in std_logic_vector(6 downto 0);
digit_16 : in std_logic_vector(6 downto 0);
68
-- Input decimal points
-- in_dp : in std_logic_vector(15 downto 0);
-- Input Clock
clk : in std_logic);
end component;
-- BCD Display
component binary_bcd
generic(
N: integer := 16);
port(
clk, reset: in std_logic;
binary_in: in std_logic_vector(N-1 downto 0);
bcd0, bcd1, bcd2, bcd3,
bcd4, bcd5, bcd6 : out std_logic_vector(3 downto 0));
end component;
-- Hex to 7 seg
component Hex_to_7_Seg
port (
seven_seg : out std_logic_vector(6 downto 0);
hex : in std_logic_vector(3 downto 0));
end component;
69
bit_depth : integer := 12);
port(
Signed_in : in signed(bit_depth - 1 downto 0);
SLV_out : out std_logic_vector(bit_depth - 1 downto 0);
neg : out std_logic);
end component;
-------------------------------------------------------------------------------
-- SIGNALS & CONSTANTS
-------------------------------------------------------------------------------
signal A_input, B_input : signed(11 downto 0) := (others => '0');
signal Result_out : signed(23 downto 0) := (others => '0');
70
signal dig_8 : std_logic_vector(6 downto 0) := "0000000";
signal dig_9 : std_logic_vector(6 downto 0) := "0000000";
signal dig_10 : std_logic_vector(6 downto 0) := "0000000";
signal dig_11 : std_logic_vector(6 downto 0) := "0000000";
signal dig_12 : std_logic_vector(6 downto 0) := "0000000";
signal dig_13 : std_logic_vector(6 downto 0) := "0000000";
signal dig_14 : std_logic_vector(6 downto 0) := "0000000";
signal dig_15 : std_logic_vector(6 downto 0) := "0000000";
signal dig_16 : std_logic_vector(6 downto 0) := "0000000";
71
begin
-- ALU
--ALU : ALU_2
--generic map(12)
--port map(opcode, A_input, B_input, start_start, Result_out);
-- Signed Multiplier
BOOTH: smult_1
generic map(12)
port map(Result_out, ready_data, A_input, B_input, start_start, reset, clk);
-- Binary BCD's
-- BCD Display A
A_BCD: binary_bcd
generic map(12)
port map(clk, reset, A_slv, A_bcd0, A_bcd1, A_bcd2, A_bcd3, A_bcd4, A_bcd5,
A_bcd6);
-- BCD Display B
B_BCD: binary_bcd
generic map(12)
port map(clk, reset, B_slv, B_bcd0, B_bcd1, B_bcd2, B_bcd3, B_bcd4, B_bcd5,
B_bcd6);
72
-- BCD Display Result
R_BCD: binary_bcd
generic map(24)
port map(clk, reset, Result_slv, R_bcd0, R_bcd1, R_bcd2, R_bcd3, R_bcd4, R_bcd5,
R_bcd6);
-- A Converter
A_CONVERTER: Signed_to_SLV
generic map(12)
port map(A_input, A_slv, A_neg);
B_CONVERTER: Signed_to_SLV
generic map(12)
port map(B_input, B_slv, B_neg);
RESULT_CONVERTER: Signed_to_SLV
generic map(24)
port map(Result_out, Result_slv, R_neg);
DIGIT_2: Hex_to_7_Seg
port map(dig_2, R_bcd6);
DIGIT_3: Hex_to_7_Seg
73
port map(dig_3, R_bcd5);
DIGIT_4: Hex_to_7_Seg
port map(dig_4, R_bcd4);
DIGIT_5: Hex_to_7_Seg
port map(dig_5, R_bcd3);
DIGIT_6: Hex_to_7_Seg
port map(dig_6, R_bcd2);
DIGIT_7: Hex_to_7_Seg
port map(dig_7, R_bcd1);
DIGIT_8: Hex_to_7_Seg
port map(dig_8, R_bcd0);
DIGIT_9: Hex_to_7_Seg
port map(dig_9, A_bcd1);
--DIGIT_10: Hex_to_7_Seg
-- port map(dig_10, A_sign);
DIGIT_11: Hex_to_7_Seg
port map(dig_11, A_bcd2);
DIGIT_12: Hex_to_7_Seg
74
port map(dig_12, A_bcd0);
--DIGIT_13: Hex_to_7_Seg
-- port map(dig_13, B_sign);
DIGIT_14: Hex_to_7_Seg
port map(dig_14, B_bcd2);
DIGIT_15: Hex_to_7_Seg
port map(dig_15, B_bcd1);
DIGIT_16: Hex_to_7_Seg
port map(dig_16, B_bcd0);
sign_proc : process(clk)
begin
if(rising_edge(clk)) then
if(A_neg = '1') then
A_sign <= "0111111";
else
A_sign <= "1111111";
end if;
75
if(B_neg = '1') then
B_sign <= "0111111";
else
B_sign <= "1111111";
end if;
edge_detect_proc : process(clk)
begin
if(rising_edge(clk)) then
one_lead <= one;
one_follow <= one_lead;
ten_lead <= ten;
ten_follow <= ten_lead;
76
hundred_lead <= hundred;
hundred_follow <= hundred_lead;
start_lead <= start;
start_follow <= start_lead;
end if;
end process edge_detect_proc;
add_sub_proc : process(clk)
begin
if(rising_edge(clk)) then
if(one_edge_start = '1') then
if(add_sub = '1') then
if(sel = '1' and A_input < 999) then
A_input <= A_input + 1;
elsif(sel = '0' and B_input < 999) then
B_input <= B_input + 1;
end if;
else
if(sel = '1' and A_input > -999) then
A_input <= A_input - 1;
elsif(sel = '0' and B_input > -999) then
B_input <= B_input - 1;
end if;
end if;
77
if(sel = '1' and A_input < 989) then
A_input <= A_input + 10;
elsif(sel = '0' and B_input < 989) then
B_input <= B_input + 10;
end if;
else
if(sel = '1' and A_input > -989) then
A_input <= A_input - 10;
elsif(sel = '0' and B_input > -989) then
B_input <= B_input - 10;
end if;
end if;
78
end if;
end if;
end process add_sub_proc;
end behavior;
2- Booth multiplier
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.numeric_std.all;
entity smult_1 is
generic (
input_size : integer := 8);
port (
product : out signed(2 * input_size - 1 downto 0);
data_ready : out std_logic;
input_1 : in signed(input_size - 1 downto 0);
input_2 : in signed(input_size - 1 downto 0);
start : in std_logic;
reset : in std_logic;
clk : in std_logic);
end smult_1;
----------------------------------------------------------------------------
--
-- BEHAVIOR
--
----------------------------------------------------------------------------
79
architecture behavior of smult_1 is
-- State Machine states
type state_type is(init, load_state, right_shift, done);
signal state, nxt_state : state_type;
-- Control signals
signal shift : std_logic;
signal add_A : std_logic;
signal add_S : std_logic;
signal load : std_logic;
-- Data Signals
constant maxcount : integer := input_size - 1;
signal A_reg : signed((2*input_size) downto 0) := (others => '0');
signal S_reg : signed((2*input_size) downto 0) := (others => '0');
signal P_reg : signed((2*input_size) downto 0) := (others => '0');
signal sum_S : signed((2*input_size) downto 0) := (others => '0');
signal sum_A : signed((2*input_size) downto 0) := (others => '0');
signal count : integer range 0 to maxcount + 1 := 0;
begin
-----------------------------------------
-- STATE MACHINE
-- (Two Process)
--
-- This state machine is used to determine
-- what state smult_1 is in based on then
-- count value and the LSB's of the P
-- register
-----------------------------------------
state_proc: process(clk)
begin
if rising_edge(clk) then
if(reset = '1') then
state <= init;
else
state <= nxt_state;
end if;
end if;
end process state_proc;
80
begin
-- Initialize nxt_state and control signals
nxt_state <= state;
shift <= '0';
add_A <= '0';
add_S <= '0';
load <= '0';
data_ready <= '0';
case state is
-- Initialization State
when init =>
if(start_count = '1') then
nxt_state <= load_state;
else
nxt_state <= init;
end if;
-- Loading State
when load_state =>
load <= '1';
nxt_state <= right_shift;
81
end if;
end case;
end process state_machine;
-----------------------------------------
-- EDGE DETECTION
--
-- This is used to detect a rising edge of
-- a signal
-----------------------------------------
start_count <= start_count_lead and (not start_count_follow);
start_count_proc: process(clk)
begin
if(rising_edge(clk)) then
if(reset = '1') then
start_count_lead <= '0';
start_count_follow <= '0';
else
start_count_lead <= start;
start_count_follow <= start_count_lead;
end if;
end if;
end process start_count_proc;
-----------------------------------------
-- COUNT PROCESS
--
-- This process is a counter that keeps
-- track of the number of cycles iterated
-- in the state machine
-----------------------------------------
count_proc: process(clk)
begin
if(rising_edge(clk)) then
if((start_count = '1') or (reset = '1')) then
count <= 0;
elsif(state = right_shift) then
count <= count + 1;
end if;
end if;
82
end process count_proc;
-----------------------------------------
-- MULTIPLIER PROCESS
--
-- This process is used to apply the
-- actual multiplication via shifts
-- and additions
-----------------------------------------
-- Determine the Sum of S_reg and A_reg
sum_S <= P_reg + S_reg;
sum_A <= P_reg + A_reg;
mult_proc: process(clk)
begin
if(rising_edge(clk)) then
if(reset = '1') then
P_reg <= (others => '0');
A_reg <= (others => '0');
S_reg <= (others => '0');
-- S_reg
S_reg(2*input_size downto input_size + 1) <= (not input_1) + 1;
S_reg(input_size downto 0) <= (others => '0');
-- P_reg
P_reg(2*input_size downto input_size + 1) <= (others => '0');
P_reg(input_size downto 1) <= input_2;
P_reg(0) <= '0';
end if;
end if;
83
end process mult_proc;
## Clock signal
##Switches
##RGB LEDs
84
#set_property -dict { PACKAGE_PIN H4 IOSTANDARD LVCMOS33 }
[get_ports { led2_b }]; #IO_L21N_T3_DQS_35 Sch=led2_b
#set_property -dict { PACKAGE_PIN J2 IOSTANDARD LVCMOS33 } [get_ports
{ led2_g }]; #IO_L22N_T3_35 Sch=led2_g
#set_property -dict { PACKAGE_PIN J3 IOSTANDARD LVCMOS33 } [get_ports
{ led2_r }]; #IO_L22P_T3_35 Sch=led2_r
#set_property -dict { PACKAGE_PIN K2 IOSTANDARD LVCMOS33 }
[get_ports { led3_b }]; #IO_L23P_T3_35 Sch=led3_b
#set_property -dict { PACKAGE_PIN H6 IOSTANDARD LVCMOS33 }
[get_ports { led3_g }]; #IO_L24P_T3_35 Sch=led3_g
#set_property -dict { PACKAGE_PIN K1 IOSTANDARD LVCMOS33 }
[get_ports { led3_r }]; #IO_L23N_T3_35 Sch=led3_r
##LEDs
##Buttons
##Pmod Header JA
85
set_property -dict { PACKAGE_PIN D13 IOSTANDARD LVCMOS33 }
[get_ports { ja[4] }]; #IO_L6N_T0_VREF_15 Sch=ja[7]
set_property -dict { PACKAGE_PIN B18 IOSTANDARD LVCMOS33 }
[get_ports { ja[5] }]; #IO_L10P_T1_AD11P_15 Sch=ja[8]
set_property -dict { PACKAGE_PIN A18 IOSTANDARD LVCMOS33 }
[get_ports { ja[6] }]; #IO_L10N_T1_AD11N_15 Sch=ja[9]
set_property -dict { PACKAGE_PIN K16 IOSTANDARD LVCMOS33 }
[get_ports { dp }]; #IO_25_15 Sch=ja[10]
##Pmod Header JB
##Pmod Header JC
86
##Pmod Header JD
##USB-UART Interface
87
#set_property -dict { PACKAGE_PIN A1 IOSTANDARD LVCMOS33 }
[get_ports { ck_an_n[3] }]; #IO_L9N_T1_DQS_AD7N_35 Sch=ck_an_n[3]
#set_property -dict { PACKAGE_PIN B1 IOSTANDARD LVCMOS33 }
[get_ports { ck_an_p[3] }]; #IO_L9P_T1_DQS_AD7P_35 Sch=ck_an_p[3]
#set_property -dict { PACKAGE_PIN B2 IOSTANDARD LVCMOS33 }
[get_ports { ck_an_n[4] }]; #IO_L10N_T1_AD15N_35 Sch=ck_an_n[4]
#set_property -dict { PACKAGE_PIN B3 IOSTANDARD LVCMOS33 }
[get_ports { ck_an_p[4] }]; #IO_L10P_T1_AD15P_35 Sch=ck_an_p[4]
#set_property -dict { PACKAGE_PIN C14 IOSTANDARD LVCMOS33 }
[get_ports { ck_an_n[5] }]; #IO_L1N_T0_AD0N_15 Sch=ck_an_n[5]
#set_property -dict { PACKAGE_PIN D14 IOSTANDARD LVCMOS33 }
[get_ports { ck_an_p[5] }]; #IO_L1P_T0_AD0P_15 Sch=ck_an_p[5]
88
##NOTE: These pins should be used when using the analog header signals A0-A5 as
digital I/O (Chipkit digital pins 14-19)
89
#set_property -dict { PACKAGE_PIN R13 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[31] }]; #IO_L5N_T0_D07_14 Sch=ck_io[31]
#set_property -dict { PACKAGE_PIN R15 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[32] }]; #IO_L13N_T2_MRCC_14 Sch=ck_io[32]
#set_property -dict { PACKAGE_PIN P15 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[33] }]; #IO_L13P_T2_MRCC_14 Sch=ck_io[33]
#set_property -dict { PACKAGE_PIN R16 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[34] }]; #IO_L15P_T2_DQS_RDWR_B_14 Sch=ck_io[34]
#set_property -dict { PACKAGE_PIN N16 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[35] }]; #IO_L11N_T1_SRCC_14 Sch=ck_io[35]
#set_property -dict { PACKAGE_PIN N14 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[36] }]; #IO_L8P_T1_D11_14 Sch=ck_io[36]
#set_property -dict { PACKAGE_PIN U17 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[37] }]; #IO_L17P_T2_A14_D30_14 Sch=ck_io[37]
#set_property -dict { PACKAGE_PIN T18 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[38] }]; #IO_L7N_T1_D10_14 Sch=ck_io[38]
#set_property -dict { PACKAGE_PIN R18 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[39] }]; #IO_L7P_T1_D09_14 Sch=ck_io[39]
#set_property -dict { PACKAGE_PIN P18 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[40] }]; #IO_L9N_T1_DQS_D13_14 Sch=ck_io[40]
#set_property -dict { PACKAGE_PIN N17 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[41] }]; #IO_L9P_T1_DQS_14 Sch=ck_io[41]
## ChipKit SPI
## ChipKit I2C
90
#set_property -dict { PACKAGE_PIN M17 IOSTANDARD LVCMOS33 }
[get_ports { ck_ioa }]; #IO_L10N_T1_D15_14 Sch=ck_ioa
#set_property -dict { PACKAGE_PIN C2 IOSTANDARD LVCMOS33 }
[get_ports { ck_rst }]; #IO_L16P_T2_35 Sch=ck_rst
91
##Quad SPI Flash
##Power Measurements
92