Alajmi Rashed Thesis 2019

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
Implementing booth algorithm on FPGA
A graduate project submitted in Partial fulfillment for requirements for the

Degree of Master of Science in
Electrical Engineering
By
Rashed Hamad Alajmi
May 2019
The graduate project of Rashed Hamad Alajmi is approved:
Date
Dr. Xiyi, Hang
Dr. Xiaojun, (Ashley) Geng Date
Date
Dr. El Naga Nagi , Chair
CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
ii
ACKNOWLEDGEMENTS
I would like to express my special thanks of gratitude to my professor Dr. Nagi El Naga for
all his advice in this project and since I started my master degree.
Secondly I would like to thank my parents and my whole family for all their support and
patients till I finish this project.
iii
Table Of Contents
SIGNATURE PAGE ……………………………………………………………………i
ACKNOWLEDGMENT ………………………………………………………………..ii
LIST OF FIGURES ……………………………………………………………………vii
ABSTRACT........................................................................................................................ x
Chapter 1: Project Overview............................................................................................... x
1.1 Introduction ..................................................................................................... 1
1.2 Objective ......................................................................................................... 2
1.3 Project Outline................................................................................................. 3
Chapter 2: High Performance Adders ................................................................................. 4
2.1 Carry Look-Ahead (CLA) Adders .................................................................. 4
2.1.1 CLA Theory ................................................................................... 4
2.1.2 Carry Look-Ahead Adder Group...................................................... 6
2.1.3 Four-bit Carry Look-ahead Unit 1 ....................................................... 7
2.1.4 Four-bit Carry Look-ahead Unit 2 ........................................................ 8
2.1.5 32-bit One-Level Carry-Look Ahead Adder .............................................. 10
2.1.6 Two Level CLA Adder unit ................................................................ 11
2.2 Carry Save Adders ........................................................................................ 12
2.3 Conclusion..................................................................................................... 15
Chapter 3: Multiplication Algorithms ............................................................................... 16
3.1 Serial Multiplication Methods ....................................................................... 16
iv
3.1.1 Simple Multiplication Method……………………………………………………….17
3.1.2 Booths Algorithm………………………………………………………………………..….18
3.1.3 Modified Booth Algorithm…………………………………………………………..21
3.2 Recoding Multipliers ..................................................................................... 23
3.2.1 Uniform Shift of One method………………………………………………………..23
3.3.2 Uniform Shift of Two method…………………………………………..…………..29
3.4 Implementing the Overlapped Scanning multiplier algorithm ...................... 34
Chapter 4- Logic Circuits and Modules ............................................................................ 36
4.1 Decoders ........................................................................................................ 36
4.2 Control Gates................................................................................................. 37
4.3 Complementing circuits ................................................................................ 37
4.4 Accumulator .................................................................................................. 38
4.5 Left-bit shifter ............................................................................................... 39
Chapter 5: Designing the Multiplier ................................................................................. 41
5.1 Logical Circuits Schematic ........................................................................... 41
5.2 8-bit Multiplication Model ............................................................................ 43
5.3 Final Multiplier Design ................................................................................. 48
Chapter 6: DETAILS OF IMPLEMENTATION …………………..………………………………………..49
6.1 introduction …………………………………………………………………………………………………………49
6.2 List of components ………………………………………………………………………………………………49
v
6.3 Description of hardware components ……………………………………………………………….60
6.3.1 Seven segments ………………………………………………………………………………….50
6.3.2 Transistors ………………………………………………………………………………………….50
6.3.3 Arty Z7 ……………………………………………………………………………………………..51
6.4 Schematic Diagram …………………………………………………………………………………………52
6.5 Functional description of the project ………………………………………………………………54
6.6 Software Description ………………………………………………………………………………………56
6.6.1 Introduction…………………………………………………………………………………………………56
6.6.2 Signed Multiplier…………………………………………………………………………………………56
6.6.3 Signed to SLV……………………………………………………………………………………………….57
6.6.4 BCD Display……………………………………………………………………………………………………58
6.6.5 Hex to 7 SEG……………………………………..…………………………….59
6.6.6 Segment Controller……………………....……………………………59
Chapter 7……………………………………………………………………………61
References…………………………………………………………………………..63
Appendix: VHDL CODE………………………………………………………………65
vi
List of Figures:
2.2.1 Schematic of a Carry Look Ahead Adder 5

2.2.2 Schematic of a Carry Look-Ahead Group Unit 6
2.2.3 Schematic of a Four-bit Carry Look-ahead Unit 1 7
2.2.3 Block Diagram a Four-bit Carry Look-ahead Unit 1 7
2.2.3 Schematic of a Four-bit Carry Look-ahead Unit 2 9
2.2.3 Block Diagram a Four-bit Carry Look-ahead Unit 2 9
2.3.4 Schematic of a 32-bit One Level Carry Look-Ahead Adder Unit 10
2.2.4 Schematic of a 32-bit Two Level Carry Look-Ahead Adder Unit 11
2.4 The Similarities between a 1-bit Full Adder and a Carry Save Adder 13
2.4 The Similarities between a 1-bit Full Adder and a Carry Save Adder 13
2.4 Carry Save Adder Blocks for n = 8 bit numbers 13
2.4 Three-leveled Carry Save Ahead Tree 14
3.1.2 Booths Algorithm operations for different combinations of bi-1 and bi 18
3.1.2 Flow Chart for Booth’s Algorithm 19
3.1.2 Booth’s Algorithm example 20
3.1.3 Radix-4 Booth encoding table 22
3.1.3 Radix-4 example 22
3.2.1 Operation Rules when depending on yi 24
3.2.1 Operation Rules when depending on yi and carry variable is considered 25
3.2.1 Operations rules for fi0 26
3.2.1 Operations rules for fi1 26
3.2.1 Operation rules for Uniform Shift of One method 27
3.3.2 Uniform shift by one recoding bit multiplier 28
3.3.2.1 Non-overlapped multibit scanning multiplier 31
3.3.2.2 Operation Rules for recoding by pairs 32
vii
3.3.2.2 Modified Operation Rules for recoding by pairs 33
3.4 Schematic for recoding by pairs multiplier 34
3.4 Decoding Operations Truth Table 35
4.1 Gate designs in a recoding by pairs decoder 36
4.3 34-bit Complementing Circuit 37
4.4 Accumulator Unit 38
4.5 Shift left register 39
4.6 Details circuit for recording by multiple pair 40
5.1 Internal resistor-transistor logic circuit schematic of the multiplier 42

5.1 Block diagram of the recoding logic component circuits 42
5.1 Outputs of each recoding logic component 43
5.3 8-bit multiplication unit using Recoding by pair's algorithm 44
5.3 Outputs for each level in the 2nd transition cycle 47
5.3 Outputs for the 3rd and 4th transition cycles 47
5.3 High speed multiplier of 32 bits which uses a recoding by pairs algorithm 48
6.2 Table of components 49

6.3.1 The common anode 7-segment Display 50
6.3.3 Arty Z7 51
6.4 Schematic of the circuit 52
6.4 Circuit before connecting to Arty Z7 53
6.5 Circuit after connecting to Arty Z7 55
6.6.2 Signed Multiplier 56
6.6.3 Signed to std logic vector convertor 57
6.6.3 Signed to slv details circuit 58
6.6.4 BCD display components 58
6.6.5 Hex to 7 segments component 59
viii
6.6.6 Segment controller 60
ix
ABSTRACT
Implementing booth algorithm on FPGA
By
Rashed Hamad Alajmi
Master of Science in Electrical Engineering
The goal of this project is to design, model, simulate and ultimately create a High-speed
32-bit multiplication system which will be utilizing recording by pair algorithm in order to speed
up the process. In basic mathematical terms, multiplication is the process by which a number is
scaled by another number. In order to elaborate on high speed 32-bit multiplication process, I have
discussed in detail several methodologies and frameworks. After thoroughly comparing different
algorithms, booth’s algorithm is by far the most efficient in terms of speed and accuracy.
The logical circuits are used to carry out a set of different actions which are dependent on
the input fed to the system. The function of the logical circuits can be defined as shifting,
complementing and control circuits. The specifics to these components are examined, designed
and articulated during this Project. A carry look ahead adder (CLA) is used in the project due to
its fast propagation time. The CLA is used along with a carry save adder because the addition
process involves multiple n-bit numbers. Two different architectures, one level CLA and two
level CLA adder have been discussed which can be used to increase the speed of operation.
However in this project, a two level CLA adder is used because it is much faster than one level
CLA adder when it comes to dealing with 32 bit numbers.
x
Chapter 1: Project Overview
1.1 Introduction
Multiplication is paramount importance in various fields ranging from finance to
engineering. It is a crucial mathematical aspect of several electrical engineering fields such as
microchip development, telecommunication networks, graphic engine designing, image rendering
and most importantly digital signal processing (DSP). In a general-purpose multiplier, the data
input is a continuous process which makes the algorithm complex. The complexity of any
algorithm can be classified into cost and time. Therefore, a better algorithm would not cost high
as well as have a fast execution time. Multiplication processes are time-consuming since they
involve multiple complicated computations which would increase the execution time, thus the
algorithm must be optimal to reduce time delay. It has been found by VLSI designers that assigning
large area to the integer and floating-point multipliers helps in speeding up the multiplication
process.
Optimally, the rate of calculations needs to be as fast as one billion arithmetic operations
performed in each second in conjunction with real-time DSP. The average rate required is one
computation successfully performed in a billionth of a second (10-9 s) which is very common
considering the vast computations necessary in real-time DSP. Advancements in Very Large Scale
Integration (VLSI) technology has resolved the heavy time-consumption issues in real-time DSP
computations by incorporating innovative and new methodologies and design architectures which
increase the efficiency of the operation significantly. The theoretical aspects of multiplication
algorithms which seemed far-fetched a few decades ago can now easily be implemented thanks to
the advancement in VLSI technology in both the production of the devices and the articulation of
1
relevant methodologies that have resolved any issues with fabrication and enhanced the efficiency
of complex design.
Multiplier requires intensive computations. Multiplication can be performed in three

major steps. In the first step, the partial products are generated. In the second step, the partial
products are reduced to one row of final sums and carries. In the third step, the final sums and
carries are added to generate the result. A modified booth multiplier should concentrate on the
following things: reducing total number of partial products, reduce number of 2’s compliment,
and optimization of adder structure.
Booth algorithm is a crucial improvement in the design of signed binary multiplication.

There has been progress in partial products reductions, adder structures and complementary
methods but still there is scope in modifying the booth algorithm so as to further optimize. The
modified booth multiplier is synthesized and implemented on FPGA using VHDL hardware
descriptive language.
1.2 Objective
The primary purpose of this project is the modelling, design, testing and implementing of
the 32-bit high speed multiplier by using recording by pair algorithm on a field programmable gate
array (FPGA) which is both efficient in terms of speed and accuracy in terms of solving a huge
amount of complicated computations. Normally, the speed and complexity of the design would be
compromised due to its inherent nature but if a carry look-ahead adder circuit can help reduce the
speed issues.
2
1.3 Project Outline
The project is categorically separated into the following components:
The introduction of the project is presented in chapter 1. It helps to give an overview to the
reader about the project.
Chapter 2- High Performance Adders: This section will briefly touch upon some of the fast
addition techniques that improve the multiplier performance such as Carry Save adders and
discussed carry look-ahead adders in great detail since they are of more importance for this project.
Chapter 3- Multiplication Algorithms: This section will adequately introduce, classify and
discuss the various multiplication methodologies and theories that can be implemented. These
include Recoding algorithms, direct multiplication and Booths algorithm. The main focus of this
project is implementing the booths algorithm on FPGA to design the multiplier therefore the booth
algorithm will be discussed in greater detail.
Chapter 4- Logic Circuits Design: As discussed previously, many logic circuits will be
used in order to design, model, simulate and implement the multiplier which include decoders,
accumulators, right and left shift registers, control circuits and how each of them play a crucial
part in the efficiency of the multiplier.
Chapter 5- Designing the Multiplier: After laying down the groundwork necessary for the
articulation of our multiplier in chapters 2,3 and 4 in terms of the adders, multipliers and logic
circuits to be implemented, the high speed multiplier is studied and designed so that it is compatible
with inflexible applications and very large scale integration applications.
Chapter 6 illustrates the modeling of a multiplier using VHDL hardware descriptive

language.
Chapter 7 acts as a concluding chapter that summarizes the results of the project.
3
Chapter 2: High Performance Adders
Fast addition is an essential component in this digital era and especially in real-time digital
signal processing. The efficiency and speed of the adders ends up playing a very important in the
overall speed and accuracy of any mathematical circuit. In this chapter, some of the fast adding
methods widely used are investigated along with the necessary details regarding Carry Look-
Ahead Adders (CLA) which will be implemented in the final design of the multiplier. It is deduced
that in order to enhance the effectiveness of the addition and the overall system, it would be more
beneficial for the system to comprise of a multilevel CLA addition algorithm.
2.1 Carry Look-Ahead (CLA) Adders
CLA adders in contrast to the slow and basic ripple carry adder are much more complicated
but provide a very efficient upgrade in speed. Ripple-Carry (RC) Adders can be compared with
conventional methods of addition i.e. via paper and pencil in which corresponding digit are added
to one another starting from the units position or whichever is the least significant until all
corresponding digits have been added and a final result has been obtained. In the RC Adders, there
is a chance that the sum of the corresponding digits might exceed the limit because of which an
extra carry bit has to be carried to the next least most significant number. The main difference
between RC and CLA adders is that although both processes initiate in the same manner i.e.
propagation through each 4-bit segment, in the CLA adder after the initiation, the speed is 4 times
greater since it involves jumping from one adjacent carry unit to the next which ultimately results
in the carry propagating inside the numbers in that segment for each group that has accepted a
carry in.
4
2.1.1 CLA Theory
Based on the concept established of 1-bit full adders, let’s assume a full adder circuit as
shown in Figure 1 in which the operand bits Ai and Bi are being added along with the Carry in bit
from the previous column (Ci).
Figure.1: Schematic of a Carry Look Ahead Adder
As it can be seen, there are two internal signals being generated, namely Pi and Gi which
are computed as follows:
𝑃𝑃𝑖𝑖 = 𝐴𝐴𝑖𝑖 + 𝐵𝐵𝑖𝑖 … … … … … (1)
𝐺𝐺𝑖𝑖 = 𝐴𝐴𝑖𝑖 . 𝐵𝐵𝑖𝑖 … … … … … . (2)
Subsequently, the sum and carry out functions can be defined as follows:
𝑆𝑆𝑖𝑖 = 𝑃𝑃𝑖𝑖 + 𝐶𝐶𝑖𝑖 … … … … … (3)
𝐶𝐶𝑖𝑖+1 = 𝑃𝑃𝑖𝑖 . 𝐶𝐶𝑖𝑖 + 𝐺𝐺𝑖𝑖 … … … … … (4)
Where Si, Ci+1 and Ci are the sum, carry out and carry in functions respectively and Pi and
Gi are known as the carry propagate and the carry generate respectively. The carry generate is
known by that term since it since a carry out is generated whenever the signal is equal to 1,
irrespective of the carry in signal. The carry propagate is known by that term since it propagates
the carry from carry in to carry out whenever the carry propagate is equal to 1. There exist two
different architectures in CLA adders which are known as One-level and Two-level CLA
respectively. A thorough investigation needs to be made in order to decide which of the two will
5
give the most optimal results. In order to analyze and design these units, there are a few modules
used which will be discussed in the following section.
2.1.2 Carry Look-Ahead Adder Group
Based on the CLA theory discussed in section 2.2.1, it was deduced how the sum and carry
out signals are determined in a CLA Adder. However, there exist some fan-in restrictions because
of which the adder is split into different 4-bit groups. These 4-bit groups are split across 3 levels
of logic as depicted by Figure.2
Figure.2: Schematic of a Carry Look-Ahead Group Unit
The levels are as follows:
First Level: All the P &G signals are generated from here, more specifically four sets of
P& G logic signals ( each set includes an AND gate and an XOR gate)
6
Second Level: This is logic block of the CLA that that includes 4 different 2 level
implementation circuits. In the above figure, the C1, C2, C3 and C4 are generated in this level
Third Level: This consist of the four logic XOR gates which generate the sum signals S0,
S1, S2 and S3.
2.1.3 Four-bit Carry Look-ahead Unit 1
Building upon the carry look-ahead adder theory and group discussed in the previous
discussions and observing the schematic and block diagram shown in Figure 3 and 4 respectively,
it is possible to recursively expand equation 6.
Figure.3: Schematic of a Four-bit Carry Look-ahead Unit 1
Figure.4 Block Diagram a Four-bit Carry Look-ahead Unit 1
7
The Boolean expressions of the carry outputs at each stage could be determined and
simplified by simply substituting the previous carry output expressions as described below:
𝐶𝐶1 = 𝐺𝐺0 +𝑃𝑃0 𝐶𝐶0 … … … . .5
𝐶𝐶2 = 𝐺𝐺1 +𝑃𝑃1 𝐶𝐶1 … … … . .6
𝐶𝐶2 = 𝐺𝐺1 +𝑃𝑃1 𝐺𝐺0 +𝑃𝑃1 𝑃𝑃0 𝐶𝐶0 … … … . .7
𝐶𝐶3 = 𝐺𝐺2 +𝑃𝑃2 𝐶𝐶2 … … … . .8
𝐶𝐶3 = 𝐺𝐺2 +𝑃𝑃2 𝐺𝐺1 +𝑃𝑃2 𝑃𝑃1 𝐺𝐺0 +𝑃𝑃2 𝑃𝑃1 𝑃𝑃0 𝐶𝐶0 … … … . .9
𝐶𝐶4 = 𝐺𝐺3 +𝑃𝑃3 𝐶𝐶3 … … … . .10
𝐶𝐶4 = 𝐺𝐺3 +𝑃𝑃3 𝐺𝐺2 +𝑃𝑃3 𝑃𝑃2 𝐺𝐺1 +𝑃𝑃3 𝑃𝑃2 𝑃𝑃1 𝐺𝐺0 +𝑃𝑃3 𝑃𝑃2 𝑃𝑃1 𝑃𝑃0 𝐶𝐶0 … … … . .11
Thus the equations relevant for this project are equations 5, 7, 9 and 11. The carry output
C4 is the final carry generated from the previous carry generates and propagates and is thus fed
into the segment that is next.
2.1.4 Four-bit Carry Look-ahead Unit 2
The schematic and block diagram shown in Figure 5 and 6 respectively indicate some of
the differences that exist in CLAU 1 and CLAU 2. Based on these differences, it is possible to
derive modified expressions for the carry generate (G1*) and carry propagate (P1*) which are as
follows:
𝑃𝑃1∗ = 𝑃𝑃3 . 𝑃𝑃2 . 𝑃𝑃1 . 𝑃𝑃0 … … . .12
𝐺𝐺1∗ = 𝐺𝐺3 + 𝐺𝐺2 . 𝑃𝑃3 . 𝑃𝑃2 + 𝐺𝐺1 . 𝑃𝑃3 . 𝑃𝑃2 + 𝐺𝐺0 . 𝑃𝑃3 . 𝑃𝑃2 𝑃𝑃1 … … .13
8
A one-level Carry-Look Ahead Adder is used in this case which basically consists of a
Four-bit Carry Look-ahead Unit 1 (CLAU1) and a Carry Look-Ahead Adder Group (CLAAG)
combined in order to enhance the speed of the multiplier. 4 bit carry generates and propagates are
generate from the CLAAG which uses 4-bit carry input from the CLAU 1. The CLAU 1 then
generates the 4-bit carry as the output after taking from the CLAAG the carry generate and
propagate as its inputs. This ends up creating a total of 8 blocks which have a combined 32 bits,
all inter-connected which gives way for the concept of a 32-bit one-level CLA unit.
Figure.5: Schematic of a Four-bit Carry Look-ahead Unit 2
Figure.6 Block Diagram a Four-bit Carry Look-ahead Unit 2
9
2.1.5 32-bit One-Level Carry-Look Ahead Adder
In this technique, as discussed briefly in the previous section, the different adder units are
segmented into different segment and the carry look-ahead method is applied at the group level.
The output carry generates propagation is permitted along the various groups that exist in this
adder for the purpose of reducing the time delay which is a significant issue in conventional CLA
adders. The time delay in CLA adders is denoted by 𝜏𝜏𝑔𝑔 which is accounted in the total addition
time for an n-bit adder as follows:
𝑛𝑛
𝑇𝑇𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 (𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛 − 𝑏𝑏𝑏𝑏𝑏𝑏 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎) = 𝜏𝜏𝑔𝑔 + 2 𝜏𝜏𝑔𝑔 𝑥𝑥 + 𝜏𝜏𝑔𝑔 … … … … … 14
4
𝑇𝑇𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 (𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛 − 𝑏𝑏𝑏𝑏𝑏𝑏 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎) = (2 + 0.5𝑛𝑛)𝑥𝑥 𝜏𝜏𝑔𝑔 … … … … … 15
𝑇𝑇𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 (𝑓𝑓𝑓𝑓𝑓𝑓 32 − 𝑏𝑏𝑏𝑏𝑏𝑏 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎) = 18𝜏𝜏𝑔𝑔
*note: The first time delay 𝜏𝜏𝑔𝑔 denoted is the delay due to the carry propagate and carry
generate in the CLAAGs, the second denotes the delay in the output sum generation of the final
group while the middle term denotes the time delay in signal propagation through the CLAU1.
Figure 7 represents a 32-bit one level CLA adder where A and B are the inputs over 32 bits and S
is the corresponding output. C denotes the flow of carry propagation from one stage to the other.
The final output carry is C32 while the first input carry is C0.
Figure.7: Schematic of a 32-bit One Level Carry Look-Ahead Adder Unit
10
2.1.6 Two Level CLA Adder unit
A two-level CLA adder unit is much faster compared to a one-level CLA unit adder which
is why it is preferred over it usually. There are some key differences in the two-level CLA adder
and the one-level CLA Adder as it can be inferred from Figure 8. It forms one one-level CLA unit
at a group level wile it forms two two-level adders at the piece wise. The carry output is generated
from each of the pieces and is rippled throughout the latter pieces in the Adder. When comparing
Figure 7 and 8, the terms A,B and S in both figures serve the same purpose where A and B are the
inputs over 32 bits and S is the corresponding output. The main difference in the bottom CLAU 1
sections where Cin is the input carry to the CLAU 1 and C32 is the output carry and C16 serves both
purposes as it propagates from one CLAU1 stage to the other. Other differences are that this system
includes P1* and G1* too which are modified carry propagate and generate connection.
Figure.8: Schematic of a 32-bit Two Level Carry Look-Ahead Adder Unit
The time delay is also measured differently in the two-level CLA Adder and it is s follows:
11
𝑇𝑇𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 𝜏𝜏𝑔𝑔 + 2 𝜏𝜏𝑔𝑔 + 2 𝜏𝜏𝑔𝑔 𝑥𝑥 𝑆𝑆 + 2 𝜏𝜏𝑔𝑔 + 𝜏𝜏𝑔𝑔 … … … … … 16
𝑇𝑇𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = (6 + 2𝑆𝑆)𝑥𝑥 𝜏𝜏𝑔𝑔 … … … … … 15
Since in our case S =2,
𝑇𝑇𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 10 𝜏𝜏𝑔𝑔
*note: The first time delay 𝜏𝜏𝑔𝑔 denoted is the delay due to the carry propagate and carry
generate in the CLAAGs, the second (2 𝜏𝜏𝑔𝑔 ) is the delay due to the carry propagate and carry
generate in the CLAU2, the third (2 𝜏𝜏𝑔𝑔 𝑥𝑥 𝑆𝑆) is the delay which occurs when signals are propagating
through the CLAU2 sections, the fourth (2 𝜏𝜏𝑔𝑔 ) is the delay which occurs during signal propagation
through the last section’s CLAU2 and the last delay 𝜏𝜏𝑔𝑔 is due to the output sum produced in the
last segment.
As we can see the time delay is much lesser compared to the one-level CLA unit which
indicates better performance at least in terms of speed. However, when n-bit numbers have to be
added, where n is larger, CLAU2 might become inefficient which is where carry save adders can
replace them as a more efficient option.
2.2 Carry Save Adders
The main drawback of using conventional adders is that they are devised in a way to only
add two numbers simultaneously. For the current project of a 32 bit multiplier, it is required that
the multiplicand multiples are added at a significantly quick rate without and restrictions on the
number of units being added which is due to the inherent size of the operands and the vast amount
of partial additions required. Carry Save Adders or CSA can overcome the shortcomings of
conventional adders and even CLAU2 Adders since it is possible to carry out addition of three n-
bit numbers simultaneously while producing the sum vectors and carry generates which are then
used as inputs by the following CSA units. A CSA unit will have the same number of full adders
12
as the bit size i.e. a 32 bit CSA will have 32 full adders. The general mechanism behind a CSA
adder is that it generates two n-bit output vectors which are the sum (S) for the partial sum and the
carry ( C) for the partial carry to be used later on and takes the input of 3 n-bit binary numbers
which can be denoted as x, y and z. The order in which they are added is irrelevant to the
computations of a CSA. A CSA is highly similar to a 1-bit full adder as can be seen in Figure 9
with the difference being that the input carry is now denoted as ‘z’, the original answer output is
now denoted as ‘s’ and the output carry is denoted as c. Figure 10 depicts how the full adders in a
CSA unit can add three distinct n-bit numbers (x, y and z) and convert them to two output vectors
(c and s). In a CSA unit, the carry vector is moved to the left by one-bit always.
Figure 9: The Similarities between a 1-bit Full Adder and a Carry Save Adder
Figure 10: Carry Save Adder Blocks for n = 8 bit numbers
The final sum however is calculated by using a look-ahead carry adder (LCA). Figure 11
shows how many steps does it take to add m different n-bit numbers. Before the numbers are added
in the final LCA, they need to go through m-2 blocks, with each block representing multiple one-
13
bit CSA units arranged in a parallel orientation. Each time a block is passed, the numbers are
incremented by 1-bit in terms of size. Consequently if we assume the time delay at each gate to be
𝜏𝜏𝑔𝑔 then each CSA level contributes to a time delay of 2𝜏𝜏𝑔𝑔 which is the same as the time delay
incurred in a full adder stage. According to figure 11 shown below which is a three-leveled tree,
the total time delay will be 6𝜏𝜏𝑔𝑔 .
Figure.11: Carry Save Adder Tree to add m-n numbers
14
2.3 Conclusion
In this section, the carry look ahead technique was described in detail which will be used
for the multiplication techniques that will be discussed later. The CLAAG units and their
importance in addition was also discussed. Then two addition implementation commonly applied
to multiplier were discussed namely the One-level CLA unit and the Two-level CLA unit. After
comparing their speeds in terms of time delay, it was deduced that the two-level CLA unit is much
faster having a time delay of only 10𝜏𝜏𝑔𝑔 while the one-level CLA unit has a delay of 18𝜏𝜏𝑔𝑔 . It was
then investigated how multiple n-bit numbers could be added and the techniques most feasible was
carry save adder in which a three-leveled CSA gives a time of delay of only 6𝜏𝜏𝑔𝑔 thus the total time
delay incurred from the addition process will be 16𝜏𝜏𝑔𝑔 .
15
Chapter 3: Multiplication Algorithms
In this chapter, several multiplication algorithms and techniques will be discussed. The
basic method of adding a number of partial products is constant throughout all techniques however
they differ in their complexity and speed, both of which are important factors to determining which
is best suited for the task at hand. The primary objective is to achieve a perfect a balance in speed
and simplicity in order to design and implement the 32-bit multiplier.
The optimal speed required can be determined the reduction of time delay required and the
simplicity can be determined by the maximum reduction in gate complexity. These in turn depend
on the cost incurred and the overall performance of the system. Therefore it is important to
determine which broad category of multiplication techniques does the required method fall under.
There exist three main categories namely serial, parallel and serial-parallel multipliers. Serial
multipliers include add-shift techniques and recoding techniques while parallel techniques include
Rom network, reduction methods and iterative cellular arrays. The main focus for this project is
serial multipliers and in the following section, several techniques that have been developed and
utilized over the years will be discussed in this section.
3.1 Serial Multiplication Methods
The main concept behind serial multiplication and architectures based on serial
multiplications use the add-shift method for their operations. The number of bits n determines the
multiplication time and complexity as they are proportional to the square of n (n2). Thus the
multiplication and time and complexity tend to increase exponentially for greater values of n. The
method of serial multiplication entails that the least significant bit will be the first to be sequentially
inspected. If the value of the bit is ‘1’ then the most important segment of the double-length
16
accumulator that is valued at zero will be added with the multiplication with the multiplicand
whose bit is ‘1’. The accumulator shifts one bit to the right after every sequential inspection and
once all the bits have been inspected, the product is generated within the accumulator. The trade-
off in this technique is that it although it uses fewer resources and is simple, the computations
could become significantly complicated should the size of the multiplicand/multiplier increase.
Improvements to this basic mechanism have been devised and they would be discussed in the
following sections.
3.1.1 Simple Multiplication Method
The simple multiplication or direct multiplication method, each multiplier with the value
of 1 is added to the multiplicand. The number of digits that have the value 1 correspond to the
exact number of addition operations required i.e. 3 digit multiplier will need 3 addition operations.
The entire premise of this simple multiplication method is based on detection of 0’s and 1’s. In the
former the multiplier takes no action and for the latter it performs an addition operation. The
example below depicts how the number of operations is decided:
X=101101010110
In the above example, there are a total of 8 1’s detected which means that the multiplier
would be added to the multiplicand 8 times.
17
3.1.2 Booths Algorithm
Booths algorithm is one of the most widely used algorithms that involves the multiplication
of two signed binary numbers in a complement notation of two’s. This algorithm is of great
importance in computer architecture. The importance of Booth algorithm lies in the fact that it can
preserve the sign of the result. The entire theory is based on the notion that the strings of binary
digits in a multiplier only need to be shifted and not necessarily added. There are four main steps
to be followed in the Booths Algorithm and the foundation of this method is built upon the notion
that an extra 0 will be added next to the lowest significant bit of the multiplier (b). The multiplier
bits bi and bi-1 from the lowest significant bit are checked sequentially and then the multiplicand
(a) is added to their partial product (m) or subtracted based on the signs and values. A single bit of
the multiplier is moved to the right at the end of each step till it obtains the value of 0. The actions
and conditions could be characterized as follows in Figure 12:
Value of bi-1 Value of bi Action
0 0 Do Nothing
0 1 Add a
1 0 Subtract a
1 1 Do Nothing
Figure.12: Booths Algorithm operations for different combinations of bi-1 and bi
A more concise way of representing this table would be by the expression (bi-1-bi) which if
0 requires no action, if +1 requires addition to a and if -1 requires a subtraction from a. Figure 13
18
depicts the flow chart and the method by which Booths algorithm is carried out generally following
which an example of booths algorithms application is described.
Figure.13: Flow Chart for Booth’s Algorithm
19
Example:
Qs. Multiply 2 (0010) by -3 (1101) using booths algorithm using 4 bit numbers? The
answer should be 1111 1010 1. The steps for this computation are shown below in Figure 14:
Figure.14: Booth’s Algorithm example
*note: The colored bits in each iteration are the bits used to determine what the next step
or action to be taken will be.
A few examples which compare the number of operations when the multiplication is
carried out on the same number using booths and direct multiplication are shown below:
1. X = 1 1 1 0 1 1 1 1 0 1 1 0
When using direct multiplication, the number of operations required would be 9
however let’s have a look at booths algorithms application to this:
1 1 1 0 1 1 1 1 0 1 1 0
- + - + - +
From this we can see that it will only take 6 operations using the booths algorithm.
2. X = 1 0 1 0 0 1 0 1 0 1 0 1
20
When using direct multiplication, the number of operations required would be 6
however let’s have a look at booths algorithms application to this:
1 0 1 0 0 1 0 1 0 1 0 1
- + - + - + - + - + -
From this we can see that it will take 11 operations using the booths algorithm.
This shows that the pattern of the binary digits determines the complexity of the operations
and the speed too hence Booth’s algorithm is only beneficial when there are lesser bits with the
value of 1 which is a shortcoming of this technique; however improvements to this have been
suggested which will be discussed in the following section.
3.1.3 Modified Booth Algorithm
The Modified Booth’s algorithm is much greater in speed than the normal Booth’s
algorithm, almost twice as fast. This algorithm is meant to group the consecutive bits in either of
two operands to formulate signed multiples that decreases the total number of partial products and
ultimately increases the efficiency of the operation. The most efficient popular modified Booth’s
algorithm is known as the Radix-4 which uses the following algorithm in order to scan 3 bits of
strings:
1. Firstly, in order to ensure that that n is even, the sign bit 1 position should be extended
if need be.
2. Secondly, a ‘0’ is added to the right of the multipliers least significant bit.
3. Thirdly, the value of each vector will determine what the partial product will be. The
partial product can only take the value of 0, +X, -X, +2X, -2X where X is the
multiplicand. The bits are to be grouped in groups of three so that they can overlap by
21
one bit with the previous group. This process starts from the least significant bit and
only 2 bits of the multiplier are used for the first group of 3. Figure 15 below shows the
functional operations of the Booth encoder for the Radix-4.
Figure.15: Radix-4 Booth encoding table
An example of Radix-4 Booths algorithm is shown below in Figure 16 for the
multiplication of -73 (10110111) and 90 (01011010):
Figure.16: Radix-4 example
22
3.2 Recoding Multipliers
In direct multiplication, it was discussed that the multiplicand must be added for each digit
that attains the value of one therefore n number of operations are required for multipliers that have
n digits with the value of one. The primary objective of using recoding multiplication algorithms
is to enhance the simplicity of the system by reducing the number of operations required regardless
of how many digits that have attained the value of 1 exist. The uniform shift of one and the uniform
shift of two methods are the two methods by which recoding multipliers carry out their operations.
The premise of recoding multipliers is similar to that discussed in booths algorithm i.e. that both
addition and subtraction operations are carried out based on these two following fundamental rules:
1. The additional of a multiplicand that has been moved i positions = Subtraction of that
multiplicand plus an addition of a multiplicand that has been moved i+1 positions so they can be
interchanged.
2. The addition of a multiplicand that has been moved i+1 positions = Two additional of a
multiplicand that has been moved i positions so they can be interchanged.
3.2.1 Uniform Shift of One method
For a multiplier X, recoding yi as discussed before is dependent on what values are attained
by it and yi+1. The final recoding results also depend on what the results of the recoded yi-1
multiplier were. The operations fi that is performed because of the recoding principles mentioned
above can be summarized as follows in Figure 17:
23
Operation required yi yi+1 fi
None 0 0 0
None 0 1 0
Addition 1 0 1
Subtraction 1 1 -1
Figure .17: Operation Rules when depending on yi
*note: The above situations only occur when the recordings results of yi-1 have no bearing
on the recoding of yi.
From the above table, in the last scenario when both digits attain the value of 1, instead of
carrying out the addition with the multiplicand that has moved i positions, subtraction of that
multiplicand is carried plus an addition of a multiplicand that has been moved i+1 positions. In the
event we have an yi+2 and it attains the value of zero then instead of carrying out two addition
operations on the yi+1 multiplicand, all that is need to be done is a single addition operation of the
yi+2 multiplicand however as it can be seen, the number of operations remain the same for this
scenario. This is due to the fact that two addition operations of yi and yi+1 are now being replaced
by an equal number of operations that include an addition of the yi+2 multiplicand and a subtraction
of the yi multiplicand. The number of operations will actually reduce if the yi+2 multiplicand also
attained the value of 1. In this case, instead of having two addition operations for yi+1, a single
addition of the yi+2 multiplicand is sufficient and instead of the two additions that will be needed
on the yi+2 position, a single addition process on the yi+3 multiplicand will be sufficient if it attains
24
the value of 1 too. This implies that as long as this process continues, the total number of steps
required will decrease drastically enhancing the simplicity of the multiplier.
The carry propagation theory encapsulates the entire impact that a bit-recoding has on the
bits that follow later on. The pseudo carry(ci)can be defined as the operation that as a result of
recoding applied to yi-1 is pushed forward to the yi multiplicand. It takes the value of 1 in the event
that an addition operation is pushed forward and 0 if nothing happens. Figure 18 gives a detailed
a description of the operations requirement during recoding operations.
yi yi+1 ci ci+1 fi
0 0 0 0 0
0 0 1 0 1
1 0 0 0 1
1 0 1 1 0
0 1 0 0 0
0 1 1 1 -1
1 1 0 1 -1
1 1 1 1 0
Figure .18: Operation Rules when depending on yi and carry variable is considered
For uniform shifts of one operation, it is important to introduce two additional binary
variables namely fi1 that represents either addition of subtraction operations and fi0 which indicates
25
whether or not an operation is taking place. The value assignment for these variables can be
summarized as per Figure 19 and 20 below:
fi0 value Indication
1 Subtraction operation required
0 Addition operation required
Figure .19: Operations rules for fi0
fi1 value Indication
1 Operation is required
0 No operations required
Figure .20: Operations rules for fi1
Incorporating these more elaborately defined operation variables into Figure 18, the
following recoding table is obtained:
26
yi yi+1 ci ci+1 fi0 fi1
0 0 0 0 0 N/A
0 0 1 0 1 0
1 0 0 0 1 0
1 0 1 1 0 N/A
0 1 0 0 0 N/A
0 1 1 1 1 1
1 1 0 1 1 1
1 1 1 1 0 N/A
Figure .21: Operation rules for Uniform Shift of One method
A few equations and relationships can be derived on the basis of the above information. It
can be seen that fi1 attains the same values as yi+1 given that fi0 has the value of 1. The others are as
follows:
𝑓𝑓𝑖𝑖0 = 𝑐𝑐𝑖𝑖 + 𝑦𝑦𝑖𝑖 … … … … … .18
𝑐𝑐𝑖𝑖+1 = 𝑦𝑦𝑖𝑖 . 𝑦𝑦𝑖𝑖+1 + 𝑦𝑦𝑖𝑖+1 . 𝑐𝑐𝑖𝑖 + 𝑦𝑦𝑖𝑖 . 𝑐𝑐𝑖𝑖 … … … … … 19
27
In conjunction with the above equations, Figure 22 shows the design of a uniform shift by
one recoding bit multiplier, the sequence in which it passes the logic gates, logic circuits and the
flow of operations from start to end.
Figure 22: Uniform shift by one recoding bit multiplier
28
3.3.2 Uniform Shift of Two method
As discussed previously, there was significant room for improvement in speed of the
multiplier as the uniform shift of one method was able to reduce the total number of operations in
most situations to a great extent. However there is more room for speed improvement in multiple
multipliers are scanned in each cycle. Scanning two multiplier bits simultaneously could reduce
the number of operations by half, and three multiplier bits by three times. This is known as uniform
shift of multiples. For this project, only uniform shift of two will be discussed which includes non-
overlapped scanning and overlapped scanning.
3.3.2.1 Non-Overlapped Scanning
In this method, the multiples outputted from each multiplier bit are used in an addition
operation with the partial product. A better way to visualize this is through an example where an
even word length of n = 2M, where M is the total bits examined, is assumed. For this example, M
is assumed as 2 which means that there are a total of 4 bits. The least significant bits among all the
bits which are y1 and y0 are scanned and there are four possible routes that could be taken
depending on their values. The sums that result from these operations contain more than n bits that
have been moved multiple times albeit in either direction. The following rules determine those
operations:
1. If both y1 and y0 attain the value of 0 then no multiples will be added.
2. If y1 attains the value of 0 and y0 attains the value of 1, then the multiplicand X is added
to the following partial product.
29
3. If y1 attains the value of 0 and y0 attains the value of 1, then a multiple of the
multiplicand X (in this case 2X) is added to the following partial product. 2X implies
that the multiplicand X will be moved one place to the left.
4. If both y1 and y0 attain the value of 1, both X and 2X are added to the following partial
product.
Broadly defining this, multiplying a number by a factor of 2i is equivalent to moving a
binary digit to the left by i positions with zero digit coming in from the right. The following
equations capture this relationship where y1 and y0 are represented j and T ranges from 0 to k.
𝑘𝑘
𝑗𝑗 𝑥𝑥 𝐵𝐵 = � 𝑏𝑏𝑇𝑇 𝑥𝑥 2𝑖𝑖 𝑥𝑥 𝐵𝐵 … … … .20

0
𝑗𝑗 𝑥𝑥 𝐵𝐵 = 21 𝑥𝑥 𝐵𝐵 = 2𝐵𝐵 … … … … … … . .21
Following this addition operation, the accumulator and the multipliers both are moved
together as if they were on unit by 2 positions towards the right side as per the following
equation:
𝐴𝐴𝐴𝐴𝐴𝐴 𝑥𝑥 𝑦𝑦 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑜𝑜𝑜𝑜𝑜𝑜 + 𝑆𝑆𝑛𝑛 … … . 𝑆𝑆0 . 𝑦𝑦𝑛𝑛−1 … . 𝑦𝑦2 … … … . .21
30
A carry save adder and carry propagate adder which were discussed in section 2 are
implemented in a non-overlapped scanning multiplier as per Figure 23. The flow of operations
shown is that firstly the carry save adder receives two inputs which it then sets aside for the X
and 2X multiples. The way these multiples are generated is by passing through the AND gate
with the y1 and y0 multipliers. The third input received by the carry save adder is generated from
the accumulator’s existing partial product. The carry propagate adder receives its inputs in the
form the two carry save adder outputs and thus generates the final product. The newly generated
partial product is moved two positions to the right after every iteration.
Figure .23: Non-overlapped multibit scanning multiplier
31
3.3.2.2 Overlapped Scanning
Through this technique it is possible to drastically reduce the total quantity of multiplicated
multiples which in turn decreases the total number of operations. The main idea behind this
algorithm is basically recoding by pairs or overlapped scanning. The basic process starts when the
multiplier is split into paired-bit groups and only one of these groups is scanned at a specified time.
As with other multipliers, either no operation happens or either of addition/subtraction takes place.
The multiplicand in the addition or subtraction operations exists in multiples of 2’s (2 times X/ 4
times X etc.). The multiple is obtained by moving the multiplicand from its position of entry in the
adder to the left by either 1 position or 2 positions from the reference bit which is the low order bit
in the sequence. After this, the partial product that is obtained is moved by 2 positions to the right,
as well is the multiplier.
yi+1 yi Supposed to Add Actually Added
0 0 No operation No Operation
0 1 +X +2xX
1 0 +2xX +2xX
1 1 +3xX +4xX
Figure .24: Operation Rules for recoding by pairs
32
The above figure encapsulates the rules for recoding by pairs. What it infers is that if the
first bit attains the value of 1, then an error of X is incurred in the partial product which can be
mitigated when the preceding pair is processed and 4 times X is subtracted from the partial product
where X is the multiplicand. The final set of modified rules is shown in the figure below:
yi+2 yi+1 yi operation
0 0 0 No operation
0 0 1 +2xX
0 1 0 +2xX
0 1 1 +4xX
1 0 0 -4xX
1 0 1 -2xX
1 1 0 -2xX
1 1 1 No operation
Figure .25: Modified Operation Rules for recoding by pairs
The lowest significant bit is assumed to have a value of zero during the computation, and
then the initiating partial product is zero. If the lowest significant bit attains the value of 1 then the
partial product I equal to the multiplicand. The speed with which this algorithm carries out
operations especially or larger bit numbers is precisely why it is preferred for this project.
33
3.4 Implementing the Overlapped Scanning multiplier algorithm
The main components of this system are the shifting, complementing circuits, the adder,
accumulator, multiplier and decoder as shown in Figure 26. Four control signals (S1 – S4) are
generated after the least significant three bits of the multiplier index are decoded.
The first of these control signals, S1 is known as the operation control signal that is inputted
to the adder. It has the task of enabling or disabling the results from the shifting and complementing
circuits. The S2 signal is the Addition/Subtraction operation control signal that as the name
suggests indicates which of the operations is supposed to be performed. S3 is the operation signal
that shifts the bits one position to the left (one-bit shift), while S4 is the operation signal that shifts
the bits two positions to the left (two-bit shift). The decoding operations take place as per Figure
27.
Figure .26: Schematic for recoding by multiplier pairs
34
yi+2 yi+1 yi operation S1 S2 S3 S4
0 0 0 0 0 N/A N/A N/A
0 0 1 +2xX 1 0 1 0
0 1 0 +2xX 1 0 1 0
0 1 1 +4xX 1 0 0 1
1 0 0 -4xX 1 1 0 1
1 0 1 -2xX 1 1 1 0
1 1 0 -2xX 1 1 1 0
1 1 1 0 0 N/A N/A N/A
Figure .27: Decoding Operations Truth Table
From this we can infer that if any of the three bits is 1, then S1 will attain the value of 1 but
if all three bits are 0 or 1 then it will be 0. S2 always takes the same value as yi+2 provided that the
S1 signal has the value of 1. If any of the two (yi+1 or yi) bits is 1, then S3 will attain the value of 1
but if both bits are 0 or 1 then it will be 0 and S4 will take the opposite value as S3.
35
Chapter 4- Logic Circuits and Modules
There are many different modules and circuits which have been used effectively in order
to carry out this project effectively. These include decoders, control gates, left-bit shifters, right-
bit shifters, accumulator and complementing circuits.
4.1 Decoders
A decoder is defined as a circuit that alters the code and converts them into a set of signals
which is primarily the reverse of encoding. It includes different logic gates such as AND, OR and
XOR that take different inputs and generate a certain number of control signals. In this project, the
inputs are yi+2, yi+1 and yi which generate 4 signals (S1 – S4). A schematic of the decoder used for
this project and its respective components, inputs and outputs is shown in Figure 28.
Figure .28: Gate designs in a recoding by pairs decoder
36
4.2 Control Gates
Control Gates are memoryless circuits which generate an output solely based on the
combination of their inputs which can be 0 or 1 at a given time. They have no feedback and any
change to the signals being fed to them will instantaneously alter the output signals too. The control
gate used in this project includes a chain of AND gates which works on the following principle: if
the two inputs to an AND gate are 1, only then will the output signal be 1 otherwise it will be zero.
In the circuit implemented for this project, there is a 34-bit control gates circuit which receives
input from the decoder and from the complementing circuit’s output.
4.3 Complementing circuits
The complementing circuit used in the project is a chain of XOR (Exclusive OR Gate)
circuits that operate on the following principle: It receives multiple inputs and has one output with
an exclusive disconnection. If any one of the input signals has a value of 1, only then will the
output signal be 1 but if both are 0 or both are 1, then the output signal will be 0. In this circuit, a
34-bit complementing circuit is used as shown in Figure 29 which receives the input signal S2 from
the decoder and from the left-bit shifting circuit.
Figure .29: 34-bit Complementing Circuit
37
4.4 Accumulator
An accumulator is defined as a register which serves the purpose of a temporary,
intermediate storage unit for the logic and arithmetic input from the computer’s CPU. If these
accumulators did not exist, then it would be necessary to copy each of the computations and results
onto the main memory which will be very time consuming as accessing the main memory over
and over again in order to read the results is a much slower process as opposed to reading the
results from an accumulator since the controller overhead for reading/writing is used for memory
elements.
The primary purpose that the accumulator register serves in this project is the accumulation
of the list of member bits. The count in the accumulator is initially zero and keeps rising as numbers
enter into it from the CLA unit. The result is stored in the accumulator and the multiplier register
once all the numbers have gone through the necessary operations. Below in Figure 30, it can be
seen that the extension bits € and the load_acc which is the binary input received to the accumulator
are fed into the accumulator and ultimately exist to the 32-bit multiplier register Q. If the input
signal coming from the load_acc is 1, then the data will add onto the accumulator, otherwise the
value in the accumulator will stay as it is.
2 2
To Q regiter
E Accumulator
Load_acc
Figure .30: Accumulator Unit
38
4.5 shift left register
A shift left register is a circuit conjunction that moves the data towards the opposite
direction of the control signal flow (towards the left) by other one or two position and the output
gained is a 2’s multiple of the multiplicand. This register is enabled by the S3 (one-bit shift left
control signal) and the S4 (two-bit shift left control signal) control signals. Below in Figure 31, the
shift left register circuit used in the 32 bit multiplier is shown. The 32-bit input is denoted by A32.
As discussed previously, the control signal S3 comes from the decoder and if it attains the value of
1 then the shifting circuit moves to the left by one bit position and if it has the value of 0 then no
shifting occurs. The output is a total of 34 bits which then behave as the input for the
complementing circuit discussed in section 4.3. Similarly, if control signal S4 attains the value of
1 then the shifting circuit moves to the left by two bit positions and if it has the value of 0 then no
shifting occurs. The output is a total of 34 bits which also behaves as the input for the
complementing circuit.
Figure .31: Shift left register
39
The very first least significant bit in the figure above is set at 0 while the others are
dependent on the combination of the inputs received and the control signals.
4.6 Shift Right Register
A shift register is required in order to carry out two fundamental tasks: storing the data and
moving it subsequently. It consists of a group of flip-flops that each stores a single binary bit and
then shifts that data from one flip-flop to the other within itself or outside it. A shift right register
moves the bits towards the right, one or multiple bits at a time in the direction of the control signal.
The multiplier register and the accumulators behave as the shift right registers in this project and
they move two bits in each transition.
Figure 32: details circuit for recording by multiple pairs
40
Chapter 5: Designing the Multiplier
A hierarchal modeling methodology has been applied in this project in order to design a
top module multiplier. A carry propagate adder, 3-leveled carry save adders and 4 recoding logic
modules are amalgamate in order to design the high speed recoding multiplier and an accumulator
and multiplier register instantiate the multiplier. This multiplier operates much faster than a ripple-
carry adder due to the extensive CLA circuitry deployed along with considerations for propagation
delays.
5.1 Logical Circuits Schematic
The gate-level logic circuit schematic of the multiplier and the block diagram of the
recoding logic components are shown in Figure 32 and 33 respectively which essentially comprise
of shifting, logic and complementing circuits. This figure takes into consideration that 8 bits are
being recoded at a time for a multiplicand of 8-bits hence there would be a total of 17 bits being
generated including the sign bit. This can be understood better if the total 17 bits are seen as 10
output bits, one sign bit and the rest sign extension bits. The results generated from these are reliant
on the three least significant bits from the multiplier. The required number of concurrent recoding
logic components is 4 since the multiplier is split into 8 bits in each component which adds up 32
bits in total. They also play an important part in the 32-but multiplication process. Four control
signals are generated from the decoder which then becomes the input of the recoding logic
components. The width of each of these output signals can be generalized by the following
equation where x is the module number and n represents the number of bits and m is the number
of operands:
41
𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊ℎ 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑡𝑡ℎ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = (𝑛𝑛 + 2𝑚𝑚)𝑥𝑥 + 1 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑏𝑏𝑏𝑏𝑏𝑏 … … … … … . .22
𝑒𝑒. 𝑔𝑔. 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊ℎ 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑡𝑡ℎ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤ℎ 2 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = (𝑛𝑛 + 4)𝑥𝑥 + 1 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑏𝑏𝑏𝑏𝑏𝑏
Figure .33: Internal resistor-transistor logic circuit schematic of the multiplier
Figure .34: Block diagram of the recoding logic component circuits
42
From the above figures and details, it can be seen that 4 recoding logic components and 4
operands are necessary in order to recode 8 bits at a time. Each of the components generates a total
of n+8 bits plus one sign bit too. The output for each of the recoding logic components are
characterized in the figure below:
Module Output
1st 6 bit sign extension + 1 sign bit + (x+2) (exiting from the control gates)
2nd 4 bit sign extension + 1 sign bit + (x+2) (exiting from the control gates)
3rd 2 bit sign extension + 1 sign bit + (x+2) (exiting from the control gates)
4th 1 sign bit + (x+2) (exiting from the control gates) + 6 bits
Figure .35: Outputs of each recoding logic component
5.2 8-bit Multiplication Model
It has already been discussed how the recoding logic components will produce a total of
17 bits including the sign bit. This output of the first three is then fed to the first carry save adder
after which the 17-bit sum is obtained and the carry vectors are stored so that they can serve as the
input for the next carry save adder. Simultaneously, the 4th recoding logic module generates its 17
bits and feeds them to the next carry save adder together with the output from the accumulator.
This is then fed to the carry look ahead and is output is then subsequently fed to the accumulator
which carries it to the third carry save adder. This entire process is depicted in Figure 36. Since
there are 8 bits being recoded at a time, the 8 most significant bits of the output would be in the
accumulator whereas the 8 least significant bits would be in the multiplier register at the end of
the 4 transition cycles.
43
Figure .36: 8-bit multiplication unit using Recoding by pair's algorithm
From the figure, it can be seen that the ALOAD signal controls the 8 least significant bits
that go into the multiplier register. The first three of these bits are fed to the decoder which then
outputs the 4 control signals S1-S4. The recoding logic components generate the 17bits once they
obtain the 8 bits that are loaded in the multiplier register which then serve as the input for the adder
component in order to give the final output. In the figure above, S21 serves as the input for the
second carry save adder along with S22, S23 and S24. In order to better understand the workings of
the multiplier, an example will be discussed.
44
X (8 bit multiplicand) = 11101010 (234), Y (9 bit multiplier) = 001101100 (108)
1st Transition Cycle: The 1st decoder takes in the y0-y2 bits, the 2nd decoder takes in the y2-
y4 bits, the 3rd decoder takes in the y4-y6 bits and the 4th decoder takes in the y6-y8 bits and the
control signals are generates accordingly.
1st Decoder: y0 = 0, y1 = 0 and y2 = 1 which produces the control signals S1 = 1, S2 = 1, S3
= 0 and S4 = 1. These control signals indicate that the error will be fixed by subtracting 4 times
the multiplicand from the partial product. This moves it towards the left by two bit positions and
then complements which gives the output 11111110001010111. The sign extension bits are the
first 6 bits which is a replication of the signs bit that resulted from the gating of control signals S1
and S2 bits of their corresponding decoders by the AND gates.
2nd Decoder: y2 = 1, y3 = 1and y4 = 0 which produces the control signals S1 = 1, S2 = 0, S3
= 0 and S4 = 1. These control signals indicate that the error will be fixed by adding 4 times the
multiplicand to the partial product. This moves it towards the left by two bit positions and then
complements which gives the output 00000111010100000. The sign extension bits are the last 2
bits and the first 4 bits which is a replication of the signs bit.
3rd Decoder: y4 = 0, y5 = 1and y6 = 1 which produces the control signals S1 = 1, S2 = 1, S3
= 1 and S4 = 0. These control signals indicate that the error will be fixed by subtracting 4 times
the multiplicand from the partial product. This moves it towards the left by two bit positions and
then complements which gives the output 11110001010111111. The sign extension bits are the
last 4 bits and the first 2 bits which is a replication of the signs bit.
4th Decoder: y6 = 1, y7 = 0 and y8 = 0 which produces the control signals S1 = 1, S2 = 0, S3
= 1 and S4 = 0. These control signals indicate that the error will be fixed by adding 2 times the
45
multiplicand to the partial product. This moves it towards the left by one bit position and then
complements which gives the output 00111010100000000. The sign extension bits are the last 6
bits which is a replication of the signs bit.
The results from the first three recoding logic components will be fed to the first carry save
adder after which the sum and carry vectors are obtained. This sum and the carry vectors are stored
so that they can serve as the input for the next carry save adder along with the output from the 4th
recoding logic module. The addition process in the second carry save adder then generates a new
sum and carry vector and feeds them to the third carry save adder together with the output from
the accumulator. This is then fed to the carry propagate adder to get the final output as
00110001010101111. The two least significant bits from the accumulator are moved to the right
towards the multiplier register that then takes up their spots as the two most significant bits of the
multiplier register. The two most significant bit of the accumulator is compared in order to
determine which bits would move forward to the two bit extension register which marks the end
of the 1st transition cycle.
In the 2nd transition cycle, the control signals are generated from the 4 decoders after the three least
significant bits from the multiplier register are compared. The results from the decoders and final
result from the 2nd transition cycle are described in the figure below:
46
Stage Output
1st Decoder 00000001110101000
2nd Decoder 11111100010101111
3rd Decoder 00001110101000000
4th Decoder 11000101011111111
Final Carry Propagate Adder 10011110101000100
Figure .37: Outputs for each level in the 2nd transition cycle
This final result is loaded onto the accumulator and moved to the right by two bit positions
towards the multiplier register. The same process is repeated for the 3rd and 4th transition cycles
and their results are as per the following figure:
Cycle Output
3rd Transition Cycle 01010110010001100
4th Transition Cycle 11100001011100111
Figure .38: Outputs for the 3rd and 4th transition cycles
The output that was loaded into the accumulator at the end of the 4th transition cycle was
00111000011100010 while the output in the multiplier register was 10111000. The final output is
determined by taking the 7 least significant bits from the accumulator and the 8 bits from the
multiplier register which puts the final multiplication result at 110001010111000.
47
5.3 Final Multiplier Design
The final design follows the same method as the one discussed in section 5.2. It is depicted
in Figure 39 which shows the final design for the high speed multiplier of 32 bits which uses a
recoding by pairs algorithm. 4 transition cycles are required in order to simulate the 8 bits of input
in the example in 5.2 but in this final design, the total number of cycles required is 16 is the bits
required are 32.
Figure .39: High speed multiplier of 32 bits which uses a recoding by pairs algorithm (Final
Design)
48
Chapter 6: DETAILS OF IMPLEMENTATION
6.1 introduction
This chapter talks about the implementation of booth algorithms on FPGA including the
circuit diagram and hardware components used to build the project. It will guide you through the
step by step to build this project. The machine is power using a 3.3v 500mA power supply. The
FPGA used for the project is Arty Z7.
6.2 List of Components
Items Qty
Arty Z7 1
7-segment display 4
NPN transistor 16
Resistor-470 ohms 16
Wires 40
Breadboard 1
Figure 40: Table of component
49
6.3 Description of hardware component
6.3.1 Digit seven segments
The 7-segment used is a 4 digit 7-segment display. Pins 1, 2, 6 & 8 are common anodes
Pins 14, 16, 13, 3, 5, 11, 15 & 7 are the pins corresponding to the LED’s.
Figure 41: the common anode 7-segment
6.3.2 Transistors:
An NPN transistor is used in the project to connect common anode from the 7-segment to the
positive supply. It has been chosen because a NPN transistor avoid some of the voltage base to
emitter drop, it is as little as 100mV.
The transistor here is used as a switch to control the positive supply going to the display. The
base of the transistor is connected to the FPGA using a 470 ohms resistor. It is enough to operate
the transistor in saturation and cut-off mode.
50
6.3.3 ArtyZ7
It is the development kit designed around the Zynq-7000 from Xilinx. It consists of dual-core,
650 MHz ARM Cortex-A9 processor with Xilinx 7-series Field Programmable Gate Array
(FPGA) logic. This is the core of the project where multiplication algorithm is implemented.
Figure 42: Arty Z7
51
6.4 Schematic Diagram
Figure 43: schematic of the circuit
52
Figure 44: circuit before connecting to the ArtyZ7
53
6.5 Functional Description of the Project:
ArtyZ7 is programmable SOC (system on chip) using A9 processor with architecture that
integrate dual core and 650 MHZ clock rate, which make it a powerful processor. Also it has
four buttons and two switches which are used in this project to get user input. Switch one is to
select between input one and input two, switch 2 is to select the sign of the input which is
selected by switch one. It sends logic one or high signal to activate the digit on the seven
segments digit. The switch 3 is assigned to change the value of the input and preform the
multiplication. Button 0 is to add one to the first digit of the four seven-segment. Every time is
pressed, it will increment the value until it reaches 9 then start from zero again by sending high
signal to the emitter of the transistor that is connected to the digit. Button 1 is controlling the
second digit and button 2 is controlling the third digit of the input that is selected by switch one.
The button 3 is assigned to perform the multiplication.
54
Figure 45: circuit after connecting to the ArtyZ7
55
6.6 Software Description:
6.6.1 Introduction
After designing the circuit, it is modeled in VHDL. Since the circuit does not require the
use of memory and the output is only dependent on the present input, the combinational circuit
design process is used for the implementation of the circuit. This section describes the
implementation of various components and their use.
While writing the code, the standard packages and libraries from IEEE is used. The clock
frequency is set to the default of 100MHz. The code consists of several components including
singed multipliers, seven segment displays, BCD display, hex to seven segment converters and
signed to slv converter. The following are the components in the code:
6.6.2 Signed Multiplier:
This section contains the implementation of Booth’s algorithm. The signals input 1 and
input 2 provide the two numbers for multiplication. The clock signal synchronizes the circuit with
other components in the circuit. The reset is always set to zero. When start signal is high,
multiplication occurs and when the start signal is low, the multiplication is complete and the result
is produced on the product output.
56
Figure 46: Signed multiplier
6.6.3 Signed_to_SLV:
Following to the multiplier is the signed to slv component. It is used to generate signal for
the segment converter to display the sign of the number. There are three signed to slv converters,
one each for A, B and result. It looks at the number and generate 1 if the number is smaller than
zero. The output goes to a multiplexer, which generates either ‘0111111’ or ‘1111111’. The signals
are stored in the register and passed on to the segment converter to display to the output.
Figure 47: signed to std logic vector convertor
57
Figure 48: signed to SLV details circuit
6.6.4 BCD Display:
Next follows a bcd display component. This component is a binary to bcd converter, which
converts 12 bits binary input from signed to slv converter to 4 bit bcd signals. It works by shifting
bits from one shift register to another starting from MSB first. There are three binary to bcd
converter in the code, one each for input A, input B and result.
Figure 49: BCD display
58
6.6.5 Hex_to_7_Seg:
The hex to seven segment converter the hex input from the bcd display and converts it to
the seven-segment output which is fed into the segment controller. There is a total of 13 hex to 7
segment converters in the code. It consists of predefined outputs for the set of inputs.
Figure 50: Hex to 7 segment component
6.6.6 Segment Controller:
The section of the code controls the output on the seven-segment display. The input signals
are clock, refresh rate and digital inputs from the output of the multiplier and the output connects
directly to the pins of the seven-segment display. The segment refresh rate is set to 50 kHz. The
segment controller works by toggling between different pins of the seven-segment display to
display digits. This section consists of several processes. They are:
59
The sign_proc process handles the sign assignment to be displayed for A, B and result
based upon the value stored in sign register. With every rising edge, it reads the value of sign
register and produces corresponding output to display the sign.
The add_sub_proc process accounts for the digits which are displayed at A and B. With
every rising edge of the clock, if the one_edge_start is 1 than it increments the LSB of the output
displays. Similarly, if the ten_edge_start of the input is 1, it will control the center digit or the digit
at tens place. If the hundred_edge_start of the input is high, it will control the MSB.
The counter_proc is a counter to create desired multiplexed rate and shift toggle bits.
The toggle_proc toggles between the various seven segment displays by selecting
appropriate display output based on the toggle bit.
Figure 51: segment controller
60
Chapter 7: SUMMARY AND CONCLUSION
In this project, a High speed 32-bit multiplier is studied, designed, modeled,

simulated and implemented. Recoding by pairs algorithm is used to make multiplier faster.
The CSA's are used because the multiplication process involves addition of multiple n-bit
numbers. Two-level CLA adder is modeled and designed to speed up the addition process.
The Two-level CLA adder unit is used because it is approximately 50% faster than One-
level CLA adder unit.
The major goal of this project is to design a multiplier that operate much faster than
regular multiplier. This multiplier performs high speed 32-bit multiplication with moderate
complexity and high performance. However, various schemes and methodologies are
presented; an ideal selection would depend on the individual needs and requirements of the
designer.
Few problems were encountered during the completion of the project, especially in
the first stages of simulation and synthesis as well. One by one, these problems were solved
until the expected results were obtained. Several root causes were identified along the way.
A description of the problems and their corresponding solutions are discussed below.
First, conflicting multiplication results were obtained during simulation when
certain combinations of the multiplier triplets were being examined. For instance,
multiplier bits with the format 000 or 111 produced inaccurate results. It was observed that
control signal S2 out of decoders took the unknown value of x, which only matched what
the decoder function table (Figure 27) indicated. In fact, this was wrong. Control signal S2
turned out to be as important as control signal S1 for the above mentioned combinations of
multiplier triplets. To resolve this problem, the decoder function table was modified and
S2 was assigned a value of "0" for multiplier triplets like 000 and 111. The decoder was
then re-designed and tested for functionality.
Second, also during simulation, the last 32 of the 64 bits in the multiplication final result
were shown having values in complement form, that is, instead of 0's there were 1's and
61
vice versa. The error was caused by the right shift input to the accumulator, which was zero
all the times. Whenever 2's complement was performed on the multiplicand multiples, one
was needed as right shift input to the accumulator. To resolve this issue, an AND gate was
added to provide the correct shift input. The accumulator MSB (acc [40]) and constant
number "1" were used as inputs to the AND gate. With this new addition, the functionality
of the multiplier was successfully verified.
All throughout the synthesis, many difficulties were overcome. However, the most
significant one was the definition of constraints. Each step was performed a number of
times until the right set of constraints were conceived. The excellent debugging capabilities
of both synthesis tools made it possible to identify the critical paths in the design such that
when inspected, gave the author a better understanding of the way constraints were used by
the synthesis tools.
Ultimately, the experience and knowledge gained while working on the project have
been very valuable. Although, the design sacrifices some uniformity and cost, the recoding
by pairs multiplier was designed, modeled and simulated successfully. Thus, achieving the
initial goals set for the project.
62
References
Baugh, C.R. and Wooley, B.A., "A Two's Complement Parallel Array Multiplication Algorithm"
Bell Laboratories, 1973.
Booth, A.D., "A Signed Binary Multiplication Technique," 1951.
Brent, R.P. and Kung, H.T., "The Area time Complexity of Binary Multiplication," Journal of the
ACM, 1981.
Dr. Nagi, El naga. "ECE621 Lecture Notes", California state university, Northridge, 2009
Fenwick, P.M., "Binary Multiplication with Overlapped Addition Cycles," IEEE Trans. Comp.,
Vol.C-18, No.1, Jan. 1969.
Habiti, A and Wintz, P.A., "Fast Multipliers," IEEE Trans. Computers, Vol. C-19, No.4, Feb 1971.
Kai, Hwang, " Global Versus Modular Two's Complement Array Multipliers" IEEE Trans.
Computers, Vol. C- 28, No.4, Apr.2007.
Kai, Hwang, "Computer Arithmetic- Principles, Architecture and Design," New York: John wiley
and sons. inc., 1979.
Kamal, A.A and Ghanam, M., "High - Speed Multiplication Systems," IEEE Trans. Computers,
Vol. C-21, No.9, Sep. 1972.
Koren, Israel, "Computer Arithmetic Algorithms," second edition, 2005.
Lyon, R. F., "Two's Complement Pipeline Multipliers," IEEE Trans. Commun., com-24, Apr.
1976.
63
Mi, Lu, "Arithmetic and logic in computer systems" John Wiley and sons, Hoboken, NJ, c2004.
Morris, Mano. "Digital Design" Upper Saddle River, NJ: Prentice Hall, 2007.
Palnitkar, Samir, "Verilog HDL, a guide to Digital Design and Synthesis", Prentice Hall, NJ, 2008
Pezaris, S, D, "A 40ns 17-bit-by-bit An-ay Multiplier," IEEE Trans. Computers, Vol. C-20, No.4,
Apr. 1971.
Stenzel, W.J. et al., "A Compact High - Speed Multiplication Scheme," IEEE Trans. Computers,
Oct. 1977.
64
Appendix
1- VHDL
----------------------------------------------------------------------------------
-- Company:
-- Engineer: Rashed Alajmi
--
-- Create Date: 11/21/2018 11:35:06 AM
-- Design Name:
-- Module Name: top - Behavioral
-- Project Name:
-- Target Devices:
-- Tool Versions:
-- Description:
--
-- Dependencies:
--
-- Revision:
-- Revision 0.01 - File Created
-- Additional Comments:
--
-- ALU Top
-- Libraries
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
65
use IEEE.numeric_std.all;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use IEEE.std_logic_signed.all;
entity Booth_Top is
generic (
clock_frequency : integer := 100000000; -- Input clock rate in Hz (100 MHz
default)
segment_refresh : integer := 50000); -- Refresh rate in Hz
port (
ja : out std_logic_vector(6 downto 0); -- seg out
dp : out std_logic;
digit_select : out std_logic_vector(15 downto 0);
one : in std_logic;
ten : in std_logic;
hundred : in std_logic;
start : in std_logic;
sel : in std_logic;
add_sub : in std_logic;
clk : in std_logic);
end Booth_Top;
architecture behavior of Booth_Top is
-------------------------------------------------------------------------------
-- COMPONENTS
-------------------------------------------------------------------------------
-- ALU
66
--component ALU_2
--generic (
-- bit_depth : integer := 8);
--port (
-- opcode : in std_logic_vector(2 downto 0);
-- A : in signed(2 * bit_depth - 1 downto 0);
-- B : in signed(2 * bit_depth - 1 downto 0);
-- execute : in std_logic;
-- result : out signed(2 * bit_depth - 1 downto 0));
--end component;
-- Signed Multiplier (Booths Algorithm)

component smult_1
generic (
input_size : integer := 8);
port (
product : out signed(2 * input_size - 1 downto 0);
data_ready : out std_logic;
input_1 : in signed(input_size - 1 downto 0);
reset : in std_logic;
end component;
-- Seg Display
component Seg_Display_16
67
generic(
input_clk_freq : integer := 100000000; -- Input clock rate in Hz
refresh_rate : integer := 50000); -- Refresh rate in Hz
port(
-- 7 Segment Display Output
seg : out std_logic_vector(6 downto 0);
-- 7 Segment Display Decimal Point
dp : out std_logic;
-- Selects Digit
an : out std_logic_vector(15 downto 0);
-- Input segments 0 through 3
digit_1 : in std_logic_vector(6 downto 0);
68
-- Input decimal points
-- in_dp : in std_logic_vector(15 downto 0);
-- Input Clock
end component;
-- BCD Display
component binary_bcd
generic(
N: integer := 16);
port(
clk, reset: in std_logic;
binary_in: in std_logic_vector(N-1 downto 0);
bcd0, bcd1, bcd2, bcd3,
bcd4, bcd5, bcd6 : out std_logic_vector(3 downto 0));
end component;
-- Hex to 7 seg
component Hex_to_7_Seg
port (
seven_seg : out std_logic_vector(6 downto 0);
hex : in std_logic_vector(3 downto 0));
end component;
-- Signed to SLV Converter

component Signed_to_SLV
generic(
69
bit_depth : integer := 12);
port(
Signed_in : in signed(bit_depth - 1 downto 0);
SLV_out : out std_logic_vector(bit_depth - 1 downto 0);
neg : out std_logic);
end component;
-------------------------------------------------------------------------------
-- SIGNALS & CONSTANTS
-------------------------------------------------------------------------------
signal A_input, B_input : signed(11 downto 0) := (others => '0');
signal Result_out : signed(23 downto 0) := (others => '0');
signal A_slv, B_slv : std_logic_vector(11 downto 0) := (others

=> '0');
signal Result_slv : std_logic_vector(23 downto 0) :=
(others => '0');
signal A_sign, B_sign, Result_sign : std_logic_vector(6 downto 0) :=

"0000000";
signal dig_1 : std_logic_vector(6 downto 0) := "0000000";

70
signal A_bcd0, A_bcd1, A_bcd2, A_bcd3, A_bcd4, A_bcd5, A_bcd6 :

std_logic_vector(3 downto 0) := x"0";
signal B_bcd0, B_bcd1, B_bcd2, B_bcd3, B_bcd4, B_bcd5, B_bcd6 :
signal R_bcd0, R_bcd1, R_bcd2, R_bcd3, R_bcd4, R_bcd5, R_bcd6 :
signal one_lead, ten_lead, hundred_lead, start_lead : std_logic := '0';

signal one_follow, ten_follow, hundred_follow, start_follow : std_logic := '0';
signal one_edge_start, ten_edge_start, hundred_edge_start, start_start
: std_logic := '0';
signal A_neg, B_neg, R_neg : std_logic := '0';
signal reset : std_logic := '0';

signal ready_data : std_logic;
-------------------------------------------------------------------------------
-- DESIGN
-------------------------------------------------------------------------------
71
begin
dig_1 <= Result_sign;

dig_10 <= A_sign;
dig_13 <= B_sign;
-- ALU
--ALU : ALU_2
--generic map(12)
--port map(opcode, A_input, B_input, start_start, Result_out);
-- Signed Multiplier
BOOTH: smult_1
generic map(12)
port map(Result_out, ready_data, A_input, B_input, start_start, reset, clk);
-- Binary BCD's
-- BCD Display A
A_BCD: binary_bcd
generic map(12)
port map(clk, reset, A_slv, A_bcd0, A_bcd1, A_bcd2, A_bcd3, A_bcd4, A_bcd5,
A_bcd6);
-- BCD Display B
B_BCD: binary_bcd
generic map(12)
port map(clk, reset, B_slv, B_bcd0, B_bcd1, B_bcd2, B_bcd3, B_bcd4, B_bcd5,
B_bcd6);
72
-- BCD Display Result
R_BCD: binary_bcd
generic map(24)
port map(clk, reset, Result_slv, R_bcd0, R_bcd1, R_bcd2, R_bcd3, R_bcd4, R_bcd5,
R_bcd6);
-- Hex to 7 Seg converters

--DIGIT_1: Hex_to_7_Seg
-- port map(dig_1, Result_sign);
-- A Converter
A_CONVERTER: Signed_to_SLV
generic map(12)
port map(A_input, A_slv, A_neg);
B_CONVERTER: Signed_to_SLV
generic map(12)
port map(B_input, B_slv, B_neg);
RESULT_CONVERTER: Signed_to_SLV
generic map(24)
port map(Result_out, Result_slv, R_neg);
DIGIT_2: Hex_to_7_Seg
port map(dig_2, R_bcd6);
73
port map(dig_9, A_bcd1);
-- port map(dig_10, A_sign);
74
-- port map(dig_13, B_sign);
port map(dig_14, B_bcd2);
-- 7 Segment Display Controller

SEG_CONTROLLER: Seg_Display_16
generic map(clock_frequency, segment_refresh)
port map(ja, dp, digit_select, dig_1, dig_2, dig_3, dig_4, dig_5, dig_6, dig_7, dig_8,
dig_9, dig_10, dig_11, dig_12, dig_13, dig_14, dig_15, dig_16, clk);
sign_proc : process(clk)
begin
if(rising_edge(clk)) then
if(A_neg = '1') then
A_sign <= "0111111";
else
A_sign <= "1111111";
end if;
75
if(B_neg = '1') then
B_sign <= "0111111";
else
B_sign <= "1111111";
end if;
if(R_neg = '1') then

Result_sign <= "0111111";
else
Result_sign <= "1111111";
end if;
end if;
end process sign_proc;
one_edge_start <= one_lead and (not one_follow);

ten_edge_start <= ten_lead and (not ten_follow);
hundred_edge_start <= hundred_lead and (not hundred_follow);
start_start <= start_lead and (not start_follow);
edge_detect_proc : process(clk)
begin
one_lead <= one;
one_follow <= one_lead;
ten_lead <= ten;
ten_follow <= ten_lead;
76
hundred_lead <= hundred;
hundred_follow <= hundred_lead;
start_lead <= start;
start_follow <= start_lead;
end if;
end process edge_detect_proc;
add_sub_proc : process(clk)
begin
if(one_edge_start = '1') then
if(add_sub = '1') then
if(sel = '1' and A_input < 999) then
A_input <= A_input + 1;
elsif(sel = '0' and B_input < 999) then
B_input <= B_input + 1;
end if;
else
if(sel = '1' and A_input > -999) then
A_input <= A_input - 1;
elsif(sel = '0' and B_input > -999) then
B_input <= B_input - 1;
end if;
end if;
elsif(ten_edge_start = '1') then

77
end if;
else
end if;
end if;
elsif(hundred_edge_start = '1') then

end if;
else
end if;
end if;
78
end if;
end if;
end process add_sub_proc;
end behavior;
2- Booth multiplier
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.numeric_std.all;
entity smult_1 is
generic (
input_size : integer := 8);
port (
product : out signed(2 * input_size - 1 downto 0);
data_ready : out std_logic;
reset : in std_logic;
end smult_1;
----------------------------------------------------------------------------
--
-- BEHAVIOR
--
----------------------------------------------------------------------------
79
architecture behavior of smult_1 is
-- State Machine states
type state_type is(init, load_state, right_shift, done);
signal state, nxt_state : state_type;
-- Control signals
signal shift : std_logic;
signal add_A : std_logic;
signal add_S : std_logic;
signal load : std_logic;
-- Data Signals
constant maxcount : integer := input_size - 1;
signal A_reg : signed((2*input_size) downto 0) := (others => '0');
signal S_reg : signed((2*input_size) downto 0) := (others => '0');
signal P_reg : signed((2*input_size) downto 0) := (others => '0');
signal sum_S : signed((2*input_size) downto 0) := (others => '0');
signal sum_A : signed((2*input_size) downto 0) := (others => '0');
signal count : integer range 0 to maxcount + 1 := 0;
signal start_count_lead : std_logic := '0';

signal start_count_follow : std_logic := '0';
signal start_count : std_logic := '0';
begin
-----------------------------------------
-- STATE MACHINE
-- (Two Process)
--
-- This state machine is used to determine
-- what state smult_1 is in based on then
-- count value and the LSB's of the P
-- register
-----------------------------------------
state_proc: process(clk)
begin
if rising_edge(clk) then
if(reset = '1') then
state <= init;
else
state <= nxt_state;
end if;
end if;
end process state_proc;
state_machine: process(state, start, start_count, count, P_reg(1 downto 0))
80
begin
-- Initialize nxt_state and control signals
nxt_state <= state;
shift <= '0';
add_A <= '0';
add_S <= '0';
load <= '0';
data_ready <= '0';
case state is
-- Initialization State
when init =>
if(start_count = '1') then
nxt_state <= load_state;
else
nxt_state <= init;
end if;
-- Loading State
when load_state =>
load <= '1';
nxt_state <= right_shift;
-- Right Shift state (Multiplication is occurring)

when right_shift =>
shift <= '1';
if(count /= maxcount) then
nxt_state <= right_shift;
else
nxt_state <= done;
end if;
-- Read 2 LSB's of P_reg

if(P_reg(1 downto 0) = "01") then
add_A <= '1';
elsif(P_reg(1 downto 0) = "10") then
add_S <= '1';
end if;
-- Multiplication is complete (ready to receive new inputs)

when done =>
data_ready <= '1';
if(start = '0') then
nxt_state <= init;
else
nxt_state <= done;
81
end if;
-- All other states

when others =>
nxt_state <= init;
end case;
end process state_machine;
-----------------------------------------
-- EDGE DETECTION
--
-- This is used to detect a rising edge of
-- a signal
-----------------------------------------
start_count <= start_count_lead and (not start_count_follow);
start_count_proc: process(clk)
begin
start_count_lead <= '0';
start_count_follow <= '0';
else
start_count_lead <= start;
start_count_follow <= start_count_lead;
end if;
end if;
end process start_count_proc;
-----------------------------------------
-- COUNT PROCESS
--
-- This process is a counter that keeps
-- track of the number of cycles iterated
-- in the state machine
-----------------------------------------
count_proc: process(clk)
begin
if((start_count = '1') or (reset = '1')) then
count <= 0;
elsif(state = right_shift) then
count <= count + 1;
end if;
end if;
82
end process count_proc;
-----------------------------------------
-- MULTIPLIER PROCESS
--
-- This process is used to apply the
-- actual multiplication via shifts
-- and additions
-----------------------------------------
-- Determine the Sum of S_reg and A_reg
sum_S <= P_reg + S_reg;
sum_A <= P_reg + A_reg;
mult_proc: process(clk)
begin
P_reg <= (others => '0');
A_reg <= (others => '0');
S_reg <= (others => '0');
elsif(load = '1') then

-- A_reg
A_reg(2*input_size downto input_size + 1) <= input_1;
A_reg(input_size downto 0) <= (others => '0');
-- S_reg
S_reg(2*input_size downto input_size + 1) <= (not input_1) + 1;
S_reg(input_size downto 0) <= (others => '0');
-- P_reg
P_reg(2*input_size downto input_size + 1) <= (others => '0');
P_reg(input_size downto 1) <= input_2;
P_reg(0) <= '0';
elsif(add_A = '1') then

P_reg <= sum_A(2*input_size) & sum_A(2*input_size downto 1);
elsif(add_S = '1') then

P_reg <= sum_S(2*input_size) & sum_S(2*input_size downto 1);
elsif(shift = '1') then

P_reg <= P_reg(2*input_size) & P_reg(2*input_size downto 1);
end if;
end if;
83
end process mult_proc;
-- Defining the output

product <= P_reg(2*input_size downto 1);
end behavior;
3- XDC for arty7

## This file is a general .xdc for the ARTY Rev. B
## To use it in a project:
## - uncomment the lines corresponding to used pins
## - rename the used ports (in each line, after get_ports) according to the top level
signal names in the project
## Clock signal
set_property -dict { PACKAGE_PIN E3 IOSTANDARD LVCMOS33 } [get_ports

{ clk }]; #IO_L12P_T1_MRCC_35 Sch=gclk[100]
#create_clock -add -name sys_clk_pin -period 10.00 -waveform {0 5} [get_ports {
CLK100MHZ }];
##Switches
set_property -dict { PACKAGE_PIN A8 IOSTANDARD LVCMOS33 } [get_ports

{ add_sub }]; #IO_L12N_T1_MRCC_16 Sch=sw[0]
#set_property -dict { PACKAGE_PIN C11 IOSTANDARD LVCMOS33 }
[get_ports { add_sub}]; #IO_L13P_T2_MRCC_16 Sch=sw[1]
[get_ports { opcode[2] }]; #IO_L13N_T2_MRCC_16 Sch=sw[2]
set_property -dict { PACKAGE_PIN A10 IOSTANDARD LVCMOS33 }
[get_ports { sel }]; #IO_L14P_T2_SRCC_16 Sch=sw[3]
##RGB LEDs
#set_property -dict { PACKAGE_PIN E1 IOSTANDARD LVCMOS33 }

[get_ports { led0_b }]; #IO_L18N_T2_35 Sch=led0_b
#set_property -dict { PACKAGE_PIN F6 IOSTANDARD LVCMOS33 }
[get_ports { led0_g }]; #IO_L19N_T3_VREF_35 Sch=led0_g
#set_property -dict { PACKAGE_PIN G6 IOSTANDARD LVCMOS33 }
[get_ports { led0_r }]; #IO_L19P_T3_35 Sch=led0_r
[get_ports { led1_b }]; #IO_L20P_T3_35 Sch=led1_b
#set_property -dict { PACKAGE_PIN J4 IOSTANDARD LVCMOS33 } [get_ports
{ led1_g }]; #IO_L21P_T3_DQS_35 Sch=led1_g
[get_ports { led1_r }]; #IO_L20N_T3_35 Sch=led1_r
84
#set_property -dict { PACKAGE_PIN H4 IOSTANDARD LVCMOS33 }
[get_ports { led2_b }]; #IO_L21N_T3_DQS_35 Sch=led2_b
{ led2_g }]; #IO_L22N_T3_35 Sch=led2_g
{ led2_r }]; #IO_L22P_T3_35 Sch=led2_r
#set_property -dict { PACKAGE_PIN K2 IOSTANDARD LVCMOS33 }
[get_ports { led3_b }]; #IO_L23P_T3_35 Sch=led3_b
[get_ports { led3_g }]; #IO_L24P_T3_35 Sch=led3_g
[get_ports { led3_r }]; #IO_L23N_T3_35 Sch=led3_r
##LEDs

[get_ports { led[0] }]; #IO_L24N_T3_35 Sch=led[4]
{ led[1] }]; #IO_25_35 Sch=led[5]
#set_property -dict { PACKAGE_PIN T9 IOSTANDARD LVCMOS33 }
[get_ports { led[2] }]; #IO_L24P_T3_A01_D17_14 Sch=led[6]
[get_ports { led[3] }]; #IO_L24N_T3_A00_D16_14 Sch=led[7]
##Buttons
set_property -dict { PACKAGE_PIN D9 IOSTANDARD LVCMOS33 } [get_ports

{ one }]; #IO_L6N_T0_VREF_16 Sch=btn[0]
set_property -dict { PACKAGE_PIN C9 IOSTANDARD LVCMOS33 } [get_ports
{ ten }]; #IO_L11P_T1_SRCC_16 Sch=btn[1]
set_property -dict { PACKAGE_PIN B9 IOSTANDARD LVCMOS33 } [get_ports
{ hundred }]; #IO_L11N_T1_SRCC_16 Sch=btn[2]
set_property -dict { PACKAGE_PIN B8 IOSTANDARD LVCMOS33 } [get_ports
{ start }]; #IO_L12P_T1_MRCC_16 Sch=btn[3]
##Pmod Header JA
set_property -dict { PACKAGE_PIN G13 IOSTANDARD LVCMOS33 }

[get_ports { ja[0] }]; #IO_0_15 Sch=ja[1]
set_property -dict { PACKAGE_PIN B11 IOSTANDARD LVCMOS33 }
[get_ports { ja[1] }]; #IO_L4P_T0_15 Sch=ja[2]
[get_ports { ja[2] }]; #IO_L4N_T0_15 Sch=ja[3]
set_property -dict { PACKAGE_PIN D12 IOSTANDARD LVCMOS33 }
[get_ports { ja[3] }]; #IO_L6P_T0_15 Sch=ja[4]
85
set_property -dict { PACKAGE_PIN D13 IOSTANDARD LVCMOS33 }
[get_ports { ja[4] }]; #IO_L6N_T0_VREF_15 Sch=ja[7]
set_property -dict { PACKAGE_PIN B18 IOSTANDARD LVCMOS33 }
[get_ports { ja[5] }]; #IO_L10P_T1_AD11P_15 Sch=ja[8]
[get_ports { ja[6] }]; #IO_L10N_T1_AD11N_15 Sch=ja[9]
set_property -dict { PACKAGE_PIN K16 IOSTANDARD LVCMOS33 }
[get_ports { dp }]; #IO_25_15 Sch=ja[10]
##Pmod Header JB

[get_ports { jb[0] }]; #IO_L11P_T1_SRCC_15 Sch=jb_p[1]
[get_ports { jb[1] }]; #IO_L11N_T1_SRCC_15 Sch=jb_n[1]
#set_property -dict { PACKAGE_PIN D15 IOSTANDARD LVCMOS33 }
[get_ports { jb[2] }]; #IO_L12P_T1_MRCC_15 Sch=jb_p[2]
[get_ports { jb[3] }]; #IO_L12N_T1_MRCC_15 Sch=jb_n[2]
#set_property -dict { PACKAGE_PIN J17 IOSTANDARD LVCMOS33 }
[get_ports { jb[4] }]; #IO_L23P_T3_FOE_B_15 Sch=jb_p[3]
[get_ports { jb[5] }]; #IO_L23N_T3_FWE_B_15 Sch=jb_n[3]
[get_ports { jb[6] }]; #IO_L24P_T3_RS1_15 Sch=jb_p[4]
[get_ports { jb[7] }]; #IO_L24N_T3_RS0_15 Sch=jb_n[4]
##Pmod Header JC
set_property -dict { PACKAGE_PIN U12 IOSTANDARD LVCMOS33 }

[get_ports { digit_select[0] }]; #IO_L20P_T3_A08_D24_14 Sch=jc_p[1]
set_property -dict { PACKAGE_PIN V12 IOSTANDARD LVCMOS33 }
[get_ports { digit_select[1] }]; #IO_L20N_T3_A07_D23_14 Sch=jc_n[1]
[get_ports { digit_select[2] }]; #IO_L21P_T3_DQS_14 Sch=jc_p[2]
[get_ports { digit_select[3] }]; #IO_L21N_T3_DQS_A06_D22_14 Sch=jc_n[2]
[get_ports { digit_select[4] }]; #IO_L22P_T3_A05_D21_14 Sch=jc_p[3]
set_property -dict { PACKAGE_PIN T13 IOSTANDARD LVCMOS33 } [get_ports
{ digit_select[6] }]; #IO_L23P_T3_A03_D19_14 Sch=jc_p[4]
86
##Pmod Header JD

{ digit_select[8] }]; #IO_L11N_T1_SRCC_35 Sch=jd[1]
{ digit_select[9] }]; #IO_L12N_T1_MRCC_35 Sch=jd[2]
set_property -dict { PACKAGE_PIN F4 IOSTANDARD LVCMOS33 } [get_ports
{ digit_select[10] }]; #IO_L13P_T2_MRCC_35 Sch=jd[3]
set_property -dict { PACKAGE_PIN F3 IOSTANDARD LVCMOS33 } [get_ports
{ digit_select[11] }]; #IO_L13N_T2_MRCC_35 Sch=jd[4]
set_property -dict { PACKAGE_PIN E2 IOSTANDARD LVCMOS33 } [get_ports
{ digit_select[12] }]; #IO_L14P_T2_SRCC_35 Sch=jd[7]
{ digit_select[13] }]; #IO_L14N_T2_SRCC_35 Sch=jd[8]
set_property -dict { PACKAGE_PIN H2 IOSTANDARD LVCMOS33 } [get_ports
{ digit_select[14] }]; #IO_L15P_T2_DQS_35 Sch=jd[9]
set_property -dict { PACKAGE_PIN G2 IOSTANDARD LVCMOS33 } [get_ports
{ digit_select[15] }]; #IO_L15N_T2_DQS_35 Sch=jd[10]
##USB-UART Interface

[get_ports { uart_rxd_out }]; #IO_L19N_T3_VREF_16 Sch=uart_rxd_out
#set_property -dict { PACKAGE_PIN A9 IOSTANDARD LVCMOS33 }
[get_ports { uart_txd_in }]; #IO_L14N_T2_SRCC_16 Sch=uart_txd_in
##ChipKit Single Ended Analog Inputs

##NOTE: The ck_an_p pins can be used as single ended analog inputs with voltages
from 0-3.3V (Chipkit Analog pins A0-A5).
## These signals should only be connected to the XADC core. When using these
pins as digital I/O, use pins ck_io[14-19].

[get_ports { ck_an_n[0] }]; #IO_L1N_T0_AD4N_35 Sch=ck_an_n[0]
[get_ports { ck_an_p[0] }]; #IO_L1P_T0_AD4P_35 Sch=ck_an_p[0]
[get_ports { ck_an_n[1] }]; #IO_L3N_T0_DQS_AD5N_35 Sch=ck_an_n[1]
[get_ports { ck_an_p[1] }]; #IO_L3P_T0_DQS_AD5P_35 Sch=ck_an_p[1]
#set_property -dict { PACKAGE_PIN B4 IOSTANDARD LVCMOS33 }
87
[get_ports { ck_an_n[3] }]; #IO_L9N_T1_DQS_AD7N_35 Sch=ck_an_n[3]
[get_ports { ck_an_p[3] }]; #IO_L9P_T1_DQS_AD7P_35 Sch=ck_an_p[3]
##ChipKit Digital I/O Low
#set_property -dict { PACKAGE_PIN V15 IOSTANDARD LVCMOS33 }

[get_ports { add_sub }]; #IO_L16P_T2_CSI_B_14 Sch=ck_io[0]
#set_property -dict { PACKAGE_PIN U16 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[1] }]; #IO_L18P_T2_A12_D28_14 Sch=ck_io[1]
#set_property -dict { PACKAGE_PIN P14 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[2] }]; #IO_L8N_T1_D12_14 Sch=ck_io[2]
#set_property -dict { PACKAGE_PIN R12 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[4] }]; #IO_L5P_T0_D06_14 Sch=ck_io[4]
[get_ports { ck_io[5] }]; #IO_L14P_T2_SRCC_14 Sch=ck_io[5]
[get_ports { ck_io[6] }]; #IO_L14N_T2_SRCC_14 Sch=ck_io[6]
[get_ports { ck_io[7] }]; #IO_L15N_T2_DQS_DOUT_CSO_B_14 Sch=ck_io[7]
#set_property -dict { PACKAGE_PIN N15 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[8] }]; #IO_L11P_T1_SRCC_14 Sch=ck_io[8]
#set_property -dict { PACKAGE_PIN M16 IOSTANDARD LVCMOS33 }
[get_ports { ck_io[10] }]; #IO_L18N_T2_A11_D27_14 Sch=ck_io[10]
[get_ports { ck_io[12] }]; #IO_L12N_T1_MRCC_14 Sch=ck_io[12]
[get_ports { ck_io[13] }]; #IO_L12P_T1_MRCC_14 Sch=ck_io[13]
##ChipKit Digital I/O On Outer Analog Header
88
##NOTE: These pins should be used when using the analog header signals A0-A5 as
digital I/O (Chipkit digital pins 14-19)

[get_ports { ck_io[14] }]; #IO_0_35 Sch=ck_a[0]
[get_ports { ck_io[15] }]; #IO_L4P_T0_35 Sch=ck_a[1]
[get_ports { ck_io[16] }]; #IO_L4N_T0_35 Sch=ck_a[2]
[get_ports { ck_io[17] }]; #IO_L6P_T0_35 Sch=ck_a[3]
[get_ports { ck_io[18] }]; #IO_L6N_T0_VREF_35 Sch=ck_a[4]
[get_ports { ck_io[19] }]; #IO_L11P_T1_SRCC_35 Sch=ck_a[5]
##ChipKit Digital I/O On Inner Analog Header

##NOTE: These pins will need to be connected to the XADC core when used as
differential analog inputs (Chipkit analog pins A6-A11)

[get_ports { ck_io[20] }]; #IO_L2P_T0_AD12P_35 Sch=ad_p[12]
[get_ports { ck_io[21] }]; #IO_L2N_T0_AD12N_35 Sch=ad_n[12]
##ChipKit Digital I/O High

[get_ports { ck_io[26] }]; #IO_L19N_T3_A09_D25_VREF_14 Sch=ck_io[26]
[get_ports { ck_io[28] }]; #IO_L6N_T0_D08_VREF_14 Sch=ck_io[28]
[get_ports { ck_io[29] }]; #IO_25_14 Sch=ck_io[29]
[get_ports { ck_io[30] }]; #IO_0_14 Sch=ck_io[30]
89
[get_ports { ck_io[32] }]; #IO_L13N_T2_MRCC_14 Sch=ck_io[32]
[get_ports { ck_io[33] }]; #IO_L13P_T2_MRCC_14 Sch=ck_io[33]
[get_ports { ck_io[34] }]; #IO_L15P_T2_DQS_RDWR_B_14 Sch=ck_io[34]
[get_ports { ck_io[35] }]; #IO_L11N_T1_SRCC_14 Sch=ck_io[35]
[get_ports { ck_io[40] }]; #IO_L9N_T1_DQS_D13_14 Sch=ck_io[40]
[get_ports { ck_io[41] }]; #IO_L9P_T1_DQS_14 Sch=ck_io[41]
## ChipKit SPI

[get_ports { ck_miso }]; #IO_L17N_T2_35 Sch=ck_miso
[get_ports { ck_mosi }]; #IO_L17P_T2_35 Sch=ck_mosi
[get_ports { ck_sck }]; #IO_L18P_T2_35 Sch=ck_sck
[get_ports { ck_ss }]; #IO_L16N_T2_35 Sch=ck_ss
## ChipKit I2C
#set_property -dict { PACKAGE_PIN L18 IOSTANDARD LVCMOS33 }

[get_ports { ck_scl }]; #IO_L4P_T0_D04_14 Sch=ck_scl
[get_ports { ck_sda }]; #IO_L4N_T0_D05_14 Sch=ck_sda
[get_ports { scl_pup }]; #IO_L9N_T1_DQS_AD3N_15 Sch=scl_pup
[get_ports { sda_pup }]; #IO_L9P_T1_DQS_AD3P_15 Sch=sda_pup
##Misc. ChipKit signals
90
[get_ports { ck_ioa }]; #IO_L10N_T1_D15_14 Sch=ck_ioa
[get_ports { ck_rst }]; #IO_L16P_T2_35 Sch=ck_rst
##SMSC Ethernet PHY

[get_ports { eth_col }]; #IO_L16N_T2_A27_15 Sch=eth_col
[get_ports { eth_crs }]; #IO_L15N_T2_DQS_ADV_B_15 Sch=eth_crs
[get_ports { eth_mdc }]; #IO_L14N_T2_SRCC_15 Sch=eth_mdc
[get_ports { eth_mdio }]; #IO_L17P_T2_A26_15 Sch=eth_mdio
[get_ports { eth_ref_clk }]; #IO_L22P_T3_A17_15 Sch=eth_ref_clk
[get_ports { eth_rstn }]; #IO_L20P_T3_A20_15 Sch=eth_rstn
[get_ports { eth_rx_clk }]; #IO_L14P_T2_SRCC_15 Sch=eth_rx_clk
[get_ports { eth_rx_dv }]; #IO_L13N_T2_MRCC_15 Sch=eth_rx_dv
[get_ports { eth_rxd[0] }]; #IO_L21N_T3_DQS_A18_15 Sch=eth_rxd[0]
[get_ports { eth_rxd[1] }]; #IO_L16P_T2_A28_15 Sch=eth_rxd[1]
[get_ports { eth_rxd[2] }]; #IO_L21P_T3_DQS_15 Sch=eth_rxd[2]
[get_ports { eth_rxd[3] }]; #IO_L18N_T2_A23_15 Sch=eth_rxd[3]
[get_ports { eth_rxerr }]; #IO_L20N_T3_A19_15 Sch=eth_rxerr
[get_ports { eth_tx_clk }]; #IO_L13P_T2_MRCC_15 Sch=eth_tx_clk
[get_ports { eth_tx_en }]; #IO_L19N_T3_A21_VREF_15 Sch=eth_tx_en
[get_ports { eth_txd[0] }]; #IO_L15P_T2_DQS_15 Sch=eth_txd[0]
[get_ports { eth_txd[1] }]; #IO_L19P_T3_A22_15 Sch=eth_txd[1]
[get_ports { eth_txd[2] }]; #IO_L17N_T2_A25_15 Sch=eth_txd[2]
[get_ports { eth_txd[3] }]; #IO_L18P_T2_A24_15 Sch=eth_txd[3]
91
##Quad SPI Flash

[get_ports { qspi_cs }]; #IO_L6P_T0_FCS_B_14 Sch=qspi_cs
[get_ports { qspi_dq[0] }]; #IO_L1P_T0_D00_MOSI_14 Sch=qspi_dq[0]
[get_ports { qspi_dq[1] }]; #IO_L1N_T0_D01_DIN_14 Sch=qspi_dq[1]
[get_ports { qspi_dq[2] }]; #IO_L2P_T0_D02_14 Sch=qspi_dq[2]
[get_ports { qspi_dq[3] }]; #IO_L2N_T0_D03_14 Sch=qspi_dq[3]
##Power Measurements

[get_ports { vsnsvu_n }]; #IO_L7N_T1_AD2N_15 Sch=ad_n[2]
[get_ports { vsnsvu_p }]; #IO_L7P_T1_AD2P_15 Sch=ad_p[2]
[get_ports { vsns5v0_n }]; #IO_L3N_T0_DQS_AD1N_15 Sch=ad_n[1]
[get_ports { vsns5v0_p }]; #IO_L3P_T0_DQS_AD1P_15 Sch=ad_p[1]
[get_ports { isns5v0_n }]; #IO_L5N_T0_AD9N_15 Sch=ad_n[9]
[get_ports { isns5v0_p }]; #IO_L5P_T0_AD9P_15 Sch=ad_p[9]
[get_ports { isns0v95_n }]; #IO_L8N_T1_AD10N_15 Sch=ad_n[10]
[get_ports { isns0v95_p }]; #IO_L8P_T1_AD10P_15 Sch=ad_p[10]
92

Alajmi Rashed Thesis 2019

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Alajmi Rashed Thesis 2019

Uploaded by

Copyright:

Available Formats

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

Implementing booth algorithm on FPGA

A graduate project submitted in Partial fulfillment for requirements for the

Rashed Hamad Alajmi

Dr. Xiaojun, (Ashley) Geng Date

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

patients till I finish this project.

LIST OF FIGURES ……………………………………………………………………vii

Chapter 1: Project Overview............................................................................................... x

1.1 Introduction ..................................................................................................... 1

1.2 Objective ......................................................................................................... 2

1.3 Project Outline................................................................................................. 3

Chapter 2: High Performance Adders ................................................................................. 4

2.1 Carry Look-Ahead (CLA) Adders .................................................................. 4

2.1.1 CLA Theory ................................................................................... 4

2.1.2 Carry Look-Ahead Adder Group...................................................... 6

2.1.3 Four-bit Carry Look-ahead Unit 1 ....................................................... 7

2.1.4 Four-bit Carry Look-ahead Unit 2 ........................................................ 8

2.1.5 32-bit One-Level Carry-Look Ahead Adder .............................................. 10

2.1.6 Two Level CLA Adder unit ................................................................ 11

2.2 Carry Save Adders ........................................................................................ 12

Chapter 3: Multiplication Algorithms ............................................................................... 16

3.1 Serial Multiplication Methods ....................................................................... 16

3.1.2 Booths Algorithm………………………………………………………………………..….18

3.1.3 Modified Booth Algorithm…………………………………………………………..21

3.2 Recoding Multipliers ..................................................................................... 23

3.2.1 Uniform Shift of One method………………………………………………………..23

3.3.2 Uniform Shift of Two method…………………………………………..…………..29

3.4 Implementing the Overlapped Scanning multiplier algorithm ...................... 34

Chapter 4- Logic Circuits and Modules ............................................................................ 36

4.1 Decoders ........................................................................................................ 36

4.2 Control Gates................................................................................................. 37

4.3 Complementing circuits ................................................................................ 37

4.4 Accumulator .................................................................................................. 38

4.5 Left-bit shifter ............................................................................................... 39

Chapter 5: Designing the Multiplier ................................................................................. 41

5.1 Logical Circuits Schematic ........................................................................... 41

5.2 8-bit Multiplication Model ............................................................................ 43

5.3 Final Multiplier Design ................................................................................. 48

Chapter 6: DETAILS OF IMPLEMENTATION …………………..………………………………………..49

6.1 introduction …………………………………………………………………………………………………………49

6.2 List of components ………………………………………………………………………………………………49

6.3.1 Seven segments ………………………………………………………………………………….50

6.3.2 Transistors ………………………………………………………………………………………….50

6.3.3 Arty Z7 ……………………………………………………………………………………………..51

6.4 Schematic Diagram …………………………………………………………………………………………52

6.5 Functional description of the project ………………………………………………………………54

6.6 Software Description ………………………………………………………………………………………56

6.6.2 Signed Multiplier…………………………………………………………………………………………56

6.6.3 Signed to SLV……………………………………………………………………………………………….57

6.6.4 BCD Display……………………………………………………………………………………………………58

6.6.5 Hex to 7 SEG……………………………………..…………………………….59

6.6.6 Segment Controller……………………....……………………………59

Appendix: VHDL CODE………………………………………………………………65

2.2.1 Schematic of a Carry Look Ahead Adder 5

4.6 Details circuit for recording by multiple pair 40

5.1 Internal resistor-transistor logic circuit schematic of the multiplier 42

6.2 Table of components 49

6.6.5 Hex to 7 segments component 59

Implementing booth algorithm on FPGA

Rashed Hamad Alajmi

Master of Science in Electrical Engineering

Multiplication is paramount importance in various fields ranging from finance to