Fpga Notes April23

DSPedia Notes 1
Introduction to FPGA
THIS SLIDE IS BLANK
Introduction: DSP and FPGAs 1
• In the last 20 years the majority of DSP applications have been enabled
by DSP processors:
• Texas Instruments Motorola Analog Devices
• A number of DSP cores have been available.
• Oak Core LSI Logic ZSP 3DSP
• ASICs (Application specific integrated circuits) have been widely

used for specific (high volume) DSP applications
• But the most recent technology platform for high speed DSP
applications is the Field Programmable Gate Array (FPGA)
This course is all about the why and how of DSP with FPGAs!
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
DSP is all about multiplies and accumulates/adds (MACs). As we progress through the course, we will see that
most algorithms that are used for different applications employ digital filters, adaptive filters, Fourier transforms
and so on. These algorithms all require multiplies and adds (note that a divide or square root is quite a rare thing
in DSP).
Hence a DSP algorithm or problem is often specified in terms of its MAC requirements. In particular, when
comparing two algorithms, if they both perform the same job but one with less MACs than the other, then clearly
the “cheaper” one would be the best choice. However this implies some assumptions. One is that the required
MACs are the same - but surely a multiply is multiply! Well, yes in the traditional DSP processor based situations
we are likely to be using, say, a 16 bit device which will process 16 bit inputs, using 16 bit digital filter coefficients
etc. With FPGAs this constraint is removed - we can use as many, or as few bits, as are required. Therefore we
can choose to optimise and schedule DSP algorithms in a completely different way.
Circuit Board
General Purpose Input/Output Bus
DAC ADC
DSP56307
DSP Processor
Amplifiers/Filters
Voltage Output Voltage Input
Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010

The FPGA DSP Evolution 2
• Since around 1998 the evolution of FPGAs into the DSP market has
been sustained by classic technology progress such as the ever
present Moore’s law.
• Late 1990s FPGAs allow multipliers to be implemented in

FPGA logic fabric. A few multipliers per device are possible.
• Early 2000s FPGAs: vendors place hardwired multipliers onto

the device with clocking speeds of > 100MHz. Number of
multpliers from 4 to > 500.
• Mid 2000s FPGA vendors place DSP algorithms signal flow

graphs (SFGs) onto devices. Full (pipelined) FIR SFGs filters
for example are available
• Late 2000s to early 2010s - who knows! Probably more DSP

power, more arithmetic capability (fast square root, divide),
FFTs, more floating point support. Rest assured more DSP is
coming....
Notes:
Technology just keeps moving.
Anyone who has purchased a new laptop knows the feeling. If you just wait, then in the next quarter you will get
the new model with integrated WiFi or WiMax, a faster processor. Of course wait another quarter and in 6
months it will be improved again - also, the new faster, better, bigger machine is likely to be cheaper also! Such
is technology
DSP for FPGAs is just the same. If you wait another year its likely the vendors will be bring out prepackaged
algorithms for precisely what you want to do. And they will be easier to work with - higher level design tools,
design wizards and so on.
So if you are planning to design a QR adaptive equalizing beamformer for MIMO implementation of a software
radio for 802.16 - then if you wait, it will probably be a free download in a few years. But of course who can wait?
Therefore in this course, we discuss and review the fundamental strategies of designing DSP for FPGAs. Like
all technologies you still need to know how it works if you really want to use it.

FPGAs: A “Box” of DSP blocks 3
• We might be tempted to this of the latest FPGAs as repositories of DSP

components just waiting to be connected together.
• Of course the resource is finite and the connections available are finite.
• In the days of circuits boards one had to be careful about running

busses close together, lengths of wires etc. Similiar considerations are
required for FPGAs (albeit out of your direct control).
• However, the high level concept, take the blocks, & build it:
Clocks Input/Output
Registers and Memory
Design
Verify
Place and Route
“Connectors” Logic Arithmetic

Notes:
This is undoubtedly the modern concept of FPGA design. Take the blocks, connect them together and the
algorithm is in place.
Do we actually need an FPGA/IC engineer then?
Do we actually need a DSP engineer?
Yes in both cases, but moderm toolsets and design flows are such that it might be the same person.
There is lots to worry about. In terms of the DSP design; is the arithmetic correct (ie overflows, underflows,
saturates etc). Do the latency or delays used allow the integrity to be maintained.
For the FPGA, can we clock at a high enough rate? Does the device place and route. What device do we need,
and how efficient is the implementation (just like compilers, different vendors design flows will give different
results (some better than others).
As vendors provide higher level components (like the DSP48 slice from XIlinx which allows a complete FIR to
be implemented) then issues such as overflow, numerical integrity and so on are taken care of.

Binary Addition and Multiply 4
• The bottom line for DSP is multiplies and adds - and lots of them!
• Adding two N bit numbers will produce up to an N+1 bit number:
N N N+1
+ = =
• Multiplying two N bit numbers can produce up to a 2N bit number:

N N 2N
x =
• So with a MAC (multiply and accumulate/add) of two N bit numbers we

could, in the worst case, end up with 2N+1 bits wordlength.

Notes:
If the wordlength grows beyond the maximum value you can store we clearly have the situation of numerical
overflow which is a non-linear operation and not desirable.
Within traditional DSP processors this wordlength growth is well known an catered for.
For example, the Motorola 56000 series is so called because it has a 56 bit accumulator, i.e. the largest result
of any “addition” operation can have 56 bits. For a typical DSP filtering type operation we may require to take,
say an array of 24 bit numbers and multiply by an array of another 24 bits numbers. The result of each multiply
will be a 48 bit number. If we then add two 48 bit numbers together, if they both just happen to be large positive
values then the result could be a 49 bit number. Now if we add many 48 bit numbers together (and they just all
happen to be large positive values), then the final result may have a word growth of quite a few bits. So one
must assume that Motorola had a good look at this, and realised it was fairly unlikely that the result of adding
these 48 bit products together would ever be larger that 56 bits - so 56 bits was chosen. (Of course if you did
have a problem that grew beyond 56 bits you would have to put special trapping in to the code to catch this.)

The “Cost” of Addition 5
• A 4 bit addition can be performed using a simple ripple adder:
A3 B3 A2 B2 A1 B1 A0 B0
C3
Σ C2
Σ C1
Σ C0
Σ ‘0’
S4 S3 S2 S1 S0
MSB LSB
A3 A2 A1 A0
+ B3 B2 B1 B0
C3 C2 C1 C0
0 carry in
S4 S3 S2 S1 S0
• Therefore an N bit addition could be performed in parallel at a cost of N

full adders.
Notes:
The simple Full Adder (FA):
Adds two bits + one carry in bit, to produce sum and carry out
S out = ABC + ABC + ABC + ABC

A B Cin Cout Sout
= A⊕B⊕C
0 0 0 0 0
0 0 1 0 1 C out = ABC + ABC + ABC + ABC
0 1 0 0 1
0 1 1 1 0 = AB + AC + BC
1 0 0 0 1
1 0 1 1 0 A B
1 1 0 1 0
1 1 1 1 1
Cout Σ Cin
Sout
1011 +11
+ 1101 +13
11000 +24

The “Cost” of Multiply 6
• A 4 bit multiply operation requires an array of 16 multiply/add cells:

a3 a2 a1 a0 0 a3 0 a2 0 a1 0 a0
b3 b2 b1 b0
c3 c2 c1 c0 b0
0
d3 d2 d1 d0
e3 e2 e1 e0
b1
f3+ f2 f1 f0 0
p7 p6 p5 p4 p3 p2 p1 p0
b2
0
b3
0
p7 p6 p5 p4 p3 p2 p1 p0
• Therefore an N by N multiply requires N 2 cells......
......so for example a 16 bit multiply is nominally 4 times more

expensive to perform than an 8 bit multiply.
Notes:
Each cell is composed of a Full Adder (FA) and an AND gate, plus some broadcast wires:
s
a
bout b
cout c 1011 11
1001 x9
aout sout 1011 Partial Product
0000
cout = s.z.c + s.z.c + s.z.c + s.z.c 0000
aout = a +1 0 1 1
sout = (s ⊕ z) ⊕ c 1100011 99
bout = b
z = a.b
An 8 bit by 8 bit multiplier would require 8 x 8 = 64 cells

The Gate Array (GA) 7
• Early gate-arrays were simply arrays of NAND gates:
• Designs were produced by interconnecting the gates to form

combinational and sequential functions.
Notes:
The NAND gate is often called the Universal gate, meaning that it can be used to produce any Boolean logic
function.
• Early gate array design flow would be design, simulate/verify, device production and test. Metal layers
make simple connections Z = AB + CD
A
B
C Z
D
Early simulators and netlisters such as HILO (from GenRad) were used.
From GA to FPGA
However simple gate arrays although very generic, were used by many different users for similar systems.....
....for example to implement two level logic functions, flip-flops and registers and perhaps addition and
subtraction functions.
For a GA once a layer(s) of metal had been laid on a device - that’s it! No changes, no updates, no fixes.
So then we move to field programmable gate arrays. Two key differences between these and gate arrays:
• They can be reprogrammed in the “field”, i.e. the logic specified is changeable
• They no longer are just composed of NAND gates. A carefully balanced selection of multi-input logic, flips-
flops, multiplexors and memory.
Generic FPGA Architecture (Logic Fabric) 8
• Arrays of gates and higher level logic blocks might be refered to as the
logic fabric...
I/O I/O I/O I/O
logic logic logic logic logic Row

block block block block block interconnects
I/O
logic logic logic logic logic

block block block block block
I/O

I/O
Column
interconnects
I/O

I/O I/O I/O I/O

Notes:
The logic block in this generic FPGA contains a few logic elements. Different manufacturers will include
different elements (and use different terms for logic block, e.g. slices etc).
A simple logic block might contain the following:
Cascade/
Carry FLIP FLIP
Logic FLOP FLOP
Interconnects
LUT
Select
MUX
Logic
Logic Element
Of course the actual contents of a logic element will vary from manufacturer to manufacturer and device to
device.

FPGA Architecture (Xilinx DSP Devices) 9
• Looking more specifically at recent Xilinx FPGAs, we also find block

RAMs and dedicated arithmetic blocks... both very useful for DSP!
Block RAM
r
Logic Fabric o
w
s
Arithmetic
Block
Input/Output
Blocks
columns

Notes:
One of the major features of more recent, DSP-targeted
Diagram Key Xilinx FPGAs is the provision of dedicated arithmetic
blocks, which offer lower power and higher clock
frequency operation than the logic fabric (i.e. the array
of CLBs). These can be configured to perform a number
Input / Output Block (IOB) of different computations, and are especially suited to
the Multiply Accumulate (MAC) operations prevalent in
digital filtering.
Block RAM Block RAMs are also used extensively in DSP. Example
uses are for storing filter coefficients, encoding and
decoding, and other tasks.
Despite the inclusion of these additional resources, the

DSP48 / DSP48A / DSP48E logic fabric still forms the majority of the FPGA. We will
now look at the CLBs which comprise the logic fabric,
and how they are connected together, in further detail.
Configurable Logic Block (CLB)
FPGA

Example: Xilinx Logic Blocks and Routing 10
• Xilinx FPGA logic fabric comprises Configurable Logic Blocks (CLBs),

which are groups of SLICEs (e.g. 2 or 4 SLICEs per CLB).
• Signals travel between CLBs via routing resources.
• Each CLB has an adjacent switch matrix for (most) routing.
CLB
Slices
Switch
Matrix
other CLBs
NOTE: Only a subset of routing resources is depicted above.

Notes:
The example in the main slide features a typical Xilinx FPGA architecture, and the Altera and Lattice
architectures are different. Logic units differ in size, composition and name! However, in all cases, their Logic
Blocks include both combinational logic and registers, and routing resources are required for connecting blocks
together.
Continuing with the Xilinx example, the combinational blocks are termed Lookup Tables (LUTs), and in most
devices have 4 inputs (some of the more recent devices have 6-input LUTs). These LUTs can be utilised in four
modes:
• To implement a combinatorial logic function
• As Read Only Memory (ROM)
• As Random Access Memory (RAM)
• As shift registers
The register can be used as:
• A flip-flop
• A latch
Over the next few slides, the functionality of LUTs and registers in the above modes will be described.

The Lookup Table 11
• When used to implement a logic function, the 4-bit input addresses the
LUT to find the correct output, Z, for that combination of A, B, C and D.
A
B Lookup Z
C Table
D
Example logic function: CD

AB 00 01 11 10
00 0 0 0 0
Z=BCD+ABCD 01 1 0 0 0
11 1 0 0 0
10 0 0 0 1
Notes:
The lookup table can also implement a ROM,
containing sixteen 1-bit values. Instead of the four A B C D Z
inputs representing inputs of a logic function, they
can be thought of as a 4-bit address. A 1-bit value is 0 0 0 0 1
stored within each memory location, and the 0 0 0 1 1
appropriate output is supplied for any input address.
0 0 1 0 0
In this example, A is considered the Most Significant
0 0 1 1 1
Bit (MSB) and D the Least Significant Bit (LSB), and
the output is Z. 0 1 0 0 1
0 1 0 1 0
0 1 1 0 0
0 1 1 1 0
1 0 0 0 0
1 0 0 1 0
1 0 1 0 0
1 0 1 1 0
1 1 0 0 0
1 1 0 1 0
1 1 1 0 1
1 1 1 1 1

LUTs as Distributed RAM 12
• LUTs can also be configured as distributed RAM, in either single port

or dual port modes.
• Single port: 1 address for both synchronous write

operations and asynchronous read operations.
• Dual port: 1 address for both synchronous write operations

and asynchronous read operations, and 1 address for
asynchronous reads only.
• Larger RAMs can be constructed by connecting two or more LUTs

together.
• Dual port RAM requires more resources than single port RAM.
• For example, a 32x1 Single Port RAM (32 addresses, 1 bit

data), requires two 4-bit LUTs, whereas an equivalent Dual
Port RAM requires four 4-bit LUTs.

Notes:
The two diagrams below demonstrate the implementation of 16x1 single and dual port RAMs in the Virtex II Pro
device, respectively. Notice that the dual port RAM requires twice as many resources as the single port RAM.
Single Port
Source: Virtex-II Pro and Virtex-II Pro X Platform FPGAs:

Complete Data Sheet, DS083 (v4.7), November 5, 2007.
Dual Port

LUTs as Shift Registers (SRL16s) 13
• A final alternative is to use the LUT as a Shift Register of up to 16 bits.
• Additional Shift In and Shift Out ports are used, and the 4-bit address
is used to define the memory location which is asynchronously read.
• For example, if the LUT input is the word 1001, the output from the 10th
register is read, as depicted below.
SHIFT IN SHIFT OUT

DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ
CLK
4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LUT INPUT
(e.g. 1001) D OUT
• The slice register at the output from the LUT can be used to add
another 1 clock cycle delay. Using the register also synchronises the
read operation.
Notes:
As with the other LUT configurations, larger Shift Registers can be constructed by combining several LUTs. For
example, a 64-bit shift register segment can be constructed by combining four 16-bit Shift Registers together,
as shown below. The cascadable ports allow further interconnections for larger Shift Registers.
Cascadable In DI D DI D
SRL16 FF SRL16 FF
MSB MSB
DI DI
D D
FF FF
SRL16 SRL16
MSB MSB
Slice 1 Slice 2
Cascadable Out

Registers 14
• The sequential logic element which follows the lookup table can be
configured as either:
• An edge-triggered D-type flip flop; or
• A level-sensitive latch
• The input to the register may be the output from the LUT, or
alternatively a direct input to the slice (i.e. the LUT is bypassed).
LUT Carry Logic

Inputs LUT
REG Output
Bypass
Input LUT / Register
Pair

Notes:
A D-type flip flop provides a delay of one clock cycle, as confirmed by the truth table below (D(t) is the register
input at time t, and Q(t=1) is the output 1 clock cycle later). A clock signal and control inputs (set, reset, etc.)
are also provided.
D(t) Q(t+1)
0 0
1 1
When configured as a latch, the control inputs define when data on the D input is “captured” and stored within
the register. The Q output thereafter remains unchanged until new data is captured.
Flip flops and registers are discussed in the Digital Logic Review notes chapter.

Resets: Registers and SRL16s 15
• Whereas registers can be reset, SRL16s cannot. Therefore, adding

reset capabilities to a design has implications for resource utilisation.
• For example, consider an 8-bit shift register. If resettable, then each

element requires a slice register. If not, then an SRL16 can be used.
Resettable Implementation
INPUT DQ DQ DQ DQ DQ DQ DQ DQ OUTPUT
CLOCK R R R R R R R R
RESET
Slice 1 Slice 2 Slice 3 Slice 4
Non Resettable Implementation
INPUT SRL16 OUTPUT
CLOCK
Slice 1
Notes:
We can still design a resettable shift register with an SRL16, by using a slightly more sophisticated design.
Instead of making all elements resettable, we can implement the first element using a slice register, and the
subsequent ones using an SRL16. The reset signal is held high for 8 clock cycles, which allows the 0 input to
propagate through the shift register. Instead of using 4 slices, this design would require 2 slices at most.
INPUT DQ SRL16 OUTPUT

CLOCK R
RESET Slice 1/2

Slice 1
Hold RESET signal high

for 8 clock cycles to reset
the shift register...

FPGA Arithmetic Implementation 16.1
• The implementation of arithmetic operations in FPGA hardware is an

integral and important aspect of DSP design.
• The following key issues are presented in this chapter:
• Number representations:
binary word formats for signed and unsigned integers, 2’s
complement, fixed point and floating point;
• Binary arithmetic, including:

Overflow and underflow, saturation, truncation and rounding
• Structures for arithmetic operations:

Addition/Subtraction, Multiplication, Division and Square Root;
• Complex arithmetic operations;
• Mapping to Xilinx FPGA hardware

... including special resources for high speed arithmetic
Notes:
This section of the course will introduce the following concepts:
Integer number representations - unsigned, one’s complement, two’s complement.
Non-integer number representations - fixed point, floating point.
Quantisation of signals, truncation, rounding, overflow, underflow and saturation.
Addition - decimal, two’s complement integer binary, two’s complement fixed point, hardware structures for
addition, Xilinx-specific FPGA structures for addition.
Multiplication - decimal, 2s complement integer binary, two’s complement fixed point, hardware structures for
multiplication, Xilinx-specific FPGA structures for multiplication.
Division.
Square root.
Complex addition.
Complex multiplication.

Number Representations 17
• DSP, by its very nature, requires quantities to be represented digitally -

using a number representation with finite precision.
• This representation must be specified to handle the “real-world”

inputs and outputs of the DSP system.
• Sufficient resolution
• Large enough dynamic range
• The number representation specified must also be efficient in terms of

its implementation in hardware.
• The hardware implementation cost of an arithmetic operator

increases with wordlength.
• The relationship is not always linear!
• There is a trade-off between cost, and numeric precision and range.

Notes:
The use of binary numbers is a fundamental of any digital systems course, and is well understood by most
engineers. However when dealing with large complex DSP systems, there can be literally billions of multiplies
and adds per second. Therefore any possible cost reduction by reducing the number of bits for representation
is likely to be of significant value.
For example, assume we have a DSP filtering application using 16 bit resolution arithmetic. We will show later
(see Slide 42) that the cost of a parallel multiplier (in terms of silicon area - speed product) can be approximated
as the number of full adder cells. Therefore for a 16 bit by 16 bit parallel multiply the cost is of the order of 16 x
16 = 256 “cells”. The wordlength of 16 bits has been chosen (presumably) because the designer at sometime
demonstrated that 17 bits was too many bits, and 15 was not enough - or did they? Probably not! Its likely that
we are using 16 bits because... well, that’s what we usually use in DSP processors and we are creatures of
habit! In the world of FPGA DSP arithmetic you can choose the resolution. Therefore, if it was demonstrated
that in fact 9 bits was sufficient resolution, then the cost of a multiplier is 9 x 9 cells = 81 cells. This is
approximately 30% of the cost of using 16 bits arithmetic.
Therefore its important to get the wordlength right: too many bits wastes resources, and too few bits loses
resolution. So how do we get it right? Well, you need to know your algorithms and DSP.

Unsigned Integer Numbers 18
• Unsigned integers are used to represent non-negative numbers.
• The weightings of individual bits within a generic n-bit word are:

bit index: n-1 n-2 n-3 2 1 0
bit weighting: 2n-1 2n-2 2n-3 22 21 20
MSB LSB
• The decimal number 82 is “01010010” in unsigned format as shown:

bit index: 7 6 5 4 3 2 1 0
bit weighting: 27 26 25 24 23 22 21 20
example binary number: 0 1 0 1 0 0 1 0

0x27 1x26 0x25 1x24 0x23 0x22 1x21 0x20
decimal representation:
64 + 16 + 2 = 82
• The numerical range of an n-bit unsigned number is: 0 to 2n - 1. For

example, an 8-bit word can represent all integers from 0 to 255.
Notes:
Taking the example of an 8-bit unsigned number, the range of representable values is 0 to 255:
Integer Value Binary Representation

0 00000000
1 00000001
2 00000010
3 00000011
4 00000100
64 01000000
65 01000001
131 10000011
255 11111111
Note that the minimum value is 0, and the maximum value ( 255 ) is the sum of the powers of two between 0
and 8 , where 8 is the number of bits in the binary word:
i.e. 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 = 255 = 28 - 1

2’s Complement (Signed Integers) 19
• 2’s Complement caters for positive and negative numbers in the

range -2n-1 to +2n-1 -1, and has only one representation of 0 (zero).
• In 2’s complement, the MSB has a negative weighting:

bit index: n-1 n-2 n-3 2 1 0
bit weighting: -2n-1 2n-2 2n-3 22 21 20
MSB LSB
• The most negative and most positive numbers are represented by:
bit index: n-1 n-2 n-3 2 1 0
most negative: 1 0 0 0 0 0
MSB LSB
bit index: n-1 n-2 n-3 2 1 0

most positive: 0 1 1 1 1 1
MSB LSB

Notes:
As examples, we can convert the following two 8-bit 2’s complement words to decimal.
As for the unsigned representation, the decimal number 82 is “01010010” in 2’s complement signed format:
bit index: 7 6 5 4 3 2 1 0
bit weighting: -27 26 25 24 23 22 21 20

0x-27 1x26 0x25 1x24 0x23 0x22 1x21 0x20
64 + 16 + 2 = 82
Meanwhile the decimal number -82 is “10101110” in 2’s complement signed format:
bit index: 7 6 5 4 3 2 1 0
bit weighting: -27 26 25 24 23 22 21 20

1x-27 1x26 1x25 1x24 0x23 0x22 1x21 0x20
-128 + 32 + 8 + 4 + 2 = -82

2’s Complement Conversion 20
• For 2’s Complement, converting between negative and positive

numbers involves inverting all bits, and adding 1.
• For example, we have just considered 2’s complement representations

of the decimal numbers -82 and +82. They are converted as shown:
+82 -82 -82 +82
+82 0 1 0 1 0 0 1 0 -82 1 0 1 0 1 1 1 0
invert all bits invert all bits
1 0 1 0 1 1 0 1 0 1 0 1 0 0 0 1
add 1 + 0 0 0 0 0 0 0 1 add 1 + 0 0 0 0 0 0 0 1
-82 1 0 1 0 1 1 1 0 +82 0 1 0 1 0 0 1 0

Notes:
Positive Numbers Negative Numbers

Integer Binary Integer Binary
0 00000000 0 100000000
1 00000001 Invert all bits -1 11111111
2 00000010 and ADD 1 -2 11111110
3 00000011 -3 11111101
125 01111101 -125 10000011

126 01111110 -126 10000010
127 01111111 -127 10000001
-128 10000000
Note that when negating positive values, a ninth bit is required to represent negative zero. However, if we simply
ignore this ninth bit, the representation for the negative zero becomes identical to the representation for positive
zero.
Notice from the above that -128 can be represented but +128 cannot.

Fixed-point Binary Numbers 21
• We can now define what is known as a “fixed-point” number:
...a number with a fixed position for the binary point.
• Bits on the left of the binary point are termed integer bits, and bits on
the right of the binary point are termed fractional bits.
• The format of a generic fixed point word, comprising n integer bits and
b fractional bits, is: :
binary point
bit index: n-1 n-2 1 0 -1 -2 -b+1 -b

bit weighting: 2n-1 2n-2 21 20 2-1 2-2 2-b+1 2-b
MSB LSB
n integer bits b fractional bits
• The MSB has -ve weighting for 2’s complement (as for integer words).
Notes:
As examples, we consider the 2’s complement word “11010110” with the binary point in two different places.
Firstly, with the binary point to the left of the third bit, i.e. 5 integer bits and 3 fractional bits:
bit index: 4 3 2 1 0 -1 -2 -3
bit weighting: -24 23 22 21 20 2-1 2-2 2-3
binary number: 1 1 0 1 0 1 1 0
decimal 1x-24 1x23 0x22 1x21 0x20 1x2-1 1x2-2 0x2-3
representation: -16 + 8 + 2 + 0.5 + 0.25 = -5.25
...and secondly, with the binary point to the left of the third bit, i.e. 3 integer bits and 5 fractional bits:
bit index: 2 1 0 -1 -2 -3 -4 -5
bit weighting: -22 21 20 2-1 2-2 2-3 2-4 2-5
binary number: 1 1 0 1 0 1 1 0
decimal 1x-22 1x21 1x20 1x2-1 0x2-2 1x2-3 1x2-4 0x2-5
representation: -4 + 2 + 0.5 + 0.125+ 0.0625 = -1.3125
Note that these results are related by a factor of 22 = 4, i.e. 4 x -1.3125 = -5.25.
Fixed Point Range and Precision 22
• As with integer representations (which are also effectively fixed point

numbers, but with the binary point at position 0), the binary range of
fixed point numbers extends from:
unsigned: 00000.....0000 11111.....1111
signed (2’s comp.): 10000.....0000 01111.....1111
• The same number of quantisation levels is present, e.g. for an 8-bit

binary word, 256 levels can be represented.
• Numerical range scales according to the binary point position, e.g.:

1000 0000. (-128) 0111 1111. (+127)
1 x0.5 1 x0.5
1000 000.0 (-64) 0111 111.1 (+63.5)
• Dynamic range (range / interval) is independent of the binary point

position, e.g. (127-(-128))/1 = 255 = (63.5-(-64))/0.5
Notes:
To illustrate this further, lets consider the very simple case of a 3-bit 2’s complement word, with the binary point
in all four possible positions. Clearly the numerical range is affected by the binary point position, but the
relationship between the interval and range remains the same.
+3
+2
+1.5
+1 +0.75
+0.375
0
-0.5
-1 -1
-2 -2
-3 interval = 1 interval = 0.5 interval = 0.25 interval = 0.125
-4
binary point binary point binary point binary point

position = 0 position = 1 position = 2 position = 3
+0.125
-4 +2 +1 -2 +1 +0.5 -1 +0.5 +0.25 -0.5 +0.25
all integer bits all fractional bits

Fixed-point Quantisation 23
• Consider the fixed point number format:
n n n b b b b b
3 integer bits 5 fractional bits
• Numbers between – 4 and 3.96785 can be represented, in steps of

0.03125 . As there are 8 bits, there are 2 8 = 256 different values.
• Revisiting our sine wave example, using this fixed-point format:

+2
-2
LSB
• This looks much more accurate. The quantisation error is ± -----------
(where LSB = least significant bit)... so 0.015625 rather than 0.5! 2
Notes:
Quantisation is simply the DSP term for the process of representing infinite precision numbers with finite
precision numbers. In the decimal world, it is familiar to most to work with a given number of decimal places.
The real number π can be represented as 3.14159265.... and so on. We can quantise or represent π to 4
decimal places as 3.1416. If we use “rounding” here and the error is:
3.14159265… – 3.1416 = 0.00000735
If we truncated (just chopped off the bits below the 4th decimal place) then the error is larger:
3.14159265… – 3.1415 = 0.00009265
Clearly rounding is most desirable to maintain best possible accuracy. However it comes at a cost. Albeit the
cost is relatively small, but it is however not “free”.
When multiplying fractional numbers we will choose to work to a given number of places. For example, if we
work to two decimal places then the calculation:
0.57 x 0.43 = 0.2451
can be rounded to 0.25, or truncated to 0.24. The result are different.
Once we start performing billions of multiplies and adds in a DSP system it is not difficult to see that these small
errors can begin to stack up.

Multiplication & Division via Binary Shifts 24
• The binary point position has a power-of-2 relationship with the

numerical range of the word format, and any number it represents.
• Therefore if we want to multiply or divide a fixed point number by a

power-of-two, this is achieved by simply shifting the numbers with
respect to the binary point!
0.0625
-2 1 0.5 0.25 0.125
shift right by 2 places
4 0 1 0 1 1 1 (decimal 1.4375)
-8 4 2 1 0.5 0.25
original (decimal 5.75)
0 1 0 1 1 1
number:
-16 8 4 2 1 0.5 2 shift left by 1 place
0 1 0 1 1 1
(decimal 11.5)

Notes:
Of course, looking at the example in the main slide, we could also consider that the binary point is moved, rather
than the bits - it amounts to the same thing! Ultimately the binary point is conceptual - having no effect on the
hardware produced - and it falls to the DSP design tool and/or DSP designer to keep track of it.
Reviewing the divide-by-4 and multiply-by-2 examples from the main slide... if we move the binary point, the
weightings of the bits comprising the word, and hence the value it represents, change by a power-of-two factor.
0.0625
-2 1 0.5 0.25 0.125
0 1 0 1 1 1 (decimal 1.4375)
4
-8 4 2 1 0.5 0.25
original
0 1 0 1 1 1 (decimal 5.75)
number:
2
-16 8 4 2 1 0.5
0 1 0 1 1 1 (decimal 11.5)

Normalised Fixed Point Numbers 25
• Fixed point word formats with 1 integer bit and a number of fractional
bits are often adopted.
• Numbers from -1 to +1-2-b (i.e. just less than 1) can be represented.

Some examples:
-1 1/2 1/4 1/8 1/16 1/32 1/64 1/128 decimal
most -ve 1 0 0 0 0 0 0 0 -1
0 0 0 0 0 0 0 0
most +ve 0 1 1 1 1 1 1 1 +0.9921875
• Limiting the numeric range to ± 1 is advantageous because it makes the

arithmetic easier to work with... multiplying two normalised numbers
together cannot produce a result greater than 1!
Notes:
The term Q-format is often used to describe fixed point number formats, usually in the context of DSP
processors. However, it is useful to note that Q-format and 2’s complement are actually the same thing.
Q-format notation is given in the form Qm.n, where m is the number of integer bits, and n is the number of
fractional bits. Notably this description excludes the MSB of the 2’s complement representation, which Q-format
considers a sign bit. Therefore the total number of bits in a Q-format number is 1 + m + n, whereas in 2’s
complement, the same word format would be described as having m+1 integer bits, and n fractional bits.
For example, a Q2.5 number has a sign bit, 2 other integer bits, and 5 fractional bits, and hence can be
represented as shown below. In 2’s complement, this would be described as having 3 integer bits and 5
fractional bits.
1 sign bit
Q-format m=2 n=5
description integer bits fractional bits
bit weightings: -22 21 20 2-1 2-2 2-3 2-4 2-5
2’s complement
description 3 integer bits 5 fractional bits
The Q0.15 format (often abbreviated to Q15) is used extensively in DSP as it covers the normalised range of
numbers from -1 to +1 - 2-15, and is equivalent to a 16 bit 2’s complement representation with the binary point
at 15, i.e. 1 integer bit and 15 fractional bits.
Fractional Motivation - Normalisation 26
• Working with fractional binary values makes the arithmetic “easier” to

work with and to account for wordlength growth.
• As an example take the case of working with a “machine” using 4 digit

decimals and a 4 digit arithmetic unit - range -9999 to +9999.
• Multiplying two 4 digit numbers will result in up to 8 significant digits.

Scale Tr
6787 x 4198 = 28491826 2849.1826 2849
If we want to pass this number to a next stage in the machine (where

arithmetic is 4 digits accuracy) then we need scale down by 10000,
then truncate.
• Consider normalising to the range -0.9999 to +0.9999.

Tr
0.6787 x 0.4198 = 0.28491826 0.2849
now the procedure for truncating back to 4 bits is much “easier”.

Notes:
Of course the two results are exactly identical and the differences are in how we handle the truncate and scale.
However using the normalised values, where all inputs are contrained to be in the range from -1 to +1 then its
easy to note that multiplying ANY two numbers in this range together will give a result also in the range of -1 to
+1.
Exactly the same idea of normalisation is applied to binary, and the the binary point is implicitly used in most
DSP systems.
Consider 8 bit values in 2’s complement. The range is therefore:
10000000 to 01111111 (-128 to +127)
If we normalise these numbers to between -1 and 1 (i.e. divide through by 128) then the binary range is:
1.0000000 to 0.1111111 ( -1 to 0.9921875, where 127/128 = 0.9921875).
Therefore we apply the same normalising ideas as for decimal for multiplication in binary.1101 1010 0100
Consider multiplying 36 x 97 = 3492 equivalent to 00100100 x 01100001 = 0000 1101 1010 0100
In binary normalising the values would be the calculation 0.0100100 x 0.1100001 = 0.00110110100100
which in decimal is equvalent to: 0.28125 x 0.7578125 = 0.213134765625
Note very clearly that in a DSP system then the binary poiint is all in the eye of the designer. There is no physical
connection or wire for the binary point. It just makes things significantly easier in keeping track of wordlength
growth, and truncating just by dropping fractional bits. Of course if you prefer integers and would like to keep
track of the scaling etc you can do this,..... you will get the same answer and the cost is the same...

Analogue to Digital Converter (ADC) 27
• An ADC is a device that can convert a voltage to a binary number,

according to its specific input-output characteristic.
Binary Output
fs
127 01111111
96 01100000
8 bit 1
64 01000000 0
0
32 00100000 1
ADC
1
-2 -1 1 2 Voltage 1
-32 11001000 Voltage Input 0
Input 1
-64 11000000
Binary
-96 10100000 Output
-128 10000000
• We can generally assume ADCs operate using two’s complement

arithmetic.
Notes:
Viewing the straight line portion of the device we are tempted to refer to the characteristic as “linear”. However
a quick consideration clearly shows that the device is non-linear (recall the definition of a linear system from
before) as a result of the discrete (staircase) steps, and also that the device clips above and below the maximum
and minimum voltage swings. However if the step sizes are small and the number of steps large, then we are
tempted to call the device “piecewise linear over its normal operating range”.
Note that the ADC does not necessarily have a linear (straight line) characteristic. In telecommunications for
example a defined standard nonlinear quantiser characteristic is often used (A-law and μ-law). Speech signals,
for example, have a very wide dynamic range: Harsh “oh” and “b” type sounds have a large amplitude, whereas
softer sounds such as “sh” have small amplitudes. If a uniform quantisation scheme were used then although
the loud sounds would be represented adequately the quieter sounds may fall below the threshold of the LSB
and therefore be quantised to zero and the information lost. Therefore non-linear quantisers are used such that
the quantisation level at low input levels is much smaller than for higher level signals. A-law quantisers are often
implemented by using a nonlinear circuit followed by a uniform quantiser. Two schemes are widely in use: the
A-law in Europe, and the μ -law in the USA and Japan. Similarly for the DAC can have a non-linear
characteristic.
Binary Output
Voltage Input

ADC Sampling “Error” 28
• Perfect signal reconstruction assumes that sampled data values are

exact (i.e. infinite precision real numbers). In practice they are not, as
an ADC will have a number of discrete levels.
• The ADC samples at the Nyquist rate, and the sampled data value is
the closest (discrete) ADC level to the actual value:
s(t) v̂ ( n ) ts
4 fs 4
Binary value
3 3
2
Voltage
2
1 1
0 ADC 0
-1 -1 sample, n
-2 time -2
-3 -3
-4 -4
v̂ ( n ) = Quantise { s ( nt s ) }, for n = 0, 1, 2, …
• Hence every sample has a “small” quantisation error.

Notes:
For example purposes, we can assume our ADC or quantiser has 5 bits of resolution and maximum/minimum
voltage swing of +15 and -16 volts. The input/output characteristic is shown below:
01111 (+15)
Binary Output
1 volts
Vmin = -15 volts
Vmax = 15 volts Voltage

Input
10000 (-16)
In the above slide figure, for the second sample the true sample value is 1.589998..., however our ADC
quantises to a value of 2.

Quantisation Error 29
• If the smallest step size of a linear ADC is q volts, then the error of any
one sample is at worst q/2 volts.
01111 (+15)
Binary Output
q volts
-Vmax Vmax Voltage

Input
10000 (-16)

Notes:
Quantisation error is often modelled an additive noise component, and indeed the quantisation process can be
considered purely as the addition of this noise:
nq
x ADC y
x y

An Example 30
• Here is an example using a 3-bit ADC:

4
3
Amplitude/volts
2
1
0
-1 time/seconds
-2
-3
-4
ADC Characteristic
4
3 3 Quantisation Error
Amplitude/volts
2 2
1 1
-4
0 0
4 3 2 1 -1 -2 -3 -4
Output
-1 -1 time/seconds
-2 -2
-3 -3
Input
-4

Notes:
In this case worst case error is 1/2.

Adding multi-bit numbers 31
• The full adder circuit can be used in a chain to add multi-bit numbers.
The following example shows 4 bits:
A3 B3 A2 B2 A1 B1 A0 B0
A3 A2 A1 A0
‘0’
C3
Σ C2
Σ C1
Σ C0
Σ + B3 B2 B1 B0
C3 C2 C1 C0
S4 S3 S2 S1 S0
S4 S3 S2 S1 S0
MSB LSB
• This chain can be extended to any number of bits. Note that the last
carry output forms an extra bit in the sum.
• If we do not allow for an extra bit in the sum, if a carry out of the last
adder occurs, an “overflow” will result i.e. the number will be incorrectly
represented.
Notes:
The truth table for the full adder is:
A B CIN S COUT
0 0 0 0 0 0+0+0 = 0
0 0 1 1 0 0+0+1 = 1
0 1 0 1 0 0+1+0 = 1
0 1 1 0 1 0+1+1 = 2
1 0 0 1 0 1+0+0 = 1
1 0 1 0 1 1+0+1 = 2
1 1 0 0 1 1+1+0 = 2
1 1 1 1 1 1+1+1 = 3
Full adder circuitry can be therefore produced with gates:
A S out = ABC + ABC + ABC + ABC

S
= A⊕B⊕C
B
C C out = ABC + ABC + ABC + ABC

COUT
= AB + AC + BC = C ( A ⊕ B )
The longest propagation delay path in the above full adder is “two gates”.
Subtraction 32
• Subtraction is very readily derived from addition. Remember 2’s

complement? All we need to do to get a negative number is invert the
bits and add 1.
• Then if we add these numbers, we’ll get a subtraction D = A + (-B):
A3 B3 A2 B2 A1 B1 A0 B0 Invert
4 bit
2’s comp
add 1
C4
Σ Σ Σ Σ ‘1’
Discard
D3 D2 D1 D0
• The addition of the 1 is done by setting the LSB carry in to be 1.

Notes:
Note for 4 bit positive numbers (i.e. NOT 2’s complement) the range is from 0 to 15. For 2’s complement the
numerical range is from -8 to 7.
Addition/Subtraction (using 2’s complement representation)
Sometimes we need a combined adder/subtractor with the ability to switch between modes.
This can be achieved quite easily:
B3 B2 B1 B0
A3 A2 A1 A0
0 1 0 1 0 1 0 1
MUX MUX MUX MUX
Σ Σ Σ Σ K
For: A + B, K = 0
For: A - B, K= 1
This structure will be seen again in the Division/Square Root slides!

Wraparound Overflow & 2’s Complement 33
• With 2’s complement overflow will occur when the result to be produced
lies outside the range of the number of bits.
• Therefore for an 8 bit example the range is -128 to +127 (or in binary
this is 100000002 to 011111112:
-65 10111111 100 01100100
+ -112 +10010000 + 37 +00100101
-177 101001111 137 10001001
With an 8 bit result we lose the 9th bit With an 8 bit result the result “wraps
and the result “wraps around” to a around” to a negative value:
positive value: 01001111 = 79 . 10001001 = – 119 .
• One solution to overflow is to ensure that the number of bits available

is always sufficient for the worst case result. Therefore in the above
example perhaps allow the wordlength to grow to 9 or even 10 bits.
• Using Xilinx SystemGenerator we can specifically check for overflow

in every addition calculation.
Notes:
Recall from previously that overflow detect circuitry is relatively easy to design. Just need to keep an eye on the
MSB bits (indicating whether number is +ve or -ve):
For example
10110111 01100100
(-73) + 127 = 54 +01111111 100 + 64 = 164 +01000000
1 00110110 10100100
Discard final 9th bit carry
No overflow MSB bit indicate -ve result! Overflow
Adding +ve and -ve will never overflow!
Adding +ve and +ve if a -ve result then overflow
Adding -ve and -ve if a +ve result then overflow

Saturation 34
• One method to try to address overflow is to use a saturate technique.
• Taking the previous overflowing examples from Slide 33

-65 10111111 100 01100100
+ -112 +10010000 + 37 +00100101
-177 101001111 137 10001001
Detect overflow and saturate Detect overflow and saturate
-128 1000000 127 01111111
• When overflow is detected, the result is set to the close possible value
(i.e for the 8 bit case either -128 or +127).
• Therefore for every addition that is explicitly done with an adder block.
In Xilinx System Generator the user will get a checkbox choice to allow
results to either (i) Wraparound or (ii) Saturate.
• Implementing saturate will require “detect overflow & select” circuitry.

Notes:
Once again, design of a DSP system might be done such that overflow never happens as the user has ensured
there are enough bits to cater for the worse possible case leading to the maximum magnitude result.
Generally for some later FPGAs such as the Virtex-4, using some of the DSP48 blocks give adders with 48 bits
of precisions therefore the likelihood of say working with 16 bit values that grow to over 48 bits is unlikely. Hence
overflow has been “designed out”. Of course not applications will use these devices and using general slice
logic and attempting to make adders as small as possible, would mean care must be taken, and where
appropriate to efficient design, saturate might be included.
Saturation is extremely useful for adaptive algorithms. For example, in the Least Mean Squares algorithm
(LMS), the filter weights w are updated according to the equation:
w ( k ) = w ( k – 1 ) + 2μe ( k )x ( k )
Without further concern over the meaning of this equation, we can see that the term 2μe ( k )x ( k ) is added to
the weights at time epoch k – 1 to generate the new weights at time epoch k .
The the operations that form 2μe ( k )x ( k ) were to overflow, there is a high chance that the sign of the term would
flip and drive the weights in completely the wrong direction, leading to instability.
With saturation however, if the term 2μe ( k )x ( k ) gets very big and would overflow, saturation will limit it to the
maximum value representable, causing the weights to change in the right direction, and at the fastest speed
possible in the current representation. The result is a huge increase in the stability of the algorithm.

Xilinx Virtex-II Pro Addition 35
• The used components of the slide are outlined below

C
Sout
out
D
B
A Sout
A B
Cout
Σ Cin
Sout
Cin

Notes:
Picture of Xilinx-II Pro slice (upper half) taken from “Virtex-II Pro Platform FPGAs: Introduction and Overview”,
DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com
LookUp Table (LUT) programmed with two-input XOR function:
G1 (A) G2 (B) D
0 0 0
0 1 1
1 0 1
1 1 0
Sout = Cin xor D , Cout = DA + Cin D (multiplex operation). Result is the FULL ADDER implementation: .
G1 (A) G2 (B) Cin D Sout Cout

0 0 0 0 0 0
0 0 1 0 1 0
0 1 0 1 1 0
0 1 1 1 0 1
1 0 0 1 1 0
1 0 1 1 0 1
1 1 0 0 0 1
1 1 1 0 1 1

Xilinx Virtex-II Pro Slice Main Components 36
• A (very) high level diagram of the main logic components on one slice
Inputs
ORCY
Outputs D-type
4 input FF
LUT
MULTAND Upper
RAM
Mux
Mux
Mux
Mux
XORG
ShiftReg
Inputs
ORCY
Outputs D-type
4 input FF
MULTAND
LUT Lower
RAM
Mux
Mux
Mux
Mux
XORG
ShiftReg

Notes:
Just reviewing the logic circuitry on one half of the slice (note that in Slide 35 only the top half of the slice is
shown, whereas the above slide shows the top and bottom halfs), we can note:
• One D-type Flip Flop
• One 4 input Look-Up-Table (LUT) (can be configured as shift register or simple as RAM/memory)
• One XOR gate
• One AND gate
• One OR gate
• A few 2 input MUX (multiplexors) to route signals
• Clock inputs
“Small” FPGAs will have just a few hundred (100’s) slices;
“Large” FPGAs will have many tens of thousands (10000’s) of slices (and other components!)

Xilinx Virtex-II Pro 4 bit Adder 37
• To produce larger adders the Xilinx tools will simply cascade the carry
bits in adjacent (where possible!) slices.
• The bottom half of a Virtex-II Pro slice can be programmed for an

identical operation, with its COUT wired to the top-half’s CIN. Hence we
can get two bits of addition per standard Xilinx slice.
• To produce a 4 bit adder, we cascade with another slice.

C1 C3
B1 S1
2 bit addition B3 S3 4 bit addition
A1 FA A3 FA
1 slice 2 slices
B0 S0 B2 S2
FA FA
A0 A2
A1 B1 A0 B0
C1
‘0’ ‘0’
C1
Σ C0
Σ B1 S1
A3 B3 A2 B2 A1 B1 A0 B0
A1 FA
‘0’
S1 S0
C3
Σ C2
Σ C1
Σ C0
Σ
B0 S0
FA S4 S3 S2 S1 S0
A0
‘0’

Notes:
Note the importance of the LUT (look up table) in the Xilinx slice. When configured as a LUT, any four input
Boolean equation can be implemented.
For example take the equation Y = ABC + ABCD
The truth table for this equation is:
ABCD Y
00001
0001 0
0010 0
0011 0
0100 0 A 4 input
0101 0 B Y
0110 0 LUT
0111 0 C
1000 0 D RAM
1001 0
1010 0 ShiftReg
1011 0
1100 0
1101 0
1110 1
1111 1
To implement this function, simply store the values of Y in the Slice LUT, and the address the LUT with values
of ABCD to get the output
Therefore ANY 4 variable Boolean function can be simply implemented with a four input LUT. (Of course if the
equation is only 3 variables then we can also implement and just set one input constant)

Multiplication in binary 38
• Multiplying in binary follows the same form as in decimal:
11010110 A 7 …A 0
x00101101 B 7 …B 0
11010110
000000000
1101011000
11010110000
000000000000
1101011000000
00000000000000
000000000000000
0010010110011110 P 15 …P 0
• Note that the product P is composed purely of selecting, shifting and

adding A . The i th column of B indicates whether or not a shifted
version of A is to be selected or not in the i th row of the sum.
• So we can perform multiplication using just full adders and a little logic
for selection, in a layout which performs the shifting.

Notes:
Multiplication in decimal
Starting with an example in decimal:
214
x45
1070
+8560
9630
Note that we do 214 × 5 = 1070 and then add to it the result of 214 × 4 = 856 right-shifted by one column.
For each additional column in the second operand, we shift the multiplication of that column with the first
operand by another place.
zzz
xaaaa
bbbb
+cccc0
+dddd00
+eeee000 etc...

2’s complement Multiplication 39
• For one negative and one positive operand just remember to sign
extend the negative operand.
11010110 -42
x00101101 x45
1111111111010110
0000000000000000
1111111101011000
sign 1111111010110000
extends 0000000000000000
1111101011000000
0000000000000000
0000000000000000
1111100010011110 -1890

Notes:
2’s complement multiplication (II)
For both operands negative, subtract the last partial product.
We use the trick of inverting (negating and adding 1) the last partial product and adding it rather than
subtracting.
form last partial product negative

11010110 -42
x10101101 x-83
1111111111010110
0000000000000000
1111111101011000
1111111010110000
0000000000000000
two’s 1111101011000000
complement 0000000000000000
-1110101100000000 +0001010100000000
0000110110011110 3486
Of course, if both operands are positive, just use the unsigned technique!
The difference between signed and unsigned multiplies results in different hardware being necessary. DSP
processors typically have separate unsigned and signed multiply instructions.

Fixed Point multiplication 40
• Fixed point multiplication is no more awkward than integer

multiplication:
11010.110
x00101.101 26.750
11.010110 x5.625
000.000000 0.133750
1101.011000
11010.110000 0.535000
000000.000000 16.050000
1101011.000000 133.750000
00000000.000000
000000000.000000 150.468750
0010010110.011110
• Again we just need to remember to interpret the position of the binary

point correctly.

Notes:

Multiplier Implementation Options 41
• Distributed multipliers
• Constant multipliers
• Using the logic fabric (LUTs)
• Using block RAM
• Shift-and-add “multipliers”
• High speed embedded multipliers
• 18 x 18 bit multipliers
• High speed integrated arithmetic slices (DSP48s)
• Multiply, accumulate
• Add, multiply, accumulate

Notes:
Over the next few slides we will see that multipliers can be implemented in a variety of different ways. As
multipliers are used extensively in DSP, implementing them efficiently is a priority consideration.
The most basic multiplier is a 2-input version which is implemented using the logic fabric, i.e. the lookup tables
within the slices of the device. This type is referred to as a distributed multiplier, because the implementation is
distributed over the resouces in several slices.
In the case of multiplication with a constant, which is commonly required in DSP, the knowledge of one
multiplicand can be exploited to create a cheaper hardware implementation than a conventional 2-input
multiplier. Two approaches that will be discussed in the coming pages are ROM-based constant multipliers, and
“shift-and-add” multipliers which sum the outputs from binary shift operations.
The FPGA companies are well aware that DSP engineers desire fast and efficient multipliers, and as a result,
they began incorporating embedded multipliers into their devices in the year 2000. Since then the sophistication
of these components has increased, and they have been extended to feature fast adders and in many cases
longer wordlengths, too. We can now think of them as embedded arithmetic slices, rather than simply
multipliers.

Distributed Multipliers 42
• This figure shows a four-bit multiplication:

s FA is full adder
0 a3 0 a2 0 a1 0 a0
a
b b0
bout 0 0
c
cout FA Example:
b1
0 1101 13
aout 0
sout 1011 11
b2 1101
0 0 1101
0000
b3 1101
0 10001111 143
p7 p6 p5 p4 p3 p2 p1 p0
• The AND gate connected to a and b performs the selection for each
bit. The diagonal structure of the multiplier implicitly inserts zeros in the
appropriate columns and shifts the a operands to the right.
• Note that this structure does not work for signed two’s complement!

Notes:
Note the function of the simple AND gate.
The operation of multiplying 1’s and 0’s is the same AND 1’s and 0’s
A B Z
0 0 0
0 1 0
1 0 0
1 1 1
Z = A x B (where x = multiply) or in Boolean algebra Z = A and B = AB
Hence the AND gate is the bit multiplier. The function of one partial product stage of the multiplier is as shown
below.
x3 x2 x1 x0
s FA is full adder
a3 a2 a1 a0
a
b b0
bout 0
c
cout FA
aout y4 y3 y2 y1 y0
sout
y4 y3 y2 y1 y0 = b0(a3 a2 a1a0) + x3 x2 x1 x0

Distributed Multiplier Cell 43
• This shows the top half of a slice, which implements one multiplier cell.
Cout
S D
A Sout
S
A
FA
Cin
Cout
NOTE: This implementation
features a Virtex-II Pro FPGA.
Sout
Cin

Notes:
Picture of Xilinx-II Pro slice (upper half) taken from “Virtex-II Pro Platform FPGAs: Introduction and Overview”,
DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com
LUT implements the XOR of two ANDs:
G3 (S)
G2 (A) D
G1 (B)
The dedicated MULTAND unit is required as the intermediate product G1G2 cannot be obtained from within the
LUT, but is required as an input to MUXCY. The two AND gates perform a one-bit multiply each, and the result
is added by the XOR plus the external logic (MUXCY, XORG):
Sout = CIN xor D, COUT = DAB + CIND
This structure will perform one cell of the multiplier (see the next slide...).
Note that whereas the signal flow graph of the distributed multiplier shows signals propagating from the top and
right of the diagram to the bottom, the internal structure of the FPGA slice logic results in a different
configuration when implemented on a device.

ROM-based Multipliers 44
• Just as logical functions such as XOR can be stored in a LUT as shown

for addition, we can use storage-based methods to do other operations.
• By using a ROM, we can store the result of every possible multiplication

of two operands.
• The two operands A and B are concatenated to form the address with
which to access the ROM. The value stored at that address is the
multiplication result, P:
1 ROM-based multiplier
0 address data (product)
A 1
1 1
0000 0000 0000 0000 1
0
4 bits 0
1 A:B P 1
decimal -6 0 0
0 1010 0011 1110 1110 1
0 8 bits 8 bits 1
0
1 1
B 0
1
1111 1111 0000 0001
0
1 28 = 256 8-bit addresses
4 bits 1 decimal -18
8-bit data
decimal 3

Notes:
There is one serious problem with this technique: as the operand size grows, the ROM size grows exponentially.
For two N bit input operands, there are 2 2N possible results, and hence the ROM has 2 2N entries. The output
result is 2N bits long, and in total 2N × 2 2N bits of storage are required.
For example, with 8 bits operands (a fairly reasonable) size, 1Mbit of storage is required - a large quantity. For
bigger operands e.g. 16 bits, a huge quantity of storage is required. 16 bit operands require 128Gbits of storage
and hence a ROM-based multiplier is clearly not a realistic implementation choice!
Input Wordlength Output Wordlength No. of ROM entries Total ROM Storage
(N) (2N) (22N) (2N x 22N)
4 8 28 = 256 2 Kbits
6 12 212 = 4,096 48 Kbits
8 16 216 = 65,536 1 Mbit
10 20 220 = 1,048,576 20 Mbits
12 24 224 = 16,777,216 384 Mbits
14 28 228 = 268,435,456 7 Gbits
16 32 232 = 4,294,967,296 128 Gbits
18 36 236 = 68,719,476,736 2.25 Tbits
20 40 240 = 1,099,511,627,776 40 Tbits

Input Wordlength and ROM Addresses 45
• Consider a ROM multiplier with 8-bit inputs: 65,536 8-bit locations are
required to store all possible outputs... so 1Mbit of storage is needed!
0 ROM-based multiplier
1 address data (product)
0 0
1
A 1 0000 0000 0000 0000 0000 0000 0000 0000 0
0
1 0
1
8 bits 0 1
0
1 1
1
0 0110 1011 0100 0011 0001 1100 0000 0001 1
1
1 A:B P 0
decimal 107 1 0
0 0
16 bits 16 bits 0
0 1
1 0 address 27,459 0
0 0 0
0 0 0
B 0 0 0
0 1 0
8 bits 1
1 1
1 1111 1111 1111 1111 0000 0000 0000 0001 decimal 7,169
decimal 67 216 = 65,536 16-bit addresses 16-bit data

Notes:
For example, if the B input was the constant value 75, the possible input words would be composed of 256
possible combinations of the upper 8-bits of the address, concatenated with the 8-bit binary word 0100 1011,
as shown below. The result is that only 256 of the 65,536 memory locations are actually accessed.
Therefore, when one of the inputs to the

?
ROM-based multiplier is fixed, the size of
? addresses (decimal):
the required ROM can be reduced to 256 ?
?
locations of 16-bit data (note that the ? 0 x 28 + 75 = 75
precision of the stored output words A=? ?
? 1 x 28 + 75 = 331
remains 8 bits + 8 bits). The total memory ?
8 bits ? 2 x 28 + 75 = 587
required is thus 256 x 16 = 4kbits. ?
? 3 x 28 + 75 = 843
?
However, depending on the value of the
?
?
constant, it may also be possible to reduce ? A:B
the length of the stored results. For ?
instance, if the value of B is (decimal) 10, 0
the maximum output product generated by 1 16 bits
0
the multiplication of B with any 8-bit input A 1 0
will be:
0 0
0 0
– 128 × 10 = – 1280 B=75 0
1 253 x 28 + 75 = 64,843
As -1280 can be represented with 12 bits, 0 1 254 x 28 + 75 = 65,099
8 bits 1
that represents a further saving of 4 bits 1 255 x 28 + 75 = 65,355
storage x 256 memory locations = 1kbit. 1
The total storage requirement for this example constant coefficient multiplier would therefore be 3 kbits...
significantly smaller than the 1Mbit needed for a 16-bit multiplier where both operands are unknown!

ROM-based Constant Multipliers 46
• ROM-based multipliers with a constant input require fewer addresses.
• The storage required for output words may also be reduced, if the
maximum result does not require the full numerical range of:
2N – 1 2N – 1
–2 ≤ result ≤ 2 –1
• The maximum product and output wordlength can be calculated for the
particular constant value, and the multiplier optimised accordingly...
8 bit signed number maximum absolute value = -128
A=? ? ? ? ? ? ? ? ?
16 bit signed result
P=? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
8 bit representation (min.) maximum product = 10,624

so maximum 15-bit representation
B = -83 0 1 0 1 0 0 1 1 required! 1-bit saved!
• Additional optimisations allow cost to be reduced further.

Notes:
Constant multipliers can be implemented using the LUTs within the logic fabric (“distributed ROM”), or with one
or more of the Block RAMs available on most devices. The selection is influenced by the other demands placed
on these resources by the rest of the system being designed.
In System Generator, the designer can specify the implementation style via the Constant Multiplier dialog box,
along with the constant value, the output wordlength, and other parameters.

Multiplication by Shift and Add 47
• Multiplication by a power-of-2 can be achieved simply by shifting the

number to the right or left by the appropriate number of binary places.
1 0 0 0 0 1 4 1 0 0 0 0 1 0 0 0 0 -31 x 24 = -496 x16
0 1 0 1 2 0 0 0 1 0 1 0.625 x 2-2 = 0.15625 x0.25
• Extending this a little, multiplications by other numbers can be

performed at low cost by creating partial products from shifts, and then
adding them together.
0 1 0 1 0 1 3 21 x 23 + 21 = 189 x9
0 1 0 1 1 1 1 0 1
0 1 0 0 0 0 4
0 1 0 1 0 1 0 0 0 0
2 (1 x 2-4) + 1 + (1 x 2-2) = 1.3125 x1.3125

Notes:
Shift operations are effectively “free” in terms of logic resources, as they are implemented only using routing.
Therefore multiplications by power-of-two numbers are very cheap! By recognising that multiplications by other
numbers can be achieved by summing partial products of power-of-two shifts, any arbitrary multiplication can
be decomposed into a series of shifts and add operations. The “closer” the desired multiplication is to a power-
of-two, i.e. the fewer partial products that are required, the fewer adders are required, and hence the lower the
cost of the multiplier.
This type of multiplier is

x16
suitable only for constant
multiplications, because 4 x16
there is only one input, 4 x24
and the result is achieved x8
using the configuration of 3 x24 x8
the hardware. 3
The technique can be
particularly powerful when x8
3 x1
applied to parallel x9
multiplications of the
same input. The partial x9
x1 combined - fewer partial products
product terms common to
several multiplications x24 and x9 calculated separately
can be shared and thus
the overall effort reduced. Transpose form filters are very suitable for optimisation in this way.
Taking the above simple example of two concurrent multiplications, one x9 and the other x24, it is clear that the
shift right by three places can be shared as x8 is common to both operations.

Embedded multipliers 48
• The Xilinx Virtex-II and Virtex-II Pro series were the first to provide “on-
chip” multipliers in early 2000s.
• These are in hardware on the FPGA ASIC, not actually in the user
FPGA “slice-logic-area”. Therefore are permanently available, and they
use no slices. They also consume less power than a slice-based
equivalent and can be clocked at the maximum rate of the device.
A
P
18x18 bit
B multiplier
• A and B are 18-bit input operands, and P is the 36-bit product, i.e.
P = A × B.
• Depending upon the actual FPGA, between 12 and more than 2000
(Virtex 6 top of range) of these dedicated multipliers are available.
Notes:
Looking at a device floorplan, you can clearly see the embedded multipliers, which are located next to Block
RAMs on the FPGA in order to support high speed data fetching/writing and computation.
Block RAM slices
18x18 multiplier
Information on dedicated multipliers taken from “Virtex-II Pro Platform FPGAs: Introduction and Overview”,
DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com.

Embedded Multiplier Efficiency 49
• It can be easy to utilise on chip embedded multipliers inefficiently

through choice of wordlengths...
18 4
36 8
18 x 4
1 embedded multiplier 100% utilised 1 embedded multiplier ~5% utilised
• When using multipliers in System Generator....be careful

18 36 19 38
18 19
SysGen will use 1 embedded mult SysGen will use 4 embedded mults

Notes:
If you specify the use of embedded multipliers for a particular multiplier in System Generator, the tool will do
exactly as you have asked, and implement it entirely using embedded multipliers. However, depending on the
wordlengths involved, this may lead to an inefficient implementation.
The wordlengths of the embedded multipliers are fixed at 18 x 18 bits, and it makes sense to use them as fully
as possible.
It is relatively easy to see that a 4 x 4 bit multiply will greatly underuse the capabilities of the multiplier, and this
particular multiply operation might be better mapped to a distributed implementation, which would leave the
embedded multiplier free for use somewhere else. Of course, these decisions are made in the context of some
larger design with its own particular needs for the various resources available on the FPGA being targeted.
Perhaps less obviously, mapping a multiplication to embedded multipliers where the input operands are slightly
longer than 18 bits is also inefficient. This may result in, for example, the following implementation for a
requested 19 x 19 bits multiplier, where 4 embedded multipliers are used instead of the expected 1!
18 x 18 1 x 18
19 38
19
18 x 1 1x1

High Speed Arithmetic Slices (DSP48s) 50
• As much DSP involves the Multiply-Accumulate (MAC) operation, soon

after embedded multipliers came DSP48 slices (on the Virtex-4).
• These feature an 18 x 18 bit adder followed by a 48 bit accumulator.
18
36 48
18
48 DSP48
Virtex-4
• Like the embedded multipliers, these are low power and fast.
• The ability to cascade slices together also means that whole filters can
be constructed without having to use any slices.
Notes:
The next series of FPGAs (the Virtex-5) enhanced the capabilities of the DSP slice with the DSP48E.
The major improvements of this slice are logic capabilities within the adder/subtractor unit, and an extended
wordlength of one input to 25 bits. The maximum clock frequency also increased in line with the speed of the
device.
18
43 48
25
48 DSP48E
Virtex-5

DSP48s with Pre-Adders 51
• The Spartan-3A DSP series and subsequent Spartan-6 feature a

version of the DSP48 with a pre-adder, prior to the multiplier.
• This feature is especially useful for DSP structures like symmetric

filters, because it allows the total number of multiplications to be
reduced.
Spartan-3A DSP DSP48A

18
18 Spartan-6 DSP48A1
36 48
18 18
48

Notes:
The Virtex-6 offers a combination of the benefits of the Virtex-5 (the longer wordlength and arithmetic unit),
together with the pre-adder from the Spartan series. This results in a very computationally powerful device,
especially as it can be clocked at 600MHz, and the largest chips have 2000+ of them!
25 Virtex-6 DSP48E1
25
43 48
25 18
48

Division (i) 52
• Divisions are sometimes required in DSP, although not very often.
• 6 bit non-restoring division array:

a5 a4 a3 a2 a1 a0 sin
Bin
b5 b4 b3 b2 b1 b0
1
q5 bin bout
0
cout FA cin
q4
0
Bout
q3 sout
0
q2
0
Q=B/A q1
0
q0
• Note that each cell can perform either addition or subtraction as shown
in an earlier slide ⇒ either Sin+ Bin or Sin - Bin can be selected.
Notes:
A Direct method of computing division exists. This “paper and pencil” method may look familiar as it is often
taught in school. A binary example is given below. Note that each stage computes an addition or subtraction of
the divisor A. The quotient is made up of the carry bits from each addition/subtraction. If the quotient bit is a 0,
the next computation is an addition, and if it is a 1, the divisor is subtracted. It is not difficult to map this example
into the structure shown on the slide.
Example: B = 01011 (11), A = 01101 (13) ⇒ -A = 10011. Compute Q = B / A.
01011 R0 = B
q4 = 0 carry 10011 -A
11110 R1
0
11100 2.R1
q3 = 1 carry 01101 +A
01001 R2
0
10010 2.R2
q2 = 1 carry 10011 -A
00101 R3
0
01010 2.R3
q1 = 0 carry 10011 -A
11101 R4
0
11010 2.R4
q0 = 1 carry 01101 +A
00111 R5
Q = B / A = 01101 x 2-4 = 0.8125

Division (ii) 53
• There is an alternative way to compute division using another paper

and pencil technique.
divisor_in
00000.1101
01101 01011.0000
0
01 VHDL Design
00
010
000
0101
0000
01011
00000
remdsh1
01011 0
divisor_in
00110 1
00100 10
00011 01
00001 010
00000 000
00001 0100
00000 1101
00000 0111

Notes:

The Problem With Division 54
• An important aspect of division is to note that the quotient is generated

MSB first - unlike multiplication or addition/subtraction!
• This has implications for the rest of the system.
• It is unlikely that the quotient can be passed on to the next stage until
all the bits are computed - hence slowing down the system!
• Also, an N by N array has another problem - ripple through adders.
• Note that we must wait for N full adder delays before the next row can
begin its calculations.
• Unlike multiplication there is no way around this, and as result division

is always slower than multiply even when performed on a parallel array
- a N by N multiply will run faster than a N by N divide!

Notes:
By looking at the top two rows of a 4 x 4 division array we can see that the first bit to get generated is the MSB
of the quotient. This is unlike the multiplication array that can also be seen below, where the LSB is generated
first. This is a problem when using division as most operations require the LSBs to start a computation and
hence the whole solution will have to be generated before the next stage can begin.
Another problem for division is the fact that it takes N full adder delays before the next row can start. In the
examples below, the order in which the cells can start has been shown. So for the multiplier, the first cell on the
second row is the 3rd cell to start working. However, for the divider, the first cell on the second row is only the
5th cell to start working because it has to wait for the 4 cells on the first row to finish.
a2 a1 a0 sin
a3 Bin
b3 b2 b1 b0
1
q3 4 3 2 1 bout
bin
0
6 5 cout FA cin
q2
Bout
sout
s FA is full adder
0 a3 0 a2 0 a1 0 a0
a
b 4 3 2 1 b0
bout 0 0
c
cout FA
b1
5 4 3
aout 0
sout
p1 p0
Pipelining The Division Array 55
• The division array shown earlier can be pipelined to increase

throughput.
a5 a4 a3 a2 a1 a0
b5 b4 b3 b2 b1 b0
a5 a4 a3 a2 a1 a0
b5 b4 b3 b2 b1 b0 Operands
a5 a4 a3 a2 a1 a0
b5 b4 b3 b2 b1 b0
pipeline delay
1
q5 sin
Bin
0
q4 bout
0 bin
q3 cout FA cin
0
Bout
q2 sout
0
Q=B/A q1
0
q0

Notes:
To increase the throughput, the critical path can be broken down by implementing pipeline delays at appropriate
points. If pipelining is not used, the delay (critical path) from new data arriving to registering the full quotient is
N2 full adders. This delay represents the maximum rate that new data can enter the array. However, by
pipelining the array, the critical path is broken down to just N full adders and thus the rate at which new data
can arrive is increased dramatically.
a3 a2 a1 a0
b3 b2 b1 b0
The longest path from register 1
to register is the Critical Path. q3 0
q2 0
a3 a2 a1 a0 q1 0
b3 b2 b1 b0
1
q3 0 q0
With pipelining the critical path is only N full adders.

q2 0
q1 0
q0
Without pipelining the critical path is through N2 full adders.

Square Root (i) 56
• 6 bit non-restoring square root array.

sin
0 a7 a6 Bin
0 1 1
1 0 0
0 a5 a4 bin bout
b5 1 1
0 0
cout FA cin
b4 0 a3 a2
1 1
0 0 Bout
b3 0 a1 a0
1 1
0 0
b2 0
1 0 1 0
0 0 sout
b1 0
B = A 1 0 1 0
0 0
b0
0
• The square root is found (with divides) in DSP in algorithms such as QR

algorithms, vector magnitude calculations and communications
constellation rotation.

Notes:
Looking carefully at the non-restoring square root array, we can note that this array is essentially “half” of the
division array! If the division array above is cut diagonally from the left we can see the cells that are needed for
the square root array. The 2 extra cells on the right hand side are standard cells which can be simplified. So
square root can be performed twice as fast as divide using half of the hardware!
a4 a 3 a2 a1
A = 10 11 01 01
010 0a4
b3 = 1 carry 111 111
001 R1
0111 R1<<1 & a3
b2 = 1 carry 1011 1b311
0 a7 a6 R2
0 1 1 0010
1 0 0 01001 R2<<1 & a2
0 a5 a4 b1 = 0 c ar r y 10011 1b3b211
b5 1 1 11100 R3
0 0
b4 0 a3 a2 110001 R3<<1 & a1
1 1 b0 = 1 c ar r y 011011 0b3b2b111
0 0 001100 R4
b3 0 a1 a0
1 1
0 0
b2 0
1 0 1 0
0 0 sout
b1 0
1 0 1 0
0 0
b0
0
Square Root and Divide - Pythagoras! 57
• The main appearance of square roots and divides is in advanced

adaptive algorithms such as QR using givens rotations.
• For these techniques we often find equations of the form:
x - y
cos θ = --------------------- and sin θ = ----------------------
x2 + y2 x2 + y2
• So in fact we actually have to perform two squares, a divide and a

square root. (Note that squaring is “simpler” than multiply!)
• There are a number of iterative techniques that can be used to

calculate square root. (However these routines invariably require
multiplies and divides and do not converge in a fixed time.)
• There seems to be some misinformation out there about square roots:

For FPGA implementation square roots are easier and cheaper to
implement than divides....!

Notes:

Fpga Notes April23

Uploaded by

Copyright:

Available Formats

You might also like

Fpga Notes April23

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fpga Notes April23

Uploaded by

Copyright:

Available Formats

DSPedia Notes 1

• Texas Instruments Motorola Analog Devices

• A number of DSP cores have been available.

• Oak Core LSI Logic ZSP 3DSP

• ASICs (Application specific integrated circuits) have been widely

Voltage Output Voltage Input

Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010

• Late 1990s FPGAs allow multipliers to be implemented in

• Early 2000s FPGAs: vendors place hardwired multipliers onto

• Mid 2000s FPGA vendors place DSP algorithms signal flow

• Late 2000s to early 2010s - who knows! Probably more DSP

Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010

• We might be tempted to this of the latest FPGAs as repositories of DSP

• In the days of circuits boards one had to be careful about running

Registers and Memory

“Connectors” Logic Arithmetic

R. Stewart, Dept EEE, University of Strathclyde, 2010

Do we actually need an FPGA/IC engineer then?

Do we actually need a DSP engineer?

Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010

• Adding two N bit numbers will produce up to an N+1 bit number:

• Multiplying two N bit numbers can produce up to a 2N bit number:

• So with a MAC (multiply and accumulate/add) of two N bit numbers we

R. Stewart, Dept EEE, University of Strathclyde, 2010

Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010

• A 4 bit addition can be performed using a simple ripple adder:

• Therefore an N bit addition could be performed in parallel at a cost of N

S out = ABC + ABC + ABC + ABC

Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010

• A 4 bit multiply operation requires an array of 16 multiply/add cells:

• Therefore an N by N multiply requires N 2 cells......

......so for example a 16 bit multiply is nominally 4 times more

An 8 bit by 8 bit multiplier would require 8 x 8 = 64 cells

Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010

• Early gate-arrays were simply arrays of NAND gates:

• Designs were produced by interconnecting the gates to form

logic logic logic logic logic Row

logic logic logic logic logic

logic logic logic logic logic

logic logic logic logic logic

I/O I/O I/O I/O

R. Stewart, Dept EEE, University of Strathclyde, 2010

A simple logic block might contain the following:

Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010

• Looking more specifically at recent Xilinx FPGAs, we also find block

R. Stewart, Dept EEE, University of Strathclyde, 2010

Despite the inclusion of these additional resources, the

Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010

• Xilinx FPGA logic fabric comprises Configurable Logic Blocks (CLBs),

• Signals travel between CLBs via routing resources.

• Each CLB has an adjacent switch matrix for (most) routing.

NOTE: Only a subset of routing resources is depicted above.

• To implement a combinatorial logic function

• As Read Only Memory (ROM)

• As Random Access Memory (RAM)

The register can be used as:

Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010