Professional Documents
Culture Documents
Fpga Notes April23
Fpga Notes April23
Fpga Notes April23
Introduction to FPGA
THIS SLIDE IS BLANK
Introduction: DSP and FPGAs 1
• In the last 20 years the majority of DSP applications have been enabled
by DSP processors:
• But the most recent technology platform for high speed DSP
applications is the Field Programmable Gate Array (FPGA)
This course is all about the why and how of DSP with FPGAs!
R. Stewart, Dept EEE, University of Strathclyde, 2010
Introduction to FPGA
Notes:
DSP is all about multiplies and accumulates/adds (MACs). As we progress through the course, we will see that
most algorithms that are used for different applications employ digital filters, adaptive filters, Fourier transforms
and so on. These algorithms all require multiplies and adds (note that a divide or square root is quite a rare thing
in DSP).
Hence a DSP algorithm or problem is often specified in terms of its MAC requirements. In particular, when
comparing two algorithms, if they both perform the same job but one with less MACs than the other, then clearly
the “cheaper” one would be the best choice. However this implies some assumptions. One is that the required
MACs are the same - but surely a multiply is multiply! Well, yes in the traditional DSP processor based situations
we are likely to be using, say, a 16 bit device which will process 16 bit inputs, using 16 bit digital filter coefficients
etc. With FPGAs this constraint is removed - we can use as many, or as few bits, as are required. Therefore we
can choose to optimise and schedule DSP algorithms in a completely different way.
Circuit Board
General Purpose Input/Output Bus
DAC ADC
DSP56307
DSP Processor
Amplifiers/Filters
• Since around 1998 the evolution of FPGAs into the DSP market has
been sustained by classic technology progress such as the ever
present Moore’s law.
Anyone who has purchased a new laptop knows the feeling. If you just wait, then in the next quarter you will get
the new model with integrated WiFi or WiMax, a faster processor. Of course wait another quarter and in 6
months it will be improved again - also, the new faster, better, bigger machine is likely to be cheaper also! Such
is technology
DSP for FPGAs is just the same. If you wait another year its likely the vendors will be bring out prepackaged
algorithms for precisely what you want to do. And they will be easier to work with - higher level design tools,
design wizards and so on.
So if you are planning to design a QR adaptive equalizing beamformer for MIMO implementation of a software
radio for 802.16 - then if you wait, it will probably be a free download in a few years. But of course who can wait?
Therefore in this course, we discuss and review the fundamental strategies of designing DSP for FPGAs. Like
all technologies you still need to know how it works if you really want to use it.
• Of course the resource is finite and the connections available are finite.
• However, the high level concept, take the blocks, & build it:
Clocks Input/Output
Design
Verify
Place and Route
Yes in both cases, but moderm toolsets and design flows are such that it might be the same person.
There is lots to worry about. In terms of the DSP design; is the arithmetic correct (ie overflows, underflows,
saturates etc). Do the latency or delays used allow the integrity to be maintained.
For the FPGA, can we clock at a high enough rate? Does the device place and route. What device do we need,
and how efficient is the implementation (just like compilers, different vendors design flows will give different
results (some better than others).
As vendors provide higher level components (like the DSP48 slice from XIlinx which allows a complete FIR to
be implemented) then issues such as overflow, numerical integrity and so on are taken care of.
• The bottom line for DSP is multiplies and adds - and lots of them!
N N N+1
+ = =
Within traditional DSP processors this wordlength growth is well known an catered for.
For example, the Motorola 56000 series is so called because it has a 56 bit accumulator, i.e. the largest result
of any “addition” operation can have 56 bits. For a typical DSP filtering type operation we may require to take,
say an array of 24 bit numbers and multiply by an array of another 24 bits numbers. The result of each multiply
will be a 48 bit number. If we then add two 48 bit numbers together, if they both just happen to be large positive
values then the result could be a 49 bit number. Now if we add many 48 bit numbers together (and they just all
happen to be large positive values), then the final result may have a word growth of quite a few bits. So one
must assume that Motorola had a good look at this, and realised it was fairly unlikely that the result of adding
these 48 bit products together would ever be larger that 56 bits - so 56 bits was chosen. (Of course if you did
have a problem that grew beyond 56 bits you would have to put special trapping in to the code to catch this.)
A3 B3 A2 B2 A1 B1 A0 B0
C3
Σ C2
Σ C1
Σ C0
Σ ‘0’
S4 S3 S2 S1 S0
MSB LSB
A3 A2 A1 A0
+ B3 B2 B1 B0
C3 C2 C1 C0
0 carry in
S4 S3 S2 S1 S0
Adds two bits + one carry in bit, to produce sum and carry out
Sout
1011 +11
+ 1101 +13
11000 +24
b3
0
p7 p6 p5 p4 p3 p2 p1 p0
bout = b
z = a.b
• Early gate array design flow would be design, simulate/verify, device production and test. Metal layers
make simple connections Z = AB + CD
A
B
C Z
D
Early simulators and netlisters such as HILO (from GenRad) were used.
From GA to FPGA
However simple gate arrays although very generic, were used by many different users for similar systems.....
....for example to implement two level logic functions, flip-flops and registers and perhaps addition and
subtraction functions.
For a GA once a layer(s) of metal had been laid on a device - that’s it! No changes, no updates, no fixes.
So then we move to field programmable gate arrays. Two key differences between these and gate arrays:
• They can be reprogrammed in the “field”, i.e. the logic specified is changeable
• They no longer are just composed of NAND gates. A carefully balanced selection of multi-input logic, flips-
flops, multiplexors and memory.
Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010
Generic FPGA Architecture (Logic Fabric) 8
• Arrays of gates and higher level logic blocks might be refered to as the
logic fabric...
I/O I/O I/O I/O
I/O
I/O
I/O
Column
interconnects
I/O
Cascade/
Carry FLIP FLIP
Logic FLOP FLOP
Interconnects
LUT
Select
MUX
Logic
Logic Element
Of course the actual contents of a logic element will vary from manufacturer to manufacturer and device to
device.
Block RAM
r
Logic Fabric o
w
s
Arithmetic
Block
Input/Output
Blocks
columns
Block RAM Block RAMs are also used extensively in DSP. Example
uses are for storing filter coefficients, encoding and
decoding, and other tasks.
FPGA
CLB
Slices
Switch
Matrix
other CLBs
Continuing with the Xilinx example, the combinational blocks are termed Lookup Tables (LUTs), and in most
devices have 4 inputs (some of the more recent devices have 6-input LUTs). These LUTs can be utilised in four
modes:
• As shift registers
• A flip-flop
• A latch
Over the next few slides, the functionality of LUTs and registers in the above modes will be described.
• When used to implement a logic function, the 4-bit input addresses the
LUT to find the correct output, Z, for that combination of A, B, C and D.
A
B Lookup Z
C Table
D
• Dual port RAM requires more resources than single port RAM.
Single Port
• Additional Shift In and Shift Out ports are used, and the 4-bit address
is used to define the memory location which is asynchronously read.
• For example, if the LUT input is the word 1001, the output from the 10th
register is read, as depicted below.
CLK
4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LUT INPUT
(e.g. 1001) D OUT
• The slice register at the output from the LUT can be used to add
another 1 clock cycle delay. Using the register also synchronises the
read operation.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Introduction to FPGA
Notes:
As with the other LUT configurations, larger Shift Registers can be constructed by combining several LUTs. For
example, a 64-bit shift register segment can be constructed by combining four 16-bit Shift Registers together,
as shown below. The cascadable ports allow further interconnections for larger Shift Registers.
Cascadable In DI D DI D
SRL16 FF SRL16 FF
MSB MSB
DI DI
D D
FF FF
SRL16 SRL16
MSB MSB
Slice 1 Slice 2
Cascadable Out
• The sequential logic element which follows the lookup table can be
configured as either:
• A level-sensitive latch
• The input to the register may be the output from the LUT, or
alternatively a direct input to the slice (i.e. the LUT is bypassed).
Bypass
Input LUT / Register
Pair
D(t) Q(t+1)
0 0
1 1
When configured as a latch, the control inputs define when data on the D input is “captured” and stored within
the register. The Q output thereafter remains unchanged until new data is captured.
Flip flops and registers are discussed in the Digital Logic Review notes chapter.
INPUT DQ DQ DQ DQ DQ DQ DQ DQ OUTPUT
CLOCK R R R R R R R R
RESET
Slice 1 Slice 2 Slice 3 Slice 4
CLOCK
Slice 1
R. Stewart, Dept EEE, University of Strathclyde, 2010
Introduction to FPGA
Notes:
We can still design a resettable shift register with an SRL16, by using a slightly more sophisticated design.
Instead of making all elements resettable, we can implement the first element using a slice register, and the
subsequent ones using an SRL16. The reset signal is held high for 8 clock cycles, which allows the 0 input to
propagate through the shift register. Instead of using 4 slices, this design would require 2 slices at most.
• Number representations:
binary word formats for signed and unsigned integers, 2’s
complement, fixed point and floating point;
Addition - decimal, two’s complement integer binary, two’s complement fixed point, hardware structures for
addition, Xilinx-specific FPGA structures for addition.
Multiplication - decimal, 2s complement integer binary, two’s complement fixed point, hardware structures for
multiplication, Xilinx-specific FPGA structures for multiplication.
Division.
Square root.
Complex addition.
Complex multiplication.
• Sufficient resolution
For example, assume we have a DSP filtering application using 16 bit resolution arithmetic. We will show later
(see Slide 42) that the cost of a parallel multiplier (in terms of silicon area - speed product) can be approximated
as the number of full adder cells. Therefore for a 16 bit by 16 bit parallel multiply the cost is of the order of 16 x
16 = 256 “cells”. The wordlength of 16 bits has been chosen (presumably) because the designer at sometime
demonstrated that 17 bits was too many bits, and 15 was not enough - or did they? Probably not! Its likely that
we are using 16 bits because... well, that’s what we usually use in DSP processors and we are creatures of
habit! In the world of FPGA DSP arithmetic you can choose the resolution. Therefore, if it was demonstrated
that in fact 9 bits was sufficient resolution, then the cost of a multiplier is 9 x 9 cells = 81 cells. This is
approximately 30% of the cost of using 16 bits arithmetic.
Therefore its important to get the wordlength right: too many bits wastes resources, and too few bits loses
resolution. So how do we get it right? Well, you need to know your algorithms and DSP.
64 01000000
65 01000001
131 10000011
255 11111111
Note that the minimum value is 0, and the maximum value ( 255 ) is the sum of the powers of two between 0
and 8 , where 8 is the number of bits in the binary word:
i.e. 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 = 255 = 28 - 1
• The most negative and most positive numbers are represented by:
bit index: n-1 n-2 n-3 2 1 0
most negative: 1 0 0 0 0 0
MSB LSB
As for the unsigned representation, the decimal number 82 is “01010010” in 2’s complement signed format:
bit index: 7 6 5 4 3 2 1 0
bit weighting: -27 26 25 24 23 22 21 20
Meanwhile the decimal number -82 is “10101110” in 2’s complement signed format:
bit index: 7 6 5 4 3 2 1 0
bit weighting: -27 26 25 24 23 22 21 20
+82 0 1 0 1 0 0 1 0 -82 1 0 1 0 1 1 1 0
invert all bits invert all bits
1 0 1 0 1 1 0 1 0 1 0 1 0 0 0 1
add 1 + 0 0 0 0 0 0 0 1 add 1 + 0 0 0 0 0 0 0 1
-82 1 0 1 0 1 1 1 0 +82 0 1 0 1 0 0 1 0
Note that when negating positive values, a ninth bit is required to represent negative zero. However, if we simply
ignore this ninth bit, the representation for the negative zero becomes identical to the representation for positive
zero.
Notice from the above that -128 can be represented but +128 cannot.
• Bits on the left of the binary point are termed integer bits, and bits on
the right of the binary point are termed fractional bits.
• The format of a generic fixed point word, comprising n integer bits and
b fractional bits, is: :
binary point
• The MSB has -ve weighting for 2’s complement (as for integer words).
R. Stewart, Dept EEE, University of Strathclyde, 2010
Introduction to FPGA
Notes:
As examples, we consider the 2’s complement word “11010110” with the binary point in two different places.
Firstly, with the binary point to the left of the third bit, i.e. 5 integer bits and 3 fractional bits:
bit index: 4 3 2 1 0 -1 -2 -3
bit weighting: -24 23 22 21 20 2-1 2-2 2-3
binary number: 1 1 0 1 0 1 1 0
decimal 1x-24 1x23 0x22 1x21 0x20 1x2-1 1x2-2 0x2-3
representation: -16 + 8 + 2 + 0.5 + 0.25 = -5.25
...and secondly, with the binary point to the left of the third bit, i.e. 3 integer bits and 5 fractional bits:
bit index: 2 1 0 -1 -2 -3 -4 -5
bit weighting: -22 21 20 2-1 2-2 2-3 2-4 2-5
binary number: 1 1 0 1 0 1 1 0
decimal 1x-22 1x21 1x20 1x2-1 0x2-2 1x2-3 1x2-4 0x2-5
representation: -4 + 2 + 0.5 + 0.125+ 0.0625 = -1.3125
Note that these results are related by a factor of 22 = 4, i.e. 4 x -1.3125 = -5.25.
Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010
Fixed Point Range and Precision 22
+3
+2
+1.5
+1 +0.75
+0.375
0
-0.5
-1 -1
-2 -2
-3 interval = 1 interval = 0.5 interval = 0.25 interval = 0.125
-4
n n n b b b b b
3 integer bits 5 fractional bits
-2
LSB
• This looks much more accurate. The quantisation error is ± -----------
(where LSB = least significant bit)... so 0.015625 rather than 0.5! 2
R. Stewart, Dept EEE, University of Strathclyde, 2010
Introduction to FPGA
Notes:
Quantisation is simply the DSP term for the process of representing infinite precision numbers with finite
precision numbers. In the decimal world, it is familiar to most to work with a given number of decimal places.
The real number π can be represented as 3.14159265.... and so on. We can quantise or represent π to 4
decimal places as 3.1416. If we use “rounding” here and the error is:
If we truncated (just chopped off the bits below the 4th decimal place) then the error is larger:
Clearly rounding is most desirable to maintain best possible accuracy. However it comes at a cost. Albeit the
cost is relatively small, but it is however not “free”.
When multiplying fractional numbers we will choose to work to a given number of places. For example, if we
work to two decimal places then the calculation:
Once we start performing billions of multiplies and adds in a DSP system it is not difficult to see that these small
errors can begin to stack up.
4 0 1 0 1 1 1 (decimal 1.4375)
-8 4 2 1 0.5 0.25
original (decimal 5.75)
0 1 0 1 1 1
number:
-16 8 4 2 1 0.5 2 shift left by 1 place
0 1 0 1 1 1
(decimal 11.5)
Reviewing the divide-by-4 and multiply-by-2 examples from the main slide... if we move the binary point, the
weightings of the bits comprising the word, and hence the value it represents, change by a power-of-two factor.
0.0625
-2 1 0.5 0.25 0.125
0 1 0 1 1 1 (decimal 1.4375)
4
-8 4 2 1 0.5 0.25
original
0 1 0 1 1 1 (decimal 5.75)
number:
2
-16 8 4 2 1 0.5
0 1 0 1 1 1 (decimal 11.5)
• Fixed point word formats with 1 integer bit and a number of fractional
bits are often adopted.
0 0 0 0 0 0 0 0
Q-format notation is given in the form Qm.n, where m is the number of integer bits, and n is the number of
fractional bits. Notably this description excludes the MSB of the 2’s complement representation, which Q-format
considers a sign bit. Therefore the total number of bits in a Q-format number is 1 + m + n, whereas in 2’s
complement, the same word format would be described as having m+1 integer bits, and n fractional bits.
For example, a Q2.5 number has a sign bit, 2 other integer bits, and 5 fractional bits, and hence can be
represented as shown below. In 2’s complement, this would be described as having 3 integer bits and 5
fractional bits.
1 sign bit
Q-format m=2 n=5
description integer bits fractional bits
2’s complement
description 3 integer bits 5 fractional bits
The Q0.15 format (often abbreviated to Q15) is used extensively in DSP as it covers the normalised range of
numbers from -1 to +1 - 2-15, and is equivalent to a 16 bit 2’s complement representation with the binary point
at 15, i.e. 1 integer bit and 15 fractional bits.
Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010
Fractional Motivation - Normalisation 26
However using the normalised values, where all inputs are contrained to be in the range from -1 to +1 then its
easy to note that multiplying ANY two numbers in this range together will give a result also in the range of -1 to
+1.
Exactly the same idea of normalisation is applied to binary, and the the binary point is implicitly used in most
DSP systems.
If we normalise these numbers to between -1 and 1 (i.e. divide through by 128) then the binary range is:
Therefore we apply the same normalising ideas as for decimal for multiplication in binary.1101 1010 0100
Consider multiplying 36 x 97 = 3492 equivalent to 00100100 x 01100001 = 0000 1101 1010 0100
In binary normalising the values would be the calculation 0.0100100 x 0.1100001 = 0.00110110100100
Note very clearly that in a DSP system then the binary poiint is all in the eye of the designer. There is no physical
connection or wire for the binary point. It just makes things significantly easier in keeping track of wordlength
growth, and truncating just by dropping fractional bits. Of course if you prefer integers and would like to keep
track of the scaling etc you can do this,..... you will get the same answer and the cost is the same...
Binary Output
fs
127 01111111
96 01100000
8 bit 1
64 01000000 0
0
32 00100000 1
ADC
1
-2 -1 1 2 Voltage 1
-32 11001000 Voltage Input 0
Input 1
-64 11000000
Binary
-96 10100000 Output
-128 10000000
Note that the ADC does not necessarily have a linear (straight line) characteristic. In telecommunications for
example a defined standard nonlinear quantiser characteristic is often used (A-law and μ-law). Speech signals,
for example, have a very wide dynamic range: Harsh “oh” and “b” type sounds have a large amplitude, whereas
softer sounds such as “sh” have small amplitudes. If a uniform quantisation scheme were used then although
the loud sounds would be represented adequately the quieter sounds may fall below the threshold of the LSB
and therefore be quantised to zero and the information lost. Therefore non-linear quantisers are used such that
the quantisation level at low input levels is much smaller than for higher level signals. A-law quantisers are often
implemented by using a nonlinear circuit followed by a uniform quantiser. Two schemes are widely in use: the
A-law in Europe, and the μ -law in the USA and Japan. Similarly for the DAC can have a non-linear
characteristic.
Binary Output
Voltage Input
• The ADC samples at the Nyquist rate, and the sampled data value is
the closest (discrete) ADC level to the actual value:
s(t) v̂ ( n ) ts
4 fs 4
Binary value
3 3
2
Voltage
2
1 1
0 ADC 0
-1 -1 sample, n
-2 time -2
-3 -3
-4 -4
v̂ ( n ) = Quantise { s ( nt s ) }, for n = 0, 1, 2, …
01111 (+15)
Binary Output
1 volts
Vmin = -15 volts
10000 (-16)
In the above slide figure, for the second sample the true sample value is 1.589998..., however our ADC
quantises to a value of 2.
• If the smallest step size of a linear ADC is q volts, then the error of any
one sample is at worst q/2 volts.
01111 (+15)
Binary Output
q volts
10000 (-16)
Quantisation error is often modelled an additive noise component, and indeed the quantisation process can be
considered purely as the addition of this noise:
nq
x ADC y
x y
Amplitude/volts
2
1
0
-1 time/seconds
-2
-3
-4
ADC Characteristic
4
3 3 Quantisation Error
Amplitude/volts
2 2
1 1
-4
0 0
4 3 2 1 -1 -2 -3 -4
Output
-1 -1 time/seconds
-2 -2
-3 -3
Input
-4
• The full adder circuit can be used in a chain to add multi-bit numbers.
The following example shows 4 bits:
A3 B3 A2 B2 A1 B1 A0 B0
A3 A2 A1 A0
‘0’
C3
Σ C2
Σ C1
Σ C0
Σ + B3 B2 B1 B0
C3 C2 C1 C0
S4 S3 S2 S1 S0
S4 S3 S2 S1 S0
MSB LSB
• This chain can be extended to any number of bits. Note that the last
carry output forms an extra bit in the sum.
• If we do not allow for an extra bit in the sum, if a carry out of the last
adder occurs, an “overflow” will result i.e. the number will be incorrectly
represented.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Introduction to FPGA
Notes:
The truth table for the full adder is:
A B CIN S COUT
0 0 0 0 0 0+0+0 = 0
0 0 1 1 0 0+0+1 = 1
0 1 0 1 0 0+1+0 = 1
0 1 1 0 1 0+1+1 = 2
1 0 0 1 0 1+0+0 = 1
1 0 1 0 1 1+0+1 = 2
1 1 0 0 1 1+1+0 = 2
1 1 1 1 1 1+1+1 = 3
The longest propagation delay path in the above full adder is “two gates”.
Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010
Subtraction 32
A3 B3 A2 B2 A1 B1 A0 B0 Invert
4 bit
2’s comp
add 1
C4
Σ Σ Σ Σ ‘1’
Discard
D3 D2 D1 D0
Sometimes we need a combined adder/subtractor with the ability to switch between modes.
B3 B2 B1 B0
A3 A2 A1 A0
0 1 0 1 0 1 0 1
MUX MUX MUX MUX
Σ Σ Σ Σ K
For: A + B, K = 0
For: A - B, K= 1
• With 2’s complement overflow will occur when the result to be produced
lies outside the range of the number of bits.
• Therefore for an 8 bit example the range is -128 to +127 (or in binary
this is 100000002 to 011111112:
-65 10111111 100 01100100
+ -112 +10010000 + 37 +00100101
-177 101001111 137 10001001
With an 8 bit result we lose the 9th bit With an 8 bit result the result “wraps
and the result “wraps around” to a around” to a negative value:
positive value: 01001111 = 79 . 10001001 = – 119 .
For example
10110111 01100100
(-73) + 127 = 54 +01111111 100 + 64 = 164 +01000000
1 00110110 10100100
Discard final 9th bit carry
No overflow MSB bit indicate -ve result! Overflow
Adding +ve and -ve will never overflow!
Adding +ve and +ve if a -ve result then overflow
Adding -ve and -ve if a +ve result then overflow
• When overflow is detected, the result is set to the close possible value
(i.e for the 8 bit case either -128 or +127).
• Therefore for every addition that is explicitly done with an adder block.
In Xilinx System Generator the user will get a checkbox choice to allow
results to either (i) Wraparound or (ii) Saturate.
Generally for some later FPGAs such as the Virtex-4, using some of the DSP48 blocks give adders with 48 bits
of precisions therefore the likelihood of say working with 16 bit values that grow to over 48 bits is unlikely. Hence
overflow has been “designed out”. Of course not applications will use these devices and using general slice
logic and attempting to make adders as small as possible, would mean care must be taken, and where
appropriate to efficient design, saturate might be included.
Saturation is extremely useful for adaptive algorithms. For example, in the Least Mean Squares algorithm
(LMS), the filter weights w are updated according to the equation:
w ( k ) = w ( k – 1 ) + 2μe ( k )x ( k )
Without further concern over the meaning of this equation, we can see that the term 2μe ( k )x ( k ) is added to
the weights at time epoch k – 1 to generate the new weights at time epoch k .
The the operations that form 2μe ( k )x ( k ) were to overflow, there is a high chance that the sign of the term would
flip and drive the weights in completely the wrong direction, leading to instability.
With saturation however, if the term 2μe ( k )x ( k ) gets very big and would overflow, saturation will limit it to the
maximum value representable, causing the weights to change in the right direction, and at the fastest speed
possible in the current representation. The result is a huge increase in the stability of the algorithm.
D
B
A Sout
A B
Cout
Σ Cin
Sout
Cin
G1 (A) G2 (B) D
0 0 0
0 1 1
1 0 1
1 1 0
Sout = Cin xor D , Cout = DA + Cin D (multiplex operation). Result is the FULL ADDER implementation: .
• A (very) high level diagram of the main logic components on one slice
Inputs
ORCY
Outputs D-type
4 input FF
LUT
MULTAND Upper
RAM
Mux
Mux
Mux
Mux
XORG
ShiftReg
Inputs
ORCY
Outputs D-type
4 input FF
MULTAND
LUT Lower
RAM
Mux
Mux
Mux
Mux
XORG
ShiftReg
• One 4 input Look-Up-Table (LUT) (can be configured as shift register or simple as RAM/memory)
• One OR gate
• Clock inputs
“Large” FPGAs will have many tens of thousands (10000’s) of slices (and other components!)
• To produce larger adders the Xilinx tools will simply cascade the carry
bits in adjacent (where possible!) slices.
B1 S1
2 bit addition B3 S3 4 bit addition
A1 FA A3 FA
1 slice 2 slices
B0 S0 B2 S2
FA FA
A0 A2
A1 B1 A0 B0
C1
‘0’ ‘0’
C1
Σ C0
Σ B1 S1
A3 B3 A2 B2 A1 B1 A0 B0
A1 FA
‘0’
S1 S0
C3
Σ C2
Σ C1
Σ C0
Σ
B0 S0
FA S4 S3 S2 S1 S0
A0
‘0’
ABCD Y
00001
0001 0
0010 0
0011 0
0100 0 A 4 input
0101 0 B Y
0110 0 LUT
0111 0 C
1000 0 D RAM
1001 0
1010 0 ShiftReg
1011 0
1100 0
1101 0
1110 1
1111 1
To implement this function, simply store the values of Y in the Slice LUT, and the address the LUT with values
of ABCD to get the output
Therefore ANY 4 variable Boolean function can be simply implemented with a four input LUT. (Of course if the
equation is only 3 variables then we can also implement and just set one input constant)
11010110 A 7 …A 0
x00101101 B 7 …B 0
11010110
000000000
1101011000
11010110000
000000000000
1101011000000
00000000000000
000000000000000
0010010110011110 P 15 …P 0
• So we can perform multiplication using just full adders and a little logic
for selection, in a layout which performs the shifting.
214
x45
1070
+8560
9630
Note that we do 214 × 5 = 1070 and then add to it the result of 214 × 4 = 856 right-shifted by one column.
For each additional column in the second operand, we shift the multiplication of that column with the first
operand by another place.
zzz
xaaaa
bbbb
+cccc0
+dddd00
+eeee000 etc...
• For one negative and one positive operand just remember to sign
extend the negative operand.
11010110 -42
x00101101 x45
1111111111010110
0000000000000000
1111111101011000
sign 1111111010110000
extends 0000000000000000
1111101011000000
0000000000000000
0000000000000000
1111100010011110 -1890
We use the trick of inverting (negating and adding 1) the last partial product and adding it rather than
subtracting.
Of course, if both operands are positive, just use the unsigned technique!
The difference between signed and unsigned multiplies results in different hardware being necessary. DSP
processors typically have separate unsigned and signed multiply instructions.
11010.110
x00101.101 26.750
11.010110 x5.625
000.000000 0.133750
1101.011000
11010.110000 0.535000
000000.000000 16.050000
1101011.000000 133.750000
00000000.000000
000000000.000000 150.468750
0010010110.011110
• Distributed multipliers
• Constant multipliers
• Shift-and-add “multipliers”
• 18 x 18 bit multipliers
• Multiply, accumulate
The most basic multiplier is a 2-input version which is implemented using the logic fabric, i.e. the lookup tables
within the slices of the device. This type is referred to as a distributed multiplier, because the implementation is
distributed over the resouces in several slices.
In the case of multiplication with a constant, which is commonly required in DSP, the knowledge of one
multiplicand can be exploited to create a cheaper hardware implementation than a conventional 2-input
multiplier. Two approaches that will be discussed in the coming pages are ROM-based constant multipliers, and
“shift-and-add” multipliers which sum the outputs from binary shift operations.
The FPGA companies are well aware that DSP engineers desire fast and efficient multipliers, and as a result,
they began incorporating embedded multipliers into their devices in the year 2000. Since then the sophistication
of these components has increased, and they have been extended to feature fast adders and in many cases
longer wordlengths, too. We can now think of them as embedded arithmetic slices, rather than simply
multipliers.
b b0
bout 0 0
c
cout FA Example:
b1
0 1101 13
aout 0
sout 1011 11
b2 1101
0 0 1101
0000
b3 1101
0 10001111 143
p7 p6 p5 p4 p3 p2 p1 p0
• The AND gate connected to a and b performs the selection for each
bit. The diagonal structure of the multiplier implicitly inserts zeros in the
appropriate columns and shifts the a operands to the right.
• Note that this structure does not work for signed two’s complement!
The operation of multiplying 1’s and 0’s is the same AND 1’s and 0’s
A B Z
0 0 0
0 1 0
1 0 0
1 1 1
Hence the AND gate is the bit multiplier. The function of one partial product stage of the multiplier is as shown
below.
x3 x2 x1 x0
s FA is full adder
a3 a2 a1 a0
a
b b0
bout 0
c
cout FA
aout y4 y3 y2 y1 y0
sout
y4 y3 y2 y1 y0 = b0(a3 a2 a1a0) + x3 x2 x1 x0
• This shows the top half of a slice, which implements one multiplier cell.
Cout
S D
A Sout
S
A
FA
Cin
Cout
NOTE: This implementation
features a Virtex-II Pro FPGA.
Sout
Cin
G3 (S)
G2 (A) D
G1 (B)
The dedicated MULTAND unit is required as the intermediate product G1G2 cannot be obtained from within the
LUT, but is required as an input to MUXCY. The two AND gates perform a one-bit multiply each, and the result
is added by the XOR plus the external logic (MUXCY, XORG):
Sout = CIN xor D, COUT = DAB + CIND
This structure will perform one cell of the multiplier (see the next slide...).
Note that whereas the signal flow graph of the distributed multiplier shows signals propagating from the top and
right of the diagram to the bottom, the internal structure of the FPGA slice logic results in a different
configuration when implemented on a device.
• The two operands A and B are concatenated to form the address with
which to access the ROM. The value stored at that address is the
multiplication result, P:
1 ROM-based multiplier
0 address data (product)
A 1
1 1
0000 0000 0000 0000 1
0
4 bits 0
1 A:B P 1
decimal -6 0 0
0 1010 0011 1110 1110 1
0 8 bits 8 bits 1
0
1 1
B 0
1
1111 1111 0000 0001
0
1 28 = 256 8-bit addresses
4 bits 1 decimal -18
8-bit data
decimal 3
For example, with 8 bits operands (a fairly reasonable) size, 1Mbit of storage is required - a large quantity. For
bigger operands e.g. 16 bits, a huge quantity of storage is required. 16 bit operands require 128Gbits of storage
and hence a ROM-based multiplier is clearly not a realistic implementation choice!
Input Wordlength Output Wordlength No. of ROM entries Total ROM Storage
(N) (2N) (22N) (2N x 22N)
4 8 28 = 256 2 Kbits
6 12 212 = 4,096 48 Kbits
8 16 216 = 65,536 1 Mbit
10 20 220 = 1,048,576 20 Mbits
12 24 224 = 16,777,216 384 Mbits
14 28 228 = 268,435,456 7 Gbits
16 32 232 = 4,294,967,296 128 Gbits
18 36 236 = 68,719,476,736 2.25 Tbits
20 40 240 = 1,099,511,627,776 40 Tbits
• Consider a ROM multiplier with 8-bit inputs: 65,536 8-bit locations are
required to store all possible outputs... so 1Mbit of storage is needed!
0 ROM-based multiplier
1 address data (product)
0 0
1
A 1 0000 0000 0000 0000 0000 0000 0000 0000 0
0
1 0
1
8 bits 0 1
0
1 1
1
0 0110 1011 0100 0011 0001 1100 0000 0001 1
1
1 A:B P 0
decimal 107 1 0
0 0
16 bits 16 bits 0
0 1
1 0 address 27,459 0
0 0 0
0 0 0
B 0 0 0
0 1 0
8 bits 1
1 1
1 1111 1111 1111 1111 0000 0000 0000 0001 decimal 7,169
• The storage required for output words may also be reduced, if the
maximum result does not require the full numerical range of:
2N – 1 2N – 1
–2 ≤ result ≤ 2 –1
• The maximum product and output wordlength can be calculated for the
particular constant value, and the multiplier optimised accordingly...
8 bit signed number maximum absolute value = -128
A=? ? ? ? ? ? ? ? ?
16 bit signed result
P=? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
In System Generator, the designer can specify the implementation style via the Constant Multiplier dialog box,
along with the constant value, the output wordlength, and other parameters.
0 1 0 1 1 1 1 0 1
0 1 0 0 0 0 4
0 1 0 1 0 1 0 0 0 0
Taking the above simple example of two concurrent multiplications, one x9 and the other x24, it is clear that the
shift right by three places can be shared as x8 is common to both operations.
• The Xilinx Virtex-II and Virtex-II Pro series were the first to provide “on-
chip” multipliers in early 2000s.
• These are in hardware on the FPGA ASIC, not actually in the user
FPGA “slice-logic-area”. Therefore are permanently available, and they
use no slices. They also consume less power than a slice-based
equivalent and can be clocked at the maximum rate of the device.
A
P
18x18 bit
B multiplier
• A and B are 18-bit input operands, and P is the 36-bit product, i.e.
P = A × B.
• Depending upon the actual FPGA, between 12 and more than 2000
(Virtex 6 top of range) of these dedicated multipliers are available.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Introduction to FPGA
Notes:
Looking at a device floorplan, you can clearly see the embedded multipliers, which are located next to Block
RAMs on the FPGA in order to support high speed data fetching/writing and computation.
18x18 multiplier
Information on dedicated multipliers taken from “Virtex-II Pro Platform FPGAs: Introduction and Overview”,
DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com.
The wordlengths of the embedded multipliers are fixed at 18 x 18 bits, and it makes sense to use them as fully
as possible.
It is relatively easy to see that a 4 x 4 bit multiply will greatly underuse the capabilities of the multiplier, and this
particular multiply operation might be better mapped to a distributed implementation, which would leave the
embedded multiplier free for use somewhere else. Of course, these decisions are made in the context of some
larger design with its own particular needs for the various resources available on the FPGA being targeted.
Perhaps less obviously, mapping a multiplication to embedded multipliers where the input operands are slightly
longer than 18 bits is also inefficient. This may result in, for example, the following implementation for a
requested 19 x 19 bits multiplier, where 4 embedded multipliers are used instead of the expected 1!
18 x 18 1 x 18
19 38
19
18 x 1 1x1
18
36 48
18
48 DSP48
Virtex-4
• Like the embedded multipliers, these are low power and fast.
• The ability to cascade slices together also means that whole filters can
be constructed without having to use any slices.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Introduction to FPGA
Notes:
The next series of FPGAs (the Virtex-5) enhanced the capabilities of the DSP slice with the DSP48E.
The major improvements of this slice are logic capabilities within the adder/subtractor unit, and an extended
wordlength of one input to 25 bits. The maximum clock frequency also increased in line with the speed of the
device.
18
43 48
25
48 DSP48E
Virtex-5
48
25 Virtex-6 DSP48E1
25
43 48
25 18
48
q2
0
Q=B/A q1
0
q0
• Note that each cell can perform either addition or subtraction as shown
in an earlier slide ⇒ either Sin+ Bin or Sin - Bin can be selected.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Introduction to FPGA
Notes:
A Direct method of computing division exists. This “paper and pencil” method may look familiar as it is often
taught in school. A binary example is given below. Note that each stage computes an addition or subtraction of
the divisor A. The quotient is made up of the carry bits from each addition/subtraction. If the quotient bit is a 0,
the next computation is an addition, and if it is a 1, the divisor is subtracted. It is not difficult to map this example
into the structure shown on the slide.
01011 R0 = B
q4 = 0 carry 10011 -A
11110 R1
0
11100 2.R1
q3 = 1 carry 01101 +A
01001 R2
0
10010 2.R2
q2 = 1 carry 10011 -A
00101 R3
0
01010 2.R3
q1 = 0 carry 10011 -A
11101 R4
0
11010 2.R4
q0 = 1 carry 01101 +A
00111 R5
• It is unlikely that the quotient can be passed on to the next stage until
all the bits are computed - hence slowing down the system!
• Note that we must wait for N full adder delays before the next row can
begin its calculations.
Another problem for division is the fact that it takes N full adder delays before the next row can start. In the
examples below, the order in which the cells can start has been shown. So for the multiplier, the first cell on the
second row is the 3rd cell to start working. However, for the divider, the first cell on the second row is only the
5th cell to start working because it has to wait for the 4 cells on the first row to finish.
a2 a1 a0 sin
a3 Bin
b3 b2 b1 b0
1
q3 4 3 2 1 bout
bin
0
6 5 cout FA cin
q2
Bout
sout
s FA is full adder
0 a3 0 a2 0 a1 0 a0
a
b 4 3 2 1 b0
bout 0 0
c
cout FA
b1
5 4 3
aout 0
sout
p1 p0
Ver 10.423 R. Stewart, Dept EEE, University of Strathclyde, 2010
Pipelining The Division Array 55
q4 bout
0 bin
q3 cout FA cin
0
Bout
q2 sout
0
Q=B/A q1
0
q0
q2 0
a3 a2 a1 a0 q1 0
b3 b2 b1 b0
1
q3 0 q0
q1 0
q0
x - y
cos θ = --------------------- and sin θ = ----------------------
x2 + y2 x2 + y2