Professional Documents
Culture Documents
Retiming Signal Flow Graphs Xilinx
Retiming Signal Flow Graphs Xilinx
Retiming Signal
Flow Graphs (SFG)
Return Return
• Cut set retiming strategies of time scaling and delay transfer will be
introduced.
• The “old” idea of systolic arrays and their relevance to current and
future FPGA DSP design strategies will be presented.
As FPGAs come from a number of manufacturers, and then various families from the different manufacturers,
there is the simple question - which device to use for the application? The choice of manufacturer has many
avenues of consideration, some practical, some based on tradition, and some based on engineering. The actual
device chosen will depend on many things such as cost, capability, ease of programming and whether the
design is a single prototype or demonstrator, or perhaps the system requires to go to manufacturing and hence
design cost and power issues are of most importance
Once the choice has been made the next requirement is “programming” the device. There are again many
decision choices here. Which DSP algorithm to choose, what toolset, what optimisation strategies and so on.
One bottom line however is start off with the right type of DSP design. Undoubtedly different synthesis tools will
give different results, some resulting in smaller designs and faster clock rates than other tools - just like
compilers - different compilers produce different code, some better code than others. (This is of course just a
reflection on life - all car washes wash cars; some do it better than others, but regardless of the specific final
quality as long as it washes cars with some competence all have their place in the car wash business.)
So the point of this presentation. Get the DSP design in an amenable form before heading off to synthesise the
design for an FPGA. This will require understanding aspects of timing, pipelining, parallelism, device sharing
etc. No changes to the actual algorithms, just changes to the way the algorithms are prepared for
implementation.
Top
Signal Flow Graph Critical Path 10.2
x(k)
w0 w1 w2 w3 w4
y(k)
Therefore there is granularity associated with the latency of the multiplier. Do we just lump together the latency
of the multiplier as one time delay, or do we break this down to smaller components. This is of course the
decision of the designer and depends what you are actually trying to achieve and what problem you may (or
may not) be trying to solve.
In this section we will be identifying critical paths to understand issues related to maximum clocking rates. By
managing and minimising this critical path we will be in a position to increase the clocking rate of our
implementations.
To be more correct in the above figure we should probably include the input register for x ( k ) and the output
register for y ( k ) (which would usually be present) and show that all registers are clocked:
clock
x(k)
w0 w1 w2 w3 w4
y(k)
• If τ add is the propagation delay of the adder and τ mult is the propagation
delay of the multiplier, the longest path for data “ripple” can recognised
as:
x(k)
τ mult
y(k)
1
f clk = ----------------------------------
τ mult + 4τ add
Therefore, if for argument sake, τ mult = 1ns and τ add = 0.1ns , then the maximum clocking frequency to avoid
race hazards is:
1
f clk = -------------------------------- x 10 9 ≈ 714MHz
1 + ( 4 × 0.1 )
If we considered a larger length filter, of say 8 weights, then the maximum clocking frequency reduces:
τ mult
1
f clk = -------------------------------- x 10 9 ≈ 526MHz
1 + ( 9 × 0.1 )
Therefore clearer the longer the (canonical) parallel FIR filter the slower the clocking rate
Top
Cut Set Retiming 10.4
• Cut-set comes from formal graph theoretic techniques that can be used
to retime the SFG to a more amenable form.
x(k)
y(k)
Cut-Set
August 2005, For Academic Use Only, All Rights Reserved
Notes:
This might just be one of the most powerful and valuable FPGA design strategies you come across for DSP
parallel systems design. It is very simple to understand and simple use...
The complexity and application of cut ses and other related techniques extends well beyond the simple signal
flow graphs that are use in signal processing. However in this course we will present enough information to
ensure that this very powerful retiming techniques can be applied in a who variety of applications.
Cut set retiming was formalised by S.Y. Kung in his 1987 textbook VLSI Array Processors published by
Prentice-Hall. In a DSP context the use of cut set was first widely presented in the seminal paper:
S.Y. Kung. On Super computing with systolic/wavefront array processors. Proceedings of the IEEE, Vol 72, No. 7, July
1984.
It is interesting to note that although the above book was published many years before FPGA based DSP, the
issues relating to retiming, “systolic” and “wavefront” arrays as discussed in the book are very relevant to
modern FPGA-DSP. Worth a look if you can source a copy.
Systolic arrays were arrays of simple computing elements (cells) which communicated with neighbouring cells
via simple data transfer. Communication was synchronized, with the name systolic being derived from the term
systole related to the regular beating of the heart. Systolic arrays were essentially parallel processors.
An alternative form of parallel array was the wavefront array, where neighbouring cells communication via a
handshake and such arrays were there termed asynchronous.
Top
Cut Set Retiming Rule 1 10.5
• Cut set Rule 1: We can advance and/or delay the various edges
depending on their inbound or outbound directions.
z-1
SFG-1
SFG-2
z-1
outbound Cut-Set
One thing that will become clear in the next few slides, is that some retimings are not physically possible. That
is to say that they SFG result can be expressed, however the actual physical implementation would require to
include non-causal (i.e. time advance) components.
Note that inside the SFG1 and SFG2 circuits is some arbitrary DSP computation:
SFG-1
SFG-2
x(k) z-1
z-1 z-1
z-1 r(k)
y(k)
(Note the above SFG is not meant to be any specific circuit - just an arbtrary DSP linear system implementation
with multiplies and adds - however if we did choose to study this circuit, then we might notice that output y ( k )
is some linear combination of current and past values of x ( k ) , and r ( k ) , and past values of y ( k ) . Hence it is
some sort of linear recursive system on y ( k ) with two inputs - but not really important, the focus is on the
retiming at the bounday of SFG1 and SFG2.)
Top
Cut Set Delay Transfer 10.6
Delay
z-1 z-1
SFG-1 z-1
SFG-2
z-1 z+1
Advance
z-2
SFG-1 z-1
SFG-2
Delay doubling
Delay Cancellation
z-1 z-1
z-1 z+1
z-2
Nothing inside the circuits has changed, i.e. delays only added or removed at the cutsets.
Clearly the time advance z +1 applied to the outbound data path has cancelled with the time delay z – 1 ,
i.e. z +1 z – 1 = 1 .
and we have delay doubling, z – 1 z – 1 = z – 2 . Care must be taken when performing cut set retiming as if any data
transfer path ends up with a time advance, then the system becomes non-causal - a time advance, z +1 , is a
look ahead one step in time and such things are impossible!
Top
Cut Set Delay Transfer “Failure” 10.7
z-1 z+1
SFG-1 z+1
SFG-2
z-1 z-1
Delay
SFG-1 z+1
SFG-2
z-2
This is a non-causal implementation and therefore cannot be practically built. The ability to predict the future
(as required by the z +1 is just not with us yet!
So.....a bad choice of cut set retiming. In fact a choice of cut set that is IMPOSSIBLE to acutually implement.
Top
Cut Set Rule 2: Time Scaling 10.8
w0 w1 w2 w3
y(k)
DELAY Scale by α = 2
w0 w1 w2 w3
Prior to scaling, the output of the filter was simply the weighted sum of past inputs, where the filter weights were
specified by the data vector and input vector
T
w = [ w 0, w 1, w 2, w 3 ] and x kT = [ x ( k ), x ( k – 1 ), x ( k – 2 ), x ( k – 3 ) ] , ie.
y ( k ) = wT xk = w0 x ( k ) + w1 x ( k – 1 ) + w1 x ( k – 2 ) + w1 x ( k – 3 )
After scaling we have simply doubled (scaled by 2) all of the delays in the circuit:
w0 w1 w2 w3
w0 w1 w2 w3
Top
Slowing Down the Input Rate 10.9
• The delay scaling on the FIR SFG essentially increases the length of
the filter from N = 4 to N = 8 and but where every second weight is 0
(zero) :
w0 w1 w2 w3
0 0 0 0
w0 w1 w2 w3
T
w = [ w 0, 0, w 1, 0, w 2, 0, w 3, 0 ]
Therefore, for a given input sequence of data, to use the delay scaled SFG to produce the same output would
require the data vector to be:
x kT = [ x ( k ), 0, x ( k – 1 ), 0, x ( k – 2 ), 0, x ( k – 3 ), 0 ]
(Note the last “0” weight is not necessary and could be removed)
Therefore in order to produce the same sequence of output numbers for a given input sequence x ( k ) , this input
sequence requires to be upsampled by 2, or every second input is set to zero.
Top
Time Scaling 10.10
• Without scaling data can be input at the full rate in the usual style of an
FIR filter, i.e. not scaled down by α . i.e:
x(k)
z-1 z-1 z-1
x(k)
k
k
y(k)
k
k
q(k)
Y ( z ) = w 0 + w 1 X ( z )z – 1 + w 2 X ( z )z – 2 + w 3 X ( z )z – 3
= ( w 0 + w 1 z – 1 + w 2 z – 2 + w 3 z – 3 )X ( z )
Y(z)
therefore ------------ = w 0 + w 1 z – 1 + w 2 z – 2 + w 3 z – 3
X(z)
Now if we apply delay scaling such that we make the substitution of z – 1 → z – 2 , (and equivalently z → z 2 ) then:
Y ( z 2 )- Q(z)
-------------- = w 0 + w 1 z – 2 + w 2 z – 4 + w 3 z – 6 = ------------
X( z2 ) P(z)
If we assumed the input sequence was, for example the set of samples: x ( k ) = [ x 0, x 1, x 2, x 3, …, x N ] , then
the z-transform is:
X ( z ) = x0 + x1 z –1 + x2 z –2 + x3 z –3 + … + xN z –N
Therefore the delay scaling inserts zeroes in the input sequence. Equivalently zeroes are inserted in the output
sequence, q ( k ) .
Top
Delay Scaling Loss of Efficiency 10.11
• For the previous slides scaled by 2, the FIR only has a computational
efficiency of 50% as given every 2nd input is a zero then on every 2nd
clock cycle only zeroes are input to the multipliers:
0 6 0 1 0 7 0
Sample k+1
w0 w1 w2 w3 0 multiply/adds
0
Note that if for, example the delay scaling had been by a factor of α = 3 then the array would now only be 1/3
or 33% efficient. So for the input sequence [4, 0, 0, 7, 0, 0, 1, 0, 0, 6]
6 0 0 1 0 0 7 0 0 4
Sample k
w0 w1 w2 w3 4 multiply/adds
12
0 6 0 0 1 0 0 7 0 0
Sample k+1
w0 w1 w2 w3 0 multiply/adds
0 0 6 0 0 1 0 0 7 0
Sample k+2
w0 w1 w2 w3
0 multiply/adds
0
Top
Increasing the Efficiency: SFG Sharing 10.12
k
k
v(k)
z1(k)
z1(k) z-1 z-1 z-1 v1(k)
k
k
v1(k)
z2(k)
z2(k) z-1 z-1 z-1 v2(k)
k
k
v2(k)
This technique is therefore of interest if you have different data sets that must be processed by the same filter,
and speed of operation is such that the slowdown caused by “sharing” or multiplexing SFGs is tolerable. Of
course this is an ideal structure for use with multichannel applications (i.e. various data sources to be processed
by the same filter characteristic).
Top
Multiplex the input and output..... 10.13
z(k)
k
z2(k) Int
fs k
v(k)
k
DeInt v2(k)
k
fs
k
In general to do N channel interleaving requires that all delays are scaled by N, and the input signals are
upsampled by N.
Therefore if an FIR SFG which can be clocked at, say, a sampling rate of f s = 100MHz then one channel of
data can be processed at 100MHz input sample rate. If this array was delay scaled by α = 4 then the FIR SFG
could now process 4 independent channels at a maximum data rate of 100/4 = 25 MHz.
If the FIR filter had 20 weights, then for both the single channel at 100MHz, or the 4 channels at 25MHz, the
total number of MACs (multiply-accumulates) per second is 2000 million MAC/sec.
Top
Input to Output Latency 10.14
• Following a cut set retiming by time scaling and delay transfer it may be
that the input-to-output-delay from input to output is changed.
z-1 z-1
SFG-1 z-1
SFG-2
z-1 z+1
• From Slide 10.6 we chose to time delay all inbound edges and hence
the cut set procedure adds one delay. (y(k) was o/p before retiming)
August 2005, For Academic Use Only, All Rights Reserved
Notes:
Compare this to the original SFG, the I/O test line shows a delay from x(k) to y(k) as a result of the cut set.
Therefore in this case the output on the right hand side should be correctly labelled z(k) = y(k-1), as without the
cut set the output was y(k).
input, x(k)
output, y(k)
z-1
SFG-1
SFG-2
z-1
In many DSP problems adding a few delays to the overall input to output transfer path is not usually a problem,
However there are problems where this WILL be a concern, and therefore we should always keep track of
precisely what is happening.
Top
Retiming a Simple FIR SFG 10.15
• We can redraw (no retiming yet!) the FIR SFG in the following form
x(k) z-1 z-1 z-1 z-1
w0 w1 w2 w3 w4
y(k)
....as the data transfer paths are considered delayless, we have simply
reversed the direction of the adder line.
y(k)
Cut set 1
z-1
y(k)
Cut set 2
For our second cut-set we can again retime by advancing all inbound edges and time delaying all output edges:
z-1 z-1
y(k)
....and so on.
Top
The Transpose FIR SFG 10.16
y(k)
x(k)
w0 w1 w2 w3 w4
x(k)
w0 w1 w2 w3 w4
y(k)
x(k)
w4 w3 w2 w1 w0
Of course, given that most FIR filters are in fact symmetric then this re-ordering is of no consequence as a 5
weight symmetric filter will have w 0 = w 4 and w 1 = w 3
Top
Transpose FIR Additional Latency? 10.17
• Note that there is no change to the delay or latency from input to output:
I/O-test-line
y(k)
I/O-test-line
y(k)
Hence the time advance and time delay at the edges would “balance” and give the same result as above, i.e.
no change to the latency, ie. z – 4 z 4 = 1
Top
Clocking Rate of the Transpose FIR 10.18
• Recall from Slide 10.2 the latency from input to output of a parallel
implementation of a 5 weight canonical FIR was:
• However for the transpose FIR the latency is now reduced to:
x(k)
Therefore the maximum data sample clocking speed an N weight system is therefore:
1
f clk = ------------------------------
τ mult + τ add
Although the transpose FIR circumvents the problem of the ripple through the adder chain, the broadcast line
at the top may be undesirable, particularly if the filter is very long and the broadcast line (or wires or bus) is very
long.
Taking the same example figures from the the notes of Slide 10.3 τ mult = 1ns and τ add = 0.1ns , then the
maximum clocking frequency to avoid race hazards of the transpose FIR is:
1
f clk = ------------------ x 10 9 ≈ 909MHz
1 + 0.1
Compare this to the 714MHz for the standard FIR
For a 10 weight filter, the maximum clocking frequency to avoid race hazards of the transpose FIR is again:
1
f clk = ------------------ x 10 9 ≈ 909MHz
1 + 0.1
Compare the 909MHz to that of 514MHz for the 10 weight standard FIR result from the notes of Slide 10.3.
Top
Additional Cost of the Transpose FIR 10.19
• From a first view there is no extra cost to implement the transpose FIR
compared to the standard FIR.
x(k)
w0 w1 w2 w3 w4 Standard FIR
y(k)
x(k)
w4 w3 w2 w1 w0 Transpose FIR
y(k)
Therefore we might conclude that we get something for nothing here! By carefully retiming the circuit we get the
benefit of an implementation that can be clocked faster due to the smaller critical path delay.
But we all know that there is nothing is free in this world! There is always the hidden cost, or hidden overheads!
In the next slide we will consider the actual cost of the components when implemented as 8 bit arithmetic
Top
Additional Cost of the Transpose FIR 10.20
• Assume that the filter weights and the data are both 8 bit resolution.
w0 w1 w2 w3 w4
8 8 8 8 8
16 16 16 16 16
17 17 18 18
y(k)
x(k)
8 8 8 8 8
w4 w3 w2 w1 w0
8 8 8 8
16 16 16 16
16
y(k)
17 17 17 18 18
In fact the number of delay registers for the transpose FIR is slightly more than double that for the standard FIR
given the worldlength on the adder delay line continues to grow. There are also more interconnections required
(connecting a 16 bit register requires twice as many interconnects as an 8 bit register).
For a much longer filter note that the wordlength grows above 16 and hence the adder line data register length
requires to be larger than just 16 bits as the wordlengths grow.
16 16 16 16 16 16 16
16
17 17 17 18 18 18 18 19
At some level this will be a disadvantage of the transpose FIR over the standard FIR. Often, it is suggested that
in FPGAs delay/flip-flops are virtually free given they are abundantly available. However at some point this will
catch up with the design.
So its worth repeating, there is never really anything for free in this world. And that goes for the timing advantage
of the transpose FIR filter.
Top
Another Retiming of an FIR SFG 10.21
w0 w1 w2 w3 w4
I/O-test-line y(k)
......yields
x(k) z-2 z-2 z-2 z-2
w0 w1 w2 w3 w4
• Note that the I/O testline shows the retiming SFG output is delayed by
4 samples.
August 2005, For Academic Use Only, All Rights Reserved
Notes:
For this retimed SFG, note that the I/O-test-line from input to output would mean that this retiming inserts delay
of 4 samples (i.e. z – 4 ), hence we correctly refer to the output as y ( k – 4 ) if this is the same “y” as in the standard
FIR.
w0 w1 w2 w3 w4
Therefore, the critical path latency has again been reduced to τ mult + τ add giving a much higher clocking rate
than the standard FIR.
However a disadvantage is that we require significantly more registers - 3 times as many as the standard FIR
or the transpose FIR.;
y ( k ) = w0 x ( k ) + w1 x ( k – 1 ) + w1 x ( k – 2 ) + w1 x ( k – 3 )
z ( k ) = y ( k – 4 ) = w0 x ( k – 4 ) + w1 x ( k – 5 ) + w1 x ( k – 6 ) + w1 x ( k – 7 )
Top
“Arbitrary” Cut Sets 10.22
• If you have the requirement for delays at specific places within an SFG
then cut set theory is likely to be a useful tool:
I/O-test-line
z-1 z-1
z-1
• A I/O-test-line from input to out indicates that the output is now delayed
by one sample in this retimed SFG.
August 2005, For Academic Use Only, All Rights Reserved
Notes:
A cut must separate a SFG into two parts. Therefore, we could, from this definition, set up a closed cutset of
the following type:
inbound edge
Cut-set
A I/O-test-line from input to out indicates that the output has not been delayed when compared to the original
SFG.
Top
Systolic Arrays 10.23
• Systolic arrays were widely researched and design in the 1980s and
early 1990s.
• However technology was such that very few were actually built.
• A general definition of a systolic array is (Kung,VLSI Array Processing, 1987, Prentice Hall)
• Synchrony: The data are rhythmically computed (timed by a global clock) and
passed through the network;
• Modularity and regularity: The array consists of modular processing units with
homogenous interconnections; the array may be extended indefinitely;
• Pipelinability: The array is pipelinable - i.e. time delay scaling by α will allow α
data sets to be processed simulataneously by the array.
So how many of these arrays were actually built. Not many - in fact if any at all! The research was primarily
based on mapping algorithms, general procedures for design and so on. The simple reason they were not built
was that the technology was not available. Wafer scale integration was attempted by a few labs, however the
results were not stunning. So systolic arrays and their simple massive parallelism sat on the shelf waiting for
the technology.
And it arrived - FPGAs. Of course FPGAs have been with us since the 80s also, however it is on the last few
years that fast multipliers and other arithmetic capability has been available on larger FPGA devices. Although
FPGAs are not specifically designed for systolic array implementation, the principles of cut set, retiming, delay
scaling etc, to produce an SFG with the attributes of a systolic array is indeed a desirable thing to do. The
attributes of a systolic design will short critical paths to be designed, and lead to very regular arrays.
Top
FIR Filter Systolic Array 10.24
• From the retimed FIR SFG of Slide 10.21 we can redraw this as:
x(k)
z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1
• We now have a “systolic array”. Our cell is a multiplier and adder and
all cells are connected through clocked registers:
z-1 z-1
As presented earlier we could perform delay scaling on this systolic SFG and use it to share multichannels of
data.
The advantage of the systolic SFG over the transpose FIR is that there is no broadcast line. The cost of the
systolic FIR is slightly higher than the tranpose FIR given the cost of the extra delays/registers.
The systolic FIR can be extended idenfinitely and the critical path delay is still that of one cell. Hence the
maximum clocking rate does not reduce just because of a longer length filter. Of course eventually one will
reach a length when clock skew problems may occur (because the wires transmitting the clock have a time
skew due to their long (relatively speaking!) length on the chip. It was this problem of clock skew that led to the
companion idea to systolic arrays called wavefront arrays. In this case all cells communicated by an
asynhcronous handshake. Probably not a strategy that will be adopted for FPGAs, but if of interest you can read
more in the book by S.Y. Kung, VLSI Array Processors, Prentice Hall, 1987.
Top
More retiming on FIR Systolic Array 10.25
• With some more cut sets we can further reduce the critical path by
introduced a delay at the multiplier output before the adder:
x(k)
z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1
cutset
x(k)
z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1
cutset
m(k)
The critical path of the array has there been further reduced by the path time of the adder. The critical path could
now be identified as the τ mult and hence the maximum clocking rate is:
1
f max = ------------
τ mult
1
Compare this to the maximum clocking rate of the systolic array of Slide 10.21, f max = ------------------------------ .
τ mult + τ add
This increase is of course minimal, but is nonetheless an increase should you design require it.
Top
FIR Filter with Adder Tree 10.26
• Consider the FIR using an adder tree to produce the sum of products
• The critical path is τ mult + 3τ add . Compare to the standard FIR which
has a critical path of τ mult + 7τ add for an 8 weight filter.
Generally with linear SFG we can appropriately move adders, delays and multipliers according to some simple
rules.
For example can illustrate simple linear principles with the SFGs below:
a
x(n) y(n) x(n) a y(n) = ax(n) + bz(n)
z(n) z(n) a
For SFGs if we assume interconnects are delayless then we can move/slides elements around and not change
the functionality of the SFG.
The strategy of doing DSP related mathematics is sometimes called algorithmic engineering whereby SFGs
instead of equations are manipulated. J.G McWhirter published a number of papers in this area.
Top
Pipelining the Adder Tree 10.27
• We can pipeline the adder tree by using cut set set retiming to yield:
We could further reduce the latency to τ mult by applying a further cut set below the multipliers to yield:
Top
Symmetric FIR Filters 10.28
1 3 7 7 3 1
The desirable property of linear phase is particularly important in applications where the phase of a signal
carries important information. To illustrate the linear phase response, consider inputting a cosine wave of
frequency f , sampled at f s samples per second (i.e. cos 2πfk ⁄ f s ) to a symmetric impulse response FIR filter
with an even number of weights N (i.e. w n = w N – n for n = 0, 1, …, N ⁄ 2 – 1 ). For notational convenience let
ω = 2πf ⁄ f s :
x(k-5) x(k-4)
x(k)
1 3 7
• For an N weight filter, the number of mutlipliers is now N/2 (for even N
and N/2 +1 for odd N).
It is often stated that the transpose filter cannot be implemented in a symmetric form. This can be easily
appreciated by attempting to cut set the above SFG.
input
output output
cut sets
Therefore a retimed SFG allows the adder line to be pipelined and then the data line is a mix of double delay
line and a broadcast line after the mid-weight.
• Consider the standard pipelining of the following FIR SFG which (for
example purposes) uses 4 bit data and weights.
4
x(k)
1
f clk = ----------------------------
τ mult + τ add
A simple type of 4 bit (+ve) integer multiplier has a latency of 7 cell “delays” (longest path):
s 0 0 a2 0 a1 0 a0
a
b b0
bout longest path 0 0
c
cout FA
b1
aout 0 0
sout
Example: b2
0 0
1101 13
1011 11
b3
1101 0
1101
0000 p7 p6 p5 p4 p3 p2 p1 p0
1101
10001111 143 a3a2a1a0 X b3b2b1b0 = p7p6p5p4p3p2p1p0
Top
Pipelining the FIR Multipliers 10.31
• The above cut sets could of course cut right through the multiplier:
4
x(k)
1
f clk = -----------------------------------
- (where N=4 bits for our example)
τ mult ⁄ N + τ add
Yielding the pipelined version, where we are now adding delays at the bit level, i.e. (D-type flip-flops):
D-type flip-fliop
The addition of the delays means that the latency path from register to register is now 1/4 of previously.
Top
IIR Filters 10.32
• The latency from input to the output of a simple 5 weight IIR is:
x(k) y(k)
-1 -1 -1 -1
z z z z z-1
b5 b1
• If we tried to cut-set the above then because of the two data transfer
paths in opposite directions, we would end up with time-advance
requirements, which are, of course, non-causal.
(Note that we are using an all-pole IIR, ie no feedforward coefficients, to simplify the figures.)
x(k)
z-1 z-1 z-1 z-1 z-1
x(k) y(k)
z-1 z-1 z-1 z-1 z -1
b5 b1
• Then flip the input x ( k ) and input-adder to the right hand side:
y(k)
x(k)
b1 b5
x(k)
b1 b5
• Try any varied cuts sets on the above SFG, and no matter what you try
it is impossible to “systolise” in the same style as done for the FIR in
Slide 10.24.
b1
y(k)
x(k)
z-1
No matter what cut-sets are introduced it is impossible to get a delay after the multiplier.
So here is a converse way of posing this problem. Assume that you are “given” the IP blocks shown below
where (quite reasonably) the result of the add or multiply is clocked out into a register:
z-1 z-1
Adder Multiplier
Now using these IP blocks it is impossible (!) to built the precise integrator SFG shown above. See the notes
slides of Slide 10.35 for how this could be managed by delay scaling (which of course leads to slowing down of
the input data rate).
Top
“Systolisation” of IIR Filter SFG 10.35
y(k)
However as with many SFG architectures we are keen to ensure that the pipelining can take place and we
minimise ripple through adder lines.
The above SFG was delay scaled by two and therefore the array can process two channels or two independent
data streams.
If we revisit the example from the notes of Slide 10.34 we now use the IP blocks given to us, however the only
option is to slow down the IIR filter by a factor of 2, by delay scaling. After delay scaling.
z-1 z-1
z-1
Delay Scale by 2 Separate two delays & cut set Delay transfer Use available IP blocks
Top
Pipelining the IIR Multipliers? 10.36
y(k)
b5 b1
Cut set (x4)
y(k)
b5 b1
z+4
Hence unlike for the FIR filter, there is no possible speedup by pipelining the IIR filter multipliers. It is simple
impossible!
Top
Adaptive Filter SFG 10.37
w0 w1 w2 w3 w4
y(k) e(k)
d(k)
e(k) y(k)
x(k)
d(k)
e(k) y(k)
z-1 z-1 z-1 z-1
x(k) d(k)
y(k) e(k)
-1 -1 -1 -1
z z z z
• Although this filter may look OK (and may even “adapt” to some
degree!) it is NOT a standard adaptive system as the output y ( k ) is no
longer a function of the most recently updated weights.
This observation that the transpose FIR cannot be used to exactly implement a standard adaptive LMS type
algorithm has quite often been “missed” by chip designers. There were two notable chips released in the 1980s
(one being the A100 from INMOS - still sourced and used in many military systems) which were parallel FIR
filters, but which both used the transpose FIR and failed to recognize the problems this would incur for adaptive
FIR systems.
So if designing adaptive FIRs from first principles for FPGAs......be careful to be sure you know what your
architecture actually does!
Top
LMS Hardware Implementation 10.39
2µ
w0 w0 x(k)
w1 w1 x(k – 1)
= + 2µe ( k ) or in vector form:
w2 w2 x(k – 2)
w3 w3 x(k – 3)
k k–1 k
w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k )
where
w0 x(k)
w1 x(k – 1)
w(k) = and x ( k ) =
w2 x(k – 2)
w3 x(k – 3)
k k
Top
Serial LMS (SLMS) Architecture 10.40
d(k)
-
z-1 z-1 z-1 z-1
e(k)
standard
2µ weights
update
x(k)
z-1 z-1 z-1
This structure is identical to the one in Slide 10.39; we have simple changed the direction of some data paths.
The following slides will show why is there a need for pipelining the SLMS structure and how to achieve it.
Top
LMS: Hardware Implementation Issues 10.41
• For example, the SLMS structure shown before contains an FIR filter
stage. As seen before a transpose implementation of the FIR filter is
better suited to FPGA implementation.
• All these modifications may modify the algorithm and therefore can
have a serious impact in its performance.
z-1 z-1
x(k) x(k)
w0 w1 w2 w2 w1 w0
z-1 z-1
y(k) y(k)
This structure constitutes a retimed version of the canonical FIR structure. Both implementations present
exactly the same behaviour. However, in an adaptive filter setup we cannot retime only the FIR stage without
considering the rest of the structure. Doing so results in an architecture which does not implement the LMS
algorithm. A different algorithm is now in place, and even though it may work and meet the designer’s
requirements great care has to be taken when considering the rules (stability, convergence speed, etc.) which
apply to the SLMS. These rules may not hold for the new algorithm.
Top
SLMS: Retiming I 10.42
d(k)
-
z-1 z-1 z-1 z-1
e(k)
canonical
2µ weights
z+1 z+1 z+1 update
x(k)
x(k) z-1 z-1 z-1
z+1 z+1 z+1
w0 w1 w2
y(k)
cut-set
A decision is then made on the use of advance and/or delay of the various edges depending on their inbound
or outbound directions.
z+1
x(k) z-1 z-1 x(k) z-1
w0 w1 w2 w0 w1 w2
d(k)
-
z-1 z-1 z-1 z-1
e(k)
non-causal
2µ weights
update!
z+1 z+1 z+1
x(k)
A different option is to use the original weights update structure of the SLMS algorithm but with a transpose FIR
setup. As mentioned earlier on this will modify the algorithm behaviour (not the SLMS anymore) but will ease
implementation. This possibility is considered in subsequent slides.
Top
Non-Canonical LMS (NCLMS) 10.44
• The use of a transpose FIR structure with the weight update of the
SLMS algorithm results in the NCLMS, which is a different algorithm
with different behaviour and performance.
x(k)
w0(k-1) w1(k-1) w2(k-1) w3(k-1) transpose
filtering
y(k)
z-1 z-1 z-1
-
d(k) z-1 z-1 z-1 z-1
e(k)
standard
2µ weights
update
We can note the actual weight update is the same as for the standard LMS:
w0 w0 x(k)
w1 w1 x(k – 1)
= + 2µe ( k ) , i.e. w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k )
w2 w2 x(k – 2)
w3 w3 x(k – 3)
k k–1 k
Observe below the different adaption speed for two adaptive filters with the same parameters. One implements
the SLMS while the other runs under the NCLMS.
SystemView
0 1 2 3 4
50
-50
-100
NCLMS
MSE (dB)
-150
-200
SLMS
-250
-300
0 1 2 3 4
Time in Seconds
Top
Non-Canonical LMS (NCLMS) 10.45
• We can try to convert the NCLMS to include a standard FIR and then
note the different update equations that result for the weights.
z-1 z-1 z-1
x(k)
w0(k-1) w1(k-1) w2(k-1) w3(k-1) transpose
filtering
y(k)
z-1 z-1 z-1
-
d(k) z-1 z-1 z-1 z-1
e(k)
standard
2µ weights
z-1 z-1 z-1 update
-
d(k) z-1 z-1 z-1 z-1
e(k)
non-canonical
2µ weights
z-1 z-1 z-1 update
x(k) FILTER
w(k) y(k)
WEIGHTS d(k)
UPDATE
2µ
• SLMS clock is limited by the time it takes to calculate the filter output
and to update the weights.
Assume it takes T F seconds to calculate the filter output, and T WU seconds to perform all the filter weights
updates. In total, the next sample x ( k + 1 ) will not be processed at least until T F + T WU seconds have passed
since x ( k ) was received. Clearly, that means there is a lower bound of T F + T WU seconds in the sample period
in order to be able to use this structure, or in terms of sampling frequency:
1
f s ≤ --------------------------
T F + T WU
The speed of execution of the system is limited by the structure used, which does not allow more than one
sample to be processed at the same time. A solution to this is provided by pipelining, which allows more than
one sample to be processed by modifying the original structure. Some pipelining techniques do not modify the
input/output behaviour of the original structure, i.e. they have exactly the same functionality. However, other
pipelining techniques do modify the initial input/output behaviour and the system designer has to be careful
when using these structures and verify their validity for the considered application.
Top
Conclusions 10.47
• In this section we have reviewed the simple signal flow graph (SFG)
tool for representation of linear filtering type algorithms. In particular: