Retiming Signal Flow Graphs Xilinx

The DSP Primer10
Retiming Signal
Flow Graphs (SFG)
Return Return
DSPprimer Home DSPprimer Notes
August 2005, University of Strathclyde, Scotland, UK For Academic Use Only

THIS SLIDE IS BLANK
Top
Introduction 10.1
• In this session we aim to review the representation of DSP algorithms

using the signal flow graphs (SFG) of linear filters.
• The simple process of “redrawing” of SFGs will be used in order that we

can easily recognise parallelism and issues relating to data
communication between elements (adders, multipliers, and delays).
• Cut set retiming strategies of time scaling and delay transfer will be
introduced.
• Latency issues caused by “arithmetic” ripple, and latency/delay

introduced by our retiming strategies will also be discussed.
• The “old” idea of systolic arrays and their relevance to current and
future FPGA DSP design strategies will be presented.
August 2005, For Academic Use Only, All Rights Reserved

Notes:
Current FPGA implementations are most impressive for their fast implementations of digital filters. The
opportunity to use maybe 100 or more multipliers all running at 100’s of MHz brings an almost unimagined level
of DSP parallel processing power to the desktop.
As FPGAs come from a number of manufacturers, and then various families from the different manufacturers,
there is the simple question - which device to use for the application? The choice of manufacturer has many
avenues of consideration, some practical, some based on tradition, and some based on engineering. The actual
device chosen will depend on many things such as cost, capability, ease of programming and whether the
design is a single prototype or demonstrator, or perhaps the system requires to go to manufacturing and hence
design cost and power issues are of most importance
Once the choice has been made the next requirement is “programming” the device. There are again many
decision choices here. Which DSP algorithm to choose, what toolset, what optimisation strategies and so on.
One bottom line however is start off with the right type of DSP design. Undoubtedly different synthesis tools will
give different results, some resulting in smaller designs and faster clock rates than other tools - just like
compilers - different compilers produce different code, some better code than others. (This is of course just a
reflection on life - all car washes wash cars; some do it better than others, but regardless of the specific final
quality as long as it washes cars with some competence all have their place in the car wash business.)
So the point of this presentation. Get the DSP design in an amenable form before heading off to synthesise the
design for an FPGA. This will require understanding aspects of timing, pipelining, parallelism, device sharing
etc. No changes to the actual algorithms, just changes to the way the algorithms are prepared for
implementation.
Top
Signal Flow Graph Critical Path 10.2
• The standard or canonical FIR SFG can be represented as:
x(k)
w0 w1 w2 w3 w4
y(k)
• The ideal SFG assumes that each connection between components

(delays, adders, multipliers) is delayless.
• If the SFG is directly mapped to a (parallel) architecture then the

maximum clocking rate is governed by the latency as defined by the
longest delay path caused by unclocked logic components.
Latency = Longest Propagation delay between two clocked registers

Notes:
The term latency is often used to represent an “uncontrolled” (and even uncontrollable) electronic circuit
propagation delay (the laws of physics etc!). The latency through the multiplier for example is governed by the
propagation delays through the various logic components inside the multiplier. Therefore we could reasonably
say that the latency of the multiplier is made up from the latency of the various adders and gates which make
up the multiplier unit. Of course the latency of different multipliers in different technologies.
Therefore there is granularity associated with the latency of the multiplier. Do we just lump together the latency
of the multiplier as one time delay, or do we break this down to smaller components. This is of course the
decision of the designer and depends what you are actually trying to achieve and what problem you may (or
may not) be trying to solve.
In this section we will be identifying critical paths to understand issues related to maximum clocking rates. By
managing and minimising this critical path we will be in a position to increase the clocking rate of our
implementations.
To be more correct in the above figure we should probably include the input register for x ( k ) and the output
register for y ( k ) (which would usually be present) and show that all registers are clocked:
clock
x(k)
w0 w1 w2 w3 w4
y(k)
For ease of viewing circuits this is not shown.

Top
Signal Flow Graph Latency 10.3
• If τ add is the propagation delay of the adder and τ mult is the propagation
delay of the multiplier, the longest path for data “ripple” can recognised
as:
x(k)
τ mult
y(k)
τ add τ add τ add τ add
Critical Path Latency = τ mult + 4τ add
• In addition any discernable delay (due to long wire lengths) on the

actual connections should be added to get the total latency delay time.

Notes:
The maximum clocking rate therefore of the 4 weight FIR filter s therefore:
1
f clk = ----------------------------------
τ mult + 4τ add
Therefore, if for argument sake, τ mult = 1ns and τ add = 0.1ns , then the maximum clocking frequency to avoid
race hazards is:
1
f clk = -------------------------------- x 10 9 ≈ 714MHz
1 + ( 4 × 0.1 )
If we considered a larger length filter, of say 8 weights, then the maximum clocking frequency reduces:
τ mult
τ add τ add τ add τ add τ add τ add τ add τ add τ add
1
f clk = -------------------------------- x 10 9 ≈ 526MHz
1 + ( 9 × 0.1 )
Therefore clearer the longer the (canonical) parallel FIR filter the slower the clocking rate
Top
Cut Set Retiming 10.4
• Cut-set comes from formal graph theoretic techniques that can be used
to retime the SFG to a more amenable form.
• Retiming is useful to manage latency issues and where necessary

ensure maximum clocking rates are kept high by ensuring the critical
paths are small.
• Definition: A cut-set in an SFG is a minimal set of edges which

partitions the SFG into two separate parts:
x(k)
y(k)
Cut-Set
Notes:
This might just be one of the most powerful and valuable FPGA design strategies you come across for DSP
parallel systems design. It is very simple to understand and simple use...
The complexity and application of cut ses and other related techniques extends well beyond the simple signal
flow graphs that are use in signal processing. However in this course we will present enough information to
ensure that this very powerful retiming techniques can be applied in a who variety of applications.
Cut set retiming was formalised by S.Y. Kung in his 1987 textbook VLSI Array Processors published by
Prentice-Hall. In a DSP context the use of cut set was first widely presented in the seminal paper:
S.Y. Kung. On Super computing with systolic/wavefront array processors. Proceedings of the IEEE, Vol 72, No. 7, July
1984.
It is interesting to note that although the above book was published many years before FPGA based DSP, the
issues relating to retiming, “systolic” and “wavefront” arrays as discussed in the book are very relevant to
modern FPGA-DSP. Worth a look if you can source a copy.
Systolic arrays were arrays of simple computing elements (cells) which communicated with neighbouring cells
via simple data transfer. Communication was synchronized, with the name systolic being derived from the term
systole related to the regular beating of the heart. Systolic arrays were essentially parallel processors.
An alternative form of parallel array was the wavefront array, where neighbouring cells communication via a
handshake and such arrays were there termed asynchronous.
Top
Cut Set Retiming Rule 1 10.5
• Cut set Rule 1: We can advance and/or delay the various edges
depending on their inbound or outbound directions.
• Given any cut-set of an SFG which partitions into two components we

can choose one side (the left here) of the cut and group the edges into
inbound edges and outbound edges:
inbound
z-1
SFG-1
SFG-2
z-1
outbound Cut-Set
• For convenience and formalism we use z-1 notation for a delay.

Notes:
For the example above we have a vertical (top to bottom cut). We have chosen to specify inbound as being
data transfer paths (directed lines) from left to right, and outbound as being data transfer paths from right to left.
One thing that will become clear in the next few slides, is that some retimings are not physically possible. That
is to say that they SFG result can be expressed, however the actual physical implementation would require to
include non-causal (i.e. time advance) components.
Note that inside the SFG1 and SFG2 circuits is some arbitrary DSP computation:
SFG-1
SFG-2
x(k) z-1
z-1 z-1
z-1 r(k)
y(k)
(Note the above SFG is not meant to be any specific circuit - just an arbtrary DSP linear system implementation
with multiplies and adds - however if we did choose to study this circuit, then we might notice that output y ( k )
is some linear combination of current and past values of x ( k ) , and r ( k ) , and past values of y ( k ) . Hence it is
some sort of linear recursive system on y ( k ) with two inputs - but not really important, the focus is on the
retiming at the bounday of SFG1 and SFG2.)
Top
Cut Set Delay Transfer 10.6
• We can choose the following:
Time Delay by z –1 on all inbound edges
and hence Time Advance by z +1 on all outbound edges
Delay
z-1 z-1
SFG-1 z-1
SFG-2
z-1 z+1
Advance

Notes:
Hence the retimed SFG is now:
z-2
SFG-1 z-1
SFG-2
Delay doubling
Delay Cancellation
z-1 z-1
z-1 z+1
z-2
Nothing inside the circuits has changed, i.e. delays only added or removed at the cutsets.
Clearly the time advance z +1 applied to the outbound data path has cancelled with the time delay z – 1 ,
i.e. z +1 z – 1 = 1 .
and we have delay doubling, z – 1 z – 1 = z – 2 . Care must be taken when performing cut set retiming as if any data
transfer path ends up with a time advance, then the system becomes non-causal - a time advance, z +1 , is a
look ahead one step in time and such things are impossible!
Top
Cut Set Delay Transfer “Failure” 10.7
• Note we could have chosen the converse assignment, i.e.
Time Delay by z –1 on all outbound edges
and hence Time Advance by z +1 on all inbound edges

Advance
z-1 z+1
SFG-1 z+1
SFG-2
z-1 z-1
Delay
...but this time the final circuit is not a causal implementation

Notes:
If we simplify the above SFG, then: we get:
SFG-1 z+1
SFG-2
z-2
This is a non-causal implementation and therefore cannot be practically built. The ability to predict the future
(as required by the z +1 is just not with us yet!
So.....a bad choice of cut set retiming. In fact a choice of cut set that is IMPOSSIBLE to acutually implement.
Top
Cut Set Rule 2: Time Scaling 10.8
• All delays in a SFG may be scaled by a +ve integer, i.e. z –1 → z – α
All input & output rates are correspondingly scaled by factor of α
• Example: Scaling all delays in a standard FIR by 2, i.e. z – 1 → z –2
x(k) z-1 z-1 z-1
w0 w1 w2 w3
y(k)
DELAY Scale by α = 2
? z-2 z-2 z-2
w0 w1 w2 w3

Notes:
Now we require to carefully think about the result of the delay scaling. Previously we refered to the output of the
filter as y ( k ) but now we must carefully review the output, as mathetically the SFG has changed and therefore
the output has changed.
Prior to scaling, the output of the filter was simply the weighted sum of past inputs, where the filter weights were
specified by the data vector and input vector
T
w = [ w 0, w 1, w 2, w 3 ] and x kT = [ x ( k ), x ( k – 1 ), x ( k – 2 ), x ( k – 3 ) ] , ie.
y ( k ) = wT xk = w0 x ( k ) + w1 x ( k – 1 ) + w1 x ( k – 2 ) + w1 x ( k – 3 )
After scaling we have simply doubled (scaled by 2) all of the delays in the circuit:
z-2 z-2 z-2
w0 w1 w2 w3
z-1 z-1 z-1 z-1 z-1 z-1
w0 w1 w2 w3
Top
Slowing Down the Input Rate 10.9
• The delay scaling on the FIR SFG essentially increases the length of
the filter from N = 4 to N = 8 and but where every second weight is 0
(zero) :
z-1 z-1 z-1 z-1 z-1 z-1
w0 w1 w2 w3
z-1 z-1 z-1 z-1 z-1 z-1 z-1
0 0 0 0
w0 w1 w2 w3

Notes:
For the delay scaled filter, the weight vector is now:
T
w = [ w 0, 0, w 1, 0, w 2, 0, w 3, 0 ]
Therefore, for a given input sequence of data, to use the delay scaled SFG to produce the same output would
require the data vector to be:
x kT = [ x ( k ), 0, x ( k – 1 ), 0, x ( k – 2 ), 0, x ( k – 3 ), 0 ]
(Note the last “0” weight is not necessary and could be removed)
Therefore in order to produce the same sequence of output numbers for a given input sequence x ( k ) , this input
sequence requires to be upsampled by 2, or every second input is set to zero.
Top
Time Scaling 10.10
• Without scaling data can be input at the full rate in the usual style of an
FIR filter, i.e. not scaled down by α . i.e:
x(k)
z-1 z-1 z-1
x(k)
k
k
y(k)
• Therefore after delay scaling by 2 the input should be upsampled in

order to produce the same sequence of output values.
p(k) = x(k/2) p(k)
z-2 z-2 z-2 q(k) = y(k/2)
k
k
q(k)

Notes:
This could of course have been represented elegantly via the z-transform. For the above system the output is
simply given in the time domain by:
y ( k ) = w0 x ( k ) + w1 x ( k – 1 ) + w1 x ( k – 2 ) + w1 x ( k – 3 )
Taking z-transform of both sides gives:
Y ( z ) = w 0 + w 1 X ( z )z – 1 + w 2 X ( z )z – 2 + w 3 X ( z )z – 3
= ( w 0 + w 1 z – 1 + w 2 z – 2 + w 3 z – 3 )X ( z )
Y(z)
therefore ------------ = w 0 + w 1 z – 1 + w 2 z – 2 + w 3 z – 3
X(z)
Now if we apply delay scaling such that we make the substitution of z – 1 → z – 2 , (and equivalently z → z 2 ) then:
Y ( z 2 )- Q(z)
-------------- = w 0 + w 1 z – 2 + w 2 z – 4 + w 3 z – 6 = ------------
X( z2 ) P(z)
If we assumed the input sequence was, for example the set of samples: x ( k ) = [ x 0, x 1, x 2, x 3, …, x N ] , then
the z-transform is:
X ( z ) = x0 + x1 z –1 + x2 z –2 + x3 z –3 + … + xN z –N
After the delay scaling this becomes X ( z 2 ) = P ( z ) = x 0 + x 1 z – 2 + x 2 z – 4 + x 3 z – 6 + … + x N z – 2N . Taking the

inverse z-transform of P ( z ) gives:
p ( k ) = [ x 0 , 0, x 1 , 0, x 2 , 0 , x 3 , 0 , … , x N ]
Therefore the delay scaling inserts zeroes in the input sequence. Equivalently zeroes are inserted in the output
sequence, q ( k ) .
Top
Delay Scaling Loss of Efficiency 10.11
• For the previous slides scaled by 2, the FIR only has a computational
efficiency of 50% as given every 2nd input is a zero then on every 2nd
clock cycle only zeroes are input to the multipliers:
• To see this, observe two consecutive cycles while the sequence

[4, 0, 7, 0, 1, 0, 6] is being input to the FIR:
6 0 1 0 7 0 4
Sample k
w0 w1 w2 w3 4 multiply/adds
12
0 6 0 1 0 7 0
Sample k+1
0

Notes:
On every second clock cycle the input to the multipliers in the FIR SFG are zeroes, and therefore no useful
computation is actually done in order to produce the “0” output. In this case therefore, the array is 50% efficient
(i.e. 1/2 efficient).
Note that if for, example the delay scaling had been by a factor of α = 3 then the array would now only be 1/3
or 33% efficient. So for the input sequence [4, 0, 0, 7, 0, 0, 1, 0, 0, 6]
6 0 0 1 0 0 7 0 0 4
Sample k
12
0 6 0 0 1 0 0 7 0 0
Sample k+1
0 0 6 0 0 1 0 0 7 0
Sample k+2
w0 w1 w2 w3
0 multiply/adds
0
Top
Increasing the Efficiency: SFG Sharing 10.12
• If a SFG is delay scaled by α then the SFG becomes 1 ⁄ α efficient

when the input sequence is upsampled by α (i.e. insert α – 1 zeroes).
• To increase the SFG utilisation back up to 100% (efficiency of 1) then

the SFG can be used to process more than one channel of data.
• In the SFG below, for α = 2 , 2 independent sets of input data are

being processed by essentially multiplexing the array:
z(k)
z(k) z-2 z-2 z-2 v(k)
k
k
v(k)
• The previous “0”’s introduced by upsampling have been replaced with

another independent signal.
Notes:
In the above delay scaled by 2 SFG we have multiplexed the two sets of data z 1 ( k ) and z 2 ( k ) as shown below
on two independent SFGs.
z1(k)
z1(k) z-1 z-1 z-1 v1(k)
k
k
v1(k)
z2(k)
z2(k) z-1 z-1 z-1 v2(k)
k
k
v2(k)
This technique is therefore of interest if you have different data sets that must be processed by the same filter,
and speed of operation is such that the slowdown caused by “sharing” or multiplexing SFGs is tolerable. Of
course this is an ideal structure for use with multichannel applications (i.e. various data sources to be processed
by the same filter characteristic).
Top
Multiplex the input and output..... 10.13
• At the input interleave/multiplex 2 channels of upsampled signals z 1 ( k )

and z 2 ( k ) to produce interleaved signal z ( k )
z1(k) ts
z(k)
k
z2(k) Int
fs k
• Deinterleave/demultiplex to produce the two output signals.

v1(k)
v(k)
k
DeInt v2(k)
k
fs
k

Notes:
The sampling rate f s = 1 ⁄ t s
In general to do N channel interleaving requires that all delays are scaled by N, and the input signals are
upsampled by N.
Therefore if an FIR SFG which can be clocked at, say, a sampling rate of f s = 100MHz then one channel of
data can be processed at 100MHz input sample rate. If this array was delay scaled by α = 4 then the FIR SFG
could now process 4 independent channels at a maximum data rate of 100/4 = 25 MHz.
If the FIR filter had 20 weights, then for both the single channel at 100MHz, or the 4 channels at 25MHz, the
total number of MACs (multiply-accumulates) per second is 2000 million MAC/sec.
Top
Input to Output Latency 10.14
• Following a cut set retiming by time scaling and delay transfer it may be
that the input-to-output-delay from input to output is changed.
• To calculate this change we draw a I/O-test-line from input to output,

and count the number of delays “added” by the (various) cut sets:
input, x(k) I/O-test-line
z-1
output, z(k) = y(k-1)
z-1 z-1
SFG-1 z-1
SFG-2
z-1 z+1
• From Slide 10.6 we chose to time delay all inbound edges and hence
the cut set procedure adds one delay. (y(k) was o/p before retiming)
Notes:
Compare this to the original SFG, the I/O test line shows a delay from x(k) to y(k) as a result of the cut set.
Therefore in this case the output on the right hand side should be correctly labelled z(k) = y(k-1), as without the
cut set the output was y(k).
input, x(k)
output, y(k)
z-1
SFG-1
SFG-2
z-1
In many DSP problems adding a few delays to the overall input to output transfer path is not usually a problem,
However there are problems where this WILL be a concern, and therefore we should always keep track of
precisely what is happening.
Top
Retiming a Simple FIR SFG 10.15
• We can redraw (no retiming yet!) the FIR SFG in the following form
x(k) z-1 z-1 z-1 z-1
w0 w1 w2 w3 w4
y(k)
....as the data transfer paths are considered delayless, we have simply
reversed the direction of the adder line.
• We can now apply a single cut set to allow a delay transfer:

x(k) z-1 z-1 z-1 z-1
y(k)
Cut set 1

Notes:
Considering the left side of the SFG, we can choose to retime by advancing all inbound edges and time delaying
all output edges, and then apply another cut set:
x(k) z-1 z-1 z-1
z-1
y(k)
Cut set 2
For our second cut-set we can again retime by advancing all inbound edges and time delaying all output edges:
x(k) z-1 z-1
z-1 z-1
y(k)
....and so on.
Top
The Transpose FIR SFG 10.16
• Applying a series of cut sets and delay transfers...:
x(k) z-1 z-1 z-1 z-1
y(k)
...therefore results in the transpose FIR:
x(k)
w0 w1 w2 w3 w4
z-1 z-1 z-1 z-1

y(k)

Notes:
If we then redraw the transpose filter with the input on the left and the output on the right we notice that the order
of the weights appears to change:
x(k)
w0 w1 w2 w3 w4
y(k)
x(k)
w4 w3 w2 w1 w0
z-1 z-1 z-1 z-1 y(k)
Of course, given that most FIR filters are in fact symmetric then this re-ordering is of no consequence as a 5
weight symmetric filter will have w 0 = w 4 and w 1 = w 3
Top
Transpose FIR Additional Latency? 10.17
• Note that there is no change to the delay or latency from input to output:
x(k) z-1 z-1 z-1 z-1
I/O-test-line
y(k)
• Generally speaking therefore, the transpose FIR filter introduces no

additional delays and therefore its input to output operation is identical
to the standard or canonical FIR.

Notes:
Also note that regardless of the I/O-test-line from input to output that you may choose, the same result will be
found. For example, if the construct line was chosen as below, then the cut set would be traversed as four input
edges and four output edges.
z+1 z+1 z+1 z+1
x(k) z-1 z-1 z-1 z-1
I/O-test-line
y(k)
z-1 z-1 z-1 z-1
Hence the time advance and time delay at the edges would “balance” and give the same result as above, i.e.
no change to the latency, ie. z – 4 z 4 = 1
Top
Clocking Rate of the Transpose FIR 10.18
• The advantage of the transpose FIR is that the maximum latency

caused by the various (non-ideal) components is reduced.
• Recall from Slide 10.2 the latency from input to output of a parallel
implementation of a 5 weight canonical FIR was:
Critical Path Latency = τ mult + 4τ add
• However for the transpose FIR the latency is now reduced to:
Critical Path Latency = τ mult + τ add
x(k)
z-1 z-1 z-1 z-1

y(k)

Notes:
-1
We have now identified the maximum delay that exists between two clocked registers z (assuming of
course that the data is input from a synchronised register, and the output clocked into a register).
This is seen to be simply Latency = τ mult + τ add .
Therefore the maximum data sample clocking speed an N weight system is therefore:
1
f clk = ------------------------------
τ mult + τ add
Although the transpose FIR circumvents the problem of the ripple through the adder chain, the broadcast line
at the top may be undesirable, particularly if the filter is very long and the broadcast line (or wires or bus) is very
long.
Taking the same example figures from the the notes of Slide 10.3 τ mult = 1ns and τ add = 0.1ns , then the
maximum clocking frequency to avoid race hazards of the transpose FIR is:
1
f clk = ------------------ x 10 9 ≈ 909MHz
1 + 0.1
Compare this to the 714MHz for the standard FIR
For a 10 weight filter, the maximum clocking frequency to avoid race hazards of the transpose FIR is again:
1
f clk = ------------------ x 10 9 ≈ 909MHz
1 + 0.1
Compare the 909MHz to that of 514MHz for the 10 weight standard FIR result from the notes of Slide 10.3.
Top
Additional Cost of the Transpose FIR 10.19
• From a first view there is no extra cost to implement the transpose FIR
compared to the standard FIR.
• Both (apparently) require the same number of multipliers, adders and

delays.
x(k)
w0 w1 w2 w3 w4 Standard FIR
y(k)
x(k)
w4 w3 w2 w1 w0 Transpose FIR
y(k)

Notes:
This is of course a simple count of discrete components. For the 5 weight filter we require 5 multipliers, 4 delays
and 4 adders for both the canonical FIR and the transpose FIR.
Therefore we might conclude that we get something for nothing here! By carefully retiming the circuit we get the
benefit of an implementation that can be clocked faster due to the smaller critical path delay.
But we all know that there is nothing is free in this world! There is always the hidden cost, or hidden overheads!
In the next slide we will consider the actual cost of the components when implemented as 8 bit arithmetic
Top
Additional Cost of the Transpose FIR 10.20
• Assume that the filter weights and the data are both 8 bit resolution.
• Then note the transpose FIR is actually more expensive in hardware.

x(k)
8 8 8 8 8
w0 w1 w2 w3 w4
8 8 8 8 8
16 16 16 16 16
17 17 18 18
y(k)
x(k)
8 8 8 8 8
w4 w3 w2 w1 w0
8 8 8 8
16 16 16 16
16
y(k)
17 17 17 18 18
>16 bit register 8 bit register

Notes:
The cost of the transpose FIR and the standard FIR is the same in terms of number of multipliers and adders.
However the number of delay registers required is double that of the FIR. This is easily seen from the fact that
the wordlength on the transpose FIR delay line is (at least) 16 its because this is storing the partial sum after
the multiplier.
In fact the number of delay registers for the transpose FIR is slightly more than double that for the standard FIR
given the worldlength on the adder delay line continues to grow. There are also more interconnections required
(connecting a 16 bit register requires twice as many interconnects as an 8 bit register).
For a much longer filter note that the wordlength grows above 16 and hence the adder line data register length
requires to be larger than just 16 bits as the wordlengths grow.
16 16 16 16 16 16 16
16
17 17 17 18 18 18 18 19
At some level this will be a disadvantage of the transpose FIR over the standard FIR. Often, it is suggested that
in FPGAs delay/flip-flops are virtually free given they are abundantly available. However at some point this will
catch up with the design.
So its worth repeating, there is never really anything for free in this world. And that goes for the timing advantage
of the transpose FIR filter.
Top
Another Retiming of an FIR SFG 10.21
• Applying the following cut sets.....
x(k) z-1 z-1 z-1 z-1
w0 w1 w2 w3 w4
I/O-test-line y(k)
......yields
x(k) z-2 z-2 z-2 z-2
w0 w1 w2 w3 w4
z-1 z-1 z-1 z-1 z(k) = y(k-4)
• Note that the I/O testline shows the retiming SFG output is delayed by
4 samples.
Notes:
For this retimed SFG, note that the I/O-test-line from input to output would mean that this retiming inserts delay
of 4 samples (i.e. z – 4 ), hence we correctly refer to the output as y ( k – 4 ) if this is the same “y” as in the standard
FIR.
x(k) z-2 z-2 z-2 z-2
w0 w1 w2 w3 w4
I/O-test-line z-1 z-1 z-1 z-1 y(k-4)
Therefore, the critical path latency has again been reduced to τ mult + τ add giving a much higher clocking rate
than the standard FIR.
However a disadvantage is that we require significantly more registers - 3 times as many as the standard FIR
or the transpose FIR.;
The output of the standard FIR is:
y ( k ) = w0 x ( k ) + w1 x ( k – 1 ) + w1 x ( k – 2 ) + w1 x ( k – 3 )
The output of transpose FIR is:
z ( k ) = y ( k – 4 ) = w0 x ( k – 4 ) + w1 x ( k – 5 ) + w1 x ( k – 6 ) + w1 x ( k – 7 )
Top
“Arbitrary” Cut Sets 10.22
• If you have the requirement for delays at specific places within an SFG
then cut set theory is likely to be a useful tool:
I/O-test-line
z-1 z-1 z-1
z-1 z-1 z-1 z-1

Cut-set
z-1 z-1
z-1
• A I/O-test-line from input to out indicates that the output is now delayed
by one sample in this retimed SFG.
Notes:
A cut must separate a SFG into two parts. Therefore, we could, from this definition, set up a closed cutset of
the following type:
inbound edge
Cut-set
z-1 z-1 z-1

outbound edge
z-1 z-1 z-1

z-1
A I/O-test-line from input to out indicates that the output has not been delayed when compared to the original
SFG.
Top
Systolic Arrays 10.23
• Systolic arrays were widely researched and design in the 1980s and
early 1990s.
• However technology was such that very few were actually built.
• A general definition of a systolic array is (Kung,VLSI Array Processing, 1987, Prentice Hall)
• Synchrony: The data are rhythmically computed (timed by a global clock) and
passed through the network;
• Modularity and regularity: The array consists of modular processing units with
homogenous interconnections; the array may be extended indefinitely;
• Spatial and temporal locality: The array manifests a locally communicative

interconnection structure, i.e. spatial locality. There is at least one unit time delay
allotted so that signal transactions from one node to the next can be completed,
which is temporal locality (this reduces critical paths);
• Pipelinability: The array is pipelinable - i.e. time delay scaling by α will allow α
data sets to be processed simulataneously by the array.

Notes:
The 1980s saw a significant amount of research on designing systolic arrays for just about every conceivable
regular DSP algorithm and matrix based algorithms. This would include filters, adaptive filters, linear system
solvers, image processing algorithms and so on. A good review of the work of the time can be found in the book
by S.Y. Kung on VLSI Array Processors (Prentice Hall, 1987). Many other books were written at the time and
a quick search on the word “systolic” will through up some more titles
So how many of these arrays were actually built. Not many - in fact if any at all! The research was primarily
based on mapping algorithms, general procedures for design and so on. The simple reason they were not built
was that the technology was not available. Wafer scale integration was attempted by a few labs, however the
results were not stunning. So systolic arrays and their simple massive parallelism sat on the shelf waiting for
the technology.
And it arrived - FPGAs. Of course FPGAs have been with us since the 80s also, however it is on the last few
years that fast multipliers and other arithmetic capability has been available on larger FPGA devices. Although
FPGAs are not specifically designed for systolic array implementation, the principles of cut set, retiming, delay
scaling etc, to produce an SFG with the attributes of a systolic array is indeed a desirable thing to do. The
attributes of a systolic design will short critical paths to be designed, and lead to very regular arrays.
Top
FIR Filter Systolic Array 10.24
• From the retimed FIR SFG of Slide 10.21 we can redraw this as:
x(k)
z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1
0 z-1 z-1 z-1 z-1

z(k)
• We now have a “systolic array”. Our cell is a multiplier and adder and
all cells are connected through clocked registers:
z-1 z-1
The FIR Systolic Cell/Processor

z-1

Notes:
This systolic array has only nearest neighbour inputs and all intercell communications are via clocked delays/
registers.
As presented earlier we could perform delay scaling on this systolic SFG and use it to share multichannels of
data.
The advantage of the systolic SFG over the transpose FIR is that there is no broadcast line. The cost of the
systolic FIR is slightly higher than the tranpose FIR given the cost of the extra delays/registers.
The systolic FIR can be extended idenfinitely and the critical path delay is still that of one cell. Hence the
maximum clocking rate does not reduce just because of a longer length filter. Of course eventually one will
reach a length when clock skew problems may occur (because the wires transmitting the clock have a time
skew due to their long (relatively speaking!) length on the chip. It was this problem of clock skew that led to the
companion idea to systolic arrays called wavefront arrays. In this case all cells communicated by an
asynhcronous handshake. Probably not a strategy that will be adopted for FPGAs, but if of interest you can read
more in the book by S.Y. Kung, VLSI Array Processors, Prentice Hall, 1987.
Top
More retiming on FIR Systolic Array 10.25
• With some more cut sets we can further reduce the critical path by
introduced a delay at the multiplier output before the adder:
x(k)
z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1
cutset
0 z-1 z-1 z-1 z-1

z(k)
x(k)
z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1
cutset
z-1 z-1 z-1 z-1 z-1
0 z-1 z-1 z-1 z-1
m(k)

Notes:
An I/O testline will show that this output is further delayed by one, ie. m ( k ) = z ( k – 1 ) .
The critical path of the array has there been further reduced by the path time of the adder. The critical path could
now be identified as the τ mult and hence the maximum clocking rate is:
1
f max = ------------
τ mult
1
Compare this to the maximum clocking rate of the systolic array of Slide 10.21, f max = ------------------------------ .
τ mult + τ add
This increase is of course minimal, but is nonetheless an increase should you design require it.
Top
FIR Filter with Adder Tree 10.26
• Consider the FIR using an adder tree to produce the sum of products
• Clearly the adder tree is ideally suited to a filter with a number of

weights that is a power of 2 (i.e. 8 in the above example).
• The critical path is τ mult + 3τ add . Compare to the standard FIR which
has a critical path of τ mult + 7τ add for an 8 weight filter.

Notes:
General Linear Transforms
Generally with linear SFG we can appropriately move adders, delays and multipliers according to some simple
rules.
For example can illustrate simple linear principles with the SFGs below:
a
x(n) y(n) x(n) a y(n) = ax(n) + bz(n)
z(n) z(n) a
w(n) x(n) z(n)
x(n) y(n) w(n) y(n) = x(n) + z(n) + w(n)

z(n)
For SFGs if we assume interconnects are delayless then we can move/slides elements around and not change
the functionality of the SFG.
The strategy of doing DSP related mathematics is sometimes called algorithmic engineering whereby SFGs
instead of equations are manipulated. J.G McWhirter published a number of papers in this area.
Top
Pipelining the Adder Tree 10.27
• We can pipeline the adder tree by using cut set set retiming to yield:
• The critical path is now τ mult + τ add

Notes:
The cuts used to produced the above pipelined adder tree are:
We could further reduce the latency to τ mult by applying a further cut set below the multipliers to yield:
Top
Symmetric FIR Filters 10.28
• Many (particularly in communications) FIR filters have symmetric

coefficients (this property ensures linear phase/constant group delay).
• Consider the SFG with weights of 1,3,7,7,3,1:

x(k) x(k-1) x(k-2) x(k-3) x(k-4) x(k-5)
1 3 7 7 3 1
• Observation shows that there 6 multiply/accumulates per sample.

However we can see that w 0 = w 5 and w 1 = w 4 and w 2 = w 3 .
• Therefore to reduce the number of multplies we could first add x ( k ) and

x ( k – 5 ) and then multiply thus reducing the total number of multiplies.
This can be done for the other weights also.

Notes:
If the weights of an N weight real valued FIR line of symmetry line of symmetry
filter are symmetric or anti-symmetric, i.e. wk wk
: w ( n ) = ± w ( N – 1 – n ) then the filter has
linear phase. This means that all frequencies
passing through the filter are delayed by the 0
k 0
k
same amount. The impulse response of a
Symmetric impulse response of an 11 (odd Symmetric impulse response of an 8 (even
linear phase FIR filter can have either an number) weight linear phase FIR filter. number) weight linear phase FIR filter.
even or odd number of weights.
The desirable property of linear phase is particularly important in applications where the phase of a signal
carries important information. To illustrate the linear phase response, consider inputting a cosine wave of
frequency f , sampled at f s samples per second (i.e. cos 2πfk ⁄ f s ) to a symmetric impulse response FIR filter
with an even number of weights N (i.e. w n = w N – n for n = 0, 1, …, N ⁄ 2 – 1 ). For notational convenience let
ω = 2πf ⁄ f s :
N–1 N⁄2–1 N⁄2–1

·
y(k) = ∑ w n cos ω ( k – n ) = ∑ w n ( cos ω ( k – n ) + cos ω ( k – N + n ) ) = ∑ 2w n cos ω ( k – N ⁄ 2 ) cos ω ( n – N ⁄ 2 )
n=0 n=0 n=0
(7)
N⁄2–1 N⁄2–1
= 2 cos ω ( k – N ⁄ 2 ) ∑ w n cos ω ( n – N ⁄ 2 ) = M ⋅ cos ω ( k – N ⁄ 2 ), where M = ∑ 2w n cos ω ( n – N ⁄ 2 )
n=0 n=0
From this equation it can be seen that regardless of the input frequency, the input cosine wave is delayed only
by N ⁄ 2 samples, often referred to as the group delay, and its magnitude is scaled by the factor M. Hence the
phase response of such an FIR is simply a linear plot of the straight line defined by ωN ⁄ 2 . Group delay is often
defined as the differentiation of the phase response with respect to angular frequency. Hence, a filter that
provides linear phase has a group delay that is constant for all frequencies. An all-pass filter with constant group
delay (i.e., linear phase) produces a pure delay for any input time waveform.
Top
Symmetric FIR Filters 10.29
• We can reorder the computation for the symmetric coefficients
• Consider the SFG with weights of 1,3,7,7,3,1:
x(k-5) x(k-4)
x(k)
x(k-1) x(k-2) x(k-3)
1 3 7
• For an N weight filter, the number of mutlipliers is now N/2 (for even N
and N/2 +1 for odd N).

Notes:
Note the multipliers used in the symmetric filter have increased slightly in cost. Previously for example both
inputs were N bits, and therefore an N x N multiplier was required. Now the input from the sample adder means
that one of the inputs is N+1 bits. Hence the cost of the multipliers increases slightly to N+1 x N, but overall the
hardware cost is less.
It is often stated that the transpose filter cannot be implemented in a symmetric form. This can be easily
appreciated by attempting to cut set the above SFG.
input
output output
cut sets
Therefore a retimed SFG allows the adder line to be pipelined and then the data line is a mix of double delay
line and a broadcast line after the mid-weight.
An IO testline would indicate that there is no increase in latency.

Top
Pipelining the FIR SFG Multipliers 10.30
• On an FPGA we could investigate further efficiency and speed

improvements by allowing the actual multipliers to also be pipelined.
• Consider the standard pipelining of the following FIR SFG which (for
example purposes) uses 4 bit data and weights.
4
x(k)
Cut set (x4)
y(k) z-1 z-1 z-1 z-1
• Whether pipelined or not, the maximum clocking rate is again:
1
f clk = ----------------------------
τ mult + τ add

Notes:
The retimed SFG is therefore of the form:
x(k)
z-4 z-4 z-4 z-4 z-4
z(k) = y(k-4) z-1 z-1 z-1 z-1
A simple type of 4 bit (+ve) integer multiplier has a latency of 7 cell “delays” (longest path):
s 0 0 a2 0 a1 0 a0
a
b b0
bout longest path 0 0
c
cout FA
b1
aout 0 0
sout
Example: b2
0 0
1101 13
1011 11
b3
1101 0
1101
0000 p7 p6 p5 p4 p3 p2 p1 p0
1101
10001111 143 a3a2a1a0 X b3b2b1b0 = p7p6p5p4p3p2p1p0
Top
Pipelining the FIR Multipliers 10.31
• The above cut sets could of course cut right through the multiplier:
4
x(k)
Cut set (x4)
y(k) z-1 z-1 z-1 z-1
• We can now choose to pipeline the within the actual multipliers.
• Therefore rather than pipelining at the data sample level, we are at a

lower level of granularity and are pipelining at the bit level.
• The maximum clocking rate with pipelined multipliers increases to:
1
f clk = -----------------------------------
- (where N=4 bits for our example)
τ mult ⁄ N + τ add

Notes:
The cut sets through multiplier (diagram simplified for clearer viewing) are now of the form:
Yielding the pipelined version, where we are now adding delays at the bit level, i.e. (D-type flip-flops):
D-type flip-fliop
The addition of the delays means that the latency path from register to register is now 1/4 of previously.
Top
IIR Filters 10.32
• The latency from input to the output of a simple 5 weight IIR is:
Latency = τ mult + 4τ add
x(k) y(k)
-1 -1 -1 -1
z z z z z-1
b5 b1
• If we tried to cut-set the above then because of the two data transfer
paths in opposite directions, we would end up with time-advance
requirements, which are, of course, non-causal.
(Note that we are using an all-pole IIR, ie no feedforward coefficients, to simplify the figures.)

Notes:
The cut set retiming of the above IIR SFG (shown below) would lead to delays between the ripple adders,
however we would require time advances. Hence this is not physically realisable and we need to consider from
a different start point.
x(k)
z-1 z-1 z-1 z-1 z-1
z+1 z+1 z+1 z+1

x(k)
z-2 z-2 z-2 z-2 z-2
z-1 z-1 z-1 z-1

Top
“Redrawing” the IIR Filter SFG 10.33
• We redraw the IIR SFG to change the adder line direction:
x(k) y(k)
z-1 z-1 z-1 z-1 z -1
b5 b1
• Then flip the input x ( k ) and input-adder to the right hand side:
y(k)
z-1 z-1 z-1 z-1 z-1

x(k)
b5 b1

Notes:
We can now flip the entire SFG from left to right (just to put the input on the left hand side of the SFG -
conventional viewing, no other reason) and apply cut set retiming by delay transfer:
y(k)
z-1 z-1 z-1 z-1 z-1

x(k)
b1 b5
Producing the retimed SFG:

y(k)
x(k)
b1 b5
z-1 z-1 z-1 z-1 z-1
The latency of the retimed SFG is now: Latency = τ mult + 2τ add

Top
Systolisation of the IIR SFG 10.34
• Starting from either the retimed IIR, if we try to “systolise” this

architecure then it is impossible.
y(k)
x(k)
b1 b5
z-1 z-1 z-1 z-1 z-1
• Try any varied cuts sets on the above SFG, and no matter what you try
it is impossible to “systolise” in the same style as done for the FIR in
Slide 10.24.
• This is because of the feedback. Fundamentally you cannot delay

feedback information. If a feedback sample is needed at time sample k
then delaying this via some cutset will lead to a non-causal architecture.

Notes:
To see the feedback problem more clearly try using cutsets or any other method to place a delay after the
multplier in the simple integrator circuit below. In this circuit assume that the delay before the adder is not
moveable (i.e. part of the adder).
b1
y(k)
x(k)
z-1
No matter what cut-sets are introduced it is impossible to get a delay after the multiplier.
So here is a converse way of posing this problem. Assume that you are “given” the IP blocks shown below
where (quite reasonably) the result of the add or multiply is clocked out into a register:
z-1 z-1
Adder Multiplier
Now using these IP blocks it is impossible (!) to built the precise integrator SFG shown above. See the notes
slides of Slide 10.35 for how this could be managed by delay scaling (which of course leads to slowing down of
the input data rate).
Top
“Systolisation” of IIR Filter SFG 10.35
• From cut set theory we apply the delay scaling by a factor of α = 2

and apply cut sets as shown:
y(k)
z-2 z-2 z-2 z-2 z-2

x(k-1), 0, x(k), 0
b5 b1
y(k)
z-1 z-1 z-1 z-1 z-1

x(k-1), 0, x(k), 0
b5 b1
z-1 z-1 z-1 z-1 z-1

Notes:
Note that we delay scaled by 2 before attempting to “systolise” this SFG. In fact is it impossible to systolise the
standard IIR filter. Try it - cant be done. Real IIR filters are rarely, if ever larger than say, 10 weights. Therefore
the requirement to perform a systolisation procedure is perhaps more of academic than practical interest.
However as with many SFG architectures we are keen to ensure that the pipelining can take place and we
minimise ripple through adder lines.
The above SFG was delay scaled by two and therefore the array can process two channels or two independent
data streams.
If we revisit the example from the notes of Slide 10.34 we now use the IP blocks given to us, however the only
option is to slow down the IIR filter by a factor of 2, by delay scaling. After delay scaling.
z-1 z-1
z-2 z-1 z-1 z-1
z-1
Delay Scale by 2 Separate two delays & cut set Delay transfer Use available IP blocks
Top
Pipelining the IIR Multipliers? 10.36
• The short answer here is NO! We can investigate why.
y(k)
z-1 z-1 z-1 z-1 z-1

x(k)
b5 b1
Cut set (x4)
z-1 z-1 z-1 z-1 z-1
• Note that although we could pipeline the multipliers as shown in the

notes of Slide 10.31 (assigning these as inbound edges the outbound
edges at the right side would require time advances.

Notes:
Note that after the cut set we have the (impossible) time advance (or look-ahead in time!) on the right hand side
at the feedback line....so we CANNOT pipeline the multipliers.
y(k)
z-1 z-1 z-1 z-1 z-1

z-4 z-4 z-4 z-4 x(k)
b5 b1
z+4
z-1 z-1 z-1 z-1 z-1
Hence unlike for the FIR filter, there is no possible speedup by pipelining the IIR filter multipliers. It is simple
impossible!
Top
Adaptive Filter SFG 10.37
• Consider the generic adaptive FIR filter SFG:
x(k) z-1 z-1 z-1 z-1 d(k)
w0 w1 w2 w3 w4
y(k) e(k)
• For implementation reasons it may be desirable to use a transpose FIR

structure.
• However we have two delayless lines in opposite directions (one the

feedback error, and the other in the adder line) - no matter how we
redraw and cut-set we cannot convert this to a transpose FIR form.
Notes:
Consider the generic adaptive FIR filter SFG:
x(k)
z-1 z-1 z-1 z-1
d(k)
e(k) y(k)
x(k)
d(k)
e(k) y(k)
z-1 z-1 z-1 z-1
z+1 z+1 z+1 z+1
The retimed version is non-causal and cannot be implemented.

Top
Non-Canonical/Transpose Adaptive FIR 10.38
• If we simply set up a transpose FIR in an adaptive error feedback

structure we get the following SFG:
x(k) d(k)
y(k) e(k)
-1 -1 -1 -1
z z z z
• Although this filter may look OK (and may even “adapt” to some
degree!) it is NOT a standard adaptive system as the output y ( k ) is no
longer a function of the most recently updated weights.
• We can view this retiming in more detail in the next slides.

Notes:
We still require the x ( k ) delay line for the adaptive weight update and for simplicity of argument we have not
shown in the above figure. This will be presented in more detail in the Adaptive Filtering for FPGAs section.
This observation that the transpose FIR cannot be used to exactly implement a standard adaptive LMS type
algorithm has quite often been “missed” by chip designers. There were two notable chips released in the 1980s
(one being the A100 from INMOS - still sourced and used in many military systems) which were parallel FIR
filters, but which both used the transpose FIR and failed to recognize the problems this would incur for adaptive
FIR systems.
So if designing adaptive FIRs from first principles for FPGAs......be careful to be sure you know what your
architecture actually does!
Top
LMS Hardware Implementation 10.39
• Considering the LMS weight update equation the hardware LMS

implementation is:
T
w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k ) e ( k ) = d ( k ) – w ( k – 1 )x ( k )
x(k) z-1 z-1 z-1

weight update
2µ
z-1 z-1 z-1 z-1
w0(k-1) w1(k-1) w2(k-1) w3(k-1)

y(k)
e(k)=y(k)-d(k) - d(k)
• Retiming/Pipelining this structure is desired but not simple.

Notes:
The hardware implementation shown here is based on the LMS weights update equation. It is common to find
this structure also represented in a slightly different format, where FIR filtering and weight update operations
are represented as different blocks. This structure is often called the serial LMS (SLMS).
The standard weight updates can be written as:
w0 w0 x(k)
w1 w1 x(k – 1)
= + 2µe ( k ) or in vector form:
w2 w2 x(k – 2)
w3 w3 x(k – 3)
k k–1 k
w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k )
where
w0 x(k)
w1 x(k – 1)
w(k) = and x ( k ) =
w2 x(k – 2)
w3 x(k – 3)
k k
Top
Serial LMS (SLMS) Architecture 10.40
• Filtering and weights update blocks are split:
x(k) z-1 z-1 z-1
w0(k-1) w1(k-1) w2(k-1) w3(k-1)

standard
filtering
y(k)
d(k)
-
z-1 z-1 z-1 z-1
e(k)
standard
2µ weights
update
x(k)
z-1 z-1 z-1

Notes:
This architecture presents the filtering and weights update blocks separately. This structure is known as the
serial LMS (SLMS) since samples are processed in a serial fashion. There is no pipelining or parallel
processing. An input sample has to be completely processed before a new one can be accepted. This will have
implications on the maximum speed the system can operate at due to the ripple through adder chain.
This structure is identical to the one in Slide 10.39; we have simple changed the direction of some data paths.
The following slides will show why is there a need for pipelining the SLMS structure and how to achieve it.
Top
LMS: Hardware Implementation Issues 10.41
• It is often necessary to modify the theoretical algorithmic

implementation to better suit the hardware characteristics.
• For example, the SLMS structure shown before contains an FIR filter
stage. As seen before a transpose implementation of the FIR filter is
better suited to FPGA implementation.
• Pipelining of the SLMS structure is also desirable in order to achieve

higher clocking rates.
• All these modifications may modify the algorithm and therefore can
have a serious impact in its performance.

Notes:
As an example consider the FIR transpose structure presented previously.
z-1 z-1
x(k) x(k)
w0 w1 w2 w2 w1 w0
z-1 z-1
y(k) y(k)
canonical or standard FIR structure transpose FIR structure
This structure constitutes a retimed version of the canonical FIR structure. Both implementations present
exactly the same behaviour. However, in an adaptive filter setup we cannot retime only the FIR stage without
considering the rest of the structure. Doing so results in an architecture which does not implement the LMS
algorithm. A different algorithm is now in place, and even though it may work and meet the designer’s
requirements great care has to be taken when considering the rules (stability, convergence speed, etc.) which
apply to the SLMS. These rules may not hold for the new algorithm.
Top
SLMS: Retiming I 10.42
• Consider the retiming technique used to obtain the FIR transpose

structure and let it be applied to the SLMS setup as shown below:
z+1 z+1 z+1
x(k) z-1 z-1 z-1
w0(k-1) w1(k-1) w2(k-1) w3(k-1)

standard
filtering
y(k)
z-1 z-1 z-1
d(k)
-
z-1 z-1 z-1 z-1
e(k)
canonical
2µ weights
z+1 z+1 z+1 update
x(k)
x(k) z-1 z-1 z-1
z+1 z+1 z+1

Notes:
The retiming technique used above makes use of the cut set theory. Remember that a cut-set is a minimal set
of edges which partitions the signal flow graph into two separate parts.
x(k) z-1 z-1
w0 w1 w2
y(k)
cut-set
A decision is then made on the use of advance and/or delay of the various edges depending on their inbound
or outbound directions.
z+1
x(k) z-1 z-1 x(k) z-1
w0 w1 w2 w0 w1 w2
y(k) y(k) z-1

z-1
Top
SLMS: Retiming II 10.43
• The resulting retimed structure is, however, not implementable:

x(k)
w0(k-1) w1(k-1) w2(k-1) w3(k-1) transpose

filtering
y(k) structure
z-1 z-1 z-1
d(k)
-
z-1 z-1 z-1 z-1
e(k)
non-causal
2µ weights
update!
z+1 z+1 z+1
x(k)
time advance: not implementable

Notes:
As it can be observed in the figure above the retiming technique which transforms the canonical FIR structure
into a transpose setup requires the inclusion of advance elements (looking one sample into the future) in the
adaptive filter structure. Note that this is due to the existence of feedback. While this has some theoretical
meaning, it is not physically implementable.
A different option is to use the original weights update structure of the SLMS algorithm but with a transpose FIR
setup. As mentioned earlier on this will modify the algorithm behaviour (not the SLMS anymore) but will ease
implementation. This possibility is considered in subsequent slides.
Top
Non-Canonical LMS (NCLMS) 10.44
• The use of a transpose FIR structure with the weight update of the
SLMS algorithm results in the NCLMS, which is a different algorithm
with different behaviour and performance.
x(k)
filtering
y(k)
z-1 z-1 z-1
-
d(k) z-1 z-1 z-1 z-1
e(k)
standard
2µ weights
update
z-1 z-1 z-1

Notes:
The resulting non-canonical LMS (NCLMS) differs from the canonical LMS (also referred to so far as SLMS).
This implies that the convergence and stability characteristics of the latter cannot be applied to the NCLMS and
a different analysis for this resulting algorithm has to be performed.
We can note the actual weight update is the same as for the standard LMS:
w0 w0 x(k)
w1 w1 x(k – 1)
= + 2µe ( k ) , i.e. w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k )
w2 w2 x(k – 2)
w3 w3 x(k – 3)
k k–1 k
Observe below the different adaption speed for two adaptive filters with the same parameters. One implements
the SLMS while the other runs under the NCLMS.
SystemView
0 1 2 3 4
50
-50
-100
NCLMS
MSE (dB)
-150
-200
SLMS
-250
-300
0 1 2 3 4
Time in Seconds
Top
Non-Canonical LMS (NCLMS) 10.45
• We can try to convert the NCLMS to include a standard FIR and then
note the different update equations that result for the weights.
z-1 z-1 z-1
x(k)
filtering
y(k)
z-1 z-1 z-1
-
d(k) z-1 z-1 z-1 z-1
e(k)
standard
2µ weights
z-1 z-1 z-1 update
z-1 z-1 z-1

z-1 z-1 z-1

Notes:
Applying these cut sets yields the following architecture with a standard FIR, but very different weight update.
x(k) z-1 z-1 z-1
w0(k-1) w1(k-1) w2(k-1) w3(k-1) standard

filtering
y(k)
-
d(k) z-1 z-1 z-1 z-1
e(k)
non-canonical
2µ weights
z-1 z-1 z-1 update
z-1 z-1 z-1 z-1 z-1 z-1

Top
The Need for Pipelining 10.46
• Considering the SLMS architecture, in every iteration, and before the

next sample can be processed:
• y ( k ) needs to be obtained (filter output);
• e ( k ) needs to be calculated and w ( k ) updated;
x(k) FILTER
w(k) y(k)
WEIGHTS d(k)
UPDATE
2µ
• SLMS clock is limited by the time it takes to calculate the filter output
and to update the weights.
• Processing can be speeded up if delays are introduced in the feedback

path...... this therefore a sub-optimal adaptive algorithm.

Notes:
The speed of operation of the SLMS algorithm is limited by the time it takes the circuitry to calculate the filter
output and to update the filter weights. New samples cannot be processed until these operation can be
performed.
Assume it takes T F seconds to calculate the filter output, and T WU seconds to perform all the filter weights
updates. In total, the next sample x ( k + 1 ) will not be processed at least until T F + T WU seconds have passed
since x ( k ) was received. Clearly, that means there is a lower bound of T F + T WU seconds in the sample period
in order to be able to use this structure, or in terms of sampling frequency:
1
f s ≤ --------------------------
T F + T WU
The speed of execution of the system is limited by the structure used, which does not allow more than one
sample to be processed at the same time. A solution to this is provided by pipelining, which allows more than
one sample to be processed by modifying the original structure. Some pipelining techniques do not modify the
input/output behaviour of the original structure, i.e. they have exactly the same functionality. However, other
pipelining techniques do modify the initial input/output behaviour and the system designer has to be careful
when using these structures and verify their validity for the considered application.
Top
Conclusions 10.47
• In this section we have reviewed the simple signal flow graph (SFG)
tool for representation of linear filtering type algorithms. In particular:
• Canonical and transpose FIR filters;
• Retiming using cut set: delay transfer and time scaling;
• Sharing or multiplexing of SFGs (i.e. multichannel);
• The concept and advantage of systolic type arrays;
• Retiming of IIR filters;
• The difficulties and dangers of retiming adaptive FIR filters!

Notes:

Retiming Signal Flow Graphs Xilinx

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Retiming Signal Flow Graphs Xilinx

Uploaded by

Copyright:

Available Formats

The DSP Primer10

DSPprimer Home DSPprimer Notes

August 2005, University of Strathclyde, Scotland, UK For Academic Use Only

• In this session we aim to review the representation of DSP algorithms

• The simple process of “redrawing” of SFGs will be used in order that we

• Latency issues caused by “arithmetic” ripple, and latency/delay

August 2005, For Academic Use Only, All Rights Reserved

• The standard or canonical FIR SFG can be represented as:

• The ideal SFG assumes that each connection between components

• If the SFG is directly mapped to a (parallel) architecture then the

Latency = Longest Propagation delay between two clocked registers

August 2005, For Academic Use Only, All Rights Reserved

For ease of viewing circuits this is not shown.

τ add τ add τ add τ add

Critical Path Latency = τ mult + 4τ add

• In addition any discernable delay (due to long wire lengths) on the

August 2005, For Academic Use Only, All Rights Reserved

τ add τ add τ add τ add τ add τ add τ add τ add τ add

• Retiming is useful to manage latency issues and where necessary

• Definition: A cut-set in an SFG is a minimal set of edges which

• Given any cut-set of an SFG which partitions into two components we

• For convenience and formalism we use z-1 notation for a delay.

August 2005, For Academic Use Only, All Rights Reserved

• We can choose the following:

Time Delay by z –1 on all inbound edges

and hence Time Advance by z +1 on all outbound edges

August 2005, For Academic Use Only, All Rights Reserved

• Note we could have chosen the converse assignment, i.e.

Time Delay by z –1 on all outbound edges

and hence Time Advance by z +1 on all inbound edges

...but this time the final circuit is not a causal implementation

August 2005, For Academic Use Only, All Rights Reserved

• All delays in a SFG may be scaled by a +ve integer, i.e. z –1 → z – α

All input & output rates are correspondingly scaled by factor of α

• Example: Scaling all delays in a standard FIR by 2, i.e. z – 1 → z –2

x(k) z-1 z-1 z-1

? z-2 z-2 z-2

August 2005, For Academic Use Only, All Rights Reserved

z-2 z-2 z-2

z-1 z-1 z-1 z-1 z-1 z-1

z-1 z-1 z-1 z-1 z-1 z-1

z-1 z-1 z-1 z-1 z-1 z-1 z-1

August 2005, For Academic Use Only, All Rights Reserved

• Therefore after delay scaling by 2 the input should be upsampled in

August 2005, For Academic Use Only, All Rights Reserved

Taking z-transform of both sides gives:

After the delay scaling this becomes X ( z 2 ) = P ( z ) = x 0 + x 1 z – 2 + x 2 z – 4 + x 3 z – 6 + … + x N z – 2N . Taking the

• To see this, observe two consecutive cycles while the sequence

August 2005, For Academic Use Only, All Rights Reserved

• If a SFG is delay scaled by α then the SFG becomes 1 ⁄ α efficient

• To increase the SFG utilisation back up to 100% (efficiency of 1) then

• In the SFG below, for α = 2 , 2 independent sets of input data are

• The previous “0”’s introduced by upsampling have been replaced with

• At the input interleave/multiplex 2 channels of upsampled signals z 1 ( k )

• Deinterleave/demultiplex to produce the two output signals.

August 2005, For Academic Use Only, All Rights Reserved

• To calculate this change we draw a I/O-test-line from input to output,

• We can now apply a single cut set to allow a delay transfer:

August 2005, For Academic Use Only, All Rights Reserved

x(k) z-1 z-1 z-1

x(k) z-1 z-1

• Applying a series of cut sets and delay transfers...: