Skript HWSW Codesign

HW/SW Codesign for DSP
G. Fettweis, E. Matus, E. Fischer

July 30, 2014
Skript
Vodafone Chair Mobile Communications Systems

Technische Universitt Dresden
Contents
1 Introduction
2 Pipelining of Non-Recursive Systems

2.1 Basic Principles . . . . . . . . . . . .
2.2 Cut-Sets . . . . . . . . . . . . . . . .
2.3 Critical Path . . . . . . . . . . . . .
2.4 Pure Pipelining . . . . . . . . . . . .
2.5 Example: FIR Filter . . . . . . . . .
2.6 Latency and Clock Rate . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
5
7
9
10
13
3 Pipelining Recursive Systems

3.1 Basic Loops . . . . . . . . . .
3.2 Linear Fedback Loop . . . . .
3.3 Logarithmic look-ahead . . .
3.4 Z-Transform . . . . . . . . . .
3.5 Vectors . . . . . . . . . . . .
3.6 Impact of Commutative Law
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
14
17
20
21
23
24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Time Variant Feedback Loops

24
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Dot Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Generalizations
28
5.1 Algebraic Operator Generalizations . . . . . . . . . . . . . . . . . 28
5.2 More Than 2 Operators . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Multi-Span Algebra . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 Block Processing
6.1 Basic Idea . . . . . . . . . . . . . . . . .
6.2 Pipeline Interleaving . . . . . . . . . . .
6.3 Block Processing with Parallel Feedback
6.4 Block Processing with One Feedback . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
37
41
44
7 On-Chip Communication
46
7.1 Bit-level communication . . . . . . . . . . . . . . . . . . . . . . . 46
7.2 Perfect Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3 De Bruijn Networks & Shuffle-Exchange . . . . . . . . . . . . . . 57
8 Measures
8.1 Basics . . . . . . . . .
8.2 Measure of Complexity
8.3 Wordlength Analysis .
8.4 M-Step Analysis . . .
8.5 ATE Measure . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
61
62
63
65
71
74
9 Basic Processor Principles

9.1 Hardware Reuse . . . . . . . . . . . . . . . . . .
9.1.1 Example: FIR . . . . . . . . . . . . . . .
9.1.2 Example: Transposed Form FIR . . . . .
9.1.3 Intermediate Very Important Conclusion .
9.2 Our First Digital Signal Processor . . . . . . . .
9.2.1 Implementing Direct Form FIR . . . . . .
9.2.2 Instructions Needed . . . . . . . . . . . .
9.3 Instruction Set Architecture (ISA) . . . . . . . .
9.3.1 VLIW Very Long Instruction Word . . . .
9.3.2 RISC Reduced Instruction Set Computer
9.3.3 CISC Complex Instruction Set Computer
9.4 Pipelining of Our Basic Processor Block Diagram
9.4.1 Pipelining . . . . . . . . . . . . . . . . . .
9.4.2 VLIW Pipelining . . . . . . . . . . . . . .
9.4.3 CISC Pipelining . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
74
74
75
77
79
79
81
81
84
84
86
89
90
90
91
93
10 Hazards
94
10.1 Control Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.2 Data Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.3 Structural Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11 Vector Processing
11.1 Vector & Data Flow Processors . . . . .
11.2 SIMD: Single Instruction Mutliple Data
11.2.1 Example FIR & Zurich Zip . . .
11.2.2 Transposed FIR, partial . . . . .
11.2.3 Generalization . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
103
103
105
106
111
113
12 Scheduling for Synchronous Data Flow

116
12.1 Kahn Process Networks . . . . . . . . . . . . . . . . . . . . . . . 117
12.2 SDF: Synchronous Data Flow . . . . . . . . . . . . . . . . . . . . 118
12.3 Multi Processor Scheduling . . . . . . . . . . . . . . . . . . . . . 123
13 MPSoC: Multi Processor System-on-Chip
13.1 Programming Model . . . . . . . . . . . . .
13.2 Task Scheduling . . . . . . . . . . . . . . . .
13.3 Network-on-Chip . . . . . . . . . . . . . . .
13.4 Homogeneous / Heterogeneous MPSoC . . .
13.5 Hierarchy . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
128
129
134
138
139
140
Introduction
The course Hardware/Software - Codesign for digital signal processors is mainly

based on the book: [1] (german version) / [2] (english version). Other recommended books can be found at the references at the end of this script: [3][4][5][6].
More specific scientific publications will be referenced at the beginning of the
according sections.
...
Pipelining of Non-Recursive Systems
Further reading: [7]
2.1
Basic Principles
For the purpose of pipelining a system we can make use of a graphical representation (flow graph). Flow graphs basically consist of two types of elements:
operators (i.e. computational or logic functions) and delay elements (i.e. registers, flip-flops, memories, etc.), as figure 1 shows. The operator is time-invariant
if its transfer function is not dependent time. In other words, if the inputs to
operator are time shifted then the outputs of the operator stay the same and
are time shifted by the same amount as the inputs. For example, an operator
with inputs ak , bk and output yk realizing function
yk = ak bk
(1)
is time-invariant if also time-shifted equation holds i.e.

yk+1 = ak+1 bk+1
Once the time-invariance property of operator is defined, retiming based on
cut-set rule can be introduced.
Figure 1: Graphical interpretation of basic operator and delay.
Registers can either be moved from the outputs of an operator to the inputs...
Figure 2: Register balancing: Moving output register to the inputs
... or the other way around ...

Figure 3: Register balancing: Moving input register to the outputs
The cut-set method is a more formal way to easily realize register

balancing/pipelining for more complex non-recursive systems.
2.2
Cut-Sets
Assume:
Time clocked signal-flow graph (SFG)
Time invariant node operators
Discrete time delays (usually a delay is given as multiple of a clock cycle)
Def: Cut of a graph is partitioning of the graph nodes into two disjoin subsets.
Def: Cut-set of the cut is the set of graph edges whose endpoints belongs to
different graph partitions.
Def: Cut-set rule:
1. Construct closed cut ...
2. Move discrete delays from branches entering cut-sets and vice versa by
insering delays (registers) on the one side and corrsponding negative delays
(speed-ups) on the other side
5
3. Combine complementary delay elements (register/speed-up) on cut-set

edges resulting in zero delay
Assume for example system in Fig. 4 with inputs ak , bk and outputs xk and yk .
Cut is constructed around the node. Delay and speedup elements are inserted on
input and output edges, respectively. Speedup and delay element on the output
edge are combined resulting in zero delay. It is evident, that this technique
preserves timing of cut-set input and output signals.
Figure 4: Graphical interpretation of retiming based on cut-set rule.
Be aware that speed-up elements are only theoretical (auxiliary) constructs

that cant be realized in hardware. Thus, all speed-up elements must be
removed (at the cost of an increased latency) to be able to implement it as a
real circuit. In the example of figure 4 we could remove the speed-up from the
output yk and consequently providing the delayed version yk1 at this output
port.
Figure 5: Generalization of cut-set rule.
More general example is in Fig. 5. Here the cut-set is constructed around the
node as first. Two realization of retiming schemes are depicted there.
2.3
Critical Path
Def: Critical Path (CP) is the longest delay/latency path among all paths in
the SFG determining maximum system clock frequency.
CP poses as the limitation on the execution time of SFG. In the case of discretetime (clocked) systems, the CP determines maximum clock frequency of system
as the delay of CP is inverse proportional to the clock frequency. Thus, in many
cases, it may be of interest to speedup the system. In general, the speedup of
the system can be achieved by spending additional (computational) resources or
by increasing clock frequency. The first approach represents spatial parallelism
while second one is associated with temporal parallelism (pipelining). Note that
in the rest of this work the term parallel processing is associated with spatial
parallelism while the term pipelining is associated with temporal parallelism.
In the following section, pipelining techniques are addressed as first.
Pipelining reduces CP by inserting additional delay elements (registers)
along the CP. However, the transfer function of the system after pipelining
shall be preserved. Basically, two methods can be used for pipelining:
Retiming - is technique for structural relocation of delay elements within
SFG such that SFG function is preserved.
Loop transformation - graph structure is changed without affecting the
transfer function resulting in increased number of delay elements and operators. Newly created delay elements can be moved into the CP be retiming.
In the example in Fig. 6, all nodes have same delay T . Thus, CP passes 4
nodes resulting in total delay of 4 T . In order to reduce the CP, pipelining
using cut-set rule is applied along CP to shorten the CP by 1/2. Two cut-sets
have been used for this purpose. Figure 7 shows that these two cut-sets can
even be merged into a single cut-set (dashed green cut-set). By inserting two
additional cut-sets we can pipeline the circuit even further until full pipelining
is achieved. Full pipelining is associated with such graph structure that any two
nodes in the graph are isolated by delay element. In the Fig. 7 , three cut-sets
are constructed.
Figure 6: Critical path in SFG.
Figure 7: Graph retiming using cut-set pipelining. Three cut-sets has been
constructed in order to fully pipeline the graph.
Figure 8: Register arrangement in SFG after cut-set pipelining of SDF in Fig.6.
Interesting observation: Cut-set rule can be seen as application of distributive

law. E.g. take the example in figure 9 where 2 different operators
(multiplication with constant a and addition) are connected according to
the following formula: xa = a ya + a ya . This can be transformed by the
application of the distributive law into: xa = a (ya + ya ), i.e. moving the
multiplication from the inputs to the output... Compare it with the cut-set
rule.
Figure 9: Relation between cut-set rule and distributive law
A more complex example is given in figure 10. The length of the critical path
in this example is 5 operations. Thus, 4 pipelining stages should be sufficient
for fully pipelining the circuit. The critical path is cut ones between every two
operations for inserting registers on the input edges of the cut-set. The output
edges of the cut-set where the speed-ups are inserted are also the output edges of
the circuit. This allows an easy removal of the speed-ups later on. The latency
of the circuit after fully pipelined is 4 clock cycles: yk4 .
Figure 10: Pipelining along the critical path
2.4
Pure Pipelining
As discussed in previous section, cut-set construction based on "closed" cut

curves guaranties invariance of system transfer function. However, due to the
cut-set rule property, complement time-shifts are placed on input and output
edges. Therefore, not physically realizable speedup elements are placed on input
or output edges of the cut-set. Even if the system transfer function is mathematically correct in this case, it is not realizable due to presence of speedup
element. For that reason, in some cases, pure pipelining is used. In this case,
cut-sets are constructed such that they contain only physically realizable register elements (positive delay). An example of pure pipelining for system in Fig.
6 is in Fig. 11. Only registers are placed at crossings of cuts and edges resulting
in the same structure as that in Fig. 7.
Figure 11: The principle of pure pipelining.
The idea behind is that the cuts crossing input edges or output edges only. If
this is the case, we can insert a register on every cutted edge and close the cut
via all inputs or outputs of the circuit, respectively, where the speed-ups are
inserted. The speed-ups at the circuit inputs and outputs can simply be
dropped after pipelining and only the internally inserted registers remain.
Figure 12 gives a more complex example for pure pipelining with a critical
path of length 5.
Figure 12: More complex example for pure pipelining
2.5
Example: FIR Filter
An FIR filter is a typical element of telecommunications and signal processing.

It is realized by a convolutional sum as given by equation 2.
yk =
L1
X
ai xki
(2)
i=0
A corresponding hardware implementation shows figure 13. The critical path

consists of 1 constant multiplier and 3 adders. By pure pipelining 8 additional
registers are inserted (in addition to the already existing 3 registers). After
10
pipelining the CP only contains a single adder or constant multiplier but the
latency of the circuit was increased to 3 clock cycles.
Figure 13: Example for application of pure pipelining on FIR filter
This is still not the end of the story. The question is: Can we do the pipelining
in a more efficient way, i.e. spending less (than the 11 used) registers? This is
possible by exploiting the associativity property of the (addition) operator.
This is shown in figure 14. By slightly rearranging the FIR structure according
to the associativity rule from (((a0 xk + a1 xk1 ) + a2 xk2 ) + a3 xk3 ) to (a0 xk +
(a1 xk1 + (a2 xk2 + a3 xk3 ))), we are now able to define cut-sets that make
use (i.e. reuse) the already existing FIR shift registers for the pipelining.
Figure 14: Exploiting associativity of operator in FIR filter pipelining.
For the pipelining of the rearranged FIR structure we thus only need 4 (L=4)
additional registers (instead of 2 L before) for a total of 7 registers. After
redrawing the graph a little bit we get the well-known transposed form of the
FIR filter, shown in figure 15, which has a latency of only 1 clock cycle.
11
Figure 15: Pipelining transposed FIR filter.
If we dont shift the registers from the upper branch to the lower branch (in
between the adders) the circuit is not fully pipelined, since we have still 3 adders
in the critical path. This form of the FIR is called direct form and shown in
figure.
Figure 16: Direct form FIR filter.
What did we do:

we had a shift register
and non-pipelined logic
we created pipelined logic by using the delay registers of the shift register
12
Finally, take a look at figure 17 where we compare the critical path of both
FIR filter versions. We clearly find that the transposed form FIR filter is better
suited for the hardware design. However, as we will see later on this is actually
a bad idea to be used for software realization.
Figure 17: Comparison of c.p. between direct and transposed form FIR filter
2.6
Latency and Clock Rate
Clock Rate 1/T is determined by the C.P.

1
1
<
T
TC.P.
Example: FIR Filter (non-pipelined). The length of the critical path is
T c.p. = T mult + (L 1)Tadd + Treg
Here Treg= set-up/hold time of the register.
For the transposed form, the length of the critical path is:
T c.p. = T mult + Tadd + Treg
Latency comparison (D):
Latency (D) of direct form FIR = T c.p.
Latency (D) of transposed form FIR = T c.p.
Let us look at the signal flow graph of a previous example.From the top figure
of figure 18, we can see that
T c.p. = 4Top + Treg
Clockrate, (1/T )max
1
Tc.p.
1
4Top +Treg
D = 4Top + Treg
After pipelining, the system looks like the bottom figure of figure 18 and the
length of the critical path is now
13
Figure 18: Latency and Clock rate comparison
T c.p. = Top + Treg

(1/T )max
1
Tc.p.
1
Top +Treg
D = 4(Top + Treg )
Pipelining Recursive Systems
Further reading: [8][9]
3.1
Basic Loops
For any pure feedforward (non-recursive) system, which contains no feedbackloops, we use the cut-set rule for fully pipelining of the circuit. This is always
possible and illustrated for an example on the left side of figure 19. On the
right side of the figure we consider the same network but changed the direction
of a single edge. Thus, we introduced a feedback loop in the network. A pure
combinational feedback loop is not implementable, because it has an infinite
14
(critical) path length, i.e. the signal would iterate in this cycle infinitely. For
this purpose, we must also insert a register as boundary for the loop.
Figure 19: Comparison of piplining of system without and with loop.
We can see from the right side of figure 19 that the critical path consists of 4
operations and contains the feedback loop. By applying 2 cut-sets in the first
step the critical path can be reduced to a length of 3 operations, but the loop is
still completely contained. In the second and third step we try to pipeline the
loop by introducing additional registers. For this purpose, cut-sets are chosen
that contain only a single node of the loop. As we can see the application of
the cut-set rule now only moves the registers within the loop, but is not able to
introduce new registers. Thus, the critical path cannot be shortened anymore.
In the end, we arrive at the same register arrangement that we started with. A
more generic view of this problem is shown in figure 20.
15
Figure 20: Pipelining of generic loop
No matter around which node (or set of nodes) we draw the cut-set, the number
of registers (i.e. the overall delay) always stays the same. Nothing can be done
to insert new registers!
It follows
In any closed circuit the directed
P sum of delays in a feedback is a constant i.e.
Regs=const. !
Figure 21 illustrates that the overall delay didnt change when counting the
speed-up as negative delay.
Figure 21: Pipelining doesnt affect the overall delay in a feedback loop
If we consider arbitrary loops (i.e. also feedforward loops), we find that the rule
is still applicable for this more general case, i.e. the number of registers is also
constant for arbitrary loops. This is shown in figure 22 where a feedforward
loop is pipelined via cut-set rule. It can be seen that the overall cycle delay is
still the same after pipelining (registers on edges that are directed against the
loop direction have to be counted as negative delays). However, this behaviour
16
is only a problem for pipelining of feedback loops.

Figure 22: Pipelining arbitrary loop
3.2
Linear Fedback Loop
The most simple form of a linear feedback loop is in an accumulator system

shown in figure 23.
yk =
k
X
xi = xk + yk1
i=
Figure 23: Accumulator
A more general form, with the following equation yields a 1st order IIR filter,
shown in figure 24.
yk = xk + a yk1
Figure 24: First order IIR filter
The critical path of this IIR filter consists of 1 adder and 1 multiplier and cannot
be shortened by pipelining via cut-sets, as we have seen before. What can we
17
do? The formula for yk is based on yk1 , but we know what yk1 is by applying
the recursion formula. Thus, lets unroll the loop once ...
yk
xk + a yk1
subst.
yk1 = xk1 + a yk2
after substitution
yk
= xk + a(xk1 + a yk2 ) = | distr. law | =

= xk + a xk1 + a2 yk2
Without the associativity property of the + operator, we would have to compute

the sum in brackets at first. This would be realized by the system shown in figure
25.
Figure 25: IIR after loop transformation w/o assoc. law
We find that this network is identical to the example from figure 19, but with 1
additional register in the feedback loop. Still we will not be able to fully pipeline
this system by application of the cut-set rule, since we would need 3 delays in
that loop. By making use of the associativity of the + operator, we come to the
system shown in figure 26.
18
Figure 26: IIR after loop transformation w/ assoc. law
The system contains of a 2-step linear look-ahead (feedforward) part and a 2step feedback part. The feedforward part can easily be pipelined via cut-set
rule. The feedback part now contains 2 registers, but still only 2 operations
(1 addition and 1 multiplication) in contrast to figure 25. Thus, the overall
delay of the loop is sufficient for fully pipelining. We only have to rearrange
the position of the registers a little bit. For this purpose the cut-set rule can be
applied again. As we have already seen before, the cut-set rule cannot change
the number of delays in the loop, but is able to rearrange the register positions
within loops.
Doing the loop transformation once again, we come to a solution with a 3-step
look-ahead and a 3-step feedback part.
yk
= xk + a xk1 + a2 yk2
subst.
yk2 = xk2 + a yk3
yk
(xk + a xk1 + a2 xk2 ) + a3 yk3
Figure 27: Scheme of IIR filter after 2-fold loop transformation
19
The 3-step solution consists of 3 registers in the feedback loop now. As we have
seen before, the feedforward part can easily be replaced by a transposed form
3-step look-ahead, as shown in figure 28. Thus, the shift register can be reused
for pipelining the system. The additional register in the feedback loop can for
example be used for internal pipelining of the constant multiplier (we have now
2 instead of 1 clock cycle for the computation of the product).
Figure 28: IIR filter with transposed form 3-step look-ahead
3.3
Logarithmic look-ahead
We go still 1 step further now, taking the example from section 3.2 and consider
the 4-step case

yk = xk + a xk1 + a2 xk2 + a3 xk3 + a4 yk4
Figure 29: IIR scheme after loop transformation with 4-step transposed form
FIR linear look-ahead
20
Reconsider the 2-step look-ahead case...

yk
(xk + a xk1 ) + a2 yk2

subst.
yk2 = (xk2 + a xk3 ) + a2 yk4
The feedforward part of yk2 is identical with that of yk except that all x are additionally delayed by 2 clock cycles. Thus by using the notation (.)2 delay
all entries by 2, we come to:
yk2 = (xk + a xk1 )2 + a2 yk4
and by substituting
yk = (xk + a xk1 ) + a2 (xk + a xk1 )2 + a4 yk4
We recognize from the equation that the computational result of the feedforward
term (xk + a xk1 ) is reused, which directly leads us to the logarithmic form
of the 4-step look-ahead:
Figure 30: SFG of IIR with 4-step logarithmic look-ahead
This can easily be extended to an 8-step (i.e. 3-stage) logarithmic look-ahead,

shown in figure 31.
Figure 31: IIR with 8-step logarithmic look-ahead
3.4
Z-Transform
As we have seen in section 3.2, loop transformation can be cumbersome sometimes. The z-transform provides a mean that can simplify this procedure. Lets
pick up the small IIR example that we used in the previous sections again:
21
= xk + a yk1
yk
By making use of the linearity property of the z-transform (a X(z) + b Y (z)

a xk + b yk ), we can simply derive the z-transform for the IIR example.
Y (z) = X(z) + a Y (z)z 1
The factor z 1 is caused by the offset in the index of yk1 , and demonstrates the
advantage of using the z-transform for dealing with difference (i.e. recursive)
equations: In z-domain it is no difference equation anymore. After solving the
equation for Y(z), we note that the look-ahead part can now be found in the
numerator while the feedback part is found in the denominator. Due to the
linearity property, these two parts in the transform domain are directly related
to the time domain: 1 xk and 1 a z 1 a yk1 . Now, we are able to
simply increase the polynomial order of the denominator (i.e. z 1 z 2 ) by
1+az 1
multiplying the fraction with a neutral 1: 1+az
1 .

1 + a z 1
1

Y (z) =
X(z)
1 + a z 1
1 a z 1
1
1+az
X(z)
=
(1 a z 1 )(1 + a z 1 )

1
= 1 + a z 1
X(z)
2
|
{z
} |1 a{z z 2}
lookahead

1 + a2 z 2

1 + a2 z 2
2step rec.
Finally, we increased the order of the recursion part by 1 and thus introduced
an additional delay in the feedback loop. This is nothing else but the 2-step
loop transformation for the linear look-ahead that was shown in section 3.2.
The results are directly related: 1 + a z 1 xk + a xk1 (look-ahead) and
1 a2 z 2 a2 yk2 (feedback). We can even simply extend the 2-step
linear look-ahead to a 4-step logarithmic look-ahead by multiplication with the
2 2
z
neutral 1: 1+a
1+a2 z 2 :
Y (z)
1 + a z 1
=
|
=

1
1 + a2 z 2
X(z)
4
{z
} 1 a z 4
logarithmic lookahead
1
1 + a z 1 + a2 z 2 + a3 z 3
X(z)
|
{z
} 1 a4 z 4
linear lookahead
And the 8-step logarithmic look-ahead is accordingly:

Y (z)
1 + a z 1
=
|
1 + a2 z 2
{z
1 + a4 z 4
22

}1
1
a8
z 8
X(z)
The Z-transform also easily allows to consider mixed form of linear and logarithmic look-ahead, e.g., for M=6:
Y (z)

1
1 + a z 1 + a2 z 2 1 + a3 z 3
X(z)
1 a6 z 6
|
{z
}
lin
|
{z
}
log
The resulting flow graph is depicted in figure 32.

Figure 32: Mixed form with 3-step linear and 2-step log. part
3.5
Vectors
Up to now, we assumed all considered values (yk , xk , ak ) to be scalar. Can our

knowledge gained concerning pipelining and loop tranformation also be applied
for the more general case? What, if:
xk is a vector, dim N
ak is a matrix, dim N N
yk is a vector, dim N
E.g.: Yk = Xk + AYk1 ...
Figure 33: Example: integrator operating on input vector
23
Can all transformations be applied again? What operations did we actually use
for the loop transformation? What problems could occur?
1. Matrix multiplication is not communitative: ak ak1 6= ak1 ak Thats
o.k, we didnt use this!
2. Matrix/Vector multiplication is associative (ak ak1 ) ak2 = ak (ak1 ak2 )
3. Addition of matrices/vectors is associative (ak + ak1 ) + ak2 = ak +
(ak1 + ak2 )
4. We find a zero element (for multiplication) / neutral element for addition
vector (0, ..., 0)T
0 0
.
..
..
matrix
0 0
Result: We can conclude that exactly the same result that we found for scalar
values also hold here!
3.6
4
4.1
Impact of Commutative Law
Time Variant Feedback Loops

Motivation
Take a look at figure 34. There, a small pure feed-forward example is shown
that contains no feedback loops but multiplication with time-variant coefficients
(ak , bk ). We see that we can apply the cut-set rule here as usual. Only take care
that if we draw the cut-set around the coeffcient multipliers, this also affects
the time index of the coefficients, as depicted in the figure.
Figure 34: Small pure feed-forward system example w/o loops
24
After this short introduction example, lets now come to time-variant feedback
systems. Therefore, we modify the IIR example a little bit making the factor
a time variant, i.e. a ak . For pipelining this loop we can simply apply loop
transformation as before (a 2-step recursion should be sufficient, since we have
only 2 operations in the loop). This is shown in the following equation:
yk
xk + ak yk1
yk1 = xk1 + ak1 yk2

(xk + ak xk1 ) + [ak ak1 ]yk2
(3)
If we only apply the associativity rule for the + operator (illustrated by red
brackets), we come to the circuit shown in figure 35 (left). We now find 2
delays in the loop but the number of operators also increased to 3 (2 mult. + 1
add). Consequently, 2 registers are still not sufficient to fully pipeline the loop.
On the other hand, if we make also use of the associativity of the operator
(illustrated by green brackets) to precompute the product of the a0k s outside
of the loop, we come to the circuit in figure 35 (right). In this case only 2
operations are inside the loop. Since we also have 2 pipeline registers available
(2-step recursion), the loop can easily be fully pipelined by applying the cut-set
rule. This small example demonstrates that the loop-transformation for timevariant feedback loops is more tricky and depends on the order in which the
operations are executed and the mathematical rules that can be applied to the
operators.
Figure 35: SFG of system (3)
For the time-variant IIR example we made used of the following rules:
associative law: +,
distributive law: over +
4.2
Dot Operator
The Dot-Operator is introduced as generalization of the loop transformation.

Def. Dot-operator on tuples (a, b) and (c, d) is defined as
25
(a, b) (c, d) = (ac, b + ad)
Example - 1st order IIR filter:

(a, xk ) (0, yk1 ) = (0, xk + a yk1 ) = (0, yk )
Now lets examine, if we can assume associativity for the Dot operator...
[(a, b) (c, d)] (e, f )
(a, b) [(c, d) (e, f )]
[ac, b + ad] (e, f )
(ace, b + ad + acf )
(a, b) [ce, d + cf ]
(ace, b + a (d + cf ))
| {z }
ad+acf
is assoc.
It follows Dot-operator is associative! if + is assoc.
is distributive over +
This means, no matter in which order we apply the operator, the result is the
same! How to use this: we apply the definition of the Dot operator to the timevariant recursion loop and show how easily we can now derive a 2-step loop
recursion:
(0, xk + ak yk1 )
(ak , xk ) (0, yk1 ) = (ak , xk ) (ak1 , xk1 ) (0, yk2 )

| {z } | {z }
| {z }
=Xk
=Yk2
{z
After the first loop transformation we can simply compute the Dot product of
(ak , xk ) and the delayed (ak1 , xk1 ) in the feedforward part by exploiting the
associativity rule for the Dot operator. Afterwards, (0, yk2 ) has to be Dot
multiplied in the feedback part. We can make the recursion equation for this
time-variant rule more clearly by introducing the following short form notation
for the tuples: Xk = (ak , xk ), Yk = (0, yk ). Using these substitutions, the
recursion equation with the Dot operator becomes very simple:
Yk
= Xk Yk1
"L1
#
K
=
Xki YkL
i=0

Dot Operator is not commutative

computation must be done from left to right
26
The equation also shows the generalization for an L-step recursion. In this case
we only have to compute the Dot product of [Xk, ..., XkL1 ] in the feedforward
part and Dot multiply the result with YkL in the feedback part. Be aware
that the Dot operation must be applied in the correct order (from left to right),
since it is not commutative! E.g. Xk Xk1 is o.k., while the result of Xk1 Xk
would be wrong. Some examples:
Yk
= Xk Yk1
M =2
= (Xk Xk1 Xk2 Xk3 ) Yk4
{z
}
|
=
(Xk Xk1 ) Yk2
M =4
linear lookahead
((Xk Xk1 ) (Xk2 Xk3 )) Yk4

|
{z
}
Figure 36 shows the simple and regular structure that results by applying the
Dot operator onto the example of a 4-step linear look-ahead. The orange arrows
show the order of the Dot operator inputs which is essential due to the missing
commutativity of the operator.
Figure 36: SFG of IIR with time variant loop and 4-step look-ahead using Dot
operator
Another example is shown in figure 37 for an 8-step logarithmic look-ahead:

Yk = [((Xk Xk1 ) (Xk Xk1 )2 ) ((Xk Xk1 ) (Xk Xk1 )2 )4 ] Yk8
|
{z
}
27
Figure 37: SFG of IIR with time variant loop and 8-step log look-ahead using
Dot operator
Doing the pipelining of time-variant loops by hand (w/o operator) is difficult

and error-prone...
yk
xk + ak xk1 + ak ak1 yk2
(xk + ak xk1 ) + ak ak1 (xk2 + ak2 xk3 ) + [ak ak1 ak2 ak3 ] yk4
We can conclude that by using the operator any feedback recursion of form
yk
= bk xk +ak yk1
|{z}
x
k
yk
= x
k + ak yk1
(0, yk )
|(ak , xk ) (0, yk1 )
(ak , bk xk ) (0, yk1 )
can be handled in the same way as yk = xk + yk1 the simplest binary operator.
Please note that a time-variant coefficient (bk ) in the feedforward part is no
problem, since we can simply substitute bk xk by x
k and thus apply the
operator as usual. But: commutative law does not apply! Finally, notice that
the operator can also be generalized to vectors & matrices.
Generalizations
5.1
Algebraic Operator Generalizations
We can further generalize the applicability of the loop transformation and the
Dot operator with respect to the underlying algebraic operation. Up to now we
restricted our investigations on addition and multiplication. The question is:
Can we apply the rules also for different operators? What mathematical rules
have we used so far?
(S, +) is a semi group (i.e. + is associative and S contains a neutral
element):
(a + b) + c = a + (b + c)
0+a=a
28
b+0=b
(S, ) is a semi group
is distributive over +
a(b + c)
= ab + ac, lef t distr.
(a + b)c = ac + bc, right distr.

(b + c)a
= ba + ca, [not equal f or matrices]
Please note that (S, +) as well as (S, ) are also commutative semigroups. However, this is not true for (S, ) with respect to matrices! Consequently, we are
now able to generalize our previous knowledge from this section to arbitrary
operators and (Dont confuse with addition + and multiplication ! The
symbols and serve as template for arbitrary operators!), if:
, are semigroups over a
is distr. over
yk
= xk ak yk1
1 step recursion
= xk ak (xk1 ak1 yk2 )
distr. law
= xk ak xk1 ak (ak1 yk2 )
assoc. of and
(xk ak xk1 ) (ak ak1 ) yk2
In the following you find some examples of operator pairs that fulfill the conditions above, so that all rules concerning loop transformation and Dot operator
can also be applied for these operations:
max
min
min
max
XOR
AN D
29
Table 1: Checking distributivity of min and max

Left
=
Right
=
a < b < c min(a,c) a max(a,a) a
a < c < b min(a,b) a max(a,a) a
b < a < c min(a,c) a max(b,a) a
b < c < a min(a,c) c max(b,c) c
c < a < b min(a,b) a max(a,c) a
c < b < a min(a,b) b max(b,c) b
Let us take a closer look at the following case: max and min
a b = max(a, b)
a b c = max(a, b, c)
It is clear that max is associative. No matter in which order the maximum is
computed, the result will be the same:
(a b) c = max(a, b, c) = a (b c)
This is also true for min. But what about distributivity? Is min distributive
over max?
min[a, max(b, c)] = max [min (a, b) , min (a, c)]
By observing every combination of relations between a,b and c, we find that
distributivity can be assumed, as table 1 shows.
What about the neutral element of min and max. For (S, min) the neutral
element 0 is the largest element if the set S. Accordingly, 0 is the smallest
element of S with respect to the semi-group (S, max). With these properties
the operator pair min/max is perfectly suited to handle sorting algorithms!
Another very important operator pair is max and +. We clearly find
that the distributive law holds for this case as well:
max(a + b, a + c) = a + max(b, c)
This so called Max-Plus Algebra is well-known and used for many different
applications (e.g. MLSE, Viterbi). In case of the Viterbi algorithm it is used to
determine the so called path metrics () for constructing the Trellis tree. In that
type of tree every node is connected with two inputs and gets assigned a path
metric. The path metrics are therein computed as the maximum path metrics
of the two input nodes. Furthermore, the so called branch metrics () for the
transition from the predecessor to the current node have also be taken into
account. Therefore, the branch metrics have to be added to the path metrics
before computing the maximum. The structure is shown for a small example of
30
a Viterbi Trellis with 2 states in figure 38 (left). The path metrics are therein
derived as follows:
0,k+1 = max(0,0,k + 0,k , 0,1,k + 1,k )
1,k+1 = max(1,0,k + 0,k , 1,1,k + 1,k )
or as combined matrix equation:

0
00
=
=
1 k+1
01
10
11
0
1

k
If we do this computation iteratively we can realize this by an integrator hardware that is also depicted in figure 38 (right).
Figure 38: Viterbi Trellis with 2 states as an example for the Max-Plus-Algebra
application
5.2
More Than 2 Operators
In the previous section, we found that we are not restricted to the operators +
and x, as long as certain conditions are fulfilled. Let us first summarize these
conditions once again:

max
is semigroup
distr. law semi ring

+
is semigroup
We already know from previous examples that (multiplication) is distributive
over (addition). We also know from the previous section that is distributive
over max and all 3 operations are associative.
a (b c)
= a max(b, c)
= max(a b, a c)
= abac
Consequently, we can exchange the operators for any circuit that we have already
pipelined before. Figure 39 (top, left) shows the small time-variant IIR example
from section 4.1. If the and operations are now replaced by and max,
respectively, we get an Add-Compare-Select unit (figure 39; top, right) , which
is a typical hardware element that is, e.g., used for Viterbi decoding. Since we
already know the solution for the pipelined recursion loop from figure 35, and
31
since we know that the necessary mathematical rules also hold for the max/+
algebra, we are able to dircetly present the result here: figure 39 (bottom).
Figure 39: Loop transform + Pipelining of Add-Compare-Select circuit
If we consider now all 3 operators (, and max) in common, the only remaining question is: Is distr. over max?
a max(b, c) = max(ab, ac)
This is true, as long as we limit ourselves to non-negative numbers, e.g., F = R0 !
We created 3 operators which are a distributive sorted set of operations:
1

2

3

max min
Let us take the following example with 3 operations and a recursion loop for
pipelining:
yk
3 [bk
2 (ak y
1
= ck
k1 )]
3
2
1
= ck b
k a
k y
k1
Now, we want to do a 2-step loop transformation for the case with 3 operators
by unrolling the equation as usual and applying the distributive law afterwards:
yk
3
2 [ak
1 (ck1 b
3
2
1
= ck b
k
k1 a
k1 y
k2 )]
3
2 [(ak c
1
3 (ak b
1
2 (ak a
1
1
= ck b
k
k1 )
k1 )
k1 ) y
k2 ]
3 [bk
2 (ak c
1
3 bk
2 (ak b
1
2 (ak a
1
1
= ck
k1 )]
k1 )
k1 ) y
k2
|
{z
}
|
{z
} |
{z
}
lookahead of xk
lookahead
32
lookahead
Therein, every of the 3 operators fulfills the conditions to form a semi-group.

Furthermore, we have the following distributive order:
1 is distributive over
2

3

3

For simplicity (when dealing with time-variant loops) we should prefer the Dot
operator to do the loop transformation. The operator for 3 operators in the
given distributive order can be defined as follows:
(a1 , a2 , a3 ) (b1 , b2 , b3 )
=
=
1
2
1
3
2
1
(a1 b
1 , a2 a
1 b
2 , a3 a
2 a
1 b
3)
To be able to apply the definition for M-step loop transformation, we need to

show that associative law holds. We have to show that:
(a b) c = a (b c)
(4)
Lets start to evolve the left-hand side of eq. 4:

(a b) c =
((a1 , a2 , a3 ) (b1 , b2 , b3 )) (c1 , c2 , c3 )
2
1
3
2
1
1
, a a
b
, a a
b
c1 , c2 , c3
= a1 b
|{z}
2 a
|{z} |{z}
{z 1 }3
| {z }1 | 2 {z1 }2 | 3
A1
A2
B1
A3
B2
B3
1
2 a b
1
1
2
1
1

c1 , a2 a

c2 , ...
= a1 b
b
{z1 }2 | 1 {z }1 |{z}
| {z }1 |{z}
|
A1
B1
A2
A1
B2
3
2
1
3 a a
2
1
2 a b
1
1
b

b

c3
..., ..., a3 a
2 a
|
{z 1 }3 | 2 {z1 }2 | 1 {z }1 |{z}
A3
A2
Now we evolve the right-hand side of equation 4:
33
A1
B3
a (b c)
(a1 , a2 , a3 ) ((b1 , b2 , b3 ) (c1 , c2 , c3 ))
1
2
1
3
2
1
, b b
c
, b b
b
c
= a1 , a2 , a3 b1 c
|{z} |{z} |{z}

| {z }1 |2 {z1 }2 |3 2 {z 1 }3
=
A1
A2
A3
B1
B2
B3
1 b c
1
2 a1
1 [b b
2
1
c
], ...
= a1
, a2
|{z} |1 {z }1 |{z}
|{z} | 2 {z1 2}
A1
B1
A1
A2
B2
3 a2
2
1 [b3 b
3
2
1
a3
a1
..., ..., |{z}
2 b
1 c
3]
|{z}
{z
}
|{z} |
A3
A2
A1
B3
distr. law
1
1
2
1
2
1
1
(a1 b
1 c
1 , a2 a
1 b
2 a
1 b
1 c
2 , ...)
3
2
1
3
2
1
2
1
1
(..., ..., a3 a
2 a
1 b
3 a
2 a
1 b
2 a
1 b
1 c
3)
By comparing the results, we see that eq. 4 holds. For 4+ operators we would
apply an interative proof. Finally, we examine some examples of operator tuples
and their distributive order. A summary is given by table
Table 2: Some examples for tuples of (more than
tive order
S
R+
R+
R
1

+
2
+
+
max

3

max max min
4

min
2) operators & their distribuR

max
min
max
Please note that the distributive order for (, max) only holds for positive
numbers, i.e., max(ab, ac) = a max(b, c) is only true, iff a, b, c 0! A special
case are the operators max and min that are even self distributive, i.e.,
max(a, max(b, c)) = max(max(a, b), max(a, c))
Consequently, max is also distributive over min, as well as min is distributive
over max. This particular property makes the curious set (max, min, max) a
valid operator tuple.
5.3
Multi-Span Algebra
If entries of N-tuples can be vectors and matrices,e.g. for a 3-space structure :

2 spans N N matrix and 1 span L-tuple. The 3 operators can look like these:
34
1 like

2 like +

3 weaker +

a11
a21
a12
a22

1

b11
b21
b12
b22
a11
a21
a11
a21

1
2
1
a11 b
11 a
12 b
21
=

1
a12
b11 b12
a11 b
11
2

=
a22
b21 b22

3
a12
b11 b12
a11 b
11
3

=
a22
b21 b22
Now we can speedup/ transform complex searching and sorting.
Block Processing
Further reading: [10]
6.1
Basic Idea
The basic idea of Block Processing is very simple as the figure shows. Take an
arbitrary serial input stream and chunk it into blocks of length L:
xk3 , xk2 , xk1 , xk , xk+1 , xk+2 , ...
| {z } | {z } | {z }
L=2
Then perform a parallel processing on every chunk and serialize the parallel
outputs afterwards. However, in general we have to distinguish two different
types that can both be considered as Block Processing.
Figure 40: Basic Idea of Block Processing
35
Therefore, let us examine two small examples. At first assume an arbitrary

OFDM system like WiFi, 802.11 or LTE. They have in common an FFT block
for the time-frequency transformation with some subsequent parallel processing
on each sub carrier, as depicted in figure 41 (left). Furthermore assume that the
processing frequency is quite low, e.g., 100 kHz. In this case we wish to reuse
the post-FFT processing hardware by first doing a parallel-to-serial conversion,
then processing the serial sub carrier stream with an L times increased speed
and finally doing a serial-to-parallel conversion to retrieve the (now processed)
parallel block of sub carriers.
Figure 41: Example for block processing via pipeline interleaving
For the second example assume a high speed serial link (e.g. 60 GHz @ 10Gb/s).
The sample rate is far too high for todays systems to be processed sequentially.
Therefore, we need a parallel processing. This can easily be done, as depicted in
figure 42, if the processing of L subsequent samples can be done independently.
In this case we can use a demultiplexer to do a serial-to-parallel conversion first,
then perform the parallel processing on each of the parallel streams individually
and multiplexing the results back to a common serial stream. The advantage
of this approach is, that each of the parallel streams can now be processed on a
lower rate (factor L1 ).
Figure 42: Example for block processing via parallel execution
Both of these two examples show an application of block processing. In the

first case we used the so called pipeline interleaving technique to speed-up a
processing block to do a high-speed serial processing on the parallel output of
a previous processing stage. In the second case we used parallel processing to
slow down the computation speed on a high-speed sequential stream. In the
following, we will discuss these two ideas in more detail.
36
6.2
Pipeline Interleaving
We introduced loop transformation as a utility to speed-up loops by unfolding

and pipelining the loop. However, this can sometimes be difficult if the computation in the loop is more complex or even impossible, if the applied operations
do not fulfill the basic requirements ( semi-ring) to perform the loop transformation. An easier to handle approach which is useful sometimes to speed-up
loops is pipeline interleaving. Therefore however, it is necessary that we have
L independent (interleaved) streams which need identical processing. We find
exactly find this situation in the case of block processing, as depicted in figure
43.
Figure 43: Pipeline interleaving as application for block processing
Lets take the following simple recursion formula (integrator) as an example for
the system realization to demonstate the idea of pipeline interleaving:
yk = xk + yk1
Normally, we would apply loop transformation to introduce new registers in
the feedback loop. But what happens if we just insert the registers in the
loop without transforming the look-ahead part, i.e. take an arbitrary signal
processing hardware and replace every delay element by a shift register of L
delays?
yk = xk + yk2
Does the behavior of the realized function change - and how? Let us take a look
at the small example in figure 44 to clarify these questions.
37
Figure 44: Behavior of integrator with false loop pipelining
On the left side of the figure we see the serial input stream xk . The right
side shows the corresponding output sequence yk . We recognize that the sum
of the xk with even k and the sum of the xk with odd k is computed in an
interleaved manner: x0 , x1 , x2 + x0 , x3 + x1 , ... Actually, this is not the desired
behavior. But what if we exploit this behavior by feeding the integrator with
two completely independent data sets that we already pass to the integrator
input in an interleaved manner? The resulting behavior is demonstrated in
figure 45.
Figure 45: Example of pipeline interleaved integrator behavior
As the figure shows this works fine for the two interleaved sets. After the 6 (i.e.
38
23) considered clock cycles, we observe the correct sums for both data sets at
the output:
ya,2
xa,2 + xa,1 + xa,0
yb,2
xb,2 + xb,1 + xb,0
Lets consider this example more generally for the case L=3 as shown in figure
46. Therein, our 1-step recursion loop
yk = xk + yk1
becomes
yk = xk + yk3
by applying the pipeline interleaving approach.
Figure 46: Integrator example for L=3
Consequently, we can now define 3 independent recursion loops, as follows:
yk
xk + yk3
yk+1
xk+1 + yk2
yk+2
xk+2 + yk1
However, as mentioned before, the pipeline interleaving approach will only work
if we pass 3 independent data sets to the input of this circuit in an alternating
manner. In this case the three equations change as follows:
y1,k
= x1,k + y1,k1
y2,k
= x2,k + y2,k1
y3,k
= x3,k + y3,k1
Now we recognize our initial equation again, but this time we can process three
data sets on the same hardware simultaneously having a speed-up of 3x (bestcase) by introducing additional pipeline registers.
39
Finally, we were able to speed-up the feedback loop by introducing registers

without the need of doing loop transformation. The circuit however, is now
restricted to be used only for independent interleaved data sets!
In general, when adding delays in the feedback loop, we need to replace every
delay (i.e. also in the look-ahead part) by an L-fold delay to make the pipeline
interleaving concept work. This is shown in figure 47.
Figure 47: Every delay is L-fold replaced
Thus, the right circuit in the figure (after the 3-step pipeline interleaving) can be
realized as 3 independent all-pass filters1 . For this purpose, the 3 independent
data sets xk , ak and bk must be passed to the filter in an interleaved manner.
Finally, a short remark: be aware that for applying the pipeline interleaving
technique it is necessary to serialize the parallel inputs before processing at
higher speed and deserializing the outputs again, as shown in figure 48.
Figure 48: Serialization / Deserialization necessary to exploit pipeline interleaving technique
Summary: Pipeline interleaving: realizes an L-step recursion on one hardware.

we interleave L streams of (identical) processing out of one HW
For an L-step recursion loop, we get L independent recursions.
Please note that the look-ahead parts do not necessarily have to be independent!
we need L-fold clock speedup (to TL ) to achieve same rate of processing for
each stream
1 we considered no filter coefficients in this IIR circuit, i.e. all a = 1; thats why we have
i
an all-pass filter
40
6.3
Block Processing with Parallel Feedback
In the previous section we introduced pipeline interleaving that we used to

process multiple independent data sets in a serial, i.e., pipelined fashion, at an
increased processing speed. In this section we consider block processing for a
parallel computation on independent data sets. As figure 49 shows, we have
to distribute (demultiplex) the serial input stream onto multiple inputs of the
parallel processing block for this purpose. After the parallel computation is
done, the parallel outputs must be serialized (multiplexed) again.
Figure 49: The concept of block processing
We also see in figure 49 that the clock frequency of the computation circuit is
reduced by factor 1/2 for this 2-fold parallelization. This is the main advantage
of the block processing concept. On the one hand this can be used to increase
the throughput of the computation: by using the same clock frequency that
would be necessary for the serial processing (1/T in the above example), the
throughput would be increased by a factor of 2. On the other hand the reduced
clock frequency can save power: power consumption grows significantly faster
than clock frequency above a certain level (~200 MHz), but scales linear with
area increase!
Let us pick up the simple recursion loop example from the previous section for
block processing. The equations for splitting it into odd and even part, are
exactly the same, only the realization will be different:
yk
yk
yk1
xk + yk1
2 xk
z }| {
xk + xk1 +yk2
2 xk1
}|
{
z
xk1 + xk2 +yk3 y2k1 =2 x2k1 + y2k3
y2k =2 x2k + y2k2
41
Figure 50 shows the corresponding hardware realization for this integrator example with 2-fold parallel block processing.
Figure 50: 2-fold block processing for integrator example
It should be mentioned that the parallel inputs have already been demultiplexed
before. Thus, the first input only sees the even x0k s while the second input only
sees the odd x0k s (i.e. k is always a multiple of 2). Consequently, every single
delay on one of these inputs, actually counts as a delay of 2: the single delay
on xk produces xk2 . For the same reason a single delay in the feedback loop is
sufficient to produce y2k2 . We emphasize once again that the main advantage
of block processing is the clock speed reduction to 1/LT (for an L-fold parallel
block processing).
Finally, we want to take a look onto a slightly more complex example. By
unrolling the loop of the above example once again twice, we come to a 4-step
recursion loop. We are able to split that loop into four independent loops that
can be computed in parallel with block processing. The corresponding four
equations with linear look-ahead are as follows:
yk
xk + xk1 + xk2 + xk3 + yk4
yk1
xk1 + xk2 + xk3 + xk4 + yk5
yk2
xk2 + xk3 + xk4 + xk5 + yk6
yk3
xk3 + xk4 + xk5 + xk6 + yk7
The red xs in the equation must be derived by introducing delays on xk , xk1

and xk2 , respectively, Remember: a single delay counts as 4 time steps in this
4-fold block processing case! For the same reason only a single register is used
in the feedback loop. The result is illustrated in figure 51.
42
Figure 51: 4-fold block processing for integrator example with linear look-ahead
We can make this more efficient by using a logarithmic look-ahead instead. We

come to the following four equations:
yk
( xk + xk1 ) + ( xk + xk1 )2 + yk4
yk1
(xk1 + xk2 ) + (xk1 + xk2 )2 + yk5
yk2
(xk2 + xk3 ) + (xk2 + xk3 )2 + yk6
yk3
(xk3 + xk4 ) + (xk3 + xk4 )2 + yk7
Again, the red parts in the equation must be generated by introducing delays,
where each delay counts as four time steps due to the parallel processing approach. The realization of the 4-fold block processing example with logarithmic
look-ahead is illustrated in figure 52.
43
Figure 52: 4-fold block processing for integrator example with logarithmic lookahead
By applying the logarithmic look-ahead, we could save 4 adders.
6.4
Block Processing with One Feedback
Up to now we had to realize a different feedback recursion loop in every parallel

branch for the block processing. For example, we had to use two different recursion loops for our 2-step example from the previous section, since yk depends
on yk2 while yk1 depends on yk3 :
yk
yk1
| {z }
results
= xk + xk1 + yk2
= xk1 + xk2 + yk3
|
{z
} | {z }
input available
available
What if we use only a 1-step recursion for yk1 instead?...

yk
xk + xk1 + yk2
yk1
xk1 + yk2
Now both parallel branches depend on the same feedback, namely yk2 . Hence,
we can save some hardware resources, since the branch for the computation of
yk1 can reuse the feedback (yk2 ) from the yk branch. This optimization for
the L=2 case is illustrated in figure 53.
44
Figure 53: 2-fold block processing example with single feedback loop
Compared to figure 50 we can save 1 adder and the register in the second
feedback loop by reusing the output of the first feedback loop. For the more
complex 4-fold block processing example with linear look-ahead, the 4 equations
are as follows when using only a single feedback loop:
yk
= xk + xk1 + xk2 + xk3 + yk4 4 step
yk1
= xk1 + xk2 + xk3 + yk4 3 step
yk2
= xk2 + xk3 + yk4 2 step
yk3
= xk3 + yk4 1 step
The resulting circuit is drawn in figure 54.

Figure 54: 4-fold block processing example with single feedback loop and linear
look-ahead
Compared to the solution in figure 51, we could save 6 adders and 3 registers.
45
Finally, we also want to examine the example with logarithmic look-ahead. The
four equations change as follows:
yk
(xk + xk1 ) + (xk + xk1 )2 + yk4 4 step
yk1
= xk1 + (xk + xk1 )2 + yk4 3 step
yk2
yk3
= xk3 + yk4 1 step
(xk2 + xk3 ) + yk4 2 step
Herein, the intermediate result of the yk2 branch (red) was reused for the
logarithmic look-ahead (green). The realization depicts figure 55.
Figure 55: 4-fold block processing example with single feedback loop and logarithmic look-ahead
In contrast to figure 52, we need only 8 instead of 12 adders and could save again
the 3 registers in the feedback loops. Although the version with a single loop
and logarithmic look-ahead/look-back is the most efficient form of our integrator
example with respect to hardware resources, it is not always the most preferred
solution. This is because the topology is quite irregular, if we compare it to
the parallel loop version, for instance. Finally, the decision of which of these
hardware realizations to be preferred, is a trade-off between hardware resources,
reusability and regularity of the communication network.
On-Chip Communication
Further reading: [11][12][13][14]
7.1
Bit-level communication
To emphasize the importance of on-chip communication, we take a look at the

small integrator example yk = xk + yk1 again and try to speed-up the circuit.
46
As the word-level view onto the integrator circuit shows (figure 56, left), the
critical path is located within a loop. To speed it up, we have two options:
1. Unroll the loop to an L-step recursion via loop transformation pipelining the adder by using the additional registers in the loop
2. Make the adder faster, i.e. using faster adder types (e.g. carry-look-ahead
or carry-save instead of slow carry-ripple adder)
To pipeline the adder internally, it is necessary to consider the bit-level view of
the integrator (figure 56, right). In contrast to the word-level view, the bit-level
considers every single (bit) line of a bus. Thus, the wordlength W becomes
important here. As we already know, the critical path length and thus the
pipelining of a carry-ripple adder depends on the wordlength. Hence, pipelining
can only be done on bit-level.
Figure 56: Integrator circuit: word-level view (left) and bit-level view (right)
On bit-level we are confronted with two problems:

1. making the routing efficient and fast,
2. making the computation of the adder fast.
While the necessary routing area is propotional W 2 (twice the word length,
means twice the number of lines in x- as well as in y- dimension), the area of
the adder increases only with W ldW (for a block processing of W bits in
parallel).
Let us take a closer look into the adder for the case W = 3 for pipelining the
circuit. Figure 57 shows the critical path on the more detailed bit-level view.
47
Figure 57: Integrator example (W=3) with critical path on bit-level
At first sight, it looks like we find the critical path still within a loop. However,
by redrawing the circuit a little bit we come to figure 58 (top). Now we recognize that the critical path actually is not a feedback loop. Figure 58 (bottom)
illustrates this even better. Thus, we can simply do pipelining by cut-set rule
(or pure pipelining as shown in figure 58 (bottom)) without the necessity of
loop-transformation.
48
Figure 58: Integrator example (W=3) on bit-level in different representation
By rearranging the inserted pipeline registers a little bit, the so called skewing
triangle at the integrator inputs and the deskewing triangle at the integrator
outputs becomes visible. This is depicted in figure 59. The registers at the
inputs and outputs are necessary for aligning the timing of the single bit-lines,
so that all associated bits of a single word can be passed to the circuit in parallel
and arrive at the output in parallel at the same clock cycle.
Figure 59: (De-)skewing triangles in Integrator example
49
The (de-)skewing triangles have a special importance with respect to the chip
area, since the area increase is proportional to W 2 . This actually does not come
from the routing, but from the skewing. For a full-pipelined N-bit integrator,
we need N full adders, N registers in the feedback loops and N 1 registers
for pipelining the carry chain; the skewing, however, consumes (N 1) (N
1) registers in total. Consequently, by reusing the skewing, we could save a
lot of area! The following example of a (transposed form) FIR adder chain
demonstrates how this works. In the upper part of figure 60 we see the adder
chain of an FIR filter with 4 taps (L=4) with a subsequent integrator. The
lower part of the figure shows, how we can do the pipelining on bit-level (W=3)
jointly for the whole adder chain. We can see that the computation of the least
significant bit (LSB) of the second adder can directly start after the LSB of the
first adder is available. We conclude that there is no need to deskew the outputs
of the first adder and skew the inputs at the second adder again. This would
only be a waste of registers and an unnecessary increase of the overall latency of
the circuit. There is actually only one skewing triangle necessary at the input
and one deskewing triangle at the output, as figure 60 (bottom) shows. If we
had only pipelined the carry-ripple adder on bit-level and just plugged together
the adder chain on word-level, we would not have been aware of this unnessecary
skewing overhead and thus wasted a lot of area.
50
Figure 60: FIR example for reuse of skewing triangles (L=4, W=3)
This small example should be sufficient to demonstrate that it sometimes might

be worthwhile to observe a circuit a little bit closer on bit-level! Skewing on
bit-level can almost always be shared and can improve the area consumption by
a factor of up to 10x.
Result: By making use of shared (de-)skewing the area increase is only proportional to W (plus a little additional area for the shared (de-)skewing) instead
of W 2 !
Remember also: If youre confronted with heavy signal processing dont forget
bit-level W 2 W in area!
51
7.2
Perfect Shuffle
The Perfect Shuffle Network can often be found in algorithms to realize a permutation on the input signals (e.g. Viterbi, FFT) or for routing purpose. Classical
interconnection structures such as crossbar switches can be used for this purpose
as well, but consume much more area when implemented in hardware.
Figure 61: Crossbar switch can realize any arbitrary permutation on the inputs
Barrel shifters are used to realize arbitrary shift on the inputs.

Figure 62: Barrel shifter used to realize arbitrary shifts
The Perfect Shuffle Network is also able to realize arbitrary permutations when
arranged in a so called shuffle exchange network (see section 7.3) and is thereby
able to save some area compared to the crossbar switch. Actually, this method
is used for shuffling playing cards. The deck is split into equal halves which are
then pushed together in a certain way so as to make them perfectly interweave.
52
The cards of the two halves are arranged in an alternating manner after shuffling.
Figure 63 shows how this is applied to communication networks.
Figure 63: Perfect shuffle network with 8 elements (nodes) and 3 stages
The network consists of N = 8 elements or processing nodes (comparable to the

number of playing cards) and 3 shuffling stages. The nodes are split into two
halves: Set1 = {1, 2, 3, 4}, Set2 = {5, 6, 7, 8}. Now the output of each node is
propageted to the input of the next network stage, such that the nodes of both
sets are always alternating: {1, 5, 2, 6, 3, 7, 4, 8}. The same shuffling is repeated
at the subsequent network stage. Figure 63 demonstates that the original order
is restored after the 3rd (ld(N ) = ld(8) = 3) shuffling stage.
Please note that for the type of the permutation P that is used in every stage
of the networks, we apply a special notation:
PbN (b = base; N = size)
N corresponds to the number of nodes per stage and b determines the number
of splits (e.g. 2 in figure 63) for the Perfect Shuffle. We could realize the P28
shuffle by writing into a memory row by row and reading it column by column
as you can see in figure 64.
53
Figure 64: Realization of P28 shuffle supported by memory
The realization of the P48 shuffle looks accordingly as depicted in figure 65.
Figure 65: Realization of P48 shuffle supported by memory
Figure 66 shows a different example with 9 nodes. In this case 3 splits have
been used (i.e. 3 different sets are considered for the shuffling).
54
Figure 66: Perfect shuffle network with 9 elements, 3 splits and 2 stages
In the P39 case the original order is already restored after log3 9 = 2 stages.
As a general rule for Perfect shuffle networks we can summarize: The original
order is restored after logb N shuffling stages!
Another example for a Perfect Shuffle network with a P416 permutation shown in
figure 67. This permutation network can e.g. be used for a so called ideal block
interleaver which is often found in coding theory for randomizing the order of
the data stream before transmitting it over the channel to increase immunity
to disturbance. Thereby, the data stream is written column-by-column into a
memory with 4x4 cells. Afterwards it is read from the memory row-by-row, thus
realizing a P416 Perfect Shuffle.
55
Figure 67: P416 Perfect shuffle example (ideal block interleaver)
The block interleaving is reversible, since after the second shuffling stage the
original order is restored: log4 16 = 2. We can also express this via a factorization
16
rule for permutations: P416 P416 = P16
= P016 whereas PNN = P0N represents
Identity (i.e. no permutation). In general we can decompose a permutation
with a split (a1 a2 ) into a network of two permutations, having splits a1 and
a2 but the same number of elements as the original permutation:
b1
Pab11 Pab21 = P(a
1 a2 ) mod b1
Hence, we can confirm once again the initial conclusion for P28 :
P28 P28 P28 = P88 = P08 .
Observing a skat deck with 32 playing cards and usual shuffling with 2 splits,
you need 5 stages to restore the original order:
P232 P232 P232 P232 P232 = P032 .
Please note: For an admissible permutation, the number of splits b has to be an
integer divider of the number of elements N .
E.g. for N = 12 the following would be valid permutations: P212 , P312 , P412 , P612 .
A last note concerning the routing of the wires for a Perfect Shuffle network: If
we assume W processors to be arranged as a linear vector, than we have W/2
channels to route in x and W channels in y direction.
56
The bottleneck (i.e. the highest concentration of wires) in y direction is located

at the center symmetry line that is crossed by W/4 wires from top to bottom
and vice versa. From this observation, we can conclude that the routing area
for perfect shuffle is A W 2 .
7.3
De Bruijn Networks & Shuffle-Exchange
Shuffle exchange networks can be used for all kinds of dynamic data flow control,
e.g. routing/switching (cross-bar like) or sorting. As figure 68 shows, a single
stage of the network can be sub-divided into a shuffle stage, represented by a
Perfect Shuffle network, and an exchange stage that allows a pair-wise exchange
of the data between each two adjacent processors. As mentioned in the previous
section, the area is proportional to W 2 for the shuffle stage. For the exchange
stage the area is A W . Hence, the shuffle stage is dominant with respect to
routing area.
Figure 68: Shuffle exchange network for N=8
By combining the shuffle with the exchange stage, i.e. merging the first column
of nodes in figure 68 with the second one, we come to the de Bruijn graph that
57
is shown in figure 69.

Figure 69: De Bruijn network for N=8
The advantage of the de Bruijn representation is that it has a regular structure

that can simply be reused in every stage of the network. The graph consists
of N/2 (for a split of 2) independent and fully connected sub graphs that are
highlighted in figure 69 by different colors. The de Bruijn shuffle can also be
represented as a shift register where a bit stream is shifted in from right to left
(refer to figure 70). Depending on the value of the incoming bit, we can branch
from the current state to one of two different states: e.g. from state 000 we
go to 000, if the incoming bit is 0; and to 001, if the incoming bit is 1
(shifting the 1 into the state register from right to left). For state 001 there
are the following two possibilities: 001 0 = 010, 001 1 = 011,
and so on.
58
Figure 70: Shift register representation of De Bruijn network
What graph representatation do we get, if we shift the bits from left to right into
the register? The corresponding graph, the so called reverse de Bruijn graph,
is depicted in figure 71. While the original de Bruijn realizes a shuffle exchange
for a P28 permutation, the corresponding reverse de Bruijn now is equivalent to
a P48 shuffle with preceding exchange stage, i.e. the order of the shuffle and
exchange stage has been inverted.
59
Figure 71: Shift register representation of reverse De Bruijn network
Though the two realizations look quite identical with respect to the routing
complexity, there is a big difference what concerns the consumed routing area!
Every parallel wire consumes a certain area for its routing channel in x and y
direction. For the routing in x direction we clearly need one routing channel
per node, independent of the realization. For the routing in y direction, we find
the highest concentration of wires at the center symmetry line that is sketched
in the figures 70 and 71. In figure 70 we recognize that wires from 4 different
processing nodes cross the center line. Thus, 4 (N/2) routing channels are
necessary for the de Bruijn realization. In the reverse de Bruijn graph (figure
71) we see that one wire from all 8 nodes crosses the center line. Thus, we need
N parallel routing channels in y direction in this case (which doubles the width
of the routing network). This has a significant impact on the consumed area,
since A N 2 !
Finally, let us combine multiple stages of a de Bruijn shuffle. If we hook-up
loga (N ) de Bruijn graphs, each with a PaN shuffle, we come to the so called
-Network. As depicted in figure 72, loga (N ) stages are enough to reach a
full coverage, i.e. to be able to connect every input with every possible output
(and vice versa), given that the exchange stages are configured accordingly.
Because of this property, the -Network has special importance, e.g. for routing/switching purpose as (more efficient) replacement for a cross-bar switch.
60
Figure 72: Omega network (N=8)
The functionality of the -Network can also be realized by an FFT network

with loga (N ) stages, depicted in figure 73. However, the disadvantage of the
FFT graph is that every stage has a different connection scheme which makes
reusability (e.g. for sequential processing using a single stage) difficult.
Figure 73: FFT graph (Cooley/Tuckey) (N=8)
Please note that de Bruijn and FFT graph can be converted into one another
by rearranging the order of the processing nodes.
Measures
Up to now, we discussed many different hardware implementation and optimization techniques. The question is: How can we evaluate and compare two
different implementations? For this purpose we introduce some important mea-
61
sures within this section which allow us to make a statement on the quality of
a hardware implementation.
8.1
Basics
One of the most important (and basic) measures for evaluating a hardware
implementation is the area (A). It can for example be measured in # transistors,
# NAND-gates, die size (e.g. in m2 ) or # slices (FPGAs). We can further
sub-divide the area as follows:
Al : logic area (e.g. for adders, multipliers, etc.)
Ac : communication/routing area (for the wires that connect logic and
registers)
Ar : area of registers/buffers (for pipelining and storage)
Am : area of memory
To keep it simple, we will not consider the memory area here, since it very much
depends on the applied technology (e.g. available memory size, type, etc.) and
the according technology specific algorithm implementation.
A second very important basic measure is the latency of the clock interval (T)
or the clock rate (1/T), respectively. The clock rate is directly proportional to
the achievable processing rate at the input of the circuit and thus a very good
measure for the processing speed. The latency can be sub-divided in a similar
way:
Tl : latency of logic
Tc : latency of communication (i.e. wire delays)
Tr : latency of registers (i.e. setup time2 an hold time3 )
Tm : latency of memory (i.e. memory access time)
A third important basic measure is consumed energy:
El : energy consumption of logic
Ec : energy consumption for communication
Er : energy consumption for register storage
Em : energy consumption for memory storage
2 the time that the data input must be stable before a clock edge to allow the register to
sample the new data value
3 the time that is needed by the register to update the output after a clock edge
62
8.2
Measure of Complexity
Let us first define the efficiency (E) of an algorithm. Clearly, an algorithm gets
the more efficient the less area it consumes and thus:
1
A
Secondly, an algorithm becomes more efficient the higher the processing rate
(1/T) is:
E
1
T
E
Combining both, we get:
1
AT
Now the complexity (C) can be seen as the inverse of the efficiency (the more
efficient an algorithm becomes the less complex it is). Finally, this yields the
well-known AT measure for the complexity:
E
C AT
By sub-dividing the area and latency according to section 8.1, we could refine
the equation by replacing A and T by the sum of the constituent parts. Does
this really work?
AT
X X
Ai
Ti
=
X
X
X X
X
X
Al +
Ac +
Ar
Tl +
Tc +
Tr
=
This approach is o.k. with respect to the area (green). However, summing up all
delays in the circuit (red), does not provide any information with respect to the
achievable clock interval. What were actually looking for, is the latency of the
critical path that was many times discussed before. This is directly related to
the achievable clock interval. Hence, the equation must be rewritten as follows:
(AT )
X
Al +
Ac +

Ar TCP
with
TCP
TCP,l + TCP,c + TCP,r
If the critical path is not known, the upper limit of T and hence of C can be
estimated as follows:
63
(AT )
X
Al +
Ac +

Ar max Tl + max Tc + max Tr
l
We can extend the equation for the complexity measure to also consider the
software side as follows:
AT = AP TCP Ncycles
Ncycles thereby represents the number of cycles to complete a task which also
affects the complexity of an algorithm. If a task runs on the same processor at
the same clock speed and takes more cycles to complete compared to a different
task, it is clearly expected to be more complex. Consequently, this definition of
the AT measure needs to be analyzed according to the task to be completed.
Let us examine a small example to emphasize the influence of the software side,
if we talk about mixed hardware/software approaches. Consider 2 different
processing cores:
Core I: AP = 10 mm2 ,
Core II: AP = 20 mm2 ,
1
T
1
T
= 2GHz
= 4GHz
By using the AT measure both seem to have the same efficiency (complexity):

AP TCP I = 5 mm2 /GHz

AP TCP II = 5 mm2 /GHz
Furthermore, we define the instruction level parallelism (ILP), which is the
average number number of instruction that the processor executes in parallel:
ILPI = 1.2
ILPII = 1.7
By considering this software point of view we clearly recognize that core II is
more efficient than core I. However, our basic AT measure does not consider
this and had to be extended for this purpose:
1
ILP
Now that we have introduced the complexity measure, we want to apply it to
some examples in the following.
AT = AP TCP
64
8.3
Wordlength Analysis
Routing Example
Take a look at figure 74. This demonstrates exemplarily an on-chip routing on
bit-level. We already discussed the importance of the bit-level view and the
difference to the word-level in section 7.1. In this section were going to analyze
the routing complexity with respect to the wordlength W .
Figure 74: Example for on-chip routing (bit-level view)
From figure 74 it should be clear that by doubling the wordlength, we get twice
as many wires in x- as well as in y-direction. Hence, the routing area doubles in
the x- as well as in y-dimension. This results in a total area increase of factor
4. In general we have the following relation:
Ac W 2
The communication delay however, only increases with the square root of the
area:
p
Tc Ac W
Using the AT complexity measure, we finally get:
Cc = Ac Tc W 3
We can draw the surprising conclusion that the routing complexity increases by
W 3 with the wordlength W. Or the other way around:
The efficiency goes down by W 3 with increasing wordlength W.
65
De Bruijn Example
As a third example we take a look onto the de Bruijn network again for analyzing
its complexity using the AT metric. Assume a linear vector of processing nodes,
as we did before in section 7.3. According to figure 75, it is easy to conclude that
the number of routing channels in x- as well as in y-direction depends linearly
on the wordlength W and the number of processing nodes N . Consequently, we
can assume the following complexity relation for the communication area:
Ac W 2 N 2
Figure 75: Routing area of a De Bruijn graph
The logic area Al depends linearly on the number of nodes (N ) and linearly
or quadratically on the wordlength (W ). The wordlength dependency of the
complexity is determined by the algorithm that is realized by a single node in
the de Bruijn communication network. Algorithms such as addition, min/max
(e.g. used for Viterbi) have a linear logic area complexity, while multiplications
(e.g. used within FFT) exhibit a quadratic complexity. We can conclude: Al
N W or N W 2 . The area of registers is also linearly increasing with N and
also depends linearly on W , assuming the more efficient case w/o skewing (see
section 8.3): Ar N W . The following table summarizes the findings for the
area:
Communication
Logic
Registers
Ac W 2 N 2
Al N W or N W 2
Ar N W
From the table we can conclude that the communication area seems to be the
most critical part with respect to the complexity of the de Bruijn network,
since it increases quadratically with W and with N as well. So what about the
communication latency? As mentioned previously its complexity is proportional
to the square root of the area complexity and thus:
66
Tc =
Ac = W N
Putting it all together we receive the AT complexity measure:

C Ac Tc = W 3 N 3
We find that the complexity of the de Bruijn communication network depends
cubically on the wordlength W as well as on the number of processing nodes
N ! Is it possible to reduce this complexity somehow? Figure 76 presents one
possible solution. Up to now, we considered the de Bruijn as a network that
consists of multiple stages. As figure 76 shows, this is no necessity for the
implementation. We can easily merge the two stages into a single one where the
processing nodes are now connected recursively.
Figure 76: Recursive de Bruijn graph
But what do we gain in terms of communication complexity? The interconnection still seems to be the same! By observing the recursive structure on bit-level
(figure 77, left), we can see that we have a feedback at each bit-level. The
individual bit-levels are independent of each other, however. We can slightly
rearrange the processor-bit nodes (figure 77, right) and can now recognize that
we actually deal with N independent 1-bit de Bruijn networks.
67
Figure 77: Bit-level view of recursive de Bruijn
How is the complexity affected by this insight? To clarify this, we once again
redraw the network structure from figure 77 (right) a little bit.
Figure 78: Complexity of recursive de Bruijn structure
We observe that we have W levels of independent processor-bit node arrays.

Each of these arrays is connected by a 1-bit de Bruijn communication network.
For each of these networks we can assume a communication area of Ac,1bit N 2 ,
since W = 1. The total communication area is thus:
Ac = W N 2
For the communication delay Tc we get:
Tc
p
Ac,1bit N
This is because the indivdual 1-bit de Bruijn networks are independent and we
assmue pipelining on bit-level. This yields the final result for the AT complexity
68
measure:
AT W N 3
AT W 3 N 3
Finally, we were able to save a factor of W 2 in (communication) complexity by

just applying the little trick of using a recursive de Bruijn network structure!
Side note:
Another example where this independent bit-plane processing idea can be applied is the FIR filter. The general equation for computing the convolutional
sum of an L-tap FIR filter is:
yk =
L
X
ai xki
Considering the bit-level the equation turns into three nested sums:
yk =
W
W X
L X
X
i
ai,n xki,m
whereas W is the wordlength of the coefficients and input samples, respectively.

By introducing weights (2n or 2m resp.), we can make each bit-plane independent and seperate the computation accordingly for each bit-plane, just by
reordering the sums as follows:
yk =
W X
W X
L
X
n
ai,n xki,m 2n 2m
Now the convolutional sum is computed individually on each bit. This bit-plane
FIR filter is significantly decreased in its complexity, as we have already seen at
the de Bruijn example before!
Integrator Example
In this section we examine the complexity of the well-known integrator circuit.
Viewed from above (figure 79, left), the routing complexity seems to be Ac Tc
W 3 , according to the result of the previous section. If we examine the adder
circuit a little bit closer using a convenient representation (figure 79, right), we
see that this is actually not true!
69
Figure 79: Example: Complexity of an integrator
For increasing the wordlength by 1, a new adder stage has to be appended. This
new adder stage consists of a constant amount of routing area Ac W . The
routing delay Tc depends on the maximum wire length which is independent
on the number of adder stages Tc 1. Hence, we only get Ac Tc W , in
contrast to the top-level view on the left-hand side of figure 79. After fully
pipelining the adder (also depicted in figure 79, right) we observe the following
situation:
Ac W
Al W
Ar W 2
Tc 1
Tl 1
Tr 1
The complexity of the communication (Ac Tc ) stays untouched by the pipelining.

This is also true for the necessary amount of logic area Al that linearly depends
on W (one full adder per bit). The logic delay Tl (i.e. the critical path) is
now independent of the wordlength after fully pipelining the integrator. The
setup and hold times of the registers (Tr ) are also independent of W . The area
consumption of the pipeline registers Ar is mainly dominated by the skewing
triangles. Thats why we get: Ar W 2 . Without the skewing there would only
be one pipeline register between every two adder stages, i.e. Ar W . From
the table above, we can finally determine the complexity:
with skewing...
AT = W 2 1 = W 2
w/o skewing...
AT = W 1 = W
This once again demonstrates how much the hardware implementation is influenced by skewing. By omitting the skewing, we can significantly reduce the
complexity: C W 2 C W !
70
8.4
M-Step Analysis
In the following, we will apply the AT complexity measure from section 8.2 to
analyze the efficiency of two different techniques that we introduced to speed-up
recursions, namely Pipeline Interleaving (considered in section 8.4) and Block
Processing (section 8.4).
Pipeline Interleaving
Consider the following 4-step recursion with generic Dot operator and logarithmic look-ahead, also depicted in figure 80:
yk = (xk xk1 ) (xk xk1 )2 yk4
Side note:
It should be mentioned that the order in which the operator is applied in the
look-ahead part is important here (in contrast to the + operator)! Hence, the
following implementation is not equivalent!
Normally, we compute (xk xk1 ) (xk2 xk3 ) in the logarithmic lookahead. By changing the order according to the figure above the realized computation becomes: (xk xk2 ) (xk1 xk3 ). Because the operator is not
commutative, xk1 and xk2 cannot be exchanged and we run into trouble here!
Figure 80: 4-step recursion with generic Dot operator and logarithmic lookahead
71
By applying the M-step recursion, we introduced M registers in the feedback

loop and are now able to pipeline the operator (M-1) times. Thus, the clock
period can be reduced by the factor M:
T
M
To ensure that the critical path is still located in the feedback loop after the
transformation (i.e. to really achieve the full speed-up of factor M), we also
need to insert an M-fold pipelining in the look-ahead part (cut-set rule). To
analyze the relative complexity with respect to the 1-step implementation, we
introduce the following normalized measures:
T
Area
Clock period
Complexity
z
}|
{
of 1 step impl.
A
1 = 1
T
1 = 1
A
1 T
1 = 1
For the M-step recursion we get the following relative area metrics:
Look-ahead
Al ld(M )
Ar M
Ac 1
Feedback
Al 1
Ar M
Ac 1
Using a logarithmic look-ahead for an M-step recursion, we have ld(M ) stages

whereas each stage consists of a single Dot operator ( Al ). In the feedback
loop no additional logic is introduced. The overall number of registers in the
look-ahead increases linear with each recursion step as well as the registers in the
feedback-loop (Ar M ). The connection area stays nearly the same (Ac 1).
What about the clock speed? Ideally, the critical path should be shortened by
the pipelining by a factor of M:
Look-ahead
T
1
TCP = M
Feedback
T
1
TCP = M
However, in reality were only able to shorten the critical path with respect to
the logic delay! The register delay (setup/hold time) as well as the wire delay
cannot be shortened via pipelining!
TCP,l
TCP,l
+ TCP,r + TCP,c =
+ Tr + Tc
M
M
Finally, these results yield the following AT measure:
TCP =
AT

TCP,l
ld(M ) + |{z}
M
+ Tr+c
|{z}
| {z }
M
l
r
Tr + Tc
72
Only if we assume the area of the registers Ar as well as the register/communication

delay Tr+c to be negligible, we come to the following relation:
ld(M )
M
This means that our algorithm would become more efficient with every additional recursion step, i.e. (M ) (E ). This would be a very nice
behavior, if it would work in practice. However, since new pipeline registers
are introduced with every recursion step, Ar as well as Tr+c cannot be assumed
to be negligible for high M anymore. Thats why the nice complexity behavior
above turns into...
AT
AT M
...which is not so nice anymore, because this means that the efficiency of the
algorithm now linearly decreases with every new recursion step!
Block Processing
Finally, let us take a quick glance at how block processing changes the complexity of an algorithm. With block processing we definitely have to deal with an
area increase. We get M parallel processing chains, each consisting of ld(M )
stages (with logarithmic look-ahead). Compared to the original 1-step recursion
function, we thus have a final area increase of:
A M ld(M )
The achievable clock frequency on the other hand stays untouched by the block
processing:
T 1
Accordingly, the AT measure yields:
AT M ld(M )
This means that the complexity is increased (or the efficiency is decreased, resp.)
by block processing. So, what did we miss? Sure, we are now able to do multiple
computations in parallel. This so called parallelism factor P is not considered
by the AT measure. Thats why we have to extend the AT measure accordingly
which finally yields the effective AT measure ATef f :
AT
ld(M )
P
with P = M for the case of block processing.
ATef f
73
8.5
ATE Measure
We are now very familiar with the AT measure. What if we want to consider the
energy consumtion in addtion to the area and clock speed within our measure?
We could define the energy consumption as follows:
E =
|{z}
energy
T
|{z}
N
|{z}
P ower
clock period # clock cycles
Hence, our AT measure changes to the following ATE measure:

AT E = AT 2 N P ower
The following table summarizes the measures that we discussed and their application domain, i.e., where they are usually used for:
Measure
A
AT
ATE
Used for...
chip cost analysis
determine cost of solution
taking also energy into account
Other Measures
Besides the AT measure the following two complexity measures can occasionally
be found:
AT 2 : complexity measure for communication dominated applications
A
T:
complexity measure for power consumption
Basic Processor Principles
Now, that we have become aquainted with the basic hardware- /software codesign principles, we will take a look at more complex structures within this
section, namely processors.
9.1
Hardware Reuse
The main problem of a hardware accelerator is that it is dedicated for a certain

task and cannot be reused for other purposes. Processors approach this problem
by providing an architecture that can flexibly be programmed for different tasks.
Thereby, in general the increased flexibility is bought by a decreased processing
speed (due to the lack of parallelism) compared to a dedicated hardware solution.
In this section we want to discuss:
How can we reuse hardware for different parts of the algorithm?
How can we reuse hardware for different algorithms?
74
How can we reuse hardware for control?

In the following we motivate the hardware reuse problem a little bit further at
the example of our well-known FIR filter and thus approach the architecture of
a simple Digital Signal Processor (DSP) example step by step.
9.1.1
Example: FIR
We already know the equation for the convolutional sum of an FIR filter:
X
yk =
ai xki
i
Figure 81 recalls the corresponding direct form hardware implementation.

Figure 81: Direct Form FIR example for hardware reuse
The basic idea of our hardware reuse approach is that we just take a composite
of a multiplier, an adder and a register (framed in figure 81) and use it for
the computation of all filter stages. The resulting processor example for the
iterative FIR computation is shown in figure 82.
75
Figure 82: Reusable hardware (processor) for FIR example
The coefficients ai are read from a memory (a-mem) addressing it with the
pointer *pa ( because of descending index). In parallel the input xi is
read from a second memory (x-memory) with *px used for addressing it (++
because of ascending index). Both values are now multiplied and accumulated,
i.e. added to the value of the previous stage. For this purpose an accumulator
register is used (which must be initialized with 0). The output is generated
after the L-th (in this example L = 3) stage and is finally written into a third
memory. This FIR processor realization needs 2 read operations per iteration.
The following pseudo code demonstrates how this hardware works:
Step 0:
Step 1:
Step 2:
Step 3:
reset acc
i=3
acc = a3 xk3
i=2
acc+ = a2 xk2
i=1
acc+ = a1 xk1
i=0
acc+ = a0 xk
output
Following this approach we were able to reuse certain parts of the hardware:
Hardware of FIR tap
76
memories
control
registers (acc) for intermediate results
9.1.2
Example: Transposed Form FIR
In section 2.5 we already got to know a more efficient alternative of the FIR
filter implementation: the so-called transposed form FIR filter. A corresponding
example with 4 taps is shown in figure 83. Its advantages (with respect to
the hardware realization!) are: reduced latency and less registers needed for
pipelining (due to reusing the already available shift registers). So, what about
the processor, i.e. hardware reuse, approach?
Figure 83: Transposed Form FIR example for hardware reuse
If we compute the adder chain in figure 83 sequentially from left to right, we

had to use double buffering. Otherwise, the input value of the subsequent filter
stage would be overwritten by the output of the current stage. E.g. we would
compute a3 xk in the first stage. If we now store the result in an accumulator
register and continue with the second stage, we get a2 xk + a3 xk instead of
a2 xk + a3 xk1 what we actually wanted. Hence, for the iterative computation
of the transposed form FIR it is recommended to perform the accumulation from
right to left. This allows in-place memory update. In every stage i (starting
with the rightmost) we simply read the current input register value Zk1,i+1 ,
do the computation Zk1,i+1 + ai xk and write the result to the output register
of that stage Zk,i (or the FIR output yk , respectively). Figure 84 shows the
corresponding processor for the iterative computation of the transposed form
FIR.
77
Figure 84: Reusable hardware (processor) for transposed form FIR example
Similar to the direct form FIR version we need an a-mem and an x-mem with
according address pointers to read the coefficients and input data, respectively.
The subsequent multiplication and addition is also similar. However, the storage of the output/intermediate values gets somewhat more complicated for the
transposed form filter. Instead of a simple accumulator register we need now a
dual ported memory for the intermediate Z-values to be able to read Zk1,i+1
while writing Zk,i at the same cycle. This also implies a more complicated address handling, since we have to read from address *pz but write to the read
address of the last iteration cycles (input of stage i is output of stage i+1). Finally, this FIR processor realization takes 3 read/write operations per iteration,
which is also a disadvantage compared to the direct form FIR. The according
pseudo code looks as follows:
Step 0:
Step 1:
Step 2:
Step 3:
load xk , zk,o = ..., yk = zk,0

i=0
i=1
zk,1 = a1 xk + zk1,2
i=2
zk,2 = a2 xk + zk1,3
i=3
zk,3 = a3 xk + 0
The following table summarizes both FIR implemention approaches with respect to the realization as reusable software and demonstrates once again the
advantage of the direct form FIR for this purpose:
78
Table 3: Comparison of direct and transposed form FIR w.r.t. SW realization

direct
transposed
pointers
px,pa
px,pa,pz,pz-1
memories
a-mem, x-mem a-mem, x-mem, z-mem
reads/step
2 [a,x]
2 [a,z]
writes/step
0
1 [z]
9.1.3
Intermediate Very Important Conclusion
To summarize the results of the small FIR example, we observed that though
the transposed form FIR is well suited for the hardware implementation, it is
not a good choice for the sequential (i.e. software or processor) implementation. On the other hand the direct form FIR seems to be better suited for a
sequential implementation. We can conclude that the structure of a design must
be carefully chosen depending on the type of the implementation (hardware vs.
software) to be efficient! Some general rules summarizes the following table:
Hardware (HW)
distribute algorithmic registers
within computational logic
use shift registers to...
Software (SW)
keep algorithmic registers together
use registers if possible as shift
registers to...
1. simplify memory addressing &
control
2. process iterations along critical
path with local data reuse (e.g.
accumulator)
shift registers are good
1. pipeline HW w/o shift registers

2. shorten critical path
shift registers are bad
9.2
Our First Digital Signal Processor
Now that we have investigated the very basic and static example of an FIR
processor, we will design a more complex and flexible processor for digital signal
processing. First, we define a short wish list of the desired features of our DSP:
2 memory ports, e.g.
1 memory read port
1 memory read or write port
1 multiplier
1 arithmetic logic unit (ALU)
79
Let us now directly take a look at the block diagram of the Simple DSP s
data path as depicted in figure 85. In the following we will build up the whole
processor step by step.
Figure 85: The data path of our Simple DSP
Let us start with the two input registers ra and rb. They can receive a value
from the data memories (a-mem, b-mem) and pass their content to the computational part of the processors data path. The computational part consists of
a multiplier with a subsequent ALU4 . The register content (of ra or rb) can
either directly be passed to the ALU, bypassing the multiplier, or is passed to
the multiplier first and subsequently propagated to the ALU. This is controlled
by two multiplexers. The output of the computational part is stored in one of
two accumulator registers (acc0 or acc1). The ouput of the accumulators can
be looped back to the ALU input as well as to the (dual ported) data memory
(b-mem). In addition to the accumulators, the ALU also writes some flag registers (e.g. sign or overflow bit) that can be used for program control, later on
(conditional jumps, comparisons, etc.). For addressing the data memory four
pointer registers are provided: p0, p1, p2 and p3. For each of the two data
ports (a and b) the corresponding address is multiplexed from one of the four
4 Arithmetic Logic Unit: a generic unit that is able to compute different arithmetic (e.g.
ADD, SUB) and logic (e.g. AND, OR) operations which is usually controlled by a mode
input
80
pointer registers and passed to the according address input pa or pb, respectively. The output is passed to the according input register ra or rb (in case of
read). The input is passed from acc1 to the data memory of port b (in case of
write). To update the pointer address values the modifier registers m0, m1, m2
and m3 can be used. E.g. we could define m0 = 0; m1 = 1; m2 = 1; m3 = 2.
In this case m1 could be used for address increment, m2 for address decrement,
m3 for an address increment of 2 and m0 as neutral address modifier.
9.2.1
Implementing Direct Form FIR
In this section we will give a short example and show how we could use our Simple DSP to compute the direct form FIR (compare to section 9.1.1). Therefore,
we use our input ra for the coeffcients ai . Accordingly, rb is used for the input xki . The accumulated result is stored in acc1. acc0 is not needed for
this application. Hence, we use the pointer registers p0 and p1 as pa and px,
respectively. p2 is also needed to store the output yk . The algorithm for the
direct form FIR is sketched by the following pseudo code.
Step -1 (init):
Step
Step
Step
Step
Step
9.2.2
0:
1:
2:
3:
4:
ra = pa, i.e., ra = p0, p0 = p0 + m0 (m0 = 1)

rb = px, i.e., rb = p1, p1 = p1 + m1 (m1 = +1)
acc1 = 0
acc1+ = ra rb
ra = p0, p0+ = m0
rb = p1, p1+ = m1
as steps 0-2 and in addition yk = acc1 at end of it. 3
write yk (save into memory)
write acc1 into *p2
p2+ = m1, i.e., (p2 + +)
execute step -1 for next filter iteration
Instructions Needed
To be able to control the data flow within our DSP we need to define control
words, so called Function Instruction Words (FIW), for every functional unit of
the DSP. Our DSP basically needs control of 4 functional units with according
FIWs:
1. Address generation unit for ra-port
2. Address generation unit for rb-port
3. Data path unit (ALU/mult)
4. Program control unit
FIW for Address Generation Unit (ra-port):
We define the following requirements and restrictions for addressing the ra-port.
1. Read-only access from data memory into input register ra
81
2. Immediate value5 possible low bits must be extended by an immediate

register... e.g. for 16 bits:
3. Use of pointer addressing

4. Memory as well as other registers can be source for ra
A possible FIW realization using 6 bits is depicted in figure 86. We spent 1 bit to
select between immediate value or pointer access. In case of immediate value
mode was selected, the lower 5 bits are used to directly pass the value with the
FIW. In case of the pointer access mode, another bit selects whether the data
memory or another register should be chosen as source for the data transfer.
If a register is used as source, its address value is passed in the remaining 4
LSBs (we can address up to 16 registers). Otherwise, a 2-bit value is passed
to select one of the pointer registers (p0,...,p3) for addressing the data memory.
In addition, a 2-bit pointer update mode is passed. These 2 bits are used for
addressing the registers m0,...,m3 that have an initial value of {0,1,-1,2}. The
output of the selected mj register is used as pointer increment. Consequently,
the following address update operations are possible: *pa (unmodified), *pa++
(increment), *pa (decrement), *pa+=2 (increment by 2). [By overwriting
the initial mj values also other address steppings could be realized.]
Specific examples for address update:
*p0+=m0
*pi+=mj
Figure 86: Functional instruction word of ra-port
5 an immediate value is directly contained within the FIW in contrast to register or memory
access where only the address is passed with the FIW
82
FIW for Address Generation Unit (rb-port):

Again, we assume a 6-bit FIW for the rb address generation unit. In contrast
to ra, we need read/write access in this case:
Read data from memory (rb port) to input register rb
Write data to rb data port from acc1
In return we will not allow to use immediate values for the rb-port. The resulting
FIW scheme shows figure 87.
Figure 87: Functional instruction word of rb-port
The MSB is used to toggle between read and write mode. Again another bit is
used for switching between memory and register. The remaining 4 lower bits
are used similar to the ra-port: 4-bit register address (in register mode) or 2-bit
pointer address and 2-bit update mode (in memory mode), respectively.
FIW for Data Path Unit (ALU/mult):
Assuming a 6-bit FIW, we come to the following structure for the data path
unit FIW (also compare it with figure 85).
Figure 88: Functional instruction word of data path unit
The first bit (MSB) is spent to control the destination of the ALU output (acc0
or acc1). The second bit is used to control the left multiplexer of the ALU
83
input (i.e. feedback from acc0 or ra/mult output). The third bit controls the
right multiplexer of the ALU input (i.e. feedback from acc1 or rb output). The
ALU mode (plus the multiplier bypass using the left multiplexer) is controlled
by the lower 3 bits. Possible modes are: +, -, AND, OR, XOR, Mult, MAC,
nop6 .
FIW for Program Control Unit:
The FIW for the program control unit has influence on:
program flow (e.g. jumps, call, return, ...)
conditional execution (e.g. if, else, ...)
repeat / loops
We will not go into the details here and simply assume 6 bits to be spent for
this FIW.
9.3
Instruction Set Architecture (ISA)
First, we briefly discuss the question: What is an ISA? We may distinguish 3

main tasks of an ISA:
1. It defines the structure of the instruction. Each Instruction Word (IW)
consists of multiple fields that have to be defined by the ISA, e.g.:
2. It control the instruction decoding pipeline, i.e., the mapping of IWs onto
HW and its control pipeline. (VLIW, CISC, RISC, ...)
3. It provides an interface between hardware and software that defines the
assembler/compiler/.... input.
9.3.1
VLIW Very Long Instruction Word
The idea of the VLIW (Very Long Instruction Word) ISA is simple. We just
concatenate the instruction words of every single functionial unit (i.e. the FIWs)
to form one long processor instruction. Thus, every functionial unit is directly
controlled in parallel via a single instruction. Figure 89 shows the VLIW for
our example DSP, consisting of 24 bits.
6 nop:
no operation, i.e. ALU is idle
84
Figure 89: Very Long Instruction Word for our Simple DSP
For more complex processor architectures the VLIW would become much longer
(which is the main disadvantage of this ISA approach).
Processor Block Diagram
Based on the VLIW ISA we are now able to draw the big picture of the top-level
processor architecture for our Simple DSP. This is shown in figure 90. Our
processor architecture consists of a
control loop
and a data loop.
The control loop in turn consists of three functional units:
Program Control Unit (PCU)
Program Memory (PM)
Instruction Decoder (ID)
The data loop consists of three functional units as well:
Address Generation Unit (AGU)
Data Memory (DM)
Data Path Unit (DPU)
85
Figure 90: Block diagram of the basic processor architecture
Let us first take a closer look at the control loop. The PCU manages the program
flow which is usually a straight-forward sequential flow but may be influenced
by executed instruction (e.g. program control, loops, conditional jumps, etc.).
The next instruction is fetched from the PM using the program counter (PC)
of the PCU for addressing it. The fetched instruction word (IW) is passed
to the ID that maps the IW onto a bunch of control signals for controlling
every functional unit of the data loop. For our VLIW ISA it just separates
the FIWs that are contained in the IW. The data loop starts with the AGU
that is repsonsible for pointer management (using pointer registers) and pointer
updates for read/write address generation. When the correct read address is
pending, input data can be fetched from the DM and passed to the DPU. The
DPU is the actual number crunching unit (see section 9.2) that consists of the
ALU, multiplier and I/O registers. The DPU output may be written back to
the DM. For this purpose the according write address must be selected by the
AGU which in turn is controlled by the DPU itself. In addition the output of
the DPU computation may influence the program control flow. E.g. in case of
a conditional jump, the result of the comparison (sign flag) which is computed
by the DPU, controls whether the jump takes place or not. Hence, we have a
feedback to the PCU and ID.
9.3.2
RISC Reduced Instruction Set Computer
From the evolutionary perspective, the Reduced Instruction Set Computer

(RISC) was introduced to cope with the main disadvantages of the (at that
time) established CISC architecture ( section 9.3.3) that is a dynamic and thus
difficult to handle pipeline and a bad pipeline utilization. The main idea is to
split the complex CISC instructions into smaller ones to decouple memory access
and ALU computation ( load/store architecture) and thus to orthogonalize
the instruction set. This mean that the instructions are not dependent on a
bunch of special purpose registers anymore, as it was the case for the CISC
ISA. Rather, RISC uses only one large set of coequal registers. Independent of
86
the executed ALU instruction, arbitrary input/output registers can be chosen!

To realize a RISC architecture, we only control one funtional unit at once and
thus significantly reduce the length of the instruction word. For our Simple DSP
this is illustrated in the following figure:
Figure 91: RISC instruction word for our Simple DSP
The instruction now only consists of a single 6-bit FIW and a 2-bit opcode that
is used for addressing the functional unit (we defined FIWs for 4 different FUs in
case of our Simple DSP). The RISC ISA clearly has the following disadvantage:
Slower: due to sequential instead of parallel FU control
On the other hand, we gain the following advantages:
less verbose than VLIW
more modular (decoupled memory & ALU, orthogonal instruction set)
The second one is the main advantage of the RISC concept that becomes very
important for super scalar architectures. In fact, RISC actually only makes
sense for super scalar architectures!
Super Scalar Architectures The idea of the super scalar architecture is the
following. The program control unit addresses and loads (i.e. fetches) multiple
successive instructions from the program memory (e.g. for the different functional units) that are thereupon executed in parallel. The parallel execution is
controlled by a dynamic instruction scheduler that analyzes and solves dependencies in the program flow (and between input/output data) and schedules (i.e.
issues) the instruction onto the available parallel FUs with the objective to keep
their utilization as high as possible. If we use one RISC instruction for every
FU (in our Simple DSP case: 4), we get a very modular and fast architecture.
With this architecure we dont need to issue nop instructions to keep unused
functional units idle (as we have to do it with VLIW). Only functional relevant
instructions will be issued to the according FUs.
Remark : Please note that in general the number of fetched instructions and
the number of parallel issued instructions must not be identical ( N fetch
/ M issue). The number of fetched instructions mainly depends on the abilities/restrictions of the memory (throughput, word length, etc.) and the length
87
of the instruction word. The number of parallel issued instructions depends on

the number of available and unoccupied FUs of the required type. It can make
sense to fetch more instructions than FUs are available because the dynamic
scheduler thus gets greater freedom for choosing the best instruction execution
order. In other words: the analysis window size for the instruction dependency
checking can thus be increased.
Figure 92 shows an example of such an architecture with 3-fold instruction fetch
and 2-fold instruction issue. This means that we fetch 3 instructions in parallel
from the PM, partially decode them in parallel and store the (now longer) IWs in
an instruction table. The Analyze & Issue unit (Super Scalar Control Unit) is
responsible for scheduling (issuing) the buffered instructions onto the according
available functional units as fast as possible with the objective of keeping the
FU utilization as high as possible. In the example of figure 92, two parallel
FUs are available. Furthermore, we can distinguish between in-order issuing
and out-of-order issuing, where the instructions dont have to be issued in the
same order that they are found in the original program code ( to increase FU
utilization and thus processing speed).
Figure 92: Super Scalar RISC Architecture
Finally, figure 93 gives a very abstract and holistic perspective of a general

processor architecture. Therein, it is nothing else than multiple levels of an ever
recurring processing chain that consists of a pointer unit (ptr), a memory (M)
and a logic (computational) unit (log). Our control loop consists of a program
control unit including the program counter which can be seen as a pointer
(ptr) to the program memory (M). The fetched instruction word is subsequently
decoded (log) and passed to the data loop. In the data loop we find the same
scheme: the AGU serves as pointer (ptr) to the data memory (M) which in turn
provides the input for the data path (log). This abstract view simply treats
the super-scalar concept as an intermediate level between the control and data
loop with: instruction table pointer (ptr), instruction table (M) and instruction
decoders (log). [Compare figures 92 and 93]
88
Figure 93: Abstract view of Processor Architecture
9.3.3
CISC Complex Instruction Set Computer
The Complex Instruction Set Computer (CISC) offers a different idea for reducing the word length of the (VLIW) processor instructions (and thus the code
size and memory accesses). Usually not every combination of the FIWs makes
sense. For example we definitely dont need 224 instructions for our Simple DSP.
Rather, we often only need a small selection of VLIW instructions. This is the
idea that the CISC concept is based on. It defines a small opcode (e.g. 8 bits
for our Simple DSP) that is mapped onto the long VLIW (24 bits in our case)
using an internal ROM table which contains a list of the used instruction words.
I.e. the opcode is just used for addressing the actual VLIW instruction. This is
illustrated in figure 94.
Figure 94: Idea of CISC
The main advantage of the CISC ISA is reduced code size and hence a more
efficient usage of the code memory (and possibly less memory access). On the
other hand, CISC has also some disadvantages as mentioned before. In practice,
the mapping of the functional instruction words onto the pipeline stages is not
so easy and requires control by a so called sequencer unit. A CISC ISA
89
has no fix pipeline architecture with N stages. Instead, it depends on the

executed instruction and is controlled by the sequencer which of the pipeline
stages is needed and which can be bypassed. Consequently, every instruction
needs a different number of clock cycles for its execution, which complicates
programming of CISC processors. Last but not least, the pipeline utilization is
comparatively low due to the instruction dependent pipeline structure.
9.4
9.4.1
Pipelining of Our Basic Processor Block Diagram

Pipelining
In figure 90 we have investigated the basic top-level processor architecture. It

is easy to see that the critical path of the processor comprises all 6 functional
units: PCUPMIDAGUDMDPU. Therefore, we want to pipeline our
DSP in the following to shorten the critical path. We can simply try to use the
cut-set rule (section 2.2) for the pipelining. The result is shown in figure 95.
Figure 95: Pipelining of our Simple DSP using cut-set rule
The problem is that our DSP contains several feedback loops (DPAGU,
DPID, IDPCU, etc.). As we know, the cut-set rule is actually no appropriate mean for pipelining recursive systems (see section 3). Thats why we
also need to insert speed-ups on those feedback signals in the opposite direction.
Since speed-ups are not implementable, we run into trouble, if we cannot remove
them somehow. We notice that the feedback signals are actually only needed
under following situations:
in data loop: only during memory write
in control loop: only if program flow instructions (jump,...) have been
decoded
from DLCL: only for conditional execution
In all other cases we dont need the feedback paths and our processor works fine,
if we just omit the speed-ups, i.e., consider the processor without feedbacks. In
section 10 we will discuss how we can deal with the problems that occur under
90
the situations mentioned above. At this point we want to scrutinize the skewing
triangle a bit closer that has been formed due to pipelining between the data and
control loop. We already know from section 7.1 that skewing triangles can often
be found in pipelined systems and how much they can influence the efficiency
of a circuit. However, with respect to processors the skewing triangle has also
a strong influence on the kind of programming.
At the output of the ID, the whole instruction word (i.e. the control signals
that are passed to the data loop) is still synchronized with the data. I.e. all
FIWs are still related to the same set of input data. This is called a data
stationary IW (refer to figure 96). After passing the skewing triangle, the FIWs
are spreaded over 3 time slots and synchronized with the time of processing:
FIW for AGU is passed after 1 cycle, for DM after 2 cycles, for DP after 3
cycles. The instruction word is now time stationary (see figure 96). As we will
see in the following, programming could be data or time stationary!
Figure 96: Data- & Time stationarity of instruction word
9.4.2
VLIW Pipelining
Let us start to examine the programming perspective for the VLIW ISA. In this
case the arrangement of the program memory can be considered as depicted in
figure 97. Each memory entry consists of one instruction word (IW) that in turn
is composed of (in our case) 4 FIWs. The 4 FIWs could also be considered to be
arranged in 4 parallel memories. If the FIWs are completely independent of each
other ( orthogonal), i.e. if every single FIW contains all information necessary
to control a single functional unit, we are able to decode each FIW using a
separate instruction decoder (ID). This is illustrated in figure 97 (bottom).
91
Figure 97: Processor Control Loop for VLIW ISA
Now, we can simply apply the cut-set rule, drawing local cut-sets around each
of the 4 IDs and thus move the registers from the ID output to the ID input, as
figure 98 shows.
Figure 98: Moving skewing triangle in front of IDs by applying cut-set rule
But if the independent FIWs are actually only stored in parallel memories, why
can we not just consider the delays directly for the arrangement of the FIWs in
the code? Assume that the program counter (PC), which is used for addressing
the memory, directly steps through the memory entry by entry in a sequential
order, fetching one IW after another. If we just move the lowest FIW of the IW
in figure 99 one entry to the left (orange box), it will be fetched one clock cycle
later and we can now discard the register at that memory output. Similarly,
we can move the second FIW by two entries to the left and the third FIW by
three and thus skip the whole skewing triangle7 . Finally, we relocated the IW
skewing directly into the program code that is now time stationary. Since skewed
7 the upper FIW is the program control FIW which is not passed to the data loop and thus
not passes the skewing triangle
92
VLIW code is very hard to read or write (especially if we also take branches
into account), we are reliant on good compilers that are able to automatize the
program skewing task. The following small example demonstrates the difference
between data stationary code and equivalent time stationary (skewed) code.
cycle
n
n+1
data stationary
acc1 = (pa + +) (pb )
time stationary
ra = pa + +, rb = pb
acc1 = ra rb
Figure 99: Making VLIW time stationary
9.4.3
CISC Pipelining
As already mentioned in section 9.3.3, the idea of CISC is that only sensible
combinations of FIWs are stored in an instruction table that is addressed by
the (short) instruction word read from PM. The control loop for the CISC
processor architecture is once again depicted in the figure below. Because of
this, the FIWs cannot be assumed to be independent of each other anymore
(many FIW combinations are not allowed anymore). In other words, we use
a maximum correlation between the FIWs such that they are not orthogonal
anymore. For this reason, CISC cannot be made time stationary and always
has to be data stationary!
93
Figure 100: CISC architecture
10
Hazards
In section 9.4.1 we assumed that we can simply discard the speed-ups in the
feedback paths when pipelining our DSP. In practice, this assumption leads
to an incorrect behavior of the DSP under certain conditions, which is called
hazards! In the following we want to separately investigate the effect of hazards
in the control loop (Control Hazards: section 10.1) and in the data loop (Data
Hazards: section 10.2) and discuss how we can deal with them. A third type of
hazards, the so called Structural Hazards, will be discussed afterwards (section
10.3).
10.1
Control Hazards
As figure 101 depicts, hazards in the control loop are introduced by removing
the (in our case: 2) speed-ups on the feedback signal from the instruction decoder (ID) to the program control unit (PCU). The feedback signal is used for
influencing the program flow depending on the currently decoded instruction
(e.g. for jumps or loops). After introducing the pipelining, it now takes two
clock cycles to decode the instruction. The decoded instruction should, however,
directly influence the program flow for the next instruction at the subsequent
clock cycle. Therefore, we let the signal travel back in time by two clock cycles
by using the speed-ups. Since this is unfortunately no feasible realization, we
need to skip the speed-ups which leads to the problem that the feedback is not
necessarily computed correctly anymore!
Figure 101: Origin of Hazards in Control Loop
Actually, this is no problem as long as the feedback isnt used by the decoded
instruction, i.e. as long as the instruction doesnt influence the program flow.
94
E.g. linear code (c=a+b; e=c*d; f=c xor e;...) or non-conditional instructions
which have no control part (i.e. standard program counter increment [pc++] is
used).
But what, if pc is jumped to a new address?
goto / jumps ( PC controlled from ID or data path (DP))
call ( new PC controlled from ID; current PC pushed to stack)
return ( new PC popped from stack)
In this case the feedback signal needs to be used and carries the new PC value.
By removing the speed-ups, the new PC value now arrives too late at the PCU
(in the example of figure 101 these are 2 clock cycles). A more abstract perspective onto this problem gives figure 102.
Figure 102: More Abstract View onto the Problem of Control Hazards
It can be clearly seen that the feedback signal bk is not time-aligned with the
input ak , anymore, after introducing the pipelining and removing the speed-ups:
[ak , bk2 ]!
In the following we want to observe the effect of a hazard in case of a single
jump: How does it influence the program execution assuming the following
4-stage processor pipeline?
1. Prefetch (fetch setup): PC update/increment by PCU
2. Fetch: read next instruction from PM addressed by PC
3. Decode: decode instruction (ID)
4. Execute: execute instruction (AGU, DM, DP; pipelining of data loop
is not relevant here)
This is illustrated in figure 103. Therein, if we assume the green instruction to
be a jump, it is prefetched at n 2, fetched at n 1 and decoded at pc cycle
n. Now, only after the decoding finished, the feedback loop can inform the PCU
about the jump and the PC can be updated accordingly. This is the case right
before the red instruction is passed to the control loop pipeline. The problem
is that the next two instructions (the blue one starting at pc = n 1 and the
brown one starting at pc = n) are already in the pipeline of the control loop,
ready to be executed. If these two wrong fetches would really be executed, we
have a good chance that the program computes the wrong results!
95
Figure 103: Instruction Pipelining Example for Control Hazard
In the following we discuss four solutions to address this problem.

Inserting NOPs (Fixing in SW)
We know that when a PC jump occurs at
call,
return,
conditional or unconditional jumps,
we get the problem that the next N instructions (e.g. 2 in our case) will be
wrongly fetched and executed and thus corrupt the result of the program computation. Thereby, N is the depth of the control loop pipeline! We can simply
solve this problem in software ( i.e. directly in the program code) by inserting
N nop instrutions after any pc modifying instruction (i.e. pc jump). This
approach has some disadvantages, however:
Compatibility of object code is not supported (due to code depends on
pipeline depth of the processors control loop)
Efficiency of program execution: many cycles are wasted by the execution
of nop instrutions
Flushing the Pipeling (Fixing in HW)
As figure 103 shows, none of the instructions following the (green) jump instruction has already been executed before the jump instruction was decoded. We
can exploit this and let the ID itself override the next N instructions in case
that a pc jump was detected (flush the pipeline). This approach requires
additional hardware. On the other hand, our object code gets independent of
the pipeline depth and is thus compatible. From the performance point of view,
we have still a lot of dead cycles, if the code contains many jumps (similar to
96
the nop instructions in the previous section). By the N-stage pipelining, we

gained a speed-up of factor N:
T
N
On the other hand, we need to flush the whole pipeline (by overriding wrongly
fetched instructions) after every pc jump and thus waste some amount of the
speed-up that weve gain. This depends on the relation of linear code and pc
modifying instructions. To be able to achieve a good performance, the linear
code should be clearly dominating!
T
linear code
1
pc jump code
What can be done to improve the performance? One possibility offer so called
jump tables. This is a kind of cache that stores the (decoded) target instruction
after executing a jump. The next time that the same jump is executed, the
decoded target instruction can directly be fetched from the jump table. Hence,
we can significantly speed-up the execution of repeated jumps ( loops): e.g.
PL1
for our well-known convolutional sum yk = i=0 ai xki . Many DSPs on the
other hand, use HW loop counters with dedicated loop instructions so that the
loop management is directly handled within the PCU. In this case there is no
need for jump tables.
Delayed Jump (Fixing in ISA)
Another alternative to deal with control hazards is to rearrange the instruction
sequence. We pick up the idea from section 10.1 to insert nop instructions
after jumps in the code. However, instead of nops we want to use some functional (sensible) instructions of our program code this time to fill the hole. This
approach requires that we have at least N (in our case: 2) non-pc jump instructions8 before our executed jump at cycle n (refer to figure 104). If this is
the case, we can execute the (delayed) jump N cycles earlier and fill the hole
with the sensible instructions from the program code: n 1, ..., n N . We
know for sure that these instruction can and definitively have to be executed!
Consequently, the jump instruction is now readily decoded before the (originally
subsequent) instruction n + 1 is feeded to the control loop pipeline. This means
that the feedback signal from the instruction decoder is now right in time available at the PCU. The method of the delayed jump is depicted in figure 104. A
small code example is given in table 4. Therein, the original code can be found
in the left column and the modified code (that includes the delayed jump) in
the right column.
8 i.e.
instructions that only use the default pc increment: pc++
97
Table 4: Code example to demonstrate

Cycle Original Code
n
acc0 = ra rb
n+1
acc1 = ra acc0
n+2
pb + + = acc0
n+3
pb + + = acc1
n+4
jump 0xABCD
the approach of the delayed jump

Code w/ delayed jump
acc0 = ra rb
acc1 = ra acc0
djump 0xABCD
pb + + = acc0
pb + + = acc1
Figure 104: Idea of Delayed Jumps to Handle Control Hazards
Remark: This solution may get quite complicated when talking about conditional jumps, since the evaluation of the condition must be executed before the
jump and thus has to be moved forward in the program code, as well!
Pipeline Interleaving (Fixing Control at Processor Architecture)
A fourth (special) option to deal with the control hazards is supported by the
processor architecture itself. This only works, if we depart from single processor
and assume N (pipeline depth) processors, instead. In this case, we could apply
the concept of Pipeline Interleaving ( section 6.2) for complete control hazard
avoidance by construction [ Sandbridge Technologies, IFX MUSIC]. How
this works shows figure 105.
Figure 105: Use Pipeline Interleaving to Avoid Control Hazards
For this purpose we need (in our case) 3 PCUs/PCs, 3 PMs for the 3 programs
and 3 the amount of registers and memories for the Pipeline Interleaving
98
of the data loop. Now, we can execute the instructions of the program code
of the 3 independent processors in an alternating manner. For the case of
the pipeline interleaved processor, jumps are no problem anymore, since the
decoding of any instruction is finished before the next instruction (of the same
processor) is fed to the pipeline! In the meanwhile, only instruction of the
parallel (independent!) processors are inserted into the pipeline. The example
in figure 106 shows how the data path looks like if we apply Pipeline Interleaving
for processing 4 independent data set, i.e., having a pipeline depth of 4. In this
case we need an independent output register file and address generation unit for
each data set.
Figure 106: Data path of pipeline interleaved processor for pipeline depth of 4
Using this approach, we have the following advantages:

HW reuse of ID and logic in data loop (only memory needs to be duplicated!)
No control hazards possible
On the other hand, we get the following disadvantage:
Silicon (area) is usually dominated by memory, which now has N times
the complexity in the interface!!!
10.2
Data Hazards
In section 10.1, we restricted us to the control loop. In this section, we also

want to analyze the data loop. Here we can also find feedback signals between
the data path (DP) and the address generation unit (AGU) or the DP and the
data memory (DM), respectively, as depicted in figure 107.
99
Figure 107: Hazards in Data Loop
We distinguish two types of data hazards:

Read before write: Read a value from memory before its new value is
available
Write before read: Memory entry (or register value) is overwritten, but
old value is still needed
Let us take a look at some small examples to understand the difference between
these data hazard types:
Algorithm 1 Code Example - Read before write
*pa = ra x rb
acc0 = *pa + acc1 [must be 4 clock cycles later ( Simple DSP pipeline)]
2nd instruction reads wrong value of *pa as update has not happened yet!!!
Algorithm 2 Code Example - Write before read
Assembly view:
1: acc0 = *pa + acc1
2: *pa = *pa x *pb
3: acc0 = *++pa + acc1
Pipeline view:
Cycle
Instr. 1
#1
read ra = *pa
#2
acc0 = ra + acc1
#3
#4
Instr. 2
Instr. 3
read ra = *pa, rb = *pb

acc0 = ra x rb
*pa = acc0
++pa ra = *++pa
acc0 = ra + acc1
3rd instruction modifies *pa before writeback of 2nd instruction!!!
100
Algorithm 3 2nd Code Example - Read before write

Assembly view:
1: *pb = *pa + acc0
2: acc1 = *pb
Pipeline view:
Cycle
#1
#2
#3
Instr. 1
ra = *pa
acc0 += ra
*pb=acc0
Instr. 2
rb = *pb
acc1 = rb
2nd instruction reads wrong value *pb because modification by instr. 1

occurs to late!!!
We can try to apply the ideas that we used to deal with control hazards also to
solve data hazards:
1. NOPs when writing into memory makes no sense!
2. Pipeline Interleaving its possible but only a special case!
3. Deal with data hazards in HW or SW discussed in the following!
SW Solution
We can solve data hazards in SW by analyzing the conflicts and take care of it
( compiler / assembler), e.g.:
Ping-pong memory allocation: using alternating memory locations / registers to avoid such conflicts
Delayed instructions: move dependent instructions apart (similar to section 10.1 for control loop)
Bypassing (HW Solution)
One solution to fix the problem directly in HW provides the super-scalar architecture ( section 9.3.2). This architecture features dynamic (out-of-order)
scheduling (issuing) of the instructions onto available functional units at runtime. This can be used to directly analyze and solve data dependencies at
runtime by the Anylze & Issue unit ( figure 92). A more general approach
(i.e. independent of the architecture) offers the bypassing method, as depicted
in figure 108. This allows to grip intermediate values earlier without passing
it through the whole data loop pipeline. In the example below, the output of
the data path (DP) unit is directly connected to its input (bypassing the AGU
and DM) and thus speed-up the memory access to solve a read-before-write
hazard ( refer to the code example above). [This approach is similar to the
instruction cache (jump table) approach that we discussed in section 10.1.]
101
Figure 108: Bypassing Method to Avoid Data Hazards
It should be mentioned here that the bypass multiplexers can consume a significant amount of the overall logic area for long (modern) processor pipelines.
Also note that the bypassing approach only works, if the internal pipeline of the
data path unit is not too long. Otherwise, the output of the DPU may already
be too late so that bypassing would not help.
10.3
Structural Hazards
The last type of hazards that we discuss in this section are the so called structural hazards. They occur, if the instructions scheduled for a sinlge clock cycle
require more HW (functional units, memory ports, etc.) than there is available.
Some cases where structural hazards occur are:
1. HW not realized (i.e. # FUs is too small) handling by exception code
2. HW not available (occupied) due to pipelining
(a) R/W port conflicts on data memory
(b) # memory ports exceeds HW limitations
Code example #1: data memory with 2 ports ( Our simple DSP with port
a (R only) and port b (R/W)), but 3 ports needed
Algorithm 4 Code example: data memory with 2 ports, but 3 ports needed
Cycle
n
n+1
n+2
port a/b
r/w
r/r
r/r
At cycle n we read an input from port a, make the computation ( DP) and
write back the result into port b after 2 clock cycles (i.e. in n+2) due to the
102
pipelining. At the same clock cycle a new instruction was issued that requires
a read access on both ports. Hence, we get a conflict ( structural hazard)
between the two read and the write request at cycle n+2.
Code example #2: read/write conflict
Algorithm 5 Code Example: Read/Write Conflict
Assembly view:
1: *pb++ = acc1 x acc0
2: acc0 = *pb + 0xAAAA
Pipeline view:
Cycle
#1
#2
#3
Instr. 1
acc1 =acc1*acc0
*pb++ = acc1
Instr. 2
rb = *pb, ra=0xAAAA
acc0 = ra + rb
This 2nd small code example demonstrates a similar (but more concrete) case.
We see a read/write conflict at port a (*pb) in cycle #2. The writeback of
instruction #1 collides with the read access of instruction #2. simultaneous
read/write is not supported on the same port! If we replace the immediate value
0xAAAA by a memory access *pa, we even get two simultaneous structural
hazards. This is because we would need 3 memory ports in cycle #2 in this
case. However, our DSP architecture has only 2 memory ports available...
Solution:
Solutions for the structural hazards are quite similar to that of
the data hazards:
1. SW/tools (i.e. compiler): rearrange/delay instructions to avoid conflicts
2. Super-Scalar architecture: issue unit checks at runtime if FUs are available
3. HW: stall the pipeline until FUs are available
11
Vector Processing
So far we discussed processing on scalar data. But can the block processing
approach also be applied to processors? In this section different techniques for
parallel processing on data sets ( vectors) will be discussed.
11.1
Vector & Data Flow Processors
At first we want to take a quick glance onto the parallel processing concept of
classical vector processing machines. Recall the two techniques for pipelining a
processor:
103
1. One processor instantiation with pipelined control and data path bypasses etc. needed for hazard handling
2. N pipeline interleaved instances N fold register/memory size (hazard
free)
The idea of classical Vector Processing Machines is to exploit the repetitive
program code of loops, such as our well-known FIR sum:
yk =
L1
X
ai xki
i=0
Now, use parallel processing by N-fold pipeline of the data path, while the
program control code is static during the vector operation, as figure 109 shows.
I.e. only the data path needs the massive pipelining to be able to perform the
same operation on a row of multiple vector elements (at high speed), while the
control path needs no expensive pipelining, since the program control word is
static during the vector operation. On software side the idea is to generate a
high-level language (HLL) library of vector instructions, e.g., vectorAdd(&x,
&y, L), to easily exploit the architectural features.
Figure 109: General idea of a vector processing machine
A more generic approach offer the so called Data Flow Processors. This kind of
processor one can think of a pool of different FUs that can be connected in an
arbitrary fashion at run-time. Thus, the data path can be configured according
to the needs of the executed application.
E.g. assume that we want to compute
P
the following equation: yi =
|xi ai | bi . Figure 110 shows how the data
path would be configured for this case to execute the computation within a
single instruction.
104
Figure 110: Data path example for data flow processors
The big advantage of data flow processors is that they maximize the ILP (instruction level parallelism), i.e., the number of algebraic operations executed
per cycle. This makes especially sense if a computation is repeated within a
loop multiple (L>>1) times.
11.2
SIMD: Single Instruction Mutliple Data
Now, let us make a bottom-up analysis of the different hierarchy levels of parallelism, starting with the full serial (bit-serial) processing and finally ending
with the maximum level of parallelism (SIMD).
Bits: bit-serial figure 111 shows an example of a bit-serial adder that receives
two input bits ai and bi and computes the sum bit si and the carry output cout
accordingly. Thereby, the carry bit cout is fed back to the input cin so that
multiple bits of a parallel word can be computed in a sequential loop.
l
Figure 111: Example: bit-serial adder
l
Words: bit-parallel - operate the same operator on a word-parallel ALU (e.g.
multiple 1-bit adder connected in a row, i.e. carry-ripple adder)
105
Words: one operation on multiple input words (e.g. add, mult, OR, ...)
l
ILP: (Instruction level parallelism) many ops on multiple inputs (e.g. MAC,
sat(add),...), superscalar issue
ILP: one complex ILP set of operations on scalar elements
l
SIMD: one complex ILP set of ops on vectors
11.2.1
Example FIR & Zurich Zip
To give an example for SIMD vector processing we take the FIR equation as
basis:
L1
X
ai xki
yk =
i=0
We can decompose the sum to odd and even parts, which can be computed in
parallel, i.e., independent of each other.
L/21
L/21
yk
a2i xk2i +
X
odd i
a2i+1 xk2i1
i=0
i=0
ai xki +
{z
even
{z
odd
ai xki
even i
Except for the final addition of the two sums, we use the same instruction
(multiply a and x and accumulate the result) on two parallel sets of data
SIMD: Single Instruction Multiple Data! The resulting hardware structure is
depicted in figure 112. The final addition of the two sums is therein realized by
a multiplexer at the input of the second accumulator that is fed by the output
of the first accumulator.
106
Figure 112: FIR SIMD Example
Please note that the SIMD concept with two parallel ALUs differs from the
classical VP machine implementation, where we have only a single ALU that is
N- (in this case 2-) fold pipelined as figure 113 shows.
Figure 113: FIR Implementation as Vector Processing Machine
The problem of the SIMD solution from figure 112 is that the necessary memory bandwidth has also been doubled compared to the non-parallel solution
to feed all the inputs. We need to pass 4 inputs per cycle: aieven ,xkieven ,
aiodd ,xkiodd , instead of 2 inputs per cycle for the non-parallel solution: ai ,xki .
This emphasizes figure 114 by direct comparison of the bandwidth required for
inputs/outputs of the SIMD processor and pure hardware solution.
107
Figure 114: Comparison of SIMD processor & HW solution w.r.t. bandwidth

of inputs/outputs
A more efficient approach is the following: let us compute the sum sequentially
as before, but instead compute multiple outputs (yk and yk+1 for N=2) in
parallel:
yk
L1
X
ai xki
i=0
yk1
L1
X
ai xki1
i=0
The resulting hardware structure is depicted in figure 115. We can compute yk

and yk1 in completely separated (i.e. parallel) DPUs. They perform the same
operation on two different data sets ( SIMD). No vector merging between the
2 stages is necessary after the last iteration as in the approach before. The ai
inputs need only to be read once for both DPUs. Moreover, the xki also can
be reused for both DPUs by introducing a simple register as depicted in the
figure. Finally, the memory bandwidth for reading the inputs did not increase
(2 read ops / cycle) using this second approach.
108
Figure 115: FIR Implementation as Zurich Zip
So, what did we gain?

N-fold parallelization
N sets of data
1 single instruction
N sets of DPs (with ILP)
but
We are now able to generate N sets of data, out of scalar set of input data!
i.e. we gained N-fold processing power w/o N-fold memory bandwidth (exploiting the so called data locality principle)
Actually, this idea has originally been invented for scalar processing (with singleport memory) by IBM Research/Zurich and was therefore called Zurich Zip
(Zipping between 2 ACCUs). The original concept for the (scalar) FIR computation with a 1-port memory shows figure 116. The main idea of this concept
is that only one of the two input registers ai and xki+1 had to be updated per
clock cycle, i.e. both registers only need an update every second clock cycle. For
this purpose, two outputs yk and yk+1 are computed in an alternating manner
in two parallel accumulators (similar to the SIMD example from figure 115):
yk
+ = ai xki
yk+1 + = ai xki+1
yk
+ = ai+1 xki+1
yk+1 + = ai+1 xki+2
...
109
Hence, we recognize that only a single input changes between every two computations!
Figure 116: General idea of Zurich Zip
Finally, we gained N-fold processing power by applying SIMD with Zurich Zip,
but we need a skewing triangle consideration here! As figure 117 shows, we
get N-1 setup cycles and N-1 completion cycles for an N-fold parallelization to
fill/flush the register chain. During the initialization and flushing phase, we can
not really do full parallel SIMD processing. Hence, the parallelization is only
effective, if the parallelization factor is much lower than the number of loop
cycles (L), so that the overhead by the init/flush phase carries no weight: N
L. The effective parallelization factor, taking the overhead of the initialization
and flushing phase into account, can be determined as follows:
P =
L
N
L + 2N 2
Figure 117: Skewing triangle consideration for SIMD processing
110
Following table summarizes both approaches, emphasizing their advantages

(green). Therein, L corresponds to the number of loop iterations (i.e. # of
FIR taps) and N is the vector processing size (i.e. # of parallel DPUs).
latency
speedup P
reads/cycle
writes/cycle
vector parallel proc.

L/N + N 1
L
P = N L+N
1
L
2L L+N 1
P
L
zipped processing
L+N 1
L
P = N L+N
1
L
2 L+N 1
P
L
Consequently, if latency is not critical it is usually a good idea to go for the

zipped processing approach.
11.2.2
Transposed FIR, partial
The methods presented in the previous section are not the only options to realize
SIMD processing. A different approach for the SIMD FIR implementation is
provided by the so called partial transposed form (i.e. a mixture of the direct
form and the transposed form FIR filter). The structure is illustrated in figure
118. It is easy to see that the computation of this 6-tap filter can be realized
by 3 parallel data path units with internal ILP. All DPUs exhibit the same
structure, as depicted in figure 119, so that hardware reuse is enabled.
This small example should demonstrate that there is often a large number of
alternatives of how to handle data.
Finally, we can also generalize the concept of the partial transposed form FIR
for different algorithms, since:
operator doesnt matter
operator even doesnt have to be distributive
L1
2 i=0 ai x
1
yk =
ki
For example the partial transposed form is also possible for:
yk
|ai xki |
2
|ai xki |
111
Figure 118: FIR filter in Partial Transposed Form
112
Figure 119: Reusable DPU for Parallel Processing of Partial Transposed FIR
11.2.3
Generalization
Finally, we conclude the SIMD topic with a general consideration. Figure 120
shows a general view onto SIMD processing. Therein, two (or more general:
multiple) N-element vectors are passed to N parallel data processing paths (DP).
Figure 120: General View Onto SIMD Processing
Thereby, two problems concerning the input vector have to be considered:

1. Input vector is not aligned in memory
2. Input vector does not match N (# of parallel DPUs)
113
The first problem is illustrated on the lefthand side of figure 121. A cyclic
shifter can be used to rearrange the memory output. However, the shifter (
cross-bar switch) usually consumes a lot of area. For many operations the cyclic
shift is, however, not even necessary. E.g. in case of an FFT (righthand side of
figure 121), we can simply omit the cyclic shifter due to its periodicity property
(cyclic operator)! But we need to store a different set of twiddle factors for each
possible vector offset (see figure 121) in this case.
Figure 121: Problem (1): Input Vector not Aligned in Memory
The second problem is addressed in figure 122. Actually, this is no real problem,
since we can just split the computation over multiple clock cycles. However, we
waste somewhat of the available computation power during the last cycle, if the
input vector length is not a multiple of N. In the illustrated case some of the
DPUs are idle (perform nop) during the second cycle.
Figure 122: Problem (2): Vector Length not Matching Memory Width
114
The overall summary and generalization of the SIMD concept is given in figure
123. Therein, a general shuffle network (e.g. as cyclic shifter to address the
memory alignment problem or as shuffle network to perform algorithms such as
FFT) has been considered. Shuffle operations on vectors with size > N have to
be realized through addressing!
Figure 123: Summary & Generalization of SIMD Concept
Problem: How many parallel DPUs needed to execute an application under

certain constraints? Design Space Exploration! What is the optimal parallelization of my application? Scheduling! A further generalization is to build
a single Vector Processing Engine that unites the Zurich-Zip advantage with the
advantage of vector-parallel processing, as depicted in figure 124.
115
Figure 124: Combined approach for Z-Zip and vector-parallel processing
12
Scheduling for Synchronous Data Flow
Further reading: [15][16]

Up to now, we discussed efficient solutions for different selected algorthims and
architectures in hardware or software. A problem that connects both worlds is
the (task) scheduling. This means that we have a piece of software that can be
devided into (sub-) tasks on the one hand (figure 125, left). On the other hand,
we have a Silicon-Chip that consists of a bunch of so called processing elements
(PEs): e.g. DSPs, general purpose processors, dedicated hardware accelerators,
etc. Now, we have to allocate the software tasks onto the available PEs, i.e.,
design a temporal ( when?) and spacial ( which PE?) schedule, thereby
considering data dependencies between the software tasks (e.g. task2 needs
output of task1 as input).
116
Figure 125: General view onto scheduling problem
The right side of figure 125 shows the 3 main challenges of the scheduling problem, depicted as 3D mapping space:
1. MP challenge: mapping tasks on PEs
2. Memory allocation & coherence: Which data to store where to avoid long
data transfers between global and local (e.g. PE cache) memory? How
to keep the data coherent, i.e., how must the results be merged, if tasks
work on different memories?
3. On-chip interconnect (Network-on-Chip, NoC): How to schedule tasks to
make data/program transfers as fast as possible (short routes, uniform
load, etc.)?
12.1
Kahn Process Networks
To handle the problem of scheduling formal tools can be used to model the
problem and find a solution. Kahn Process Networks (KPN) are a very simple
graph based model that can be used to model software tasks (or processes) and
their interdependencies. Therin, tasks/processes are modeled by graph nodes,
called actors, as depecited in figure 126. The arcs between the nodes are used
to model the interdependencies between the tasks or the task inputs/outputs,
respectively. Also feedback loops are allowed. An actor can be executed (
fired) as soon as all input conditions are satisfied. A valid output is created
on every output edge when an actor is fired (note that empty outputs are also
possible). Almost any software can be described by a KPN!
117
Figure 126: KPN graph model
Figure 127 summarizes some problems that can be addressed and examined by
means of the KPN model.
Figure 127: Selection of design problems to be modeled and examined by KPN
12.2
SDF: Synchronous Data Flow
A sub-class of the graphical KPN model is the so called Synchronous Data Flow
(SDF). In contrast to the more general KPN all inputs are assumed to switch
time synchronous, i.e., we have discrete processing times and a discrete amount
of data that flows between the actors. I.e. every time an actor is fired:
a given # of elements is consumed and
a given # of elements is produced.
The number of consumed/produced elements is thereby time independent (i.e.
constant at run-time)! This model suits perfectly to model synchronous (i.e.
clocked) digital systems that are commonly used today! Figure 128 shows a
very simple SDF with one input actor and two output actors and , where
and depend on the output of .
118
Figure 128: SDF graph model
Moreover, SDF allows to support different (rational) sample rates. However, the
sample rates must be constant and known apriori. In this case, we can annotate
the amount of data that an actor consumes (at the inputs) and produces (at the
outputs) to model different sample rates. A small example is given in figure 129.
1 has at least 2 elements in
Therin, actor is ready and can be fired, if input
2 contains at least 1 element. The according
its queue and the queue of input
number of tokens is consumed when is fired while producing 3 elements at its
output edge.
Figure 129: Small SDF examples with annotated # of elements consumed/produced
As second more generic example can be found in figure 130. The actor con1 and b elements from input
2 when
sumes a data amount of a from its input
fired and produces c and d tokens at its outputs. The inputs of this actor are
provided with rates R
1 and R
2 tokens per clock cycle. What is the average
firing rate F of this actor?
1 /a cycle.
1 we could fire every R
According to input
2 /b cycle.
2 we could fire every R
According to input
Therefore...

F = min
R
R
2
1
,
a
b
119
Figure 130: SDF annotations
Thereby, the firing rate of determines the sample rate at its outputs:
R
3
R
4
= F c
= F d
To make the SDF from figure 130 a valid (i.e. executable) schedule we have to
ensure that no buffer overflow occurs at the inputs. This is true if the following
condition is fulfilled:
R
R
1
2
=
.
a
b
Figure 131 gives a slightly more complex SDF example with 3 actors and data
amount annotations.
F =
Figure 131: SDF sample rate concept
The question that now arises is: Does this work? / Is this SDF executable? We
cannot answer this question, since it depends on the concrete amount of data
120
consumed and produced, which has only been annotated in abstract form in
figure 131! Let us examine a concrete instance of this SDF example, shown in
figure 132. Therin, we see that produces 2 tokens on its output to once it
is fired. On the output edge to 1 token will be produced by . Now, can
be fired once, consuming the 1 token from at its input and producing 1 token
at its output. Finally, can be fired onces, consuming 1 token from and 1
token from . Now, we finished a single period and start from the beginning.
The problem, however, is that there is still 1 token left at the input buffer of
, since has produced 2 tokens on this arc before! If we repeat this game
continuously, we end up with a buffer overflow (assuming a real system with
limited buffer size) which in turn leads to a memory deadlock, since can not
be fired anymore in this case. This means that the SDF is not executable due
to a sample rate inconsistency!
Figure 132: Example for SDF with sample rate inconsistency
The following two questions remain:

How to check for such memory deadlocks (due to buffer overflows)?
How to create a schedule from a given SDF?
The second question is subject of the following sub section. To check for sample
rate consistency we can successively compute the firing rates of the actors and
the data rate between the actors, as depicted in figure 133. The firing rate of
the input actor can be computed by the input rate Rin and the amount of
data that is consumed by (namely b): F = Rin/b. Following, we are able to
compute the data rates at the output edges of , since we know the amount of
data produced when is fired (c and d resp.):
Rce = c F = Rin c/b,
Rdf = d F = Rin d/b.
We continue with this until all firing and data rates have been determined.
121
Figure 133: Check for rate consistency by computation of firing rates
From the sample rate of actor (that depends on multiple different inputs) we
can now derive the condition for a balanced schedule w/o deadlocks:
= min (Rce/e, Rgi/i)
Rce/e
c F/e
c
Rin
be
c
be
Rgi/i
= g F/i = Rdf/f
= d F
=
g
i
g
dg
= Rin/b
fi
fi
dg
bf i
Since this method can be quite cumbersome for large networks, we need a more
convenient and generic method to deal with this. Luckily, this can be provided
by using a mathematical representation of the SDF graph, the so called topology
matrix . It is constructed by numbering each node and arc, as in Fig.134 , and
assigning a column to each node and a row to each arc. The matrix element
(b, a) contains the number of tokens produced by node a on arc b each time it
is invoked. If node a consumes data from arc b , the number is negative, and
if it is not connected to arc , then the number is zero.
Correspondingly, we can now define for the example in Fig. 134 :
node
1
c e 0
= arc 2 d 0 f
3
0 i g
The condition that must be fulfilled for the SDF to be rate consistent is simple:
rank() = s 1,
122
Figure 134: SDF graph showing the node and arc numbering. The input and
output arcs are ignored for now.
where s is the number of actors. That means the matrix must be rank deficient i.e. has to have less linear independent rows than columns ( underdetermined linear system of equations). Check it for yourself for the example
from figure 132. You will find that all 3 rows are linear independent in this case!
The rank deficiency condition is based on the following idea: Assume that our
3 actors are fired F = (F , F , F )T times during a single period of a schedule.
T
Than the following equation yields the final buffer state B = (B , B , B ) at
the end of the schedule period: F = B. We already know that B must be
0 at the end of a schedule period for the SDF to be rate consistent. Hence:
F = 01s . We will always be able to find a solution for this system of
equations, namely F = 0. However, we require Fi > 0 to result in a reasonable
schedule. Consequently, to find an F 6= 0 we need at least a second solution.
Hence, the system of equations must be under-determined and therefor it needs
to be rank deficient.
We could generalize the SDF concept to stochastic queues with variable input
rates. In this case we would simply take the average rate as our constant data
rate in the SDF, e.g., a = E [a] and b = E [b]. This approach will work as long
as the variance of the stochastic input process is small compared to the queue
size, such that no overflows occurs.
12.3
Multi Processor Scheduling
Now, that we discussed the question of checking the SDF for feasibility (rate
consistency), we will address the question in this sub section of how to actually
create a schedule from the SDF for a single- or homogeneous multi-processor
architecture. Lets start off with the small SDF example presented in figure
135 (left). It consists of 3 actors with dependencies in a feedback loop. A first
check confirms that there are no rate inconsistencies in this graph, so that we
123
can continue with it. Please note that we need buffer initialization in case of
feedback loops. Otherwise, we would get a deadlock right at the beginning. This
is because every actor is dependent on the output of another actor, such that
no actor can start firing! By the buffer initialization we avoid this deadlock and
allow a certain actor(s) to start. We also need to consider the actor processing
duration that has been chosen to be equal to the node number of the actor for
simplicity (figure 135, right).
Figure 135: Left: Example SDF with loop for MP scheduling; Right: processing
delays
Now, we can continue to generate a schedule by constructing the so called

precedence graph. This graph contains 1 node for each actor in a potential
schedule. It contains 1 edge from node a to b, if a has to be fired before b
(i.e. a precedes b). It shows a concrete chronological firing (execution) order,
in contrast to the SDF that only shows the general dependencies! Figure 136
shows the precedence graph for our small SDF example for J=1, i.e., every actor
has to be fired at least once.
Figure 136: Precendence graph (J=1) for SDF example
The two actors #1 and #3 can directly be fired at the beginning (due to the
buffer initialization). Actor #1 can even be fired twice right at the beginning.
After actor #1 has been fired twice, actor #2 can be fired, since it consumes
two tokens from the output of actor #1. Now, we are already done with the
construction of the precedence graph, since every actor was fired at least once!
124
From the precedence graph we can easily generate different single processor
scheduling alternatives, as depicted in figure 137.
Figure 137: Possible schedules for P=1 processors
All these options are reasonable schedules. The question is: Are some of these
schedules better than others? In fact, this is the case since the two green
encircled alternatives make use of the data locality principle, as shown in figure
138. This means that the actors are directly executed in the order of their
dependencies. Remember: A data dependency from actor to means that
needs the output data from as input. If is directly executed after on
the same PE, it can reuse the results in the local memory. If a different actor is
executed on the PE, the results from must be moved to a global memory and
transfered back to the local memory before is executed two unnecessary
data transfers!
Figure 138: Best scheduling options that exploit data locality
Lets continue with the scheduling example for two processors. We can again
easily find valid schedules from our precedence graph. Figure 139 presents two
alternatives. Although, these are reasonable schedules, we recognize that they
are not entirely optimal, since we have to schedule nops in between and thus
not fully utilize the two processors. From the efficiency perspective it looks as
1
follows: We know that E AT
. If we normalize A to the area of a single
processor, we get A = P = 2. If we further normalize the processing rate (i.e.
the length of a single schedule period)
to that of a single iprocessor solution,
h
Schedule period f or P =1
1
7
we get the speedup S = T = 4 = Schedule period f or P =2 . Finally, we get
E 74 12 = 78 < 1. This means that we are less efficient compared to the single
processor solution!
125
Figure 139: Possible schedules for P=2 processors
To be able to fill the gaps in the schedules of figure 139, we have to alternate
the processors between adjacent periods. For this purpose we need to extend
the period of our precedence graph to J=2 (i.e. every actor is fired at least
twice). The resulting graph is presented in figure 140 (the brown indices provide
information on the corresponding schedule period).
Figure 140: Precendence graph (J=2) for SDF example
From the extended precedence graph we can now again easily derive a schedule
for P=2 processors. As figure 141 shows, the gaps could be removed from the
schedule.
Figure 141: Schedule for P=2 processors considering J=2 periods
126
Finally, we were able to find an optimum schedule for the P=2 case with an
efficiency comparable to that of the optimum single-processor solution:
1
14 1
E=S
=
= 1.
P
7 2
What about the case P=3? Figure 142 shows the corresponding schedule (J=1).
The efficiency is further decreased compared to P=2:
1
7 1
7
E=S
= = < 1.
P
3 3
9
From the precedence graph (J=2) in figure 140 we can estimate that it does
not make sense to observe J>2, since we are not able to initially schedule more
than the three actors (#1,#1,#3) in parallel and thus will not achieve a higher
degree of parallelism.
Figure 142: Schedule for P=3 processors
Instead, let us slightly modify our SDF example by simply introducing an additional initialization buffer at the input of actor #3. Now, actor #3 can be
fired two times initially! This has quite an impact on our schedule, since it now
indeed makes sense to go for the J=3 precedence graph, as figure 143 (right)
shows! Now, we find three independent precedence trees in the graph and can
simply build an optimal schedule for P=3 processors out of it (E = S/P = 1).
127
Figure 143: Slightly modified SDF example (left) with precedence graph for
J=3 (right)
13
MPSoC: Multi Processor System-on-Chip
Many modern applications, such as mobile communication or multi-media, require a lot of computation power. Therefore, contemporary hardware architectures tend to integrate multiple processors, memories and hardware accelerators
on a single chip, so called Multi-Processor Systems-on-Chip (MPSoC). Thats
why we want to discuss the main challenges related to MPSoCs in the scope
of this section. Figure 144 summarizes the software/hardware mapping (i.e.
scheduling) problem that we already discussed in the previous section. Software
tasks need to be assigned to PEs/memories in space and time under certain
constraints (e.g. execution time/deadlines, power consumption, data locality,
etc.). Furthermore, the on-chip communication (NoC) must be considered: e.g.
inter-processor or processor-memory. We can distinguish static and dynamic
mapping approaches, as discussed in the following.
128
Figure 144: SW/HW mapping problem
13.1
Programming Model
The scheduling problem has been addressed in the previous section using the
SDF graph model. However, this model can only be used to generate static
schedules that must be known at software compile time. Modern applications
often exhibit a very dynamic behavior that is dependent on input data or user
interaction. E.g. when the user is writing an SMS it requires much less computational power from the mobile phone than using it to make a video conference.
On the other hand, under bad channel conditions the mobile phone is not able to
transfer with high data rates and thus the computational effort is also decreased
in this case. These examples should make clear that static scheduling with SDF
is often not sufficient for modern applications, especially for mobile communication and multi-media. In this case we need to deal with dynamic data flow
(DDF) which is also a sub-class of the general KPN. Let us first summarize the
joint properties of SDF and DDF:
KP N
SDF
DDF Dynamic Data F low

|
{z
}
joint properties :
1. Each actor/task can be fired once the input data/ctrl is ready.
2. Each actor/task needs no other input once it is fired.
129
3. Output is created at the end of an actor/task.

No inter actor/task control once fired, i.e. every actor/task collects all necessary control information (as part of the input data) before it starts.
If a PE (processing element) has its input for an actor (locally) available,
then the actor can complete.
So what are the features of DDF now?
1. No predefined run-times
2. Actors can be created by actors ( e.g. dynamic task scheduler which is
an actor itself).
3. Interdependency of actors (arcs) can be created dynamically (necessity
that follows from 2.).
Figure 145 gives a small example to motivate the necessity of dynamic scheduling. Therein, we consider two actors that perform motion estimation on a 2D
image that is stored in a global memory. We can distinguish three cases:
1.
1 &
2 (actors work on independent memory areas) one global mem
ory o.k.
2.
1 &
3 (actors work on overlapping memory areas) sequential access

to memory o.k.
1 &
3 with writeback modify problem to work on the same memory!

2
3 but input/output array chosen dynamically (image
in case actor
3.
processing)
Figure 145: Image processing as example for dynamic scheduling
130
This small example should demonstrate a case where we are not able to apply static scheduling, since the dependencies and execution order and even the
memory mapping depends on which area of the image actually has to be processed by the two actors which in turn is dynamically selected at run-time. This
emphasizes the need for a dynamic (run-time) scheduler for:
handling hazard conflicts in memory
mapping tasks onto cores (time/space)
reserving interconnects (NoC)
To allow dynamic scheduling at run-time we need a special programming model
that allows us to divide the program into sub tasks which can be issued in
arbitrary execution order (considering the data dependencies) by the run-time
scheduler onto appropriate PEs. For this purpose we have to describe the actors/tasks as shown in the following:
Algorithm 6 Programming model for MPSoC with run-time scheduler
call actor (in 0, in 1, ...; out 0, out 1, ...)
actor
...
end of actor
The memory addresses of the inputs and outputs can be used by the run-time
scheduler to identify data dependencies (i.e. overlapping memory areas). Based
on the dependency analysis the scheduler is able to find an admissible execution
order with respect to the targeted optimization strategy (speed, power, etc.).
The actual program code of the actor is stored in the global memory and transfered to the local memory of the selected PE when the actor is ready to be
executed. Figure 146 shows the MPSoC programming model from the PE point
of view. The processor receives the input data and control from a global memory, as well as the program code that matches the PE type. Certain actors may
be suited to be executed on different PE types. In this case different versions of
the same program code must be located in the global memory (the appropriate
version of the program code is selected by the scheduler). A local scratch pad
memory is used by the PE to store intermediate results. The actual outputs of
the computation are transfered back to a global memory (unless the data can
be reused by the subsequently scheduled actor data locality).
131
Figure 146: MPSoC programming model
Let us examine the programming model for our FIR example a little bit closer
to understand the consequences of the input/output procedure:
yk =
L1
X
ai xki
i=0
Assume that our actor computes the FIR in direct form for M steps in parallel,
as depicted in figure 147 for the case L=4 and M=3.
132
Figure 147: FIR example for L=4, M=3
In this case we must pass the following inputs:

x = (xkL+1 , ..., xk+M 1 ) = xkL+1 , ..., xk1 ; xk , ..., xk+M 1
|
{z
} |
{z
}
state space
input
a = (a0 , ..., al1 )
In case of M=3 and L=4 our state space consists of (xk3 , ..., xk1 ) and the inputs for x are (xk , ..., xk+2 ) (refer to figure 147). The outputs that are produced
by the actor are the following:
y = (yk , ..., yk+M 1 )
xs = (xk+M L+1 , ..., xk+M 1 ) (the state space for the next computation
cycle)
For the M=3, L=4 case we get accordingly (yk , ..., yk+2 ) as actual outputs. For
the next computation cycle (i.e. to compute (yk+3 , ..., yk+5 )) we need to save the
according state space which is (xk+2 , ..., xk ) for the case in figure 147. For this
purpose, the state space xs must also be passed to the output to be transfered to
the global memory. This is necessary, since the PE could be used by a different
actor in the meanwhile that would overwrite the state of our FIR actor. This is
an important consequence that comes with our dynamic programming model for
133
MPSoCs and differs from the assumption that we made before that our current
state is always locally available in the PE. Figure 148 briefly summarizes our
observations for the FIR filter.
Figure 148: M-step FIR example
In case of SDF/DDF (xDF) the main program is merely a sequence of actor

calls:
Algorithm 7 Main (control) program sequence in case of xDF
main code
call actor
call actor
call actor
case
.
.
.
dependency for actual execution order is determined by in/outputs
In case we have P PEs (P > 1) this leads to the following consequences:
scheduling actors according dependencies and not necessarily in-order
out-of-order
# tasks read in main controller can be >P, with P being issued in parallel at most
Super Scalar Task-Level Processing (see section 9.3.2)
13.2
Task Scheduling
One of the main challenges of MPSoCs is the task scheduling that is briefly
addressed in this section. As already mentioned before, we distinguish two
cases: static (1) and dynamic (2) scheduling.
134
1. Static Task Scheduling: This case is more easy to handle, since we can compute the schedule at compile time using the common SDF model that has
been presented in section 12. Figure 149 shows a simple example to create an
ASAP (as soon as possible) or ALAP (as late as possible) schedule given an
SDF graph with annotated execution times as input. Using the ASAP policy,
actor a can start immediately at t=0. Actor b can be fired at t=200 and is
scheduled to the still idle PE #2. Actor c can directly be fired at t=300 after a
has been completed and is executed on PE #1. Due to its long run-time of 800
cycles, actor b occupies PE #2 nearly up to the end of the schedule, so that the
subsequent actors must be executed sequentially on PE #1: d at t=400 and e
at t=700. Finally, after b has finished, the output actor f can be scheduled on
PE #2 at t=900. The schedule period is completed at t=1000. For the ALAP
schedule we simple apply the same procedure in reverse order, i.e. starting with
actor f b e ....
Figure 149: Example for static MPSoC schedule
2. Dynamic Task Scheduling (CoreManager): For the case of dynamic scheduling we need a run-time scheduler that we call Core Manager. It is responsible
for fetching a bunch of tasks (actors) into its local buffer. Now, the Core Manager has to distinguish two cases:
1. task is ready to be fired there are no dependencies on other running
tasks fire as soon as possible (i.e. when the next PE is available)
2. task is almost ready to be fired almost means that the actor inputs
are being calculated by other actors (currently running task or in buffer
to be executed); Core Manager must check the dependencies to find a
valid execution order
The Core Manager concept is depicted on figure 150. Therein, the mircoprocessor (P ) controls the top-level program flow (i.e. executes the main
routine) and pushes task descriptions (for tasks to be scheduled onto PEs) into
135
the task queue of the Core Manager. The Core Manager analyzes the dependencies between scheduled tasks and forwards them in an admissible order to the
processing elements to be executed. For this purpose, the program code as well
as the input data is transfered from the global memory to the local memory of
the corresponding PE (usually by means of a DMA). After the PE finished the
task execution, it informs the Core Manager that it is available for new tasks.
The result of the computation is transfered back to the global memory, or kept
in local memory to be used further by a subsequent task ( data locality).
Figure 150: Core Manager concept for dynamic scheduling
We make following assumptions:

HW PEs can execute a task, if
enough memory for data & program is available,
the ISA of the selected PE type matches the program code,
the NoC allows to deliver and retrieve inputs/outputs.
The following advantages offers dynamic scheduling:
react on the number of PEs changing in time (e.g. dynamic switch-off due
to temperature control, failures, etc.)
power up/down PEs as needed
support of voltage/clock scaling
have multiple Core Managers for different threads fight for the same
sea-of-PEs
dynamically decide which kind of PEs to allocate good support of
heterogeneous MPSoC architectures (see figure 151 below)
136
Figure 151: Heterogeneous actor/PE types
react on priority actors (interrupts) [i.e. an interrupting actor that has

priority over all currently running actors];
assume an MPSoC with a Core Manager (CM) and 4 busy PEs:
CM needs to react to interrupt priority actor

kill running actor on any PE to service interrupt
restart killed actor with initial input
while serving the interrupt the actual parallelization degree is decreased
to P=3
in case of static scheduling we would have P=0 during interrupt service,
since the whole schedule must be restarted again afterwards!
DDF scheduling possible
Using dynamic scheduling the Core Manager has a pool (buffer) of fetched actors
locally available and has to make a decision on which PE to schedule which task
in which order with respect to the task inter-dependencies.
Thereby, it can pursue different optimization strategies - the issue decision can
be controlled for:
speed of completion
power consumption
locality of data/program
Lets give a small example to motivate this. Assume a PE where currently an
actor is running. The Core Manager now has to decide which of the actors that
137
are available in its buffer should be issued next to that PE. With respect to the
data locality principle, the following facts influence this decision:
Can input data be reused?
Can output data be reused?
Can program code be reused?
In consequence, memory transfers are minimized, ...
NoC load and memory R/W access is reduced and finally ...
energy consumption is minimized!
13.3
Network-on-Chip
The on-chip interconnection of a large number of PEs is a challenge that gets

more and more important for modern MPSoCs. Simple busses or cross-bar
switches do not offer the necessary bandwidth and flexibility and do not scale
with the number of PEs. Network-on-Chip (NoC) offers a solution for this
problem. It consists of connector nodes (routers) and wires (links) that are
used to arrange the PEs, control processors, CoreManagers, memories, etc., in
arbitrary topologies: e.g. mesh, ring, start, etc. This is analogous to off-chip
networks, like ethernet, but using a much thinner protocol stack. NoCs are
packet-switched networks where the atomic network packet is called Flit (flow
control digit). Basically, we can distinguish two traffic (service) classes:
Best Effort (BE) traffic: Using this traffic class, the packet is routed
through the network as fast as possible, but without giving any latency or
throughput guarantees. This means that certain links may even starve, if
the network is under very high load.
Guaranteed Service (GS) traffic: With GS traffic a fix route through the
network must be reserved at first between the two communication partners. The communication partners have now highest priority on this route,
so that the link cannot be interfered by other traffic. With this kind of
circuit-switching the NoC is able to provide guarantees for latency and
throughput which is especially interesting for real-time applications. We
can further sub-divide this class according to the method that is used for
the GS route allocation:
Distributed allocation: By sending a special setup/teardown flit from
the sender to the receiver (using common BE routing techniques such
as XY-routing or flooding), the GS route is established/released.
Central allocation: A central control unit (here: called NoC Manager) is responsible for establishing/releasing the GS routes. The
NoC Manager has a global view of the NoC and access to the reservation control mechanisms of all routers.
138
Finally, lets take a look at the small example in figure 152. We see a NoC
arranged in a 2D-mesh topology of size 4x4. We recognize that this is a heterogeneous MPSoC, since different PE types (illustrated by circles and rectangles)
are connected to the router nodes (small bubbles). E.g. one of the nodes could
be used as central micro-processor to run the main code. Other PE nodes could
be used for central managers, such as the Core Manager or the NoC manager (a
central control instance for the link allocation in the network). Furthermore, we
find two successfully established routes in the NoC (green lines). A third connection request was not successful (red line), since the route is already blocked
by another crossing route.
Figure 152: 4x4 2D-Mesh example for Network-on-Chip
This brief insight should give a small impression of the on-chip interconnection
challenge.
13.4
Homogeneous / Heterogeneous MPSoC
We can distinguish MPSoCs according to PE types used.

Homogeneous MPSoC
can have an irregular structure (i.e. topology)
but: all PEs & routers are of the same type...
except for:
Core Manager,
NoC Manager,
Micro-Processor can be of a different type
An example for a homogeneous MPSoC is the well-known IBM Cell processor.
A coarse schematic sketch shows figure 153.
139
Figure 153: IBM Cell processor as example for a homogeneous MPSoC
Heterogeneous MPSoC
This MPSoC type can consist of multiple different PEs (e.g. x86s &
ARMs) refer to figure 151 in section 13.2.
One example with multiple DSPs and ARM processors is the Qualcomm
MSM, a cellular modem chip.
A typical application example that can exploit heterogeneous MPSoC is
the recent mobile communication standard LTE (Long Term Evolution).
In corresponding modems for this standard we typically find:
~10
~10
~10
~50
13.5
ARMs (3 kinds)
HW accelerators (e.g. filters, decoders, etc.)
DSPs
memories
Hierarchy
New challenges come up with the increasing size of MPSoCs. If we take a look
at state-of-the-art MPSoCs, we usually find a few dozen or maybe a hundred
of processors. In near future, we will already find multi-processor chips with
thousands of PEs, usually referred to as many-core systems-on-chip or FPPA
(field programmable processor array processor arrays with programmable
interconnects, analogous to todays FPGAs [field programmable gate array]).
Assuming such large number of PEs, conventional topologies will not be sufficient anymore. Hierarchical network topologies with PEs pooled in clusters and
dynamically controllable cluster sizes might be a solution to handle such large
SoCs ( figure 154).
140
Figure 154: Advance to clustered many-core architectures
Finally, we can also stack multiple MPSoCs together on a higher level of hierarchy. This could either be done on one plane (2D) or by stacking multiple
MPSoCs on top of each other (3D).
Figure 155: Hierarchy of multiple MPSoC
Design-Space Exploration
Another big challenge is the design of an efficient MPSoC for a specific problem. This task is subject of the so called Design Space Exploration (DSE) that
systematically evaluates a large number of different architecture and mapping
alternatives with respect to the suitability for a given computation problem
(e.g. signal processing chain for UMTS modem). Broadly speaking, we can
distinguish three different approaches to start off with a DSE:
1. greenfield approach: given a system problem find a HW/SW solution free choice of hardware architecture, interconnect, etc., software
partitioning and HW/SW mapping
141
2. given system problem also given HW/SW solution as starting point

find an optimal HW/SW mapping or dynamic scheduling algorithm for
the system problem
3. given HW/SW solution Can a system problem with a different HW/SW
solution be fitted for the new HW/SW solution? ( portability problem)
Outlook 2020
Finally, we want to give a brief outlook concerning the development trends of
MPSoCs within this section. Todays chips consist of a few dozen or hundreds
of processors. Here are two example:
1. nVidia 500 PEs on a chip in 40 nm
2. Tomahawk chip 14 cores in 130 nm
A strong growth of the parallelism degree is predicted already for near future
many-core chips:
2020: ~ 0.1M - 1M cores
2030: ~ 1B cores(!)
This mean that the prediction for the number of cores in 2030 is comparable to
the number of transistors in todays chips!
142
References
[1] P. Pirsch, Architekturen der digitalen Signalverarbeitung. Stuttgart: Teubner, 1996.
[2] , Architectures for Digital Signal Processing.
John Wiley & Sons, Inc., 1998.
New York, NY, USA:
[3] K. K. Parhi, VLSI digital signal processing systems / design and implementation. New York, NY ; Weinheim [u.a.]: Wiley, 1999.
[4] V. P. Nelson, Digital logic circuit analysis and design. Upper Saddle River,
NJ: Prentice Hall, 1995.
[5] V. Madisetti, VLSI digital signal processors / an introduction to rapid prototyping and design synthesis.
Boston [u.a.]: Butterworth-Heinemann
[u.a.], 1995.
[6] G. Fettweis, Parallelisierung des Viterbi-Decoders: Algorithmus und VLSIArchitektur, als ms. gedr. ed. Dsseldorf: VDI-Verl., 1990.
[7] ,
On implementing digital signal processing tasks of
communications systems in vlsi,
Digital Signal Processing,
vol. 3, no. 3, pp. 210 219, 1993. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1051200483710262
[8] G. Fettweis and L. Thiele, Algebraic recurrence transformations for massive parallelism, Circuits and Systems I: Fundamental Theory and Applications, IEEE Transactions on, vol. 40, no. 12, pp. 949 952, dec 1993.
[9] G. Fettweis, L. Thiele, and G. Meyr, Algorithm transformations for unlimited parallelism, in Circuits and Systems, 1990., IEEE International
Symposium on, may 1990, pp. 1756 1759 vol.3.
[10] J. Robelly, G. Cichon, H. Seidel, and G. Fettweis, Implementation of recursive digital filters into vector simd dsp architectures, in Acoustics, Speech,
and Signal Processing, 2004. Proceedings. (ICASSP 04). IEEE International Conference on, vol. 5, may 2004, pp. V 1658 vol.5.
[11] G. Fettweis and S. Bitterlich, Optimizing computation and communication
in shuffle-exchange processors, in Circuits and Systems, 1991., Proceedings
of the 34th Midwest Symposium on, may 1991, pp. 269 272 vol.1.
[12] H. Dawid and G. Fettweis, Bit-level systolic carry-save array division, in Global Telecommunications Conference, 1992. Conference Record.,
GLOBECOM 92. Communication for Global Users., IEEE, dec 1992, pp.
484 488 vol.1.
[13] H. Srinivas and K. Parhi, A fast vlsi adder architecture, Solid-State Circuits, IEEE Journal of, vol. 27, no. 5, pp. 761 767, may 1992.
143
[14] G. Fettweis, J. Chiu, and B. Fraenkel, A low-complexity bit-serial dct/idct

architecture, in Communications, 1993. ICC 93. Geneva. Technical Program, Conference Record, IEEE International Conference on, vol. 1, may
1993, pp. 217 221 vol.1.
[15] E. A. Lee and D. G. Messerschmitt, Static scheduling of synchronous data
flow programs for digital signal processing, Computers, IEEE Transactions
on, vol. C-36, no. 1, pp. 24 35, jan. 1987.
[16] E. Lee and D. Messerschmitt, Synchronous data flow, Proceedings of the
IEEE, vol. 75, no. 9, pp. 1235 1245, sept. 1987.
[17] F. Rellich, Darstellung der Eigenwerte von u + u = 0 durch ein Randintegral, Math Z., vol. 46, pp. 635636, 1940.
[18] E. Zeidler, Nonlinear Functional Analysis.
1988, vol. IIa.
New York: Springer Verlag,
[19] L. Lamport, LATEX: A Document Preparation System.

AddisonWesley Pub. Co., 1986.
[20] D. E. Knuth, The TEXbook.
1984.
Reading, MA:
Reading, MA: AddisonWesley Pub. Co.,
[21] M. Goossens, F. Mittelbach, and A. Samarin, The LATEXCompanion. Reading, MA: AddisonWesley Pub. Co., 1994.
144

Skript HWSW Codesign

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Skript HWSW Codesign

Uploaded by

Copyright:

Available Formats

HW/SW Codesign for DSP

G. Fettweis, E. Matus, E. Fischer

Vodafone Chair Mobile Communications Systems

2 Pipelining of Non-Recursive Systems

3 Pipelining Recursive Systems

4 Time Variant Feedback Loops

9 Basic Processor Principles

12 Scheduling for Synchronous Data Flow

The course Hardware/Software - Codesign for digital signal processors is mainly

Pipelining of Non-Recursive Systems

Further reading: [7]

is time-invariant if also time-shifted equation holds i.e.

Figure 2: Register balancing: Moving output register to the inputs

... or the other way around ...

The cut-set method is a more formal way to easily realize register

3. Combine complementary delay elements (register/speed-up) on cut-set

Be aware that speed-up elements are only theoretical (auxiliary) constructs

Figure 5: Generalization of cut-set rule.

Figure 6: Critical path in SFG.

Figure 8: Register arrangement in SFG after cut-set pipelining of SDF in Fig.6.

Interesting observation: Cut-set rule can be seen as application of distributive

Figure 9: Relation between cut-set rule and distributive law

As discussed in previous section, cut-set construction based on "closed" cut

Figure 12: More complex example for pure pipelining

Example: FIR Filter

An FIR filter is a typical element of telecommunications and signal processing.

A corresponding hardware implementation shows figure 13. The critical path

Figure 15: Pipelining transposed FIR filter.

What did we do:

and non-pipelined logic

Latency and Clock Rate

Clock Rate 1/T is determined by the C.P.

Figure 18: Latency and Clock rate comparison

T c.p. = Top + Treg

Pipelining Recursive Systems

Further reading: [8][9]

Figure 20: Pipelining of generic loop

is only a problem for pipelining of feedback loops.

Linear Fedback Loop

The most simple form of a linear feedback loop is in an accumulator system

Figure 23: Accumulator

Figure 24: First order IIR filter

= xk + a(xk1 + a yk2 ) = | distr. law | =

Without the associativity property of the + operator, we would have to compute

Figure 26: IIR after loop transformation w/ assoc. law

(xk + a xk1 + a2 xk2 ) + a3 yk3

Figure 27: Scheme of IIR filter after 2-fold loop transformation

Reconsider the 2-step look-ahead case...

(xk + a xk1 ) + a2 yk2

This can easily be extended to an 8-step (i.e. 3-stage) logarithmic look-ahead,

By making use of the linearity property of the z-transform (a X(z) + b Y (z)

And the 8-step logarithmic look-ahead is accordingly:

The resulting flow graph is depicted in figure 32.

Up to now, we assumed all considered values (yk , xk , ak ) to be scalar. Can our

vector (0, ..., 0)T

Impact of Commutative Law

Time Variant Feedback Loops

yk1 = xk1 + ak1 yk2

The Dot-Operator is introduced as generalization of the loop transformation.

(a, b) (c, d) = (ac, b + ad)

Example - 1st order IIR filter:

(a, b) [(c, d) (e, f )]

[ac, b + ad] (e, f )

(ak , xk ) (0, yk1 ) = (ak , xk ) (ak1 , xk1 ) (0, yk2 )