Professional Documents
Culture Documents
Skript HWSW Codesign
Skript HWSW Codesign
Skript
Contents
1 Introduction
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
5
7
9
10
13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
14
17
20
21
23
24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
37
41
44
7 On-Chip Communication
46
7.1 Bit-level communication . . . . . . . . . . . . . . . . . . . . . . . 46
7.2 Perfect Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3 De Bruijn Networks & Shuffle-Exchange . . . . . . . . . . . . . . 57
8 Measures
8.1 Basics . . . . . . . . .
8.2 Measure of Complexity
8.3 Wordlength Analysis .
8.4 M-Step Analysis . . .
8.5 ATE Measure . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
61
62
63
65
71
74
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
74
74
75
77
79
79
81
81
84
84
86
89
90
90
91
93
10 Hazards
94
10.1 Control Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.2 Data Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.3 Structural Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11 Vector Processing
11.1 Vector & Data Flow Processors . . . . .
11.2 SIMD: Single Instruction Mutliple Data
11.2.1 Example FIR & Zurich Zip . . .
11.2.2 Transposed FIR, partial . . . . .
11.2.3 Generalization . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
103
103
105
106
111
113
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
128
129
134
138
139
140
Introduction
2.1
Basic Principles
For the purpose of pipelining a system we can make use of a graphical representation (flow graph). Flow graphs basically consist of two types of elements:
operators (i.e. computational or logic functions) and delay elements (i.e. registers, flip-flops, memories, etc.), as figure 1 shows. The operator is time-invariant
if its transfer function is not dependent time. In other words, if the inputs to
operator are time shifted then the outputs of the operator stay the same and
are time shifted by the same amount as the inputs. For example, an operator
with inputs ak , bk and output yk realizing function
yk = ak
bk
(1)
Registers can either be moved from the outputs of an operator to the inputs...
2.2
Cut-Sets
Assume:
Time clocked signal-flow graph (SFG)
Time invariant node operators
Discrete time delays (usually a delay is given as multiple of a clock cycle)
Def: Cut of a graph is partitioning of the graph nodes into two disjoin subsets.
Def: Cut-set of the cut is the set of graph edges whose endpoints belongs to
different graph partitions.
Def: Cut-set rule:
1. Construct closed cut ...
2. Move discrete delays from branches entering cut-sets and vice versa by
insering delays (registers) on the one side and corrsponding negative delays
(speed-ups) on the other side
5
More general example is in Fig. 5. Here the cut-set is constructed around the
node as first. Two realization of retiming schemes are depicted there.
2.3
Critical Path
Def: Critical Path (CP) is the longest delay/latency path among all paths in
the SFG determining maximum system clock frequency.
CP poses as the limitation on the execution time of SFG. In the case of discretetime (clocked) systems, the CP determines maximum clock frequency of system
as the delay of CP is inverse proportional to the clock frequency. Thus, in many
cases, it may be of interest to speedup the system. In general, the speedup of
the system can be achieved by spending additional (computational) resources or
by increasing clock frequency. The first approach represents spatial parallelism
while second one is associated with temporal parallelism (pipelining). Note that
in the rest of this work the term parallel processing is associated with spatial
parallelism while the term pipelining is associated with temporal parallelism.
In the following section, pipelining techniques are addressed as first.
Pipelining reduces CP by inserting additional delay elements (registers)
along the CP. However, the transfer function of the system after pipelining
shall be preserved. Basically, two methods can be used for pipelining:
Retiming - is technique for structural relocation of delay elements within
SFG such that SFG function is preserved.
Loop transformation - graph structure is changed without affecting the
transfer function resulting in increased number of delay elements and operators. Newly created delay elements can be moved into the CP be retiming.
In the example in Fig. 6, all nodes have same delay T . Thus, CP passes 4
nodes resulting in total delay of 4 T . In order to reduce the CP, pipelining
using cut-set rule is applied along CP to shorten the CP by 1/2. Two cut-sets
have been used for this purpose. Figure 7 shows that these two cut-sets can
even be merged into a single cut-set (dashed green cut-set). By inserting two
additional cut-sets we can pipeline the circuit even further until full pipelining
is achieved. Full pipelining is associated with such graph structure that any two
nodes in the graph are isolated by delay element. In the Fig. 7 , three cut-sets
are constructed.
Figure 7: Graph retiming using cut-set pipelining. Three cut-sets has been
constructed in order to fully pipeline the graph.
A more complex example is given in figure 10. The length of the critical path
in this example is 5 operations. Thus, 4 pipelining stages should be sufficient
for fully pipelining the circuit. The critical path is cut ones between every two
operations for inserting registers on the input edges of the cut-set. The output
edges of the cut-set where the speed-ups are inserted are also the output edges of
the circuit. This allows an easy removal of the speed-ups later on. The latency
of the circuit after fully pipelined is 4 clock cycles: yk4 .
Figure 10: Pipelining along the critical path
2.4
Pure Pipelining
6 is in Fig. 11. Only registers are placed at crossings of cuts and edges resulting
in the same structure as that in Fig. 7.
Figure 11: The principle of pure pipelining.
The idea behind is that the cuts crossing input edges or output edges only. If
this is the case, we can insert a register on every cutted edge and close the cut
via all inputs or outputs of the circuit, respectively, where the speed-ups are
inserted. The speed-ups at the circuit inputs and outputs can simply be
dropped after pipelining and only the internally inserted registers remain.
Figure 12 gives a more complex example for pure pipelining with a critical
path of length 5.
2.5
L1
X
ai xki
(2)
i=0
pipelining the CP only contains a single adder or constant multiplier but the
latency of the circuit was increased to 3 clock cycles.
Figure 13: Example for application of pure pipelining on FIR filter
This is still not the end of the story. The question is: Can we do the pipelining
in a more efficient way, i.e. spending less (than the 11 used) registers? This is
possible by exploiting the associativity property of the (addition) operator.
This is shown in figure 14. By slightly rearranging the FIR structure according
to the associativity rule from (((a0 xk + a1 xk1 ) + a2 xk2 ) + a3 xk3 ) to (a0 xk +
(a1 xk1 + (a2 xk2 + a3 xk3 ))), we are now able to define cut-sets that make
use (i.e. reuse) the already existing FIR shift registers for the pipelining.
Figure 14: Exploiting associativity of operator in FIR filter pipelining.
For the pipelining of the rearranged FIR structure we thus only need 4 (L=4)
additional registers (instead of 2 L before) for a total of 7 registers. After
redrawing the graph a little bit we get the well-known transposed form of the
FIR filter, shown in figure 15, which has a latency of only 1 clock cycle.
11
If we dont shift the registers from the upper branch to the lower branch (in
between the adders) the circuit is not fully pipelined, since we have still 3 adders
in the critical path. This form of the FIR is called direct form and shown in
figure.
Figure 16: Direct form FIR filter.
we created pipelined logic by using the delay registers of the shift register
12
Finally, take a look at figure 17 where we compare the critical path of both
FIR filter versions. We clearly find that the transposed form FIR filter is better
suited for the hardware design. However, as we will see later on this is actually
a bad idea to be used for software realization.
Figure 17: Comparison of c.p. between direct and transposed form FIR filter
2.6
1
Tc.p.
1
4Top +Treg
D = 4Top + Treg
After pipelining, the system looks like the bottom figure of figure 18 and the
length of the critical path is now
13
1
Tc.p.
1
Top +Treg
D = 4(Top + Treg )
3.1
Basic Loops
For any pure feedforward (non-recursive) system, which contains no feedbackloops, we use the cut-set rule for fully pipelining of the circuit. This is always
possible and illustrated for an example on the left side of figure 19. On the
right side of the figure we consider the same network but changed the direction
of a single edge. Thus, we introduced a feedback loop in the network. A pure
combinational feedback loop is not implementable, because it has an infinite
14
(critical) path length, i.e. the signal would iterate in this cycle infinitely. For
this purpose, we must also insert a register as boundary for the loop.
Figure 19: Comparison of piplining of system without and with loop.
We can see from the right side of figure 19 that the critical path consists of 4
operations and contains the feedback loop. By applying 2 cut-sets in the first
step the critical path can be reduced to a length of 3 operations, but the loop is
still completely contained. In the second and third step we try to pipeline the
loop by introducing additional registers. For this purpose, cut-sets are chosen
that contain only a single node of the loop. As we can see the application of
the cut-set rule now only moves the registers within the loop, but is not able to
introduce new registers. Thus, the critical path cannot be shortened anymore.
In the end, we arrive at the same register arrangement that we started with. A
more generic view of this problem is shown in figure 20.
15
No matter around which node (or set of nodes) we draw the cut-set, the number
of registers (i.e. the overall delay) always stays the same. Nothing can be done
to insert new registers!
It follows
In any closed circuit the directed
P sum of delays in a feedback is a constant i.e.
Regs=const. !
Figure 21 illustrates that the overall delay didnt change when counting the
speed-up as negative delay.
Figure 21: Pipelining doesnt affect the overall delay in a feedback loop
If we consider arbitrary loops (i.e. also feedforward loops), we find that the rule
is still applicable for this more general case, i.e. the number of registers is also
constant for arbitrary loops. This is shown in figure 22 where a feedforward
loop is pipelined via cut-set rule. It can be seen that the overall cycle delay is
still the same after pipelining (registers on edges that are directed against the
loop direction have to be counted as negative delays). However, this behaviour
16
3.2
k
X
xi = xk + yk1
i=
A more general form, with the following equation yields a 1st order IIR filter,
shown in figure 24.
yk = xk + a yk1
The critical path of this IIR filter consists of 1 adder and 1 multiplier and cannot
be shortened by pipelining via cut-sets, as we have seen before. What can we
17
do? The formula for yk is based on yk1 , but we know what yk1 is by applying
the recursion formula. Thus, lets unroll the loop once ...
yk
xk + a yk1
subst.
yk1 = xk1 + a yk2
after substitution
yk
We find that this network is identical to the example from figure 19, but with 1
additional register in the feedback loop. Still we will not be able to fully pipeline
this system by application of the cut-set rule, since we would need 3 delays in
that loop. By making use of the associativity of the + operator, we come to the
system shown in figure 26.
18
The system contains of a 2-step linear look-ahead (feedforward) part and a 2step feedback part. The feedforward part can easily be pipelined via cut-set
rule. The feedback part now contains 2 registers, but still only 2 operations
(1 addition and 1 multiplication) in contrast to figure 25. Thus, the overall
delay of the loop is sufficient for fully pipelining. We only have to rearrange
the position of the registers a little bit. For this purpose the cut-set rule can be
applied again. As we have already seen before, the cut-set rule cannot change
the number of delays in the loop, but is able to rearrange the register positions
within loops.
Doing the loop transformation once again, we come to a solution with a 3-step
look-ahead and a 3-step feedback part.
yk
= xk + a xk1 + a2 yk2
subst.
yk2 = xk2 + a yk3
yk
19
The 3-step solution consists of 3 registers in the feedback loop now. As we have
seen before, the feedforward part can easily be replaced by a transposed form
3-step look-ahead, as shown in figure 28. Thus, the shift register can be reused
for pipelining the system. The additional register in the feedback loop can for
example be used for internal pipelining of the constant multiplier (we have now
2 instead of 1 clock cycle for the computation of the product).
Figure 28: IIR filter with transposed form 3-step look-ahead
3.3
Logarithmic look-ahead
We go still 1 step further now, taking the example from section 3.2 and consider
the 4-step case
yk = xk + a xk1 + a2 xk2 + a3 xk3 + a4 yk4
Figure 29: IIR scheme after loop transformation with 4-step transposed form
FIR linear look-ahead
20
The feedforward part of yk2 is identical with that of yk except that all x are additionally delayed by 2 clock cycles. Thus by using the notation (.)2 delay
all entries by 2, we come to:
yk2 = (xk + a xk1 )2 + a2 yk4
and by substituting
yk = (xk + a xk1 ) + a2 (xk + a xk1 )2 + a4 yk4
We recognize from the equation that the computational result of the feedforward
term (xk + a xk1 ) is reused, which directly leads us to the logarithmic form
of the 4-step look-ahead:
Figure 30: SFG of IIR with 4-step logarithmic look-ahead
3.4
Z-Transform
As we have seen in section 3.2, loop transformation can be cumbersome sometimes. The z-transform provides a mean that can simplify this procedure. Lets
pick up the small IIR example that we used in the previous sections again:
21
= xk + a yk1
yk
1 + a2 z 2
1 + a2 z 2
2step rec.
Finally, we increased the order of the recursion part by 1 and thus introduced
an additional delay in the feedback loop. This is nothing else but the 2-step
loop transformation for the linear look-ahead that was shown in section 3.2.
The results are directly related: 1 + a z 1 xk + a xk1 (look-ahead) and
1 a2 z 2 a2 yk2 (feedback). We can even simply extend the 2-step
linear look-ahead to a 4-step logarithmic look-ahead by multiplication with the
2 2
z
neutral 1: 1+a
1+a2 z 2 :
Y (z)
1 + a z 1
=
|
=
1
1 + a2 z 2
X(z)
4
{z
} 1 a z 4
logarithmic lookahead
1
1 + a z 1 + a2 z 2 + a3 z 3
X(z)
|
{z
} 1 a4 z 4
linear lookahead
1 + a z 1
=
|
1 + a2 z 2
{z
1 + a4 z 4
logarithmic lookahead
22
}1
1
a8
z 8
X(z)
The Z-transform also easily allows to consider mixed form of linear and logarithmic look-ahead, e.g., for M=6:
Y (z)
1
1 + a z 1 + a2 z 2 1 + a3 z 3
X(z)
1 a6 z 6
|
{z
}
lin
|
{z
}
log
3.5
Vectors
23
Can all transformations be applied again? What operations did we actually use
for the loop transformation? What problems could occur?
1. Matrix multiplication is not communitative: ak ak1 6= ak1 ak Thats
o.k, we didnt use this!
2. Matrix/Vector multiplication is associative (ak ak1 ) ak2 = ak (ak1 ak2 )
3. Addition of matrices/vectors is associative (ak + ak1 ) + ak2 = ak +
(ak1 + ak2 )
4. We find a zero element (for multiplication) / neutral element for addition
0 0
.
..
..
matrix
0 0
Result: We can conclude that exactly the same result that we found for scalar
values also hold here!
3.6
4
4.1
Take a look at figure 34. There, a small pure feed-forward example is shown
that contains no feedback loops but multiplication with time-variant coefficients
(ak , bk ). We see that we can apply the cut-set rule here as usual. Only take care
that if we draw the cut-set around the coeffcient multipliers, this also affects
the time index of the coefficients, as depicted in the figure.
Figure 34: Small pure feed-forward system example w/o loops
24
After this short introduction example, lets now come to time-variant feedback
systems. Therefore, we modify the IIR example a little bit making the factor
a time variant, i.e. a ak . For pipelining this loop we can simply apply loop
transformation as before (a 2-step recursion should be sufficient, since we have
only 2 operations in the loop). This is shown in the following equation:
yk
xk + ak yk1
(3)
If we only apply the associativity rule for the + operator (illustrated by red
brackets), we come to the circuit shown in figure 35 (left). We now find 2
delays in the loop but the number of operators also increased to 3 (2 mult. + 1
add). Consequently, 2 registers are still not sufficient to fully pipeline the loop.
On the other hand, if we make also use of the associativity of the operator
(illustrated by green brackets) to precompute the product of the a0k s outside
of the loop, we come to the circuit in figure 35 (right). In this case only 2
operations are inside the loop. Since we also have 2 pipeline registers available
(2-step recursion), the loop can easily be fully pipelined by applying the cut-set
rule. This small example demonstrates that the loop-transformation for timevariant feedback loops is more tricky and depends on the order in which the
operations are executed and the mathematical rules that can be applied to the
operators.
Figure 35: SFG of system (3)
For the time-variant IIR example we made used of the following rules:
associative law: +,
distributive law: over +
4.2
Dot Operator
(ace, b + ad + acf )
(a, b) [ce, d + cf ]
(ace, b + a (d + cf ))
| {z }
ad+acf
is assoc.
It follows Dot-operator is associative! if + is assoc.
is distributive over +
This means, no matter in which order we apply the operator, the result is the
same! How to use this: we apply the definition of the Dot operator to the timevariant recursion loop and show how easily we can now derive a 2-step loop
recursion:
(0, xk + ak yk1 )
=Yk2
{z
After the first loop transformation we can simply compute the Dot product of
(ak , xk ) and the delayed (ak1 , xk1 ) in the feedforward part by exploiting the
associativity rule for the Dot operator. Afterwards, (0, yk2 ) has to be Dot
multiplied in the feedback part. We can make the recursion equation for this
time-variant rule more clearly by introducing the following short form notation
for the tuples: Xk = (ak , xk ), Yk = (0, yk ). Using these substitutions, the
recursion equation with the Dot operator becomes very simple:
Yk
= Xk Yk1
"L1
#
K
=
Xki YkL
i=0
Dot Operator is not commutative
computation must be done from left to right
26
The equation also shows the generalization for an L-step recursion. In this case
we only have to compute the Dot product of [Xk, ..., XkL1 ] in the feedforward
part and Dot multiply the result with YkL in the feedback part. Be aware
that the Dot operation must be applied in the correct order (from left to right),
since it is not commutative! E.g. Xk Xk1 is o.k., while the result of Xk1 Xk
would be wrong. Some examples:
Yk
= Xk Yk1
M =2
= (Xk Xk1 Xk2 Xk3 ) Yk4
{z
}
|
=
M =4
linear lookahead
Figure 36 shows the simple and regular structure that results by applying the
Dot operator onto the example of a 4-step linear look-ahead. The orange arrows
show the order of the Dot operator inputs which is essential due to the missing
commutativity of the operator.
Figure 36: SFG of IIR with time variant loop and 4-step look-ahead using Dot
operator
27
Figure 37: SFG of IIR with time variant loop and 8-step log look-ahead using
Dot operator
(xk + ak xk1 ) + ak ak1 (xk2 + ak2 xk3 ) + [ak ak1 ak2 ak3 ] yk4
We can conclude that by using the operator any feedback recursion of form
yk
= bk xk +ak yk1
|{z}
x
k
yk
= x
k + ak yk1
(0, yk )
can be handled in the same way as yk = xk + yk1 the simplest binary operator.
Please note that a time-variant coefficient (bk ) in the feedforward part is no
problem, since we can simply substitute bk xk by x
k and thus apply the
operator as usual. But: commutative law does not apply! Finally, notice that
the operator can also be generalized to vectors & matrices.
Generalizations
5.1
We can further generalize the applicability of the loop transformation and the
Dot operator with respect to the underlying algebraic operation. Up to now we
restricted our investigations on addition and multiplication. The question is:
Can we apply the rules also for different operators? What mathematical rules
have we used so far?
(S, +) is a semi group (i.e. + is associative and S contains a neutral
element):
(a + b) + c = a + (b + c)
0+a=a
28
b+0=b
(S, ) is a semi group
is distributive over +
a(b + c)
Please note that (S, +) as well as (S, ) are also commutative semigroups. However, this is not true for (S, ) with respect to matrices! Consequently, we are
now able to generalize our previous knowledge from this section to arbitrary
operators and (Dont confuse with addition + and multiplication ! The
symbols and serve as template for arbitrary operators!), if:
, are semigroups over a
is distr. over
yk
= xk ak yk1
1 step recursion
distr. law
assoc. of and
In the following you find some examples of operator pairs that fulfill the conditions above, so that all rules concerning loop transformation and Dot operator
can also be applied for these operations:
max
min
min
max
XOR
AN D
29
30
a Viterbi Trellis with 2 states in figure 38 (left). The path metrics are therein
derived as follows:
0,k+1 = max(0,0,k + 0,k , 0,1,k + 1,k )
1,k+1 = max(1,0,k + 0,k , 1,1,k + 1,k )
or as combined matrix equation:
0
00
=
=
1 k+1
01
10
11
0
1
k
If we do this computation iteratively we can realize this by an integrator hardware that is also depicted in figure 38 (right).
Figure 38: Viterbi Trellis with 2 states as an example for the Max-Plus-Algebra
application
5.2
In the previous section, we found that we are not restricted to the operators +
and x, as long as certain conditions are fulfilled. Let us first summarize these
conditions once again:
max
is semigroup
distr. law semi ring
+
is semigroup
We already know from previous examples that (multiplication) is distributive
over (addition). We also know from the previous section that is distributive
over max and all 3 operations are associative.
a (b c)
= a max(b, c)
= max(a b, a c)
= abac
Consequently, we can exchange the operators for any circuit that we have already
pipelined before. Figure 39 (top, left) shows the small time-variant IIR example
from section 4.1. If the and operations are now replaced by and max,
respectively, we get an Add-Compare-Select unit (figure 39; top, right) , which
is a typical hardware element that is, e.g., used for Viterbi decoding. Since we
already know the solution for the pipelined recursion loop from figure 35, and
31
since we know that the necessary mathematical rules also hold for the max/+
algebra, we are able to dircetly present the result here: figure 39 (bottom).
Figure 39: Loop transform + Pipelining of Add-Compare-Select circuit
If we consider now all 3 operators (, and max) in common, the only remaining question is: Is distr. over max?
a max(b, c) = max(ab, ac)
This is true, as long as we limit ourselves to non-negative numbers, e.g., F = R0 !
We created 3 operators which are a distributive sorted set of operations:
1
2
3
max min
Let us take the following example with 3 operations and a recursion loop for
pipelining:
yk
3 [bk
2 (ak
y
1
= ck
k1 )]
3
2
1
= ck
b
k
a
k
y
k1
Now, we want to do a 2-step loop transformation for the case with 3 operators
by unrolling the equation as usual and applying the distributive law afterwards:
yk
3
2 [ak
1 (ck1
b
3
2
1
= ck
b
k
k1
a
k1
y
k2 )]
3
2 [(ak
c
1
3 (ak
b
1
2 (ak
a
1
1
= ck
b
k
k1 )
k1 )
k1 )
y
k2 ]
3 [bk
2 (ak
c
1
3 bk
2 (ak
b
1
2 (ak
a
1
1
= ck
k1 )]
k1 )
k1 )
y
k2
|
{z
}
|
{z
} |
{z
}
lookahead of xk
lookahead
32
lookahead
For simplicity (when dealing with time-variant loops) we should prefer the Dot
operator to do the loop transformation. The operator for 3 operators in the
given distributive order can be defined as follows:
(a1 , a2 , a3 ) (b1 , b2 , b3 )
=
=
1
2
1
3
2
1
(a1
b
1 , a2
a
1
b
2 , a3
a
2
a
1
b
3)
(4)
2
1
3
2
1
1
, a
a
b
, a
a
b
c1 , c2 , c3
= a1
b
|{z}
2
a
|{z} |{z}
{z 1 }3
| {z }1 | 2 {z1 }2 | 3
A1
A2
B1
A3
B2
B3
1
2 a
b
1
1
2
1
1
c1 , a2
a
c2 , ...
= a1
b
b
{z1 }2 | 1 {z }1 |{z}
| {z }1 |{z}
|
A1
B1
A2
A1
B2
3
2
1
3 a
a
2
1
2 a
b
1
1
b
b
c3
..., ..., a3
a
2
a
|
{z 1 }3 | 2 {z1 }2 | 1 {z }1 |{z}
A3
A2
33
A1
B3
a (b c)
1
2
1
3
2
1
, b
b
c
, b
b
b
c
= a1 , a2 , a3 b1
c
A1
A2
A3
B1
B2
B3
1 b
c
1
2 a1
1 [b
b
2
1
c
], ...
= a1
, a2
|{z} |1 {z }1 |{z}
|{z} | 2 {z1 2}
A1
B1
A1
A2
B2
3 a2
2
1 [b3
b
3
2
1
a3
a1
..., ..., |{z}
2
b
1
c
3]
|{z}
{z
}
|{z} |
A3
A2
A1
B3
distr. law
1
1
2
1
2
1
1
(a1
b
1
c
1 , a2
a
1
b
2
a
1
b
1
c
2 , ...)
3
2
1
3
2
1
2
1
1
(..., ..., a3
a
2
a
1
b
3
a
2
a
1
b
2
a
1
b
1
c
3)
By comparing the results, we see that eq. 4 holds. For 4+ operators we would
apply an interative proof. Finally, we examine some examples of operator tuples
and their distributive order. A summary is given by table
Table 2: Some examples for tuples of (more than
tive order
S
R+
R+
R
1
+
2
+
+
max
3
max max min
4
min
Please note that the distributive order for (, max) only holds for positive
numbers, i.e., max(ab, ac) = a max(b, c) is only true, iff a, b, c 0! A special
case are the operators max and min that are even self distributive, i.e.,
max(a, max(b, c)) = max(max(a, b), max(a, c))
Consequently, max is also distributive over min, as well as min is distributive
over max. This particular property makes the curious set (max, min, max) a
valid operator tuple.
5.3
Multi-Span Algebra
34
1 like
2 like +
3 weaker +
a11
a21
a12
a22
1
b11
b21
b12
b22
a11
a21
a11
a21
1
2
1
a11
b
11
a
12
b
21
=
1
a12
b11 b12
a11
b
11
2
=
a22
b21 b22
3
a12
b11 b12
a11
b
11
3
=
a22
b21 b22
Block Processing
6.1
Basic Idea
The basic idea of Block Processing is very simple as the figure shows. Take an
arbitrary serial input stream and chunk it into blocks of length L:
xk3 , xk2 , xk1 , xk , xk+1 , xk+2 , ...
| {z } | {z } | {z }
L=2
Then perform a parallel processing on every chunk and serialize the parallel
outputs afterwards. However, in general we have to distinguish two different
types that can both be considered as Block Processing.
Figure 40: Basic Idea of Block Processing
35
For the second example assume a high speed serial link (e.g. 60 GHz @ 10Gb/s).
The sample rate is far too high for todays systems to be processed sequentially.
Therefore, we need a parallel processing. This can easily be done, as depicted in
figure 42, if the processing of L subsequent samples can be done independently.
In this case we can use a demultiplexer to do a serial-to-parallel conversion first,
then perform the parallel processing on each of the parallel streams individually
and multiplexing the results back to a common serial stream. The advantage
of this approach is, that each of the parallel streams can now be processed on a
lower rate (factor L1 ).
Figure 42: Example for block processing via parallel execution
6.2
Pipeline Interleaving
Lets take the following simple recursion formula (integrator) as an example for
the system realization to demonstate the idea of pipeline interleaving:
yk = xk + yk1
Normally, we would apply loop transformation to introduce new registers in
the feedback loop. But what happens if we just insert the registers in the
loop without transforming the look-ahead part, i.e. take an arbitrary signal
processing hardware and replace every delay element by a shift register of L
delays?
yk = xk + yk2
Does the behavior of the realized function change - and how? Let us take a look
at the small example in figure 44 to clarify these questions.
37
On the left side of the figure we see the serial input stream xk . The right
side shows the corresponding output sequence yk . We recognize that the sum
of the xk with even k and the sum of the xk with odd k is computed in an
interleaved manner: x0 , x1 , x2 + x0 , x3 + x1 , ... Actually, this is not the desired
behavior. But what if we exploit this behavior by feeding the integrator with
two completely independent data sets that we already pass to the integrator
input in an interleaved manner? The resulting behavior is demonstrated in
figure 45.
Figure 45: Example of pipeline interleaved integrator behavior
As the figure shows this works fine for the two interleaved sets. After the 6 (i.e.
38
23) considered clock cycles, we observe the correct sums for both data sets at
the output:
ya,2
yb,2
Lets consider this example more generally for the case L=3 as shown in figure
46. Therein, our 1-step recursion loop
yk = xk + yk1
becomes
yk = xk + yk3
by applying the pipeline interleaving approach.
Figure 46: Integrator example for L=3
yk
xk + yk3
yk+1
xk+1 + yk2
yk+2
xk+2 + yk1
However, as mentioned before, the pipeline interleaving approach will only work
if we pass 3 independent data sets to the input of this circuit in an alternating
manner. In this case the three equations change as follows:
y1,k
= x1,k + y1,k1
y2,k
= x2,k + y2,k1
y3,k
= x3,k + y3,k1
Now we recognize our initial equation again, but this time we can process three
data sets on the same hardware simultaneously having a speed-up of 3x (bestcase) by introducing additional pipeline registers.
39
In general, when adding delays in the feedback loop, we need to replace every
delay (i.e. also in the look-ahead part) by an L-fold delay to make the pipeline
interleaving concept work. This is shown in figure 47.
Figure 47: Every delay is L-fold replaced
Thus, the right circuit in the figure (after the 3-step pipeline interleaving) can be
realized as 3 independent all-pass filters1 . For this purpose, the 3 independent
data sets xk , ak and bk must be passed to the filter in an interleaved manner.
Finally, a short remark: be aware that for applying the pipeline interleaving
technique it is necessary to serialize the parallel inputs before processing at
higher speed and deserializing the outputs again, as shown in figure 48.
Figure 48: Serialization / Deserialization necessary to exploit pipeline interleaving technique
40
6.3
We also see in figure 49 that the clock frequency of the computation circuit is
reduced by factor 1/2 for this 2-fold parallelization. This is the main advantage
of the block processing concept. On the one hand this can be used to increase
the throughput of the computation: by using the same clock frequency that
would be necessary for the serial processing (1/T in the above example), the
throughput would be increased by a factor of 2. On the other hand the reduced
clock frequency can save power: power consumption grows significantly faster
than clock frequency above a certain level (~200 MHz), but scales linear with
area increase!
Let us pick up the simple recursion loop example from the previous section for
block processing. The equations for splitting it into odd and even part, are
exactly the same, only the realization will be different:
yk
yk
yk1
xk + yk1
2 xk
z }| {
xk + xk1 +yk2
2 xk1
}|
{
z
xk1 + xk2 +yk3 y2k1 =2 x2k1 + y2k3
41
Figure 50 shows the corresponding hardware realization for this integrator example with 2-fold parallel block processing.
Figure 50: 2-fold block processing for integrator example
It should be mentioned that the parallel inputs have already been demultiplexed
before. Thus, the first input only sees the even x0k s while the second input only
sees the odd x0k s (i.e. k is always a multiple of 2). Consequently, every single
delay on one of these inputs, actually counts as a delay of 2: the single delay
on xk produces xk2 . For the same reason a single delay in the feedback loop is
sufficient to produce y2k2 . We emphasize once again that the main advantage
of block processing is the clock speed reduction to 1/LT (for an L-fold parallel
block processing).
Finally, we want to take a look onto a slightly more complex example. By
unrolling the loop of the above example once again twice, we come to a 4-step
recursion loop. We are able to split that loop into four independent loops that
can be computed in parallel with block processing. The corresponding four
equations with linear look-ahead are as follows:
yk
yk1
yk2
yk3
42
Figure 51: 4-fold block processing for integrator example with linear look-ahead
yk
yk1
yk2
yk3
Again, the red parts in the equation must be generated by introducing delays,
where each delay counts as four time steps due to the parallel processing approach. The realization of the 4-fold block processing example with logarithmic
look-ahead is illustrated in figure 52.
43
Figure 52: 4-fold block processing for integrator example with logarithmic lookahead
6.4
= xk + xk1 + yk2
= xk1 + xk2 + yk3
|
{z
} | {z }
input available
available
xk + xk1 + yk2
yk1
xk1 + yk2
Now both parallel branches depend on the same feedback, namely yk2 . Hence,
we can save some hardware resources, since the branch for the computation of
yk1 can reuse the feedback (yk2 ) from the yk branch. This optimization for
the L=2 case is illustrated in figure 53.
44
Figure 53: 2-fold block processing example with single feedback loop
Compared to figure 50 we can save 1 adder and the register in the second
feedback loop by reusing the output of the first feedback loop. For the more
complex 4-fold block processing example with linear look-ahead, the 4 equations
are as follows when using only a single feedback loop:
yk
yk1
yk2
yk3
Compared to the solution in figure 51, we could save 6 adders and 3 registers.
45
Finally, we also want to examine the example with logarithmic look-ahead. The
four equations change as follows:
yk
yk1
yk2
yk3
Herein, the intermediate result of the yk2 branch (red) was reused for the
logarithmic look-ahead (green). The realization depicts figure 55.
Figure 55: 4-fold block processing example with single feedback loop and logarithmic look-ahead
In contrast to figure 52, we need only 8 instead of 12 adders and could save again
the 3 registers in the feedback loops. Although the version with a single loop
and logarithmic look-ahead/look-back is the most efficient form of our integrator
example with respect to hardware resources, it is not always the most preferred
solution. This is because the topology is quite irregular, if we compare it to
the parallel loop version, for instance. Finally, the decision of which of these
hardware realizations to be preferred, is a trade-off between hardware resources,
reusability and regularity of the communication network.
On-Chip Communication
7.1
Bit-level communication
As the word-level view onto the integrator circuit shows (figure 56, left), the
critical path is located within a loop. To speed it up, we have two options:
1. Unroll the loop to an L-step recursion via loop transformation pipelining the adder by using the additional registers in the loop
2. Make the adder faster, i.e. using faster adder types (e.g. carry-look-ahead
or carry-save instead of slow carry-ripple adder)
To pipeline the adder internally, it is necessary to consider the bit-level view of
the integrator (figure 56, right). In contrast to the word-level view, the bit-level
considers every single (bit) line of a bus. Thus, the wordlength W becomes
important here. As we already know, the critical path length and thus the
pipelining of a carry-ripple adder depends on the wordlength. Hence, pipelining
can only be done on bit-level.
Figure 56: Integrator circuit: word-level view (left) and bit-level view (right)
47
At first sight, it looks like we find the critical path still within a loop. However,
by redrawing the circuit a little bit we come to figure 58 (top). Now we recognize that the critical path actually is not a feedback loop. Figure 58 (bottom)
illustrates this even better. Thus, we can simply do pipelining by cut-set rule
(or pure pipelining as shown in figure 58 (bottom)) without the necessity of
loop-transformation.
48
By rearranging the inserted pipeline registers a little bit, the so called skewing
triangle at the integrator inputs and the deskewing triangle at the integrator
outputs becomes visible. This is depicted in figure 59. The registers at the
inputs and outputs are necessary for aligning the timing of the single bit-lines,
so that all associated bits of a single word can be passed to the circuit in parallel
and arrive at the output in parallel at the same clock cycle.
Figure 59: (De-)skewing triangles in Integrator example
49
The (de-)skewing triangles have a special importance with respect to the chip
area, since the area increase is proportional to W 2 . This actually does not come
from the routing, but from the skewing. For a full-pipelined N-bit integrator,
we need N full adders, N registers in the feedback loops and N 1 registers
for pipelining the carry chain; the skewing, however, consumes (N 1) (N
1) registers in total. Consequently, by reusing the skewing, we could save a
lot of area! The following example of a (transposed form) FIR adder chain
demonstrates how this works. In the upper part of figure 60 we see the adder
chain of an FIR filter with 4 taps (L=4) with a subsequent integrator. The
lower part of the figure shows, how we can do the pipelining on bit-level (W=3)
jointly for the whole adder chain. We can see that the computation of the least
significant bit (LSB) of the second adder can directly start after the LSB of the
first adder is available. We conclude that there is no need to deskew the outputs
of the first adder and skew the inputs at the second adder again. This would
only be a waste of registers and an unnecessary increase of the overall latency of
the circuit. There is actually only one skewing triangle necessary at the input
and one deskewing triangle at the output, as figure 60 (bottom) shows. If we
had only pipelined the carry-ripple adder on bit-level and just plugged together
the adder chain on word-level, we would not have been aware of this unnessecary
skewing overhead and thus wasted a lot of area.
50
Figure 60: FIR example for reuse of skewing triangles (L=4, W=3)
51
7.2
Perfect Shuffle
The Perfect Shuffle Network can often be found in algorithms to realize a permutation on the input signals (e.g. Viterbi, FFT) or for routing purpose. Classical
interconnection structures such as crossbar switches can be used for this purpose
as well, but consume much more area when implemented in hardware.
Figure 61: Crossbar switch can realize any arbitrary permutation on the inputs
The Perfect Shuffle Network is also able to realize arbitrary permutations when
arranged in a so called shuffle exchange network (see section 7.3) and is thereby
able to save some area compared to the crossbar switch. Actually, this method
is used for shuffling playing cards. The deck is split into equal halves which are
then pushed together in a certain way so as to make them perfectly interweave.
52
The cards of the two halves are arranged in an alternating manner after shuffling.
Figure 63 shows how this is applied to communication networks.
Figure 63: Perfect shuffle network with 8 elements (nodes) and 3 stages
53
The realization of the P48 shuffle looks accordingly as depicted in figure 65.
Figure 65: Realization of P48 shuffle supported by memory
Figure 66 shows a different example with 9 nodes. In this case 3 splits have
been used (i.e. 3 different sets are considered for the shuffling).
54
Figure 66: Perfect shuffle network with 9 elements, 3 splits and 2 stages
In the P39 case the original order is already restored after log3 9 = 2 stages.
As a general rule for Perfect shuffle networks we can summarize: The original
order is restored after logb N shuffling stages!
Another example for a Perfect Shuffle network with a P416 permutation shown in
figure 67. This permutation network can e.g. be used for a so called ideal block
interleaver which is often found in coding theory for randomizing the order of
the data stream before transmitting it over the channel to increase immunity
to disturbance. Thereby, the data stream is written column-by-column into a
memory with 4x4 cells. Afterwards it is read from the memory row-by-row, thus
realizing a P416 Perfect Shuffle.
55
The block interleaving is reversible, since after the second shuffling stage the
original order is restored: log4 16 = 2. We can also express this via a factorization
16
rule for permutations: P416 P416 = P16
= P016 whereas PNN = P0N represents
Identity (i.e. no permutation). In general we can decompose a permutation
with a split (a1 a2 ) into a network of two permutations, having splits a1 and
a2 but the same number of elements as the original permutation:
b1
Pab11 Pab21 = P(a
1 a2 ) mod b1
Hence, we can confirm once again the initial conclusion for P28 :
P28 P28 P28 = P88 = P08 .
Observing a skat deck with 32 playing cards and usual shuffling with 2 splits,
you need 5 stages to restore the original order:
P232 P232 P232 P232 P232 = P032 .
Please note: For an admissible permutation, the number of splits b has to be an
integer divider of the number of elements N .
E.g. for N = 12 the following would be valid permutations: P212 , P312 , P412 , P612 .
A last note concerning the routing of the wires for a Perfect Shuffle network: If
we assume W processors to be arranged as a linear vector, than we have W/2
channels to route in x and W channels in y direction.
56
7.3
Shuffle exchange networks can be used for all kinds of dynamic data flow control,
e.g. routing/switching (cross-bar like) or sorting. As figure 68 shows, a single
stage of the network can be sub-divided into a shuffle stage, represented by a
Perfect Shuffle network, and an exchange stage that allows a pair-wise exchange
of the data between each two adjacent processors. As mentioned in the previous
section, the area is proportional to W 2 for the shuffle stage. For the exchange
stage the area is A W . Hence, the shuffle stage is dominant with respect to
routing area.
Figure 68: Shuffle exchange network for N=8
By combining the shuffle with the exchange stage, i.e. merging the first column
of nodes in figure 68 with the second one, we come to the de Bruijn graph that
57
58
What graph representatation do we get, if we shift the bits from left to right into
the register? The corresponding graph, the so called reverse de Bruijn graph,
is depicted in figure 71. While the original de Bruijn realizes a shuffle exchange
for a P28 permutation, the corresponding reverse de Bruijn now is equivalent to
a P48 shuffle with preceding exchange stage, i.e. the order of the shuffle and
exchange stage has been inverted.
59
Though the two realizations look quite identical with respect to the routing
complexity, there is a big difference what concerns the consumed routing area!
Every parallel wire consumes a certain area for its routing channel in x and y
direction. For the routing in x direction we clearly need one routing channel
per node, independent of the realization. For the routing in y direction, we find
the highest concentration of wires at the center symmetry line that is sketched
in the figures 70 and 71. In figure 70 we recognize that wires from 4 different
processing nodes cross the center line. Thus, 4 (N/2) routing channels are
necessary for the de Bruijn realization. In the reverse de Bruijn graph (figure
71) we see that one wire from all 8 nodes crosses the center line. Thus, we need
N parallel routing channels in y direction in this case (which doubles the width
of the routing network). This has a significant impact on the consumed area,
since A N 2 !
Finally, let us combine multiple stages of a de Bruijn shuffle. If we hook-up
loga (N ) de Bruijn graphs, each with a PaN shuffle, we come to the so called
-Network. As depicted in figure 72, loga (N ) stages are enough to reach a
full coverage, i.e. to be able to connect every input with every possible output
(and vice versa), given that the exchange stages are configured accordingly.
Because of this property, the -Network has special importance, e.g. for routing/switching purpose as (more efficient) replacement for a cross-bar switch.
60
Please note that de Bruijn and FFT graph can be converted into one another
by rearranging the order of the processing nodes.
Measures
Up to now, we discussed many different hardware implementation and optimization techniques. The question is: How can we evaluate and compare two
different implementations? For this purpose we introduce some important mea-
61
sures within this section which allow us to make a statement on the quality of
a hardware implementation.
8.1
Basics
One of the most important (and basic) measures for evaluating a hardware
implementation is the area (A). It can for example be measured in # transistors,
# NAND-gates, die size (e.g. in m2 ) or # slices (FPGAs). We can further
sub-divide the area as follows:
Al : logic area (e.g. for adders, multipliers, etc.)
Ac : communication/routing area (for the wires that connect logic and
registers)
Ar : area of registers/buffers (for pipelining and storage)
Am : area of memory
To keep it simple, we will not consider the memory area here, since it very much
depends on the applied technology (e.g. available memory size, type, etc.) and
the according technology specific algorithm implementation.
A second very important basic measure is the latency of the clock interval (T)
or the clock rate (1/T), respectively. The clock rate is directly proportional to
the achievable processing rate at the input of the circuit and thus a very good
measure for the processing speed. The latency can be sub-divided in a similar
way:
Tl : latency of logic
Tc : latency of communication (i.e. wire delays)
Tr : latency of registers (i.e. setup time2 an hold time3 )
Tm : latency of memory (i.e. memory access time)
A third important basic measure is consumed energy:
El : energy consumption of logic
Ec : energy consumption for communication
Er : energy consumption for register storage
Em : energy consumption for memory storage
2 the time that the data input must be stable before a clock edge to allow the register to
sample the new data value
3 the time that is needed by the register to update the output after a clock edge
62
8.2
Measure of Complexity
Let us first define the efficiency (E) of an algorithm. Clearly, an algorithm gets
the more efficient the less area it consumes and thus:
1
A
Secondly, an algorithm becomes more efficient the higher the processing rate
(1/T) is:
E
1
T
E
Combining both, we get:
1
AT
Now the complexity (C) can be seen as the inverse of the efficiency (the more
efficient an algorithm becomes the less complex it is). Finally, this yields the
well-known AT measure for the complexity:
E
C AT
By sub-dividing the area and latency according to section 8.1, we could refine
the equation by replacing A and T by the sum of the constituent parts. Does
this really work?
AT
X X
Ai
Ti
=
X
X
X X
X
X
Al +
Ac +
Ar
Tl +
Tc +
Tr
=
This approach is o.k. with respect to the area (green). However, summing up all
delays in the circuit (red), does not provide any information with respect to the
achievable clock interval. What were actually looking for, is the latency of the
critical path that was many times discussed before. This is directly related to
the achievable clock interval. Hence, the equation must be rewritten as follows:
(AT )
X
Al +
Ac +
Ar TCP
with
TCP
If the critical path is not known, the upper limit of T and hence of C can be
estimated as follows:
63
(AT )
X
Al +
Ac +
Ar max Tl + max Tc + max Tr
l
We can extend the equation for the complexity measure to also consider the
software side as follows:
AT = AP TCP Ncycles
Ncycles thereby represents the number of cycles to complete a task which also
affects the complexity of an algorithm. If a task runs on the same processor at
the same clock speed and takes more cycles to complete compared to a different
task, it is clearly expected to be more complex. Consequently, this definition of
the AT measure needs to be analyzed according to the task to be completed.
Let us examine a small example to emphasize the influence of the software side,
if we talk about mixed hardware/software approaches. Consider 2 different
processing cores:
Core I: AP = 10 mm2 ,
Core II: AP = 20 mm2 ,
1
T
1
T
= 2GHz
= 4GHz
By using the AT measure both seem to have the same efficiency (complexity):
AP TCP I = 5 mm2 /GHz
AP TCP II = 5 mm2 /GHz
Furthermore, we define the instruction level parallelism (ILP), which is the
average number number of instruction that the processor executes in parallel:
ILPI = 1.2
ILPII = 1.7
By considering this software point of view we clearly recognize that core II is
more efficient than core I. However, our basic AT measure does not consider
this and had to be extended for this purpose:
1
ILP
Now that we have introduced the complexity measure, we want to apply it to
some examples in the following.
AT = AP TCP
64
8.3
Wordlength Analysis
Routing Example
Take a look at figure 74. This demonstrates exemplarily an on-chip routing on
bit-level. We already discussed the importance of the bit-level view and the
difference to the word-level in section 7.1. In this section were going to analyze
the routing complexity with respect to the wordlength W .
Figure 74: Example for on-chip routing (bit-level view)
From figure 74 it should be clear that by doubling the wordlength, we get twice
as many wires in x- as well as in y-direction. Hence, the routing area doubles in
the x- as well as in y-dimension. This results in a total area increase of factor
4. In general we have the following relation:
Ac W 2
The communication delay however, only increases with the square root of the
area:
p
Tc Ac W
Using the AT complexity measure, we finally get:
Cc = Ac Tc W 3
We can draw the surprising conclusion that the routing complexity increases by
W 3 with the wordlength W. Or the other way around:
The efficiency goes down by W 3 with increasing wordlength W.
65
De Bruijn Example
As a third example we take a look onto the de Bruijn network again for analyzing
its complexity using the AT metric. Assume a linear vector of processing nodes,
as we did before in section 7.3. According to figure 75, it is easy to conclude that
the number of routing channels in x- as well as in y-direction depends linearly
on the wordlength W and the number of processing nodes N . Consequently, we
can assume the following complexity relation for the communication area:
Ac W 2 N 2
The logic area Al depends linearly on the number of nodes (N ) and linearly
or quadratically on the wordlength (W ). The wordlength dependency of the
complexity is determined by the algorithm that is realized by a single node in
the de Bruijn communication network. Algorithms such as addition, min/max
(e.g. used for Viterbi) have a linear logic area complexity, while multiplications
(e.g. used within FFT) exhibit a quadratic complexity. We can conclude: Al
N W or N W 2 . The area of registers is also linearly increasing with N and
also depends linearly on W , assuming the more efficient case w/o skewing (see
section 8.3): Ar N W . The following table summarizes the findings for the
area:
Communication
Logic
Registers
Ac W 2 N 2
Al N W or N W 2
Ar N W
From the table we can conclude that the communication area seems to be the
most critical part with respect to the complexity of the de Bruijn network,
since it increases quadratically with W and with N as well. So what about the
communication latency? As mentioned previously its complexity is proportional
to the square root of the area complexity and thus:
66
Tc =
Ac = W N
But what do we gain in terms of communication complexity? The interconnection still seems to be the same! By observing the recursive structure on bit-level
(figure 77, left), we can see that we have a feedback at each bit-level. The
individual bit-levels are independent of each other, however. We can slightly
rearrange the processor-bit nodes (figure 77, right) and can now recognize that
we actually deal with N independent 1-bit de Bruijn networks.
67
How is the complexity affected by this insight? To clarify this, we once again
redraw the network structure from figure 77 (right) a little bit.
Figure 78: Complexity of recursive de Bruijn structure
p
Ac,1bit N
This is because the indivdual 1-bit de Bruijn networks are independent and we
assmue pipelining on bit-level. This yields the final result for the AT complexity
68
measure:
AT W N 3
AT W 3 N 3
L
X
ai xki
Considering the bit-level the equation turns into three nested sums:
yk =
W
W X
L X
X
i
ai,n xki,m
W X
W X
L
X
n
ai,n xki,m 2n 2m
Now the convolutional sum is computed individually on each bit. This bit-plane
FIR filter is significantly decreased in its complexity, as we have already seen at
the de Bruijn example before!
Integrator Example
In this section we examine the complexity of the well-known integrator circuit.
Viewed from above (figure 79, left), the routing complexity seems to be Ac Tc
W 3 , according to the result of the previous section. If we examine the adder
circuit a little bit closer using a convenient representation (figure 79, right), we
see that this is actually not true!
69
For increasing the wordlength by 1, a new adder stage has to be appended. This
new adder stage consists of a constant amount of routing area Ac W . The
routing delay Tc depends on the maximum wire length which is independent
on the number of adder stages Tc 1. Hence, we only get Ac Tc W , in
contrast to the top-level view on the left-hand side of figure 79. After fully
pipelining the adder (also depicted in figure 79, right) we observe the following
situation:
Ac W
Al W
Ar W 2
Tc 1
Tl 1
Tr 1
w/o skewing...
AT = W 1 = W
This once again demonstrates how much the hardware implementation is influenced by skewing. By omitting the skewing, we can significantly reduce the
complexity: C W 2 C W !
70
8.4
M-Step Analysis
In the following, we will apply the AT complexity measure from section 8.2 to
analyze the efficiency of two different techniques that we introduced to speed-up
recursions, namely Pipeline Interleaving (considered in section 8.4) and Block
Processing (section 8.4).
Pipeline Interleaving
Consider the following 4-step recursion with generic Dot operator and logarithmic look-ahead, also depicted in figure 80:
yk = (xk xk1 ) (xk xk1 )2 yk4
Side note:
It should be mentioned that the order in which the operator is applied in the
look-ahead part is important here (in contrast to the + operator)! Hence, the
following implementation is not equivalent!
Normally, we compute (xk xk1 ) (xk2 xk3 ) in the logarithmic lookahead. By changing the order according to the figure above the realized computation becomes: (xk xk2 ) (xk1 xk3 ). Because the operator is not
commutative, xk1 and xk2 cannot be exchanged and we run into trouble here!
Figure 80: 4-step recursion with generic Dot operator and logarithmic lookahead
71
Area
Clock period
Complexity
z
}|
{
of 1 step impl.
A
1 = 1
T
1 = 1
A
1 T
1 = 1
For the M-step recursion we get the following relative area metrics:
Look-ahead
Al ld(M )
Ar M
Ac 1
Feedback
Al 1
Ar M
Ac 1
Feedback
T
1
TCP = M
However, in reality were only able to shorten the critical path with respect to
the logic delay! The register delay (setup/hold time) as well as the wire delay
cannot be shortened via pipelining!
TCP,l
TCP,l
+ TCP,r + TCP,c =
+ Tr + Tc
M
M
Finally, these results yield the following AT measure:
TCP =
AT
TCP,l
ld(M ) + |{z}
M
+ Tr+c
|{z}
| {z }
M
l
r
Tr + Tc
72
AT M
...which is not so nice anymore, because this means that the efficiency of the
algorithm now linearly decreases with every new recursion step!
Block Processing
Finally, let us take a quick glance at how block processing changes the complexity of an algorithm. With block processing we definitely have to deal with an
area increase. We get M parallel processing chains, each consisting of ld(M )
stages (with logarithmic look-ahead). Compared to the original 1-step recursion
function, we thus have a final area increase of:
A M ld(M )
The achievable clock frequency on the other hand stays untouched by the block
processing:
T 1
Accordingly, the AT measure yields:
AT M ld(M )
This means that the complexity is increased (or the efficiency is decreased, resp.)
by block processing. So, what did we miss? Sure, we are now able to do multiple
computations in parallel. This so called parallelism factor P is not considered
by the AT measure. Thats why we have to extend the AT measure accordingly
which finally yields the effective AT measure ATef f :
AT
ld(M )
P
with P = M for the case of block processing.
ATef f
73
8.5
ATE Measure
We are now very familiar with the AT measure. What if we want to consider the
energy consumtion in addtion to the area and clock speed within our measure?
We could define the energy consumption as follows:
E =
|{z}
energy
T
|{z}
N
|{z}
P ower
Used for...
chip cost analysis
determine cost of solution
taking also energy into account
Other Measures
Besides the AT measure the following two complexity measures can occasionally
be found:
AT 2 : complexity measure for communication dominated applications
A
T:
Now, that we have become aquainted with the basic hardware- /software codesign principles, we will take a look at more complex structures within this
section, namely processors.
9.1
Hardware Reuse
74
Example: FIR
We already know the equation for the convolutional sum of an FIR filter:
X
yk =
ai xki
i
The basic idea of our hardware reuse approach is that we just take a composite
of a multiplier, an adder and a register (framed in figure 81) and use it for
the computation of all filter stages. The resulting processor example for the
iterative FIR computation is shown in figure 82.
75
The coefficients ai are read from a memory (a-mem) addressing it with the
pointer *pa ( because of descending index). In parallel the input xi is
read from a second memory (x-memory) with *px used for addressing it (++
because of ascending index). Both values are now multiplied and accumulated,
i.e. added to the value of the previous stage. For this purpose an accumulator
register is used (which must be initialized with 0). The output is generated
after the L-th (in this example L = 3) stage and is finally written into a third
memory. This FIR processor realization needs 2 read operations per iteration.
The following pseudo code demonstrates how this hardware works:
Step 0:
Step 1:
Step 2:
Step 3:
reset acc
i=3
acc = a3 xk3
i=2
acc+ = a2 xk2
i=1
acc+ = a1 xk1
i=0
acc+ = a0 xk
output
Following this approach we were able to reuse certain parts of the hardware:
Hardware of FIR tap
76
memories
control
registers (acc) for intermediate results
9.1.2
In section 2.5 we already got to know a more efficient alternative of the FIR
filter implementation: the so-called transposed form FIR filter. A corresponding
example with 4 taps is shown in figure 83. Its advantages (with respect to
the hardware realization!) are: reduced latency and less registers needed for
pipelining (due to reusing the already available shift registers). So, what about
the processor, i.e. hardware reuse, approach?
Figure 83: Transposed Form FIR example for hardware reuse
77
Figure 84: Reusable hardware (processor) for transposed form FIR example
Similar to the direct form FIR version we need an a-mem and an x-mem with
according address pointers to read the coefficients and input data, respectively.
The subsequent multiplication and addition is also similar. However, the storage of the output/intermediate values gets somewhat more complicated for the
transposed form filter. Instead of a simple accumulator register we need now a
dual ported memory for the intermediate Z-values to be able to read Zk1,i+1
while writing Zk,i at the same cycle. This also implies a more complicated address handling, since we have to read from address *pz but write to the read
address of the last iteration cycles (input of stage i is output of stage i+1). Finally, this FIR processor realization takes 3 read/write operations per iteration,
which is also a disadvantage compared to the direct form FIR. The according
pseudo code looks as follows:
Step 0:
Step 1:
Step 2:
Step 3:
The following table summarizes both FIR implemention approaches with respect to the realization as reusable software and demonstrates once again the
advantage of the direct form FIR for this purpose:
78
To summarize the results of the small FIR example, we observed that though
the transposed form FIR is well suited for the hardware implementation, it is
not a good choice for the sequential (i.e. software or processor) implementation. On the other hand the direct form FIR seems to be better suited for a
sequential implementation. We can conclude that the structure of a design must
be carefully chosen depending on the type of the implementation (hardware vs.
software) to be efficient! Some general rules summarizes the following table:
Hardware (HW)
distribute algorithmic registers
within computational logic
use shift registers to...
Software (SW)
keep algorithmic registers together
use registers if possible as shift
registers to...
1. simplify memory addressing &
control
2. process iterations along critical
path with local data reuse (e.g.
accumulator)
shift registers are good
9.2
Now that we have investigated the very basic and static example of an FIR
processor, we will design a more complex and flexible processor for digital signal
processing. First, we define a short wish list of the desired features of our DSP:
2 memory ports, e.g.
1 memory read port
1 memory read or write port
1 multiplier
1 arithmetic logic unit (ALU)
79
Let us now directly take a look at the block diagram of the Simple DSP s
data path as depicted in figure 85. In the following we will build up the whole
processor step by step.
Figure 85: The data path of our Simple DSP
Let us start with the two input registers ra and rb. They can receive a value
from the data memories (a-mem, b-mem) and pass their content to the computational part of the processors data path. The computational part consists of
a multiplier with a subsequent ALU4 . The register content (of ra or rb) can
either directly be passed to the ALU, bypassing the multiplier, or is passed to
the multiplier first and subsequently propagated to the ALU. This is controlled
by two multiplexers. The output of the computational part is stored in one of
two accumulator registers (acc0 or acc1). The ouput of the accumulators can
be looped back to the ALU input as well as to the (dual ported) data memory
(b-mem). In addition to the accumulators, the ALU also writes some flag registers (e.g. sign or overflow bit) that can be used for program control, later on
(conditional jumps, comparisons, etc.). For addressing the data memory four
pointer registers are provided: p0, p1, p2 and p3. For each of the two data
ports (a and b) the corresponding address is multiplexed from one of the four
4 Arithmetic Logic Unit: a generic unit that is able to compute different arithmetic (e.g.
ADD, SUB) and logic (e.g. AND, OR) operations which is usually controlled by a mode
input
80
pointer registers and passed to the according address input pa or pb, respectively. The output is passed to the according input register ra or rb (in case of
read). The input is passed from acc1 to the data memory of port b (in case of
write). To update the pointer address values the modifier registers m0, m1, m2
and m3 can be used. E.g. we could define m0 = 0; m1 = 1; m2 = 1; m3 = 2.
In this case m1 could be used for address increment, m2 for address decrement,
m3 for an address increment of 2 and m0 as neutral address modifier.
9.2.1
In this section we will give a short example and show how we could use our Simple DSP to compute the direct form FIR (compare to section 9.1.1). Therefore,
we use our input ra for the coeffcients ai . Accordingly, rb is used for the input xki . The accumulated result is stored in acc1. acc0 is not needed for
this application. Hence, we use the pointer registers p0 and p1 as pa and px,
respectively. p2 is also needed to store the output yk . The algorithm for the
direct form FIR is sketched by the following pseudo code.
Step -1 (init):
Step
Step
Step
Step
Step
9.2.2
0:
1:
2:
3:
4:
Instructions Needed
To be able to control the data flow within our DSP we need to define control
words, so called Function Instruction Words (FIW), for every functional unit of
the DSP. Our DSP basically needs control of 4 functional units with according
FIWs:
1. Address generation unit for ra-port
2. Address generation unit for rb-port
3. Data path unit (ALU/mult)
4. Program control unit
FIW for Address Generation Unit (ra-port):
We define the following requirements and restrictions for addressing the ra-port.
1. Read-only access from data memory into input register ra
81
5 an immediate value is directly contained within the FIW in contrast to register or memory
access where only the address is passed with the FIW
82
The MSB is used to toggle between read and write mode. Again another bit is
used for switching between memory and register. The remaining 4 lower bits
are used similar to the ra-port: 4-bit register address (in register mode) or 2-bit
pointer address and 2-bit update mode (in memory mode), respectively.
FIW for Data Path Unit (ALU/mult):
Assuming a 6-bit FIW, we come to the following structure for the data path
unit FIW (also compare it with figure 85).
Figure 88: Functional instruction word of data path unit
The first bit (MSB) is spent to control the destination of the ALU output (acc0
or acc1). The second bit is used to control the left multiplexer of the ALU
83
input (i.e. feedback from acc0 or ra/mult output). The third bit controls the
right multiplexer of the ALU input (i.e. feedback from acc1 or rb output). The
ALU mode (plus the multiplier bypass using the left multiplexer) is controlled
by the lower 3 bits. Possible modes are: +, -, AND, OR, XOR, Mult, MAC,
nop6 .
FIW for Program Control Unit:
The FIW for the program control unit has influence on:
program flow (e.g. jumps, call, return, ...)
conditional execution (e.g. if, else, ...)
repeat / loops
We will not go into the details here and simply assume 6 bits to be spent for
this FIW.
9.3
2. It control the instruction decoding pipeline, i.e., the mapping of IWs onto
HW and its control pipeline. (VLIW, CISC, RISC, ...)
3. It provides an interface between hardware and software that defines the
assembler/compiler/.... input.
9.3.1
The idea of the VLIW (Very Long Instruction Word) ISA is simple. We just
concatenate the instruction words of every single functionial unit (i.e. the FIWs)
to form one long processor instruction. Thus, every functionial unit is directly
controlled in parallel via a single instruction. Figure 89 shows the VLIW for
our example DSP, consisting of 24 bits.
6 nop:
84
Figure 89: Very Long Instruction Word for our Simple DSP
For more complex processor architectures the VLIW would become much longer
(which is the main disadvantage of this ISA approach).
Processor Block Diagram
Based on the VLIW ISA we are now able to draw the big picture of the top-level
processor architecture for our Simple DSP. This is shown in figure 90. Our
processor architecture consists of a
control loop
and a data loop.
The control loop in turn consists of three functional units:
Program Control Unit (PCU)
Program Memory (PM)
Instruction Decoder (ID)
The data loop consists of three functional units as well:
Address Generation Unit (AGU)
Data Memory (DM)
Data Path Unit (DPU)
85
Let us first take a closer look at the control loop. The PCU manages the program
flow which is usually a straight-forward sequential flow but may be influenced
by executed instruction (e.g. program control, loops, conditional jumps, etc.).
The next instruction is fetched from the PM using the program counter (PC)
of the PCU for addressing it. The fetched instruction word (IW) is passed
to the ID that maps the IW onto a bunch of control signals for controlling
every functional unit of the data loop. For our VLIW ISA it just separates
the FIWs that are contained in the IW. The data loop starts with the AGU
that is repsonsible for pointer management (using pointer registers) and pointer
updates for read/write address generation. When the correct read address is
pending, input data can be fetched from the DM and passed to the DPU. The
DPU is the actual number crunching unit (see section 9.2) that consists of the
ALU, multiplier and I/O registers. The DPU output may be written back to
the DM. For this purpose the according write address must be selected by the
AGU which in turn is controlled by the DPU itself. In addition the output of
the DPU computation may influence the program control flow. E.g. in case of
a conditional jump, the result of the comparison (sign flag) which is computed
by the DPU, controls whether the jump takes place or not. Hence, we have a
feedback to the PCU and ID.
9.3.2
The instruction now only consists of a single 6-bit FIW and a 2-bit opcode that
is used for addressing the functional unit (we defined FIWs for 4 different FUs in
case of our Simple DSP). The RISC ISA clearly has the following disadvantage:
Slower: due to sequential instead of parallel FU control
On the other hand, we gain the following advantages:
less verbose than VLIW
more modular (decoupled memory & ALU, orthogonal instruction set)
The second one is the main advantage of the RISC concept that becomes very
important for super scalar architectures. In fact, RISC actually only makes
sense for super scalar architectures!
Super Scalar Architectures The idea of the super scalar architecture is the
following. The program control unit addresses and loads (i.e. fetches) multiple
successive instructions from the program memory (e.g. for the different functional units) that are thereupon executed in parallel. The parallel execution is
controlled by a dynamic instruction scheduler that analyzes and solves dependencies in the program flow (and between input/output data) and schedules (i.e.
issues) the instruction onto the available parallel FUs with the objective to keep
their utilization as high as possible. If we use one RISC instruction for every
FU (in our Simple DSP case: 4), we get a very modular and fast architecture.
With this architecure we dont need to issue nop instructions to keep unused
functional units idle (as we have to do it with VLIW). Only functional relevant
instructions will be issued to the according FUs.
Remark : Please note that in general the number of fetched instructions and
the number of parallel issued instructions must not be identical ( N fetch
/ M issue). The number of fetched instructions mainly depends on the abilities/restrictions of the memory (throughput, word length, etc.) and the length
87
9.3.3
The Complex Instruction Set Computer (CISC) offers a different idea for reducing the word length of the (VLIW) processor instructions (and thus the code
size and memory accesses). Usually not every combination of the FIWs makes
sense. For example we definitely dont need 224 instructions for our Simple DSP.
Rather, we often only need a small selection of VLIW instructions. This is the
idea that the CISC concept is based on. It defines a small opcode (e.g. 8 bits
for our Simple DSP) that is mapped onto the long VLIW (24 bits in our case)
using an internal ROM table which contains a list of the used instruction words.
I.e. the opcode is just used for addressing the actual VLIW instruction. This is
illustrated in figure 94.
Figure 94: Idea of CISC
The main advantage of the CISC ISA is reduced code size and hence a more
efficient usage of the code memory (and possibly less memory access). On the
other hand, CISC has also some disadvantages as mentioned before. In practice,
the mapping of the functional instruction words onto the pipeline stages is not
so easy and requires control by a so called sequencer unit. A CISC ISA
89
9.4
9.4.1
The problem is that our DSP contains several feedback loops (DPAGU,
DPID, IDPCU, etc.). As we know, the cut-set rule is actually no appropriate mean for pipelining recursive systems (see section 3). Thats why we
also need to insert speed-ups on those feedback signals in the opposite direction.
Since speed-ups are not implementable, we run into trouble, if we cannot remove
them somehow. We notice that the feedback signals are actually only needed
under following situations:
in data loop: only during memory write
in control loop: only if program flow instructions (jump,...) have been
decoded
from DLCL: only for conditional execution
In all other cases we dont need the feedback paths and our processor works fine,
if we just omit the speed-ups, i.e., consider the processor without feedbacks. In
section 10 we will discuss how we can deal with the problems that occur under
90
the situations mentioned above. At this point we want to scrutinize the skewing
triangle a bit closer that has been formed due to pipelining between the data and
control loop. We already know from section 7.1 that skewing triangles can often
be found in pipelined systems and how much they can influence the efficiency
of a circuit. However, with respect to processors the skewing triangle has also
a strong influence on the kind of programming.
At the output of the ID, the whole instruction word (i.e. the control signals
that are passed to the data loop) is still synchronized with the data. I.e. all
FIWs are still related to the same set of input data. This is called a data
stationary IW (refer to figure 96). After passing the skewing triangle, the FIWs
are spreaded over 3 time slots and synchronized with the time of processing:
FIW for AGU is passed after 1 cycle, for DM after 2 cycles, for DP after 3
cycles. The instruction word is now time stationary (see figure 96). As we will
see in the following, programming could be data or time stationary!
Figure 96: Data- & Time stationarity of instruction word
9.4.2
VLIW Pipelining
Let us start to examine the programming perspective for the VLIW ISA. In this
case the arrangement of the program memory can be considered as depicted in
figure 97. Each memory entry consists of one instruction word (IW) that in turn
is composed of (in our case) 4 FIWs. The 4 FIWs could also be considered to be
arranged in 4 parallel memories. If the FIWs are completely independent of each
other ( orthogonal), i.e. if every single FIW contains all information necessary
to control a single functional unit, we are able to decode each FIW using a
separate instruction decoder (ID). This is illustrated in figure 97 (bottom).
91
Now, we can simply apply the cut-set rule, drawing local cut-sets around each
of the 4 IDs and thus move the registers from the ID output to the ID input, as
figure 98 shows.
Figure 98: Moving skewing triangle in front of IDs by applying cut-set rule
But if the independent FIWs are actually only stored in parallel memories, why
can we not just consider the delays directly for the arrangement of the FIWs in
the code? Assume that the program counter (PC), which is used for addressing
the memory, directly steps through the memory entry by entry in a sequential
order, fetching one IW after another. If we just move the lowest FIW of the IW
in figure 99 one entry to the left (orange box), it will be fetched one clock cycle
later and we can now discard the register at that memory output. Similarly,
we can move the second FIW by two entries to the left and the third FIW by
three and thus skip the whole skewing triangle7 . Finally, we relocated the IW
skewing directly into the program code that is now time stationary. Since skewed
7 the upper FIW is the program control FIW which is not passed to the data loop and thus
not passes the skewing triangle
92
VLIW code is very hard to read or write (especially if we also take branches
into account), we are reliant on good compilers that are able to automatize the
program skewing task. The following small example demonstrates the difference
between data stationary code and equivalent time stationary (skewed) code.
cycle
n
n+1
data stationary
acc1 = (pa + +) (pb )
time stationary
ra = pa + +, rb = pb
acc1 = ra rb
9.4.3
CISC Pipelining
As already mentioned in section 9.3.3, the idea of CISC is that only sensible
combinations of FIWs are stored in an instruction table that is addressed by
the (short) instruction word read from PM. The control loop for the CISC
processor architecture is once again depicted in the figure below. Because of
this, the FIWs cannot be assumed to be independent of each other anymore
(many FIW combinations are not allowed anymore). In other words, we use
a maximum correlation between the FIWs such that they are not orthogonal
anymore. For this reason, CISC cannot be made time stationary and always
has to be data stationary!
93
10
Hazards
In section 9.4.1 we assumed that we can simply discard the speed-ups in the
feedback paths when pipelining our DSP. In practice, this assumption leads
to an incorrect behavior of the DSP under certain conditions, which is called
hazards! In the following we want to separately investigate the effect of hazards
in the control loop (Control Hazards: section 10.1) and in the data loop (Data
Hazards: section 10.2) and discuss how we can deal with them. A third type of
hazards, the so called Structural Hazards, will be discussed afterwards (section
10.3).
10.1
Control Hazards
As figure 101 depicts, hazards in the control loop are introduced by removing
the (in our case: 2) speed-ups on the feedback signal from the instruction decoder (ID) to the program control unit (PCU). The feedback signal is used for
influencing the program flow depending on the currently decoded instruction
(e.g. for jumps or loops). After introducing the pipelining, it now takes two
clock cycles to decode the instruction. The decoded instruction should, however,
directly influence the program flow for the next instruction at the subsequent
clock cycle. Therefore, we let the signal travel back in time by two clock cycles
by using the speed-ups. Since this is unfortunately no feasible realization, we
need to skip the speed-ups which leads to the problem that the feedback is not
necessarily computed correctly anymore!
Figure 101: Origin of Hazards in Control Loop
Actually, this is no problem as long as the feedback isnt used by the decoded
instruction, i.e. as long as the instruction doesnt influence the program flow.
94
E.g. linear code (c=a+b; e=c*d; f=c xor e;...) or non-conditional instructions
which have no control part (i.e. standard program counter increment [pc++] is
used).
But what, if pc is jumped to a new address?
goto / jumps ( PC controlled from ID or data path (DP))
call ( new PC controlled from ID; current PC pushed to stack)
return ( new PC popped from stack)
In this case the feedback signal needs to be used and carries the new PC value.
By removing the speed-ups, the new PC value now arrives too late at the PCU
(in the example of figure 101 these are 2 clock cycles). A more abstract perspective onto this problem gives figure 102.
Figure 102: More Abstract View onto the Problem of Control Hazards
It can be clearly seen that the feedback signal bk is not time-aligned with the
input ak , anymore, after introducing the pipelining and removing the speed-ups:
[ak , bk2 ]!
In the following we want to observe the effect of a hazard in case of a single
jump: How does it influence the program execution assuming the following
4-stage processor pipeline?
1. Prefetch (fetch setup): PC update/increment by PCU
2. Fetch: read next instruction from PM addressed by PC
3. Decode: decode instruction (ID)
4. Execute: execute instruction (AGU, DM, DP; pipelining of data loop
is not relevant here)
This is illustrated in figure 103. Therein, if we assume the green instruction to
be a jump, it is prefetched at n 2, fetched at n 1 and decoded at pc cycle
n. Now, only after the decoding finished, the feedback loop can inform the PCU
about the jump and the PC can be updated accordingly. This is the case right
before the red instruction is passed to the control loop pipeline. The problem
is that the next two instructions (the blue one starting at pc = n 1 and the
brown one starting at pc = n) are already in the pipeline of the control loop,
ready to be executed. If these two wrong fetches would really be executed, we
have a good chance that the program computes the wrong results!
95
linear code
1
pc jump code
What can be done to improve the performance? One possibility offer so called
jump tables. This is a kind of cache that stores the (decoded) target instruction
after executing a jump. The next time that the same jump is executed, the
decoded target instruction can directly be fetched from the jump table. Hence,
we can significantly speed-up the execution of repeated jumps ( loops): e.g.
PL1
for our well-known convolutional sum yk = i=0 ai xki . Many DSPs on the
other hand, use HW loop counters with dedicated loop instructions so that the
loop management is directly handled within the PCU. In this case there is no
need for jump tables.
Delayed Jump (Fixing in ISA)
Another alternative to deal with control hazards is to rearrange the instruction
sequence. We pick up the idea from section 10.1 to insert nop instructions
after jumps in the code. However, instead of nops we want to use some functional (sensible) instructions of our program code this time to fill the hole. This
approach requires that we have at least N (in our case: 2) non-pc jump instructions8 before our executed jump at cycle n (refer to figure 104). If this is
the case, we can execute the (delayed) jump N cycles earlier and fill the hole
with the sensible instructions from the program code: n 1, ..., n N . We
know for sure that these instruction can and definitively have to be executed!
Consequently, the jump instruction is now readily decoded before the (originally
subsequent) instruction n + 1 is feeded to the control loop pipeline. This means
that the feedback signal from the instruction decoder is now right in time available at the PCU. The method of the delayed jump is depicted in figure 104. A
small code example is given in table 4. Therein, the original code can be found
in the left column and the modified code (that includes the delayed jump) in
the right column.
8 i.e.
97
Remark: This solution may get quite complicated when talking about conditional jumps, since the evaluation of the condition must be executed before the
jump and thus has to be moved forward in the program code, as well!
Pipeline Interleaving (Fixing Control at Processor Architecture)
A fourth (special) option to deal with the control hazards is supported by the
processor architecture itself. This only works, if we depart from single processor
and assume N (pipeline depth) processors, instead. In this case, we could apply
the concept of Pipeline Interleaving ( section 6.2) for complete control hazard
avoidance by construction [ Sandbridge Technologies, IFX MUSIC]. How
this works shows figure 105.
Figure 105: Use Pipeline Interleaving to Avoid Control Hazards
For this purpose we need (in our case) 3 PCUs/PCs, 3 PMs for the 3 programs
and 3 the amount of registers and memories for the Pipeline Interleaving
98
of the data loop. Now, we can execute the instructions of the program code
of the 3 independent processors in an alternating manner. For the case of
the pipeline interleaved processor, jumps are no problem anymore, since the
decoding of any instruction is finished before the next instruction (of the same
processor) is fed to the pipeline! In the meanwhile, only instruction of the
parallel (independent!) processors are inserted into the pipeline. The example
in figure 106 shows how the data path looks like if we apply Pipeline Interleaving
for processing 4 independent data set, i.e., having a pipeline depth of 4. In this
case we need an independent output register file and address generation unit for
each data set.
Figure 106: Data path of pipeline interleaved processor for pipeline depth of 4
10.2
Data Hazards
Instr. 2
Instr. 3
++pa ra = *++pa
acc0 = ra + acc1
100
Instr. 1
ra = *pa
acc0 += ra
*pb=acc0
Instr. 2
rb = *pb
acc1 = rb
It should be mentioned here that the bypass multiplexers can consume a significant amount of the overall logic area for long (modern) processor pipelines.
Also note that the bypassing approach only works, if the internal pipeline of the
data path unit is not too long. Otherwise, the output of the DPU may already
be too late so that bypassing would not help.
10.3
Structural Hazards
The last type of hazards that we discuss in this section are the so called structural hazards. They occur, if the instructions scheduled for a sinlge clock cycle
require more HW (functional units, memory ports, etc.) than there is available.
Some cases where structural hazards occur are:
1. HW not realized (i.e. # FUs is too small) handling by exception code
2. HW not available (occupied) due to pipelining
(a) R/W port conflicts on data memory
(b) # memory ports exceeds HW limitations
Code example #1: data memory with 2 ports ( Our simple DSP with port
a (R only) and port b (R/W)), but 3 ports needed
Algorithm 4 Code example: data memory with 2 ports, but 3 ports needed
Cycle
n
n+1
n+2
port a/b
r/w
r/r
r/r
At cycle n we read an input from port a, make the computation ( DP) and
write back the result into port b after 2 clock cycles (i.e. in n+2) due to the
102
pipelining. At the same clock cycle a new instruction was issued that requires
a read access on both ports. Hence, we get a conflict ( structural hazard)
between the two read and the write request at cycle n+2.
Code example #2: read/write conflict
Algorithm 5 Code Example: Read/Write Conflict
Assembly view:
1: *pb++ = acc1 x acc0
2: acc0 = *pb + 0xAAAA
Pipeline view:
Cycle
#1
#2
#3
Instr. 1
acc1 =acc1*acc0
*pb++ = acc1
Instr. 2
rb = *pb, ra=0xAAAA
acc0 = ra + rb
This 2nd small code example demonstrates a similar (but more concrete) case.
We see a read/write conflict at port a (*pb) in cycle #2. The writeback of
instruction #1 collides with the read access of instruction #2. simultaneous
read/write is not supported on the same port! If we replace the immediate value
0xAAAA by a memory access *pa, we even get two simultaneous structural
hazards. This is because we would need 3 memory ports in cycle #2 in this
case. However, our DSP architecture has only 2 memory ports available...
Solution:
Solutions for the structural hazards are quite similar to that of
the data hazards:
1. SW/tools (i.e. compiler): rearrange/delay instructions to avoid conflicts
2. Super-Scalar architecture: issue unit checks at runtime if FUs are available
3. HW: stall the pipeline until FUs are available
11
Vector Processing
So far we discussed processing on scalar data. But can the block processing
approach also be applied to processors? In this section different techniques for
parallel processing on data sets ( vectors) will be discussed.
11.1
At first we want to take a quick glance onto the parallel processing concept of
classical vector processing machines. Recall the two techniques for pipelining a
processor:
103
1. One processor instantiation with pipelined control and data path bypasses etc. needed for hazard handling
2. N pipeline interleaved instances N fold register/memory size (hazard
free)
The idea of classical Vector Processing Machines is to exploit the repetitive
program code of loops, such as our well-known FIR sum:
yk =
L1
X
ai xki
i=0
Now, use parallel processing by N-fold pipeline of the data path, while the
program control code is static during the vector operation, as figure 109 shows.
I.e. only the data path needs the massive pipelining to be able to perform the
same operation on a row of multiple vector elements (at high speed), while the
control path needs no expensive pipelining, since the program control word is
static during the vector operation. On software side the idea is to generate a
high-level language (HLL) library of vector instructions, e.g., vectorAdd(&x,
&y, L), to easily exploit the architectural features.
Figure 109: General idea of a vector processing machine
A more generic approach offer the so called Data Flow Processors. This kind of
processor one can think of a pool of different FUs that can be connected in an
arbitrary fashion at run-time. Thus, the data path can be configured according
to the needs of the executed application.
E.g. assume that we want to compute
P
the following equation: yi =
|xi ai | bi . Figure 110 shows how the data
path would be configured for this case to execute the computation within a
single instruction.
104
The big advantage of data flow processors is that they maximize the ILP (instruction level parallelism), i.e., the number of algebraic operations executed
per cycle. This makes especially sense if a computation is repeated within a
loop multiple (L>>1) times.
11.2
Now, let us make a bottom-up analysis of the different hierarchy levels of parallelism, starting with the full serial (bit-serial) processing and finally ending
with the maximum level of parallelism (SIMD).
Bits: bit-serial figure 111 shows an example of a bit-serial adder that receives
two input bits ai and bi and computes the sum bit si and the carry output cout
accordingly. Thereby, the carry bit cout is fed back to the input cin so that
multiple bits of a parallel word can be computed in a sequential loop.
l
Figure 111: Example: bit-serial adder
l
Words: bit-parallel - operate the same operator on a word-parallel ALU (e.g.
multiple 1-bit adder connected in a row, i.e. carry-ripple adder)
105
Words: one operation on multiple input words (e.g. add, mult, OR, ...)
l
ILP: (Instruction level parallelism) many ops on multiple inputs (e.g. MAC,
sat(add),...), superscalar issue
ILP: one complex ILP set of operations on scalar elements
l
SIMD: one complex ILP set of ops on vectors
11.2.1
To give an example for SIMD vector processing we take the FIR equation as
basis:
L1
X
ai xki
yk =
i=0
We can decompose the sum to odd and even parts, which can be computed in
parallel, i.e., independent of each other.
L/21
L/21
yk
a2i xk2i +
X
odd i
a2i+1 xk2i1
i=0
i=0
ai xki +
{z
even
{z
odd
ai xki
even i
Except for the final addition of the two sums, we use the same instruction
(multiply a and x and accumulate the result) on two parallel sets of data
SIMD: Single Instruction Multiple Data! The resulting hardware structure is
depicted in figure 112. The final addition of the two sums is therein realized by
a multiplexer at the input of the second accumulator that is fed by the output
of the first accumulator.
106
Please note that the SIMD concept with two parallel ALUs differs from the
classical VP machine implementation, where we have only a single ALU that is
N- (in this case 2-) fold pipelined as figure 113 shows.
The problem of the SIMD solution from figure 112 is that the necessary memory bandwidth has also been doubled compared to the non-parallel solution
to feed all the inputs. We need to pass 4 inputs per cycle: aieven ,xkieven ,
aiodd ,xkiodd , instead of 2 inputs per cycle for the non-parallel solution: ai ,xki .
This emphasizes figure 114 by direct comparison of the bandwidth required for
inputs/outputs of the SIMD processor and pure hardware solution.
107
A more efficient approach is the following: let us compute the sum sequentially
as before, but instead compute multiple outputs (yk and yk+1 for N=2) in
parallel:
yk
L1
X
ai xki
i=0
yk1
L1
X
ai xki1
i=0
108
109
Hence, we recognize that only a single input changes between every two computations!
Figure 116: General idea of Zurich Zip
Finally, we gained N-fold processing power by applying SIMD with Zurich Zip,
but we need a skewing triangle consideration here! As figure 117 shows, we
get N-1 setup cycles and N-1 completion cycles for an N-fold parallelization to
fill/flush the register chain. During the initialization and flushing phase, we can
not really do full parallel SIMD processing. Hence, the parallelization is only
effective, if the parallelization factor is much lower than the number of loop
cycles (L), so that the overhead by the init/flush phase carries no weight: N
L. The effective parallelization factor, taking the overhead of the initialization
and flushing phase into account, can be determined as follows:
P =
L
N
L + 2N 2
110
latency
speedup P
reads/cycle
writes/cycle
zipped processing
L+N 1
L
P = N L+N
1
L
2 L+N 1
P
L
The methods presented in the previous section are not the only options to realize
SIMD processing. A different approach for the SIMD FIR implementation is
provided by the so called partial transposed form (i.e. a mixture of the direct
form and the transposed form FIR filter). The structure is illustrated in figure
118. It is easy to see that the computation of this 6-tap filter can be realized
by 3 parallel data path units with internal ILP. All DPUs exhibit the same
structure, as depicted in figure 119, so that hardware reuse is enabled.
This small example should demonstrate that there is often a large number of
alternatives of how to handle data.
Finally, we can also generalize the concept of the partial transposed form FIR
for different algorithms, since:
operator doesnt matter
operator even doesnt have to be distributive
L1
2 i=0 ai
x
1
yk =
ki
For example the partial transposed form is also possible for:
yk
|ai xki |
2
|ai xki |
111
112
Figure 119: Reusable DPU for Parallel Processing of Partial Transposed FIR
11.2.3
Generalization
Finally, we conclude the SIMD topic with a general consideration. Figure 120
shows a general view onto SIMD processing. Therein, two (or more general:
multiple) N-element vectors are passed to N parallel data processing paths (DP).
Figure 120: General View Onto SIMD Processing
The first problem is illustrated on the lefthand side of figure 121. A cyclic
shifter can be used to rearrange the memory output. However, the shifter (
cross-bar switch) usually consumes a lot of area. For many operations the cyclic
shift is, however, not even necessary. E.g. in case of an FFT (righthand side of
figure 121), we can simply omit the cyclic shifter due to its periodicity property
(cyclic operator)! But we need to store a different set of twiddle factors for each
possible vector offset (see figure 121) in this case.
Figure 121: Problem (1): Input Vector not Aligned in Memory
The second problem is addressed in figure 122. Actually, this is no real problem,
since we can just split the computation over multiple clock cycles. However, we
waste somewhat of the available computation power during the last cycle, if the
input vector length is not a multiple of N. In the illustrated case some of the
DPUs are idle (perform nop) during the second cycle.
Figure 122: Problem (2): Vector Length not Matching Memory Width
114
The overall summary and generalization of the SIMD concept is given in figure
123. Therein, a general shuffle network (e.g. as cyclic shifter to address the
memory alignment problem or as shuffle network to perform algorithms such as
FFT) has been considered. Shuffle operations on vectors with size > N have to
be realized through addressing!
Figure 123: Summary & Generalization of SIMD Concept
115
12
116
The right side of figure 125 shows the 3 main challenges of the scheduling problem, depicted as 3D mapping space:
1. MP challenge: mapping tasks on PEs
2. Memory allocation & coherence: Which data to store where to avoid long
data transfers between global and local (e.g. PE cache) memory? How
to keep the data coherent, i.e., how must the results be merged, if tasks
work on different memories?
3. On-chip interconnect (Network-on-Chip, NoC): How to schedule tasks to
make data/program transfers as fast as possible (short routes, uniform
load, etc.)?
12.1
To handle the problem of scheduling formal tools can be used to model the
problem and find a solution. Kahn Process Networks (KPN) are a very simple
graph based model that can be used to model software tasks (or processes) and
their interdependencies. Therin, tasks/processes are modeled by graph nodes,
called actors, as depecited in figure 126. The arcs between the nodes are used
to model the interdependencies between the tasks or the task inputs/outputs,
respectively. Also feedback loops are allowed. An actor can be executed (
fired) as soon as all input conditions are satisfied. A valid output is created
on every output edge when an actor is fired (note that empty outputs are also
possible). Almost any software can be described by a KPN!
117
Figure 127 summarizes some problems that can be addressed and examined by
means of the KPN model.
Figure 127: Selection of design problems to be modeled and examined by KPN
12.2
A sub-class of the graphical KPN model is the so called Synchronous Data Flow
(SDF). In contrast to the more general KPN all inputs are assumed to switch
time synchronous, i.e., we have discrete processing times and a discrete amount
of data that flows between the actors. I.e. every time an actor is fired:
a given # of elements is consumed and
a given # of elements is produced.
The number of consumed/produced elements is thereby time independent (i.e.
constant at run-time)! This model suits perfectly to model synchronous (i.e.
clocked) digital systems that are commonly used today! Figure 128 shows a
very simple SDF with one input actor and two output actors and , where
and depend on the output of .
118
Moreover, SDF allows to support different (rational) sample rates. However, the
sample rates must be constant and known apriori. In this case, we can annotate
the amount of data that an actor consumes (at the inputs) and produces (at the
outputs) to model different sample rates. A small example is given in figure 129.
1 has at least 2 elements in
Therin, actor is ready and can be fired, if input
2 contains at least 1 element. The according
its queue and the queue of input
number of tokens is consumed when is fired while producing 3 elements at its
output edge.
Figure 129: Small SDF examples with annotated # of elements consumed/produced
As second more generic example can be found in figure 130. The actor con1 and b elements from input
2 when
sumes a data amount of a from its input
fired and produces c and d tokens at its outputs. The inputs of this actor are
provided with rates R
1 and R
2 tokens per clock cycle. What is the average
firing rate F of this actor?
1 /a cycle.
1 we could fire every R
According to input
2 /b cycle.
2 we could fire every R
According to input
Therefore...
F = min
R
R
2
1
,
a
b
119
Thereby, the firing rate of determines the sample rate at its outputs:
R
3
R
4
= F c
= F d
To make the SDF from figure 130 a valid (i.e. executable) schedule we have to
ensure that no buffer overflow occurs at the inputs. This is true if the following
condition is fulfilled:
R
R
1
2
=
.
a
b
Figure 131 gives a slightly more complex SDF example with 3 actors and data
amount annotations.
F =
The question that now arises is: Does this work? / Is this SDF executable? We
cannot answer this question, since it depends on the concrete amount of data
120
consumed and produced, which has only been annotated in abstract form in
figure 131! Let us examine a concrete instance of this SDF example, shown in
figure 132. Therin, we see that produces 2 tokens on its output to once it
is fired. On the output edge to 1 token will be produced by . Now, can
be fired once, consuming the 1 token from at its input and producing 1 token
at its output. Finally, can be fired onces, consuming 1 token from and 1
token from . Now, we finished a single period and start from the beginning.
The problem, however, is that there is still 1 token left at the input buffer of
, since has produced 2 tokens on this arc before! If we repeat this game
continuously, we end up with a buffer overflow (assuming a real system with
limited buffer size) which in turn leads to a memory deadlock, since can not
be fired anymore in this case. This means that the SDF is not executable due
to a sample rate inconsistency!
Figure 132: Example for SDF with sample rate inconsistency
121
From the sample rate of actor (that depends on multiple different inputs) we
can now derive the condition for a balanced schedule w/o deadlocks:
Rce/e
c F/e
c
Rin
be
c
be
Rgi/i
= g F/i = Rdf/f
= d F
=
g
i
g
dg
= Rin/b
fi
fi
dg
bf i
Since this method can be quite cumbersome for large networks, we need a more
convenient and generic method to deal with this. Luckily, this can be provided
by using a mathematical representation of the SDF graph, the so called topology
matrix . It is constructed by numbering each node and arc, as in Fig.134 , and
assigning a column to each node and a row to each arc. The matrix element
(b, a) contains the number of tokens produced by node a on arc b each time it
is invoked. If node a consumes data from arc b , the number is negative, and
if it is not connected to arc , then the number is zero.
Correspondingly, we can now define for the example in Fig. 134 :
node
1
c e 0
= arc 2 d 0 f
3
0 i g
The condition that must be fulfilled for the SDF to be rate consistent is simple:
rank() = s 1,
122
Figure 134: SDF graph showing the node and arc numbering. The input and
output arcs are ignored for now.
where s is the number of actors. That means the matrix must be rank deficient i.e. has to have less linear independent rows than columns ( underdetermined linear system of equations). Check it for yourself for the example
from figure 132. You will find that all 3 rows are linear independent in this case!
The rank deficiency condition is based on the following idea: Assume that our
3 actors are fired F = (F , F , F )T times during a single period of a schedule.
T
Than the following equation yields the final buffer state B = (B , B , B ) at
the end of the schedule period: F = B. We already know that B must be
0 at the end of a schedule period for the SDF to be rate consistent. Hence:
F = 01s . We will always be able to find a solution for this system of
equations, namely F = 0. However, we require Fi > 0 to result in a reasonable
schedule. Consequently, to find an F 6= 0 we need at least a second solution.
Hence, the system of equations must be under-determined and therefor it needs
to be rank deficient.
We could generalize the SDF concept to stochastic queues with variable input
rates. In this case we would simply take the average rate as our constant data
rate in the SDF, e.g., a = E [a] and b = E [b]. This approach will work as long
as the variance of the stochastic input process is small compared to the queue
size, such that no overflows occurs.
12.3
Now, that we discussed the question of checking the SDF for feasibility (rate
consistency), we will address the question in this sub section of how to actually
create a schedule from the SDF for a single- or homogeneous multi-processor
architecture. Lets start off with the small SDF example presented in figure
135 (left). It consists of 3 actors with dependencies in a feedback loop. A first
check confirms that there are no rate inconsistencies in this graph, so that we
123
can continue with it. Please note that we need buffer initialization in case of
feedback loops. Otherwise, we would get a deadlock right at the beginning. This
is because every actor is dependent on the output of another actor, such that
no actor can start firing! By the buffer initialization we avoid this deadlock and
allow a certain actor(s) to start. We also need to consider the actor processing
duration that has been chosen to be equal to the node number of the actor for
simplicity (figure 135, right).
Figure 135: Left: Example SDF with loop for MP scheduling; Right: processing
delays
The two actors #1 and #3 can directly be fired at the beginning (due to the
buffer initialization). Actor #1 can even be fired twice right at the beginning.
After actor #1 has been fired twice, actor #2 can be fired, since it consumes
two tokens from the output of actor #1. Now, we are already done with the
construction of the precedence graph, since every actor was fired at least once!
124
From the precedence graph we can easily generate different single processor
scheduling alternatives, as depicted in figure 137.
Figure 137: Possible schedules for P=1 processors
All these options are reasonable schedules. The question is: Are some of these
schedules better than others? In fact, this is the case since the two green
encircled alternatives make use of the data locality principle, as shown in figure
138. This means that the actors are directly executed in the order of their
dependencies. Remember: A data dependency from actor to means that
needs the output data from as input. If is directly executed after on
the same PE, it can reuse the results in the local memory. If a different actor is
executed on the PE, the results from must be moved to a global memory and
transfered back to the local memory before is executed two unnecessary
data transfers!
Figure 138: Best scheduling options that exploit data locality
Lets continue with the scheduling example for two processors. We can again
easily find valid schedules from our precedence graph. Figure 139 presents two
alternatives. Although, these are reasonable schedules, we recognize that they
are not entirely optimal, since we have to schedule nops in between and thus
not fully utilize the two processors. From the efficiency perspective it looks as
1
follows: We know that E AT
. If we normalize A to the area of a single
processor, we get A = P = 2. If we further normalize the processing rate (i.e.
the length of a single schedule period)
to that of a single iprocessor solution,
h
Schedule period f or P =1
1
7
we get the speedup S = T = 4 = Schedule period f or P =2 . Finally, we get
E 74 12 = 78 < 1. This means that we are less efficient compared to the single
processor solution!
125
To be able to fill the gaps in the schedules of figure 139, we have to alternate
the processors between adjacent periods. For this purpose we need to extend
the period of our precedence graph to J=2 (i.e. every actor is fired at least
twice). The resulting graph is presented in figure 140 (the brown indices provide
information on the corresponding schedule period).
Figure 140: Precendence graph (J=2) for SDF example
From the extended precedence graph we can now again easily derive a schedule
for P=2 processors. As figure 141 shows, the gaps could be removed from the
schedule.
Figure 141: Schedule for P=2 processors considering J=2 periods
126
Finally, we were able to find an optimum schedule for the P=2 case with an
efficiency comparable to that of the optimum single-processor solution:
1
14 1
E=S
=
= 1.
P
7 2
What about the case P=3? Figure 142 shows the corresponding schedule (J=1).
The efficiency is further decreased compared to P=2:
1
7 1
7
E=S
= = < 1.
P
3 3
9
From the precedence graph (J=2) in figure 140 we can estimate that it does
not make sense to observe J>2, since we are not able to initially schedule more
than the three actors (#1,#1,#3) in parallel and thus will not achieve a higher
degree of parallelism.
Figure 142: Schedule for P=3 processors
Instead, let us slightly modify our SDF example by simply introducing an additional initialization buffer at the input of actor #3. Now, actor #3 can be
fired two times initially! This has quite an impact on our schedule, since it now
indeed makes sense to go for the J=3 precedence graph, as figure 143 (right)
shows! Now, we find three independent precedence trees in the graph and can
simply build an optimal schedule for P=3 processors out of it (E = S/P = 1).
127
Figure 143: Slightly modified SDF example (left) with precedence graph for
J=3 (right)
13
Many modern applications, such as mobile communication or multi-media, require a lot of computation power. Therefore, contemporary hardware architectures tend to integrate multiple processors, memories and hardware accelerators
on a single chip, so called Multi-Processor Systems-on-Chip (MPSoC). Thats
why we want to discuss the main challenges related to MPSoCs in the scope
of this section. Figure 144 summarizes the software/hardware mapping (i.e.
scheduling) problem that we already discussed in the previous section. Software
tasks need to be assigned to PEs/memories in space and time under certain
constraints (e.g. execution time/deadlines, power consumption, data locality,
etc.). Furthermore, the on-chip communication (NoC) must be considered: e.g.
inter-processor or processor-memory. We can distinguish static and dynamic
mapping approaches, as discussed in the following.
128
13.1
Programming Model
The scheduling problem has been addressed in the previous section using the
SDF graph model. However, this model can only be used to generate static
schedules that must be known at software compile time. Modern applications
often exhibit a very dynamic behavior that is dependent on input data or user
interaction. E.g. when the user is writing an SMS it requires much less computational power from the mobile phone than using it to make a video conference.
On the other hand, under bad channel conditions the mobile phone is not able to
transfer with high data rates and thus the computational effort is also decreased
in this case. These examples should make clear that static scheduling with SDF
is often not sufficient for modern applications, especially for mobile communication and multi-media. In this case we need to deal with dynamic data flow
(DDF) which is also a sub-class of the general KPN. Let us first summarize the
joint properties of SDF and DDF:
KP N
SDF
1 &
2 (actors work on independent memory areas) one global mem
ory o.k.
2.
1 &
3 (actors work on overlapping memory areas) sequential access
to memory o.k.
1 &
3 with writeback modify problem to work on the same memory!
2
3 but input/output array chosen dynamically (image
in case actor
3.
processing)
Figure 145: Image processing as example for dynamic scheduling
130
This small example should demonstrate a case where we are not able to apply static scheduling, since the dependencies and execution order and even the
memory mapping depends on which area of the image actually has to be processed by the two actors which in turn is dynamically selected at run-time. This
emphasizes the need for a dynamic (run-time) scheduler for:
handling hazard conflicts in memory
mapping tasks onto cores (time/space)
reserving interconnects (NoC)
To allow dynamic scheduling at run-time we need a special programming model
that allows us to divide the program into sub tasks which can be issued in
arbitrary execution order (considering the data dependencies) by the run-time
scheduler onto appropriate PEs. For this purpose we have to describe the actors/tasks as shown in the following:
Algorithm 6 Programming model for MPSoC with run-time scheduler
call actor (in 0, in 1, ...; out 0, out 1, ...)
actor
...
end of actor
The memory addresses of the inputs and outputs can be used by the run-time
scheduler to identify data dependencies (i.e. overlapping memory areas). Based
on the dependency analysis the scheduler is able to find an admissible execution
order with respect to the targeted optimization strategy (speed, power, etc.).
The actual program code of the actor is stored in the global memory and transfered to the local memory of the selected PE when the actor is ready to be
executed. Figure 146 shows the MPSoC programming model from the PE point
of view. The processor receives the input data and control from a global memory, as well as the program code that matches the PE type. Certain actors may
be suited to be executed on different PE types. In this case different versions of
the same program code must be located in the global memory (the appropriate
version of the program code is selected by the scheduler). A local scratch pad
memory is used by the PE to store intermediate results. The actual outputs of
the computation are transfered back to a global memory (unless the data can
be reused by the subsequently scheduled actor data locality).
131
Let us examine the programming model for our FIR example a little bit closer
to understand the consequences of the input/output procedure:
yk =
L1
X
ai xki
i=0
Assume that our actor computes the FIR in direct form for M steps in parallel,
as depicted in figure 147 for the case L=4 and M=3.
132
MPSoCs and differs from the assumption that we made before that our current
state is always locally available in the PE. Figure 148 briefly summarizes our
observations for the FIR filter.
Figure 148: M-step FIR example
13.2
Task Scheduling
One of the main challenges of MPSoCs is the task scheduling that is briefly
addressed in this section. As already mentioned before, we distinguish two
cases: static (1) and dynamic (2) scheduling.
134
1. Static Task Scheduling: This case is more easy to handle, since we can compute the schedule at compile time using the common SDF model that has
been presented in section 12. Figure 149 shows a simple example to create an
ASAP (as soon as possible) or ALAP (as late as possible) schedule given an
SDF graph with annotated execution times as input. Using the ASAP policy,
actor a can start immediately at t=0. Actor b can be fired at t=200 and is
scheduled to the still idle PE #2. Actor c can directly be fired at t=300 after a
has been completed and is executed on PE #1. Due to its long run-time of 800
cycles, actor b occupies PE #2 nearly up to the end of the schedule, so that the
subsequent actors must be executed sequentially on PE #1: d at t=400 and e
at t=700. Finally, after b has finished, the output actor f can be scheduled on
PE #2 at t=900. The schedule period is completed at t=1000. For the ALAP
schedule we simple apply the same procedure in reverse order, i.e. starting with
actor f b e ....
Figure 149: Example for static MPSoC schedule
2. Dynamic Task Scheduling (CoreManager): For the case of dynamic scheduling we need a run-time scheduler that we call Core Manager. It is responsible
for fetching a bunch of tasks (actors) into its local buffer. Now, the Core Manager has to distinguish two cases:
1. task is ready to be fired there are no dependencies on other running
tasks fire as soon as possible (i.e. when the next PE is available)
2. task is almost ready to be fired almost means that the actor inputs
are being calculated by other actors (currently running task or in buffer
to be executed); Core Manager must check the dependencies to find a
valid execution order
The Core Manager concept is depicted on figure 150. Therein, the mircoprocessor (P ) controls the top-level program flow (i.e. executes the main
routine) and pushes task descriptions (for tasks to be scheduled onto PEs) into
135
the task queue of the Core Manager. The Core Manager analyzes the dependencies between scheduled tasks and forwards them in an admissible order to the
processing elements to be executed. For this purpose, the program code as well
as the input data is transfered from the global memory to the local memory of
the corresponding PE (usually by means of a DMA). After the PE finished the
task execution, it informs the Core Manager that it is available for new tasks.
The result of the computation is transfered back to the global memory, or kept
in local memory to be used further by a subsequent task ( data locality).
Figure 150: Core Manager concept for dynamic scheduling
Thereby, it can pursue different optimization strategies - the issue decision can
be controlled for:
speed of completion
power consumption
locality of data/program
Lets give a small example to motivate this. Assume a PE where currently an
actor is running. The Core Manager now has to decide which of the actors that
137
are available in its buffer should be issued next to that PE. With respect to the
data locality principle, the following facts influence this decision:
Can input data be reused?
Can output data be reused?
Can program code be reused?
In consequence, memory transfers are minimized, ...
NoC load and memory R/W access is reduced and finally ...
energy consumption is minimized!
13.3
Network-on-Chip
138
Finally, lets take a look at the small example in figure 152. We see a NoC
arranged in a 2D-mesh topology of size 4x4. We recognize that this is a heterogeneous MPSoC, since different PE types (illustrated by circles and rectangles)
are connected to the router nodes (small bubbles). E.g. one of the nodes could
be used as central micro-processor to run the main code. Other PE nodes could
be used for central managers, such as the Core Manager or the NoC manager (a
central control instance for the link allocation in the network). Furthermore, we
find two successfully established routes in the NoC (green lines). A third connection request was not successful (red line), since the route is already blocked
by another crossing route.
Figure 152: 4x4 2D-Mesh example for Network-on-Chip
This brief insight should give a small impression of the on-chip interconnection
challenge.
13.4
Heterogeneous MPSoC
This MPSoC type can consist of multiple different PEs (e.g. x86s &
ARMs) refer to figure 151 in section 13.2.
One example with multiple DSPs and ARM processors is the Qualcomm
MSM, a cellular modem chip.
A typical application example that can exploit heterogeneous MPSoC is
the recent mobile communication standard LTE (Long Term Evolution).
In corresponding modems for this standard we typically find:
~10
~10
~10
~50
13.5
ARMs (3 kinds)
HW accelerators (e.g. filters, decoders, etc.)
DSPs
memories
Hierarchy
New challenges come up with the increasing size of MPSoCs. If we take a look
at state-of-the-art MPSoCs, we usually find a few dozen or maybe a hundred
of processors. In near future, we will already find multi-processor chips with
thousands of PEs, usually referred to as many-core systems-on-chip or FPPA
(field programmable processor array processor arrays with programmable
interconnects, analogous to todays FPGAs [field programmable gate array]).
Assuming such large number of PEs, conventional topologies will not be sufficient anymore. Hierarchical network topologies with PEs pooled in clusters and
dynamically controllable cluster sizes might be a solution to handle such large
SoCs ( figure 154).
140
Finally, we can also stack multiple MPSoCs together on a higher level of hierarchy. This could either be done on one plane (2D) or by stacking multiple
MPSoCs on top of each other (3D).
Figure 155: Hierarchy of multiple MPSoC
Design-Space Exploration
Another big challenge is the design of an efficient MPSoC for a specific problem. This task is subject of the so called Design Space Exploration (DSE) that
systematically evaluates a large number of different architecture and mapping
alternatives with respect to the suitability for a given computation problem
(e.g. signal processing chain for UMTS modem). Broadly speaking, we can
distinguish three different approaches to start off with a DSE:
1. greenfield approach: given a system problem find a HW/SW solution free choice of hardware architecture, interconnect, etc., software
partitioning and HW/SW mapping
141
Outlook 2020
Finally, we want to give a brief outlook concerning the development trends of
MPSoCs within this section. Todays chips consist of a few dozen or hundreds
of processors. Here are two example:
1. nVidia 500 PEs on a chip in 40 nm
2. Tomahawk chip 14 cores in 130 nm
A strong growth of the parallelism degree is predicted already for near future
many-core chips:
2020: ~ 0.1M - 1M cores
2030: ~ 1B cores(!)
This mean that the prediction for the number of cores in 2030 is comparable to
the number of transistors in todays chips!
142
References
[1] P. Pirsch, Architekturen der digitalen Signalverarbeitung. Stuttgart: Teubner, 1996.
[2] , Architectures for Digital Signal Processing.
John Wiley & Sons, Inc., 1998.
[3] K. K. Parhi, VLSI digital signal processing systems / design and implementation. New York, NY ; Weinheim [u.a.]: Wiley, 1999.
[4] V. P. Nelson, Digital logic circuit analysis and design. Upper Saddle River,
NJ: Prentice Hall, 1995.
[5] V. Madisetti, VLSI digital signal processors / an introduction to rapid prototyping and design synthesis.
Boston [u.a.]: Butterworth-Heinemann
[u.a.], 1995.
[6] G. Fettweis, Parallelisierung des Viterbi-Decoders: Algorithmus und VLSIArchitektur, als ms. gedr. ed. Dsseldorf: VDI-Verl., 1990.
[7] ,
On implementing digital signal processing tasks of
communications systems in vlsi,
Digital Signal Processing,
vol. 3, no. 3, pp. 210 219, 1993. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1051200483710262
[8] G. Fettweis and L. Thiele, Algebraic recurrence transformations for massive parallelism, Circuits and Systems I: Fundamental Theory and Applications, IEEE Transactions on, vol. 40, no. 12, pp. 949 952, dec 1993.
[9] G. Fettweis, L. Thiele, and G. Meyr, Algorithm transformations for unlimited parallelism, in Circuits and Systems, 1990., IEEE International
Symposium on, may 1990, pp. 1756 1759 vol.3.
[10] J. Robelly, G. Cichon, H. Seidel, and G. Fettweis, Implementation of recursive digital filters into vector simd dsp architectures, in Acoustics, Speech,
and Signal Processing, 2004. Proceedings. (ICASSP 04). IEEE International Conference on, vol. 5, may 2004, pp. V 1658 vol.5.
[11] G. Fettweis and S. Bitterlich, Optimizing computation and communication
in shuffle-exchange processors, in Circuits and Systems, 1991., Proceedings
of the 34th Midwest Symposium on, may 1991, pp. 269 272 vol.1.
[12] H. Dawid and G. Fettweis, Bit-level systolic carry-save array division, in Global Telecommunications Conference, 1992. Conference Record.,
GLOBECOM 92. Communication for Global Users., IEEE, dec 1992, pp.
484 488 vol.1.
[13] H. Srinivas and K. Parhi, A fast vlsi adder architecture, Solid-State Circuits, IEEE Journal of, vol. 27, no. 5, pp. 761 767, may 1992.
143
Reading, MA:
[21] M. Goossens, F. Mittelbach, and A. Samarin, The LATEXCompanion. Reading, MA: AddisonWesley Pub. Co., 1994.
144