Thiet Ke He Thong Nhung

1
Thesis proposal: Throughput and

Latency aware mapping of NoC
Aug 17, 2006

Vu-Duc Ngo
System VLSI Lab
System Integration Technology Institute
2
Contents
Related works
Part I: Latency aware mapping of NoC architectures
Part II: Throughput aware mapping of NoC architectures
Part III: Energy consumption of NoC architectures
Part IV: Experiment results
Study case of H.264 video decoder
Used architectures: 2-D Mesh, Fat-Tree
Study case of Video Object Plane Decoder
Used architectures: 2-D Mesh, Fat-Tree, Custom topologies
Future works
Publication list
References
Appendix for detailed future works
Appendix I: G/G/1 Queuing: Theoretical Approach
Appendix II: Double Plane VC Wormhole Router
Appendix III: NoC Emulation

3
Related works
Energy aware mapping schemes:
Proposed by:
De Micheli (Standford Univ) and R. Marculescu (CMU) research groups
Addressed the issue of minimizing the power consumption of NoC
architectures
Did not address the currently hot issue of QoS such as:
Throughput guarantee
Latency guarantee
Did not consider:
Drop of packets inside the network which is the nature of packet based switching
network
The power consumption was simulated with the homogeneous bit energy
model
4
Proposed mapping scheme :Latency and Throughput
aware mapping
Issue raising
- QoS issue, currently hot topic in NoC design, was addressed by J. Nurmi
in SoC05 conference.
- It was also strongly mentioned by A. Ivanov and De Micheli (IEEE Design and
Test, Aug 2005) as a important design criterion of NoC in future.
- It will be the main theme of 1
st
IEEE NoC symposium 2007.

Our works:
- Find out a mapping schemes that:
1. Minimize architectures latency
2. Maximize architectures throughput
3. Calculate the corresponding size and power consumption
5
Part I: Latency aware mapping of NoC
6
Latency Optimal Mapping: Introduction
Latency:
IPs and NoC architecture are Heterogeneous

Question is: Which switching core should each IP core be mounted
onto in order to minimize the network latency?

Issues of mapping IPs onto NoC architectures:
For each mapping scheme:
The routing table of applied routing algorithm would be changed due to
the change of mapping IPs onto pre-selected NoC architecture.
The queuing latency would be changed according to the content of the
routing table.
7
Latency Optimal Mapping: Introduction (Contd)
Solution:
Assume data transactions have Poisson distribution (general
distribution will be studied in future work)
Using M/M/1 queuing model to analyze the latency
We utilize the spanning tree search algorithm to:

Automatically map the desired IPs onto the NoC architecture

Guarantee that the network latency is minimized

Reduce the searching complexity of optimum allocation scheme

8
Latency Optimal Mapping: M/M/1 queuing model
Let the arrival rate of packet to the node is:

Let the number of packets arrive at one node is:

The node latency:

rate
Arrival =
num
Packet N =
lat
Node T =
N T =
Little theorem
Relation between
latency and number
of packet
IP Core
Switching
Core Server
Single Network
Node
Buffer
Packets arrive
9
Latency Optimal Mapping: M/M/1 queuing model (Contd)
M/M/1 (Contd):
Since:
N

=
Little theorem
1 N
T

= =
Spending time of 1
packet in one node
( )
1 1 1
W T

= =
Spending time of 1
packet in buffer
Mean of
processing time
( )
Q
N W
N

=
= =
Number of packets in
buffer
10
Latency Optimal Mapping: Queuing latency in complex network
Network topology:
For each i
th
stream

:

Since streams are iid and Markov has distribution property, then
the queuing latency of the j
th
node is:

i i i
N T =
:
: the set of incoming streams toward node
j
j
j j
i
i
j i
i
th
j
Since
N T
where
N N
j

e
e
=
=
1
j
j
j
j
i i
i
j
i
i
i
i
j i
i
i
T
T

e
e
e
e
=
Little theorem
11
Latency Optimal Mapping: Queuing latency in complex network (Contd)
Thus, the latency of route is

Where:

Network latency in terms of Queuing latency is given by:
1
j
k
j
i
i
j i R
Queue kj
j
i
i
T

o
e
e

k
R
th
i
1, if node
0, if node
th
k
kj
th
k
j R
j R
o
e
=

e
1
1
j
j
i
i
m
j i
kj
k j
i
i
e
=
e
(
(
(
(

1, if node
0, if node
: the number of routes in the routing table
th
k
kj
th
k
Where
j R
j R
m
o
e
=

e
12
Wire latency
If we take into account the difference of the wires, then:
For RLC modeled by:

Wire delay:

Furthermore, we can also calculate the wire inductance and
capacitance in terms of wires width:
load
C
x
x
W
0
W
( )
( ) ( )
0
0
0 0
l x
w f
L
T C W y C dydx
W x
= +
} }
( )
( )
( ) ( )
0
0
line
line f
L
L W
W x
C W C W x C
0
0
: wire inductance per square
: wire capacitance per unit area
: fringing capacitance per unit length
f
L
C
C
Where:
M. A. El-Moursy and E. G. Friedman,"Desing Methodologies For On-Chip Inductive Interconnection", Chapter. 4,
Interconnect-Centric Design For Advanced SoC and NoC, Kluwer Academic Publishers, 2004.
13
Wire latency (cont)
Route latency in terms of wire latency is calculated by:

Where is the route latency, and

Network latency in terms of wire latency is presented by:
( )
( )
0
0
0 0
j
i
l
x
R
Wire f ij
j
L
T C W y C dydx
W x
o ( = +

} }
i
R
Wire
T
th
i
1, if node
0, if node
th
i
ij
th
i
j R
j R
o
e
=

e
( )
( )
0
0
1
0 0
j
l
x
m
f ij
i j
L
C W y C dydx
W x
o
=
(
(
( +

(

} }
:
1, if node
0, if node
th
i
ij
th
i
Where
j R
j R
m
o
e
=

e
14
Network latency
Considering the shortest path routing algorithm is applied:
With given application, there is a certain routing table
The average latency is:

Where:
( )
( ) ( )
0
0
0 0
,
1
;
;
1
j
i
i
l x
f
j
m
Aver Lat kj
i k
i j i
j i
j
i
i j i
L
C W y C dydx
W x
N o
=
= e
= e

(
+

(

(

(
=
`
(

(

+
(

(

)
} }

1, if node route k
0, if node route k
kj
j
j
m
o
e
=

e
15
Latency Optimal Mapping: Problem statement
Since can be changed accordingly to the status of network
Routing (predetermined connections of IPs)
Congestion
Arrival rate be accumulated
However are unchanged due to predetermined design of
switching nodes

Therefore, optimum mapping should be figured out to minimize
system latency for:
Certain case of the practical application
Ex. H.264 video decoder, VOPD
Find out an optimum mapping
i
16
Latency Optimal Mapping: Graph definitions
Graph characterizations:
IP cores: IIG
Graph characterizations:
NoC architecture: SAG
V6 V2
V8
V4
V1
V10
V3
V7
V9
V5
( ) , G V A
( ) , : Directed graph
Vertex : IP cores
: arrival rate
i
i i
G V A
v V
A v
e
e
( ) , : Directed graph
Vertex : node of NoC topology
1/ : Mean of processing time of
i
i i
G U P
u U
P u
e
e
U1 U2 U3 U4
U5 U6 U7 U8
U9 U10 U11 U12
U13 U14 U15 U16
( )
, G U P
IIG: IPs Implementation Graph SAG: Switching Architecture Graph
17
Latency Optimal Mapping: Mathematical formula
Mapping with Min-latency
criteria:
Definition of mapping:

Min-latency criteria
The cost function as the
average latency:
( ) ( )
( )
( ) ( )
: , ,
. . ,
,
i j
i j
i j i j
map G V A G U P
map v u
s t v V u U
v v V map v map v
e - e
= e =
( ) ( )
( )
( ) ( )
0
0
0 0
1
;
;
Find a : , , to:
min 1
j
i
i
l x
f
j
m
kj
i k
i j i
j i
j
i
i j i
map G V A G U P
L
C W y C dydx
W x
latency o
=
= e
= e
(
+ (
(

(
=
``
(

(

+
(

(
) )
} }

Means that each IP needs to be
mapped to exactly one node of NoC
topology and no node can host more
than 1 core IP
( )
( ) ( )
0
0
0 0
,
1
;
;
1
j
i
i
l x
f
j
m
Aver Lat kj
i k
i j i
j i
j
i
i j i
L
C W y C dydx
W x
N o
=
= e
= e
(
+
(
(

(
=
`
(

(

+
(

(
)
} }

18
Latency Optimal Mapping: Mapping example
Mapping example:
Sol: Using spanning tree
search.
NoC architecture graph
(SAG)
19
Example of On-Chip Multiprocessors Network (OCMN)
Mesh architectures: Fat-Tree architectures:
20
Simulation results: H.264 video decoder on 2-D Mesh
Latency optimal mapping:

min
Minimum Latency:
325 L s =

DB LENT MC
VOM MVMVD Processor IPRED
IS REC FR_MEM ITIQ

Architecture Throughput and energy
DB
MC
FR_MEM
ITIQ
LENT VOM Processor REC
MVMVD IPRED IS

Random mapping:

Random Latency
416
Random
L s =
0 2 4 6 8 10 12 14 16 18 20
0
5
10
15
20
25
30
35
40
Time ( x 0.0005 second)
A
g
g
r
e
g
a
t
i
v
e

T
h
r
o
u
g
h
p
u
t
Throughput Comparison (Optimal vs. Random)
Optimal Lantency mapping
Random mapping
Opt_Latency_map Random_map
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x 10
-4
E
n
e
r
g
y

C
o
n
s
u
m
p
t
i
o
n

(
J
)
Energy Consumption (Optimal Latency vs. Random)
21
Part II: Throughput aware mapping of NoC
architectures
22
Throughput aware mapping
Since the wormhole router treats data flow on flit level:
Probability that exits m flits (using M/M/1 queuing model)

Where:
1
1 1
0 0
0
i i
m
j j
m
il il
l l
m
n
j j
p p p

= =
=
| |
|
= =
|
\ .

[
1
0
1
, 1.. : are arrival rates of data flows to port of j router
: is the number of individual data flows enter port
i j
il
l
j
j th th
il j
th
i
p
i p i
i
=
=
=
Block probability
( )
( )
,
2
0
1
1 , : Buffer size
k
j
block i m
m
j
P p k
p
=
=
( )
( )
1
1
,
2
1
i
k
j
il
j l
block i
j
j
P
p
+
=
| |
|
=
|
\ .
23
Throughput aware mapping (Contd)
The throughput contributed by i
th
port:

Therefore, the throughput contributed by the j
th
router :

Where N is the number of routers
Network Throughput:
( )
( )
( )
1
1
2
1 1
1
1 1
i
i i
k
j
il
j j j l
i block il il
l l
j
j
P
p
+
=
= =
| |
| |
|
|
= =
|
|
|
\ .
\ .

( )
( )
1
1
2
1
1 1
1
1
i
j j
i
k
j
p p
il
j l
j i il
l
i i
j
j
T
p
+
=
=
= =
| |
| |
| |
|
|
|
= =
|
|
|
|
|
\ .
\ .
\ .

( )
( )
1
1
2
1
1 1 1
1
1
i
j
i
k
j
p
N N
il
j l
Net j il
l
j j i
j
j
T T
p
+
=
=
= = =
(
| |
| |
| |
( |
|
|
= =
( |
|
|
|
|
( \ .
\ .
\ .

24
Throughput aware mapping (Contd)
Since:

To maximize the
Network Throughput
( )
( )
1
1
2
1
1 1 1
1
1
i
j
i
k
j
p
N N
il
j l
Net j il
l
j j i
j
j
T T
p
+
=
=
= = =
(
| |
| |
| |
( |
|
|
= =
( |
|
|
|
|
( \ .
\ .
\ .

Function of allocation scheme
of IP onto the Architecture
Optimal mapping of a
given application
onto the architecture
must be work out
25
Throughput aware Mapping: Mathematical formula
Mapping with Max-Throughput
criteria:
Definition of mapping:

Max-Throughput criteria
The cost function as the
Network Throughput:
( ) ( )
( )
( ) ( )
: , ,
. . ,
,
i j
i j
i j i j
map G V A G U P
map v u
s t v V u U
v v V map v map v
e - e
= e =
( ) ( )
( )
( )
1
1
2 1
1 1
Find a : , , to:
1
max 1
i
j
i
k
j p
N
il
j l
Net il
l
j i
j
j
map G V A G U P
T
p
+
=
=
= =
( | | | |
| |
( | |
|
=
`
( | |
|
| |
( \ .
\ . \ . )

Means that each IP needs to be
mapped to exactly one node of NoC
topology and no node can host more
than 1 core IP
( )
( )
1
1
2 1
1 1 1
1
1
i
j
i
k
j p
N N
il
j l
Net j il
l
j j i
j
j
T T
p
+
=
=
= = =
( | | | |
| |
( | |
|
= =
( | |
|
| |
( \ .
\ . \ .

26
Part III: Energy consumption of NoC architectures
27
Energy and area: calculation and parameters
0.1um CMOS technology
Vdd = 1.2V
Router configurations:
3x3
4x4
5x5
7x7
Energy of CMOS circuit:

Since:

Then:
Router energy model:

Router power model:

2
1
2
clk dd
P f V C o =
2
1
2
dd
E CV o =
clk
P f E =
Arbiter
Buf 1
Buf 2
Buf p
.
.
.
.
.
.
.
.
Output port 1
Output port 2
Output port p
Input port 1
Input port 2
Input port p
Crossbar
Switch
( )
xbar arb bufrd bufwrt
E E E E E = + + +
( )
( )
2
1
2
clk xbar arb bufrd bufwrt
clk dd xbar arb bufrd bufwrt
P f E E E E
f V C C C C o
= + + +
= + + +
28
Energy and area: calculation and parameters (Contd)
For NxN 2-D Mesh homogenous architecture:

Since:

Then:
( )
( )
2
1
2
Net clk avg xbar arb bufrd bufwrt
clk dd avg xbar arb bufrd bufwrt
P f H E E E E
f V H C C C C o
= + + +
= + + +
2
3
avg
N
H =
( )
( )
2
1 2
2 3
Net clk avg xbar arb bufrd bufwrt
clk dd xbar arb bufrd bufwrt
P f H E E E E
N
f V C C C C o
= + + +
= + + +
29
Power and area: calculation and parameters (Contd)
For heterogeneous architecture:

Where:

Hence:
( )
i
i j j j j j
flit clk xbar arb bufrd bufwrt ij
R
P f E E E E o = + + +
1, if router
0, otherwise
th
i
ij
j R
o
e
=

( )
i
i
Net flit
i
j j j j j
clk xbar arb bufrd bufwrt ij
i R
P P
f E E E E o
=
= + + +
30
Energy estimation
Bit energy estimation

Energy estimation based on
throughput:
0 2 4 6 8 10 12 14 16 18 20
0
500
1000
1500
2000
2500
T
o
t a
l T
h
r o
u
g
h
p
u
t ( M
b
p
s
)
SFQ
DRR
RED
DropTail (FIFO)
Bit energy
estimation
System energy
estimation
Random mapping
example
Information bit
Orion power
model
Scaling factor for
a given CMOS
technology
Bit energy
consumption
Router power
model
Interconnection
power model
H. S. Hwang et al.,"Orion: A Power Performance Simulator for
Interconnection Networks," IEEE Micro, Nov. 2002.
31
Energy estimation (Contd)
Optimum system power estimation

0 2 4 6 8 10 12 14 16 18 20
0
500
1000
1500
2000
2500
T
o
t a
l T
h
r o
u
g
h
p
u
t ( M
b
p
s
)
SFQ
DRR
RED
DropTail (FIFO)
Optimum mapping
0 2 4 6 8 10 12 14 16 18 20
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Time
A
v
e
r a
g
e
T
h
r o
u
g
h
p
u
t - M
b
p
s
non-optimal - 11Mbps
suboptimal - 11Mbps
Bit energy
estimation
Optimum
System energy
estimation
Random mapping
simulation
Optimal mapping
simulation
32
Part IV: Experiment results
1. Throughput aware mapping: 2D Mesh,
Fat-Tree architectures for H.264 design
2. Throughput aware mapping: irregular
architectures for VOPD design
33
H.264 on regular topologies
34
H.264 video decoders data transaction table
Assume data transactions are Poisson distribution
Throughput aware mapping for:
2-D Mesh, Fat-Tree
35
Wormhole router architecture
pxp wormhole router
P inout ports
Single switching plane
One input buffer for each input
Ex. 2-D Mesh uses 5x5 router
Arbiter
Buf 1
Buf 2
Buf p
.
.
.
.
.
.
.
.
Output port 1
Output port 2
Output port p
Input port 1
Input port 2
Input port p
Crossbar
Switch
36
Simulation parameters
For all topologies:

Routing scheme Shortest Path
Queuing scheme
DropTail

Buffer size 4 packets
Packet size 64 bytes

Flit size 128 bits
37
Throughput comparison
Throughput of five topologies
38
Topology comparison
Topology size
Topology energy
39
VOPD on 5 topologies
40
VOPDs data transaction table
Assume data transactions are Poisson distribution
Throughput aware mapping for:
2-D Mesh, Fat-Tree and 3 custom topologies
41
First custom topology (3 Xbar )
First custom topology:
(a). Nam output of First topology (b). VOPD on First topology
5x5
Wormhole
Router
(1st)
6x6
Wormhole
Router
5x5
Wormhole
Router
(2rd)
VMEM
UPSAM
VRecst PAD IQua
AC/DC
IDCT
Invers
ARM
Str_MEM
Run_Lenght
VarLen
42
Second custom topology (4 Xbar )
(a). Nam output of Second
topology
(b). H.264 decoder on Second
topology
5x5
Wormhole
Router
(1st)
5x5
Wormhole
Router
(2rd)
5x5
Wormhole
Router
(3rd)
3x3
Wormhole
Router
MC
DB
DMA FR_MEM ITIQ
LENT
VOM
REC
MVMVD
PROC IPRED IS
43
Third custom topology (5 Xbar )
(a). Nam output of Third
topology
(b). VOPD on Third topology
6x6
Wormhole
Router
5x5
Wormhole
Router
3x3
Wormhole
Router
(3rd)
VMEM
UPSAM
VRecst
PAD IQua
AC/DC
IDCT
Invers
ARM
Str_MEM
Run_Lenght
VarLen
3x3
Wormhole
Router
(1st)
3x3
Wormhole
Router
(2rd)
44
Wormhole router architecture
pxp wormhole router
P inout ports
Single switching plane
One input buffer for each input
Ex. 2-D Mesh uses 5x5 router
Arbiter
Buf 1
Buf 2
Buf p
.
.
.
.
.
.
.
.
Output port 1
Output port 2
Output port p
Input port 1
Input port 2
Input port p
Crossbar
Switch
45
Simulation parameters
For all topologies:

Routing scheme Shortest Path
Queuing scheme
DropTail

Buffer size 4 packets
Packet size 64 bytes

Flit size 128 bits
46
Throughput comparison
Throughput of five topologies
47
Result Discussion: Term of Throughput
Best topology: Fat-Tree

Worst topology: 2D Mesh
Lowest aggregative throughput while hight hardware overhead of
unused switches

Fat-Tree offers almost similar throughput in compared to
the first custom topology but big hardware overhead
48
Power and area of Router
49
Wire and energy dissipation
Wire dimension vs. Cap:
0.10um technology:

Energy:

Wire dimension vs. chip edge:
0.10um technology:

L
drawn
/Tech 0.10um
Capacitance
(fF/um)
335
2
1
2
wire wire dd
E C V =
R. Ho, et al, "The future of wires," Proceedings of the IEEE,
pp. 490 - 504, April 2001
50
Topology comparison
Size comparison Energy consumption comparison
51
Topology comparison (Contd)
Conclusion:
1
st
topology consumes smallest power
Wire energy is not significant
Because of its simplicity in interconnections.

Fat-Tree consumes biggest power and has biggest size
Wire energy dissipation is significant
Due to its complex interconnections

52
Custom Topologies comparison (Random map vs. optimal map)
Terms of throughput Terms of energy consumption
53
Custom Topologies comparison (Random map vs. optimal map) (Contd)
Discussion:
Optimal maps offer not only better throughput but also less energy
consumption.
No ARQ scheme was implemented.
If ARQ is used:
Same throughput even higher but more energy consumed for
retransmitting dropped packets. (Further works)

54
Conclusions
The heterogeneous NoC architectures are considered for
designing based on the latency criteria
The latency of heterogeneous NoC architecture in terms of router
and wire latency is well formulized
Branch and Bound algorithm is adopted to automatically map the
IPs onto the NoC architectures with optimal latency metric
The experiments on various size of Mesh and Fat-Tree
architectures for OCMN application and H.264 are carried out
The latency of optimal mappings are significantly reduced

55
Conclusions (Contd)
The heterogeneous NoC architectures are considered for
designing based on the maximum throughput criteria
The throughput of heterogeneous NoC architecture is formulized
Branch and Bound algorithm is adopted to automatically map the
IPs onto the NoC architectures to obtain maximal throughput
The experiments on various size of Mesh and Fat-Tree and Tree-
Based architectures for VOPD and H.264 are carried out
The heterogeneous bit power model was applied to exactly obtain
energy consumption and area of architectures

56
Future works
Modeling architecture with general distribution queuing model G/G/1
(Appendix I)
Realization of Multi-layer router for NoC design (Appendix II)
Performance comparison with current router model
Power consumption comparison with current router model
Variation of:
Number of switching planes
Number of Virtual channel
Will be considered.
NoC emulation (Appendix III)
Global optimization for 2 criteria of Latency and Throughput
ARQ implementation with power measurement for Throughput aware
mapping scheme
57
Publication list
International Journals
1. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi , "Analyzing the Performance of Me
sh and Fat-Tree topologies for Network on Chip design ", LNCS (Springer-Verlag), Vol
. 3824 / 2005, pp 300-310, Dec 2005.
2. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, Designing On-Chip Network based
on optimal latency criteria", LNCS (Springer-Verlag), Vol. 3820 / 2005, pp. 287-298 ,
Dec 2005.
3. Huy-Nam Nguyen, Vu-Duc Ngo, Hae-Wook Choi, Realization of Video Object Plane
Decoder on On-Chip-Network Architecture, LNCS (Springer-Verlag), Vol. 3820 /
2005, pp. 256-264 , Dec 2005.
4. Vu-Duc Ngo, Hae-Wook Choi and Sin-Chong Park, "An Expurgated Union Bound for
Space-Time Code Systems", LNCS (Springer-Verlag), Vol. 3124 / 2004, pp 156-162,
July 2004
5. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, The Optimum Network on Chip
Architectures for Video Decoder Applications Design, (to be submitted to Etri journal)
6. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, Throughput Aware Mapping for
NoC design , (to be submitted to IEE electronic letter)

58
Publication list (Contd)
International Conferences
1. Vu-Duc Ngo, Hae-Wook Choi, "On Chip Network: Topology design and evaluation using NS2," ICACT 2005. The 7
th

IEEE International Conference on, Volume 2, Page(s):1292 - 1295, 21-23 Feb. 2005, Korea.
2. Vu-Duc Ngo, Hae-Wook Choi, Designing Network on Chip based on Fat-Tree topology, 12th IEEE International
Conference on Telecommunications (ICT2005), May 2005, Capetown, South Africa (in proc).
3. Vu-Duc Ngo, Hae-Wook Choi, On-Chip Network latency analysis and optimization using Branch and Bound
algorithm , ITC-CSCC2005, Jeju, Korea, July 2005 (in proc).
4. Huy-Nam Nguyen, Vu-Duc Ngo, Hae-Wook Choi, Realization of Video Object Plane decoder on Mesh On Chip
Network Architecture, IASTED International Conference on Circuit, Signal and System (CSS2005), Oct 2005,
California, US (in proc).
5. Huy-Nam Nguyen, Vu-Duc Ngo, Hae-Wook Choi, Implementation of H.264 Decoder on On-Chip-Network
Architecture, International SoC design Conference (ISOCC2005), Oct 2005, Seoul, Korea (in proc).
6. Vu-Duc Ngo, Hae-Wook Choi, An optimum mapping of IPs for On-Chip Network design based on the minimum
latency constraint ", IEEE Tencon2005, Nov 2005. Melbourne, Australia (in proc).
7. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi ,The Optimized Tree-based Network on Chip Topologies for H.264
Decoder Design, IEEE ICCES06, Nov 2006, Cairo, Agypt. (accepted)
8. Huy-Nam Nguyen, Vu-Duc Ngo, Hae-Wook Choi, Assessing Routing Behavior on On-Chip-Network, in proc IEEE
ICCSC 06, July 2006, Bucharest, Rumani.
9. Vu-Duc Ngo and Sin-Chong Park, "Tightening Union Bound by Applying Verdu Theorem for LDPC," PIMRC'2003 ,
The 14th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications Vol. 1, pp 420-423,
Beijing, China Sept. 2003.
10. Vu-Duc Ngo, Sin-Chong Park, Hae-Wook Choi, "Expurgated Tangential Bound of Low Density Parity Check," in proc.
Fourth IEEE Communication Systems Network and Digital Signal Processing (CSNDSP), Newcastle, UK 20-22 July
2004.
11. Vu-Duc Ngo, Sin-Chong Park, Hae-Wook Choi, "Expurgated Sphere Bound of LDPC," PIMRC'2004, The 15th IEEE
International Symposium on Personal, Indoor and Mobile Radio Communications, Volume 4, Page(s):2591 - 2595 ,
Barcelona, Spain, 5-8 Sept. 2004

59
References
1. L. Benini and G. DeMicheli,"Networks On Chips: A new SoC paradigm", IEEE computer, Jan, 2002.
2. A. Agarwal,"Limit on interconnection network performance", Parallel and Distrib-uted Systems,
IEEE Transactions on Volume 2, Issue 4, Oct. 1991 pp. 398 - 412.
3. T. Ye, L. Benini and G. De Micheli," Packetization and Routing Analysis of On-Chip MultiProcessor
Networks", JSA Journal of System Architecture, Vol. 50, February 2004, pp. 81-104.
4. M. A. El-Moursy and E. G. Friedman,"Desing Methodologies For On-Chip Inductive Interconnection",
Chapter. 4, Interconnect-Centric Design For Advanced SoC and NoC, Kluwer Academic Publishers,
2004.
5. R. Ho, et al, "The future of wires," Proceedings of the IEEE, pp. 490 - 504, April 2001.
6. J. Hu, R. Marculescu, "Exploiting the Routing Flexibility for Energy Performance Aware Mapping of
Regular NoC Architectures", in Proc. Design, Automation and Test in Europe Conf, March 2003.
7. J. Hu, R. Marculescu, "Energy-Aware Communication and Task Scheduling for Network-on-Chip
Architectures under Real-Time Constraints, in Proc. Design, Au- tomation and Test in Europe Conf,
Feb. 2004.
8. S.Murali and G.De Micheli, Bandwidth-Constrained Mapping of Cores pnto NoC Architectures, DATE,
International Conference on Design and Test Europe, 2004, pp. 896-901.
9. T. Tao Ye, L. Benini, G. De Micheli, "Packetization and Routing for On-Chip Communication Networks,
" Journal of System Architecture, special issue on Networks- on-Chip.
10. T. H. Cormen, et al, "Introduction to algorithms," Second Edition, The MIT press, 2001.
11. D. Bertozzi, L. Benini and G.De Micheli, "Network on Chip Design for Gigascale Systems on Chips,
in R. Zurawski, Editor, Industrial Technology Handbook, CRC Press, 2004, pp. 95.1-95.18 on Chips,
Morgan Kaufmannn, 2004, pp. 49-80.

60
References (Contd)
12. L. Benini and G. De Micheli, "Networks on Chip: A new Paradigm for component based MPSoC
Design," in A. Jerraja and W.Wolf Editors, "Multiprocessor Systems on Chips", Morgan Kaufmannn,
2004, pp. 49-80.
13. D. Bertsekas and R. Gallager, "Data Networks," Chapter 5., Second Edition, Prentice-Hall, Inc.,
1992.
14. A. Jalabert, S. Murali, L. Benini, G. De Micheli, "xpipesCompiler: A Tool for instantiating application
specific Networks on Chip", Proc. DATE 2004.
15. M. Dall'Osso, G. Biccari, L. Giovannini, D.Bertozzi, L. Benini, "Xpipes: a latency insensitive paramete
rized network-on-chip architecture for multiprocessor SoCs", 21st International Conference on
Computer Design, Oct. 2003, pp. 536 - 539.
16. C. E. Leiserson,"Fat Trees: Universal networks for hardware efficient supercomputing," IEEE
Transactions on Computer, C-34, pp. 892-90, 1 Oct 1985.
17. H. S. Hwang et al.,"Orion: A Power Performance Simulator for Interconnection Networks," IEEE
Micro, Nov. 2002.
18. W. J. Dally and B. Towles,"Route Packets, Not Wires: On Chip Interconnection Networks," DAC, pp.
684-689, 2001.
19. N. Eisley and L.-S. Peh," High-level power analysis for on-chip networks," In Proceedings of the 7th
International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES),
September 2004.
20.J. Nurmi, "Network-on-Chip: A New Paradigm for System-on-Chip Design," Proceedings of
International Symposium on System-on-Chip, Nov. 2005.
21. L. Kleinrock, Queuing systems, volume 1: Theory, Willey. New York, 1975.

61
Appendix I: Latency of G/G/1 queuing model
Latency of a general queuing model is:

Where:

For G/G/1 queuing (applied for generally distributed data
transaction)
From [21], the waiting time:
1
T W
= +
1
: Mean of processing time
: Waiting time in buffer W
[21] L. Kleinrock, Queuing systems, volume 1: Theory, Willey.

New York, 1975.
( )
1 2
2 2
2
/
1 / 2
X X
W
o o

+
~
62
Appendix I : Latency of G/G/1 queuing model (Contd)
Where:

Therefore, for a single node, the latency is given by:

For the case of complex node, in which several data
streams are independently associated, the nodes latency
is formulated by:

| | | | ( ) ( )
| | | | ( ) ( )
1
2
2
2
2
1 1 1 1
2
2
2
2 2 2 2
: Mean value of arrival rate
X
X
Var X E X E X E X
Var X E X E X E X
o
o
(
(
= = =
(
(
= = =

( )
1 2
2 2
2
, ,
/ 1 1
1 / 2
i X i X
i i
i i
i i i i
T W
o o

+
= + = +
( )
( ) ( )
1 2
2
2 2
, , 1
1
1
/
1 1
2
1 /
i
i
i
C
C
ji i
ji X i X j
j
i i
C
i i
ji i
j
T W

o o

=
=
=
+
= + = +
63
Because:

For a curtain k
th
route, the latency is:

Where:

( )
1 2
1 2
1 2
1
1 1 1
1 1 1
1
2
1 1 1 ,
1
For independent data streams , ,...,
...
...
C i
i i i
i C i
i i i
i C i
i i i
i
C
ji
j
C
ji X
j
C X X X
E X X X
Var X X X
o
=
=
(
+ + + =

(
+ + + =

( )
( )
( ) ( )
1 2
2
2 2
, , 1
1
1
/
1
2
1 /
i
i
i
C
C
ji i
ji X i X j
j k
R i ik ik
C
i i
i
ji i
j
T T

o o
o o

=
=
=
| |
+
|
= = +
`
|

|
) \ .
1, if node route
0, if node route
th th
ik
th th
i k
i k
o
e
=

e
64
If there are totally m routes, the network latency is:

The optimization issue turns out to be: finding the optimal
mapping of IPs onto routers such that

( )
( ) ( )
( )
1 2
2
2 2
, , 1
1
1 1
1
/
1
2
1 /
,
i
i
i
C
C
m m
ji i
ji X i X j
j k
Net R ik
C
k k i
i
ji i
j
T T
f

o o
o

=
=
= =
=
(
| |

+
( |
= = +
`
( |

( |
)
\ .

=
Function of allocation scheme

of IP onto the Architecture
( )
( ) ( )
1 2
2
2 2
, , 1
1
1 1
1
/
1
min min
2
1 /
i
i
i
C
C
m m
ji i
ji X i X j
j k
Opt R ik
C
k k i
i
ji i
j
T T

o o
o

=
=
= =
=
( | |
+

( |
= = +
` ` `
( |
)

( |
) \ .
)
65
The conditions that we assume to calculate the cost function as well as
to apply Branch and Bound are:

The mean of processing time are constant

The mean and variance value of IPs data rate are known

The cost function simplified to be:
( )
( ) ( )
1
2
2
, 1
1
1 1
1
/
1
min min
2
1 /
i
i
i
C
C
m m
ji i
ji X j
j k
Opt R ik
C
k k i
i
ji i
j
T T

o
o

=
=
= =
=

(
| |

( |
= = +
` ` `
( |
)

( |
)
\ .

)
66
Appendix II: Multi-layer Router
Conventional Virtual Circuit
Router:
Multiple Switching Layer Router:
Switch allocator
VC allocator
Routing
Computation Unit
Input unit Output unit
Crossbar Switch
Crossbar
Switch 2
Switch allocator
VC allocator
Routing
Computation Unit
Input unit Output unit
Crossbar
Switch 1
67
Appendix II: Multi-layer Router (Contd)
Performance analysis:
Waiting time:

Where:
4 3 2
1
2 3
2
11 (32 6 ) (48 30 )
6( )(2 )
(24 48 ) 24
( 2( ) 2 )
m n m n m
W
m m
m n m m n
m n mn

+ + +
=

+ +
+ +
2
1
(1 )
2(1 )
m
W

max
2 2
max
,
, n m n m

<
=

+ + >
2 2
max
n m n m = + +
1 2
W W W = +
: Number of virtual circuits
: Number of switching plane
n
m
68
Appendix III: NoC Emulation
69
NoC Emulation: Behavior simulation framework
Topology Selector
(Mesh, FT, Torus,
Octagal)
Optimizer
(BnB)
Latency Metric
Throughput Metric
Data Transaction
Table
(Poisson, General
Distribution)
Routing Table
(Shortest Path)
Behavioral
Simulation
Traced Data Bit Energy Model
Energy and Area
Analysis
Performance
Analyzer
Latency Metric
Throughput Metric
70
NoC Emulation: RTL simulation framework
Router (Double
Plane VC
Wormhole)
Network Interface
(BE and GT for
QoS)
Routing Table
(Shortest Path)
Synthesis
(Vertex 4)
Critical Path and
Area Analysis
Design Compiler
Power Analysis &
Optimization
Performance
Analyzer
Latency Metric
Throughput Metric
Optimal
Architecture
Total Design
(RTL Level)
RTL Simulation
Traced Data
Data
Generator
Random data
(Uniform, Poisson)
Real Data(General
Distribution)
71
NoC Emulation: Board implementation framework
Emulation Architecture with Stochastic Data Generator
Host PC
Switch &
Routing
Table
Network
Interface
(NI)
Data Generator
Data Receiver
OCP
Interface
Network
Interface
(NI)
Data Generator
Data Receiver
OCP
Interface
Scheduler Controller
To Scheduler To Controller
.
.
.
.
MEM
OPB to
IB
Bridge
IB OPB
PowerPC
Emulation Platform
C code
Verilog Synthesis
1. To switch datas
distribution
2. To switch data generators
mode
1. To schedule
all Data
Generators
1. To store the data of Data
Receiver for post-
processing of Host
72
NoC Emulation: Board implementation framework (Contd)
Emulation Architecture with Stochastic Data Generator
(Contd)
The association of a certain Data generator and receiver with a
given NI is predetermined by Optimizer.

The combinations of NI, Data Generator, Data Receiver, OCP
interface, routing table as well as switch architecture are
synthesized and targeted on FPGA.

PowerPC reads data from MEM to HostPC to post-process for
the performance analysis

73
Emulation Architecture with Real Data Generator
Switch &
Routing
Table
Network
Interface
(NI)
OCP
Interface
Scheduler Controller
To Scheduler To Controller
.
.
.
.
Tx.MEM
OPB to
IB
Bridge
IB OPB
PowerPC
Emulation Platform
Rx.MEM
Network
Interface
(NI)
OCP
Interface
Tx.MEM
Rx.MEM
C code
Verilog Synthesis
Host PC
74
A given combination of Tx.MEM and Rx.MEM plays a role
as a soft IP with its own data transaction.
The given combination of Tx.MEM and Rx.MEM is
associated with a certain NI by Optimizer and controlled by
Controller.
NI plays as the packager (supports BE and GT)
The transmitted data is read out of Tx.MEM with the given
data transaction timing diagram
Scheduled by the Scheduler.
75
NoC Emulation: Emulation Board
VirtexTM II Pro Based Processor Board

VirtexTM II Pro XC2VP100
Total slices: 44.096
Primitive design element
Double Plane VC Wormhole
Router: 4000 slices (9%)
Benchmark: i.e H.264 decoder
12 IPs 16 Routers
Design partition
6IPs and 8 routers per FPGA

Thiet Ke He Thong Nhung

Uploaded by

Copyright:

Available Formats

You might also like

Thiet Ke He Thong Nhung

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thiet Ke He Thong Nhung

Uploaded by

Copyright:

Available Formats

1

Thesis proposal: Throughput and

[21] L. Kleinrock, Queuing systems, volume 1: Theory, Willey.

Function of allocation scheme

You might also like