A Systematic Network-on-Chip Traf¿c Modeling and Generation Methodology

A Systematic Network-on-Chip Trafc Modeling
and Generation Methodology

Zhe Wang, Weichen Liu, Jiang Xu, Xiaowen Wu, Zhehui Wang, Bin Li , Ravi Iyer , Ramesh Illikkal
The Hong Kong University of Science and Technology, Intel Labs, Hillsboro, OR, USA.
AbstractThe Network-on-chip (NoC) based multiprocessor detailed communication traces for comprehensive NoC studies,
system-on-chip (MPSoCs) is becoming a promising architec- while the latter helps to accelerate NoC explorations at the cost
ture to meet modern applications ever-increasing demands for of accuracy. We publicly release them as the MCSL (Multi-
computing capability under limited power budget. NoC trafc
patterns are essential tools for NoC performance assessment Constraint System-Level) NoC trafc patterns online [4]. A set
and architecture design exploration. In this paper, we present of experiments are conducted to analyze the generated trafc
a systematic NoC trafc modeling and generation methodology patterns, and evaluate their performance as well as power
and a set of realistic NoC trafc patterns called MCSL, which are efciency. The results show that the MCSL trafc patterns
generated through the methodology. The proposed methodology can be used to study NoC characteristics more accurately than
can faithfully capture both the communication behaviors of real
applications in NoCs and the temporal dependencies among them. traditional random trafc patterns.
And it optimizes application memory requirements, mapping The rest of the paper is organized as follows. In Section II,
and scheduling to maximize overall system performance and we illustrate the trafc modeling and generation methodology.
utilization before extracting trafc patterns through cycle-level Section III shows the experimental results on the analysis
simulations. Extensive experiments are conducted to verify the and evaluation of the realistic trafc patterns. And section IV
effectiveness of the methodology, and evaluate the performance
of the generated trafc patterns. The results show that the MCSL concludes the work.
trafc patterns can be used to study NoC characteristics more
accurately than traditional random trafc patterns. II. T RAFFIC MODELING AND GENERATION METHODOLOGY
An overview of the trafc modeling and generation method-
I. I NTRODUCTION ology is shown in Fig. 1. The generation process starts
Network-on-Chip (NoC) [1] is becoming a popular com- with the COSMIC application models. Since multiprocessor
munication architecture for Multiprocessor System-on-Chips applications are often performance-sensitive, it is necessary to
(MPSoCs) to meet the ever increasing computation and perform optimizations and performance evaluation for trafc
communication requirements of emerging applications. Large generation. As shown in Fig. 1, the trafc patterns are gen-
amount of efforts from both academia and industry have been erated through four steps: memory space allocation, mapping
invested in NoC designs. Trafc models, or trafc patterns, and scheduling, performance evaluation and trafc generation.
play an important role in both the NoC architecture design Two types of trafc patterns are generated, called RTP and
exploration and performance evaluation. Random trafcs were STP. The generation steps interact with each other closely.
widely used to approximate the communication characteristics They are essential for the methodology since these decisions
of real applications in early works. However, researchers substantially affect the nal trafc pattern. We specically
pointed out that traditional random trafc modeling approaches optimized the memory space allocation and task mapping
can hardly faithfully reect the features of real application and scheduling for different architecture models, and the
trafcs [2]. To this end, trafc modeling should be based on optimized decisions can take full advantage of the parallel
realistic applications to guarantee faithfulness and accuracy. hardware resources and improve overall system performance
In this paper, we present a systematic trafc modeling and resource utilization. The methodology is designed with the
and generation methodology based on realistic multiprocessor exibility, scalability and extensibility on the choice of these
applications and a trafc pattern suite for efcient NoC- algorithms.
based MPSoC evaluations. The proposed trafc modeling and
A. The COSMIC benchmark suite
generation methodology utilizes the COSMIC benchmarks [3],
especially takes advantage of the TCG models it provided, to COSMIC benchmark suite provides faithful models for re-
accurately and efciently capture the communication charac- alistic applications to facilitate multiprocessor system studies.
teristics of real applications. It optimizes the memory allo- We take advantage of the TCG models which is provided in the
cation and the task mapping and scheduling for application COSMIC, to perform accurate and efcient trafc modeling
execution to maximize overall system performance and re- and generation for real applications. It allows us to bypass
source utilization. Two types of trafc patterns are generated operating systems (OS) and compilers to quickly explore novel
through our methodology, the recorded trafc patterns (RTP) MPSoC architectures.
and statistical trafc patterns (STP). The former provides The applications are modeled by a weighted directed acyclic
graph model, called task communication graph (TCG). The
This project is supported by Intel and GRF620911. TCG is dened as a tuple Gt = (V, E), where V is the set of
c
978-1-4799-5230-4/14/$31.00 2014 IEEE 675
&260,&
$UFKLWHFWXUH PRGHO The algorithm is given in Alg. 1. It runs iteratively on a
DSSOLFDWLRQ PRGHO
set of randomly generated candidates (called population, each
candidate is called an individual) until a predened threshold
0HPRU\ VSDFH DOORFDWLRQ on the number of generations is reached or a satisfying result is
obtained. An individual/candidate is a possible memory size
7DVN PDSSLQJ VFKHGXOLQJ assignment, which is a vector of memory sizes for all the
edges in the application model. In each generation (Lines 3
3HUIRUPDQFH HYDOXDWLRQ
9), mutation and crossover are applied to the individuals to
generate new candidates, and selection is used to select better
individuals in the population by calculating the tness values
7UDIILF JHQHUDWLRQ
of the candidates. We use uniform order-based crossover and
scramble sub-list mutation on the individuals to generate
5HFRUGHG WUDIILF 6WDWLVWLFDO WUDIILF offspring [5]. For selection, we dene the combination of
SDWWHUQ SDWWHUQ
throughput and total memory size as the tness value, where
throughput is with higher priority to optimize. We keep the
Fig. 1. An overview of the trafc generation methodology. population size unchanged by selecting a constant number
of candidates in each generation. There are two constant
weighted vertices denoting the computational tasks, and E is
parameters in the algorithm, pop size for the number of
the set of weighted directed edges denoting the communication
individuals in the population, and gen num for the number
channels among tasks. Each task vi V has a worst-case
of generations for the algorithm to run.
execution time ti , which is measured in the number of clock
cycles. Each edge ei = (vi,s , vi,d , wi ) E has a source task D. Task mapping and scheduling
vi,s , a destination task vi,d and the amount of data wi sent The trafc generation methodology uses a centralized
from vi,s to vi,d , which is measured in number of words (32 scheduling strategy to manage the entire chip resources and
bits in this work). More details about the COSMIC benchmark coordinate processing blocks (PBs). In this way, the scheduling
suite can be found at [3]. and control decisions made are globally optimized. Formally,
B. Architecture model given an TCG application model Gt (V, E), and an architecture
model Gp (P, N ), the application mapping and scheduling
The architecture model captures the hardware resources in
problem is to nd a mapping M : V P for each task in V
MPSoCs including processing blocks (PBs) and NoCs. In
to a PB in P , as well as a static order schedule S : V N for
this work, we target regular NoC topologies, such as mesh,
the set of tasks assigned to the same PB, where each task is
torus and fat tree, and homogeneous MPSoCs. Detailed NoC
assigned a unique number indicating its execution order, such
congurations are listed in Table. I. Formally, we dene an
that the performance is optimized.
architecture model as a graph Gp = (P, N ), where P is a set
We develop a load balanced mapping and static order
of PBs, and N is an NoC architecture. The trafc generation
scheduling approach. The basic idea is to distribute processing
methodology takes the architecture model as an input for trafc
and network transmission workloads evenly to achieve high
generation. And the proposed methodology is able to support
utilization of the hardware resources. The mapping strategy is
more NoC topologies and congurations.
to assign tasks to PBs one by one in the topological order
C. Memory space allocation dened by the dependency relations in the graph, and the
We assume that a virtual private memory is assigned for schedule on each PB is determined by the sequence of the
each communication edge to store the data generated by the tasks on the same PB generated during the mapping. The
source task. When the application is executed repeatedly in
an iterative fashion, insufcient memory space allocated to the
Algorithm 1 The genetic algorithm for memory space alloca-
edge can limit the parallelism of the application and impact its
tion
performance. Therefore, it is important to determine the mem-
Require: application model Gt (V, E)
ory requirement on the edges to maximize the performance
1: dene pop size, gen num, best sol
while the total size of memory totally used are minimized.
2: pop parents = Initialization( pop size )
Basically, we apply genetic algorithms to nd the minimum
3: while gen num is not reached do
memory space that will make no negative impact to the
4: pop of f spring = pop parents
application performance. There are two objectives to be op-
5: Crossover( pop of f spring )
timized: maximizing application throughput in higher priority
6: Mutation( pop of f spring )
and minimizing total memory size in lower priority. We apply
7: pop parents = Selection(pop of f spring, pop size)
genetic algorithms to explore possible memory size alloca-
8: best sol = UpdateBestSolution( pop parents )
tions, evaluate them by calculating the theoretical application
9: end while
throughput under the memory constraint, and conduct these
10: return the best memory space allocation best sol
two steps iteratively until a satisable result is obtained.
676
TABLE I
objective is to minimize the application execution time per T HE N O C CONFIGURATIONS USED FOR TRAFFIC GENERATION .
iteration with network communication overhead taken into
consideration. The tasks are evaluated and assigned one by Topology Size (number of processors)
one in the order dened by the dependency relationships in
2x2, 2x4, 3x3, 4x4, 5x5, 4x8, 6x6, 7x7, 8x8,
the TCG model. For each task v, the weight of assigning it to Mesh 9x9, 10x10, 11x11, 8x16, 12x12, 13x13,
a processor p is calculated in the following cost function: 14x14, 15x15, 16x16
2x2, 2x4, 3x3, 4x4, 5x5, 4x8, 6x6, 7x7, 8x8,

Torus 9x9, 10x10, 11x11, 8x16, 12x12, 13x13,
w(v, p) = c1 t(v, p) + c2 q n(v, p), (1)
14x14, 15x15, 16x16
in which t(v, p) is the required time for task v to nish Fattree 4, 8, 16, 32, 64, 128, 256
execution on p, dened by the time for executing previously
assigned tasks on p plus the execution time of v on p, and studied in this work, whose specics are listed in Table. I. We
n(v, p) is the total amount of network transmission, dened rstly compare the real trafcs with the uniform trafcs to
by the number of packets generated by v and sent to other PBs. demonstrate that the former reects the actual communication
q is an architecture-specic scaling factor which balances the behaviors more accurately. Then we evaluate specic perfor-
two terms which can be measured in different units, and c1 , c2 mance for each real application on MPSoCs with different
are constant factors which can be manually adjusted to tradeoff NoC architecture. At last, we show the energy efciency of
the weight of the two parts in the overall cost. the FFT executed on various NoC based MPSoCs. Due to
E. Generation of trafc patterns space limitation, only the experimental results on the systems
with network sizes of 16, 32 and 64 are showed.
The RTP contains detailed and accurate traces of task
executions and communications. It is used for precise and
^dW
^dW
E

comprehensive NoC studies. RTPs are generated by cycle-

E
ZdW ZdW

hE/ hE/

level simulations for real applications on different NoC based

MPSoCs with the memory space allocation and mapping and

scheduling results. It contains more accurate computation and

communication traces, where all the task execution and packet

generation events are recorded. The RTPs are reusable on E E
NoCs with different congurations but the same topology. Fig. 2. Average performance results for trafc patterns on mesh/torus/fat
Since the exact packet delays among PBs are related with tree-based 16-processor, 32-processor and 64-processor MPSoCs.
specic NoC congurations, the RTPs keep the packet de-
pendencies instead of exact timings. When the trafc patterns Fig. 2 shows the normalized average network performance
are applied to a different NoC conguration, all the temporal results (network throughput and packet delay) of trafc pat-
relations can be reconstructed correctly. terns on different NoC congurations. The results of STP on
The STP gives a concise representation of the trafc by 16-processor systems are normalized to 1, and the uniform
mathematical modeling. It can be used to support long simu- trafc patterns (UNI) are generated by setting their injection
lation runs, and is useful for system-level statistical evaluation rates as those of the corresponding RTPs. From the gure,
and analysis. With the results obtained above, the STPs are we can observe that the performance of STP and RTP are
synthesized with the statistical behaviors of task executions, very close to each other, but the difference on packet delays
packet generations and transmissions in different execution between the uniform trafc and corresponding RTPs (STPs)
instances. Three key components, the task execution times, can be as large as 99.2%. This signicant difference shows
the communication data volumes and the time intervals that reinforce the conclusion that the uniform trafc patterns can
the communication data are assembled into different packets, hardly reect the characteristics of real application trafc. A
are described by statistical formulations in the STPs. The possible explanation to this is that the uniform trafcs do not
generated patterns are useful for evaluating NoCs with similar consider the real application execution dependencies, while the
topologies, and other NoC metrics can be exibly recorded in proposed methodology make dedicated analysis on them and
the STP by the proposed generation methodology. incorporates them in the generated trafc patterns.
Due to space limitation, we only show the RTP performance
III. P ERFORMANCE EVALUATION AND ANALYSIS for each application on different MPSoC architectures. Fig. 3
Extensive performance evaluations are conducted to verify shows the normalized performance result for each application
the effectiveness of the proposed trafc generation methodol- on different sized MPSoCs with the three NoC topologies.
ogy and the generated realistic trafc patterns. We generate The performance result of each application on 16-processor
trafc patterns for 8 real applications included in the COS- fat tree based MPSoC is normalized to 1. The gure shows
MIC benchmark suite, which are FFT-1024 complex, Fppp, that as larger amount of processing resources are provided,
TURBO, MolecularDynamics, Robot, RS-32 28 8 dec, RS- generally the network throughput increases, the packet delay
32 28 8 enc and Sparse. Different NoC congurations are and execution time per iteration decreases. Some applications,
677

(CVVTGG
(CVVTGG (CVVTGG
/GUJ
/GUJ 6QTWU
/GUJ
6QTWU
0QTOCRRGZGEVKOGRGTKVGT
6QTWU

0QTOCNK\GFRCEMGVFGNC[
0QTOPGVYQTMVJTQWIJRWV

((6 ((6
((6 (RRRR
(RRRR
(RRRR /QNGEWNCT&[PCOKEU
/QNGEWNCT&[PCOKEU
/QNGEWNCT&[PCOKEU 674$1
674$1 674$1
4QDQV 4QDQV
\G
4QDQV
G

K\G
UK\

UK
45AAAFGE 45AAAFGE
45AAAFGE MU
M
M
QT
QT
QT
45AAAGPE 45AAAGPE 45AAAGPE
Y
VY
VY
GV
5RCTUG G
0G
5RCTUG 0 5RCTUG
0

(a) Network throughput (b) Packet delay (c) Execution time per iteration
Fig. 3. Normalized performance result for each application on mesh/torus/fat tree-based 16-processor, 32-processor and 64-processor MPSoCs.
such as FFT, MolecularDynamics and Sparse, can better take IV. C ONCLUSION
advantage of the parallel resources than the other applications In this paper, a systematic NoC trafc modeling and gen-
do. For example in Fig. 3(b) the FFT, the average packet delay eration methodology is presented and a set of real application
for 64-processor mesh system is 51.6% less than that on 16- trafc patterns are generated and publicly released as the
processor mesh system, and it gets signicant performance MCSL NoC trafc patterns. The proposed methodology can
improvement as the processor number grows exponentially in capture both the communication behaviors in NoCs and the
terms of execution time in Fig. 3(c) , where the execution temporal dependencies among them. The trafc generation
time on 64-processor mesh system is only 48% of that on 16- methodology uses the formal computation models provided in
processor mesh system. Regarding different NoC topologies, the COSMIC benchmark suite to capture both communication
the performance results are quite irregular. Not any type of and computation requirements of real applications. Memory
NoC topology can outperform the others in all situations. space allocation and application mapping and scheduling are
optimized to better reect the communication behaviors in

W practical MPSoC designs. We generate two types of trafc
E

W
patterns, RTP and STP. The former offer detailed commu-

nication traces for comprehensive NoC studies, while the
latter help to accelerate NoC explorations at the cost of

accuracy. The generated MCSL trafc patterns can be easily

incorporated into existing NoC simulators to substantially
improve NoC simulation accuracy. Experiments are conducted

&D d to evaluate the performance of the MCSL trafc patterns, and
verify the effectiveness of the proposed methodology. And the
Fig. 4. Normalized energy consumption of FFT run on mesh/torus/fat tree-
based MPSoCs with different sizes. experimental results show that the MCSL trafc patterns more
accurately reect the characteristics of real applications than
the corresponding uniform trafc patterns.
Due to space limitation, only the energy consumption results
of FFT are showed in Fig. 4. The power model that we used R EFERENCES
in the experiments is derived from [6]. In this gure, the [1] J. Xu, W. Wolf, J. Henkel, and S. Chakradhar, A methodology for design,
result of 16-processor fat tree based MPSoC is normalized modeling, and analysis of networks-on-chip, in Circuits and Systems,
to 1. The energy consumption is divided into three parts: the 2005. ISCAS 2005. IEEE International Symposium on, May 2005, pp.
17781781 Vol. 2.
communication energy (Comm.), processor working energy [2] P. Gratz and S. W. Keclker, Realistic Workload Characterization and
(Proc. busy) and processor idle energy (Proc. idle) which is Analysis for Networks-on-Chip Design, in The 4th Workshop on Chip
resulted from static leakage. As the number of processors gets Multiprocessor Memory Systems and Interconnects (CMP-MSI), 2010.
[3] Z. Wang, W. Liu, J. Xu, B. Li, R. Iyer, R. Illikkal, X. Wu, W. H. Mow, and
larger, the energy consumption of communication and idle W. Ye, A Case Study on the Communication and Computation Behaviors
processors increase apparently. For example, the communica- of Real Applications in NoC-based MPSoCs, in IEEE Computer Society
tion energy and leakage energy on 64-processor fat tree based Annual Symp. VLSI, 2014.
[4] MCSL NoC trafc patterns: http://www.ece.ust.hk/ eexu/.
MPSoC take 33% and 15% of the total energy consumed. [5] L. D. Davis and M. Mitchell, Handbook of Genetic Algorithms, Van
And for some applications, keep increasing the amount of Nostrand Reinhold, 1991.
processing resources brings little performance improvement [6] Y. Wang et al., Power gating aware task scheduling in mpsoc, Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 19,
due to the applications intrinsic parallelism limitation, but no. 10, pp. 18011812, Oct 2011.
incurs signicant leakage overhead.
678

A Systematic Network-on-Chip Traf¿c Modeling and Generation Methodology

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Systematic Network-on-Chip Traf¿c Modeling and Generation Methodology

Uploaded by

Copyright:

Available Formats

A Systematic Network-on-Chip Trafc Modeling

and Generation Methodology

2x2, 2x4, 3x3, 4x4, 5x5, 4x8, 6x6, 7x7, 8x8,

comprehensive NoC studies. RTPs are generated by cycle-

scheduling results. It contains more accurate computation and

communication traces, where all the task execution and packet

generation events are recorded. The RTPs are reusable on E E

You might also like