Performance Evaluation of Scheduling Algorithms in Network On Chip

Performance Evaluation of Scheduling Algorithms
in Network on Chip
Xiaojie Hao1, Huaxi Gu 1, Baojian Shu 2, Daibing Zeng 2, Yonghui Li 1
1. State key lab of ISN Xidian University, Xi’an China 710071
2. ZTE Corporation, Shenzhen China 510070
Email: hxgu@xidian.edu.cn
Abstract—More and more attention is focused on to the SA stage where it arbitrates for the input and output
scheduling algorithm when designing the routers in Network- ports. Finally, the flit is read from the buffer and proceeds
on-Chip (NoC). Various scheduling algorithms are proposed to the ST stage, where it is passed to the next router or IP
for Internet router, which is a well-known technique. Based cores. At each intermediate router, body and tail flits do
on the requirements of NoC application, we made analysis
not need to go through RC and VA stage, but SA is still
and simulations on various scheduling algorithms, such as
PIM, RRM and iSLIP. The results show that iSLIP algorithm necessary and it is done on individual flit basis[5].
has better performance than others.
Key Words: Network-on-Chip, Scheduling algorithm,
Arbitration, Router
I INTRODUCTION
Historically, SoC (System-on-Chip) has usually used
bus based interconnect architecture. However, as
technology scales toward deep sub-micron this architecture
cannot economically scale to large number of Intelligent
Property (IP) cores [1]. The physical interconnection on a
single chip becomes a significant factor which limits the
delay and throughput performance. NoC as a new method
for on chip communication aims to solve the problems that
SoC faces [2] [3]. Many concepts and methodologies in
NoC design derive from the field of macro network of
computing. However, in NoC application there are many Fig.1 Virtual channel router for 2D-mesh NoC
different design requirements. The scheduling algorithm is used to overcome
Router, as a key component in on-chip network, plays contention problem in VA and SA during NoC
an important role in network performance. It determines communication. A lot of classical scheduling algorithms
how packets are forward between different IP cores for the on contention avoidance schemes have been proposed in
network communication. As is shown in Figure 1, the the area of traditional Internet router [6]. However, when
virtual-channel router consists of four primary function used in NoC, there are some different requirements. For
units: routing computation unit, virtual-channel allocation example, iSLIP is proposed for VOQ (virtual output queue)
unit, switch allocation unit and crossbar. A generic on-chip based router architecture. In NoC, if VOQ is employed
router has four pipeline stages: Routing Computation (RC), there will be much buffer and energy cost, which is not
Virtual Channel Allocation (VA), Switch Allocation (SA) possible. Hence, in this paper we will compare the
and Switch Traversal (ST) [4]. For a head flit (the first flit traditional scheduling algorithms in Internet router based
of a packet) which passes through the virtual-channel on the requirement of NoC.
router, firstly, the RC unit detects that whether the flit is a The rest of paper is organized as follows: Section two
head flit or not, then sends the destination address of the introduces three different scheduling algorithms used in
head flit into computing logic, which generates the next router. Requirements for design scheduling algorithm in
output port. Once the output port has been determined, the NoC are also analyzed in this section. The evaluation
head flit of the packet requests an output virtual channel methodology is introduced in section three. In section four,
from the VA. The VA unit performs arbitration among all we analyze the performance of these algorithms. Finally,
flits requesting for the same output VC simultaneously. conclusions are drawn in section five.
Upon successful allocation of a VC, the head flit proceeds
II BASIC OF THE SCHEDULING ALGORITHM stage, SA is carried out to solve the contention problem. In
this paper, we present three scheduling algorithms
A. Arbiter architecture including PIM[11], RRM and iSLIP[12]. We will describe
Arbiter is a key element to implement scheduling these schemes in details and consider some of its
algorithm in NoC [7]. There are different types of arbiter performance characteristics.
architectures, as is listed in Table I. Generally, scheduling algorithm attempts to quickly
converse on a conflict-free match, which consists of three
TABLE I. VARIOUS ARBITRATION ARCHITECTURES steps including Request, Grant and Accept. During step 1,
each unmatched input port sends a request to every output
Category Fairness Output selection for which it has a request signal. During step 2, if an
Early coming request will be served
FCFS arbiter FIFO firstly.
unmatched output receives any requests, it grants to the
one for which it should give a grant signal. Which input
Priority pointer is generated
Random arbiter Weak randomly and the request that get request will obtain the grant signal is depending on the
the priority can be served firstly. arbitration scheme the output port has adopted. During
Fixed priority Requests are served in a fixed step 3, if an input received grant signals, it accepts one of
arbiter Weak order. these requests according to the arbiter it used.
To arbitrate in a round-robin way, The main difference between various algorithms is
Round-Robin request that was just served should the arbiters they employ. For PIM, random arbiter is used
arbiter Strong have the lowest priority on the next
cycle of arbitration.
as the basic arbitration scheme. During step2, output will
The least recently served request be granted by randomly selection uniformly over all the
Matrix arbiter Strong has the highest priority. requests. During step3, if an input has received grant
signals, it accepts by randomly among those that granted to
Combination Fixed or variable output order
arbiter tradeoff according to the traffic mode.
this input. For iSLIP, it uses round-robin arbiter.
Considering the matching performance, we also implement
iSLIP with two iterations. Compared with iSLIP, the
As is shown in Table I different arbiters have different difference of RRM is that, gi is updated after the output
fairness. Fairness is a key property of an arbiter and there issues a grant signal without considering whether the
are three definitions for arbiter according to their accept step is successful or not during the step2.
fairness[8]. For the fixed priority arbiter and random In NoC, the number of VC associated with one port is
arbiter, some request may wait for a long time to be served. variable, which is complex than VOQ. Hence, traditional
However, they can serve the request eventually, so they scheduling algorithm is not suitable to it. In NoC， XY
have weak fairness. When using FCFS (first come first routing algorithm is popularly used. If VOQ based
service) arbiter[9], request which comes firstly can obtain architecture is used, the buffer utilization is not enough
the service priority. Its fairness likes a FIFO queue. because of the prohibited turns. In such design, traditional
Round-robin and matrix arbiters can give the requests scheduling algorithms will not work if two VCs in the
different priority to avoid the starvation phenomenon. same input port have requests for the same output port.
They have a strong fairness because when several requests
III EVALUATION METHODOLOGY
competing for the same port, they will be served equally.
The combination arbiter is a tradeoff, whose fairness is To evaluate the various scheduling algorithms, we use
between weak fairness and strong fairness. OPNET, one of the most powerful network simulation
For arbitration delay, when arbiter accomplish the software[13]. The simulations are carried on 8×8 2D
same arbitration work should take less time. The arbiter Mesh network due to the popularity of this topology in
selects a winner per output (input or output) among the many systems. In 2D-mesh NoC, the nodes are connected
requests. Arbitration architecture should be simple to with their neighbors by bi-directional channels. The
reduce the time of flit or packet arbitration[10]. As Table I scheduling algorithm used in the simulations includes PIM,
shows, round-robin arbiter, matrix arbiter and combination RRM and iSLIP. The size of the packets are fixed and
arbiter have better time performance than others. generated independently, which follows a Poisson Process.
Various traffic patterns are used, including uniform traffic,
B. Scheduling algorithms transpose traffic and hot spot traffic. In the uniform traffic
With the result of the arbiter, scheduling algorithm pattern[14], packets are sent with the same probability to
give the control signals to the crossbar connecting the path all the other nodes. In transpose traffic [15], the destination
between the input ports and the output ports. Generally, it node address of the packets generated by node (x, y) are
needs two stages to implement scheduling in virtual (m-1-x, n-1-y) where m and n represent the number of
channel router. In the first stage, VA is carried out nodes arranged in each dimension of an mxn 2D-mesh
according to the flow control information. In the second NoC. For hot spot traffic pattern[16], additional hot traffic
is received by the hot spot region in a NoC. 4 virtual
channels are used for each input port in the router. The 1200
RRM
performance of the scheduling algorithm is measured in 1050 islip

islip-iteration
terms of ETE (End to End) delay and throughput. The ETE 900 PIM
Throughput (Gbit/cycle/IP)
delay means the average time from the packet generation 750
to the time when it reaches the destination node. The 600
throughput is the flits accepting rate when NoC works at a 450
steady state. 300
150
IV SIMULATION RESULTS 0
0.00 0.01 0.02 0.03 0.04 0.05 0.06
In this section, we present simulation-based Offered Load (packets/cycle/IP)
performance for the scheduling algorithms in the case of (b) Throughput

NoC router. We show the results on 8×8 networks below.
As Fig.2 shows, under uniform traffic pattern for Fig.2 Performance of three scheduling algorithms under uniform traffic
traffic loads less than 0.033 where little congestion is
present, the three algorithms yield almost identical latency 12000
RRM
and throughput. When traffic loads increase to 0.038 and 10500 islip
islip-iteration
over, the performance gap starts to broaden. PIM is the 9000 PIM
first to saturate and have the highest latency. The RRM
ETE Delay (cycles)

7500
achieves similar network performance as achieved by 6000
iSLIP algorithm. The reason is that RRM adopts similar 4500
arbitration schemes as iSLIP and uniform traffic that may 3000
lead to more even distribution of traffic. With single 1500
iteration, synchronization of the grant pointer limits 0
RRM's performance, so it is not as good as iSLIP. In each 0.00 0.01 0.02 0.03
Offered Load (packets/cycle/IP)
0.04
iteration, iSLIP algorithm attempts to add more

(a) ETE Delay
connections missed by earlier iteration. Hence, the iSLIP-
iteration can improve the size of match. As Fig.2 shows, 350 RRM
iSLIP-iteration turns out to be the best of the three islip
islip-iteration
300
algorithms, with the lowest latency and the highest PIM
throughput.
250
We simulate the three scheduling algorithms under 200
hotspot traffic. As shown in Fig 3, the PIM algorithm has 150
disadvantages in balancing the network load. PIM is also 100

the first to saturate and yields the highest ETE latency. It is
obvious that iSLIP outperforms the other scheduling 50
algorithms in term of the hotspot traffic. ISLIP also 0.00 0.01 0.02 0.03 0.04
achieves the lowest latency for the full range of traffic Offered Load (packets/cycle/IP)
among the three algorithms. The reason is that the (b) Throughput
desynchronization of iSLIP will alleviate the contentions Fig.3 Performance of three scheduling algorithms under hotspot traffic
among arbiters. Algorithm PIM can converge to a maximal
match. Finally, we compared the three scheduling algorithms
under transpose traffic pattern. Fig.4 shows results for
12000
RRM
transpose traffic pattern. It can be seen from the figures
10000
islip
islip-iteration
that the network End-to-End delay increases immediately
8000
PIM
when using PIM scheduling algorithm. The reason is that
the PIM scheduling algorithm adopts random arbiter,
ETE Delay (cycles)
6000
resulting to small allocation success when contention
4000 appears. Because of the random and independent selection
2000
by all the arbiters that leads to different rates at each
output. PIM with only one iteration performs worst among
0
0.00 0.01 0.02 0.03 0.04 0.05 0.06
the three scheduling algorithms. If multiple iterations are
Offered Load (packets/cycle/IP) employed, the time to converge may affect the design of
(a) ETE Delay
NoC router. Hence, algorithm achieves better result is
preferred.
REFERENCES
[1] C. Grecu, A. Ivanov, and R. Saleh, et al. NoC Interconnect Yield
14000
RRM
Improvement Using Crosspoint Redundancy. 21st IEEE
12000 islip International Symposium on Defect and Fault Tolerance in VLSI
islip-iteration
PIM Systems. 2006.
10000
[2] G. De Micheli, C. Seiculescu, and S. Murali, et al. Network on
ETE Delay (cycles)
8000
Chip: From research to products. Design Automation Conference
6000
(DAC) 2010 p. 300-305.
[3] S. Q. Wang, H. X. Gu, and Z. M. Zhu. Fat tree of Mesh (FoM):A
4000
New Optical Network on Chip Architecture. Journal of Xidian
2000 University, 2011.38(6): P.8-16.
0 [4] K. S. Shim, M. H. Cho, and M. Kinsy, et al. Static virtual channel
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 allocation in oblivious routing. 3rd ACM/IEEE International
Offered Load (packets/cycle/IP)
Symposium on Networks-on-Chip. 2009.
(a) ETE Delay [5] Y. Xu, B. Zhao, and Y. T. zhang, et al. Simple virtual channel
allocation for high throughput and high frequency on-chip routers.
550
International Symposium on High Performance Computer
500
RRM Architecture (HPCA). 2010.
islip
450 islip-iteration [6] S. Q. zheng, and M. Yang, Algorithm-Hardware Codesign of Fast
PIM
400 Parallel Round-Robin Arbiters. IEEE Transactions on Parallel and
350 Distributed Systems, 2007. 18(1): p. 84-95.

300
[7] J. M. Jou, and Y. L. Lee, An Optimal Round-Robin Arbiter Design
250
for NoC. J. Inf. Sci. Eng. 26(6): 2047-2058 (2010)
200
150 [8] W.J. Dally and B. Towles, Principles and Practices of

100 Interconnection Networks. Morgan Kaufmann, 2004.
50 [9] Y. Liu, X. G. Guan, and Y. Yang, et al. An asynchronous low
0
0.00 0.01 0.02 0.03 0.04 0.05 0.06
latency ordered arbiter for network on chips. International
Offered Load (packets/cycle/IP) Conference on Natural Computation (ICNC). 2010.
[10] Y. L. Lee, J. M. Jou, and Y. Y. Chen. A high-speed and
(b) Throughput decentralized arbiter design for NoC. IEEE/ACS International
Conference on Computer Systems and Applications. 2009.
Fig.4 Performance of three scheduling algorithms under transpose [11] A. Kumar, P. Kundu, and A. P.Singh, et al. A 4.6Tbits/s 3.6GHz
traffic single-cycle NoC router with a novel switch allocator in 65nm
CMOS. 25th International Conference on Computer Design (ICCD).
V CONCLUSIONS 2007.
[12] X. P. Gao, Z. Zhang, and X. Long, Round Robin Arbiters for
Virtual Channel Router. IMACS Multiconference on Computational
Various scheduling algorithms are proposed in Engineering in Systems Applications. 2006.
literature such as PIM, RRM and iSLIP. In this paper, we [13] N. Wu, F. Ge, and Q. Wang, Simulation and performance analysis
compare their advantages and disadvantages when used in of network on chip architectures using OPNET. International
router for NoC. The comparison results show that, iSLIP Conference on ASIC, ASICON '07. 2007.
scheduling algorithm achieves a little better performance [14] K. K. Paliwal, M. S. Gaur, and V. Janyani, et al. Performance
Analysis of Guaranteed Throughput and Best Effort Traffic in
than the other two. Hence, when used in NoC, the Network-on-Chip under Different Traffic Scenario. International
traditional scheduling algorithms should be changed Conference on Future Networks. 2009.
according to the requirement of NoC. In our future work, [15] M. Radetzki, and A. Kohler, An intelligent deflection router for
we design an efficient scheduling algorithm that will networks-on-chip. Seventh Workshop on Intelligent solutions in
consider the buffer state in the downstream router. Embedded Systems, 2009.
[16] L. W. Wang. A Virtual Channel Calculation Algorithm for
Application Specific On-chip Networks. Third International
ACKNOWLEDGMENT Conference on Intelligent Networks and Intelligent Systems
(ICINIS). 2010.
The authors would like to thank the reviewers for

the suggestions and comments that help improving the
paper. This work was supported by the National Science
Foundation of China under Grant No.60803038,
No.61070046 and 60725415, the special fund from State
Key Lab (No.ISN110401), the Fundamental Research
Funds for the Central Universities under Grant
No.K50510010010, the 111 Project under Grant No.
B08038 and ZTE University cooperation project.

Performance Evaluation of Scheduling Algorithms in Network On Chip

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Evaluation of Scheduling Algorithms in Network On Chip

Uploaded by

Copyright:

Available Formats

Performance Evaluation of Scheduling Algorithms

performance of the scheduling algorithm is measured in 1050 islip

to the time when it reaches the destination node. The 600

throughput is the flits accepting rate when NoC works at a 450

steady state. 300

In this section, we present simulation-based Offered Load (packets/cycle/IP)

performance for the scheduling algorithms in the case of (b) Throughput

first to saturate and have the highest latency. The RRM

ETE Delay (cycles)

achieves similar network performance as achieved by 6000

iSLIP algorithm. The reason is that RRM adopts similar 4500

arbitration schemes as iSLIP and uniform traffic that may 3000

lead to more even distribution of traffic. With single 1500

iteration, synchronization of the grant pointer limits 0

iteration, iSLIP algorithm attempts to add more

We simulate the three scheduling algorithms under 200

hotspot traffic. As shown in Fig 3, the PIM algorithm has 150

disadvantages in balancing the network load. PIM is also 100

350 Distributed Systems, 2007. 18(1): p. 84-95.

150 [8] W.J. Dally and B. Towles, Principles and Practices of

The authors would like to thank the reviewers for

You might also like