Professional Documents
Culture Documents
Performance Evaluation of Scheduling Algorithms in Network On Chip
Performance Evaluation of Scheduling Algorithms in Network On Chip
in Network on Chip
Xiaojie Hao1, Huaxi Gu 1, Baojian Shu 2, Daibing Zeng 2, Yonghui Li 1
1. State key lab of ISN Xidian University, Xi’an China 710071
2. ZTE Corporation, Shenzhen China 510070
Email: hxgu@xidian.edu.cn
Abstract—More and more attention is focused on to the SA stage where it arbitrates for the input and output
scheduling algorithm when designing the routers in Network- ports. Finally, the flit is read from the buffer and proceeds
on-Chip (NoC). Various scheduling algorithms are proposed to the ST stage, where it is passed to the next router or IP
for Internet router, which is a well-known technique. Based cores. At each intermediate router, body and tail flits do
on the requirements of NoC application, we made analysis
not need to go through RC and VA stage, but SA is still
and simulations on various scheduling algorithms, such as
PIM, RRM and iSLIP. The results show that iSLIP algorithm necessary and it is done on individual flit basis[5].
has better performance than others.
Key Words: Network-on-Chip, Scheduling algorithm,
Arbitration, Router
I INTRODUCTION
Historically, SoC (System-on-Chip) has usually used
bus based interconnect architecture. However, as
technology scales toward deep sub-micron this architecture
cannot economically scale to large number of Intelligent
Property (IP) cores [1]. The physical interconnection on a
single chip becomes a significant factor which limits the
delay and throughput performance. NoC as a new method
for on chip communication aims to solve the problems that
SoC faces [2] [3]. Many concepts and methodologies in
NoC design derive from the field of macro network of
computing. However, in NoC application there are many Fig.1 Virtual channel router for 2D-mesh NoC
different design requirements. The scheduling algorithm is used to overcome
Router, as a key component in on-chip network, plays contention problem in VA and SA during NoC
an important role in network performance. It determines communication. A lot of classical scheduling algorithms
how packets are forward between different IP cores for the on contention avoidance schemes have been proposed in
network communication. As is shown in Figure 1, the the area of traditional Internet router [6]. However, when
virtual-channel router consists of four primary function used in NoC, there are some different requirements. For
units: routing computation unit, virtual-channel allocation example, iSLIP is proposed for VOQ (virtual output queue)
unit, switch allocation unit and crossbar. A generic on-chip based router architecture. In NoC, if VOQ is employed
router has four pipeline stages: Routing Computation (RC), there will be much buffer and energy cost, which is not
Virtual Channel Allocation (VA), Switch Allocation (SA) possible. Hence, in this paper we will compare the
and Switch Traversal (ST) [4]. For a head flit (the first flit traditional scheduling algorithms in Internet router based
of a packet) which passes through the virtual-channel on the requirement of NoC.
router, firstly, the RC unit detects that whether the flit is a The rest of paper is organized as follows: Section two
head flit or not, then sends the destination address of the introduces three different scheduling algorithms used in
head flit into computing logic, which generates the next router. Requirements for design scheduling algorithm in
output port. Once the output port has been determined, the NoC are also analyzed in this section. The evaluation
head flit of the packet requests an output virtual channel methodology is introduced in section three. In section four,
from the VA. The VA unit performs arbitration among all we analyze the performance of these algorithms. Finally,
flits requesting for the same output VC simultaneously. conclusions are drawn in section five.
Upon successful allocation of a VC, the head flit proceeds
II BASIC OF THE SCHEDULING ALGORITHM stage, SA is carried out to solve the contention problem. In
this paper, we present three scheduling algorithms
A. Arbiter architecture including PIM[11], RRM and iSLIP[12]. We will describe
Arbiter is a key element to implement scheduling these schemes in details and consider some of its
algorithm in NoC [7]. There are different types of arbiter performance characteristics.
architectures, as is listed in Table I. Generally, scheduling algorithm attempts to quickly
converse on a conflict-free match, which consists of three
TABLE I. VARIOUS ARBITRATION ARCHITECTURES steps including Request, Grant and Accept. During step 1,
each unmatched input port sends a request to every output
Category Fairness Output selection for which it has a request signal. During step 2, if an
Early coming request will be served
FCFS arbiter FIFO firstly.
unmatched output receives any requests, it grants to the
one for which it should give a grant signal. Which input
Priority pointer is generated
Random arbiter Weak randomly and the request that get request will obtain the grant signal is depending on the
the priority can be served firstly. arbitration scheme the output port has adopted. During
Fixed priority Requests are served in a fixed step 3, if an input received grant signals, it accepts one of
arbiter Weak order. these requests according to the arbiter it used.
To arbitrate in a round-robin way, The main difference between various algorithms is
Round-Robin request that was just served should the arbiters they employ. For PIM, random arbiter is used
arbiter Strong have the lowest priority on the next
cycle of arbitration.
as the basic arbitration scheme. During step2, output will
The least recently served request be granted by randomly selection uniformly over all the
Matrix arbiter Strong has the highest priority. requests. During step3, if an input has received grant
signals, it accepts by randomly among those that granted to
Combination Fixed or variable output order
arbiter tradeoff according to the traffic mode.
this input. For iSLIP, it uses round-robin arbiter.
Considering the matching performance, we also implement
iSLIP with two iterations. Compared with iSLIP, the
As is shown in Table I different arbiters have different difference of RRM is that, gi is updated after the output
fairness. Fairness is a key property of an arbiter and there issues a grant signal without considering whether the
are three definitions for arbiter according to their accept step is successful or not during the step2.
fairness[8]. For the fixed priority arbiter and random In NoC, the number of VC associated with one port is
arbiter, some request may wait for a long time to be served. variable, which is complex than VOQ. Hence, traditional
However, they can serve the request eventually, so they scheduling algorithm is not suitable to it. In NoC, XY
have weak fairness. When using FCFS (first come first routing algorithm is popularly used. If VOQ based
service) arbiter[9], request which comes firstly can obtain architecture is used, the buffer utilization is not enough
the service priority. Its fairness likes a FIFO queue. because of the prohibited turns. In such design, traditional
Round-robin and matrix arbiters can give the requests scheduling algorithms will not work if two VCs in the
different priority to avoid the starvation phenomenon. same input port have requests for the same output port.
They have a strong fairness because when several requests
III EVALUATION METHODOLOGY
competing for the same port, they will be served equally.
The combination arbiter is a tradeoff, whose fairness is To evaluate the various scheduling algorithms, we use
between weak fairness and strong fairness. OPNET, one of the most powerful network simulation
For arbitration delay, when arbiter accomplish the software[13]. The simulations are carried on 8×8 2D
same arbitration work should take less time. The arbiter Mesh network due to the popularity of this topology in
selects a winner per output (input or output) among the many systems. In 2D-mesh NoC, the nodes are connected
requests. Arbitration architecture should be simple to with their neighbors by bi-directional channels. The
reduce the time of flit or packet arbitration[10]. As Table I scheduling algorithm used in the simulations includes PIM,
shows, round-robin arbiter, matrix arbiter and combination RRM and iSLIP. The size of the packets are fixed and
arbiter have better time performance than others. generated independently, which follows a Poisson Process.
Various traffic patterns are used, including uniform traffic,
B. Scheduling algorithms transpose traffic and hot spot traffic. In the uniform traffic
With the result of the arbiter, scheduling algorithm pattern[14], packets are sent with the same probability to
give the control signals to the crossbar connecting the path all the other nodes. In transpose traffic [15], the destination
between the input ports and the output ports. Generally, it node address of the packets generated by node (x, y) are
needs two stages to implement scheduling in virtual (m-1-x, n-1-y) where m and n represent the number of
channel router. In the first stage, VA is carried out nodes arranged in each dimension of an mxn 2D-mesh
according to the flow control information. In the second NoC. For hot spot traffic pattern[16], additional hot traffic
is received by the hot spot region in a NoC. 4 virtual
channels are used for each input port in the router. The 1200
RRM
terms of ETE (End to End) delay and throughput. The ETE 900 PIM
Throughput (Gbit/cycle/IP)
delay means the average time from the packet generation 750
150
IV SIMULATION RESULTS 0
0.00 0.01 0.02 0.03 0.04 0.05 0.06
RRM's performance, so it is not as good as iSLIP. In each 0.00 0.01 0.02 0.03
Offered Load (packets/cycle/IP)
0.04
throughput.
Throughput (Gbit/cycle/IP)
250
algorithms in term of the hotspot traffic. ISLIP also 0.00 0.01 0.02 0.03 0.04
achieves the lowest latency for the full range of traffic Offered Load (packets/cycle/IP)
among the three algorithms. The reason is that the (b) Throughput
desynchronization of iSLIP will alleviate the contentions Fig.3 Performance of three scheduling algorithms under hotspot traffic
among arbiters. Algorithm PIM can converge to a maximal
match. Finally, we compared the three scheduling algorithms
under transpose traffic pattern. Fig.4 shows results for
12000
RRM
transpose traffic pattern. It can be seen from the figures
10000
islip
islip-iteration
that the network End-to-End delay increases immediately
8000
PIM
when using PIM scheduling algorithm. The reason is that
the PIM scheduling algorithm adopts random arbiter,
ETE Delay (cycles)
6000
resulting to small allocation success when contention
4000 appears. Because of the random and independent selection
2000
by all the arbiters that leads to different rates at each
output. PIM with only one iteration performs worst among
0
0.00 0.01 0.02 0.03 0.04 0.05 0.06
the three scheduling algorithms. If multiple iterations are
Offered Load (packets/cycle/IP) employed, the time to converge may affect the design of
(a) ETE Delay
NoC router. Hence, algorithm achieves better result is
preferred.
REFERENCES
[1] C. Grecu, A. Ivanov, and R. Saleh, et al. NoC Interconnect Yield
14000
RRM
Improvement Using Crosspoint Redundancy. 21st IEEE
12000 islip International Symposium on Defect and Fault Tolerance in VLSI
islip-iteration
PIM Systems. 2006.
10000
[2] G. De Micheli, C. Seiculescu, and S. Murali, et al. Network on
ETE Delay (cycles)
8000
Chip: From research to products. Design Automation Conference
6000
(DAC) 2010 p. 300-305.
[3] S. Q. Wang, H. X. Gu, and Z. M. Zhu. Fat tree of Mesh (FoM):A
4000
New Optical Network on Chip Architecture. Journal of Xidian
2000 University, 2011.38(6): P.8-16.
0 [4] K. S. Shim, M. H. Cho, and M. Kinsy, et al. Static virtual channel
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 allocation in oblivious routing. 3rd ACM/IEEE International
Offered Load (packets/cycle/IP)
Symposium on Networks-on-Chip. 2009.
(a) ETE Delay [5] Y. Xu, B. Zhao, and Y. T. zhang, et al. Simple virtual channel
allocation for high throughput and high frequency on-chip routers.
550
International Symposium on High Performance Computer
500
RRM Architecture (HPCA). 2010.
islip
450 islip-iteration [6] S. Q. zheng, and M. Yang, Algorithm-Hardware Codesign of Fast
PIM
400 Parallel Round-Robin Arbiters. IEEE Transactions on Parallel and
Throughput (Gbit/cycle/IP)