Professional Documents
Culture Documents
Using Packet Information For Efficient Communication in Nocs
Using Packet Information For Efficient Communication in Nocs
Communication in NoCs
Chip multiprocessor (CMP) designs support parallel exe- Adaptive techniques learn and choose the best path for
cution of more than one application. As these applications try each packet from the available list of paths. DyAd routing
to co-exist in a chip during their execution time, performance [5] switches between static and dynamic route computation
limitation occurs from places where parallelism gets limited. methods based on the occupation levels of input queues. Thus,
This is studied by Amdahl’ s law [1]. In a CMP, there are a this method marks the low and high injection phases for a
lot of inherent sharing. For example, independent processes network. But, it does not do much good when it has to decide
share resources like shared cache blocks, network connecting on options between the available set of adaptive routes. RCA
distributed cache banks with the cores and the memory system. [6] uses both crossbar demand and VC occupation levels for
These shared resources hinder the performance of parallel its congestion metric to choose a link from a set of adaptive
applications. Also, sharing of data at the cache level causes routes. It then communicates the congestion metric with other
a lot of coherence traffic to get produced. We can see in routers in every region. It has good performance gains with
Sections III and VI that when the number of cores in a CMP some multi-threaded benchmarks too.
becomes large, performance of a parallel application becomes
DAR [7] is an advancement over RCA where each router
influenced by the efficiency of shared resources. We provide
estimates and correct its routes dynamically. Each router
key optimization at the network-on-chip (NoC) level to gain
maintains some book-keeping to update the recent paths taken
better performance over existing techniques. We start by listing
by flits in its buffers. This router design is complex and it scales
the different parts of an NoC packet, that we will use in this
better with network size when compared to RCA-QUAD, the
paper for optimizing traffic flow.
best performing RCA technique.
• Source, Data: Multicast optimization via a dynamic
BARP [8] uses scout flits to warn sources of a congested
common path.
link. It has no turn restrictions and so has better path redun-
• Destination: Packet concatenation for saving route- dancy. It avoids deadlocks by bifurcating the network into XY+
computation (RC) time and virtual channel (VC) and XY- networks. Thus this network model is much simpler
space. when it comes to route computation. The only overhead for
this technique is the scout flit traversals. It suffers a set back in
• Address1 : Arresting a read request and replying if the performance when the network congestion blocks the scouts
router VC already has the response data. from reaching the sources to take timely decisions. HPRA [9]
Our methods read one or more of the items above from the uses dedicated nodes in every region of a CMP for predicting
packets in router’s buffers to achieve better network perfor- the traffic flow. It uses Artificial Neural Networks (ANNs) in
mance. each region to compute ”hotspots” and warn the nodes in the
region. Thus the effectiveness of this technique is dependent
1 This is a special case of data sharing on the versatility of the offline learning for ANNs.
GCA [10] optimizes the communication of congestion 30 8
Mean Sharers Per Multicast
values by packing them up into the empty bit fields in the data
Number of Sharers
packets. Unlike the previous techniques, this technique does 20
6
cache population pattern of the benchmarks using a perfect in the VC buffers for its match. This comparator is 38 bits long
network. This perfect network handles any request to get and is implemented as two stages. The first stage XORs two
delivered in the next cycle to the destination. We observe that numbers and the second stage checks for equality of the two
most of the PARSEC workloads have low injection rate. So, numbers by OR-ing all the 38 bits and checks for zero. We
they pose no real challenge for the interconnection network. also model a similar 6 bit destination id comparator for packet
concatenation. We develop a Verilog module for a crossbar,
For the 512 node simulation setup, we found most of the
a basic routing algorithm module, virtual channels, input-
multithreaded benchmarks running a maximum of only 64
output ports and the comparators along each input channel
threads. so, we used combinations of different benchmarks
buffers. The buffers are made using a matrix of T flip-flops
(multiprograms) to run on all cores. For the 512 node system
to the required dimensions. We then used Synopsys-Synthesis
experiment, we have 256 cores and 256 L2 banks. We run
tool with 65nm technology. We also encode the necessary
32 threads of each of Barnes, Cholesky, Radiosity, Radix
parameters for the router design into Orion 2.0 [34] and
and Ocean benchmarks. We also run two 32 threaded Fmm
integrate the model to our simulator stack. We did this by
instances, making the total to 7 × 32 + 32 = 256. We use this
using a MySQL table as a buffer between booksim and orion
configuration for all the algorithms.
executables. Booksim dumps per cycle per router vc injections
B. Network Simulation into the table. A daemon checks and executes orion for each of
the dumped injection value and collects stats on various energy
Booksim2.0 [29] has built-in C++ models for two stage parameters reported by orion back into the table. This way, we
pipelined router, virtual channels, switch allocators and other will able to get accurate cycle level energy consumption details
routing components. As mentioned in Section V-A, we send for every router in the network. The results from this modelling
network injections from Multi2sim4.0.1 to a Booksim2.0 are discussed in Section VI.
instance. So, the simulators interact with a common cycle
accurate clock. With this setup they can send and receive We do not develop hardware level designs for all the
request/response and acknowledgments in real time. As a techniques compared in this paper. We compare our 2D designs
full system in operation, the traffic does not handle message area, power and energy consumption with the base router
deadlines. It is essentially a network talking to the architecture. models 2D only. We do not introduce any change to the 3D
When the response from the architecture gets delayed, the router design at all. It can be implemented using TSVs or using
network does nothing about it. Booksim2.0 processes the a full fledged DimDe like router too.
injections and respond back to Multi2sim4.0.1 using a common
clock. This communication is possible by linking the clock VI. R ESULTS
increments to a TCP socket. This socket synchronizes both
the simulators and network injections are sent through this We compare our optimizations on the base router with two
channel. Thus cycle accuracy is maintained. The injections contemporary techniques, namely VCTM and Hamiltonian for
and responses to/from the network are marshalled and un- a 4×4×4 network as explained in Table I. The results obtained
marshalled as fixed length strings. Thus, there is no data loss are discussed below. routing for any multicast chain.
with this model. We also built a new network topology to
support z − direction jumps, with a specific router model for A. Real Traffic from a Full System Simulator
supporting the jumps. These two components are integrated to The results from full system simulations can be seen
form a base network model for our simulations. from Fig. 3. The full-system performance is obtained from
the committed instructions per cycle for each simulation run.
C. Hardware Modelling
System speedup graph populated in Fig. 3b shows variations
We model a comparator in Verilog for both source+tag due to the proposed optimizations over the base odd-even
comparison of read requests and compare the responses present routing. We also compare the performance speedup with state
Base
Base+D
3000 1.7 20000
Base+CD
1.6 Base+CVD
Flits Transferred
2500 18000
Latency (Cycles)
1.5 Base+CVDP
VCTM
Speedup
2000 1.4
16000 VCTM+V
1500 1.3 Hamiltonian
Hamiltonian+V
1.2 14000
1000
1.1
500 12000
1
ba
ch
fm ky
oc
ra
ra ity
sw
G ion
ba
ch
fm
oc
ra
ra
sw
ba
ch
fm
oc
ra
ra
sw
eo s
di
di
ol
ea
eo
eo
rn
m
di
di
di
di
ap
ol
ea
ol
ea
rn
rn
m
ap
ap
os
x
es
M
os
os
es
es
es
M
n
es
es
t
n
n
tio
tio
ea
ity
ea
ity
ea
ky
ky
ns
ns
n
n
n
(a) Latency (b) System Speedup (c) Network Throughput
of the art VCTM and Hamiltonian methods. Though the further by making the responses reach quicker. But Packet
GeoMean shows step by step speedup improvement from the Concatenation lowers the performance. This is because of
base, there are certain individual benchmarks where it does not the existence of short bursts of multicasts make the network
report such performance gains. Barnes posts most gains from congested for short periods of time. But, by the time the
dynamic multicasting. With other optimizations over dynamic network starts having free VCs, there are various unicast
multicasting, it brings down the performance achieved by the packets that get concatenated already. Since we consider a
dynamic multicasting. This is because, Barnes has a high- once concatenated super-packet to never split again, we tend
communication phase prevalent for most of the time in its to make the packets’ efficiency to reach to get limited by the
execution. Thus by doing critical word first, we tend to make number of concatenations.
the destination node act on the received data before the body
and tail flits retire. Since it is a heavy injecting phase, this But, this is not always the case. For example, in both Radix
causes the destination node to inject a new packet quicker and Swaptions, Packet Concatenation alone improves IPC to
than the case without enabling critical word first. Dynamic 60% and 20%, respectively. Radix gains because of its dense
multicasting gains up to 9% of IPC because the average hop multicasts. As we can see from Fig. 1a, the number of sharers
length of the multicast packets is 2.1 for Barnes. peak to 28. So, it could potentially mean that we could gain
performance from any of our optimizations. There is not much
Also, from Fig. 1, we can see that the maximum sharers for to gain with VC as cache in here because, the multicast data get
Barnes is 10 and it occurs for 4.7% of the multicasts. The other no extra additional requests later. Swaptions loose performance
95.3% of the multicasts are around the mean with less than four when using dynamic broadcast. As Fig. 1a explains, Swaptions
sharers. VC as cache saves the damage done by critical word has a few number of sharers. Also, the workload is sensitive to
first to cause the performance gain to improve by 1%. Though the order of requests. Since with dynamic multicast, we tend
Barnes cannot recover from critical word first to add to the per- to fork at each common point, we delay some injection. This
formance gain by dynamic multicast, it improves the network stalls some threads and causes the loss in performance. Though
performance by reducing the number of flits transferred by the the network is able to transfer more flits in the execution at a
network. This is shown in the flits transferred bar for Barnes lower average latency, the workload’s sensitivity on the data
in Fig. 3c. It saves up to 1200 flits in the network and gains becomes crucial. Thus, it starts recovering from the damage
1% on top of the performance by critical word first. The last using Critical Word First. When there is a stall in injecting
optimization we have is Packet Concatenation on top of the into a router, VC as cache becomes a natural option for the
rest. Though it improves the number of flits transferred, it does workload to show huge gains. IPC Speedup jumps by 12%
to not gain much performance when compared to VC as cache. and it also beats VCTM. With Packet concatenation further
This is because, though it increases the number of requests reduces injection stalls and gains up to 20% in terms of IPC
submitted to L2 banks, it does not imply a higher multicast Speedup.
percentage. Though the requests are traveling toward the same
destination, they are actually for different blocks. Thus the L2 Hamiltonian and VCTM also gain with VC as cache on
bank becomes the bottleneck and stops this optimization from an average. We could also apply any of the aforementioned
gaining anything at all. optimization on top of these multicast techniques. But the
analysis on the impact of performance of these techniques is
If we examine Cholesky, we see a similar behavior. Though not done for brevity.
there is an overall gain over the base with all the optimizations,
VCTM still outperforms our optimization. In Fmm Critical B. Chip Area, Clock Delay and Control Lines
Word First adds on to the overall speedup. Since Fmm has
relatively low number of multicasts (Figs. 1a and 1b) the As mentioned in Section V-C, all the components of the
network could bear more injections as the critical words reach router are modelled in Verilog. We do not find any area
faster and gets processed. This can be seen in the raise in overhead by adding one extra RC unit. But by adding two
the number of flits transferred in Fig. 3c. VC as cache adds extra RC units we were able to get a slight increase in the area
25 16 Base Base+CD Base+CVDP
Base+D Base+D Base+D Base+CVD
% Energy Improvement w.r.t Base
ba
ch
fm
oc
ra
ra
sw y
G
eo
di
di
ol
ea
rn
m
0
ap
x
os
es
M
es
tio
it
ea
ky
ns
n
0 -2
3000 6000 9000 12000 15000 18000 3000 6000 9000 12000 15000 18000
Clock Cycles Clock Cycles
(a) Barnes Energy Across Time (b) Ocean Energy Across Time (c) Overall Per Flit Energy Footprint
of 0.013%. The area overhead for fitting in two comparators beginning. In those places, our forking take place almost near
for VC as Cache and Packet Concatenation techniques were the end of the multicast tree. This is very efficient and shows
found to be 1% of the total area of the router. We need one up to 17.34% energy savings. But as the algorithm progresses,
extra line in the link to indicate multicast and unicast packets. the threads accessing the same block tend to be located at
We do not impose additional delays to the router pipeline. The opposite corners of the mesh, making the packet forking to
comparators take 0.063ns for one full comparison. Since our occur prematurely. Thus the multicast transfers start congesting
clock delay is 4ns, we could fit 6.33 comparisons per cycle. the network and other optimizations start kicking in. Unlike
But we limit our comparisons to five to provide some slack. multicast, the other optimizations occur only when there is a
congestion. So, we see a smooth gain in energy with CD, CVD
and CVDP.
C. Energy Analysis
On the other hand, Fig. 4b shows ocean to behave in the
As mentioned in Section V, we use Orion 2.0 to interact
exact opposite way. This makes Packet Concatenation loose to
with our full system simulator to gain fine grained dynamic
the base by about 1.8%. This again boils down to the traffic
energy at every cycle. The crossbars and the pipeline units
behavior. The energy gain in multicast traffic is almost similar
consume less than 1% of the total energy. Clock consumes
to that of Barnes. But the congestion alleviation happens so
0.3%, while the pipeline stages consume around 0.02% of the
rapidly that doing Packet Concatenation drops the energy usage
total router energy. The remaining 99.5% of the total energy is
to 1.79% higher than the base. Also, VC as cache consumes
controlled by the buffers. Since we use the idle buffers in the
higher energy than plain multicasts at many places. This is
baseline model to our advantage by adding more flits to them,
mainly because of the futile matches made by the router.
our model utilizes the buffers more. Thus we show the per-flit
Though there is a good request - response meeting at the
energy consumption on an average at each router in Fig. 3. As
intermediate routers, the tags requested are different most of
we can see from the flits transferred by the network in Fig. 3c,
the time. So, the comparisons cause the overheads in degrading
our techniques tend to utilize the network better by transferring
energy per flit when compared to dynamic multicast. But we
larger number of flits most of the time. Thus the dynamic
do post better geometric mean than the base technique. We
energy consumption of our technique is slightly greater than
save 1.66% of energy spent per flit by a router with respect to
the baseline router on an average. Our Base+CVDP consumes
that of base. Our energy savings peak to 17.7% when running
5.4nJ on an average per cycle while the base design consumes
Radiosity.
4.75nJ. But for a fair analysis and point out the advantage of
our optimizations correctly, we take a look at the energy spent
on a flit by a router. A multicast packet is still treated as a D. Scaling to 512 Nodes
single packet with five flits only.
As we discuss sharing, we also evaluated our model in a
Fig. 4c shows that our optimizations reduce the energy 512 node setup and the results are shown in Figs. 5 and 6. The
spent by a router on a flit in most cases. This reflects our speedups against base in the 64-node by CVDP, Hamiltonian
philosophy behind using the idle buffers. But in cases like and VCTM are 18.4%, 10.01% and 10.6%, respectively. Now
Cholesky, ocean or Radix, our overall optimization shoots in 512 nodes, the speedups are 7%, 5.8% and 5%, respectively.
up the energy per flit by a router. It shows that Packet This shows the scalability of our technique when compared
Concatenation is a bad idea when congestion seem to alleviate to Hamiltonian and VCTM. Also CVD gains 9% in this 512
quicker than the average packet latency. To understand this node setup while it gained 11.6% in the 64 node setup. This
data, we take two cases - Barnes and ocean. In Barnes (refer steady behavior is from the fact that concatenated packets
to Fig. 4a), we gain up to 22.85% with respect to the energy travel longer distances with increase in network size. Thus
spent on a flit by the base technique. Thus the optimizations there is a degradation in performance from CVD to CVDP
we do benefit the system on the whole. Another thing to see is by 2%. VCTM+V gives 5.5% and that is 10% improvement
the erratic behavior of the energy line by Dynamic Multicast. coming from VC as cache. Hamiltonian+V on the other hand
Since Barnes is a tree access algorithm, the threads accessing gains only 4%. This loss of 31.08% is because, the responses
a particular block of data tend to be close to each other in the take a longer path to move along the Hamiltonian route.
VII. C ONCLUSION [6] P. Gratz et al., “Regional congestion awareness for load balance in
networks-on-chip.” in HPCA, 2008.
We proposed a scalable way of handling multicasts and [7] R. S. Ramanujam et al., “Destination-based adaptive routing on 2d mesh
demonstrated it to outperform the best techniques available to- networks.” in ANCS, 2010, p. 19.
day in two different network sizes using and real multithreaded [8] P. Lotfi-Kamran et al., “BARP-a dynamic routing protocol for balanced
benchmarks. We analysed the other possible optimizations to distribution of traffic in nocs.” in DATE, 2008.
support this implementation and we discussed the way to in- [9] E. Kakoulli et al., “HPRA: A pro-active hotspot-preventive high-
corporate it with the existing techniques. Thus, we could derive performance routing algorithm for networks-on-chips.” in ICCD, 2012.
much better performance of the whole system by utilizing the [10] M. Ramakrishna et al., “Gca: Global congestion awareness for load
knowledge of data transmitted in an NoC of any CMP. It is balance in networks-on-chip.” in NOCS, 2013, pp. 1–8.
[11] A. e. a. Topol, “Three-dimensional integrated circuits,” IBM Journal of
Latency Research and Development, pp. 491–506, 2006.
220 [12] T. D. Richardson et al., “A hybrid soc interconnect with dynamic tdma-
210
based transaction-less buses and on-chip networks.” in VLSI Design,
2006, pp. 657–664.
Latency (Cycles)
200
[13] D. Park et al., “MIRA: A multi-layered on-chip interconnect router
190 architecture.” in ISCA, 2008, pp. 251–261.
180 [14] J. Kim et al., “A novel dimensionally-decomposed router for on-chip
170 communication in 3d architectures.” in ISCA, 2007, pp. 138–149.
160 [15] M. Salas et. al, “The roce-bush router: a case for routing-centric
dimensional decomposition for low-latency 3d noc routers,” ser.
Ba
Ba
Ba
Ba
Ba
VC
VC
H
am
am ian
se
se
se
se
se
TM
TM
ilt
ilt
+C
+C
+C
+V
on
on
D
VD
VD
ia
n+
P
[16] S. Akbari et al., “Afra: A low cost high performance reliable routing
V
1.06 chip hardware multicast support.” in ISCA. IEEE, 2008, pp. 229–240.
1.05
[20] S. Moosavi et al., “An efficient implementation of hamiltonian path
1.04
1.03
based multicast routing for 3d interconnection networks,” in Electrical
1.02 Engineering (ICEE), 2013 21st Iranian Conference on, 2013, pp. 1–6.
1.01 [21] F. A. Samman et al., “Adaptive and deadlock-free tree-based multicast
1
0.99
routing for networks-on-chip.” IEEE Trans. VLSI Syst., pp. 1067–1080,
2010.
Ba
Ba
Ba
Ba
Ba
VC
VC
H
am
am ian
se
se
se
se
se
TM
TM
ilt
ilt
+D
+C
+C
+C
on
on
D
VD
VD
ia
n
P