Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Using Packet Information for Efficient

Communication in NoCs

Prasanna Venkatesh Rengasamy Madhu Mutyam


PACE Lab, IIT Madras, Chennai, India 600036 PACE Lab, IIT Madras, Chennai, India 600036
Email: pras@cse.iitm.ac.in Email: madhu@cse.iitm.ac.in

Abstract—Multithreaded workloads like SPLASH2 and PAR- II. BACKGROUND


SEC generate heavy network traffic. This traffic consists of
different packets injected by various nodes at various points of The past works in literature are examined in three cate-
time. A packet consists of three essential components viz., source, gories of NoCs in this section. NoCs [2] manage commu-
destination and data. In this work, we study the opportunity to nication between various components in a chip. Congestion
increase the efficiency of the underlying network. We come up control algorithms aim at providing better bandwidth in heavy-
with novel methods to share various components of the packets traffic scenarios. Most of such works treat a packet as a
present in a router at any time. We provide an analysis of the black box. Our work aims to provide better bandwidth by
performance gains over contemporary optimization techniques.
understanding what a packet transports. So, combining both
We conduct experiments on a 64 node setup as well as a 512
node setup and record the energy consumption, IPC gains, congestion control and packet-peeping would result in bet-
network latency and throughputs. We show that our technique ter network performance. Some initial works on congestion
outperforms the contemporary Hamiltonian routing by 8.43% control include Odd Even routing [3] and West-first routing
and VCTM routing by 7.7% on an average, in terms of IPC [4]. These techniques get the framework ready for exploring
speedup. optimized congestion control methods in NoCs. Both these
methods give the idea of choosing one out of many available
I. I NTRODUCTION productive paths for a packet.

Chip multiprocessor (CMP) designs support parallel exe- Adaptive techniques learn and choose the best path for
cution of more than one application. As these applications try each packet from the available list of paths. DyAd routing
to co-exist in a chip during their execution time, performance [5] switches between static and dynamic route computation
limitation occurs from places where parallelism gets limited. methods based on the occupation levels of input queues. Thus,
This is studied by Amdahl’ s law [1]. In a CMP, there are a this method marks the low and high injection phases for a
lot of inherent sharing. For example, independent processes network. But, it does not do much good when it has to decide
share resources like shared cache blocks, network connecting on options between the available set of adaptive routes. RCA
distributed cache banks with the cores and the memory system. [6] uses both crossbar demand and VC occupation levels for
These shared resources hinder the performance of parallel its congestion metric to choose a link from a set of adaptive
applications. Also, sharing of data at the cache level causes routes. It then communicates the congestion metric with other
a lot of coherence traffic to get produced. We can see in routers in every region. It has good performance gains with
Sections III and VI that when the number of cores in a CMP some multi-threaded benchmarks too.
becomes large, performance of a parallel application becomes
DAR [7] is an advancement over RCA where each router
influenced by the efficiency of shared resources. We provide
estimates and correct its routes dynamically. Each router
key optimization at the network-on-chip (NoC) level to gain
maintains some book-keeping to update the recent paths taken
better performance over existing techniques. We start by listing
by flits in its buffers. This router design is complex and it scales
the different parts of an NoC packet, that we will use in this
better with network size when compared to RCA-QUAD, the
paper for optimizing traffic flow.
best performing RCA technique.
• Source, Data: Multicast optimization via a dynamic
BARP [8] uses scout flits to warn sources of a congested
common path.
link. It has no turn restrictions and so has better path redun-
• Destination: Packet concatenation for saving route- dancy. It avoids deadlocks by bifurcating the network into XY+
computation (RC) time and virtual channel (VC) and XY- networks. Thus this network model is much simpler
space. when it comes to route computation. The only overhead for
this technique is the scout flit traversals. It suffers a set back in
• Address1 : Arresting a read request and replying if the performance when the network congestion blocks the scouts
router VC already has the response data. from reaching the sources to take timely decisions. HPRA [9]
Our methods read one or more of the items above from the uses dedicated nodes in every region of a CMP for predicting
packets in router’s buffers to achieve better network perfor- the traffic flow. It uses Artificial Neural Networks (ANNs) in
mance. each region to compute ”hotspots” and warn the nodes in the
region. Thus the effectiveness of this technique is dependent
1 This is a special case of data sharing on the versatility of the offline learning for ANNs.
GCA [10] optimizes the communication of congestion 30 8
Mean Sharers Per Multicast
values by packing them up into the empty bit fields in the data

% Above 0.5 * Max Sharers


25 Max Sharers Per Multicast 7

Number of Sharers
packets. Unlike the previous techniques, this technique does 20
6

not need extra communication. It stores the communicated 15


5
4
congestion values in every router using minimal storage bits. 10
3
This is the latest technique for congestion control and it 5 2
outperforms RCA. 0 ba c f o r r s G
rn hole mm cea adio adix wap eoM
1 ba
rn
ch
o
fm o r r s G
m cea adio adix wap eoM
es sk n sit tio es les n sit tio
y y ns ean ky y ns ean

A. Congestion Control in Three Dimensional NoCs


(a) Mean and Max Sharers Per Multicast (b) Existence of Max Sharers
Path redundancy improves with improvement in router
degree. Thus 3D NoCs introduced better path redundancy in Fig. 1: Motivation: Sharing in 32 Threaded Applications
networks from the advent of Through Silicon Vias [11] and a
TDMA bus on Z-direction [12] to full fledged routers like the
MIRA router [13], DimDe router [14] and Roce-Bush [15].
for route discovery. Thus for bigger networks as well as large
These network implementations enabled a variety of routing
destination sets based multicasts (refer Fig. 1a), it becomes
and optimization techniques to achieve better performance.
expensive to maintain bigger tables and to send scout flits.
AFRA [16] achieves a fault tolerant deadlock free routing in
3D NoCs. Every router stores the nearest Z-link and a backup
Z-link for fault tolerance. It follows dimension order routing C. Cache Coherence at NoC
when the destination is in the same XY plane of the current In the previous section, we studied techniques that treated
router. Elevator First [17] routes the traffic through a deadlock multicasts differently from unicasts. To know whether a re-
free algorithm on a partially connected Z network in a 3D quest is unicast or multicast, they peek into the packet for
NoC. This is similar to AFRA but it achieves Z-traversals by its designated bit. We could also peek into the packet data
masking the header with a new header to go to the nearest node further to gain better knowledge on the system behavior. Thus
linking the destination XY plane. There are recent techniques there are techniques that use cache coherence information
to reduce congestion by handling multicasts efficiently too. to optimize network performance. INSO [22] provides snoop
ordering from NoC level to extend snoop-based protocol for
B. Multicast Routing topologies other than bus. This reduces broadcasts and so uti-
lizes the available bandwidth efficiently for transmitting useful
In any typical CMP, multiple private caches are attached information. Helia [23] provides better power management by
to a (group of) processor core(s). They have the possibility of exploiting NoC-cache combination to reduce tag-mismatches.
holding a copy of the same block of data at any given time. There are further areas available for exploration at the data
These copies get validated for correctness by communicating sharing level, since multicast optimization motivates us to
updates through the interconnection network. Such update exploit sharing. Thus in Sections III and IV, we will see novel
messages are often seen as a critical factor for performance. As usage of packet data to improve performance.
the number of sharers increases for a given data block, every
update to the shared data may cause a multicast to the sharers
in the network. Also, there are directory based protocols III. M OTIVATION
that generalize these multicasts as broadcasts for simplicity. A. General Congestion Control
Network takes these multicasts/broadcasts as multiple single
packet injections of the same data. So, techniques like EVC- We study and come up with the following arguments to
T [18], VCTM [19] and Hamiltonian [20] provide a static emphasize on the reasons behind the optimizations proposed.
path construction technique. The aim of this path construction Communication in SPLASH2 [24] and PARSEC [25] has
is to make a single injection into the network. This single been analysed in depth in [26] . This analysis forms the
packet moves along a common path viable for all the multicast basis for establishing the motivation of our optimizations. The
destinations. When the packet reaches a point where it is following observations are from the analysis.
not possible anymore to hop to a common router, the packet
forks. This forking again is optimized to happen during link 1) Multithreaded applications communicate shared data.
traversals. Thus, the above techniques avoid multiple copies 2) The authors analyze only the producer-consumer type
from flooding the network. This whole process of a single sharing as shared communication in the paper.
packet propagating to all destinations create a tree-like path. 3) Shared writes constitute up to 55% of the total writes
This tree gets fixed for every group of multicast destinations. and shared reads constitute to 9% of the total reads.
Thus this deterministic nature provides less path diversity in 4) Temporal locality among these producer-consumer
case of congestion. All these trees are maintained in small patterns show that communications occur in heavy
CAM tables at every router. These techniques save route bursts during synchronization phases, but light and
computations by having static lookup tables for destination random otherwise.
sets. Adaptive routing for multicast tree management have been
proposed in XHiNoC [21]. It also uses a lookup table like Also, we study the sharers for any packet in its lifetime in
VCTM to store routes computed. These techniques optimize the network as configured in Section V and get the pattern as
route computations using lookup tables. But to fill the tables, a shown in Fig. 1. Fig. 1a shows the mean number of sharers
control flit is sent apriori for each multicast packet destination in multicast messages sent by the high injection benchmarks.
The mean ranges from 2.2 for Barnes and 3.4 for Radix. From
point number 3 in the above observation, we can see that up
to 9% of the total reads constitute shared read. Thus if we are
using techniques like EVC-T, VCTM or Hamiltonian, static
route assignment for specific multicast destination sets may
under-utilize the network. In addition, Fig. 1a indicate that
there can be up to 28 sharers in a 32 threaded application
(87.5%). Thus, sending scout flits for route-discovery to assign
destination sets with output ports like VCTM and XHiNoC
will also become undesirable. The only way for the network
would be to pick the least congested port for every destination
in a multicast at each router. This means that there may be up
to 0.875 × N umT hreads route computations at every router.
This will make the route computation a bottleneck. (a) VC as Cache (b) Packet Concatenation
VCTM and Hamiltonian techniques route a multicast Fig. 2: Modifications to Router Microarchitecture
packet along a fixed path. We show adaptive route for multicast
packets perform better than them. But, doing so may create a
bottleneck at the RC phase of the pipeline. So, to reduce the
delay of computing routes for all multicast destinations, we to meet in any router on their ways, we can arrest
propose redundant RC units. Fig 1a shows that the number of the request and send the reply back. We should note
sharers can reach up to 87.5% [28 out of 32] of the number that the request still has to reach the home L2 bank
of threads. We study the percentage of occurrence of such for coherence. But L2 need not respond with a reply
multicasts in the execution phase and populate the results in unless there is an update from its previous response.
Fig. 1b. It shows that such maximums occur only for 3.2%
on an average across the benchmarks under consideration. To • Synchronization phases with high-injection to the net-
decide on the number of RC units to use, we use the mean work tend to wane out and occur in bursts of time. So,
sharers per multicast packet. This value is 2.3, and so it is there is little to gain on top a request-limited network
enough to have two RC units in place of one. This reduces most (phases with few injections). In these phases, we send
of the stalls caused by computing route for each destination at the requested word along the header of the five-flit
every hop. response packet. This will enhance the response from
the receiving core to get the further requests inject
This approach of dynamic RC for every destination in quicker. This improves our opportunity to get better
multicasts may still starve other packets in the router. Our network utilization.
aim is to reuse the RC output of a flit to other flits going
to the same destination. This technique is a variation from
Packet Chaining [27] where switch allocator (SA) memoizes IV. A PPROACH
the output for faster allocation. Here, we propose a destination Though we propose three optimization techniques, we do
lookup circuitry, that concatenates packets with same destina- not modify much at the base router microarchitecture level.
tion. These concatenated packets traverse as a single ”Super- First, we classify a packet as unicast or multicast packet using a
Packet” till the destination. Thus we save some RC time on separate control bit in the header. A packet may be a request to
unicast packets by allowing greater flexibility on multicast read or write a cache block, a cache block itself or a coherence
packets path. Such packet concatenations are usually done for message. For simplicity, we consider them to be a single flit
the request packets. They are just one flit long and a VC of packet or a five flit packet at the router level. The components
more than one buffers is allocated for storing the request. of a packet at the network layer are header, body and tail.
The reasoning behind the coincidental arrival can be better
explained by the point number 4 in the above study.
A. Addressing: Bit field and Subnet mapping
We see that a producer-consumer pattern occur during
synchronization phases. At these phases the cores make a lot For a single flit packet, the single flit itself will serve as
of requests to the same block of data. These requests arrive at all the components. But, a five flit packet has 64 bytes of data
the same L2-bank. This causes the L2-bank to fetch the block and another 16 bytes of header. As mentioned in GCA, this 16
and multicast it back to the requesting nodes. These phases bytes in header is used for communicating the source node-id,
provide three different optimization opportunities for us. destination node-id(s), the memory address in transmission,
along with 80 bits of optional control information. We use
• There are different single-flit request packets travers- these 80 bits and the 4 bits for destination-id to denote
ing towards the same destination. So, if there is a multicast destination address. Thus this scheme supports till
bottleneck due to an occupied RC phase as described 84 nodes in a network.
above, packet concatenation can help.
We also use the same bit field based destination encoding
• Assume that a request for a block is on its way to to denote a group of nodes in a network with size more than 84
reach the L2 bank at time ’t’. And the L2 bank has too. In this paper, we consider one bit to denote one 2 × 2 × 2
sent the same block of data to some other node at cube for a 512 node network. We can optimize the addressing
time t - ∆t. Now, if the request and response happen further to associate the bit fields to various groups of nodes
like the groups discovered in VCTM and XHiNoC. This is left consider the following scenario. The RC units of a cache bank
as an extension in the future to this work. are busy with a multicast-reply packet and there are no free
VCs available at an input channel. To handle the cache miss,
B. Critical Word First the cache bank has to access the appropriate memory controller
with a read request. Let us assume that one of the VCs already
In this optimization, for all unicast packets, we put the holds a memory read request to the memory controller. If there
word (8 bytes) requested in the header flit itself. We send the are more such read requests, we can concatenate them to form
word to the requesting core once the header flit arrives at the a longer chain of flits in the same VC. As the size of a request
destination to speedup the execution process. This is a simple is one flit, this case is possible.
rearrangement of the block data, and we assume it to happen
at the cache-controller before injection. This implementation continues as an extension to the VC
as cache method we saw earlier. The controller at the input
C. VC As Cache level compares the source address bits for request packets
in the previous optimization. We add a similar controller to
For this technique we have a tag+source-comparator at each match destination [Refer Fig. 2b]. Similar to the VC as cache
input as shown in Fig. 2a. It checks the tag and destination comparator, we limit only 5 comparisons for maintaining cycle
for an incoming read request with those of every multicast time limit.
responses staying in the VCs. If there is a match, it marks a
special bit on the read request to ask the receiver to not send Thus by utilizing the otherwise idle buffers, we also reduce
any reply back along with the current timestamp. This does the number of route computations required by unicast packets
not prevent us from forwarding the request till the destination going to the same destination.
because of the need of the home L2 bank to know the
current sharers. It also sets the requestors bit to one in the V. E XPERIMENTAL S ETUP
multicast address of the response, to make the request get a The simulation setup consists of the following phases:
quicker response. This also enhances the network utilization
as discussed in Section VI.
A. Benchmark Execution
This step occurs when such a request is waiting in its VC
We use Multi2sim4.0.1 [28] simulator for benchmark exe-
for RC, SA and link traversal (LT) stages of the pipeline. Thus
cution. We use Booksim2.0 [29] for our NoC level simulations.
it does not consume any critical path delay. A read request can
We modify Multi2sim to let an instance of Booksim to mange
also miss this optimization if there are a lot of read requests
its network modelling. Booksim services the network requests
in the VCs. We simulated to check only till 5 responses per
and intimates the Multi2sim once the service is complete.
request. This is to meet the clock cycle demands as discussed
We also modify the FastForward setup in Multi2sim to warm
in Section VI-B.
up the cache during the FastForward. This is to start shar-
ing of the same data by different cores. We then executed
D. Dynamic Multicast Tree the multithreaded benchmarks from SPLASH2 and PARSEC
A multicast packet contains the destinations encoded in workloads using this setup. We configure the simulators with
the header as bit fields. To build a multicast tree, we compute the parameters given in TABLE I.
minimal odd-even route [3] for each destination and decide to 1) Architecture for Executing Benchmark: We take a 64 −
fork a packet or not. In a 3D NoC, a minimal route computation node system-on-chip (SoC) consisting of 32 − cores with
at any router will give utmost three productive output ports. private L1 − caches and 32 L2 − cache banks. Then, we
So, finding the common output port for all destinations is a arrange it on a 3D NoC as a chess board like structure
simple task of mapping each destination to an output port. But, [30]. This paper discusses in detail about the thermal benefits
it takes longer time to compute routes for each destination. We with such an arrangement. We use this to demonstrate our
propose two solutions to handle this bottleneck. technique. We fix this arrangement for all the techniques
1) Add redundant RC units. Though this appears to be compared. We could also adopt any other topology or NoC
an expensive solution, the odd even algorithm has configuration as per the need and implement our optimizations
a lot of slack time and is simple in comparison to on top of them without any modifications to them.
the implementation of the allocators. We are able to As we target multicast traffic, we use L1 - L2 - Memory
accommodate till three RC units without incurring as our cache hierarchy. The advantage we get is the density of
any visible cycle delay. traffic that flows through this network is higher when compared
2) The next solution can be used in combination with to more levels of cache. This hierarchy is still practical as
redundant RC units too. This technique is “Packet we have contemporary examples of this hierarchy in SPARC
Concatenation” and is explained below. Processor [31] and Intel Xeon Phi Coprocessor. Vantage [32]
validates cache partitioning with banked L2 cache. McPAT [33]
E. Packet Concatenation calls this banked L2 cache shared among CPU cores as ”future
high throughput computing”.
We start concatenating packets when we have a request
packet to inject into a router and there is no free VCs in it. In We used the cycle accurate full system simulator to run
that case we compare the destination of the request to inject all the SPLASH2 and PARSEC workloads for 100K cycles
with the requests in the VCs. If there is a match, and if the after an initial warm-up phase for removing cold cache misses.
VC has a free buffer we concatenate the packets. For example, This warm-up phase is decided empirically after observing the
TABLE I: Configuration used in our simulations.
Architecture simulation Network simulation
Parameter Value Description Parameter Value Description
Number of Cores 32/256 We use 256 nodes General Router Properties
to demonstrate scaling - Number of VC Classes 1 Handles all traffic.
- Core Properties 2.5GHz Clock Speed - Flits per Cache Requests 1
Cache Coherence Protocol MOESI - Flits per Cache Block transfer 5
Cache Replacement Policy LRU - Flits per Coherence Data 1
Number of L1 Caches 32/256 One per core - not shared, - Clock Frequency (GHz) 2.5
- L1 Properties 2 way set associative, - Number of buffers per VC 5
128 Sets, 64B block size, - Number of RC units 2
2 cycle latency, 2 ports. - Flit Size 128+6 bits 5 extra bits as
Number of L2 Banks 32/256 Shared by all L1 caches, - Link Width 128+6 bits specified by [10]
split address space, and 1 for denoting
- L2 Properties 16 way set associative, unicast and multicast
1024 Sets, 64B block size, - Pipeline Width 2
20 cycle latency, 4 ports. Simulation Abbreviations D Dynamic Multicast Tree
Number of Memory Banks 8 Shared by all L2 caches, C Critical Word First
split address space V VC as Cache
The 8 corners of 3D NoC P Packet Concatenation

cache population pattern of the benchmarks using a perfect in the VC buffers for its match. This comparator is 38 bits long
network. This perfect network handles any request to get and is implemented as two stages. The first stage XORs two
delivered in the next cycle to the destination. We observe that numbers and the second stage checks for equality of the two
most of the PARSEC workloads have low injection rate. So, numbers by OR-ing all the 38 bits and checks for zero. We
they pose no real challenge for the interconnection network. also model a similar 6 bit destination id comparator for packet
concatenation. We develop a Verilog module for a crossbar,
For the 512 node simulation setup, we found most of the
a basic routing algorithm module, virtual channels, input-
multithreaded benchmarks running a maximum of only 64
output ports and the comparators along each input channel
threads. so, we used combinations of different benchmarks
buffers. The buffers are made using a matrix of T flip-flops
(multiprograms) to run on all cores. For the 512 node system
to the required dimensions. We then used Synopsys-Synthesis
experiment, we have 256 cores and 256 L2 banks. We run
tool with 65nm technology. We also encode the necessary
32 threads of each of Barnes, Cholesky, Radiosity, Radix
parameters for the router design into Orion 2.0 [34] and
and Ocean benchmarks. We also run two 32 threaded Fmm
integrate the model to our simulator stack. We did this by
instances, making the total to 7 × 32 + 32 = 256. We use this
using a MySQL table as a buffer between booksim and orion
configuration for all the algorithms.
executables. Booksim dumps per cycle per router vc injections
B. Network Simulation into the table. A daemon checks and executes orion for each of
the dumped injection value and collects stats on various energy
Booksim2.0 [29] has built-in C++ models for two stage parameters reported by orion back into the table. This way, we
pipelined router, virtual channels, switch allocators and other will able to get accurate cycle level energy consumption details
routing components. As mentioned in Section V-A, we send for every router in the network. The results from this modelling
network injections from Multi2sim4.0.1 to a Booksim2.0 are discussed in Section VI.
instance. So, the simulators interact with a common cycle
accurate clock. With this setup they can send and receive We do not develop hardware level designs for all the
request/response and acknowledgments in real time. As a techniques compared in this paper. We compare our 2D designs
full system in operation, the traffic does not handle message area, power and energy consumption with the base router
deadlines. It is essentially a network talking to the architecture. models 2D only. We do not introduce any change to the 3D
When the response from the architecture gets delayed, the router design at all. It can be implemented using TSVs or using
network does nothing about it. Booksim2.0 processes the a full fledged DimDe like router too.
injections and respond back to Multi2sim4.0.1 using a common
clock. This communication is possible by linking the clock VI. R ESULTS
increments to a TCP socket. This socket synchronizes both
the simulators and network injections are sent through this We compare our optimizations on the base router with two
channel. Thus cycle accuracy is maintained. The injections contemporary techniques, namely VCTM and Hamiltonian for
and responses to/from the network are marshalled and un- a 4×4×4 network as explained in Table I. The results obtained
marshalled as fixed length strings. Thus, there is no data loss are discussed below. routing for any multicast chain.
with this model. We also built a new network topology to
support z − direction jumps, with a specific router model for A. Real Traffic from a Full System Simulator
supporting the jumps. These two components are integrated to The results from full system simulations can be seen
form a base network model for our simulations. from Fig. 3. The full-system performance is obtained from
the committed instructions per cycle for each simulation run.
C. Hardware Modelling
System speedup graph populated in Fig. 3b shows variations
We model a comparator in Verilog for both source+tag due to the proposed optimizations over the base odd-even
comparison of read requests and compare the responses present routing. We also compare the performance speedup with state
Base
Base+D
3000 1.7 20000
Base+CD
1.6 Base+CVD

Flits Transferred
2500 18000
Latency (Cycles)

1.5 Base+CVDP
VCTM

Speedup
2000 1.4
16000 VCTM+V
1500 1.3 Hamiltonian
Hamiltonian+V
1.2 14000
1000
1.1
500 12000
1

ba

ch

fm ky

oc

ra

ra ity

sw

G ion
ba

ch

fm

oc

ra

ra

sw

ba

ch

fm

oc

ra

ra

sw

eo s
di

di
ol

ea
eo

eo

rn

m
di

di

di

di

ap
ol

ea

ol

ea
rn

rn

m
ap

ap

os

x
es

M
os

os

es
es

es
M

n
es

es

t
n

n
tio

tio

ea
ity

ea

ity

ea
ky

ky
ns

ns

n
n

n
(a) Latency (b) System Speedup (c) Network Throughput

Fig. 3: Real Traffic Results[Please refer to Simulation Abbreviations in TABLE I]

of the art VCTM and Hamiltonian methods. Though the further by making the responses reach quicker. But Packet
GeoMean shows step by step speedup improvement from the Concatenation lowers the performance. This is because of
base, there are certain individual benchmarks where it does not the existence of short bursts of multicasts make the network
report such performance gains. Barnes posts most gains from congested for short periods of time. But, by the time the
dynamic multicasting. With other optimizations over dynamic network starts having free VCs, there are various unicast
multicasting, it brings down the performance achieved by the packets that get concatenated already. Since we consider a
dynamic multicasting. This is because, Barnes has a high- once concatenated super-packet to never split again, we tend
communication phase prevalent for most of the time in its to make the packets’ efficiency to reach to get limited by the
execution. Thus by doing critical word first, we tend to make number of concatenations.
the destination node act on the received data before the body
and tail flits retire. Since it is a heavy injecting phase, this But, this is not always the case. For example, in both Radix
causes the destination node to inject a new packet quicker and Swaptions, Packet Concatenation alone improves IPC to
than the case without enabling critical word first. Dynamic 60% and 20%, respectively. Radix gains because of its dense
multicasting gains up to 9% of IPC because the average hop multicasts. As we can see from Fig. 1a, the number of sharers
length of the multicast packets is 2.1 for Barnes. peak to 28. So, it could potentially mean that we could gain
performance from any of our optimizations. There is not much
Also, from Fig. 1, we can see that the maximum sharers for to gain with VC as cache in here because, the multicast data get
Barnes is 10 and it occurs for 4.7% of the multicasts. The other no extra additional requests later. Swaptions loose performance
95.3% of the multicasts are around the mean with less than four when using dynamic broadcast. As Fig. 1a explains, Swaptions
sharers. VC as cache saves the damage done by critical word has a few number of sharers. Also, the workload is sensitive to
first to cause the performance gain to improve by 1%. Though the order of requests. Since with dynamic multicast, we tend
Barnes cannot recover from critical word first to add to the per- to fork at each common point, we delay some injection. This
formance gain by dynamic multicast, it improves the network stalls some threads and causes the loss in performance. Though
performance by reducing the number of flits transferred by the the network is able to transfer more flits in the execution at a
network. This is shown in the flits transferred bar for Barnes lower average latency, the workload’s sensitivity on the data
in Fig. 3c. It saves up to 1200 flits in the network and gains becomes crucial. Thus, it starts recovering from the damage
1% on top of the performance by critical word first. The last using Critical Word First. When there is a stall in injecting
optimization we have is Packet Concatenation on top of the into a router, VC as cache becomes a natural option for the
rest. Though it improves the number of flits transferred, it does workload to show huge gains. IPC Speedup jumps by 12%
to not gain much performance when compared to VC as cache. and it also beats VCTM. With Packet concatenation further
This is because, though it increases the number of requests reduces injection stalls and gains up to 20% in terms of IPC
submitted to L2 banks, it does not imply a higher multicast Speedup.
percentage. Though the requests are traveling toward the same
destination, they are actually for different blocks. Thus the L2 Hamiltonian and VCTM also gain with VC as cache on
bank becomes the bottleneck and stops this optimization from an average. We could also apply any of the aforementioned
gaining anything at all. optimization on top of these multicast techniques. But the
analysis on the impact of performance of these techniques is
If we examine Cholesky, we see a similar behavior. Though not done for brevity.
there is an overall gain over the base with all the optimizations,
VCTM still outperforms our optimization. In Fmm Critical B. Chip Area, Clock Delay and Control Lines
Word First adds on to the overall speedup. Since Fmm has
relatively low number of multicasts (Figs. 1a and 1b) the As mentioned in Section V-C, all the components of the
network could bear more injections as the critical words reach router are modelled in Verilog. We do not find any area
faster and gets processed. This can be seen in the raise in overhead by adding one extra RC unit. But by adding two
the number of flits transferred in Fig. 3c. VC as cache adds extra RC units we were able to get a slight increase in the area
25 16 Base Base+CD Base+CVDP
Base+D Base+D Base+D Base+CVD
% Energy Improvement w.r.t Base

% Energy Improvement w.r.t Base


Base+CD Base+CD

Per Flit Per Router Energy (nJ)


14
Base+CVD Base+CVD 2.05
20 Base+CVDP 12 Base+CVDP 2
1.95
10 1.9
15 1.85
8 1.8
1.75
6 1.7
10 1.65
4 1.6
1.55
5 2 1.5

ba

ch

fm

oc

ra

ra

sw y

G
eo
di

di
ol

ea
rn

m
0

ap
x

os
es

M
es

tio
it

ea
ky

ns

n
0 -2
3000 6000 9000 12000 15000 18000 3000 6000 9000 12000 15000 18000
Clock Cycles Clock Cycles
(a) Barnes Energy Across Time (b) Ocean Energy Across Time (c) Overall Per Flit Energy Footprint

Fig. 4: Energy Analysis[Please refer to Simulation Abbreviations in TABLE I]

of 0.013%. The area overhead for fitting in two comparators beginning. In those places, our forking take place almost near
for VC as Cache and Packet Concatenation techniques were the end of the multicast tree. This is very efficient and shows
found to be 1% of the total area of the router. We need one up to 17.34% energy savings. But as the algorithm progresses,
extra line in the link to indicate multicast and unicast packets. the threads accessing the same block tend to be located at
We do not impose additional delays to the router pipeline. The opposite corners of the mesh, making the packet forking to
comparators take 0.063ns for one full comparison. Since our occur prematurely. Thus the multicast transfers start congesting
clock delay is 4ns, we could fit 6.33 comparisons per cycle. the network and other optimizations start kicking in. Unlike
But we limit our comparisons to five to provide some slack. multicast, the other optimizations occur only when there is a
congestion. So, we see a smooth gain in energy with CD, CVD
and CVDP.
C. Energy Analysis
On the other hand, Fig. 4b shows ocean to behave in the
As mentioned in Section V, we use Orion 2.0 to interact
exact opposite way. This makes Packet Concatenation loose to
with our full system simulator to gain fine grained dynamic
the base by about 1.8%. This again boils down to the traffic
energy at every cycle. The crossbars and the pipeline units
behavior. The energy gain in multicast traffic is almost similar
consume less than 1% of the total energy. Clock consumes
to that of Barnes. But the congestion alleviation happens so
0.3%, while the pipeline stages consume around 0.02% of the
rapidly that doing Packet Concatenation drops the energy usage
total router energy. The remaining 99.5% of the total energy is
to 1.79% higher than the base. Also, VC as cache consumes
controlled by the buffers. Since we use the idle buffers in the
higher energy than plain multicasts at many places. This is
baseline model to our advantage by adding more flits to them,
mainly because of the futile matches made by the router.
our model utilizes the buffers more. Thus we show the per-flit
Though there is a good request - response meeting at the
energy consumption on an average at each router in Fig. 3. As
intermediate routers, the tags requested are different most of
we can see from the flits transferred by the network in Fig. 3c,
the time. So, the comparisons cause the overheads in degrading
our techniques tend to utilize the network better by transferring
energy per flit when compared to dynamic multicast. But we
larger number of flits most of the time. Thus the dynamic
do post better geometric mean than the base technique. We
energy consumption of our technique is slightly greater than
save 1.66% of energy spent per flit by a router with respect to
the baseline router on an average. Our Base+CVDP consumes
that of base. Our energy savings peak to 17.7% when running
5.4nJ on an average per cycle while the base design consumes
Radiosity.
4.75nJ. But for a fair analysis and point out the advantage of
our optimizations correctly, we take a look at the energy spent
on a flit by a router. A multicast packet is still treated as a D. Scaling to 512 Nodes
single packet with five flits only.
As we discuss sharing, we also evaluated our model in a
Fig. 4c shows that our optimizations reduce the energy 512 node setup and the results are shown in Figs. 5 and 6. The
spent by a router on a flit in most cases. This reflects our speedups against base in the 64-node by CVDP, Hamiltonian
philosophy behind using the idle buffers. But in cases like and VCTM are 18.4%, 10.01% and 10.6%, respectively. Now
Cholesky, ocean or Radix, our overall optimization shoots in 512 nodes, the speedups are 7%, 5.8% and 5%, respectively.
up the energy per flit by a router. It shows that Packet This shows the scalability of our technique when compared
Concatenation is a bad idea when congestion seem to alleviate to Hamiltonian and VCTM. Also CVD gains 9% in this 512
quicker than the average packet latency. To understand this node setup while it gained 11.6% in the 64 node setup. This
data, we take two cases - Barnes and ocean. In Barnes (refer steady behavior is from the fact that concatenated packets
to Fig. 4a), we gain up to 22.85% with respect to the energy travel longer distances with increase in network size. Thus
spent on a flit by the base technique. Thus the optimizations there is a degradation in performance from CVD to CVDP
we do benefit the system on the whole. Another thing to see is by 2%. VCTM+V gives 5.5% and that is 10% improvement
the erratic behavior of the energy line by Dynamic Multicast. coming from VC as cache. Hamiltonian+V on the other hand
Since Barnes is a tree access algorithm, the threads accessing gains only 4%. This loss of 31.08% is because, the responses
a particular block of data tend to be close to each other in the take a longer path to move along the Hamiltonian route.
VII. C ONCLUSION [6] P. Gratz et al., “Regional congestion awareness for load balance in
networks-on-chip.” in HPCA, 2008.
We proposed a scalable way of handling multicasts and [7] R. S. Ramanujam et al., “Destination-based adaptive routing on 2d mesh
demonstrated it to outperform the best techniques available to- networks.” in ANCS, 2010, p. 19.
day in two different network sizes using and real multithreaded [8] P. Lotfi-Kamran et al., “BARP-a dynamic routing protocol for balanced
benchmarks. We analysed the other possible optimizations to distribution of traffic in nocs.” in DATE, 2008.
support this implementation and we discussed the way to in- [9] E. Kakoulli et al., “HPRA: A pro-active hotspot-preventive high-
corporate it with the existing techniques. Thus, we could derive performance routing algorithm for networks-on-chips.” in ICCD, 2012.
much better performance of the whole system by utilizing the [10] M. Ramakrishna et al., “Gca: Global congestion awareness for load
knowledge of data transmitted in an NoC of any CMP. It is balance in networks-on-chip.” in NOCS, 2013, pp. 1–8.
[11] A. e. a. Topol, “Three-dimensional integrated circuits,” IBM Journal of
Latency Research and Development, pp. 491–506, 2006.
220 [12] T. D. Richardson et al., “A hybrid soc interconnect with dynamic tdma-
210
based transaction-less buses and on-chip networks.” in VLSI Design,
2006, pp. 657–664.
Latency (Cycles)

200
[13] D. Park et al., “MIRA: A multi-layered on-chip interconnect router
190 architecture.” in ISCA, 2008, pp. 251–261.
180 [14] J. Kim et al., “A novel dimensionally-decomposed router for on-chip
170 communication in 3d architectures.” in ISCA, 2007, pp. 138–149.
160 [15] M. Salas et. al, “The roce-bush router: a case for routing-centric
dimensional decomposition for low-latency 3d noc routers,” ser.
Ba

Ba

Ba

Ba

Ba

VC

VC

H
am

am ian
se

se

se

se

se

TM

TM

ilt

ilt

CODES+ISSS, 2012, pp. 171–180.


+D

+C

+C

+C

+V

on

on
D

VD

VD

ia
n+
P

[16] S. Akbari et al., “Afra: A low cost high performance reliable routing
V

for 3d mesh nocs.” in DATE. IEEE, 2012.


Fig. 5: Scaling to 512 Nodes: Latency [17] F. Dubois et al., “Elevator-first: a deadlock-free distributed routing
algorithm for vertically partially connected 3d-nocs,” Computers, IEEE
Transactions on, 2011.
[18] C. Chen et al., “Express virtual channels with taps (evc-t): A flow
1.1 control technique for network-on-chip (noc) in manycore systems.” in
Speedup
1.09 Hot Interconnects, 2011, pp. 1–10.
1.08
1.07 [19] N. D. E. Jerger et al., “Virtual circuit tree multicasting: A case for on-
IPC Speedup

1.06 chip hardware multicast support.” in ISCA. IEEE, 2008, pp. 229–240.
1.05
[20] S. Moosavi et al., “An efficient implementation of hamiltonian path
1.04
1.03
based multicast routing for 3d interconnection networks,” in Electrical
1.02 Engineering (ICEE), 2013 21st Iranian Conference on, 2013, pp. 1–6.
1.01 [21] F. A. Samman et al., “Adaptive and deadlock-free tree-based multicast
1
0.99
routing for networks-on-chip.” IEEE Trans. VLSI Syst., pp. 1067–1080,
2010.
Ba

Ba

Ba

Ba

Ba

VC

VC

H
am

am ian
se

se

se

se

se

TM

TM

ilt

ilt
+D

+C

+C

+C

[22] N. Agarwal et al., “In-network snoop ordering (inso): Snoopy coherence


+V

on

on
D

VD

VD

ia
n
P

on unordered interconnects.” in HPCA, 2009, pp. 67–78.


+V

[23] A. Shafiee et al., “Helia: Heterogeneous interconnect for low resolution


Fig. 6: Scaling to 512 Nodes: System Speedup cache access in snoop-based chip multiprocessors.” in ICCD, 2010, pp.
84–91.
[24] S. C. Woo et al., “The SPLASH-2 programs: Characterization and
also cautioned that Packet Concatenation can be beneficial only methodological considerations,” in ISCA, 1995.
when there is no alternative path to take. Though this would [25] C. Bienia et al., “PARSEC 2.0: A new benchmark suite for chip-
occur often in case of heavy workloads like the experiments multiprocessors,” in Proceedings of the 5th Annual Workshop on
Modeling, Benchmarking and Simulation, 2009.
we conducted with, it may not be the case all the time. In such
[26] N. Barrow-Williams et al., “A communication characterisation of
cases, path diversity will be better than getting stuck behind a splash-2 and parsec.” in IISWC, 2009, pp. 86–97.
queue of packets. [27] G. Michelogiannakis et al., “Packet chaining: efficient single-cycle
allocation for on-chip networks.” in MICRO, 2011, pp. 83–94.
ACKNOWLEDGMENT [28] R. Ubal et al., “Multi2sim: a simulation framework for cpu-gpu com-
puting.” in PACT. ACM, 2012, pp. 335–344.
The authors would like to thank VIRGO team, IIT Madras
[29] “https://nocs.stanford.edu/cgi-bin/trac.cgi/wiki/resources/booksim.”
for their kind support. This work is funded by DST, Govt. of
[30] F. Li et al., “Design and management of 3d chip multiprocessors using
India. network-in-memory.” in ISCA, 2006, pp. 130–141.
[31] J. L. Shin et al., “A 40nm 16-core 128-thread cmt sparc soc processor.”
R EFERENCES in ISSCC, 2010, pp. 98–99.
[1] G. M. Amdahl, “Validity of the single processor approach to achieving [32] D. Sanchez et al., “Vantage: scalable and efficient fine-grain cache
large scale computing capabilities,” 1967, pp. 483–485. partitioning.” in ISCA, 2011, pp. 57–68.
[2] W. J. Dally et al., “Route packets, not wires: on-chip inteconnection [33] S. Li et al., “McPAT: an integrated power, area, and timing modeling
networks,” in DAC, 2001. framework for multicore and manycore architectures.” in MICRO, 2009,
pp. 469–480.
[3] G.-M. Chiu, “The odd-even turn model for adaptive routing.” IEEE
Trans. Parallel Distrib. Syst., 2000. [34] A. B. Kahng et al., “Orion 2.0: A power-area simulator for intercon-
nection networks.” IEEE Trans. VLSI Syst., pp. 191–196, 2012.
[4] C. J. Glass et al., “The turn model for adaptive routing.” in ISCA, 1992.
[5] J. Hu et al., “Dyad: Smart routing for networks-on-chip.” in DAC, 2004.

You might also like