Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 48

Low-Latency Virtual-Channel Routers for On-Chip Networks

Robert Mullins, Andrew West, Simon Moore Presented by


Sailesh Kumar

Outline
s

Motivation
Why Network-on-chip (NoC)

Comparison to Packet Networks


Similarities Differences Design Constraints

Topology and Routing/Switching techniques for NoC


Mesh, fat-tree, honey-comb Greedy, Deflection, Wormhole, Virtual-Channels

Start with the paper Design of a Low-Latency Virtual-Channel Router

2 - Sailesh Kumar - 02/28/13

Why NoC
s

Billion transistor era has arrived


Several such SoC are in pipeline, Inter-connection is critical

A generic inter-connection architecture ensures


Reduced design time IP reuse Predictable backend (versus ad-hoc wiring)

Bus based inter-connects were sufficient until now But not now
Shared bus is slow (arbitrates between several requesters) More components increase loading => speed drops further Ad-hoc routing of wires results in backend complications, lower performance and higher power consumption

3 - Sailesh Kumar - 02/28/13

Why NoC (cont)


s

Recently Dally proposed an idea


Route packets not wires as in data networks
Point to point communication

Point to point links are faster


s

Create a chip wide network (Like a regular IP WAN)


A router at every node Links connecting all routers Messages encapsulated in packets, which are routed

Challenges
Topologies, Routing protocol Network and router design with small footprint and low latency

4 - Sailesh Kumar - 02/28/13

Some more motivations


s

The need to put repeaters into long wires allows us to add the switching needed to implement a network at little additional cost Makes efficient use of critical global wiring resources by sharing them across different senders and receivers Simplifies overall design Design a single router and do copy-paste in both dimension

s s

5 - Sailesh Kumar - 02/28/13

A typical NoC node


s

Layered Design of reconfigurable micronetworks. Exploits methods and tools used for general network. Micronetworks based on the ISO/OSI model. NoC architecture consists of Physical, Data link, and Network layers.

6 - Sailesh Kumar - 02/28/13

A typical NoC
s

Layered Design of reconfigurable micronetworks. Exploits methods and Implemented in tools used for general cores, enables endto-end reliable network. transport Micronetworks based on the ISO/OSI model. NoC architecture consists of Physical, Data link, and Network layers.

7 - Sailesh Kumar - 02/28/13

A typical NoC
s

Layered Design of reconfigurable micronetworks. Exploits methods and Implemented in tools used for general cores, enables endto-end reliable network. transport Micronetworks based on the ISO/OSI model. NoC architecture of Physical, Data and Network layers.
Multi hop route consists packet setup, link, addressing, etc

8 - Sailesh Kumar - 02/28/13

A typical NoC
s

Layered Design of reconfigurable micronetworks. Exploits methods and Implemented in tools used for general cores, enables endto-end reliable network. transport Micronetworks based on the ISO/OSI model. NoC architecture of Physical, Data and Network layers.
Multi hop route consists packet setup, link, addressing, etc Contention issues, reliability issues, grouping of physical layer bits, e.g. flits

9 - Sailesh Kumar - 02/28/13

A NoC topology
s

s s

Cores Communicates With Each Other Using NoC NoC Consists of Routers (R) and Network Interfaces (NI) A NI linked to Router by Non-Pipelined Wires One or More Cores Connected to a NI

10 - Sailesh Kumar - 02/28/13

Another NoC topologies


Fat tree

Mesh

Multi hop route setup, packet addressing, etc

11 - Sailesh Kumar - 02/28/13

Routing protocols
s s

We will only consider mesh topology Objective is to find a path from a source to a destination Greedy Algorithms (deterministic)
Choose shortest path (e.g. X-Y)

s s

Adaptive routing
If congestion, choose alternative path Deflection routing Is adaptive better than greedy => NOT REALLY (when only local information is used)

Adaptive routing can also result in livelock

12 - Sailesh Kumar - 02/28/13

Switching techniques
s

Circuit Switching: A control message is sent from source to


destination and a path is reserved. Communication starts. The path is released when communication is complete.

Store-and-forward policy (Packet Switching): each switch waits for the full packet to arrive in switch before sending to the next switch Cut-through routing or worm hole routing: switch examines the header, decides where to send the message, and then starts forwarding it immediately
In worm hole routing, when head of message is blocked, message stays strung out over the network, potentially blocking other messages (Needs only buffer the piece of the packet that is sent between switches). Cut through routing lets the tail continue when head is blocked, storing the whole message into an intermediate switch. (Need buffer large enough to hold the largest packet).

13 - Sailesh Kumar - 02/28/13

Wormhole Routing Good fit for NoC


s

Wormhole routing is good for NoC


Low latency Less buffering requirements

Suffers from deadlock

14 - Sailesh Kumar - 02/28/13

[example from Li and McKinley, IEEE Computer v26n2, 1993]

Adding Virtual Channels


s s

With virtual channels, deadlock can be avoided Move message and reply on different channels => Will never have loop on a single channel

Packet switches from lo to hi channel

15 - Sailesh Kumar - 02/28/13

Designing Virtual Channel Routers


s

Design Constraints in NoC


Minimize Latency Minimize Buffering Minimal footprint

s s s

Can exploit far greater number of pins and wires May use fat data and flow control wires Objective: Design routers with minimal latency
This will also result in smaller buffers

This paper presents design of a low latency router


Cycle time of 12 FO4 Single cycle routing/switching

16 - Sailesh Kumar - 02/28/13

Designing Virtual Channel Routers


s

Design Constraints in NoC


Minimize Latency Minimize Buffering Minimal footprint

s s s

Can exploit far greater number of pins and wires May use fat data and flow control wires Objective: Design routers with minimal latency
This will also result in smaller buffers

This paper presents design of a low latency router


Cycle time of 12 FO4 Single cycle routing/switching

17 - Sailesh Kumar - 02/28/13

A Virtual Channel Router

18 - Sailesh Kumar - 02/28/13

Designing Virtual Channel Routers


Every VC of every input port has buffers to hold arriving flits

Arriving flits are placed into the buffers of corresponding VC

19 - Sailesh Kumar - 02/28/13

Designing Virtual Channel Routers


Every VC of every input port has buffers to hold arriving flits Routing logic assigns set of outgoing VC on which flit can go Arbitrates between competing input VC & allocates output VC Arriving flits are placed into the buffers of corresponding VC

20 - Sailesh Kumar - 02/28/13

Designing Virtual Channel Routers


Every VC of every input port has buffers to hold arriving flits Routing logic assigns set of outgoing VC on which flit can go Arbitrates between competing input VC & allocates output VC Arriving flits are placed into the buffers of corresponding VC Matches successful input ports (allocated VC) to output ports

Flits at input VCs getting grants are passed to output VCs


21 - Sailesh Kumar - 02/28/13

Routing Logic
s

Three possibilities
Return a single VC Return set of VCs on a single port Return any VCs

Look ahead routing


Routing performed at the previous router Good for X-Y deterministic (non adaptive) routing A SGI routing chip first implemented it

22 - Sailesh Kumar - 02/28/13

VC Allocation
s s

Complexity of VC allocation depends on routing range Routing returns single VC


Needs PxV input arbiter for every outgoing VC

Routing returns multiple VC at single port


Additional V:1 arbiter at every input VC to reduce potential outgoing VC to 1

Routing returns any set of VCs


Needs two cascaded PxV input arbiters

We consider multiple VC at single port case

23 - Sailesh Kumar - 02/28/13

VC Allocation Logic
At every outgoing VC following logic is needed

24 - Sailesh Kumar - 02/28/13

Switch Allocation
s

Individual flits at input VCs arbitrate for access to the crossbar port Arbitration can be performed in two stages First stage
A VC among V possible VCs at every input port is selected V:1 arbiter at every input port

s s

Second stage
Winning VC at every input port is matched to the output port P:1 arbiter at every output port

s s

This scheme doesnt guarantee a maximal/maximum/good matching But simple to implement

25 - Sailesh Kumar - 02/28/13

Switch Allocation

26 - Sailesh Kumar - 02/28/13

Issues
s s s s

VC allocation and Switch allocation are serialized A flit will either take 2 clocks to get through Else clock speed will be low Solution: Speculative switch allocation

27 - Sailesh Kumar - 02/28/13

Speculative Switch Allocation


s

Dally proposed speculative switch allocation


Perform switch and VC allocation in parallel Assume that participating VC in switch allocation will get the output VC If not then wasted cycle

An even better idea is to perform speculative and non-speculative switch allocation in parallel
Non-speculative allocation has higher priority Note that non-speculative allocation is done for input VCs which has already been allocated an output VC Speculative
will work

s s

Mostly one cycle delay under light load Mostly one cycle delay under heavy load

Non-speculative will work

28 - Sailesh Kumar - 02/28/13

s s

Further Enhancement Is it possible to have zero cycle VC/switch allocation YES, Most of the time, thats what this paper is about!

29 - Sailesh Kumar - 02/28/13

Idea 1: Free Virtual Channel Queue


s

Keep queue of free VC at every outgoing port


Also bit mask with one set bit

Thus First stage of VC allocation where an output VC is selected will be removed

30 - Sailesh Kumar - 02/28/13

Idea 1: Free Virtual Channel Queue


s

Keep queue of free VC at every outgoing port


Also bit mask with one set bit

Thus First stage of VC allocation where an output VC is selected will be removed

31 - Sailesh Kumar - 02/28/13

Idea 2: Pre-computing arbitration decisions


s

If somehow, you know the arbitration results before flits actually arrive and fight for the VC and switch I mean, every arbitration decision
VC allocation Switch allocation Etc

Then the router can be made to run in zero cycle


The arriving flit route/switch in the same clock they arrive

Also, clock speed may be pretty good


Data path and control path are no more in series

Thats what the idea 2 is.

32 - Sailesh Kumar - 02/28/13

Some preliminaries before going into detail

Tree Arbiters
Implements large arbiters using tree of small arbiters

Matrix Arbiters
Fair and Fast arbiter implementation

33 - Sailesh Kumar - 02/28/13

VC allocation using a Tree Arbiter

34 - Sailesh Kumar - 02/28/13

A Matrix Arbiter

35 - Sailesh Kumar - 02/28/13

Pre-computing arbitration decisions


s

An alternative arbiter design

36 - Sailesh Kumar - 02/28/13

Pre-computing arbitration decisions


s

An alternative arbiter design


Grants are product of latched enables and the requests

Generate grant enables one cycle prior and latch them

37 - Sailesh Kumar - 02/28/13

Pre-computing arbitration decisions


s

An alternative arbiter design


Grants are product of latched enables and the requests

Generate grant enables one cycle prior and latch them

Grants are generated in same clock as request arrives


If at least one request remains

38 - Sailesh Kumar - 02/28/13

Pre-computing arbitration decisions


s

An alternative arbiter design


Grants are product of latched enables and the requests

Generate grant enables one cycle prior and latch them

However, when no request remains, it is difficult to generate grant enables ???

39 - Sailesh Kumar - 02/28/13

Generating grant enables


s

Safe Environment
Only one request may arrive in a cycle Thus it is safe to assert all grant enables Thus grant can still be generated in same cycle

Unsafe Environment
Multiple request may arrive in same cycle Can still assert all grants But need to abort when multiple requests arrive in same cycle

s s

All first stage V:1 arbiters operate under safe environment However P:1 arbiters doesnt

40 - Sailesh Kumar - 02/28/13

Generating grant enables


s

Even in unsafe environments, assert all grants


May need to abort when multiple requests arrive Note that after an abort, a correct arbitration is ensured in the next cycle

s s s

Why will it work? Because in lightly loaded network, multiple requests for same VC/port will not arrive (few aborts) In heavily loaded network flits will remain buffered and Non-speculative arbitration (higher priority) will happen most of the time
Few aborts again

41 - Sailesh Kumar - 02/28/13

I will skip the design details now


s s

Since it is confusing and complex Will jump to critical path analysis

42 - Sailesh Kumar - 02/28/13

Analysis of critical path


Generates VC/switch grants from precomputed grant enables

43 - Sailesh Kumar - 02/28/13

Analysis of critical path


Generates VC/switch grants from precomputed grant enables

Crossbar traversal is aborted once invalid grants are detected


44 - Sailesh Kumar - 02/28/13

Analysis of critical path


Generates VC/switch grants from precomputed grant enables In case of an abort, the correct control signals are ensured in the next cycle

Crossbar traversal is aborted once invalid grants are detected


45 - Sailesh Kumar - 02/28/13

Final design
s

Control path critical delay is 12 FO4


Until now, the best design had 20 FO4 delays

They have sampled a NoC based ASIC last week using this idea Runs at several GHz speeds Note that fast cycle time is possible by
Running VC allocation and Switch allocation in parallel Must use speculation, else delay will be higher (1 more cycle)

s s

46 - Sailesh Kumar - 02/28/13

Simulation results

47 - Sailesh Kumar - 02/28/13

If (doubts) Then Ask; Else Thank you; Goto Discussion; End if;

48 - Sailesh Kumar - 02/28/13

You might also like