Professional Documents
Culture Documents
Verilog Modeling and Simulation of A Communication Coprocessor For Multicomputers
Verilog Modeling and Simulation of A Communication Coprocessor For Multicomputers
Verilog Modeling and Simulation of A Communication Coprocessor For Multicomputers
Multicomputers
A S hyainpi-akash C P Ravikurnar
Cadence Design Systems (India) Pvt. Ltd., De p iir t me n t of E I ect r ic a1 En g i n eel-in g ,
SDF # A-l/B-8, Noida Export Processing Zone, 1n di ;in I n s t i t LI t e of Techno 1ogy ,
PO NEPZ, NOIDA, UP 20 1305, HUIZ Khas, New Delhi 1 IO0 16,
INDIP. INDIA
Email : ashy alii @ cadence.coni Em ai 1: rku iiiar 0ee.iitd .ernet.in
Data
Data Tail
Header Flits TI:+
I
1.0 Introduction
. _.
M assively parallel processing systems which use o\'er
4000 processors are being conceived foi- achieving ter;i-
tlops performance needed i n adcit-saing grand challenfe
Z m
probleiiis of computing. Such machines are built as dis-
tributed memory multiprocessors, and data sharing among
Time
processors must take place through iiiessage passing. Ell-
cient inter-processor coiiimiinicatio~iis ;I necessity in tiias- Figure 1. Wormhole routing
S I vely p a r a l l e l c o i i i p u t e r s . T h e s e p i ~ o c e s s o r s ;ire
58
0-8186-7082-7/95 $04.00 0 1995 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on October 31,2022 at 11:59:36 UTC from IEEE Xplore. Restrictions apply.
13h y si c al
resources. Vi t-tti a1 c h a t i t i e 1s are in i p 1eine t i t ed by 1.5 Hypcrcu b e iiitercoiinectiori
allocating separate buffers for each of thein. Virtual chan-
The hype I-cu be i II t c rc o ti ti ec t io n t i et vlio I-k I 6 Ii ;I x bee t i
nels also allow one to introduce deadlock free routingl.31.
employed in ;I number of commercially succesdul p:ir~dlcl
1.2 Fault-toleraintrouting computers .;ucli ;is the I n t e l iPSC/2. Intel iPSC/X60.
NCURE-2. the connection machine and so on. The x l v a i i -
It i
i n assi vel y para1le1 coni pu tet-s, the occurrence of node
tages of hypercube i ti tercon nec tio ti are its mod U 1 a1-it y.
and/or l i n k faults is of high probability. Thus ii routing small degree. small communication diameter ancl fault-tol-
algorithm which is oblivious to network conditions s u c h erance. An /r-cliniensional hypercube consists of 2" tic.ltles.
as faults is of little use. Depending on the network topol- labelled using /!-bit strings. Two node!; i and,j in ;in /r-D
ogy, there will exist more than one routing path between hypercube Lire connected if and only if their bit atldt-esses
two nodes of a multicomputer. A f;iult-tolerant t-outing differ in exactly one bit position. Thus Ithe degree of c x h
algorithm is capable of successfully forwarding the tiles- node i n an /1-11 hypercube is P I , and the total n u m b e r o f
sage to the destination node in the presence of one o r more links i n that i s 1!.2"-'.The node synimetry of the netvliork
faults. as long 3s a routing path exists between source m c i gives it node f~iiilt-tolerance,More recently. liypercitbes
destination. and meshes have been generalized into k-ary /r-ciihes by
1.3 Motivation for hardware routers Dally and Seitzl3l.
59
Authorized licensed use limited to: National University of Singapore. Downloaded on October 31,2022 at 11:59:36 UTC from IEEE Xplore. Restrictions apply.
using Verilog HDL is described i i i Section 3. FinnlIy i n ment the coprocessor chip, i t will be difticiilt to support ;I
section 4 the conclusions ancl future n'ork are presenteil. of buffers on chip. In the present design. the
l a i y ni~nihet~
packer size is chosen t o be 40 bytes. which i s suificieiit for
2.0 Architecture of the communication ;I p11~~1I1el ng syctem to transfer clata at ;I high rate.
pi~ocecsi
coprocessor
The tlit size should be ;IS small as possible so thal the t i m e
Architecture of the pi-oposed coprocessor i s desct-ibed taken f o r its transrnicsion is reduced; ;it the s m i e time.
here. A careful examination of the routing process r e \ w l s other o v e r h e x s such ;IS the virtual channel nuniber and
that there are four impoi-tant subtasks involved: the tlit-type should not dominate the size of tlie d a t a lieltl.
I n the design. flit size is taken as 20 bits. inclusive of'2 bits
Receiving flits fi-om a neighboi-in: node for \,irtiiaI chmnel number and 2 bits for tlit type speciti-
Transmitting flits t o a neighboring node. cation. Thu\. each flit carries 2 bytes of infoimation.
Deciding the chainel throiigh which ;I f l i t niiist be tcir-
warded (routing). Currentl>. t h e nuiiibei~of physical chnnnels provirlerl i s 3.
since this will enable 11s t o implement ;I variety ol' inter-
Host-processor interface, which involves assembling
connection networks. such ;IS .?-D hypercube. Cubc coil-
of f l i t s into messages a n d disassembling a iiie\sage
tiected cycles. &stai-. 3-11 mesh, and so on.
into flits.
Siniulatioii results [ 2 , 3 ]have shown that as tlie number of
We have organized the coprocessor into four major blocks.
virtual channels is increased, the network throughput satu-
one corresponding to each o f the subtasks mentioned
rates. It W;IS observed iii 12.1 that the number of virtual
above. The above decomposition of the routing proce\s
chnnnels ai-e in the I-ange from 3 t o 5 gives better through-
into subtasks i s also useful i n inodeling c o p r o c e \ \ o r
put to cost I-atio. In the coprocessor, 3 virtual channels are
behavior using Verilog. The block cliagrnni of the copro-
p t-o\,i(led per pli y s ical chon ne1.
cessor i s shown in Figure 2. We nnw describe the futictioti
and the design o f each of the ~ntlividt~al block of t l i e
If the niinibei- of lines 1x1- physical l i n k is increiisctl. tlic
coprocessor. ti-;instiiissioti rate increases: but at tlie same time. the
iiii~~leiiieiitatioii cost goes up. On tlie other h a i i c l i f ' t h e
number 01. lines is I-educed, it will increase the transmis-
s i o n time. I n 0111' design, two lines are provided per physi-
cal channel since it niakes design of internal inodules easy.
Only one line is usecl for hand shake. Separate wires are
used for each direction. for both data transmission ;IS well
a s ti and shake,
60
Authorized licensed use limited to: National University of Singapore. Downloaded on October 31,2022 at 11:59:36 UTC from IEEE Xplore. Restrictions apply.
clock frequencies are same for all nodes. When a tlit is t o Fnult Vector
be trmisiiiitted, tirst start bits are sent, followed by the \,ir-
tun1 channel number. the flit type (liexiel-. data or tail), and
tlie information bits (see figure 3 I>&). The hancls1i;ike
signalling consists of ii s t a r t bit followed by the virtual
channel number i n bit-serial fashion. The routiny table at node i stores, for each possible desti-
nation,i, tl ~rlterixitepaths to reachi from i. Thus the size of
tlie routing table is
I.llog(t/) + log(v)] bits
,Sire = l(N-l),c/i-l
(c) Handshake
61
Authorized licensed use limited to: National University of Singapore. Downloaded on October 31,2022 at 11:59:36 UTC from IEEE Xplore. Restrictions apply.
rent iin~~lcnientation,
we have avoitled livelocks b!' ili-op The tixi\iiiission includes sending the start bits, tlie v i r t d
ping the packet when the message I-exlies ;I blind :\lie). channel numbet- ir> and the data in the corresponding tlit-
buffet- f / i If the f l i t to be transmitted is a tail tlit. the status
2.4 Receiver block information of' i h changed so that i t is n o t assigned t o
Each physical l i n k is p r o v i d e d w i t h \eparate i-ecei\ t i . an! packet. after tlie ti-xisiiiissioii of tlie tlit. Foi- a n y i l i t .
blocks. A receiver block h a s N,, llit-hulf'ei-s. where N,, i \ aftei- i t \ tixn\iiiisaion the 5tatus of the virtual c1i;iiiiiel. I ) ( . is
tlie number of virtual channels supported per phy\iciil \et to blochecl state. x i c l will be reset only when i t receives
channel. Each of these flit-buffers is capable of storing cine ii handhake col-i-esl~oiiditig to IT.
fl it . The rece i \,er block C O iit i n ti o 11sI y iiion i tors the i 1113 u t
physical channel for incoining start bits. When i t sense\
2.6 Router
the arrival o f a new nit. that is stored in corresponding fit- There is only one I-outer. for routing purpose, in a copro-
buffer. It concurrently monitors tlie status of its flit-hufl- c e \ s o r. T 11i s 111o d u I e ;is s i g n s the output c h ii iine I t 11roU g h
ers: if any one of these flit-buffers j'h has been emptied ih! u.hicli ;I \cqueiice of tlits belonging t o a packet should be
the transmitter block, as will be seen later) the recti\ e r routecl. Thi\ decision is iniide when a header tlit ai-i-ivesi n
must send ;I handshake signal to indicate i t is i.eiict> to any o f tlie tlit-buffers. The mapping of tlit-buffei-s to vi]--
reccive the next flit of t h i s virtual channel ,j'i'i. Each llit- t u at c1i;iii tie Is i s i in plenien teil by tlie r o u t iiig fu i i c t ion.
bu ff'c r generates b ,!f%/.-=/I I / si g ti ;\I :I 11(1 flif-rj,p s i 2t i ;II\
, (2 h,Iiicli i \ \pecitic t o the interconnection network being
which are used by router and truii\iiiitters. This recei\.cr i in p I e in e 11t ed . I n o ti r cl es ig n. the 111a p p i 11g fu iic t i o t i is
block i s ;iIso capable of dropping the tlits if the router finds i 11113 le ine 111 eci t li I-ough ;I RAM- based lookup tab1e , 211 Io w -
no path to forward the packet. in; ;I cigniticant degree of tlexibility. The routing table is
initialized hj, tlie host i ~ ~ o c e s s wheneveror the chip is reset.
2.5 Transmitter block It u s e s three separate lookup tables for three alternative
There is one transmitter block f o r each physical l i n k . I t routes foi- ;I packet.
inultiplexes N, virtual channel to ;I single physical cliiiii- The router module uses a set offlaps which are set when ;I
ne1 in ;I I-ound I-obin fashion. The mqiping from the \ i i - t w l header tlit arrives in tlit buffer. These flags are scanned i n
channels to the tlit buffers is done by the router deact-ihecl n round-I-obin fashion. If ;I tlag,fis f o u n d to be set. it gets
next. The transmitter uses this mapping t o associate ;I p;it.- the dest i ii at ion address from the correspondi iig tl i t-buffer
ticular Ait-bui'fer with each of its virtii;iI channels. I t :I flit f h . Using t h i s ;itidress i t gets all possible routes from the
is present in the flit-buffer corresponding to a virtual chaii-
routing table\. Then f'or each of these output channels, oc,
ne1 wc, then the f i t is picked up for triiiisinissioii. Aftei- the
it checks t h e fault vector if or is faulty. If it finds that oc i s
flit has been tronsniitted. anotliet- ilit c;iii be tran\mittecl
not fault>. i t c h e c k \ if tlie output channel is ;ilready
through the virtual chmnel IT only after- I-eceiving a liand- assignet1 t o ;rny other ixicket. If it is n o t assigned the rout-
shake signal, signifying that the n e i g h b o r i n g node is I-eacl!
ers assign\ f h to o c . otherwise it waits till ( I ( ' is freed. I t the
to receive the next flit through I ' C
ixitttei' tincl\ that all the output channels except tlie one
through which i t came are faulty, i t informs the receiver
Tlie transmitter block iiia i nt ai lis stat115 i ti f o r m at io n c )f e ;IC ti block that tlie packet can n o t be forwarded a n d all the flits
of' its virtual channels. These inforination associatecl with ;I
in t h a t picket will be dropped by the receivei-.
virtual channel I'C includes
2.7 Processor interface block
whether w is currently ;issigned to a n y packet.
the flit-buffer address to which I T i \ presently There is one p i w ~ s s o rinterface block for each node. The
assigned. functiotis carried out by this block are the following.
whether vc is in the blocked state. (:pori I-czct. ;icccpi cliit;i f r o m the host-processor iititl pro-
gi-atii ( l i e i ' o i t t i i i ~tablc.
Foi- :I V i i - t u L i l channel i u the transmitter begins t r a n s i n i s Acccpi (I p x h c t 1roiii lhohi-pi-oceszoi- iiiid cliwszcinhlc thc
pirchci i n l o ii hcatlci-. data. ;incl tail Ilits. The Ilii\ ai-c sioretl
sion if tlie following conditions are met. i i i ;I flit-bul'l'cr otic by oiie.
I,(. is assiplied to a packet. Ahsciithle tiic (lit\ clcstinccl to thc cui i-eiit nodc into packctz
;Illcl lot \ \ w c l Illell1 i o host-pl-ocessor.
IY is not in the blocked state. i!td
there i s ii flit available i n the corrcspoiiciing buffer fii. There are two packet-buffers available, one t o store the
for t ran s i n iss ion through I'C. picket to be transinilted a s tlits, a n d another to buffer iill
the flit\ of ;I packet that are arriving. It divide.; the packet
to be transiiiitteil into tlits ancl stores them in ;I tlit-buffei-
62
Authorized licensed use limited to: National University of Singapore. Downloaded on October 31,2022 at 11:59:36 UTC from IEEE Xplore. Restrictions apply.
which is treated similar t o tlit-bul'l'ers i n receivers. This I n atltlitioii to the iiiodule\ that are describcrl h c l o ~;I' I I U I I I -
b 1 oc k i n t erac ts wit li the host p I-oce s s o r ;in d ge 1-1 c r;i t c \ ber o t sripportin:; modules were reqLiiretI to complcte
required control sigiials. This interface d s o diverts the i n - bc h iiv i o I-. The y ;ire hi1 n d s h n ke sign ii I trans iii i I t c i ~;I11tI
ti al data writ ten on to the roti t i n g t ii b Ies a 11d fau 1t- vec tor. reccivers ;:incl niiiiilw of' hnsic gates.
3.0 Behavior modeling using Verilog HDL The iiiodeling of i m c h of' the functional blocks ciescribetl
above are descrihcd below. Vel-ilog-like psetido cocle is
Modeling of the corninuriication coprocessor described i n used to de:,cribe them instead of actual cocle for reasoiis of'
the previous section:; was carried out using Verilog Hard- brevity. The tasklike functions that are rcl'el-i-ed below may
ware Description Language[ I I ] _ Verilog HDL provides take nioi-c than on(: clock cycle to finish execution.
great amount of tlexibility t o model the behavior of ;I s y s -
tcm. A module is the basic u n i t i n Verilog. I t represents
3.1 Rccciver module
some logical entity that is ~isunllyiniplenientetl by ;I pi Three ixxt.i\,ei. blocks rcquired by a c o p i " x s s ( i t - to li,.indle
o f hardware. Using the various c k t a types and ~ ~ r o c c c l ~ i r a l three p l i y ~ ~ i c alinks.l The receiver block i \ motleleti i i \ i i
state111en ts av ai 1a b I e , ii I1ii rdw are c i 1.c ti i t can be 111ocI e I ecl t i 11i t e s t ;It e iiiiich i ne. w h ic h takes i npu ts 1'1-om t lie plij)\ i cii I
accurately. To con t 1.0 I sc hed ti 1 i n g of' exec U t ion, d i fferen t I i n k ancl tl i t - bti iYe1-s. ancl 2 i ves control si glial s to 11i t - hiiff-
timing structures are provided. The language has the capii- el-s to ston: (lata ancl to set header tlit flag i i i the rou1ci-. if
bility to apply inputs and display messages, values and there is Iit:;ider tlit. The pseudo code for this nioclule is
waveforms of various signals whenever required. g iv e n be Io w.
63
Authorized licensed use limited to: National University of Singapore. Downloaded on October 31,2022 at 11:59:36 UTC from IEEE Xplore. Restrictions apply.
module Txmitter (Reset. BfrData. <tatus. phyO. clock .. j: <;l;rte= 3:
I/ d ec 1aratio tis 3 : 101-(e~icIi-~iltci.ii~itive_paths)
/I Reset t x g i ii
j state = 3:
i f (cli~iiitiel~not~faitlty
always @(posedge clock) end
begi t i set-tlro 11-pac ke t :
case (state) state = 0:
0 : foreach (virtual channel j 3 : if (clinnnel_not_~issigned)state = 4:
begin else state = 0:
..
it (can-be-txmi tted) 4 : assiyn status:
state = I ; i t a t e = 0:
en tl e I1de ;iic
I : send-start-bit: end
send-vit-tual~chatitiel_nLlriiii~~e~: enilmoilLllc
state = 2;
2 : transmit-fit: 3.4 Processor interface
state = 3: The pi-ocessot- interface is divided into two niodules -
3 : Update signals; proc-I-x atid pi-oc-tx. Module proc-I-x receives the packet
Update status: froin host-processor and divides that i n t o flits and stores
state = 0; the flit i n ;I flit-buffer. Module proc-tx assembles the tlits
endcase that are arri\,ing and delivers it as a packet to the hostpro-
e I1d cessor. Pseudocodes foi. these two modules are given
be low.
end modu1 e
inoiltile Proc-rx (ChipSel. Write. BfrEtnpty. ..);
3.3 Router I/ t l e c l x a t i o ti s
111 acidition to the FSM for i-outing f'unction. i t acts a i the I/ Reict
ce t i t I-aI I-eposi tory fo I' status inf o I-t i i at i o n assoc ia ted w i t li
all virtual channels. It takes inputa from all the modules. illLvays @ (poseecl~eclock)
i iic I id i ng I-out i iig tab I e , a n d fau I t-vec t o I- and gen e i ~eii case ( \ t i l i e )
outputs in the form of status signals. The pseudocode for 0 : if iChipSel && Write) state = I :
this module is given below. I : recei\e_packet:
module Router (Reset. BfrFull, statu-control. clock. \t;ite = 2 :
Desti 11at ion-ad&. stat tis, Tab1e-add r, 2 : itoi-e-flit-type:
OutPut-Paths, fault-vector. ..): \talc = 3:
11 declarations 3 : stoi-e-flit;
/I Reset state = 4:
1 : M~ait_foi--flit-tr;insmission;
e
a1ways (status-col1 t rols ) state = 5 :
I/ upda te-s ta t us : 5 : if (p;lcket-eiiipty) state = 0;
else state = 2:
i d ways 8 (Bti-FuI I) endcase
/I Set Flags end
endmoclule
always @(posetlgeclock)
hepi i i nioclule Proc.-tx (ChipSel, Read, PktReady. ...)
CLISC:(state) I/ dec I x
it io l i s
0 : foreach(F1ag) I1 Reset
if (Flag) state = I :
I : get-destination address; a l w a y s @(po\eclgt: clock)
64
Authorized licensed use limited to: National University of Singapore. Downloaded on October 31,2022 at 11:59:36 UTC from IEEE Xplore. Restrictions apply.
begin the respective nocles. These tiles are read by / i o r / r modules
case (state) to initiali;re their respective routing tables and fault vec-
0 : if(Packet_can_be_asseiiiblc~l)state = 1 : tors. The disk tile corresimiding to node i also contains the
I : assetnble-Hit_ititt,_l,acket: nieswge tl:) be transmitted by node i.
state = 2;
OLII- intention h a \ becn t o use the Verilog model of'the
2 : if (packet-coniplete) pktRexly = I ; state := 3 ;
c o l ~ r o c e s w rto vel-ify the functional correctness of the
else state =: I ;
wormholc routing algorithm. The model has alw helped
3 : transmit-packettohostpl-ocessot-~ l i s i n logically orgnnizing the various functions of the
state = 4; coprocessoi- into separate building blocks. The simulotion
4 : Update-status; of the 3-I3 hypercube network itself has given us a high
state = 0; degree of contidence i n o u r coprocessor design. The net-
endcase work sinilllation Iias enabled us to understnncl and debug
end the fiiu I t- toleran t rou t i ng algorithm .
end modti I e
Each noclr. records all important events onto a log file with
3.5 Testing of the communication coprocessor
These events can be of' the l'ollowing
timing inf'ortii~itic~~ii.
In order t o test the router by siniiil:ition :I top level niodiile type.
NETWORK is formed. This instantiates many nodes
Split ;I packet into flits
which i n turn is a combination 01' coprocessoi. a n c l ;I
pseeudo processor which controls the coprocessor. These selecl routing path
nodes are interconnected to f o r m the recluii-ed topology. Transtiiission of ii llit along with line of transniission
The pseudo procesmr will initialize the coprocessor after * Reccp~ionof flit and handshake
reset, and will write the required t'aault-vector and ]packets
Complete reception of a packet
for t ra ti s m i ss i o t i . The o vera I I h i e r ~h i c a 1 s t ru c t u re 11 sed
for simulation is shown in figure 4. Discarding ii packet if no path is found.
I module NETWORK I By analyzing the log file. we are able to trace the history of
each packet. I n other words we obtain snap shots of the
network through the analysis of the log file antl are able to
1 ocate i ni pI e inen tat i o ti ii I bugs, We found the $cl i sp I ay
65
Authorized licensed use limited to: National University of Singapore. Downloaded on October 31,2022 at 11:59:36 UTC from IEEE Xplore. Restrictions apply.
tion of the physical resources such as comniunic;itioii W I Dally et iil . T h e M e s s a p c - D r i \ i c n Proccswr. IEEE
links. The communication coproceswr has b e e n desifnetl MICRO.Api-il 1092. 171123-39.
a'r ;I general purpose chip, and is capable of implenienting
a t i y s t ;it i c fau 1 t- to1era t i t r o u t i ti g al go 1. it h in by pro i d i t i g
\)
P E:is\v;ii.. Adapiivc Dc;idlock-l'i-cc Roiitiiig Algoritlinia loi
progrmniable lookup table and ;I f a u l t vector. By iusiiig SI ;I I G i-;i p h 5 . M - ' k c h 'I'h cs is. Dc pal-tmen t of' M at hc m;it ic s .
Indian Institiitc ol'Tcchnolopy N c w Delhi. Intli;i. 1001,
in d t ra ti s ni it ters for e x h p h y s i ca I c li ;i t i -
tiel, this chip will be able to give high throughput. E \ e n
though for testing the design. 3-D hypet-cube topology 1i;is
been used. the overall design is n o t restricted t o any p x t i c -
u l a r topology. It is pi-ovided with ;I standard interface \o A Kticlilous. VLSl Implcinenation 01' a Fault-tolcrxil Also-
that it can used along with m y of the cominercial 1>roces';- I-irhiii ior Star Gi-aphs. B-Tech Thcsis. Dcpartnicnt 01. Elcc-
sors. Ii-ic;il E t i g i n c e i ~ i n g ,I n i l i a n Institute 01' T e c h n o l o g y Nc\u
Dclhi. liiclia. 1993.
In our future work, we intend we intend to synthesizz t h i s
design and map them onto FPGA. By using layout gene].- P R Millcr. C R Jcsshope a n d J T Yantchcv, The Mncl-Post-
ating tools we plan to obtain the mask level layout of the n i m Sctiioi-h Chip. Proc Transputing l 9 O l . Vol 2 IOS U c l .
circuit. Also we want to make the fuult-tolei-ant algorithm pp 55 17-5536
more robust by ensuring packet delivery as long as ii path
exists between the source and destination. We are a l s o L M Ni ancl P K McKinley. A Sui-vey 01'Worinholc Rotitins
studying possibi 1 i ty of hand I ing other network cotid i tioiis Tcchnicltics in Direct Networks, IEEE C O M P U T E R . Fcbi-ti-
such ;is congestion. ill!) I 0 0 3 . ] p p 62-76.
66
Authorized licensed use limited to: National University of Singapore. Downloaded on October 31,2022 at 11:59:36 UTC from IEEE Xplore. Restrictions apply.