5th_AccML_paper_1

Graph neural network hardware acceleration in
Pytorch with streaming PYNQ overlays

Jose Nunez-Yanez
Department of Electrical Engineering
University of Linköping
Linköping, Sweden
jose.nunez-yanez@liu.se
Abstract—Graph neural networks (GNNs) show high learning Pytorch. The architecture is based on our previous work on
accuracy when applied to non-euclidean data in which data the Tensorflow Lite accelerator FADES [1] and it has been
elements do not fit into a regular structure. They combine sparse named gFADES (graph FADES). This initial work focuses on
and dense data characteristics and this, in turn, results in a
combination of compute and bandwidth intensive requirements a popular type of GNNs called graph convolutional networks
challenging to meet with general purpose hardware. In this (GCN) and the main contributions are as follows:
paper we investigate a dataflow of dataflows (DoD) hardware • We present gFADES as a graph neural network acceler-
architecture using high-level synthesis that optimizes data access ator that uses a dataflow of dataflows (DoD) approach to
and processing element utilization. The architecture is highly
configurable with both the number of hardware threads available compute the output features of the l + 1 layer in a GCN:
1 1
for the aggregation and combination phases and the number H (l+1) = σ(D̃− 2 ÃD̃− 2 H l W l )
of compute units per thread defined at compile time. The fine- where W indicates the trainable weight matrix of layer l.
grained dataflow in the compute units streams words with a H the input feature matrix for layer l and Ã the normalize
bit-width that depends on the network precision while the coarse- adjacency matrix. Each row of the input feature matrix
grained dataflow that links the aggregation and combination
stages streams partially computed matrix tiles. The accelerator H 0 contains attributes or features for a node of the input
is mapped to the programmable logic of a Zynq Ultrascale graph. Each row of the output feature matrix H (1) is the
device whose processing system runs Pytorch extended with embedding of the node in a lower dimension space.
PYNQ overlays. Preliminary results on the citeseer citation • We demonstrate how gFADES performance can be scaled
network show a performance gain of 20x with multi-threaded to adapt to the system bandwidth and compute availability
hardware configurations compared with the multi-threaded CPU
implementation available in Pytorch. with multiple hardware threads and multiple compute
Index Terms—neural network, FPGA, sparse, pruning, matrix units.
multiplication acceleration, TensorFlow • We explore new HLS features that enable the creation
of high-throughput and efficient dataflow of dataflows
I. I NTRODUCTION (DoD) architectures.
• We present preliminary performance results and the in-
GNNs perform tasks such as graph classification, node clas- tegration flow as a high-performance Pytorch accelerator
sification, link prediction or graph clustering and have a large suitable for edge compute devices.
number of applications in areas such as anomaly detection,
This paper is organized as follows: section 2 reviews related
bioinformatics, cybersecurity or natural language processing.
work. Section 3 describes the proposed DoD architecture for
GNN processing uses both dense and sparse data represen-
high-performance dense and sparse tensor operations. Sec-
tations and the resulting irregular computing and data access
tion 4 focuses on the details of the hardware multi-threaded
means that both inference and training of GNNs are complex.
extensions. Section 5 performs a preliminary performance
Popular machine learning frameworks like Tensorflow and
evaluation while section 6 discusses the Pytorch integration in
Pytorch support graph neural network development and, in this
a 2-layer CGN example network. Finally, section 7 concludes
work, we focus on Pytorch and how its python interface can
this paper and proposes future work.
be integrated with accelerator overlays developed with Xilinx
PYNQ for graph neural network processing. PYNQ is a Xilinx II. R ELATED WORK AND M OTIVATION
Python framework that runs on Ubuntu and provides a highly- Over the last few years the interest on graph neural net-
productive development platform for Xilinx devices such as works applications has increased significantly and the topic
the Zynq family. In this paper, we present preliminary results of hardware acceleration has started to receive widespread
on creating a dataflow architecture for graph neural networks attention. These accelerators typically consider that the ag-
using high-level-synthesis and its integration into PYNQ and gregation phase consists of sparse x dense matrix operation
with an sparse adjacency matrix while the combination phase
This research was partially funded by the Wallenberg AI autonomous
autonomous systems and software (WASP) program funded by the Knut and is a dense x dense operation with a dense feature matrix
Alice Wallenberg Foundation [2]. In [3] the authors indicate that in many applications
the input features contain significant levels of sparsity and always uses mode (2) because both A and H can be sparse and
propose a sparse block strategy for these cases. The input therefore both the aggregation and combination stages have the
feature matrix is encoded in CSR (Compressed Sparse Row) opportunity to work in sparse mode. Also, in the 2-layer GCN
format with coarse-grained blocks of zeros that can be by- evaluated we have observed so far that the feature vector size
passed. The proposed systolic architecture allows the design to decreases after the first layer.
adapt the computing performance to the available input/output Compared with previous work, we focus on resource con-
bandwidth. The design targets an ASIC technology and uses strained devices operating at the edge such as the Zynq
internal memory to preload the weight matrix in the systolic and Zynq ultrascale family that lack high-bandwidth memory
array which is not possible in embedded FPGA devices due to features. We also aim at integrating the accelerator as part
the limited amount of internal BRAM memory. The hardware of the Pytorch framework so it can be used as a drop-in
also loads rows of the feature matrix selectively when they replacement for the sparse/dense computation libraries cur-
correspond to non-zero elements in the adjacency. This implies rently available in Pytorch. The design focuses on streaming
that to obtain high hardware utilization once the sparse value is data with independent dataflow stages to optimize the limited
detected as a non-zero, the loading of multiple feature values memory bandwidth available and keep the compute engines
must be performed in one clock cycle to avoid stalling the busy. We consider fine-grained sparse adjacency matrices and
compute engine. Instead this work uses streaming techniques sparse/dense feature matrices.
to stream sparse and dense feature values processing multiple
tensor columns in parallel. We have observed that although III. DATAFLOW DESCRIPTION
the adjacency matrix is very sparse, each row uses different The dataflow combines hardware engines for aggregation
elements of the feature column and eventually all adjacency and combination stages that correspond to adjacency and
data is involved in the computation. The recoding in [3] of feature matrix processing respectively. Each of these engines
input features into blocks on non-zeros will loose fine-grained can instantiate a variable number of hardware threads and
detail in the input features. Our hardware treats the sparsity of compute units depending on the required level of performance
the feature matrix in the same way as the adjacency matrix so and available bandwidth. The dataflow of a single thread
no blocking is needed. It can switch to dense processing in a is shown in Figure 1 and it has been fully described us-
single clock cycle when processing the dense hidden feature ing Xilinx Vitis HLS (High-Level-Synthesis) toolset. In the
layers of the graph neural network. Other recent work for GNN HLS description a dataflow of dataflows (DoD) is created
processing includes GCNAX [4] that computes aggregation with a new HLS 2022 feature called streamof blocks that
and combination in two separate phases to take advantage enables the selective read lock and write lock of the PIPO
of the sparseness of the adjacency matrix and the possible (Ping-Pong) buffers that join the combination and aggregation
sparseness of the input features of the first layer of the GCN. stages and ensure that both engines can run in parallel with
The authors in [4] buffer the intermediate dense matrix result- both single and multi-threaded hardware configurations. The
ing of the combination phase and pass it to the aggregation combination and aggregation engines use fine-grained intra-
engine. The design employs 16 MAC array and also targets dataflow stages that are triggered by data words with bit-
an ASIC technology with performance and energy parameters widths that depend on the selected data type (e.g. 16-bit
estimated using a synthesised design. In GraphACT [5], a half, 16-bit fixed, 32-bit float etc). In Figure 1 we can see
hybrid CPU-FPGA platform targeting large scale Xilinx Alveo how multiple FIFOs whose number depend on the number of
cards equipped with HBM memory is presented focusing on compute units in each stage join the different processing stages
the acceleration of large graph training. The hardware is based of the fine-grained dataflow. The coarse-grain inter-dataflow
on a systolic array and the training algorithm is optimized between both engines has a data granularity that consists of
to fit the constraints of the hardware. The paper does not tiles with dimensions that depend on the number of compute
provide details on logic complexity or design methodology units and the number of hardware threads. These tiles ensure
focusing on a graph theory to improve hardware mapping. high-throughput by keeping all stages in the dataflow active. In
The paper reports a DSP utilization of 5632 cores and it is Figure 1 we can see how a PIPO joins the different processing
shown to outperform an NVIDIA tesla GPU by 10% to 30% stages of the coarse-grained dataflow. Multiple PIPOs are
for different datasets. Also using large scale Alveo cards and needed in multi-threaded configurations as seen in Figure 2.
large scale graph acceleration the hardware in [6] proposes The weight matrix is always considered to be dense while the
a pre-processing stage that performs graph sparsification to adjacency matrix is always sparse in CSR format. The feature
reduce the number of edge connections and node reordering matrix is available in sparse mode for the first layer and in
to increase data locality. This reduces the required size of dense modes for the hidden layers.
on-chip memory. The research treats the feature matrix as a
dense matrix for all the layers in the network and introduces A. Combination engine
a two mode strategy to change the order of the combination The combination engine computes F EA ∗ W where F EA
and aggregation stages (1) (AH)W or (2) A(HW ). Their represents the feature matrix and W the weight matrix and
analysis shows that (2) has lower computation when next layer generates a dense matrix output B in chunks for the aggrega-
feature vector is shorter than the current layer. Our hardware tion engine that computes ADJ ∗ B where ADJ represents
the adjacency matrix. This engine consists of two main stages C. Multi-threaded extensions
that read sparse or dense feature data and compute the dot To exploit the additional bandwidth and compute perfor-
product. Sparse mode in the combination engine is generally mance available in the Zynq ultrascale device the number of
used in the first graph layer where the size of the feature working hardware threads is configurable at compile time.
matrix is large and significantly sparse. In the second or Each hardware thread is assigned a number of rows of the
subsequent layers the feature matrix is the output from the adjacency and feature matrices while having access to the
previous layer and dense. Notice that performing a conversion same weight data. Each hardware thread has independent
from the dense output matrix into a sparse CSR matrix at run- ports connected to the multiple high-performance AXI ports
time would represent a significant software overhead for no available in the Zynq device.
significant gain. To avoid these issues the combination engine Figure 2 shows examples of multi-threaded configurations
is set to run in dense mode and the feature values port is where multiple data ports and hardware threads are used
used to read the dense input feature matrix while the rowptr to stream and process the CSR matrices column index,
and column index feature ports remain in idle mode. The row pointer and values. As seen in previous work, feature
read stage is activated after loading a w matrix tile with a processing tends to be more compute intensive than adjacency
column count that equals the number of compute units and processing so in many applications a configuration with a
row count that equals the number of rows present in the w higher number of threads for the combination engine compared
matrix. It then starts streaming elements of the F EA matrix with the aggregation engine is beneficial as shown in Figure
in CSR (Compressed Sparse Row) format with column index 2c. The number of PIPO buffers connecting aggregation and
and non-zero values in sparse mode. This means that in sparse combination stages is determined by the number of threads.
mode, the accelerator performs a variable number of reads The figure shows how each combination thread writes the
of feature values and column indices per matrix. This does same output to a number of PIPOs equal to the number
not represent a performance limiting factor in the dataflow of aggregation threads. Then, each aggregation thread reads
architecture because the READ stage is independent of the from a number of PIPOs equal to the number of combination
other stages, and it will run at full speed over all the data engines. Each of these PIPOs contains different data and
elements present in the F EA matrix as long as there are no are needed to complete all the rows needed for the adja-
output FIFO overflows. cency tensor processing. This organization ensures that all
The compute stage is the main computing loop that instan- the compute units can write and read data in parallel without
tiates the bulk of the DSP blocks. It aims at activating all dataflow stalls and overcomes the limited number of read/write
compute units (CUs) in parallel in each clock cycle. This is ports available in the BRAMs that build the PIPOs. Notice
the case for all w tiles with a number of columns equal to the the each thread contains a number of CUs that is always a
number of CUs. Typically, the last tile contains a number of multiple of 2 to utilize efficiently the double read/write ports
columns lower than the number of compute units, and in this available in Xilinx BRAMs. Table I shows the memory and
scenario, some of the CUs do not write their output FIFOs. logic complexity for the different configurations for a weight
This enables support for arbitrary matrix shapes that are not matrix size with up to 20480 rows. The target device is the
a multiple of the tile size. Zynq Ultrascale+ XCZU28DR available in the RFSoC 2x2
B. Aggregation engine board with a programmable logic clock frequency of 250 MHz.
Notice that the maximum row count of the weight matrix is
The aggregation engine is used always in sparse mode since in this case limited to 20480 but this is also limited by the
the same adjacency values are used for the first and second hardware need to allocate physically contiguous memory in
graph layers. The aggregation engine contains read,compute, main memory to hold the matrix data. A way to overcome
scale and write stages. The read stage is very similar to the this size limitation is to introduce a new level of software
read stage of the combination engine but it lacks the logic tiling for very large arrays using the processor to invoke the
needed to compute in dense mode. It streams values and accelerator multiple times using an approach as indicated the
column indices while it internally computes the number of listing below:
non-zeros present in each row using the rowptr data. All
this information is streamed into FIFOs that are used by the Listing 1: Float compute kernel example
compute stage. The compute stage has as inputs the PIPOs
1
that are used by the combination engine to write its results
2
and the FIFOs with the adjacency matrix data. The scaling
3 f o r ( i n t c = 0 ; c < C ; c ++)
stage is needed for low bit widths as presented in our previous
4 f o r ( i n t m = 0 ; m < M; m++)
work targeting Tensorflow Lite [1] and convolutional neural
5 f o r ( i n t n = 0 ; n < N ; n ++)
networks. In this initial analysis for graph neural networks
6 f o r ( i n t k = 0 ; k < K ; k ++)
the data types are 16-bit floating-point and do not require
7 D(m, c ) +=A(m, n ) *H( n , k ) *W( k , c ) ;
scaling so this stage is not enabled. The write stage reads
the FIFOs from each compute engine and writes the results to Where D(x,y), A(x,y), H(x,y), W(x,y) represent tiles of
main memory. compatible dimensions of the original matrices.
Dataflow FIFO
rnnz_fifo
READ SPARSE Instances = 1
ADJACENCY Depth = 16
Width = 32
Loop_1 : Determine
NNZ per row and load
in RNN fifo. Load total
NNZ values and col value_fifo
index data in FIFOs Instances = 1
Depth = 16
Width = 8/16/32
Dataflow FIFO
col_fifo
rnnz_fifo Instances = 1
Instances = 1 Depth = 16
Depth = 16 Width = 32
Width = 32
Dataflow FIFO Dataflow FIFO
Loop_1 : Determine Loop_2: Loop over all NNZ Dataflow PIPO ..
number NNZ per row (non-zero) and computes C Loop_2: Loop over all NNZ .
value_fifo
and load in RNNZ fifo. Instances = 1 = Σ (FEA * W). (non-zero) and computes
. Load total NNZ Each PE uses the same FEA D = Σ (ADJ * C). Loop3 : Scale result Loop4 : Write D to
Depth = 16
values and col index value and its own W Each PE uses the same ADJ computing D = ((D+ external memory
Width = 8/16/32
data in FIFOs column. value and its own C bias)*quantize_mult) ..
.
column.
col_fifo C_buffer ADJACENCY

READ SPARSE FEATURE Instances = 1 COMPUTE
Instances = 1 D_fifo write_fifo WRITE
FEATURE COMPUTE Depth = N_ADJ SCALE
Depth = 16 Instances = Instances =
Width = 32 Width = B_WIDTH_BLOCK B_WIDTH_BLOCK
B_WIDTH_BLOCK Depth = 16 Depth = 16
Width = 32 Width = 32
Loop_2 : Load W
block
Weight_ram
READ WEIGHT Instances = 1
BLOCK Depth = M_FEA
Width =
B_WIDTH_BLOCK
Fig. 1: Single-threaded dataflow of dataflows description
d0
d0
Adjacency dataflow
adj_ci0,adj_rp0,
Adjacency dataflow
adj_v0
adj_ci0,adj_rp0,
w
adj_v0
w
fea_ci0,fea_rp0, Feature dataflow SB(0,0) X2
fea_v0
fea_ci0,fea_rp0, fea_ci1,fea_rp1,
Feature dataflow SB(0,0) X2 Feature dataflow SB(1,0)
fea_v0 fea_v1
(a) 1 feature engine and 1 adjacency engine (b) 2 feature engines and 1 adjacency engine
d0 d1
d0 d1 d2 d3
fea_ci0,fea_rp0,
adj_ci0,adj_rp0, fea_v0
fea_ci1,fea_rp1,
adj_v0 fea_v1
fea_ci2,fea_rp2,
adj_ci1,adj_rp1, fea_v2
fea_ci3,fea_rp3,
adj_v1 fea_v3 X2
w
w
fea_ci0,fea_rp0, SB(0,0) SB(1,0) SB(2,0) SB(3,0)
Feature dataflow
fea_v0
fea_ci1,fea_rp1,
fea_ci0,fea_rp0, Feature dataflow SB(1,0) SB(1,1) SB(1,2) SB(1,3)
fea_v1
Feature dataflow SB(0,0) SB(1,0) X2
fea_v0 fea_ci2,fea_rp2,
Feature dataflow SB(2,0) SB(2,1) SB(2,2) SB(2,3)
fea_v2
fea_ci1,fea_rp1, Feature dataflow SB(1,0) SB(1,1) fea_ci3,fea_rp3,
Feature dataflow SB(3,0) SB(3,1) SB(3,2) SB(3,3)
fea_v1 fea_v3
(c) 2 feature engines and 2 adjacency engines (d) 4 feature engines and 4 adjacency engines
Fig. 2: Examples of hardware multi-threaded configurations
Configuration LUTs(K) FFs(K) BRAM 18Ks DSP48Es
(1t1t2c,half) 21.7 29.7 58 24
(2t2t4c,half) 38.0 50.1 345 76
(2t2t8c,half) 46.4 58.2 537 140
(1t1t16c,half) 35.7 44.15 645 136
(2t1c16c,half) 49.43 57.62 999 200
(4t2t8c,half) 65.71 77.56 1068 204
TABLE I: configuration complexity comparison
IV. P RELIMINARY PERFORMANCE EVALUATION

To evaluate the performance of the design we initially
select the citeseer dataset. The CiteSeer dataset consists of
3327 nodes where each node means a scientific publication
and edges that mean citation relationships. Each node has a
predefined feature vector with 3703 values that indicate the
absence or presence of a corresponding dictionary word in the Fig. 3: Performance evaluation on citeseer dataset
publication. In total there are 12431 non-zeros in the adjacency
matrix indicating links between nodes and 105165 non-zeros
in the feature matrix indicating the presence of words in for the accelerator. Numpy arrays of contiguous memory are
the nodes. The dataset is designed for node classification allocated in the python script to store the input and output data
tasks and the objective is to predict the category of unknown for the accelerator. The accelerator is then configured with the
publications. Table II summarizes the statistics of the dataset addresses of these buffers. Any additional IP control registers
and the density of the adjacency and feature matrix. In this are also written such as those indicating dense or sparse mode
preliminary performance evaluation we focus on a single and matrix sizes. Finally, a run kernel script starts the IP
GCN layer with 21 hidden features that represent the features block by setting the AP START bit to 1 and checks when
generated by the layer for each node and also the output of the accelerator completes reading the AP DONE bit. Pytorch
the node that will be propagated to the next layer. uses torch tensors to store its data that in addition to the
values store additional information such as requires grad used
Dataset nodes dens. adj inp features dens. fea hidden features to compute derivates automatically during the backward pass.
Citeseer 3327 0.11% 3703 0.85% 21
TABLE II: configuration complexity comparison These torch tensors are similar to numpy arrays and conversion
between both data types is possible reusing the same memory
The statistics show that both adjacency and feature matrix without explicit data copying. This simplifies and optimizes
are significantly sparse and this can be effectively exploited the integration of PYNQ and Pytorch. For example:
by the gFADES accelerator. Figure 3 shows the performance Listing 3: Torch tensor and acceleerator PYNQ numpy arrays inte-
obtained on this dataset for the hardware configurations pre- gration
sented in Table I. The reference implementation running on
1
the CPU uses Pytorch as originally presented in [7] and the
2
code is illustrated below.
3 from pynq import a l l o c a t e
Listing 2: GCN pytorch layer in CPU 4 import t o r c h
5 # a l l o c a t e numpy a r r a y s u i t a b l e f o r
1 s u p p o r t = t o r c h . spmm ( i n p u t , s e l f . w e i g h t
,→ a c c e l e r a t o r c a l l s
,→ )
6 B b u f f e r = a l l o c a t e ( s h a p e = ( P w , M fea ) ,
2 o u t p u t c p u = t o r c h . spmm ( a d j , s u p p o r t )
7 d t y p e =np . f l o a t 1 6 )
We can see that in this initial analysis all the hardware 8 # Obtain Torch t e n s o r
configurations show promising performance improvements but 9 t o r c h B b u f f e r = t o r c h . from numpy (
additional verification is required with other datasets. Also, ,→ B b u f f e r )
additional work is required to test the performance of the 10 # c o n f i g u r e IP r e g i s t e r w i t h numpy
accelerator against other devices such as embedded GPUs. ,→ p o i n t e r a d d r e s s
11 my ip . r e g i s t e r m a p . B o f f s e t 1 =
V. P YTORCH INTEGRATION ,→ B b u f f e r . p h y s i c a l a d d r e s s
The gFADES accelerator is implemented as a PYNQ over- 12 # o t h e r r e g i s t e r c o n f i g u r a t i o n and
lay integrated in the Pytorch machine learning framework. We ,→ memory a l l o c a t i o n o m i t t e d f o r
use a PYNQ 2.7 image that runs Ubuntu 22.04 and install ,→ c l a r i f y .
Pytorch 1.9 on the RFSoC 2x2 board from the original sources. 13 # run hardware k e r n e l
PYNQ enables full control of the accelerator from a python 14 run kernel ()
environment. PYNQ uses numpy arrays as the data buffers
The obtained torch B buffer tensor and B buffer numpy 18 output = output cpu
array can then be used in the rest of the algorithm. We perform
The CPU uses all 4 CPU cores available to compute but
an initial test of the accelerator with a GCN consisting of two
we observe a performance improvement in hardware for both
layers with a first layer with 32 hidden units and a second
layers. The pure sparse mode reduces processing from 41.58
layer with 16 hidden units feeding a final fully connected layer
ms to 2.34 ms while the hybrid sparse/dense mode reduces
that outputs 7 possible classes. The hardware configuration
processing from 6.63 ms to 1.63 ms.
consists of 1 aggregation thread, 1 combination tread and 16
compute units. The following listing shows and excerpt from VI. C ONCLUSIONS
the integration of the accelerator in the layersṗy script from In this paper we have presented the gFADES dataflow
[7]. When the accelerator is active (i.e. acc=1) we initially set hardware architecture for graph convolutional networks. The
the hardware registers to point to the layer parameters. Then, gFADES architecture is highly configurable in terms of logic
we activate the accelerator with run kernel() and obtained and bandwidth requirements and suitable for edge FPGA
the output from D buffer. The first layer runs the hardware devices. Preliminary performance and functional results fol-
in full sparse mode and both feature and adjacency matrix lowing the integration into the Pytorch machine learning
are processed in sparse mode. On the other hand, the second framework show good acceleration both in pure sparse and
layer has as input the output feature matrix generated by the in hybrid sparse-dense mode. Future work includes testing
first layer. This second feature matrix is now dense and the additional data sets, hardware configurations and quantization
accelerator uses a dense mode that means that the adjacency strategies for low bit-width data types. Also, we intend to
matrix is still sparse but the feature matrix is dense. streamline the integration process with Pytorch/Tensorflow and
investigate how gFADES can be used in the backward pass to
Listing 4: Pytorch acceleration with gFADES
accelerate training in addition to the forward pass. Finally,
1 i f ( a c c ==1) : investigating how gFADES can be scaled up to non-Zynq
2 p r i n t ( ” Running gFADES” ) devices equipped with HBM (high bandwidth memory) is also
3 s e l f . my ip . r e g i s t e r m a p . a worthwhile avenue of research.
,→ M fea= s e l f .
,→ i n f e a t u r e s R EFERENCES
4 s e l f . my ip . r e g i s t e r m a p . P w [1] J. Nunez-Yanez, “Fused architecture for dense and sparse matrix process-
,→ = s e l f . o u t f e a t u r e s ing in tensorflow lite,” IEEE Micro, vol. 42, no. 6, pp. 55–66, 2022.
[2] R. Garg, E. Qin, F. Muñoz-Martı́nez, R. Guirado, A. Jain, S. Abadal,
5 s e l f . my ip . r e g i s t e r m a p . J. L. Abellán, M. E. Acacio, E. Alarcón, S. Rajamanickam, and T. Kr-
,→ gemm mode= d e n s e ishna, “Understanding the design-space of sparse/dense multiphase gnn
6 s e l f . my ip . r e g i s t e r m a p . dataflows on spatial accelerators,” 2021.
[3] C. Peltekis, D. Filippas, C. Nicopoulos, and G. Dimitrakopoulos,
,→ D 1 o f f s e t 1 = “Fusedgcn: A systolic three-matrix multiplication architecture for graph
,→ D b u f f e r . convolutional networks,” in 2022 IEEE 33rd International Conference
,→ p h y s i c a l a d d r e s s on Application-specific Systems, Architectures and Processors (ASAP),
pp. 93–97, 2022.
7 s e l f . my ip . r e g i s t e r m a p . [4] J. Li, A. Louri, A. Karanth, and R. Bunescu, “Gcnax: A flexible and
,→ v a l u e s f e a 1 o f f s e t 1 energy-efficient accelerator for graph convolutional neural networks,” in
,→ = v a l u e s f e a b u f f e r . 2021 IEEE International Symposium on High-Performance Computer
Architecture (HPCA), pp. 775–788, 2021.
,→ p h y s i c a l a d d r e s s [5] H. Zeng and V. Prasanna, “Graphact: Accelerating gcn training on cpu-
8 s e l f . my ip . r e g i s t e r m a p . fpga heterogeneous platforms,” in Proceedings of the 2020 ACM/SIGDA
,→ B o f f s e t 1 = B b u f f e r International Symposium on Field-Programmable Gate Arrays, FPGA
’20, (New York, NY, USA), p. 255–265, Association for Computing
,→ . p h y s i c a l a d d r e s s Machinery, 2020.
9 self . run kernel () [6] B. Zhang, H. Zeng, and V. Prasanna, “Hardware acceleration of large
10 output acc = D buffer scale gcn inference,” in 2020 IEEE 31st International Conference
on Application-specific Systems, Architectures and Processors (ASAP),
11 output acc = torch . pp. 61–68, 2020.
,→ from numpy ( o u t p u t a c c [7] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
,→ ) convolutional networks,” 2016.
12 output acc = torch . tensor (
,→ o u t p u t a c c , d t y p e =
,→ t o r c h . f l o a t 3 2 )
13 output = output acc
14 else :
15 p r i n t ( ” Running CPU” )
16 s u p p o r t = t o r c h .mm( i n p u t ,
,→ s e l f . w e i g h t )
17 o u t p u t c p u = t o r c h . spmm ( a d j
,→ , s u p p o r t )

5th_AccML_paper_1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5th_AccML_paper_1

Uploaded by

Copyright:

Available Formats

Graph neural network hardware acceleration in

Pytorch with streaming PYNQ overlays

col_fifo C_buffer ADJACENCY

Fig. 1: Single-threaded dataflow of dataflows description

IV. P RELIMINARY PERFORMANCE EVALUATION

You might also like