Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

SMAPPIC: Scalable Multi-FPGA Architecture

Prototype Platform in the Cloud


Grigory Chirkov David Wentzlaff
Princeton University Princeton University
Princeton, USA Princeton, USA
gchirkov@princeton.edu wentzlaf@princeton.edu

ABSTRACT top venues in these fields has increased dramatically over the last
Traditionally, architecture prototypes are built on top of FPGA in- 20 years [1–3]. One can imagine a future where all the advances in
frastructure, with two associated problems. First, very large FPGAs computation come purely from clever new techniques and features
are prohibitively expensive for most people and institutions. Sec- designed by these researchers. However, most of such work de-
ond, the burden of FPGA development adds to an already uneasy mands working on futuristic, currently nonexistent hardware. This
life of researchers, especially those who focus on software. Large highlights the importance of a cheap, fast, scalable, and easy-to-use
designs that do not fit into a single FPGA exacerbate these issues prototyping infrastructure.
even more. This work presents SMAPPIC – the first open-source Traditionally, architecture prototypes are built on top of Field-
prototype platform for shared memory multi-die architectures on Programmable Gate Array (FPGA) infrastructure. This approach has
cloud FPGAs. SMAPPIC leverages the OpenPiton/BYOC infrastruc- two significant associated problems. First, FPGA development is a
ture and AWS F1 instances to make FPGA-based prototypes of complicated task that requires deep knowledge of not only Register
System-on-Chips, processor cores, accelerators, cache subsystems, Transfer Level (RTL) but also the details and physical constraints of
etc., cheap, scalable, and straightforward. SMAPPIC enables many the chosen FPGA platform. Many researchers (especially those who
use cases that are not possible or significantly more complicated in focus more on software aspects) would rather spend their time and
existing software and FPGA tools. This work has the potential to effort on something closer to their interests. Second, FPGAs needed
accelerate the rate of innovation in computer engineering fields in for large systems’ prototypes must be bought by researchers for a
the nearest future. significant price (starting from $5000 or $20000 for a set of four) [62].
Therefore, independent researchers cannot afford to use FPGAs at
CCS CONCEPTS all, and large entities cannot afford to scale out FPGA infrastructure,
i.e., evaluate multiple systems in parallel or use multiple FPGAs for
• Hardware → Simulation and emulation; Reconfigurable
a single prototype.
logic and FPGAs; • Computer systems organization → Multi-
Moreover, the end of Moore’s Law gradually slows down the
core architectures; Heterogeneous (hybrid) systems.
transistor density scaling and forces designers to implement increas-
ingly larger designs each year [33]. Correspondingly, prototypes
KEYWORDS are also becoming larger and no longer fit into a single FPGA. Par-
Modeling, FPGA, cloud, multi-die, interconnect, multicore, hetero- titioning prototypes across multiple FPGAs further exacerbates
geneity both price and complexity issues. Emulation systems like Synopsys
ACM Reference Format: Zebu [55] solve the complexity problem but only at the cost of even
Grigory Chirkov and David Wentzlaff. 2023. SMAPPIC: Scalable Multi-FPGA more expensive hardware.
Architecture Prototype Platform in the Cloud. In Proceedings of the 28th The current COVID-19 pandemic makes FPGAs even less acces-
ACM International Conference on Architectural Support for Programming sible. Working with an FPGA often requires physical access to the
Languages and Operating Systems, Volume 2 (ASPLOS ’23), March 25–29, hardware, which can be unavailable due to pandemic restrictions
2023, Vancouver, BC, Canada. ACM, New York, NY, USA, 14 pages. https: like lockdowns and Work-From-Home orders. Moreover, due to the
//doi.org/10.1145/3575693.3575753
pandemic-induced chip shortage, purchasing FPGAs is challenging
even with the necessary funds. For example, lead times for large
1 INTRODUCTION FPGAs from Xilinx now exceed half a year [62].
The approaching demise of Moore’s Law [25] puts the task of opti- However, FPGAs have debuted as offerings from multiple pub-
mizing computation on the shoulders of system-level researchers: lic cloud providers in the past five years, including Amazon Web
computer architects, compiler engineers, operating system and run- Services (AWS) [19, 32, 43]. This development presents a unique
time developers, etc. Correspondingly, the number of papers in the opportunity for researchers to embrace FPGA prototyping or mi-
grate their flow from existing on-premise solutions to cloud FPGAs,
enabling them to run thousands of instances in parallel or to dis-
tribute large designs across multiple FPGAs for a modest price.
This work is licensed under a Creative Commons Attribution 4.0 Interna- Such a prototype also presents the opportunity to use large datasets
tional License.
available only in the cloud or have an FPGA prototype interacting
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
© 2023 Copyright held by the owner/author(s). with cloud-only services like complex Function-as-a-Service (FaaS)
ACM ISBN 978-1-4503-9916-6/23/03. pipelines. Computer architecture, operating systems, and compilers
https://doi.org/10.1145/3575693.3575753

733
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Grigory Chirkov and David Wentzlaff

classes can also use cloud FPGA platforms’ on-demand scale-out


nature for educational purposes beyond what a single academic
FPGA0 FPGA0 institution can purchase.
Node 0 Node 0 Node 1 This paper presents Scalable Multi-FPGA Architecture Prototype
Tile Tile Tile Tile Tile Tile Tile Tile
0 1 2 3 0 1 0 1 Platform in the Cloud (SMAPPIC) – the first open-source1 proto-
Tile Tile Tile Tile type platform for shared memory multi-die architectures on cloud
4 5 6 7 Node 2 Node 3 FPGAs. SMAPPIC is designed with three main goals: ease of use,
Tile Tile Tile Tile Tile Tile Tile Tile
8 9 10 11 0 1 0 1 cost-effectiveness, and scalability. They all are achieved through
a modular, hierarchical, and parametrizable structure based on a
(a) Classic 1x1x12 configuration (b) Cost-efficient 1x4x2 configu- cloud FPGA backend.
for single-node targets. ration for small targets. SMAPPIC is a ready-to-use solution that makes architectural
FPGA prototyping easy and accessible to researchers from various
FPGA0 FPGA1 computer science and computer engineering fields. Researchers
Node 0 Node 1 do not need to be experts in FPGA development to implement
Tile Tile Tile Tile Tile Tile Tile Tile their prototypes, and the modular design of SMAPPIC allows them
11 10 9 8 20 21 22 23
Tile Tile Tile Tile Tile Tile Tile Tile to make RTL modifications with non-invasive changes. Moreover,
7 6 5 4 16 17 18 19 many software studies can be performed by tuning built-in platform
Tile Tile Tile Tile Tile Tile Tile Tile
3 2 1 0 12 13 14 15 parameters without the need to change RTL code altogether. SMAP-
PIC’s cloud FPGA backend makes it accessible and cost-effective,
allowing users to pay for infrastructure per hour.
Each prototype in SMAPPIC contains one or more separate
FPGA2 FPGA3 nodes, where each node represents a single chip or die of the target
Node 2 Node 3 system. Nodes can be distributed across multiple target FPGAs; they
Tile Tile Tile Tile Tile Tile Tile Tile can be independent and represent separate systems or be merged
27 26 25 24 36 37 38 39
Tile Tile Tile Tile Tile Tile Tile Tile with newly designed inter-node interconnect into one large system.
31 30 29 28 40 41 42 43 SMAPPIC configurations are described in AxBxC notation, where
Tile Tile Tile Tile Tile Tile Tile Tile
35 34 33 32 44 45 46 47 A is the number of FPGAs in the prototype, B is the number of
nodes per FPGA, and C is the number of tiles per node. Fig. 1 shows
(c) 4x1x12 configuration for multi-node targets with large nodes. the example SMAPPIC hierarchies:
(1) Fig. 1a shows a classic 1x1x12 configuration with a single
FPGA0 FPGA1 prototype that implements one node in one FPGA.
Node 0 Node 1 Node 4 Node 5 (2) Fig. 1b shows a cost-efficient 1x4x2 configuration. Such con-
Tile Tile Tile Tile Tile Tile Tile Tile figuration allows modeling multiple systems with different
1 0 2 3 9 8 10 11
parameters in one FPGA to optimize utilization and cost-
efficiency.
(3) Fig. 1c shows a 4x1x12 configuration. It represents a cache-
Node 2 Node 3 Node 6 Node 7 coherent multi-node system with 4 large 12-core nodes.
Tile Tile Tile Tile Tile Tile Tile Tile
5 4 6 7 13 12 14 15 (4) Fig. 1d shows a 4x4x2 configuration. It represents a cache-
coherent multi-node system with 16 small 2-core nodes.
We chose the BYOC/OpenPiton [10, 11] framework as SMAP-
FPGA2 FPGA3 PIC’s node foundation and AWS EC2 F1 instances as a target FPGA
Node 8 Node 9 Node 12 Node 13 infrastructure. Together, they provide all the necessary features
Tile Tile Tile Tile Tile Tile Tile Tile needed to achieve our design goals:
17 16 18 19 25 24 26 27
(1) The BYOC framework provides a modular and configurable
RTL model of a heterogeneous computing platform with
Network-On-Chip (NoC) interconnect and scalable coherent
Node 10 Node 11 Node 14 Node 15
Tile Tile Tile Tile Tile Tile Tile Tile cache subsystem inside each node. Various research groups
21 20 22 23 29 28 30 31 have already integrated Ariane [65], OpenSPARC T1 [37],
ao486 [39], PicoRV32 [60], AnyCore [17], BlackParrot [41]
(d) 4x4x2 configuration for multi-node targets with small nodes.
cores as well as NVDLA [34] and MIAOW GPU [8] accel-
erators into BYOC. These computational cores can be used
Figure 1: SMAPPIC generates optimized prototypes for var- out of the box without any modifications. Moreover, the
ious targets. Configurations are denoted in AxBxC format, Transaction Response Interface (TRI) allows for fast and
where A is the number of FPGAs, B is the number of nodes easy integration of other new computation cores.
per FPGA, and C is the number of tiles per node.
1 Source code: https://parallel.princeton.edu

734
SMAPPIC: Scalable Multi-FPGA Architecture Prototype Platform in the Cloud ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

FPGA Host PC BYOC Memory System DDR Memory


To
onboard Last-Level Last-Level Last-Level Last-Level Wishbone SDHC
Custom logic
DRAM DDR4 AXI4 Host Cache Cache Cache Cache
controller
(x3) program Chipset Ethernet
NoC NoC NoC NoC NoC
Routers Routers Routers Routers Crossbars VGA Framebuffer
Switches,
AXI4 AXI-Lite
LEDs, AXI4 AXI4 BYOC BYOC BYOC BYOC
(x3) UART
reset Private Private Private Private
To Cache Cache Cache Cache
onboard PS/2
DRAM DDR4
PCIe x16 PCIe
PicoRV32 ao486
controller driver OpenSPARC Core Ariane Transaction-Response
Hard Shell Core
T1 Core RISC-V Core Interface

Figure 2: F1 instances organization. F1 FPGA contains two Figure 3: BYOC principal diagram (reproduced from [10]).
partitions: Custom Logic (CL) and Hard Shell (HS). Users can BYOC has a modular open-source design, configurable cache
fully customize CL; HS is fixed and provided by AWS. All subsystem, and a wide range of integrated cores and ISAs.
communication with the host happens through HS, where
requests and responses are converted from the host-FPGA
PCIe connection to AXI4/AXI-Lite interfaces and back. Fig. 2 shows a logical representation of F1 instances [42]. FPGAs
in F1 are divided into two parts: 1) A Hard Shell (HS) provided by
(2) AWS EC2 F1 instances provide infrastructure for cheap, AWS that must be included in the final FPGA image, and 2) Custom
scalable, on-demand FPGA prototyping. Four independent Logic (CL), where developers have complete control over instanti-
DRAM interfaces in F1 make it possible to pack up to four ated logic. Developers can access four DDR4 memory controllers
nodes into one FPGA. At the same time, the F1 instance’s abil- and three AXI-Lite interfaces for logic configuration and manage-
ity to perform direct FPGA-to-FPGA communication through ment per FPGA. Data movement to/from the instance is performed
the PCIe interface makes SMAPPIC’s coherent inter-node through two AXI4 interfaces: one inbound and one outbound. The
interconnect possible. HS converts these AXI4 signals to/from PCIe Gen3 x16 connections.
There are many unique use cases that are impossible or sig- Depending on the target address, the outbound AXI4 request is
nificantly harder without SMAPPIC. We discuss some of them in routed to one of the FPGAs connected to the host or to the host
this paper, including large-scale multi-node architecture model- itself.
ing, accelerator evaluation and verification, hardware/software co- The described setup is easy to recreate locally, given sufficient
development, in situ studies of a custom architecture interaction funds. Similar boards can be bought directly from Xilinx [62] or
with a cloud infrastructure, cost-efficient architecture modeling, other vendors [12]. HS uses Xilinx XDMA IP under the hood, which
remote work, education, and more. is provided with each copy of Xilinx Vivado tools [63]. However,
Our contributions include: we estimate that the upfront cost of a similar system with a single
• The first open-source prototype platform for shared memory FPGA (including server, FPGA, and FPGA memory) is about $8000,
multi-node architectures on cloud FPGAs which can be too large of an investment for many researchers,
• The first open-source RISC-V multi-node system with unified especially if sporadically used.
memory
• The first open-source 48-core 64-bit RISC-V system with 2.2 BYOC and OpenPiton
coherent memory BYOC [10] is a hardware framework for heterogeneous-ISA re-
• The first demonstration of full-stack Linux running on a search. It supports the connection of cores with different ISAs and
48-core RISC-V system in NUMA mode microarchitectures as well as accelerator integration. The high-level
• The first comparison of architecture modeling methods in design of BYOC is shown in Fig. 3.
the cloud with respect to cost-efficiency BYOC is built on top of OpenPiton [9] framework and Piton
• Demonstration of custom prototype interacting with public processor [31] and provides a modern, tiled manycore design with
cloud services a cache coherence protocol and a Network-on-Chip (NoC) intercon-
nect scalable to up to 500 million cores. BYOC’s cache subsystem is
2 BACKGROUND configurable and allows choosing different cache sizes and associa-
tivities. Users can select one of the ten provided core models. No-
2.1 AWS EC2 F1 Infrastructure tably, the framework already integrates Ariane [65], AnyCore [17],
AWS debuted EC2 F1 instances as an option in its public cloud BlackParrot [41], OpenSparcT1 [37], PicoRV32 [60], and ao486 [39]
in November 2016 as a way to offer its customers customizable cores. If this choice is not enough for the users, they can choose
acceleration capabilities. AWS currently offers three F1 instances: to connect their design using the Transaction Response Interface
f1.2xlarge, f1.4xlarge, and f1.16xlarge; their parameters and prices (TRI).
are listed in Table 1. Overall, users can get up to 8 Xilinx VU9P [64] TRI is the gateway between a core and the central part of BYOC:
FPGAs inside one instance. Each hour of a single FPGA usage costs its memory subsystem (see Fig. 3). This interface and BYOC Pri-
$1.65. vate Cache (BPC) were designed to isolate cores from details of

735
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Grigory Chirkov and David Wentzlaff

Table 1: Available AWS EC2 F1 instances [44]. FPGA 2

Instance f1.2xl f1.4xl f1.16xl Node 2


#vCPUs 8 16 64
Host Memory 122 GB 244 GB 976 GB
Storage 470 GB 940 GB 3760 GB FPGA 3 FPGA 1
6
#FPGAs 1 2 8 PCIe transfer
Node 3 5 Node 1
FPGA Memory 64 GB 128 GB 512 GB
Pricing $1.65/hr $3.30/hr $13.20/hr PCIe PCIe
Hardware price ≈ $8000 ≈ $16000 ≈ $64000 outbound 7 inbound
FPGA 0
AXI4->PCIe conversion 4 8 PCIe->AXI4 conversion
the coherence protocol in the underlying cache subsystem, thereby AXI4 outbound AXI4 inbound
simplifying the integration of new compute units (cores) into BYOC.
Inter-node bridge: send 3 Inter-node bridge: receive
BPC can work as either a private L1 or L2 cache. BPCs are intercon- • aw channel: transfer info • b channel: acknowledgement
nected with each other and slices of distributed shared last-level (destination node-ID, source • r channel: credits to be
cache (LLC) through three NoCs. The LLC also uses the NoCs to node-ID, valid bits for flits) returned
• w channel: three NoC flits
interact with the memory controller and peripherals, all located • ar channel: request for
inside the chipset. credits return 9
The BYOC framework has an open-source codebase and is writ-
Node 0
ten in Verilog and SystemVerilog. This modular and configurable
design makes it an excellent platform for experimenting and de- Tile Tile Tile Tile
veloping new hardware/software features; SMAPPIC leverages it 2 0 10 1 2 2
to provide the same features in its nodes. At the same time, BYOC
Tile Tile Tile Tile
provides only limited FPGA prototyping capabilities without multi- Chipset
13 4 5 5
node support.
Tile Tile Tile Tile
3 ARCHITECTURE 6 7 8 8
A SMAPPIC prototype consists of one or several nodes; each node
represents a single chip or die of the target system. Nodes can be AXI4 to memory
interface
independent of each other and implement separate instances of
the prototyped design, or they can be connected with SMAPPIC’s
Figure 4: SMAPPIC in 4x1x12 configuration. Inter-node NoC
specially designed inter-node interconnect to build a prototype of
packets are first routed towards Tile 0, then pushed out in
a single large shared memory multi-node system. We measured the
the northbound direction into the inter-node bridge, where
inter-FPGA round-trip latency on the PCIe bus in an F1 instance to
they are encapsulated into AXI4 requests and tunneled in
be about 1250 nanoseconds. At a typical FPGA frequency of about
the PCIe bus to other nodes in other FPGAs.
100 MHz, this latency equals 125 cycles, which is the same order
of magnitude as the inter-socket latency in a typical multi-socket
system. Therefore, a multi-node system can be modeled either by
putting all nodes into one FPGA or by distributing nodes across change the traffic and does not significantly rely on packet struc-
multiple FPGAs. ture, and the BYOC can be replaced with any other node model
The nodes in our implementation are separate BYOC instances. with inter-node capabilities fairly easily.
However, the inter-node interconnect does not rely on any funda- Fig. 4 shows the high-level design of SMAPPIC in a 4x1x12 con-
mental properties of BYOC, and BYOC can be replaced with any figuration. Each node is located in its own FPGA and communicates
node model with multi-node capabilities. We chose BYOC because with other nodes through inbound and outbound AXI4 interfaces.
it strengthens SMAPPIC’s advantages: modularity, configurability, We will go through an example of what happens during a BPC
and ease of use. cache miss to show the functionality of SMAPPIC. Blue numbers
and lines in Fig. 4 schematically show the sequence of events and
3.1 Inter-node Interconnect NoC communication.
SMAPPIC’s inter-node interconnect is designed to make large-scale • Stage 1: a core on node 0 in tile 3 does a memory read that
multi-node prototypes possible. It binds together nodes located in causes L1 and BPC cache misses. BPC issues a read request
the same FPGA and different FPGAs. To achieve this, the intercon- to the cache line’s home LLC slice in this case. The homing
nect encapsulates traffic between nodes into AXI4 write requests. mechanism in BPC decides where the cache line’s home is.
This solution allows connecting nodes on the same FPGA using The original BYOC design supports multi-chip configuration
the AXI4 crossbar and nodes on different FPGAs using the AXI4- through a special hardware/software mechanism called Co-
PCIe transducer provided in Hard Shell. The encapsulation does not herence Domain Restriction [22]. Unlike BYOC, SMAPPIC’s

736
SMAPPIC: Scalable Multi-FPGA Architecture Prototype Platform in the Cloud ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

ar channel Request Request


NoC
Read Response NoC input Interrupt Interrupt
r channel engine Data Data
deserializer controller Interrupt wire packetizer
Management
aw channel Request module Response
NoC
w channel Write Data NoC output NoC
b channel
engine Data
serializer
Response

AXI4 interface Interrupt


Core
Interrupt wire depacketizer
Figure 5: NoC-AXI4 bridge transduces requests from BYOC
NoC protocol to AXI4 protocol. The bridge buffers requests Figure 6: Interrupt packetizer and depacketizer enable send-
for non-blocking operation and aligns requests to a 64-byte ing interrupts across node boundaries.
boundary to conform to AXI4 protocol requirements.

for read and write requests, and BYOC’s original memory controller
homing mechanism was changed to distribute cache lines is incompatible with this protocol.
across all nodes in the system and work out of the box with- To manipulate the memory interfaces, we designed a NoC-AXI4
out software support. The read is local if the cache line’s memory controller that transduces requests from the native NoC
home slice is located in the same node. However, in our ex- protocol to the AXI4 interface and sends responses back via the
ample, the core is in node 0, and the home slice is in node 3, NoC to the requesting tile. Its high-level design is shown in Fig. 5.
making this request inter-node. Incoming requests from the NoC are first deserialized and then
• Stage 2: NoC routers are programmed to route inter-node forwarded to the management module. There, requests are buffered
packets into tile 0, then in the northbound direction, where to enable non-blocking functionality and higher throughput. Re-
they leave the node and enter the inter-node bridge specifi- quests are then steered into the corresponding engine: memory
cally designed to encode NoC packets into outgoing AXI4 reads into the read engine and memory writes into the write en-
requests and vice versa. gine. The engines assign an AXI4-ID to each request and save the
• Stage 3: A NoC packet is encapsulated into AXI4 write re- Miss Status Handling Register (MSHR), the ID-MSHR mapping,
quests. The request’s address encodes the destination node the origin of the request in the system, and other miscellaneous
ID, source node ID, and valid bits for flits encoded in write information. The request’s address and data (in the case of a write)
data. To guarantee the absence of deadlocks in the system, are then aligned to a 64-byte boundary to fulfill the AXI4 protocol
the NoCs must have credit-based flow controls. To return requirements. Upon response arrival, ID-MSHR mapping is used to
credits, the sending side periodically issues an AXI4 read restore the initial request’s MSHR. If it is a read response and the
request to the receiving side and gets the number of returned original request is smaller than 64 bytes, the necessary portion of
credits in response. bytes is selected. The response and data (in case of a read) are then
• Stages 4-5: The request is converted into PCIe transfer inside pushed into the NoC serializer, after which they leave the memory
HS and sent to the other FPGA. This PCIe traffic goes directly controller.
from FPGA to FPGA and does not use the host CPU.
• Stages 6-7: The response from the home slice is sent back to
3.3 RISC-V Interrupt Controller
node 0 in the form of another PCIe transfer.
• Stage 8: The PCIe transfer is converted into an AXI4 request Another essential part of SMAPPIC is its RISC-V interrupt controller.
inside HS. We want to make the transition to our platform as seamless for
• Stage 9: An AXI4 request is decoded inside the inter-node researchers as possible. With the growing popularity of RISC-V in
bridge. The valid bits are taken from the request address, the community, it was important for us to have this functionality
and NoC flits are taken from the request data. The request’s available out of the box.
address also contains the source node-ID, which is used for In the RISC-V specification [5], cores are notified about pending
credit accounting. If the inter-node bridge gets an AXI4 read interrupts by asserting a wire going straight from the interrupt
request, it returns the credits in response. controller into the core. This specification lacks scalability because
• Stage 10: Response is routed back to the original core in tile a) in the case of a manycore system, a wire goes across the whole
3 on node 0. node and uses significant routing resources and affects the timing
of the design, and b) in the case of a multi-node system, it is difficult
or impossible to route a wire across node boundary.
3.2 Memory Interfaces To deal with this problem, we designed an interrupt packetizer
The F1 infrastructure provides developers access to up to four and depacketizer. Fig. 6 shows their functionality. The interrupt
DRAM interfaces. Developers are expected to use AXI4 interfaces packetizer scans the interrupt controller’s outputs, and when they

737
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Grigory Chirkov and David Wentzlaff

Table 2: Prototyped System Parameters

Instruction set RISC-V 64-bit 0

Operating system Linux v5.12 250

Frequency 100 MHz


Core Ariane
Core pipeline In-order, 6 stages 12
200
Branch history table entries 128
ITLB entries 16

Sender core index


DTLB entries 16
L1D cache 8 KB, 4 ways 24
150
L1I cache 16 KB, 4 ways
BPC cache 8 KB, 4 ways
LLC cache slice 64 KB, 4 ways
100
DRAM latency 80 cycles
36
Inter-node round-trip latency 125

50

change, it notifies the appropriate core about the change by sending


0 12 24 36
a NoC packet. On the core side, the interrupt depacketizer sniffs the
network traffic, and when an interrupt packet arrives, the interrupt Receiver core index
0
wires are asserted or deasserted depending on the packet contents.
Figure 7: Inter-core latencies in multi-node prototype in cy-
3.4 I/O Interfaces cles. There are four clearly visible NUMA domains in the
3.4.1 UART and Serial Interfaces. UART is an essential interface system. Round-trip latency inside one domain is about 100
for SMAPPIC: it is used for all the external interactions like loading cycles, and across domains – 250 cycles, 2.5 times higher.
tests into memory and console I/O operations. However, the F1
infrastructure does not provide a UART interface to the developer.
To tackle this issue, we encapsulate UART into the AXI-lite the FPGA’s inbound AXI4 bus. This way, we can inject NoC flits
protocol and use one of the available AXI-Lite interfaces in F1 to inside the system. In particular, we can inject NoC packets that
further tunnel data from the FPGA to the host. The UART to AXI- target the prototype’s memory controller and perform writes in the
Lite conversion is done with a Xilinx UART16550 IP [61] included SD card region of memory.
with Xilinx Vivado. Further inside the host system, we developed a
new program that creates a virtual serial device and tunnels data 3.5 Modeling Off-node Interfaces
from the PCIe driver into this virtual serial device.
We instantiate two UART devices inside each BYOC instance: The node’s interaction with the outside world cannot be mapped
one with the standard baud rate of 115200 bits/second for console into FPGA gates. This includes inter-node communication, memory
interactions/instance management and one with an “overclocked” operations, input/output (I/O) operations, network operations, etc.
transmission rate of around 1 Mbit/second for data transmission. To combat this modeling challenge, we include a traffic shaper with
We used the overclocked device to connect the prototype to the configurable bandwidth and latency in the inter-node bridge and
Internet using the standard modem connection utility pppd. memory controller. This solution provides performance models of
interconnect and memory interfaces in addition to the functional
3.4.2 Virtual SD Card. Like many datacenter FPGAs, the F1 FPGA models described above. I/O operations are modeled only function-
does not have an SD card slot. However, BYOC requires an SD card ally in SMAPPIC.
controller to be able to provide a filesystem and run Linux. To solve
this problem, we introduce the notion of a “Virtual device” into 4 SMAPPIC USE CASES
SMAPPIC. We call a device virtual if it does not physically exist in
This section describes our vision of how SMAPPIC can be employed
the system; all requests to this device are instead forwarded into the
for various problems.
SMAPPIC system’s main memory. Virtual devices provide only the
functionality of the original device and do not model performance
properly.
4.1 Large-Scale Multi-node Architecture
Since F1 lacks an SD card slot, we make the device virtual and Modeling
map it into the top half of the FPGA’s DRAM; the remaining bot- In our first example, we build an FPGA prototype of a 64-bit cache-
tom half is used for the prototype’s main memory. For SD card coherent RISC-V multi-node system with a total of 48 cores. Using
initialization, we wrote a specialized Linux driver that is run on this prototype, we run a full-stack Linux operating system and
the host system. It performs the writes to the addresses inside the evaluate its’ implementation of NUMA mode in the RISC-V archi-
FPGA’s PCIe address space, and these writes eventually appear on tecture.

738
SMAPPIC: Scalable Multi-FPGA Architecture Prototype Platform in the Cloud ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

3000 NUMA mode On Off 1000

2500
800
Time, seconds

Time, seconds
2000
600
1500

400
1000

200
500 NUMA mode On Off

0 0
3 6 12 24 48 1 2 3 4
Threads Active chiplets

Figure 8: The difference in performance between NUMA- Figure 9: Thread allocation effect on integer sorting bench-
aware and non-NUMA-aware Linux kernel running multi- mark runtime. The number of threads is fixed to 12; they
threaded integer sorting benchmark. NUMA mode reduces are distributed between either 1, 2, 3, or 4 active nodes. In
runtimes by 1.6-2.8 times. The effect is especially strong with NUMA mode, more active nodes mean increased memory ac-
a high thread count because of increased inter-node commu- cess latency and slightly higher runtime. With NUMA mode
nication in non-NUMA mode. disabled, more active nodes mean a higher chance of thread
and data collocation on the same node and slightly better
performance.
We use the integrated Ariane core in this example. SMAPPIC
makes synthesizing the FPGA prototype straightforward: the user
simply specifies the preferred core type, the number of tiles per
of cores with Linux NUMA mode on versus off. NUMA mode re-
node, the number of ndoes per FPGA, and the number of FPGAs. As
duces runtime by 1.6-2.8 times depending on the number of threads.
an example, we create a prototype with a 4x1x12 configuration: four
The difference is especially large with a high thread count; mem-
FPGAs, one node per FPGA, and 12 cores per node for a total of 48
ory allocation becomes even more critical in this situation because
Ariane cores. Fig. 2 shows the parameters of the prototyped system.
of the large volume of inter-node communication and associated
The FPGA image generation takes about 2 hours on a desktop
congestion. We also compared the scaling characteristics of the
machine equipped with an Intel Core-i9 9900K CPU and requires
system with a similar multi-socket server setup from Intel. We saw
about 32 GB of memory. Final postprocessing is done by AWS in
a similar dynamic going from low thread count to full system usage.
their datacenter and takes another 2 hours. Loading the bitstream
Fig. 9 shows the same benchmark run in the same system, but
into FPGA takes about 10 seconds.
this time we fix the number of threads to 12 and pin them to either
As a part of this case study, we run the full-stack Linux kernel in
1, 2, 3, or 4 nodes using the taskset utility to study how much sub-
the generated prototype. The Linux kernel works out of the box in
optimal thread allocation hurts performance. Distributing threads
SMAPPIC. Non-uniform memory access (NUMA) mode has been
among more nodes in NUMA mode leads to higher average mem-
available for RISC-V platforms since release 5.12, and since our setup
ory access latency and slightly higher runtime. Interestingly, with
has NUMA properties, we enabled it in the kernel configuration.
NUMA mode disabled, the situation is reversed, and the perfor-
The software reads NUMA parameters from the device tree during
mance slightly increases when cores are spaced further apart. We
the boot process.
think that this happens because, with more nodes involved, there
The first metric of interest for us is the communication latency
is a higher chance that at least some of the threads and application
between cores. Fig. 7 shows the heatmap of round-trip latencies
data are collocated on the same node.
between different cores in the system in cycles. The locations of
The studies shown in this section are not possible on any
the four NUMA nodes are clearly visible in this chart: cores on
simulator or emulator currently available to researchers.
the same node have round-trip latency of about 100 cycles, and
Only the SMAPPIC prototype can model benchmark runs in a large
round-trip latency for cores on different nodes is about 250 cycles,
NUMA system running full-stack Linux at speeds around 100 MHz.
2.5 times higher than intra-node. This number is roughly the same
for multi-socket Intel server platforms [21], which validates our
idea of modeling multi-node systems with multi-FPGA setups. The 4.2 Accelerator Verification and Evaluation
inter-node link latency can be adjusted to represent systems with a SMAPPIC is a valuable framework for accelerator development.
slower interconnect, e.g., Ampere Altra [4]. It includes tools for easy accelerator integration into the node’s
We then evaluate the real-world performance of our prototype architecture. Users can employ provided coherence tools to enable
with an integer sorting benchmark from the NAS parallel bench- fine-grained interaction between the accelerator and the rest of
mark (NPB) suite [7]. The benchmark uses a parallel bucket sorting the system. Superior speeds of FPGA emulation allow evaluating
algorithm to sort a 134-million-long array of integers (NPB’s class and verifying the accelerator with much larger input datasets than
C input size). Fig. 8 shows the performance scaling with the number possible with purely software tools.

739
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Grigory Chirkov and David Wentzlaff

32 Mode 1 thread MAPLE 2 threads


2.5 2.4
30 Mode 2.2
SW
2 1.9
25 1 1.8
21 2 1.6
Speedup

20 1.4

Speedup
4 1.5
1.2
15 13 1.0 1.0 1.0 1.0 1.0
12 1
10
10
7.4
0.5
5
1.0 1.0
0
0
A: Noise generator B: Noise applier SPMV SPMM SDHP BFS
Benchmark Benchmark

Figure 10: Performance evaluation of the GNG accelerator in Figure 11: Performance evaluation of MAPLE engine in
SMAPPIC. The speedup is calculated relative to the software SMAPPIC. Speedup is calculated relative to single-thread
implementation. Four execution modes are compared: soft- execution. Three execution modes are shown: single-thread,
ware (SW), 1 number per fetch (1), 2 numbers per fetch (2), single-thread with MAPLE engine, and 2-thread.
and 4 numbers per fetch (4). The hardware implementation
works much faster, especially with combined fetches.
including accelerator integration, debugging, optimization, synthe-
sizing FPGA images, developing accelerator software, and running
benchmarks.
As an example, we built a prototype of a CPU with an integrated
Gaussian Noise Generator (GNG) accelerator [28] from the Open- 4.3 Hardware/Software Co-development
Cores project [36]. Gaussian noise is widely used in communica- Many works on programming languages and operating systems
tion channel testing, molecular dynamics simulation, and financial propose ideas based on slight architectural changes to existing
modeling [27]. The accelerator generates noise using the random hardware [22, 48, 66]. Researchers from these fields want to be
numbers from the Tausworthe Generator [59]. It took around 1.5 able to run full-stack operating systems and be able to interact
hours of a graduate student’s work to do the initial integration and with the hardware in real time. Due to its scalability, speed, cost,
debugging. We use a SMAPPIC 1x1x2 configuration for simplicity, configurability, and full-stack SMP Linux support, SMAPPIC is a
but users can quickly scale this prototype by only changing com- perfect tool for such work.
mand line options passed to build scripts. Tile 0 contains the Ariane To demonstrate this, we reevaluate the MAPLE [38] engine in
core, and tile 1 contains the GNG. SMAPPIC and compare it with parallel 2-thread software execu-
To fetch a new number from the GNG, Ariane issues a non- tion. MAPLE is an accelerator for Decoupled Access Execute pro-
cacheable load to the accelerator’s memory address. We develop grams [49]. The Execute part runs in the general-purpose core,
and study two integration schemes: base and optimized. In the base while the Access part is offloaded to MAPLE. MAPLE is programmed
scheme, each non-cacheable load returns a single 16-bit number. In before the execution starts to asynchronously fetch the data from
the optimized scheme, the load returns two or four 16-bit numbers memory and supply it to the Execute core right when needed.
packed into one 32-bit or 64-bit integer respectively. The optimiza- This approach grants substantial performance gains in applications
tion replaces multiple fetch operations with only one and reduces with irregular memory access patterns through fine-grained soft-
the number of needed loads. ware/hardware interaction.
After the prototype is built, users can easily evaluate and ver- We use SMAPPIC’s 1x1x6 configuration with Ariane cores in
ify the accelerator with SMAPPIC’s built-in machinery. For illus- tiles 0, 1, 4, 5, and MAPLE engines in tiles 2, 3. The benchmarks and
trative purposes, we compare the accelerator’s performance with datasets are the same as in the original work. MAPLE’s code and
the software implementation executed in Ariane. We execute two integration with OpenPiton are open-sourced, so the engine works
benchmarks on top of Linux. Benchmark A (“Noise generator”) in SMAPPIC out of the box. We would like to note that integrating
generates 64 MB of noise to compare software and hardware GNG even such delicate mechanisms in SMAPPIC is easy and takes only
implementations. Benchmark B (“Noise applier”) converts gener- about a hundred lines of Verilog code [38].
ated noise into 8-bit integers and applies it to a 32 MB long sequence Fig. 11 shows MAPLE’s performance results. The evaluation
to compare two implementations in a real-life scenario. shows that MAPLE is more efficient than the second thread in
Fig. 10 shows performance results. The accelerator significantly latency-bound applications. Unsurprisingly, SMAPPIC produces
decreases benchmarks’ execution times, especially when it com- charts identical to the ones described in the original MAPLE work
bines multiple number fetches into one request. The performance because the modeled hardware is very similar. However, this case
difference is smaller in benchmark B because less of the execution study is interesting in another aspect.
time is accelerated here. The whole process of accelerator evalua- The test execution would often hang the whole system during
tion took about one graduate student’s workday from start to end, our initial testing. After a couple of hours of debugging, we noticed

740
SMAPPIC: Scalable Multi-FPGA Architecture Prototype Platform in the Cloud ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

Table 3: Requirements for host machines and cheapest suit-


AWS Lambda SMAPPIC
able AWS EC2 instances [44].
Python script Nginx web-server
Create new
Proxy
Sniper gem5 Verilator SMAPPIC
HTTP request
& FireSim
HTTP HTTP
Return Redirect #vCPUs 2 1 1 1
response response Memory 8 GB 64 GB 8 GB 8 GB
FPGAs 0 0 0 1
CGI Instance t3.m r5.2xl t3.m f1.2xl
AWS S3 Price/hr $0.04 $0.45 $0.04 $1.65

Attach date,
return
response
S3 data REST API (1) The HTTP request enters the AWS Lambda function. The
S3 function acts as a gateway from the Internet to the private
fetch
PHP script network. It redirects requests to the Nginx web server run-
ning on the SMAPPIC prototype.
(2) The web server guides the request through the Common
Figure 12: SMAPPIC in an experimental cloud pipeline. Gateway Interface (CGI) into the PHP script executed in the
SMAPPIC’s cloud nature allows for in situ cloud studies of prototype alongside the web server.
custom architecture’s interaction with other datacenter ser- (3) The script fetches the data from the AWS S3 service.
vices. (4) The script attaches the current time to the data and returns
it as a response to the web server.
(5) The web server returns the response back to the Lambda
function.
that the problem disappeared when each program’s thread was (6) The function returns the response back to the original client.
pinned to some core. Eventually, we found a piece of Verilog code
This workflow presents a unique opportunity for researchers
in MAPLE’s design that memorizes the core ID at the execution start
to integrate their custom design into a massive warehouse-scale
and uses this information later for some operations. This design is
datacenter and test its behavior and performance while interacting
problematic for running programs on top of the operating system
with complex cloud services. It enables the researchers to run RISC-
because of the thread migration.
V in the cloud and Internet with AWS’s vast APIs and observe the
Interestingly, MAPLE’s designers had not seen this problem be-
execution details directly inside the machine.
fore we contacted them. Even though they used FPGA for testing
and executing tests on top of Linux as well, their FPGA was small
4.5 Cost-Efficient Architecture Modeling
and could fit only two Ariane cores inside. This case provides yet
another reason why SMAPPIC helps so much with design verifica- In Sec. 4.1 we demonstrate that SMAPPIC is a valuable tool for
tion. The ability to execute tests in large prototypes on top of the large-scale architecture modeling. However, SMAPPIC also shows
operating system at high frequencies is currently provided in only superior cost-efficiency among other modeling tools in the cloud.
our open-source tool. We compare the cost of modeling a similar 64-bit RISC-V system
with different tools to demonstrate this. We use Sniper [14] as
an example of a parallel simulator, gem5 [30] as an example of
4.4 In Situ Studies of Custom Architecture a cycle-level simulator, and Verilator [51] as an example of an
Interaction with Cloud Infrastructure RTL simulator. We also compare our work against another cloud
While there are many ways to model new architectures, modeling FPGA tool, FireSim [24]. We use two FireSim configurations: 1)
their interactions with existing cloud infrastructure can be chal- without network simulation and with one quad-core RocketChip
lenging. Researchers that study custom cloud architectures would instance, and 2) supernode configuration with network simulation
greatly benefit from the ability to integrate the new hardware di- and with four single-core RocketChip instances on one FPGA. We
rectly into a datacenter setting. SMAPPIC makes this task a lot easier want to compare with the most cost-efficient FireSim configuration
because the prototype is located in the cloud, runs full-stack SMP as measured by running benchmarks.
Linux, and supports network connections via its serial interfaces. We execute benchmarks on a real RISC-V chip SiFive Freedom
This aspect allows users to transform their custom architectures U740 SoC [46], part of the HiFive Unmatched [47] development
into first-class citizens inside the AWS infrastructure. platform, to get baseline results when running benchmarks in real
To prove this, we build a cloud pipeline that includes a SMAPPIC RISC-V silicon.
1x1x4 prototype and multiple AWS services. The prototype executes We use a SMAPPIC prototype in a 1x4x2 configuration without an
the Nginx web server and PHP backend script, as well as interacts inter-node interconnect. This configuration allows us to model four
with AWS Lambda and S3 services. Fig. 12 shows the built pipeline. independent prototypes inside one FPGA and significantly improves
The incoming HTTP request goes through the following series of the cost-efficiency of SMAPPIC. The studied system parameters are
events: listed in Table 2. All other tools were configured to model systems

741
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Grigory Chirkov and David Wentzlaff

6
SMAPPIC 11.56 8.24
Setup Cloud On-premises
14k
FireSim single-node
5 FireSim supernode 12k
Modeling cost, $

Sniper
4

Modeling Cost, $
10k

3 8k

2 6k

1 4k
<0.01 <0.01
0 2k
de

ex

gc

le

om

pe

x2

xa

xz

SP
el

cf
ch

64

la
ep

rlb

EC
ne
a

nc
0
an
sje

en
tp

in
bm
ge

ch
ng

t2
50 100 150 200 250 300 350

k
2

01
7
Modeling Time, days

Figure 13: Modeling costs in dollars. Results for gem5 are not
shown because its costs are 4-5 orders of magnitude higher Figure 14: Cost of FPGA modeling in the cloud vs. locally
than in all other benchmarks. SMAPPIC delivers the best on-premises. The cloud setup is more cost-efficient for up to
cost efficiency in a cloud setting out of all methods. 200 days of continuous experiments.

as similar to ours as possible. It is worth noting that all of the from Table 3 we can derive that SMAPPIC is about 1600 times more
tools below are configured to use RISC-V workloads, except Sniper. cost-efficient. Such a significant advantage makes a big difference
Sniper has had initial support for RISC-V since version 7.3 [50], in design verification.
but it could not run RISC-V benchmarks out of the box or after we As mentioned in Section 2.1, researchers can avoid using cloud
attempted to find and fix the problem. Therefore, we run Sniper infrastructure for FPGA prototypes and purchase similar hardware
with the X86-64 version of benchmarks. We do not expect this to for on-premise usage instead. Our estimates on the price of similar
affect our results significantly. servers with similar FPGA boards are shown in Table 1.
Our target benchmark suite is SPECint 2017 with the “test” input Based on these estimates and F1 instance pricing, we can compare
size for simplicity. We could not run the benchmark perlbench on the costs of using cloud and on-premises FPGA prototypes. Results
Sniper because it does not support forks inside executing workloads. are shown in Fig. 14.
For each of the studied tools, we find the smallest and cheapest Given the same prototype configuration running in both cases,
suitable instance in AWS EC2 offerings and then use its price per it takes more than 200 days of continuous modeling to justify buy-
hour to calculate the cost of benchmark execution. Our minimal ing a private FPGA setup and using it on-premises. We consider
instance selections are described in Table 3. this scenario possible only for huge research groups where such
The main limiting factor of software tools is the amount of mem- hardware will be shared between many users. However, even in
ory the host instance has: Sniper needs at least 8 GB for proper such cases, the burden of proper hardware management and lack of
functioning, and most of the gem5 runs could fit into a 64 GB alloca- ability to scale modeling makes the cloud option more preferable.
tion. However, two of the benchmarks required even more memory.
For example, the gem5 run with the mcf benchmark completes 4.6 Remote Work
only on a host with 350 GB of memory. Using a cloud setup might be the only option during unexpected
Modeling costs are shown in Fig. 13. As we expected, SMAPPIC workflow disruptions like the current COVID-19 pandemic. In the
shows the best cost-efficiency among other approaches. The main last year and a half, together with our collaborators from other
reason is its ability to allocate multiple prototype instances in a institutions, we finished two large chip tapeouts. These chips are
single FPGA and the high frequency provided by direct RTL-to- based on heterogeneous architectures with multiple accelerators
gate mapping. gem5 shows the worst results with 4-5 orders of and complex interactions through a coherent cache subsystem.
magnitude higher cost than other tools because of its large memory Such chip parameters require extensive design testing with real,
requirements and long execution time. For this reason, we did not large workloads and driver-controlled accelerators running under
include gem5 results in the chart. a full-stack operating system.
Compared to a single-node FireSim configuration, SMAPPIC In such conditions, SMAPPIC was the only realistic option to per-
shows about four times better cost-efficiency. Single-node FireSim form design verification. RTL simulators do not have high enough
and SMAPPIC have similar frequencies. However, SMAPPIC models performance to make the modeling of complete real system (with
four separate instances in parallel on one FPGA, whereas FireSim OS) workloads possible. At the same time, physical FPGA access
models only one. Similar to SMAPPIC, supernode FireSim puts four was not available due to the Work-From-Home order.
systems into one FPGA but still shows worse results because of its After we finished the initial integration of accelerators, the pro-
lower frequency. totype and all of its functionality worked out of the box. The cloud
For the sake of completeness, we also compare Verilator with nature of SMAPPIC allowed us to launch multiple instances and
SMAPPIC. In our small “Hello World” example, the Verilator simula- work in parallel while being physically distributed in many differ-
tion takes 65 seconds, and SMAPPIC finishes in 4ms. Based on data ent time zones. We found two critical bugs inside the accelerators

742
SMAPPIC: Scalable Multi-FPGA Architecture Prototype Platform in the Cloud ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

themselves, one bug inside the RISC-V interrupt controller, and one Table 4: SMAPPIC configurations with frequencies and LUT
problem with the interfacing of an accelerator and TRI interface. utilizations. Configurations are shown in BxC format, where
B is the number of nodes per FPGA, and C is the number of
4.7 Using SMAPPIC for Education tiles per node. Tile parameters are listed in Table 2.
The value of using FPGAs in teaching architecture classes has been
noted by multiple educators [35, 52]. However, before the era of Configuration Frequency LUT Utilization
cloud FPGAs, educational institutions could afford only limited 1x12 75 MHz 97%
use of FPGAs in classes. SMAPPIC is a potent tool for overcoming 1x10 100 MHz 83%
this problem. With a proper management tool, educators can use 2x4 100 MHz 73%
SMAPPIC to launch hundreds of instances on-demand and pay only 2x5 75 MHz 88%
for the time students actually use FPGAs for their assignments or 4x2 100 MHz 87%
projects. SMAPPIC is one of the tools used for final projects by
students taking the computer architecture class at our institution
this year. (4) The PCIe connection has a round-trip latency of 1250 ns.
This latency puts a lower limit on modeled inter-node link
4.8 Platform Limitations latency. For example, at 100 MHz, the round-trip latency is
equal to 125 cycles.
Even though SMAPPIC allows for many unique use cases that are
not achievable with other tools, there are a couple of factors to Most of these limits are not fundamental and are only dictated
consider before choosing it as a modeling platform. by the AWS infrastructure: the number of tiles per FPGA is limited
First, while SMAPPIC provides a parametrized model of a typical by the FPGA size, the number of nodes per FPGA is limited by the
tiled multicore architecture with NoC and directory coherence, number of DRAM slots per FPGA, and the number of FPGAs in a
there are only a couple of provided fixed core models. For example, prototype is limited by the number of FPGAs in one F1 instance.
all case studies in this section used the Ariane core: an in-order The limits can be raised in the future if AWS updates its cloud FPGA
core with a single-issue pipeline which may not be representative infrastructure or other cloud providers introduce a larger selection
of large-core out-of-order designs. Users have two options on how of FPGAs.
to approach this limitation. We also want to note that SMAPPIC is not a total substitution
(1) Provide the RTL design of their own custom core, use the for classic software simulators. Researchers often value flexibility
functionality provided by the BYOC framework and TRI in- over execution time, cost, or accuracy. FPGA emulation can be a
terface, and integrate it into the SMAPPIC model. In this case, poor fit in such situations because each system parameter’s change
the modeling results will be more accurate and represent the leads to complete prototype regeneration which can take hours.
real system. However, this option requires some effort from However, the use cases discussed above demonstrate situations
the user. where SMAPPIC is the users’ only real option due to its scale,
(2) Use one of the core models provided by SMAPPIC out of the speed, accuracy, and cost-efficiency.
box. This option requires less effort, but the modeling results
will not represent the final target system exactly if its core 5 RELATED WORK
differs from the core chosen in SMAPPIC. Nevertheless, we 5.1 FPGAs for Architecture Modeling
think this option is valuable for researchers who are more There are a few commercial tools for FPGA-accelerated design
focused on obtaining qualitative results or care more about modeling, such as Cadence Palladium [13], Mentor Veloce [45], and
the interactions in the system as a whole. For example, this Synopsys Zebu [55]. These systems are proprietary, have incredibly
option is useful for studying the interaction between the high prices (millions of dollars), and are not accessible to most
cores and the integrated accelerators. researchers. SMAPPIC is open-source and uses a publicly available
Another factor to consider when choosing SMAPPIC for research cloud infrastructure instead.
is the physical constraints of the FPGA infrastructure: ProtoFlex [18] and FAST [15] use hardware to accelerate only
(1) There is a finite number of gates per FPGA; this restricts the the performance layer of the model. Some researchers propose
amount of logic that can be allocated per prototype and its using an FPGA to model only subparts of the whole system, e.g.,
frequency. For example, F1 FPGAs can fit at most 12 Ariane memory subsystem [53] or NoC [26, 29]. In contrast, SMAPPIC
tiles at 75 MHz frequency. Table 4 shows various possible models each node entirely inside the FPGA, making performance-
SMAPPIC setups. All configurations use core parameters accuracy tradeoffs more visible.
from Table 2. RAMP Gold [57], DIABLO [56], and HASim [40] put both per-
(2) Each F1 FPGA contains only four memory slots, and each formance and functional models into the FPGA. However, these
node in SMAPPIC has a separate memory controller. This frameworks use various techniques to pack more target logic in-
requirement puts a cap of at most four nodes per FPGA in side a single FPGA at the cost of lower frequency. SMAPPIC uses
SMAPPIC. direct gate-to-RTL mapping and keeps frequency as high as possi-
(3) Only four FPGAs in the F1 instance are connected with low- ble instead. SMAPPIC models large systems by partitioning across
latency PCIe links. Therefore, one SMAPPIC prototype can multiple FPGAs. Such a scaling strategy has good cost-efficiency in
include at most four FPGAs. a cloud setting.

743
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Grigory Chirkov and David Wentzlaff

5.2 Multi-node System Modeling significantly different frameworks because they have different goals
PriME [23] supports the modeling of independent chips without any influencing their designs.
connectivity. MGPUSim [54] was designed specifically for multi- FireSim is designed from the ground up to be a datacenter simula-
die modeling but only in the GPU domain. Sniper [14] supports tor, and there is a big emphasis on simulating the network between
modeling multi-die systems with shared memory but at much lower separate nodes. This factor puts an additional restriction on the
speeds than FPGAs. Unlike all these tools, SMAPPIC supports fast simulation speed and the size of separate computational nodes.
modeling of heterogeneous shared memory multi-node systems. Unlike FireSim, SMAPPIC is designed to make the modeling of a
Enzian [20] and QuickIA [16] are similar to SMAPPIC in that separate system scalable. This goal allows SMAPPIC to scale pro-
they provide the foundation for building heterogeneous multi-chip totypes up to large, 48-core, 4-node systems and be cost-effective
prototypes. However, both frameworks use CPU + FPGA design when modeling small systems simultaneously. This result is achiev-
where only the logic inside the FPGA can be modified. In contrast, able because we do not have the additional requirement of simulat-
SMAPPIC is fully customizable and can be changed by configuration ing IP network interactions.
options or modifying source code.
BYOC [10] and OpenPiton [9, 11] provide separate tools for 6 CONCLUSION
multi-chip modeling and FPGA prototyping. However, they do not This paper presents SMAPPIC – the first open-source prototype plat-
allow making FPGA prototypes of multi-chip architectures. More- form for shared memory multi-die architectures on cloud FPGAs.
over, they support multi-chip capabilities only through a special SMAPPIC achieves ease of use, cost-effectiveness, and multi-node
hardware/software mechanism called Coherence Domain Restric- scalability through modular, hierarchical, and parametrizable struc-
tion [22]. In contrast, SMAPPIC changes the original BYOC frame- ture based on a cloud FPGA backend. By using BYOC’s infrastruc-
work to provide multi-node capabilities out of the box without ture and AWS EC2 F1 instances, SMAPPIC provides a configurable
additional software modifications. underlying design that allows researchers to quickly and easily
start using cloud-enabled FPGA prototypes.
5.3 Multi-FPGA Architecture Modeling SMAPPIC makes many unique use cases possible or signifi-
FireSim [24] and DIABLO [56] support the modeling of multi-chip cantly more accessible, in particular: large-scale multi-node ar-
systems distributed over multiple FPGAs. However, they do it only chitecture modeling, accelerator evaluation and verification, hard-
without a unified memory, with different chips connected over ware/software co-development, in situ studies of custom architec-
Ethernet links. Unlike these tools, SMAPPIC provides coherent ture interaction with a cloud infrastructure, cost-efficient modeling,
inter-node interconnect that tunnels existing inter-node intercon- remote work, and education. We use SMAPPIC to build the first
nect through FPGA’s PCIe connection. This mechanism allows the open-source multi-node 48-core 64-bit Linux-capable RISC-V sys-
modeling of multi-node systems with unified memory. tem with unified memory and to make the first comparison of the
The Twinstar platform [6] can model the full 16-core Bluegene/Q architecture modeling methods’ costs in a cloud. SMAPPIC has the
compute chip but uses a local, sophisticated, expensive, and propri- potential to enable even more exceptional studies in the future.
etary setup consisting of 60 FPGAs and can reach only speeds of
4 MHz. Moreover, the logic partition between the FPGAs in Twin- ACKNOWLEDGMENTS
star was done manually by the engineers. SMAPPIC divides the We would like to thank Alexey Lavrov, Jonathan Balkind, and the
design between FPGAs based on the user’s input and leverages the whole OpenPiton/BYOC team for sharing their expertise on two
cloud infrastructure to free researchers from these complications. frameworks, Marcelo Orenes-Vera for his help in reimplementing
MAPLE in SMAPPIC, Georgios Tziantzioulis for his aid with GNG
5.4 Architecture Modeling in the Cloud and Its accelerator integration, and Ang Li, Zujun Tan, Yebin Chon, Ro-
Cost Optimization han Prabhakar, August Ning, and anonymous reviewers for their
valuable feedback. We also thank the AWS Cloud Credit for Re-
The FAME work [58] estimates the cost of software architecture
search program. This material is based on research sponsored by
simulation in the cloud. It concludes that running FPGA simulations
the National Science Foundation under Grant No. CNS-1823222, Air
on-premises is more cost-effective than running software simula-
Force Research Laboratory (AFRL) and Defense Advanced Research
tions in the cloud. Our work makes much broader estimations and
Projects Agency (DARPA) under agreements No. FA8650-18-2-7852
adds price calculations for cloud FPGA simulations and different
and FA8650-18-2-7862. The U.S. Government is authorized to repro-
types of software simulations.
duce and distribute reprints for Governmental purposes notwith-
The FireSim paper [24] compares the cost of cloud and on-
standing any copyright notation thereon. The views and conclu-
premises FPGA simulation and concludes that large-scale simu-
sions contained herein are those of the authors and should not be
lations are financially sustainable only when using cloud infras-
interpreted as necessarily representing the official policies or en-
tructure. Unlike SMAPPIC, FireSim is not compared with software
dorsements, either expressed or implied, of Air Force Research Lab-
simulators in terms of pricing.
oratory (AFRL) and Defense Advanced Research Projects Agency
(DARPA) or the U.S. Government. Any opinions, findings, and con-
5.5 Architecture Modeling in Cloud FPGAs clusions or recommendations expressed in this material are those of
The main previous work on utilizing cloud FPGAs for architecture the authors and do not necessarily reflect the views of the National
modeling is FireSim [24]. We consider SMAPPIC and FireSim two Science Foundation.

744
SMAPPIC: Scalable Multi-FPGA Architecture Prototype Platform in the Cloud ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

REFERENCES [20] David Cock, Abishek Ramdas, Daniel Schwyn, Michael Giardino, Adam Tur-
[1] ACM. 2020. ISCA ’20: Proceedings of the ACM/IEEE 47th Annual International owski, Zhenhao He, Nora Hossle, Dario Korolija, Melissa Licciardello, Kristina
Symposium on Computer Architecture. https://dl.acm.org/doi/proceedings/10. Martsenko, Reto Achermann, Gustavo Alonso, and Timothy Roscoe. 2022. En-
5555/3426241 zian: An Open, General, CPU/FPGA Platform for Systems Software Research. In
[2] ACM. 2021. ASPLOS 2021: Proceedings of the 26th ACM International Conference Proceedings of the 27th ACM International Conference on Architectural Support for
on Architectural Support for Programming Languages and Operating Systems. Programming Languages and Operating Systems (Lausanne, Switzerland) (ASP-
https://dl.acm.org/doi/proceedings/10.1145/3445814 LOS 2022). Association for Computing Machinery, New York, NY, USA, 434–451.
[3] ACM. 2021. PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International https://doi.org/10.1145/3503222.3507742
Conference on Programming Language Design and Implementation. https: [21] Andrei Frumusanu. 2021. Intel 3rd Gen Xeon Scalable (Ice Lake SP) Review:
//dl.acm.org/doi/proceedings/10.1145/3453483 Generationally Big, Competitively Small. https://www.anandtech.com/show/
[4] Dr. Ian Cutress Andrei Frumusanu. 2021. AMD 3rd Gen EPYC Milan Review: 16594/intel-3rd-gen-xeon-scalable-review/4
A Peak vs Per Core Performance Balance. https://www.anandtech.com/show/ [22] Yaosheng Fu, Tri M. Nguyen, and David Wentzlaff. 2015. Coherence domain
16529/amd-epyc-milan-review/4 restriction on large scale systems. In 2015 48th Annual IEEE/ACM International
[5] John Hauser Andrew Waterman, Krste Asanović. 2021. RISC-V Specification. Symposium on Microarchitecture (MICRO). 686–698. https://doi.org/10.1145/
https://riscv.org/technical/specifications/ 2830772.2830832
[6] Sameh Asaad, Ralph Bellofatto, Bernard Brezzo, Chuck Haymes, Mohit Kapur, [23] Yaosheng Fu and David Wentzlaff. 2014. PriME: A parallel and distributed
Benjamin Parker, Thomas Roewer, Proshanta Saha, Todd Takken, and José Tierno. simulator for thousand-core chips. In 2014 IEEE International Symposium on
2012. A Cycle-Accurate, Cycle-Reproducible Multi-FPGA System for Accelerating Performance Analysis of Systems and Software (ISPASS). 116–125. https://doi.org/
Multi-Core Processor Simulation. In Proceedings of the ACM/SIGDA International 10.1109/ISPASS.2014.6844467
Symposium on Field Programmable Gate Arrays (Monterey, California, USA) (FPGA [24] Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid,
’12). Association for Computing Machinery, New York, NY, USA, 153–162. https: Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra,
//doi.org/10.1145/2145694.2145720 Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and
[7] David H. Bailey. 2011. NAS Parallel Benchmarks. Springer US, Boston, MA, Krste Asanović. 2018. FireSim: FPGA-accelerated Cycle-exact Scale-out System
1254–1259. https://doi.org/10.1007/978-0-387-09766-4_133 Simulation in the Public Cloud. In Proceedings of the 45th Annual International
[8] Raghuraman Balasubramanian, Vinay Gangadhar, Ziliang Guo, Chen-Han Ho, Symposium on Computer Architecture (Los Angeles, California) (ISCA ’18). IEEE
Cherin Joseph, Jaikrishnan Menon, Mario Paulo Drumond, Robin Paul, Sharath Press, Piscataway, NJ, USA, 29–42. https://doi.org/10.1109/ISCA.2018.00014
Prasad, Pradip Valathol, and Karthikeyan Sankaralingam. 2015. MIAOW - An [25] Hassan Khan, David Hounshell, and Erica Fuchs. 2018. Science and research
open source RTL implementation of a GPGPU. In 2015 IEEE Symposium in Low- policy at the end of Moore’s law. Nature Electronics 1 (01 2018). https://doi.org/
Power and High-Speed Chips (COOL CHIPS XVIII). 1–3. https://doi.org/10.1109/ 10.1038/s41928-017-0005-9
CoolChips.2015.7158663 [26] Kouadri-Mostefaoui, Abdellah-Medjadji, Benaoumeur Senouci, and Frederic
[9] Jonathan Balkind, Ting-Jung Chang, Paul J. Jackson, Georgios Tziantzioulis, Ang Petrot. 2008. Large Scale On-Chip Networks : An Accurate Multi-FPGA Emu-
Li, Fei Gao, Alexey Lavrov, Grigory Chirkov, Jinzheng Tu, Mohammad Shahrad, lation Platform. In 2008 11th EUROMICRO Conference on Digital System Design
and David Wentzlaff. 2020. OpenPiton at 5: A Nexus for Open and Agile Hardware Architectures, Methods and Tools. 3–9. https://doi.org/10.1109/DSD.2008.130
Design. IEEE Micro 40, 4 (2020), 22–31. https://doi.org/10.1109/MM.2020.2997706 [27] D.-U. Lee, J.D. Villasenor, W. Luk, and P.H.W. Leong. 2006. A hardware Gaussian
[10] Jonathan Balkind, Katie Lim, Michael Schaffner, Fei Gao, Grigory Chirkov, Ang noise generator using the Box-Muller method and its error analysis. IEEE Trans.
Li, Alexey Lavrov, Tri M. Nguyen, Yaosheng Fu, Florian Zaruba, Kunal Gulati, Comput. 55, 6 (2006), 659–671. https://doi.org/10.1109/TC.2006.81
Luca Benini, and David Wentzlaff. 2020. BYOC: A "Bring Your Own Core" [28] Guangxi Liu. [n. d.]. OpenCores Gaussian Noise Generator. https://opencores.
Framework for Heterogeneous-ISA Research. In Proceedings of the Twenty-Fifth org/projects/gng
International Conference on Architectural Support for Programming Languages [29] Yangfan Liu, Peng Liu, Yingtao Jiang, Mei Yang, Kejun Wu, Weidong Wang,
and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for and Qingdong Yao. 2010. Building a multi-FPGA-based emulation framework
Computing Machinery, New York, NY, USA, 699–714. https://doi.org/10.1145/ to support networks-on-chip design and verification. International Journal of
3373376.3378479 Electronics - INT J ELECTRON 97 (10 2010), 1241–1262. https://doi.org/10.1080/
[11] Jonathan Balkind, Michael McKeown, Yaosheng Fu, Tri Nguyen, Yanqi Zhou, 00207217.2010.512017
Alexey Lavrov, Mohammad Shahrad, Adi Fuchs, Samuel Payne, Xiaohua Liang, [30] Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico
Matthew Matl, and David Wentzlaff. 2016. OpenPiton: An Open Source Manycore Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann,
Research Framework. SIGARCH Comput. Archit. News 44, 2 (mar 2016), 217–232. Srikant Bharadwaj, Gabe Black, Gedare Bloom, Bobby R. Bruce, Daniel Rodrigues
https://doi.org/10.1145/2980024.2872414 Carvalho, Jeronimo Castrillon, Lizhong Chen, Nicolas Derumigny, Stephan Di-
[12] Bittware. [n. d.]. BittWare XUP-P3R. https://www.bittware.com/fpga/xup-p3r/ estelhorst, Wendy Elsasser, Carlos Escuin, Marjan Fariborz, Amin Farmahini-
[13] Cadence. [n. d.]. High performance hardware and software verification and debug Farahani, Pouya Fotouhi, Ryan Gambord, Jayneel Gandhi, Dibakar Gope, Thomas
of complex SoCs and Systems. https://www.cadence.com/en_US/home/tools/ Grass, Anthony Gutierrez, Bagus Hanindhito, Andreas Hansson, Swapnil Haria,
system-design-and-verification/emulation-and-prototyping/palladium.html Austin Harris, Timothy Hayes, Adrian Herrera, Matthew Horsnell, Syed Ali Raza
[14] Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring Jafri, Radhika Jagtap, Hanhwi Jang, Reiley Jeyapaul, Timothy M. Jones, Matthias
the level of abstraction for scalable and accurate parallel multi-core simulation. Jung, Subash Kannoth, Hamidreza Khaleghzadeh, Yuetsu Kodama, Tushar Kr-
In SC ’11: Proceedings of 2011 International Conference for High Performance ishna, Tommaso Marinelli, Christian Menard, Andrea Mondelli, Miquel Moreto,
Computing, Networking, Storage and Analysis. 1–12. https://doi.org/10.1145/ Tiago Mück, Omar Naji, Krishnendra Nathella, Hoa Nguyen, Nikos Nikoleris,
2063384.2063454 Lena E. Olson, Marc Orr, Binh Pham, Pablo Prieto, Trivikram Reddy, Alec Roelke,
[15] Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil A. Patil, William Reinhart, Mahyar Samani, Andreas Sandberg, Javier Setoain, Boris Shingarov, Matthew D.
Darrel Eric Johnson, Jebediah Keefe, and Hari Angepat. 2007. FPGA-Accelerated Sinclair, Tuan Ta, Rahul Thakur, Giacomo Travaglini, Michael Upton, Nilay Vaish,
Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators. Ilias Vougioukas, William Wang, Zhengrong Wang, Norbert Wehn, Christian
In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO Weis, David A. Wood, Hongil Yoon, and Éder F. Zulian. 2020. The gem5 Simulator:
2007). 249–261. https://doi.org/10.1109/MICRO.2007.36 Version 20.0+. https://doi.org/10.48550/ARXIV.2007.03152
[16] Nagabhushan Chitlur, Ganapati Srinivasa, Scott Hahn, P K Gupta, Dheeraj Reddy, [31] Michael McKeown, Alexey Lavrov, Mohammad Shahrad, Paul J. Jackson,
David Koufaty, Paul Brett, Abirami Prabhakaran, Li Zhao, Nelson Ijih, Suchit Yaosheng Fu, Jonathan Balkind, Tri M. Nguyen, Katie Lim, Yanqi Zhou, and
Subhaschandra, Sabina Grover, Xiaowei Jiang, and Ravi Iyer. 2012. QuickIA: David Wentzlaff. 2018. Power and Energy Characterization of an Open Source
Exploring heterogeneous architectures on real prototypes. In IEEE International 25-Core Manycore Processor. In 2018 IEEE International Symposium on High
Symposium on High-Performance Comp Architecture. 1–8. https://doi.org/10.1109/ Performance Computer Architecture (HPCA). 762–775. https://doi.org/10.1109/
HPCA.2012.6169046 HPCA.2018.00070
[17] Rangeen Basu Roy Chowdhury, Anil K. Kannepalli, and Eric Rotenberg. 2016. [32] Microsoft. [n. d.]. NP-series. https://docs.microsoft.com/en-us/azure/virtual-
AnyCore-1: A comprehensively adaptive 4-way superscalar processor. In 2016 machines/np-series
IEEE Hot Chips 28 Symposium (HCS). 1–1. https://doi.org/10.1109/HOTCHIPS. [33] Samuel Naffziger, Noah Beck, Thomas Burd, Kevin Lepak, Gabriel H. Loh, Mahesh
2016.7936237 Subramony, and Sean White. 2021. Pioneering Chiplet Technology and Design
[18] Eric S. Chung, Michael K. Papamichael, Eriko Nurvitadhi, James C. Hoe, Ken Mai, for the AMD EPYC™ and Ryzen™ Processor Families : Industrial Product. In 2021
and Babak Falsafi. 2009. ProtoFlex: Towards Scalable, Full-System Multiprocessor ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).
Simulations Using FPGAs. ACM Trans. Reconfigurable Technol. Syst. 2, 2, Article 57–70. https://doi.org/10.1109/ISCA52012.2021.00014
15 (jun 2009), 32 pages. https://doi.org/10.1145/1534916.1534925 [34] NVIDIA. [n. d.]. NVIDIA Deep Learning Accelerator. http://nvdla.org/
[19] Alibaba Cloud. [n. d.]. Alibaba Cloud Computing Services. https: [35] Joaquín Olivares, José Manuel Palomares, José Manuel Soto, and Juan Carlos
//www.alibabacloud.com/product/computing?spm=a3c0i.239195.6791778070. Gámez. 2010. Teaching microprocessors design using FPGAs. In IEEE EDUCON
157.7af912bdlh7EC5#J_9413186770 2010 Conference. 1189–1193. https://doi.org/10.1109/EDUCON.2010.5492390

745
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Grigory Chirkov and David Wentzlaff

[36] OpenCores. 2022. OpenCores project. https://opencores.org/ //doi.org/10.1145/3123939.3124535


[37] Oracle. [n. d.]. OpenSPARC T1. https://www.oracle.com/servers/technologies/ [54] Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane
opensparc-t1-page.html Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, Harrison
[38] Marcelo Orenes-Vera, Aninda Manocha, Jonathan Balkind, Fei Gao, Juan L. Barclay, Amir Kavyan Ziabari, Zhongliang Chen, Rafael Ubal, José L. Abellán,
Aragón, David Wentzlaff, and Margaret Martonosi. 2022. Tiny but Mighty: John Kim, Ajay Joshi, and David Kaeli. 2019. MGPUSim: Enabling Multi-GPU
Designing and Realizing Scalable Latency Tolerance for Manycore SoCs. In Pro- Performance Modeling and Optimization. In Proceedings of the 46th International
ceedings of the 49th Annual International Symposium on Computer Architecture Symposium on Computer Architecture (Phoenix, Arizona) (ISCA ’19). Association
(New York, New York) (ISCA ’22). Association for Computing Machinery, New for Computing Machinery, New York, NY, USA, 197–209. https://doi.org/10.
York, NY, USA, 817–830. https://doi.org/10.1145/3470496.3527400 1145/3307650.3322230
[39] Aleksander Osman. [n. d.]. ao486. https://github.com/alfikpl/ao486 [55] Synopsys. [n. d.]. ZeBu Server 4. https://www.synopsys.com/verification/
[40] Michael Pellauer, Michael Adler, Michel Kinsy, Angshuman Parashar, and Joel emulation/zebu-server.html
Emer. 2011. HAsim: FPGA-based high-detail multicore simulation using time- [56] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson.
division multiplexing. In 2011 IEEE 17th International Symposium on High Per- 2015. DIABLO: A Warehouse-Scale Computer Network Simulator Using FPGAs.
formance Computer Architecture. 406–417. https://doi.org/10.1109/HPCA.2011. SIGPLAN Not. 50, 4 (mar 2015), 207–221. https://doi.org/10.1145/2775054.2694362
5749747 [57] Zhangxi Tan, Andrew Waterman, Rimas Avizienis, Yunsup Lee, Henry Cook,
[41] Daniel Petrisko, Farzam Gilani, Mark Wyse, Dai Cheol Jung, Scott Davidson, Paul David Patterson, and Krste Asanović. 2010. RAMP gold: an FPGA-based architec-
Gao, Chun Zhao, Zahra Azad, Sadullah Canakci, Bandhav Veluri, Tavio Guarino, ture simulator for multiprocessors. In Proceedings of the 47th Design Automation
Ajay Joshi, Mark Oskin, and Michael Bedford Taylor. 2020. BlackParrot: An Agile Conference. 463–468. https://doi.org/10.1145/1837274.1837390
Open-Source RISC-V Multicore for Accelerator SoCs. IEEE Micro 40, 4 (2020), [58] Zhangxi Tan, Andrew Waterman, Henry Cook, Sarah Bird, Krste Asanović, and
93–102. https://doi.org/10.1109/MM.2020.2996145 David Patterson. 2010. A Case for FAME: FPGA Architecture Model Execution. In
[42] Amazon Web Services. 2021. AWS Shell Interface Specification. Proceedings of the 37th Annual International Symposium on Computer Architecture
https://github.com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_ (Saint-Malo, France) (ISCA ’10). Association for Computing Machinery, New York,
Specification.md NY, USA, 290–301. https://doi.org/10.1145/1815961.1815999
[43] Amazon Web Services. 2022. Amazon EC2 F1 Instances. https://aws.amazon. [59] Robert Tausworthe. 1965. Random Numbers Generated by Linear Recurrence
com/ec2/instance-types/f1/ Modulo Two. Mathematics of Computation - Math. Comput. 19 (05 1965), 201–201.
[44] Amazon Web Services. 2022. Amazon EC2 On-Demand Pricing. https://aws. https://doi.org/10.2307/2003345
amazon.com/ec2/pricing/on-demand/ [60] Clifford Wolf. [n. d.]. PicoRV32. https://github.com/cliffordwolf/picorv32
[45] Siemens. [n. d.]. Veloce HW-Assisted Verification System. https://eda.sw.siemens. [61] Xilinx. 2016. Xilinx AXI UART 16550 v2.0. https://www.xilinx.com/
com/en-US/ic/veloce/ support/documentation/ip_documentation/axi_uart16550/v2_0/pg143-axi-
[46] SiFive. 2021. SiFive FU740-C000 Manual. https://sifive.cdn.prismic.io/sifive/ uart16550.pdf
de1491e5-077c-461d-9605-e8a0ce57337d_fu740-c000-manual-v1p3.pdf [62] Xilinx. 2022. Alveo U250 Data Center Accelerator Card. https://www.xilinx.
[47] SiFive. 2022. HiFive Unmatched. https://www.sifive.com/boards/hifive- com/products/boards-and-kits/alveo/u250.html
unmatched [63] Xilinx. 2022. DMA/Bridge Subsystem for Express v4.1. https:
[48] Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu, and Josep Torrellas. 2020. //www.xilinx.com/content/dam/xilinx/support/documentation/ip_
Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Paral- documentation/xdma/v4_1/pg195-pcie-dma.pdf
lelism. In Proceedings of the Twenty-Fifth International Conference on Architectural [64] Xilinx. 2022. Xilinx Virtex UltraScale+. https://www.xilinx.com/products/silicon-
Support for Programming Languages and Operating Systems (Lausanne, Switzer- devices/fpga/virtex-ultrascale-plus.html#productAdvantages
land) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, [65] Florian Zaruba and Luca Benini. 2019. The Cost of Application-Class Processing:
1093–1108. https://doi.org/10.1145/3373376.3378493 Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core
[49] James E. Smith. 1982. Decoupled Access/Execute Computer Architectures. in 22-nm FDSOI Technology. IEEE Transactions on Very Large Scale Integration
SIGARCH Comput. Archit. News 10, 3 (apr 1982), 112–119. https://doi.org/10. (VLSI) Systems 27, 11 (Nov 2019), 2629–2640. https://doi.org/10.1109/tvlsi.2019.
1145/1067649.801719 2926114
[50] Sniper. 2021. Sniper Releases. http://snipersim.org/w/Releases [66] Hansen Zhang, Soumyadeep Ghosh, Jordan Fix, Sotiris Apostolakis, Stephen R.
[51] Wilson Snyder. 2013. Verilator: Open simulation-growing up. DVClub Bristol Beard, Nayana P. Nagendra, Taewook Oh, and David I. August. 2019. Architectural
(2013). Support for Containment-Based Security. In Proceedings of the Twenty-Fourth
[52] Andrew Strelzoff. 2007. Teaching computer architecture with fpga soft processors. International Conference on Architectural Support for Programming Languages and
In ASEE Southeast Section Conference. Operating Systems (Providence, RI, USA) (ASPLOS ’19). Association for Comput-
[53] Bharat Sukhwani, Thomas Roewer, Charles L. Haymes, Kyu-Hyoun Kim, Adam J. ing Machinery, New York, NY, USA, 361–377. https://doi.org/10.1145/3297858.
McPadden, Daniel M. Dreps, Dean Sanner, Jan Van Lunteren, and Sameh Asaad. 3304020
2017. ConTutto – A Novel FPGA-based Prototyping Platform Enabling Innovation
in the Memory Subsystem of a Server Class Processor. In 2017 50th Annual Received 2022-07-07; accepted 2022-09-22
IEEE/ACM International Symposium on Microarchitecture (MICRO). 15–26. https:

746

You might also like