Data Vortex Network Sept 2018

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Table of Contents

p.1 – Introduction
p.3 – Conception description of the network
p.5 – Present-day implementation of the network
p.7 – Network performance
p.10 – Programming Data Vortex-enabled systems
p. 12 – Recent & ongoing developments
p. 13 – Conclusion

The Data Vortex Network: An overview of the


architecture, implementations, and performance

1- Introduction
Among the greatest challenges in distributed and high-performance computing are congestion,
latency, and communication time variability between different parts of the network. The unique
Data Vortex network is designed from the ground up to solve these problems. It can communicate
large amounts of data with minimal congestion and consistent latencies, even in situations when
the communication pattern is chaotic and data is fragmented in small packets. The Data Vortex
topology is the network solution for processor-to-processor communication within a system and
core-to-core communication within a chip.
Interactic Holdings, LLC (DBA Data Vortex Technologies) has already developed several test
beds of this novel architecture, some of which are actively used to investigate scientific problems.
These include design of protocols for remote memory access (Pacific Northwest National
Laboratory), distributed computing (Providentia Worldwide), fundamental physics research
(University of Ulm), Molecular Simulation (Indiana University Bloomington) and quantum
computing simulation (Data Vortex Technologies).
The performance of present-day Data Vortex-enabled systems has validated the mathematical
scalability of the network. Our ongoing hardware and software developments are the first steps
toward implementing the Data Vortex network into next generation technologies across multiple,
intersecting industries. These include enterprise infrastructure, graph analytics, and chip design:

Data Vortex Technologies


2303 Rio Grande Street
Austin, Texas 78705 USA September 2018
• Cloud and enterprise infrastructure: Enterprise specialists, such as our partners at
Providentia Worldwide, have identified the Data Vortex network as a network of choice
for latency-sensitive data movement. The predictable latency for Data Vortex-enabled
systems allows for unparalleled scaling for consensus algorithms, elections, and data,
virtual machine, and container migrations. Distributed topologies like blockchain mining
networks, compute and rendering clusters, and a variety of other applications in distributed
networks can depend on guaranteed delivery with narrow sliding windows for more
effective governance and predictable scaling as node count increases. In 2018, Providentia
Worldwide demonstrated the efficacy of the DV network topology underneath the popular
enterprise RabbitMQ message broker to achieve world class results across a variety of
message sizes. This work opens the DV network to distributed computing workloads, and
for protocols underlying large-scale latency-sensitive networks like cloud infrastructure.
Cloud users can run applications with tremendous performance improvements and solve
problems that are prohibitive with traditional interconnects. The right cloud service
provider could use the Data Vortex network for a serious competitive advantage.

• Graph analytics: As performance on present-day DV systems has demonstrated (see


Section 4), the Data Vortex network excels at small message unstructured communications,
particularly as the number of connected devices becomes quite large. Communication-
intensive applications can be used in wide arrays, such as discovering previously
undetected trends within a large online consumer dataset or finding commonalities within
a social network.

• Processor and chip design: The Data Vortex Network-on-Chip lays the groundwork for
the next generation of hardware. The network can now connect thousands of cores across
multiple chips using core-to-core transfer logic, eliminating traditional communication
bottlenecks. Elimination of the PCIe bus could position the right vendor to design and
produce a differentiating and impressive processor.

The purpose of this white paper is to provide a concise, fact-based description of the network that
can be used to make managerial decisions regarding the applicability of the DV network to
distributed computing, next-generation chip design, cloud infrastructure, high performance
computing, underlying AI neural nets, and fundamental research.
In Section 2 we present a conceptual picture of the DV network. In Section 3 we show how the
network is implemented in practice with present day hardware to validate the mathematical proof,
taking into account the constraints of real-world electronics. In Section 4 we summarize the most
important benchmarks that differentiate the Data Vortex from other networks. In Section 5 we
explain how the programming of DV-enabled systems is done. Finally, in Section 6 we summarize
recent work on the DV network-on-chip and multi-level systems.

The Data Vortex Network 2 | P a g e


2- Conceptual description of the Data Vortex network

The Data Vortex network1 is composed of a collection of packet-carrying Data Vortex switches
and Data Vortex network interface devices. The Data Vortex switch consists of a collection of
richly connected rings. The rings and the connections between the rings are built using parallel
data busses. In Radix R switch with R = 2N, the rings are arranged in N+1 levels. A packet on the
entry level, level N, the outermost level, can travel to any of the output ports. When a packet
travels from level N to level (N-1), the most significant bit of the binary address of the output is
fixed so that a packet on level N-1 can reach only half of the output ports; a packet on level N-2
can reach only one fourth of the output ports and so forth. This process continues so that when a
packet reaches level 0, the target output is determined.

Figure 1: Sketch of a Data Vortex switch.

In this most elementary Data Vortex switch example, there are five levels in the radix 16 switch.
In the input level (level 4), there is a single ring containing exactly 112 nodes (112 = 7 x 16). On
level 3 there are two rings each containing 56 nodes. This continues so that on level zero (the
output level) there are 16 rings each containing 7 nodes.

Consider a node N at an angle A on the bottom ring at level L. Denote the height H of N as a binary
integer in an L long field. In two ticks, a packet travels from node N to node M, the next node at
the same level L that a packet moves to if it can’t go to the next level.

1
First patented in US Patent US5996020A, “Multiple Level, Minimum Logic Network”, by C. S. Reed (1999), later
patents include US Patent US20140341077A1, “Parallel Data Switch”, by C. S. Reed & D. Murphy (2011). Patents
held by Interactic Holdings, LLC

The Data Vortex Network 3 | P a g e


The Height J of M is calculated as follows:
U = the binary integer produced by reversing the bits of H.
V = the rightmost L bits of U+1.
J = the binary integer produced by reversing the bits of V.
As an example, the permutation of the eight heights on the bottom ring of level three permutes (0,
1, 2, 3, 4, 5, 6, 7) to (4, 5, 6, 7, 2, 3, 0, 1).

Let’s walk through the colored sections of Figure 1. An important feature of all the Data Vortex
networks is that when two nodes are positioned to send data to a third node, one of the sending
nodes has priority over the other sending node to send the data. This priority is based on position.
The priority is enforced by control information riding in the red lines. This control information is
sent by the node with the higher priority to the node with the lower priority. When a level x node
A and a level x+1 node B are each positioned to send a data packet to arrive at level x node C at a
time T, A has priority over B to send to C. When A sends a packet to C on a green ring, it sends
a blocking control signal to B on a red control line to enforce this priority. The node B can send a
packet PKT to node C only in case 1) C is on a path to the target output port of PKT as indicated
by the header of PKT; and 2) B is not blocked from sending PKT to C. In case B does not send
PKT to C, then B sends PKT to the next node D on the ring with B (so that PKT stays on level
x+1). It takes one tick for the first bit of a packet to travel on a blue line from node B to node C.
It takes two ticks for the first bit of a packet to travel on a green line from node B to node D (when
a packet stays on the same level). This extra tick can be the result of passing through a one tick
delay in B or the result of a one tick delay element (not pictured) on the green line from node B to
node D. The purpose for the added tick is so that a blocking control bit on a red line will arrive
before it is needed by the logic of node D. In case C is not on a path to a target output of PKT,
then the permutation in the green rings guarantees that D can send PKT to a node that is on a path
to the target output of PKT.

The switch illustrated in Figure 1 is the simplest example of a Data Vortex switch. Each node B
on a level x of the switch has two output lines, one to a node C on level x-1 and one to a node D
on level x. In a more richly connected Data Vortex switch, the “double down” architecture, each
node has multiple data lines connected to lower level nodes. A topology for such a switch is
obtained by pairing up nodes on a given level, so that a node B on level x that is positioned to send
data to node C on level x-1 is paired with a node B’ on level x that is positioned to send data to the
node C’ on level x-1. Data lines are added so that B can send data to both C and C’ and B’ is also
positioned to send data to both C and C’. Moreover, there is logic that is connected to B, C, B’
and C’. The logic works as follows. If B is able to send to C or B’ is able to send to C’, then such
transfer or transfers take place. If neither of these transfers is possible, but B is able to send to C’
or B’ is able to send to C, then such transfer or transfers takes place. This “double down” switch
has been implemented and studied in detail. All Data Vortex performance numbers are based on
the “double down” architecture.

The Data Vortex Network 4 | P a g e


3- Present-day practical implementation of the network

In a hardware implementation of the Data Vortex network, there are a number of constraints that
demand a modification of Figure 1. FPGAs are used, which have a limited clock rate, internal logic
performance, and a number of SerDes. Moreover, the layout inside the chip is mostly two-
dimensional, while the DV network is intrinsically three dimensional. We solve these problems by
• Implementing multiple, parallel, independent DV networks to increase bandwidth
• Mapping the DV network into a two-dimensional circuit.
Since it is costlier to efficiently distribute a DV network among many FPGAs, we put an entire
radix 64 DV switch into a single FPGA. To increase compute node-to-compute node bandwidth,
we interface each compute node with 16 parallel DV networks as demonstrated in Figure 3. Each
Vortex Interface Controller (VIC) on the compute node chooses a DV network at random, thus
balancing the traffic load.

Figure 2: Sketch of a Data Vortex switch wiring

A 2D representation of the circuit in Verilog was created to wire a 3D network in a 2D chip. In


order to conceptually verify that this was efficient, we designed by hand the 2D layout. Figure 2
shows how the DV network was mapped into a two-dimensional circuit.
Figure 2 depicts a two-dimensional wiring diagram of a radix four switch of the type depicted in
Figure 1. The nodes depicted in the diagram are composed of a logic portion on the left followed
by a one tick delay element. Wires exiting the delay element to the right stay on a ring on the same
level. Wires exiting a logic element in a downward direction go down and to the left so that these

The Data Vortex Network 5 | P a g e


wires are exactly like the blue wires of Figure 1. The wiring is designed to be easy to fabricate as
there are just three modules (red, blue, and green) that must be implemented.
Figure 3 is a high-level view of a computer utilizing Data Vortex networks and commodity
processors. Data travels from the network switches of the type illustrated in Figure 1 and Figure
2 to the Vortex Interface Controllers (VIC) chips on the red lines using packets with 64 bit headers
and 64 bit payloads. The packet header contains an operation code, a target VIC identification
number and a target SRAM address on the receiving VIC. Packets can be sent in groups and in
that case the header also contains an identifying group number. A receiving VIC contains counters
referred to as group counters that are initially set to the number of packets in the group. In one
important application, the packets in a group are targeted to DRAM on a target server. As packets
in a group arrive at the target VIC, the group counter is decremented by one count. When the
group counter reaches zero, a PCIe packet is formed and that packet is sent from the VIC to the
processor. Based on the group identification number, the data is sent from the VIC's SRAM to the
server's DRAM. It is an important feature of the architecture that the packets in a group can be
sent from a number of remote processors. In this way, the computer has a very general and efficient
fine-grained gather mechanism.2
A sending server is able to scatter 128 bit packets across the system. Using one method, a sending
processor loads remote VIC SRAM addresses into the SRAM of its VIC. Then the server sends
payloads to its VIC and the VIC packet former makes packets using the header information from
its SRAM and the payloads from the server. In a second method, the payloads are stored in VIC
SRAM and headers are sent from server. In either method, the packets are sent to the VIC output
switch. The VIC output switch randomly scatters the packets across the independent switches
using the blue lines in the network. In this way, the computer has a very general and efficient fine-
grained scatter mechanism.

Figure 3: Overview of a current


generation Data Vortex computer,
comprising commodity servers, Vortex
Interface Controllers (VICs), and Data
Vortex Switches. Systems made up of
this architecture are presently housed at
the US Departments of Energy and
Defense.

2
US Patent US20140341077A1, “Parallel Data Switch”, by C. S. Reed & D. Murphy (2011)

The Data Vortex Network 6 | P a g e


4- Network performance
In the following two situations especially, the Data Vortex network has the advantage compared
to other systems (such as those using InfiniBand or other competing topologies):

• When transferring large amounts of small packets in non-structured chaotic traffic. This
happens in situations where aggregation is very costly or impossible. As shown below in
the Random-Ring test, DV has a large advantage when aggregation is less than 1024
words of 8 bytes each. A very extreme example of this situation is the Random-Access
(Giga Updates Per Second – “GUPS”) test, summarized below. The DV networks
performs 3 times faster than IB with three dimensional FFTs for the same reason.

• When there is congestion in the network. This happens when the number of nodes is very
large (much larger than 64 nodes). The Data Vortex was designed from the ground up to
be scalable, eliminating congestion issues even in multi-level systems. An example is the
proposed 512 node Data Vortex computer that has been predicted mathematically to
perform as well on the Fast Fourier Transform as the world record holding Cray Jaguar
computer that has 100 times as many nodes.3

Some proven performance advantages on existing Data Vortex-enabled systems are included
below:

4-a-Random Ring test


In this test, P nodes send messages of size S around a ring of P nodes using one thread. Every time
that a message is sent, the nodes in the ring are shuffled at random as shown in Figure 4. When a
given amount of data (approx. 1 GB) is transmitted in this fashion, S is increased to plot the transfer
rate as a function of message size.

Figure 4: Each circle represents a compute node. Every time a message of size S is sent by all
nodes, the ring of compute nodes is shuffled at random.

3
The Cray Jaguar, presently configured as Titan, is housed at Oak Ridge National Laboratory. We use data from
Jaguar rather than Titan as those are the FFT numbers published in the HPC Challenge.
(http://icl.cs.utk.edu/hpcc/hpcc_results.cgi?orderby=MFF&sortorder=DESC&display=combo)

The Data Vortex Network 7 | P a g e


8.E+09
4.E+09
2.E+09
1.E+09
5.E+08
3.E+08
BW bytes/s

1.E+08
6.E+07
3.E+07
2.E+07
FDR
FDRIFBIB DV
DV
8.E+06
4.E+06
2.E+06
1.E+06
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
buffer (words)

Figure 5: Bandwidth comparison between DV and IB with the Random Ring test with one thread and
32 nodes. DV has a clear advantage for aggregation sizes below 1024 words (8KB).

Figure 5 shows a comparison between the Data Vortex network (DV) and MPI-InfiniBand (FDR
IB). When the number of words is smaller than 1024, DV has a clear edge that becomes larger as
the buffer size decreases. The main loop of the code is summarized in Figure 9.
There are two kinds of aggregation one must consider when using any network: a) aggregation of
the PCIe connection between CPU and network card, and b) aggregation of the inter-node network.
The Random Ring example does not do PCIe aggregation when the packet size is one. In the next
test we show that when aggregation as little as 1024 words is used, much greater gains are obtained
with DV when sending single packets at random through the network.

4-b Random-Access benchmark (GUPS)


Present Data Vortex computers excel at the Random-Access benchmark, achieving more than 100x
acceleration versus competing architectures.
The Random-Access benchmark is difficult for most large computer systems because random
updates must be made to random memory addresses distributed across many nodes. If a system is
allowed to aggregate updates before transmitting them to destination nodes, then network transfers
of large streams of data will lead to an efficient Random-Access implementation on a wide variety
of systems. However, the Random-Access benchmark specifically prohibits aggregating more
than 1024 updates, thus requiring the implementation to initiate many small network transfers.
This leads to poor Random-Access performance on most systems. DV performs

The Data Vortex Network 8 | P a g e


small network transfers much better than other systems. In addition, transferring data to N
different nodes can be done as a single network transfer on DV and requires N network transfers
for other systems. Through better performance on small network transfers and being able to use a
single network transfer to transmit to multiple nodes, DV is able to achieve 100x performance per
core compared to competing architectures.

Figure 6: DV excels at the Random-Access benchmark, exceeding 100X acceleration. Data Vortex
data is from runs on a DV206, 64 node system.

4-c A graph-based application: Breadth First Search (BFS)

Many graph problems need to communicate small packets at random through the network. Here
we show the DV implementation of Breadth First Search. The algorithm is applied to a directed,
strongly biased stochastic Kronecker graph with an average degree of 16. The data below shows
that DV easily outperforms other systems. Additional graph applications are presently being
explored on Data Vortex-enabled systems.

The Data Vortex Network 9 | P a g e


Figure 7: BFS test, comparison of DV versus competing architectures.

5- Programming Data Vortex-enabled systems


In order to program DV-enabled systems, our engineers developed two different application
programming interfaces (APIs) that enable the user to write programs in the C language. The first
is a low-level API (DV-API), where the developer must handle DMA allocation, VIC memory,
switch transfers, counters, etc. The second is a higher-level API (DV Send-Receive) that is much
easier to use and is somewhat similar to MPI (see the DV programming course [4] for details). On
both libraries the user can use POSIX threads (pthreads) or openmp to take maximum advantage
of modern processors. Based in our experience with users, it takes just a few days to learn either
API, provided they have previous parallel programming experience and some knowledge of
pthreads.

4
Jay Rockstroh, “Data Vortex series 200 programming course”, tutorial slides (2018)
https://www.datavortex.com/DV-programming-course

The Data Vortex Network 10 | P a g e


Figure 8: This is how a programmer sees a Data Vortex system. With the low-level DV-API, the programmer is responsible for
managing DV VIC memory, counters and transfers to the switch.

The development of newer APIs for DV is an active field of research currently being undertaken
by Pacific Northwest National Laboratory and our engineers. In principle, it is also possible to
write an MPI API for Data Vortex, but it is not currently implemented.
Now let us consider the higher-level API, Send-Receive. The best way to describe it is to show a
simple example, the main loop of the Random-Ring test:

Figure 9: The main loop of Random-Ring: MPI on the left, DV Send-Receive API on the right. The arrows show the approximate equivalences.

The left side shows MPI code, and the right side shows DV Send-Receive code. The arrows show
the equivalence between both. It is evident that the translation is quite straightforward. The API
The Data Vortex Network 11 | P a g e
contains a few more functions which are described in detail in the Data Vortex programming
course.4

6- Recent and ongoing developments


Current hardware developments are focused on exploiting fine-grained communication, end-to-
end, with a goal of eliminating congestion caused by the PCIe bus.

6-a Multiple level Data Vortex


When the number of compute nodes grows large, it is necessary to add multiple levels of DV
Switches. Due to the predictable flat response of the DV Switch, the impact for increasing levels
is negligible. This is not the case with most other networks. To demonstrate this, application
performance was compared on the same system that was attached first to a single level DV Switch
network, and then to a 2 level DV Switch network. The system utilized was a DV2LTP - a 16
processor/VIC system configured with Intel Xeon E5-2637 v2 3.50 GHz processors.
The DV networks were configured to ensure all ports of all DV Switches were completely filled
to exhibit the performance of a fully loaded network. The DV network single level and 2 level
configurations for the 16 VIC system are depicted below.

1 Level Switch Network vs. 2 Level Switch Network


Performance (16 VICs)
10000

1000

100

10

1
Sparse
Matrix
RandomAcc Global
BFS Vector Mult GFFT
ess Barrier Sync
(MTEPs) (Million (GFLOPs)
(MUPs/VIC) (microsec)
Msg/sec/VI
C)
1 Lvl Switch Network 1330 543 234 140 2.3
2 Lvl Switch Network 1320 542.8 240 140 2.6

Figure 10: 1 Level Switch Network vs. 2 Level Switch Network performance on 16 VICs

The Data Vortex Network 12 | P a g e


6-b Data Vortex Network-on-chip (DV NoC)
As the clock speed and the number of cores on current CPUs increase, one cannot feed data
sufficiently fast to the cores, creating a significant bottleneck problem. DV can be used to solve
this problem by connecting multiple cores inside a CPU chip, both between themselves and to
memory and IO. We have demonstrated this concept with the Random-Access test in an Intel
Stratix 5 FPGA containing a whole DV network and multiple custom designed GUPS cores: we
achieved over 1 giga updates per second. The Data Vortex hardware team has crafted a roadmap
for Stratix 10 FPGA development and beyond. The ultimate endgame is to replicate the
performance and strength of an entire Data Vortex system on a single chip.

7- Conclusion
Today’s network topologies continue to be challenged by congestion, latency, and variable
message size. Alternatively our technology can transfer large amounts of data with minimal
congestion and consistent latencies, even in situations when the communication pattern is chaotic
and data is fragmented in small packets. Recent developments have demonstrated that the Data
Vortex not only scales linearly for processor-to-processor communication within a system but is
an equally elegant solution for core-to-core communication within a chip.

HBM HBM

H Core Core Core IO IO Core Core Core H


B B
Core Core Core IO IO Core Core Core
M M
Core Core Core Core Core Core Core Core

H Core Core Core Core Core Core H


B DV B
M Core Core Core
Network Core Core Core M

Core Core Core Core Core Core Core Core


H H
B Core Core Core Core Core Core Core Core B
M M
Core Core Core Core Core Core Core Core

HBM HBM HBM

Figure 11: Proposed circuit of a multi core CPU with Data Vortex connecting memory, cores and IO.

The Data Vortex Network 13 | P a g e


“The Data Vortex Network: An overview of the architecture, implementations, & performance”
Authors: Santiago Betelu, PhD & Coke S. Reed, PhD
Additional Contributors: Arno Kolster & Ryan Quick (Providentia Worldwide), Mike Ives (Plexus),
Reed Devany, & Jay Rockstroh (Data Vortex Technologies)

No part of this document covered by copyright may be reproduced in any form or by any means — graphic,
electronic, or mechanical, including photocopying, recording, taping, or storage in an electronic retrieval
system — without prior written permission of the copyright owner.
© Copyright 2018 Interactic Holdings, LLC (Data Vortex Technologies, DBA). All rights reserved.

Printed in USA Please Recycle ♻ Data Vortex Network_Sept_2018

The Data Vortex Network 14 | P a g e

You might also like