Professional Documents
Culture Documents
Data Vortex Network Sept 2018
Data Vortex Network Sept 2018
Data Vortex Network Sept 2018
p.1 – Introduction
p.3 – Conception description of the network
p.5 – Present-day implementation of the network
p.7 – Network performance
p.10 – Programming Data Vortex-enabled systems
p. 12 – Recent & ongoing developments
p. 13 – Conclusion
1- Introduction
Among the greatest challenges in distributed and high-performance computing are congestion,
latency, and communication time variability between different parts of the network. The unique
Data Vortex network is designed from the ground up to solve these problems. It can communicate
large amounts of data with minimal congestion and consistent latencies, even in situations when
the communication pattern is chaotic and data is fragmented in small packets. The Data Vortex
topology is the network solution for processor-to-processor communication within a system and
core-to-core communication within a chip.
Interactic Holdings, LLC (DBA Data Vortex Technologies) has already developed several test
beds of this novel architecture, some of which are actively used to investigate scientific problems.
These include design of protocols for remote memory access (Pacific Northwest National
Laboratory), distributed computing (Providentia Worldwide), fundamental physics research
(University of Ulm), Molecular Simulation (Indiana University Bloomington) and quantum
computing simulation (Data Vortex Technologies).
The performance of present-day Data Vortex-enabled systems has validated the mathematical
scalability of the network. Our ongoing hardware and software developments are the first steps
toward implementing the Data Vortex network into next generation technologies across multiple,
intersecting industries. These include enterprise infrastructure, graph analytics, and chip design:
• Processor and chip design: The Data Vortex Network-on-Chip lays the groundwork for
the next generation of hardware. The network can now connect thousands of cores across
multiple chips using core-to-core transfer logic, eliminating traditional communication
bottlenecks. Elimination of the PCIe bus could position the right vendor to design and
produce a differentiating and impressive processor.
The purpose of this white paper is to provide a concise, fact-based description of the network that
can be used to make managerial decisions regarding the applicability of the DV network to
distributed computing, next-generation chip design, cloud infrastructure, high performance
computing, underlying AI neural nets, and fundamental research.
In Section 2 we present a conceptual picture of the DV network. In Section 3 we show how the
network is implemented in practice with present day hardware to validate the mathematical proof,
taking into account the constraints of real-world electronics. In Section 4 we summarize the most
important benchmarks that differentiate the Data Vortex from other networks. In Section 5 we
explain how the programming of DV-enabled systems is done. Finally, in Section 6 we summarize
recent work on the DV network-on-chip and multi-level systems.
The Data Vortex network1 is composed of a collection of packet-carrying Data Vortex switches
and Data Vortex network interface devices. The Data Vortex switch consists of a collection of
richly connected rings. The rings and the connections between the rings are built using parallel
data busses. In Radix R switch with R = 2N, the rings are arranged in N+1 levels. A packet on the
entry level, level N, the outermost level, can travel to any of the output ports. When a packet
travels from level N to level (N-1), the most significant bit of the binary address of the output is
fixed so that a packet on level N-1 can reach only half of the output ports; a packet on level N-2
can reach only one fourth of the output ports and so forth. This process continues so that when a
packet reaches level 0, the target output is determined.
In this most elementary Data Vortex switch example, there are five levels in the radix 16 switch.
In the input level (level 4), there is a single ring containing exactly 112 nodes (112 = 7 x 16). On
level 3 there are two rings each containing 56 nodes. This continues so that on level zero (the
output level) there are 16 rings each containing 7 nodes.
Consider a node N at an angle A on the bottom ring at level L. Denote the height H of N as a binary
integer in an L long field. In two ticks, a packet travels from node N to node M, the next node at
the same level L that a packet moves to if it can’t go to the next level.
1
First patented in US Patent US5996020A, “Multiple Level, Minimum Logic Network”, by C. S. Reed (1999), later
patents include US Patent US20140341077A1, “Parallel Data Switch”, by C. S. Reed & D. Murphy (2011). Patents
held by Interactic Holdings, LLC
Let’s walk through the colored sections of Figure 1. An important feature of all the Data Vortex
networks is that when two nodes are positioned to send data to a third node, one of the sending
nodes has priority over the other sending node to send the data. This priority is based on position.
The priority is enforced by control information riding in the red lines. This control information is
sent by the node with the higher priority to the node with the lower priority. When a level x node
A and a level x+1 node B are each positioned to send a data packet to arrive at level x node C at a
time T, A has priority over B to send to C. When A sends a packet to C on a green ring, it sends
a blocking control signal to B on a red control line to enforce this priority. The node B can send a
packet PKT to node C only in case 1) C is on a path to the target output port of PKT as indicated
by the header of PKT; and 2) B is not blocked from sending PKT to C. In case B does not send
PKT to C, then B sends PKT to the next node D on the ring with B (so that PKT stays on level
x+1). It takes one tick for the first bit of a packet to travel on a blue line from node B to node C.
It takes two ticks for the first bit of a packet to travel on a green line from node B to node D (when
a packet stays on the same level). This extra tick can be the result of passing through a one tick
delay in B or the result of a one tick delay element (not pictured) on the green line from node B to
node D. The purpose for the added tick is so that a blocking control bit on a red line will arrive
before it is needed by the logic of node D. In case C is not on a path to a target output of PKT,
then the permutation in the green rings guarantees that D can send PKT to a node that is on a path
to the target output of PKT.
The switch illustrated in Figure 1 is the simplest example of a Data Vortex switch. Each node B
on a level x of the switch has two output lines, one to a node C on level x-1 and one to a node D
on level x. In a more richly connected Data Vortex switch, the “double down” architecture, each
node has multiple data lines connected to lower level nodes. A topology for such a switch is
obtained by pairing up nodes on a given level, so that a node B on level x that is positioned to send
data to node C on level x-1 is paired with a node B’ on level x that is positioned to send data to the
node C’ on level x-1. Data lines are added so that B can send data to both C and C’ and B’ is also
positioned to send data to both C and C’. Moreover, there is logic that is connected to B, C, B’
and C’. The logic works as follows. If B is able to send to C or B’ is able to send to C’, then such
transfer or transfers take place. If neither of these transfers is possible, but B is able to send to C’
or B’ is able to send to C, then such transfer or transfers takes place. This “double down” switch
has been implemented and studied in detail. All Data Vortex performance numbers are based on
the “double down” architecture.
In a hardware implementation of the Data Vortex network, there are a number of constraints that
demand a modification of Figure 1. FPGAs are used, which have a limited clock rate, internal logic
performance, and a number of SerDes. Moreover, the layout inside the chip is mostly two-
dimensional, while the DV network is intrinsically three dimensional. We solve these problems by
• Implementing multiple, parallel, independent DV networks to increase bandwidth
• Mapping the DV network into a two-dimensional circuit.
Since it is costlier to efficiently distribute a DV network among many FPGAs, we put an entire
radix 64 DV switch into a single FPGA. To increase compute node-to-compute node bandwidth,
we interface each compute node with 16 parallel DV networks as demonstrated in Figure 3. Each
Vortex Interface Controller (VIC) on the compute node chooses a DV network at random, thus
balancing the traffic load.
2
US Patent US20140341077A1, “Parallel Data Switch”, by C. S. Reed & D. Murphy (2011)
• When transferring large amounts of small packets in non-structured chaotic traffic. This
happens in situations where aggregation is very costly or impossible. As shown below in
the Random-Ring test, DV has a large advantage when aggregation is less than 1024
words of 8 bytes each. A very extreme example of this situation is the Random-Access
(Giga Updates Per Second – “GUPS”) test, summarized below. The DV networks
performs 3 times faster than IB with three dimensional FFTs for the same reason.
• When there is congestion in the network. This happens when the number of nodes is very
large (much larger than 64 nodes). The Data Vortex was designed from the ground up to
be scalable, eliminating congestion issues even in multi-level systems. An example is the
proposed 512 node Data Vortex computer that has been predicted mathematically to
perform as well on the Fast Fourier Transform as the world record holding Cray Jaguar
computer that has 100 times as many nodes.3
Some proven performance advantages on existing Data Vortex-enabled systems are included
below:
Figure 4: Each circle represents a compute node. Every time a message of size S is sent by all
nodes, the ring of compute nodes is shuffled at random.
3
The Cray Jaguar, presently configured as Titan, is housed at Oak Ridge National Laboratory. We use data from
Jaguar rather than Titan as those are the FFT numbers published in the HPC Challenge.
(http://icl.cs.utk.edu/hpcc/hpcc_results.cgi?orderby=MFF&sortorder=DESC&display=combo)
1.E+08
6.E+07
3.E+07
2.E+07
FDR
FDRIFBIB DV
DV
8.E+06
4.E+06
2.E+06
1.E+06
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
buffer (words)
Figure 5: Bandwidth comparison between DV and IB with the Random Ring test with one thread and
32 nodes. DV has a clear advantage for aggregation sizes below 1024 words (8KB).
Figure 5 shows a comparison between the Data Vortex network (DV) and MPI-InfiniBand (FDR
IB). When the number of words is smaller than 1024, DV has a clear edge that becomes larger as
the buffer size decreases. The main loop of the code is summarized in Figure 9.
There are two kinds of aggregation one must consider when using any network: a) aggregation of
the PCIe connection between CPU and network card, and b) aggregation of the inter-node network.
The Random Ring example does not do PCIe aggregation when the packet size is one. In the next
test we show that when aggregation as little as 1024 words is used, much greater gains are obtained
with DV when sending single packets at random through the network.
Figure 6: DV excels at the Random-Access benchmark, exceeding 100X acceleration. Data Vortex
data is from runs on a DV206, 64 node system.
Many graph problems need to communicate small packets at random through the network. Here
we show the DV implementation of Breadth First Search. The algorithm is applied to a directed,
strongly biased stochastic Kronecker graph with an average degree of 16. The data below shows
that DV easily outperforms other systems. Additional graph applications are presently being
explored on Data Vortex-enabled systems.
4
Jay Rockstroh, “Data Vortex series 200 programming course”, tutorial slides (2018)
https://www.datavortex.com/DV-programming-course
The development of newer APIs for DV is an active field of research currently being undertaken
by Pacific Northwest National Laboratory and our engineers. In principle, it is also possible to
write an MPI API for Data Vortex, but it is not currently implemented.
Now let us consider the higher-level API, Send-Receive. The best way to describe it is to show a
simple example, the main loop of the Random-Ring test:
Figure 9: The main loop of Random-Ring: MPI on the left, DV Send-Receive API on the right. The arrows show the approximate equivalences.
The left side shows MPI code, and the right side shows DV Send-Receive code. The arrows show
the equivalence between both. It is evident that the translation is quite straightforward. The API
The Data Vortex Network 11 | P a g e
contains a few more functions which are described in detail in the Data Vortex programming
course.4
1000
100
10
1
Sparse
Matrix
RandomAcc Global
BFS Vector Mult GFFT
ess Barrier Sync
(MTEPs) (Million (GFLOPs)
(MUPs/VIC) (microsec)
Msg/sec/VI
C)
1 Lvl Switch Network 1330 543 234 140 2.3
2 Lvl Switch Network 1320 542.8 240 140 2.6
Figure 10: 1 Level Switch Network vs. 2 Level Switch Network performance on 16 VICs
7- Conclusion
Today’s network topologies continue to be challenged by congestion, latency, and variable
message size. Alternatively our technology can transfer large amounts of data with minimal
congestion and consistent latencies, even in situations when the communication pattern is chaotic
and data is fragmented in small packets. Recent developments have demonstrated that the Data
Vortex not only scales linearly for processor-to-processor communication within a system but is
an equally elegant solution for core-to-core communication within a chip.
HBM HBM
Figure 11: Proposed circuit of a multi core CPU with Data Vortex connecting memory, cores and IO.
No part of this document covered by copyright may be reproduced in any form or by any means — graphic,
electronic, or mechanical, including photocopying, recording, taping, or storage in an electronic retrieval
system — without prior written permission of the copyright owner.
© Copyright 2018 Interactic Holdings, LLC (Data Vortex Technologies, DBA). All rights reserved.