Professional Documents
Culture Documents
DPDK - Inter-Container Communications
DPDK - Inter-Container Communications
Gabriele Ara1 a
, Tommaso Cucinotta1 b
, Luca Abeni1 c
and Carlo Vitucci2
1 Scuola Superiore Sant’Anna, Pisa, Italy
2 Ericsson, Stockholm, Sweden
Abstract: This work presents a framework for evaluating the performance of various virtual switching solutions, each
widely adopted on Linux to provide virtual network connectivity to containers in high-performance scenarios,
like in Network Function Virtualization (NFV). We present results from the use of this framework for the
quantitative comparison of the performance of software-based and hardware-accelerated virtual switches on a
real platform with respect to a number of key metrics, namely network throughput, latency and scalability.
44
Ara, G., Cucinotta, T., Abeni, L. and Vitucci, C.
Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications.
DOI: 10.5220/0009321200440055
In Proceedings of the 10th International Conference on Cloud Computing and Services Science (CLOSER 2020), pages 44-55
ISBN: 978-989-758-424-4
Copyright
c 2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications
can be used to efficiently exchange packets locally on therefore networking overheads grow proportionally
a single machine or among multiple hosts accessing with the number of packets exchanged.
Network Interface Controller (NIC) adapters directly These overheads can be partially amortized re-
from the user-space bypassing the OS kernel. We sorting to batch APIs to exchange multiple pack-
present results from a number of experiments compar- ets using a single system call (either sendmmsg() or
ing the performance of these virtual switches (either recvmmsg()), grouping packets in bursts, amortiz-
software-based or hardware-accelerated) when sub- ing the cost of each system call over the packets in
ject to synthetic workloads that resemble the behavior each burst. However, packets still need to traverse the
of real interacting VNF components. whole in-kernel networking stack, going through ad-
ditional data copies, to be sent or received.
This last problem can be tackled bypassing par-
2 BACKGROUND tially the kernel networking stack using raw sock-
ets instead of regular UDP sockets and implementing
There are a number of different options when multi- data encapsulation in user-space. This is often done
ple applications, each encapsulated in its own OS con- in combination with zero-copy APIs and memory-
tainer or other virtualized environment, need to com- mapped I/O to transfer data quickly between the ap-
municate through network primitives. Usually, they plication and the virtual Ethernet port, greatly reduc-
involve some form of network virtualization to pro- ing the time needed to send a packet (Rizzo, 2012b).
vide each application a set of gateways to exchange Despite the various techniques available to reduce
data over a virtual network infrastructure. the impact of the kernel on network performance,
For the purposes of this work, we will fo- their use is often not sufficient to match the demand-
cus on the following: (i) kernel-based networking, ing requirements of typical NFV applications (Ara
(ii) software-based user-space networking, (iii) hard- et al., 2019). In typical scenarios, the only solution
ware-accelerated user-space networking. to these problems is to resort to techniques that can
In the following, we briefly summarize the main bypass completely the kernel when exchanging pack-
characteristics of each of these techniques when ets both locally and between multiple hosts.
adopted in NFV scenarios to interconnect OS con-
tainers within a private cloud infrastructure. Given 2.2 Inter-container Communications
the demanding requirements of NFV applications in with Kernel Bypass
terms of performance, both with respect to throughput
and latency, we will focus on the performance that can Significant performance improvements for inter-
be attained when adopting each solution on general- container networking can be achieved by avoiding
purpose computing machines running a Linux OS. system calls, context switches and unneeded data
copies as much as possible. Various I/O frame-
2.1 Containers Networking through the works undertake such an approach, recurring to ker-
Linux Kernel nel bypassing techniques to exchange batches of raw
Ethernet frames among applications without a sin-
A first way to interconnect OS containers is to place gle system call. For example, frameworks based on
virtual Ethernet ports as gateways in each container the virtio standard (Russell et al., 2015) use para-
and to connect them all to a virtual switch embed- virtualized network interfaces that rely on shared
ded within the Linux kernel, called “linux-bridge”. memory to achieve high-performance communication
Through this switch, VNFs can communicate on the among containers located on the same host, a situa-
same host with other containerized VNFs or with tion which is fairly common in NFV scenarios. While
other hosts via forwarding through actual Ethernet virtio network devices are typically implemented for
ports present on the machine, as shown in Figure 1a. hypervisors (e.g. QEMU, KVM), recently a complete
The virtual ports assigned to each container are user-space implementation of the virtio specification
implemented in the Linux kernel as well and they em- called vhost-user can be used for OS containers to
ulate the behavior of actual Ethernet ports. They can completely bypass the kernel. However, virtio cannot
be accessed via blocking or nonblocking system calls, be used to access the physical network without any
for example using the traditional POSIX Socket API, user-space implementation of a virtual switch. This
exchanging packets via send() and recv() (or their means that it cannot be used alone to achieve both dy-
more general forms sendmsg() and recvmsg()); as namic and flexible communications among indepen-
a result, at least two system calls are required to ex- dently deployed containers.
change each UDP datagram over the virtual network,
45
CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science
User Space
VHOST-USER VHOST-USER User Space
Kernel Space
User Space Virtual Switch Kernel Space
VETH PORT VETH PORT
User Space
Linux Bridge
VF VF
Kernel Space
NIC
Hardware Switch
NIC NIC
(a) Kernel-based solution. (b) Using DPDK with vhost-user. (c) Using SR-IOV support.
Figure 1: Different approaches to inter-container networking.
2.2.1 Data Plane Development Kit (DPDK) to the physical network via the actual NIC interface
present on the host.
Data Plane Development Kit (DPDK)1 is an open Many software implementations of L2/L3
source framework for fast packet processing imple- switches are based on DPDK or other kernel bypass
mented entirely in user-space and characterized by a frameworks, each implementing their own packet
high portability across multiple platforms. Initially processing logic responsible for packet forward-
developed by Intel, it now provides a high-level pro- ing. For this reason, performance among various
gramming abstraction that applications can use to implementation may differ greatly from one imple-
gain access to low-level resources in user-space with- mentation to another. In addition, these switches
out depending on specific hardware devices. consume a non-negligible amount of processing
In addition, DPDK supports virtio-based network- power on the host they are deployed onto, to achieve
ing via its implementation of vhost-user interfaces. very high network performance. On the other hand,
Applications can hence exchange data efficiently us- special NIC devices that support the Single-Root I/O
ing DPDK to communicate locally using vhost-user Virtualization (SR-IOV) specification allow traffic
ports and remotely via DPDK user-space implemen- offloading to the hardware switch implementation
tations of actual Ethernet device drivers: the frame- that they embed, which applications can access
work will simply initialize the specified ports accord- concurrently without interfering with each other.
ingly in complete transparency from the application Below, we briefly describe the most common soft-
point of view. For this reason, DPDK has become ex- ware virtual switches in the NFV industrial practice,
tremely popular over the past few years when it is nec- and the characteristics of SR-IOV compliant devices.
essary to implement fast data plane packet processing.
DPDK Basic Forwarding Sample Application2 is a
sample application provided by DPDK that can be
2.3 High-performance Switching among
used to connect DPDK-compatible ports, either vir-
Containers tual or physical, in pairs: this means that each appli-
cation using a given port can only exchange packets
After having described virtio in Section 2.2, now we with a corresponding port chosen during system ini-
can explain better how virtio, in combination with tialization. For this reason, this software does not per-
other tools, can be used to realize a virtual network- form any packet processing operation, hence it cannot
ing infrastructure in user-space. There are essentially be used in real use-case scenarios.
two ways to achieve this goal: 1) by assigning each
container a virtio port, using vhost-user to bypass the Open vSwitch (OVS)3 is an open source virtual
kernel, and then connect each port to a virtual switch switch for general-purpose usage with enhanced flex-
application running on the host (Figure 1b); 2) by ibility thanks to its compatibility with the OpenFlow
leveraging special capabilities of certain NIC devices protocol (Pfaff et al., 2015). Recently, Open vSwitch
that allow concurrent access from multiple applica- (OVS) has been updated to support DPDK and virtio-
tions and that can be accessed in user-space by us- based ports, which accelerated considerably packet
ing DPDK drivers (Figure 1c). The virtual switch in- forwarding operations by performing them in user-
stance, either software or hardware, is then connected space rather than within a kernel module (Intel, 2015).
2 https://doc.dpdk.org/guides/sample_app_ug/skeleton.html
1 https://www.dpdk.org/ 3 https://www.openvswitch.org
46
Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications
FD.io Vector Packet Processing (VPP)4 is an ex- pass mechanisms it is possible to achieve single-digit
tensible framework for virtual switching released microsecond latency, with the only drawback that ap-
by the Linux Foundation Fast Data Project (FD.io). plications must continuously poll the physical devices
Since it is developed on top of DPDK, it can run on for incoming packets, leading to high CPU utilization.
various architectures and it can be deployed in VMs, Another work (Lettieri et al., 2017) compared
containers or bare metal environments. It uses Cisco both qualitatively and quantitatively common high-
Vector Packet Processing (VPP) that processes pack- performance networking setups. It measured CPU
ets in batches, improving the performance thanks to utilization and throughput achievable using either
the better exploitation of instruction and data cache SR-IOV, Snabb, OVS and Netmap (Rizzo, 2012b),
locality (Barach et al., 2018). which is another networking framework for high-
Snabb5 is a packet processing framework that can performance I/O. Their focus was on VM to VM com-
be used to provide networking functionality in user- munications, either on a single host or between mul-
space. It allows for programming arbitrary packet tiple hosts, and they concluded that in their setups
processing flows (Paolino et al., 2015) by connecting Netmap is capable of reaching up to 27 Mpps (when
functional blocks in a Directed Acyclic Graph (DAG). running on a 4 GHz CPU), outperforming SR-IOV
While not being based on DPDK, it has its own im- due to the limited hardware switch bandwidth.
plementation of virtio and some NIC drivers in user- Another similar comparison of various kernel by-
space, which can be included in the DAG. pass technologies (Gallenmüller et al., 2015) eval-
uates throughput performance of three networking
Single-Root I/O Virtualization (SR-IOV) (Dong frameworks: DPDK, Netmap and PF_RING7 . They
et al., 2012) is a specification that allows a single concluded that for each of these solutions there are
NIC device to appear as multiple PCIe devices, called two major bottlenecks that can potentially limit net-
Virtual Functions (VFs), that can be independently working performance: CPU capacity and NIC maxi-
assigned to VMs or containers and move data mum transfer rate. Among them, the latter is the dom-
through dedicated buffers within the device. VMs inating bottleneck when per-packet processing cost is
and containers can directly access dedicated VFs kept low, while the former has a bigger impact as
and leverage the L2 hardware switch embedded in soon as this cost increases and the CPU becomes fully
the NIC for either local or remote communications loaded. Their evaluations also showed that DPDK
(Figure 1c). Using DPDK APIs, applications within is able to achieve the highest throughput in terms of
containers can access the dedicated VFs bypassing packets per second, independently from the burst size;
the Linux kernel, removing the need of any software on the contrary, Netmap can reach its highest through-
switch running on the host; however, a DPDK put only for a burst size greater than 128 packets per
daemon is needed on the host to manage the VFs. burst, and even then it cannot reach the same perfor-
mance of DPDK or PF_RING.
The scalability of various virtual switching solu-
3 RELATED WORK tions with respect to the number of VMs deployed
on the same host has been addressed in another re-
The proliferation of different technologies to ex- cent work (Pitaev et al., 2018), in which VPP and
change packets among containers has created the need OVS have been compared against SR-IOV. From their
for new tools to evaluate the performance of virtual evaluations, they concluded that SR-IOV can sustain
switching solutions with respect to throughput, la- a greater number of VMs with respect to software-
tency and scalability. Various works in the research based solutions, since the total throughput of the sys-
literature addressed the problem of network perfor- tem scales almost linearly with the number of VMs.
mance optimization for VMs and containers. On the contrary, both OVS and VPP are only able
A recent work compared various kernel bypass to scale the total throughput up to a certain plateau,
frameworks like DPDK and Remote Direct Mem- which is influenced by the amount of allocated CPU
ory Access (RDMA)6 against traditional sockets, fo- resources: the more VMs are added, the more the sys-
cusing on the round-trip latency measured between tem performance degrades if no more resources are
two directly connected hosts (Géhberger et al., 2018). allocated for the each software virtual switch.
This work showed that both DPDK and RDMA out- Finally, a preliminary work (Ara et al., 2019) was
perform POSIX UDP sockets, as only with kernel by- presented by the authors of this paper, performing a
basic comparison of various virtual switching techni-
4 https://fd.io/
ques for inter-container communications. That evalu-
5 https://github.com/snabbco/snabb
6 http://www.rdmaconsortium.org 7 https://www.ntop.org/products/packet-capture/pf_ring/
47
CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science
Software Tools Benchmarking Applications Stack ers and deploy in each of them a selected application
System Post-
Apps that generates or consumes synthetic network traf-
Setup Processing Sender Receiver Client Server
Tools Tools fic. These applications, also developed for this frame-
Logging API Abstraction Data Generation / Consumption
Test
work, serve the dual purpose to generate/consume
Deployment
Setup
Tools
Tools POSIX API DPDK network traffic and to collect statistics to evaluate sys-
DPDK Environment
User Space
Abstraction Library
tem performance in the given configuration.
Fast Networking
Frameworks
Kernel Space In particular, this framework can be used to con-
Linux Kernel DPDK DPDK
DPDK Basic
Forwarding Application
Networking Stack VIRTIO Device figure a number of tests, each running for a specific
Driver Drivers
OVS VPP
Linux NIC Device Drivers amount of time with a given system configuration;
Snabb
after each test is done, the framework collects per-
Hardware
formance results from each running application and
Figure 2: Main elements of the proposed framework. provides the desired statistics to the user. In addi-
tion, multiple tests can be performed consecutively by
ation however was limited to only a single transmis- specifying multiple values for any test parameter.
sion flow between a sender and a receiver applica- The software is freely available on GitHub, under
tion deployed on the same machine, for which SR- a GPLv3 license, at: https://github.com/gabrieleara/
IOV was the most suitable among the tested solutions. nfv-testperf . The architecture of the framework is de-
This work extends that preliminary study to a much picted in Figure 2, which highlights the various tools
broader number of test cases and working conditions, it includes. First there are a number of software tools
on either one or multiple machines, and presents a and Bash scripts that are used to install system de-
new set of tools that can be used to conveniently re- pendencies, configure and customize installation, run
peat the experiments in other scenarios. performance tests and collect their results. Among
Compared to the above mentioned works, in this the dependencies installed by the framework there are
paper we present for the first time an open-source the DPDK framework (which includes also the Ba-
framework that can be used to evaluate the perfor- sic Forwarding Sample Application), and the other
mance of various widely adopted switching solutions software-based virtual switches that will be used dur-
among Linux containers in the common NFV indus- ing testing to connect applications in LXC contain-
trial practice. This framework can be used to carry out ers: OVS (compiled with DPDK support), VPP, and
experiments from multiple points of view, depending Snabb (configured as a simple learning switch). Each
on the investigation focus, while varying testing pa- virtual switch is configured to act as a simple learning
rameters and the system configuration. The proposed L2 switch, with the only exception represented by the
framework eases the task of evaluating the perfor- DPDK Basic Forwarding Sample Application, which
mance of a system under some desired working con- does not have this functionality. In addition, VPP and
ditions. With this framework, we evaluated system OVS can be connected to physical Ethernet ports to
performance in a variety of application scenarios, by perform tests for inter-machine communications.
deploying sender/receiver and client/server applica- Finally, a set of benchmarking applications are in-
tions on one or multiple hosts. This way we were able cluded in the framework, serving the dual purpose to
to draw some general conclusions about the charac- generate/consume network traffic and to collect statis-
teristics of multiple networking solutions when vary- tics that can be used to evaluate the actual system per-
ing the evaluation point of view, either throughput, formance in the provided configuration. These can
latency or scalability of the system. As being open- measure the performance from various view points:
source, the framework can be conveniently extended Throughput. As many VNFs deal with huge
by researchers or practicioners, should they need to amounts of packets per second, it is important to eval-
write further customized testing applications. uate the maximum performance of each networking
solution with respect to either the number of pack-
ets or the amount of data per second that it is able to
4 PROPOSED FRAMEWORK effectively process and deliver to destination. To do
this, the sender/receiver application pair is provided
In this section we present the framework we realized to generate/consume unidirectional traffic (from the
to evaluate and compare the performance of different sender to the receiver) according to the given param-
virtual networking solutions, focusing on Linux con- eters.
tainers. The framework we developed can be easily Latency. Many VNFs operate with networking pro-
installed and configured on one or multiple general- tocols designed for hardware implementations of cer-
purpose servers to instantiate a number of OS contain- tain network functions; for this reason, these proto-
48
Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications
cols expect very low round-trip latency between in- nario is depicted in Figure 1. In any case, deployed
teracting components, in the order of single-digit mi- applications use polling to exchange network traffic
crosecond latency. In addition, in NFV infrastruc- over the selected ports. For tests involving multiple
tures it is crucial to keep the latency of individual hosts, only OVS or VPP can be used among software-
interactions as little as possible to reduce the end-to- based virtual switches to interconnect the benchmark-
end latency between components across long service ing applications; otherwise, it is possible to assign to
chains. For this purpose, the client/server application each container a dedicated VF and leverage the em-
pair is used to generate bidirectional traffic to eval- bedded hardware switch in the SR-IOV network card
uate the average round-trip latency for each packet to forward traffic from one host to another.
when multiple packets are sent and received in bursts The proposed framework can be easily extended
through a certain virtual switch. To do so, the server to include more virtual switching solutions among
application will send back each packet it receives to containerized applications or to develop different test-
its corresponding client. ing applications that generate/consume other types
Scalability. Evaluations from this point of view are of synthetic workload. From this perspective, the
orthogonal with respect of the two previous dimen- inclusion of other virtio-based virtual switches is
sions, in particular with respect to throughput: since straightforward, and it does not require any modifi-
full utilization of a system is achieved only when cation of the existing test applications. In contrast,
multiple VNFs are deployed on each host that is other networking frameworks that rely on custom port
present within infrastructure, it is extremely important types/programming paradigms (e.g., Netmap) may re-
to evaluate how this affects performance in the men- quire the extension of the API abstraction layer or
tioned dimensions. For this purpose there are no ded- the inclusion of other custom test applications. Fur-
icated applications: multiple instances of each des- ther details about the framework’s extensibility can be
ignated application can be deployed concurrently to found at https://github.com/gabrieleara/nfv-testperf/
evaluate how that affects global system performance. wiki/Extending .
49
CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science
Table 1: List of parameters used to run performance tests Table 2: Maximum throughput achieved for various socket-
with the framework. based solutions: (D = throughput, L = local, S = 1vs1,
V = linux-bridge, P = 64, R = 1M, B = 64).
Parameter Symbol Values
Max
Test Dimension D throughput or latency Technique Throughput
local or remote (kpps)
Hosts Used L (i.e. single or multi-host)
Containers Set S NvsN (where N is the UDP sockets using send/recv 338
number of pairs) UDP sockets using
Virtual Switch V linux-bridge, basic f wd, 409
sendmmsg/recvmmsg
ovs,snabb, sriov, vpp Raw sockets using send/recv 360
Packet Size P Expressed in bytes Raw sockets using
Sending Rate R Expressed in pps 440
sendmmsg/recvmmsg
Burst Size B Expressed in number of
packets per burst
kernel-based techniques, while using DPDK any vir-
5.1 Testing Parameters tual switch can easily support at least 2 Mpps in simi-
lar set-ups. That is why all the results that will follow
The framework can be configured to vary certain pa- will consider kernel bypass technologies only.
rameters before running each test. Each configura-
tion is represented by a set of applications (each run- 5.3 Throughput Evaluations
ning within a container), deployed either on one or
both hosts and grouped in pairs (i.e. sender/receiver or Moving on to techniques that use DPDK, we first
client/server) and a set of parameters that specify the evaluated throughput performance between two ap-
synthetic traffic to be generated. Given the number of plications in a single pair, deployed both on the same
evaluated test cases, for each different perspective we host or on multiple hosts, varying desired sending
will show only relevant results. rate, packet and burst sizes. In all our experiments,
To identify each test, we use the notation shown in we noticed that varying the burst size from 32 to
Table 1: each test is uniquely identified using a tuple 256 packets per burst did not affect throughput per-
in following form: formance, thus we will always refer to the case of
32 packets per burst in further reasoning.
(D, L, S, V, P, R, B)
5.3.1 Single Host
When multiple tests are referred, the notation will
omit those parameters that are free to vary within a We first deployed a single pair of sender/receiver ap-
predefined set of values. For example, to show some plications on a single host (Figure 3a), varying the
tests performed while varying the sending rate the R sending rate from 1 to 20 Mpps and the packet size
parameter will not be included in the tuple. from 64 to 1500 bytes:
Each test uses only a fixed set of parameters and
runs for 1 minute; once the test is finished, an aver- (D = throughput, L = local, S = 1vs1)
age value of the desired metric (either throughput or In all our tests, each virtual switch achieves the de-
latency) is obtained after discarding values related to sired throughput up to a certain maximum value, after
initial warm-up and shutdown phases, thus consider- which the virtual switch has saturated his capabilities,
ing only values related to steady state conditions of as shown in Figure 4a. This maximum throughput de-
the system. The scenarios that we considered in our pends on the size of the packets exchanged through
experiments are summarized in Figure 3. the virtual switch, thus from now on we will consider
only the maximum achievable throughput for each
5.2 Kernel-based Networking virtual switch while varying the size of each packet.
Figures 4b and 4c show the maximum receiving
First we will show that performance achieved with- rates achieved in all our tests that employed only two
out using kernel bypass techniques are much worse containers on a single host, expressed in Mpps and
than the ones achieved using techniques like vhost- in Gbps respectively. In each plot, the achieved rate
user in similar set-ups. Table 2 reports the maximum (Y axis) is expressed as function of the size of each
throughput achieved using socket APIs and linux- packet (X axis), for which a logarithmic scale has
bridge to interconnect a pair of sender and receiver been used.
applications on a single host (Figure 3a). In this sce- From both figures it is clear that maximum per-
nario, it is not possible to reach even 1 Mpps using formance are attained by offloading network traffic
50
Single
Single
SinglePair
Pair
PairLocal
Local
LocalThroughput
Throughput
ThroughputTests
Tests
Tests Multiple
Multiple
MultiplePairs
Pairs
PairsLocal
Local
LocalThroughput
Throughput
Throughput Local
Local
LocalLatency
Latency
LatencyTests
Tests
Tests
Tests
Tests
Tests
Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications
Host
Host
Host Host
Host
Host Host
Host
Host
Container
Container
Single
Single
Container
SinglePair
Pair
111
PairLocal
Local
LocalThroughput
Throughput
ThroughputTests
Tests
Tests
Container
Container
Container
222
Multiple
Multiple
MultiplePairs
Pairs
PairsLocal
Container
Container
Container
Container
Container
555
Container
Container
Container Local
LocalThroughput
333
Container 111 Throughput
Throughput Container
Container
Container
Container
Container
Container
Container
Container
Container
666
444
222 Container
Container
ContainerLocal
Local
LocalLatency
111 Latency
LatencyTests
Tests
Tests
Container
Container
Container
222
Sender
Sender
Sender/// Receiver
Receiver
Receiver/ / /
Sender
Sender
Sender
Switch
Switch
Switch
Receiver
Receiver
Receiver Sender
Sender
Sender
Client
Client
Client
Client
Client
Client
Sender
Sender
Sender
Tests
///Tests
Tests
Switch
Switch
Switch
Receiver
Receiver
Receiver/ / /
Server
Server
Server
Server
Server
Server
Receiver
Receiver
Receiver Client
Client
Client Switch
Switch
Switch Server
Server
Server
Host
Host
Host Host
Host
Host Host
Host
Host
Container
Container
Container
111 Container
Container
Container
222 Container
Container
Container 555 Container
Container
Container 666
Container
Container
Container 333 Container
Container
Container 444
Container
Container
Container 111 Container
Container
Container 222 Container
Container
Container
111 Container
Container
Container
222
Sender
Sender
Sender/// Receiver
Receiver
Receiver/ / /
Sender
Sender
Sender Receiver
Receiver
Receiver Sender
Sender
Sender /// Receiver
Receiver
Receiver/ / /
Switch
Switch
Switch Client
Client
Client Switch
Switch
Switch Server
Server
Server Client
Client
Client Switch
Switch
Switch
Client
Client
Client
Sender
Sender
Sender Server
Server
Server
Receiver
Receiver
Receiver Server
Server
Server
Single
Single
SinglePair
Pair
PairRemote
Remote
RemoteThroughput
Throughput
Throughput Multiple
Multiple
MultiplePairs
Pairs
PairsRemote
Remote
RemoteThroughput
Throughput
Throughput
Tests
Tests
Tests Remote
Remote
RemoteLatency
Latency
LatencyTests
Tests
Tests
Tests
Tests
Tests
(a) (b) (c)
Host
Host
Host
AAA Host
Host
Host
BBB Host
Host
Host
AAA Host
Host
Host
BBB Host
Host
Host
AAA Host
Host
Host
BBB
Single
Single
SinglePair
Container
ContainerPair
PairRemote
Container
111 Remote
RemoteThroughput
Throughput
Throughput Container
Container
Container
222
Multiple
Multiple
MultiplePairs
Pairs
PairsRemote
Container
Container
Container
Container
111
Container
Container Remote
RemoteThroughput
111 Throughput
Throughput Container
Container
Container
Container
Container
Container
222
222 Container
Container
Container
111
Container
Container
Container
222
Container
Container
Container111
Tests
Tests
Tests
Container
Container
Container
Receiver
Receiver
Receiver/ / /
222
Remote
Remote
RemoteLatency
Latency
LatencyTests
Tests
Tests
Sender
Sender
Sender Tests
Tests
Tests Receiver
Receiver
Receiver Sender
Sender
Sender
Sender
Sender
Sender
Sender
Sender
Sender
Receiver
Receiver
Receiver/ / /
Server
Server
Server
Server
Server
Server
Receiver
Receiver
Receiver
Client
Client
Client
Server
Server
Server
Host
Host
Host
AAA Host
Host
Host
BBB Host
Host
Host
AAA Host
Host
Host
BBB Host
Host
Host
AAA Host
Host
Host
BBB
Container
Container
Container
222 Container
Container
Container
111 Container
Container
Container 222 Container
Container
Container
222
Container
Container
Container
Switch 111
Switch
Switch Switch
Switch
Switch Container
Container
Container
Switch
Switch
Switch 111 Container
Container
Container
Switch
Switch
Switch 222 Container
Container
Container
Switch 111
Switch
Switch Switch
Switch
Switch
Container
Container
Container111 Container
Container
Container 222
Receiver
Receiver
Receiver/ / /
Receiver
Receiver
Receiver Sender
Sender
Sender Receiver
Receiver
Receiver/ / / Server
Server
Server
Sender
Sender
Sender Sender
Sender
Sender Server
Server
Server Client
Client
Client
Sender
Sender
Sender Server
Server
Server
Receiver
Receiver
Receiver
(d) (e) (f)
Figure Switch
3: Different testing scenarios
Switch
Switch Switch
Switch
Switch used for our evaluations. In particular,
Switch
Switch
Switch Switch
Switch
Switch (a), (b), andSwitch
(c) refer to single-host
Switch
Switch scenarios,
Switch
Switch
Switch
while (d), (e), and (f) to scenarios that consider multiple hosts.
to the SR-IOV device, exploiting its embedded hard- only virtual switches able to forward traffic between
ware switch. Second best ranked the Basic Forward- multiple hosts8 :
ing Sample Application, which was expected since it (D = throughput, L = remote, S = 1vs1,
does not implement any actual switching logic. The
very small performance gap between VPP and the V ∈ {ovs, sriov, vpp})
latter solution is also a clear indicator that the batch Figure 4d shows the maximum receiving rates
packet processing features included in VPP can dis- achieved for a burst size of 32 packets. In this sce-
tribute packet processing overhead effectively among nario, results depend on the size of the exchanged
incoming packets. Finally, OVS and Snabb follow. packets: for smaller packet sizes the dominating bot-
Comparing these performance with other evaluations tleneck is still represented by the CPU for OVS and
present in literature (Ara et al., 2019), we were able to VPP, while for bigger packets the Ethernet standard
conclude that the major bottleneck for Snabb is repre- limits the total throughput achievable by any virtual
sented by its internal L2 switching component. switch to only 10 Gbps. From these results we con-
In addition, notice that while the maximum cluded that when the expected traffic is characterized
throughput in terms of Mpps is achieved with the by relatively small packet sizes (up to 256 bytes) de-
smallest of the packet sizes (64 bytes), the maximum ploying a component on a directly connected host
throughput in terms of Gbps is actually achieved for does not impact negatively system performance when
the biggest packet size (1500 bytes). In fact, through- OVS or VPP are used. In addition, we noticed that
put in terms of Gbps is characterized by a logarithmic in this scenario there is no clear best virtual switch
growth related to the increase of the packet size, as with respect to the others: while for smaller packet
shown in Figure 4c. sizes SR-IOV is more efficient, both OVS and VPP
From these plots we derived that for software perform better for bigger ones.
switches the efficiency of each implementation im-
pacts performance only for smaller packet sizes, 5.4 Throughput Scalability
while for bigger packets the major limitation becomes
the ability of the software to actually move pack- To test how throughput performance scale when the
ets from one CPU core to another, which is equiva- number of applications on the system increases, we
lent for any software implementation. Given also the repeated all our evaluations deploying multiple appli-
slightly superior performance achieved by SR-IOV, cation pairs on the same host (Figure 3b):
especially for smaller packet sizes, we also concluded (D = throughput, L = local, S ∈ {1vs1, 2vs2, 4vs4})
that its hardware switch is more efficient at moving
large number of packets between CPU cores than the Figures 4e and 4f show the relationship between
software implementations that we tested. the maximum total throughput of the system achiev-
able and the number of pairs deployed simultane-
5.3.2 Multiple Hosts ously, for packets of 64 and 1500 bytes respectively.
8 The Basic Forwarding Sample Application does not imple-
We repeated these evaluations deploying the receiver ment any switching logic, while Snabb was not compatible
application on a separate host (Figure 3d), using the with our selected SR-IOV Ethernet controller.
51
CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science
Throughput [Mpps]
18 18
Actual Receiving
Rate [Mpps]
12 12
Max
6 6
0 0
1 5 10 15 20 64 128 256 512 1024 1500
Tentative Sending Rate [Mpps] Packet Size [bytes]
(a) (D = throughput, L = local, S = 1vs1, B = 32, (b) (D = throughput, L = local, S = 1vs1, B = 32)
P = 64)
18 10
Throughput [Gbps]
Throughput [Gbps]
12
Max
Max
5
6
0 0
64 128 256 512 1024 1500 64 128 256 512 1024 1500
Packet Size [bytes] Packet Size [bytes]
(c) (D = throughput, L = local, S = 1vs1, B = 32) (d) (D = throughput, L = remote, S = 1vs1, B = 32)
9
Throughput [Gbps]
50
Throughput [Gbps]
Max Total
Max Total
6
25
3
0 0
1vs1 2vs2 4vs4 1vs1 2vs2 4vs4
Scenario S Scenario S
(e) (D = throughput, L = local, P = 64, B = 32) (f) (D = throughput, L = local, P = 1500, B = 32)
8
Throughput [Gbps]
10
Throughput [Gbps]
Max Total
Max Total
4 5
0 0
1vs1 2vs2 4vs4 8vs8 1vs1 2vs2 4vs4 8vs8
Scenario S Scenario S
(g) (D = throughput, L = remote, P = 64, B = 32) (h) (D = throughput, L = remote, P = 1500, B = 32)
Figure 4: Throughput performance obtained varying system configuration and virtual switch used to connect sender/receiver
applications deployed in LXC containers.
While the maximum total throughput does not change sources over a bigger number of network flows. From
for relatively small packet sizes when increasing the these results we concluded that while for SR-IOV the
number of senders, for bigger packet sizes SR-IOV major limitation is mostly represented by the number
can sustain 4 senders with only a per-pair perfor- of packets exchanged through the network, for virtio-
mance drop of about 18%, achieving almost linear based switches the major bottleneck is represented
performance improvements with the increase of the by the capability of the CPU to move data from one
number of participants. On the contrary, virtio-based application to another, which depends on the overall
switches can only distribute the same amount of re- amount of bytes exchanged.
52
Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications
40 60
Mean Round-Trip
Mean Round-Trip
Latency [µs]
Latency [µs]
30
40
20
20
10
0 0
64 128 256 512 1024 1500 64 128 256 512 1024 1500
Packet Size [bytes] Packet Size [bytes]
(a) (D = latency, L = local, S = 1vs1, B = 4, (d) (D = latency, L = remote, S = 1vs1, B = 4,
R = 1000) R = 1000)
120
Mean Round-Trip
Mean Round-Trip
Latency [µs]
Latency [µs]
100
80
50
40
0 0
64 128 256 512 1024 1500 64 128 256 512 1024 1500
Packet Size [bytes] Packet Size [bytes]
(b) (D = latency, L = local, S = 1vs1, B = 32, (e) (D = latency, L = remote, S = 1vs1, B = 32,
R = 1000) R = 1000)
300
Mean Round-Trip
Mean Round-Trip
300
Latency [µs]
Latency [µs]
200
200
100 100
0 0
64 128 256 512 1024 1500 64 128 256 512 1024 1500
Packet Size [bytes] Packet Size [bytes]
(c) (D = latency, L = local, S = 1vs1, B = 128, (f) (D = latency, L = remote, S = 1vs1, B = 128,
R = 1000) R = 1000)
Figure 5: Average round-trip latency obtained varying system configuration and virtual switch used to connect client/server
applications deployed in LXC containers. Plots on the left refer to tests performed on a single-host, while plots on the right
involve two separate hosts.
Repeating scalability evaluations on multiple to estimate the per-packet processing overhead intro-
hosts (L = remote), we were able to deploy up to 8 ap- duced by each different solution. In particular we
plication pairs (S = 8vs8) transmitting data from one focused on the ability of each solution to distribute
host to the other one (Figure 3e). Results of these packet processing costs over multiple packets when
evaluations, shown in Figures 4g and 4h, indicate that increasing the burst size. For this purpose, a single
the outcome strongly depends on the size of the pack- pair of client/server applications has been deployed
ets exchanged: for bigger packets the major bottle- on a single or multiple hosts to perform each evalua-
neck is represented by the limited throughput of the tion. Each test is configured to exchange a very low
Ethernet standard; for smaller ones the inability of number of packets per second, so that there is no inter-
each virtual switch to scale adequately with the num- ference between the processing of a burst of packets
ber of packet flows negatively impacts even more sys- and the following one.
tem performance. First we considered a single pair of client/server
applications deployed on a single host (Figure 3c).
5.5 Latency Performance In these tests we varied the burst size from 4 to
128 packets and the packet size from 64 to 1500 bytes:
We finally evaluated the round-trip latency between
two applications when adopting each virtual switch (D = latency, L = local, S = 1vs1, R = 1000)
53
CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science
In all our tests, Snabb was not able to achieve to its competitors when bigger burst sizes are adopted.
comparable latency with respect to other solutions; In the future, we plan to support additional vir-
for example, while all other virtual switches could tual networking solutions, like NetVM (Hwang et al.,
maintain their average latency under 40 µs for B = 4, 2015), Netmap (Rizzo, 2012a), and VALE (Rizzo and
Snabb’s delay was always greater than 60 µs, even for Lettieri, 2012). We also plan to extend the functional-
very small packet sizes. Since this overall behavior is ity of the proposed framework, for example by allow-
repeated in all our tests, regardless of which parame- ing for customizing the number of CPU cores allo-
ters are applied, Snabb will not be discussed further. cated to the virtual switch during the runs. Finally, we
Figures 5a to 5c show that only virtio-based so- would like to compare various versions of the consid-
lutions were able to achieve single-digit microsec- ered virtual switching solutions, as we noticed some
ond round-trip latency on a single host, while SR- variability in the performance figures after upgrading
IOV has a higher latency for small packet and burst some of the components, during the development of
sizes. Increasing the burst size over 32 packets, these the framework.
roles are reversed, with SR-IOV becoming the most
lightweight solution, although it was never able to
achieve single-digit microsecond latency. From this
we inferred that SR-IOV performance are less influ- REFERENCES
enced by the variation of the burst size with respect to
the other options available and thus it is more suitable Ara, G., Abeni, L., Cucinotta, T., and Vitucci, C. (2019).
when traffic on a single host is grouped into bigger On the use of kernel bypass mechanisms for high-
performance inter-container communications. In High
bursts.
Performance Computing, pages 1–12. Springer Inter-
We then repeated the same evaluations by deploy- national Publishing.
ing the server application on a separate host (L =
Barach, D., Linguaglossa, L., Marion, D., Pfister, P.,
remote, Figure 3f). In this new scenario, SR-IOV al- Pontarelli, S., and Rossi, D. (2018). High-speed
ways outperforms OVS and VPP, as shown in Fig- software data plane via vectorized packet processing.
ures 5d to 5f. This can be easily explained, since IEEE Communications Magazine, 56(12):97–103.
when using a virtio-based switch to access the phys- Barik, R. K., Lenka, R. K., Rao, K. R., and Ghose, D.
ical network two levels of forwarding are introduced (2016). Performance analysis of virtual machines and
with respect to the case using only SR-IOV: the two containers in cloud computing. In 2016 International
software instances, each running in their respective Conference on Computing, Communication and Au-
hosts, introduce additional computations with respect tomation (ICCCA). IEEE.
to the one performed in hardware by the SR-IOV de- Dong, Y., Yang, X., Li, J., Liao, G., Tian, K., and Guan, H.
vice. (2012). High performance network virtualization with
SR-IOV. Journal of Parallel and Distributed Comput-
ing, 72(11):1471–1480.
Felter, W., Ferreira, A., Rajamony, R., and Rubio, J. (2015).
6 CONCLUSIONS AND FUTURE An updated performance comparison of virtual ma-
chines and linux containers. In 2015 IEEE Interna-
WORK tional Symposium on Performance Analysis of Sys-
tems and Software (ISPASS). IEEE.
In this work, we presented a novel framework aimed Gallenmüller, S., Emmerich, P., Wohlfart, F., Raumer,
to evaluate and compare the performance of various D., and Carle, G. (2015). Comparison of frame-
virtual networking solutions based on kernel bypass. works for high-performance packet IO. In 2015
We focused on the interconnection of VNFs deployed ACM/IEEE Symposium on Architectures for Network-
in Linux containers in one or multiple hosts. ing and Communications Systems (ANCS). IEEE.
We performed a number of performance evalu- Géhberger, D., Balla, D., Maliosz, M., and Simon, C.
ations on some virtual switches commonly used in (2018). Performance evaluation of low latency com-
the NFV industrial practice. Test results show that munication alternatives in a containerized cloud envi-
ronment. In 2018 IEEE 11th International Conference
SR-IOV has superior performance, both in terms of on Cloud Computing (CLOUD). IEEE.
throughput and scalability, when traffic is limited to a
Hwang, J., Ramakrishnan, K. K., and Wood, T. (2015).
single host. However, in scenarios that consider inter- NetVM: High performance and flexible networking
host communications, each solution is mostly con- using virtualization on commodity platforms. IEEE
strained by the limits imposed by the Ethernet stan- Transactions on Network and Service Management,
dard. Finally, from a latency perspective we showed 12(1):34–47.
that both for local and remote communications SR- Intel (2015). Open vSwitch enables SDN and NFV trans-
IOV can attain smaller round-trip latency with respect formation. White Paper, Intel.
54
Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications
55