Design of Rapidio User-Level Communication Interface Based On Socket in Real-Time Applications

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Design of RapidIO User-level Communication

Interface Based on Socket in Real-time Applications

Ying-hui JI, Chao KONG, Hui-zhi CAI


Institute of Acoustics
Chinese Academy of Sciences
Beijing, China
{jyh, kc, chz}@mail.ioa.ac.cn

Abstract—Interconnect fabric technology such as RapidIO, network simulator. Network simulator simulates the RapidIO
InfiniBand and PCIe has evolved to 10Gbps. However, user node as a common Ethernet network card, so the end users can
applications still can’t fully benefit from such high speed communicate using RapidIO fabric by standard TCP/IP
technology due to user-level protocols’ high processing overhead protocols. This method can only gain 120MB/sec maximum
and redundant data copies. It remains difficult to design and throughput in Linux operating system, which is a fearful
implement flexible and efficient communication software, performance waste [3].
especially in real-time applications. This paper introduces a high
performance RapidIO user-level communication interface, called The rest of this paper is organized as follows. In section II,
RULCI. RULCI provides standard socket API to end users, as we introduce the basic concepts of RapdIO, Socket API and
well as supports user defined interfaces. According to the our application circumstances. The overview of RapidIO user-
communication characteristic and transfer data size per message, level communication interface (RULCI) is given in Section III.
it realizes two modes of communication. One is based on remote Section IV describes the key RULCI technologies. Section V
direct memory access, and the other is based on message passing. shows the optimization method and performance results. We
RULCI is especially suitable for real-time system due to its easy conclude this paper in section VI.
using, message oriented, short transfer delays and large size
message support. The experimental results show that RULCI can
develop the promising communication performance of RapidIO II. BACKGROUND
for end users.
A. RapidIO Specification
Keywords-RapidIO communication interface; socket API; RapidIO [4] is a packet-switched interconnects that
message passing; RDMA; real-time signal processing system intended primarily as an intra-system interface for chip-to-chip,
chip-to-board and board-to-board communications at 10Gbits-
I. INTRODUCTION AND MOTOVATION per-second performance levels. The RapidIO protocol is a
The capability of digital signal processing technology is layered protocol comprised of three layers: logical, transport,
developing rapidly in recent years. At the same time, the and physical layer. Its logical layer supports at least three
complexity and computation of these algorithms have an different transaction types: simple I/O operations, message
exponential growth. Parallel signal processing platforms passing version, and global shared memory. DMA, high
provide a possibility to meet the demand of running complex bandwidth and low latency are key contributions of the
algorithms. However, the interconnect architecture of these RapidIO specification to real-time systems. The message
processing nodes and their communication ability become passing type provides applications a traditional message
essential problems. There are some high performance passing interface with mailbox-style delivery, which supports
interconnect specifications that are available, such as RapidIO 26 message priorities and handles segments messages of up to
and InfiniBand. Traditional sockets over host-based TCP/IP 4 kilobytes into packets. The I/O version allows data transfer
have not been able to keep pace with the exponentially without occupying the CPU time, which called remote direct
increasing network speed needs. Socket Direct Procotol (SDP) memory access (RDMA) [5]. Furthermore, RapidIO uses the
is a InfiniBand Architecture byte-stream transport protocol LVDS technique to minimize power usage at high clock rate.
defined by the InfiniBand Trade Association [1]. Many Therefore, it is especially appropriate for embedded system.
researchers have implemented SDP protocol in InfiniBand, and
gain performance boasts [2]. However, there are seldom B. Supporting the Socket API over RapidIO
suitable upper layer protocols that can be used in RapidIO The original BSD/POSIX socket API, which has the largest
fabric interconnection. base in embedded operating system, only enable synchronous
operation. It follows the file abstraction of UNIX operating
The RapidIO architecture introduces a high bandwidth, low system and can encapsulate different protocol families over
latency interconnect specification. As far as we know, the only different networks. Evaluative socket APIs have evolved over
user-level communication protocol that supports RapidIO is the years to gain higher performance. The authors use the

978-1-4244-5392-4/10/$26.00 ©2010 IEEE


original socket API in the RULCI to simplify its usage. RULCI to operating system. The middle layer is the core of the RULCI.
also realize asynchronous communication by pipelining The congestion control mechanism, retransmission method,
technology. The authors rewrite almost all the internal socket address translation, message dispatch, and the send/receive
functions, and define a new socket protocol family, which is memory management are all realized in this layer. Also, both
named AF_RULCI. User application just needs to point out the the message mode adapter and the RDMA mode adapter are
socket family to AF_RULCI when create a RULCI socket provided in order to accommodate the two different transfer
descriptor. Then its usage is much like a traditional socket API mechanisms.
communication. It also supports both of the SOCK_STREAM
and SOCK_DGRAM protocol types. The authors optimize the
RapidIO transaction performance by four aspects: (1)
underlying hardware capabilities, (2) reduce the internal
transaction overhead, (3) illuminate unnecessary data copies
and checksum, and (4) pipelining transaction.

C. Transfer Data Charactertics


The characteristics of data transfer in real time signal
processing fields can be summarized as real time, complexity,
high frequency, large size message. For example, in phase
radar system, we must transfer more than 16 kilobytes data
within 200 microseconds in order to fulfill the real-time Figure 2. RULCI internal
process requirement. So we design two transaction modes,
RDMA mode is suitable to send large size messages, and
message passing mode is used to send small size messages. IV. KEY TECHENOLOGIES
In this section, we will introduce four key technologies of
III. OVERVIEW RULCI. The First one is the realization of message passing
mode communication. The second one is the realization of
A. Communication Architecture RDMA mode communication. The third one is the realization
As depicted in Fig. 1, communication architecture of our of congestion mechanism. And the last one is memory
design is comprised of three stages, divided as six parts. Each management strategies of RULCI.
part can only interactive directly with their neighbors. RULCI
is located between the RapidIO device driver and the socket A. Message Passing Mode
API. As can be seen, RULCI doesn’t operate RapidIO Message passing mode is the most commonly used
hardware directly. It only communicates with RapidIO device communication method in communication protocols. That is,
driver. Meanwhile, user application can’t access to RULCI the sender sends out a packet to the receive processor node,
service only if through socket API or other API. Based on this without caring about where the data will be put at the receiver
design, the authors hope RULCI can provide multimedia side. The receiver can know when the packet comes by query
communication support in the future. To some extent, RULCI or by interrupt. In the receive processor node, the received
is much like a communication middleware [6]. packet is put sequentially in a message queue. As described in
Fig. 3, RULCI realizes this communication mode by the
following steps. Firstly, the receive process enters the kernel
mode by calling recvfrom() function, and tests whether there
any desired packet comes. If it is, the packet will be
defragmented when it is needed and then it will be copied to
user memory and return to user application. Otherwise, the
receive process will wait for message by calling semTake()
function. In send process, the send information is passed to the
kernel by calling sendto() function, and the send message is
fragmented into packets according to message transfer unit
Figure 1. Communication architecture (MTU), which is 4 kilobytes in RapidIO message passing
mode. Then each packet is copied to the kernel buffer and adds
necessary head info. After that these packets are added to the
B. RULCI Internal RapidIO device driver send queue. Finally, the send process
In the current implementation, RULCI is composed of three will return. The RapidIO device driver has the responsibility to
layers, as shown in Fig. 2. The bottom layer provides send the packet by RapidIO message controller. The RapidIO
registration service to hardware device driver. Device driver send message controller will pop a packet descriptor and send
must tell the RULCI its initial function, send function, receive the packet by RapidIO hardware until the send queue is empty.
function and congestion method before using the RULCI The RapidIO message receive interrupt will be triggered when
service to communicate. The top layer is the socket emulation a packet comes to the receive node. In the interrupt service
layer. The RULCI should register a new socket address family process, the incoming packet is dispatched and the
corresponding receive process is awakened by calling global send thread writes the message descriptor to RapidIO
semGive() function. The receive process that is awakened DMA channel controller sequentially. An interrupt will
previously will defragment the incoming packet and copy it to generate when the DMA channel sends this message out. In
proper user buffer space. After the receive process receives an interrupt service process, a notifying message is sent out. At
entire message it will return to user application. receive node, RULCI chooses which process the news belongs
to, and then gets the message info at the preknown address
space when it receives the notifying message. Then the
message is dispatched to the sleeping process and the process is
awakened. At last, the receive process copies the message to
the user memory and returns to user.

Figure 3. Message passing mode realization

In this mode, the message should be frequently fragmented


and defragmented if user wants to transfer a message much
bigger than MTU. Because they are time consuming operations, Figure 4. RDMA mode realization
this mode is not suitable when user wants to transfer large size
message in one time. However, message passing mode has C. Congesting Control Method
excellent communication latency when the message size is not
Bearer congestion control exists in any socket type. It
much larger than MTU, which is very urgent in real-time
means that the RapidIO device message send or receive queue
communication. Also in signal processing fields, the handling
is full, or the DMA channel is busy. So it is unable to accept
messages are sometimes much larger than MTU. In this
any more packets. Another congestion control-link congestion
circumstance, RDMA mode is provided.
control will also exist when the socket type is
SOCK_STREAM. The authors handle those two kinds of
B. RDMA Mode congestion control as follows.
The realization of the RDMA mode has three basic
differences compared with message passing mode. Firstly, it 1) Bearer Congestion control: The bearer congestion
must know the receiver’s receive address space before the function may be activated if the local bearer media becomes
sender sends the message to the receiver. And there is no overloaded. This call back function keeps track of the current
interrupt when the receiver receives a message. The sender status of the bearer, and stops accepting any packet send calls
must send a notifying message to the receiver, which will tell in this processor node until the bearer is not congested any
the kernel that the message has sent to the receiver. Secondly, more. Although during this interval user application can still
the RDMA mode has no MTU limit. It can send out any size perform send system calls, and packets will be accumulated in
message at a time. Therefore, a message doesn’t need to be RULCI kernel send queues, but all actual transmission is
fragmented and defragmented within one transfer. The last
stopped.
difference is that the RDMA mode manages the kernel receive
buffer independently in each receiving process. The authors 2) Link Congestion Control: It is reliable communication
design this mode of communication in order to fulfill the mode when the socket type is SOCK_STREAM. In this mode
requirement of transferring large size messages. The detail the message must transfer reliably, so the link congestion
realization of this mode is shown in Fig. 4. Before control mechanism and message retransmission are necessary.
communication started, the sender and the receiver must RULCI designs this kind of congestion control by setting
handshake and the receiver must tell to which address apace several atomic counters. In the pair of two communication end
the sender should send the message. The receive process enters points, there are send counter, response counter and receive
the kernel mode by calling recv() function and then waits for counter. All these counters are initialized to zero when the link
the incoming message. In the send process, the message is between the pair of two sockets is established. In the sender,
copied to kernel memory at a time and adds necessary head
the send counter and the response counter are both incremented
info when the process enters the kernel mode by calling send()
function. After that, the message descriptor is added to the after a packet has been sent. In the receiver, the receive counter
DMA global send thread, and then the send process will return is incremented after receiving an incoming packet. At the same
to user and can continue to start a new transaction. The DMA time, it sends a receive response message to the sender when it
receives a certain number of packets. In the sender, it decreases TABLE I. RULCI PERFORMANCE RESULTS IN VXWORKS OS
the response counter by certain number when it receives the packet Throughput(Mbytes/sec) Latency(us)
mode
response message. Link congestion control will occur if the size(bytes) normal altivec normal altivec

response counter is greater than a certain number. User 64 message 8 9 15 9


message 35 50 17 10
application can’t send any messages when link congestion 256
message 90 140 21 13
occurs. The retransmission will also occur when the send 1k
4k message 125 283 25 46
counter in the sender and the receiver counter in the receiver
4k RDMA 3 8 83 62
are not equal with some certain period.
16k RDMA 12 29 119 107
D. Reasonable Send/Receive Memory Management 64k RDMA 48 114 655 281

It is obvious that memory allocation and reallocation 256k RDMA 190 442 - -

operations are time consuming operations, especially in 1024k RDMA 191 454 - -

embedded systems due to the limited total memory space.


RULCI manages the send and receive memory by the
following methods. Firstly, RULCI allocates enough kernel VI. CONCLSIONS AND FUTURE WORKS
memory when RULCI kernel module is loaded to kernel, and
adds them to a special memory pool. Secondly, memory will RULCI is now implemented in Linux and VxWorks
get from the memory pool when RULCI send or receive kernel operating systems. By using two modes of transfer method,
thread needs memory. Thirdly, memory will be released to the RULCI has outstanding performance both in small and large
memory pool after the used memory won’t be used by the size messages. User can choose which mode to use according
special thread any more. At last, the required thread will be to the requirement of application.
congested when the memory in the pool is going to be And we will try to bear PCIe bearer communication in the
exhausted. future works.

V. PERFORMANCE OPTIMAZATION AND EXPERIMENTAL However, our optimization also has limitations when there
is no altivec available. Meanwhile the CPU utilization of our
RESULTS
design is still a bit high because software copy still exists. In
Memory copy and data checksum operations are both time the future, we will try to reduce the CPU utilization by some
consuming operations [7]. Fortunately, RapidIO has reliable other methods, such as hardware copy and zero-copy.
data transaction, so we don’t need checksum data in RULCI.
However, RULCI needs one memory copy both at sender and REFERENCES
receiver. The copy time is decreased compared with traditional
TCP/IP protocols, but still the performance bottleneck. There [1] D. Goldenberg, M. Kagan, R. Ravid, M. S. Tsirkin. “Transparently
achieving superior socket performance using zero copy socket direct
are at least two solutions. One is zero-copy and the other is protocol over 20Gb/s InfiniBand links,” In RAIT Workshop, Cluster
improving copy speed [8]. As is known in [1], zero-copy is 2005.
broadly used in SDP in InfiniBand interconnection. However, [2] P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, D.
zero-copy causes the overhead of locking the application K. Panda. “Sockets direct procotol over InfiniBand in clusters: is it
buffers in physical memory, registering them with the kernel, beneficial?,” In ISPASS, Austin, Texas, 2004.
and additional communication overhead associated with buffer [3] Liang Ji. “Design and implement the RapidIO based high performance
mapped [2]. Zero-copy can gain performance boast when the communication interface,” The Postgraduate Thesis of Shanghai
University, 2008.
message size is large enough. But it will not get satisfactory
[4] RapidIO Trade Association. RapidIO Specification 1.3.
performance in small message size. Thanks to the AltivecTM www.rapidio.org/specs/current, 2001.
technology, the authors choose the second method to optimize
[5] J. Hilland, P. Culley, J. Pinkerton, and R. Recio. RDMA protocol verbs
RULCI performance. Altivec technology in G4 processors that specification. RDMA Consortium, 2003.
has degree of data parallelism is based on the implementation [6] A.Kanevsky, A.Skjellum and J.Watts. “Standardization of a
of separate SIMD execution units. More information about communication middleware for high-performance real-time systems,”
Altivec can be found in [9]. By using altivec copy instead of Proceedings of the Real-Time Systems Symposium, 1997.
normal copy, RULCI communication performance can be [7] D. D. Clark, J. Romkey, H. Salwen. “An analysis of TCP processing
greatly improved. The RULCI performance test results are overhead,” Proceedings of the 13th Conference on Local Computer
described in Table I. Networks, 1988, pp. 284-291.
[8] T. von Eicken, A. Basu, V. Buch, W. Vogels. “U-Net: a user-level
It can be seen in Table I that message passing mode has network interface for parallel and distributed computing (includes
outstanding communication latency when the message size is URL),” Proceedings of the fifteenth ACM symposium on Operating
less than 4 kilobytes. However this mode can only gain 125 systems principles, 1995, United States, pp.40-53.
Mbytes/sec maximum throughput without altivec optimization. [9] http://www.freescale.com/webapp/sps/site/overview.jsp?code=DRPPCA
LTVC.
Conversely, RDMA mode can gain 190 Mbytes/sec maximum
throughput. Both of the two modes can obtain more than two
times of maximum throughput by altivec optimization,
meanwhile reduce the communication latency significantly.

You might also like