Professional Documents
Culture Documents
Design of Rapidio User-Level Communication Interface Based On Socket in Real-Time Applications
Design of Rapidio User-Level Communication Interface Based On Socket in Real-Time Applications
Design of Rapidio User-Level Communication Interface Based On Socket in Real-Time Applications
Abstract—Interconnect fabric technology such as RapidIO, network simulator. Network simulator simulates the RapidIO
InfiniBand and PCIe has evolved to 10Gbps. However, user node as a common Ethernet network card, so the end users can
applications still can’t fully benefit from such high speed communicate using RapidIO fabric by standard TCP/IP
technology due to user-level protocols’ high processing overhead protocols. This method can only gain 120MB/sec maximum
and redundant data copies. It remains difficult to design and throughput in Linux operating system, which is a fearful
implement flexible and efficient communication software, performance waste [3].
especially in real-time applications. This paper introduces a high
performance RapidIO user-level communication interface, called The rest of this paper is organized as follows. In section II,
RULCI. RULCI provides standard socket API to end users, as we introduce the basic concepts of RapdIO, Socket API and
well as supports user defined interfaces. According to the our application circumstances. The overview of RapidIO user-
communication characteristic and transfer data size per message, level communication interface (RULCI) is given in Section III.
it realizes two modes of communication. One is based on remote Section IV describes the key RULCI technologies. Section V
direct memory access, and the other is based on message passing. shows the optimization method and performance results. We
RULCI is especially suitable for real-time system due to its easy conclude this paper in section VI.
using, message oriented, short transfer delays and large size
message support. The experimental results show that RULCI can
develop the promising communication performance of RapidIO II. BACKGROUND
for end users.
A. RapidIO Specification
Keywords-RapidIO communication interface; socket API; RapidIO [4] is a packet-switched interconnects that
message passing; RDMA; real-time signal processing system intended primarily as an intra-system interface for chip-to-chip,
chip-to-board and board-to-board communications at 10Gbits-
I. INTRODUCTION AND MOTOVATION per-second performance levels. The RapidIO protocol is a
The capability of digital signal processing technology is layered protocol comprised of three layers: logical, transport,
developing rapidly in recent years. At the same time, the and physical layer. Its logical layer supports at least three
complexity and computation of these algorithms have an different transaction types: simple I/O operations, message
exponential growth. Parallel signal processing platforms passing version, and global shared memory. DMA, high
provide a possibility to meet the demand of running complex bandwidth and low latency are key contributions of the
algorithms. However, the interconnect architecture of these RapidIO specification to real-time systems. The message
processing nodes and their communication ability become passing type provides applications a traditional message
essential problems. There are some high performance passing interface with mailbox-style delivery, which supports
interconnect specifications that are available, such as RapidIO 26 message priorities and handles segments messages of up to
and InfiniBand. Traditional sockets over host-based TCP/IP 4 kilobytes into packets. The I/O version allows data transfer
have not been able to keep pace with the exponentially without occupying the CPU time, which called remote direct
increasing network speed needs. Socket Direct Procotol (SDP) memory access (RDMA) [5]. Furthermore, RapidIO uses the
is a InfiniBand Architecture byte-stream transport protocol LVDS technique to minimize power usage at high clock rate.
defined by the InfiniBand Trade Association [1]. Many Therefore, it is especially appropriate for embedded system.
researchers have implemented SDP protocol in InfiniBand, and
gain performance boasts [2]. However, there are seldom B. Supporting the Socket API over RapidIO
suitable upper layer protocols that can be used in RapidIO The original BSD/POSIX socket API, which has the largest
fabric interconnection. base in embedded operating system, only enable synchronous
operation. It follows the file abstraction of UNIX operating
The RapidIO architecture introduces a high bandwidth, low system and can encapsulate different protocol families over
latency interconnect specification. As far as we know, the only different networks. Evaluative socket APIs have evolved over
user-level communication protocol that supports RapidIO is the years to gain higher performance. The authors use the
It is obvious that memory allocation and reallocation 256k RDMA 190 442 - -
operations are time consuming operations, especially in 1024k RDMA 191 454 - -
V. PERFORMANCE OPTIMAZATION AND EXPERIMENTAL However, our optimization also has limitations when there
is no altivec available. Meanwhile the CPU utilization of our
RESULTS
design is still a bit high because software copy still exists. In
Memory copy and data checksum operations are both time the future, we will try to reduce the CPU utilization by some
consuming operations [7]. Fortunately, RapidIO has reliable other methods, such as hardware copy and zero-copy.
data transaction, so we don’t need checksum data in RULCI.
However, RULCI needs one memory copy both at sender and REFERENCES
receiver. The copy time is decreased compared with traditional
TCP/IP protocols, but still the performance bottleneck. There [1] D. Goldenberg, M. Kagan, R. Ravid, M. S. Tsirkin. “Transparently
achieving superior socket performance using zero copy socket direct
are at least two solutions. One is zero-copy and the other is protocol over 20Gb/s InfiniBand links,” In RAIT Workshop, Cluster
improving copy speed [8]. As is known in [1], zero-copy is 2005.
broadly used in SDP in InfiniBand interconnection. However, [2] P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, D.
zero-copy causes the overhead of locking the application K. Panda. “Sockets direct procotol over InfiniBand in clusters: is it
buffers in physical memory, registering them with the kernel, beneficial?,” In ISPASS, Austin, Texas, 2004.
and additional communication overhead associated with buffer [3] Liang Ji. “Design and implement the RapidIO based high performance
mapped [2]. Zero-copy can gain performance boast when the communication interface,” The Postgraduate Thesis of Shanghai
University, 2008.
message size is large enough. But it will not get satisfactory
[4] RapidIO Trade Association. RapidIO Specification 1.3.
performance in small message size. Thanks to the AltivecTM www.rapidio.org/specs/current, 2001.
technology, the authors choose the second method to optimize
[5] J. Hilland, P. Culley, J. Pinkerton, and R. Recio. RDMA protocol verbs
RULCI performance. Altivec technology in G4 processors that specification. RDMA Consortium, 2003.
has degree of data parallelism is based on the implementation [6] A.Kanevsky, A.Skjellum and J.Watts. “Standardization of a
of separate SIMD execution units. More information about communication middleware for high-performance real-time systems,”
Altivec can be found in [9]. By using altivec copy instead of Proceedings of the Real-Time Systems Symposium, 1997.
normal copy, RULCI communication performance can be [7] D. D. Clark, J. Romkey, H. Salwen. “An analysis of TCP processing
greatly improved. The RULCI performance test results are overhead,” Proceedings of the 13th Conference on Local Computer
described in Table I. Networks, 1988, pp. 284-291.
[8] T. von Eicken, A. Basu, V. Buch, W. Vogels. “U-Net: a user-level
It can be seen in Table I that message passing mode has network interface for parallel and distributed computing (includes
outstanding communication latency when the message size is URL),” Proceedings of the fifteenth ACM symposium on Operating
less than 4 kilobytes. However this mode can only gain 125 systems principles, 1995, United States, pp.40-53.
Mbytes/sec maximum throughput without altivec optimization. [9] http://www.freescale.com/webapp/sps/site/overview.jsp?code=DRPPCA
LTVC.
Conversely, RDMA mode can gain 190 Mbytes/sec maximum
throughput. Both of the two modes can obtain more than two
times of maximum throughput by altivec optimization,
meanwhile reduce the communication latency significantly.