Performance Evaluation of Hardware Unit For Fast IP Packet Header Parsing

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Performance Evaluation of Hardware

Unit for Fast IP Packet Header Parsing

Danijela Efnusheva(B)

Faculty of Electrical Engineering and Information Technologies,


Computer Science and Engineering Department,
Skopje, Republic of North Macedonia
danijela@feit.ukim.edu.mk

Abstract. Modern multi-gigabit computer networks are faced with


enormous increase of network traffic and constant growth of number of
users, servers, connections and demands for new applications, services,
and protocols. Assuming that networking devices remain the bottleneck
for communication in such networks, the design of fast network pro-
cessing hardware represents an attractive field of research. Generally,
most hardware devices that provide network processing spend a signif-
icant part of processor cycles to perform IP packet header field access
by means of general-purpose processing. Therefore, this paper proposes
a dedicated IP packet header parsing unit that allows direct and single-
cycle access to different-sized IP packet header fields with the aim to
provide faster network packet processing. The proposed unit is applied
to a general-purpose MIPS processor and a memory-centric network pro-
cessor core and their network processing performances are compared and
evaluated. It is shown that the proposed IP header parsing unit speeds-up
IP packet headers parsing when applied to both processor cores, leading
to multi-gigabit network processing throughput.

Keywords: Header parser · IP packet processing ·


Memory-centric computing · Multi-gigabit networks ·
Network processor

1 Introduction
The rapid expansion of Internet has resulted with increased number of users,
network devices, connections, and novel applications, services, and protocols in
the modern multi-gigabit computer networks [1]. As technology has been advanc-
ing, the network connection links were gaining higher capacities, (especially with
the development of fiber-optic communications) [2], and consequently the net-
working devices were experiencing many difficulties to cope with the increased
network traffic and to timely satisfy the novel imposed requirements of high
throughput and speed, and low delays [3].
Network processors (NPs) have become the most popular solution to the bot-
tleneck problem for constructing high speed gigabit networks. Therefore, they
c Springer Nature Switzerland AG 2019
R. Silhavy et al. (Eds.): CoMeSySo 2019, AISC 1046, pp. 142–154, 2019.
https://doi.org/10.1007/978-3-030-30329-7_14
Performance Evaluation of Hardware Unit 143

are included in various network equipment devices, such as: routers, switches,
firewalls or IDS (Intrusion Detection Systems). In general, NPs are defined as
chip programmable devices that are particularly tailored to provide network
packet processing at multi-gigabit speeds [2–4]. They are usually implemented
as application specific instruction processors (ASIP), with customized instruc-
tion set that is based on RISC, CISC, VLIW or some other instruction set
architecture [3]. Over the last few years many vendors have developed their own
NPs (Intel, Agere, IBM etc.), which resulted with many NP architectures exist-
ing on the market [3]. Although there is no standard NP architecture, most NP
designs in general include: many processing engines (PE), dedicated hardware
accelerators (coprocessors or functional units), adjusted memory architectures,
interconnection mechanisms, hardware parallelization techniques (ex. pipelin-
ing), and software support [4,5]. NPs architecture design is an ongoing field of
research, expecting that the NPU market will achieve strong growth in the near
future. What is more, many new ideas, such as the NetFPGA architecture [6],
or software routers [7] are constantly emerging.
The most popular NPs that are in use today include one or many homo-
or heterogeneous processing cores that operate in parallel. For example, Intel’s
IXP2800 processor [8], consists of 16 identical multi-threaded general-purpose
RISC processors organized as a pool of parallel homogenous processing cores that
can be easily programmed with great flexibility towards ever-changing services
and protocols. Furthermore, EZChip has introduced the first network processor
with 100 ARM cache-coherent programmable processor cores [9], that is by far
the largest 64-bit ARM processor yet announced. Along with the general-purpose
ARM cores, this novel chip also include a mesh core interconnect architecture
that provides a lot of bandwidth, low latency and high linear scalability.
The discussed NPs confirm that most of the network processing is basically
performed by general-purpose RISC-based processing core (as a cheaper but
slower solution) combined with custom-tailored hardware units (as more expen-
sive but also more energy-efficient and faster solution) for executing some com-
plex tasks like traffic management, fast table look-up etc. Therefore, if network
packet processing is analyzed on general-purpose processing cores then it can be
easily concluded that a significant part of processor cycles is spent on packet’s
headers fields access, especially when the packet’s headers fields are non word-
aligned. In such case, some bit-wise logical and arithmetic operations are needed
in order to extract the field’s value from the packet header, (i.e. parsing of the
header) that should be further processed.
Assuming that network processing usually begins by copying the packet
header into a memory buffer that is available for further processing by the pro-
cessor, this paper proposes a specialized IP header parsing unit that performs
field extraction operations directly on the memory buffer output, before for-
warding the IP header to be processed by the processor. This way, the bit-wise
logical and arithmetic operations for extraction of IP header fields that are non
word-aligned, are avoided, and the packet header fields are directly sent to the
processor’s ALU in order to be further evaluated and inspected by the processor.
144 D. Efnusheva

The proposed IP header parsing unit is applied to a general purpose MIPS pro-
cessor [10] and a memory centric processor that operates with on-chip memory
[11], and then the performance gain of IP header parsing speed for the both
processors is compared, discussed and evaluated.
The rest of this paper is organized as follows: Sect. 2 gives an overview of
different approaches for improving network packet processing speed. Section 3
describes the proposed IP header parsing unit and provides details about its
design and way of operation. Section 4 presents and evaluates the simulations
results of IP header parsing speed attained when the proposed parser is applied
to a general purpose processor architectures (ex. MIPS and RISC-based memory
centric processor). Section 5 concludes the paper, outlining the performance gain
that is achieved with the proposed IP header parsing unit.

2 Approaches of Improving Network Processing Speed


NPs development starts in the late 1990-ies when network devices were insuf-
ficient to handle complex network processing requirements [2]. Generally, NPs
are used to perform fast packet forwarding in the data plane, while the slow
packet processing (control, traffic shaping, traffic classes, routing protocols) is
usually handled in software by the control plane. NP operation begins with the
receipt of an input stream of data packets from the physical interface. After that,
the IP headers of the received packets are inspected and its content is analyzed,
parsed and modified [12], so some IP header fields are validated, checksum is cal-
culated, Time to Live - TTL is decremented etc. During the packet processing
some specialized hardware units may be used to perform tasks like: classification
of packets, lookup and pattern matching, forwarding, queue management and
traffic control [3]. For example, the forwarding engine selects the next hop in the
network where the packet is to be sent, while the routing processor deals with
routing table updates. Once the IP address look up in the forwarding table is fin-
ished, the outport is selected and the packet is ready to be forwarded. Actually,
the packet scheduler decides which packets should be sent out to the switching
fabric and also deals with flow management.
According to the research that is presented in [13], the packet processing can
be accelerated if some most time consuming network processing operations are
simplified, and appropriate choices of routing protocol functionalities are made.
So far many different approaches have been proposed, including: label concept
used to accelerate the look-up operations, dedicated hardware units intended
to perform complex and slow operations (ex. header checksum calculation), or
several algorithms for faster routing table lookup. For example [14] proposes
a route lookup mechanism that when implemented in a pipelined fashion in
hardware can achieve one route lookup every memory access, while [15] presents
a lookup scheme called Tree Bitmap that performs fast Hardware/Software IP
Lookups. Moreover, the authors of the research given in [13] suggest an approach
that avoids the slow routing table lookup operations, by use of the IP protocol
source routing option. Other similar approach that also makes use of source
routing is given in [16].
Performance Evaluation of Hardware Unit 145

In general, network processing software is getting closer to the network pro-


cessing hardware, such as in [17], where part of the packet processing is offloaded
to application-specific coprocessors, which are used and controlled by the soft-
ware. This way, the hardware handles the larger part of the packet processing, at
the same time leaving the more complex and specific network traffic analyses to
the general-purpose processor. As follows, a flexible network processing system
with high throughput can be build. Some researchers also try to unify the view
on the various network hardware systems, as well as their network offloading
coprocessors, by developing a common abstraction layer for network software
development [18].
When it comes to packet parsing, many proposed approaches make big use
of FPGA technology, as it is very suitable for implementation of pipeline archi-
tectures and thus it is ideal for achieving high-speed network stream processing
[19]. According to that, the reconfigurable FPGA boards can be used to design
flexible multi-processing systems that adjust themselves to the current packet
traffic protocols and characteristics. This approach is given in [20], where the
authors propose use of PP as a simple high-level language for describing packet
parsing algorithms in an implementation-independent manner. Similarly, in the
research given in [21], a special descriptive language PX is used to describe the
kind of network processing that is needed in a system, and then a special tool
generates the whole multi-processor system as an RTL description. Afterwards,
this system may be mapped to an FPGA platform, and may be dynamically
reconfigured.

3 Design of Hardware Unit for IP Header Parsing


The basic idea for designing a dedicated hardware unit for IP packet header
parsing is to provide direct and single-cycle access to each field of the IP packet
header. When accompanied with a general purpose processor such IP header
parsing unit should allow same access time for an IP packet header field as the
access to any random memory word, even when the field is not word-aligned. It is
expected that this approach would have huge impact on the IP packet processing
speed, and thus would provide increased data throughput of the appropriate NP
device.
In order to achieve single-cycle access, the proposed IP packet header parsing
unit uses part of the memory address space to directly address various IP packet
header fields. This technique is known as memory aliasing, and allows each IP
header field to be accessed with a separate memory address value (field address).
When the field address is sent to the IP header parsing unit, it is used to select
the corresponding word from memory (headers buffer), where the given field is
placed. Afterwards, the word is processed and depending on the field address,
the value of the field is extracted. This process may include some operations
such as: shifting of the word and/or modification of its bits. A schematic of the
IP header parsing unit, used for performing read access to a single IP header
filed, is given on Fig. 1.
146 D. Efnusheva

Fig. 1. Read access to an IPv4/IPv6 header field with the IP header parsing hardware
unit

The IP header parsing unit is designed so that it assumes that the IPv4 or
IPv6 packet headers are placed in a fixed area (headers buffer) of the memory,
before they are being processed. The descriptions of the IPv4 and IPv6 packet
headers that are supported by the IP header parsing unit include type of the IP
header and its location in memory as first line, while each following line contains
the definition of a single field. For each IP header field, the name and its size in
bits are specified, whereas the fields are defined in the order as they appear in
the IP header.
The IP packet header starting address that is specified in the IP header
description is used to set the base register value inside the field/data memory
address generator of the IP header parsing unit. This address generator module
also receives a field address that is translated into a field offset by the lookup
table (LUT). Actually, the field offset is a word-aligned offset to the starting IP
header packet address, thus it points to the location where the given IP packet
header field is placed in the headers buffer. This means that if the length of
some field is smaller than the memory word length, then the closest word-aligned
offset is placed in the LUT table for the given IP header field. For example, Field
Offset for IPv4 fields placed in the first word of an IPv4 header (Version, Header
Length, Type of Service and Total Length) is 0000h, while for the second word
of an IPv4 header (Identifier, Flags and Fragment Offset) is 0001h etc.
According to Fig. 1, the selected field offset from the LUT table is added to
the IP header starting address and the address of the memory word that holds
the required IP header field is generated. This address is applied to the memory
(headers buffer) and then the read word is forwarded to the field/data selector.
This selector module consists of separated field logic (FL) blocks purposed to
Performance Evaluation of Hardware Unit 147

Fig. 2. Write access to an IPv4/IPv6 header field with the IP header parsing hardware
unit

extract the value of the various IP header fields (FieldLogic1... N). In fact, each
field is extracted with a separate field logic (FL) block that is activated by the
output enable (OE) signal connected to a decoder output. The given decoder
is driven by the field address, which causes only one of the FL blocks to be
selected at a given moment. Afterwards, the selected FL block performs some
bit-wise and/or shifting operations in order to extract and then zero-extend the
appropriate IP header field. In the case when the IP header field is word-aligned,
then its FL block is empty and the word is directly forwarded from the memory
to the output of the field/data selector module.
The IP header parsing unit, presented on Fig. 1 shows the hardware that
is used to read out a single IP header field from the headers buffer. The same
concept is used for writing directly to an IP header field in the headers buffer,
as shown on Fig. 2. According to Figs. 1 and 2 it can be noticed that the both
modules use the same field/data memory address generator logic to generate the
address of the memory word that holds the required IP packet header field that
should be accessed.
The only difference between the two modules given on Figs. 1 and 2 is in
the field/data selector logic, since the FL blocks of the parsing unit receive two
inputs during writing: the IP header word-aligned data that was read from the
memory and the IP header field that will be written to the memory. In order
to provide write access to the required field, the decoder that is driven by the
148 D. Efnusheva

field address activates only one of the FL blocks. This FL block sets the input
IP header field to the appropriate position in the input IP header word-aligned
data. After that the whole word, including the written IP header field is stored
in the headers buffer at the generated address.
The IP header parsing unit is flexible to design, given that there are well-
defined packet header formats that should be supported. The proposed parsing
unit currently operates with IP headers, providing further support for other
packet header formats. In addition to the abilities for flexible extension, the
presented hardware approach of direct access to IP header fields also brings much
faster packet processing in comparison with the bare general-purpose processing,
used by nearly all network processors. A more detailed analysis referring to this
is given in the next section.

4 Estimation of IP Header Parsing Speed Improvement


In order to justify the improvements that can be achieved with the proposed
IP header parser, a comparison between MIPS and memory-centric RISC-based
processors without and with IP header parsing unit is made. Detailed description
of the MIPS processor is given in [10], while the memory-centric RISC-based
processor, (here called MIMOPS) is presented in [11]. The extended versions
of the MIPS and MIMOPS processors include IP header parsing unit that is
added next to the on-chip cache and data memory, accordingly. The following
analysis compares the IP header parsing speed (for IPv4 and IPv6), achieved by
the both processors when they operate without and with the IP header parsing
unit, appropriately.
Figure 3 shows MIPS and MIMOPS assembly programs that parse an IPv4
header, without and with the IP header parsing unit. The given programs extract
the IPv4 fields: version, header length, type of service, total length, identifier,
flags, fragment offset, TTL, protocol, header checksum, source and destination
IP addresses; and some fields that are not standard IPv4 fields (ex. first word
first half, sec. word sec. half, etc.), but are used to simplify IPv4 header checksum
calculation. The fields extraction basically involves bit-wise logical and shifting
operations. The MIPS processor supports right and left arithmetical or logical
shifts, specified as a separate instruction, where the shifting amount is given as
a 5-bit constant. On the other hand, the MIMOPS processor provides shifting of
the second flexible source operand and then execution of an ALU operation, using
just a single instruction. Besides the shifting dissimilarities, the main difference
between these two processors is that MIPS has to load the IP header from the
cache memory into the GPRs in order to process it, while MIMOPS can operate
directly on the IP header, placed in the headers buffer of its on-chip data memory.
The first program, shown on Fig. 3 implements IPv4 header fields access in the
pure MIPS processor. This program uses the r0 register as a zero-value register,
and the r1 register as a pointer for the header words. Additionally, the register
r2 is purposed to hold the header words that are read from memory and further
used to extract every IPv4 header field into registers r3–r21, appropriately. In
Performance Evaluation of Hardware Unit 149

Fig. 3. Assembly programs that perform IPv4 header parsing in MIPS and MIMOPS
processors, without and with IP header parsing hardware unit
150 D. Efnusheva

the case of the Version field, only logical AND operation is needed to set all
bits to zero, except the last 4 that hold the field’s value. The Header length
field on the other hand needs a shifting first. After that an AND instruction is
used to select the last 4 bits of the shifted word, which hold the field’s value.
All the other fields are also retrieved by shifting and logical operations. The
second program is an equivalent to the previous, except that it refers to a MIPS
processor that operates with the IP header parsing unit. This program directly
addresses the fields, by using mnemonics starting with the letter ‘h’ followed by
the number of the field, as specified in the header description. Accordingly, only
one instruction is needed to read out a header field to a register.
The third and the fourth program, shown on Fig. 3, implement IPv4 header
fields access in the pure MIMOPS processor, and the MIMOPS processor that
includes IP header parsing unit, accordingly. Referring to that, it can be noticed
that the third program has many similarities with the first one, while the fourth
program is similar to the second one. Although the third program (which is pur-
posed for the MIMOPS processor) operates directly with the IP header words
that are placed in the on-chip memory, it still has to extract the fields that are not
word-aligned. The instructions that perform field’s extraction can address up to
three operands, where a 3-bit immediate value signifies which of the operands
implement base addressing. The extracted fields are placed on continuous mem-
ory locations in the Fields array, so afterwards they can be directly accessed and
processed by the ALU unit. On the other hand, the fourth program (which is
purposed for the MIMOPS processor that includes IP header parsing unit) only
has to set the base register to point to the starting address of the IP header, in
order to provide direct access to the IP header fields (specified with mnemonics,
as in the second program). According to that, an instruction that decrements the
TTL (h9) field could be simply given as SUB h9, h9, 1, allowing the ALU unit to
instantly process the extracted TTL field. This simplification could significantly
speed-up the complete network processing of an IP packet.
Figure 4 shows MIPS and MIMOPS assembly programs that parse an IPv6
packet header, without and with the IP header parsing unit. The given programs
perform parsing of an IPv6 packet header, by extracting its fields: version, traffic
class, flow label, payload length, next header, hop limit, source IP address and
destination IP address. These programs are very similar to the ones related
to IPv4 header parsing and also consist of many bit-wise logical and shifting
operations.
The comparative analysis between MIPS and MIMOPS processors that oper-
ate without or with IP header parsing unit is shown on Fig. 5. This analysis
verifies that the proposed parsing unit improves the IPv4/IPv6 header parsing
speed, providing impressing speed-up for the MIMOPS processor.
The results of the comparative analysis are given on Fig. 5, where Fig. 5a/b
shows the execution time of an IPv4/IPv6 header parsing program, while
Fig. 5c/d illustrates the IPv4/IPv6 parsing speed improvement that is achieved
by the use of IP header parsing unit in MIPS and MIMOPS processors, accord-
ingly. Referring to these results, it can be noticed that the MIPS processor that
Performance Evaluation of Hardware Unit 151

Fig. 4. Assembly programs that perform IPv6 header parsing in MIPS and MIMOPS
processors, without and with IP header parsing hardware unit

implements IP header parsing unit provides a parsing speed improvement of


37.5/51.7% in comparison with a pure MIPS processor, while the MIMOPS
processor that implements IP header parsing unit provides a parsing speed
improvement of 95.6/93.7% in comparison with a pure MIMOPS processor,
for IPv4/IPv6 header parsing, appropriately. The complete improvement of
MIMOPS processor that includes IP header parsing unit, given on Fig. 5e/f shows
that this processor achieves 96.8/96.5%, 95/92.8%, 95.6/93.7% better IPv4/IPv6
header parsing speed results, in comparison with the processors: MIPS, MIPS
with IP header parser and MIMOPS, accordingly. This analysis verifies that the

You might also like