Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Eighth IEEE International Symposium on Cluster Computing and the Grid

TCP Connection Scheduler


in Single IP Address Cluster

Hajime Fujita1 , Hiroya Matsuba2 , and Yutaka Ishikawa1,2


1 2
Graduate School of Information Science and Technology, Information Technology Center,
The University of Tokyo The University of Tokyo

{hfujita@is.s, matsuba@cc, ishikawa@is.s}.u-tokyo.ac.jp

Abstract quired to be available every day of the week, or even every


day of the year.
A broadcast-based single IP cluster aims at being both Constructing a server from a computer cluster [4] is one
scalable and available. However, existing systems can only solution to these requirements. It is desirable for a cluster
employ static traffic assignment based on incoming pack- server to make only one IP address visible to clients, since
ets. In this paper we propose FTCS, a new TCP connection many Internet applications think that a server is constructed
dispatching mechanism that enables a single IP cluster to from a single computer, hence a single IP address. Hiding
use more flexible load balancing algorithms. In this mech- multiple cluster nodes behind one address is also preferable
anism, one of the cluster nodes acts as a master node. A because a cluster can dispatch a request to an arbitrary node
centralized connection scheduler runs on the master node while clients use only one address. This makes it possible
in order to dispatch TCP connections to nodes of the clus- to implement load balancing, hot node addition, or hot node
ters. Since connections are scheduled by a single scheduler, removal. There are already two approaches for constructing
the master node is able to employ arbitrary scheduling al- single IP clusters, the centralized dispatcher approach, such
gorithms. Once a TCP connection is established on a node, as that of Linux Virtual Server[15, 10], TCP Router[6], and
succeeding communication is handled without involving the SAPS[8], and the broadcast based approach, such as that of
master node. When the master node fails, one of the nodes Windows NLB[1], ONE-IP[5], Clone Cluster[14], and Hive
takes over the role of the master node. Therefore the master Server[13].
node does not become a single point of failure. Benchmark In centralized dispatcher-based clusters, the dispatcher
results using SPECweb2005 Support benchmark show that has the IP address of the cluster and bridges the cluster and
a four-node Linux cluster using FTCS balances workloads its client. This type balances workloads well but has a single
well and successfully handles 13% more requests than the point of failure at the dispatcher. In broadcast-based clus-
existing method, on average. ters, all nodes have the same IP address so that all nodes
may communicate with clients directly. In order to guaran-
tee that only one node communicates with a client at a time,
1 Introduction a static load balancing mechanism is employed. This type
does not have a single point of failure, but its load balancing
Today, as the Internet is an essential part of the infras- capability is limited.
tructure for society, server computers are one of the most In this paper, we propose a new method, FTCS (Flexi-
important components of Internet services. The increas- ble TCP Connection Scheduling). FTCS employs a mas-
ing population of Internet users, network bandwidth and ter node, which dispatches new TCP connections to one of
the processing capabilities of commodity level computers the nodes. Another node is able to communicate with the
require more performance from servers. Furthermore, as clients without the master after establishing a connection.
servers are the key components of services, they are re- FTCS enables broadcast-based single IP clusters to employ
dynamic and flexible connection scheduling, while preserv-
This work has been partially supported by the CREST project of JST
ing robustness against node failures.
(Japan Science and Technology Agency). We have implemented FTCS in the Linux kernel. Most

978-0-7695-3156-4/08 $25.00 © 2008 IEEE 366


DOI 10.1109/CCGRID.2008.75
Figure 2. An example of broadcast-based
clusters.
Figure 1. An example of centralized
dispatcher-type clusters.

no back-end node can communicate with its client without


the dispatcher.
of the features have been implemented as a kernel module
by using the Linux network packet filtering mechanism.
We have evaluated FTCS by using the SPECweb2005 2.2 Broadcast-based type
Support benchmark test. The results show that FTCS well
balances the disk I/O load among four nodes and handles Broadcast-based single IP cluster (Figure2), like Win-
13% more requests than the existing static connection as- dows NLB[1], ONE-IP[5], Clone Cluster[14], and Hive
signment method. Server[13], does not have a special node like the dispatcher
in the previous type. All nodes in the cluster have the same
IP address that can be accessed by clients. Each incom-
2 Background ing packet is broadcasted toward all nodes, then one of the
nodes accepts it and responds to the client. The others ig-
There have been quite a few single IP cluster systems nore and discard the packet. This type does not have a single
proposed. We present a brief view of the existing systems point of failure since all nodes are able to receive incoming
and divide them into two groups, the centralized dispatcher packets without depending on another node.
type and the broadcast-based type. It is a fundamental issue to implement a mechanism that
allows only one node to respond to the request. If no one
2.1 Centralized dispatcher type responds, the client thinks that the packet is lost or, in
the worst case, the server is dead. On the other hand, if
Centralized dispatcher type single IP clusters generally more than two nodes respond, the client is confused by the
have one dispatcher and multiple back-end nodes (Figure1), server’s seemingly irregular behaviors.
examples of this type include Linux Virtual Server[15, 10], Since it is too heavy to decide inside the cluster which
TCP Router[6], and SAPS[8]. The dispatcher has an IP ad- node should respond upon every packet arrival, existing sys-
dress which is visible to clients and receives all incoming tems use a static hash function to filter incoming packets.
packets from clients. When the dispatcher receives a packet, For example, whether the node i (0 ≤ i ≤ N − 1) should
it chooses one of the back-end nodes based on a load bal- respond to the incoming packet is determined by the follow-
ancing algorithm, then forwards the packet to it. Since con- ing formula,
nection dispatching is controlled by the single director, it is k ≡ i (modN )
able to assign a connection to an arbitrary node chosen by
the algorithm. where N is the number of cluster nodes, k is the hash key
It is necessary for these systems to work correctly so that constructed from a packet. For a hash key, an IP address, a
all packets that belong to the same connection are forwarded port number, or their combination is often used.
to the same back-end node. The dispatcher has a connection These hash functions also aim at balancing network traf-
tracking table in order to ensure this. fic among cluster nodes. The basic idea is that if sources of a
One of the drawbacks of this type is that the dispatcher is hash key (e.g. IP address or port number) are uniformly dis-
a single point of failure for the cluster since back-end nodes tributed, incoming traffic should be equally balanced among
are not able to receive any packets from clients unless they cluster nodes.
are forwarded by the dispatcher. Since the dispatcher has a However, even if the connection requests are uniformly
connection tracking table, it is difficult for another node to distributed, the workload required to process each connec-
take over the role of dispatcher without losing connections tion request is not guaranteed to be equal. For example at
between clients and back-end nodes. Moreover, even if a the Web server, the total workload per connection, such as
connection tracking table can be reconfigured, every com- running scripts or reading files, varies among connections.
munication between the server and a client, including exist- Existing broadcast-based single IP clusters can do nothing
ing connections, hangs during the dispatcher failover, since about this kind of unbalanced load since the size of the

367
1: procedure R ECEIVE TCPS EGMENT (s: Segment)
2: q: quadraple ← (srcIP, srcPort, dstIP, dstPort)
3: tcb: TCB ← findTCB(q)
4: if SYN is set and ACK is not set then
5: if tcb found and its state is TIME-WAIT then
6: remove tcb from the Connection Table
7: end if
8: if node-state is MASTER then
9: n: Node
10: if q ∈ ForwardingHistoryTable then
11: n ← history-table(q)
Figure 3. An overview of an FTCS cluster. 12: else
13: n ← schedule()
14: update-history-table(q, n)
15: end if
workload per each node is completely determined by their 16: if n is myself then
clients. Furthermore, it is difficult to implement many load 17: accept s
balancing algorithms, such as round robin or least loaded, 18: else
in this type of cluster. 19: forward s to n
20: end if
21: else
3 Design 22: if s is from the master then
23: accept s
24: else
25: discard s
In this section we design the FTCS (Flexible TCP Con-
26: end if
nection Scheduler), the proposed method. Our design goal 27: end if
is to provide a scalable single IP cluster server without a 28: else
single point of failure. By scalability, we mean that loads of 29: if tcb found then
each cluster node are balanced without any bottlenecks for 30: accept s
the management. 31: else
FTCS is a method designed to construct a single IP clus- 32: discard s
33: end if
ter based on a broadcast-based approach, but has a cen-
34: end if
tralized connection scheduler to improve load balancing of 35: end procedure
TCP server applications. FTCS introduces two types of
cluster nodes, master and slave. The master node is a con- Figure 4. Algorithm for handling incoming
nection scheduler, which assigns a new TCP connection to TCP segments.
one of the cluster nodes. All other nodes are slave nodes.
The only difference between the master and the slaves is
how they handle a new TCP connection request. Once a
node (either the master or a slave) accepts a connection
request, the node handles the connection, but other nodes
discard packets for that connection independently. This
method is focused on TCP communications, and relies on
its state control mechanism, so other stateless protocols like
UDP should be handled by the existing static assignment
method[1, 5, 14, 13].
An overview of the system is shown in Figure3. Each
node has at least two network interfaces. One is used
for communications with clients, and the other is for
intra-cluster communications. Since FTCS is based on a
broadcast-based single IP cluster, all incoming packets from
clients must be broadcasted to all the nodes. Although most
of Ethernet devices today are connected by using switches,
there are several ways to deliver packets to all nodes. This
issue is later discussed in the Section 4.

368
3.1 Incoming packet handling

The algorithm used for handling incoming TCP seg-


ments is shown in Figure4.
When a client begins a new TCP connection, it first sends
a TCP segment with the SYN bit set in its TCP header. We
call this the “SYN segment.” The master node determines
which node should handle this connection when it receives
this segment (Figure4, L.8). A connection can be accepted
by either one of the slave nodes or the master node itself.
This decision is made by a scheduling policy which can
be replaced dynamically at the run time (Figure4, L.13).
Since the connection scheduling is done by a single mas- Figure 5. Infinite ping-pong of ACKs.
ter node, it can employ many scheduling policies that are
used in dispatcher-based systems. If one of the slave nodes 3.2 Avoiding duplicated connections
is chosen, the master node forwards the SYN segment to
it (Figure4, L.19). This scheduling decision is recorded in
One of the most important issues for FTCS to implement
a table in case of retransmission, which is described later
a single IP view is how to ensure that no more than one
(Figure4, L.14). A slave node discards any SYN segments
node responds to one connection at the same time. Fail-
received directly from clients. It accepts SYN segments
ure to ensure this causes many undesirable phenomena such
only when they are forwarded from the master node. This
as packet flooding, unintentional reset of a connection, and
ensures that only one node responds to each new connection
even the destruction of transferred data.
request (Figure4, L.22). After a node accepts a SYN seg-
An example of packet flooding happens in the case that
ment, it returns a segment with the SYN and ACK bits set.
two or more nodes response to the same connection between
Then the client sends an ACK segment to the server as a re-
a client and the cluster. Such an irregular behavior may
ply to the server. This is the standard TCP 3-way handshake
occur in two cases. Before revealing the cases, the packet
procedure [11].
flooding behavior is described using an example in Figure5.
When the server transmits a SYN+ACK segment, the Suppose that the client and node 1 have an established TCP
TCP protocol stack also creates a record called a Transmis- connection. Node 2 also has been keeping the same con-
sion Control Block (TCB) to track the connection. The TCB nection information with the different state. Node 2 is , of
includes information about the two endpoints of the connec- course, not recognized by the client.
tion such as the quadruple of remote IP address, remote port A pair of two numbers like (1000, 2000) in Figure5 de-
number, local IP address, and local port number. TCB is notes (SND.NXT, RCV.NXT). SND.NXT is the sequence
stored into the TCP connection table. When a node receives number used to send a segment next time. RCV.NXT is the
a TCP segment without its SYN bit set, it consults its con- sequence number which the node will receive next. There-
nection table by using the quadruple to determine whether fore, (1000, 2000) for the client means that the sequence
the segment should be accepted (Figure4, L.3). The seg- number 1000 will be used next by the client and the peer of
ment is dropped if the node does not have a corresponding the connection (node 1) will send a segment with sequence
TCB (Figure4, L.32). This mechanism enables each node number 2000 next time. To make situations simple, let all
to determine whether it should accept a segment indepen- nodes have the same window size of 100. What happens in
dently. In other words, once a TCP connection is assigned, this case is as follows:
a slave node is able to communicate with its client without
the master. 1. The client first sends segment 1 with 100 bytes data,
which is received by both node 1 and node 2.
Unlike standard TCP implementations, FTCS does not
return an RST segment if it receives an irrelevant TCP 2. Node 1 accepts this segment and returns ACK (seg-
segment, since another node might accept it. Even if ment 3).
none of the nodes has a relevant TCB, no one returns
an RST. Though this behavior violates the original TCP 3. However, for node 2, the sequence number recorded in
specification[11], it is not fatal to most applications because segment 1 is out of its receive window (in this case,
these kinds of RST may be also masked by firewalls, even 500 to 599). TCP requires the return of ACK when a
in regular environments. node receives out-of-window segments[11]. So node

369
2 returns ACK (segment 2). This segment has a se- 3.3 Node failure
quence number from SND.NXT of node 2, which is
3000. As FTCS is based on broadcast-based systems, influ-
ences of a node failure are limited to a narrow area. There
4. The client receives segment 2, which also overruns the are two types of node failures, slave node crash and master
receive window. The client’s receive window is from node crash.
2000 to 2099 at this time. In the case of slave node crash, the TCP connections
which belong to that node are lost. However it does not
5. Therefore the client returns ACK (segment 4). This affect other nodes’ connections. The master node should
segment reaches not only node 2 but also node 1. Node watch its slaves. Once one of them appears to be crashed,
1 just ignores this segment as a duplicate. For node 2, the master stops assigning new connections to it.
this is an out-of-window segment again. In the case of master node crash, existing connections
on the slave nodes are not affected since slaves receive in-
6. Repeat step 3 to 5 forever. (segment 5, 6, ...)
coming packets without the master node. The difference
This ping-pong of ACK segments will consume a large between master node crash and slave node crash is that the
amount of CPU time and network capacity. While estab- cluster is not able to accept a new connection after the mas-
lishing a new connection, there are two cases that can cause ter node crashes. To deal with this case, the slaves should
this kind of irregular state. monitor the master. If one of the slaves notices that the
master is dead, it begins an election to choose a new mas-
ter node. Since all nodes are receiving the same incoming
3.2.1 Retransmission of SYN segments packets at the same time, a slave node can easily become
The first case occurs when a packet is lost at a certain tim- the master node.
ing. When a server node accepts a SYN segment from All state information stored in the master node should
a client, one of the server nodes (say node 1) returns a be designed with consideration of a node crash. The mas-
SYN+ACK segment to the client. If this segment is lost, ter node should not hold states which are indispensable
the client retransmits the initial SYN segment to the server. for communications of other nodes. However, it may hold
Without any history information, the master node may for- states that are not necessary for communications when they
ward this retransmitted SYN segment to another node, say are lost. For example, a scheduling module can hold unre-
node 2, because the master does not know that it is a re- producible states such as the next node number in the round
transmitted one. As a result, both nodes 1 and 2 have their robin scheduling module or the connection tracking table
own TCB with a state of SYN-RECEIVED, at the same in the least connection scheduling module. This is because
time. To prevent this, retransmitted SYN segments must losing data for scheduling may make load balancing worse,
be processed at the same node that initially accepts the first but does not affect communications between clients and the
SYN segment. Therefore the master node should remember server.
the destination of each SYN segment so that it can forward When a slave node becomes a new master, the SYN for-
retransmitted segments to the appropriate destination (Fig- warding history table described in Section 3.2.1 is also lost.
ure4, L.10). So the new master should collect information about half-
opened connections, which are in the SYN-RECEIVED
state, from surviving slaves to reproduce the forwarding ta-
3.2.2 Connections of the TIME-WAIT state ble. Another option is to request all other surviving slave
nodes to delete all connections in SYN-RECEIVED state.
The second case occurs when an existing connection is go-
ing to be closed. Suppose that node 1 initiates the connec-
tion closing procedure. It sends a FIN segment to the client 4 Implementation
and eventually enters the TIME-WAIT state. A problematic
situation occurs when the client reuses the same port num- We have implemented FTCS in the Linux kernel version
ber immediately to reconnect to the server, and the schedul- 2.6.20. Currently our implementation covers the TCP dis-
ing policy chooses a node other than node 1 (say node 2). In patching mechanism. Cluster management features such as
this case, node 2 establishes a new TCP connection with the node failure detection and recovering are left as future work.
client, while node 1 still thinks that it has a connection of Since it is implemented inside the kernel, there is no need
the TIME-WAIT state. Once this happens, the situation de- to modify user mode applications to utilize FTCS. FTCS
scribed in Figure5 occurs. To prevent this, the TIME-WAIT also does not modify the TCP protocol, so any existing
socket must be removed when a new SYN segment arrives TCP implementation is able to communicate with an FTCS
at the cluster (Figure4, L.6). server.

370
We set the same unicast MAC address for each Ethernet
card on each node so that every node receives incoming
Ethernet frames without using promiscuous mode. We also
modify the FTCS kernel to mask the source MAC address
from outgoing Ethernet frames in order to avoid a switch
from learning the MAC address, as described in [1]. By us-
ing this masking, all incoming Ethernet frames are always
broadcasted by the switch because it never knows which
port the destination device is connected to.

4.3 Least connection scheduler


Figure 6. Packet filtering inside the kernel.
We have implemented a least connection scheduler as an
example of connection schedulers. While using this sched-
4.1 Packet manipulation uler, the master node has a connection tracking table to hold
the number of connections per each node. For each new
Most of the features of the FTCS kernel has been imple- connection request, the master searches the table for a node
mented as a network packet filter module, as Linux Virtual with minimum connections, then increments the number for
Server does (Figure6). Linux provides a mechanism named the chosen one. The number is decreased when the master
netfilter to insert packet filters dynamically. Currently we knows the connection no longer exists.
apply a very small patch to the kernel code improve the per- The master node can monitor all incoming traffic to the
formance a little. cluster, but it cannot see any outgoing packets from the slave
There are two types of filtering, one for SYN segments nodes. This makes it difficult for the master node to know
and one for other TCP segments. For SYN segments, fil- the exact number of connections for each node. For exam-
tering and forwarding is done between the IP layer and the ple, when the master sees a FIN or an RST segment from a
TCP layer. When the master node forwards the SYN seg- client, the master thinks that a connection corresponding to
ment, it uses the IP protocol type 253 to show that packet the segment is going to close. However if a slave node ab-
has been forwarded by the master. When a slave node re- normally finishes a connection by sending an RST segment,
ceives this packet, it just passes the payload of the packet to the master never knows it. Therefore we added an explicit
the TCP protocol stack. notification of connection closing. When a slave node trans-
For TCP segments other than SYN segments, the filter- mits a FIN or an RST segment, it notifies the master node
ing is done between the Ethernet layer and the IP layer in or- that the connection is closing. This mechanism has been
der to drop irrelevant packets as early as possible. We have implemented as an out-bound packet filter and also imple-
applied a small patch to the kernel here because we want mented as a kernel module.
to insert our packet filtering code before the IP checksum
is calculated, which is impossible with the existing packet 5 Evaluation
filtering mechanism.
If a packet header is corrupted and the filter accepts the In this section we evaluate FTCS by using the
packet, it will eventually be discarded because of its bad SPECweb2005 benchmark[12] as a sample of real world
checksum. So we skip calculating checksums in the filtering applications.
functions in order to reduce CPU consumptions. The SPECweb2005 benchmark consists of three tests,
Banking, e-Commerce, and Support. We use only the Sup-
4.2 Broadcasting Ethernet frames port benchmark because the other two benchmarks hold
session information on the server side and are not suitable
Ethernet was originally designed to employ for running on multiple nodes with TCP-level load balanc-
broadcasting-style communications. However, almost ing [8].
all Ethernet devices today are connected by switches, The Support benchmark simulates a support file distribu-
so broadcasting and multicasting are rarely used. To tion site for many products. Many users concurrently search
implement broadcast-based clusters, packets arriving at the for and download support files. Requested materials are di-
cluster must be broadcasted to all nodes. Many methods vided into two groups, dynamic pages and large download
to deliver the packets, even when nodes are connected via files. Dynamic pages are generated by PHP scripts and con-
switches, have been proposed in the literature[1, 5, 14, 13]. tain several small images. After retrieving the page, a client
We choose the unicast-based method described in [14]. requests these images to the server. Large download files

371
Table 1. Computers and software used in ex-
periments.

Server / BeSim
CPU AMD Opteron 1124 (2.2GHz×2)
Memory 2GB
HDD HITACHI HDS7225SCSUN250G
(250GB SATA)
Ethernet Device Intel Pro/1000 Server
(For communications with clients) Figure 7. The testing environment for
Broadcom NetXtreme BCM5715 (On-board) SPECweb.
(For intra-cluster communications)
Kernel Linux 2.6.20 (x86 64)
Web Server Apache 2.2.3
PHP PHP 5.1.6 size of the main memory. The HTTP keep-alive mechanism
Client is enabled and its timeout is set to one second.
CPU Intel Xeon 2.80GHz × 2 In order to reduce the CPU consumption, the Intel
Memory 1GB Pro/1000 Ethernet cards on server nodes are configured
Ethernet Device Intel Pro/1000 Server (On-board) suitable for servers. That is, two parameters, Interrupt-
Kernel Linux 2.6.18 (i386) ThrottleRate and RxDescriptors, are set to 4000 and 768,
Java VM Sun Java 1.5.0 12 respectively. This limits the maximum interrupt rate to 4000
Switches times per second.
Switch A Unknown GbE Switch
Switch B Cisco Catalyst 3750 24-T-S
Switch C Alaxala AX3630S-48T2XW
5.2 Server clustering methods

The following server clustering methods are tested and


represent support materials like device drivers or firmware compared.
updates, the sizes of which vary from about 100Kbytes to FTCS: The first configuration is a cluster using the pro-
36Mbytes. posed method. The master node uses least connection
scheduling. This configuration is later referred to as FTCS.
PortHash: The second configuration is a broadcast-based
5.1 Common configurations cluster implemented by us, in order to simulate the existing
system like Windows NLB[1]. In this configuration, each
Specifications of the hardware and software used in our node independently determines whether it should accept a
experiments are shown in Table 1 and the testing environ- SYN segment based on a hash key, which is calculated us-
ment is shown in Figure7. Four server nodes and ten client ing the source port number of the segment. This configu-
nodes are used. Server nodes and clients are connected ration is used to compare the existing static connection as-
with a 1Gbps Ethernet link. We defined a dedicated VLAN signment with FTCS, and referred to as PortHash.
for the server’s client-side interfaces so that broadcasting LVS-DR: The third configuration is a dispatcher-based
frames does not affect to other irrelevant Ethernet ports. cluster using Linux Virtual Server (LVS). This feature is al-
Because of this, the server and the clients are placed in ready implemented in the standard Linux 2.6 kernels. The
the different IP subnets. Routing between these two sub- LVS-DR (Direct Routing) configuration is used. In this con-
nets is done in the Catalyst switch, showed as Switch B figuration, incoming packets to the cluster are first received
in Figure7. BeSim is a back-end simulator defined by the by the dispatcher and then forwarded to a back-end server,
SPECweb2005 benchmark to simulate a back-end database but outgoing packets from back-end servers are sent directly
server. The same hardware as that of a server node is used to the clients. Usually LVS is configured by using a dedi-
for BeSim. cated computer for the dispatcher. However in this bench-
Apache HTTP server[2] is used as a web server. It is mark, the dispatcher also acts as a back-end server to use
configured to employ the prefork multiprocessing module. the same computing resources as those of the FTCS config-
This is required when we use a PHP module. Under this uration. As in FTCS, least connection scheduling is chosen
configuration, one Apache process serves one connection. for the connection scheduling algorithm. TCP connection
The maximum number of processes, that is, the maximum timer is set to 30 seconds. This configuration is referred to
number of concurrent connections, is set to 768 due to the as LVS-DR.

372
Table 2. Results of SPECweb benchmark tests (Simultaneous Sessions = 2300).
Clustering File Total QoS QoS QoS Error
Method Type Requests Good Tolerable Fail
Scripts 363,614 328,142 361,449 2,165
FTCS Download 26,459 22,940 23,349 3,110 0
Total 390,073 351,082 384,797 5,275
Scripts 321,165 (0.883)1 289,686 (0.883) 311,472 (0.862) 9,692 (4.476)
PortHash Download 23,324 (0.882) 17,200 (0.750) 17,431 (0.747) 5,892 (1.895) 5.33
Total 344,488 (0.883) 306,887 (0.874) 328,904 (0.855) 15,585 (2.954)
Scripts 361,383 (0.994) 335,347 (1.022) 360,245 (0.997) 1,138 (0.526)
LVS-DR Download 26,256 (0.992) 21,615 (0.942) 21,918 (0.939) 4,337 (1.395) 0
Total 387,639 (0.994) 356,962 (1.017) 382,163 (0.993) 5,476 (1.038)
1
Values in parentheses are ratio against the corresponding values of the result of FTCS.

5.3 Results PortHash


200

Request Rate (requests/s)


Node 0
We recorded each benchmark iteration three times for Node 1
150 Node 2
each clustering method. Table 2 shows average values of Node 3
three iterations. Simultaneous Sessions represents the to- 100
tal number of clients users. Total Requests are the total
50
number of successfully completed requests. QoS is de-
fined as follows. For each request for pages generated by 0
dynamic scripts, Good means complete responses, includ- 0 500 1000 1500 2000 2500
ing an HTML page and all images, within three seconds, Time (s)

and Tolerable means responses within five seconds. Oth-


erwise the service is rated as Fail. On the other hand, for Figure 8. Connection request rate with the
large downloads, Good and Tolerable stands for more than PortHash method.
99Kbytes/s and 95Kbytes/s of download rates, respectively.
In Table 2, the values for QoS Tolerable includes the num-
ber of requests that is also counted in QoS Good. Error is and cooling-down phase. 1800 seconds of benchmark phase
the number of unsuccessful requests due to heavy load. begins at 480 seconds and ends at 2280 secconds.
Note that these results are not compliant with the Figure8 shows the rate of incoming connection requests
SPECweb2005 regulations, because we rebooted all nodes per each node with the PortHash method. As shown in the
before running each iteration, while the regulation requires figure, incoming connections are well balanced among four
us to run three iterations at once. Thus these results are nodes even when we use a static hash function. However,
not comparable to other officially published SPECweb2005 Figure9 shows that the PortHash method failed to balance
scores. However they are sufficient to reveal the difference disk I/O loads and both Node 0 and Node 2 was overloaded,
among server clustering methods. while FTCS balanced I/O loads well. This can be explained
According to the Total Requests column of Table 2, as follows. While connection arriving rates are nearly equal
FTCS handles about 13% more requests than the PortHash among four nodes, workload per each connection varies.
method. Furthermore, the PortHash configuration often For example, the size of the download files differs among
fails to complete the benchmark test because of heavy load. connections. Therefore, once one node is overloaded, the
We had to try seven iterations in total to obtain three suc- number of unprocessed requests grows. This suggests that
cessful results with the PortHash configuration, while the a dynamic feedback mechanism against workload increases
FTCS and the LVS-DR did not fail. The PortHash method is needed for load balancing. Balancing disk I/O is impor-
also recorded 16 errors in the third successful run due to the tant for this benchmark because a huge amount of files are
heavy load. LVS-DR handled almost the same requests as read and transferred during the run, and disk I/O becomes
FTCS did. one of the bottlenecks of the system.
Figs. 8, 9, and 10 show more precise system activities. Figure10 shows the number of established TCP connec-
These data are extracted from the third iteration for each tions. With the PortHash method, the number of connec-
clustering method. These graphs include warming-up phase tions of Node 0 is eventually saturated around 700. This is

373
FTCS PortHash LVS-DR
800 800 800
700 Node 0 700 Node 0 700 Node 0
Node 1 Node 1 Node 1
# of Processes

600 Node 2 600 Node 2 600 Node 2


500 Node 3 (master) 500 Node 3 500 Node 3 (dispatcher)
400 400 400
300 300 300
200 200 200
100 100 100
0 0 0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
Time (s) Time (s) Time (s)

Figure 9. Total number of processes waiting for disk I/O.

FTCS PortHash LVS-DR


# of Established Connections

800 800 800


700 Node 0 700 Node 0 700 Node 0
Node 1 Node 1 Node 1
600 Node 2 600 Node 2 600 Node 2
500 Node 3 (master) 500 Node 3 500 Node 3 (dispatcher)
400 400 400
300 300 300
200 200 200
100 100 100
0 0 0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
Time (s) Time (s) Time (s)

Figure 10. Total number of established TCP connections.

because the number of Apache processes is limited to 768. accept the packet based on the hash key. Therefore, this
Since disk I/Os are invoked from HTTP requests, waiting system is a kind of weighted static hash function approach.
for a disk input means that one TCP connection is kept es- This table is shared among all nodes, and updated peri-
tablished and idle. Therefore it is reasonable to use the num- odically so that idle nodes accept more packets than busy
ber of established connections as an indicator of the amount nodes. However, this approach requires more time than our
of loading. approach to respond to an imbalance of workload.
Figs. 9 and 10 also show that LVS-DR did not balance Saru[7] supports a configuration called Active-Active.
workloads as well as FTCS did. This is because the dis- This configuration introduces multiple dispatchers into a
patcher in LVS-DR configuration is sometimes not able to Linux Virtual Server-based cluster. In this configuration,
know that a back-end node closes a connection, since it does all dispatchers are active at the same time, and receive all
not see outgoing packets from back-end nodes. Especially incoming packets to the cluster by using broadcast or mul-
in the HTTP server with the keep-alive feature enabled, the ticast, as broadcast-based clusters do. Each dispatcher em-
server side closes the connection after the keep-alive time- ploys a hash function to determine which packet to accept.
out. By this time, the HTTP server process has already This mechanism reduces the impact of the dispatcher fail-
closed the socket and is ready for the next connection. How- ure. Dispatchers exchange multicast UDP packets to inform
ever the dispatcher thinks that there is still an established the list of current connections in order to keep the existing
connection. This makes load balancing worse and eventu- connections alive (Connection Synchronization). However,
ally causes some nodes to become overloaded. this mechanism consumes CPU resource and network band-
width. Moreover, it does not guarantee that all connection
6 Related Work to be synchronized since UDP packets may be dropped.
SAPS[8] is a mechanism for dispatcher-based clusters
Hive server[13] is a system which aims at improving to offload the TCP protocol handling to the dispatcher. It
load balancing in broadcast-based single IP clusters. It uses is mainly focused on the network performance and uses
a table to determine whether a node should accept an incom- Myrinet[9] for intra-cluster network. Because congestion
ing packet. When an IP packet arrives at one node, it calcu- control and flow control are done at the network card, back-
lates a hash key from a combination of the IP address and end nodes do not encounter packet losses due to congestions
the port number. The table tells whether the node should inside the cluster. However, because the TCP protocol stack

374
completely runs on the single dispatcher, named I/O Server, [4] V. Cardellini, E. Casalicchio, M. Colajanni, and P. S. Yu.
it is difficult to keep a cluster running when the I/O Server The state of the art in locally distributed web-server systems.
fails. ACM Computing Surveys, 34(2):263–311, 2002.
Round Robin DNS[3] (RR-DNS) ties multiple IP ad- [5] O. P. Damani, P. E. Chung, Y. Huang, C. Kintala, and Y.-M.
Wang. ONE-IP: techniques for hosting a service on a cluster
dresses with one domain name. Upon request, the DNS
of machines. In Selected papers from the sixth international
server chooses one of the IP addresses in a round robin
conference on World Wide Web, pages 1019–1027, Essex,
manner. This kind of DNS-based mechanism can be used UK, 1997. Elsevier Science Publishers, Ltd.
together with single IP clusters like FTCS. RR-DNS dis- [6] D. M. Dias, W. Kish, R. Mukherjee, and R. Tewari. A scal-
tributes requests among geographically distributed sites, able and highly available web server. In COMPCON ’96:
whereas single IP approaches are able to deal with load im- Proceedings of the 41st IEEE International Computer Con-
balance and node failures to a fine degree. ference, pages 85–92, Washington, DC, USA, 1996. IEEE
Computer Society.
[7] S. Horman. Active-Active Servers and Connection Syn-
7 Conclusion and future work chronisation for LVS. linux.conf.au 2004 (LCA 2004), Jan.
2004.
In this paper we have proposed and implemented FTCS, [8] H. Matsuba and Y. Ishikawa. Single IP address cluster
a mechanism for enabling flexible load balancing of TCP for internet servers. In Proceedings of 21st IEEE Inter-
applications in broadcast-based single IP clusters. FTCS national Parallel and Distributed Processing Symposium
introduces a master node as a centralized connection sched- (IPDPS2007), 2007.
[9] Myrinet. http://www.myri.com.
uler to improve load balancing. Once a TCP connection [10] P. O’Rourke and M. Keefe. Performance Evaluation of
is established, the master node is no longer involved in the Linux Virtual Server. LISA 2001 15th Systems Administra-
communication. This makes it possible for other nodes to tion Conference, 2001.
keep connections when the master node fails. [11] J. Postel. Transmission Control Protocol. RFC 793, Sept.
We run the SPECweb2005 Support benchmark test 1981.
against a four-node cluster. Results from the benchmark [12] SPECweb2005. http://www.spec.org/web2005/.
tests show that FTCS with a least-connection scheduling al- [13] T. Takigahira. Hive server: high reliable cluster web server
gorithm equally balances disk I/O loads among four nodes based on request multicasting. In Proceedings of The Third
International Conference on Parallel and Distributed Com-
and handles about 13% more requests than the existing
puting, Applications and Technologies (PDCAT’02), pages
static connection assignment method, on average. The re- 289–294, 2002.
sults also show that FTCS performs as well as Linux Virtual [14] S. Vaidya and K. J. Christensen. A single system image
Server, the existing dispatcher-type cluster implementation. server cluster using duplicated MAC and IP addresses. In
The CPU utilization rate increased by receiving and dis- Proceedings of the 26th Annual IEEE Conference on Local
carding irrelevant packets was less than 3% during the Computer Networks, pages 206–214, 2001.
SPECweb2005 benchmark test, and less than 5% during [15] W. Zhang. Linux Virtual Servers for Scalable Network Ser-
receiving TCP burst packets at 1Gbps. This indicates that vices. Linux Symposium, 2000.
the cost for receiving all incoming packets is negligible for
most TCP server applications, and thus broadcast-based ap-
proach is a reasonable way to build single IP clusters.
As future work, we will implement cluster management
features, such as node failure detection, master node fail-
over, and hot node addition/removal. We will also evalu-
ate the downtime under node failures and compare it with a
highly available cluster based on Linux Virtual Server.

References

[1] Network Load Balancing Technical Overview.


http://www.microsoft.com/technet/
prodtechnol/windows2000serv/deploy/
confeat/nlbovw.mspx.
[2] The Apache HTTP Server. http://httpd.apache.
org/.
[3] T. Brisco. DNS Support for Load Balancing. RFC 1794
(Informational), Apr. 1995.

375

You might also like