Professional Documents
Culture Documents
TCP Connection Scheduler in Single IP Ad
TCP Connection Scheduler in Single IP Ad
367
1: procedure R ECEIVE TCPS EGMENT (s: Segment)
2: q: quadraple ← (srcIP, srcPort, dstIP, dstPort)
3: tcb: TCB ← findTCB(q)
4: if SYN is set and ACK is not set then
5: if tcb found and its state is TIME-WAIT then
6: remove tcb from the Connection Table
7: end if
8: if node-state is MASTER then
9: n: Node
10: if q ∈ ForwardingHistoryTable then
11: n ← history-table(q)
Figure 3. An overview of an FTCS cluster. 12: else
13: n ← schedule()
14: update-history-table(q, n)
15: end if
workload per each node is completely determined by their 16: if n is myself then
clients. Furthermore, it is difficult to implement many load 17: accept s
balancing algorithms, such as round robin or least loaded, 18: else
in this type of cluster. 19: forward s to n
20: end if
21: else
3 Design 22: if s is from the master then
23: accept s
24: else
25: discard s
In this section we design the FTCS (Flexible TCP Con-
26: end if
nection Scheduler), the proposed method. Our design goal 27: end if
is to provide a scalable single IP cluster server without a 28: else
single point of failure. By scalability, we mean that loads of 29: if tcb found then
each cluster node are balanced without any bottlenecks for 30: accept s
the management. 31: else
FTCS is a method designed to construct a single IP clus- 32: discard s
33: end if
ter based on a broadcast-based approach, but has a cen-
34: end if
tralized connection scheduler to improve load balancing of 35: end procedure
TCP server applications. FTCS introduces two types of
cluster nodes, master and slave. The master node is a con- Figure 4. Algorithm for handling incoming
nection scheduler, which assigns a new TCP connection to TCP segments.
one of the cluster nodes. All other nodes are slave nodes.
The only difference between the master and the slaves is
how they handle a new TCP connection request. Once a
node (either the master or a slave) accepts a connection
request, the node handles the connection, but other nodes
discard packets for that connection independently. This
method is focused on TCP communications, and relies on
its state control mechanism, so other stateless protocols like
UDP should be handled by the existing static assignment
method[1, 5, 14, 13].
An overview of the system is shown in Figure3. Each
node has at least two network interfaces. One is used
for communications with clients, and the other is for
intra-cluster communications. Since FTCS is based on a
broadcast-based single IP cluster, all incoming packets from
clients must be broadcasted to all the nodes. Although most
of Ethernet devices today are connected by using switches,
there are several ways to deliver packets to all nodes. This
issue is later discussed in the Section 4.
368
3.1 Incoming packet handling
369
2 returns ACK (segment 2). This segment has a se- 3.3 Node failure
quence number from SND.NXT of node 2, which is
3000. As FTCS is based on broadcast-based systems, influ-
ences of a node failure are limited to a narrow area. There
4. The client receives segment 2, which also overruns the are two types of node failures, slave node crash and master
receive window. The client’s receive window is from node crash.
2000 to 2099 at this time. In the case of slave node crash, the TCP connections
which belong to that node are lost. However it does not
5. Therefore the client returns ACK (segment 4). This affect other nodes’ connections. The master node should
segment reaches not only node 2 but also node 1. Node watch its slaves. Once one of them appears to be crashed,
1 just ignores this segment as a duplicate. For node 2, the master stops assigning new connections to it.
this is an out-of-window segment again. In the case of master node crash, existing connections
on the slave nodes are not affected since slaves receive in-
6. Repeat step 3 to 5 forever. (segment 5, 6, ...)
coming packets without the master node. The difference
This ping-pong of ACK segments will consume a large between master node crash and slave node crash is that the
amount of CPU time and network capacity. While estab- cluster is not able to accept a new connection after the mas-
lishing a new connection, there are two cases that can cause ter node crashes. To deal with this case, the slaves should
this kind of irregular state. monitor the master. If one of the slaves notices that the
master is dead, it begins an election to choose a new mas-
ter node. Since all nodes are receiving the same incoming
3.2.1 Retransmission of SYN segments packets at the same time, a slave node can easily become
The first case occurs when a packet is lost at a certain tim- the master node.
ing. When a server node accepts a SYN segment from All state information stored in the master node should
a client, one of the server nodes (say node 1) returns a be designed with consideration of a node crash. The mas-
SYN+ACK segment to the client. If this segment is lost, ter node should not hold states which are indispensable
the client retransmits the initial SYN segment to the server. for communications of other nodes. However, it may hold
Without any history information, the master node may for- states that are not necessary for communications when they
ward this retransmitted SYN segment to another node, say are lost. For example, a scheduling module can hold unre-
node 2, because the master does not know that it is a re- producible states such as the next node number in the round
transmitted one. As a result, both nodes 1 and 2 have their robin scheduling module or the connection tracking table
own TCB with a state of SYN-RECEIVED, at the same in the least connection scheduling module. This is because
time. To prevent this, retransmitted SYN segments must losing data for scheduling may make load balancing worse,
be processed at the same node that initially accepts the first but does not affect communications between clients and the
SYN segment. Therefore the master node should remember server.
the destination of each SYN segment so that it can forward When a slave node becomes a new master, the SYN for-
retransmitted segments to the appropriate destination (Fig- warding history table described in Section 3.2.1 is also lost.
ure4, L.10). So the new master should collect information about half-
opened connections, which are in the SYN-RECEIVED
state, from surviving slaves to reproduce the forwarding ta-
3.2.2 Connections of the TIME-WAIT state ble. Another option is to request all other surviving slave
nodes to delete all connections in SYN-RECEIVED state.
The second case occurs when an existing connection is go-
ing to be closed. Suppose that node 1 initiates the connec-
tion closing procedure. It sends a FIN segment to the client 4 Implementation
and eventually enters the TIME-WAIT state. A problematic
situation occurs when the client reuses the same port num- We have implemented FTCS in the Linux kernel version
ber immediately to reconnect to the server, and the schedul- 2.6.20. Currently our implementation covers the TCP dis-
ing policy chooses a node other than node 1 (say node 2). In patching mechanism. Cluster management features such as
this case, node 2 establishes a new TCP connection with the node failure detection and recovering are left as future work.
client, while node 1 still thinks that it has a connection of Since it is implemented inside the kernel, there is no need
the TIME-WAIT state. Once this happens, the situation de- to modify user mode applications to utilize FTCS. FTCS
scribed in Figure5 occurs. To prevent this, the TIME-WAIT also does not modify the TCP protocol, so any existing
socket must be removed when a new SYN segment arrives TCP implementation is able to communicate with an FTCS
at the cluster (Figure4, L.6). server.
370
We set the same unicast MAC address for each Ethernet
card on each node so that every node receives incoming
Ethernet frames without using promiscuous mode. We also
modify the FTCS kernel to mask the source MAC address
from outgoing Ethernet frames in order to avoid a switch
from learning the MAC address, as described in [1]. By us-
ing this masking, all incoming Ethernet frames are always
broadcasted by the switch because it never knows which
port the destination device is connected to.
371
Table 1. Computers and software used in ex-
periments.
Server / BeSim
CPU AMD Opteron 1124 (2.2GHz×2)
Memory 2GB
HDD HITACHI HDS7225SCSUN250G
(250GB SATA)
Ethernet Device Intel Pro/1000 Server
(For communications with clients) Figure 7. The testing environment for
Broadcom NetXtreme BCM5715 (On-board) SPECweb.
(For intra-cluster communications)
Kernel Linux 2.6.20 (x86 64)
Web Server Apache 2.2.3
PHP PHP 5.1.6 size of the main memory. The HTTP keep-alive mechanism
Client is enabled and its timeout is set to one second.
CPU Intel Xeon 2.80GHz × 2 In order to reduce the CPU consumption, the Intel
Memory 1GB Pro/1000 Ethernet cards on server nodes are configured
Ethernet Device Intel Pro/1000 Server (On-board) suitable for servers. That is, two parameters, Interrupt-
Kernel Linux 2.6.18 (i386) ThrottleRate and RxDescriptors, are set to 4000 and 768,
Java VM Sun Java 1.5.0 12 respectively. This limits the maximum interrupt rate to 4000
Switches times per second.
Switch A Unknown GbE Switch
Switch B Cisco Catalyst 3750 24-T-S
Switch C Alaxala AX3630S-48T2XW
5.2 Server clustering methods
372
Table 2. Results of SPECweb benchmark tests (Simultaneous Sessions = 2300).
Clustering File Total QoS QoS QoS Error
Method Type Requests Good Tolerable Fail
Scripts 363,614 328,142 361,449 2,165
FTCS Download 26,459 22,940 23,349 3,110 0
Total 390,073 351,082 384,797 5,275
Scripts 321,165 (0.883)1 289,686 (0.883) 311,472 (0.862) 9,692 (4.476)
PortHash Download 23,324 (0.882) 17,200 (0.750) 17,431 (0.747) 5,892 (1.895) 5.33
Total 344,488 (0.883) 306,887 (0.874) 328,904 (0.855) 15,585 (2.954)
Scripts 361,383 (0.994) 335,347 (1.022) 360,245 (0.997) 1,138 (0.526)
LVS-DR Download 26,256 (0.992) 21,615 (0.942) 21,918 (0.939) 4,337 (1.395) 0
Total 387,639 (0.994) 356,962 (1.017) 382,163 (0.993) 5,476 (1.038)
1
Values in parentheses are ratio against the corresponding values of the result of FTCS.
373
FTCS PortHash LVS-DR
800 800 800
700 Node 0 700 Node 0 700 Node 0
Node 1 Node 1 Node 1
# of Processes
because the number of Apache processes is limited to 768. accept the packet based on the hash key. Therefore, this
Since disk I/Os are invoked from HTTP requests, waiting system is a kind of weighted static hash function approach.
for a disk input means that one TCP connection is kept es- This table is shared among all nodes, and updated peri-
tablished and idle. Therefore it is reasonable to use the num- odically so that idle nodes accept more packets than busy
ber of established connections as an indicator of the amount nodes. However, this approach requires more time than our
of loading. approach to respond to an imbalance of workload.
Figs. 9 and 10 also show that LVS-DR did not balance Saru[7] supports a configuration called Active-Active.
workloads as well as FTCS did. This is because the dis- This configuration introduces multiple dispatchers into a
patcher in LVS-DR configuration is sometimes not able to Linux Virtual Server-based cluster. In this configuration,
know that a back-end node closes a connection, since it does all dispatchers are active at the same time, and receive all
not see outgoing packets from back-end nodes. Especially incoming packets to the cluster by using broadcast or mul-
in the HTTP server with the keep-alive feature enabled, the ticast, as broadcast-based clusters do. Each dispatcher em-
server side closes the connection after the keep-alive time- ploys a hash function to determine which packet to accept.
out. By this time, the HTTP server process has already This mechanism reduces the impact of the dispatcher fail-
closed the socket and is ready for the next connection. How- ure. Dispatchers exchange multicast UDP packets to inform
ever the dispatcher thinks that there is still an established the list of current connections in order to keep the existing
connection. This makes load balancing worse and eventu- connections alive (Connection Synchronization). However,
ally causes some nodes to become overloaded. this mechanism consumes CPU resource and network band-
width. Moreover, it does not guarantee that all connection
6 Related Work to be synchronized since UDP packets may be dropped.
SAPS[8] is a mechanism for dispatcher-based clusters
Hive server[13] is a system which aims at improving to offload the TCP protocol handling to the dispatcher. It
load balancing in broadcast-based single IP clusters. It uses is mainly focused on the network performance and uses
a table to determine whether a node should accept an incom- Myrinet[9] for intra-cluster network. Because congestion
ing packet. When an IP packet arrives at one node, it calcu- control and flow control are done at the network card, back-
lates a hash key from a combination of the IP address and end nodes do not encounter packet losses due to congestions
the port number. The table tells whether the node should inside the cluster. However, because the TCP protocol stack
374
completely runs on the single dispatcher, named I/O Server, [4] V. Cardellini, E. Casalicchio, M. Colajanni, and P. S. Yu.
it is difficult to keep a cluster running when the I/O Server The state of the art in locally distributed web-server systems.
fails. ACM Computing Surveys, 34(2):263–311, 2002.
Round Robin DNS[3] (RR-DNS) ties multiple IP ad- [5] O. P. Damani, P. E. Chung, Y. Huang, C. Kintala, and Y.-M.
Wang. ONE-IP: techniques for hosting a service on a cluster
dresses with one domain name. Upon request, the DNS
of machines. In Selected papers from the sixth international
server chooses one of the IP addresses in a round robin
conference on World Wide Web, pages 1019–1027, Essex,
manner. This kind of DNS-based mechanism can be used UK, 1997. Elsevier Science Publishers, Ltd.
together with single IP clusters like FTCS. RR-DNS dis- [6] D. M. Dias, W. Kish, R. Mukherjee, and R. Tewari. A scal-
tributes requests among geographically distributed sites, able and highly available web server. In COMPCON ’96:
whereas single IP approaches are able to deal with load im- Proceedings of the 41st IEEE International Computer Con-
balance and node failures to a fine degree. ference, pages 85–92, Washington, DC, USA, 1996. IEEE
Computer Society.
[7] S. Horman. Active-Active Servers and Connection Syn-
7 Conclusion and future work chronisation for LVS. linux.conf.au 2004 (LCA 2004), Jan.
2004.
In this paper we have proposed and implemented FTCS, [8] H. Matsuba and Y. Ishikawa. Single IP address cluster
a mechanism for enabling flexible load balancing of TCP for internet servers. In Proceedings of 21st IEEE Inter-
applications in broadcast-based single IP clusters. FTCS national Parallel and Distributed Processing Symposium
introduces a master node as a centralized connection sched- (IPDPS2007), 2007.
[9] Myrinet. http://www.myri.com.
uler to improve load balancing. Once a TCP connection [10] P. O’Rourke and M. Keefe. Performance Evaluation of
is established, the master node is no longer involved in the Linux Virtual Server. LISA 2001 15th Systems Administra-
communication. This makes it possible for other nodes to tion Conference, 2001.
keep connections when the master node fails. [11] J. Postel. Transmission Control Protocol. RFC 793, Sept.
We run the SPECweb2005 Support benchmark test 1981.
against a four-node cluster. Results from the benchmark [12] SPECweb2005. http://www.spec.org/web2005/.
tests show that FTCS with a least-connection scheduling al- [13] T. Takigahira. Hive server: high reliable cluster web server
gorithm equally balances disk I/O loads among four nodes based on request multicasting. In Proceedings of The Third
International Conference on Parallel and Distributed Com-
and handles about 13% more requests than the existing
puting, Applications and Technologies (PDCAT’02), pages
static connection assignment method, on average. The re- 289–294, 2002.
sults also show that FTCS performs as well as Linux Virtual [14] S. Vaidya and K. J. Christensen. A single system image
Server, the existing dispatcher-type cluster implementation. server cluster using duplicated MAC and IP addresses. In
The CPU utilization rate increased by receiving and dis- Proceedings of the 26th Annual IEEE Conference on Local
carding irrelevant packets was less than 3% during the Computer Networks, pages 206–214, 2001.
SPECweb2005 benchmark test, and less than 5% during [15] W. Zhang. Linux Virtual Servers for Scalable Network Ser-
receiving TCP burst packets at 1Gbps. This indicates that vices. Linux Symposium, 2000.
the cost for receiving all incoming packets is negligible for
most TCP server applications, and thus broadcast-based ap-
proach is a reasonable way to build single IP clusters.
As future work, we will implement cluster management
features, such as node failure detection, master node fail-
over, and hot node addition/removal. We will also evalu-
ate the downtime under node failures and compare it with a
highly available cluster based on Linux Virtual Server.
References
375