Professional Documents
Culture Documents
Self-Healing Mechanism On Switch-Controller Connections in SDN
Self-Healing Mechanism On Switch-Controller Connections in SDN
Self-Healing Mechanism On Switch-Controller Connections in SDN
connections in SDN
by
Takuma Watanabe
Supervisor
Submitted to
Department of Communications and Computer Engineering
in fulfillment of the requirements for the degree of
Master of Engineering
at the
TOKYO INSTITUTE OF TECHNOLOGY
February 2015
Summary
Modern networking infrastructure is becoming more and more complex facing highly
growing demands from modern applications, and is required to meet certain big chal-
lenges including the manageability, flexibility and extensibility. Software-Defined Net-
working (SDN) is an emerging paradigm of computer networking that meets these de-
mands. On the other hand, SDN, besides its desired advantages, has a considerable dis-
advantage in its reliability due to its centralized architecture. To overcome this reliability
problem, many researches have been performed. However, none can protect networks,
especially their control logic, under large-scale, unexpected link failures. Networking in-
frastructures are now fundamental basis of modern society, so that they must be reliable
even under severe failures which may caused by disasters.
We have found that centralized protection and restoration mechanisms, in which only
controllers take actions to recover control logic against link failures, is incapable of re-
covering from large-scale link failure, so that we decided to use a distributed mechanisms,
in which all switches are able to maintain their own control logic. We proposed Resilient-
Flow, a self-healing mechanism in which switches can manage their control channel by
their own means. We introduced a module, the Control Channel Maintenance Module
(CCMM) that enables a switch to detect control channel failure and restore the control
channel via an alternative path, so that the switches can maintain their control channel by
their own means. Inside all switches, A CCMM 1) monitors link status of the switch with
heartbeat packets, 2) exchanges network topology maps with the switch’s controller(s)
and with neighboring CCMMs, and 3) sets up flow entries in the switch to establish a
path from the switch to controller(s).
In this paper, we designed and implemented our ResilientFlow. For our implementa-
tion, we used the OSPF daemon in Quagga to monitor link status and to exchange network
topology maps. We utilized Internal Ports of Open vSwitch for OSPF daemon to work
correctly with. We then implemented the CCMM’s flow entry installer using Python. We
used Linux kernel’s multiple routing table functionality so that the switch’s routing should
be converged into SDN manner where routing only follows flow table, not a conventional
TCP/IP mechanisms.
To prove our concept and show how the ResilientFlow recovers control channels, we
have placed a series of experimental evaluation in two different scenarios: the scenario
in which a single specified link is failed with a dedicated topology, and the scenario in
i
which random multiple links are failed with a real world topology. We showed that the
ResilientFlow recovers control channel within 300 ms against a single link failure. We
also showed that the ResilientFlow can restore control channels against multiple, severe
link failures and they take the time in the order of seconds.
We also made a further extension to the CCMM against domain-splitting problem,
where a switch has no path available to the controllers. We extend the CCMM to be
an emergency alternative controller. We performed experiment and showed the applica-
bility of the CCMM to domain-splitting problem, with approximately 1 to 2 seconds of
restoration time.
For future works, we suggested further applications of the ResilientFlow for a SDN
bootstrapping problem and discussed a better controller selection method in a split-domain
environment.
ii
Acknowledgement
iii
Contents
Summary i
Acknowledgement iii
Table of Contents iv
List of Figures vi
1 Introduction 1
3 ResilientFlow 11
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1) Monitoring Link Statuses . . . . . . . . . . . . . . . . . . . . . . . . 13
2) Exchanging Network Topology Maps . . . . . . . . . . . . . . . . . . 13
3) Installing Flow Entries for Control Channel . . . . . . . . . . . . . . . 13
4 Implementation 15
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Monitoring Link Status and Exchanging Network Topology Maps . . . . 16
4.3 Installing Flow Entry for Control Channel . . . . . . . . . . . . . . . . . 17
iv
5 Experimental Evaluation 19
5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.1 Emulated Network with Mininet . . . . . . . . . . . . . . . . . . 19
5.1.2 Switch Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.3 Controller Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Evaluation Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Single Specified Link failure Scenario . . . . . . . . . . . . . . . 22
5.2.2 Random Multiple Links Failure Scenario . . . . . . . . . . . . . 27
7 Discussion 36
7.1 Controller–CCMM Coordination Problem . . . . . . . . . . . . . . . . . 36
7.2 SDN Domain-Splitting Problem . . . . . . . . . . . . . . . . . . . . . . 36
Controller Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Controller–Application Re-coordination . . . . . . . . . . . . . . . . . . 37
7.3 SDN Bootstrapping Problem . . . . . . . . . . . . . . . . . . . . . . . . 37
8 Conclusion 38
Bibliography 40
Publications 44
English Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Japanese Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
v
List of Figures
vi
List of Tables
vii
Chapter 1
Introduction
Today’s modern networking infrastructure is becoming more and more complex facing
highly growing demands from modern applications, and is required to meet certain big
challenges including the manageability, flexibility and extensibility. Despite their today’s
wide adoption, managing the conventional networks are known to be hard task due to
its lacking manageability [6]. Also, not only modern diversified network services and
applications have rich in amount of data traffic [12] but also have multifarious resource
demands including bandwidth, latency and other kind of QoS demands. Enforcing high-
level policies through out networks is required in today’s datacenter and inter-datacenter
networks [8], which leads to the strong needs of efficient support of flexible network man-
agement, or the flexibility of the network [9]. Additionally, the conventional networking
architecture is lacking extensibility adopting to new applications and their demands, due
to its vertically-integrated, hardware-implemented architectural design [13].
Software-Defined Networking (SDN) is an emerging paradigm of computer network-
ing against these problems [1][2][3][10]. The key idea of SDN is to separate the net-
work’s control logic (control plane) from packet forwarding logic (data plane), which is
vertically integrated into the same switch box in the conventional networking architecture,
and converge the control logic into the centralized few instances called controllers [13].
In SDN, switches only perform packet forwarding functions, while controllers take care
of the rest of the functions required, including maintaining the forwarding tables of the
switches. Through the controllers, administrators and approved applications take control
of the networks so that they can flexibly manipulate the networks. With the controllers in
SDN basically implemented as software, we can easily extend networking infrastructure
to meet additional demands from new applications without upgrading physical switches.
Also, only one or few number of controllers assumed to be placed in each administra-
tive domain of network. This architecture significantly reduces the tasks of managing
networks [1].
While SDN has advantages in the manageability, flexibility and extensibility, it also
has a notable disadvantage in its reliability. Having split and centralized control logic,
1
SDN may lose its control capability when the controller fails or the links between switches
and controllers fail.
Today’s networking infrastructure needs to be manageable and extensible as noted
above, which SDN can realize, it also needs to be reliable as it has become crucial to
modern society. Networking infrastructure must continue to be reliable, while being ex-
posed to many disasters. Many disasters are reported to cause large-scale link failures and
affect unexpected parts of the network. For example, hurricane causes a persistent link
failure [20]. Earthquake causes an even worse failure, international fibre disconnection
and large-scale re-routing [18][19].
Many researches, thus, have done to bring the reliability for SDN [5], which will be
introduced in the next chapter. Fonseca et al. [15] proposed approaches deploy multi-
ple controllers for backups in a single administrative domain of the network. Sharma
et al. [17][16] proposed protection methods that use prepared backup links to keep con-
trollers and switches connected. These preparation methods only work well where some
expected node or link fails.
Current SDN failure recovery mechanisms introduced by many researchers have been
bound to the core idea of SDN, a centralized paradigm, therefore they have the limited
failure recovery ability; i.e. the existing approaches cannot deal with unexpected large-
scale link failures. SDN removed switches’ capability of self-management of forwarding
information base including one for their own control channels. This architecture is be-
lieved to be introduced for the simplicity thus the cost efficiency of switches in very early
days of SDN [13] and this simplicity is exactly what reduces the reliability of control
channel as described above. Being more widely used as demands from applications grow
and get complex, SDN must have the reliability in their control logic and this reliability is
based on both the reliability of controller and control channel. The reliability of control
channel is hard to be held with no capability of self-management in switches, and we thus
believe switch has to be able to manage their own control channel by their own means.
SDN’s promise is that controllers can manage their switches (thus entire network) [10].
This does not meant that only controllers have to manage their switches. And for a fun-
damental basis, we must keep switches and controllers connected even under unexpected,
large-scale link failures, to keep network alive.
To overcome the SDN’s limitations and have them a tolerance against large-scale, un-
expected link failures which are caused by disasters, we introduce a distributed paradigm
to protect and restore control logic in SDN; that is, to make SDN resilient to link failures.
Here we propose ResilientFlow, a self-healing mechanism in which we utilize a dis-
tributed link failure detection and restoration method. Our proposal is to enable switches
to manage their own logical connections to the controllers, called control channels. We in-
troduce a module called the Control Channel Maintenance Module (CCMM) and deploy
it to every switch and controller. Each CCMM monitors the link status between neighbor-
ing switches and controllers, and exchanges neighboring information with neighboring
2
CCMMs, so that the switch can detect disconnection of the control channel. Upon the de-
tection of the link failure, CCMM modifies the forwarding table of the switch, to keep the
control channel alive. Note that the CCMM only maintains forwarding table entries which
establish the control channel, not any entries for data forwarding, so that the centralized
controlling remains at the controller.
The rest of the content is organized as follows. Chapter 2 describes technical details
of SDN and its reliability, where we introduce the precedent research to improve the re-
liability of SDN. Chapter 3 describes the design of ResilientFlow and its CCMM, while
Chap. 4 shows the implementation of CCMM based on the design. Chapter 5 presents ex-
perimental results from our implementations in two types of evaluation scenarios: the per-
formance evaluation in a small topology scenario and the resiliency demonstration in the
real world topology scenario. Chapter 6 shows the further application of the CCMM with
the specific extension to the domain-splitting environment. Chapter 7 discusses technical
issues of SDN and CCMM, and suggests advanced applications of CCMM, and Chapter
8 concludes the paper.
3
Chapter 2
4
hard task lacking manageability [6]. In the conventional networks, applying a specific
high-level network policies, which is essential for modern complex networks, needs net-
work operators to use low-level command interface on each network devices separately.
As a result from this distributed design, the manageability improvement is highly required
in the conventional architecture.
Also a standardized automatic configuration framework is not exist in the conventional
networking architecture, which means that flexible management and configuration of net-
works in a dynamic environment is highly challenging. Today’s usage of networking,
especially in a data center networks, are lacking of efficient support of flexible network
management, thus the flexibility of the network is required [9]. Not only modern di-
versified network services and applications have rich in amount of data traffic [12] but
also have multifarious resource demands including bandwidth, latency and other kind
of QoS demands, and adopting these applications needs an efficient support of flexible
network management in these dynamic networking environment. Enforcing high-level
policies through out networks is required in today’s datacenter and inter-datacenter net-
works [8]. For example, handling interactive user traffic, which is delay-sensitive, and
background traffic, which is delay-non-sensitive, separately makes Quality-of-Service,
Quality-of-Experience better while making better accommodation of networks; which we
cannot in the current networking architecture [4].
Additionally, in the conventional architecture, network controlling functionality in-
cluding distributed protocols and algorithms, are implemented on a hardware of switch-
ing equipment, due to its vertically integrated model of data plane and control plane. This
architectural design makes introducing, extending and innovating new protocols, designs
and architectures hard as it may easily need existing networking hardwares to be replaced
entirely, thus is lacking of extensibility and enabling extensibility of networks are desired
[13].
As a solution to these problems on modern networking infrastructure, SDN is pro-
posed and widely growing. Later in this section, we will describe the design of SDN in
detail.
5
Architecture(Comparison(
Applica=ons(
Controller( Control(Plane(
Control( Control(
Plane( Plane( Control(Channel(
Data( Data(
Plane( Plane(
Data( Data( Data(
Switch( Switch(
Plane( Plane( Plane(
plane, which is responsible for forwarding data packets on the basis of their forwarding
table, called a flow table. In SDN, switches never manage their own flow tables by them-
selves. In its early days, SDN is intended to be used without replacing any hardware
devices [13]. For this reason, SDN removed self-control capability from switches.
Controllers, on top of the switches, handle the rest of the functionality required for net-
working infrastructure to work, including managing their switches’ flow table. The com-
ponent that keeps networking infrastructure working is called the control plane. Switch’s
interface, through which a controller manage their switches is called Southbound API.
Controller nodes are usually implemented as a software, so that they can easily replaced
for upcoming demands from new applications. This realizes an extensibility of networks.
Also, in SDN, few number of controller instances are assumed to be places in a single
administrative networks. These controllers will entirely manage their own networks. This
architectural design significantly reduces a manage task of network operators, thus brings
us a better manageability of networks.
Applications, on top of the controllers, are given an abstract view and programmability
of networking infrastructure. Controller’s interface, through which applications take con-
trol of their given network is called Northbound API. Through this controller–application
interface, SDN realizes flexible management of networks.
6
SDN(SwitchDController(Connec=on(
(1)(InDband( Controller(
(2)(OutDofDband
(2)( (1)(
Switch( Switch(
via a dedicated link (which is only for control channels), where the channel is called an
out-of-band connection, or via another switch and a path to that switch, where the channel
is called an in-band connection. In other words, switches can be configured to connect to
controllers via links that are also used for data forwarding.
7
SDN(Switch(Design(
Datapath(
Flow(Table(
Physical(Ports(
To(the(Controllers(
8
Reliability(in(SDN(
Reliability(of(Control(Plane(
Controller(
Control(Channel(
Reliability(of(Data(Plane(
Figure 2.4: Different aspects and different parts of the reliability in SDN
Sharma et al. [16] utilized this fast failover mechanism to achieve data plane failure
recovery. They proposed and evaluated two different recovery methods: restoration and
protection.
9
controller to the switch in order. The authors focused on fast failure recovery in career-
grade speed, which is up to 50 [ms]. However, the centralized controller takes actions
to restore the control channel by using port status messages from the other switches to
detect failure link and by installing flow entries from the controller. Therefore, the con-
troller cannot calculate a recovery path correctly and may fail in large-scale link failures.
Indeed, the authors only conducted experiments of single-link failure.
The current researches described above is focusing on the reliability of controller and
the reliability of switch–controller connection under few expected link disconnections,
and especially on their fast failover mechanisms. Today’s modern networks are exposed
to a disasters where severe link disconnections will occur. Under this kind of situations,
the current failover mechanisms are considered not to work correctly, so that a strong
recovery mechanism must be required against large-scale, unexpected link failures.
10
Chapter 3
ResilientFlow
We introduced the SDN architecture and its reliability discussions in Chapter 2, where
we survey control channel restoration proposals that handle only single link failure. To
make SDN stably reliable through enabling it to recover control channel failure from mul-
tiple and unexpected link failures, which a centralized recovery mechanism cannot meet,
we decided to use a distributed mechanisms, in which all switches are able to self-heal
their own control channel, so that they will keep working under multiple and unexpected
link failures. We introduce ResilientFlow, a failure recovery mechanism that utilizes a
distributed link failure detection and restoration method to recover from control channel
failure under multiple and unexpected link failures. In this section, we will present an
overview and detailed design of ResilientFlow.
3.1 Overview
ResilientFlow is a control channel failure recovery mechanism, which protect and re-
store control logic in SDN. To enable SDN to recover from control channel failure au-
tonomously, switches have to 1) detect control channel failure and 2) restore a control
channel by calculating and establishing alternative paths. Our main idea is to enable all
switches to maintain their control channel by themselves. More specifically, we intro-
duce a module, the Control Channel Maintenance Module (CCMM) that enables a switch
to detect control channel failure and restore the control channel via an alternative path,
so that the switches can maintain their control channel by their own means, as shown
in Fig. 3.1. A CCMM 1) monitors link status of the switch with heartbeat packets, 2)
exchanges network topology maps with the switch’s controller(s) and with neighboring
CCMMs, and 3) sets up flow entries in the switch to establish a path from the switch to
controller(s), as shown in Fig. 3.3. The design of an SDN switch with CCMM is shown
in Fig. 3.2 annotated with numbered labels indicating described three functionalities.
Note that, a CCMM only modify its switch’s flow entries which are mandatory for a
control channel to be established and the CCMM to work correctly and not the rest of
11
ResilientFlow(Overview(
Modify(flow(entries(
C( M
S(
C(
✗( M
S(
M M
S( S(
M M
S( S(
C( Controller(
S( Switch(
M( CCMM(
Before(Links(are(Disconnected( AQer(Links(are(Disconnected(
Switch(Design(
Figure 3.1: Overview of ResilientFlow
C(
Control(Channel( Control(Channel(
End(point( Maintenance(Module(
M(
S( 3)(Installs(Flow(Entries( 1)(Monitors(Link(Status(
2)(Exchanges(
(3)( ((Network(Topology(Map(
(2)( (1)(
M(
S(
M( Flow(Table(
S(
(1)(Heartbeat( Physical(Ports(
(2)(Topology(Map
Controllers( Neighboring(
CCMMs(
flow entries including those for data forwarding. The entire capability and flexibility of
management of the networks will keep held by domain-administrative controllers. This
describes why a CCMM is not a controller, but a module.
3.2 Design
We introduced three functionalities of the CCMM in Sect. 3.1. In this Section, we describe
the details of these functionalities shown in Fig. 3.3.
12
CCMM(Func=onality(
He’s(alive
1) M
S( M
S(
M M
2) S( S(
✗(
✓(
3) C(
M
S( ✗( M
S(
M
S(
Control(Channel(
via(new(path
13
illustrates re-establishing new control channel via new alternative path. When a switch
detects a control channel failure, the switch’s CCMM calculates and determines a path
from the switch to the controllers on the basis of the collected network topology maps.
After calculating the path, the CCMM sets up flow entries into the switches in accordance
with the determined path as shown in Fig. 3.2 (3).
14
Chapter 4
Implementation
In the previous chapter, we presented the design of ResilientFlow and its CCMM. In this
chapter, we describe our implementation of our CCMM-enabled switch and how each of
their components represents the functions of the CCMM, described in Chapter 3.
4.1 Overview
We decided to use Linux as the basis of switch nodes and Open vSwitch as our OpenFlow
switch software. As we described in section 3.2 the CCMM has three functionalities: 1)
monitoring link statuses, 2) exchanging network topology maps, and 3) installing flow
entries for control channel. Given that the behaviour of link-state routing protocol fol-
lows former two functionality, monitoring link statuses and exchanging network topology
maps follows, we decided to use OSPF protocol and an OSPF routing daemon to imple-
ment former two functionalities. For the latter functionality, installing flow entries, we
built our flow installer using Python. The whole structure of the implementation of a
switch node is shown in Fig. 4.1. In the Linux node that represents a switch node, we
run Open vSwitch. On top of the Open vSwitch, the OSPF daemon runs to monitor the
links status and the physical ports statuses and exchanges network topology maps with
neighboring CCMMs’ OSPF daemons. The OSPF daemon generates and dumps routing
entries to the Linux kernel’s routing table in accordance with its collected network topol-
ogy maps. The flow entry installer checks the routing table change notifications. Upon
the notification, the flow entry installer generates and installs flow entries in accordance
with the Linux kernel’s routing table. Finally, the Open vSwitch establishes a control
channel in accordance with the flow entries of its own, not with the routing table of the
Linux kernel.
15
Switch(Implementa=on(
Linux(Node(
Generates(Rou=ng(Table(
Linux(kernel’s(
Rou=ng(Table(
Retrieves(
Rou=ng(Table(
Flow(Entry( OSPF(daemon(
Installer( (Quagga(and(OSPFd)(
Monitors(Link(Status(and(
Installs(
Exchanges(Network(Topology(Maps(
Flow(Entries(
Internal(Ports(
Open(vSwitch(
External(Ports(
16
Switch(Implementa=on:(Internal(Ports(
Normal(Situa=on( With(Open(vSwitch(Situa=on(
OSPF(daemon OSPF(daemon
OSPF(daemon(cannnot(handle(
each(ports(separately
Interfaces( Interfaces(
Seen(by(applica=ons Seen(by(applica=ons
Bridge(
(Open(vSwitch(Bridge)
Physical(Ports Physical(Ports
Switch(Implementa=on:(Internal(Ports(
Figure 4.2: OSPF daemon to use with OpenFlow switch: Problem description
With(Open(vSwitch(Situa=on((Fixed)((1)( With(Open(vSwitch(Situa=on((Fixed)((2)(
OSPF(daemon OSPF(daemon
Create(the(same(number(of( Insert(OSPFDpassthrough(
Internal(Ports flow(entries
Interfaces Interfaces
Bridge(
Physical(Ports Physical(Ports
Figure 4.3: OSPF daemon to use with OpenFlow switch: Solution using Internal Ports
interface on the Linux kernel as shown in Fig. 4.3 left-hand side. We then install flow
entries that pass through OSPF packets between each physical port and its corresponding
internal port, shown in Fig. 4.3 right-hand side. The OSPF daemon is then assigned to
internal ports seen on Linux kernel’s network interface list and uses them to exchange
hello packets and network topology maps.
17
suite.
We implement the CCMM’s flow entry installer using Python. The CCMM monitors
network topology map changes using the ip-monitor command as a trigger and retrieves
the Linux kernel’s routing table. A CCMM then calculates flow entries in accordance
with routing entries and installs them into the switch.
Upon OSPF daemon giving routing information to the flow entry installer, we utilize
the Linux kernel’s multiple routing tables, and set the OSPF daemon to dump the calcu-
lated routing table into the specified routing table, other than the main routing table. This
change is made to avoid two unwanted side effects. The OSPF daemon generates routing
entry on the basis of its (internal) ports and their addresses, forcing applications to use
an internal port as the source port and its address as the source address. This causes two
problems. First, it breaks the TCP/IP’s 5-tuple, thus disconnecting the control channel
forcibly when the source interface of the control channel connection changes as a result
of link failure. Instead, we want Open vSwitch to use a dedicated address as the source ad-
dress when establishing the control channel, so that we can exactly measure our CCMM’s
restoration performance. Second, it exactly represents the hybrid switch’s behavior. Hy-
brid switch is a switch that has both capability of an SDN switch and a conventional
switch (IP router). As we intended to propose SDN’s failure recovery mechanism, we
should make a normal SDN switch, which follows flow table for its packet forwarding, as
our implementation. We want the switch’s control channel to simply and only follow its
flow table as in a normal operation of OpenFlow switch. As the Linux’s main routing ta-
ble is used for routing ordinary traffic, including Open vSwitch’s, we decided to maintain
our routing table’s simplicity, as it routes all the ordinary traffic into the Open vSwitch’s
bridge, so that all the routing is converged onto OpenFlow’s mechanisms.
18
Chapter 5
Experimental Evaluation
Chapters 3 and 4 detailed the design and the corresponding implementation of the Re-
silientFlow. To prove our concept and show how the ResilientFlow recovers control chan-
nels, we run our implementation of the ResilientFlow in emulated network environments
and perform series of experiments. We also evaluate the recovery performance of the
ResilientFlow. In this section, we will describe our evaluation environment, its scenarios,
and two of our evaluation results.
19
Experimental(Evalua=on(with(Mininet(
Host(Linux(Node
Switch( Switch(
(OvS) (OvS)
Interface
TC(
Link
Controller Switch(1 Switch(n
Mininet Creates(Network(Topology(
Manages(Nodes
14(
develop a set of special configurations and scripts on Mininet that enables individual
networking functions and runs individual Open vSwitch daemons on each different node.
Original Mininet does not separate each switch node’s networking functions; this means
all switch share the same network interface list and routing table. For this reason, original
Mininet cannot run individual Open vSwitch for each different node without our special
configuration. Second, we develop functions that add internal ports and their links to
node’s instance in Mininet and that bring both physical and corresponding internal ports
up/down simultaneously. As the OSPF daemon can only monitor assigned ports, it cannot
detect port status changes, therefore, it only relies on Hello Packet failure (which lasts
until OSPF dead interval, which is one second when Fast hello is enabled). With our
modification, OSPF daemon can detect port status changes and are thus able to broadcast
LSAs immediately.
20
Table 5.1: Implementation and evaluation environment
OS
Linux distribution Ubuntu 14.04.1 Trusty Tahr
Linux kernel 3.13.0
Hardware specification
CPU Intel(R) Core(TM) i7-3770K CPU 3.5 GHz
RAM 16GB
Software version
Mininet 2.1.0
Open vSwitch 2.0.2
Ryu 3.13
Quagga 0.99.22.4
Python 2.7.6
Nping 0.6.40
included in the Nmap software suite [32]. We set nping to generate UDP packets at a
specified constant rate.
21
recovery performance in this scenario.
We want to consider the link disconnection time to be the epoch time in a experiment,
so we decided to disconnect all the specified links at once when we choose multiple links
to be disconnected. Also, we use the time at which we finish all the specified links to
disconnect as the link disconnection time.
While we keep track of Open vSwitch daemon log so that we can detect control chan-
nel disconnection and reconnection due to timeout, we consider (and later found to to
be true) that a control channel restoration is so fast that we cannot use underlying TCP
disconnection and reconnection time detected by Open vSwitch for measuring control
channel restoration performance, so that we decided to use packet-in message for measur-
ing control channel restoration performance. We generate packet-in messages at the fast,
constant rate on the switches, and log the time when received these messages on the con-
troller. For experiments, we employ the time between the link disconnection and packet-in
from the switch restarting as the control channel restoration time or link restoration time
of the switch. We also use the latest link restoration time of all the restored switches to be
the network restoration time.
In both scenarios, we follow the same steps described below. At first, we set up
switches, a controller and a network that connects them in accordance with a given topol-
ogy map. We then establish control channels from the specified switches to the controller,
and start sending packet-in messages from specified switches to the controller. After these
initial setup finished, we disconnect determined links. Finally, we keep the controller, the
switches, and the packet-in message generators running until desired switches’ control
channels are restored.
In the following sections, we describe the detail of the scenarios and their results.
22
C
10
00
s, M
5m co b/
/ s, 0 st s,
:1 5
b
M t: 1 0 ms
0 00 cos ,
1
1000Mb/s, 5ms,
1000Mb/s, 5ms,
3
cost: 10
1 cost: 15
10
Target 00
Mb s,
m
switch cos /s, 5 s,5
t: 1 ms 0
0 , b/ 1
0 M st:
0 co
10
is an in-band path, 1–2–C. In this case, we disconnect the link, 1–2. After the link is
disconnected, then the switch is reconfigured to connect via a different in-band path on a
different source interface on the switch side, 1–3–C, as shown in Fig. 5.3 (b). In the third
case, the in-to-in-middle case, the switch is initially connected via in-band path, 1–2–C,
and after we disconnect the link 2–C, the switch is connected via different in-band path
via the same source interface, 1–2–3–C, as shown in Fig. 5.3 (c) upper and lower. In the
third case, only the middle links alongside the path are changed from the switch’s view.
In all three cases, we performed 20 times of experiments.
Figure 5.4, 5.5, 5.6 show the evaluation results. Each figure shows the measured
duration from the link disconnection to each event, both in the figure and the table, in
each evaluation case. Each figure also includes a 95 percentile confidence interval of the
result measurements. We also plot the sample points of each event on the bottom of each
result point.
The results show the link restoration takes 250 to 300 ms. In the time between the link
disconnection and the routing table modification, only OSPF daemons do their own job.
The other component, the flow entry installer, simply awaits the routing table change no-
tification. The results suggest the network topology maps collection and the routing table
calculation takes up most of the restoration time. This also suggests that we may improve
the network restoration by optimizing OSPF configurations (e.g. more frequent Heartbeat,
LSA exchanging.) Note that in this scenario, the control channel kept connected during
the experiment.
We also see that the in-to-in-middle case and the in-to-in case take longer than the
out-to-in case. As described in the previous sections, the CCMM in each switch asyn-
23
C C C
3 3 3
1 1 1
2 2 2
C C C
3 3 3
1 1 1
2 2 2
(a) Out-to-in (b) In-to-in (c) In-to-in-middle
Figure 5.3: Three failure cases in single specified link failure experiments
chronously collects the network topology map, calculates the path, and installs the flow
entries. This is considered to happen due to the reason that the in-to-in and the in-to-in-
middle case take a longer path than the other cases after the disconnection. After all the
switch alongside the path from the switch to the controller are configured, the switches
can restore the connectivity to the controller. This describes the time between the time at
which the flow entries are installed and packet-in messages restart being received.
24
Time table from link disconnection
Packet-in stopped
Network topology map generated
Installing of flow entries started
Installing of flow entries finished
Packet-in restarted
0
100
200
300
400
Time from link down [ms]
Event Time from link down [ms]
Packet-in stopped 0.0 ± 0.0
Network topology map generated 211.5 ± 0.3
Installing of flow entries started 214.5 ± 0.4
Installing of flow entries finished 222.8 ± 0.5
Packet-in restarted 246.3 ± 7.0
Figure 5.4: Link restoration performance on three cases in single specified link failure
experiments: out-to-in case
100
200
300
400
25
Time table from link disconnection
Packet-in stopped
Network topology map generated
Installing of flow entries started
Installing of flow entries finished
Packet-in restarted
0
100
200
300
400
Time from link down [ms]
Event Time from link down [ms]
Packet-in stopped 0.0 ± 0.0
Network topology map generated 217.9 ± 8.7
Installing of flow entries started 221.0 ± 8.8
Installing of flow entries finished 233.8 ± 8.7
Packet-in restarted 292.7 ± 9.3
Figure 5.6: Link restoration performance on three cases in single specified link failure
experiments: in-to-in-middle case
26
5.2.2 Random Multiple Links Failure Scenario
In this scenario, we use a large-scale, real world topology from Topology Zoo [30][31]
to mainly study and demonstrate our proposal’s feasibility. We also measure the network
failure recovery time in this scenario.
We use the topology of the BT North America network consists of 36 nodes and 76
links shown in Fig. 5.7. To initialize our network, we set the link speed to be 1000 Mb/s
with the link delay based on the distance calculated using nodes’ latitude and longitude
given in the topology file from the Topology Zoo, and the speed of light in optical fiber.
We then choose one node with the largest degree, and with the lowest node number given
in the original topology file, which is node number 28 in Figure 5.7. This node is config-
ured to be the controller, and all the other nodes are configured to be switches.
In this scenario, we choose links to be disconnected randomly in each experiment.
The number of the links to be disconnected is given as a ratio to the total links, and we
call this the link disconnection rate. We then calculate the nodes that will have alternative
paths to the controller after all the determined link has been disconnected. We call these
nodes the controller-reachable switches. We run packet generators in all the controller-
reachable switches. In this scenario, the switches generate the packet-in messages in the
rate at 10 packet/s.
We perform 20 times of experiments on each link disconnection rate: 10, 20, 40, 60,
and 80 percent. We both measure the link restoration time of all the switches and the
network restoration time. We also measure the number of controller-reachable switches.
Figure 5.8 shows the result of network restoration times on each link disconnection
rate. The figure also includes a 95 percentile confidence interval of the result measure-
ments. We also plot the sample points of each event on the bottom of each result point.
We can see that the network restoration time gets longer till the link disconnection rate
goes up to 40, and gets shorter after. This is due to the number of the controller-reachable
switches and the number of the changed links alongside the paths of the control chan-
nels. Figure 5.9 shows the number of reachable switch against link disconnection rate.
As the link disconnection rate gets higher, so the number of the changed links during the
disconnection increases, but also the number of controller-reachable switches decreases.
Compared to the former scenario, the topology scale impacts on OSPF convergence time
thus the restoration time. Also, control channels get disconnected at some switches due
to TCP time out, thus making restoration time even longer.
We also showed the restoration progress of an experiment from the previous series of
experiments, with 40 percent of link down rate in Fig. 5.10. Also, the network topology
map after all the determined link has been disconnected is shown in Figure 5.11 with solid
line indicating a connected link and dashed line indicating a disconnected link. The figure
5.10 shows the cumulative number of restored switches, and the horizontal dashed line on
the upper side indicates the number of controller-reachable switches, which is the limit
of the number of restored switches. In this figure, we can see that some switches behave
27
8
16 7
31 33
5 32
34
6
30
35 13
23
18
1 36
2 11
12
27
3 15
4
28 14
17
29
20
9
10
22 24
19
21
26
25
to wait for other switches to be configured. In order for a switch to be connected to the
controller, the switch needs all the switches alongside the path to be configured.
Another example with 80% of link down rate shows us a notable result. Figure 5.12
shows another example network topology map with 80% of the link failure rate, in which
we can see the network domain is split into multiple disconnected domains. The CCCM
has a capability of modifying flow entries, which is a partial, limited, but fundamental ca-
pability of an SDN controller. This means that with further extensions to make CCMM a
basic mini controller, we can use the CCMM to emergency alternative controllers in these
disconnected domains. Against this domain disconnection problem, we will make further
extension to the CCMM, implement the extension and perform further experiments in
Chapter 6.
28
16
14
Network restoration time [s]
12
10
0
0 10 20 30 40 50 60 70 80 90 100
Link failure rate [%]
40
35
# of controller-reachable switches
30
25
20
15
10
0
0 10 20 30 40 50 60 70 80 90 100
Link failure rate [%]
29
35 1.0
0.6
20
15
0.4
10
0.2
5
0 0.0
12
13
11
10
05
02
03
01
06
04
09
00
08
07
8
16 7
31 33
5 32
34
6
30
35 13
23
18
1 36
2 11
12
27
3 15
4
28 14
17
29
20
9
10
22 24
19
21
26
25
Figure 5.11: An example Topology for large scale link failure after link has been discon-
nected
30
8
16 7
31 33
5 32
34
6
30
35 13
23
18
1 36
2 11
12
27
3 15
4
28 14
17
29
20
9
10
22 24
19
21
26
25
Figure 5.12: An example Topology with 80% link failure after link has been disconnected
31
Chapter 6
32
SDN(domainDspliWng(problem(
Reachable(domain(
C(
C( S( S(
S( S( S(
S(
S(
S( S( S( S(
S( S( S(
C( Controller(
S( Switch( Unreachable(domain( Unreachable(domain(
Before(Links(are(Disconnected( AUer(Links(are(Disconnected(
33
For Open vSwitch, we can use OVSDB [34], Open vSwitch’s switch configuring pro-
tocol. Also, there is an alternative protocol, OF-Config [35], standardized by the Open
Networking Foundation. We may use OF-Config if we use a switch other than Open
vSwitch. As our implementation uses Open vSwitch, we decided to use OVSDB for our
implementation of extension.
34
SDN(domainDspli_ng(problem(
C( C(
✗(
M M M M
1( 3( 1( 3(
M M
C( Controller( 2( 2(
S( Switch(
M( CCMM(
Before(Links(are(Disconnected( AUer(Links(are(Disconnected(
2000
0
35
Chapter 7
Discussion
36
Controller Election
We have noted that as centralization is the core idea of SDN, we keep a centralized con-
trolling mechanism even in split-domain environment, so that we have to select one con-
troller among the controllers in the domain as a split-domain’s alternative controller. In
this selection, called master controller election, the way to choose the alternative con-
troller can be an important issue. In our proof-of-concept demonstrative experiment, we
use a simple controller selection algorithms which simply chooses a reachable node with
lowest ID to be a controller. For more sophisticated solution to this alternative controller
election problem, Heller et al. [36] describe this controller selection problem and sug-
gested that it is best to choose the node with the minimum average delay.
Controller–Application Re-coordination
In a split-domain environment, controlling applications which works with the controller
may lose connectivity to the controller due to link failures. In these cases, the controlling
applications should re-establish connection to the controllers, which can be the same or
the new alternative controller. For the controlling applications to continuously work with
controllers to take control of the network, the controlling applications should know the
new alternative controller, which involves some controller advertising mechanisms. This
mechanisms is out of scope of this research, and will be future work.
37
Chapter 8
Conclusion
38
narios: the scenario in which a single specified link is failed with a dedicated topology,
and the scenario in which random multiple links are failed with a real world topology. We
showed that the ResilientFlow recovers control channel within 300 ms against a single
link failure. We also showed that the ResilientFlow can restore control channels against
multiple, severe link failures and they take the time in the order of seconds. We also made
a further extension to the CCMM against domain-splitting problem, where a switch has
no path available to the controllers. We extended the CCMM to be an emergency alter-
native controller. We performed experiment and showed the applicability of the CCMM
to domain-splitting problem, with approximately 1 to 2 seconds of restoration time. For
future works, we suggested further applications of the ResilientFlow for an SDN boot-
strapping problem and discussed a better controller selection method in a split-domain
environment.
39
Bibliography
[3] N. Feamster, J. Rexford and E. Zegura, “The Road to SDN,” ACM Queue, vol. 11,
issue 12, pages 20, Dec. 2013.
[4] I. Akyildiz, A. Lee, P. Wang, M. Luo and W. Chou, “A Roadmap for Traffic Engi-
neering in SDN-OpenFlow Networks,” Computer Networks, Elsevier, vol. 71, pp. 1–
30, Oct. 2014.
[6] T. Benson, A. Akella, and D. Maltz, “Unraveling the complexity of network manage-
ment,” Proc. 6th USENIX Symp. Networked Syst. Design Implement., pp. 335-348.,
2009.
[7] B.M. Leiner, V.G. Cerf, D.D. Clark, R.E. Kahn, L. Kleinrock, D.C. Lynch,
J. Postel, L.G. Roberts and S. Wolff, “Brief History of the Inter-
net,” http://www.internetsociety.org/internet/what-internet/
history-internet/brief-history-internet, last accessed at 5 Feb. 2015.
40
[9] M.F. Bari, R. Boutaba, R. Esteves, L.Z. Granville, M. Podlesny, M.G. Rabbani,
Q. Zhang and M.F. Zhani, “Data Center Network Virtualization: A Survey,” IEEE
Communications Surveys & Tutorials, vol. 15, no. 2, pp. 909–928, Second Quar-
ter, 2013.
[12] Cisco VNI Forecast, “Cisco Visual Networking Index: Forecast and
Methodology, 20132018,” Cisco Public Information, http://www.
cisco.com/c/en/us/solutions/collateral/service-provider/
ip-ngn-ip-next-generation-network/white_paper_c11-481360.pdf,
Jun. 2014.
41
[19] K. Cho, C. Pelsser, R. Bush, and Y. Won, “The Japan earthquake: The Impact on
Traffic and Routing Observed by a Local ISP,” Proc. ACM Special Workshop on
Internet and Disasters (SWID 2011) in Conjunction with International Conference
on emerging Networking EXperiments and Technologies (CoNEXT 2011), pp. 2:1–
2:8, Dec. 2011.
[26] D. Katz and D. Ward, “Bidirectional Forwarding Detection (BFD),” IETF RFC 5880
(Proposed Standard), Jun. 2010.
[27] Cisco Systems, “OSPF Support for Fast Hello Packets — iro-fast-hello.pdf,”
http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/iproute_ospf/
configuration/xe-3s/iro-xe-3s-book/iro-fast-hello.pdf, last ac-
cessed at 28 Oct. 2014.
[29] Mininet Team, “Mininet: An Instant Virtual Network on Your Laptop (or Other PC)
- Mininet,” http://mininet.org/, last accessed at 21 Oct. 2014.
42
[31] The University of Adelaide, “The Internet Topology Zoo,” http://www.
topology-zoo.org/, last accessed at 21 Oct. 2014.
[32] Nmap.org, “Nmap — Free Security Scanner For Network Exploration & Security
Audits,” http://nmap.org/, last accessed at 12 Nov. 2014.
[34] B. Pfaff and B. Davie, “The Open vSwitch Database Management Protocol,” IETF
RFC 7047 (Informational), Dec. 2013.
[35] Open Networking Foundation, “OF-Config 1.2 — OpenFlow Management and Con-
figuration Protocol”, https://www.opennetworking.org/images/stories/
downloads/sdn-resources/onf-specifications/openflow-config/
of-config-1.2.pdf, Jun. 2014.
43
Publications
English Publications
[A1] T. Watanabe, T. Omizo, T. Akiyama and K. Iida, “ResilientFlow: Deployments of
Distributed Control Channel Maintenance Modules to Recover SDN from Unex-
pected Failures,” to be presented at IEEE International Conference on the Design
of Reliable Communication Networks (DRCN 2015), Mar. 2015 (Accepted).
Japanese Publications
[B1] T. Watanabe, T. Omizo, T. Akiyama and K. Iida, “Self-Healing Mechanism on
Switch-Controller Connections in SDN,” to be presented at IEICE General Confer-
ence 2015, Proceedings of the 2015 IEICE General Conference, BS-2-2, Mar. 2015
(Sumitted).
[B2] T. Watanabe, T. Omizo, T. Akiyama and K. Iida, “Design and Evaluation of Self-
Healing Mechanism on Switch-Controller Connections in SDN,” to be presented
at IEICE Technical Committee on Internet Architecture, IEICE Technical Report,
Mar. 2015 (Submitted).
44