Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

WHITE PAPER

Measuring Network
Convergence Time

www.ixiacom.com 915-1864-01 Rev. C, January 2014


2
Table of Contents
Introduction to Convergence........................................................................ 4

What are Network Outages Caused By?...................................................... 7

Layer 1 – Physical Layer Outages................................................................ 7

Layer 2 - Data Link Layer Outages.............................................................. 7

Layer 3 - Network Layer Outages................................................................ 8

Next Generation Protocols........................................................................... 9

Measuring Convergence Time...................................................................... 9

Traditional Methods.....................................................................................10

More Complete Characterization................................................................. 11

Ixia’s TrueView Convergence......................................................................12

Critical Event Triggers................................................................................15

Conclusion...................................................................................................16

3
Introduction to Convergence
Convergence addresses the manner in which networks recover from problems and
network changes. Modern networks anticipate problems by providing alternate, redundant
or standby paths. Failover is the process in which networks automatically detect service
interruptions and adjustments and switch over to an alternate path. Convergence is the
event that happens within the transport network when the re-routed information flow
merges back to a point in the error-free path. Failback, conversely, is the process of
restoring a network back to its original state when the service interruption is fixed.

Figure 1 is a small part of a larger network in which a Client computer has requested
information from a Server. That information is normally forwarded through routers R1,
R2, the Primary Link, and R3. Now, imagine that the Primary Link is lost, possibly through
a physical cut, a failure of R3, a network overload or some other cause. Router R2 will
first notice this loss of connectivity; because it has no other connection to the Client, it
will reflect the break back to router R1. R1 will then look for an alternate path to the Client,
finding one through R4, R6, the Backup Link, and R3. Traffic will then flow over the lower
path. The path is said to converge at R3 and the convergence time is the measurement of
Convergence the interval between the first service interruption and the resumption of full data flow at
addresses the R3.

manner in which Break!

networks recover R1 R2

from problems and Prim


ar y
Link
network changes. Server
X R3

Link Client
kup
Bac

R4 R5
Failure Recovery

Technically, network route convergence is said to be complete when all affected routes
have switched from the primary path to the secondary path.

Prior to the widespread usage of voice and video, convergence times in the hundred
millisecond range were common and entirely acceptable. Interruption “hiccups” of this
scale are virtually unnoticed in data transfers; TCP and other techniques would take care
of retransmission of any lost packets.

Voice over IP (VoIP) and video applications, however, are now transported over the same
network infrastructure as web, e-mail, and enterprise applications. Each application has
its own requirements for latency, jitter and bandwidth, as shown in Figure 2.

4
IPTV:
real time
high bandwidth
latency sensitive
high QoE expectation
Voice:
real time
low bandwidth
latency sensitive
high QoE expectation

High-speed Internet:
not real time
variable bandwidth
not latency sensitive
no QoE expectation

Mobility and
Mobile Services:
real time
moderate bandwidth
latency sensitive
moderate QoE expectation

Business:
other services +
security
high SLA requirements
In today’s multiplay
Peer-to-peer:
not real time
networks, it’s
very high bandwidth
not latency sensitive essential that we
consider service
no QoE expectation
Gaming:
real time
variable bandwidth
latency sensitive
high QOE expectation
interruption time
Protocol Bandwidth Requirements with respect to
traffic type.
In today’s multiplay networks, it’s essential that we consider service interruption time
with respect to traffic type. For example, an interruption of a half-second would be
unnoticed in a web page download or peer-to-peer transfer, annoying in a video download
and unacceptable in a VoIP call. It is for this reason that convergence has received more
attention of late. Edge and core networks are moving from average convergence times
of 100-150 milliseconds to 50 milliseconds. 50 milliseconds is a standard used in SONET
networks for decades. In fact, many network and consumer products, such as set-top
boxes, build this level of buffer into their operation so that these short interrupts go
entirely unnoticed.

Today’s networks carry critical data, requiring continuous availability and a high degree
of reliability. Toward that end, the routers and switches used in these high-reliability
networks implement failover using extensions to classical routing protocols as well as
protocols designed specifically for failover operation.

At layer 2, switching protocols such as STP, RSTP, MSTP and LDP/RSVP-TE provide
mechanisms to redirect traffic if there is a link failure or network change. Layer 3 routing
protocols such as RIP, OSPF, ISIS and BGP include the ability to re-route IP traffic if a
link or network fails. These classical techniques, however, can take seconds to complete
depending on the size and complexity of the network that they’re handling.

5
Next-generation networks require much faster recovery times to satisfy their high-
availability requirements. A number of protocol extensions and new protocols have been
used to achieve fast failover times. These include:

• Graceful and hitless restart – a router sends a message to a neighbor indicating that it
is restarting its routing process, asking it to continue to forward packets while it does.
• Virtual router redundancy protocol (VRRP) – defines and advertises a “virtual” router
as a gateway, which can be serviced by two or more routers.
• MPLS fast re-route – a local restoration network resiliency mechanism. Each LSP
is protected by a backup path. This mechanism meets the requirements of real-time
applications with recover times comparable to those of SONET rings, at less than 50
milliseconds.
• Bi-directional forwarding detection – a simple, high-speed HELLO protocol that
provides low-overhead, short-duration (as low as one millisecond) detection of
failures in the path.
• Link OAM/CFM – provides fault detection and isolation on Ethernet links and services.
CFM can achieve service outage detection in as low as 10 milliseconds.
Service providers • Protocol timer manipulation – networks use relatively slow HELLO mechanisms,
guarantee their usually in routing protocols, to detect failures when there is no hardware signaling to
help out. Many of these timers can be adjusted to decrease response time.
services to
Service providers guarantee their services to their enterprise customers in the form of
their enterprise service level agreements (SLAs) that specify levels of reliability, often 99.999%. As blue-
customers in the sky as this sounds, five nine’s of reliability amounts to over 5 minutes of outage in a year.
This challenging network requirement has led service providers to implement features
form of service level that minimize downtime and speed up convergence.
agreements (SLAs)
Convergence testing is no easy matter from both functional and performance viewpoints.
that specify levels Many routers and other pieces of networking gear are affected in failover and recovery
of reliability, often scenarios. Each router from each vendor must be separately tested for standards
compliance and functionality, of course, to ensure correct operation and to measure its
99.999%. inherent convergence time.

The nature of large-scale routed networks requires that subsystems be tested as a whole
in order to measure the true convergence time across the subsystem and to ensure multi-
vendor device compatibility. Finally, end-to-end testing across an entire network must
occur to ensure that the aggregated convergence times satisfy the information consumers’
quality of experience (QoE) requirements.

It is not only routing protocols that are affected by failover conditions. Routers also need to
forward large amounts of traffic, all the while enforcing quality of service (QoS) and other
policies. Information servers and load balancing equipment must deal with the shock of
dropped packets and connections. Convergence testing, therefore, must take place in an
environment of network traffic that realistically models subscriber load.

6
What are Network Outages Caused By?
There are many possible causes for loss of network connectivity, from the obvious power
failure or line cut to failures caused by device misconfiguration, or software failures and
upgrades.

The following discussion looks into failures caused by or seen at different network stack
levels. It is important to remember that any failure may be seen and acted upon by multiple
level 1, 2 and 3 agents.

Layer 1 – Physical Layer Outages


Examples of failures that cause outages at the physical layer are:

• Power outages – even a brief power interruption can cause a failover event.
• Line cuts – transient faults can be observed as line cuts.
• Device failure – possibly caused by faulty power supply, bad memory, CPU card
failure or interface card failure.
There are many
SONET networks include built-in protection for such outages, but Ethernet networks have
possible causes
no such inherent capabilities. Although there are many viable options for physical network for loss of network
connectivity, there is clear momentum toward Ethernet as the choice for next-generation
networks. Whether copper or optical fiber Ethernet links are used, the management
connectivity,
interface to the physical layer device (called a PHY) provides only a minimal visibility into from the obvious
link failures. As far as the network interface is concerned, the link is either up or down.
Higher level protocols, such as link OAM, are required to effectively monitor link condition.
power failure or
line cut to failures
Layer 2 - Data Link Layer Outages caused by device
Switches are the most common layer 2 devices. Problems that cause failures at layer 2
misconfiguration,
can be classified as follows: or software failures
• Capacity – MAC address capacities can be reached.
and upgrades.
• Environment – overheating can cause devices to misbehave.
• Hardware/software failures – moves, adds and changes from IT network operation
staff members can induce hardware or software failures if not properly planned and
tested.
• Events – authentication issues (for example, with 802.1x), or interoperability,
misconfigurations exposed
These failures are seen in a variety of ways, including:

• Flooding or dropping traffic


• Impaired traffic
• Loss of connection
• High latency and slow performance
• Intermittent dropped traffic, causing degraded performance
• Limited network connectivity

7
At the data link layer, most of the protocols used provide no mechanism for connectivity
problem detection. For example, the ARP protocol is used to map a host’s MAC address to
a layer 3 IP address, but if ARP fails there is no recovery mechanism.

There are several protocols that address failures at layer 2, including spanning tree, link
OAM, service OAM, MPLS/RSVP-TE and BFD. The Ethernet spanning tree protocols, such
as STP, RSTP and MSTP, are used to provide redundancy in switched networks. They
require careful configuration by network administrators to obtain peak performance and
still do not converge rapidly.

A number of new protocols are now being standardized that will provide 50 millisecond
convergence or better. These are described in the section on Next Generation Protocols

Layer 3 - Network Layer Outages


Routers are the most common layer 3 devices, although many other devices include
routing functionality. Problems that cause failures at layer 3 fall in the following
categories:
A number of new
protocols are now • Capacity – exceeding ARP or IP forwarding table size.

being standardized • Environment – temperature problems causing CPU overheating or power issues that
cause black or brown outs.
that will provide • Hardware/software failures – moves, adds and changes from IT network operation
50 millisecond staff members can induce hardware or software failures .
convergence or • Events – a network failure, such as a link down, can expose other problems, such as
mis-configured backup or filtering/re-distribution routing issues.
better.
These failures are seen in a variety of ways, including:

• Routing issues, causing intermittent connections or loss of connectivity to affected


networks
• Adjacency drops causing loss of connectivity
• Route flapping causing degraded services
At the network layer, Internet protocol (IP) is the dominant technology. IP depends on
layers 1 and 2 being “up”. IP itself is connectionless1, that is, there is no concept of an
end-to-end connection. Each router has a routing/forwarding table that tells it where
to send packets that it receives, based on the IP address in the packet. The routing/
forwarding information is provided by one or more routing protocols, such as RIP, OSPF,
ISIS and BGP.

When there are network issues, the routing protocol operating on the router closest to
the problem will notice the issue and communicate the change to other routers. This
will cause traffic to be re-routed to alternate paths, if they are available. Noticing that a
problem exists can take time. For example, if OSPF stops running on a router, it could take
its neighbor 4 “HELLO”2 times, typically 40 seconds to realize that the neighbor is down.
Multi-second outage recovery times were acceptable for data-only networks, but are far
too long by today’s standards. While timers can be adjusted, they still do not achieve the
sub-100-millisecond time required.

1 The TCP protocol, built on IP, implements a connection-oriented environment.


2 HELLOs are a typical technique by which a router sends a message to see whether its neighbors are
still available, expecting a return HELLO.

8
Even where outages are noticed quickly, depending on where the actual outage occurred
and how many hops there are in the network, it may take time for that change to
propagate. For that period of time traffic can be forward to what is known as a “black
hole”. When a router attempts to deliver a packet for which it has no entry in its routing/
forwarding table, it is dropped.

Next Generation Protocols


Over the last few years, new protocols that dramatically improve the time to detect and
recover from failures at layers 1, 2 and 3 have been standardized and implemented.
Service providers, who need to build and maintain highly available networks, are working
to test and deploy these protocols. With a bank account of only 5 minutes of outage a year,
all but the worst network problems must be invisible to customers.

Among the protocols now available for fast problem detection and recovery are:

• At layer 2:
ƒƒ Link OAM Over the last
ƒƒ Service OAM
few years, new
ƒƒ RSVP-TE fast re-route
protocols that
• At layer 3:
ƒƒ OSPF fast hellos
dramatically improve
ƒƒ Bi-directional forwarding detection (BFD)
the time to detect
ƒƒ Virtual router redundancy protocol (VRRP) and recover from
These protocols are designed to detect failures, but typically need to work with another, failures at layers 1,
typically routing, protocol to recover from the issue. For example, for the OSPF example 2 and 3 have been
raised in the previous section, if BFD is used in conjunction with OSPF, it can reduce the
time to detect a failure to milliseconds. standardized and
implemented.
Measuring Convergence Time
Convergence has a direct influence on users’ perception of quality. Service disruptions
are quickly noticed, especially when repeated. Consumers in particular have considerable
freedom in the choice of service providers; they can and will switch providers at the drop
of a packet. Measurement of convergence time from the moment of a service disruption to
full service restoration is a key performance indicator for service providers.

Multi-second convergence times can be measured with a stop watch. Convergence times
in the hundreds of milliseconds can be measured with a number of easily implemented
techniques. New protocols, however, require some careful attention to detail.

9
Traditional Methods
The most common classical method for determining convergence is illustrated in Figure 4.

Break!
Test Port 1
Test Port 2
Primary Link

Backup Link

Test Port 3

System Under Test


Traditional Convergence Time Measurement

The system under test (SUT) consists of one or more routers located under the “cloud”.
They are assumed to be configured so that they use the Backup Link when the Primary
Link is not available. Three test ports are used to exercise the SUT. During the test data is
transmitted at a constant rate and the number of packets received at Test Port 2 and Test
Port 3 are counted. Under error-free operation, traffic travels from Test Port 1 to Test Port
2. A line break is simulated at Test Port 2. Within the SUT, the break is noticed and traffic
is re-routed to Test Port 3. The number of packets lost during the transition is a measure
of the convergence time.

For example, if the fixed transmission rate from Test Port 1 is 1,000 frames per second and
2,500 packets were lost, then the convergence time could be calculated at 2.5 seconds.
This type of test is easy to program; it only needs to be run for a period longer than the
expected convergence time. This measurement is simplistic in nature, characterizing
the idealized traffic rates shown in Figure 5. This would only be the case the simplest of
networks, where there is only one route that needed to be withdrawn and re-advertised.

Convergence Time
Receive Rate

Time
Failure Switchover
Simple Convergence Characterization

10
More Complete Characterization
Where there are many routes that have to be moved from Test Port 2 to 3, then the actual
switch over will be gradual, as shown in Figure 6.

Convergence Time
Failure Switchover
Receive Rate

Time
More Realistic Convergence Characterization

As each route is switched over, traffic for that route starts appearing on the Backup
Link. It is not until the last route has been switched that convergence can be completed.
The time at which some traffic flows can be significant. In a larger network, this might
be correspond to a route that carries most of the network’s traffic. By the way, it should
be evident that the test traffic sent from Test Port 1 must use all of the address ranges
supported by the SUT.

In order to characterize the gradual nature of convergence, some means of sampling is


required. Two techniques are in common use:

• High-speed sampling – the receive rate on Test Port 2 is measured as rapidly as


possible until it matches the transmit rate. The rate of this measurement is limited by
the speed of the test equipment, which is normally computer controlled. It’s accuracy
is determined by the operation of the test application – typically in the range of 5 to 10
milliseconds.
• Capture buffer – the data received on Test Port 2 is captured in a buffer. The time-
stamped data is post-processed to reveal the changing receive rate. This technique
can reveal great detail, but is limited by the size of the capture buffer. In practice, this
often results in the same accuracy as high-speed sampling.
In addition to limited resolution and accuracy, none of these techniques accurately
correlate to the event that cause the failover.

11
Ixia’s TrueView Convergence
A more advanced technique is needed that correlates the beginning of an event, such
as link down or neighbor failure, with the moment that the convergence resulted in an
acceptable level of service. Let’s illustrate this requirement with a more complex example,
shown in Figure 7.

Withdraw Withdraw Withdraw

Test Port 1 Test Port 2

PE Router Primary Link

X
P Router P Router PE Router

P Router P Router PE Router

Backup Link

System Under Test Test Port 3

Complex Convergence Test Case

A failure at Test Port 2 is noticed at the egress provider edge router (PE Router). That
router removes its routing table entries for the emulated network behind Test Port 2 and
sends a set of withdrawal messages to its neighbor router. Each router in turns sends
withdrawal messages to its neighbors until they reaches the ingress PE Router. That
router switches traffic to the alternate, lower path.

n order to In order to accurately measure the convergence time from the initial event to the
accurately measure resumption of useful traffic, Ixia has developed a patent-pending technology called
TrueView Convergence. Incorporated into Ixia’s flagship network infrastructure test
the convergence application, IxNetwork, TrueView provides the most comprehensive convergence test
time from the capability in the industry.

initial event to the In order to understand how TrueView works, it’s important to look at the processing that
resumption of useful the ingress PE Router performs when it receives withdrawal messages from its neighbor.
As shown in Figure 8, the ingress PE Router receives a sequence of Route Withdrawal
traffic, Ixia has messages from its neighbor. As each withdrawal message is processed, a new route
developed a patent- advertisement is sent to its P Router neighbor in the lower row and traffic for that route is
immediately forwarded. This demonstrates that failover is not a singular event, but rather a
pending technology gradual process.
called TrueView
Convergence.

12
Route
Advertisements

Route
Withdrawals

Route Convergence Process

TrueView’s operation is described in Figure 9.

CP/DP - convergence time

BTT ATT Rx Flow Test Port 3


DP - Convergence Time
Rx Threshold

tEvent
R AT E

Ramp-down convergence time

Rx Flow Test Port 2

Ramp-up convergence time

TIME

TrueView Operation

The figure shows the receive flow rate at Test Port 1 and Test Port 2 during a convergence
time measurement. The unique TrueView measurement is the CP/DP-convergence time.
It expresses the time that the SUT took to fully converge, starting with the event that
started the convergence (tEvent – link down in this case) and ending when a specified rate
of traffic (Rx Threshold) was received over the Backup Link at Test Port 3. This, and the
other key convergence time metrics, are expressed in Table 1.

13
TrueView Measurements

Label Description
CP/DP The total convergence time, between the event that caused the
convergence switchover (at tEvent) until an acceptable amount of traffic was received
time at time ATT, when the rate crosses above the Rx Threshold.
DP The convergence time measured from the data plane perspective only. It
convergence measures the time from BTT, when the rate on Test Port 2 crosses below
time the Rx Threshold until an acceptable amount of traffic was received at
time ATT, when the rate crosses above the Rx Threshold.

Ramp-down The time required for the SUT to stop forwarding traffic, measured from
convergence a data plane perspective only. It measures the time from BTT, when the
time rate on Test Port 2 crosses below the Rx Threshold until the rate drops to
zero.
Ramp-up The time required for the SUT to forwarding traffic, measured from a data
convergence plane perspective only. It measures the time from the start of traffic until
As discussed, time the rate on Test Port 3 crosses above the Rx Threshold.
routers need
to handle large
TrueView uses fast monitoring of receive rates, coordinated with the timestamps of the
numbers of event, the last packet received on Test Port 3 and the first packet received on Test Port 2
simultaneous to make highly accurate measurements. All TrueView measurements are accurate within 1
millisecond.
withdrawal and
advertisements As discussed, routers need to handle large numbers of simultaneous withdrawal and
advertisements during a convergence event. Routers use sophisticated algorithms to
during a handle these operations. For example, if a router’s algorithm is designed to favor /8
convergence event. networks, then it would be important to measure convergence for just those routes.
TrueView is designed to provide those measurements.

In addition to providing convergence time measurements for the aggregated traffic


received on Test Ports 2 and 3, TrueView provides the same information for individual
routes or arbitrary groups of routes. TrueView can be used to ensure that preferred route
ranges receive preferential treatment that results in lowered convergence times.

14
Critical Event Triggers
Much of the power of TrueView results from its ability to accurately correlate the data or
control plane event that started the convergence. TrueView uses a large set of events that
are triggered in IxNetwork. These are detailed in Table 2.

Convergence Triggering Events

Protocol Events
Data Plane Link down, link up
BGP, BGP+ Enable/disable IPv4/6, MPLS or VPN route range
IGMP Enable or disable group range
ISISv4, ISISv6 Enable or disable IPv4/6 L3 route range
Link-OAM Enable or disable a link-OAM event:
• Remote loopback
• Critical event
• Link fault
• Dying gasp
MLD Enable or disable group range
MSTP Parameter changes in the STP bridge ID or multiple spanning tree
instance (MSTI) including:
• MSTP-CIST regional root: priority, MAC address, root cost
• MSTP-CIST external root: priority, MAC address, root cost
• STP interfaces: interface cost
• MSTI field changes, including:
• Priority
• MSTI port priority
• MAC address
• Root cost
OSPFv2, OSPFv3 Enable or disable IPv4/6 route range
PIM-SM/SSM-v4, Enable or disable group rante (*, G and S, or G)
SSM-v6
RPVST+ CST parameter changes:
• Root priority
• VLAN port priority
• Root MAC address
RSTP Changes in the root cost or root ID priority, system ID or MAC
address.

15
Conclusion
Convergence time is one of the key performance indicators in modern high-availability
networks that carry multiplay traffic. Classical techniques for measuring convergence time
in all-data networks are no longer adequate for measuring the quick, sub-50-millisecond,
convergence times in modern networks.

Ixia has developed a new TrueView technology to provide millisecond-resolution


measurements, correlating network and routing events with critical thresholds in traffic
switchover. TrueView not only measures aggregated convergence time, but also provides
per-route measurements that make it possible to verify the functionality and measure the
performance of routing protocols.

Convergence
time is one of the
key performance
indicators in modern
high-availability
networks that carry
multiplay traffic.

16
17
WHITE PAPER

Ixia Worldwide Headquarters Ixia European Headquarters Ixia Asia Pacific Headquarters
26601 Agoura Rd. Ixia Technologies Europe Ltd 21 Serangoon North Avenue 5
Calabasas, CA 91302 Clarion House, Norreys Drive #04-01
Maidenhead SL6 4FL Singapore 554864
(Toll Free North America)
United Kingdom
1.877.367.4942 Sales +65.6332.0125
(Outside North America) Sales +44 1628 408750 Fax +65.6332.0127
+1.818.871.1800 (Fax) +44 1628 639916
(Fax) 818.871.1805
www.ixiacom.com

915-1864-01 Rev. C, January 2014

You might also like