Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

The Internet Outage

Survival Guide:
Unpacking common outages types and
how they impact your business
Table of Contents
Introduction: Business Runs on the Internet 1

Internet Routing Outages


BGP Route Hijack 2
BGP Route Leak 4
BGP Route Flap 8

Reachability Outages
DDoS Attack 10
DNS Hijacks & Cache Poisoning 14

Internet & Cloud Network Outages


ISP Infrastructure 16
Cloud Provider Network 18
Natural Disaster 20
Operator Error 22
Internet Sovereignty 24

Move from Chaos to Clarity 28

Conclusion 29
Business Runs on the Internet
When Internet outages happen, they can be extremely disruptive to your business.
By preventing users from reaching your applications and services, outages can cause
major revenue and reputation damage. While application delivery is dependent on
many Internet Service Providers (ISPs), it also increasingly relies on a large and
complex ecosystem of Internet-facing services—such as CDN, DNS, DDoS mitigation
and public cloud. These services work together to provide exceptional digital
experiences to users and even brief disruptions can have a significant impact.

At the same time, enterprises are increasingly relying on Internet transport to connect
their sites and reach business-critical applications and services. Gone are the days in
which applications are solely hosted in private data centers and office locations are
connected primarily by MPLS circuits. The Internet is replacing or supplementing
services like MPLS as enterprises embrace SD-WAN technologies. As a result, the
Internet is now effectively the enterprise backbone, which as a “best-effort” transport
can have significant yet unforeseen consequences for businesses.

For many enterprises, however, the Internet is a “black box,” and when disruptive
events occur, IT and digital operations teams are often unable to identify the source
or respond effectively. Some While the interdependent, fragile nature of the Internet
means that outages are inevitable, having visibility into these outages can significantly
reduce the time to escalate and resolve these incidents, as well as enable you to
better communicate with your customers.

In this eBook, we will discuss the common causes of large-scale outages along with
real-world examples that we have analyzed using the ThousandEyes platform. The
key learnings uncovered in this eBook should serve as the basis for proactive outage
resilience and readiness plans that today’s enterprises require.

The Internet Outage Survival Guide 1


Internet Routing Outages / BGP Route Hijack
EXAMPLE: BGP Hijack Cripples Amazon’s Route 53 DNS Service

What’s a BGP Route Hijack?


The border gateway protocol (BGP) is a critical component
How does one steal crypto coins? By hacking DNS and BGP—the two cornerstone protocols
of Internet transit. When a user wants to access a website or
governing the Internet. On April 24, 2018, the popular crypto wallet app, MyEtherWallet, was
service, their traffic flows over the Internet from “Point A” to
attacked by means of a BGP Hijack on Amazon’s Route 53 DNS service. This example is yet
“Point B” through a chain of service providers and third-party
another reminder of how the trusting nature of BGP routing can hit you in the wallet.
vendors that route the traffic to its intended destination. At its
core, this order of operations is built around a foundational
trust between the entities involved in this transit.
The Cause
A BGP hijacking in which attackers took control of a small ISP, eNet (also known as XLHost),
In a BGP route hijack scenario, a malicious actor takes
which was connected to the Equinix fabric. This gave the attackers access to a large number
advantage of this trust-based system by diverting traffic from
of ISPs, two of which propagated their spoofed prefixes across the Internet. As a result, traffic
an intended destination to an illegitimate one. BGP hijacking
meant for Amazon’s DNS servers flowed into the XLHost network—and sitting in the XLHost
is performed by configuring an autonomous system (AS) edge
data center was a fake Route 53 DNS server that selectively answered queries for
router to announce ownership of prefixes that have not been
MyEtherWallet.com, blackholing all other requests. The DNS responses provided by this server
assigned to it. By broadcasting false prefix announcements,
directed users to a website masquerading as MyEtherWallet, enabling the hijackers to capture
the malicious actor may compromise the Routing Information
the credentials of some users and gain access to their accounts.
Base (RIB) of its peers (ASes) which could get propagated
across the Internet. If the malicious announcement is more
The Impact
specific than the legitimate one or claims to offer a shorter
Amazon Route 53 was able to detect and resolve this issue within a couple of hours and
path, the traffic may be directed to the hijacker.
restore their DNS Service well before any major cascading impacts occurred. However, some
users of MyEtherWallet were not so lucky. Reports indicate that over $150,000 in Ethereum
That’s why identifying BGP route hijacking as soon as
was stolen as part of this attack. Even though MyEtherWallet was the target of the attack, the
possible is critical for the security of your network.
collateral damage victims were many customers of Amazon’s Route 53 DNS service, like
Instagram and CNN. These companies sites were not reachable during the attack because
many users were not able to resolve their domains through DNS.

The Internet Outage Survival Guide 2


Internet Routing Outages / BGP Route Hijack

Key Learnings
BGP-related incidents have been on the rise.
Methods of trust like Route Origin
Authorizations (ROAs) have been around
for awhile, but they have failed to catch on
universally. As a result, the Internet is still
vulnerable to these incidents.

If you offer a digital service online, you need


to monitor your prefixes to ensure that they
are reachable and not being illegitimately
advertised (whether maliciously or
inadvertently). You should also monitor
your authoritative DNS nameservers for
reachability and to ensure they are
providing correct IP mappings.

The Internet Outage Survival Guide 3


Internet Routing Outages / BGP Route Leak
EXAMPLE 1: BGP Route Leak Results in GCP Outage

What is a BGP Route Leak?


The same trust-based foundation that makes BGP vulnerable to hijacking also exposes it to
On November 12, 2018, a BGP route leak at an ISP based
route changes that redirect traffic through an unintended path. During a BGP Route Leak, a
in Nigeria leaked traffic into China Telecom resulting in an
third-party may accidentally share a route with peers, causing traffic to be directed towards
outage on Google Cloud Platform (GCP), affecting access
an unintended entity. Unlike hijacks, however, BGP route leaks are often benign from the
to G Suite, Google Search, and Google Analytics. This
perspective of service disruption except where the route change steers traffic to an ISP or
incident underscores one of the fundamental weaknesses
a destination that will blackhole traffic.
in the fabric of the Internet: BGP was designed to be a
chain of trust and does not account for the complex
While a majority of BGP route leaks are the result of accidental misconfigurations, it can
commercial and geopolitical relationships that exist
happen that a leak is caused intentionally for the purposes of eavesdropping or traffic analysis.
between ISPs and nations.
Having comprehensive BGP-layer visibility can help network operators identify the upstream
ISPs that most likely propagated the bad routes advertised during a BGP route leak. Route
leaks can be identified by utilizing network monitoring tools that enable alerting to BGP
The Cause
route changes.
A BGP route leak at MainOne, an ISP based in Nigeria,
caused traffic to be re-routed, slamming into the Great
Firewall and terminating at China Telecom’s edge router.

The Internet Outage Survival Guide 4


Internet Routing Outages / BGP Route Leak
EXAMPLE 1: BGP Route Leak Results in GCP Outage

On November 12, 2018, a BGP route leak at an ISP based


in Nigeria leaked traffic into China Telecom resulting in an
outage on Google Cloud Platform (GCP), affecting access
to G Suite, Google Search, and Google Analytics. This
incident underscores one of the fundamental weaknesses
in the fabric of the Internet: BGP was designed to be a
chain of trust and does not account for the complex
commercial and geopolitical relationships that exist
between ISPs and nations.

The Impact
The outage affected access to several Google services,
including G Suite, Google Search as well as Google
Analytics. Google traffic was funneled into the hands
of ISPs in countries with a long history of Internet
surveillance. MainOne took 74 minutes to either notice
or be notified of the issue and fix it, and it took about
three-quarters of an hour more for services to come
back up.

The Internet Outage Survival Guide 5


Internet Routing Outages / BGP Route Leak

EXAMPLE 2: BGP Route Leak Impacts CloudFlare’s CDN

On June 24, 2019, for nearly two hours a significant BGP routing error impacted users trying to access services fronted
by CDN provider Cloudflare, including gaming platforms Discord and Nintendo Life. This incident is yet another
example of how incredibly easy it is to dramatically alter the service delivery landscape on the Internet.

The Cause The Impact


A significant BGP route leak affected a variety of prefixes from multiple providers. DQE, Sites served through the Cloudflare CDN were impacted for nearly two hours. This major
a transit provider, appears to have been the original source of the route leak, which was Internet disruption affected about 15% of Cloudflare’s global traffic and impacted services like
propagated through Allegheny Technologies, a customer of both DQE and Verizon. Discord, Facebook and Reddit. The route leak also affected access to some AWS services.
Unfortunately, Verizon further propagated the route leak, magnifying the impact.

The Internet Outage Survival Guide 6


Internet Routing Outages / BGP Route Leak

Key Learnings
BGP route leaks are not uncommon on the Internet. When you rely on the Internet, an The unfortunate reality is that business risks associated with BGP route leaks and other
ecosystem that is deeply interconnected and vulnerable, you need to understand how Internet flaws are greater given the modern enterprise and service delivery landscape. While
it works and expect that a glitch in one service provider can have cascading effects on the ISP community recognizes the scope of BGP routing issues, and solutions such as ROA
another. Route leaks from smaller networks are often propagated by large providers, and IRR filtering exist, none of them are silver bullets and incorrectly implementing them risks
even though there are common filtering techniques available to reduce the impact of reachability of your services. Enterprises need to continuously monitor their BGP routes and
these events. detect incidents quickly in order to mitigate any service impacts on their business.

The Internet Outage Survival Guide 7


Internet Routing Outages / BGP Route Flap
EXAMPLE: Apple Services Impacted on Fourth of July

What is a BGP Route Flap?


Route flapping occurs when routes alternate or are advertised
On July 4, 2019, starting just before 9 am PDT ThousandEyes tests detected that users
and then withdrawn in rapid sequence, often resulting from
connecting to http://apple.com and Apple services, such as Apple Pay, began experiencing
equipment or configuration errors. Flapping often causes
significant packet loss, which would have prevented many of them from successfully
packet loss and results in performance degradation for
connecting to those services.
traffic traversing the affected networks.

The Cause
The packet loss appears to have been caused by a BGP route flap issue, where a routing
announcement is made and withdrawn in quick succession, often repeatedly.

The Impact
The incident impacting Apple Pay and http://apple.com took place over more than 90 minutes,
resolving around 10:30 am. However, it appears that additional services continued to
experience issues for some time after. While Apple services are certainly important for many
Internet users, the fact that the incident occurred early on a holiday seems to have
prevented the incident from sparking more than a few user complaints.

The Internet Outage Survival Guide 8


Internet Routing Outages / BGP Route Flap

Key Learnings
The lesson from this incident is that
sometimes even significant outages may go
unnoticed (or conversely create significant
uproar) simply based on their timing and
context. In this case, the outage coincided
with the Fourth of July holiday in the US, and
it’s likely that fewer people were trying to
access these services at that time. However,
outages can happen at any time, so it is
critical to have visibility into your Internet
routing to triage situations like this and
resolve the issues quickly to mitigate
further impacts.

The Internet Outage Survival Guide 9


Reachability Outages / DDoS Attack
EXAMPLE: DDoS Attack Takes Aim at Wikipedia

What is a DDoS attack?


A Distributed Denial-of-Service (DDoS) attack is a deliberate
On September 6, 2019, access to Wikipedia sites from around the world was disrupted for
attempt to take a service offline or deny legitimate users
close to nine hours. As a result, users across many regions were unable to establish an
access to a service by overwhelming it with a large number of
Internet connection for ongoing communication with Wikipedia servers. This outage reminds
requests simultaneously. While DDoS attacks can overwhelm
us that DDoS attacks are a sad fact of life in doing digital business, and taking proactive steps
their target’s web infrastructure, they can also create
to be prepared makes an awful lot of sense.
congestion within service provider networks that can lead to
packet loss. DDoS attacks occur for a variety of reasons,
including hacktivism, commercial competition or to
The Cause
accomplish geopolitical goals.
A massive and sustained DDoS attack took aim at Wikipedia sites.

The Internet Outage Survival Guide 10


Reachability Outages / DDoS Attack
EXAMPLE 1: DDoS Attack Takes Aim at Wikipedia

On September 6, 2019, access to Wikipedia sites from


around the world was disrupted for close to nine hours.
As a result, users across many regions were unable to
establish an Internet connection for ongoing
communication with Wikipedia servers. This outage
reminds us that DDoS attacks are a sad fact of life in
doing digital business, and taking proactive steps to
be prepared makes an awful lot of sense.

The Impact
During the course of the incident, ThousandEyes saw a
significant drop in HTTP server availability from around
the world, as well as a dramatic increase in HTTP response
times. ThousandEyes also measured packet loss of up to
60% from our global vantage points, a condition that
would have further prevented access to Wikipedia sites.

The Internet Outage Survival Guide 11


Reachability Outages / DDoS Attack
EXAMPLE 2: Major DDoS Attacks Test GitHub’s Mitigation Process

On February 28, 2018, GitHub was a victim of two powerful DDoS


attacks that impacted its global user base of 20M. This was one of
the largest recorded DDoS attacks, with attack traffic peaking at
1.3 Tbps. As DDoS attacks become more frequent and ever more
powerful, this event serves as a reminder that effective DDoS
mitigation requires quick action and visibility to measure
its effectiveness.

The Cause
Two very powerful DDoS attacks pounded on GitHub sites, testing
the limits of its well-executed mitigation process.

The Impact
The first attack (at the time) was the most powerful DDoS attack
recorded, with 1.3 Tbps of attack traffic. However, within 24 hours,
GitHub was struck with another DDoS attack, which appeared to be
more severe in its impact. On the second day, GitHub’s availability
dropped by 61%, compared to a 26% drop the day before.

The Internet Outage Survival Guide 12


Reachability Outages / DDoS Attack

Key Learnings
While DDoS events are an unfortunate reality
of operating on the Internet, organizations
should have visibility into the scope, impact
and behavior of these events and be able
to validate that DDoS mitigation steps
are effective.

While the GitHub attack, in particular, had


minimum service interruption and showcased
a well-executed mitigation process, not all
DDoS attacks are created equally. You should
get a view of how your mitigation service is
working and how your user experience is
holding up under attack.

The Internet Outage Survival Guide 13


Reachability Outages / DNS Hijacking & Cache Poisoning
EXAMPLE: Traffic to New York Times Blocked by the Great Firewall

What is a DNS Hijack?


DNS records, and web traffic destined for those domains, are typically
China has long been known for its online content filtering and censorship, which
compromised in one of two ways, hijacking or cache poisoning. In both cases,
deploys a number of sophisticated techniques to control what digital content its citizens
redirected web traffic can be used to execute denial of service attacks, install
have access to. DNS hijacking and cache poisoning are just two of the many tools the
malware and phish for passwords.
country uses. One website, in particular, where we can clearly see the detrimental
effects of DNS hijacking is the New York Times.
Hijacking involves compromising the DNS server or registrar itself, typically
through a phishing attack or a compromised password. The hijacker gains
administrative access to the DNS account in order to change the records
The Cause
directly. Typically, a hijacker will change the name server (NS) record to point
DNS Server tests from ThousandEyes agents in China to all name servers serving up the
future DNS queries to a name server under their control. A hijacker may also
A record for nytimes.com show that while US agents are able to look up the correct IP
directly change address records themselves.
address for the name server and reach the name server in Newark with no issue, the
China agents fail miserably returning the DNS lookups for services like Facebook and
What is Cache Poisoning?
Dropbox—known to be blocked in China.
Cache poisoning occurs on DNS resolvers distributed throughout networks
that comprise the Internet. An attacker inserts a forged DNS record into a
The Impact
DNS resolver, using a variety of tactics that typically involve racing to provide
Trace tests terminate within Chinese ISPs that practice IP blocking, which entails
a valid response or brute-forcing less secure DNS configurations. The
blackholing traffic destined for blacklisted IP addresses like the ones we see below.
poisoned records, again typically NS records, direct future DNS queries to
Any users within the Great Firewall attempting to access the New York Times would
the attacker’s name server, which then serves up an authoritative record
be unable to do so.
of the attacker’s choice.

The Internet Outage Survival Guide 14


Reachability Outages / DNS Hijacking & Cache Poisoning

Key Learnings
Whether your traffic is in China or anywhere
else in the world, DNS hijacking can be
incredibly disruptive to your business. It’s
important to remember that application
delivery completely relies on the availability
of accurate DNS records. To keep abreast of
any changes that affect important records,
use DNS Server and Trace tests to
continuously monitor the state of your DNS
records, including their availability, accuracy
and resolution time.

The Internet Outage Survival Guide 15


Internet & Cloud Service Outage / ISP Infrastructure
EXAMPLE: Fiber Cut Takes Down Comcast Services

What is an ISP infrastructure outage?


Internet Service Providers (ISPs) provide transport of Internet
On June 29, 2018, a major network outage at Comcast left millions of customers, from Seattle
traffic on behalf of individuals, companies, as well as other
to New York, without connectivity to critical sites and services for upwards of three hours. The
ISPs. A common misconception is that the Internet Service
effects of this outage were felt not only by Comcast’s own subscriber base (including those on
Provider (ISP) a customer contracts with is the same one that
the Xfinity media platform) but also by users connected through ISPs peered with Comcast.
handles their service end-to-end—but that’s not how the
Here are the impacts we saw around this outage and what you can learn from it.
Internet works. The Internet is made up of thousands of
autonomous networks that are interdependent on one
another to deliver traffic from point to point across the globe.
The Cause
An infrastructure outage caused by a faulty router, fiber cut or
A fiber cut was isolated as the root cause of the outage, and very shortly after the
control plane failure can impact your ability to connect to sites
announcement, network service appeared to be restored.
and services—even if you don’t have a direct relationship
with an affected ISP.
The Impact
For nearly three hours, any Internet user whose traffic transited through the Comcast
backbone would have been impacted by this outage. This includes those trying to connect
to Comcast’s Xfinity media platform.

The Internet Outage Survival Guide 16


Internet & Cloud Service Outage / ISP Infrastructure

Key Learnings
The Internet comprises a complex chain of
interconnections that can have a ripple effect
on the customers of other ISP networks when
something goes awry. For businesses
especially, this constant vulnerability poses
a significant risk. Enterprises need to have
visibility into Internet connectivity and
performance in order to know which
networks their traffic is touching. This is
especially critical for those organizations
deploying SD-WAN technology as they are
more dependent on Internet connectivity.

The Internet Outage Survival Guide 17


Internet & Cloud Service Outage / Cloud Provider Network
EXAMPLE: Control Plane Failure Causes GCP Outage

What is a cloud provider outage?


Moving critical applications and services to the cloud brings unprecedented
On June 2, 2019, services hosted in some of the US regions of Google Cloud Platform
power to IT teams who no longer have to worry about building and
experienced an outage for more than four hours. This affected access to popular
maintaining infrastructure. At the same time, cloud computing introduces a
services like G Suite, YouTube, and Google Compute Engine. While Google has issued
solid dose of unpredictability due to the sheer complexity of the Internet
an official incident report on the matter, ThousandEyes vantage points across the global
and cloud connectivity.
Internet give a unique perspective on issues such as these.

A cloud outage can result from a loss of power at a data center, for example,
or even an issue within its own network. While cloud outages occur from time
The Cause
to time, most vendors have redundancy measures in place across availability
Networking issues related to high levels of network congestion in the eastern United
zones to mitigate the impact of outages on customers.
States affected multiple services in Google Cloud, G Suite, and YouTube, causing users
to experience intermittent errors and slow performance.

The Impact
The outage lasted for more than four hours and affected access to various services
including YouTube, G Suite, and Google Compute Engine. For 3.5 hours, 100% packet
loss for global monitoring locations attempting to connect to a service hosted in GCP
us-west2-a. Similar losses were seen for sites hosted in several portions of GCP US
East, including us-east4-c.

The Internet Outage Survival Guide 18


Internet & Cloud Service Outage / Cloud Provider Network

Key Learnings
It’s reasonable to expect that IT infrastructure
and services will sometimes have outages,
even in the cloud. Ensure your cloud
architecture has sufficient resilience
measures, whether on a multi-region basis
or even a multi-cloud basis, to protect
from future recurrence of outages.

The Internet Outage Survival Guide 19


Internet & Cloud Service Outage / Natural Disaster
EXAMPLE: Weather-related Power Outage at AWS

What is a Natural Disaster outage?


Despite its ethereal-sounding monicker, the “cloud” is solidly
On March 2, 2018, a severe outage occurred across Amazon AWS’ US-East-1 region, located
based on terra firma. Cloud providers maintain a network of
in Ashburn, VA. This outage impacted Amazon’s very own Alexa along with multiple apps and
data centers across regions and availability zones in order to
services hosted within the IaaS provider, including Slack, Twilio and Atlassian JIRA.
operate. These data centers are serviced by business-class
telecom and utility providers. However, they are not immune
to power outages caused by large-scale weather events, such
The Cause
as hurricanes or floods. Even strong winds can bring down
What started as a power outage impacting a small set of services quickly cascaded into
utility lines, and with it, your cloud providers’ services.
a major issue even impacting customers who had subscribed to Amazon’s critical service
offering, AWS Direct Connect.

The Impact
The outage primarily affected customers relying on AWS Direct Connect, a service that offers
dedicated connectivity between the AWS cloud and enterprise networks. Although the
infrastructure recovered very quickly from what was a weather-related power outage,
prolonged and cascading impacts were felt by many software applications and services
running on AWS.

The Internet Outage Survival Guide 20


Internet & Cloud Service Outage / Natural Disaster

Key Learnings
The cloud is a complex interconnected
system, dependent on other services. When
outages caused by natural disasters happen,
they can be harder to recover from and
cause prolonged effects. It is critical to
consider geographical redundancy as a key
part of your fault tolerance strategy by
making sure workloads are not concentrated
in one geographic region, which may be
vulnerable to the same shared risk. Lastly,
monitor connectivity to cloud infrastructure
and services so you can correctly identify the
scope and root cause of service outages.

The Internet Outage Survival Guide 21


Internet & Cloud Service Outage / Operator Error
EXAMPLE: Internal Networking Issue Causes AWS S3 Outage

What is operator error?


The Internet is an incredibly complex and interconnected ecosystem, but
On February 28, 2017, Amazon Web Services in the eastern United States (US-East-1)
sometimes human error can result in major outages on networks, applications
experienced a complete outage for nearly three hours. AWS S3 allows many services
or both. An internal mistake, like inadvertently taking servers offline, can
(like Quora, Coursera, and Medium) to store and retrieve files from anywhere on the
manifest as symptoms such as packet loss or availability to end users
web, and it also facilitates other AWS services, many of which had limited to no
attempting to access services hosted on those servers. These instances serve
functionality during the outage. Here’s what we saw during this outage and the
as a reminder that no matter how much automation is in place, there is always
surprising root cause.
the chance for operator error.

The Cause
Amazon employees mistakenly took more servers offline than intended, which required
various S3 subsystems to be restarted, and S3 was unable to service requests during
this time.

The Impact
During the outage, a large number of services that depend on AWS, including Quora,
Coursera, Docker, Medium and Down Detector, were impacted over the course of
roughly three hours. In addition, AWS services also had limited to no functionality.

The Internet Outage Survival Guide 22


Internet & Cloud Service Outage / Operator Error

Key Learnings
Being aware of dependencies within your
own service and other business-critical
applications is central to identifying issues
early on. These dependencies might be
external (an ISP) or deep in your internal
network (a backend data storage solution).
Either way, it is critical to think through how
important applications are delivered and
fortify your infrastructure with redundancies.

The Internet Outage Survival Guide 23


Internet & Cloud Service Outage / Internet Sovereignty
EXAMPLE 1: Route Leak Takes Down WhatsApp for Many Users

What is Internet Sovereignty?


While the Internet is not under the governance of a central authority, certain countries have
On June 6, 2019, a number of users around the globe attempting to
been known to place restrictions on Internet traffic originating from or destined to servers
access WhatsApp were unable to reach the service. ThousandEyes
within its borders in order to control the content its population has access to. In China, the
determined the root cause of this packet loss was a massive route
Great Firewall is an advanced content filtering mechanism that deploys a variety of techniques
leak that steered traffic to China Telecom, a service provider that
to deny access to sites the government deems inappropriate. And, in recent years, Russia has
does not forward any Facebook-related traffic.
experimented with an Internet “kill switch,” that would enable the country to disconnect itself
from the global Internet. However, when nations interfere with the Internet, we often see the
unintended consequences it can have far beyond its own borders.
The Cause
The incident was triggered when a Swiss colocation company called
Safe Host announced to the Internet that the best way to reach
WhatsApp was through its network. When Safe Host advertised
these routes, they were accepted by China Telecom and further
propagated through other ISPs such as Cogent. Users whose traffic
was routed to Cogent, and ultimately handed off to China Telecom,
would have been completely unable to reach the service.

The Internet Outage Survival Guide 24


Internet & Cloud Service Outage / Internet Sovereignty
EXAMPLE 1: Route Leak Takes Down WhatsApp for Many Users

On June 6, 2019, a number of users around the globe attempting to


access WhatsApp were unable to reach the service. ThousandEyes
determined the root cause of this packet loss was a massive route
leak that steered traffic to China Telecom, a service provider that
does not forward any Facebook-related traffic.

The Impact
During two instances that lasted over 1.5 hours combined, 100%
packet loss prevented reachability of Whatsapp services for a
limited number of users around the globe.

The Internet Outage Survival Guide 25


Internet & Cloud Service Outage / Internet Sovereignty
EXAMPLE 2: Outage at China Telecom Has Global Impacts

On May 13, 2019, China Telecom experienced a significant outage


that lasted for nearly five hours, with aftereffects occurring for
several hours later. It revealed some important foundational realities
about China and its impact on the global Internet that many folks
aren’t aware of.

The Cause
Substantial packet loss across China Telecom’s backbone continued
over many hours, primarily impacting network infrastructure in
mainland China, but also affecting China Telecom’s network in
Singapore and multiple points in the U.S., including Los Angeles.

The Impact
Over the course of the prolonged outage, any traffic routed through
affected infrastructure was dropped, which meant that some Internet
users in and outside of China would have experienced service
disruptions connecting to various websites and applications. Users
in China attempting to reach sites hosted external to China would
have been impacted, along with users outside of China trying to
connect to sites hosted within China. Though not exclusively
impacting western sites and services, many major U.S. brands, such
as Apple, Amazon, Microsoft, Slack, Workday, SAP, and others
were impacted over the course of the outage window.

The Internet Outage Survival Guide 26


Internet & Cloud Service Outage / Internet Sovereignty

Key Learnings
Most people think about the Great Firewall as
a monolithically administered set of rules that
keeps China-based users hermetically sealed
from the rest of the globe. However, China is
fairly well connected to external sites and
services—at least those that serve
commercial interests. The scope of
infrastructure controlled and managed by
China Telecom extends far beyond China’s
geographic borders. Enterprises whose
traffic transits over any ISP that is known for
restricting Internet traffic should carefully
monitor their routing for sudden
unexpected behavior.

The Internet Outage Survival Guide 27


Move from Chaos to Clarity
For many enterprises, the Internet is a “black box,” and when
disruptive events occur, IT and digital operations teams are
often unable to identify the source or respond effectively.
While the interdependent, fragile nature of the Internet means
that outages are inevitable, visibility into these outages can
significantly reduce the time to escalate and resolve these
incidents, as well as enable you to better communicate with
your customers.

In order to troubleshoot outages effectively, you need


multi-layered visibility. You need to be able to see into your
own network to understand if the problem is somewhere
inside your infrastructure. You also need to be able to
understand the hop-by-hop routing paths your traffic is taking
across the Internet. And you need the greater context around
these instances to know if you’re part of a much larger outage.

Internet Insights™ provides this context, leveraging collective


intelligence to create a picture of Internet health. Internet
Insights measures billions of service paths each day to identify
and isolate outages to specific service providers and locations
and presents them via a NOC-style dashboard, as well as
through timeline and topographical incident views. This added
layer of context allows you to rapidly identify, escalate and
remediate business-relevant Internet outages, as well as
communicate more effectively with customers and providers.

The Internet Outage Survival Guide 28


Conclusion
Delivering an excellent end-user experience in the digital domain requires a seamless
orchestration between multiple third parties and your own systems—all of which must
transit across the Internet and, in some cases, private networks. Yet, the Internet is as
unpredictable as it is critical. It's a "best-effort" collection of networks connecting a myriad
of providers (public cloud, DNS, BGP, CDN, DDoS mitigation, SaaS and security gateways),
and at its very core, the Internet is vulnerable to outages and exploitation that can affect
users’ experiences.

To say that your business hinges on the Internet is no understatement. Yet, outages take
down critical business services every day. By ensuring that you have visibility into all the
dependencies that matter to your organization, you can create effective outage recovery
plans and measure resiliency when those plans are called into play. The guidelines
outlined in this eBook should serve as a foundation for these outage preparations.

If you’re ready to take a proactive approach to mitigating outage-related risks,


let's talk about how ThousandEyes can help.

201 Mission Street, Suite 1700


San Francisco, CA 94105
(415) 231-5674

www.thousandeyes.com

© 1992–2020 Cisco Systems, Inc. All rights reserved.

You might also like