Professional Documents
Culture Documents
ThousandEyes Internet Outages Survival Guide Ebook
ThousandEyes Internet Outages Survival Guide Ebook
Survival Guide:
Unpacking common outages types and
how they impact your business
Table of Contents
Introduction: Business Runs on the Internet 1
Reachability Outages
DDoS Attack 10
DNS Hijacks & Cache Poisoning 14
Conclusion 29
Business Runs on the Internet
When Internet outages happen, they can be extremely disruptive to your business.
By preventing users from reaching your applications and services, outages can cause
major revenue and reputation damage. While application delivery is dependent on
many Internet Service Providers (ISPs), it also increasingly relies on a large and
complex ecosystem of Internet-facing services—such as CDN, DNS, DDoS mitigation
and public cloud. These services work together to provide exceptional digital
experiences to users and even brief disruptions can have a significant impact.
At the same time, enterprises are increasingly relying on Internet transport to connect
their sites and reach business-critical applications and services. Gone are the days in
which applications are solely hosted in private data centers and office locations are
connected primarily by MPLS circuits. The Internet is replacing or supplementing
services like MPLS as enterprises embrace SD-WAN technologies. As a result, the
Internet is now effectively the enterprise backbone, which as a “best-effort” transport
can have significant yet unforeseen consequences for businesses.
For many enterprises, however, the Internet is a “black box,” and when disruptive
events occur, IT and digital operations teams are often unable to identify the source
or respond effectively. Some While the interdependent, fragile nature of the Internet
means that outages are inevitable, having visibility into these outages can significantly
reduce the time to escalate and resolve these incidents, as well as enable you to
better communicate with your customers.
In this eBook, we will discuss the common causes of large-scale outages along with
real-world examples that we have analyzed using the ThousandEyes platform. The
key learnings uncovered in this eBook should serve as the basis for proactive outage
resilience and readiness plans that today’s enterprises require.
Key Learnings
BGP-related incidents have been on the rise.
Methods of trust like Route Origin
Authorizations (ROAs) have been around
for awhile, but they have failed to catch on
universally. As a result, the Internet is still
vulnerable to these incidents.
The Impact
The outage affected access to several Google services,
including G Suite, Google Search as well as Google
Analytics. Google traffic was funneled into the hands
of ISPs in countries with a long history of Internet
surveillance. MainOne took 74 minutes to either notice
or be notified of the issue and fix it, and it took about
three-quarters of an hour more for services to come
back up.
On June 24, 2019, for nearly two hours a significant BGP routing error impacted users trying to access services fronted
by CDN provider Cloudflare, including gaming platforms Discord and Nintendo Life. This incident is yet another
example of how incredibly easy it is to dramatically alter the service delivery landscape on the Internet.
Key Learnings
BGP route leaks are not uncommon on the Internet. When you rely on the Internet, an The unfortunate reality is that business risks associated with BGP route leaks and other
ecosystem that is deeply interconnected and vulnerable, you need to understand how Internet flaws are greater given the modern enterprise and service delivery landscape. While
it works and expect that a glitch in one service provider can have cascading effects on the ISP community recognizes the scope of BGP routing issues, and solutions such as ROA
another. Route leaks from smaller networks are often propagated by large providers, and IRR filtering exist, none of them are silver bullets and incorrectly implementing them risks
even though there are common filtering techniques available to reduce the impact of reachability of your services. Enterprises need to continuously monitor their BGP routes and
these events. detect incidents quickly in order to mitigate any service impacts on their business.
The Cause
The packet loss appears to have been caused by a BGP route flap issue, where a routing
announcement is made and withdrawn in quick succession, often repeatedly.
The Impact
The incident impacting Apple Pay and http://apple.com took place over more than 90 minutes,
resolving around 10:30 am. However, it appears that additional services continued to
experience issues for some time after. While Apple services are certainly important for many
Internet users, the fact that the incident occurred early on a holiday seems to have
prevented the incident from sparking more than a few user complaints.
Key Learnings
The lesson from this incident is that
sometimes even significant outages may go
unnoticed (or conversely create significant
uproar) simply based on their timing and
context. In this case, the outage coincided
with the Fourth of July holiday in the US, and
it’s likely that fewer people were trying to
access these services at that time. However,
outages can happen at any time, so it is
critical to have visibility into your Internet
routing to triage situations like this and
resolve the issues quickly to mitigate
further impacts.
The Impact
During the course of the incident, ThousandEyes saw a
significant drop in HTTP server availability from around
the world, as well as a dramatic increase in HTTP response
times. ThousandEyes also measured packet loss of up to
60% from our global vantage points, a condition that
would have further prevented access to Wikipedia sites.
The Cause
Two very powerful DDoS attacks pounded on GitHub sites, testing
the limits of its well-executed mitigation process.
The Impact
The first attack (at the time) was the most powerful DDoS attack
recorded, with 1.3 Tbps of attack traffic. However, within 24 hours,
GitHub was struck with another DDoS attack, which appeared to be
more severe in its impact. On the second day, GitHub’s availability
dropped by 61%, compared to a 26% drop the day before.
Key Learnings
While DDoS events are an unfortunate reality
of operating on the Internet, organizations
should have visibility into the scope, impact
and behavior of these events and be able
to validate that DDoS mitigation steps
are effective.
Key Learnings
Whether your traffic is in China or anywhere
else in the world, DNS hijacking can be
incredibly disruptive to your business. It’s
important to remember that application
delivery completely relies on the availability
of accurate DNS records. To keep abreast of
any changes that affect important records,
use DNS Server and Trace tests to
continuously monitor the state of your DNS
records, including their availability, accuracy
and resolution time.
Key Learnings
The Internet comprises a complex chain of
interconnections that can have a ripple effect
on the customers of other ISP networks when
something goes awry. For businesses
especially, this constant vulnerability poses
a significant risk. Enterprises need to have
visibility into Internet connectivity and
performance in order to know which
networks their traffic is touching. This is
especially critical for those organizations
deploying SD-WAN technology as they are
more dependent on Internet connectivity.
A cloud outage can result from a loss of power at a data center, for example,
or even an issue within its own network. While cloud outages occur from time
The Cause
to time, most vendors have redundancy measures in place across availability
Networking issues related to high levels of network congestion in the eastern United
zones to mitigate the impact of outages on customers.
States affected multiple services in Google Cloud, G Suite, and YouTube, causing users
to experience intermittent errors and slow performance.
The Impact
The outage lasted for more than four hours and affected access to various services
including YouTube, G Suite, and Google Compute Engine. For 3.5 hours, 100% packet
loss for global monitoring locations attempting to connect to a service hosted in GCP
us-west2-a. Similar losses were seen for sites hosted in several portions of GCP US
East, including us-east4-c.
Key Learnings
It’s reasonable to expect that IT infrastructure
and services will sometimes have outages,
even in the cloud. Ensure your cloud
architecture has sufficient resilience
measures, whether on a multi-region basis
or even a multi-cloud basis, to protect
from future recurrence of outages.
The Impact
The outage primarily affected customers relying on AWS Direct Connect, a service that offers
dedicated connectivity between the AWS cloud and enterprise networks. Although the
infrastructure recovered very quickly from what was a weather-related power outage,
prolonged and cascading impacts were felt by many software applications and services
running on AWS.
Key Learnings
The cloud is a complex interconnected
system, dependent on other services. When
outages caused by natural disasters happen,
they can be harder to recover from and
cause prolonged effects. It is critical to
consider geographical redundancy as a key
part of your fault tolerance strategy by
making sure workloads are not concentrated
in one geographic region, which may be
vulnerable to the same shared risk. Lastly,
monitor connectivity to cloud infrastructure
and services so you can correctly identify the
scope and root cause of service outages.
The Cause
Amazon employees mistakenly took more servers offline than intended, which required
various S3 subsystems to be restarted, and S3 was unable to service requests during
this time.
The Impact
During the outage, a large number of services that depend on AWS, including Quora,
Coursera, Docker, Medium and Down Detector, were impacted over the course of
roughly three hours. In addition, AWS services also had limited to no functionality.
Key Learnings
Being aware of dependencies within your
own service and other business-critical
applications is central to identifying issues
early on. These dependencies might be
external (an ISP) or deep in your internal
network (a backend data storage solution).
Either way, it is critical to think through how
important applications are delivered and
fortify your infrastructure with redundancies.
The Impact
During two instances that lasted over 1.5 hours combined, 100%
packet loss prevented reachability of Whatsapp services for a
limited number of users around the globe.
The Cause
Substantial packet loss across China Telecom’s backbone continued
over many hours, primarily impacting network infrastructure in
mainland China, but also affecting China Telecom’s network in
Singapore and multiple points in the U.S., including Los Angeles.
The Impact
Over the course of the prolonged outage, any traffic routed through
affected infrastructure was dropped, which meant that some Internet
users in and outside of China would have experienced service
disruptions connecting to various websites and applications. Users
in China attempting to reach sites hosted external to China would
have been impacted, along with users outside of China trying to
connect to sites hosted within China. Though not exclusively
impacting western sites and services, many major U.S. brands, such
as Apple, Amazon, Microsoft, Slack, Workday, SAP, and others
were impacted over the course of the outage window.
Key Learnings
Most people think about the Great Firewall as
a monolithically administered set of rules that
keeps China-based users hermetically sealed
from the rest of the globe. However, China is
fairly well connected to external sites and
services—at least those that serve
commercial interests. The scope of
infrastructure controlled and managed by
China Telecom extends far beyond China’s
geographic borders. Enterprises whose
traffic transits over any ISP that is known for
restricting Internet traffic should carefully
monitor their routing for sudden
unexpected behavior.
To say that your business hinges on the Internet is no understatement. Yet, outages take
down critical business services every day. By ensuring that you have visibility into all the
dependencies that matter to your organization, you can create effective outage recovery
plans and measure resiliency when those plans are called into play. The guidelines
outlined in this eBook should serve as a foundation for these outage preparations.
www.thousandeyes.com