Professional Documents
Culture Documents
Women in Tech Anthology
Women in Tech Anthology
Women in Tech Anthology
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.
ISBN: 9781617299292
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 - EBM - 24 23 22 21 20 19
iii
Personal Essays
W omen are steadily chipping away at the large and long-lived gender gap in
technology, standing shoulder to shoulder with their male counterparts, and garner-
ing much-deserved attention for their tech talents, intelligence, innovative spirit, and
sheer determination. It’s heartening to see the many initiatives cropping up, such as
Girls Who STEAM, Girls Who Code, TechGirlz, and she++, all striving to encourage
young girls and women to pursue careers in technology. These and other female-
focused initiatives on the rise promise that the tech industry’s gender gap will con-
tinue to close further and faster, and that women will be more likely to follow their
hearts—and their talents—into the tech fields that ignite their passions.
We asked a small handful of the many women technologists we know and highly
respect to offer their insights on being a woman in technology. The next four
entries in this mini ebook are their gracious and thoughtful contributions, filled
with spirited anecdotes, valuable lessons, and even some sage advice for their
younger selves. We hope their words inspire you, bolster you, and remind you that
you’re not alone in the gender-related challenges you face. They want you to know
that all the successes achieved by them—indeed, by all the bright and persevering
women unapologetically smashing that glass ceiling every day—are your successes
too, because each stride that you, as women, collectively make in this field further
dispels the myth that technology careers are not for you. These women we’re
proudly featuring here also invite you to blaze ahead and create a new story about
technology, with women in starring roles. As the highly lauded NASA rocket scien-
tist Dr. Camille Wardrop Alleyne is credited with saying, “Step out of the box. Chart
your own course. Leave your mark on our world.”
Jamie Riedesel is a Staff Engineer at Dropbox with over twenty years of experi-
ence in IT, working in government, education, legacy companies, and startups.
She’s a member of the League of Professional System Administrators (LOPSA),
a nonprofit global corporation for advancing the practice of system administra-
tion. She is also an elected moderator for ServerFault, a Q&A site focused on
professional systems administration and desktop support professionals.
career aspirations. The list of successful mothers who master the juggle has become
long, spotlighting women including Susan Wojcicki, mother of 5 and CEO of You-
Tube; Sheryl Sandberg, mother of two and COO of Facebook; Gwynne Shotwell,
mother of two and President and CEO of SpaceX; and Fei-Fei Li, mother of two, pro-
fessor of Computer Science at Stanford University and a member of Twitter’s board of
directors. These women and others like them are inspiring the next generation of
female developers, software engineers, programmers, and marketers.
I once met a woman in my office bathroom and asked if she was in sales because
I’d never seen her around. She was an engineer! I’d just presumed she was not tech-
nical. I immediately apologized, and she understood, but the damage was done.
As a woman engineer, I know the common occurrence of having to defend my
technical skills and chops, and having some men presume I don’t know what
I’m talking about or that I don’t understand code.
I used to attend a lot of college hackathons where I was surrounded by people
who fit all the stereotypes I held of engineers: playing video games on their breaks,
drinking Soylent, staying up all night coding, wearing sweatpants or maybe jeans
and a tech company tee shirt, lots of hoodies, and collecting swag from companies
at those hackathons and conferences. Even though I loved (and still love) shopping,
I started to live for company tee shirts, stickers, and water bottles. I thought wearing
and using those things made me belong.
When I joined my first tech company in 2016, I tried to be one of the guys. And
by “guys” I mean male engineers. I love Disney, Broadway, Taylor Swift, and some
fairly feminine things, but I tried to hide it all. I didn’t laugh loudly. I wore tech
company tee shirts and hoodies and jeans and though I felt like I belonged, like the
engineers took me seriously, it didn’t make me feel good about myself.
It’s important to note that it wasn’t the company that made me feel like I had to
fit in like that, and it wasn’t the engineers—it was me. It was all internalized.
I'm very grateful and lucky my manager at the time was an Asian woman who hap-
pened to love cats, pink, and cute things too. She had the respect of engineers while
wearing dresses, polka-dots, and makeup, so I started to do those things as well, and
my happiness and confidence grew. My only hope is that other underrepresented
people in tech can find themselves, be themselves, and pass it on to others.
—Lizzie Siegle
Developer Evangelist [she, her]
5
This is from 2020, which feels like half a lifetime away. I’ve seen some things—many
I can’t share for what I hope are obvious causality reasons, but there are a few
things to share that will make life easier and not destroy the timeline. I know where
you are right now, and the problems you are facing (mostly... time is like that).
There are three big bombshells you need to hear:
Remember the bragging guy at a party a couple years ago, who talked about
his income goals? You can do that by age 30 (you're almost there already).
And double it again by 40. Another doubling by 50 is not nearly as hard as
you think.
You pass way better than you think you do. Once you get your facial hair
zapped, you will be surprised. We won some genetic lotteries you won’t find
out about until you try. Also, your hair has amazing curls in it once you let it
come out; the haircare routine you were taught at home is custom-built to
prevent curls from escaping. Find a new one.
You will eventually be a woman in tech, and you are very wrong about the
headwinds you will face. That’s some sexist baloney you’ve internalized; stop,
think, and get your butt to WisCon sometime. The rest of your life is way hap-
pier, which gives you the emotional resilience to push your way into the tent
once people stop granting you male privilege.
There is a word you need to be on the lookout for: genderqueer. It’s out there right
now in small places, but it really comes into its own in a few years. No, you’re not
quite trans enough for the standard trans narrative; but you’re sure as heck gender-
queer enough for the queers. The gender in your head does not have to match the
gender you present to the world, and that is power.
Your current office, with all the women around you, is amazing and I’m so glad we
had that as our first post-college job. Unless you deliberately seek out such environ-
ments, this is the only time in your career this happens. You are learning communica-
tion skills right now that will be incredibly valuable when you’re surrounded by
nothing but men. Half of getting our Operations-style job done is hearts-n-minds
work, so knowing how people communicate is a critical skill. Continue to watch how
people relate to each other—diverse environments like where you are now are defi-
nitely not common.
Surviving all-guy offices will take some getting used to. This is made far easier if
you have a community outside of your work that you fit into. Part of the disconnect
you will feel in offices like that is the lack of people who are like you, so you have to
find them outside.
Believe it or not, you will eventually have an all-coding job. I know we went into
Ops specifically to avoid programming every day. However, as the years roll on, and
more abstractions are placed on top of existing abstractions, which are further
abstracted—eventually we get to a place where you will write code to define the infra-
structure you are managing (it seems impossible, but know that hardware is merely
the bottom layer of the abstraction-layer-cake). You will write more code to define how
the operating systems are configured. More code to define how software is installed
into your infrastructure. The future is code, so it’s a good idea to pay some attention
to how software is built. I write code every day, but I’m still doing SysAdmin work.
Your superpower will continue to be how we synthesize information to come up
with connections, theories for how those connections are present, and the ability to
describe those connections to others. This is how we pull rabbits out of hats (we keep
doing that, by the way), but that skill is incredibly useful when diagnosing problems
across a large, distributed system. Yes, distributed systems are in your future, and
they’re as fun to work on as you hope.
Work on writing documentation: both the runbook 1-2-3-4 kind, and the abstract
‘why this was put together this way’ varieties. Work on storytelling, because describing
complex technical problems is easier on the listener if there is a narrative through-
line. Writing shouldn’t be your main job (resist it if people urge you that way), but
being good at writing will make you stand out in the job you do have.
Finally, the hard stuff. You won’t like this.
Beware of technical cul-de-sacs. Once a technology is only used in niches, it will
take you five times as long to change jobs to get away from a toxic job. It hurts to leave
a community behind, but you need to think about your mental well-being. The writing
is already on the wall for one technology you’re working on—don’t stay too long.
There will be others.
Beware of jobs with no advancement potential. I lost seven years in one where
there was no way to promote; the one opportunity I had was due to a retirement, and
was into management (which I didn’t want, and I’m so very glad I didn’t try). Ask
about career ladders during your interviews, and how they invest in employees.
You will spend a depressing amount of time between now and 2020 in recessions.
The business cycle sucks sometimes, and it means you’ll likely end up sticking with a
job you don’t like simply because there aren’t any other jobs to be had. Make friends
online, network. It will help you to make the leap, when leaping is needed.
There is a reason you don’t really feel a strong connection to the technical com-
munities you’re in right now: they’re not like you. They don’t communicate like you.
Communities with people like you (and who you will become) are out there, and feel
far more true. Seek them out. You’re prone to magnifying risks; know that about your-
self, and take the risk to figure out if you’re wrong. Life is better that way.
—Jamie Riedesel
Anthology of Chapters
by Women in Technology
H ow we work, play, and experience the world changes and evolves as techno-
logy does. And just as technology has been shaping our world throughout the ages,
women have been helping to shape technology.
Take, for example, tech pioneer Grace Hopper’s early work in programming lan-
guages which spawned COBOL, a language that went on to become the backbone of
the financial industry and remains a top choice for business solutions. Or Hedy
Lamarr’s spread-spectrum encryption technology, initially designed for use in military
applications, but which later became the foundation for modern Wi-Fi. Consider net-
work maven Marian Croak’s work in Internet Protocol, which led to the network tech-
nology that expands our reach across the globe and brings us all together in a new and
better way than ever before. And how about taking our (gaming) hats off to visionary
Roberta Williams whose groundbreaking and prolific “King’s Quest” adventure game
series helped carve a path for the digital games that take us out of our daily grind and
into amazing immersive experiences.
Imagine how different the world would look without these innovations. Then,
imagine the exciting technological advances still on the horizon from women includ-
ing Sadaf Monajemi, whose med-tech company, See-Mode Technologies, saves money,
time, and lives by helping medical professionals predict strokes in high-risk patients;
Gina Trapani, who heads up projects at ExpertLabs, a think tank that aims to gain fed-
eral support for world-changing technological initiatives; and Lisa Seacat DeLuca,
who’s been granted over 400 patents, and has been named one of MIT’s 35 Innovators
Under 35, one of Fast Company’s 100 Most Creative People in Business, and #1 on
LinkedIn’s list of Top Voices in Technology, along with earning many other distin-
guished accolades.
10
Jamie Riedesel is a Staff Engineer at Dropbox with over twenty years of experi-
ence in IT, working in government, education, legacy companies, and startups.
She’s a member of the League of Professional System Administrators (LOPSA),
a nonprofit global corporation for advancing the practice of system administra-
tion. She is also an elected moderator for ServerFault, a Q&A site focused on
professional systems administration and desktop support professionals.
Each one of these amazing women represents countless others everywhere who are
making their marks in technology with their bold dedication, inventive spirits, and
impressive technological talent. We salute you all, and we can’t wait to see what you do
next!
About Python
Read this chapter if you want to know how Python compares to other languages
and its place in the grand scheme of things. Skip ahead—go straight to chapter 3—
if you want to start learning Python right away. The information in this chapter is a
valid part of this book—but it’s certainly not necessary for programming with
Python.
12
To get an idea of how Python’s expressiveness can simplify code, consider swap-
ping the values of two variables, var1 and var2. In a language like Java, this requires
three lines of code and an extra variable:
int temp = var1;
var1 = var2;
var2 = temp;
The variable temp is needed to save the value of var1 when var2 is put into it, and
then that saved value is put into var2. The process isn’t terribly complex, but reading
those three lines and understanding that a swap has taken place takes a certain
amount of overhead, even for experienced coders.
By contrast, Python lets you make the same swap in one line and in a way that
makes it obvious that a swap of values has occurred:
var2, var1 = var1, var2
Of course, this is a very simple example, but you find the same advantages throughout
the language.
# Python version.
def pairwise_sum(list1, list2):
result = []
for i in range(len(list1)):
result.append(list1[i] + list2[i])
return result
Both pieces of code do the same thing, but the Python code wins in terms of readabil-
ity. (There are other ways to do this in Perl, of course, some of which are much more
concise—but in my opinion harder to read—than the one shown.)
There’s no need to install libraries to handle network connections and HTTP; it’s
already in Python, right out of the box.
Python has a lot going for it: expressiveness, readability, rich included libraries, and
cross-platform capabilities. Also, it’s open source. What’s the catch?
and Perl have even larger collections of libraries available, in some cases offering a
solution where Python has none or a choice of several options where Python might
have only one. These situations tend to be fairly specialized, however, and Python is
easy to extend, either in Python itself or through existing libraries in C and other lan-
guages. For almost all common computing problems, Python’s library support is
excellent.
The fact that Python associates types with objects and not with variables means that
the interpreter doesn’t help you catch variable type mismatches. If you intend a vari-
able count to hold an integer, Python won’t complain if you assign the string "two" to
it. Traditional coders count this as a disadvantage, because you lose an additional free
check on your code. But errors like this usually aren’t hard to find and fix, and
Python’s testing features makes avoiding type errors manageable. Most Python pro-
grammers feel that the flexibility of dynamic typing more than outweighs the cost.
concurrent processes by using Python, if you need concurrency out of the box, Python
may not be for you.
You may be thinking, “Why change details like this if it’s going to break old code?”
Because this kind of change is a big step for any language, the core developers of
Python thought about this issue carefully. Although the changes in Python 3 break
compatibility with older code, those changes are fairly small and for the better; they
make the language more consistent, more readable, and less ambiguous. Python 3
isn’t a dramatic rewrite of the language; it’s a well-thought-out evolution. The core
developers also took care to provide a strategy and tools to safely and efficiently
migrate old code to Python 3, which will be discussed in a later chapter, and the Six
and Future libraries are also available to make the transition easier.
Why learn Python 3? Because it’s the best Python so far. Also, as projects switch to
take advantage of its improvements, it will be the dominant Python version for years to
come. The porting of libraries to Python 3 has been steady since its introduction, and
by now many of the most popular libraries support Python 3. In fact, according to the
Python Readiness page (http://py3readiness.org), 319 of the 360 most popular librar-
ies have already been ported to Python 3. If you need a library that hasn’t been con-
verted yet, or if you’re working on an established code base in Python 2, by all means
stick with Python 2.x. But if you’re starting to learn Python or starting a project, go
with Python 3; it’s not only better, but also the future.
Summary
Python is a modern, high-level language with dynamic typing and simple, con-
sistent syntax and semantics.
Python is multiplatform, highly modular, and suited for both rapid develop-
ment and large-scale programming.
It’s reasonably fast and can be easily extended with C or C++ modules for
higher speeds.
Python has built-in advanced features such as persistent object storage,
advanced hash tables, expandable class syntax, and universal comparison func-
tions.
Python includes a wide range of libraries such as numeric processing, image
manipulation, user interfaces, and web scripting.
It’s supported by a dynamic Python community.
It’s not Amazon’s fault. On Sunday, September 20, 2015, Amazon Web Services
(AWS) experienced a significant outage. With an increasing number of companies
running mission-critical workloads on AWS—even their core customer-facing ser-
vices—an AWS outage can result in far-reaching subsequent system outages. In this
instance, Netflix, Airbnb, Nest, IMDb, and more all experienced downtime,
impacting their customers and ultimately their business’s bottom lines. The core
outage lasted about five hours (or more, depending on how you count), resulting
in even longer outages for the affected AWS customers before their systems recov-
ered from the failure.
If you’re Nest, you’re paying AWS because you want to focus on creating value
for your customers, not on infrastructure concerns. As part of the deal, AWS is
responsible for keeping its systems up, and enabling you to keep yours functioning
as well. If AWS experiences downtime, it’d be easy to blame Amazon for your result-
ing outage.
But you’d be wrong. Amazon isn’t to blame for your outage.
Wait! Don’t toss this book to the side. Please hear me out. My assertion gets
right to the heart of the matter and explains the goals of this book.
First, let me clear up one thing. I’m not suggesting that Amazon and other cloud
providers have no responsibility for keeping their systems functioning well; they
obviously do. And if a provider doesn’t meet certain service levels, its customers
20
can and will find alternatives. Service providers generally provide service-level agree-
ments (SLAs). Amazon, for example, provides a 99.95% uptime guarantee for most of
its services.
What I’m asserting is that the applications you’re running on a particular infra-
structure can be more stable than the infrastructure itself. How’s that possible? That,
my friends, is exactly what this book will teach you.
Let’s, for a moment, turn back to the AWS outage of September 20. Netflix, one of
the many companies affected by the outage, is the top internet site in the United
States, when measured by the amount of internet bandwidth consumed (36%). But
even though a Netflix outage affects a lot of people, the company had this to say about
the AWS event:
Netflix did experience a brief availability blip in the affected Region, but we sidestepped
any significant impact because Chaos Kong exercises prepare us for incidents like this. By
running experiments on a regular basis that simulate a Regional outage, we were able to
identify any systemic weaknesses early and fix them. When US-EAST-1 became
unavailable, our system was already strong enough to handle a traffic failover.1
Netflix was able to quickly recover from the AWS outage, being fully functional only
minutes after the incident began. Netflix, still running on AWS, was fully functional
even while the AWS outage continued.
No single piece of hardware can be guaranteed to be up 100% of the time, and, as has
been the practice for some time, we put redundant systems are in place. AWS does
exactly this and makes those redundancy abstractions available to its users.
In particular, AWS offers services in numerous regions; for example, at the time of
writing, its Elastic Compute Cloud platform (EC2) is running and available in Ireland,
Frankfurt, London, Paris, Stockholm, Tokyo, Seoul, Singapore, Mumbai, Sydney, Bei-
jing, Ningxia, Sao Paulo, Canada, and in four locations in the United States (Virginia,
California, Oregon, and Ohio). And within each region, the service is further parti-
tioned into numerous availability zones (AZs) that are configured to isolate the
resources of one AZ from another. This isolation limits the effects of a failure in one
AZ rippling through to services in another AZ.
Figure 1.1 depicts three regions, each of which contains four availability zones.
Applications run within availability zones and—here’s the important part—may run in
more than one AZ and in more than one region. Recall that a moment ago I made the
assertion that redundancy is one of the keys to uptime.
In figure 1.2, let’s place logos within this diagram to hypothetically represent run-
ning applications. (I have no explicit knowledge of how Netflix, IMDb, or Nest have
deployed their applications; this is purely hypothetical, but illustrative nevertheless.)
1
See “Chaos Engineering Upgraded” at the Netflix Technology blog (http://mng.bz/P8rn) for more informa-
tion on Chaos Kong.
Region: us-east-1 (N. Virginia) Region: us-west-1 (N. California) Region: us-west-2 (Oregon)
AZ: us-east-1a AZ: us-east-1b AZ: us-west-1a AZ: us-west-1b AZ: us-west-2a AZ: us-west-2b
AZ: us-east-1c AZ: us-east-1d AZ: us-west-1c AZ: us-west-1d AZ: us-west-2c AZ: us-west-2d
Figure 1.1 AWS partitions the services it offers into regions and availability zones. Regions map to
geographic areas, and AZs provide further redundancy and isolation within a single region.
Region: us-east-1 (N. Virginia) Region: us-west-1 (N. California) Region: us-west-2 (Oregon)
AZ: us-east-1a AZ: us-east-1b AZ: us-west-1a AZ: us-west-1b AZ: us-west-2a AZ: us-west-2b
AZ: us-east-1c AZ: us-east-1d AZ: us-west-1c AZ: us-west-1d AZ: us-west-2c AZ: us-west-2d
Figure 1.2 Applications deployed onto AWS may be deployed into a single AZ (IMDb), or in multiple
AZs (Nest) but only a single region, or in multiple AZs and multiple regions (Netflix). This provides
different resiliency profiles.
Figure 1.3 depicts a single-region outage, like the AWS outage of September 2015. In
that instance, only us-east-1 went dark.
In this simple graphic, you can immediately see how Netflix might have weathered
the outage far better than others companies; it already had its applications running in
other AWS regions and was able to easily direct all traffic over to the healthy instances.
And though it appears that the failover to the other regions wasn’t automatic, Netflix
Region: us-east-1 (N. Virginia) Region: us-west-1 (N. California) Region: us-west-2 (Oregon)
AZ: us-east-1a AZ: us-east-1b AZ: us-west-1a AZ: us-west-1b AZ: us-west-2a AZ: us-west-2b
AZ: us-east-1c AZ: us-east-1d AZ: us-west-1c AZ: us-west-1d AZ: us-west-2c AZ: us-west-2d
Figure 1.3 If applications are properly architected and deployed, digital solutions can survive even
a broad outage, such as of an entire region.
had anticipated (even practiced!) a possible outage such as this and had architected
its software and designed its operational practices to compensate.2
2
See “AWS Outage: How Netflix Weathered the Storm by Preparing for the Worst” by Nick Heath (http://
mng.bz/J8RV) for more details on the company’s recovery.
These organizations have altered their software architectures and their engineering
practices to make designing for failure an integral part of the way they build, deliver,
and manage software.
3
See “Amazon.com Goes Down, Loses $66,240 Per Minute” by Kelly Clay at the Forbes website for more details
(http://mng.bz/wEgP).
is often different from what you can anticipate. The best way to get answers to important
questions such as these is to release an early version of a feature and get feedback. Using
that feedback, you can then make adjustments or even change course entirely. Fre-
quent software releases shorten feedback loops and reduce risk.
The monolithic software systems that have dominated the last several decades
can’t be released often enough. Too many closely interrelated subsystems, built and
tested by independent teams, needed to be tested as a whole before an often-fragile
packaging process could be applied. If a defect was found late in the integration-
testing phase, the long and laborious process would begin anew. New software
architectures are essential to achieving the required agility in releasing software
to production.
4
See Kate Dreyer’s April 13, 2015 blog at the Comscore site (http://mng.bz/7eKv) for a summary of the
report.
5
You can read more about these findings by Zion Market Research at the GlobeNewswire site (http://
mng.bz/mm6a).
1.1.5 Data-driven
Considering several of the requirements that I’ve presented up to this point drives
you to think about data in a more holistic way. Volumes of data are increasing,
sources are becoming more widely distributed, and software delivery cycles are being
shortened. In combination, these three factors render the large, centralized, shared
database unusable.
A jet engine with hundreds of sensors, for example, is often disconnected from
data centers housing such databases, and bandwidth limitations won’t allow all the
data to be transmitted to the data center during the short windows when connectivity
is established. Furthermore, shared databases require a great deal of process and
coordination across a multitude of applications to rationalize the various data models
and interaction scenarios; this is a major impediment to shortened release cycles.
Instead of the single, shared database, these application requirements call for a
network of smaller, localized databases, and software that manages data relationships
across that federation of data management systems. These new approaches drive the
need for software development and management agility all the way through to the
data tier.
Finally, all of the newly available data is of little value if it goes unused. Today’s
applications must increasingly use data to provide greater value to the customer
through smarter applications. For example, mapping applications use GPS data from
connected cars and mobile devices, along with roadway and terrain data to provide
real-time traffic reports and routing guidance. The applications of the past decades
that implemented painstakingly designed algorithms carefully tuned for anticipated
usage scenarios are being replaced with applications that are constantly being revised
or may even be self-adjusting their internal algorithms and configurations.
6
Gartner forecasts that 8.4 billion connected things will be in use worldwide in 2017; see the Gartner report at
www.gartner.com/newsroom/id/3598917.
Figure 1.4 User requirements for software drive development toward cloud-native
architectural and management tenets.
adapt to a changing infrastructure and to fluctuating request volumes. Taking that col-
lection of attributes as a whole, let’s carry this analysis to its conclusion; this is
depicted in figure 1.5:
Software that’s constructed as a set of independent components, redundantly
deployed, implies distribution. If your redundant copies were all deployed close
to one another, you’d be at greater risk of local failures having far-reaching con-
sequences. To make efficient use of the infrastructure resources you have, when
you deploy additional instances of an app to serve increasing request volumes,
you must be able to place them across a wide swath of your available infrastruc-
ture—even, perhaps, that from cloud services such as AWS, Google Cloud Plat-
form (GCP), and Microsoft Azure. As a result, you deploy your software
modules in a highly distributed manner.
Adaptable software is by definition “able to adjust to new conditions,” and the
conditions I refer to here are those of the infrastructure and the set of interre-
lated software modules. They’re intrinsically tied together: as the infrastructure
changes, the software changes, and vice versa. Frequent releases mean frequent
change, and adapting to fluctuating request volumes through scaling opera-
tions represents a constant adjustment. It’s clear that your software and the
environment it runs in are constantly changing.
Many more granular details go into the making of cloud-native software (the specifics
fill the pages of this volume). But, ultimately, they all come back to these core charac-
teristics: highly distributed and constantly changing. This will be your mantra as you
progress through the material, and I’ll repeatedly draw you back to extreme distribu-
tion and constant change.
Anywhere, Internet of
Always Release any device Things
up frequently
Highly Constantly
distributed changing
Figure 1.5 Architectural and management tenets lead to the core characteristics of
cloud-native software: it’s highly distributed and must operate in a constantly changing
environment even as the software is constantly evolving.
7
Hear Adrian talk about this and other examples of complicated things at http://mng.bz/5NzO.
State
(service) Service
The app will call upon other components Figure 1.6 Familiar
to provide services it needs to fulfill its elements of a basic
requirements. software architecture
Taking these simple concepts, let’s now build up a topology that represents the cloud-
native software you’ll build; see figure 1.7. You have a distributed set of modules, most
of which have multiple instances deployed. You can see that most of the apps are also
acting as services, and further, that some services are explicitly stateful. Arrows depict
where one component depends on another.
Figure 1.7 Cloud-native software takes familiar concepts and adds extreme distribution, with
redundancy everywhere, and constant change.
This diagram illustrates a few interesting points. First, notice that the pieces (the
boxes and the database, or storage, icons) are always annotated with two designations:
apps and services for the boxes, and services and state for the storage icons. I’ve come
to think of the simple concepts shown in figure 1.7 as roles that various components
in your software solution take on.
You’ll note that any entity that has an arrow going to it, indicating that the compo-
nent is depended upon by another, is a service. That’s right—almost everything is a
service. Even the app that’s the root of the topology has an arrow to it from the soft-
ware consumer. Apps, of course, are where you’re writing your code. And I particu-
larly like the combination of service and state annotations, making clear that you have
some services that are devoid of state (the stateless services you’ve surely heard about,
annotated here with “app”), whereas others are all about managing state.
And this brings me to defining the three parts of cloud-native software, depicted in
figure 1.8:
The cloud-native app —Again, this is where you’ll write code; it’s the business
logic for your software. Implementing the right patterns here allows those
apps to act as good citizens in the composition that makes up your software; a
single app is rarely a complete digital solution. An app is at one or the other
end of an arrow (or both) and therefore must implement certain behaviors to
make it participate in that relationship. It must also be constructed in a man-
ner that allows for cloud-native operational practices such as scaling and
upgrades to be performed.
Cloud-native data —This is where state lives in your cloud-native software. Even
this simple picture shows a marked deviation from the architectures of the past,
which often used a centralized database to store state for a large portion of the
software. For example, you might have stored user profiles, account details,
reviews, order history, payment information, and more, all in the same data-
base. Cloud-native software breaks the code into many smaller modules (the
apps), and the database is similarly decomposed and distributed.
Cloud-native interactions —Cloud-native software is then a composition of cloud-
native apps and cloud-native data, and the way those entities interact with one
another ultimately determines the functioning and the qualities of the digital
solution. Because of the extreme distribution and constant change that charac-
terizes our systems, these interactions have in many cases significantly evolved
from those of previous software architectures, and some interaction patterns
are entirely new.
Notice that although at the start I talked about services, in the end they aren’t one of
the three entities in this mental model. In large part, this is because pretty much
everything is a service, both apps and data. But more so, I suggest that the interactions
between services are even more interesting than a service alone. Services pervade
through the entire cloud-native software model.
App App
Interactions may be
push-centric.
Interactions may be
request/response.
App App App
Cloud-native software
entities Interactions may
be pull-centric.
Cloud-native app Data Data
Cloud-native data
Cloud-native interactions
Figure 1.8 Key entities in the model for cloud-native software: apps, data, and interactions
With this model established, let’s come back to the modern software requirements
covered in section 1.1 and consider their implications on the apps, data, and interac-
tions of your cloud-native software.
CLOUD-NATIVE APPS
Concerns about cloud-native apps include the following:
Their capacity is scaled up or down by adding or removing instances. We refer
to this as scale-out/in, and it’s far different from the scale-up models used in
prior architectures. When deployed correctly, having multiple instances of an
app also offers levels of resilience in an unstable environment.
As soon as you have multiple instances of an app, and even when only a single
instance is being disrupted in some way, keeping state out of the apps allows you
to perform recovery actions most easily. You can simply create a new instance of
an app and connect it back to any stateful services it depends on.
Configuration of the cloud-native app poses unique challenges when many
instances are deployed and the environments in which they’re running are con-
stantly changing. If you have 100 instances of an app, for example, gone are the
days when you could drop a new config into a known filesystem location and
restart the app. Add to that the fact that these instances could be moving all over
your distributed topology. And applying such old-school practices to the instances
as they are moving all over your distributed topology would be sheer madness.
The dynamic nature of cloud-based environments necessitates changes to the
way you manage the application lifecycle (not the software delivery lifecycle, but
rather the startup and shutdown of the actual app). You must reexamine how
you start, configure, reconfigure, and shut down apps in this new context.
CLOUD-NATIVE DATA
Okay, so your apps are stateless. But handling state is an equally important part of a
software solution, and the need to solve your data-handling problems also exists in an
environment of extreme distribution and constant change. Because you have data
that needs to persist through these fluctuations, handling data in a cloud setting poses
unique challenges. The concerns for cloud-native data include the following:
You need to break apart the data monolith. In the last several decades, organi-
zations invested a great deal of time, energy, and technology into managing
large, consolidated data models. The reasoning was that concepts that were rel-
evant in many domains, and hence implemented in many software systems,
were best treated centrally as a single entity. For example, in a hospital, the con-
cept of a patient was relevant in many settings, including clinical/care, billing,
experience surveys, and more, and developers would create a single model, and
often a single database, for handling patient information. This approach
doesn’t work in the context of modern software; it’s slow to evolve and brittle,
and ultimately robs the seemingly loosely coupled app fabric of its agility and
robustness. You need to create a distributed data fabric, as you created a distrib-
uted app fabric.
The distributed data fabric is made up of independent, fit-for-purpose data-
bases (supporting polyglot persistence), as well as some that may be acting only
as materialized views of data, where the source of truth lies elsewhere. Caching
is a key pattern and technology in cloud-native software.
When you have entities that exist in multiple databases, such as the “patient” I
mentioned previously, you have to address how to keep the information that’s
common across the different instances in sync.
Ultimately, treating state as an outcome of a series of events forms the core of
the distributed data fabric. Event-sourcing patterns capture state-change events,
and the unified log collects these state-change events and makes them available
to members of this data distribution.
CLOUD-NATIVE INTERACTIONS
And finally, when you draw all the pieces together, a new set of concerns surface for
the cloud-native interactions:
Accessing an app when it has multiple instances requires some type of routing
system. Synchronous request/response, as well as asynchronous event-driven
patterns, must be addressed.
In a highly distributed, constantly changing environment, you must account for
access attempts that fail. Automatic retries are an essential pattern in cloud-native
software, yet their use can wreak havoc on a system if not governed properly. Cir-
cuit breakers are essential when automated retries are in place.
Because cloud-native software is a composite, a single user request is served
through invocation of a multitude of related services. Properly managing
Because the instances of any of these apps (the User Profile or Auth apps) can
change for any number of reasons, a protocol must exist for continuously
updating the router with new IP addresses. o *
The User Profile app then makes a downstream request to the User API service G
to obtain the current user’s profile data, including phone number. The User Pro-
file app, in turn, makes a request to the user’s stateful service.
After the user has updated their phone number and clicked the Submit button,
the User Profile app sends the new data to an event log H.
Eventually, one of the instances of the User API service will pick up and process
this change event I and send a write request to the Users database.
H Event log
User profile … …
E app
I
Router
B User profile
G
app
C F User API
service
User API
service
Auth app
Router
Router
Auth app
Users
Auth API
service
Auth API
D service
Cloud-native interactions
Routes to app instances
Auth tokens
Updated instances
registered with router
Figure 1.9 The online banking software is a composition of apps and data services. Many types of
interaction protocols are in play.
Bank teller
User profile
app
app App App
app app
Customer
Auth
subsystem
Branch
Users Customers
promotions
Figure 1.10 What appears to a user as a single experience with Wizard’s Bank is realized by
independently developed and managed software assets.
Bank teller
User profile
app
app App App
app app
Customer
Auth
subsystem
Branch
Users Customers
promotions
Distributed data
coordination
Each of the two independent pieces of software deals with customer information.
On the left, the customer is referred to as a user, and on the right, as a customer.
Cloud-native data is highly distributed. Cloud-native software must address the way
data is synchronized across separate systems.
Figure 1.11 A decomposed and loosely coupled data fabric requires techniques for cohesive data
management.
In figure 1.11, I’ve added one more concept to our model—something I’ve labeled
“Distributed data coordination.” The depiction here doesn’t imply any implementa-
tion specifics. I’m not suggesting a normalized data model, hub-and-spoke master
data management techniques, or any other solution. For the time being, please accept
this as a problem statement; I promise we’ll study solutions soon.
That’s a lot! Figures 1.9, 1.10, and 1.11 are busy, and I don’t expect you to under-
stand in any great detail all that’s going on. What I do hope you take from this comes
back to the key theme for cloud-native software:
The software solution comprises quite a distribution of a great many components.
Protocols exist to specifically deal with the change that’s inflicted upon the system.
We’ll get into all the details, and more, throughout the following chapters.
If cloud-native is about how, does that mean you can implement cloud-native solutions
on premises? You bet! Most of the enterprises I work with on their cloud-native jour-
ney first do so in their own data centers. This means that their on-premise computing
infrastructure needs to support cloud-native software and practices. I talk about this
infrastructure in chapter 3.
As great as it is (and I hope that by the time you finish this book, you’ll think so
too), cloud-native isn’t for everything.
8
Although I use the term microservice to refer to the cloud-native architecture, I don’t feel that the term
encompasses the other two equally important entities of cloud-native software: data and interactions.
Banking software
Bank teller
UI
$
Cloud-native apps
Account
API
Mainframe
application
Account
balances
Figure 1.12 Dispensing funds without access
to the source of record is ill-advised.
But now let’s apply a few cloud-native patterns to parts of this system. For example, if
you deploy many instances of each microservice across numerous availability zones, a
network partition in one zone still allows access to mainframe data through service
instances deployed in other zones (figure 1.13).
It’s also worth noting that when you do have legacy code that you wish to refactor,
it needn’t be done in one fell swoop. Netflix, for example, refactored its entire
Banking software
Bank teller
UI
$
Deploying multiple instances of
an app across different failure
AZ1 AZ2 boundaries allows cloud-native
Account Account
Account patterns to provide benefit in a
API
App API
API
App hybrid (cloud-native and
non-cloud-native) software
architecture.
Mainframe
application
Account
balances
Figure 1.13 Applying some cloud-native patterns, such as redundancy and properly distributed
deployments, brings benefit even in software that isn’t fully cloud-native.
Summary
Cloud-native applications can remain stable, even when the infrastructure
they’re running on is constantly changing or even experiencing difficulties.
The key requirements for modern applications call for enabling rapid iteration
and frequent releases, zero downtime, and a massive increase in the volume
and variety of the devices connected to it.
A model for the cloud-native application has three key entities:
– The cloud-native app
– Cloud-native data
– Cloud-native interactions
Cloud is about where software runs; cloud-native is about how it runs.
Cloud-nativeness isn’t all or nothing. Some of the software running in your
organization may follow many cloud-native architectural patterns, other soft-
ware will live on with its older architecture, and still others will be hybrids (a
combination of new and old approaches).
Introducing
Quantum Computing
Quantum computing has been an increasingly popular research field and source of
hype over the last few years. There seem to be news articles daily discussing new
breakthroughs and developments in quantum computing research, promising that
we can solve any number of different problems faster and with lower energy costs.
By using quantum physics to perform computation in new and wonderful
ways, quantum computers can make an impact across society, making it an exciting
time to get involved and learn how to program quantum computers and apply
quantum resources to solve problems that matter.
In all of the buzz about the advantages quantum computing offers, however, it is
easy to lose sight of the real scope of the advantages. We have some interesting his-
43
torical precedent for what can happen when promises about a technology outpace
reality. In the 1970s, machine learning and artificial intelligence suffered from dra-
matically reduced funding, as the hype and excitement around AI outstripped its
results; this would later be called the “AI winter.” Similarly, Internet companies faced
the same danger trying to overcome the dot-com bust.
One way forward is by critically understanding what the promise offered by quan-
tum computing is, how quantum computers work, what they can do, and what is not
in scope for quantum computing. In this chapter, we’ll do just that, so that you can get
hands-on experience writing your own quantum programs in the rest of the book.
All that aside, though, it’s just really cool to learn about an entirely new computing
model! To develop that understanding, as you read this book you’ll learn how quan-
tum computers work by programming simulations that you can run on your laptop
today. These simulations will show many of the essential elements of what we expect
real commercial quantum programming to be like, while useful commercial hardware
is coming online.
For those of us working in tech or related fields, we increasingly must make or provide
input into these decisions. We have a responsibility to understand what quantum com-
puting is, and perhaps more importantly still, what it is not. That way, we will be best
prepared to step up and contribute to these new efforts and decisions.
All that aside, another reason that quantum computing is such a fascinating topic
is that it is both similar to and very different from classical computing. Understanding
both the similarities and differences between classical and quantum computing helps
us to understand what is fundamental about computing in general. Both classical and
quantum computation arise from different descriptions of physical laws such that
understanding computation can help us understand the universe itself in a new way.
What’s absolutely critical, though, is that there is no one right or even best reason
to be interested in quantum computing. Whether you’re reading this because you
want to have a nice nine-to-five job programming quantum computers or because you
want to develop the skills you need to contribute to quantum computing research,
you’ll learn something interesting to you along the way.
Computer
IMPORTANT A computer is a device that takes data as input and does some sort of operations
on that data.
There are many examples of what we have called a computer, see Figure 1.1 for some
examples.
When we say the word “computer” in conversation, though, we tend to mean
something more specific. Often, we think of a computer as an electronic device like the
one we are currently writing this book on (or that you might be using to read this
book!). For any resource up to this point that we have made a computer out of, we can
model it with classical physics—that is, in terms of Newton’s laws of motion, Newto-
nian gravity, and electromagnetism.
Following this perspective, we will refer to computers that are described using clas-
sical physics as classical computers. This will help us tell apart the kinds of computers
we’re used to (e.g. laptops, phones, bread machines, houses, cars, and pacemakers)
from the computers that we’re learning about in this book.
Specifically, in this book, we’ll be learning about quantum computers. By the way
we have formulated the definition for classical computers, if we just replace the
term classical physics with quantum physics we have a suitable definition for what a quan-
tum computer is!
Quantum Computer
A quantum computer is a device that takes data as input and does some sort of
IMPORTANT
operations on that data, which requires the use of quantum physics to describe this
process.
Figure 1.1 Several examples of different kinds of computers, including the UNIVAC mainframe operated by Rear
Admiral Hopper, a room of “human computers” working to solve flight calculations, a mechanical calculator, and
a LEGO-based Turing machine. Each computer can be described by the same mathematical model as computers
like cell phones, laptops, and servers. Photo of “human computers” by NASA. Photo of LEGO Turing machine by
Projet Rubens, used under CC BY 3.0. (https://creativecommons.org/licenses/by/3.0/).
The distinction between classical and quantum computers is precisely that between classi-
cal and quantum physics. We will get into this more later in the book, but the primary dif-
ference is one of scale: our everyday experience is largely with objects that are large
enough and hot enough that even though quantum effects still exist, they don’t do much
on average. Quantum physics still remains true, even at the scale of everyday objects like
coffee mugs, bags of flour, and baseball bats, but we can do a very good job of describing
how these objects interact using physical laws like Newton’s laws of motion.
(continued)
That said, all of the computation that is implemented using relativistic effects can
also be described using purely classical models of computing such as Turing
machines. By contrast, quantum computation cannot be described as faster classical
computation, but requires a different mathematical model. There has not yet been a
proposal for a “gravitic computer” that uses relativity to realize a different model of
computation, so we’re safe to set relativity aside in this book.
Quantum computing is the art of using small and well-isolated devices to usefully
transform our data in ways that cannot be described in terms of classical physics alone.
We will see in the next chapter, for instance, that we can generate random numbers
on a quantum device by using the idea of rotating between different states. One way to
build quantum devices is to use small classical computers such as digital signal proces-
sors (DSPs) to control properties of exotic materials.
For most quantum devices, we need to keep them very cold and very well isolated, since
NOTE
quantum devices can be very susceptible to noise.
Figure 1.2 An example of how a quantum device might interact with a classical computer through the use of a
digital signal processor (DSP). The DSP sends low-power signals into the quantum device, and amplifies very low-
power signals coming back to the device.
Figure 1.3 Ways we wish we could use quantum computers. Comic used with permission from xkcd.com.
It is important to understand both the potential and the limitations of quantum com-
puters, especially given the hype surrounding quantum computation. Many of the
misunderstandings underlying this hype stem from extrapolating analogies beyond
where they make any sense—all analogies have their limits, and quantum computing
is no different in that regard.
If you’ve ever seen descriptions of new results in quantum computing that read like “we can
teleport cats that are in two places at once using the power of infinitely many parallel uni-
TIP
verses all working together to cure cancer,” then you’ve seen the danger of extrapolating too
far from where analogies are useful.
Indeed, any analogy or metaphor used to explain quantum concepts will be wrong if
you dig deep enough. Simulating how a quantum program acts in practice can be a
great way to help test and refine the understanding provided by analogies. Nonethe-
less, we will still leverage analogies in this book, as they can be quite helpful in provid-
ing intuition for how quantum computation works.
One especially common point of confusion regarding quantum computing is the
way in which users will leverage quantum computers. We as a society now have a particu-
lar understanding of what a device called a computer does. A computer is something that
you can use to run web applications, write documents, and run simulations to name a
few common uses. In fact, classical computers are present in every aspect of our lives,
making it easy to take computers for granted. We don’t always even notice what is and
isn’t a computer. Cory Doctorow made this observation by noting that “your car is a
computer you sit inside of” (https://www.youtube.com/watch?v=iaf3Sl2r3jE).
Quantum computers, however, are likely to be much more special-purpose. Just as
not all computation runs on graphical processing units (GPUs) or field-programma-
ble gate arrays (FPGAs), we expect quantum computers to be somewhat pointless for
some tasks.
Classical computing will still be around and will be the main way we communicate and
interact with each other as well as our quantum hardware. Even to get the classical
computing resource to interface with the quantum devices we will also need in most
cases a digital-to-analogue signal processor as shown in Figure 1.2.
Moreover, quantum physics describes things at very small scales (both size and
energy) that are well-isolated from their surroundings. This puts some hard limitations
to what environments we can run a quantum computer in. One possible solution is to
keep our quantum devices in cryogenic fridges, often near absolute 0 K (-459.67 °F, or
-273.15 °C). While this is not a problem at all to achieve in a data center, maintaining a
dilution refrigerator isn’t really something that makes sense on a desktop, much less in a
laptop or a cell phone. For this reason, it’s very likely that quantum computers will, at
least for quite a while after they first become commercially available, be used through
the cloud.
Using quantum computers as a cloud service resembles other advances in special-
ized computing hardware. By centralizing exotic computing resources in data centers,
it’s possible to explore computing models that are difficult for all but the largest users to
deploy on-premises. Just as high-speed and high-availability Internet connections have
made cloud computing accessible for large numbers of users, you will be able to use
quantum computers from the comfort of your favorite WiFi-blanketed beach, coffee
shop, or even from a train as you watch majestic mountain ranges off in the distance.
Exotic cloud computing resources:
Specialized gaming hardware (PlayStation Now, Xbox One).
Extremely low-latency high-performance computing (e.g. Infiniband) clusters
for scientific problems.
The concept of “parallel universes” is a great example of an analogy that can help make
quantum concepts understandable, but that can lead to nonsense when taken to its
extreme. It can sometimes be helpful to think of the different parts of a quantum computa-
tion as being in different universes that can’t affect each other, but this description makes
NOTE
it harder to think about some of the effects we will learn in this book, such as interference.
When taken too far, the “parallel universes analogy” also lends itself to thinking of quantum
computing in ways that are closer to a particularly pulpy and fun episode of a sci-fi show like
Star Trek than to reality.
What this fails to communicate, however, is that it isn’t always obvious how to use
quantum effects to extract useful answers from a quantum device, even if it appears to
contain the desired output. For instance, one way to factor an integer classically is to
list each potential factor, and to check if it’s actually a factor or not.
Factoring N classically:
Let i = 2.
Check if the remainder of N/i is zero.
– If so, return that i factors N.
– If not, increment i and loop.
We can speed this classical algorithm up by using a large number of different classical
computers, one for each potential factor that we want to try. That is, this problem can
be easily parallelized. A quantum computer can try each potential factor within the
same device, but as it turns out, this isn’t yet enough to factor integers faster than the
classical approach above. If you run this on a quantum computer, the output will be
one of the potential factors chosen at random. The actual correct factors will occur
with probability about 1 / √N, which is no better than the classical algorithm above.
As we’ll see in Chapter 11, though, we can improve this by using other quantum
effects, however, to factor integers with a quantum computer faster than the best-
known classical factoring algorithms. Much of the heavy lifting done by Shor’s algo-
rithm is to make sure that the probability of measuring a correct factor at the end is
much larger than measuring an incorrect factor. Cancelling out incorrect answers in
this way is where much of the art of quantum programming comes in; it’s not easy or
even possible to do for all problems we might want to solve.
To understand more concretely what quantum computers can and can’t do, and
how to do cool things with quantum computers despite these challenges, it’s helpful
to take a more concrete approach. Thus, let’s consider what a quantum program
even is, so that we can start writing our own.
Program
A program is a sequence of instructions that can be interpreted by a classical com-
puter to perform a desired task.
We can write classical programs to break down a wide variety of different tasks for
interpretation by all sorts of different computers.
Figure 1.4 Examples of classical programs. Tax forms, map directions, and recipes are all examples in
which a sequence of instructions is interpreted by a classical computer such as a person. Each of these
may look very different but use a list of steps to communicate a procedure.
Let’s take a look at what a simple “hello, world” program might look like in Python:
>>> def hello():
... print("Hello, world!")
...
>>> hello()
Hello, world!
At its most basic, this program can be thought of as a sequence of instructions given to
the Python interpreter, which then executes each instruction in turn to accomplish
some effect—in this case, printing a message to the screen.
We can make this way of thinking more formal by using the dis module provided
with Python to disassemble hello() into a sequence of instructions:
>>> import dis
>>> dis.dis(hello)
2 0 LOAD_GLOBAL 0 (print)
2 LOAD_CONST 1 ('Hello, world!')
4 CALL_FUNCTION 1
6 POP_TOP
8 LOAD_CONST 0 (None)
10 RETURN_VALUE
You may get different output on your system, depending on what version of Python you’re
NOTE
using.
Each line consists of a single instruction that is passed to the Python virtual machine;
for instance, the LOAD_GLOBAL instruction is used to look up the definition of
the print function. The print function is then called by the CALL_FUNCTION instruc-
tion. The Python code that we wrote above was compiled by the interpreter to produce
this sequence of instructions. In turn, the Python virtual machine executes our pro-
gram by calling instructions provided by the operating system and the CPU.
Figure 1.5 An example of how a classical computing task is repeatedly described and interpreted.
Programmers start by writing code in a language like Python and those instructions get translated to
lower and lower level descriptions until it can easily be run on the CPU. The CPU then causes a physical
change in the display hardware that the programmer can observe.
At each level, we have a description of some task that is then interpreted by some other
program or piece of hardware to accomplish some goal. This constant interplay
between description and interpretation motivates calling Python, C, and other such
programming tools languages, emphasizing that programming is ultimately an act of
communication.
In the example of using Python to print “Hello, world!”, we are effectively commu-
nicating with Guido von Rossum, the founding designer of the Python language.
Guido then effectively communicates on our behalf with the designers of the operat-
ing system that we are using. These designers in turn communicate on our behalf with
Intel, AMD, ARM, or whomever has designed the CPU that we are using, and so forth.
As with any other use of language to communicate, our choice of programming lan-
guage affects how we think and reason about programming. When we choose a pro-
gramming language, the different features of that language and the syntax used to
express those features mean that some ideas are more easily expressed than others.
Figure 1.6 Writing a quantum program with the Quantum Development Kit and Visual Studio Code. We
will get to the content of this program in Chapter 5, but what you can see at a high level is that it looks
quite similar to other software projects you may have worked on before.
The instructions available to classical and quantum programs differ according to this
difference in tasks. For instance, a classical program may describe a task such as load-
ing some cat pictures from the Internet in terms of instructions to a networking stack,
and eventually in terms of assembly instructions such as mov (move). By contrast,
quantum languages like Q# allow programmers to express quantum tasks in terms of
instructions like M (measure).
When run using quantum hardware, these programs may instruct a digital signal
processor such as that shown in Figure 1.7 to send microwaves, radio waves, or lasers
into a quantum device, and to amplify signals coming out of the device.
If we are to achieve a different end, however, it makes sense for us to use a lan-
guage that reflects what we wish to communicate! We have many different classical
Figure 1.7 An example of how a quantum device might interact with a classical computer through the use of a
digital signal processor (DSP). The DSP sends low-power signals into the quantum device, and amplifies very low-
power signals coming back to the device.
programming languages for just this reason, as it doesn’t make sense to use only one
of C, Python, JavaScript, Haskell, Bash, T-SQL, or any of a whole multitude of other
languages. Each language focuses on a subset of tasks that arise within classical pro-
gramming, allowing us to choose a language that lets us express how we would like to
communicate that task to the next level of interpreters.
Quantum programming is thus distinct from classical programming almost
entirely in terms of what tasks are given special precedence and attention. On the
other hand, quantum programs are still interpreted by classical hardware such as digi-
tal signal processors, so a quantum programmer writes quantum programs using clas-
sical computers and development environments.
Throughout the rest of this book, we will see many examples of the kinds of tasks
that a quantum program is faced with solving or at least addressing, and what kinds of
classical tools we can use to make quantum programming easier. We will build up the
concepts you need to write quantum programs chapter by chapter, you can see a road
map of how these concepts will build up in Figure 1.8.
Summary
Quantum computing is important because quantum computers potentially
allow us to solve problems that are too difficult to solve with conventional com-
puters.
Quantum computers can provide advantages over classical computers for some
kinds of problems, such as factoring large numbers.
Quantum computers are devices that use quantum physics to process data.
Programs are sequences of instructions that can be interpreted by a classical
computer to perform tasks.
Quantum programs are programs that perform computation by sending
instructions to quantum devices.
Figure 1.8 This book will try to build up the concepts you need to write quantum programs. You will
start in Part 1 at the lower level descriptions of the simulators, and the intrinsic operations (think like
hardware API) by building your own simulator in Python. Part 2 will take you into the Q# language and
quantum development techniques that will help you develop your own applications. Part 3 will show you
some known applications for quantum computing and what are the challenges and opportunities we
have with this technology moving forward.
Introduction
Having picked up this book, you might be wondering what the algorithms and data
structures for massive datasets are, and what makes them different from “normal”
algorithms you might have encountered thus far. Does the title of this book imply
that the classical algorithms (e.g., binary search, merge sort, quicksort, fundamen-
tal graph algorithms) as well as canonical data structures (e.g., arrays, matrices,
hash tables, binary search trees, heaps) were built exclusively for small datasets?
And if so, why the hell did no one tell you that?
The answer to this question is not that short and simple (but if it had to be short
and simple, it would be “Yes”). The notion of what constitutes a massive dataset is
59
relative and it depends on many factors, but the fact of the matter is that most bread-
-and-butter algorithms and data structures that we know about and work with on a daily
basis have been developed with an implicit assumption that all data fits in the main
memory, or random-access memory (RAM) of one’s computer. So once you read all your
data into RAM and store it into the data structure, it is relatively fast and easy to access
any element of it, at which point, the ultimate goal from the efficiency point of view
becomes to crunch the most productivity into the fewest number of CPU cycles. This is
what the Big-Oh Analysis (O(.)) teaches us about; it commonly expresses the worst-case
number of basic operations the algorithm has to perform in order to solve a problem.
These unit operations can be comparisons, arithmetic, bit operations, memory cell
read/write/copy, or anything that directly translates into a small number of CPU cycles.
However, if you are a data scientist today, a developer or a back-end engineer work-
ing for a company that regularly collects data from its users, whether it be a retail web-
site, a bank, a social network, or a smart-bed app collecting sensor data, storing all data
into the main memory of your computer probably sounds like a beautiful dream. And
you don’t have to work for Facebook or Google to deal with gigabytes (GB), terabytes
(TB) or even petabytes (PB) of data almost on a daily basis. According to some projec-
tions, from 2020 onward, the amount of data generated will be at least equal to every
person on Earth generating close to 2 megabytes (MB) per second!10 Companies with a
more sophisticated infrastructure and more resources can afford to delay thinking
about scalability issues by spending more money on the infrastructure (e.g., by buying
more RAM), but, as we will see, even those companies, or should we say, especially those
companies, choose to fill that extra RAM with clever and space-efficient data structures.
The first paper11 to introduce external-memory algorithms—a class of algorithms
that today govern the design of large databases, and whose goal is to minimize the
total number of memory transfers during the execution of the program—appeared
back in 1988, where, as the motivation, the authors cite the example of large banks
having to sort 2 million checks daily, about 800MB worth of checks to be sorted over-
night before the next business day. And for the working memories of that time being
about the size of 2-4MB, this indeed was a massive dataset. Figuring out how to sort
checks efficiently where we can at one time fit in the working memory (and thus sort)
only about 4MB worth of checks, how to swap pieces of data in and out in a way to
minimize the number of trips the data makes from disk and into main memory, was a
relevant problem back then, and since then it has only become more relevant. In past
decades, data has grown tremendously but perhaps more importantly, it has grown at
a much faster rate than the average size of RAM memory.
One of the central consequences of the rapid growth of data, and the main idea
motivating algorithms in this book, is that most applications today are data-intensive.
10
Domo.com, “Data Never sleeps,”[Online]. Available: https://www.domo.com/solution/data-never-sleeps-6.
[Accessed 19th January, 2020].
11
A. Aggarwal and S. Vitter Jeffrey, “The input/output complexity of sorting and related problems,” J Commun.
ACM, vol. 31, no. 9, pp. 1116-1127, 1988.
Data-intensive (in contrast to CPU-intensive) means that the bottleneck of the applica-
tion comes from transferring data back and forth and accessing data, rather than
doing computation on that data (in Section 1.4 of this chapter, there are more details
as to why data access is much slower than the computation). But what this practically
means is it will require more time to get data to the point where it is available for us to
solve the problem on it than to actually solve the problem; thus any improvement in
managing the size of data or improving data access patterns are some of the most
effective ways to speed up the application.
In addition, the infrastructure of modern-day systems has become very complex.
With thousands of computers exchanging data over the network, databases and
caches are distributed, and many users simultaneously add and query large amounts
of content. Data itself has become complex, multidimensional, and dynamic. The
applications, in order to be effective, need to respond to changes very quickly. In
streaming applications12, data effectively flies by without ever being stored, and the
application needs to capture the relevant features of the data with the degree of accu-
racy rendering it relevant and useful, without the ability to scan it again. This new con-
text calls for a new generation of algorithms and data structures, a new application
builder’s toolbox that is optimized to address many challenges of massive-data sys-
tems. The intention of this book is to teach you exactly that ---the fundamental algo-
rithmic techniques and data structures for developing scalable applications.
1.1 An example
To illustrate the main themes of this book, consider the following example: you are work-
ing for a media company on a project related to news article comments. You are given a
large repository of comments with the following associated basic metadata information:
{
comment-id: 2833908
article-id: 779284
user-id: johngreen19
text: this recipe needs more butter
views: 14375
likes: 43
}
We are looking at approximately 100 million news articles and roughly 3 billion user
comments. Assuming storing one comment takes 200 bytes, we need about 600GB to
store the comment data. Your goal is serving your readers better, and in order to do
that, you would like to classify the articles according to keywords that recur in the
comments below the articles. You are given a list of relevant keywords for each topic
(e.g., ‘sports’, ‘politics’, etc.), and for initial analysis, the goal is only to count how
often the given keywords occurs in comments related to a particular article, but
before all that, you would like to eliminate the duplicate comments that occurred
during multiple instances of crawling.
12
B. Ellis, Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data, Wiley Publishing, 2014.
We can build similar structures for analyzing the popularity of particular comments,
users, etc. Therefore, we might need close to a hundred or hundreds of gigabytes just
to build basic structures that do not even include most of the metadata information
that we often need for a more insightful analysis (see Figure 1.2).
Figure 1.2 Hash tables only use the amount of space linear in
the size of data, asymptotically the minimum possible required
to store data correctly, but with large dataset sizes, hash tables
cannot fit into the main memory.
These data structures use much less space to process a dataset of n items than the lin-
ear space O(n)) that a hash table or a red-black tree would need; think 1 byte per item
or sometimes much less than that.
We can solve our news article comment examples with succinct data structures.
A Bloom filter (Chapter 3) will use 8x less space than the (comment-id -> frequency)
hash table and can help us answer membership queries with about a 2% false positive
rate. In this introductory chapter, we will try to avoid doing the math to explain how we
arrived at these numbers, but for now it suffices to say that the reason why Bloom filter,
and some other data structures that we will see can get away with substantially less space
than hash tables or red-black trees is that they do not actually store the items them-
selves. They compute certain codes (hashes) that end up representing the original
items (but also some other items, hence the false positives), and original items are
much larger than the codes. Another data structure, Count-Min sketch (Chapter 4)
will use about 24x less space than the (comment-id -> frequency) hash table to esti-
mate the frequency of each comment-id, exhibiting a small overestimate in the fre-
quency in over 99% of the cases. We can use the same data structure to replace the
(article-id -> keyword_frequency) hash tables and use about 3MB per keyword
table, costing about 20x less than the original scheme. Lastly, a data structure HyperLo-
gLog (Chapter 5) can estimate the cardinality of the set with only 12KB, exhibiting the
error less than 1%. If we further relax the requirements on accuracy for each of these
data structures, we can get away with even less space. Because the original dataset still
resides on disk, while the data structures are small enough to serve requests efficiently
from RAM, there is also a way to control for an occasional error.
Another choice we have when dealing with large data is to proclaim the set unnec-
essarily large and only work with a random sample that can comfortably fit into RAM.
So, for example, you might want to calculate the average number of views of a com-
ment, by computing the average of the views variable, based on the random sample
you drew. If we migrate our example of calculating an average number of views to the
streaming context, we could efficiently draw a random sample from the data stream of
comments as it arrives using the Bernoulli sampling algorithm (Chapter 6). To illus-
trate, if you have ever plucked flower petals in the love-fortune game “(s)he loves me,
(s)he loves me not’’ in a random manner, you could say that you probably ended up
with “Bernoulli-sampled” petals in your hand—this sampling scheme offers itself con-
veniently to the one-pass-over-data context.
Answering some more granular questions about the comments data, like, below
which value of the attribute views is 90% of all of the comments according to their
view count will also trade accuracy for space. We can maintain a type of a dynamic his-
togram (Chapter 7) of the complete viewed data within a limited, realistic fast-mem-
ory space. This sketch or a summary of the data can then be used to answer queries
about any quantiles of your complete data with some error.
Last but definitely not least, we often deal with large data by storing it into a data-
base or as a file on disk or some other persistent storage. Storing data on a remote
storage and processing it efficiently presents a whole new set of rules than traditional
algorithms (Chapter 8) do, even when it comes to fundamental problems such as
searching or sorting. The choice of a database, for example, becomes important and
it will depend on the particular workload that we expect. Will we often be adding new
comment data and rarely posing queries, or will we rarely add new data and mostly
query the static dataset, or, as it often happens, will we need to do both at a very rapid
rate? These are all questions that are paramount when deciding on the type of data-
base engine we might use to store the data.
Very few people actually implement their own storage engines, but to knowledge-
ably choose between different alternatives, we need to understand what data struc-
tures power them underneath. Many massive-data applications today struggle to
provide high query and insertion speeds, and the tradeoffs are best understood by
studying the data structures that run under the hood of MySQL, TokuDB, LevelDB
and other storage engines, such as B-trees, B?-trees, and LSM-trees, where each is opti-
mized for a particular purpose (Chapters 9 and 10). Similarly, it is important to under-
stand the basic algorithmic tricks when working with files on disk, and this is best
done by learning how to solve fundamental algorithmic problems in this context like
the first example in our chapter of sorting checks in external memory (Chapter 11).
13
S. S. Skiena, The Algorithm Design Manual, Second Edition, Springer Publishing Company, Incorporated, 2008.
rithms by Cormen, Leiserson, Rivest and Stein14, Algorithms by Robert Sedgewick and
Kevin Wayne15, or for a more introductory and friendly take on the subject, Grokking
Algorithms by Aditya Bhargava16. The algorithms and data structures for massive data-
sets are slowly but surely making their way into the mainstream textbooks, but the
world is moving fast and our hope is that this book can provide a compendium of the
state-of-the-art algorithms and data structures that can help a data scientist or a devel-
oper handling large datasets at work. The book is intended to offer a good balance of
theoretical intuition, practical use cases and (pseudo)code snippets. This book
assumes that a reader has some fundamental knowledge of algorithms and data struc-
tures, so if you have not studied the basic algorithms and data structures, you should
first cover that material before embarking on this subject.
Many books on massive data focus on a particular technology, system or infrastruc-
ture. This book does not focus on the specific technology, neither does it assume
familiarity with any particular technology. Instead, it covers underlying algorithms and
data structures that play a major role in making these systems scalable. Often the
books that do cover algorithmic aspects of massive data focus on machine learning.
However, an important aspect of handling large data that does not specifically deal
with inferring meaning from data, but rather has to do with handling the size of the
data and processing it efficiently, whatever the data is, has often been neglected in the
literature. This book aims to fill that gap.
There are some books that address specialized aspects of massive datasets (e.g., Probabi-
listic Data Structures and Algorithms for Big Data Applications17, Real-Time Analytics: Tech-
niques to Analyze and Visualize Streaming Data,18, Disk-Based Algorithms for Big Data19, and
Mining of Massive Datasets20). With this book, we intend to present these different themes
in one place, often citing the cutting-edge research and technical papers on relevant sub-
jects. Lastly, our hope is that this book will teach more advanced algorithmic material in a
down-to-earth manner, providing mathematical intuition instead of the technical proofs
that characterize most resources on this subject. Illustrations play an important role in com-
municating some of the more advanced technical concepts and we hope you enjoy them.
Now that we’ve gotten the introductory remarks out of the way, let’s discuss the
central issue that motivates topics from this book.
14
T. H. Cormen, C. E. Leiserson, R. L. Rivest and C. Stein, Introduction to algorithms, Third Edition, The MIT
Press, 2009.
15
R. Sedgewick and K. Wayne, Algorithms, Fourth Edition, Addison-Wesley Professional, 2011.
16
A. Bhargava, Grokking Algorithms: An Illustrated Guide for Programmers and Other Curious People, Man-
ning Publications Co., 2016.
17
G. Andrii, Probabilistic Data Structures and Algorithms for Big Data Applications, Books on Demand, 2019.
18
B. Ellis, Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data, Wiley Publishing, 2014.
19
C. G. Healey, Disk-Based Algorithms for Big Data, CRC Press, Inc., 2016.
20
A. Rajaraman and J. D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2011.
computers face in processing large amounts of data actually stem from hardware and
general computer architecture. Now, this book is not about hardware, but in order to
design efficient algorithms for massive data, it is important to understand some physi-
cal constraints that are making data transfer such a big challenge. Some of the main
issues include: 1) the large asymmetry between the CPU and the memory speed, 2)
different levels of memory and the tradeoffs between the speed and size for each level,
and 3) the issue of latency vs. bandwidth. In this section, we will discuss these issues, as
they are at the root of solving performance bottlenecks of data-intensive applications.
What this gap points to intuitively is that doing computation is much faster than
accessing data. So if we are still stuck with the traditional mindset of measuring the
performance of algorithms using the number of computations (and assuming mem-
ory accesses take the same amount of time as the CPU computation), then our analy-
ses will not jive well with reality.
21
J. L. Hennessy and D. A. Patterson, Computer Architecture, Fifth Edition: A Quantitative Approach, Morgan
Kaufmann Publishers Inc., 2011.
The hard disk is the only remaining mechanical part of a computer, and it works a lot
like a record player. Placing the mechanical needle on the right track is the expensive
part of accessing data on disk. Once the needle is on the right track, the data transfer
can be very fast, depending on how fast the disk spins.
22
C. Terman, “MIT OpenCourseWare, Massachusetts Institute of Technology,” Spring 2017. [Online]. Avail-
able: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-004-computation-struc-
tures-spring-2017/index.htm. [Accessed 20th January 2019].
A similar phenomenon, where “latency lags bandwidth” holds for other types of
memory23. Generally, the bandwidth in various systems, ranging from microproces-
sors, main memory, hard disk, and network, has tremendously improved over the past
few decades, but latency hasn’t as much, even though the latency might often be the
more important measurement for most scenarios—a common pattern of user behav-
ior is many small random accesses as opposed to one large sequential one.
Because of this expensive initial call, it is appropriate that the data transfer
between different levels of memory is done in chunks of multiple items, to offset the
cost of the call. The chunks are proportionate to the sizes of memory levels, so for
cache that are between 8 bytes and 64 bytes, and disk blocks that can be up to 1MB24.
Due to the concept known as spatial locality, where we expect the program to access
memory locations that are in the vicinity of each other close in time, transferring
blocks in a way pre-fetches the data we will likely need in the future.
You might be asking yourself, how are all these facts relevant for the design of data-effi-
cient algorithms and data structures? The first important takeaway is that, although
technology improves constantly (for instance, SSDs are a relatively new development
and they do not share many of the issues of hard disks), some of the issues, such as the
tradeoff between the speed and the size of memories are not going away any time soon.
Part of the reason for this is purely physical: to store a lot of data, we need a lot of
23
D. A. Patterson, “Latency Lags Bandwith,” Commun. ACM, vol. 47, no. 10, p. 71–75, 2004.
24
J. L. Hennessy and D. A. Patterson, Computer Architecture, Fifth Edition: A Quantitative Approach, Morgan
Kaufmann Publishers Inc., 2011.
space, and the speed of light sets the physical limit to how fast data can travel from one
part of the computer to the other or from one part of the network to the other. To
extend this to a network of computers, we will cite25 an example that for two computers
that are 300 meters away, the physical limit of data exchange is 1 microsecond.
Hence, we need to design algorithms with this awareness. Designing succinct data
structures (or taking data samples) that can fit into fast levels of memory helps because
we avoid expensive disk seeks. So, one of the important algorithmic tricks with large
data is that saving space saves time. Yet, in many applications we still need to work with
data on disk. Here, designing algorithms with optimized patterns of disk access and
caching mechanisms that enable the smallest number of memory transfers is import-
ant, and this is further linked to how we lay out and organize data on a disk (say in a
relational database). Disk-based algorithms prefer smooth scanning over the disk over
random hopping—this way we get to make use of a good bandwidth and avoid poor
latency, so one meaningful direction is transforming an algorithm that hops into one
that glides over data. Throughout this book, we will see how classical algorithms can be
transformed and new ones can be designed having space-related concerns in mind.
Lastly, it is important to keep in mind that many aspects of making systems work in
production have little to do with designing a clever algorithm. Modern systems have
many performance metrics other than scalability, such as: security, availability, main-
tainability, etc. So, real production systems need an efficient data structure and an
algorithm running under the hood, but with a lot of bells and whistles on top to make
all the other stuff work for their customers (see Figure 1.6). However, with ever-
increasing amounts of data, designing efficient data structures and algorithms has
become more important than ever before, and we hope that in the coming pages you
will learn how to do exactly that.
25
D. A. Patterson, “Latency Lags Bandwith,” Commun. ACM, vol. 47, no. 10, p. 71–75, 2004.
Summary
Applications today generate and process large amounts of data at a rapid rate.
Traditional data structures, such as basic hash tables and key-value dictionaries,
can grow too big to fit in RAM memory, which can lead to an application chok-
ing due to the I/O bottleneck.
To process large datasets efficiently, we can design space-efficient hash-based
sketches, do real-time analytics with the help of random sampling and approxi-
mate statistics, or deal with data on disk and other remote storage more effi-
ciently.
This book serves as a natural continuation to the basic algorithms and data
structures book/course because it teaches how to transform the fundamental
algorithms and data structures into algorithms and data structures that scale
well to large datasets.
The key reason why large data is a major issue for today’s computers and sys-
tems is that CPU (and multiprocessor) speeds improve at a much faster rate
than memory speeds, the tradeoff being between the speed and size for differ-
ent types of memory in the computer, as well as the latency vs. bandwidth phe-
nomenon. These trends are not likely to change significantly soon, so the
algorithms and data structure that address the I/O cost and issues of space are
only going to increase in importance over time.
In data-intensive applications, optimizing for space means optimizing for time.
Introduction
Telemetry is the feedback you get from your production systems that tells you
what’s going on in there, all to improve your ability to make decisions about your
production systems. For NASA the production system might be a rover on Mars,
but most of the rest of us have our production systems right here on Earth (and
sometimes in orbit around Earth). Whether it’s the amount of power left in a
rover’s batteries or the number of Docker containers live in Production right now,
it’s all telemetry. Modern computing systems, especially those operating at scale,
live and breathe telemetry; it’s how we can manage systems that large at all. Using
telemetry is ubiquitous in our industry.
If you’ve ever looked at a graph describing site-hits over time, you’ve used
telemetry.
If you’ve ever written a logging statement in code and later looked up that
statement in a log-searching tool like Kibana or Loggly, you’ve used telemetry.
72
If you’ve ever configured the Apache web server to send logs to a relational
database, you’ve used telemetry.
If you’ve ever written a Jenkinsfile to send continuous integration test results to
another system that could display it better, you’ve used telemetry.
If you’ve ever configured GitHub to send webhooks for repository events,
you’ve used telemetry.
Software Telemetry is about the systems that bring you telemetry and display it in a way
that will help you make decisions. Telemetry comes from all kinds of things, from the
Power Distribution Units your servers (or your cloud provider’s servers) are plugged
into, to your actual running code at the very top of the technical pyramid. Taking that
telemetry from whatever emitted it and transforming it so your telemetry can be dis-
played usefully is the job of the telemetry system. Software Telemetry is all about that sys-
tem and how to make it durable.
Telemetry is a broad topic, and one that is rapidly changing. Between 2010 and
2020 our industry saw the emergence of three new styles of telemetry systems. Who
knows what we will see between 2020 and 2030? This book will teach you the funda-
mentals of how any telemetry system operates, including ones we haven’t seen yet,
which will prepare you for modernizing your current telemetry systems and adapting
to new styles of telemetry. Any time you teach information-passing and translation,
which is what telemetry systems do, you unavoidably have to cover how people pass
information; this book will teach you both the technical details of maintaining and
upgrading telemetry systems as well as the conversations you need to have with your
coworkers while you revise and refine your telemetry systems.
Any telemetry system has a similar architecture; figure 1.1 is one we will see often
as we move through parts 1 and 2 this book.
Telemetry is data that production systems emit to provide feedback about what is
happening inside. Telemetry systems are the systems that handle, transform, store, and
present telemetry data. This book is all about the systems, so let’s take a look at the five
major telemetry styles in use today:
Centralized Logging: This was the first telemetry system created, which hap-
pened in the early 1980s. This style takes text-based logging output from pro-
duction systems and centralizes it to ease searching. Of note, this is the only
technique widely supported by hardware.
Metrics: This grew out of the monitoring systems used by operations teams, and
was renamed “metrics” when software engineers adopted the technique. This
system emerged in the early 2010s, and focuses on numbers rather than text to
describe what is happening. This allowed much longer timeframes to be kept
online and searchable when compared with centralized logging.
Observability: This style grew out of frustration over the limitations of central-
ized logging and takes a more systematic approach to tracking events in a pro-
duction system. This emerged in the mid 2010s.
Figure 1.1 Architecture common to all telemetry systems, though some stages are often combined in
smaller architectures. The Emitting Stage receives telemetry from your production systems and
delivers it to the Shipping Stage. The Shipping Stage processes and ultimately stores telemetry. The
Presentation Stage is where people search and work with telemetry. The Emitting and Shipping stages
can apply context-related markup to telemetry, where the Shipping and Presentation stages can further
enrich telemetry by pulling out details encoded within.
Figure 1.2 A centralized logging system using Logstash, Elasticsearch, and Kibana as major
components. Telemetry is emitted from both production code and Cisco Hardware. This telemetry is
then received by Shipping Stage components; centralized in Logstash, and stored in Elasticsearch.
Kibana uses Elasticsearch storage to provide a single interface for people to search all logs.
Centralized logging supports not just software telemetry, but hardware telemetry as
well! The Syslog Server box in the figure represents the modern version of a system
that was first written around 1980 as a dedicated logging system for the venerable
sendmail program from the Berkeley Software Distribution 3BSD. By the year 2000
Syslog was in near universal use across Unix and Unix-like operating systems. A stan-
dardization effort in 2001 resulted in a series of RFCs that defined the Syslog format
for both transmission protocol and data format. By making Syslog a standard, it gave
hardware makers one option for emitting telemetry that wasn’t likely to change over
the decade lifespan of most hardware. The other option is SNMP, the Simple Network
Management Protocol, which is covered in Chapter 2.
I bring up Syslog because the concepts it brought to the table influenced much of
how we think about logging from software. If you’ve ever heard the phrase:
Turn on debug logging.
You’re using a concept introduced by Syslog. The concept of log-levels originated in
Syslog, which defined eight levels, seen here in table 1.1
Where you see Syslog’s biggest influence is in the keywords in table 1.1. Not every
software logger builds all eight levels, but nearly all have some concept of debug, info,
warning, and error.
ID Severity Keyword
1 Alert alert
2 Critical crit
5 Notice notice
6 Info info
7 Debug debug
If you’ve written software chances are good you’ve used these levels some time in your
career. This idea of log levels also introduces the idea that all logging has some con-
text along with it; a log-level gives context of how severe the event is, while the text of
the event describes what happened. Logging from software can include quite a lot
more context than simply priority and a message; Section 6.1 describes this markup
process in much more detail.
The middle stage of centralized logging, represented in the figure as the Logstash
server, takes telemetry in the emitted format (Syslog for the Cisco hardware, whatever
the log-file format is for the production code) and reformats it into the format
needed by Elasticsearch. Elasticsearch needs a hash, so Logstash is taking the Syslog
format and rewriting it into Elasticsearch’s format before storing it in Elasticsearch.
This reformatting process is covered in Chapter 4.
The end of the pipeline, represented in the figure as the Kibana server, uses Elas-
ticsearch as a database for queries. Section 5.2 goes into greater detail about what con-
stitutes a good system for this job for centralized logging. Kibana here is used by
people to access telemetry and assist with analysis.
Metrics systems encode the same information by encoding a number and some addi-
tional fields to provide context. In this example case, a function-name is added for
context, and a timer is used for the number:
metrics.timer(“Dangerous_Function_runtime”, timer.to_seconds)
Figure 1.3 A metrics system where the production software emits metrics from code into a StatsD
server, and the operating system has a monitoring package called collectd that collects system metrics
that reports directly to the graphite API of a Prometheus storage server. StatsD submits summarized
metrics into Prometheus by way of the graphite API. A Grafana server acts as the interface point for all
users of this metrics system, including both Operations and Software Engineering teams.
This example shows a metrics system being used for both software metrics and system
metrics. The system metrics are gathered by a monitoring tool called collectd, which
has the ability to push metrics into a Graphite API. Prometheus is a database custom-
built for storing data-over-time, or time-series data. Such time-series databases are the
foundation of many metrics systems, though other database styles can certainly be
used successfully. Grafana is an open source dashboarding system for metrics that is
widely used, and in this case is being used by both the Operations team running the
infrastructure and the Software Engineering team managing the production software.
Like centralized logging style telemetry, metrics telemetry is almost always marked
up with additional details to go with the number. In the case of the statement before
the figure we are adding a single field with Dangerous_Function_runtime as the
value. Additional fields can certainly be added, though doing so introduces complex-
ity in the metrics database known as cardinality.
Cardinality is a big part of how metrics came to be its own discrete telemetry style.
Centralized logging, with all of the data it encodes, has the highest cardinality of all of
the five telemetry styles I talk about. It also takes up the most space by far. The combi-
nation of those two factors make centralized logging require the most complex data-
bases and the largest volume of data of any style. Due to budget constraints,
centralized logging systems rarely can keep data online and searchable for long at all.
Compare this to metrics with its low cardinality and focus on easy to store numbers,
and you have a telemetry system that can keep years of telemetry online and searchable
for a fraction of the cost of centralized logging systems!
In the 2009 to 2012 era when metrics really began to be known as a telemetry style,
the cost savings versus centralized logging was one of the biggest drivers for adoption.
Centralized logging was still used, but being able to leverage a specialized telemetry
flow designed for the decision-type was a revolution—one that set up the next two
telemetry styles to come onto the scene.
Figure 1.4 An observability system using the honeycomb.io Software-as-a-Service platform for the
Shipping and Presentation stages. Production software emits observability telemetry into an ingestion
API operated by honeycomb.io. Honeycomb then stores this in their database, and presents the
processed telemetry in their dashboard. Use of a SaaS provider allows leveraging observability without
having to manage its complex infrastructure yourself.
allow (for security reasons). Doing it this way allows your production systems to gain
the benefit of observability without having to deploy and maintain the complex infra-
structure it requires. In my experience, SaaS companies dominate the Observability
marketplace for this very reason.
Figure 1.5 An example of the kind of display a distributed tracing system provides, following the flow
of execution similar to a stack-trace. Here we see a call to upload_document, but also all of the other
processes that upload_document called during its execution. When tracing a fault in a pdf_to_png
process, you will be presented with the full context of events leading up to that specific execution.
Tracing also came onto the scene in the late 2010s, so is undergoing rapid develop-
ment. The OpenTracing project (https://opentracing.io) is an effort from major
players in the US tech industry to provide standards for communication and data-for-
mat for tracing. Programming languages that are well-established and mostly over-
looked by the Big Names You Know US tech industry—languages like PHP, COBOL,
and Perl—often lack a Software Development Kit for tracing. Frustration by software
engineers is a prime driver of innovation, so I expect these underserved languages will
get the support they need before too long.
In spite of its newness, there are real-world examples of tracing to look at today.
Figure 1.6 gives us one real-world example.
Figure 1.6 An example of a distributed tracing system circa 2020. Production code is running an
OpenTracing SDK, which sends events to a system running the Jaeger open source tracing system.
The Jaeger collector then stores the event in a database. The Jaeger frontend provides a place to
search and display traces from production systems.
Due to how common these requirements are and the complexity of tracking and later
correlating them, this has given rise to a completely separate telemetry style known as
the Security Information Event Management system, or SIEM. Due to the complexity
of the task SIEMs are almost always paid-for software; there are very few, if any, open
source projects that do this work.
As a telemetry style operator, you will spend more time connecting sources of
telemetry to a system that knows how to interpret the data. Figure 1.7 gives us one pos-
sible architecture for integrating a SIEM into a larger telemetry system.
Figure 1.7 One possible SIEM system. Since SIEM systems are often derived from centralized
logging systems, here we see that the centralized logging flow and SIEM flow have an identical
source. When telemetry enters the Logstash server, it produces two feeds; one feed goes into
Elasticsearch for centralized logging, and a second feed is submitted to the Splunk SaaS API.
Splunk is acting as a SIEM in this case.
There are many different architectures; figure 1.7 is but one. Another is when Security
has installable agents running on host servers, which emit in a completely different
way than the centralized logging flows, making for a fully separate system. Both
approaches are viable.
Figure 1.8 The preferred telemetry styles for Operations, DevOps, and SRE
teams. Centralized Logging, because the infrastructure these teams manage
emit there by preference, and metrics, because that is used for monitoring and
site availability tracking. Use of the other three styles is quite possible, but the
majority usage is with centralized logging and metrics.
Security incidents are special cases; when they happen, literally every source of telem-
etry is potentially useful during the investigation. If you are in a different team, be
ready to support incident responders with information about how to use and search
telemetry under your care. Security is everyone’s job.
Compliance with regulation and voluntary frameworks invariably requires keeping
certain kinds of telemetry around for years (often seven years, a number inherited
from the accounting industry). This long-term retention requirement is almost
unique among the telemetry styles here, with metrics being the only other style that
approaches SIEM systems for retention period.
Where Software Engineering teams are focusing on how their code is performing in
production, SRE teams are focused more on whether or not the code is meeting
promised performance and availability targets. This is a need related to what Software
Engineering desires, but the difference does matter. Software Engineering is very con-
cerned with failures and how they impact everything. SRE is concerned with overall,
aggregated, performance.
The charge of teams of this type is to work with your customers (or users, or employ-
ees) and resolve problems. This team has the best information about how your produc-
tion system actually works for people, so if your Software Engineering and especially
SRE teams are not talking to them something has gone horribly wrong in your organi-
zation. This communication needs to go both ways because when Customer Support
teams are skilled in using the telemetry systems used by Software Engineering, the
quality of problem reports improves significantly. In an organization where Customer
Support has no access to telemetry systems, problem reports come in sounding like:
Account 11213 had a failed transaction on March 19, 2028 at 18:02 UTC.
They say they’ve had this happen before, but can’t tell us when. They’re a
churn risk.
Compare this report to the kind of report your Customer Support teams can make if
they have access to query telemetry systems:
Account 11213 has had several failed transactions. The reported transaction
was ID e69aed5a-0dfc-47e2-abca-8c11374b626f, which has a failure in it when
I looked it up. That failure was found four more times with this account. I also
saw it happening for five other accounts and reached out to all five. Two have
gotten back to me and thanked us for proactively notifying them of the prob-
lem. It looks like accounts with billing-code Q are affected.
This second problem report is objectively far better because the work of isolating
where the actual problem may be hiding has mostly been done. You want to empower
your Customer Support teams. Figure 1.11 demonstrates the sort of telemetry systems
Customer Support makes the best use of.
Customer Support teams work with customers to figure out what went wrong,
which means they are most interested in events that happened recently. Telemetry sys-
tems that rely on aggregation (metrics) are not useful because the single interesting
event is not visible. Telemetry systems that rely on statistical sampling (observability)
can be somewhat useful, but the interesting error needs to be in the sample. This
problem can be gotten around by ensuring you persist error-events outside of the sta-
tistical sample, perhaps in a second errors database.
Figure 1.11 The telemetry systems Customer Support teams make the best use
of. Because Customer Support teams are most interested in specific failures,
telemetry styles that rely on aggregation (metrics) are not as useful. Note that in
cases where Customer Support is more of a Helpdesk for internal users, SIEM
access is often also granted and useful.
Explain how a new telemetry style works, provide a framework for how it would
operate in your existing production systems, and point out how that will
improve prioritization of work.
Personally, I spent 14 years in the public sector, of which seven were spent in reces-
sions. Organizations like these are at the mercy of an annual or biannual budgeting
process where a group of (rarely technical) elected officials will ultimately decide
whether or not you get your expensive new system. This is hard work, but it can be
done. Make the case, do it well, and plan far enough in advance (months if not years)
that you won’t be in a panic if the answer comes down to not this year.
Figure 1.12 The five telemetry styles charted for their preferred online availability periods.
SIEM systems have the longest retention due to external requirements. Observability and
distributed tracing achieve their retention through the use of statistical sampling. Metrics
achieves its duration through aggregations on the numbers stored inside. Centralized logging,
well, is just plain expensive so it gets the smallest online retention period.
A one-policy-applies-to-all approach simply will not work for a telemetry system. Your
retention policies need to be written in ways that accommodate the diverse needs of
your teams and telemetry systems in use. Chapter 17 is dedicated to this topic.
There is also diversity in the shape of your telemetry data itself.
Hardware emits in Syslog or SNMP, and if you can’t handle that then you’re
simply not going to get that telemetry.
Telemetry SaaS provider SDKs might not have support for emitting telemetry
through HTTP proxies, a required feature in many production environments.
Platform services like VMWare vCenter have their own telemetry handling systems.
Infrastructure providers like Amazon Web Services and Digital Ocean provide
telemetry in their own formats and in their own locations, leaving it up to you
to fetch and process it.
Operating System components (Windows, Linux, FreeBSD, AIX, HP-UX, z/OS,
etc.) emit in their own formats like Syslog or Windows Event Log. If you want
that telemetry, you will need to handle those formats.
What programming languages your production systems are written in (and
their age) can prevent you even having access to SDKs for observability and dis-
tributed tracing.
Challenges like these certainly increase the complexity of your telemetry system, but
they’re not insurmountable. Chapters 3 and 4 cover methods of moving telemetry
(Chapter 3) and transforming formats (Chapter 4). If you happen to be using a lan-
guage or platform unloved by the hot-new-now tech industry, you’re likely already
used to building support for new things yourself; I’m sorry (she says, having run Tom-
cat apps on NetWare, successfully).
For the rest of us, keeping privacy- and health-related data out of the telemetry
stream is a never-ending battle:
The biggest leak source is exception-logging with parameters. Parameters are
incredibly useful for debugging, but can include privacy- and health-related
data. By far this is the largest source of leaks I’ve seen in my own systems. This is
made worse by the fact that many logger modules don’t have redaction con-
cepts baked into them, and software engineers aren’t used to thinking of excep-
tions as needing in-code redaction before emission.
Unthinking inclusion of IP address and email addresses in logging statements.
Both of these are useful for fighting fraud and isolating which account a state-
ment is about. Unfortunately, IP addresses and email addresses are protected by
many privacy regulations. If you simply must include these details, consider
hashing them instead to provide correlation without providing the direct values.
Inclusion of any user-submitted data of any kind in logging statements. Users
will stuff all kinds of things they shouldn’t into fields. Unfortunately for you, many
privacy- and health-data regulations require you to detect and respond to leaks of
this type. If you are in a place subjected to that kind of law (ask your lawyers) and
have a bug-bounty program, expect to pay out bounties to bug-hunters who find
ways to display user-supplied input on unprotected dashboards. It’s best not to
emit user-submitted data into your telemetry stream in the first place.
As deeply annoying as it is, you simply must have policies and procedures in place to
retroactively remove mistakenly stored privacy- and health-related information. Legis-
lation making these types of data toxic haven’t been around long enough for teleme-
try handling modules to include in-flight redaction as a standard feature alongside
log-levels. Hopefully this will change in the future. Until then, we have to know how to
clean up toxic spills. Chapter 16 covers this topic extensively.
Figure 1.13 A greatly simplified flow of the document discovery process as it relates to
telemetry data. Your lawyers will be fighting on your organization’s behalf to reduce telemetry
you have to give to the other side. You can help this process out by teaching your lawyers what
can and can’t be produced by your telemetry system.
Request to hold documents. Of the two demands, this is the most impactful to
you, the telemetry system operator. A request to hold documents means you have
to exempt certain telemetry from your aggregation, summarization, and deletion
policies. Because legal matters can take literally years to resolve, in bad cases you
can end up having to store many multiples of your usual telemetry volumes.
Not every organization needs to prepare for lawsuits to such a degree that they have
well-tested procedures for producing and holding telemetry. However, certain indus-
tries are prone to lawsuits (finance, drug manufacture, patent law). Also, certain kinds
of lawsuits, such as those suing for leaking toxic data (see previous section) and insider-
sabotage, are far more likely to dive into telemetry data. You should have at least a
whiteboard plan for what to do when facing a court order. Chapter 18 covers this topic.
To help guide this, I will be using examples drawn from three different styles of tech-
nical organizations. Know that what I teach here is applicable to a growing startup, to
companies with a founding date in the 1700s, and to organizations where writing and
running software only supports the business and is not the business.
Summary
Telemetry is the feedback you get from your production systems.
Telemetry is how modern computing works at all, because telemetry is what tells
us what our production systems are up to.
Telemetry ultimately supports the decisions you have to make about your produc-
tion systems. If your telemetry systems are poor, you will make poor decisions.
Centralized logging is the first telemetry style to emerge, which happened in
the mid-1980s, and brings all logging produced by your production systems to a
central location.
Logging format standards like Syslog mean that hardware systems emit in stan-
dard formats, so you need to support those formats as well if you want telemetry
from hardware systems.
Syslog introduced the concept of log-levels (debug, info, warn, error) to the
industry.
Metrics emerged in the early 2010s and focuses on aggregatable numbers to
describe what is happening in your production systems.
Cardinality is the term for index complexity in databases. The more fields in a
table, the higher the cardinality. Centralized logging is a high-cardinality sys-
tem; metrics systems generally are low-cardinality.
Observability grew up in the mid-2010s as a result of frustration over the limita-
tions of centralized logging, and takes a systematic approach to tracking events
in your production systems.
Observability provides extensive context to events, which makes isolating prob-
lems much easier versus centralized logging.
Software-as-a-Service companies dominate the observability space due to the
complexity of observability systems.
Distributed tracing emerged in the late 2010s as a specialized form of observabil-
ity focused on tracing events across an execution-flow crossing system boundaries.
Distributed tracing provides the context of the entire execution-flow when
investigating an interesting event, which further improves your ability to isolate
where a problem truly started.
Security Information Event Management systems are specialist telemetry sys-
tems for Security and Compliance teams, and store information relating to the
security use-case.
SIEM systems store consistent information because regulation and voluntary
compliance frameworks largely ask for tracking of the same kinds of data, and
often require such data to be stored for many years.
Operations and DevOps teams use telemetry to track how their infrastructure
systems are operating, focusing on centralized logging and metrics styles.
Security and Compliance teams focus on both centralized logging and SIEM
systems because SIEM systems share a lot of history with centralized logging,
and centralized logging is useful during audits for compliance with regulation
and external compliance frameworks.
Software Engineering teams use every telemetry system except SIEM systems in
an effort to understand how their code is behaving in production.
Site Reliability Engineering teams also use every telemetry system except SIEM
in their mission to ensure the organization's software is available.
Customer Support teams make use of centralized logging, observability, and dis-
tributed tracing styles to better isolate problems reported by customers, and to
improve the quality of bug reports sent to engineering.
Business Intelligence teams are often not part of the technical organization but
are responsible for building systems for business telemetry. BI people are valu-
able resources when deploying a new telemetry style, due to their familiarity
with statistical methods.
A chronic threat to telemetry systems is under-investment, which can stem from
a misunderstanding of the value telemetry systems bring to the organization
and a disconnect between decision-makers and those feeling the pain of a bad
telemetry system.
Different teams need different things from telemetry, and different telemetry
styles benefit from different retention periods. Your telemetry system needs to
accommodate these differences in order to be a good telemetry system.
Hardware, SaaS providers, infrastructure providers, third party software, and
operating systems all emit telemetry in relatively fixed formats. Your telemetry
system needs to handle these formats if you want their telemetry.
Privacy information (PII) and health information (PHI) require special han-
dling, and most telemetry systems aren't built for that usage. Do what you can to
keep PII and PHI out of your telemetry systems.
The largest source of toxic information spills are exception-logs that include
parameters; do what you can to redact these before they enter the telemetry
pipeline.
Telemetry systems are subject to court orders the same as production systems
are, so you may be called upon to produce telemetry by your lawyers for a legal
matter.
A court order to hold data means the affected data is no longer subjected to your
retention policy, which can be quite expensive if the legal matter drags on for years.