Women in Tech Anthology

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 102

Save 50% on these

books and videos


– eBook, pBook,
and MEAP.
Enter mewita50 in the
Promotional Code box
when you checkout. Software Telemetry
by Jamie Riedesel
Only at manning.com. ISBN 9781617298141
 475 pages / $47.99

The Quick Python Book, Third Edition Blockchain in Action


by Naomi Ceder by Bina Ramamurthy
ISBN 9781617294037 ISBN 9781617296338
472 pages / $31.99 352 pages / $35.99

AWS Machine Learning in Motion Java SE 11 Programmer I Certification


by Kesha Williams Guide
Course duration: 3h 42m by Mala Gupta
$39.99
ISBN 9781617297465
725 pages / $39.99
Licensed to Yaritza Miranda <yaritza1095@gmail.com>
Build a Career in Data Science Cloud Native Patterns
by Emily Robinson and Jacqueline Nolis by Cornelia Davis
ISBN 9781617296246 ISBN 9781617294297
354 pages / $31.99 400 pages / $39.99

Spring Microservices in Action, Get Programming


Second Edition by Ana Bell
by John Carnell, Illary Huaylupo Sánchez ISBN 9781617293788
456 pages / $27.99
ISBN 9781617296956
453 pages / $39.99

Getting Started with Natural Get Programming with Python


Language Processing in Motion
by Ekaterina Kochmar by Ana Bell
Course duration: 6h 40m
ISBN 9781617296765
$27.99
325 pages / $31.99
Licensed to Yaritza Miranda <yaritza1095@gmail.com>
Algorithms and Data Structures Isomorphic Web Applications
for Massive Datasets by Elyse Kolker Gordon
by Dzejla Medjedovic, Emin Tahirovic, ISBN 9781617294396
and Ines Dedovic 320 pages / $31.99
ISBN 9781617298035
325 pages / $47.99

Get Programming with Scala JavaScript on Things


by Daniela Sfregola by Lyza Danger Gardner
ISBN 9781617295270 ISBN 9781617293863
475 pages / $39.99 448 pages / $31.99

OCA Java SE 8 Programmer I Hello Scratch!


Certification Guide by Sadie Ford and Melissa Ford
by Mala Gupta ISBN 9781617294259
ISBN 9781617293252 384 pages / $27.99
704Licensed to Yaritza Miranda <yaritza1095@gmail.com>
pages / $47.99
Learn dbatools in a Month of Lunches Anyone Can Create an App
by Chrissy LeMaire by Wendy L. Wise
ISBN 9781617296703 ISBN 9781617292651
400 pages / $39.99 336 pages / $23.99

Hello App Inventor! Parallel and High Performance


by Paula Beer and Carl Simmons Computing
ISBN 9781617291432 by Robert Robey and Yuliana Zamora
360 pages / $31.99 ISBN 9781617296468
600 pages / $55.99

Learn Quantum Computing Practical Data Science with R,


with Python and Q# Second Edition
by Sarah C. Kaiser by Nina Zumel
ISBN 9781617296130 ISBN 9781617295874
360 pages / $39.99 568 pages / $39.99
Licensed to Yaritza Miranda <yaritza1095@gmail.com>
Women-in-Tech Anthology

Copyright 2020 Manning Publications


To pre-order or learn more about these books go to www.manning.com

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


For online information and ordering of these and other Manning books, please visit
www.manning.com. The publisher offers discounts on these books when ordered in quantity.

For more information, please contact

Special Sales Department


Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: Candace Gillhoolley, corp-sales@manning.com

©2020 by Manning Publications Co. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in


any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.

Manning Publications Co.


20 Baldwin Road Technical
PO Box 761
Shelter Island, NY 11964

Project Manager: Candace Gillhoolley


Cover designer: Leslie Haimes

ISBN: 9781617299292
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 - EBM - 24 23 22 21 20 19

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


contents
PART I PERSONAL ESSAYS 1
Balancing Family & Career in the Tech Industry 
– Can you have it all? 3
by Anne Michels
Staying True to Yourself as a Woman in Tech 5
by Lizzie Siegle
A letter to my past self... 6
by Naomi Ceder
Hello, younger me, 7
by Jamie Riedesel

PART II ANTHOLOGY OF CHAPTERS BY


WOMEN IN TECHNOLOGY 10
About Python 12
from The Quick Python Book, Third Edition by Naomi Ceder
You keep using that word: Defining “cloud-native” 20
from Cloud Native Patterns by Cornelia Davis
Introducing Quantum Computing 43
from Learn Quantum Computing with Python and Q# by Sarah Kaiser
Introduction 59
from Algorithms and Data Structures for Massive Datasets 
by Dzejla Medjedovic, Ph.D
Introduction 72
from Software Telemetry by Jamie Riedesel

iii

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Part I

Personal Essays

W omen are steadily chipping away at the large and long-lived gender gap in
technology, standing shoulder to shoulder with their male counterparts, and garner-
ing much-deserved attention for their tech talents, intelligence, innovative spirit, and
sheer determination. It’s heartening to see the many initiatives cropping up, such as
Girls Who STEAM, Girls Who Code, TechGirlz, and she++, all striving to encourage
young girls and women to pursue careers in technology. These and other female-
focused initiatives on the rise promise that the tech industry’s gender gap will con-
tinue to close further and faster, and that women will be more likely to follow their
hearts—and their talents—into the tech fields that ignite their passions.
We asked a small handful of the many women technologists we know and highly
respect to offer their insights on being a woman in technology. The next four
entries in this mini ebook are their gracious and thoughtful contributions, filled
with spirited anecdotes, valuable lessons, and even some sage advice for their
younger selves. We hope their words inspire you, bolster you, and remind you that
you’re not alone in the gender-related challenges you face. They want you to know
that all the successes achieved by them—indeed, by all the bright and persevering
women unapologetically smashing that glass ceiling every day—are your successes
too, because each stride that you, as women, collectively make in this field further
dispels the myth that technology careers are not for you. These women we’re
proudly featuring here also invite you to blaze ahead and create a new story about
technology, with women in starring roles. As the highly lauded NASA rocket scien-
tist Dr. Camille Wardrop Alleyne is credited with saying, “Step out of the box. Chart
your own course. Leave your mark on our world.”

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


2 Women-in-Tech Anthology

About the authors:

Anne Michels is a Senior Product Marketing Leader at Microsoft and has


been developing and executing effective marketing strategies at-scale for For-
tune 500 brands for over a decade. An award-winning, impassioned, and
engaging speaker, she masterfully translates product features to customer
value. She has an exceptional track record for attracting, retaining, and deve-
loping employees, with a focus on inclusion and diversity.

Lizzie Siegle is a software engineer, machine learning enthusiast, developer


evangelist at Twilio, and Twitch affiliate. She’s been featured on chapter1.io,
Seventeen Magazine, JS Jabber, Technical.ly Philly, and the Women of Silicon
Valley. She co-directed Spectra, the largest women’s hackathon, and has
served as ambassador for she++, an organization that strives to empower
underrepresented groups in technology. She frequently gives talks on topics
related to machine learning and artificial intelligence.

Naomi Ceder is a Pythonista, web development expert, and chair of the


Python Software Foundation’s board of directors. She advocates for open soft-
ware and content, and is a proponent of teaching programming, especially
Python, to middle and high school students. An accomplished speaker, she
captivates audiences around the world with her talks about the Python com-
munity as well as inclusion and diversity in technology.

Jamie Riedesel is a Staff Engineer at Dropbox with over twenty years of experi-
ence in IT, working in government, education, legacy companies, and startups.
She’s a member of the League of Professional System Administrators (LOPSA),
a nonprofit global corporation for advancing the practice of system administra-
tion. She is also an elected moderator for ServerFault, a Q&A site focused on
professional systems administration and desktop support professionals.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Balancing Family &
Career in the Tech Industry
– Can you have it all?
Many believe that the reason there aren't more women in tech is that companies
are not giving women, especially mothers, the flexibility they need to pursue their
career and raise a family at the same time.
While the question of how to balance motherhood and a full-time job is one
that every working woman who is thinking about starting a family is faced with, it is
especially relevant in the tech industry, known for its fast pace and demanding
nature with jobs that sometimes—or often—require employees to put in long
hours, join early morning meetings, and travel for business.
In such a work environment, having a fulfilling career and a rich family life can
feel like an impossible task. With only 24 hours in the day, how are you supposed to
do it all? How can you assure you have enough time for family life when you work
full time? As a result, many women across industries choose to put their careers on
hold at some point. According to the Harvard Business Review, 43% of highly qual-
ified women with children are leaving careers or taking a career break.

Tech industry is leading the way with generous


parental leave benefits
In an effort to change this and to attract and retain female talent, the tech industry
has become more family-friendly over the last couple of years. More and more tech
companies have started to offer generous (at least by U.S. standards) maternity
leaves, flexible working hours, and options for remote or part-time work. For exam-
ple, if you work for Microsoft or Amazon, you can welcome your child by taking up
to 20 weeks off, Google offers 22-24 weeks of leave, and Netflix allows employees to
take unlimited maternity leave during the first year after the birth of their child.
The tech industry, once known for its “bro culture”, is leading the way in show-
ing other industries how to empower women to have children and fulfill their

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


4 Women-in-Tech Anthology

career aspirations. The list of successful mothers who master the juggle has become
long, spotlighting women including Susan Wojcicki, mother of 5 and CEO of You-
Tube; Sheryl Sandberg, mother of two and COO of Facebook; Gwynne Shotwell,
mother of two and President and CEO of SpaceX; and Fei-Fei Li, mother of two, pro-
fessor of Computer Science at Stanford University and a member of Twitter’s board of
directors. These women and others like them are inspiring the next generation of
female developers, software engineers, programmers, and marketers.

The disastrous effects of COVID-19


While the tech industry at large is moving in the right direction, COVID-19 is having a
disastrous effect on the career of many women. With school and daycare closures,
families need to make adjustments. And in many cases, men are the primary bread-
winners, so it's women who bear the greater proportion of childcare and household
responsibilities and put their careers on hold.
I've talked to many women in the last few weeks whose families have been affected
by school and daycare closures. All were exhausted and struggling to navigate the
impossible task of managing full-time childcare while working a full-time job.
Research indicates that the trend of working from home has increased the workload
of women employees twofold to threefold. Not surprisingly, many women I talked to
are thinking about taking a break or even quitting their jobs for a while. And all were
concerned about what impact this would have on their career.
At the same time, COVID-19 is bringing to light how family-friendly the tech indus-
try has become. Many companies offer flexible work schedules or even additional
leave options to support parents who have been affected by school and childcare clo-
sures. Microsoft, for example, provides three months of paid “Pandemic Parental
Leave” to help employees navigate these new challenges.
As a mother who works in technology—and who has taken advantage of Micro-
soft’s pandemic leave—I really hope more companies will follow this example. We’ve
been on the right track to make the tech industry truly a place where mothers can
pursue a successful career. But we’ll all need to work together to make sure that
COVID-19 does not throw us back.
—Anne Michels

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Staying True to Yourself
as a Woman in Tech

I once met a woman in my office bathroom and asked if she was in sales because
I’d never seen her around. She was an engineer! I’d just presumed she was not tech-
nical. I immediately apologized, and she understood, but the damage was done.
As a woman engineer, I know the common occurrence of having to defend my
technical skills and chops, and having some men presume I don’t know what
I’m talking about or that I don’t understand code.
I used to attend a lot of college hackathons where I was surrounded by people
who fit all the stereotypes I held of engineers: playing video games on their breaks,
drinking Soylent, staying up all night coding, wearing sweatpants or maybe jeans
and a tech company tee shirt, lots of hoodies, and collecting swag from companies
at those hackathons and conferences. Even though I loved (and still love) shopping,
I started to live for company tee shirts, stickers, and water bottles. I thought wearing
and using those things made me belong.
When I joined my first tech company in 2016, I tried to be one of the guys. And
by “guys” I mean male engineers. I love Disney, Broadway, Taylor Swift, and some
fairly feminine things, but I tried to hide it all. I didn’t laugh loudly. I wore tech
company tee shirts and hoodies and jeans and though I felt like I belonged, like the
engineers took me seriously, it didn’t make me feel good about myself.
It’s important to note that it wasn’t the company that made me feel like I had to
fit in like that, and it wasn’t the engineers—it was me. It was all internalized.
I'm very grateful and lucky my manager at the time was an Asian woman who hap-
pened to love cats, pink, and cute things too. She had the respect of engineers while
wearing dresses, polka-dots, and makeup, so I started to do those things as well, and
my happiness and confidence grew. My only hope is that other underrepresented
people in tech can find themselves, be themselves, and pass it on to others.
—Lizzie Siegle
Developer Evangelist [she, her]
5

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


A letter to my past self...

Dear Younger Me,


Hello from 25 years in your future. Writing from 2020 I’ve got some bad news
and some good news. Actually, there’s quite a bit of bad news, but some things are
best not to know in advance. But I can give you a hint—you’re going to be working
from home... a lot.
The good news is that you made it to 2020, and like a lot of other people around
the world you make a living writing code (even Python code, imagine that!), creat-
ing new ways of doing things in every facet of human activity.
Sadly there are a few things we’d hoped to have sorted out that still elude us—
things like equity in hiring and the end of racism, sexism, and homo- and transpho-
bia. And we still don’t have flying cars either, sorry. Which means that in addition to
dealing with surface traffic, women and marginalized folks have a lot to deal with.
You're going to hear a lot about “impostor syndrome”, and you’ll be stunned by
how much it applies to you. Take comfort that it’s not some personal failing; it’s just
your brain’s reaction to being told over and over again that you don’t belong. And
you can manage it by focussing on what really matters to you and by being as trans-
parent as you can be about what you know and who you are. After all, there’s no
way people can out you if you’re out already.
Finally, every bit of community work you do will be beyond worth it. Find the
communities (tech and social) that fit you, join them and contribute your time,
your effort, and yourself as generously as you can. The people in those communi-
ties will be your support and your edge when you need them.
—Naomi Ceder

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Hello, younger me,

This is from 2020, which feels like half a lifetime away. I’ve seen some things—many
I can’t share for what I hope are obvious causality reasons, but there are a few
things to share that will make life easier and not destroy the timeline. I know where
you are right now, and the problems you are facing (mostly... time is like that).
There are three big bombshells you need to hear:
 Remember the bragging guy at a party a couple years ago, who talked about
his income goals? You can do that by age 30 (you're almost there already).
And double it again by 40. Another doubling by 50 is not nearly as hard as
you think.
 You pass way better than you think you do. Once you get your facial hair
zapped, you will be surprised. We won some genetic lotteries you won’t find
out about until you try. Also, your hair has amazing curls in it once you let it
come out; the haircare routine you were taught at home is custom-built to
prevent curls from escaping. Find a new one.
 You will eventually be a woman in tech, and you are very wrong about the
headwinds you will face. That’s some sexist baloney you’ve internalized; stop,
think, and get your butt to WisCon sometime. The rest of your life is way hap-
pier, which gives you the emotional resilience to push your way into the tent
once people stop granting you male privilege.
There is a word you need to be on the lookout for: genderqueer. It’s out there right
now in small places, but it really comes into its own in a few years. No, you’re not
quite trans enough for the standard trans narrative; but you’re sure as heck gender-
queer enough for the queers. The gender in your head does not have to match the
gender you present to the world, and that is power.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


8 Women-in-Tech Anthology

Your current office, with all the women around you, is amazing and I’m so glad we
had that as our first post-college job. Unless you deliberately seek out such environ-
ments, this is the only time in your career this happens. You are learning communica-
tion skills right now that will be incredibly valuable when you’re surrounded by
nothing but men. Half of getting our Operations-style job done is hearts-n-minds
work, so knowing how people communicate is a critical skill. Continue to watch how
people relate to each other—diverse environments like where you are now are defi-
nitely not common.
Surviving all-guy offices will take some getting used to. This is made far easier if
you have a community outside of your work that you fit into. Part of the disconnect
you will feel in offices like that is the lack of people who are like you, so you have to
find them outside.
Believe it or not, you will eventually have an all-coding job. I know we went into
Ops specifically to avoid programming every day. However, as the years roll on, and
more abstractions are placed on top of existing abstractions, which are further
abstracted—eventually we get to a place where you will write code to define the infra-
structure you are managing (it seems impossible, but know that hardware is merely
the bottom layer of the abstraction-layer-cake). You will write more code to define how
the operating systems are configured. More code to define how software is installed
into your infrastructure. The future is code, so it’s a good idea to pay some attention
to how software is built. I write code every day, but I’m still doing SysAdmin work.
Your superpower will continue to be how we synthesize information to come up
with connections, theories for how those connections are present, and the ability to
describe those connections to others. This is how we pull rabbits out of hats (we keep
doing that, by the way), but that skill is incredibly useful when diagnosing problems
across a large, distributed system. Yes, distributed systems are in your future, and
they’re as fun to work on as you hope.
Work on writing documentation: both the runbook 1-2-3-4 kind, and the abstract
‘why this was put together this way’ varieties. Work on storytelling, because describing
complex technical problems is easier on the listener if there is a narrative through-
line. Writing shouldn’t be your main job (resist it if people urge you that way), but
being good at writing will make you stand out in the job you do have.
Finally, the hard stuff. You won’t like this.
Beware of technical cul-de-sacs. Once a technology is only used in niches, it will
take you five times as long to change jobs to get away from a toxic job. It hurts to leave
a community behind, but you need to think about your mental well-being. The writing
is already on the wall for one technology you’re working on—don’t stay too long.
There will be others.
Beware of jobs with no advancement potential. I lost seven years in one where
there was no way to promote; the one opportunity I had was due to a retirement, and
was into management (which I didn’t want, and I’m so very glad I didn’t try). Ask
about career ladders during your interviews, and how they invest in employees.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Hello, younger me, 9

You will spend a depressing amount of time between now and 2020 in recessions.
The business cycle sucks sometimes, and it means you’ll likely end up sticking with a
job you don’t like simply because there aren’t any other jobs to be had. Make friends
online, network. It will help you to make the leap, when leaping is needed.
There is a reason you don’t really feel a strong connection to the technical com-
munities you’re in right now: they’re not like you. They don’t communicate like you.
Communities with people like you (and who you will become) are out there, and feel
far more true. Seek them out. You’re prone to magnifying risks; know that about your-
self, and take the risk to figure out if you’re wrong. Life is better that way.
—Jamie Riedesel

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Part II

Anthology of Chapters
by Women in Technology

H ow we work, play, and experience the world changes and evolves as techno-
logy does. And just as technology has been shaping our world throughout the ages,
women have been helping to shape technology.
Take, for example, tech pioneer Grace Hopper’s early work in programming lan-
guages which spawned COBOL, a language that went on to become the backbone of
the financial industry and remains a top choice for business solutions. Or Hedy
Lamarr’s spread-spectrum encryption technology, initially designed for use in military
applications, but which later became the foundation for modern Wi-Fi. Consider net-
work maven Marian Croak’s work in Internet Protocol, which led to the network tech-
nology that expands our reach across the globe and brings us all together in a new and
better way than ever before. And how about taking our (gaming) hats off to visionary
Roberta Williams whose groundbreaking and prolific “King’s Quest” adventure game
series helped carve a path for the digital games that take us out of our daily grind and
into amazing immersive experiences.
Imagine how different the world would look without these innovations. Then,
imagine the exciting technological advances still on the horizon from women includ-
ing Sadaf Monajemi, whose med-tech company, See-Mode Technologies, saves money,
time, and lives by helping medical professionals predict strokes in high-risk patients;
Gina Trapani, who heads up projects at ExpertLabs, a think tank that aims to gain fed-
eral support for world-changing technological initiatives; and Lisa Seacat DeLuca,
who’s been granted over 400 patents, and has been named one of MIT’s 35 Innovators
Under 35, one of Fast Company’s 100 Most Creative People in Business, and #1 on
LinkedIn’s list of Top Voices in Technology, along with earning many other distin-
guished accolades.

10

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


11

In this collection of chapters, we celebrate these bright, determined women and


others like them by featuring the first chapters of five Manning books written by
women technology experts whom we are proud to call Manning authors:

Naomi Ceder is a Pythonista, web development expert, and chair of the


Python Software Foundation’s board of directors. She advocates for open soft-
ware and content, and is a proponent of teaching programming, especially
Python, to middle and high school students. An accomplished speaker, she
captivates audiences around the world with her talks about the Python com-
munity as well as inclusion and diversity in technology.

Cornelia Davis is Chief Technology Officer at WeaveWorks. A teacher at


heart, she’s spent the last 25 years making good software and great software
developers. Of her work with Girls Who Code and GapJumpers, both organi-
zations that are helping to close the gender gap in the field of technology, she
says, “I am passionate about one simple thing: I just want everyone–women,
men, non-binary, people of color, caucasians, everyone–to have a chance to
have as much fun as I do! This is a great field that offers tremendous financial
stability and flexibility, and it’s just a total blast to boot.”

Sarah Kaiser completed her PhD in physics (quantum information) at the


University of Waterloo’s Institute for Quantum Computing, where her work
and actions earned her their Equity & Inclusivity Award. She’s spent much of
her career developing new quantum hardware in the lab, from satellites to
hacking quantum cryptography hardware. She loves finding new demos and
tools that help the quantum community grow.

Dzejla Medjedovic is Assistant Professor of Computer Science at the Interna-


tional University of Sarajevo. She earned her PhD in the Applied Algorithms
Lab of the computer science department at Stony Brook University, NY. She’s
worked on numerous projects in algorithms for massive data, taught algo-
rithms at various levels, and also spent some time at Microsoft.

Jamie Riedesel is a Staff Engineer at Dropbox with over twenty years of experi-
ence in IT, working in government, education, legacy companies, and startups.
She’s a member of the League of Professional System Administrators (LOPSA),
a nonprofit global corporation for advancing the practice of system administra-
tion. She is also an elected moderator for ServerFault, a Q&A site focused on
professional systems administration and desktop support professionals.

Each one of these amazing women represents countless others everywhere who are
making their marks in technology with their bold dedication, inventive spirits, and
impressive technological talent. We salute you all, and we can’t wait to see what you do
next!

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Chapter 1 from The Quick
Python Book, Third Edition
by Naomi Ceder

About Python

This chapter covers


 Why use Python?
 What Python does well
 What Python doesn’t do as well
 Why learn Python 3?

Read this chapter if you want to know how Python compares to other languages
and its place in the grand scheme of things. Skip ahead—go straight to chapter 3—
if you want to start learning Python right away. The information in this chapter is a
valid part of this book—but it’s certainly not necessary for programming with
Python.

1.1 Why should I use Python?


Hundreds of programming languages are available today, from mature languages
like C and C++, to newer entries like Ruby, C#, and Lua, to enterprise juggernauts
like Java. Choosing a language to learn is difficult. Although no one language is the
right choice for every possible situation, I think that Python is a good choice for a
large number of programming problems, and it’s also a good choice if you’re learn-

12

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What Python does well 13

ing to program. Hundreds of thousands of programmers around the world use


Python, and the number grows every year.
Python continues to attract new users for a variety of reasons. It’s a true cross-plat-
form language, running equally well on Windows, Linux/UNIX, and Macintosh plat-
forms, as well as others, ranging from supercomputers to cell phones. It can be used
to develop small applications and rapid prototypes, but it scales well to permit devel-
opment of large programs. It comes with a powerful and easy-to-use graphical user
interface (GUI) toolkit, web programming libraries, and more. And it’s free.

1.2 What Python does well


Python is a modern programming language developed by Guido van Rossum in the
1990s (and named after a famous comedic troupe). Although Python isn’t perfect for
every application, its strengths make it a good choice for many situations.

1.2.1 Python is easy to use


Programmers familiar with traditional languages will find it easy to learn Python. All
of the familiar constructs—loops, conditional statements, arrays, and so forth—are
included, but many are easier to use in Python. Here are a few of the reasons why:
 Types are associated with objects, not variables. A variable can be assigned a value of
any type, and a list can contain objects of many types. This also means that type
casting usually isn’t necessary and that your code isn’t locked into the strait-
jacket of predeclared types.
 Python typically operates at a much higher level of abstraction. This is partly the result
of the way the language is built and partly the result of an extensive standard
code library that comes with the Python distribution. A program to download a
web page can be written in two or three lines!
 Syntax rules are very simple. Although becoming an expert Pythonista takes time
and effort, even beginners can absorb enough Python syntax to write useful
code quickly.
Python is well suited for rapid application development. It isn’t unusual for coding an
application in Python to take one-fifth the time it would in C or Java and to take as lit-
tle as one-fifth the number of lines of the equivalent C program. This depends on the
particular application, of course; for a numerical algorithm performing mostly inte-
ger arithmetic in for loops, there would be much less of a productivity gain. For the
average application, the productivity gain can be significant.

1.2.2 Python is expressive


Python is a very expressive language. Expressive in this context means that a single line
of Python code can do more than a single line of code in most other languages. The
advantages of a more expressive language are obvious: The fewer lines of code you
have to write, the faster you can complete the project. The fewer lines of code there
are, the easier the program will be to maintain and debug.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


14 CHAPTER CHAPTER 1 About Python

To get an idea of how Python’s expressiveness can simplify code, consider swap-
ping the values of two variables, var1 and var2. In a language like Java, this requires
three lines of code and an extra variable:
int temp = var1;
var1 = var2;
var2 = temp;

The variable temp is needed to save the value of var1 when var2 is put into it, and
then that saved value is put into var2. The process isn’t terribly complex, but reading
those three lines and understanding that a swap has taken place takes a certain
amount of overhead, even for experienced coders.
By contrast, Python lets you make the same swap in one line and in a way that
makes it obvious that a swap of values has occurred:
var2, var1 = var1, var2

Of course, this is a very simple example, but you find the same advantages throughout
the language.

1.2.3 Python is readable


Another advantage of Python is that it’s easy to read. You might think that a program-
ming language needs to be read only by a computer, but humans have to read your
code as well: whoever debugs your code (quite possibly you), whoever maintains your
code (could be you again), and whoever might want to modify your code in the
future. In all of those situations, the easier the code is to read and understand, the
better it is.
The easier code is to understand, the easier it is to debug, maintain, and modify.
Python’s main advantage in this department is its use of indentation. Unlike most lan-
guages, Python insists that blocks of code be indented. Although this strikes some peo-
ple as odd, it has the benefit that your code is always formatted in a very easy-to-read
style.
Following are two short programs, one written in Perl and one in Python. Both
take two equal-size lists of numbers and return the pairwise sum of those lists. I think
the Python code is more readable than the Perl code; it’s visually cleaner and contains
fewer inscrutable symbols:
# Perl version.
sub pairwise_sum {
my($arg1, $arg2) = @_;
my @result;
for(0 .. $#$arg1) {
push(@result, $arg1->[$_] + $arg2->[$_]);
}
return(\@result);
}

# Python version.
def pairwise_sum(list1, list2):
result = []

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What Python does well 15

for i in range(len(list1)):
result.append(list1[i] + list2[i])
return result

Both pieces of code do the same thing, but the Python code wins in terms of readabil-
ity. (There are other ways to do this in Perl, of course, some of which are much more
concise—but in my opinion harder to read—than the one shown.)

1.2.4 Python is complete—“batteries included”


Another advantage of Python is its “batteries included” philosophy when it comes to
libraries. The idea is that when you install Python, you should have everything you
need to do real work without the need to install additional libraries. This is why the
Python standard library comes with modules for handling email, web pages, data-
bases, operating-system calls, GUI development, and more.
For example, with Python, you can write a web server to share the files in a direc-
tory with just two lines of code:
import http.server
http.server.test(HandlerClass=http.server.SimpleHTTPRequestHandler)

There’s no need to install libraries to handle network connections and HTTP; it’s
already in Python, right out of the box.

1.2.5 Python is cross-platform


Python is also an excellent cross-platform language. Python runs on many platforms:
Windows, Mac, Linux, UNIX, and so on. Because it’s interpreted, the same code can
run on any platform that has a Python interpreter, and almost all current platforms
have one. There are even versions of Python that run on Java (Jython) and .NET
(IronPython), giving you even more possible platforms that run Python

1.2.6 Python is free


Python is also free. Python was originally, and continues to be, developed under the
open source model, and it’s freely available. You can download and install practically
any version of Python and use it to develop software for commercial or personal appli-
cations, and you don’t need to pay a dime.
Although attitudes are changing, some people are still leery of free software
because of concerns about a lack of support, fearing that they lack the clout of paying
customers. But Python is used by many established companies as a key part of their
business; Google, Rackspace, Industrial Light & Magic, and Honeywell are just a few
examples. These companies and many others know Python for what it is: a very stable,
reliable, and well-supported product with an active and knowledgeable user commu-
nity. You’ll get an answer to even the most difficult Python question more quickly on
the Python internet newsgroup than you will on most tech-support phone lines, and
the Python answer will be free and correct.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


16 CHAPTER CHAPTER 1 About Python

Python and open source software


Not only is Python free of cost, but also, its source code is freely available, and you’re
free to modify, improve, and extend it if you want. Because the source code is freely
available, you have the ability to go in yourself and change it (or to hire someone to
go in and do so for you). You rarely have this option at any reasonable cost with pro-
prietary software.
If this is your first foray into the world of open source software, you should understand
that you’re not only free to use and modify Python, but also able (and encouraged) to
contribute to it and improve it. Depending on your circumstances, interests, and
skills, those contributions might be financial, as in a donation to the Python Software
Foundation (PSF), or they may involve participating in one of the special interest
groups (SIGs), testing and giving feedback on releases of the Python core or one of
the auxiliary modules, or contributing some of what you or your company develops
back to the community. The level of contribution (if any) is, of course, up to you; but
if you’re able to give back, definitely consider doing so. Something of significant value
is being created here, and you have an opportunity to add to it.

Python has a lot going for it: expressiveness, readability, rich included libraries, and
cross-platform capabilities. Also, it’s open source. What’s the catch?

1.3 What Python doesn’t do as well


Although Python has many advantages, no language can do everything, so Python
isn’t the perfect solution for all your needs. To decide whether Python is the right lan-
guage for your situation, you also need to consider the areas where Python doesn’t do
as well.

1.3.1 Python isn’t the fastest language


A possible drawback with Python is its speed of execution. It isn’t a fully compiled lan-
guage. Instead, it’s first compiled to an internal bytecode form, which is then exe-
cuted by a Python interpreter. There are some tasks, such as string parsing using
regular expressions, for which Python has efficient implementations and is as fast as,
or faster than, any C program you’re likely to write. Nevertheless, most of the time,
using Python results in slower programs than in a language like C. But you should
keep this in perspective. Modern computers have so much computing power that for
the vast majority of applications, the speed of the program isn’t as important as the
speed of development, and Python programs can typically be written much more
quickly. In addition, it’s easy to extend Python with modules written in C or C++,
which can be used to run the CPU-intensive portions of a program.

1.3.2 Python doesn’t have the most libraries


Although Python comes with an excellent collection of libraries, and many more are
available, Python doesn’t hold the lead in this department. Languages like C, Java,

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What Python doesn’t do as well 17

and Perl have even larger collections of libraries available, in some cases offering a
solution where Python has none or a choice of several options where Python might
have only one. These situations tend to be fairly specialized, however, and Python is
easy to extend, either in Python itself or through existing libraries in C and other lan-
guages. For almost all common computing problems, Python’s library support is
excellent.

1.3.3 Python doesn’t check variable types at compile time


Unlike in some languages, Python’s variables don’t work like containers; instead,
they’re more like labels that reference various objects: integers, strings, class
instances, whatever. That means that although those objects themselves have types,
the variables referring to them aren’t bound to that particular type. It’s possible (if not
necessarily desirable) to use the variable x to refer to a string in one line and an inte-
ger in another:
>>> x = "2"
>>> x
'2' x is string "2"
>>> x = int(x)
>>> x
2 x is now integer 2

The fact that Python associates types with objects and not with variables means that
the interpreter doesn’t help you catch variable type mismatches. If you intend a vari-
able count to hold an integer, Python won’t complain if you assign the string "two" to
it. Traditional coders count this as a disadvantage, because you lose an additional free
check on your code. But errors like this usually aren’t hard to find and fix, and
Python’s testing features makes avoiding type errors manageable. Most Python pro-
grammers feel that the flexibility of dynamic typing more than outweighs the cost.

1.3.4 Python doesn’t have much mobile support


In the past decade the numbers and types of mobile devices have exploded, and
smartphones, tablets, phablets, Chromebooks, and more are everywhere, running on
a variety of operating systems. Python isn’t a strong player in this space. While options
exist, running Python on mobile devices isn’t always easy, and using Python to write
and distribute commercial apps is problematic.

1.3.5 Python doesn’t use multiple processors well


Multiple-core processors are everywhere now, producing significant increases in per-
formance in many situations. However, the standard implementation of Python isn’t
designed to use multiple cores, due to a feature called the global interpreter lock
(GIL). For more information, look for videos of GIL-related talks and posts by David
Beazley, Larry Hastings, and others, or visit the GIL page on the Python wiki at
https://wiki.python.org/moin/GlobalInterpreterLock. While there are ways to run

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


18 CHAPTER CHAPTER 1 About Python

concurrent processes by using Python, if you need concurrency out of the box, Python
may not be for you.

1.4 Why learn Python 3?


Python has been around for a number of years and has evolved over that time. The
first edition of this book was based on Python 1.5.2, and Python 2.x has been the dom-
inant version for several years. This book is based on Python 3.6 but has also been
tested on the alpha version of Python 3.7.
Python 3, originally whimsically dubbed Python 3000, is notable because it’s the
first version of Python in the history of the language to break backward compatibility.
What this means is that code written for earlier versions of Python probably won’t run
on Python 3 without some changes. In earlier versions of Python, for example, the
print statement didn’t require parentheses around its arguments:
print "hello"

In Python 3, print is a function and needs the parentheses:


print("hello")

You may be thinking, “Why change details like this if it’s going to break old code?”
Because this kind of change is a big step for any language, the core developers of
Python thought about this issue carefully. Although the changes in Python 3 break
compatibility with older code, those changes are fairly small and for the better; they
make the language more consistent, more readable, and less ambiguous. Python 3
isn’t a dramatic rewrite of the language; it’s a well-thought-out evolution. The core
developers also took care to provide a strategy and tools to safely and efficiently
migrate old code to Python 3, which will be discussed in a later chapter, and the Six
and Future libraries are also available to make the transition easier.
Why learn Python 3? Because it’s the best Python so far. Also, as projects switch to
take advantage of its improvements, it will be the dominant Python version for years to
come. The porting of libraries to Python 3 has been steady since its introduction, and
by now many of the most popular libraries support Python 3. In fact, according to the
Python Readiness page (http://py3readiness.org), 319 of the 360 most popular librar-
ies have already been ported to Python 3. If you need a library that hasn’t been con-
verted yet, or if you’re working on an established code base in Python 2, by all means
stick with Python 2.x. But if you’re starting to learn Python or starting a project, go
with Python 3; it’s not only better, but also the future.

Summary
 Python is a modern, high-level language with dynamic typing and simple, con-
sistent syntax and semantics.
 Python is multiplatform, highly modular, and suited for both rapid develop-
ment and large-scale programming.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Why learn Python 3? 19

 It’s reasonably fast and can be easily extended with C or C++ modules for
higher speeds.
 Python has built-in advanced features such as persistent object storage,
advanced hash tables, expandable class syntax, and universal comparison func-
tions.
 Python includes a wide range of libraries such as numeric processing, image
manipulation, user interfaces, and web scripting.
 It’s supported by a dynamic Python community.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Chapter 1 from Cloud Native Patterns
by Cornelia Davis

You keep using that word:


Defining “cloud-native”

It’s not Amazon’s fault. On Sunday, September 20, 2015, Amazon Web Services
(AWS) experienced a significant outage. With an increasing number of companies
running mission-critical workloads on AWS—even their core customer-facing ser-
vices—an AWS outage can result in far-reaching subsequent system outages. In this
instance, Netflix, Airbnb, Nest, IMDb, and more all experienced downtime,
impacting their customers and ultimately their business’s bottom lines. The core
outage lasted about five hours (or more, depending on how you count), resulting
in even longer outages for the affected AWS customers before their systems recov-
ered from the failure.
If you’re Nest, you’re paying AWS because you want to focus on creating value
for your customers, not on infrastructure concerns. As part of the deal, AWS is
responsible for keeping its systems up, and enabling you to keep yours functioning
as well. If AWS experiences downtime, it’d be easy to blame Amazon for your result-
ing outage.
But you’d be wrong. Amazon isn’t to blame for your outage.
Wait! Don’t toss this book to the side. Please hear me out. My assertion gets
right to the heart of the matter and explains the goals of this book.
First, let me clear up one thing. I’m not suggesting that Amazon and other cloud
providers have no responsibility for keeping their systems functioning well; they
obviously do. And if a provider doesn’t meet certain service levels, its customers

20

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


21

can and will find alternatives. Service providers generally provide service-level agree-
ments (SLAs). Amazon, for example, provides a 99.95% uptime guarantee for most of
its services.
What I’m asserting is that the applications you’re running on a particular infra-
structure can be more stable than the infrastructure itself. How’s that possible? That,
my friends, is exactly what this book will teach you.
Let’s, for a moment, turn back to the AWS outage of September 20. Netflix, one of
the many companies affected by the outage, is the top internet site in the United
States, when measured by the amount of internet bandwidth consumed (36%). But
even though a Netflix outage affects a lot of people, the company had this to say about
the AWS event:
Netflix did experience a brief availability blip in the affected Region, but we sidestepped
any significant impact because Chaos Kong exercises prepare us for incidents like this. By
running experiments on a regular basis that simulate a Regional outage, we were able to
identify any systemic weaknesses early and fix them. When US-EAST-1 became
unavailable, our system was already strong enough to handle a traffic failover.1
Netflix was able to quickly recover from the AWS outage, being fully functional only
minutes after the incident began. Netflix, still running on AWS, was fully functional
even while the AWS outage continued.

NOTE How was Netflix able to recover so quickly? Redundancy.

No single piece of hardware can be guaranteed to be up 100% of the time, and, as has
been the practice for some time, we put redundant systems are in place. AWS does
exactly this and makes those redundancy abstractions available to its users.
In particular, AWS offers services in numerous regions; for example, at the time of
writing, its Elastic Compute Cloud platform (EC2) is running and available in Ireland,
Frankfurt, London, Paris, Stockholm, Tokyo, Seoul, Singapore, Mumbai, Sydney, Bei-
jing, Ningxia, Sao Paulo, Canada, and in four locations in the United States (Virginia,
California, Oregon, and Ohio). And within each region, the service is further parti-
tioned into numerous availability zones (AZs) that are configured to isolate the
resources of one AZ from another. This isolation limits the effects of a failure in one
AZ rippling through to services in another AZ.
Figure 1.1 depicts three regions, each of which contains four availability zones.
Applications run within availability zones and—here’s the important part—may run in
more than one AZ and in more than one region. Recall that a moment ago I made the
assertion that redundancy is one of the keys to uptime.
In figure 1.2, let’s place logos within this diagram to hypothetically represent run-
ning applications. (I have no explicit knowledge of how Netflix, IMDb, or Nest have
deployed their applications; this is purely hypothetical, but illustrative nevertheless.)

1
See “Chaos Engineering Upgraded” at the Netflix Technology blog (http://mng.bz/P8rn) for more informa-
tion on Chaos Kong.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


22 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

Region: us-east-1 (N. Virginia) Region: us-west-1 (N. California) Region: us-west-2 (Oregon)
AZ: us-east-1a AZ: us-east-1b AZ: us-west-1a AZ: us-west-1b AZ: us-west-2a AZ: us-west-2b

AZ: us-east-1c AZ: us-east-1d AZ: us-west-1c AZ: us-west-1d AZ: us-west-2c AZ: us-west-2d

Figure 1.1 AWS partitions the services it offers into regions and availability zones. Regions map to
geographic areas, and AZs provide further redundancy and isolation within a single region.

Region: us-east-1 (N. Virginia) Region: us-west-1 (N. California) Region: us-west-2 (Oregon)
AZ: us-east-1a AZ: us-east-1b AZ: us-west-1a AZ: us-west-1b AZ: us-west-2a AZ: us-west-2b

AZ: us-east-1c AZ: us-east-1d AZ: us-west-1c AZ: us-west-1d AZ: us-west-2c AZ: us-west-2d

Figure 1.2 Applications deployed onto AWS may be deployed into a single AZ (IMDb), or in multiple
AZs (Nest) but only a single region, or in multiple AZs and multiple regions (Netflix). This provides
different resiliency profiles.

Figure 1.3 depicts a single-region outage, like the AWS outage of September 2015. In
that instance, only us-east-1 went dark.
In this simple graphic, you can immediately see how Netflix might have weathered
the outage far better than others companies; it already had its applications running in
other AWS regions and was able to easily direct all traffic over to the healthy instances.
And though it appears that the failover to the other regions wasn’t automatic, Netflix

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


23

Region: us-east-1 (N. Virginia) Region: us-west-1 (N. California) Region: us-west-2 (Oregon)
AZ: us-east-1a AZ: us-east-1b AZ: us-west-1a AZ: us-west-1b AZ: us-west-2a AZ: us-west-2b

AZ: us-east-1c AZ: us-east-1d AZ: us-west-1c AZ: us-west-1d AZ: us-west-2c AZ: us-west-2d

Figure 1.3 If applications are properly architected and deployed, digital solutions can survive even
a broad outage, such as of an entire region.

had anticipated (even practiced!) a possible outage such as this and had architected
its software and designed its operational practices to compensate.2

NOTE Cloud-native software is designed to anticipate failure and remain sta-


ble even when the infrastructure it’s running on is experiencing outages or is
otherwise changing.
Application developers, as well as support and operations staff, must learn and apply
new patterns and practices to create and manage cloud-native software, and this book
teaches those things. You might be thinking that this isn’t new, that organizations, par-
ticularly in mission-critical businesses like finance, have been running active/active
systems for some time, and you’re right. But what’s new is the way in which this is
being achieved.
In the past, implementing these failover behaviors was generally a bespoke solu-
tion, bolted on to a deployment for a system that wasn’t initially designed to adapt to
underlying system failures. The knowledge needed to achieve the required SLAs was
often limited to a few “rock stars,” and extraordinary design, configuration, and test-
ing mechanisms were put in place in an attempt to have systems that reacted appropri-
ately to that failure.
The difference between this and what Netflix does today starts with a fundamental
difference in philosophy. With the former approaches, change or failure is treated as
an exception. By contrast, Netflix and many other large-scale internet-native compa-
nies, such as Google, Twitter, Facebook, and Uber, treat change or failure as the rule.

2
See “AWS Outage: How Netflix Weathered the Storm by Preparing for the Worst” by Nick Heath (http://
mng.bz/J8RV) for more details on the company’s recovery.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


24 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

These organizations have altered their software architectures and their engineering
practices to make designing for failure an integral part of the way they build, deliver,
and manage software.

NOTE Failure is the rule, not the exception.

1.1 Today’s application requirements


Digital experiences are no longer a sidecar to our lives. They play a major part in
many or most of the activities that we engage in on a daily basis. This ubiquity has
pushed the boundaries of what we expect from the software we use: we want applica-
tions to be always available, be perpetually upgraded with new whizbang features, and
provide personalized experiences. Fulfilling these expectations is something that
must be addressed right from the beginning of the idea-to-production lifecycle. You,
the developer, are one of the parties responsible for meeting those needs. Let’s have a
look at some key requirements.

1.1.1 Zero downtime


The AWS outage of September 20, 2015, demonstrates one of the key requirements of
the modern application: it must always be available. Gone are the days when even
short maintenance windows during which applications are unavailable are tolerated.
The world is always online. And although unplanned downtime has never been desir-
able, its impact has reached astounding levels. For example, in 2013 Forbes estimated
that Amazon lost almost $2 million during a 13-minute unplanned outage.3 Down-
time, planned or not, results in significant revenue loss and customer dissatisfaction.
But maintaining uptime isn’t a problem only for the operations team. Software
developers or architects are responsible for creating a system design with loosely cou-
pled components that can be deployed to allow redundancy to compensate for inevi-
table failures, and with air gaps that keep those failures from cascading through the
entire system. They must also design the software to allow planned events, such as
upgrades, to be done with no downtime.

1.1.2 Shortened feedback cycles


Also of critical importance is the ability to release code frequently. Driven by signifi-
cant competition and ever-increasing consumer expectations, application updates are
being made available to customers several times a month, numerous times a week, or
in some cases even several times a day. Exciting customers is unquestionably valuable,
but perhaps the biggest driver for these continuous releases is the reduction of risk.
From the moment that you have an idea for a feature, you’re taking on some level
of risk. Is the idea a good one? Will customers be able to use it? Can it be implemented
in a better-performing way? As much as you try to predict the possible outcomes, reality

3
See “Amazon.com Goes Down, Loses $66,240 Per Minute” by Kelly Clay at the Forbes website for more details
(http://mng.bz/wEgP).

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Today’s application requirements 25

is often different from what you can anticipate. The best way to get answers to important
questions such as these is to release an early version of a feature and get feedback. Using
that feedback, you can then make adjustments or even change course entirely. Fre-
quent software releases shorten feedback loops and reduce risk.
The monolithic software systems that have dominated the last several decades
can’t be released often enough. Too many closely interrelated subsystems, built and
tested by independent teams, needed to be tested as a whole before an often-fragile
packaging process could be applied. If a defect was found late in the integration-
testing phase, the long and laborious process would begin anew. New software
architectures are essential to achieving the required agility in releasing software
to production.

1.1.3 Mobile and multidevice support


In April 2015, Comscore, a leading technology-trend measurement and analytics com-
pany, released a report indicating that for the first time, internet usage via mobile
devices eclipsed that of desktop computers.4 Today’s applications need to support at
least two mobile device platforms, iOS and Android, as well as the desktop (which still
claims a significant portion of the usage).
In addition, users increasingly expect their experience with an application to
seamlessly move from one device to another as they navigate through the day. For
example, users may be watching a movie on an Apple TV and then transition to view-
ing the program on a mobile device when they’re on the train to the airport. Further-
more, the usage patterns on a mobile device are significantly different from those of a
desktop. Banks, for example, must be able to satisfy frequently repeated application
refreshes from mobile device users who are awaiting their weekly payday deposit.
Designing applications the right way is essential to meeting these needs. Core ser-
vices must be implemented in a manner that they can back all of the frontend devices
serving users, and the system must adapt to expanding and contracting demands.

1.1.4 Connected devices—also known as the Internet of Things


The internet is no longer only for connecting humans to systems that are housed in
and served from data centers. Today, billions of devices are connected to the internet,
allowing them to be monitored and even controlled by other connected entities. The
home-automation market alone, which represents a tiny portion of the connected
devices that make up the Internet of Things (IoT), is estimated to be a $53 billion
market by 2022.5
The connected home has sensors and remotely controlled devices such as motion
detectors, cameras, smart thermostats, and even lighting systems. And this is all

4
See Kate Dreyer’s April 13, 2015 blog at the Comscore site (http://mng.bz/7eKv) for a summary of the
report.
5
You can read more about these findings by Zion Market Research at the GlobeNewswire site (http://
mng.bz/mm6a).

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


26 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

extremely affordable; after a burst pipe during a –26-degree (Fahrenheit) weather


spell a few years ago, I started with a modest system including an internet-connected
thermostat and some temperature sensors, and spent less than $300. Other connected
devices include automobiles, home appliances, farming equipment, jet engines, and
the supercomputer most of us carry around in our pockets (the smartphone).
Internet-connected devices change the nature of the software we build in two fun-
damental ways. First, the volume of data flowing over the internet is dramatically
increased. Billions of devices broadcast data many times a minute, or even many times
a second.6 Second, in order to capture and process these massive quantities of data,
the computing substrate must be significantly different from those of the past. It
becomes more highly distributed with computing resources placed at the “edge,”
closer to where the connected device lies. This difference in data volume and infra-
structure architecture necessitates new software designs and practices.

1.1.5 Data-driven
Considering several of the requirements that I’ve presented up to this point drives
you to think about data in a more holistic way. Volumes of data are increasing,
sources are becoming more widely distributed, and software delivery cycles are being
shortened. In combination, these three factors render the large, centralized, shared
database unusable.
A jet engine with hundreds of sensors, for example, is often disconnected from
data centers housing such databases, and bandwidth limitations won’t allow all the
data to be transmitted to the data center during the short windows when connectivity
is established. Furthermore, shared databases require a great deal of process and
coordination across a multitude of applications to rationalize the various data models
and interaction scenarios; this is a major impediment to shortened release cycles.
Instead of the single, shared database, these application requirements call for a
network of smaller, localized databases, and software that manages data relationships
across that federation of data management systems. These new approaches drive the
need for software development and management agility all the way through to the
data tier.
Finally, all of the newly available data is of little value if it goes unused. Today’s
applications must increasingly use data to provide greater value to the customer
through smarter applications. For example, mapping applications use GPS data from
connected cars and mobile devices, along with roadway and terrain data to provide
real-time traffic reports and routing guidance. The applications of the past decades
that implemented painstakingly designed algorithms carefully tuned for anticipated
usage scenarios are being replaced with applications that are constantly being revised
or may even be self-adjusting their internal algorithms and configurations.

6
Gartner forecasts that 8.4 billion connected things will be in use worldwide in 2017; see the Gartner report at
www.gartner.com/newsroom/id/3598917.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Introducing cloud-native software 27

These user requirements—constant availability, constant evolution with frequent


releases, easily scalable, and intelligent—can’t be met with the software design and
management systems of the past. But what characterizes the software that can meet
these requirements?

1.2 Introducing cloud-native software


Your software needs to be up, 24/7. You need to be able to release frequently to give
your users the instant gratification they seek. The mobility and always-connected state
of your users drives a need for your software to be responsive to larger and more fluc-
tuating request volumes than ever before. And connected devices (“things”) form a
distributed data fabric of unprecedented size that requires new storage and process-
ing approaches. These needs, along with the availability of new platforms on which
you can run the software, have led directly to the emergence of a new architectural
style for software: cloud-native software.

1.2.1 Defining “cloud-native”


What characterizes cloud-native software? Let’s analyze the preceding requirements a bit
further and see where they lead. Figure 1.4 takes the first few steps, listing require-
ments across the top and showing causal relationships going downward. The following
list explains the details:
 Software that’s always up must be resilient to infrastructure failures and
changes, whether planned or unplanned. As the context within which it runs
experiences those inevitable changes, software must be able to adapt. When
properly constructed, deployed, and managed, composition of independent
pieces can limit the blast radius of any failures that do occur; this drives you to a
modular design. And because you know that no single entity can be guaranteed
to never fail, you include redundancy throughout the design.
 Your goal is to release frequently, and monolithic software doesn’t allow this;
too many interdependent pieces require time-consuming and complex coordi-
nation. In recent years, it’s been soundly proven that software made up of
smaller, more loosely coupled and independently deployable and releasable
components (often called microservices) enables a more agile release model.
 No longer are users limited to accessing digital solutions when they sit in front
of their computers. They demand access from the mobile devices they carry
with them 24/7. And nonhuman entities, such as sensors and device control-
lers, are similarly always connected. Both of these scenarios result in a tidal wave
of request and data volumes that can fluctuate wildly, and therefore require
software that scales dynamically and continues to function adequately.
Some of these attributes have architectural implications: the resultant software is com-
posed of redundantly deployed, independent components. Other attributes address
the management practices used to deliver the digital solutions: a deployment must

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


28 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

Always Release Anywhere, Internet of


up frequently any device Things

Resilience Agility Large and fluctuating volumes


of requests and data

Redundancy Adaptability Modular


Dynamic scalability

Figure 1.4 User requirements for software drive development toward cloud-native
architectural and management tenets.

adapt to a changing infrastructure and to fluctuating request volumes. Taking that col-
lection of attributes as a whole, let’s carry this analysis to its conclusion; this is
depicted in figure 1.5:
 Software that’s constructed as a set of independent components, redundantly
deployed, implies distribution. If your redundant copies were all deployed close
to one another, you’d be at greater risk of local failures having far-reaching con-
sequences. To make efficient use of the infrastructure resources you have, when
you deploy additional instances of an app to serve increasing request volumes,
you must be able to place them across a wide swath of your available infrastruc-
ture—even, perhaps, that from cloud services such as AWS, Google Cloud Plat-
form (GCP), and Microsoft Azure. As a result, you deploy your software
modules in a highly distributed manner.
 Adaptable software is by definition “able to adjust to new conditions,” and the
conditions I refer to here are those of the infrastructure and the set of interre-
lated software modules. They’re intrinsically tied together: as the infrastructure
changes, the software changes, and vice versa. Frequent releases mean frequent
change, and adapting to fluctuating request volumes through scaling opera-
tions represents a constant adjustment. It’s clear that your software and the
environment it runs in are constantly changing.

DEFINITION Cloud-native software is highly distributed, must operate in a con-


stantly changing environment, and is itself constantly changing.

Many more granular details go into the making of cloud-native software (the specifics
fill the pages of this volume). But, ultimately, they all come back to these core charac-
teristics: highly distributed and constantly changing. This will be your mantra as you
progress through the material, and I’ll repeatedly draw you back to extreme distribu-
tion and constant change.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Introducing cloud-native software 29

Anywhere, Internet of
Always Release any device Things
up frequently

Large and fluctuating volumes


Resilience Agility of requests and data

Redundancy Adaptability Modular Dynamic scalability

Highly Constantly
distributed changing

Figure 1.5 Architectural and management tenets lead to the core characteristics of
cloud-native software: it’s highly distributed and must operate in a constantly changing
environment even as the software is constantly evolving.

1.2.2 A mental model for cloud-native software


Adrian Cockcroft, who was chief architect at Netflix and is now VP of Cloud Architec-
ture Strategy at AWS, talks about the complexity of operating a car: as a driver, you
must control the car and navigate streets, all while making sure not to come into con-
tact with other drivers performing the same complex tasks.7 You’re able to do this only
because you’ve formed a model that allows you to understand the world and control
your instrument (in this case, a car) in an ever-changing environment.
Most of us use our feet to control the speed and our hands to set direction, collec-
tively determining our velocity. In an attempt to improve navigation, city planners put
thought into street layouts (God help us all in Paris). And tools such as signs and sig-
nals, coupled with traffic rules, give you a framework in which you can reason about
the journey you’re taking from start to finish.
Writing cloud-native software is also complex. In this section, I present a model to
help bring order to the myriad of concerns in writing cloud-native software. My hope is
that this framework facilitates your understanding of the key concepts and techniques
that will make you a proficient designer and developer of cloud-native software.
I’ll start simple, with core elements of cloud-native software that are surely familiar
to you, shown in figure 1.6.
An app implements key business logic. This is where you’ll be writing the bulk of the
code. This is where, for example, your code will take a customer order, verify that items
are available in a warehouse’s inventory, and send a notification to the billing department.
The app, of course, depends on other components that it calls to either obtain
information or take an action; I call these services. Some of the services store state—the
warehouse inventory, for example. Others may be apps that implement the business
logic for another part of your system—customer billing, for example.

7
Hear Adrian talk about this and other examples of complicated things at http://mng.bz/5NzO.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


30 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

The app is the code you write—the


business logic for your software.
App

Some of the services store and/or


manage the software state.

State
(service) Service

The app will call upon other components Figure 1.6 Familiar
to provide services it needs to fulfill its elements of a basic
requirements. software architecture

Taking these simple concepts, let’s now build up a topology that represents the cloud-
native software you’ll build; see figure 1.7. You have a distributed set of modules, most
of which have multiple instances deployed. You can see that most of the apps are also
acting as services, and further, that some services are explicitly stateful. Arrows depict
where one component depends on another.

Each of the components has a dual


role. Virtually all components act as
a service in some capacity. The root app, in turn, has dependencies
on three other services.

Apps will almost


App App
always have
Service Service
App
many instances
deployed.

The user depends on the


root app of the software.
App App App
Service
App Service
App Service
App

Some apps depend


on stateful services.
Service Service
State State

Solid arrows indicate a dependency.

Figure 1.7 Cloud-native software takes familiar concepts and adds extreme distribution, with
redundancy everywhere, and constant change.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Introducing cloud-native software 31

This diagram illustrates a few interesting points. First, notice that the pieces (the
boxes and the database, or storage, icons) are always annotated with two designations:
apps and services for the boxes, and services and state for the storage icons. I’ve come
to think of the simple concepts shown in figure 1.7 as roles that various components
in your software solution take on.
You’ll note that any entity that has an arrow going to it, indicating that the compo-
nent is depended upon by another, is a service. That’s right—almost everything is a
service. Even the app that’s the root of the topology has an arrow to it from the soft-
ware consumer. Apps, of course, are where you’re writing your code. And I particu-
larly like the combination of service and state annotations, making clear that you have
some services that are devoid of state (the stateless services you’ve surely heard about,
annotated here with “app”), whereas others are all about managing state.
And this brings me to defining the three parts of cloud-native software, depicted in
figure 1.8:
 The cloud-native app —Again, this is where you’ll write code; it’s the business
logic for your software. Implementing the right patterns here allows those
apps to act as good citizens in the composition that makes up your software; a
single app is rarely a complete digital solution. An app is at one or the other
end of an arrow (or both) and therefore must implement certain behaviors to
make it participate in that relationship. It must also be constructed in a man-
ner that allows for cloud-native operational practices such as scaling and
upgrades to be performed.
 Cloud-native data —This is where state lives in your cloud-native software. Even
this simple picture shows a marked deviation from the architectures of the past,
which often used a centralized database to store state for a large portion of the
software. For example, you might have stored user profiles, account details,
reviews, order history, payment information, and more, all in the same data-
base. Cloud-native software breaks the code into many smaller modules (the
apps), and the database is similarly decomposed and distributed.
 Cloud-native interactions —Cloud-native software is then a composition of cloud-
native apps and cloud-native data, and the way those entities interact with one
another ultimately determines the functioning and the qualities of the digital
solution. Because of the extreme distribution and constant change that charac-
terizes our systems, these interactions have in many cases significantly evolved
from those of previous software architectures, and some interaction patterns
are entirely new.
Notice that although at the start I talked about services, in the end they aren’t one of
the three entities in this mental model. In large part, this is because pretty much
everything is a service, both apps and data. But more so, I suggest that the interactions
between services are even more interesting than a service alone. Services pervade
through the entire cloud-native software model.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


32 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

App App

Interactions may be
push-centric.
Interactions may be
request/response.
App App App

Cloud-native software
entities Interactions may
be pull-centric.
Cloud-native app Data Data
Cloud-native data
Cloud-native interactions

Figure 1.8 Key entities in the model for cloud-native software: apps, data, and interactions

With this model established, let’s come back to the modern software requirements
covered in section 1.1 and consider their implications on the apps, data, and interac-
tions of your cloud-native software.
CLOUD-NATIVE APPS
Concerns about cloud-native apps include the following:
 Their capacity is scaled up or down by adding or removing instances. We refer
to this as scale-out/in, and it’s far different from the scale-up models used in
prior architectures. When deployed correctly, having multiple instances of an
app also offers levels of resilience in an unstable environment.
 As soon as you have multiple instances of an app, and even when only a single
instance is being disrupted in some way, keeping state out of the apps allows you
to perform recovery actions most easily. You can simply create a new instance of
an app and connect it back to any stateful services it depends on.
 Configuration of the cloud-native app poses unique challenges when many
instances are deployed and the environments in which they’re running are con-
stantly changing. If you have 100 instances of an app, for example, gone are the
days when you could drop a new config into a known filesystem location and
restart the app. Add to that the fact that these instances could be moving all over
your distributed topology. And applying such old-school practices to the instances
as they are moving all over your distributed topology would be sheer madness.
 The dynamic nature of cloud-based environments necessitates changes to the
way you manage the application lifecycle (not the software delivery lifecycle, but
rather the startup and shutdown of the actual app). You must reexamine how
you start, configure, reconfigure, and shut down apps in this new context.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Introducing cloud-native software 33

CLOUD-NATIVE DATA
Okay, so your apps are stateless. But handling state is an equally important part of a
software solution, and the need to solve your data-handling problems also exists in an
environment of extreme distribution and constant change. Because you have data
that needs to persist through these fluctuations, handling data in a cloud setting poses
unique challenges. The concerns for cloud-native data include the following:
 You need to break apart the data monolith. In the last several decades, organi-
zations invested a great deal of time, energy, and technology into managing
large, consolidated data models. The reasoning was that concepts that were rel-
evant in many domains, and hence implemented in many software systems,
were best treated centrally as a single entity. For example, in a hospital, the con-
cept of a patient was relevant in many settings, including clinical/care, billing,
experience surveys, and more, and developers would create a single model, and
often a single database, for handling patient information. This approach
doesn’t work in the context of modern software; it’s slow to evolve and brittle,
and ultimately robs the seemingly loosely coupled app fabric of its agility and
robustness. You need to create a distributed data fabric, as you created a distrib-
uted app fabric.
 The distributed data fabric is made up of independent, fit-for-purpose data-
bases (supporting polyglot persistence), as well as some that may be acting only
as materialized views of data, where the source of truth lies elsewhere. Caching
is a key pattern and technology in cloud-native software.
 When you have entities that exist in multiple databases, such as the “patient” I
mentioned previously, you have to address how to keep the information that’s
common across the different instances in sync.
 Ultimately, treating state as an outcome of a series of events forms the core of
the distributed data fabric. Event-sourcing patterns capture state-change events,
and the unified log collects these state-change events and makes them available
to members of this data distribution.
CLOUD-NATIVE INTERACTIONS
And finally, when you draw all the pieces together, a new set of concerns surface for
the cloud-native interactions:
 Accessing an app when it has multiple instances requires some type of routing
system. Synchronous request/response, as well as asynchronous event-driven
patterns, must be addressed.
 In a highly distributed, constantly changing environment, you must account for
access attempts that fail. Automatic retries are an essential pattern in cloud-native
software, yet their use can wreak havoc on a system if not governed properly. Cir-
cuit breakers are essential when automated retries are in place.
 Because cloud-native software is a composite, a single user request is served
through invocation of a multitude of related services. Properly managing

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


34 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

cloud-native software to ensure a good user experience is a task of managing a


composition—each of the services and the interactions between them. Appli-
cation metrics and logging, things we’ve been producing for decades, must be
specialized for this new setting.
 One of the greatest advantages of a modular system is the ability to more easily
evolve parts of it independently. But because those independent pieces ulti-
mately come together into a greater whole, the protocols underlying the inter-
actions among them must be suitable for the cloud-native context; for example,
a routing system that supports parallel deploys.
This book covers new and evolved patterns and practices to address these needs.
Let’s make all of this a bit more concrete by looking at a specific example. This will
give you a better sense of the concerns I’m only briefly mentioning here, and will give
you a good idea of where I’m headed with the content of this text.

1.2.3 Cloud-native software in action


Let’s start with a familiar scenario. You have an account with Wizard’s Bank. Part of
the time you engage with the bank by visiting the local branch (if you’re a millennial,
just pretend with me for a moment ;-)). You’re also a registered user of the bank’s
online banking application. After receiving only unsolicited calls on your home land-
line (again, pretend ;-)) for the better part of the last year or two, you’ve finally
decided to disconnect it. As a result, you need to update your phone number with
your bank (and many other institutions).
The online banking application allows you to edit your user profile, which includes
your primary and any backup phone numbers. After logging into the site, you navigate to
the Profile page, enter your new phone number, and click the Submit button. You receive
confirmation that your update has been saved, and your user experience ends there.
Let’s see what this could look like if that online banking application were archi-
tected in a cloud-native manner. Figure 1.9 depicts these key elements:
 Because you aren’t yet logged in, when you access the User Profile app, B it will
redirect you to the Authentication app. C Notice that each of these apps has
multiple instances deployed and that the user requests are sent to one of the
instances by a router.
 As a part of logging in, the Auth app will create and store a new auth token in a
stateful service. D
 The user is then redirected back to the User Profile app, with the new auth token.
This time, the router will send the user request to a different instance of the User
Profile app. E (Spoiler alert: sticky sessions are bad in cloud-native software!)
 The User Profile app will validate the auth token by making a call to an Auth
API service. F Again, there are multiple instances, and the request is sent to one
of them by the router. Recall that valid tokens are stored in the stateful Auth
Token service, which is accessible from not only the Auth app, but also any
instances of the Auth API service.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Introducing cloud-native software 35

 Because the instances of any of these apps (the User Profile or Auth apps) can
change for any number of reasons, a protocol must exist for continuously
updating the router with new IP addresses. o *
 The User Profile app then makes a downstream request to the User API service G
to obtain the current user’s profile data, including phone number. The User Pro-
file app, in turn, makes a request to the user’s stateful service.
 After the user has updated their phone number and clicked the Submit button,
the User Profile app sends the new data to an event log H.
 Eventually, one of the instances of the User API service will pick up and process
this change event I and send a write request to the Users database.

A lot is going on here!


Fear not—you don't need to understand it all right now.
This illustration simply provides an overview of the key concepts.

Online banking software

H Event log
User profile … …

E app

I
Router

B User profile

G
app

C F User API
service
User API
service

Auth app
Router

Router

Auth app
Users
Auth API
service
Auth API
D service

Cloud-native interactions
Routes to app instances
Auth tokens
Updated instances
registered with router

Figure 1.9 The online banking software is a composition of apps and data services. Many types of
interaction protocols are in play.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


36 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

Yes, this is already a lot, but I want to add even more.


I haven’t explicitly stated it, but when you’re back at the bank branch and the
teller verifies your current contact information, you’ll expect the teller to have your
updated phone number. But the online banking software and the teller’s software are
two different systems. This is by design; it serves agility, resilience, and many of the
other requirements that I’ve identified as important for modern digital systems. Fig-
ure 1.10 shows this product suite.
The structure of the bank teller software isn’t markedly different from that of the
online banking software; it’s a composition of cloud-native apps and data. But, as you
can imagine, each digital solution deals with and even stores user data, or shall I say,
customer data. In cloud-native software, you lean toward loose coupling, even when
you’re dealing with data. This is reflected with the Users stateful service in the online
banking software and the Customers stateful service in the bank teller’s software.
The question, then, is how to reconcile common data values across these disparate
stores. How will your new phone number be reflected in the bank teller software?

Bank teller

Digital product suite

Online banking software Bank teller software

User profile
app
app App App
app app
Customer

User Customer Promotions


service
app service
app service
app

Auth
subsystem

Branch
Users Customers
promotions

Two independent software solutions, each of


which consists of a collection of apps, data,
and interactions among them.

Figure 1.10 What appears to a user as a single experience with Wizard’s Bank is realized by
independently developed and managed software assets.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Introducing cloud-native software 37

Bank teller

Digital product suite

Online banking software Bank teller software

User profile
app
app App App
app app
Customer

User Customer Promotions


service
app service
app service
app

Auth
subsystem

Branch
Users Customers
promotions

Distributed data
coordination

Each of the two independent pieces of software deals with customer information.
On the left, the customer is referred to as a user, and on the right, as a customer.
Cloud-native data is highly distributed. Cloud-native software must address the way
data is synchronized across separate systems.

Figure 1.11 A decomposed and loosely coupled data fabric requires techniques for cohesive data
management.

In figure 1.11, I’ve added one more concept to our model—something I’ve labeled
“Distributed data coordination.” The depiction here doesn’t imply any implementa-
tion specifics. I’m not suggesting a normalized data model, hub-and-spoke master
data management techniques, or any other solution. For the time being, please accept
this as a problem statement; I promise we’ll study solutions soon.
That’s a lot! Figures 1.9, 1.10, and 1.11 are busy, and I don’t expect you to under-
stand in any great detail all that’s going on. What I do hope you take from this comes
back to the key theme for cloud-native software:
 The software solution comprises quite a distribution of a great many components.
 Protocols exist to specifically deal with the change that’s inflicted upon the system.

We’ll get into all the details, and more, throughout the following chapters.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


38 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

1.3 Cloud-native and world peace


I’ve been practicing in this industry long enough that I’ve seen several technological
evolutions promise to solve all problems. When object-oriented programming emerged
in the late 1980s, for example, some people acted as if this style of software would essen-
tially write itself. And although such bullish predictions wouldn’t come to pass, many of
the hyped technologies, without question, brought improvements to many elements of
software—ease of construction and management, robustness, and more.
Cloud-native software architectures, often referred to as microservices,8 are all the
rage today—but spoiler alert, they also won’t lead to world peace. And even if they did
come to dominate (and I believe they will), they don’t apply to everything. Let’s look
at this in more detail in a moment, but first, let’s talk about that word, cloud.

1.3.1 Cloud and cloud-native


The narrative around the term cloud can be confusing. When I hear a company owner
say, “We’re moving to the cloud,” they often mean they’re moving some or maybe even
all of their apps into someone else’s data center—such as AWS, Azure, or GCP. These
clouds offer the same set of primitives that are available in the on-premises data center
(machines, storage, and network), so such a “move to the cloud” could be done with lit-
tle change to the software and practices currently being used on premises.
But this approach won’t bring much improved software resilience, better
management practices, or more agility to the software delivery processes. In fact,
because the SLAs for the cloud services are almost always different from those offered
in on-prem data centers, degradation is likely in many respects. In short, moving to
the cloud doesn’t mean your software is cloud-native or will demonstrate the values of
cloud-native software.
As I reasoned through earlier in the chapter, new expectations from consumers
and new computing contexts—the very ones of the cloud—force a change in the way
software is constructed. When you embrace the new architectural patterns and opera-
tional practices, you produce digital solutions that work well in the cloud. You might
say that this software feels quite at home in the cloud. It’s a native of that land.

NOTE Cloud is about where we’re computing. Cloud-native is about how.

If cloud-native is about how, does that mean you can implement cloud-native solutions
on premises? You bet! Most of the enterprises I work with on their cloud-native jour-
ney first do so in their own data centers. This means that their on-premise computing
infrastructure needs to support cloud-native software and practices. I talk about this
infrastructure in chapter 3.
As great as it is (and I hope that by the time you finish this book, you’ll think so
too), cloud-native isn’t for everything.

8
Although I use the term microservice to refer to the cloud-native architecture, I don’t feel that the term
encompasses the other two equally important entities of cloud-native software: data and interactions.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Cloud-native and world peace 39

1.3.2 What isn’t cloud-native


I’m certain it doesn’t surprise you to hear that not all software should be cloud-native.
As you learn the patterns, you’ll see that some of the new approaches require effort
that otherwise might not be necessary. If a dependent service is always at a known loca-
tion that never changes, you won’t need to implement a service discovery protocol.
And some approaches create new problems, even as they bring significant value;
debugging program flow through a bunch of distributed components can be hard.
Three of the most common reasons for not going cloud-native in your software archi-
tecture are described next.
First, sometimes the software and computing infrastructure don’t call for cloud-
native. For example, if the software isn’t distributed and is rarely changing, you can
likely depend on a level of stability that you should never assume for modern web or
mobile applications running at scale. For example, code that’s embedded in an
increasing number of physical devices such as a washing machine may not even have
the computing and storage resources to support the redundancy so key to these mod-
ern architectures. My Zojirushi rice cooker’s software that adjusts the cooking time
and temperature based on the conditions reported by on-board sensors needn’t have
parts of the application running in different processes. If some part of the software or
hardware fails, the worst that will happen is that I’ll need to order out when my home-
cooked meal is ruined.
Second, sometimes common characteristics of cloud-native software aren’t appro-
priate for the problem at hand. You’ll see, for example, that many of the new patterns
give you systems that are eventually consistent; in your distributed software, data
updated in one part of the system might not be immediately reflected in all parts of
the system. Eventually, everything will match, but it might take a few seconds or even
minutes for everything to become consistent. Sometimes this is okay; for example, it
isn’t a major problem if, because of a network blip, the movie recommendations
you’re served don’t immediately reflect the latest five-star rating another user sup-
plied. But sometimes it’s not okay: a banking system can’t allow a user to withdraw all
funds and close their bank account in one branch office, and then allow additional
withdrawals from an ATM because the two systems are momentarily disconnected.
Eventual consistency is at the core of many cloud-native patterns, meaning that when
strong consistency is required, those particular patterns can’t be used.
And, finally, sometimes you have existing software that isn’t cloud-native, and
there’s no immediate value in rewriting it. Most organizations that are more than a
couple of decades old have parts of their IT portfolio running on a mainframe, and
believe it or not, they may keep running that mainframe code for another couple of
decades. But it’s not just mainframe code. A lot of software is running on a myriad of
existing IT infrastructures that reflect design approaches that predate the cloud. You
should rewrite code only when there’s business value in doing so, and even when
there is, you’re likely to have to prioritize such efforts, updating various offerings in
your portfolio over several years.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


40 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

1.3.3 Cloud-native plays nice


But it’s not all or nothing. Most of you are writing software in a setting filled with exist-
ing solutions. Even if you’re in the enviable position of producing a brand-new appli-
cation, it will likely need to interface with one of those existing systems, and as I just
pointed out, a good bit of the software already running is unlikely to be fully cloud-
native. The brilliant thing about cloud-native is that it’s ultimately a composition of
many distinct components, and if some of those components don’t embody the most
modern patterns, the fully cloud-native components can still interact with them.
Applying cloud-native patterns where you can, even while other parts of your soft-
ware employ older design approaches, can bring immediate value. In figure 1.12, for
example, you see that we have a few application components. A bank teller accesses
account information via a user interface, which then interfaces with an API that fronts
a mainframe application. With this simple deployment topology, if the network
between that Account API service and the mainframe application is disrupted, the
customer will be unable to receive their cash.

Banking software

Bank teller
UI
$
Cloud-native apps

Account
API

Non-cloud-native module that's


interfaced via the Account API

Mainframe
application

Account
balances
Figure 1.12 Dispensing funds without access
to the source of record is ill-advised.

But now let’s apply a few cloud-native patterns to parts of this system. For example, if
you deploy many instances of each microservice across numerous availability zones, a
network partition in one zone still allows access to mainframe data through service
instances deployed in other zones (figure 1.13).
It’s also worth noting that when you do have legacy code that you wish to refactor,
it needn’t be done in one fell swoop. Netflix, for example, refactored its entire

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Cloud-native and world peace 41

Banking software

Bank teller
UI
$
Deploying multiple instances of
an app across different failure
AZ1 AZ2 boundaries allows cloud-native
Account Account
Account patterns to provide benefit in a
API
App API
API
App hybrid (cloud-native and
non-cloud-native) software
architecture.

Mainframe
application

Account
balances

Figure 1.13 Applying some cloud-native patterns, such as redundancy and properly distributed
deployments, brings benefit even in software that isn’t fully cloud-native.

customer-facing digital solution to a cloud-native architecture as a part of its move to


the cloud. Eventually. The move took seven years, but Netflix began refactoring some
parts of its monolithic, client-server architecture in the process, with immediate
benefits.9 As with the preceding banking example, the lesson is that even during a
migration, a partially cloud-native solution is valuable.
Whether you’re building a net, new application that’s born and bred in and for the
cloud, where you apply all of the newfangled patterns, or you’re extracting and mak-
ing cloud-native portions of an existing monolith, you can expect to realize significant
value. Although we weren’t using the term cloud-native then, the industry began exper-
imenting with microservices-centric architectures in the early 2010s, and many of the
patterns have been refined over several years. This “new” trend is well enough under-
stood that its embrace is becoming significantly widespread. We’ve seen the value that
these approaches bring.
I believe that this architectural style will be the dominant one for a decade or two to
come. What distinguishes it from other fads with less staying power is that it came as a
result of a foundational shift in the computing substrate. The client-server models that
dominated the last 20 to 30 years first emerged when the computing infrastructure
moved from the mainframe to one where many smaller computers became available,
9
For more details, see “Completing the Netflix Cloud Migration” by Yury Izrailevsky (http://mng.bz/6j0e).

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


42 CHAPTER CHAPTER 1 You keep using that word: Defining “cloud-native”

and we wrote software to take advantage of that computing environment. Cloud-native


has similarly emerged as a new computing substrate—one offering software-defined
compute, storage, and networking abstractions that are highly distributed and
constantly changing.

Summary
 Cloud-native applications can remain stable, even when the infrastructure
they’re running on is constantly changing or even experiencing difficulties.
 The key requirements for modern applications call for enabling rapid iteration
and frequent releases, zero downtime, and a massive increase in the volume
and variety of the devices connected to it.
 A model for the cloud-native application has three key entities:
– The cloud-native app
– Cloud-native data
– Cloud-native interactions
 Cloud is about where software runs; cloud-native is about how it runs.
 Cloud-nativeness isn’t all or nothing. Some of the software running in your
organization may follow many cloud-native architectural patterns, other soft-
ware will live on with its older architecture, and still others will be hybrids (a
combination of new and old approaches).

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Chapter 1 from Learn Quantum
Computing with Python and Q#
by Sarah Kaiser

Introducing
Quantum Computing

This chapter covers


 Why people are excited about quantum computing,
 What a quantum computer is,
 What a quantum computer is and is not capable of, and
 How a quantum computer relates to classical
programming.

Quantum computing has been an increasingly popular research field and source of
hype over the last few years. There seem to be news articles daily discussing new
breakthroughs and developments in quantum computing research, promising that
we can solve any number of different problems faster and with lower energy costs.
By using quantum physics to perform computation in new and wonderful
ways, quantum computers can make an impact across society, making it an exciting
time to get involved and learn how to program quantum computers and apply
quantum resources to solve problems that matter.
In all of the buzz about the advantages quantum computing offers, however, it is
easy to lose sight of the real scope of the advantages. We have some interesting his-

43

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


44 CHAPTER CHAPTER 1 Introducing Quantum Computing

torical precedent for what can happen when promises about a technology outpace
reality. In the 1970s, machine learning and artificial intelligence suffered from dra-
matically reduced funding, as the hype and excitement around AI outstripped its
results; this would later be called the “AI winter.” Similarly, Internet companies faced
the same danger trying to overcome the dot-com bust.
One way forward is by critically understanding what the promise offered by quan-
tum computing is, how quantum computers work, what they can do, and what is not
in scope for quantum computing. In this chapter, we’ll do just that, so that you can get
hands-on experience writing your own quantum programs in the rest of the book.
All that aside, though, it’s just really cool to learn about an entirely new computing
model! To develop that understanding, as you read this book you’ll learn how quan-
tum computers work by programming simulations that you can run on your laptop
today. These simulations will show many of the essential elements of what we expect
real commercial quantum programming to be like, while useful commercial hardware
is coming online.

1.1 Why does quantum computing matter?


Computing technology advances at a truly stunning pace. Three decades ago, the
80486 processor would allow users to execute 50 MIPS (million instructions per sec-
ond), but today, small computers like the Raspberry Pi can reach 5,000 MIPS, while
desktop processors can easily reach 50,000 to 300,000 MIPS. If you have an exception-
ally difficult computational problem you’d like to solve, a very reasonable strategy is to
simply wait for the next generation of processors to make your life easier, your videos
stream faster, and your games more colorful.
For many problems that we care about, however, we’re not so lucky. We might
hope that getting a CPU that’s twice as fast lets us solve problems twice as big, but just
as with so much in life, more is different. Suppose we want to sort a list of 10 million
numbers and find that it takes about 1 second. If we later want to sort a list of 1 billion
numbers in one second, we’ll need a CPU that’s 130 times faster, not just 100 times.
Some problems make this even worse: for some problems in graphics, going from 10
million to 1 billion points would take 13,000 times longer.
Problems as widely varied as routing traffic in a city and predicting chemical reac-
tions get more difficult much more quickly still. If quantum computing were simply a
computer that runs 1,000 times as fast, we would barely make a dent in the daunting
computational challenges that we want to solve. Thankfully, quantum computers are
much more interesting than that. In fact, we expect that quantum computers will
likely be much slower than classical computers, but that the resources required to solve
many problems scale differently, such that if we look at the right kinds of problems we
can break through “more is different.” At the same time, quantum computers aren’t a
magic bullet, in that some problems will remain hard. For example, while it is likely
that quantum computers can help us immensely with predicting chemical reactions, it
is possible that they won’t be of much help with other difficult problems.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Why does quantum computing matter? 45

Investigating exactly which problems we can obtain such an advantage in and


developing quantum algorithms to do so has been a large focus of quantum comput-
ing research. Understanding where we can find advantages by using quantum com-
puters has recently become a significant focus for quantum software development in
industry as well. Software platforms such as the Quantum Development Kit make it
easier to develop software for solving problems on quantum computers, and in turn to
assess how easy different problems are to solve using quantum resources.
Up until this point, it has been very hard to assess quantum approaches in this way,
as doing so required extensive mathematical skill to write out quantum algorithms
and to understand all of the subtleties of quantum mechanics. Now, industry has
started developing software packages and even new languages and frameworks to help
connect developers to quantum computing. By leveraging Microsoft’s entire Quan-
tum Development Kit, we can abstract away most of the mathematical complexities of
quantum computing and help people get to actually understanding and using quantum
computers. The tools and techniques taught in this book allow developers to explore
and understand what writing programs for this new hardware platform will be like.
Put differently, quantum computing is not going away, so understanding what scale
of what problems we can solve with them matters quite a lot indeed! There are already
many governments and CEOs who are convinced that quantum computing will be the
next big thing in computing. Some people care about quantum attacks on cryptogra-
phy, some want their own quantum computer next year. Independent of whether or not
a quantum “revolution” happens, quantum computing has and will factor heavily into
decisions about how to develop computing resources over the next several decades.
Decisions that are strongly impacted by quantum computing:
 What assumptions are reasonable in information security?
 What skills are useful in degree programs?
 How to evaluate the market for computing solutions?

For those of us working in tech or related fields, we increasingly must make or provide
input into these decisions. We have a responsibility to understand what quantum com-
puting is, and perhaps more importantly still, what it is not. That way, we will be best
prepared to step up and contribute to these new efforts and decisions.
All that aside, another reason that quantum computing is such a fascinating topic
is that it is both similar to and very different from classical computing. Understanding
both the similarities and differences between classical and quantum computing helps
us to understand what is fundamental about computing in general. Both classical and
quantum computation arise from different descriptions of physical laws such that
understanding computation can help us understand the universe itself in a new way.
What’s absolutely critical, though, is that there is no one right or even best reason
to be interested in quantum computing. Whether you’re reading this because you
want to have a nice nine-to-five job programming quantum computers or because you
want to develop the skills you need to contribute to quantum computing research,
you’ll learn something interesting to you along the way.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


46 CHAPTER CHAPTER 1 Introducing Quantum Computing

1.2 What Can Quantum Computers Do?


As quantum programmers we would like to know:
If we have a particular problem, how do we know it makes sense to solve it with a
quantum computer?
We are still learning about the exact extent of what quantum computers are capa-
ble of, and thus we don’t have any concrete rules to answer this question yet. So far, we
have found some examples of problems where quantum computers offer significant
advantages over the best-known classical approaches. In each case, the quantum algo-
rithms that have been found to solve these problems exploit quantum effects to
achieve the advantages, sometimes referred to as a quantum advantage.
Some useful quantum algorithms
 Grover’s algorithm (Chapter 10): searches a list of N items in √N steps.
 Shor’s algorithm (Chapter 11): quickly factors large integers, such as those used
by cryptography to protect private data.
Though we’ll see several more in this book, both Grover’s and Shor’s are good exam-
ples of how quantum algorithms work: each uses quantum effects to separate correct
answers to computational problems from invalid solutions. One way to realize a quan-
tum advantage is to find ways of using quantum effects to separate correct and incor-
rect solutions to classical problems.

What are quantum advantages?


Grover’s and Shor’s algorithms illustrate two distinct kinds of quantum advantage.
Factoring integers might be easier classically than we suspect, as we haven’t been
able to prove that factoring is difficult. The evidence that we use to conjecture that
factoring is hard classically is largely derived from experience, in that many people
have tried very hard to factor integers quickly, but haven’t succeeded. On the other
hand, Grover’s algorithm is provably faster than any classical algorithm, but uses a
fundamentally different kind of input.
Other quantum advantages might derive from problems in which we must simulate
quantum dynamics. Quantum effects such as interference are quite useful in simu-
lating other quantum effects.
Finding a provable advantage for a practical problem is an active area of research in
quantum computing. That said, quantum computers can be a powerful resource for
solving problems even if we can’t necessarily prove that there will never be a better
classical algorithm. After all, Shor’s algorithm already challenges the assumptions
underlying large swaths of classical information security—a mathematical proof is in
that sense made necessary only by the fact that we haven’t yet built a quantum com-
puter in practice.

Quantum computers also offer significant benefits to simulating properties of quan-


tum systems, opening up applications to quantum chemistry and materials science.
For instance, quantum computers could make it much easier to learn about the
ground state energies of chemical systems. These ground state energies then provide

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What is a Quantum Computer? 47

insight into reaction rates, electronic configurations, thermodynamic properties, and


other properties of immense interest in chemistry.
Along the way to developing these applications, we have also seen significant
advantages in spin-off technologies such as quantum key distribution and quantum
metrology, as we will see in the next few chapters. In learning to control and under-
stand quantum devices for the purpose of computing, we also have learned valuable
techniques for imaging, parameter estimation, security, and more. While these are not
applications for quantum computing in a strict sense, they go a long way to showing
the values of thinking in terms of quantum computation.
Of course, new applications of quantum computers are much easier to discover
when we have a concrete understanding of how quantum algorithms work and how to
build new algorithms from basic principles. From that perspective, quantum program-
ming is a great resource to learn how to discover entirely new applications.

1.3 What is a Quantum Computer?


Let’s talk a bit about what actually makes up a quantum computer. To facilitate this, let’s
briefly talk about what the term “computer” means. In a very broad sense, a computer is
a device that takes data as input and does some sort of operations on that data.

Computer
IMPORTANT A computer is a device that takes data as input and does some sort of operations
on that data.

There are many examples of what we have called a computer, see Figure 1.1 for some
examples.
When we say the word “computer” in conversation, though, we tend to mean
something more specific. Often, we think of a computer as an electronic device like the
one we are currently writing this book on (or that you might be using to read this
book!). For any resource up to this point that we have made a computer out of, we can
model it with classical physics—that is, in terms of Newton’s laws of motion, Newto-
nian gravity, and electromagnetism.
Following this perspective, we will refer to computers that are described using clas-
sical physics as classical computers. This will help us tell apart the kinds of computers
we’re used to (e.g. laptops, phones, bread machines, houses, cars, and pacemakers)
from the computers that we’re learning about in this book.
Specifically, in this book, we’ll be learning about quantum computers. By the way
we have formulated the definition for classical computers, if we just replace the
term classical physics with quantum physics we have a suitable definition for what a quan-
tum computer is!

Quantum Computer
A quantum computer is a device that takes data as input and does some sort of
IMPORTANT
operations on that data, which requires the use of quantum physics to describe this
process.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


48 CHAPTER CHAPTER 1 Introducing Quantum Computing

Figure 1.1 Several examples of different kinds of computers, including the UNIVAC mainframe operated by Rear
Admiral Hopper, a room of “human computers” working to solve flight calculations, a mechanical calculator, and
a LEGO-based Turing machine. Each computer can be described by the same mathematical model as computers
like cell phones, laptops, and servers. Photo of “human computers” by NASA. Photo of LEGO Turing machine by
Projet Rubens, used under CC BY 3.0. (https://creativecommons.org/licenses/by/3.0/).

The distinction between classical and quantum computers is precisely that between classi-
cal and quantum physics. We will get into this more later in the book, but the primary dif-
ference is one of scale: our everyday experience is largely with objects that are large
enough and hot enough that even though quantum effects still exist, they don’t do much
on average. Quantum physics still remains true, even at the scale of everyday objects like
coffee mugs, bags of flour, and baseball bats, but we can do a very good job of describing
how these objects interact using physical laws like Newton’s laws of motion.

DEEP DIVE: What happened to relativity?


Quantum physics applies to objects that are very small and very cold or well-isolated.
Similarly, another branch of physics called relativity describes objects that are large
enough for gravity to play an important role, or that are moving very fast—near the
speed of light. We have already discussed classical physics, which could also be said
to describe things that are neither quantum mechanical nor relativistic. So far we have
primarily been comparing classical and quantum physics, raising the question: why
aren’t we concerned about relativity? Many computers use relativistic effects to per-
form computation; indeed, global positioning satellites depend critically on relativity.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What is a Quantum Computer? 49

(continued)
That said, all of the computation that is implemented using relativistic effects can
also be described using purely classical models of computing such as Turing
machines. By contrast, quantum computation cannot be described as faster classical
computation, but requires a different mathematical model. There has not yet been a
proposal for a “gravitic computer” that uses relativity to realize a different model of
computation, so we’re safe to set relativity aside in this book.

Quantum computing is the art of using small and well-isolated devices to usefully
transform our data in ways that cannot be described in terms of classical physics alone.
We will see in the next chapter, for instance, that we can generate random numbers
on a quantum device by using the idea of rotating between different states. One way to
build quantum devices is to use small classical computers such as digital signal proces-
sors (DSPs) to control properties of exotic materials.

Physics and quantum computing


The exotic materials used to build quantum computers have names that can sound intimi-
dating, like “superconductors” and “topological insulators.” We can take solace, though,
from how we learn to understand and use classical computers. Modern processors are built
NOTE
using materials like semiconductors, but we can program classical computers without know-
ing what a semiconductor is. Similarly, the physics behind how we can use superconductors
and topological insulators to build quantum computers is a fascinating subject, but it’s not
required for us to learn how to program and use quantum devices.

Quantum operations are applied by sending in small amounts of microwave power


and amplifying very small signals coming out of the quantum device.
Other quantum devices may differ in the details of how they are controlled, but
what remains consistent is that all quantum devices are controlled from and read out
by classical computers and control electronics of some kind. After all, we are ulti-
mately interested in classical data, and so there must eventually be an interface with
the classical world.

For most quantum devices, we need to keep them very cold and very well isolated, since
NOTE
quantum devices can be very susceptible to noise.

By applying quantum operations using embedded classical hardware, we can manipu-


late and transform quantum data. The power of quantum computing then comes
from carefully choosing which operations to apply in order to implement a useful
transformation that solves a problem of interest.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


50 CHAPTER CHAPTER 1 Introducing Quantum Computing

Figure 1.2 An example of how a quantum device might interact with a classical computer through the use of a
digital signal processor (DSP). The DSP sends low-power signals into the quantum device, and amplifies very low-
power signals coming back to the device.

1.3.1 How will we use quantum computers?

Figure 1.3 Ways we wish we could use quantum computers. Comic used with permission from xkcd.com.

It is important to understand both the potential and the limitations of quantum com-
puters, especially given the hype surrounding quantum computation. Many of the
misunderstandings underlying this hype stem from extrapolating analogies beyond
where they make any sense—all analogies have their limits, and quantum computing
is no different in that regard.

If you’ve ever seen descriptions of new results in quantum computing that read like “we can
teleport cats that are in two places at once using the power of infinitely many parallel uni-
TIP
verses all working together to cure cancer,” then you’ve seen the danger of extrapolating too
far from where analogies are useful.

Indeed, any analogy or metaphor used to explain quantum concepts will be wrong if
you dig deep enough. Simulating how a quantum program acts in practice can be a
great way to help test and refine the understanding provided by analogies. Nonethe-

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What is a Quantum Computer? 51

less, we will still leverage analogies in this book, as they can be quite helpful in provid-
ing intuition for how quantum computation works.
One especially common point of confusion regarding quantum computing is the
way in which users will leverage quantum computers. We as a society now have a particu-
lar understanding of what a device called a computer does. A computer is something that
you can use to run web applications, write documents, and run simulations to name a
few common uses. In fact, classical computers are present in every aspect of our lives,
making it easy to take computers for granted. We don’t always even notice what is and
isn’t a computer. Cory Doctorow made this observation by noting that “your car is a
computer you sit inside of” (https://www.youtube.com/watch?v=iaf3Sl2r3jE).
Quantum computers, however, are likely to be much more special-purpose. Just as
not all computation runs on graphical processing units (GPUs) or field-programma-
ble gate arrays (FPGAs), we expect quantum computers to be somewhat pointless for
some tasks.

Programming a quantum computer comes along with some restrictions, so classical


IMPORTANT computers will be preferable in cases where there’s no particular quantum advan-
tage to be found.

Classical computing will still be around and will be the main way we communicate and
interact with each other as well as our quantum hardware. Even to get the classical
computing resource to interface with the quantum devices we will also need in most
cases a digital-to-analogue signal processor as shown in Figure 1.2.
Moreover, quantum physics describes things at very small scales (both size and
energy) that are well-isolated from their surroundings. This puts some hard limitations
to what environments we can run a quantum computer in. One possible solution is to
keep our quantum devices in cryogenic fridges, often near absolute 0 K (-459.67 °F, or
-273.15 °C). While this is not a problem at all to achieve in a data center, maintaining a
dilution refrigerator isn’t really something that makes sense on a desktop, much less in a
laptop or a cell phone. For this reason, it’s very likely that quantum computers will, at
least for quite a while after they first become commercially available, be used through
the cloud.
Using quantum computers as a cloud service resembles other advances in special-
ized computing hardware. By centralizing exotic computing resources in data centers,
it’s possible to explore computing models that are difficult for all but the largest users to
deploy on-premises. Just as high-speed and high-availability Internet connections have
made cloud computing accessible for large numbers of users, you will be able to use
quantum computers from the comfort of your favorite WiFi-blanketed beach, coffee
shop, or even from a train as you watch majestic mountain ranges off in the distance.
Exotic cloud computing resources:
 Specialized gaming hardware (PlayStation Now, Xbox One).
 Extremely low-latency high-performance computing (e.g. Infiniband) clusters
for scientific problems.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


52 CHAPTER CHAPTER 1 Introducing Quantum Computing

 Massive GPU clusters.


 Reprogrammable hardware (e.g. Catapult/Brainwave).
 Tensor processing unit (TPU) clusters.
 High-permanence high-latency archival storage (e.g. Amazon Glacier).

Going forward, cloud services like Azure Quantum (https://azure.com/quantum/)


will make the power of quantum computing available in much the same way.

1.3.2 1.3.2. What can’t quantum computers do?


Like other forms of specialized computing hardware, quantum computers won’t be
good at everything. For some problems, classical computers will simply be better
suited to the task. In developing applications for quantum devices, it’s helpful to note
what tasks or problems are out of scope for quantum computing.
The short version is that we don’t have any hard-and-fast rules to quickly decide
between which tasks are best run on classical computers, and which tasks can take
advantage of quantum computers. For example, the storage and bandwidth require-
ments for Big Data–style applications are very difficult to map onto quantum devices,
where you may only have a relatively small quantum system. Current quantum com-
puters can only record inputs of no more than a few dozen bits, a limitation that will
become more relevant as quantum devices are used for more demanding tasks.
Although we expect to eventually be able to build much larger quantum systems than
we can now, classical computers will likely always be preferable for problems which
require large amounts of input/output to solve.
Similarly, machine learning applications that depend heavily on random access to
large sets of classical inputs are conceptually difficult to solve with quantum comput-
ing. That said, there may be other machine learning applications that map much
more naturally onto quantum computation. Research efforts to find the best ways to
apply quantum resources to solve machine learning tasks are still ongoing. In general,
problems that have small input and output data sizes but large amounts of computa-
tion to get from input to output are good candidates for quantum computers.
In light of these challenges, it might be tempting to conclude that quantum com-
puters always excel at tasks which have small inputs and outputs but have very intense
computation between the two. Notions like “quantum parallelism” are popular in
media, and quantum computers are sometimes even described as using parallel uni-
verses to compute.

The concept of “parallel universes” is a great example of an analogy that can help make
quantum concepts understandable, but that can lead to nonsense when taken to its
extreme. It can sometimes be helpful to think of the different parts of a quantum computa-
tion as being in different universes that can’t affect each other, but this description makes
NOTE
it harder to think about some of the effects we will learn in this book, such as interference.
When taken too far, the “parallel universes analogy” also lends itself to thinking of quantum
computing in ways that are closer to a particularly pulpy and fun episode of a sci-fi show like
Star Trek than to reality.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What is a Program? 53

What this fails to communicate, however, is that it isn’t always obvious how to use
quantum effects to extract useful answers from a quantum device, even if it appears to
contain the desired output. For instance, one way to factor an integer classically is to
list each potential factor, and to check if it’s actually a factor or not.
Factoring N classically:
 Let i = 2.
 Check if the remainder of N/i is zero.
– If so, return that i factors N.
– If not, increment i and loop.
We can speed this classical algorithm up by using a large number of different classical
computers, one for each potential factor that we want to try. That is, this problem can
be easily parallelized. A quantum computer can try each potential factor within the
same device, but as it turns out, this isn’t yet enough to factor integers faster than the
classical approach above. If you run this on a quantum computer, the output will be
one of the potential factors chosen at random. The actual correct factors will occur
with probability about 1 / √N, which is no better than the classical algorithm above.
As we’ll see in Chapter 11, though, we can improve this by using other quantum
effects, however, to factor integers with a quantum computer faster than the best-
known classical factoring algorithms. Much of the heavy lifting done by Shor’s algo-
rithm is to make sure that the probability of measuring a correct factor at the end is
much larger than measuring an incorrect factor. Cancelling out incorrect answers in
this way is where much of the art of quantum programming comes in; it’s not easy or
even possible to do for all problems we might want to solve.
To understand more concretely what quantum computers can and can’t do, and
how to do cool things with quantum computers despite these challenges, it’s helpful
to take a more concrete approach. Thus, let’s consider what a quantum program
even is, so that we can start writing our own.

1.4 What is a Program?


Throughout this book, we will often find it useful to explain a quantum concept by
first re-examining the analogous classical concept. In particular, let’s take a step back
and examine what a classical program is.

Program
A program is a sequence of instructions that can be interpreted by a classical com-
puter to perform a desired task.

IMPORTANT Examples of classical programs


Tax forms
Map directions
Recipes
Python scripts

We can write classical programs to break down a wide variety of different tasks for
interpretation by all sorts of different computers.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


54 CHAPTER CHAPTER 1 Introducing Quantum Computing

Figure 1.4 Examples of classical programs. Tax forms, map directions, and recipes are all examples in
which a sequence of instructions is interpreted by a classical computer such as a person. Each of these
may look very different but use a list of steps to communicate a procedure.

Let’s take a look at what a simple “hello, world” program might look like in Python:
>>> def hello():
... print("Hello, world!")
...
>>> hello()
Hello, world!

At its most basic, this program can be thought of as a sequence of instructions given to
the Python interpreter, which then executes each instruction in turn to accomplish
some effect—in this case, printing a message to the screen.
We can make this way of thinking more formal by using the dis module provided
with Python to disassemble hello() into a sequence of instructions:
>>> import dis
>>> dis.dis(hello)
2 0 LOAD_GLOBAL 0 (print)
2 LOAD_CONST 1 ('Hello, world!')
4 CALL_FUNCTION 1
6 POP_TOP
8 LOAD_CONST 0 (None)
10 RETURN_VALUE

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What is a Program? 55

You may get different output on your system, depending on what version of Python you’re
NOTE
using.

Each line consists of a single instruction that is passed to the Python virtual machine;
for instance, the LOAD_GLOBAL instruction is used to look up the definition of
the print function. The print function is then called by the CALL_FUNCTION instruc-
tion. The Python code that we wrote above was compiled by the interpreter to produce
this sequence of instructions. In turn, the Python virtual machine executes our pro-
gram by calling instructions provided by the operating system and the CPU.

Figure 1.5 An example of how a classical computing task is repeatedly described and interpreted.
Programmers start by writing code in a language like Python and those instructions get translated to
lower and lower level descriptions until it can easily be run on the CPU. The CPU then causes a physical
change in the display hardware that the programmer can observe.

At each level, we have a description of some task that is then interpreted by some other
program or piece of hardware to accomplish some goal. This constant interplay
between description and interpretation motivates calling Python, C, and other such
programming tools languages, emphasizing that programming is ultimately an act of
communication.
In the example of using Python to print “Hello, world!”, we are effectively commu-
nicating with Guido von Rossum, the founding designer of the Python language.
Guido then effectively communicates on our behalf with the designers of the operat-
ing system that we are using. These designers in turn communicate on our behalf with
Intel, AMD, ARM, or whomever has designed the CPU that we are using, and so forth.
As with any other use of language to communicate, our choice of programming lan-
guage affects how we think and reason about programming. When we choose a pro-
gramming language, the different features of that language and the syntax used to
express those features mean that some ideas are more easily expressed than others.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


56 CHAPTER CHAPTER 1 Introducing Quantum Computing

1.4.1 What is a Quantum Program?


Like classical programs, quantum programs consist of sequences of instructions that
are interpreted by classical computers to perform a particular task. The difference,
however, is that in a quantum program, the task we wish to accomplish involves con-
trolling a quantum system to perform a computation.

Figure 1.6 Writing a quantum program with the Quantum Development Kit and Visual Studio Code. We
will get to the content of this program in Chapter 5, but what you can see at a high level is that it looks
quite similar to other software projects you may have worked on before.

The instructions available to classical and quantum programs differ according to this
difference in tasks. For instance, a classical program may describe a task such as load-
ing some cat pictures from the Internet in terms of instructions to a networking stack,
and eventually in terms of assembly instructions such as mov (move). By contrast,
quantum languages like Q# allow programmers to express quantum tasks in terms of
instructions like M (measure).
When run using quantum hardware, these programs may instruct a digital signal
processor such as that shown in Figure 1.7 to send microwaves, radio waves, or lasers
into a quantum device, and to amplify signals coming out of the device.
If we are to achieve a different end, however, it makes sense for us to use a lan-
guage that reflects what we wish to communicate! We have many different classical

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What is a Program? 57

Figure 1.7 An example of how a quantum device might interact with a classical computer through the use of a
digital signal processor (DSP). The DSP sends low-power signals into the quantum device, and amplifies very low-
power signals coming back to the device.

programming languages for just this reason, as it doesn’t make sense to use only one
of C, Python, JavaScript, Haskell, Bash, T-SQL, or any of a whole multitude of other
languages. Each language focuses on a subset of tasks that arise within classical pro-
gramming, allowing us to choose a language that lets us express how we would like to
communicate that task to the next level of interpreters.
Quantum programming is thus distinct from classical programming almost
entirely in terms of what tasks are given special precedence and attention. On the
other hand, quantum programs are still interpreted by classical hardware such as digi-
tal signal processors, so a quantum programmer writes quantum programs using clas-
sical computers and development environments.
Throughout the rest of this book, we will see many examples of the kinds of tasks
that a quantum program is faced with solving or at least addressing, and what kinds of
classical tools we can use to make quantum programming easier. We will build up the
concepts you need to write quantum programs chapter by chapter, you can see a road
map of how these concepts will build up in Figure 1.8.

Summary
 Quantum computing is important because quantum computers potentially
allow us to solve problems that are too difficult to solve with conventional com-
puters.
 Quantum computers can provide advantages over classical computers for some
kinds of problems, such as factoring large numbers.
 Quantum computers are devices that use quantum physics to process data.
 Programs are sequences of instructions that can be interpreted by a classical
computer to perform tasks.
 Quantum programs are programs that perform computation by sending
instructions to quantum devices.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


58 CHAPTER CHAPTER 1 Introducing Quantum Computing

Figure 1.8 This book will try to build up the concepts you need to write quantum programs. You will
start in Part 1 at the lower level descriptions of the simulators, and the intrinsic operations (think like
hardware API) by building your own simulator in Python. Part 2 will take you into the Q# language and
quantum development techniques that will help you develop your own applications. Part 3 will show you
some known applications for quantum computing and what are the challenges and opportunities we
have with this technology moving forward.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Chapter 1 from Algorithms and Data
Structures for Massive Datasets
by Dzejla Medjedovic, Ph.D, Emin
Tahirovic, and Ines Dedovic

Introduction

This chapter covers:


 What this book is about and its structure
 What makes this book different than other books on
algorithms
 How massive datasets affect the algorithm and data
structure design
 How this book can help you design practical algorithms
at a workplace
 Fundamentals of computer and system architecture that
make massive data challenging for today’s systems

Having picked up this book, you might be wondering what the algorithms and data
structures for massive datasets are, and what makes them different from “normal”
algorithms you might have encountered thus far. Does the title of this book imply
that the classical algorithms (e.g., binary search, merge sort, quicksort, fundamen-
tal graph algorithms) as well as canonical data structures (e.g., arrays, matrices,
hash tables, binary search trees, heaps) were built exclusively for small datasets?
And if so, why the hell did no one tell you that?
The answer to this question is not that short and simple (but if it had to be short
and simple, it would be “Yes”). The notion of what constitutes a massive dataset is

59

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


60 CHAPTER CHAPTER 1 Introduction

relative and it depends on many factors, but the fact of the matter is that most bread-
-and-butter algorithms and data structures that we know about and work with on a daily
basis have been developed with an implicit assumption that all data fits in the main
memory, or random-access memory (RAM) of one’s computer. So once you read all your
data into RAM and store it into the data structure, it is relatively fast and easy to access
any element of it, at which point, the ultimate goal from the efficiency point of view
becomes to crunch the most productivity into the fewest number of CPU cycles. This is
what the Big-Oh Analysis (O(.)) teaches us about; it commonly expresses the worst-case
number of basic operations the algorithm has to perform in order to solve a problem.
These unit operations can be comparisons, arithmetic, bit operations, memory cell
read/write/copy, or anything that directly translates into a small number of CPU cycles.
However, if you are a data scientist today, a developer or a back-end engineer work-
ing for a company that regularly collects data from its users, whether it be a retail web-
site, a bank, a social network, or a smart-bed app collecting sensor data, storing all data
into the main memory of your computer probably sounds like a beautiful dream. And
you don’t have to work for Facebook or Google to deal with gigabytes (GB), terabytes
(TB) or even petabytes (PB) of data almost on a daily basis. According to some projec-
tions, from 2020 onward, the amount of data generated will be at least equal to every
person on Earth generating close to 2 megabytes (MB) per second!10 Companies with a
more sophisticated infrastructure and more resources can afford to delay thinking
about scalability issues by spending more money on the infrastructure (e.g., by buying
more RAM), but, as we will see, even those companies, or should we say, especially those
companies, choose to fill that extra RAM with clever and space-efficient data structures.
The first paper11 to introduce external-memory algorithms—a class of algorithms
that today govern the design of large databases, and whose goal is to minimize the
total number of memory transfers during the execution of the program—appeared
back in 1988, where, as the motivation, the authors cite the example of large banks
having to sort 2 million checks daily, about 800MB worth of checks to be sorted over-
night before the next business day. And for the working memories of that time being
about the size of 2-4MB, this indeed was a massive dataset. Figuring out how to sort
checks efficiently where we can at one time fit in the working memory (and thus sort)
only about 4MB worth of checks, how to swap pieces of data in and out in a way to
minimize the number of trips the data makes from disk and into main memory, was a
relevant problem back then, and since then it has only become more relevant. In past
decades, data has grown tremendously but perhaps more importantly, it has grown at
a much faster rate than the average size of RAM memory.
One of the central consequences of the rapid growth of data, and the main idea
motivating algorithms in this book, is that most applications today are data-intensive.

10
Domo.com, “Data Never sleeps,”[Online]. Available: https://www.domo.com/solution/data-never-sleeps-6.
[Accessed 19th January, 2020].
11
A. Aggarwal and S. Vitter Jeffrey, “The input/output complexity of sorting and related problems,” J Commun.
ACM, vol. 31, no. 9, pp. 1116-1127, 1988.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


An example 61

Data-intensive (in contrast to CPU-intensive) means that the bottleneck of the applica-
tion comes from transferring data back and forth and accessing data, rather than
doing computation on that data (in Section 1.4 of this chapter, there are more details
as to why data access is much slower than the computation). But what this practically
means is it will require more time to get data to the point where it is available for us to
solve the problem on it than to actually solve the problem; thus any improvement in
managing the size of data or improving data access patterns are some of the most
effective ways to speed up the application.
In addition, the infrastructure of modern-day systems has become very complex.
With thousands of computers exchanging data over the network, databases and
caches are distributed, and many users simultaneously add and query large amounts
of content. Data itself has become complex, multidimensional, and dynamic. The
applications, in order to be effective, need to respond to changes very quickly. In
streaming applications12, data effectively flies by without ever being stored, and the
application needs to capture the relevant features of the data with the degree of accu-
racy rendering it relevant and useful, without the ability to scan it again. This new con-
text calls for a new generation of algorithms and data structures, a new application
builder’s toolbox that is optimized to address many challenges of massive-data sys-
tems. The intention of this book is to teach you exactly that ---the fundamental algo-
rithmic techniques and data structures for developing scalable applications.

1.1 An example
To illustrate the main themes of this book, consider the following example: you are work-
ing for a media company on a project related to news article comments. You are given a
large repository of comments with the following associated basic metadata information:
{
comment-id: 2833908
article-id: 779284
user-id: johngreen19
text: this recipe needs more butter
views: 14375
likes: 43
}

We are looking at approximately 100 million news articles and roughly 3 billion user
comments. Assuming storing one comment takes 200 bytes, we need about 600GB to
store the comment data. Your goal is serving your readers better, and in order to do
that, you would like to classify the articles according to keywords that recur in the
comments below the articles. You are given a list of relevant keywords for each topic
(e.g., ‘sports’, ‘politics’, etc.), and for initial analysis, the goal is only to count how
often the given keywords occurs in comments related to a particular article, but
before all that, you would like to eliminate the duplicate comments that occurred
during multiple instances of crawling.

12
B. Ellis, Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data, Wiley Publishing, 2014.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


62 CHAPTER CHAPTER 1 Introduction

1.1.1 An example: how to solve it


One typical way to process the dataset to solve the tasks above is to create some type of a
key-value dictionary, for example set, map, or unordered_map in C++, HashMap in Java,
or dict in Python, etc. Key-value dictionaries are implemented either with a balanced
binary tree (such as a red-black tree in C++ map), or, for faster insert and retrieval opera-
tions, a hash table (such as dict in Python or unordered_map in C++). For simplicity, we
will assume for the rest of this example that we are working with hash tables.
Using comment-id as the key, and the number of occurrences of that comment-id
in the dataset as the value will help us eliminate duplicate comments. We will call this 
(comment-id -> frequency) hash table (see Figure 1.1). Then, for each keyword of
interest, we build a separate hash table that counts the number of occurrences of the
given keyword from all the comments grouped by article-id, so these hash tables
have the format (article-id -> keyword_frequency).
But to build the basic (comment-id -> frequency) hash table for 3 billion com-
ments, if we use 8 bytes to store each <comment-id, frequency> pair (4 bytes for com-
ment-id and 4 bytes for frequency), we might need up to 24GB for the hash table
data. Also, our keyword hash tables can become very large, containing dozens of mil-
lions of entries: one such hash table of 10 million entries, where each entry takes 8
bytes, will need around 80MB for data, and maintaining such hash tables for, say, top
1000 keywords, can cost up to 80GBs only for data. We emphasize the “only for data”
part because a hash table, depending on how it is implemented, will need extra space,
either for the empty slots or for pointers.

Figure 1.1 In this example, we build a


(comment-id, frequency) hash table to help
us eliminate duplicate comments. So for
example, the comment identified by
comment-id 36457 occurs 6 times in the
dataset. We also build “keyword” hash
tables, where, for each keyword of interest,
we count how many times the keyword is
mentioned in the comments of a particular
article. So for example, the word “science”
is mentioned 21 times in the comments of
the article identified by article-id 8999.
For a large dataset of 3 billion comments,
storing all these data structures can easily
lead to needing dozens to a hundred of
gigabytes of RAM memory.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


An example 63

We can build similar structures for analyzing the popularity of particular comments,
users, etc. Therefore, we might need close to a hundred or hundreds of gigabytes just
to build basic structures that do not even include most of the metadata information
that we often need for a more insightful analysis (see Figure 1.2).

Figure 1.2 Hash tables only use the amount of space linear in
the size of data, asymptotically the minimum possible required
to store data correctly, but with large dataset sizes, hash tables
cannot fit into the main memory.

1.1.2 An example: how to solve it, take two


With the daunting dataset sizes, there are a number of choices we are faced with.
It turns out that, if we settle for a small margin of error, we can build a data struc-
ture similar to a hash table in functionality, only more compact. There is a family of
succinct data structures, data structures that use less than the lower theoretical limit to
store data that can answer common questions such as those relating to:
 Membership—Does comment/user X exist?
 Frequency—How many times did user X comment? What is the most popular
keyword?
 Cardinality—How many truly distinct comments/users do we have?

These data structures use much less space to process a dataset of n items than the lin-
ear space O(n)) that a hash table or a red-black tree would need; think 1 byte per item
or sometimes much less than that.
We can solve our news article comment examples with succinct data structures.
A Bloom filter (Chapter 3) will use 8x less space than the (comment-id -> frequency)
hash table and can help us answer membership queries with about a 2% false positive
rate. In this introductory chapter, we will try to avoid doing the math to explain how we
arrived at these numbers, but for now it suffices to say that the reason why Bloom filter,
and some other data structures that we will see can get away with substantially less space
than hash tables or red-black trees is that they do not actually store the items them-
selves. They compute certain codes (hashes) that end up representing the original
items (but also some other items, hence the false positives), and original items are
much larger than the codes. Another data structure, Count-Min sketch (Chapter 4)
will use about 24x less space than the (comment-id -> frequency) hash table to esti-
mate the frequency of each comment-id, exhibiting a small overestimate in the fre-
quency in over 99% of the cases. We can use the same data structure to replace the

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


64 CHAPTER CHAPTER 1 Introduction

(article-id -> keyword_frequency) hash tables and use about 3MB per keyword
table, costing about 20x less than the original scheme. Lastly, a data structure HyperLo-
gLog (Chapter 5) can estimate the cardinality of the set with only 12KB, exhibiting the
error less than 1%. If we further relax the requirements on accuracy for each of these
data structures, we can get away with even less space. Because the original dataset still
resides on disk, while the data structures are small enough to serve requests efficiently
from RAM, there is also a way to control for an occasional error.
Another choice we have when dealing with large data is to proclaim the set unnec-
essarily large and only work with a random sample that can comfortably fit into RAM.
So, for example, you might want to calculate the average number of views of a com-
ment, by computing the average of the views variable, based on the random sample
you drew. If we migrate our example of calculating an average number of views to the
streaming context, we could efficiently draw a random sample from the data stream of
comments as it arrives using the Bernoulli sampling algorithm (Chapter 6). To illus-
trate, if you have ever plucked flower petals in the love-fortune game “(s)he loves me,
(s)he loves me not’’ in a random manner, you could say that you probably ended up
with “Bernoulli-sampled” petals in your hand—this sampling scheme offers itself con-
veniently to the one-pass-over-data context.
Answering some more granular questions about the comments data, like, below
which value of the attribute views is 90% of all of the comments according to their
view count will also trade accuracy for space. We can maintain a type of a dynamic his-
togram (Chapter 7) of the complete viewed data within a limited, realistic fast-mem-
ory space. This sketch or a summary of the data can then be used to answer queries
about any quantiles of your complete data with some error.
Last but definitely not least, we often deal with large data by storing it into a data-
base or as a file on disk or some other persistent storage. Storing data on a remote
storage and processing it efficiently presents a whole new set of rules than traditional
algorithms (Chapter 8) do, even when it comes to fundamental problems such as
searching or sorting. The choice of a database, for example, becomes important and
it will depend on the particular workload that we expect. Will we often be adding new
comment data and rarely posing queries, or will we rarely add new data and mostly
query the static dataset, or, as it often happens, will we need to do both at a very rapid
rate? These are all questions that are paramount when deciding on the type of data-
base engine we might use to store the data.
Very few people actually implement their own storage engines, but to knowledge-
ably choose between different alternatives, we need to understand what data struc-
tures power them underneath. Many massive-data applications today struggle to
provide high query and insertion speeds, and the tradeoffs are best understood by
studying the data structures that run under the hood of MySQL, TokuDB, LevelDB
and other storage engines, such as B-trees, B?-trees, and LSM-trees, where each is opti-
mized for a particular purpose (Chapters 9 and 10). Similarly, it is important to under-
stand the basic algorithmic tricks when working with files on disk, and this is best
done by learning how to solve fundamental algorithmic problems in this context like
the first example in our chapter of sorting checks in external memory (Chapter 11).

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What makes this book different and whom it is for 65

1.2 The structure of this book


As the earlier section outlines, this book revolves around three main themes, divided
into three parts.
Part I (Chapters 2-5) deals with hash-based sketching data structures. This part
begins with the review of hash tables and specific hashing techniques developed for a
massive-data setting. Even though it is planned as a review chapter, we suggest you use
it as a refresher of your knowledge of hash tables, and also use the opportunity to learn
about modern hash techniques devised to deal with large datasets. The succinct data
structures presented in the rest of Part I are also hash-based, so Chapter 2 also serves as
a good preparation for Chapters 3-5. Hash-based succinct data structures that trade
accuracy for reductions in space have found numerous applications in databases, net-
working, or any context where space is at a premium. In this part, we will often con-
sider the tradeoffs between accuracy and space consumption for data structures that
can answer membership (Bloom filter, quotient filter, cuckoo filter), frequency
(Count-Min Sketch), cardinality (HyperLogLog) and other essential operations.
Part II (Chapters 6-7) serves as a continuation of Part I in that it also attempts to
reduce data size, but instead of considering all data and storing it imperfectly like in
Part I, the techniques in Part II consider a subset of the dataset, but then processes the
original items, not their hashes. In this part, we first explain how to decide which sam-
pling techniques for probing your data are suited for a particular setting and how
large data sizes affect efficiency of these sampling mechanisms. We will start with clas-
sical techniques like Bernoulli sampling and reservoir sampling and move on to more
sophisticated methods like stratified sampling to counter the deficiencies of simpler
strategies in special settings. We will then use the created samples to calculate esti-
mates of the total sums or averages, etc. We will also introduce algorithms for calculat-
ing (ensemble of) ?-approximate quantiles and/or estimating the distribution of the
data within some succinct representation format.
Part III (Chapters 8-11) shifts to the scenarios when data resides on SSD/disk,
and introduces the external-memory model, the model that is well-suited for the
analysis of algorithms where the data transfer cost subsumes the CPU cost. We will
cover some of the fundamental problems such as searching, sorting, and designing
optimal data structures that power relational (and some of the NoSQL) databases
(B-trees, LSM-trees, B?-trees). We will also discuss the tradeoffs between the write-
optimized and the read-optimized databases, as well as the issues of sequential/ran-
dom access and data layout on disk.

1.3 What makes this book different and whom it is for


There are a number of great books on classical algorithms and data structures, some
of which include: Algorithm Manual Design by Steve S. Skiena13, Introduction to Algo-

13
S. S. Skiena, The Algorithm Design Manual, Second Edition, Springer Publishing Company, Incorporated, 2008.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


66 CHAPTER CHAPTER 1 Introduction

rithms by Cormen, Leiserson, Rivest and Stein14, Algorithms by Robert Sedgewick and
Kevin Wayne15, or for a more introductory and friendly take on the subject, Grokking
Algorithms by Aditya Bhargava16. The algorithms and data structures for massive data-
sets are slowly but surely making their way into the mainstream textbooks, but the
world is moving fast and our hope is that this book can provide a compendium of the
state-of-the-art algorithms and data structures that can help a data scientist or a devel-
oper handling large datasets at work. The book is intended to offer a good balance of
theoretical intuition, practical use cases and (pseudo)code snippets. This book
assumes that a reader has some fundamental knowledge of algorithms and data struc-
tures, so if you have not studied the basic algorithms and data structures, you should
first cover that material before embarking on this subject.
Many books on massive data focus on a particular technology, system or infrastruc-
ture. This book does not focus on the specific technology, neither does it assume
familiarity with any particular technology. Instead, it covers underlying algorithms and
data structures that play a major role in making these systems scalable. Often the
books that do cover algorithmic aspects of massive data focus on machine learning.
However, an important aspect of handling large data that does not specifically deal
with inferring meaning from data, but rather has to do with handling the size of the
data and processing it efficiently, whatever the data is, has often been neglected in the
literature. This book aims to fill that gap.
There are some books that address specialized aspects of massive datasets (e.g., Probabi-
listic Data Structures and Algorithms for Big Data Applications17, Real-Time Analytics: Tech-
niques to Analyze and Visualize Streaming Data,18, Disk-Based Algorithms for Big Data19, and
Mining of Massive Datasets20). With this book, we intend to present these different themes
in one place, often citing the cutting-edge research and technical papers on relevant sub-
jects. Lastly, our hope is that this book will teach more advanced algorithmic material in a
down-to-earth manner, providing mathematical intuition instead of the technical proofs
that characterize most resources on this subject. Illustrations play an important role in com-
municating some of the more advanced technical concepts and we hope you enjoy them.
Now that we’ve gotten the introductory remarks out of the way, let’s discuss the
central issue that motivates topics from this book.

1.4 Why is massive data so challenging for today’s systems?


There are many parameters in computers and distributed systems architecture that
can shape the performance of a given application. Some of the main challenges that

14
T. H. Cormen, C. E. Leiserson, R. L. Rivest and C. Stein, Introduction to algorithms, Third Edition, The MIT
Press, 2009.
15
R. Sedgewick and K. Wayne, Algorithms, Fourth Edition, Addison-Wesley Professional, 2011.
16
A. Bhargava, Grokking Algorithms: An Illustrated Guide for Programmers and Other Curious People, Man-
ning Publications Co., 2016.
17
G. Andrii, Probabilistic Data Structures and Algorithms for Big Data Applications, Books on Demand, 2019.
18
B. Ellis, Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data, Wiley Publishing, 2014.
19
C. G. Healey, Disk-Based Algorithms for Big Data, CRC Press, Inc., 2016.
20
A. Rajaraman and J. D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2011.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Why is massive data so challenging for today’s systems? 67

computers face in processing large amounts of data actually stem from hardware and
general computer architecture. Now, this book is not about hardware, but in order to
design efficient algorithms for massive data, it is important to understand some physi-
cal constraints that are making data transfer such a big challenge. Some of the main
issues include: 1) the large asymmetry between the CPU and the memory speed, 2)
different levels of memory and the tradeoffs between the speed and size for each level,
and 3) the issue of latency vs. bandwidth. In this section, we will discuss these issues, as
they are at the root of solving performance bottlenecks of data-intensive applications.

1.4.1 The CPU-memory performance gap


The first important asymmetry that we will discuss is between the speeds of CPU oper-
ations and memory access operations in a computer, also known as the CPU-memory
performance gap21. Figure 1.3 shows, starting from 1980, the average gap between the
speeds of processor memory access and main memory access (DRAM memory),
expressed in the number of memory requests per second (the inverse of latency):

Figure 1.3 CPU-Memory Performance Gap


Graph, adopted from Hennessy & Patterson’s
well-known Computer Architecture textbook.
The graph shows the widening gap between
the speeds of memory accesses to CPU and
RAM main memory (the average number of
memory accesses per second over time).
The vertical axis is on the log scale.
Processors show improvement of about 1.5x
per year up to year 2005, while the
improvement of access to main memory has
been only about 1.1x per year. Processor
speed-up has somewhat flattened since
2005, but this is being alleviated by using
multiple cores and parallelism.

What this gap points to intuitively is that doing computation is much faster than
accessing data. So if we are still stuck with the traditional mindset of measuring the
performance of algorithms using the number of computations (and assuming mem-
ory accesses take the same amount of time as the CPU computation), then our analy-
ses will not jive well with reality.

21
J. L. Hennessy and D. A. Patterson, Computer Architecture, Fifth Edition: A Quantitative Approach, Morgan
Kaufmann Publishers Inc., 2011.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


68 CHAPTER CHAPTER 1 Introduction

1.4.2 Memory hierarchy


Aside from the CPU-memory gap, there is a whole hierarchy of different types of
memory built into a computer that have different characteristics. The overarching
tradeoff has been that the memory that is fast is also small (and expensive), and the
memory that is large is also slow (but cheap). As shown in Figure 1.4, starting from the
smallest and the fastest, the computer hierarchy usually contains the following levels:
registers, L1 cache, L2 cache, L3 cache, main memory, solid state drive (SSD) and/or
the hard disk (HDD). The last two are persistent (non-volatile) memories, meaning
the data is saved if we turn off the computer, and as such are suitable for storage.
In Figure 1.4, we can see the access times and capacities for each level of the mem-
ory in a sample architecture22. The numbers vary across architectures, and are more
useful when observed in terms of ratios between different access times rather than the
specific values. So, for example, pulling a piece of data from cache is roughly 1 million
times faster than doing so from the disk.

Figure 1.4 Different types of memories in the


computer. Starting from registers in the
bottom left corner, that are blindingly fast but
also very small, we move up (getting slower)
and right (getting larger) with Level 1 cache,
Level 2 cache, Level 3 cache, main memory, all
the way to SSD and/or HDD. Mixing up
different memories in the same computer
allows for the illusion of having both the speed
and the storage capacity by having each level
serve as a cache for the next larger one.

The hard disk is the only remaining mechanical part of a computer, and it works a lot
like a record player. Placing the mechanical needle on the right track is the expensive
part of accessing data on disk. Once the needle is on the right track, the data transfer
can be very fast, depending on how fast the disk spins.

22
C. Terman, “MIT OpenCourseWare, Massachusetts Institute of Technology,” Spring 2017. [Online]. Avail-
able: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-004-computation-struc-
tures-spring-2017/index.htm. [Accessed 20th January 2019].

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Why is massive data so challenging for today’s systems? 69

A similar phenomenon, where “latency lags bandwidth” holds for other types of
memory23. Generally, the bandwidth in various systems, ranging from microproces-
sors, main memory, hard disk, and network, has tremendously improved over the past
few decades, but latency hasn’t as much, even though the latency might often be the
more important measurement for most scenarios—a common pattern of user behav-
ior is many small random accesses as opposed to one large sequential one.
Because of this expensive initial call, it is appropriate that the data transfer
between different levels of memory is done in chunks of multiple items, to offset the
cost of the call. The chunks are proportionate to the sizes of memory levels, so for
cache that are between 8 bytes and 64 bytes, and disk blocks that can be up to 1MB24.
Due to the concept known as spatial locality, where we expect the program to access
memory locations that are in the vicinity of each other close in time, transferring
blocks in a way pre-fetches the data we will likely need in the future.

1.4.3 What about distributed systems?


Most applications today run on multiple computers, and having data sent from one
computer to another adds yet another level of delay. Data transfer between computers
can be hundreds of milliseconds, or even seconds, long, depending on the system
load (e.g., the number of users accessing the same application), the number of hops
to the destination and other details of the architecture (see Figure 1.5):

Figure 1.5 Cloud-access times can be high due to


the network load and complex infrastructure.
Accessing the cloud can take hundreds of
milliseconds or even seconds. We can observe this as
another level of memory that is even larger and slower
than the hard disk. Improving the performance in cloud
applications can be additionally hard because times to
access or write data on a cloud are unpredictable.

You might be asking yourself, how are all these facts relevant for the design of data-effi-
cient algorithms and data structures? The first important takeaway is that, although
technology improves constantly (for instance, SSDs are a relatively new development
and they do not share many of the issues of hard disks), some of the issues, such as the
tradeoff between the speed and the size of memories are not going away any time soon.
Part of the reason for this is purely physical: to store a lot of data, we need a lot of

23
D. A. Patterson, “Latency Lags Bandwith,” Commun. ACM, vol. 47, no. 10, p. 71–75, 2004.
24
J. L. Hennessy and D. A. Patterson, Computer Architecture, Fifth Edition: A Quantitative Approach, Morgan
Kaufmann Publishers Inc., 2011.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


70 CHAPTER CHAPTER 1 Introduction

space, and the speed of light sets the physical limit to how fast data can travel from one
part of the computer to the other or from one part of the network to the other. To
extend this to a network of computers, we will cite25 an example that for two computers
that are 300 meters away, the physical limit of data exchange is 1 microsecond.
Hence, we need to design algorithms with this awareness. Designing succinct data
structures (or taking data samples) that can fit into fast levels of memory helps because
we avoid expensive disk seeks. So, one of the important algorithmic tricks with large
data is that saving space saves time. Yet, in many applications we still need to work with
data on disk. Here, designing algorithms with optimized patterns of disk access and
caching mechanisms that enable the smallest number of memory transfers is import-
ant, and this is further linked to how we lay out and organize data on a disk (say in a
relational database). Disk-based algorithms prefer smooth scanning over the disk over
random hopping—this way we get to make use of a good bandwidth and avoid poor
latency, so one meaningful direction is transforming an algorithm that hops into one
that glides over data. Throughout this book, we will see how classical algorithms can be
transformed and new ones can be designed having space-related concerns in mind.
Lastly, it is important to keep in mind that many aspects of making systems work in
production have little to do with designing a clever algorithm. Modern systems have
many performance metrics other than scalability, such as: security, availability, main-
tainability, etc. So, real production systems need an efficient data structure and an
algorithm running under the hood, but with a lot of bells and whistles on top to make
all the other stuff work for their customers (see Figure 1.6). However, with ever-
increasing amounts of data, designing efficient data structures and algorithms has
become more important than ever before, and we hope that in the coming pages you
will learn how to do exactly that.

Figure 1.6 An efficient


data structure with bells
and whistles.

25
D. A. Patterson, “Latency Lags Bandwith,” Commun. ACM, vol. 47, no. 10, p. 71–75, 2004.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Why is massive data so challenging for today’s systems? 71

Summary
 Applications today generate and process large amounts of data at a rapid rate.
Traditional data structures, such as basic hash tables and key-value dictionaries,
can grow too big to fit in RAM memory, which can lead to an application chok-
ing due to the I/O bottleneck.
 To process large datasets efficiently, we can design space-efficient hash-based
sketches, do real-time analytics with the help of random sampling and approxi-
mate statistics, or deal with data on disk and other remote storage more effi-
ciently.
 This book serves as a natural continuation to the basic algorithms and data
structures book/course because it teaches how to transform the fundamental
algorithms and data structures into algorithms and data structures that scale
well to large datasets.
 The key reason why large data is a major issue for today’s computers and sys-
tems is that CPU (and multiprocessor) speeds improve at a much faster rate
than memory speeds, the tradeoff being between the speed and size for differ-
ent types of memory in the computer, as well as the latency vs. bandwidth phe-
nomenon. These trends are not likely to change significantly soon, so the
algorithms and data structure that address the I/O cost and issues of space are
only going to increase in importance over time.
 In data-intensive applications, optimizing for space means optimizing for time.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Chapter 1 from Software Telemetry
by Jamie Riedesel

Introduction

This chapter covers


 What telemetry systems are
 What telemetry means to different technical groups
 Challenges unique to telemetry systems

Telemetry is the feedback you get from your production systems that tells you
what’s going on in there, all to improve your ability to make decisions about your
production systems. For NASA the production system might be a rover on Mars,
but most of the rest of us have our production systems right here on Earth (and
sometimes in orbit around Earth). Whether it’s the amount of power left in a
rover’s batteries or the number of Docker containers live in Production right now,
it’s all telemetry. Modern computing systems, especially those operating at scale,
live and breathe telemetry; it’s how we can manage systems that large at all. Using
telemetry is ubiquitous in our industry.
 If you’ve ever looked at a graph describing site-hits over time, you’ve used
telemetry.
 If you’ve ever written a logging statement in code and later looked up that
statement in a log-searching tool like Kibana or Loggly, you’ve used telemetry.

72

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


73

 If you’ve ever configured the Apache web server to send logs to a relational
database, you’ve used telemetry.
 If you’ve ever written a Jenkinsfile to send continuous integration test results to
another system that could display it better, you’ve used telemetry.
 If you’ve ever configured GitHub to send webhooks for repository events,
you’ve used telemetry.
Software Telemetry is about the systems that bring you telemetry and display it in a way
that will help you make decisions. Telemetry comes from all kinds of things, from the
Power Distribution Units your servers (or your cloud provider’s servers) are plugged
into, to your actual running code at the very top of the technical pyramid. Taking that
telemetry from whatever emitted it and transforming it so your telemetry can be dis-
played usefully is the job of the telemetry system. Software Telemetry is all about that sys-
tem and how to make it durable.
Telemetry is a broad topic, and one that is rapidly changing. Between 2010 and
2020 our industry saw the emergence of three new styles of telemetry systems. Who
knows what we will see between 2020 and 2030? This book will teach you the funda-
mentals of how any telemetry system operates, including ones we haven’t seen yet,
which will prepare you for modernizing your current telemetry systems and adapting
to new styles of telemetry. Any time you teach information-passing and translation,
which is what telemetry systems do, you unavoidably have to cover how people pass
information; this book will teach you both the technical details of maintaining and
upgrading telemetry systems as well as the conversations you need to have with your
coworkers while you revise and refine your telemetry systems.
Any telemetry system has a similar architecture; figure 1.1 is one we will see often
as we move through parts 1 and 2 this book.
Telemetry is data that production systems emit to provide feedback about what is
happening inside. Telemetry systems are the systems that handle, transform, store, and
present telemetry data. This book is all about the systems, so let’s take a look at the five
major telemetry styles in use today:
 Centralized Logging: This was the first telemetry system created, which hap-
pened in the early 1980s. This style takes text-based logging output from pro-
duction systems and centralizes it to ease searching. Of note, this is the only
technique widely supported by hardware.
 Metrics: This grew out of the monitoring systems used by operations teams, and
was renamed “metrics” when software engineers adopted the technique. This
system emerged in the early 2010s, and focuses on numbers rather than text to
describe what is happening. This allowed much longer timeframes to be kept
online and searchable when compared with centralized logging.
 Observability: This style grew out of frustration over the limitations of central-
ized logging and takes a more systematic approach to tracking events in a pro-
duction system. This emerged in the mid 2010s.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


74 CHAPTER CHAPTER 1 Introduction

Figure 1.1 Architecture common to all telemetry systems, though some stages are often combined in
smaller architectures. The Emitting Stage receives telemetry from your production systems and
delivers it to the Shipping Stage. The Shipping Stage processes and ultimately stores telemetry. The
Presentation Stage is where people search and work with telemetry. The Emitting and Shipping stages
can apply context-related markup to telemetry, where the Shipping and Presentation stages can further
enrich telemetry by pulling out details encoded within.

 Distributed Tracing: This is a refinement of observability, focusing directly on


tracking events across many components of a distributed system (large mono-
liths count for this, by the way). This style emerged in the late 2010s and is
undergoing rapid development.
 Security Information Event Management (SIEM): This is a specialized telemetry
system for use by Security and Compliance teams, and actually is a specializa-
tion of centralized logging and metrics. The technique was in use long before
the term was formalized in the mid 2000s.
These telemetry styles are used throughout this book, so you will see them mentioned
a lot. Section 1.1 gives you longer definitions and history for each of these telemetry
styles, and presents how each style conforms to the architecture in figure 1.1.
Because people matter as much as the telemetry data being handled by our telem-
etry systems, section 1.2 gives you a breakdown of the many teams inside a technical
organization as well as the telemetry systems each team prefers to use. These teams
are referenced frequently in the rest of this book.
Finally, telemetry systems face more disasters than production systems do. Section
1.3 briefly covers some of these disasters. Part 3 of this book has several chapters use-
ful for making your telemetry systems durable.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Defining the styles of telemetry 75

1.1 Defining the styles of telemetry


The list of telemetry styles in the intro provides a nice thumbnail of what each style of
telemetry does and will be a good quick-reference for you as you move through this
book. This section provides a far more detailed definition for each of these five telem-
etry styles and gives real-world examples for each of them.

1.1.1 Defining centralized logging


Centralized logging brings logging data generated by production systems to a central
place where people can query it. Figure 1.2 gives us an example of such a system today:

Figure 1.2 A centralized logging system using Logstash, Elasticsearch, and Kibana as major
components. Telemetry is emitted from both production code and Cisco Hardware. This telemetry is
then received by Shipping Stage components; centralized in Logstash, and stored in Elasticsearch.
Kibana uses Elasticsearch storage to provide a single interface for people to search all logs.

Centralized logging supports not just software telemetry, but hardware telemetry as
well! The Syslog Server box in the figure represents the modern version of a system
that was first written around 1980 as a dedicated logging system for the venerable
sendmail program from the Berkeley Software Distribution 3BSD. By the year 2000
Syslog was in near universal use across Unix and Unix-like operating systems. A stan-
dardization effort in 2001 resulted in a series of RFCs that defined the Syslog format
for both transmission protocol and data format. By making Syslog a standard, it gave
hardware makers one option for emitting telemetry that wasn’t likely to change over
the decade lifespan of most hardware. The other option is SNMP, the Simple Network
Management Protocol, which is covered in Chapter 2.
I bring up Syslog because the concepts it brought to the table influenced much of
how we think about logging from software. If you’ve ever heard the phrase:
Turn on debug logging.
You’re using a concept introduced by Syslog. The concept of log-levels originated in
Syslog, which defined eight levels, seen here in table 1.1
Where you see Syslog’s biggest influence is in the keywords in table 1.1. Not every
software logger builds all eight levels, but nearly all have some concept of debug, info,
warning, and error.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


76 CHAPTER CHAPTER 1 Introduction

Table 1.1 Syslog standard log-levels

ID Severity Keyword

0 Emergency emerg, panic

1 Alert alert

2 Critical crit

3 Error err, error

4 Warning warn, warning

5 Notice notice

6 Info info

7 Debug debug

logger.debug(“Entering Dangerous::Function with #{args.count} params.”)


logger.info(“Dangerous::Function finished in #{timer.to_seconds} seconds.”)
logger.warn(“FIXME: Dangerous::Function was not passed a CSRF token.”)
logger.err(“Dangerous::Function failed with ArgumentError::InvalidType”)

If you’ve written software chances are good you’ve used these levels some time in your
career. This idea of log levels also introduces the idea that all logging has some con-
text along with it; a log-level gives context of how severe the event is, while the text of
the event describes what happened. Logging from software can include quite a lot
more context than simply priority and a message; Section 6.1 describes this markup
process in much more detail.
The middle stage of centralized logging, represented in the figure as the Logstash
server, takes telemetry in the emitted format (Syslog for the Cisco hardware, whatever
the log-file format is for the production code) and reformats it into the format
needed by Elasticsearch. Elasticsearch needs a hash, so Logstash is taking the Syslog
format and rewriting it into Elasticsearch’s format before storing it in Elasticsearch.
This reformatting process is covered in Chapter 4.
The end of the pipeline, represented in the figure as the Kibana server, uses Elas-
ticsearch as a database for queries. Section 5.2 goes into greater detail about what con-
stitutes a good system for this job for centralized logging. Kibana here is used by
people to access telemetry and assist with analysis.

1.1.2 Defining metrics


Metrics style telemetry is about using numbers (counters, timers, rates, and the like)
to get feedback about what’s going on in your production systems. Where a central-
ized logging system often uses plain language to suggest how long something took:
logger.info(“Dangerous::Function finished in #{timer.to_seconds} seconds.”)

Metrics systems encode the same information by encoding a number and some addi-
tional fields to provide context. In this example case, a function-name is added for
context, and a timer is used for the number:

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Defining the styles of telemetry 77

metrics.timer(“Dangerous_Function_runtime”, timer.to_seconds)

Here is an example of a real world metrics pipeline in figure 1.3:

Figure 1.3 A metrics system where the production software emits metrics from code into a StatsD
server, and the operating system has a monitoring package called collectd that collects system metrics
that reports directly to the graphite API of a Prometheus storage server. StatsD submits summarized
metrics into Prometheus by way of the graphite API. A Grafana server acts as the interface point for all
users of this metrics system, including both Operations and Software Engineering teams.

This example shows a metrics system being used for both software metrics and system
metrics. The system metrics are gathered by a monitoring tool called collectd, which
has the ability to push metrics into a Graphite API. Prometheus is a database custom-
built for storing data-over-time, or time-series data. Such time-series databases are the
foundation of many metrics systems, though other database styles can certainly be
used successfully. Grafana is an open source dashboarding system for metrics that is
widely used, and in this case is being used by both the Operations team running the
infrastructure and the Software Engineering team managing the production software.
Like centralized logging style telemetry, metrics telemetry is almost always marked
up with additional details to go with the number. In the case of the statement before
the figure we are adding a single field with Dangerous_Function_runtime as the
value. Additional fields can certainly be added, though doing so introduces complex-
ity in the metrics database known as cardinality.

DEFINITION Cardinality is the term for index complexity, specifically the


number of unique combinations the fields in the index may produce. If there
are two fields A and B, where A has two possible values, and B has three possi-
ble values, the cardinality of that index is A * B, or 2 * 3 = 6. Cardinality has a
significant impact to search performance no matter what data storage system
is being used.

Cardinality is a big part of how metrics came to be its own discrete telemetry style.
Centralized logging, with all of the data it encodes, has the highest cardinality of all of
the five telemetry styles I talk about. It also takes up the most space by far. The combi-
nation of those two factors make centralized logging require the most complex data-
bases and the largest volume of data of any style. Due to budget constraints,

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


78 CHAPTER CHAPTER 1 Introduction

centralized logging systems rarely can keep data online and searchable for long at all.
Compare this to metrics with its low cardinality and focus on easy to store numbers,
and you have a telemetry system that can keep years of telemetry online and searchable
for a fraction of the cost of centralized logging systems!
In the 2009 to 2012 era when metrics really began to be known as a telemetry style,
the cost savings versus centralized logging was one of the biggest drivers for adoption.
Centralized logging was still used, but being able to leverage a specialized telemetry
flow designed for the decision-type was a revolution—one that set up the next two
telemetry styles to come onto the scene.

1.1.3 Defining observability


Observability combines the numbers-based nature of metrics with the cardinality of
centralized logging—in the form of more fields stored with the numbers—to provide
a system that was much better equipped to track events as they happened in your pro-
duction systems. It probably seems weird to re-add one of the complexities that met-
rics systems got rid of, but keep in mind that observability emerged in the market
several years after metrics did and the state of databases had evolved just enough to
enable this style of telemetry. Observability is all about adding as many context-related
details, such as hostname, software version, library version, system uptime, and many
others (see section 6.1 for this kind of markup) to emitted metrics and logging state-
ments as possible.
Having all of that extra detail allows someone investigating a problem to better iso-
late what the common features are. Maybe a certain service likes to crash out after
only 70 minutes of uptime, suggesting a memory leak could be present. Or maybe the
Operations team upgraded a core imaging library as part of monthly security-patching
and the systems with the new library are now taking 12% longer to complete. These
are all questions observability is designed to answer faster. Figure 1.4 describes a real-
world Observability system.
This example is our first demonstrating use of a Software-as-a-Service (SaaS) pro-
vider for telemetry services. Using honeycomb.io requires your production software to
be able to submit API calls to an outside system, something not all infrastructures

Figure 1.4 An observability system using the honeycomb.io Software-as-a-Service platform for the
Shipping and Presentation stages. Production software emits observability telemetry into an ingestion
API operated by honeycomb.io. Honeycomb then stores this in their database, and presents the
processed telemetry in their dashboard. Use of a SaaS provider allows leveraging observability without
having to manage its complex infrastructure yourself.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Defining the styles of telemetry 79

allow (for security reasons). Doing it this way allows your production systems to gain
the benefit of observability without having to deploy and maintain the complex infra-
structure it requires. In my experience, SaaS companies dominate the Observability
marketplace for this very reason.

WHAT ABOUT APPLICATION PERFORMANCE MONITORING?


The term Application Performance Monitoring (APM) predates Observability by several
years. The first big idea behind APM was to apply the same monitoring discipline that
Operations teams were applying to their infrastructure but to the software systems run-
ning on top of it. However, the marketplace of ideas is a rough-and-tumble place, and
Observability grabbed a lot of attention in the middle of the 2010s. As a result, systems
sold as APM systems often offer much of the features of full Observability systems.
Truth be told, APM and Observability are near synonyms these days. Tools first sold
as APM tools have to continue to support the workloads they initially were sold as,
while adding on the features that make them Observability platforms as well.
The word ‘observability’ is fading somewhat due to distributed tracing getting a lot of
attention. However, while all distributed tracing systems are observability systems,
not all observability systems are distributed tracing systems. There are cases where
observability is a better fit. That said, what happens to these terms in the period
between 2020 and 2030 is anyone’s guess.

1.1.4 Defining distributed tracing


Distributed tracing is a specialized form of observability, adding on explicit automa-
tion to add entire execution flows (more on this in a bit) to the context presented
when looking to see what happened. Observability gives you lots of context around
the moment an event happened, and gives you enough other hints to allow you to
determine how that same event behaves over time. Tracing leverages this extra con-
text to allow reconstructing the flow of execution similar to how a stack-dump shows
the stack of functions called, but does so over time. Figure 1.5 presents the kind of dis-
play of execution flow a distributed tracing system can provide.
Tracing is useful not just in microservices environments, but also:
 Large monolithic codebases that have many different teams contributing code.
The traces produced here don’t respect political boundaries, which reduces
barriers to troubleshooting.
 Micro-monolith environments with a few, large applications working together.
Such environments often have separate teams working on the larger applica-
tions, so again tracing across all applications breaks down silos between teams.
 Monoliths in the process of being chipped apart and only have a few additional
microservices so far. This metadata style often is the best choice for providing
shared context between the separate systems.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


80 CHAPTER CHAPTER 1 Introduction

Figure 1.5 An example of the kind of display a distributed tracing system provides, following the flow
of execution similar to a stack-trace. Here we see a call to upload_document, but also all of the other
processes that upload_document called during its execution. When tracing a fault in a pdf_to_png
process, you will be presented with the full context of events leading up to that specific execution.

Tracing also came onto the scene in the late 2010s, so is undergoing rapid develop-
ment. The OpenTracing project (https://opentracing.io) is an effort from major
players in the US tech industry to provide standards for communication and data-for-
mat for tracing. Programming languages that are well-established and mostly over-
looked by the Big Names You Know US tech industry—languages like PHP, COBOL,
and Perl—often lack a Software Development Kit for tracing. Frustration by software
engineers is a prime driver of innovation, so I expect these underserved languages will
get the support they need before too long.
In spite of its newness, there are real-world examples of tracing to look at today.
Figure 1.6 gives us one real-world example.

Figure 1.6 An example of a distributed tracing system circa 2020. Production code is running an
OpenTracing SDK, which sends events to a system running the Jaeger open source tracing system.
The Jaeger collector then stores the event in a database. The Jaeger frontend provides a place to
search and display traces from production systems.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Defining the styles of telemetry 81

1.1.5 Defining Security Information Event Management


Many companies and organizations operate with constraints imposed on them from
outside, such as mandatory stock market reporting, industry-specific regulation in the
banking industry, and optional compliance frameworks like those defined by ISO
standards. These standards, regulations, and compliance frameworks have been with
us for decades and have grown up along with the technical industry. Most of these
external controls require a common set of monitoring techniques inside a technical
organization; here are only a few:
 Track all login, logout, and account lockout events.
 Track all use of administrative privileges.
 Track all access to sensitive files and records.
 Track compliance with password complexity and age requirements.

Due to how common these requirements are and the complexity of tracking and later
correlating them, this has given rise to a completely separate telemetry style known as
the Security Information Event Management system, or SIEM. Due to the complexity
of the task SIEMs are almost always paid-for software; there are very few, if any, open
source projects that do this work.
As a telemetry style operator, you will spend more time connecting sources of
telemetry to a system that knows how to interpret the data. Figure 1.7 gives us one pos-
sible architecture for integrating a SIEM into a larger telemetry system.

Figure 1.7 One possible SIEM system. Since SIEM systems are often derived from centralized
logging systems, here we see that the centralized logging flow and SIEM flow have an identical
source. When telemetry enters the Logstash server, it produces two feeds; one feed goes into
Elasticsearch for centralized logging, and a second feed is submitted to the Splunk SaaS API.
Splunk is acting as a SIEM in this case.

There are many different architectures; figure 1.7 is but one. Another is when Security
has installable agents running on host servers, which emit in a completely different
way than the centralized logging flows, making for a fully separate system. Both
approaches are viable.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


82 CHAPTER CHAPTER 1 Introduction

1.2 How telemetry is consumed by different teams


Because telemetry is used to support decision-making about production systems, you
unavoidably have to consider how the people making decisions are organized and use
telemetry. This section is about defining what the major teams are in a technical orga-
nization so I can use these definitions later in the book. You are likely in this list some-
where, and, if you’ve been in the industry a long time, maybe even have been in more
than one team. In my career, I’ve been in Customer Support, Operations, DevOps,
more recently SRE, and have done close work with Security teams.

NOTE I use the term organization instead of company to be inclusive of non-


corporate organizations that are doing technology, such as government enti-
ties and non-profits. The teams listed here are broad categories, and when
you see team names in capitals, such as Software Engineering, know that I am
referring to definitions in this chapter.

1.2.1 Telemetry usage by Operations, DevOps, and SRE teams


Operations, DevOps, and Site Reliability Engineering teams share a lot of background
for all that they cover somewhat different areas. Also, SRE will be mentioned again in
section 1.2.3 because they share a lot of background with Software Engineering teams.
Operations teams were the first of these three teams to emerge, which happened in
the 1970s. Today teams with Operations in the title are likely in long-standing organi-
zations that computerized in the 1960s and 70s, though sometimes these teams add
on the word infrastructure to make Infrastructure Operations, and sometimes Platform
Engineering. For this book I will be using Operations. Teams with this name typically
are in charge of keeping the machinery running the production code operating,
cloud or hardware, including operating systems.
DevOps teams emerged in the 2000s to fight the silos that had grown up between
Operations and Software Engineering teams. These days DevOps teams often stand in
for Operations teams while also maintaining the systems that make sure code is meet-
ing minimum quality standards (continuous integration) and getting it into produc-
tion (continuous deployment). Using DevOps in a job title is certainly controversial;
DevOps is a philosophy not a job-title, but that doesn’t stop it from being a common
practice anyway.
Site Reliability Engineering emerged in rough parallel to DevOps, but did so in
some of the biggest tech companies on the planet. The original meaning of SRE was
the team that was in charge of making sure your (web-based) software was available to
customers, the way your customers needed you to be. The term means somewhat dif-
ferent things to each organization that has an SRE team, but they all still care about
your (usually web-based) software being available to customers.
Operations teams have been caring about uptime since the beginning. DevOps
cares about software quality as a way to defend uptime. SRE teams are explicitly
charged with availability. All of these converging needs mean these three teams have
common telemetry requirements. Figure 1.8 shows this.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


How telemetry is consumed by different teams 83

Figure 1.8 The preferred telemetry styles for Operations, DevOps, and SRE
teams. Centralized Logging, because the infrastructure these teams manage
emit there by preference, and metrics, because that is used for monitoring and
site availability tracking. Use of the other three styles is quite possible, but the
majority usage is with centralized logging and metrics.

1.2.2 Telemetry usage by Security and Compliance teams


Security teams are the teams charged with defense of your overall organization from
outside threats. Compliance teams are the ones charged with ensuring your organiza-
tion is complying with legislated regulations and optional compliance frameworks,
such as the Service and Organization Controls (SOC 2) standard. Security and Com-
pliance teams are often the same team until organizational needs make separating
these concerns a good idea. Not every organization has a Security team though many
who do not really should. Certain industries, like anything involving finance or health,
make these teams unavoidable.
Supporting both the Security and Compliance missions requires setting policies
for ensuring that (not an exhaustive list):
 A vulnerability management program is in place to ensure software used in pro-
duction and telemetry systems are kept up to date.
 Procedures exist for regular reviews of who has access to production and telem-
etry systems.
 Procedures exist for ensuring terminated employee access is swiftly revoked
from production and telemetry systems.
 Reporting is in place to identify failed logins to production and telemetry sys-
tems.
 Reporting is in place to track the use of administrative privileges for a period of
years.
 A password complexity and authentication policy is in place that provides suffi-
cient defenses against password-guessing and other credential theft threats.
Not only setting policies, but ensuring those policies are followed. This is where telemetry
comes into play, because it is telemetry that will allow external auditors to determine
whether or not any of these policies are effective. When these teams are not ensuring
compliance with policies, Security teams also have the hard job of responding to secu-
rity incidents. Figure 1.9 shows us the relationship Security and Compliance teams
have with telemetry.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


84 CHAPTER CHAPTER 1 Introduction

Figure 1.9 Security and Compliance teams’ relationship to telemetry systems.


Their primary system is the SIEM, but Centralized Logging provides much of the
proof that policies are being followed. During response to a security incident,
literally every telemetry system will be used to determine what happened.

Security incidents are special cases; when they happen, literally every source of telem-
etry is potentially useful during the investigation. If you are in a different team, be
ready to support incident responders with information about how to use and search
telemetry under your care. Security is everyone’s job.
Compliance with regulation and voluntary frameworks invariably requires keeping
certain kinds of telemetry around for years (often seven years, a number inherited
from the accounting industry). This long-term retention requirement is almost
unique among the telemetry styles here, with metrics being the only other style that
approaches SIEM systems for retention period.

1.2.3 Telemetry usage by Software Engineering and SRE teams


Software Engineering teams are the teams responsible for writing the software in your
production environment. I mention SRE again because their mission to ensure the
availability of your software extends to both the infrastructure the code runs on (section
1.2.1) as well as the actual code itself (this section). Some organizations split systems-ori-
ented SRE from software-oriented SRE as well. Software Engineering brought about the
metrics, observability, and distributed tracing styles of metadata all to better track what
their code was doing. Figure 1.10 should be no surprise about how connected it is.

Figure 1.10 Telemetry systems used by Software Engineering teams, which


are just about all of them. Each style provides a different view of how its code
is behaving in the production environment.

Where Software Engineering teams are focusing on how their code is performing in
production, SRE teams are focused more on whether or not the code is meeting
promised performance and availability targets. This is a need related to what Software
Engineering desires, but the difference does matter. Software Engineering is very con-

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


How telemetry is consumed by different teams 85

cerned with failures and how they impact everything. SRE is concerned with overall,
aggregated, performance.

1.2.4 Telemetry usage by Customer Support teams


Customer Support teams have a variety of titles besides customer support:
 Technical Support
 Customer Success
 Customer Services
 API Support
 Account Management

The charge of teams of this type is to work with your customers (or users, or employ-
ees) and resolve problems. This team has the best information about how your produc-
tion system actually works for people, so if your Software Engineering and especially
SRE teams are not talking to them something has gone horribly wrong in your organi-
zation. This communication needs to go both ways because when Customer Support
teams are skilled in using the telemetry systems used by Software Engineering, the
quality of problem reports improves significantly. In an organization where Customer
Support has no access to telemetry systems, problem reports come in sounding like:

Account 11213 had a failed transaction on March 19, 2028 at 18:02 UTC.
They say they’ve had this happen before, but can’t tell us when. They’re a
churn risk.

Compare this report to the kind of report your Customer Support teams can make if
they have access to query telemetry systems:

Account 11213 has had several failed transactions. The reported transaction
was ID e69aed5a-0dfc-47e2-abca-8c11374b626f, which has a failure in it when
I looked it up. That failure was found four more times with this account. I also
saw it happening for five other accounts and reached out to all five. Two have
gotten back to me and thanked us for proactively notifying them of the prob-
lem. It looks like accounts with billing-code Q are affected.

This second problem report is objectively far better because the work of isolating
where the actual problem may be hiding has mostly been done. You want to empower
your Customer Support teams. Figure 1.11 demonstrates the sort of telemetry systems
Customer Support makes the best use of.
Customer Support teams work with customers to figure out what went wrong,
which means they are most interested in events that happened recently. Telemetry sys-
tems that rely on aggregation (metrics) are not useful because the single interesting
event is not visible. Telemetry systems that rely on statistical sampling (observability)
can be somewhat useful, but the interesting error needs to be in the sample. This
problem can be gotten around by ensuring you persist error-events outside of the sta-
tistical sample, perhaps in a second errors database.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


86 CHAPTER CHAPTER 1 Introduction

Figure 1.11 The telemetry systems Customer Support teams make the best use
of. Because Customer Support teams are most interested in specific failures,
telemetry styles that rely on aggregation (metrics) are not as useful. Note that in
cases where Customer Support is more of a Helpdesk for internal users, SIEM
access is often also granted and useful.

1.2.5 Telemetry usage by Business Intelligence


Business Intelligence teams are sneaky. They work on the telemetry of the business rather
than the telemetry of the technical organization. Their versions of telemetry include
details like marketing conversion rates, rate of account upgrade/downgrade, signup
rate, feature usage, click rates in email marketing campaigns, and many more. While
BI teams often aren’t considered members of the technical organization, I mention
them here for two reasons:
1 People inside Business Intelligence teams often already have training in statisti-
cal methods, so they represent an internal resource for you when you start
applying statistical methods to your technical telemetry.
2 If you are building a Software-as-a-Service platform, they are likely to approach
you to engineer telemetry flows into their systems alongside the ones you
already have for your technical organization.
If your organization already has people skilled in handling and manipulating data,
you need to talk to them when you are building or upgrading data handling systems.
They will tell you when your plan for aggregating data won’t return valid results (get-
ting the MAX value of a series of data that was averaged will not give you the true MAX
value in the source data, for instance). It can feel strange to ask people in a radically
different department for help, but you will build better systems because of it.

1.3 Challenges facing telemetry systems


Telemetry systems face the same disasters that production systems do: fire, flood,
equipment failure, unexpected credit-card expiration, labor actions, bankruptcies,
pandemics, civil unrest, and many more. However, telemetry systems in particular are
vulnerable to disasters specific to telemetry, which is what this section introduces. The
major problems derive from three points:
 Telemetry systems aren’t revenue systems (see section 1.3.1).
 Different teams need different things out of the telemetry they use (see section
1.3.2).
 Telemetry is still data, even if it isn’t production data, and many people have an
interest in data (see sections 1.3.3 and 1.3.4).

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Challenges facing telemetry systems 87

1.3.1 Chronic under-investment harms decision-making


By far the biggest threat to your telemetry systems is insufficient investment of resources.
Telemetry systems are a decision support tool—they present the feedback you send out
of your production environment in a way that helps you figure out what to do next.
Maybe you focus on technical debt for the next couple of sprints because your availabil-
ity metrics are failing your Service Level Agreement targets. Or maybe handling 12,000
connections a second is when your load-balancer nodes start responding badly, and it’s
time to buy more for your cluster. Whatever you’re looking for, these are the systems
that will help to find it. Under-investment can be caused by many misperceptions:
 Telemetry systems aren’t revenue systems. If they don’t make money, they’re
overhead, so cut overhead to make profit; QED. This shows an incorrect under-
standing of the value telemetry systems provide the overall organization.
 Don’t fix what ain’t broke. What you have now works fine, so why bother chang-
ing? This shows a disconnect between the people who would get the best value
of well-designed telemetry systems and the people who permit the time and
money to be spent to get a well-designed telemetry system.
 Centralized logging is all we need. Centralized logging is a powerful tool, as
decades of computer management has shown. Centralized logging can also be
forced to do the jobs of a SIEM, metrics, observability, and even distributed trac-
ing; but does so much more poorly than systems designed for the task would.
You might find yourself spending much more time writing glue automation to
make this square peg fit the hexagonal hole than setting up a brand new, spe-
cialized system would likely save in time and money.
 Features! Features! Features! We don’t have time for that. We need to ship this
next set of features to enable sales! More of a pathology in growing SaaS compa-
nies, but one that also misunderstands the value telemetry systems bring to an
organization.
The broad trends boil down to not understanding what telemetry does and also a dis-
connect between those feeling the pain and those that would approve fixing the pain.
None of these problems are easy fixes for a simple technician on a team. Depending
on how your organizational culture is, fixing them may not even be possible for the
managers who are feeling the pain because approving major updates like a telemetry
system install needs to happen so far up the chain of command that it doesn’t matter.
When facing these headwinds, you still have a chance to make a change. If the
organizational culture is otherwise good, under-investment is largely a problem of
ignorance. You can fix ignorance.
 Explain the kinds of decisions that improved telemetry systems will enable.
Managers get that sort of language.
 Explain how paying for a SaaS provider now will improve everyone’s ability to
make decisions will be faster than spending 24 months building your own sys-
tems (and likely have way better features than DIY).

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


88 CHAPTER CHAPTER 1 Introduction

 Explain how a new telemetry style works, provide a framework for how it would
operate in your existing production systems, and point out how that will
improve prioritization of work.
Personally, I spent 14 years in the public sector, of which seven were spent in reces-
sions. Organizations like these are at the mercy of an annual or biannual budgeting
process where a group of (rarely technical) elected officials will ultimately decide
whether or not you get your expensive new system. This is hard work, but it can be
done. Make the case, do it well, and plan far enough in advance (months if not years)
that you won’t be in a panic if the answer comes down to not this year.

1.3.2 Diverse needs resist standardization


As you read through Section 1.2 and the many different teams in a technical organiza-
tion and their telemetry needs, you might have noticed some different goals along the
way. Here are some of the top-level differences:
 Customer Support teams need recent telemetry, within a few weeks, and they
need all of it (no aggregation or summarization) just in case what a customer is
talking about is in there.
 Security teams need their SIEM systems to keep seven or more years of telemetry.
 In order to be economic, Observability and Distributed Tracing systems need to
statistically sample their data.
 Centralized Logging systems are the most expensive system to operate (on an
expense-per-day-of-telemetry basis), so keeping years of telemetry online is pro-
hibitively expensive. Sometimes weeks is too expensive for an organization.
Figure 1.12 provides a view into the diverse storage and retention needs of the five
telemetry styles talked about here.

Figure 1.12 The five telemetry styles charted for their preferred online availability periods.
SIEM systems have the longest retention due to external requirements. Observability and
distributed tracing achieve their retention through the use of statistical sampling. Metrics
achieves its duration through aggregations on the numbers stored inside. Centralized logging,
well, is just plain expensive so it gets the smallest online retention period.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


Challenges facing telemetry systems 89

A one-policy-applies-to-all approach simply will not work for a telemetry system. Your
retention policies need to be written in ways that accommodate the diverse needs of
your teams and telemetry systems in use. Chapter 17 is dedicated to this topic.
There is also diversity in the shape of your telemetry data itself.
 Hardware emits in Syslog or SNMP, and if you can’t handle that then you’re
simply not going to get that telemetry.
 Telemetry SaaS provider SDKs might not have support for emitting telemetry
through HTTP proxies, a required feature in many production environments.
 Platform services like VMWare vCenter have their own telemetry handling systems.
 Infrastructure providers like Amazon Web Services and Digital Ocean provide
telemetry in their own formats and in their own locations, leaving it up to you
to fetch and process it.
 Operating System components (Windows, Linux, FreeBSD, AIX, HP-UX, z/OS,
etc.) emit in their own formats like Syslog or Windows Event Log. If you want
that telemetry, you will need to handle those formats.
 What programming languages your production systems are written in (and
their age) can prevent you even having access to SDKs for observability and dis-
tributed tracing.
Challenges like these certainly increase the complexity of your telemetry system, but
they’re not insurmountable. Chapters 3 and 4 cover methods of moving telemetry
(Chapter 3) and transforming formats (Chapter 4). If you happen to be using a lan-
guage or platform unloved by the hot-new-now tech industry, you’re likely already
used to building support for new things yourself; I’m sorry (she says, having run Tom-
cat apps on NetWare, successfully).

1.3.3 Information spills and cleaning them up


Information spills and their consequences are a direct result of increasing legislation
regarding privacy (personally identifiable information, or PII) and health-related
(personal health information, or PHI) information. Similar to how toxic waste regula-
tion largely didn’t exist until the last half of the 20th century, the first half of the 21st
century is seeing information starting to get classified as “toxic”. Telemetry systems
receive feedback from production systems, this is true; however, if production is han-
dling privacy- or health-related data then it is possible the telemetry stream will
include such toxic data as well.
You never want to see privacy- or health-related data in your telemetry systems. If
such data is expected to be there, then access to telemetry data will have to follow all of
the same rigorous (and tedious) access control and usage policies as accessing produc-
tion data does. Making access to your telemetry data harder reduces the overall utility
of your telemetry system in general. There are few organizations culturally and techni-
cally equipped to easily handle data of this type, and they’re all health-care companies.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


90 CHAPTER CHAPTER 1 Introduction

For the rest of us, keeping privacy- and health-related data out of the telemetry
stream is a never-ending battle:
 The biggest leak source is exception-logging with parameters. Parameters are
incredibly useful for debugging, but can include privacy- and health-related
data. By far this is the largest source of leaks I’ve seen in my own systems. This is
made worse by the fact that many logger modules don’t have redaction con-
cepts baked into them, and software engineers aren’t used to thinking of excep-
tions as needing in-code redaction before emission.
 Unthinking inclusion of IP address and email addresses in logging statements.
Both of these are useful for fighting fraud and isolating which account a state-
ment is about. Unfortunately, IP addresses and email addresses are protected by
many privacy regulations. If you simply must include these details, consider
hashing them instead to provide correlation without providing the direct values.
 Inclusion of any user-submitted data of any kind in logging statements. Users
will stuff all kinds of things they shouldn’t into fields. Unfortunately for you, many
privacy- and health-data regulations require you to detect and respond to leaks of
this type. If you are in a place subjected to that kind of law (ask your lawyers) and
have a bug-bounty program, expect to pay out bounties to bug-hunters who find
ways to display user-supplied input on unprotected dashboards. It’s best not to
emit user-submitted data into your telemetry stream in the first place.
As deeply annoying as it is, you simply must have policies and procedures in place to
retroactively remove mistakenly stored privacy- and health-related information. Legis-
lation making these types of data toxic haven’t been around long enough for teleme-
try handling modules to include in-flight redaction as a standard feature alongside
log-levels. Hopefully this will change in the future. Until then, we have to know how to
clean up toxic spills. Chapter 16 covers this topic extensively.

1.3.4 Court-orders break your assumptions


Just about every country on the planet has judicial rules in place allowing parties to start
a lawsuit to request business records from the opposing party to the lawsuit relevant to
the matter at hand. Email is famously a business record, but telemetry data is as well.
This means that if your organization is party to a lawsuit, the opposing counsel (the
other side’s lawyers) can request telemetry data. What happens between the request
being made and you taking any action on it will be decided between your organization’s
lawyers, the opposing counsel, and the judge overseeing the case, as seen in figure 1.13.
Where court processes break telemetry system assumptions has to do with the two
most likely court orders and demands you can receive during a lawsuit:
 Request to produce documents. This is the flow described in figure 1.13, and will
require you to create an extract of telemetry. The format you need to create will be
negotiated between the teams of lawyers, so you are likely to be pulled in to con-
sult with your own lawyers in what the capabilities of your telemetry systems are.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What you will learn 91

Figure 1.13 A greatly simplified flow of the document discovery process as it relates to
telemetry data. Your lawyers will be fighting on your organization’s behalf to reduce telemetry
you have to give to the other side. You can help this process out by teaching your lawyers what
can and can’t be produced by your telemetry system.

 Request to hold documents. Of the two demands, this is the most impactful to
you, the telemetry system operator. A request to hold documents means you have
to exempt certain telemetry from your aggregation, summarization, and deletion
policies. Because legal matters can take literally years to resolve, in bad cases you
can end up having to store many multiples of your usual telemetry volumes.
Not every organization needs to prepare for lawsuits to such a degree that they have
well-tested procedures for producing and holding telemetry. However, certain indus-
tries are prone to lawsuits (finance, drug manufacture, patent law). Also, certain kinds
of lawsuits, such as those suing for leaking toxic data (see previous section) and insider-
sabotage, are far more likely to dive into telemetry data. You should have at least a
whiteboard plan for what to do when facing a court order. Chapter 18 covers this topic.

1.4 What you will learn


This book gives you a reference guide for operating telemetry systems for any team in
a technical organization. It focuses on optimizing the operation of systems involved in
handling and displaying telemetry, rather than optimizing your overall use of teleme-
try. To benefit from this book, you should have worked with telemetry systems in some
capacity, such as making searches in dashboards or writing logging statements in code.
You should also have manipulated and searched strings using code, and built queries
in graphical applications. You will learn:
 The architecture of telemetry systems and how your current telemetry systems
follow the architecture.
 How to optimize your telemetry-handling to reduce costs and increase the
online searchable period.
 How to ensure the integrity of your telemetry systems to support regulation and
compliance frameworks, as well as security investigations.
 Techniques to use to support court orders as part of legal processes.
 Procedures to implement for safely handling, and disposing of, regulated informa-
tion such as Personally Identifiable Information and Personal Health Information.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


92 CHAPTER CHAPTER 1 Introduction

To help guide this, I will be using examples drawn from three different styles of tech-
nical organizations. Know that what I teach here is applicable to a growing startup, to
companies with a founding date in the 1700s, and to organizations where writing and
running software only supports the business and is not the business.

Summary
 Telemetry is the feedback you get from your production systems.
 Telemetry is how modern computing works at all, because telemetry is what tells
us what our production systems are up to.
 Telemetry ultimately supports the decisions you have to make about your produc-
tion systems. If your telemetry systems are poor, you will make poor decisions.
 Centralized logging is the first telemetry style to emerge, which happened in
the mid-1980s, and brings all logging produced by your production systems to a
central location.
 Logging format standards like Syslog mean that hardware systems emit in stan-
dard formats, so you need to support those formats as well if you want telemetry
from hardware systems.
 Syslog introduced the concept of log-levels (debug, info, warn, error) to the
industry.
 Metrics emerged in the early 2010s and focuses on aggregatable numbers to
describe what is happening in your production systems.
 Cardinality is the term for index complexity in databases. The more fields in a
table, the higher the cardinality. Centralized logging is a high-cardinality sys-
tem; metrics systems generally are low-cardinality.
 Observability grew up in the mid-2010s as a result of frustration over the limita-
tions of centralized logging, and takes a systematic approach to tracking events
in your production systems.
 Observability provides extensive context to events, which makes isolating prob-
lems much easier versus centralized logging.
 Software-as-a-Service companies dominate the observability space due to the
complexity of observability systems.
 Distributed tracing emerged in the late 2010s as a specialized form of observabil-
ity focused on tracing events across an execution-flow crossing system boundaries.
 Distributed tracing provides the context of the entire execution-flow when
investigating an interesting event, which further improves your ability to isolate
where a problem truly started.
 Security Information Event Management systems are specialist telemetry sys-
tems for Security and Compliance teams, and store information relating to the
security use-case.
 SIEM systems store consistent information because regulation and voluntary
compliance frameworks largely ask for tracking of the same kinds of data, and
often require such data to be stored for many years.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>


What you will learn 93

 Operations and DevOps teams use telemetry to track how their infrastructure
systems are operating, focusing on centralized logging and metrics styles.
 Security and Compliance teams focus on both centralized logging and SIEM
systems because SIEM systems share a lot of history with centralized logging,
and centralized logging is useful during audits for compliance with regulation
and external compliance frameworks.
 Software Engineering teams use every telemetry system except SIEM systems in
an effort to understand how their code is behaving in production.
 Site Reliability Engineering teams also use every telemetry system except SIEM
in their mission to ensure the organization's software is available.
 Customer Support teams make use of centralized logging, observability, and dis-
tributed tracing styles to better isolate problems reported by customers, and to
improve the quality of bug reports sent to engineering.
 Business Intelligence teams are often not part of the technical organization but
are responsible for building systems for business telemetry. BI people are valu-
able resources when deploying a new telemetry style, due to their familiarity
with statistical methods.
 A chronic threat to telemetry systems is under-investment, which can stem from
a misunderstanding of the value telemetry systems bring to the organization
and a disconnect between decision-makers and those feeling the pain of a bad
telemetry system.
 Different teams need different things from telemetry, and different telemetry
styles benefit from different retention periods. Your telemetry system needs to
accommodate these differences in order to be a good telemetry system.
 Hardware, SaaS providers, infrastructure providers, third party software, and
operating systems all emit telemetry in relatively fixed formats. Your telemetry
system needs to handle these formats if you want their telemetry.
 Privacy information (PII) and health information (PHI) require special han-
dling, and most telemetry systems aren't built for that usage. Do what you can to
keep PII and PHI out of your telemetry systems.
 The largest source of toxic information spills are exception-logs that include
parameters; do what you can to redact these before they enter the telemetry
pipeline.
 Telemetry systems are subject to court orders the same as production systems
are, so you may be called upon to produce telemetry by your lawyers for a legal
matter.
 A court order to hold data means the affected data is no longer subjected to your
retention policy, which can be quite expensive if the legal matter drags on for years.

Licensed to Yaritza Miranda <yaritza1095@gmail.com>

You might also like