Download as pdf or txt
Download as pdf or txt
You are on page 1of 163

30-SECOND

DATA SCIENCE
30-SECOND
DATA SCIENCE
50 KEY CONCEPTS AND
CHALLENGES, EACH EXPLAINED
IN HALF A MINUTE

Editor
Liberty Vittert
Contributors
Maryam Ahmed
Vinny Davies
Sivan Gamliel
Rafael Irizarry
Robert Mastrodomenico
Stephanie McClellan
Regina Nuzzo
Rupa R. Patel
Aditya Ranganathan
Willy Shih
Stephen Stigler
Scott Tranter
Liberty Vittert
Katrina Westerhof

Illustrator
Steve Rawlings
First published in North America in 2020 by
Ivy Press
An imprint of The Quarto Group
The Old Brewery, 6 Blundell Street
London N7 9BH, United Kingdom
T (0)20 7700 6700
www.QuartoKnows.com

Copyright © 2020 Quarto Publishing plc


All rights reserved. No part of
this book may be reproduced or
transmitted in any form by any means,
electronic or mechanical, including
photocopying, recording or by any
information storage-and-retrieval
system, without written permission
from the copyright holder.
British Library Cataloguing-in-
Publication Data
A catalogue record for this
book is available from the
British Library.
ISBN: 978-0-7112-5966-9
eISBN: 978-0-7112-6195-2
This book was conceived,
designed and produced by
Ivy Press
58 West Street, Brighton BN1 2RA, UK
Publisher David Breuer
Editorial Director Tom Kitch
Art Director James Lawrence
Commissioning Editor Natalia Price-Cabrera
Project Editor Caroline Earle
Designer Ginny Zeal
Illustrator Steve Rawlings
Glossaries Maryam Ahmed
Printed in China
10 9 8 7 6 5 4 3 2 1
CONTENTS 58 Science 120 Dating
60 GLOSSARY 122 Music
62 CERN & the Higgs Boson 124 Profile: Ada Lovelace
64 Astrophysics 126 Sports
66 CRISPR & Data 128 Social Media
68 The Million Genome Project 130 Gaming
70 Profile: Gertrude Cox 132 Gambling
72 Climate Change
6 Foreword 74 Curing Cancer 134 The Future
8 Introduction 76 Epidemiology 136 GLOSSARY
138 Personalized Medicine
12 Basics 78 Society 140 Mental Health
14 GLOSSARY 80 GLOSSARY 142 Smart Homes
16 Data Collection 82 Surveillance 144 Profile: John W. Tukey
18 How We Visualize Data 84 Security 146 Trustworthiness Score
20 Learning From Data 86 Privacy 148 Artificial Intelligence (AI)
22 Tools 88 Profile: Florence Nightingale 150 Regulation
24 Regression 90 Vote Science 152 Ethics
26 Profile: Francis Galton 92 Health
28 Clustering 94 IBM’s Watson & Google’s 154 Resources
30 Statistics & Modelling DeepMind 156 Notes on Contributors
32 Machine Learning 158 Index
34 Neural Networks & Deep Learning 96 Business 160 Acknowledgements
98 GLOSSARY
36 Uncertainty 100 Industry 4.0
38 GLOSSARY 102 Energy Supply & Distribution
40 Sampling 104 Logistics
42 Correlation 106 Profile: Herman Hollerith
44 Regression to the Mean 108 Marketing
46 Confidence Intervals 110 Financial Modelling
48 Sampling Bias 112 New Product Development
50 Bias in Algorithms
52 Profile: George Box 114 Pleasure
54 Statistical Significance 116 GLOSSARY
56 Overfitting 118 Shopping
FOREWORD
Xiao-Li Meng

“If you want to solve all the problems in the world,


major in computer science.” When a speaker at an AI conference displayed
this line, my statistical ego got provoked immediately to almost six-sigma
level. Thankfully, the next line appeared in less than three seconds: “If you
want to solve all the problems created by computer science, enrol in a
graduate program in the Faculty of Arts of Sciences.”
Whoever humoured us with this clever pairing has a profound
understanding of the brilliant and bewildering time we live in. Advances
in computer science and technology created the digital age, which in turn
created data science. Nothing seems beyond reach when we can have so
much data to reveal the secrets of nature and its most advanced species.
But there is no free lunch – a universal law of data science (and of life).
Here are just a couple of paradise-paradox pairings of food for thoughts.
Personalized medicine surely sounds heavenly. But where on earth can they
find enough guinea pigs for me? Undoubtedly, we need to collect as much
data as possible about humans in order to advance the AI technologies.
But please only study other people – don’t you dare invade my privacy!
For those who can still enroll in a graduate program and afford at
least 31,536,000 seconds, do it as if there is no next 30 seconds. For
those who cannot, this book takes 50 x 30 seconds, give or take your
personal six sigma. Completing this book will not make you a 30-second
data scientist. But without understanding its content, your application
for the digital-age citizenship will be denied with 99% certainty. You can
take a chance, of course.

Foreword g 7
INTRODUCTION
Liberty Vittert

We have lived as humanists for a long time, using


our instinct, with our thoughts, opinions and experience driving our
decisions. However, now we are moving into the era of Dataism – letting
data drive our every decision. From climate change to the refugee crisis to
healthcare, data is a driving force, and not just in these all-encompassing
issues but also in our daily lives. Instead of going to a bookshop, Amazon
can tell you what you want to read. Likewise, dating apps will tell you
who you are compatible with, using endless amounts of collected data.
Humanism and Dataism are currently pushing back against each other.
Some people want to be purely data-driven, others don’t want to let go
of the human touch. Data science, as a discipline, brings both humanism
and Dataism together. It combines the vast databases, powerful statistical
tools that drive computing processes, and analysis together with the
common sense and quantitative reasoning we as humans have been
developing over thousands of years. Data Science does not mean just
being data-driven or human-driven: it is the art of both together.
Before we begin detailing this book, let’s take a step back in time
to the seventeenth century and visit Blaise Pascal, a French monk with
a crisis of faith. He decided to think about his future options with the
information he had available to him – the data, if you will:

If God doesn’t exist and I believe in Him then I might have a wasted
life with false belief, but nothing happens.
If God doesn’t exist and I don’t believe in Him then I didn’t waste
my life with false belief, but again, nothing happens.
If God does exist and I do believe in Him then I have a wonderful
eternity in Heaven.
But if God does exist and I don’t believe in Him then it’s eternal
hell-fire for me.

8 g Introduction
Pascal used the data he had to make a decision to optimize his future
happiness and mitigate potential risk. Really, that is what data science is:
taking past and current information in order to predict the likelihood of
future events, or, rather, the closest thing to a crystal ball that the world
has at its disposal. The only difference between us and Pascal is that we
live in a world with far more than four bits of data to analyse; we have
endless amounts.
It is estimated that we produce over 2.5 exabytes of data per day.
A quick calculation makes that the same amount of information as
stacking Harry Potter books from the Earth to the Moon, stacking them
back, and then going around the circumference of the Earth 550 times.
And that is simply the amount of data produced per day!

How the book works


The first two chapters break down data science into its basic elements,
followed closely by its most important and also most under-discussed
facet – what it cannot tell us. The following five chapters venture into
how data science affects us in every way – science, society, business,
pleasure and our world’s future. Within each individual topic there is: a
3-Second Sample, an insightful glimpse into the topic; followed by the
more detailed 30-Second Data explanation; and finally the 3-Minute
Analysis, which provides the reader with an opportunity to take a deeper
dive into particular complexities and nuances of the discussion at hand.
This book has been carefully compiled by industry-specific experts
to help guide us through how data is changing every industry and every
facet of our lives in ways we haven’t even imagined, while clearly showing
the quantitative reasoning and ethical quandaries that come with the
dawn of any new era.

10 g Introduction
g
BASICS
BASICS
GLOSSARY

algorithm Set of instructions or calculations causal inference Determining whether


designed for a computer to follow. Writing a change in one variable directly causes a
algorithms is called ‘coding’ or ‘computer change in another variable. For example, if
programming’. The result of an algorithm increased coffee consumption was found
could be anything from the sum of to directly cause an improvement in exam
two numbers to the movement of a scores, the two variables would have a
self-driving car. causal relationship.

automation Repetitive tasks or calculations cluster A group of computers working in


carried out faster and more efficiently by parallel to perform a task. It is often more
computers than by humans. efficient for complex computation tasks
to be performed on a cluster rather than a
Bayesian analysis Statistical method for single computer.
estimating probabilities based on observed
data and prior knowledge, making it possible core (computer/machine) Central processing
to answer questions such as ‘What is the unit (CPU) of a computer, where instructions
probability that a person will develop lung and calculations are executed; communicates
cancer if they are a smoker?’ with other parts of a computer. Many
modern computers use multi-core
binary Representation of information processors, where a single chip contains
as a string of 1s and 0s – an easy format multiple CPUs for improved performance.
for computers to understand, and so is
fundamental to modern computing. data set A set of information stored in a
structured and standardized format; might
bivariate analysis Restricted to one output contain numbers, text, images or videos.
or dependent variable.

14 g Basics
Enigma code Method of scrambling or normal (Gaussian) distribution Bell-shaped
encrypting messages employed by the curve describing the spread or distribution
German armed forces during the Second of data across different values. Data sets
World War which was cracked by Alan Turing that are often normally distributed include
and his colleagues at Bletchley Park. exam scores, the heights of humans and
blood pressure measurements. Normal
epidemiology The study of the incidence distribution shows the probability of a
of health conditions and diseases, which random variable taking different values.
populations are most vulnerable and how Many statistical analyses assume the data
the associated risks can be managed. is normally distributed.

interpretability Extent to which a human statistical association The strength of


can understand and explain the predictions relationship between two variables or
or decisions made by a mathematical model. measurements, e.g. there is an association
between human age and height. One
model/modelling Real world processes commonly used measure of association
or problems in mathematical terms; can is Pearson’s correlation coefficient.
be simple or very complex, and are often
used to make predictions or forecasts. terabyte A measure of a computer or hard
drive’s storage capacity, abbreviated to TB.
multivariate analysis Measures the effect One TB is equal to one trillion bytes.
of one or more inputs, or independent
variables, on more than one output,
or dependent variable. For example,
a study that models the effect of coffee
consumption on heart rate and blood
pressure would be a multivariate analysis.

Glossary g 15
DATA COLLECTION
the 30-second data
Data science was born as a
subject when modern computing advances
allowed us to suddenly capture information
3-SECOND SAMPLE in huge amounts. Previously, collecting and RELATED TOPICS
Since the invention of analysing data was limited to what could done See also
modern computing, TOOLS
by hand. Modern advances now mean that
‘big data’ has become page 22
a new currency, helping
information is collected in every part of our
SURVEILLANCE
companies grow from lives, from buying groceries to smart watches page 82
conception to corporate that track every movement. The vast amount
giants within a decade. REGULATION
now collected is set to revolutionize every page 150
aspect of our lives, and massive companies
3-MINUTE ANALYSIS have emerged that collect data in almost
The amount of data that unimaginable quantities. Facebook and 3-SECOND BIOGRAPHIES
GOTTFRIED LEIBNIZ
we now collect is so Google, to name just a couple, collect so much 1646–1716
massive that the data itself
has its own term – big data.
information about each of us that they could Helped develop the binary
number system, the foundation
The big data collected in probably work out things about us that even of modern computing.
the modern era is so huge closest friends and family don’t know. Every MARK ZUCKERBERG
that companies and 1984–
time we click on a link on Google or like a post
researchers are in a Co-founded Facebook with his
constant race to keep up on Facebook, this data is collected and these college room-mates in 2004,
and is now CEO and chairman.
with the ever-increasing companies gain extra knowledge about us.
requirements of data Combining this knowledge with what they
storage, analysis and
privacy. Facebook
know about other people with similar profiles 30-SECOND TEXT
supposedly collects 500+ to ourselves means that these companies can Vinny Davies
terabytes of data every target us with advertising and predict things
day – it would take over about us that we would never think possible,
15,000 MacBook Pros per
day to store it all.
such as our political allegiances.

Personal data has


become the sought-
after commodity of
16 g Basics the technical age.
HOW WE
VISUALIZE DATA
the 30-second data
Where does the quote ‘ninety
per cent of politicians lie’ come from, and
more importantly, is it true? In everyday life,
3-SECOND SAMPLE summaries of data can be seen in many forms, RELATED TOPICS
Data is everywhere in from pie charts telling us the nation’s favourite See also
everyday life, but most LEARNING FROM DATA
chocolate bar, to news articles telling us the
of us don’t work in data page 20
science; so how is that data
chance of getting cancer in our lifetime. All
CORRELATION
seen and what beliefs are these summaries come from, or are based on, page 42
formed from it? information that has been collected, but so
VOTE SCIENCE
often summaries seem to contradict each other. page 90

3-MINUTE ANALYSIS
Why is this the case? Well, data isn’t usually
Physically visualizing simple and nor is summarizing it; I may
the massive amounts of summarize it one way, you another. But who is 3-SECOND BIOGRAPHIES
BENJAMIN DISRAELI
complex data collected is right? Therein lies the problem: it is possible to 1804–81
a challenge in itself. Most
modern data sets are
be manipulated by the data summaries we are Former British Prime Minister
to whom the quote ‘there
almost impossible to shown. Even summaries that are true may not are three types of lies: lies,
damned lies and statistics’
visualize in any sensible provide information that is a fair and accurate is often attributed.
way and therefore any
representation of the data which that summary STEPHAN SHAKESPEARE
visual summaries are
usually a very simplified represents. For instance, did you know that 1957–
Co-founder and CEO of opinion
interpretation of the data. teenage pregnancies dramatically reduce when polls company YouGov, which
This also means that visual girls reach 20 years of age? Technically true, but collects and summarizes data
summaries can easily related to world politics.
be misrepresented, and
realistically not a useful summary. The next time
what is seen isn’t always you see a summary, think about how it could
as straightforward as have been manipulated, and then consider the 30-SECOND TEXT
it seems. Vinny Davies
results of the summary accordingly.

In the realm of data


science, seeing is not
necessarily believing –
it’s always wise to
look beyond the
18 g Basics summary presented.
LEARNING
FROM DATA
the 30-second data
Collecting data is all very well,
but once collected, can more be done with
it than just summarizing it? Using models,
3-SECOND SAMPLE attempts can be made to gain information from RELATED TOPICS
Analysing and modelling the data in a more complex and useful way than See also
data can highlight DATA COLLECTION
before. Models effectively allow data scientists
information not obvious in page 16
data summaries, revealing
to use one or more pieces of data to predict an
STATISTICS & MODELLING
anything from social media outcome (another piece of data) in which they page 30
trends to causes of cancer. are interested. For instance, age and gender
MACHINE LEARNING
data could be used to predict whether someone page 32

3-MINUTE ANALYSIS
will get arthritis in the next five years. Creating
Learning from data is not a model with age and gender from previous
a modern phenomenon. individuals (knowing whether they got arthritis 3-SECOND BIOGRAPHIES
In 1854, during an outbreak JOHN SNOW
or not) allows us to predict what could happen 1813–58
of cholera in London,
Dr John Snow collected
to a new individual. As well as simply trying to British physician considered
the founding father of
and used data to show predict future data, data can also be used epidemiology who is known for
tracing the source of a cholera
the source of the disease. to try to establish the cause of a particular outbreak in London in 1854.
He recorded where cholera
outcome. This process is called ‘causal inference’ ALAN TURING
cases occurred and used
the data to map them back and is often used to help understand disease, 1912–54
British mathematician who
to the Broad Street Pump. for example via analysing DNA. However, even used data from messages to
Residents then avoided though both examples mentioned are trying help crack the Enigma code
the pump, helping to in the Second World War.
end the outbreak of the
to predict cases of arthritis, the modelling
disease. The pump remains problems they represent are subtly different
as a landmark in London and are likely to require vastly different 30-SECOND TEXT
to this day. Vinny Davies
modelling processes. Choosing the best model
based on the data and aims associated with a
particular project is one of the major skills all
data scientists must have. Once gathered, data
can be put through
modelling processes,
which can enhance
20 g Basics understanding.
TOOLS
the 30-second data
Dealing with the massive
data sets that are collected and the complex
processes needed to understand them requires
3-SECOND SAMPLE specialist tools. Data scientists use a wide RELATED TOPICS
Data is big, models variety of tools to do this, often using multiple See also
are complex, so data DATA COLLECTION
different tools depending on the specific
scientists have to use all page 16
the computational tools
problem. Most of these tools are used on a
LEARNING FROM DATA
at their disposal. But what standard computer, but in the modern era of page 20
are these tools? cloud computing, work is beginning to be done
STATISTICS & MODELLING
on large clusters of computers available via the page 30

3-MINUTE ANALYSIS
internet. A lot of large tech companies offer
While not explicitly a tool this service, and these tools are often available
in the same sense as to data scientists. In terms of the more 3-SECOND BIOGRAPHIES
Python, SQL, etc, parallel WES MCKINNEY
standard options in a data scientist’s toolbox, 1985–
computing is an important
part of modern data
they can generally be divided into tools for Python software developer
who founded multiple
science. When you buy a managing data and tools for analysing data. companies associated with
the development of Python.
computer, you will likely Often, data is simply stored in spreadsheets,
have bought either a dual HADLEY WICKHAM
but sometimes, when data gets larger and more
or quad core machine, fl. 2006–
meaning that your complex, better solutions are required, normally Researcher and Chief Scientist
at RStudio, known for the
computer is capable of SQL or Hadoop. There is a much larger variety development of a number
processing two or four of tools used for analysing data, as the of key tools within the R
things simultaneously. programming language.
Many data science
methods used often come from different
processes are designed communities, for instance statistics, machine
to use multiple cores in learning and AI, with each community tending 30-SECOND TEXT
parallel (simultaneously), Vinny Davies
to use different programming languages.
giving faster performance
and increased processing The most common programming languages
capabilities. used to analyse data tend to be R, Python and
MATLAB, although often data scientists will
know multiple languages. Data scientists will
choose a tool or
programming language
22 g Basics to suit the task at hand.
REGRESSION
the 30-second data
Regression is a method used to
explain the relationship between two or more
measurements of interest, for example height
3-SECOND SAMPLE and weight. Based on previously collected data, RELATED TOPICS
Regression predicts values regression can be used to explain how the value See also
based on the data collected DATA COLLECTION
observed for one measurement is related to the
and is one of the most page 16
important tasks in
value observed for another quantity of interest.
REGRESSION TO THE MEAN
data science. Generally, regression allows for a simple page 44
relationship between the different types
OVERFITTING
of measurements, such that as the value of page 56
3-MINUTE ANALYSIS
Regression is not always
one measurement changes, then we would
as simple as predicting expect the other measurement to change
one measurement from proportionally. Regression allows data scientists 3-SECOND BIOGRAPHIES
CARL FRIEDRICH GAUSS
another. Sometimes there to do a couple of useful things. Firstly, it enables 1777–1855
are millions of pieces of
related data that need
them to interpret data, potentially providing German mathematician
who discovered the normal
to go into the regression the chance to understand the cause of the (Gaussian) distribution in
1809, a critical part of most
model, for example DNA relationship behind the measurements of regression methods.
data, and sometimes the
interest. For instance, a relationship between FRANK E. HARRELL
different pieces of data
have complex relationships data related to smoking and cancer could be fl. 2003–
Professor of Biostatistics at
with each other. More identified, which would help to identify that the Vanderbilt University,
complex regression smoking increases the risk of cancer. Secondly, Nashville, and author of
methods allow for renowned textbook Regression
situations such as this, but
it allows for predictions of future measurements Modelling Strategies.

often require much more based on observing just some of the


complex mathematics. measurements. If we know how much someone
A big part of data science 30-SECOND TEXT
smokes, we can use regression to predict their Vinny Davies
is choosing the best
regression model for the
chances of getting cancer in the future. This is
data available. based on the data we have seen before from Regression helps data
other people, including how much they smoked scientists understand
and whether they went on to develop cancer. relationships within
collected data and
make predictions about
24 g Basics the future.
16 February 1822 1863 1885
Born in Birmingham, Creates the first modern Discovers the
England weather maps and phenomenon of
discovers the anti- regression to the mean,
cyclone phenomenon leading eventually to
1844 modern multivariate
Receives BA degree, analysis
Cambridge 1869
Publishes Hereditary
Genius, a first study of 1888
1850–2 the inheritance of talent Discovers and names
Travels in southwest correlation
Africa (now Namibia)
1883
Coins the term ‘eugenics’ 1892
His book Fingerprints
launches a new era in
forensic science

17 January 1911
Dies in Surrey, England
FRANCIS GALTON

Francis Galton created the key to Species in 1859, Galton’s main interest shifted
modern data analysis: the framework for the to the study of heredity, anthropology and
study of statistical association. Galton was psychology. His most lasting inventions were the
born into a notable English family in 1822. statistical methods he devised in those pursuits.
Subsequently, however, the family would be Galton invented correlation and discovered
best known for him and for his cousin Charles the phenomenon of regression, and he may,
Darwin. Galton attended Cambridge, where he with some justice, be credited with taking the
learned that formal mathematics was not for first major steps to a real multivariate analysis.
him, and while he then studied medicine, that His ideas are basic to all proper studies of
profession, too, did not inspire him. When statistical prediction, and to twentieth-century
Galton was 22 years old his father died, leaving Bayesian analysis as well. Galton coined the
him sufficient wealth that he was able to live term ‘eugenics’ and he promoted certain parts
the rest of his life independent of need. For a of this, but also wrote against others that
few years he travelled, and for a nearly two-year would lead much later to associating eugenics
period from 1851, went deep into southwest with genocidal practices in the mid-twentieth
Africa, where he explored and met the people. century. Galton opposed the practice of
At one point he helped negotiate a peace creating heritable peerages and he encouraged
between two tribes. granting citizenship to talented immigrants and
In 1853 Galton married and settled down to their descendants. Some of his studies of
a life in science. At first he wrote about travel, inheritance came close to but did not reach
and he invented new forms of weather maps Mendelian genetics, but he did help create
that incorporated glyphs showing details on the methods that would lead to the explosive
wind, temperature and barometric readings. development of biology after Mendel’s work
From these he discovered the anti-cyclone was rediscovered in 1901. Galton pioneered
phenomenon, where a drop of barometric the use of fingerprints as a method of
pressure reverses the cyclonic wind motion in identification. He died childless in 1911, leaving
the northern hemisphere. With the publication his moderate fortune to endow a professorship
of his cousin Darwin’s book The Origin of and research at University College London.

Stephen Stigler

Francis Galton g 27
CLUSTERING
the 30-second data
Splitting data samples into
relevant groups is an important task in data
science. When the true categories for collected
3-SECOND SAMPLE data are known, then standard regression RELATED TOPICS
Sometimes data scientists techniques – often called ‘supervised learning’ – See also
don’t have all the LEARNING FROM DATA
can be used, to understand the relationship
necessary data to carry page 20
out regression, but in
between data and associated categories.
REGRESSION
many cases clustering Sometimes, however, the true categories for page 24
can be used to extract collected data are unknown, in which case
structure from data. STATISTICS & MODELLING
clustering techniques, or unsupervised learning, page 30
can be applied. In unsupervised learning, the
3-MINUTE ANALYSIS aim is to group samples of data into related
Netflix users aren’t divided groups or clusters, usually based on the 3-SECOND BIOGRAPHIES
TREVOR HASTIE
into specific categories, similarity between measurements. The meaning 1953–
but some users have similar
film tastes. Based on the of these groups is then interpreted, or the Professor at Stanford
University and co-author of
films that users have groups are used to inform other decisions. The Elements of Statistical
Learning.
watched or not watched, A simple example of clustering would be
users can be clustered WILMOT REED HASTING JR
to group animals into types based on
into groups based on the 1960–
similarity of their watched/ characteristics. For instance, by knowing the Chairman and CEO of Netflix,
who co-founded the company
unwatched movies. While number of legs/arms an animal has, a basic in 1997 as a DVD postage
trying to interpret the grouping can be created without knowing the service.
meaning of these groups is
difficult, the information
specific type of animal. All the two-legged TONY JEBARA
1974–
can be used to make film animals would likely be grouped together, and
Director of Machine Learning
recommendations. For similarly animals with four and six legs. These at Netflix and Professor at
instance, a user could be Columbia University, USA.
groups could then easily be interpreted as birds,
recommended to watch
Ironman if they hadn’t mammals and insects respectively, helping us
watched it but everyone learn more about our animals. 30-SECOND TEXT
in their cluster had. Vinny Davies

Clustering enables the


grouping of data and
the understanding of
28 g Basics any connections.
STATISTICS &
MODELLING
the 30-second data
When most people hear
‘statistics’ they think of a statistic, for example
a percentage. While statistics are an important
3-SECOND SAMPLE part of data science, more important is the RELATED TOPICS
Statistics gives us many discipline of statistics, with statistical modelling See also
of the basic elements of REGRESSION
methods for regression and clustering being
data science, such as page 24
percentages, but much
some of the most used techniques in data
CLUSTERING
more interesting are science. Statistics methods are known for page 28
the numerous methods providing relatively simple and easily
provided by statistical CORRELATION
interpretable techniques for analysing data. page 42
modelling.
For this reason, statistical methods are usually
the first place to start for most data scientists,
3-MINUTE ANALYSIS although they are often marketed as machine 3-SECOND BIOGRAPHIES
REVEREND THOMAS BAYES
Bayesian statistics takes learning, because it sounds cooler and the 1701–61
prior information (data)
that is already known,
difference between them can be unclear. British statistician and
church minister famous for
to help inform how to Statistical modelling methods often go well formulating Bayes theorem,
the key to Bayesian statistics.
analyse the same type beyond simple regression and clustering
of future data. This
techniques, but what they almost all have in SIR DAVID COX
information can come in 1924–
many forms, including common is interpretability. Statistical models Prominent British statistician
are usually designed to clearly identify the and former president of the
measurements from a Royal Statistical Society.
related study or knowledge relationships between different measurements,
about the range of
giving actionable results, which can guide
measurements that
policy in areas like medicine and society. 30-SECOND TEXT
could occur. In Bayesian Vinny Davies
modelling, we can This characteristic of statistical models is
incorporate this vital for helping to work out whether a
information into our
model in a structured
relationship between measurements is Statistics is the
and mathematical way, due to a coincidence or resulting from an backbone of data
enabling more informed underlying causal relationship between the science and statistical
modelling choices, even
sources of the measurements. modelling methods are
with a small amount
of data. always the first place
to start when looking
30 g Basics at new data.
MACHINE LEARNING
the 30-second data
The idea of machine learning is
to teach computers to learn and improve over
time in an automated manner, without the need
3-SECOND SAMPLE for human assistance. Algorithms can be RELATED TOPICS
Machine learning gives us implemented into systems, where they can See also
the ability to learn from NEURAL NETWORKS &
make decisions automatically, often speeding up
data without human DEEP LEARNING
intervention, allowing us to
the decision-making process and reducing the page 34
automate tasks and remove possibility of human error. Within the system, PRIVACY
human decision making. the machine learning algorithms use the data page 86
they receive to make predictions about the ARTIFICIAL INTELLIGENCE (AI)

3-MINUTE ANALYSIS
future, helping the system to operate and page 148

Films like The Terminator choose between different options. The


can make machine learning algorithm then updates itself based on what
3-SECOND BIOGRAPHIES
seem scary. So how far are it learned from the information it received, YANN LECUN
we from robots stealing 1960–
our jobs and Skynet taking
ensuring that it continues to make optimal
Professor at New York
over the world? While decisions in the future. An everyday example of University and Chief AI
Scientist at Facebook.
machine learning may take machine learning in action is Spotify. The music
over some small jobs, it
app has millions of users, and data all about ANDREW NG
is unlikely that a truly 1976–
intelligent computer will be the type of music those users like, based on the Stanford University Professor
designed which would take songs they have listened to. When a new user famous for his work in machine
learning, as well as for
all our jobs. Even if that joins, Spotify knows very little about them and founding the Google Brain
happened, humans would project and online learning
still need to supervise the
will recommend songs almost at random. But platform Coursera.
computer to ensure it as the user listens to songs, the algorithm
didn’t make any inhumane continually learns about the music preferences
decisions (or create robot 30-SECOND TEXT
of that user and how they relate to the Vinny Davies
Arnold Schwarzeneggers!).
preferences of other users. The more songs
the user listens to, the better the algorithm
becomes, and the song recommendations The more data that
improve for that user. is collected, the more
a machine will learn,
and the smarter it
32 g Basics will become.
NEURAL NETWORKS
& DEEP LEARNING
the 30-second data
Originally inspired by the human
brain, neural networks are one of the most
common machine learning methods. Like the
3-SECOND SAMPLE brain, neural networks consist of a network of RELATED TOPICS
Many modern technologies interconnected (artificial) neurons which allow See also
rely on neural networks MACHINE LEARNING
the interpretation of images or other types
and deep learning, which page 32
have given us driverless
of data. Neural networks are used to help
IBM WATSON & GOOGLE’S
cars and virtual assistants. in everyday life, from finding the faces in DEEPMIND
smartphone photos, to reading addresses page 94
on envelopes, ensuring they go to the correct ARTIFICIAL INTELLIGENCE (AI)
3-MINUTE ANALYSIS
Amazon has created a
location. Deep learning is a group of methods page 148

supermarket where you based around neural networks, but with a much
don’t need to scan items. larger number of layers of interconnecting
You just pick up items, put 3-SECOND BIOGRAPHIES
artificial neurons. One of the uses of deep FRANK ROSENBLATT
them in your bag and walk
learning is analysing and responding to 1928–71
out. The supermarket American psychologist famous
works by videoing messages, either in the form of text (customer for developing the first method
that resembles a modern-day
everyone as they shop service chat bots for example) or speech (such neural network.
and using deep learning
as Alexa or Siri). However, the biggest use of
to identify each item YOSHUA BENGIO
customers pick up, noting deep learning is in image processing. Deep 1964–
whether they put it in their learning can be used to analyse the images Canadian computer scientist
famous for his work on neural
bag or back on the shelf. captured by driverless cars, interpreting the networks and deep learning.
When you walk out, the
cost of your items is simply
results and advising the car to adjust its course
charged to your account. as needed. It is also beginning to be applied in
30-SECOND TEXT
medicine, with its ability to analyse images such Vinny Davies
as MRIs or X-rays, making it a good way of
identifying abnormalities, such as tumours.
While deep learning is
a highly sophisticated
process, its prevalence
in the future will
depend on the level
34 g Basics of trust it can garner.
g
UNCERTAINTY
UNCERTAINTY
GLOSSARY

algorithmic bias Algorithms learn how to Gallup poll A series of regular surveys,
make decisions by processing examples of conducted by the company Gallup, to gauge
humans performing the same task. If this public opinion on a range of political,
data is taken from a prejudiced source, the economic and social issues.
model will learn to replicate those prejudices.
natural variation Changes or fluctuations
automated system Repetitive tasks or that occur in populations or the natural
calculations carried out by computers, world over time, e.g. natural variations
e.g. automated passport gates at airports, in a country’s birth rate over time.
self-driving cars and speech-to-text software.
noise Random variations in data collected
causation If a change in one variable directly or measured from the real world. Minimizing
causes a change in another variable, or accounting for the effects of noise in data
causation exists. is a crucial step in many statistical analyses.

correlation Two variables are correlated if non-probability sampling Method of


a change in one is associated with a change sampling from a population, where not all
in the other. members of the population have an equal
chance of selection.
cross-validation Fitting and testing a
predictive model on different subsets of a non-response bias Introduced when people
data set. Cross-validation can be a method who are able or willing to respond to a
for fine-tuning a model’s parameters, and survey differ significantly from people who
can also provide better estimates of a do not or cannot respond.
model’s performance.
null hypothesis The hypothesis that
data point A piece of information. A single there is no significant difference between
data point may consist of several quantities populations, meaning that any observed
or variables, provided they are all associated difference is due to error, noise or
with a single observation from the real world. natural variation.

38 g Uncertainty
p-value The probability that the results was investigating whether students who
observed in an experiment would occur if drink coffee perform better in exams than
the null hypothesis was true. students who don’t, the null hypothesis
would be ‘there is no difference in exam
predictive model A mathematical model performance between students who do
which predicts the value of an output, and don’t drink coffee.’ If a study found
given values of an input. significant differences in performance
between coffee drinking and non-coffee
regularization A technique to discourage drinking students, the null hypothesis
overfitting in models. could be rejected.

sample A subset of a population, selected time series analysis The analysis of a signal
for participation in a study, experiment or variable that changes over time. This
or analysis. can include identifying seasonal trends or
patterns in the data, or forecasting future
sampling Selecting members of a values of the variable.
population as participants in a study
or analysis. training data Many machine learning
models are fitted to training data, which
selection bias Introduced when samples consists of inputs and their corresponding
for a study are selected in a way that does outputs. The model ‘learns’ the relationship
not result in a representative sample. between the inputs and outputs, and is then
able to predict the output value for a new,
self-selection bias Introduced when unseen input value.
participants assign themselves to a study,
or a group within a study. This may lead to a univariate and multivariate time-dependent
sample that is biased and unrepresentative data Univariate time-dependent data
of the population. consists of the values of a single variable
over time, whereas multivariate time-
statistically significant A result that is dependent data consists of the values
very unlikely to have occurred if the null of more than one variable.
hypothesis were true. For example, if a study

Glossary g 39
SAMPLING
the 30-second data
‘Garbage in, garbage out’: data
scientists know that the quality of their data
determines the quality of their results, so most
3-SECOND SAMPLE of them have learned to pay careful attention RELATED TOPICS
When the entire population to measurement collection. When analysts can See also
of interest can’t be DATA COLLECTION
work with an entire population’s data – such as
measured or questioned, page 16
a sample is taken – but how
Netflix tracking the film-watching habits of its
SAMPLING BIAS
that is done is as much an subscribers – drawing conclusions can be a page 48
art as it is a science. straightforward matter of just crunching
VOTE SCIENCE
numbers. But that completeness is not page 90

3-MINUTE ANALYSIS
always practical. In criminal healthcare fraud
In 1936, the US was in the investigations, the ‘full population’ would be
Great Depression, and a health claims records numbering in the trillions. 3-SECOND BIOGRAPHIES
ANDERS NICOLAI KIÆR
conservative small-town Instead, lawyers might have data scientists 1838–1919
mayor was challenging
President Roosevelt for
strategically choose a subset of records from First to propose that a
representative sample be used
office. The most influential which to draw conclusions. Other times, as rather than surveying every
member of a population.
magazine of the time, with political polling, all that is available is a
Literary Digest, polling W. EDWARDS DEMING
sample. If the sample is a randomly chosen
2.4 million voters, 1900–93
predicted a challenger’s one, statistical theories exist to tell us how Wrote one of the first books
on survey sampling, in 1950,
landslide. Wrong: confident we should be in our generalizations which is still in print.
Roosevelt swept the from sample to population. Increasingly, data
nation. What happened? GEORGE HORACE GALLUP
The sample was large
scientists are relying on what is known as 1901–84
but biased; the magazine ‘non-probability sampling’, where the sample American pioneer of survey
sampling techniques and
polled its subscribers – car is not chosen according to any randomization inventor of the Gallup poll.
owners and telephone scheme. So using Twitter data to track the
users – all wealthier than
average. Within two years
buzz of a candidate or brand will not give a
30-SECOND TEXT
Literary Digest had folded, random sample representing the entire Regina Nuzzo
and a new science of population – but it still has worth.
statistical sampling
was launched.
Statisticians work to
find out the accuracy of
conclusions even from
40 g Uncertainty irregular samples.
CORRELATION
the 30-second data
A correlation is a kind of dance –
a ‘co-relation’ – between two features in a data
set. A positive correlation means the dancers
3-SECOND SAMPLE are moving more or less in the same direction RELATED TOPICS
At the heart of modern together: when crude oil prices rise, for See also
data science lies a REGRESSION TO THE MEAN
example, retail petrol prices also tend to
surprisingly simple page 44
concept: how much do
rise. A negative correlation means the dancers
OVERFITTING
two things move in sync are still in sync but are moving in opposite page 56
with each other? directions: longer website loading times are
associated with lower customer purchase
rates. Correlations can only capture linear 3-SECOND BIOGRAPHIES
3-MINUTE ANALYSIS KARL PEARSON
In 2014, for a fun project relationships, where two features can be 1857–1936
before final exam week, visualized on a graph together as a straight English mathematician
who developed Pearson’s
Harvard law student Tyler line. That means an analysis of business correlation coefficient, the
Vigen purposely set out to most common way to measure
find as many coincidental
characteristics such as staff cheerfulness and correlation.
correlations as possible customer satisfaction might return a ‘zero
JUDEA PEARL
across multiple data sets. correlation’ result, hiding a more interesting 1936–
His website Spurious Israeli-American computer
story underneath: a curvilinear relationship, scientist and philosopher
Correlations quickly went
viral, allowing millions of where customers dislike too little cheerfulness whose work has helped
researchers distinguish
visitors to view graphs but also too much. Another problem is that correlation from causation.
showing the high correlation is not the same as causation. Sales
correlation over time
between oddball variable
of ice cream and drowning deaths are positively
30-SECOND TEXT
pairs, such as the number correlated, but of course that does not mean Regina Nuzzo
of people who have died by that banning the sale of ice cream will save
becoming tangled in their lives. The causation culprit is often a third
bedsheets and the per
capita cheese consumption
characteristic (daily temperature). It is up to
in the US. the analyst to intelligently use all available
information to figure out whether the
apparent cause-and-effect is real. Graphs illustrating
dynamic relationships
can be a data scientist’s
42 g Uncertainty most powerful tool.
REGRESSION
TO THE MEAN
the 30-second data
Can stats explain the strange
phenomenon where top rookie athletes fall
from glory and go on to a disappointing second
3-SECOND SAMPLE season? The usual explanation is that stars hit RELATED TOPICS
‘What goes up must this slump because they choke under pressure See also
come down’ – it may REGRESSION
and attention from a stellar debut. But data
seem obvious, but in page 24
stats this is easy to miss,
whizzes know better – it is just a statistical
CORRELATION
and it can lead to some affair called regression to the mean. And it’s page 42
puzzling trends. not unique to sports; you can find examples
everywhere. Why do the most intelligent
women tend to marry men less intelligent than 3-SECOND BIOGRAPHIES
3-MINUTE ANALYSIS FRANCIS GALTON
Regression to the mean is themselves? Why was a company’s surprisingly 1822–1911
especially important when profitable quarter immediately followed by First coined the concept of
regression to the mean in his
analysing data that has a downturn? Why do hospital emergency study of genetics and height.
been chosen based on a
measurement that has
departments get slammed the moment DANIEL KAHNEMAN
exceeded some threshold – someone remarks, ‘Wow, it’s quiet today’? 1934–
Nobel Laureate who suggested
for example, patients It is probably not a cause-and-effect story (or regression to the mean might
whose last blood pressure explain why punishment seems
superstitious jinx). Regression to the mean says to improve performance.
measurement was
considered dangerous, that extreme events don’t stay extreme forever;
or patients with a sudden they tend back towards the average, just on
worsening of depression their own. It is not that any true effect in the 30-SECOND TEXT
symptoms. In fact, about Regina Nuzzo
a quarter of patients with
data disappears – to the contrary, native athletic
acute depression get better talent persists, good fiscal management carries
no matter what – with on – but the extreme luck that pushed an
drugs, therapy, placebo individual into the top tiers today is likely to fade
or nothing at all – leading
some researchers to
out tomorrow. Data scientists know to be on
question the usefulness guard for this effect, lest they be fooled into
of standard depression spotlighting trends that aren’t real.
treatments.
Stats can help explain
dramatic swings of
fortune in sports,
44 g Uncertainty as well as in life.
CONFIDENCE
INTERVALS
the 30-second data
When you’re lucky enough to get
data on an entire population – all customer
purchases from a web vendor last year, say –
3-SECOND SAMPLE then getting the true average is easy: just RELATED TOPICS
Confidence intervals are crunch the numbers. But when all you get is See also
almost magical in their SAMPLING
a sample of the population – like satisfaction
ability to take a piece of page 40
limited information and
ratings from only 1,000 customers out of
STATISTICAL SIGNIFICANCE
extrapolate it to the 1 million – knowing the true average value is page 54
entire population. much trickier. You can calculate the average
satisfaction rating of your sample, but that’s
just a summary of these particular 1,000 3-SECOND BIOGRAPHY
3-MINUTE ANALYSIS JERZY NEYMAN
Beware journalists customers. If you had taken another random 1894–1981
reporting numbers without 1,000 customers, you would get a different Polish mathematician and
statistician who introduced
confidence intervals. For average. So how can we ever talk about the confidence intervals in a paper
example, a 2017 Sunday published in 1937.
Times article highlighted
average satisfaction of all million people?
a reported drop of 56,000 That is where confidence intervals come to
employed people in the UK, the rescue – one of the tools statisticians use 30-SECOND TEXT
saying ‘it may signal the
in pursuit of their ultimate goal of drawing Regina Nuzzo
start of a significantly
weaker trend’. Digging conclusions about the world based on limited
deeper into Office for information. Statisticians have worked out
National Statistics reports, ingenious maths that takes information from
however, reveals a
confidence interval for
one sample and uses it to come up with a whole
the true change in number range of plausible values for the average of the
employed running from entire population. So instead of just saying the
a 202,000 decline to a average satisfaction rating in one sample was
90,000 increase. So
employment may not have 86 per cent, you can say, with some confidence,
dropped at all – it might that the average satisfaction in the entire
have actually improved! customer population is between 84 and 88 per Making conclusions
cent – which is much more valuable information. about the big picture
with confidence is
where the field of
46 g Uncertainty statistics shines.
SAMPLING BIAS
the 30-second data
Data points are like gold nuggets,
so data scientists eagerly scoop up whatever
they can find. Smart analysts do something
3-SECOND SAMPLE even more valuable: they stop, look around RELATED TOPICS
It’s almost a paradox in and ask what happened to all the nuggets that See also
data science: what’s not DATA COLLECTION
aren’t lying around in plain sight. Are those
in a data set can be even page 16
more important than
left-out data different in any systematic way
SAMPLING
what’s in it. from the data that were easy to collect? Take, page 40
for example, a report’s estimate that 10 per
OVERFITTING
cent of men suffer from impotence – results page 56
3-MINUTE ANALYSIS
In the Second World War
that were based on a survey of patients at
the American military an andrology health clinic. This selection bias
gathered data on bullet happens when the participants chosen differ in 3-SECOND BIOGRAPHIES
holes from planes returned ABRAHAM WALD
important ways from the ones not chosen (such 1902–50
from European battles.
Where were the highest
as, here, their sexual health). Related to this is Hungarian mathematician
whose work on Second
bullet densities, they self-selection bias, where, for example, service World War aircraft damage
illustrates the concept of
asked, so extra armour satisfaction ratings can easily skew negatively survivorship bias.
could be added to spots
if only the most irate customers take time to CORINNA CORTES
where planes are shot
at the most? Statistician respond. Likewise, there is non-response bias; 1961–
Danish computer scientist
Abraham Wald turned medical studies, for example, can be misleading and Head of Google Research,
the question on its head. if researchers ignore the fact that those who works on sample bias
These data only show correction theory.
where planes that managed
participants most likely to drop out are also
to make it back home had the ones who are the sickest. Sometimes it
been hit, he pointed out. is possible to statistically correct for a bias 30-SECOND TEXT
Planes were getting shot Regina Nuzzo
problem, but recognizing the problem in the
at in other places, but
these planes, hit in other
first place is often the hardest part.
spots, didn’t survive. So
the armour belonged,
A skilled data scientist
he said, where the bullet
holes weren’t.
will seek out gaps in the
data collection process
and analyse their
48 g Uncertainty potential impact.
BIAS IN
ALGORITHMS
the 30-second data
Algorithms learn how to make
decisions by processing examples of humans
performing the same task. An algorithm for
3-SECOND SAMPLE sentencing criminals might be trained on RELATED TOPICS
Can a computer be racist, thousands of historic decisions made by judges, See also
sexist or homophobic? SAMPLING BIAS
together with information about the offenders
Human biases are often page 48
built into automated
and their crimes. If this training data is taken
ARTIFICIAL INTELLIGENCE (AI)
systems, with serious from judges who give harsher sentences to page 148
consequences for the people of colour, the model will learn to replicate
most vulnerable groups REGULATION
those prejudices. In 2018, the Massachusetts page 150
in society.
Institute of Technology’s (MIT) Media Lab
showed that face recognition systems
3-MINUTE ANALYSIS developed by Microsoft, IBM and China’s 3-SECOND BIOGRAPHY
As many machine learning JOY BUOLAMWINI
Face++ were all significantly worse at detecting fl. 2011–
models are developed by
private companies, their
female faces, and performed poorly on images Computer scientist and digital
activist, based at the MIT
training data and source of darker-skinned women. With police forces Media Lab, and founder of the
Algorithmic Justice League.
code are not open to in the UK and US testing automated facial
scrutiny. This poses
recognition systems for crime prevention, low
challenges for journalists
investigating algorithmic accuracies and false alarms could have far- 30-SECOND TEXT
bias. In 2016, an reaching consequences for civil liberties. In 2018 Maryam Ahmed
investigation by the news Amazon scrapped an automated CV screening
outlet ProPublica used
tool due to gender bias. The system was trained
Freedom of Information
requests to reverse- on data from previous successful candidates,
engineer the COMPAS who were mostly male, due to existing
algorithm, used in the US imbalances in the technology industry. This
to predict the likelihood
of criminals reoffending.
produced a tool that penalized applications
They uncovered racial containing phrases more likely to appear in The potential for bias
discrimination, raising women’s résumés, such as ‘women’s football might sound far-
questions on regulation
team’. The algorithm learned to equate men’s fetched, but algorithm
and transparency in AI.
CVs with success, and women’s with failure. bias poses a very real
problem requiring
50 g Uncertainty creative solutions.
18 October 1919 1959 1973
Born in Kent, England Marries Joan Fisher; he Published Bayesian
later gives her statistical Inference in Statistical
advice as she writes her Analysis (with George C.
1953 1978 biography of her Tiao)
Receives his PhD at father, Ronald A. Fisher
University College
London 1978–9
1960 Serves as President of
Moves to Madison, the American Statistical
Wisconsin, to start a new Association, and of the
Department of Statistics Institute of Mathematical
Statistics

1970
Publishes Time Series 1985
Analysis (with Gwilym Elected Fellow of the
Jenkins). In subsequent Royal Society of London
years he also develops
forecasting methods,
based upon difference 28 March 2013
equation methods, with Dies in Madison,
other authors Wisconsin, USA
GEORGE BOX

George Box was born in England (where he met and married Joan Fisher, one of
in 1919. He had studied chemistry before being Ronald’s daughters), Box moved in 1960 to start
called up for service during the Second World a new department of statistics at the University
War, and he gained his first introduction to of Wisconsin in Madison, where he spent the
statistics when, while engaged in war work, he rest of his life and did his most influential work.
encountered problems with the interpretation Box was a great catalyst in collaborative
of experimental data. Someone suggested he scientific investigations. He ran a famous
visit British statistician and geneticist Ronald evening ‘Beer seminar’ weekly, where a scientist
A. Fisher, at the time working from home would briefly present a problem and the
because his laboratory at Cambridge had been assembled group would produce innovative
closed for the duration of the war. The visit solutions, some with great lasting effect. With
opened Box’s eyes to the world of ‘data various co-authors he created new methods
science’ (a then unknown term), and after of time series analysis for univariate and
the war he went to University College London multivariate time-dependent data, new ideas
for graduate study. There, as later in life, he for the use of Bayesian methods and new
plotted his own course, concentrating on approaches to experimental design, including
understanding the role of statistics in scientific ‘evolutionary operation’, an approach that
and engineering investigation. permitted bringing experiments to the
Box’s early work was as a statistician at manufacturing floor and involving line workers
Imperial Chemical Industries, where he was in continuously improving processes without
involved with the design of experiments. In one interrupting production. He was a great
early paper he introduced the word and concept advocate for keeping the scientific question
of ‘robustness’ to statistics: the idea that the always at the forefront, and for the importance
validity of some (‘robust’) statistical procedures of good experimental design. He employed
could withstand even large departures from mathematical models, but famously is quoted
conditions thought to be key to their use. After as cautioning that ‘all models are wrong, but
a few years that included time in Princeton some are useful’. He died in Madison in 2013.

Stephen Stigler

George Box g 53
STATISTICAL
SIGNIFICANCE
the 30-second data
It is worth getting to know the
p-value, because this tiny number boasts
outsized importance when it comes to drawing
3-SECOND SAMPLE conclusions from data. The tininess is literal: RELATED TOPICS
Are those interesting a p-value is a decimal number between 0 and 1. See also
patterns in a data set just a STATISTICS & MODELLING
It is calculated when you have a question about
random fluke? A century- page 30
old stats tool can help
the world but only limited data to answer it.
SAMPLING
answer that. Usually that question is something like, ‘Is there page 40
something real happening here in the world, or
are these results just a random fluke?’ If you
3-MINUTE ANALYSIS 3-SECOND BIOGRAPHIES
P-values are easy to hack.
toss a coin 100 times and it comes up heads
KARL PEARSON
In 2015, media around the every time, you might suspect that the coin is 1857–1936
world excitedly reported double-headed, but there is still the possibility British statistician who first
formally introduced the
on a study showing that (however negligible) that the coin is fair. The p-value.
chocolate leads to weight
loss. Then the author
p-value helps support your scepticism that this SIR RONALD FISHER
revealed the truth: he was event didn’t happen by accident. By tradition, 1890–1962
British statistician who
a journalist, the data was results with a p-value smaller than 0.05 get popularized the p-value in his
random and his results just 1925 book for researchers.
labelled ‘statistically significant’ (in the case of
a false-positive fluke. He
knew that in 5 per cent the coin, getting all heads from five flips). It is
of studies p-values will this label that people often use for reassurance 30-SECOND TEXT
be smaller than 0.05 just when making decisions. But there is nothing Regina Nuzzo
by chance. So he ran 18
separate analyses of
magical about the 0.05 threshold, and some
random data – and experts are encouraging researchers to abandon
then reported only the statistical significance altogether and evaluate
deliciously statistically each p-value on its own sliding scale.
significant one.
P-values help
statisticians work out
whether results are a
random fluke – or not:
the gold standard of
statistical evidence has
54 g Uncertainty some major flaws.
OVERFITTING
the 30-second data
Building a predictive model
involves finding a function that describes
the relationship between some input and an
3-SECOND SAMPLE output. For example, a data scientist may want RELATED TOPICS
Beware of complex models to predict a university student’s final grade See also
that fit the data perfectly. REGRESSION
based on their attendance rate in lectures. They
It is likely they are page 24
overfitted, and will predict
would do this by fitting a function to a ‘training’
STATISTICS & MODELLING
poorly when presented set of thousands of data points, where each page 30
with new data points. point represents a single student’s attendance
MACHINE LEARNING
and grade. A good model will capture the page 32

3-MINUTE ANALYSIS
underlying relationship between grades and
There are ways to attendance, and not the ‘noise’, or natural
avoid overfitting. variation, in the data. In this simple example, 30-SECOND TEXT
Cross-validation gives Maryam Ahmed
a reliable model may be a linear relationship.
an estimate of how well a
model will work in practice,
When a new student’s attendance is added,
by training the model on a the model will use it to predict their final grade
subset of the training data because it generalizes to the student population
and testing its performance
as a whole. An overfitted model will involve
on the remaining subset.
Regularization is a more parameters than necessary; instead of
technique that penalizes fitting a straight line to the university data set,
a model for being too an overenthusiastic data scientist might use a
complex; in the university
example, a line would be
very complex model to perfectly fit a contorted,
preferred over a curve. meandering curve to the training data. This
will not generalize well, and will perform
poorly when presented with data for a new
student. Understanding that a complex model is
not always better is a crucial part of responsible
and thoughtful data science practice.
If a model’s
performance seems
too good to be true,
56 g Uncertainty then it probably is!
g
SCIENCE
SCIENCE
GLOSSARY

anthropogenic Event or phenomenon debugging Finding and correcting errors in


that is caused by humans, for example computer code.
climate change.
diagnostics Identification of problems,
blind analysis Conducted where researchers typically human diseases or health
cannot see the correct measurements or conditions; also refers to the identification
answers; aims to minimize bias. of computer bugs.

causal relationship If a change in one DNA The genetic code that governs
variable directly causes a change in the development, characteristics and
another variable, a causal relationship functioning of every living organism. DNA
exists between them. is usually found in the nucleus of a cell, and
consists of two long chains of building
correlation Two variables are correlated if blocks called ‘nucleotides’, arranged in a
a change in one is associated with a change double helix shape. In most humans, an
in the other. A positive correlation exists if individual’s genome, or genetic code, is
one variable increases as the other increases, unique. Recent advances in genetic
or if one variable decreases as the other engineering have enabled the insertion,
decreases. A negative correlation exists if one deletion and modification of genetic
variable increases as the other decreases. material in DNA.

data set A set of information stored in a epidemiological evidence Correlation


structured and standardized format; might between exposure to a risk factor, such as
contain numbers, text, images or videos. smoking, and incidence of a disease, such
as lung cancer.

60 g Science
experimental design The process of independent replication Validation of
designing robust studies and experiments, a study or experiment by independent
to ensure that any conclusions drawn from researchers. This is done by repeating
the results are reliable and statistically the procedure followed by the original
significant. This includes careful selection researchers, to ensure the results can
of experimental subjects to avoid sampling be replicated.
bias, deciding on a sample size, and choosing
suitable methods for analysing results. randomized trials Experimental design
where participants or subjects are randomly
gene editing Process of editing the genome allocated to treatment groups. For example,
of a living organism by inserting, removing or participants in a randomized drug trial could
modifying its DNA. be randomly allocated to a group where they
would either receive a placebo or a drug.
genome Genetic material, or chromosomes,
present in a particular organism. The human trendlines A way of visualizing the overall
genome consists of 23 pairs of chromosomes. direction, or trend, of a variable over time.
There are different methods for calculating
greenhouse gas A gas in the atmosphere trendlines, including a moving average, or
which absorbs and radiates energy, a line of best fit calculated through linear
contributing to the warming of Earth’s regression.
surface. This causes the so-called
‘greenhouse effect’, which is necessary for
supporting life on Earth. Human activity has
led to an increase in greenhouse gases in the
atmosphere, which have amplified the
greenhouse effect and contributed to global
warming. Greenhouse gases include water
vapour, carbon dioxide and methane.

Glossary g 61
CERN & THE
HIGGS BOSON
the 30-second data
In 1964, Peter Higgs, Francois
Englert, Gerald Guralnik, C.R. Hagen and Tom
Kibble proposed the Higgs Mechanism to
3-SECOND SAMPLE explain how mass was created in the universe. RELATED TOPIC
CERN, a laboratory in But evidence of the mechanism lay in the See also
Switzerland, is synergy of MACHINE LEARNING
(elusive) discovery of an essential particle,
multinational proportions: page 32
here, top scientists
dubbed ‘Higgs boson’, from which other
convene to inspect and fundamental particles derived their mass. By
decode the constituents blasting particles into each other at incredibly 3-SECOND BIOGRAPHIES
of matter via particle PETER HIGGS
high energies and then gathering data on the 1929–
colliders, i.e. how the
universe works. number of emergent particles as a function First proposed the Higgs
Mechanism.
of particle energy, scientists hoped to identify
spikes (in collisions at particular energy levels), FRANCOIS ENGLERT
1932–
3-MINUTE ANALYSIS which in turn would point to the creation Also proposed the Higgs
The LHC in CERN is mechanism, independently
connected to four separate
of a particle, such as the Higgs boson, with of Higgs.
detectors into which highly that energy. Enter CERN, the world-famous
accelerated particles can European laboratory. Here, scientists built
be slammed. For the Higgs 30-SECOND TEXT
the Large Hadron Collider (LHC). Even in its
boson experiments, two Aditya Ranganathan
detectors, ATLAS and CMS, infancy (2008), LHC’s enormous capability was
were used. The fact that stunning: it was able to accelerate particles to
the same results were about 14 billion times their energy at rest. By
observed on both
detectors lent significant
2011, CERN had collected enough data – over
credibility to the Higgs 500 trillion collision events – for analysis. Not
discovery, once again long after, several independent groups caught
emphasizing the an energy spike in the very field where the Higgs
importance of independent
replication in data analysis.
was predicted to lie. This discovery was soon
acknowledged by the scientific community, and The reach of data
both Higgs and Englert won acclaim as joint science knows no
recipients of the 2013 Nobel Prize for Physics. bounds, being applied
to explain the very
workings of the
62 g Science universe.
ASTROPHYSICS
the 30-second data
Astrophysics has become a big
user and supplier of data science expertise.
Most cosmology experiments involve scanning
3-SECOND SAMPLE large amounts of data to make measurements RELATED TOPIC
Photons from stars that can only be statistically derived. The data See also
billions of light years away CERN & THE HIGGS BOSON
is also searched for rare events. These statistical
strike Earth, furnishing page 62
telescopes with eons-old
insights, in turn, elucidate the past – and future
galactic images – masses – of our universe. One example of a rare
of data awaiting analysis. cosmological event is the production of a 3-SECOND BIOGRAPHIES
EDWIN HUBBLE
supernova – a star that explodes during its 1889–1953
3-MINUTE ANALYSIS
demise. Supernovae were used in the discovery American astronomer who
discovered the original
A major problem in data of the accelerating expansion of the universe, expansion of the universe.
analysis is the tendency for which Saul Perlmutter, Brian Schmidt and SAUL PERLMUTTER
to interpret results Adam Reiss won the 2011 Nobel Prize. The 1959–
as confirmations of American astrophysicist
pre-existing beliefs, which
discovery hinged on automatically searching and Director of the Berkeley
leads to furious debugging the sky for supernovae and collecting enough Institute for Data Science who
won the 2011 Nobel Prize in
when outcomes clash measurements of supernova brightness and Physics for the Accelerating
with expectations and Expansion of the Universe.
redshift (a measure of how much photons have
to slackening of error-
detection when the been stretched) in order to make statistically
two correspond. To acceptable conclusions about trendlines. 30-SECOND TEXT
decontaminate the Aditya Ranganathan
Supernovae have homogenous brightness,
debugging, physicists
developed blind analysis,
and it is this brightness that indicates how far
wherein all analysis a supernova is from a telescope, and how long
happens before the final light takes to reach us from that supernova;
experimental results are if light from older supernovae stretched less
revealed to the researcher.
Blind analysis has gained
than from new supernovae, the universe must
popularity in areas of be stretching more now than before, implying Data-driven
physics and may be making that over time, the universe will continue to measurements and
a foray into other fields
stretch ever more rapidly. experiments highlight
such as psychology.
the importance of data
science to cosmology,
64 g Science and vice versa.
CRISPR & DATA
the 30-second data
Scientists are harnessing the
power of a gene-editing tool called CRISPR that
has revolutionized labs around the world. The
3-SECOND SAMPLE precision engineering tool allows scientists to RELATED TOPICS
Editing the human genome chop and change DNA in a cell’s genetic code See also
conjures images of science THE MILLION GENOME
and could one day correct mutations behind
fiction, but it could be PROJECT
closer to reality thanks to
devastating diseases such as Huntington’s, page 68
the data science that is cystic fibrosis and some cancers. CRISPR works CURING CANCER
helping researchers to like a pair of molecular scissors and cuts DNA at page 74
correct nature’s mistakes.
target genes to allow scientists to make changes ETHICS
to the genome. This technique has been used by page 152

3-MINUTE ANALYSIS scientists in the lab to make embryos resistant


There is trepidation about to HIV and remove genes that cause sickle-cell
where CRISPR technology 3-SECOND BIOGRAPHIES
disease. But these molecular scissors are FRANCISCO J.M. MOJICA
will take science, and
not perfect. One misplaced cut could cause 1963–
gene-editing of human One of the first researchers to
embryos has raised ethical irreparable damage that is passed on through characterize CRISPR and coin
the acronym.
concerns – specifically generations. To make CRISPR more accurate,
around the possibility
scientists are leveraging huge data sets JENNIFER ANNE DOUDNA
of introducing heritable 1964–
alterations to the human generated from mapping the human genome. Along with Emmanuelle
genome. Some genome Researchers have used CRISPR to edit tens of Charpentier, proposed CRISPR
as a gene-editing tool.
damage could go unnoticed thousands of different pieces of DNA and
and lead to unforeseen FRANK STEPHENS
health issues, such as
analysed the resulting sequences. From the 1982–
premature death or other data, scientists are developing machine learning Advocate for Down syndrome;
Special Olympics athlete.
genetic diseases. It is no algorithms that predict the exact mutations
surprise that CRISPR has CRISPR can introduce to a cell, helping scientists
sparked international
debate on how it should
to reduce any miscuts to the code of life. 30-SECOND TEXT
Stephanie McClellan
be regulated.
Big data sets are helping
to refine CRISPR’s
accuracy, which is
essential work, given
66 g Science the ethical concerns.
THE MILLION
GENOME PROJECT
the 30-second data
The Million Genome Project
(MGP), or All of Us, is the US National Institutes
of Health initiative to unlock the genomes of
3-SECOND SAMPLE 1 million Americans. The human genome has RELATED TOPIC
The Million Genome Project over 20,000 genes and is the DNA hereditary See also
will unlock the genome of PERSONALIZED MEDICINE
information that parents pass on to their child.
1 million Americans so that page 138
the data can be used to
MGP builds off the Human Genome Project
‘personalize’ medicine. (1990–2003), which created the world’s first
human DNA reference library, now used in 3-SECOND BIOGRAPHIES
FRANCIS CRICK & JAMES
medicine. Each person has a unique genome. WATSON
3-MINUTE ANALYSIS
Genes play a part in how we look (eye and hair 1916–2004 & 1928–
The Million Genome Project Co-discovered the structure of
is a part of the US colour) and act, as well as determine if we are DNA in 1953 and won the Nobel
Government’s Precision predisposed to cancer or have genetic diseases. Prize (1962).
Medicine Initiative. However, lifestyle and environment affect our FRANCIS COLLINS
Precision medicine, by 1950–
using a person’s genetic
health. MGP focuses on observing people’s
Led the Human Genome
information, better tailors differences in health, lifestyle, environment and Project (1990–2003) and has
discovered genes associated
treatments, especially for DNA. All of Us, uniquely, captures diversity – of with many different diseases.
diseases with a genetic
people’s backgrounds, environment from all
component, such as
Parkinson’s disease. This regions across the country, and the broad
30-SECOND TEXT
healthcare approach enters spectrum of healthiness and sickness. Survey, Rupa R. Patel
into a new era of medicine electronic health record, physical measurement
where disease risk and
treatment success can be
and biosample data will be collected to make one
more accurately predicted of the largest health databases for worldwide
for each individual. use. MGP will help develop more precise tools to
Healthcare prevention can identify, treat and prevent a person’s disease by
be better prioritized and
cost-effective in this dawn
factoring how their health is affected by age,
of precision medicine. race, ethnicity, diet and environment – a concept
known as precision medicine. Advances in technology
and data science have
made the gathering and
analysis of such large
68 g Science data sets possible.
13 January 1900 1931 1947
Born in Dayton, Iowa, Receives the first Master Founds Biometric Society
USA of Science degree in
Statistics from Iowa
State College 1949
1918 Becomes first woman
Graduates from high elected to the
school 1939 International Statistics
Appointed Assistant Institute
Professor of Statistics
1929 at Iowa State College
Receives a Bachelor of 1956
Science degree from Iowa Elected as the President
State College 1940 of the American
Appointed Professor Statistical Association
of Statistics at North
Carolina State University
(Raleigh) 1959
Receives the O. Max
Gardner Award from
the University of North
Carolina

1975
Invited to join the
National Academy
of Sciences

17 October 1978
Dies in Durham, North
Carolina, USA
GERTRUDE COX

Methodism, crafts, psychology, innovations such as bringing in applied


maths – this is a rare blend in today’s zeitgeist statisticians to teach basic statistics ideas,
but one that Gertrude Cox pursued with zest including a database of experimentation results
before and during her college years at Iowa from different fields, and holding week-long
State. It is uncertain how this combination conferences on specific topics. Consequently,
of interests correlated with Cox’s pursuit of she was the first woman elected to the
statistics thereafter, but her dissertation title International Statistical Institute in 1949. As
hints at its influence. After submitting her an administrator, she was hugely successful
thesis, ‘A Statistical Investigation of a Teacher’s in securing grants to expand the scope of
Ability as Indicated by the Success of His offerings in her department. For example, in
Students in Subsequent Courses’, Cox obtained 1944 she received a grant from the General
the first ever Master’s degree awarded by Iowa Education Board to fund the establishment of
State. She then made her way to Berkeley to an Institute of Statistics. But Cox’s repertoire
pursue research in Psychological Statistics, of roles did not end here. As an entrepreneur,
an endeavour that was abbreviated at the she pursued consulting assignments locally
insistence of her former mentor, George and abroad; she also helped run a florist shop.
Snedecor. Cox’s former calculus professor as As a scholar and writer, she co-authored books
well as employer, Snedecor sought her help in on experimental design; she also founded the
organizing his statistical laboratory, and so Cox Biometric Society, serving as its first editor
traced her way back from Berkeley. and later as its president.
As a teacher, Cox had a knack for connecting Cox is remembered for her contributions
real-world research to course design. In 1939, to the fields of psychological statistics and
she was appointed Assistant Professor of experimental design. In her biographical memoir
Statistics at Iowa State College and in 1940 Richard Anderson, her friend and colleague,
began as the head of the new department says, ‘Both as a teacher and a consultant,
of Experimental Statistics in the School of Gertrude particularly emphasized randomization,
Agriculture and later ended up as the Director replication and experimental controls as
of the Institute of Statistics at North Carolina procedures essential to experimental design’ –
State. While there, Cox brought about several tools that remain as vibrant as her batik pieces.

Aditya Ranganathan

Gertrude Cox g 71
CLIMATE CHANGE
the 30-second data
Climate trend predictions ensue
after compiling and processing volumes of data:
average global temperatures over the years,
3-SECOND SAMPLE for example. Average global temperature is a RELATED TOPIC
A firm prediction of the nuanced function of variables. Above-average See also
future of our planet CORRELATION
build-ups of greenhouse gases in the
rests upon the collection page 42
and analysis of massive
atmosphere trap above-average amounts of
amounts of data on heat, creating a barrier to prompt disposal.
global temperatures Other factors that slow down rates of heat 3-SECOND BIOGRAPHIES
and greenhouse gas JAMES HANSEN
emission include rising ocean levels, asphalt 1941–
concentrations.
levels and decreasing ice. The result of this NASA scientist and climate
change advocate.
retardation is an upset of equilibrium – the
3-MINUTE ANALYSIS desired state in which the rate of heat RICHARD A. MULLER
1944–
Anthropogenic absorption equals the rate of heat emission, Climate sceptic converted to
contributions, including climate change advocate.
expanding agricultural
and average global temperature stays constant.
and industrial practices, Even though the disequilibrium is temporary, AL GORE
1948–
correlate with an increase it is a period when heat lingers. And, when Published what was at the time
in global greenhouse gas
equilibrium returns, rather than catching up to a controversial documentary
concentrations and rising on the impacts of climate
global temperatures, also the earlier temperature, we find ourselves in the change called An Inconvenient
Truth.
known as global warming midst of a new normal. There is a range of new
or climate change. The ‘normals’ we could potentially reach: some
more data that is collected
on anthropogenic
mildly uncomfortable, some deadly. In order to 30-SECOND TEXT
contributions, the more understand which of these scenarios we might Aditya Ranganathan

conclusive the claim be heading towards, we must gather data vast


that it is human enough to average out small fluctuations that
factors driving Earth’s
temperature change.
might lead to incorrect predictions. The data So far, the data
that researchers are amassing includes global collected has led
temperatures, sea-ice levels and so on – the 98 per cent of scientists
conglomerate definitively indicating dangerous to conclude that
levels of greenhouse gas production. anthropogenic factors
are to blame for
72 g Science climate change.
CURING CANCER
the 30-second data
While discoveries in basic science
help explain the mechanisms of cancer, it is how
these discoveries lead to targeted therapies
3-SECOND SAMPLE and studies on patient outcomes that provide RELATED TOPIC
Advances in data science a deeper understanding of successful therapies See also
will be instrumental to HEALTH
and gets us closer to a cure. Data science allows
curing cancer: they help page 92
us to understand if and
us to test the value of intervention. Specifically,
why cancer interventions statistical thinking played a fundamental role
are working. in randomized trials, used for the first time by 3-SECOND BIOGRAPHY
MARVIN ZELEN
the US National Cancer Institute in 1954 to test 1927–2014
3-MINUTE ANALYSIS
treatments of patients with acute leukaemia. Founding chair of what today
is the Department of Data
Many challenges need As long as 40 years ago, cancer research Sciences in the Dana-Farber
Cancer Institute who developed
to be overcome to cure depended on many of the tasks that today many of the statistical
cancer, and data will define data science: study design, data analysis methods and data management
play a role in all of these. approaches used in modern
For example, it can take
and database management. Today, molecular clinical cancer trials.

10–15 years for a new drug biology technologies produce thousands of


to go through clinical trials measurements per patient, which permit the
and cost in excess of 30-SECOND TEXT
detection of mutations, structural chromosomal Rafael Irizarry
£1 billion. Using data
science to optimize these changes, aberrant gene expression, epigenetic
processes for both money changes and immune response in cancer cells.
and time, while keeping A primary aim is finding ways of using this
them safe, is not usually
how data and curing cancer
information to improve diagnosis and develop
are synthesized, but has tailored treatments. These new technologies
become an important produce large and complex data sets that
aspect of this work. require sophisticated statistical knowledge as
well as computing skills to effectively work
with the data and avoid being fooled by
patterns arising by chance. Data science has
become crucial in
cancer research and
will play a key role
74 g Science in future advances.
EPIDEMIOLOGY
the 30-second data
Epidemiology is the science of
collecting data and calculating how diseases
are distributed, have patterns and are caused
3-SECOND SAMPLE among people. The science blends multiple RELATED TOPICS
There is an Ebola outbreak disciplines (i.e. statistics, social sciences, See also
in Africa; what happens STATISTICS & MODELLING
biology and engineering) together to
next? Epidemiology is page 30
used to collect data and
create these calculations. The calculations
CORRELATION
study the who, what, are used to prevent and control both page 42
where, when and why contagious and non-contagious diseases
of the disease. HEALTH
within populations. Epidemiology impacts page 92
public health and generates the evidence
3-MINUTE ANALYSIS for the preventive (e.g. vaccines) and non-
Epidemiologic research is preventive procedures (e.g. diabetes screening) 3-SECOND BIOGRAPHIES
used to improve health HIPPOCRATES
used today and will adopt tomorrow, such as c. 460–370 BCE
by examining the causal
relationship between risk
microbiome-based diagnostics. Epidemiological First person to use the term
‘epidemic’ and observe how
factors (e.g. age, smoking) evidence drives the health policies and disease spread.
and disease (e.g. cancer, guidelines that governments put in place, JOHN SNOW
diabetes). Methods 1813–58
such as child vaccinations, to protect its
use observations or Successfully traced the source
experiments, combined citizens’ health. The field is known for solving of the cholera outbreak in
London in 1854, which went
with statistics, to identify epidemics, or outbreaks of infectious diseases. on to change urban water and
bias and false cause–effect Dr John Snow first defined epidemiologic sewage systems and public
associations. Milestones in health worldwide; considered
health prevention occurred
concepts when he traced a contaminated the father of epidemiology.

in the 1950s, when large water source to a cluster of cholera cases in


epidemiologic studies London in 1854. Similarly, a group of deaths
provided conclusive 30-SECOND TEXT
in Western Africa in 2013 led to an investigation Rupa R. Patel
evidence that tobacco
smoking increased the risk
to determine how and why the Ebola virus
of death from lung cancer was spreading so quickly. The investigation
and heart attacks. informed health prevention programmes in Epidemiology enables
the region to contain the virus’s spread. calculations that are
essential to our
individual and
76 g Science collective well-being.
g
SOCIETY
SOCIETY
GLOSSARY

aggregate statistics Statistics calculated census A regular, systematic survey of


across a set or group of data points, members of a population, usually conducted
e.g. weekly sales by item. by a government. Data collected during a
census may include household size and
anonymization Removing any information income, and may be used to plan housing,
from a data set that could be used to identify healthcare and social services.
or locate individuals, including names and
addresses. True anonymization is difficult to continuous health data Collected at regular,
achieve, as many variables such as location short intervals from individuals, and could
may allow individuals to be identified. include heart rate, activity or blood pressure.
Advances in wearable technologies such as
AI (artificial intelligence) Often used activity monitors make continuous health
interchangeably with ‘machine learning’. monitoring feasible.
The process of programming a computer to
find patterns or anomalies in large data sets, differential privacy Method for sharing
or to find the mathematical relationship summary statistics about a group of people,
between some input variables and an while protecting the anonymity of
output. AI algorithms have applications individuals in the group.
in a range of fields including healthcare,
self-driving cars and image recognition. geospatial data Involves a geographic
component, which could include latitude
biometric airport security The use of and longitude or a country code.
biometric information, such as facial
measurements or fingerprints, in Go Two-player strategy game, where the
airport security. aim is to capture the most territory. Google’s
DeepMind has developed several algorithms
Brexit The exit of the United Kingdom designed to compete against humans.
from the European Union.
Jeopardy! Televised American game show.
Contestants are given answers, and must
provide the correct questions.

80 g Society
machine learning Finding a mathematical randomized experiments Experimental
relationship between input variables and an design where participants or subjects are
output. This ‘learned’ relationship can then randomly allocated to treatment groups.
be used to output predictions, forecasts or Participants in a randomized drug trial could
classifications given an input. For example, be randomly allocated to a group where they
a machine learning model may be used would either receive a placebo or a drug.
to predict a patient’s risk of developing
diabetes given their weight. This would be sensitive information/data Reveals
done by fitting a function to a ‘training’ set personal details, such as ethnicity, religious
of thousands of historic data points, where and political beliefs, sexual orientation, trade
each point represents a single patient’s union membership or health-related data.
weight and whether they developed
diabetes. When a new, previously unseen sniffers Software that intercepts and
patient’s weight is run through the model, analyses the data being sent across a
this ‘learned’ function will be used to network, to or from a phone, computer
predict whether they will develop diabetes. or other electronic device.
Modern computer hardware has enabled
the development of powerful machine Yellow Vests movement Protest movement
learning algorithms. originating in France, focused on issues such
as rising fuel prices and the cost of living.
microtargeting Strategy used during
political or advertising campaigns in which
personalized messaging is delivered to
different subsets of customers or voters
based on information that has been mined
or collected about their views, preferences
or behaviours.

profile (voter) Information about an


individual voter which may include age,
address and party affiliation.

Glossary g 81
SURVEILLANCE
the 30-second data
Data surveillance is all around us,
and it continues to grow more sophisticated
and all-encompassing. From biometric airport
3-SECOND SAMPLE security to grocery shopping, online activity RELATED TOPICS
Eyewitness sketches and and smartphone usage, we are constantly being See also
background checks might SECURITY
surveilled, with our actions and choices being
become an archaic relic page 84
with the amount of
documented into spreadsheets. Geospatial
MARKETING
surveillance data we now surveillance data allows marketers to send you page 108
have the capability of tailored ads based upon your physical, real-time
storing and analysing.
location. Not only that, it can also use your
past location behaviour to predict precisely 3-SECOND BIOGRAPHY
TIM BERNERS-LEE
3-MINUTE ANALYSIS what kind of ads to send you, sometimes 1955–
While data surveillance without your permission or knowledge. While Creator of the World Wide
Web, coining the internet
can feel negative, there data surveillance is itself uninteresting; it’s the as the ‘world’s largest
are incredible advances surveillance network’.
in preventing terrorism,
actions taken from analysis of the data that
cracking child pornography can be both harmful and helpful. Using data
rings by following images surveillance, private and public entities are 30-SECOND TEXT
being sourced from the investigating methods of influencing or Liberty Vittert
internet, and even aiding
the global refugee crisis. ‘nudging’ individuals to do the ‘right’ thing, and
The Hive (a data initiative penalizing us for the doing the ‘wrong’ thing.
for USA for the UN A health insurance company could raise or lower
Refugee Agency) used
rates based upon the daily steps a fitness tracker
high-resolution satellite
imagery to create a records; a car insurance company could do the
machine-learning algorithm same based upon data from a smart car. Data
for detecting tents in surveillance is not only about the present and
refugee camps – allowing
for better camp planning
analysis of actions; it’s also about predicting
and field operation. future action. Who will be a criminal, who will be When put towards
a terrorist, or simply, what time of the day are a good cause, such
you most likely to buy that pair of shoes you as crime prevention,
have been eyeing while online shopping? certain types of
surveillance can be
82 g Society well justified.
SECURITY
the 30-second data
Data is opening up new
opportunities in intelligence processing,
dissemination and analysis while improving
3-SECOND SAMPLE investigative capacities of security and RELATED TOPICS
Big Data meets Big Brother intelligence organizations at global and See also
in the untapped and SURVEILLANCE
community levels. From anomalies (behaviour
untried world of page 82
data-driven security
that doesn’t fit a usual pattern) to association
ETHICS
opportunities. From (relationships that the human eye couldn’t page 152
community policing to detect) and links (social networks of
preventing terrorism, the
connections, such as Al-Qaeda), intelligence
possibilities are endless,
organizations compile data from online activity, 3-SECOND BIOGRAPHY
and untested. PATRICK W. KELLEY
surveillance, social media and so on, to detect fl. 1994
patterns, or lack thereof, in individual and group FBI Director of Integrity and
Compliance, who migrated
3-MINUTE ANALYSIS activity. Systems called ‘sniffers’ – designed to Carnivore to practice.
In the case of Chicago (see
‘data’ text), the higher the
monitor a target user’s internet traffic – have
score means the greater been transformed from simple surveillance
30-SECOND TEXT
risk of being a victim or systems to security systems designed to Liberty Vittert
perpetrator of violence.
distinguish between communications that
In 2016, on Mother’s Day
weekend, 80 per cent of may be lawfully intercepted and those that may
the 51 people shot over two not for security purposes. Data can visualize
days had been correctly how violence spreads like a virus among
identified on the list. While
proponents say that it
communities. The same data can also predict
allows police to prioritize the most likely victims of violence and even,
youth violence by supposedly, the criminals. Police forces are
intervening in the lives using data to both target and forecast these
of those most at risk,
naysayers worry that
individuals. For example, police in Chicago
by not identifying what identified over 1,400 men to go on a ‘heat list’ Carnivore was one
generates the risk score, generated by an algorithm that rank-orders of the first systems
racial bias and unethical
potential victims and subjects with the implemented by the FBI
data use might be
in practice. greatest risk of violence. to monitor email and
communications from
84 g Society a security perspective.
PRIVACY
the 30-second data
The adage ‘if you’re not paying
for the product, you are the product’ remains
true in the era of big data. Businesses and
3-SECOND SAMPLE governments hold detailed information about RELATED TOPICS
Every day we generate our likes, health, finances and whereabouts, See also
thousands of data points SURVEILLANCE
and can harness this to serve us personalized
describing our lifestyle and page 82
behaviour. Who should
advertising. Controversies around targeted
REGULATION
have access to this political campaigning on Facebook, including page 150
information, and how can alleged data breaches during the 2016 US
they use it responsibly?
presidential election, have brought data privacy
to the forefront of public debate. For example, 3-SECOND BIOGRAPHY
MITCHELL BAKER
3-MINUTE ANALYSIS medical records are held securely by healthcare 1959–
Governments have taken providers, but health apps are not subject to the Founder of the Mozilla
steps to safeguard privacy. Foundation, launched in 2003,
same privacy regulations as hospitals or doctors. which works to protect
The UK’s Information individuals’ privacy while
Commissioner’s Office
A British Medical Journal study found that keeping the internet open
fined Facebook £500,000 nearly four in five of these apps routinely share and accessible.

for failing to protect user personal data with third parties. Users of
data. In the European
menstrual cycle, fitness or mental health 30-SECOND TEXT
Union, organizations must
ask for consent when tracking apps may be unaware that sensitive Maryam Ahmed
collecting personal data information about their health and well-being is
and delete it when asked. up for sale. One strategy for protecting privacy
The US Census Bureau is
introducing ‘differential
is the removal of identifying variables, such as
privacy’ into the 2020 full names or addresses, from large data sets.
census, a method that But can data ever truly be anonymized? In
prevents individuals 2018 the New York Times reviewed a large
being identified from
aggregate statistics.
anonymized phone location data set. Journalists
were able to identify and contact two individuals Non-governmental
from the data, demonstrating that true organizations advocate
anonymization is difficult to achieve. for and support projects
relating to greater
internet and data
86 g Society privacy.
12 May 1820 1844 1854
Born in Italy, and is Announces intention Travels to the Scutari
named after the city to pursue a career in Barracks in modern-day
of her birth nursing, prompting Turkey, with a group of
opposition from family 38 female nurses, and
oversees the introduction
1837 of sanitary reforms
Experiences the first of 1851
several ‘calls from God’, Undertakes medical
which inform her desire training in Düsseldorf, 1857
to serve others Germany Suffers from intermittent
episodes of depression
and ill health, which
1853 continue until her death
Becomes Superintendent
of the Institute for
the Care of Sick 1858
Gentlewomen, London Publishes the data-driven
report ‘Mortality of the
British Army’

1859
Elected to the Royal
Statistical Society

13 August 1910
Dies in her sleep,
in London
FLORENCE NIGHTINGALE

Florence Nightingale was a British Army’, a report demonstrating the


pioneer of modern nursing methods, medical differences between death rates in the military
statistics and data visualization. Born to a compared to the civilian population. She was
wealthy British family in 1820, she defied the an early adopter of the coxcomb, a form of pie
cultural conventions of the time by choosing to chart where the length rather than the angle of
pursue her calling as a nurse rather than simply each segment is proportional to the size of the
marrying and raising a family. data. In one of her best known visualizations,
After training in Germany, Nightingale, aged Nightingale used a coxcomb to illustrate that
33, rose to become the superintendent of a many more soldiers died of ‘preventable or
hospital in London. It was during the Crimean mitigable’ diseases than of fatal wounds.
War, however, that she would make her mark Nightingale was also concerned with the living
on nursing and data science. In 1854, at the conditions of British soldiers in India, and drove
invitation of the Minister for War, Florence the establishment of a Royal Commission to
travelled to a military hospital in Scutari, investigate the issue. In 1873, she reported that
Turkey, to supervise the introduction of female the death rate for British soldiers in India had
nurses. Under Nightingale’s watch, the death fallen from 69 to 18 per 1,000 following ten
rate fell, partly due to her emphasis on basic years of sanitary reform.
sanitary practices such as handwashing. Here, In Britain, Nightingale lobbied government
her habit of performing ward rounds at night ministers for reforms including compulsory
earned her the affectionate moniker ‘the Lady sanitation in private houses, improved
with the Lamp’. drainage and stricter legislation.
Throughout her career, Nightingale took a Nightingale’s contributions to medical
data-driven approach to her nursing practice. statistics were widely recognized. In 1859,
She made extensive use of data visualizations she became the first female member of the
and statistics to highlight public health issues. Royal Statistical Society, and later became
During her time in the Crimea, Nightingale an honorary member of the American
collected data on mortality rates and causes of Statistical Association. She died in her
deaths. In 1858, she published ‘Mortality of the sleep in 1910, aged 90.

Maryam Ahmed

Florence Nightingale g 89
VOTE SCIENCE
the 30-second data
Vote Science has been in practice
since political outcomes began being decided by
votes, dating back to sixth-century BCE Athens.
3-SECOND SAMPLE Modern Vote Science evolved rapidly in the US RELATED TOPICS
Vote Science is the practice in the 1950s, when campaigns, political parties See also
of using modern voter LEARNING FROM DATA
and special interest groups started keeping large
registration lists, consumer page 20
and social media data, and
databases of eligible voters, which were later
ETHICS
polling to influence public used to build individual voter profiles. Using page 152
opinion and win elections. machine learning and statistical analysis,
campaign professionals began using these
profiles to make calculated decisions on how to 3-SECOND BIOGRAPHIES
3-MINUTE ANALYSIS DONALD P. GREEN
George Bush’s 2004 win an election or sway public opinion. Current 1961–
re-election was the first best practices include maintaining databases Leader of Vote Science
randomized experiments.
political campaign to use of people with hundreds of attributes, from
political microtargeting – SASHA ISSENBERG
the use of machine-
individuals’ credit scores to whether they vote fl. 2002–
learning algorithms to early/in-person or even if they are more likely to Chronicler of how data science
has been used and evolved in
classify voters on an vote if reminded via phone, text or email. Using campaigns in the last 20 years.
individual level of how
this data, campaigns and political parties work
they might vote or if they DAN WAGNER
even would vote. Barack to predict voter behaviour, such as whether fl. 2005–
voters will turn out, when they will vote, how Director of Analytics for
Obama’s campaigns in ‘Obama for America’ in
2008 and 2012 took Vote they will vote and – most recently – what will 2012; led efforts to expand
Science a step further by Vote Science in campaigns
incorporating randomized
persuade them to change their opinion. Recent to message testing and
donor models.
field experiments. Elections campaigns have adopted randomized field
in the UK, France and India experiments to assess the effectiveness of
began to use Vote Science mobilization and persuasion efforts. Vote 30-SECOND TEXT
techniques such as
microtargeting and random
Science now determines how a campaign Scott Tranter

field experiments in their chooses to spend its advertising funds as Modern-day election
campaigns after witnessing well as which particular messages are shown campaigns are driven
the success of the
to specific, individual voters. by Vote Science, with
American model.
a vast amount of
campaign budget
90 g Society allocated to it.
HEALTH
the 30-second data
Data science develops tools to
analyse health information, to improve related
services and outcomes. An estimated 30 per
3-SECOND SAMPLE cent of the world’s electronically stored data RELATED TOPICS
Data science transforms comes from the healthcare field. A single See also
unstructured health EPIDEMIOLOGY
patient can generate roughly 80 megabytes
information into page 76
knowledge that changes
of data annually (the equivalent of 260 books
PERSONALIZED MEDICINE
medical practice. worth of data). This health data can come from page 138
a variety of sources, including genetic testing,
MENTAL HEALTH
surveys, wearable devices, social media, clinical page 140
3-MINUTE ANALYSIS
Consumer-grade wearable
trials, medical imaging, clinic and pharmacy
devices coupled with information, administrative claim databases and
smartphone technology national registries. A common data source is 3-SECOND BIOGRAPHIES
FLORENCE NIGHTINGALE
offer innovative ways to electronic medical record (EMR) platforms, 1820–1910
capture continuous health
data, improving patient
which collect, organize and analyse patient data. Championed the use of
healthcare statistics.
outcomes. For example, EMRs enable doctors and healthcare networks
BILL & MELINDA GATES
heart monitors can be used to communicate and coordinate care, thereby 1955– & 1964–
to diagnose and/or predict
reducing inefficiencies and costs. EMR data Launched in 2000, the Gates
abnormal and potentially Foundation uses data to solve
life-threatening heart is used to create decision tools, for clinicians, some of the world’s biggest
health data science problems.
rhythms. The data can be which incorporate evidence-based
assessed by varying time recommendations for patient test results and JAMES PARK &
parameters (days to weeks ERIC FRIEDMAN
versus months to years),
prevention procedures. Healthcare data science fl. 2007
to develop early-warning combines the fields of predictive analytics, Founders of Fitbit who applied
sensors and wireless tech to
health scores. Similarly, machine learning and information technology health and fitness.
hearing aids with motion to transform unstructured information into
sensors can detect the
cause of a fall (slipping
knowledge used to change clinical and public
30-SECOND TEXT
versus heart attack), so health practice. Data science helps to save Rupa R. Patel
doctors can respond lives by predicting patient risk for diseases,
effectively.
personalizing patient treatments and enabling
research to cure diseases. Using data to
personalize healthcare
92 g Society helps to save lives.
IBM’S WATSON
& GOOGLE’S
DEEPMIND
the 30-second data
When IBM’s Watson computer
defeated the reigning Jeopardy! champion on a
nationally televised game show in 2011, it was a
3-SECOND SAMPLE demonstration of how computer-based natural RELATED TOPICS
IBM’s Watson Jeopardy!- language processing and machine learning had See also
playing computer and MACHINE LEARNING
advanced sufficiently to take on the complex
Google’s DeepMind page 32
Go-playing program
wordplay, puns and ambiguity that many
NEURAL NETWORKS &
introduced the world to viewers might struggle with. Google’s DEEP LEARNING
machine learning and DeepMind subsidiary did something similar – its page 34
artificial intelligence in
AlphaGo program used machine learning and GAMING
ways that were easy
to understand. artificial intelligence to beat the world champion page 130
at Go, a very complicated strategy board game
played with black and white stones; a feat no
3-MINUTE ANALYSIS 3-SECOND BIOGRAPHIES
other computer had ever accomplished. Picking THOMAS WATSON
Computer companies
ambitious targets such as beating humans at 1874–1956
pursue targets such as Chairman and CEO of IBM, the
playing Jeopardy! and well-known games serves several purposes. Jeopardy!-playing computer is
named after him.
Go because to excel First, it gives data scientists clear goals and
at them they have to
benchmarks to target, like ‘Win at Jeopardy!’. DEEPMIND TECHNOLOGIES
develop general-purpose 2010–
capabilities that can In IBM’s case, they even announced the goal Acquired by Alphabet (parent
be applied to other beforehand, which put pressure on the of Google) in 2014.
commercially important development team to be creative and think
problems. The ability to
answer a person’s question
outside the box, as who would want to be 30-SECOND TEXT
in his or her own language publicly humiliated by a mere human? Second, Willy Shih
on a broad range of topics, these sparring matches speak to the public
or to train for complicated about how far hardware and software are
problems such as robot
navigation, will help future
progressing. Go is much more challenging than
computers to perform chess, so if a computer can beat the world Computers beating
more sophisticated tasks champion, we must be making a lot of progress! humans at ever more
for people, including
their creators.
complex games is a
visible measure of the
progress being made
94 g Society in data science.
g
BUSINESS
BUSINESS
GLOSSARY

automated system Some repetitive tasks foot traffic analysis Often used in the retail
or calculations can be carried out faster, sector to measure how many customers
continuously and more efficiently by enter a shop, and their movements and
computers. Examples of automated behaviour while browsing.
systems include automated passport
gates at airports, self-driving cars or geolocation data Describes the location of
speech-to-text software. a person or object over time.

autonomous machines Able to complete Go Two-player strategy game, where the


a task without human input, such as a aim is to capture the most territory. Google’s
self-driving car. DeepMind has developed several algorithms
designed to compete against humans.
big data Data set that meets some or all
of the following criteria; volume, velocity, Internet of things (IoT) Internet-connected
veracity and variety; must consist of a large or ‘smart’ devices, including activity
volume or amount of individual data points, monitors, home assistants and TVs, which
generated at a high or regular velocity. It may provide improved functionality compared
consist of a variety of data types including to their offline counterparts through the
text, numerical data or images, and it will collection and analysis of data in real time.
ideally be accurate or have veracity. For example, a smart home assistant is able
to communicate with and control items
data analytics Obtaining, cleaning and around the home such as smart light bulbs,
analysing data to gain useful insights, central heating and security systems.
answer research questions or inform decision
making. Prescriptive data analytics describes
and draws conclusions from the available
data; predictive analytics aims to generalize
these findings to make predictions or
forecasts about the future.

98 g Business
natural language-processing algorithms quantum mechanics Branch of physics
Techniques for analysing written or spoken concerned with the behaviour of atomic
language. This could include the contents of and subatomic particles.
political speeches, vocal commands given to
a smartphone or written customer feedback reinforcement learning Branch of machine
on an e-commerce website. Common natural learning, where algorithms learn to take
language processing techniques include actions which maximize a specified reward.
sentiment analysis, where text is labelled as
positive or negative depending on its tone, tabulating system A machine, developed in
and topic modelling, which aims to identify the 1800s, designed to store information in
the overall theme or topic of a piece of text. the form of hole-punched cards. Its first use
was in the 1890s, to store data collected
probability theory Branch of mathematics during the first US census.
concerned with representing probabilities
in mathematical terms. The field relies on tracking cookies Piece of information from
a set of underlying assumptions, or axioms, a website, stored by a person’s web browser,
including ‘the probability of an event is a which is shared or tracked across websites,
non-negative, real number.’ to track a user’s online journey. They may be
used by third party advertising providers, to
prototype Working draft version of a piece serve personalized adverts based on a user’s
of software or hardware, sometimes referred browsing history.
to as a minimum viable product, or MVP.

quantitative finance Uses probability


and mathematical techniques to model
financial problems.

Glossary g 99
INDUSTRY 4.0
the 30-second data
Industry 4.0 can be more easily
understood as a ‘smart factory’, where internet-
connected systems/machines communicate and
3-SECOND SAMPLE cooperate with each other in real-time to do RELATED TOPIC
‘Humankind will be extinct the jobs that humans used to do. This relies See also
or jobless’ is the feared ARTIFICIAL INTELLIGENCE (AI)
on the Internet of things (IoT), the extension of
mantra with the fourth page 148
industrial revolution in
internet connectivity into devices and everyday
manufacturing, where objects. While Industry 4.0 can have an ominous
machines use data to ring to it in certain circles, there is a vast amount 3-SECOND BIOGRAPHY
make their own decisions. HUGH EVERETT
of incredible applications to our daily lives. From 1930–82
robots picking and packing items in a warehouse First proposed the Many
Worlds interpretation for
3-MINUTE ANALYSIS for delivery, to autonomous cranes and trucks on quantum mechanics and
operations research.
Millions of people building sites, and using information collected
are employed by the from these machines to find and optimize
manufacturing industry
and fears over job loss
irregularities in business systems – the 30-SECOND TEXT
from the data revolution/ possibilities are endless and, as of yet, unknown. Liberty Vittert
Industry 4.0 are real and Business is not the only winner in this industrial
already evident. While this
revolution. For example, providing assistance to
may be very worrisome,
many are using it as an elderly or disabled individuals through homecare
opportunity to push for the advances with systems like voice control or alerts
idea of a ‘universal basic for falls or seizures. However, there are large
income’. This is a periodic
monetary compensation
barriers to the full implementation of Industry
given to all citizens as 4.0, integration being one of the biggest. There
a right, with the only are no industry standards for connectivity
requirement being legal and the systems themselves are fragmented
residency. This stipend
would be enough for basic
between different industries and companies.
bills and living, with the Privacy concerns are overwhelming, with the Connectivity and
aim that individuals will be amount of data collected (personal and standardization across
free to pursue any interest.
otherwise) by these systems needing to be industries are a major
protected, as are decisions over ownership. obstacle to the
widespread adoption
100 g Business of smart factories.
ENERGY SUPPLY
& DISTRIBUTION
the 30-second data
Our energy supply is transitioning
from fossil fuels and centralized infrastructure
to a renewable, decentralized system, and data
3-SECOND SAMPLE analytics eases the challenges of that transition. RELATED TOPICS
Data science is key to As the output of wind farms and solar See also
managing the growth of INDUSTRY 4.0
photovoltaic plants is weather-dependent,
renewable and distributed page 100
energy sources in the
high-resolution weather forecasting based
electric power system. on predictive analytics has wide applications
in improving design and operation of these 3-SECOND BIOGRAPHY
THOMAS EDISON
systems, from optimizing the layout of wind 1847–1931
3-MINUTE ANALYSIS
Fossil fuels still make
turbines in a field to automatically adjusting Architect of the world’s first
power grid, which went live in
up a large part of global the angle of solar panels to maximize power New York City in 1882.
energy consumption, generation despite changing conditions.
and oil and gas companies As electricity is then transmitted to the end
make liberal use of 30-SECOND TEXT
analytics as well – in
customer, analytics is critical to managing the Katrina Westerhof
characterizing untapped growing complexity of the power grid due to
oil reservoirs below the ‘distributed energy resources’ – controllable
earth’s surface, optimizing
devices such as backup generators, home
drill operations when
drilling new wells, batteries and smart thermostats, often owned
forecasting impending by homeowners and businesses. These devices
equipment failures, are excellent resources for balancing the
deciding which oil
streams to blend
grid, and grid operators can use analytics to
together, and more. determine which mix of devices to pull from
at any time based on weather, historic energy
demand, the performance and tolerances of
each device, and grid conditions like voltage.
For grid operators, analytics is also useful in Data like demographics
planning infrastructure investments, allowing and infrastructure
them to predict which parts of the network condition can inform
will be most strained decades into the future. decisions about where
to increase the capacity
102 g Business of the power grid.
LOGISTICS
the 30-second data
Route optimization – born of
both predictive and prescriptive data analytics –
has unlocked enormous benefits for the
3-SECOND SAMPLE previously low-tech logistics industry, reducing RELATED TOPICS
Getting an item from fuel consumption and improving reliability of See also
Point A to Point B is more INDUSTRY 4.0
service. When delivering packages to homes
efficient and reliable with page 100
optimized routing, enabled
and businesses, logistics companies can now
SHOPPING
by data analytics. identify the most efficient routes for each page 118
driver, each day, across the entire fleet, taking
into account delivery deadlines, traffic patterns
3-MINUTE ANALYSIS
In the context of
and weather forecasts. Upstream, in freight 3-SECOND BIOGRAPHY
JUAN PEREZ
supply-chain management, shipping, shippers can apply similar techniques 1967–
the value of analytics for to optimize the route from an origin point to Chief Engineering and
logistics is even greater. Information Officer at UPS
a distribution facility, choosing the right who led the implementation
Predictive analytics will of the company’s ORION route
improve inventory
combination of sea, air, rail and road transport optimization project.
management by to get each shipment to its destination most
considering the impacts efficiently and on time. In both cases, the
of factors like geopolitics, 30-SECOND TEXT
tools exist today to make these optimizations Katrina Westerhof
weather and climate
change, and consumer dynamic, allowing carriers to reroute parcels in
sentiment on product real time as conditions change, and for delivery
availability or demand. routes, even recommending the ideal driving
And integrating data across
the supply chain unlocks
speed on each leg of the route to consistently
new opportunities – for hit green traffic lights. Beyond optimizing how
example, dynamically an item gets to its destination, big data and
rerouting a shipment of analytics also provide insights into how to
ripe fruit to a nearer store
or a store where fruit sells
structure a global logistics network, such as
more quickly, thereby where to build new hubs, distribution facilities Dynamic route
reducing food waste. and customer drop sites as transportation optimization enables
constraints and customer demand change. shippers to be
responsive to changing
conditions in the
104 g Business supply chain.
29 February 1860 1880 1890–1900
Born in Buffalo, Serves as assistant to Contracted to supply his
New York, USA William Trowbridge, his machines for the 1890
professor who worked census count
on the US Census
1875
Enrols in City College 1911
of New York 1889 Begins (alongside several
Receives patent for others) the Computing-
punch-card tabulator Tabulating-Recording
1879 (Patent No. 395,782) Company (CRT)
Receives undergraduate
degree in Mining from
Columbia University 1890 1918
Gains PhD from Columbia Starts stepping back from
University day-to-day operations
at CRT

1890
Receives the Elliot 1921
Cresson Medal Retires

1924
CRT becomes IBM

17 November 1929
Dies in Washington, DC,
USA
HERMAN HOLLERITH

On 2nd January, 1889, the US Hollerith’s inventive fire was stoked,


Patent Office awarded Patent No. 395,782 and building such a system became his
for an invention ‘which consists in recording preoccupation at MIT. Eventually, a tabulating
separate statistical items pertaining to the system emerged, one which used a series of
individual by holes or combinations of holes insulating punch cards, such that each hole
punched in sheets…and then counting [said corresponded to a particular census category.
items]…by means of mechanical counters’. Following a patent award in 1889, the US
Little did the Patent Office suspect that census adopted Hollerith’s machines in the
No. 395,782 would revolutionize the tracking 1890 census, saving years and millions of
of global populations for the next two decades dollars. Other countries, including Canada,
and inspire the very creation of modern Norway, Austria and England, were quick to
computation. But who was the author of adopt Hollerith’s system. And the old ways,
that patent? Enter Herman Hollerith – brilliant like the chad that falls out after punching, fell
statistician, consultant to the US census office to the wayside.
and lover of chicken salad. Hollerith was not only a great statistician and
Born in 1860, Hollerith was a precocious child, inventor but also an entrepreneur. In 1896, he
though his hatred for spelling often cast a founded his Tabulating Machine Company to
cloud over his early education. After earning sell his machines. As competition emerged,
an undergraduate degree at the University of Hollerith’s company merged with other
Columbia, he was invited to work with professor companies to form the Computer Tabulating
W.P. Trowbridge in 1880. Over the next ten Recording Company, which he served until 1911
years, Hollerith dabbled in technology, taught in an advisory capacity. In 1924, the Computer
at Massachusetts Institute of Technology (MIT) Tabulating Recording Company became IBM,
and served a brief stint in a patent office. But one of the largest computer companies in the
there was more to come. world today.
As the story goes, one evening in 1882, From his inventions that aided governments
Hollerith was having dinner with his girlfriend and set American computing on a roaring
and her father, Dr John Shaw Billings, when path, to his statistical insights and mechanical
Billings – who was then employed by the know-how that solved large-scale data storage
US Census Bureau, proposed the idea of an and management problems, Herman Hollerith
automated system for counting census votes. has left his imprint.

Aditya Ranganathan

Herman Hollerith g 107


MARKETING
the 30-second data
The advent of marketing data
science has allowed businesses to target their
ideal customers more effectively. To some
3-SECOND SAMPLE extent, this has had equalizing effects for RELATED TOPICS
As data science rose from up-and-coming retailers competing with large, See also
its infancy to become its PRIVACY
established online retail platforms like Amazon.
own field, separate in page 86
discipline from software
However, the established players still have an
SOCIAL MEDIA
engineering or statistics, incredible competitive advantage in the form page 128
digital marketing’s of user data. While smaller firms rely almost
dominance rose with it.
exclusively on tracking cookies and data brokers
(who provide demographic and commercial 3-SECOND BIOGRAPHIES
SUSAN WOJCICKI
3-MINUTE ANALYSIS information on individuals) to target a customer 1968–
Some lament data science profile, large industry players such as Amazon, CEO of YouTube following
Google’s acquisition, and
and marketing’s close Google and Alibaba have immense amounts dubbed the ‘most important
relationships. Cloudera’s person in advertising’.
own Jeff Hammerbacher
of data on every single one of their respective
(formerly lead data users. In a 2019 disclosure by Amazon, it was JEFF HAMMERBACHER
1982–
scientist at Facebook) has revealed that over 100 million customers pay a Chief scientist and co-founder
been famously quoted: at Cloudera; formerly led the
membership fee for Amazon Prime, its two-day data science team at Facebook.
‘The best minds of my
generation are thinking shipping service bundled with myriad other
about how to make people services like streaming content and food
click ads.’ In its most delivery. The marketing advantage due to data 30-SECOND TEXT
nefarious form, data Scott Tranter
science is a key contributor
availability alone is so pronounced in this case
to the social media that antitrust legislation is being proposed to
addiction phenomenon, as prevent companies from acting as both sellers
practitioners work to keep and platforms for selling. However the future
you engaged with the
platform by which they
of online commerce plays out, it is clear that as
can serve you adverts for society becomes ever more digital, data science Modern marketing
longer periods of time. will be an essential method in determining methods are driven by
who we are and, more importantly, what we vast amounts of data,
want to buy. used to create highly
targetable consumer
108 g Business profiles.
FINANCIAL
MODELLING
the 30-second data
Ever since Ed Thorp popularized
the application of probability theory in financial
markets, trying to beat the market has been a
3-SECOND SAMPLE continuous, elusive conquest, and data science RELATED TOPICS
With the quantitative has been an integral tool in optimizing See also
revolution in finance, data MACHINE LEARNING
investment strategies. In the past few years,
scientists are en vogue, and page 32
potentially the key to the
competition has been fierce for new alternative
IBM’S WATSON &
holy grail of figuring out data sets, from which unique and proprietary GOOGLE’S DEEPMIND
the market’s next move. insights can be extracted, hoping to give traders page 94
the edge in predicting the next price move.
3-MINUTE ANALYSIS
For example, aggregated credit card data
3-SECOND BIOGRAPHY
Market impact takes place can be used to estimate real-time revenues EDWARD O. THORP
when a participant seeks to for companies ahead of earnings, which are 1932–
buy or sell an asset on an US mathematics professor
otherwise released only quarterly, and bets who pioneered the modern
exchange, and their own applications of probability
footprint pushes the price
can be placed using this granular source of theory in the financial markets.
adversely against them. information ahead of the official release. Foot
For example, when buying traffic analysis can also help with estimating and
10,000 shares of Apple, 30-SECOND TEXT
trading around earnings – satellite imagery to
the price of the last share Sivan Gamliel
purchased may be higher count cars at supermarket car parks can indicate
than the price when trends in volume of shoppers, and geolocation
the trade was initiated. data derived from devices such as mobile
Minimizing this effect can
be modelled as a classic
phones can illuminate consumer behaviour.
‘reinforcement learning’ Natural language-processing algorithms enable
problem, where feedback machines to understand text, and can further
from a series of actions help extract sentiment about the market when
are incorporated into
subsequent trading
applied to news feeds, call transcripts regarding
decisions. company earnings and analyst reports. As availability of
diverse data sources
continues to grow, so
grows the dominance of
quantitative traders on
110 g Business Wall Street.  
NEW PRODUCT
DEVELOPMENT
the 30-second data
Developing a product means
solving a problem or fulfilling a desire that a
customer is willing to pay for. Often, developers
3-SECOND SAMPLE start by looking at products that sold in the RELATED TOPICS
Internet-based data past, then, for inspiration and prototyping, the See also
collection and analytics LEARNING FROM DATA
product is tested with surveys or focus groups,
offers much more detailed page 20
data on when, what and
after which it is tweaked accordingly and put
MARKETING
how people use products, on the market. Collecting data helps product page 108
enabling a much deeper developers to refine features and pricing, but
understanding of SHOPPING
usually intuition plays a big role. The internet page 118
consumer needs.
has turned this process on its head, because
every move a customer makes while browsing
3-MINUTE ANALYSIS products online, reading the news or even 30-SECOND TEXT
Willy Shih
Big consumer products watching TV can be tracked. Netflix collects a
companies like Procter &
Gamble use data to build
consumer’s preferences on joining, along with
computer models of ratings on films you have already seen. When
new products such as using the service, it tracks when and what is
disposable nappies even
watched, whether viewers return after a pause,
before their first physical
prototype. When they whether they finish watching a programme and
go to market, they use a what they watch next. Before a new programme
computerized ‘digital pulse’ is developed, the developers already know –
to track blogs, tweets
and online ratings and
from data analytics on their millions of
comments to see how their members – what each viewer likes and how
products are faring. This many are likely to watch it. They then make
allows them to react programmes with actors and stories they know
quickly to things that
happen in the marketplace,
will be popular. This is just one example of how
both good and bad. data analytics is changing product development. Owing to the vast
amount of information
developers have at their
fingertips, they might
know what you want
112 g Business even before you do.
g
PLEASURE
PLEASURE
GLOSSARY

AI (artificial intelligence) Often used cookies Pieces of information from a


interchangeably with ‘machine learning’. website, stored by a person’s web browser,
The process of programming a computer to which may help the website remember
find patterns or anomalies in large data sets, information specific to that person, such as
or to find the mathematical relationship items added to an online shopping cart or
between some input variables and an login details.
output. AI algorithms have applications
in a range of fields including healthcare, correlated risk Multiple negative outcomes
self-driving cars and image recognition. or losses, caused by a single event. For
example, many homes are likely to be
algorithm Set of instructions or calculations damaged and people injured as the result
designed for a computer to follow. Writing of a single hurricane.
algorithms is called ‘coding’ or ‘computer
programming’. The result of an algorithm data analytics Obtaining, cleaning and
could be anything from the sum of analysing data to gain useful insights,
two numbers to the movement of a answer research questions or inform
self-driving car. decision making.

analytical engine Mechanical computer, digital age Time period beginning in the
designed by Charles Babbage in the early 1970s and stretching to the present day,
1800s, intended to carry out arithmetic and characterized by rapid technological
logical operations, taking instructions or advances, including the introduction
inputs via hole-punched cards. The machine of the personal computer and the rise of
was not constructed during Babbage’s the internet.
lifetime, but a modified version was built
by the London Science Museum in 1991.

116 g Pleasure
digital library Large repository or archive metrics Quantitative measure of
of data, sometimes available to access performance. For example, it is important
or download through the internet, for to assess accuracy metrics for automated
commercial or research purposes. Digital decision-making algorithms. Similarly,
libraries may include images, text or measures such as inflation or the FTSE 100
numerical data. index could be seen as a performance
metrics for the economy.
esports Electronic sports in which
individuals or teams of players compete in model/modelling Real world processes or
international tournaments and for monetary problems in mathematical terms; can be
prizes, to win video games. simple or very complex, and are often used
to make predictions or forecasts.
geolocated franchising model Teams of
competitive video-game players, based STEM The fields of science, technology,
in a specific city, can form a franchise to engineering and mathematics.
compete in international or national esports
tournaments for a particular game. swipe The act of swiping a finger across a
smartphone screen, to interact with an
live streaming The live broadcast of video app. Swiping is widely used in dating apps,
or audio content, via the internet. Esports where users often swipe right or left on a
are usually watched through live streaming. photograph of a potential romantic partner,
to signal interest or disinterest.
machine learning Finding a mathematical
relationship between input variables and an wearable technology Electronic devices
output. This ‘learned’ relationship can then that can be worn on the body including
be used to output predictions, forecasts or activity monitors and smart watches.
classifications given an input.

Glossary g 117
SHOPPING
the 30-second data
With the internet giving a home
to a variety of retailers, the consumer can now
buy almost anything from the comfort of their
3-SECOND SAMPLE own home. The consequence of this is that RELATED TOPICS
Shopping online has retailers have been able to harvest extensive See also
changed shopping as we DATA COLLECTION
and accurate data relating to customers, which
know it, but how is it page 16
possible that websites
means they are better able to target shoppers
LEARNING FROM DATA
seem to know what we based on their habits. An example of this can be page 20
want before we do? seen on Amazon – the biggest online retailer in
ARTIFICIAL INTELLIGENCE (AI)
the world – with its ability to recommend items page 148

3-MINUTE ANALYSIS
based on your previous purchases, ratings and
Ever wondered how a wish lists. However, the ability to perform
website knows the shoes this type of activity is not only the realm of 3-SECOND BIOGRAPHY
JEFF BEZOS
you were looking at the companies the size of Amazon. Services now 1964–
other day? Well, the
answer is cookies. These
exist offering artificial intelligence (AI) solutions Tech entrepreneur who is the
founder, CEO and president
are small pieces of data that allow retailers of many sizes to be able to of Amazon.
that come from a website harness the power of these types of algorithms
and are stored in the web
to drive business, which means that the next
browser, allowing websites 30-SECOND TEXT
to remember various time an online retailer suggests a T-shirt to go Robert Mastrodomenico
nuggets of information with your jeans, it could be via AI. Data science
including past activity or isn’t restricted to shopping suggestions: it also
items in a shopping cart,
which explains why that
applies to how goods are purchased. Facial
pair of shoes just keeps recognition technology combined with smart
coming back. devices allows payments to be authenticated
without the use of credit cards.

Online shopping and


new payment methods
combine to create a
high-tech consumerism
that’s feeding valuable
118 g Pleasure data to retailers.
DATING
the 30-second data
Sign up to a dating site and
you’re presented with a number of questions
that you have to complete which will define you
3-SECOND SAMPLE and find you your perfect match – how is this RELATED TOPICS
Online dating has changed possible? The questions are weighted based See also
the game of finding love to REGRESSION
on their importance, and using these as an input
the extent that finding ‘the page 24
one’ is more statistical than
to the algorithms used allows a score to be
CLUSTERING
you may think. calculated that shows your satisfaction with page 28
other potential matches. It’s not all about you,
MACHINE LEARNING
though – the best match also takes into account page 32
3-MINUTE ANALYSIS
The ‘swipe right’ paradox:
how well your answers mesh with the potential
should dating app users matches. So the stats behind your match
just keep swiping to see doesn’t assume love is a one-way street, which 3-SECOND BIOGRAPHIES
everyone on the DAVID SPIEGELHALTER
seems sensible. Online dating also includes the 1953–
application? Given that the
best selections will come
generation of ‘swipers’ who use dating apps. Statistician, Professor of the
Public Understanding of Risk
first, every subsequent Here, you are able to see potential matches and author of Sex by Numbers.
swipe should give a worse based on fixed data such as location, age HANNAH FRY
selection, and eventually 1984–
preference and so on. The application then
you will see recycled Mathematician, lecturer, writer
selections as someone you shows you individuals and you register your and TV presenter who studies
patterns of human behaviour
said no to is better than thoughts on the individuals by swiping left in relation to dating and
a very bad match, at least or right. Who appears on your screen is not relationships.
from a mathematical
point of view.
just based on your fixed preferences; instead,
complex algorithms learn from how you and 30-SECOND TEXT
others have used the app and sends you Robert Mastrodomenico
individuals who you would most likely respond
to positively.

Love at first swipe?


Data scientists are
working to make this
more likely by matching
120 g Pleasure personal data.
MUSIC
the 30-second data
The movement of music from
physical libraries to digital libraries has changed
the way we consume music. By having a digital
3-SECOND SAMPLE music library, we have access to millions of RELATED TOPICS
The digital age has opened songs by a variety of artists at the touch of a See also
up the world of music DATA COLLECTION
button. Given this volume of music, how are
to us, but with so much page 16
choice, how can we find
providers able to give us recommendations
MACHINE LEARNING
new music we like? and custom playlists based upon our listening page 32
habits? Taking Spotify as an example, which
is one of the most popular music streaming
3-MINUTE ANALYSIS
If we consider two users:
services in the world, it harnesses the power 3-SECOND BIOGRAPHIES
MARTIN LORENTZON &
one may like songs a, b, c, of data by adopting a three-pronged approach DANIEL EK
d; the other a, b, c, e. to determining what you might like. The first 1969– & 1983–
Based on this, it could be Swedish entrepreneurs who
approach comes up with suggestions by co-founded the popular music
suggested that the second streaming service Spotify.
user try song d and the
comparing your listening habits to other users.
first user song e, because The second approach applies machine-learning
they both like a, b and c. techniques to textual data such as news articles, 30-SECOND TEXT
This is what is done on a
blogs or even the text data stored within the Robert Mastrodomenico
much larger level for all
music streaming digital music files themselves to find music you
service users. may like. The third approach analyses the raw
audio content of the songs to classify similarity.
Combining the results of these approaches
allows music streaming services to come up with
custom playlists for each and every user on the
platform which can include a variety of genres
and eras. Such streaming services are constantly
evolving to harness new technologies. Music streaming
services can harness
your listening data to
introduce you to new
music you might not
122 g Pleasure have heard otherwise.
10 December 1815 1829 1838
Born in London Contracts the measles, Becomes Countess of
which leaves her Lovelace, following her
paralysed for a year husband’s earlship
1816
Parents separate;
Lovelace is left in the 1833 1840s
care of her mother Meets pioneering Attempts to develop a
mathematician Charles mathematical scheme for
Babbage for the first time gambling and nearly ends
1824 up destitute
Her father, Lord Byron,
passes away, aged 36 1835
Marries William King 1842–3
Works on her translation
(with notes) on the
1836 Analytical Engine
Becomes a mother for the
first time
27 November 1952
Dies in London
ADA LOVELACE

Ada Lovelace is considered by she met Charles Babbage, who is generally


many to be the founder of computer credited as being the inventor of the first digital
programming. A gifted mathematician – at a computer. She took a fascination to Babbage’s
time when women were almost unheard of in work, and the two worked together over the
STEM fields – she is said to have possessed next 20 years that comprised the remainder of
a unique intuition into the workings of the Lovelace’s short life. In the process, she helped
analytical engine, the first digital computer. Her craft the design and structure of the analytical
contributions to modern-day computer science engine, a complex and more powerful iteration
are inescapable; her work either directly laid the of Babbage’s difference engine that was
foundations for or, at the very least, catalysed supposed to, in Lovelace’s words, ‘weave...
the ideas of many of the ‘fathers’ of modern algebraic patterns, just as the Jacquard loom
computing, from Babbage to Turing. weaves flowers and leaves’.
Lovelace’s mother, Lady Byron, ensured that While universally recognized as a central
her child grew up surrounded by mathematics contributor to the analytical engine, Lovelace
tutors. Lady Byron was not a lover of is better known for her translation of French
mathematics, per se; rather, she saw maths engineer Luigi Menabrea’s work on the
and logic as a means to subdue her daughter’s analytical engine in 1843. Her translation
poetic disposition, a tendency that, she feared, was significantly longer than the original and
Lovelace might have contracted from her contained many new ideas and improvements
father – none other than the poet Byron – added in by Lovelace herself in the form of
despite never having had contact with him past notes. This set of notes is supposed to have
infancy. Lovelace is said to have retained a been a key inspiration for Alan Turing’s first
fondness for her absent father, and her ‘modern computer’ nearly a century later.
mathematical inventions were often imbued Lovelace’s imprint on modern computing is
with a creativity that spoke of an artistic mind, indelible. Even more remarkable is that she did
not just a logical one. so much in a life that spanned only 36 years,
Lovelace was introduced to the emerging all the while balancing her roles as aristocrat,
world of computing machines at age 17, when mother and mathematician.

Aditya Ranganathan

Ada Lovelace g 125


SPORTS
the 30-second data
Fields, courts and pitches have
always welcomed professional and amateur
statisticians alike to measure team and player
3-SECOND SAMPLE performance. Common baseball metrics like RELATED TOPICS
While Moneyball is the Runs Batted In (RBI) and Earned Run Average See also
most famous example, it LEARNING FROM DATA
(ERA) have been reliably recorded since the
pales in comparison to the page 20
advancements facilitated
nineteenth century. Recent advancements in
STATISTICS & MODELLING
by the data revolution technology, however, have helped to launch a page 30
in sports, affecting the data science explosion felt by both participants
experiences of players,
and spectators. The invention of wearable
managers, referees
and fans. technology has allowed data scientists to track 3-SECOND BIOGRAPHIES
BILL JAMES
athletes and activities. In tennis, for example, 1949–
many professionals have turned to using Baseball writer and statistician
3-MINUTE ANALYSIS who created a wide array of
racquets with embedded sensors, which allow player evaluation formulas.
Commentary surrounding
the increase in data
them to track speed, rotation and ball hit BILLY BEANE
science’s influence in location in real time. Other advancements 1962–
Pioneered the use of
sports often presents two include the expanded use of cameras and radar non-traditional metrics
competing factions: the
devices. In most major sports, the universal for scouting undervalued
baseball players.
purists and the nerds. The
film Moneyball presented a use of sophisticated cameras has provided
heterodox, quantitatively- participants and fans access to insights that
inclined baseball manager were previously unimaginable. Using the new 30-SECOND TEXT
upending a system Scott Tranter
designed by experienced
metrics provided by high-accuracy camera
scouts. Recently, some technology in Major League Baseball, for
athletes have criticized instance, team quantitative analysts have
statisticians for their lack demonstrated that performance improves
of experience. However,
the most successful data
when batters correct their ‘launch angle’.
teams leverage expert As athletes continue to use data science to Sports data science
insights to complement improve their games, there is no sign that works best when
their analytics.
the ongoing arms race in sports metrics will integrating athlete
get thrown out any time soon. experience and
scientists’ numbers.
126 g Pleasure
SOCIAL MEDIA
the 30-second data
In just a few years, companies
like Facebook, Snapchat and Twitter went from
small internet start-ups to multibillion-pound
3-SECOND SAMPLE tech giants with priceless quantities of RELATED TOPICS
Since the beginning of the influence. Facebook is now reaching nearly See also
twenty-first century, social PRIVACY
90 per cent of all mobile users in the US, while
media has taken over how page 86
humans interact, access
Twitter boasts 100 million daily active users,
MARKETING
news and discover new enough to become the fifteenth most populous page 108
trends, driven, in large country in the world. With such a massive user
part, by data science. TRUSTWORTHINESS SCORE
outreach, companies can leverage the enormous page 146
amounts of data generated on these platforms
3-MINUTE ANALYSIS to discover new trends and insights about their
TV shows such as Black audience, then apply this knowledge to make 3-SECOND BIOGRAPHIES
Mirror provide us with a JACK DORSEY
smarter, more reliable business decisions. By 1976–
different view towards
continuing advancements
tracking users and implementing algorithms Co-founder and CEO of Twitter.

in social media. The to learn their interests, social media companies MARK ZUCKERBERG
episode ‘Nosedive’ depicts have been able to deliver highly targeted 1984–
a world where ‘social Co-founder and CEO of
adverts and generate billions of pounds in Facebook, and the youngest
credit’, deriving from a self-made billionaire, at 23.
mixture of in-person ad revenue every year. These same machine-
and online interactions, learning algorithms can be used to tailor the
dictates where a person content each user sees on their screen. From 30-SECOND TEXT
can live, what they can buy,
who they can talk to and
a timeline to suggested friends, social media Scott Tranter

more. China has begun companies play a prominent role in how users
implementing a Social interact with their apps and, subsequently, the
Credit System to determine world around them. What once started as a way
the trustworthiness of
individuals and accept
to update friends on one’s status has evolved
or deny individuals for into a public forum, marketplace and news The rapid growth of
functions such as receiving outlet all rolled into one. social media has seen it
loans and travelling.
infiltrate everyday life,
with data capture
capabilities on an
128 g Pleasure unprecedented scale.
GAMING
the 30-second data
Competitive video gaming,
known as esports, is an emerging global
phenomenon in which professional players
3-SECOND SAMPLE compete in packed stadiums for prize pools RELATED TOPICS
Esports is engaging its reaching millions of pounds. Unlike with See also
young, digital-savvy fans LEARNING FROM DATA
traditional sporting events, esports fans page 20
through non-traditional
online media, paving the
engage more directly with content via online
SPORTS
way for data science into live streaming technology on platforms such page 126
the recreational trends of as Twitch. Esports consumers largely consist
young generations.
of males in the 20 to 30 age range, a prime
demographic that companies wish to target. 3-SECOND BIOGRAPHIES
JUSTIN KAN
3-MINUTE ANALYSIS By tracking the fan base’s habits and interests 1983–
Although esports thrived using analytical tools and survey methods, American internet entrepreneur
who co-founded Twitch,
off the back of online live companies have been able to tailor content formerly Justin.tv, the most
streaming technology, popular streaming platform for
it has also begun
based on the audience they wish to target. esports content.
broadcasting esports However, because of the esports audience’s TYLER ‘NINJA’ BLEVINS
on television, with reduced television consumption and tendency 1991–
ads displaying during American Twitch streamer,
to block internet adverts using browser-based internet personality and former
commercial breaks, akin to
traditional sports. Esports ad-blocking technology, companies are looking professional gamer who helped
bring mainstream attention to
companies are adopting into non-traditional methods to reach this the world of esports.
the geolocated franchising demographic. For example, due to the digital
model, which looks to take
nature of esports, brands have the ability to
advantage of television 30-SECOND TEXT
advertising and display their products directly in the video Scott Tranter
sponsorship deals for its games, avoiding ad-blockers altogether.
revenue. With this move, Additionally, professional esports players have
esports has an opportunity
to expand its reach,
a large influence on how their fans may view
opening up the door for certain products. To take advantage of this, As the esports industry
mainstream popularity. companies often partner with these influencers grows, top players may
and utilize their popularity in order to reach soon be able to sign
target audiences for their products. endorsement deals
in the ballpark of
130 g Pleasure professional athletes.
GAMBLING
the 30-second data
In gambling, everything from the
likelihood that the dealer will bust in blackjack
to the placement of specific slot machines at
3-SECOND SAMPLE key locations are driven by statistics. And, in RELATED TOPICS
Data science and gambling the evolving world of data science, those with See also
can blend together with LEARNING FROM DATA
greater access to it can find themselves at a
devastating effect – and page 20
has made the adage ‘the
huge advantage over others. This ranges from
SURVEILLANCE
house always wins’ even the simple approach of an experienced poker page 82
more true. player understanding the odds of turning his
SPORTS
straight-draw into a winning hand – and the page 126

3-MINUTE ANALYSIS
correlated risk of pursuing that potential
There have been reports on hand – to the more advanced techniques
the ways in which casinos casinos use to turn vast amounts of 3-SECOND BIOGRAPHIES
are utilizing decades’ worth RICHARD EPSTEIN
unstructured data into predictions on the best 1927–
of player data (tied back to
individual players through
way to entice players to bet, and to bet more, Game theorist who has served
as an influential statistical
their rewards cards), while on lower-odd payouts. Resources exist for both consultant for casinos.
plenty of ‘expert’ gamblers the house and for the player, and they extend EDWARD O. THORP
have written books 1932–
well beyond card games and slot machines.
designed to ‘beat the Mathematician who pioneered
house’. Those with designs Statistical models can impact the payout of successful models used on Wall
Street and in casinos.
on gambling based on luck sports events – oftentimes adjusting odds in
are simply playing the real time and based on the direction that money
wrong game – they should
be playing the stats – while
is moving – in a way that can minimize the risk 30-SECOND TEXT
hoping that Lady Luck still of the sportsbook (the part of casinos that Scott Tranter
shines on them. manages sports betting). By the same token,
some gamblers use or create statistical models
to make educated decisions on outcomes that
are data-driven rather than narrative-driven,
giving them an edge on those following Move over Lady Luck:
their instinct. professional gamblers
now pit their data
skills against those
132 g Pleasure of the house.
g
THE FUTURE
THE FUTURE
GLOSSARY

AI (artificial intelligence) Often used data legality Legislation concerned with


interchangeably with ‘machine learning’. The how data can be collected, stored and
process of programming a computer to find accessed, and by whom.
patterns or anomalies in large data sets, or to
find the mathematical relationship between deepfake An image, video or audio clip that
some input variables and an output. has been altered using AI to show a person
doing or saying something they have not
box plot Data visualization which shows done. This may include superimposing a
the distribution, or shape, of a data set. person’s head on the body of another
It includes important descriptive statistics person in an image, or a speech track over
including the median, lower and upper a video of a person’s mouth moving.
quartiles, and any outliers. This is sometimes
referred to as a ‘box and whisker plot’. genomics The study of the structure and
function of DNA.
chat bot Computer program designed to
interact or ‘chat’ with human users through longitudinal behavioural surveys Study
text messaging; often used in customer that takes repeated measurements from
service, to deal with customers’ questions the same subjects or participants over time.
and problems more efficiently than a
human could. machine learning Finding a mathematical
relationship between input variables and an
data ethics Concerned with how data can output. This ‘learned’ relationship can then
be ethically collected, stored, analysed be used to output predictions, forecasts or
and distributed; especially important when classifications given an input.
handling personal data.

136 g The Future


mine information Collecting data, typically social scoring System whereby citizens are
from the internet, on a large scale. Data can ‘scored’ by government agencies according
be mined or ‘scraped’ directly from websites. to their behaviour, which may include
There are important ethical and privacy adherence to the law, financial stability,
considerations when mining personal data. employment status or educational
attainment. A person’s social score may
nanotechnologies Technologies that exist affect their ability to access social services
at nanometre scale, and often involve or loans.
the manipulation of individual molecules.
This includes nanomaterials such as stem-and-leaf display A form of data
carbon nanotubes, and ‘buckyballs’ or visualization, similar to a histogram, used to
Buckminsterfullerene. show the distribution or shape of a data set.

self-learning Type of machine learning, time series analysis The analysis of a signal
commonly used to find patterns or structure or variable that changes over time. This
in data sets. Also known as ‘unsupervised can include identifying seasonal trends or
learning’. patterns in the data, or forecasting future
values of the variable.
smart Refers to internet-connected devices
with real-time analysis or machine learning topology Branch of mathematics concerned
capabilities. Smart watches typically include with geometric objects and their properties
physical activity monitors and internet when they are stretched, twisted or crumpled.
connectivity, and smart TVs may include
voice recognition.

Glossary g 137
PERSONALIZED
MEDICINE
the 30-second data
Humans have always been
interested in themselves. So it’s no surprise that
they want to know what’s happening in their
3-SECOND SAMPLE bodies – at all times. Consumer demand for RELATED TOPICS
Wearable technology could personalized health data has fuelled the success See also
tap in to huge amounts of EPIDEMIOLOGY
of smart watches, fitness trackers and other
human data, opening up page 76
the possibility of real-time
wearable devices which give real-time feedback.
ETHICS
healthcare, along with But what does the next generation of wearables
page 152
new ways to detect and look like? And what can the data tell us? With
prevent disease.
technology so deeply ingrained in our lives, it is
easy to imagine a future with technologically 3-SECOND BIOGRAPHIES
JOSEPH WANG
3-MINUTE ANALYSIS advanced clothing, smart skin patches or 1948–
As the line between ingestible nanotechnologies which detect or American researcher and
consumer wearables and director of the Center for
monitor disease. Instead of a one-off blood Wearable Sensors at University
medical devices blurs, of California, San Diego, who
concerns are rising about
test, we could all be wearing a smart patch is pioneering wearable sensors
the reliability and security made of a series of micro-needle sensors that to monitor disease.
of data. For example, continually track chemical changes under JEFF WILLIAMS
current smartphone apps 1963–
the skin. Or flexible and stretchable sensors
for melanoma skin cancer Apple’s chief operating officer,
detection have a high resembling tattoos, which could monitor who oversees the Apple
watch and the company’s
failure rate. If people are lactate during a workout or sense changes health initiatives.
deciding to change their in environmental chemicals and pollutants.
lifestyle, or medical
professionals are making
And imagine the data – huge amounts of
30-SECOND TEXT
treatment decisions, it data. Future wearable technology will collect Stephanie McClellan
is crucial that wearable thousands of data points a minute, maybe even
devices go through any a second, which will need powerful algorithms,
necessary clinical trials and
are supported by strong
machine learning and AI to reduce the data
scientific evidence. into meaning. This will be essential to mine
the information, to better understand disease, To attain secure and
population-wide health trends and the vital reliable data, the
signs to predict a medical emergency. personal healthcare
industry needs to be
138 g The Future properly regulated.
MENTAL HEALTH
the 30-second data
Mental health disorders affect
over 970 million people worldwide, tend to be
under-diagnosed, can have long-term effects
3-SECOND SAMPLE and often carry social stigma. Mental health RELATED TOPICS
Data science enables data involves longitudinal behavioural surveys, See also
digital mental healthcare, NEURAL NETWORKS &
brain scans, administrative healthcare data and
to improve access and DEEP LEARNING
treatment outcomes.
genomics research. Such data is difficult to page 34
obtain and of a sensitive nature. Data science
HEALTH
facilitates access to this data and its applications page 92
3-MINUTE ANALYSIS to mental health include virtual counselling,
Mental healthcare has been PERSONALIZED MEDICINE
made more accessible
tele-psychiatry, effective use of social media page 138

to overcome provider and analysing mobile health device data. Data


shortages and stigma via science interfaces with mobile applications
tele-psychiatry and virtual 3-SECOND BIOGRAPHIES
to track patient mood with visual devices such EMIL KRAEPELIN
agents. A 3D chat bot 1856–1926
named Ellie has the
as flowcharts, produce algorithms to generate
Used data science concepts
capability to show facial targeted advice and connect people to a in his diagnosis methods for
schizophrenia and bipolar.
expressions and detect therapist in order to improve symptoms of
non-verbal cues. Patients’
depression and anxiety, for example. Machine AARON BECK
verbal input and facial 1921–
expressions run through learning mines unstructured text from social Created the Beck Depression
algorithms, which media and similar sources to detect symptoms Inventory, used to diagnose
depression.
determine Ellie’s visual of mental illness, offers diagnoses and creates
and verbal responses.
In a study, Ellie was more
algorithms to predict the risk of suicide.
effective in diagnosing Furthermore, AI can combine brain scans 30-SECOND TEXT
Rupa R. Patel
post-traumatic stress and neural networks to deliver personalized
disorder among military psychiatric medicine. However, preserving
personnel during routine
health exams. The ability
confidentiality of personal information,
to build rapport and trust maintaining data security and transparency
with patients drives Ellie’s with the use of data are key. The application
effectiveness.
of data science to
mental health carries
important ethical
140 g The Future considerations.
SMART HOMES
the 30-second data
Our homes are growing smarter,
from turning the lights on, to playing a song or
even ordering a pizza, devices that have become
3-SECOND SAMPLE part of the furniture of the modern home can RELATED TOPICS
Machine learning and now do it all. Actions that were fully manual are See also
AI have now taken the TOOLS
controlled via a variety of methods, be it by
average home and turned page 22
it into a smart home.
phone or smart speaker, so the central heating
MACHINE LEARNING
can be turned on at the touch of a button or the page 32
sound of a voice. The more smart devices are
3-MINUTE ANALYSIS ARTIFICIAL INTELLIGENCE (AI)
used in the home, the more data they emit and page 148
Smart speakers now use
self-learning technology,
the more information that becomes available to
which means that they can collect. The volume of data lends itself well to
learn from the errors they applying machine learning and AI techniques. In 3-SECOND BIOGRAPHIES
make. So if you ask for WILLIAM C. DERSCH
the case of smart speakers, for example, once fl. 1960s
something that the speaker
doesn’t understand
you’ve asked for something, it is sent over the Creator of IBM’s Shoebox, the
first known voice assistant.
and then ask again in a internet to be processed by a voice-processing
ROHIT PRASAD & TONI REID
different way, it can learn service, which does all the hard work. The more fl. 2013–
that the first request is
data the speaker receives, the better it becomes Creators of the Alexa assistant
related to the second, and AI specialists at Amazon.
combine the two and learn at processing a request. This means that a smart
from them both. home should grow increasingly smarter. One
worry is how they learn: major companies have 30-SECOND TEXT
had to apologize for employing people to listen Robert Mastrodomenico

in to conversations in private homes with the


alleged aim to improve voice recognition – valid
concerns over privacy loom.

‘Smart’ stands for


Self-Monitoring
Analysis and Reporting
Technology, and such
devices link to the
142 g The Future Internet of things.
16 June 1915 1939 1965
Born in Bedford, Receives PhD in Becomes the founding
Massachusetts, USA Mathematics at Princeton chairman of Princeton’s
University, and is new Department of
appointed to the Statistics
1933 mathematics faculty
Enters school for the first
time, attending Brown 1973
University 1945 Awarded the US National
Begins a 40-year Medal of Science; advises
association with AT&T every US President from
1936 Bell Laboratories Eisenhower to Ford
Receives Bachelor’s
degree in Chemistry
1960 1985
Begins a two-decade Retires from Princeton
association with NBC, and Bell Laboratories,
working on election and delivers valedictory
night forecasting and address with the title
developing new methods ‘Sunset Salvo’. Continues
for avoiding erroneous to advise many boards
judgements. For several and governmental
years Tukey can be seen councils
walking around in the
background of the
NBC set 26 July 2000
Dies in New Brunswick,
New Jersey, USA
JOHN W. TUKEY

John W. Tukey was born in 1915, of unsuspected patterns in data and to learn
and he showed unusual brilliance from a very about distributions of data, including the box
early age. He was schooled at home by his plot and the stem-and-leaf display. He coined
parents until he entered Brown University, other new terms that also became standard
Rhode Island. He graduated in three years with terminology, including ‘bit’ (short for ‘binary
BA and MS degrees in Chemistry. From Brown digit’) for the smallest unit of information
he moved on to Princeton, where he started in transferred, and even more terms that did not
Chemistry but soon switched to Mathematics, catch on (virtually no one remembers his 1947
getting a PhD at age 24 in 1939 with a term for a reciprocal second: a ‘whiz’).
dissertation in Topology, and he then moved It was through his teaching and consistent
directly to a faculty position at the same emphasis on the importance of exploratory
university. He remained at Princeton until he data analysis as a basic component of scientific
retired in 1985, adding a part-time appointment investigation that Tukey was established as a
at Bell Telephone Laboratories in 1945. founder of what some today call data science,
Over a long career, Tukey left his imprint and some credit him with coining the term.
in many fields. In mathematics: Tukey’s From 1960 to 1980, he worked with the
formulation of the axiom of choice; in time television network NBC as part of an election
series analysis: the Cooley–Tukey Fast Fourier night forecasting team that used early partial
Transform; in statistics: exploratory data results to ‘call’ the contested races of interest,
analysis, the jackknife, the one-degree test for working with several of his students, including
non-additivity, projection pursuit (an early form at different times David Wallace and David
of machine learning), and Tukey’s method Brillinger. In 1960 he prevented NBC from
of guaranteeing the accuracy of a set of prematurely announcing the election of Richard
simultaneous experimental comparisons. In Nixon. Tukey’s favourite pastimes were square
data analysis alone he created an array of dancing, bird watching and reading several
graphical displays that have since become hundreds of science fiction novels. He died in
standard, widely used to facilitate the discovery 2000 in New Brunswick, New Jersey.

Stephen Stigler

John W. Tukey g 145


TRUSTWORTHINESS
SCORE
the 30-second data
The idea of being judged in
society based on your behaviour is not new, but
the data age is giving it all new possibilities and
3-SECOND SAMPLE meaning. China is the first to take advantage of RELATED TOPICS
Yell at someone in traffic this on a population scale, by giving each of its See also
who cuts in front of you? SURVEILLANCE
citizens a social credit score, which can move
Your social credit score page 82
might go down: a data-
up or down, based upon AI and data collection
PRIVACY
driven score based on techniques of each individual’s behaviour. page 86
individual behaviour (a The methodology of the score is a secret,
black box of lifestyle traits). ETHICS
but examples of infractions are bad driving page 152
(captured by China’s extensive camera systems),
3-MINUTE ANALYSIS buying too many video games and posting
Companies are not immune deemed ‘fake news’ online. The result? A low 3-SECOND BIOGRAPHY
to interest in a social score. EARL JUDSON ISAAC
score will be punishable by actions such as 1921–83
Facebook reportedly
assigns all of its users a travel bans, not allowing your child into certain Mathematician who, along
with Bill Fair, founded a
trustworthiness score, schools, stopping you from getting the best standardized, impartial
credit scoring system.
with the intention to jobs or staying at the best hotels, slowing of
combat the spread of
your internet and, at its most base level, a loss
misinformation. In this
case, the score itself and of social status (including publishing your score 30-SECOND TEXT
with your online dating profile). Other countries Liberty Vittert
how it is computed are a
complete secret. In 2016, and companies are on the brink of using systems
the online profiles of over
70,000 OKCupid users
very similar to this. The UK, for example, has
(including usernames, been accused of getting quite close with the
political interest, drug usage of data from sources such as credit
usage and sexual exploits) scores, phone usage and even rent payments
were scraped and published
by academic researchers. to filter access to many social services and jobs.

The amount of public


and private data on
individuals is endless,
so where will our social
146 g The Future scoring end?
ARTIFICIAL
INTELLIGENCE (AI)
the 30-second data
Artificial Intelligence is less about
sentient robots and more about computers that
can find patterns and make predictions using
3-SECOND SAMPLE data. You encounter AI every time your phone RELATED TOPICS
It sounds like the stuff of autocompletes your messages, whenever you See also
science fiction, but AI is MACHINE LEARNING
use a voice recognition service or when your
already being used to page 32
power apps and services
bank catches a fraudulent transaction on your
BIAS IN ALGORITHMS
in use every day. credit card. On many social media platforms, page 50
facial recognition technology uses AI to
ETHICS
establish whether an uploaded image contains page 152
3-MINUTE ANALYSIS
The full impact of AI on the
a face, and to identify the subject by matching
labour market is unknown, against a database of images. At airports,
and advances such as this same technology is increasingly used to 3-SECOND BIOGRAPHY
driverless cars may lead to ALAN TURING
identify travellers and perform passport checks. 1912–54
unemployment or a need
to re-skill. However, many
However, facial recognition algorithms are British mathematician and
computer scientist, widely
automated systems will notorious for performing poorly on darker considered the father of AI.
always require a degree skinned faces, so a technology that saves
of human oversight or
time at passport control for one person could
interpretation, particularly 30-SECOND TEXT
in sensitive fields such as misidentify another traveller as a wanted Maryam Ahmed
medicine or criminal justice. criminal. As well as performing image
recognition tasks, AI can be used to generate
hyper-realistic images of faces or landscapes.
This has benefitted the computer graphics and
video game industries, but has also enabled
the rise of the ‘deepfake’, an image or video
in which a person is shown doing or saying
something they haven’t done. As deepfakes
become ever more convincing, governments
and news outlets must find ways to combat Facial recognition
this new form of misinformation. is one of the most
widespread uses of
148 g The Future AI in everyday life.
REGULATION
the 30-second data
Advances in data science are
raising new questions that have politicians and
legislators scratching their heads. How can
3-SECOND SAMPLE personal data be collected and stored ethically RELATED TOPICS
Technology is evolving too and securely? Who should check whether See also
fast for lawmakers to keep BIAS IN ALGORITHMS
algorithms are biased? And whose fault is it
up with, so who should be page 50
responsible for regulating
when the irresponsible use of data causes
IBM’S WATSON &
the use of data science real-life harm? In Europe, the General Data GOOGLE’S DEEPMIND
and AI? Protection Regulation (GDPR) gives citizens page 94
control over their data; organizations must ask ARTIFICIAL INTELLIGENCE (AI)

3-MINUTE ANALYSIS
for permission to collect personal data, in simple page 148

NGOs can hold language rather than legal-speak, and delete


algorithms to account it when asked. The European Commission
and effect change where 3-SECOND BIOGRAPHY
recommends that AI should be transparent and CORY BOOKER
governments fail. The
unbiased with human supervision if needed, 1969–
Algorithmic Justice League US Senator (Democrat) and
in the US has exposed and in the US the Algorithmic Accountability sponsor of the Algorithmic
Accountability Act.
racial and gender biases in Act will soon require that AI systems are ethical
IBM and Microsoft’s face
and non-discriminatory. In the wake of high-
recognition systems, and in
the UK, Big Brother Watch profile data privacy scandals at Facebook and 30-SECOND TEXT
has found low accuracy DeepMind, some companies have made efforts Maryam Ahmed
rates in face recognition to self-regulate, with varying success. Microsoft
algorithms used by some
police forces. Algorithm
and IBM have publicly committed to building
Watch has investigated fair and accountable algorithms, with data privacy
an opaque credit scoring and security as key concerns. Meanwhile, Google
algorithm in Germany, has explicitly pledged not to develop algorithms
demonstrating that some
demographic groups are for weapons systems, or technologies that
unfairly penalized. contravene human rights, although its AI ethics
board was dissolved just one week after being The main challenge for
founded, due to board member controversies. data regulators is to
keep up with the fast
pace of technological
150 g The Future developments.
ETHICS
the 30-second data
Data is the new oil, except it isn’t
the earth that is being excavated – it’s you. Your
data is being used as a commodity. It is being
3-SECOND SAMPLE bought, sold and potentially stolen, from the RELATED TOPICS
Data ethics seems Weather Channel selling your geo-located data, See also
antithetical: 21254, what PRIVACY
to political advertising companies matching
can be unethical about page 86
that? But the saying,
voter registration records to Facebook profiles,
REGULATION
‘figures never lie, but liars companies sorting through CVs based upon page 150
always figure’ isn’t really algorithmic functions, or insurance companies
that true.
selling your private medical information to
third parties to contact you about ‘further 3-SECOND BIOGRAPHIES
RUFUS POLLOCK
3-MINUTE ANALYSIS treatment’. With this new industry comes 1980–
From the leaks by Edward overwhelming ethical concerns. Both the private Founder of the Open
Knowledge Foundation,
Snowden to the extent that and public sector, behind closed doors, must which promotes and shares
companies have an ethical open data and content.
or legal responsibility to
make very difficult ethical decisions about what
inform users of how their they can and cannot do with your data. These EDWARD SNOWDEN
1983–
data will be used, the questions arise not only in the pursuit of making American whistle-blower of
conflation of data ethics classified National Security
money, but also in an effort to mitigate and Agency information.
versus data legality is
a complicated issue. In solve serious problems. For example, should
the Cambridge Analytica Apple, or other electronic/technology
scandal, while their companies, give the government access to 30-SECOND TEXT
business practice was Liberty Vittert
questionable, thousands,
the data on the mobile phone of an alleged
if not tens of thousands, of terrorist? Data ethics encompasses the
other app developers were collection, analysis and dissemination of
using the features that data (including who owns it), as well as any
Facebook itself created, to
do precisely the same thing
associated algorithmic processes and AI.
that Cambridge Analytica Professional statistical bodies and companies
was doing. What is ethical alike have tried to convene thought groups on
and which is the
the many quandaries surrounding data ethics. Data ethics is
responsible party?
an ever-changing
discipline based upon
152 g The Future current practices.
RESOURCES

BOOKS & ESSAYS


A First Course in Machine Learning The Elements of Statistical Learning
S. Rogers & M. Girolami J. Friedman, T. Hastie & R. Tibshirani
Chapman and Hall/CRC (2016) Springer (2009)

An Accidental Statistician: The Life Get Out the Vote!


and Memories of George E.P. Box Donald P. Green & Alan S. Gerber
G.E.P. Box EDS Publications Ltd (2008)
Wiley (2013)
Healthcare Data Analytics
Alan Turing: The Enigma Chandan K. Reddy, Charu C. Aggarwal  (Eds)
Andrew Hodges Chapman & Hall/CRC (2015)
Vintage (2014)
Invisible Women
The Art of Statistics: How to Learn from Data Caroline Criado Perez
David Spiegelhalter Chatto & Windus (2019)
Pelican Books (2019)
‘John Wilder Tukey 16 June 1915–26
The Book of Why: The New Science July 2000’
of Cause and Effect P. McCullagh
Judea Pearl & Dana Mackenzie Biographical Memoirs of Fellows of the Royal
Allen Lane (2018) Society (2003)

‘Darwin, Galton and the Statistical Machine Learning: A Probabilistic


Enlightenment’ Perspective
S. Stigler K.P. Murphy
Jour. of the Royal Statist. Soc. (2010) MIT Press (2012)

Data Science for Healthcare The Mathematics of Love


Sergio Consoli, Diego Reforgiato Hannah Fry
Recupero, Milan Petković (Eds) Simon & Schuster (2015)
Springer (2019)

154 g Resources
WEBSITES
Memories of My Life Coursera
F. Galton www.coursera.org/learn/machine-learning
Methuen & Co. (1908)
Data Camp
Naked Statistics: Stripping the Dread www.datacamp.com/courses/introduction-
from the Data to-data
Charles Wheelan
W.W. Norton & Company (2014) The Gender Shades project
gendershades.org
The Numerati Uncovered bias in facial recognition
Stephen Baker algorithms
Mariner Books (2009)
ProPublica
Pattern Recognition and Machine Learning www.propublica.org/article/machine-bias-
C.M. Bishop risk-assessments-in-criminal-sentencing
Springer (2006) Investigated the COMPAS algorithm for
risk-scoring prisoners
The Practice of Data Analysis: Essays in
Honour of John W. Tukey Simply Statistics
D. Brillinger (Ed) simplystatistics.org
Princeton Univ. Press (1997)
Udemy
Statistics Done Wrong: The Woefully www.udemy.com/topic/data-science/
Complete Guide
Alex Reinhart
No Starch Press (2015)

The Victory Lab


Sasha Issenberg
Broadway Books (2013)

Resources g 155
NOTES ON CONTRIBUTORS

EDITOR a strong advocate of transparency in the public


Liberty Vittert is a Professor of the Practice of sphere, and has spoken on this topic at venues
Data Science at the Olin Business School at the including the Royal Society of Arts.
Washington University in St Louis. She is a
regular contributor to many news organizations Vinny Davies completed his PhD in Statistics
as well as having a weekly column “A before becoming an academic researcher in
Statistician’s Guide to Life” on Fox Business. Machine Learning. He has spent most of his
As a Royal Statistical Society Ambassador, BBC career looking at applications of probabilistic
Expert Woman and an Elected Member of the models in Biology and Chemistry, including
International Statistical Institute, Liberty works models for vaccine selection and the left
to communicate statistics and data to the public. ventricle of the heart.
She is also an Associate Editor for the Harvard
Data Science Review and is on the board Sivan Gamliel, Director, is a member of BlackRock
of USA for UN Refugee Agency (UNHCR) as Alternative Advisors, the firm’s hedge fund
well as the HIVE, a UN Refugee Agency data solutions team, where she serves as Head of
initiative for refugees. Quantitative Strategies. She received her
Bachelor of Science degree in Physics from
FOREWORD Massachusetts Institute of Technology, USA.
Xiao-Li Meng is the Whipple V. N. Jones
Professor of Statistics at Harvard University, and Rafael A. Irizarry is professor and chair of the
the Founding Editor-in-Chief of Harvard Data Department of Data Sciences at Dana-Farber
Science Review. He was named the best Cancer Institute and Professor of Biostatistics at
statistician under the age of 40 by COPSS Harvard T.H. Chan School of Public Health.
(Committee of Presidents of Statistical
Societies) in 2001, and served as the Chair Robert Mastrodomenico PhD (University of
of the Department of Statistics (2004–12) Reading) is a data scientist and statistician.
and the Dean of Graduate School of Arts and His research interests revolve around the
Sciences (2012–17). modelling of sport events and computational
techniques, with specific focus on the Python
CONTRIBUTORS programming language.
Maryam Ahmed is a data scientist and
journalist at BBC News, with a PhD in Stephanie McClellan is a science writer
Engineering from the University of Oxford. She based in London, and has an MSc in Science
has reported on issues such as targeted political Communications from Imperial College London.
advertising and the gender pay gap. Maryam is She has been a freelance writer for institutions

156 g Notes on Contributors


such as the European Space Agency, the BBC, Willy C. Shih is the Robert & Jane Cizik Professor
CERN, and the United Nations Educational, of Management Practice at the Harvard Business
Scientific and Cultural Organization (UNESCO). School. He worked in the computer and consumer
She has also worked in the national press office electronics industries for 28 years, and has been
at Cancer Research for five years. at the school for 13 years.

Regina Nuzzo has a PhD in Statistics from Stephen M. Stigler is the Ernest DeWitt Burton
Stanford University and graduate training in Distinguished Service Professor of Statistics at
Science Writing from University of California the University of Chicago. Among his many
Santa Cruz. Her writings on probability, data and published works is ‘Stigler’s Law of Eponymy’
statistics have appeared in the Los Angeles Times, (‘No scientific discovery is named after its original
New York Times, Nature, Science News, Scientific discoverer’ in Trans. N. Y. Acad. Sci. 1980, 39:
American and New Scientist, among others. 147–158). His most recent book on the history
of statistics is The Seven Pillars of Statistical
Rupa Patel is a physician scientist and is the Wisdom (2016).
Founder and Director of the Washington
University in St Louis Biomedical HIV Prevention Scott Tranter is the former Director of Data
programme. She is also a technical advisor for the Science for Marco Rubio for President and
World Health Organization. Dr Patel utilizes data founder of Øptimus, a data and technology
science to improve implementation of evidence- company based in Washington, DC. Tranter has
based HIV prevention strategies in clinics, health worked in both the political and commercial
departments and community organizations in the spaces where the science of using data to
US, Africa and Asia. innovate how we do everything from elect our
leaders to sell people cars has been evolving over
Aditya Ranganathan is the chief evangelist for the last several decades.
Sense & Sensibility and Science (S&S&S), a UC
Berkeley Big Ideas course – founded by Saul Katrina Westerhof helps companies develop and
Perlmutter – on critical thinking, group decision adopt emerging technologies, particularly in
making and applied rationality. He also serves on spaces that are being upended by analytics
the board of Public Editor, a citizen science and the Internet of things. She has a diverse
approach to low-quality news and fake news. background in consulting, innovation,
Aditya is pursuing his PhD at Harvard University, engineering and entrepreneurship across the
where he studies collective behaviour (with energy, manufacturing and materials industries.
implications for group dynamics and education).

Notes on Contributors g 157


INDEX

A C Dorsey, Jack 128 Gore, Al 72


ad-blockers 130 cancer research 74 Doudna, Jennifer Anne 66 Green, Donald P. 90
aggregation 80, 86, 110 causal inference 20 Guralnik, Gerald 62
Algorithmic Accountability census data 106–7 E
Act 150 CERN 62 Edison, Thomas 102 H
algorithms 32, 50, 66, 82, 84 climate change 72 Ek, Daniel 122 Hagen, C.R. 62
and business 110 clustering 28, 30 electronic medical records Hammerbacher, Jeff 108
and future trends 138, 140, Collins, Francis 68 (EMR) 92 Hansen, James 72
148, 150, 152 confidence intervals 46 energy distribution/supply 102 Harrell, Frank E. 24
and pleasure 118, 120, 128 cookies 99, 108, 116, 118 Englert, François 62 Hastie, Trevor 28
and society 90 correlation 42 epidemiology 76 Hasting, Wilmot Reed Jr. 28
Alibaba 108 Cortes, Corinna 48 Epstein, Richard 132 health 92
AlphaGo 94 cosmology 64 esports 130 Higgs boson 62
Amazon 8, 34, 50, 108, 118, Cox, David 30 ethics 152 Higgs, Peter 62
142 Cox, Gertrude 70–1 eugenics 26–7 Hippocrates 76
American Statistical coxcomb charts 89 Everitt, Hugh 100 Hollerith, Herman 106–7
Association 89 credit cards 110, 118, 148 homes 142
Analytical Engine 124–5 Crick, Francis 68 F Hubble, Edwin 64
anonymization 86 CRISPR 66 face recognition 32, 118, 148 Human Genome Project 66, 68
artificial intelligence (AI) 22 Facebook 16, 128, 150, 152 humanists 8
and future trends 138, 140, D fake news 146
142, 146, 148, 150, 152 Darwin, Charles 27 financial modelling 110 I
and pleasure 118 data analytics 27, 62, 64, 74 Fisher, Ronald A. 53–4 IBM 94, 106–7, 150
and society 94 and business 102, 104, 112 forecasting 52, 81, 84, 102, Industry 4.0 100
and uncertainty 50 and future trends 145 104, 144–5 International Statistics
astrophysics 64 and society 92 Friedman, Eric 92 Institute 70–1
data brokers 108 Fry, Hannah 120 Internet of Things (IoT) 100,
B data collection 16, 18, 48, 72, 142
Babbage, Charles 124–5 100, 112, 138, 146 G Iowa State College 70–1
Baker, Mitchell 86 data security 140, 150 Gallup, George Horace 40 Isaac, Earl Judson 146
Bayes, Thomas 30 data storage 16, 107 Galton, Francis 26–7, 44 Issenberg, Sasha 90
Beane, Billy 126 data visualization 18, 89 gambling 132
Beck, Aaron 140 dataism 8 gaming 130, 146, 148 J
Bengio, Yoshua 34 dating 120, 146 Gates, Bill 92 James, Bill 126
Berners-Lee, Tim 82 deep learning 34 Gates, Melinda 92 Jebara, Tony 28
Bezos, Jeff 118 deepfakes 148 Gauss, Carl Friedrich 24
big data 16, 66, 84, 86, 98, DeepMind 94, 150 General Data Protection K
104 Deming, W. E. 40 Regulation (GDPR) 150 Kahneman, Daniel 44
Biometric Society 70–1 demographics 102, 108, 130, generalization 40, 56, 98 Kan, Justin 130
Blevins, Tyler 130 150 genetics 66, 68 Kelley, Patrick W. 84
Booker, Cory 150 Dersch, William C. 142 geolocation 110, 152 Kiær, Anders Nicolai 40
Box, George 52–3 Disraeli, Benjamin 18 Google 16, 32, 34, 48, 80, 94, Kibble, Tom 62
Buolamwini, Joy 50 DNA 20, 24, 60–1, 66, 68, 136 98, 108, 150 Kraepelin, Emil 140

158 g Index
L P S Turing, Alan 15, 20, 125, 148
Large Hadron Collider (LHC) p-value 39, 54 sampling 40, 48 Twitch 130
62 Park, James 92 Schmidt, Brian 64 Twitter 40, 128
Lecun, Yann 32 Pascal, Blaise 8, 10 security 84
Leibniz, Gottfried 16 patents 106–7 Shakespeare, Stephan 18 U
logistics 104 Pearl, Judea 42 shopping 118 University College London
Lorentzon, Martin 122 Pearson, Karl 42, 54 smart devices 16, 32, 34, 48 27, 52–3
Lovelace, Ada 124–5 Perez, Juan 104 and business 100, 102 University of Wisconsin 53
Perlmutter, Saul 64 and future trends 138, 140,
M personalization 68, 81, 86, 92, 142 V
machine learning 22, 30, 32, 34 99, 138, 140 and pleasure 118, 128 voice recognition 137, 142, 148
and future trends 138, 140, Pollock, Rufus 152 and society 82, 92 vote science 90
142, 145 Prasad, Rohit 142 Snapchat 128
and pleasure 128 precision medicine 68 sniffer systems 84 W
and science 66 prediction 10, 16, 20, 24, 27 Snow, John 20, 76 Wagner, Dan 90
and society 90, 92, 94 and business 102, 104, 110 Snowden, Edward 152 Wald, Abraham 48
and uncertainty 50 and future trends 138, 140, social media 128, 140, 148 Wang, Joseph 138
McKinney, Wes 22 148 Spiegelhalter, David 120 Watson computer 94
marketing 30, 82, 108 and machine learning 32 sports 126, 130 Watson, James 68
Massachusetts Institute of and pleasure 132 Spotify 32, 122 Watson, Thomas 94
Technology (MIT) 50, 107 and science 62, 66, 72 statistics 8, 18, 22, 27–8, 30 wearables 138
medicine 68, 138, 140 and society 82, 84, 90, 92 and business 107–8 weather 26–7, 102, 104, 152
mental health 140 and uncertainty 40, 56 and future trends 144–5, 152 Wickham, Hadley 22
Microsoft 150 privacy 86, 100, 142, 150 and pleasure 120, 126, 132 Williams, Jeff 138
Million Genome Project (MGP) probability 40, 110 and science 64, 70–1, 74, 76 Wojcicki, Susan 108
68 product development 112 and society 82, 86, 88–90
modelling 30, 110, 132 profiles 16, 81, 90, 108, 146, and uncertainty 40, 44, 46, Y
Mojica, Francisco J.M. 66 152 48, 52–4 YouGov 18
Muller, Robert A. 72 programming 14, 22, 80, 94, Stephens, Frank 66
music 122 116, 125, 136 streaming 108, 117, 122, 130 Z
prototyping 112 supernovae 64 Zelen, Marvin 74
N surveillance 82, 84 Zuckerberg, Mark 16, 128
Netflix 112 R swipers 120
neural networks 34 randomized trials 61, 74, 81, 90
Neyman, Jerzy 46 regression 24, 26–8, 30, 44 T
Ng, Andrew 32 regulation 150 Thorp, Edward O. 110, 132
Nightingale, Florence 88–9, Reid, Toni 142 tools 22
92 Reiss, Ada 64 tracking 16, 40, 86, 107–8, 112
Nobel Prize 44, 62, 64, 68 robots 32, 94, 100, 148 and future trends 138, 140
nudge technologies 82 robustness 53 and pleasure 126, 128, 130
Rosenblatt, Frank 34 transparency 50, 140, 150
O route optimization 104 trustworthiness scoring 146
overfitting 56 Royal Statistical Society 88–9 Tukey, John W. 144–5

Index g 159
ACKNOWLEDGEMENTS

The publisher would like to thank the following for permission to


reproduce copyright material on the following pages:

All images that appear in the montages are from Shutterstock, Inc.
unless stated.

Alamy Stock Photo/Photo Researchers: 106

Getty Images/Donaldson Collection: 124; Alfred Eisenstaedt: 144

Library of Congress: 51, 91

NASA/CXC/RIKEN/T. Sato et al: 65

North Carolina State University/College of Agriculture and Life


Sciences, Department of Communication Services Records
(UA100.099), Special Collections Research Center at North Carolina
State University Libraries: 70

Wellcome Collection/John Snow: 21; Wellcome Images: 26, 88

Wikimedia Commons/ARAKI Satoru: 47; Cdang: 25; CERN: 63;


Chabacano: 57; Chrislb:6, 35, 85; David McEddy: 52; Denis Rizzoli: 91;
Emoscopes: 49; Fanny Schertzer: 127; Fred053: 127; Geek3: 65;
Headbomb: 127; Justinc: 21; Karsten Adam: 127; Martin Grandjean:
2,29; Martin Thoma: 6, 35; MLWatts: 49; National Weather Service:
103; Niyumard: 133; Paul Cuffe: 103; Petr Kadlec: 127; Sigbert: 25;
Trevor J. Pemberton, Michael DeGiorgio and Noah A. Rosenberg: 69;
Tubas: 127; Warren K. Leffler: 51; Yapparina: 133; Yomomo: 43;
Yunyoungmok: 25; Zufzzi: 33

All reasonable efforts have been made to trace copyright holders


and to obtain their permission for the use of copyright material. The
publisher apologizes for any errors or omissions in the list above and
will gratefully incorporate any corrections in future reprints if notified.

160 g Acknowledgements

You might also like