Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

IAS/PARK CITY

MATHEMATICS SERIES
Volume 25

The Mathematics
of Data
Michael W. Mahoney
John C. Duchi
Anna C. Gilbert
Editors

American Mathematical Society


Institute for Advanced Study
Society for Industrial and Applied Mathematics
10.1090/pcms/025

The Mathematics
of Data
IAS/PARK CITY
MATHEMATICS SERIES
Volume 25

The Mathematics
of Data
Michael W. Mahoney
John C. Duchi
Anna C. Gilbert
Editors

American Mathematical Society


Institute for Advanced Study
Society for Industrial and Applied Mathematics
Rafe Mazzeo, Series Editor
Michael W. Mahoney, John C. Duchi, and Anna C. Gilbert, Volume Editors.

IAS/Park City Mathematics Institute runs mathematics education programs that bring
together high school mathematics teachers, researchers in mathematics and mathematics
education, undergraduate mathematics faculty, graduate students, and undergraduates to
participate in distinct but overlapping programs of research and education. This volume
contains the lecture notes from the Graduate Summer School program
2010 Mathematics Subject Classification. Primary 15-02, 52-02, 60-02, 62-02, 65-02,
68-02, 90-02.

Library of Congress Cataloging-in-Publication Data


Names: Mahoney, Michael W., editor. | Duchi, John, editor. | Gilbert, Anna C. (Anna Catherine),
1972– editor. | Institute for Advanced Study (Princeton, N.J.) | Society for Industrial and Applied
Mathematics. | Park City Mathematics Institute.
Title: The mathematics of data / Michael W. Mahoney, John C. Duchi, Anna C. Gilbert, editors.
Description: Providence : American Mathematical Society, 2018. | Series: IAS/Park City mathe-
matics series ; Volume 25 | “Institute for Advanced Study.” | “Society for Industrial and Applied
Mathematics.” | Based on a series of lectures held July 2016, at the Park City Mathematics
Institute. | Includes bibliographical references.
Identifiers: LCCN 2018024239 | ISBN 9781470435752 (alk. paper)
Subjects: LCSH: Mathematics teachers–Training of–Congresses. | Mathematics–Study and
teaching–Congresses | Big data–Congresses. | AMS: Linear and multilinear algebra; matrix the-
ory – Research exposition (monographs, survey articles). msc | Convex and discrete geometry –
Research exposition (monographs, survey articles). msc | Probability theory and stochastic pro-
cesses – Research exposition (monographs, survey articles). msc | Statistics – Research exposition
(monographs, survey articles). msc | Numerical analysis – Research exposition (monographs, sur-
vey articles). msc | Computer science – Research exposition (monographs, survey articles). msc
| Operations research, mathematical programming – Research exposition (monographs, survey
articles). msc
Classification: LCC QA11.A1 M345 2018 | DDC 510–dc23
LC record available at https://lccn.loc.gov/2018024239

Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting
for them, are permitted to make fair use of the material, such as to copy select pages for use
in teaching or research. Permission is granted to quote brief passages from this publication in
reviews, provided the customary acknowledgment of the source is given.
Republication, systematic copying, or multiple reproduction of any material in this publication
is permitted only under license from the American Mathematical Society. Requests for permission
to reuse portions of AMS publication content are handled by the Copyright Clearance Center. For
more information, please visit www.ams.org/publications/pubpermissions.
Send requests for translation rights and licensed reprints to reprint-permission@ams.org.

c 2018 by the American Mathematical Society. All rights reserved.
The American Mathematical Society retains all rights
except those granted to the United States Government.
Printed in the United States of America.

∞ The paper used in this book is acid-free and falls within the guidelines
established to ensure permanence and durability.
Visit the AMS home page at https://www.ams.org/
10 9 8 7 6 5 4 3 2 1 23 22 21 20 19 18
Contents

Preface vii
Introduction ix

Lectures on Randomized Numerical Linear Algebra


Petros Drineas and Michael W. Mahoney 1

Optimization Algorithms for Data Analysis


Stephen J. Wright 49

Introductory Lectures on Stochastic Optimization


John C. Duchi 99

Randomized Methods for Matrix Computations


Per-Gunnar Martinsson 187

Four Lectures on Probabilistic Methods for Data Science


Roman Vershynin 231

Homological Algebra and Data


Robert Ghrist 273

v
IAS/Park City Mathematics Series
Volume 25, Pages -7—8
https://doi.org/10.1090/pcms/025/00827

Preface

The IAS/Park City Mathematics Institute (PCMI) was founded in 1991 as part
of the Regional Geometry Institute initiative of the National Science Foundation.
In mid-1993 the program found an institutional home at the Institute for Ad-
vanced Study (IAS) in Princeton, New Jersey.
The IAS/Park City Mathematics Institute encourages both research and educa-
tion in mathematics and fosters interaction between the two. The three-week sum-
mer institute offers programs for researchers and postdoctoral scholars, graduate
students, undergraduate students, high school students, undergraduate faculty,
K-12 teachers, and international teachers and education researchers. The Teacher
Leadership Program also includes weekend workshops and other activities dur-
ing the academic year.
One of PCMI’s main goals is to make all of the participants aware of the full
range of activities that occur in research, mathematics training and mathematics
education: the intention is to involve professional mathematicians in education
and to bring current concepts in mathematics to the attention of educators. To
that end, late afternoons during the summer institute are devoted to seminars and
discussions of common interest to all participants, meant to encourage interaction
among the various groups. Many deal with current issues in education: others
treat mathematical topics at a level which encourages broad participation.
Each year the Research Program and Graduate Summer School focuses on a
different mathematical area, chosen to represent some major thread of current
mathematical interest. Activities in the Undergraduate Summer School and Un-
dergraduate Faculty Program are also linked to this topic, the better to encourage
interaction between participants at all levels. Lecture notes from the Graduate
Summer School are published each year in this series. The prior volumes are:
• Volume 1: Geometry and Quantum Field Theory (1991)
• Volume 2: Nonlinear Partial Differential Equations in Differential Geometry
(1992)
• Volume 3: Complex Algebraic Geometry (1993)
• Volume 4: Gauge Theory and the Topology of Four-Manifolds (1994)
• Volume 5: Hyperbolic Equations and Frequency Interactions (1995)
• Volume 6: Probability Theory and Applications (1996)
• Volume 7: Symplectic Geometry and Topology (1997)
• Volume 8: Representation Theory of Lie Groups (1998)
• Volume 9: Arithmetic Algebraic Geometry (1999)
©2018 American Mathematical Society

vii
viii Preface

• Volume 10: Computational Complexity Theory (2000)


• Volume 11: Quantum Field Theory, Supersymmetry, and Enumerative Geome-
try (2001)
• Volume 12: Automorphic Forms and their Applications (2002)
• Volume 13: Geometric Combinatorics (2004)
• Volume 14: Mathematical Biology (2005)
• Volume 15: Low Dimensional Topology (2006)
• Volume 16: Statistical Mechanics (2007)
• Volume 17: Analytical and Algebraic Geometry (2008)
• Volume 18: Arthimetic of L-functions (2009)
• Volume 19: Mathematics in Image Processing (2010)
• Volume 20: Moduli Spaces of Riemann Surfaces (2011)
• Volume 21: Geometric Group Theory (2012)
• Volume 22: Geometric Analysis (2013)
• Volume 23: Mathematics and Materials (2014)
• Volume 24: Geometry of Moduli Spaces and Representation Theory (2015)
The American Mathematical Society publishes material from the Undergradu-
ate Summer School in their Student Mathematical Library and from the Teacher
Leadership Program in the series IAS/PCMI—The Teacher Program.
After more than 25 years, PCMI retains its intellectual vitality and continues
to draw a remarkable group of participants each year from across the entire spec-
trum of mathematics, from Fields Medalists to elementary school teachers.
Rafe Mazzeo
PCMI Director
March 2017
IAS/Park City Mathematics Series
Volume 25, Pages -9—12
https://doi.org/10.1090/pcms/025/00828

Introduction

Michael W. Mahoney, John C. Duchi, and Anna C. Gilbert

“The Mathematics of Data” was the topic for the 26th annual Park City Mathe-
matics Institute (PCMI) summer session, held in July 2016. To those more familiar
with very abstract areas of mathematics or more applied areas of data—the latter
going these days by names such as “big data” or “data science”—it may come as
a surprise that such an area even exists. A moment’s thought, however, should
dispel such a misconception. After all, data must be modeled, e.g., by a matrix or
a graph or a flat table, and if one performs similar operations on very different
types of data, then there is an expectation that there must be some sort of com-
mon mathematical structure, e.g., from linear algebra or graph theory or logic.
So too, ignorance or errors or noise in the data can be modeled, and it should be
plausible that how well operations perform on data depend not just on how well
data are modeled but also on how well ignorance or noise or errors are modeled.
So too, the operations themselves can be modeled, e.g., to make statements such
as whether the operations answer a precise question, exactly or approximately, or
whether they will return a solution in a reasonable amount of time.
As such, “The Mathematics of Data” fits squarely in applied mathematics—
when that term is broadly, not narrowly, defined. Technically, it represents some
combination of what is traditionally the domain of linear algebra and probability
and optimization and other related areas. Moreover, while some of the work
in this area takes place in mathematics departments, much of the work in the
area takes place in computer science, statistics, and other related departments.
This was the challenge and opportunity we faced, both in designing the graduate
summer school portion of the PCMI summer session, as well as in designing this
volume. With respect to the latter, while the area is not sufficiently mature to say
the final word, we have tried to capture the major trends in the mathematics of
data sufficiently broadly and at a sufficiently introductory level that this volume
could be used as a teaching resource for students with backgrounds in any of the
wide range of areas related to the mathematics of data.
The first chapter, “Lectures on Randomized Numerical Linear Algebra,” pro-
vides an overview of linear algebra, probability, and ways in which they interact
fruitfully in many large-scale data applications. Matrices are a common way to
model data, e.g., an m × n matrix provides a natural way to describe m objects,
each of which is described by n features, and thus linear algebra, as well as more
sophisticated variants such as functional analysis and linear operator theory, are
©2018 American Mathematical Society

ix
x Introduction

central to the mathematics of data. An interesting twist is that, while work in nu-
merical linear algebra and scientific computing typically focuses on deterministic
algorithms that return answers to machine precision, randomness can be used
in novel algorithmic and statistical ways in matrix algorithms for data. While
randomness is often assumed to be a property of the data (e.g., think of noise
being modeled by random variables drawn from a Gaussian distribution), it can
also be a powerful algorithmic resource to speed up algorithms (e.g., think of
Monte Carlo and Markov Chain Monte Carlo methods), and many of the most
interesting and exciting developments in the mathematics of data explore this
algorithmic-statistical interface. This chapter, in particular, describes the use of
these methods for the development of improved algorithms for fundamental and
ubiquitous matrix problems such as matrix multiplication, least-squares approxi-
mation, and low-rank matrix approximation.
The second chapter, “Optimization Algorithms for Data Analysis,” goes one
step beyond basic linear algebra problems, which themselves are special cases
of optimization problems, to consider more general optimization problems. Op-
timization problems are ubiquitous throughout data science, and a wide class
of problems can be formulated as optimizing smooth functions, possibly with
simple constraints or structured nonsmooth regularizers. This chapter describes
some canonical problems in data analysis and their formulation as optimization
problems. It also describes iterative algorithms (i.e., those that generate a se-
quence of points) that, for convex objective functions, converge to the set of solu-
tions of such problems. Algorithms covered include first-order methods that de-
pend on gradients, so-called accelerated gradient methods, and Newton’s second-
order method that can guarantee convergence to points that approximately satisfy
second-order conditions for a local minimizer of a smooth nonconvex function.
The third chapter, “Introductory Lectures on Stochastic Optimization,” cov-
ers the basic analytical tools and algorithms necessary for stochastic optimiza-
tion. Stochastic optimization problems are problems whose definition involves
randomness, e.g., minimizing the expectation of some function; and stochastic
optimization algorithms are algorithms that generate and use random variables
to find the solution of a (perhaps deterministic) problem. As with the use of ran-
domness in Randomized Numerical Linear Algebra, there is an interesting syn-
ergy between the two ways in which stochasticity appears. This chapter builds
the necessary convex analytic and other background, and it describes gradient
and subgradient first-order methods for the solution of these types of problems.
These methods tend to be simple methods that are slower to converge than more
advanced methods—such as Newton’s or other second-order methods—for deter-
ministic problems, but they have the advantage that they can be robust to noise
in the optimization problem itself. Also covered are mirror descent and adap-
tive methods, as well as methods for proving upper and lower bounds on such
stochastic algorithms.
Michael W. Mahoney, John C. Duchi, and Anna C. Gilbert xi

The fourth chapter, “Randomized Methods for Matrix Computations,” goes


into more detail on randomized methods for computing efficiently a low-rank
approximation to a given matrix. One often wants to decompose a large m × n
matrix A, where m and n are both large, into two lower-rank more-rectangular
matrices E and F such that A ≈ EF. Examples include low-rank approximations
to the eigenvalue decomposition or the singular value decomposition. While
low-rank approximation problems of this type form a cornerstone of traditional
applied mathematics and scientific computing, they also arise in a broad range
of data science applications. Importantly, though, the questions one asks of these
matrix decompositions (e.g., whether one is interested in numerical precision or
statistical inference objectives) and even how one accesses these matrices (e.g.,
within the RAM model idealization or in a single-pass streaming setting where
the data can’t even be stored) are very different. Randomness can be useful
in many ways here. This chapter describes randomized algorithms that obtain
better worst-case running time, both in the RAM model and a streaming model,
how randomness can be used to obtain improved communication properties for
algorithms, and also several data-driven decompositions such as the Nyström
method, the Interpolative Decomposition, and the CUR decomposition.
The fifth chapter, “Four Lectures on Probabilistic Methods for Data Science,”
describes modern methods of high dimensional probability and illustrates how
these methods can be used in data science. Methods of high-dimensional proba-
bility play a central role in applications for statistics, signal processing, theoretical
computer science, and related fields. For example, they can be used within a ran-
domized algorithm to obtain improved running time properties, and/or they can
be used as random models for data, in which case they are needed to obtain in-
ferential guarantees. Indeed, they are used (explicitly or implicitly) in all of the
previous chapters. This chapter presents a sample of particularly useful tools of
high-dimensional probability, focusing on the classical and matrix Bernstein’s in-
equality and the uniform matrix deviation inequality, and it illustrates these tools
with applications for dimension reduction, network analysis, covariance estima-
tion, matrix completion, and sparse signal recovery.
The sixth and final chapter, “Homological Algebra and Data,” provides an ex-
ample of how methods from more pure mathematics, in this case topology, might
be used fruitfully in data science and the mathematics of data, as outlined in the
previous chapters. Topology is—informally—the study of shape, and topological
data analysis provides a framework to analyze data in a manner that should be
insensitive to the particular metric chosen, e.g., to measure the similarity between
data points. It involves replacing a set of data points with a family of simplicial
complexes, and then using ideas from persistent homology to try to determine
the large scale structure of the set. This chapter approaches topological data
analysis from the perspective of homological algebra, where homology is an al-
gebraic compression scheme that excises all but the essential topological features
xii Introduction

from a class of data structures. An important point is that linear algebra can
be enriched to cover not merely linear transformations—the 99.9% use case—but
also sequences of linear transformations that form complexes, thus opening the
possibility of further mathematical developments.
Overall, the 2016 PCMI summer program included minicourses by Petros
Drineas, John Duchi, Cynthia Dwork and Kunal Talwar, Robert Ghrist, Piotr
Indyk, Mauro Maggioni, Gunnar Martinsson, Roman Vershynin, and Stephen
Wright. This volume consists of contributions, summarized above, by Petros
Drineas (with Michael Mahoney), Stephen Wright, John Duchi, Gunnar Martins-
son, Roman Vershynin, and Robert Ghrist. Each chapter in this volume was
written by a different author, and so each chapter has it’s own unique style, in-
cluding notational differences, but we have taken some effort to ensure that they
can fruitfully be read together.
Putting together such an effort—both the entire summer session as well as this
volume—is not a minor undertaking, but for us it was not difficult, due to the
large amount of support we received. We would first like to thank Richard Hain,
the former PCMI Program Director, who first invited us to organize the summer
school, as well as Rafe Mazzeo, the current PCMI Program Director, who pro-
vided seamless guidance throughout the entire process. In terms of running the
summer session, a special thank you goes out to the entire PCMI staff, and in par-
ticular to Beth Brainard and Dena Vigil as well as Bryna Kra and Michelle Wachs.
We received a lot of feedback from participants who enjoyed the event, and Beth
and Dena deserve much of the credit for making it run smoothly; and Bryna and
Michelle’s role with the graduate steering committee helped us throughout the
entire process. In terms of this volume, in addition to thanking the authors for
their efforts and (usually) getting back to us in a timely manner, we would like to
thank Ian Morrison, who is the PCMI Publisher. Putting together a volume such
as this can be a tedious task, but for us it was not, and this is in large part due to
Ian’s help and guidance.
PUBLISHED TITLES IN THIS SERIES

25 Michael W. Mahoney, John C. Duchi, and Anna C. Gilbert, Editors, The


Mathematics of Data, 2018
24 Roman Bezrukavnikov, Alexander Braverman, and Zhiwei Yun, Editors,
Geometry of Moduli Spaces and Representation Theory, 2017
23 Mark J. Bowick, David Kinderlehrer, Govind Menon, and Charles Radin,
Editors, Mathematics and Materials, 2017
22 Hubert L. Bray, Greg Galloway, Rafe Mazzeo, and Natasa Sesum, Editors,
Geometric Analysis, 2016
21 Mladen Bestvina, Michah Sageev, and Karen Vogtmann, Editors, Geometric
Group Theory, 2014
20 Benson Farb, Richard Hain, and Eduard Looijenga, Editors, Moduli Spaces of
Riemann Surfaces, 2013
19 Hongkai Zhao, Editor, Mathematics in Image Processing, 2013
18 Cristian Popescu, Karl Rubin, and Alice Silverberg, Editors, Arithmetic of
L-functions, 2011
17 Jeffery McNeal and Mircea Mustaţă, Editors, Analytic and Algebraic Geometry,
2010
16 Scott Sheffield and Thomas Spencer, Editors, Statistical Mechanics, 2009
15 Tomasz S. Mrowka and Peter S. Ozsváth, Editors, Low Dimensional Topology, 2009
14 Mark A. Lewis, Mark A. J. Chaplain, James P. Keener, and Philip K. Maini,
Editors, Mathematical Biology, 2009
13 Ezra Miller, Victor Reiner, and Bernd Sturmfels, Editors, Geometric
Combinatorics, 2007
12 Peter Sarnak and Freydoon Shahidi, Editors, Automorphic Forms and Applications,
2007
11 Daniel S. Freed, David R. Morrison, and Isadore Singer, Editors, Quantum Field
Theory, Supersymmetry, and Enumerative Geometry, 2006
10 Steven Rudich and Avi Wigderson, Editors, Computational Complexity Theory, 2004
9 Brian Conrad and Karl Rubin, Editors, Arithmetic Algebraic Geometry, 2001
8 Jeffrey Adams and David Vogan, Editors, Representation Theory of Lie Groups, 2000
7 Yakov Eliashberg and Lisa Traynor, Editors, Symplectic Geometry and Topology,
1999
6 Elton P. Hsu and S. R. S. Varadhan, Editors, Probability Theory and Applications,
1999
5 Luis Caffarelli and Weinan E, Editors, Hyperbolic Equations and Frequency
Interactions, 1999
4 Robert Friedman and John W. Morgan, Editors, Gauge Theory and the Topology of
Four-Manifolds, 1998
3 János Kollár, Editor, Complex Algebraic Geometry, 1997
2 Robert Hardt and Michael Wolf, Editors, Nonlinear partial differential equations in
differential geometry, 1996
1 Daniel S. Freed and Karen K. Uhlenbeck, Editors, Geometry and Quantum Field
Theory, 1995
Data science is a highly interdisciplinary field, incorporating ideas from applied math-
ematics, statistics, probability, and computer science, as well as many other areas. This
book gives an introduction to the mathematical methods that form the foundations of
machine learning and data science, presented by leading experts in computer science,
statistics, and applied mathematics. Although the chapters can be read independently, they
are designed to be read together as they lay out algorithmic, statistical, and numerical
approaches in diverse but complementary ways.
This book can be used both as a text for advanced undergraduate and beginning graduate
courses, and as a survey for researchers interested in understanding how applied math-
ematics broadly defined is being used in data science. It will appeal to anyone interested
in the interdisciplinary foundations of machine learning and data science.

PCMS/25

You might also like