Professional Documents
Culture Documents
Full Chapter Beyond The Worst Case Analysis of Algorithms 1St Edition Tim Roughgarden PDF
Full Chapter Beyond The Worst Case Analysis of Algorithms 1St Edition Tim Roughgarden PDF
https://textbookfull.com/product/algorithms-illuminated-
part-3-greedy-algorithms-and-dynamic-programming-tim-roughgarden/
https://textbookfull.com/product/algorithms-illuminated-
part-3-greedy-algorithms-and-dynamic-programming-1st-edition-tim-
roughgarden/
https://textbookfull.com/product/twenty-lectures-on-algorithmic-
game-theory-1st-edition-tim-roughgarden/
https://textbookfull.com/product/the-worst-case-scenario-
survival-handbook-expert-advice-for-extreme-situations-
borgenicht/
Modern Algorithms of Cluster Analysis 1st Edition
Slawomir Wierzcho■
https://textbookfull.com/product/modern-algorithms-of-cluster-
analysis-1st-edition-slawomir-wierzchon/
https://textbookfull.com/product/the-ethics-of-silence-an-
interdisciplinary-case-analysis-approach-1st-edition-nancy-
billias/
https://textbookfull.com/product/dimensional-analysis-beyond-the-
pi-theorem-1st-edition-bahman-zohuri-auth/
https://textbookfull.com/product/tools-and-algorithms-for-the-
construction-and-analysis-of-systems-dirk-beyer/
https://textbookfull.com/product/tools-and-algorithms-for-the-
construction-and-analysis-of-systems-dirk-beyer-2/
Beyond the Worst-Case Analysis of Algorithms
There are no silver bullets in algorithm design, and no single algorithmic idea is
powerful and flexible enough to solve every computational problem. Nor are there
silver bullets in algorithm analysis, as the most enlightening method for analyzing
an algorithm often depends on the problem and the application. However, typical
algorithms courses rely almost entirely on a single analysis framework, that of worst-
case analysis, wherein an algorithm is assessed by its worst performance on any input
of a given size.
The purpose of this book is to popularize several alternatives to worst-case
analysis and their most notable algorithmic applications, from clustering to linear
programming to neural network training. Forty leading researchers have contributed
introductions to different facets of this field, emphasizing the most important models
and results, many of which can be taught in lectures to beginning graduate students
in theoretical computer science and machine learning.
Edited by
Tim Roughgarden
Columbia University, New York
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
79 Anson Road, #06–04/06, Singapore 079906
www.cambridge.org
Information on this title: www.cambridge.org/9781108494311
DOI: 10.1017/9781108637435
© Cambridge University Press 2021
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2021
Printed in the United Kingdom by TJ Books Limited, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
ISBN 978-1-108-49431-1 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Contents
1 Introduction 1
Tim Roughgarden
1.1 The Worst-Case Analysis of Algorithms 1
1.2 Famous Failures and the Need for Alternatives 3
1.3 Example: Parameterized Bounds in Online Paging 8
1.4 Overview of the Book 12
1.5 Notes 20
2 Parameterized Algorithms 27
Fedor V. Fomin, Daniel Lokshtanov, Saket Saurabh, and Meirav Zehavi
2.1 Introduction 27
2.2 Randomization 31
2.3 Structural Parameterizations 34
2.4 Kernelization 35
2.5 Hardness and Optimality 39
2.6 Outlook: New Paradigms and Application Domains 42
2.7 The Big Picture 46
2.8 Notes 47
v
CONTENTS
4 Resource Augmentation 72
Tim Roughgarden
4.1 Online Paging Revisited 72
4.2 Discussion 75
4.3 Selfish Routing 77
4.4 Speed Scaling in Scheduling 81
4.5 Loosely Competitive Algorithms 86
4.6 Notes 89
5 Perturbation Resilience 95
Konstantin Makarychev and Yury Makarychev
5.1 Introduction 95
5.2 Combinatorial Optimization Problems 98
5.3 Designing Certified Algorithms 101
5.4 Examples of Certified Algorithms 106
5.5 Perturbation-Resilient Clustering Problems 108
5.6 Algorithm for 2-Perturbation-Resilient Instances 111
5.7 (3 + ε)-Certified Local Search Algorithm for k-Medians 113
5.8 Notes 115
vi
CONTENTS
vii
CONTENTS
viii
CONTENTS
ix
CONTENTS
x
CONTENTS
xi
Preface
There are no silver bullets in algorithm design – no one algorithmic idea is powerful
and flexible enough to solve every computational problem of interest. The emphasis
of an undergraduate algorithms course is accordingly on the next-best thing: a small
number of general algorithm design paradigms (such as dynamic programming,
divide-and-conquer, and greedy algorithms), each applicable to a range of problems
that span multiple application domains.
Nor are there silver bullets in algorithm analysis, as the most enlightening method
for analyzing an algorithm often depends on the details of the problem and moti-
vating application. However, the focus of a typical algorithms course rests almost
entirely on a single analysis framework, that of worst-case analysis, wherein an
algorithm is assessed by its worst performance on any input of a given size. The
goal of this book is to redress the imbalance and popularize several alternatives to
worst-case analysis, developed largely in the theoretical computer science literature
over the past 20 years, and their most notable algorithmic applications. Forty leading
researchers have contributed introductions to different facets of this field, and
the introductory Chapter 1 includes a chapter-by-chapter summary of the book’s
contents.
This book’s roots lie in a graduate course that I developed and taught several
times at Stanford University.1 While the project has expanded in scope far beyond
what can be taught in a one-term (or even one-year) course, subsets of the book
can form the basis of a wide variety of graduate courses. Authors were requested to
avoid comprehensive surveys and focus instead on a small number of key models and
results that could be taught in lectures to second-year graduate students in theoretical
computer science and theoretical machine learning. Most of the chapters conclude
with open research directions as well as exercises suitable for classroom use. A free
electronic copy of this book is available from the URL https://www.cambridge.org/
9781108494311#resources (with the password ‘BWCA_CUP’).
Producing a collection of this size is impossible without the hard work of many
people. First and foremost, I thank the authors for their dedication and timeliness in
writing their own chapters and for providing feedback on preliminary drafts of other
chapters. I thank Avrim Blum, Moses Charikar, Lauren Cowles, Anupam Gupta,
1 Lecture notes and videos from this course, covering several of the topics in this book, are available from
my home page (www.timroughgarden.org).
xiii
PREFACE
Ankur Moitra, and Greg Valiant for their enthusiasm and excellent advice when this
project was in its embryonic stages. I am also grateful to all the Stanford students who
took my CS264 and CS369N courses, and especially to my teaching assistants Rishi
Gupta, Joshua Wang, and Qiqi Yan. The cover art is by Max Greenleaf Miller. The
editing of this book was supported in part by NSF award CCF-1813188 and ARO
award W911NF1910294.
xiv
Contributors
Maria-Florina Balcan
Carnegie Mellon University
Jérémy Barbay
University of Chile
Avrim Blum
Toyota Technological Institute at Chicago
Kai-Min Chung
Institute of Information Science, Academia Sinica
Daniel Dadush
Centrum Wiskunde Informatica
Sanjoy Dasgupta
University of California at San Diego
Ilias Diakonikolas
University of Wisconsin-Madison
Uriel Feige
The Weizman Institute
Fedor Fomin
University of Bergen
Vijay Ganesh
University of Waterloo
Rong Ge
Duke University
xv
LIST OF CONTRIBUTORS
Anupam Gupta
Carnegie Mellon University
Nika Haghtalab
Cornell University
Moritz Hardt
University of California at Berkeley
Sophie Huiberts
Centrum Wiskunde Informatica
Daniel Kane
University of California at San Diego
Anna R. Karlin
University of Washington at Seattle
Elias Koutsoupias
University of Oxford
Samory Kpotufe
Columbia University
Daniel Lokshtanov
University of California at Santa Barbara
Tengyu Ma
Stanford University
Konstantin Makarychev
Northwestern University
Yury Makarychev
Toyota Technological Institute at Chicago
Bodo Manthey
University of Twente
Michael Mitzenmacher
Harvard University
Ankur Moitra
Massachusetts Institute of Technology
xvi
LIST OF CONTRIBUTORS
Eric Price
The University of Texas at Austin
Heiko Röglin
University of Bonn
Tim Roughgarden
Columbia University
Saket Saurabh
Institute of Mathematical Sciences
C. Seshadhri
University of California at Santa Cruz
Sahil Singla
Princeton University
Inbal Talgam-Cohen
Technion–Israel Institute of Technology
Salil Vadhan
Harvard University
Gregory Valiant
Stanford University
Paul Valiant
Brown University
Moshe Vardi
Rice University
Sergei Vassilvitskii
Google, Inc.
Aravindan Vijayaraghavan
Northwestern University
Meirav Zehavi
Ben-Gurion University of the Negev
xvii
CHAPTER ONE
Introduction
Tim Roughgarden
1
T. ROUGHGARDEN
1. Performance prediction. The first goal is to explain or predict the empirical perfor-
mance of algorithms. In some cases, the analyst acts as a natural scientist, taking
an observed phenomenon such as “the simplex method for linear programming is
fast” as ground truth, and seeking a transparent mathematical model that explains
it. In others, the analyst plays the role of an engineer, seeking a theory that
3 Worst-case analysis is also the dominant paradigm in complexity theory, where it has led to the develop-
ment of NP-completeness and many other fundamental concepts.
2
INTRODUCTION
gives accurate advice about whether or not an algorithm will perform well in an
application of interest.
2. Identify optimal algorithms. The second goal is to rank different algorithms accord-
ing to their performance, and ideally to single out one algorithm as “optimal.” At
the very least, given two algorithms A and B for the same problem, a method for
algorithmic analysis should offer an opinion about which one is “better.”
3. Develop new algorithms. The third goal is to provide a well-defined framework in
which to brainstorm new algorithms. Once a measure of algorithm performance
has been declared, the Pavlovian response of most computer scientists is to
seek out new algorithms that improve on the state-of-the-art with respect to
this measure. The focusing effect catalyzed by such yardsticks should not be
underestimated.
When proving or interpreting results in algorithm design and analysis, it’s impor-
tant to be clear in one’s mind about which of these goals the work is trying to
achieve.
What’s the report card for worst-case analysis with respect to these three goals?
3
T. ROUGHGARDEN
search on the vertices of the solution set boundary, and variants of it remain
in wide use to this day. The enduring appeal of the simplex method stems from
its consistently superb performance in practice. Its running time typically scales
modestly with the input size, and it routinely solves linear programs with millions of
decision variables and constraints. This robust empirical performance suggested that
the simplex method might well solve every linear program in a polynomial amount
of time.
Klee and Minty (1972) showed by example that there are contrived linear programs
that force the simplex method to run in time exponential in the number of decision
variables (for all of the common “pivot rules” for choosing the next vertex). This
illustrates the first potential pitfall of worst-case analysis: overly pessimistic perfor-
mance predictions that cannot be taken at face value. The running time of the simplex
method is polynomial for all practical purposes, despite the exponential prediction of
worst-case analysis.
To add insult to injury, the first worst-case polynomial-time algorithm for linear
programming, the ellipsoid method, is not competitive with the simplex method in
practice.4 Taken at face value, worst-case analysis recommends the ellipsoid method
over the empirically superior simplex method. One framework for narrowing the gap
between these theoretical predictions and empirical observations is smoothed analysis,
the subject of Part Four of this book; see Section 1.4.4 for an overview.
4 Interior-point methods, developed five years later, led to algorithms that both run in worst-case polynomial
time and are competitive with the simplex method in practice.
4
INTRODUCTION
such distances (the k-means objective). Almost all natural optimization problems that
are defined over clusterings are NP-hard.5
In practice, clustering is not viewed as a particularly difficult problem. Lightweight
clustering algorithms, such as Lloyd’s algorithm for k-means and its variants, regu-
larly return the intuitively “correct” clusterings of real-world point sets. How can
we reconcile the worst-case intractability of clustering problems with the empirical
success of relatively simple algorithms?6
One possible explanation is that clustering is hard only when it doesn’t matter.
For example, if the difficult instances of an NP-hard clustering problem look like
a bunch of random unstructured points, who cares? The common use case for a
clustering algorithm is for points that represent images, or documents, or proteins, or
some other objects where a “meaningful clustering” is likely to exist. Could instances
with a meaningful clustering be easier than worst-case instances? Part Three of this
book covers recent theoretical developments that support an affirmative answer; see
Section 1.4.2 for an overview.
5 Recall that a polynomial-time algorithm for an NP-hard problem would yield a polynomial-time algorithm
for every problem in NP – for every problem with efficiently verifiable solutions. Assuming the widely believed
P = NP conjecture, every algorithm for an NP-hard problem either returns an incorrect answer for some inputs
or runs in super-polynomial time for some inputs (or both).
6 More generally, optimization problems are more likely to be NP-hard than polynomial-time solvable. In
many cases, even computing an approximately optimal solution is an NP-hard problem. Whenever an efficient
algorithm for such a problem performs better on real-world instances than (worst-case) complexity theory would
suggest, there’s an opportunity for a refined and more accurate theoretical analysis.
5
T. ROUGHGARDEN
(e.g., whether or not an image contains a cat). Over the past decade, aided by massive
data sets and computational power, neural networks have achieved impressive levels
of performance across a range of prediction tasks. Their empirical success flies in
the face of conventional wisdom in multiple ways. First, there is a computational
mystery: Neural network training usually boils down to fitting parameters (weights
and biases) to minimize a nonconvex loss function, for example, to minimize the
number of classification errors the model makes on the training set. In the past such
problems were written off as computationally intractable, but first-order methods
(i.e., variants of gradient descent) often converge quickly to a local optimum or even
to a global optimum. Why?
Second, there is a statistical mystery: Modern neural networks are typically over-
parameterized, meaning that the number of parameters to fit is considerably larger
than the size of the training data set. Overparameterized models are vulnerable
to large generalization error (i.e., overfitting), since they can effectively memorize
the training data without learning anything that helps classify as-yet-unseen data
points. Nevertheless, state-of-the-art neural networks generalize shockingly well –
why? The answer likely hinges on special properties of both real-world data sets and
the optimization algorithms used for neural network training (principally stochastic
gradient descent). Part Five of this book covers the state-of-the-art explanations
of these and other mysteries in the empirical performance of machine learning
algorithms.
The beyond worst-case viewpoint can also contribute to machine learning by
“stress-testing” the existing theory and providing a road map for more robust
guarantees. While work in beyond worst-case analysis makes strong assumptions
relative to the norm in theoretical computer science, these assumptions are usually
weaker than the norm in statistical machine learning. Research in the latter field
often resembles average-case analysis, for example, when data points are modeled
as independent and identically distributed samples from some underlying structured
distribution. The semirandom models described in Parts Three and Four of this book
serve as role models for blending adversarial and average-case modeling to encourage
the design of algorithms with robustly good performance.
6
INTRODUCTION
7
T. ROUGHGARDEN
8 The notation x means the number x, rounded up to the nearest integer.
8
INTRODUCTION
f (n) 1 2 3 3 4 4 4 5 ···
n 1 2 3 4 5 6 7 8 ···
Theorem 1.1 (Albers et al., 2005) With αf (k) defined as in (1.1) below:
(a) For every approximately concave function f , cache size k ≥ 2, and deterministic
cache replacement policy, there are arbitrarily long page request sequences
conforming to f for which the policy’s page fault rate is at least αf (k).
(b) For every approximately concave function f , cache size k ≥ 2, and page request
sequence that conforms to f , the page fault rate of the LRU policy is at most
αf (k) plus an additive term that goes to 0 with the sequence length.
(c) There exists a choice of an approximately concave function f , a cache size k ≥ 2,
and an arbitrarily long page request sequence that conforms to f , such that the
page fault rate of the FIFO policy is bounded away from αf (k).
Parts (a) and (b) prove the worst-case optimality of the LRU policy in a strong
and fine-grained sense, f -by-f and k-by-k. Part (c) differentiates LRU from FIFO, as
the latter is suboptimal for some (in fact, many) choices of f and k.
The guarantees in Theorem 1.1 are so good that they are meaningful even when
taken at face value – for strongly sublinear f ’s, αf (k) goes to 0 reasonably quickly
with k. The precise definition of αf (k) for k ≥ 2 is
k−1
αf (k) = , (1.1)
f −1 (k + 1) − 2
where we abuse notation and interpret f −1 (y) as the smallest value of x such that
f (x) = y. That is, f −1 (y) denotes the smallest window length in which page requests
for y distinct pages might appear. As expected, for the function f (n) = n we have
αf (k) = 1 for all k. (With no restriction on the input sequence, an adversary can force
√ √
a 100% fault rate.) If f (n) = n, however, then αf (k) scales with 1/ k. Thus with
a cache size of 10,000, the page fault rate is always at most 1%. If f (n) = 1 + log2 n,
then αf (k) goes to 0 even faster with k, roughly as k/2k .
Part (a). To prove the lower bound in part (a), fix an approximately concave function
f and a cache size k ≥ 2. Fix a deterministic cache replacement policy A.
We construct a page sequence σ that uses only k + 1 distinct pages, so at any given
time step there is exactly one page missing from the algorithm’s cache. (Assume that
the algorithm begins with the first k pages in its cache.) The sequence comprises k − 1
blocks, where the jth block consists of mj+1 consecutive requests for the same page
pj , where pj is the unique page missing from the algorithm A’s cache at the start of the
9
T. ROUGHGARDEN
block. (Recall that my is the number of values of x such that f (x) = y.) This sequence
conforms to f (Exercise 1.3).
By the choice of the pj ’s, A incurs a page fault on the first request of a block, and
not on any of the other (duplicate) requests of that block. Thus, algorithm A suffers
exactly k − 1 page faults.
The length of the page request sequence is m2 + m3 + · · · + mk . Because m1 = 1,
this sum equals ( kj=1 mj ) − 1 which, using the definition of the mj ’s, equals (f −1 (k +
1) − 1) − 1 = f −1 (k + 1) − 2. The algorithm’s page fault rate on this sequence matches
the definition (1.1) of αf (k), as required. More generally, repeating the construction
over and over again produces arbitrarily long page request sequences for which the
algorithm has page fault rate αf (k).
Part (b). To prove a matching upper bound for the LRU policy, fix an approximately
concave function f , a cache size k ≥ 2, and a sequence σ that conforms to f . Our
fault rate target αf (k) is a major clue to the proof (recall (1.1)): we should be looking
to partition the sequence σ into blocks of length at least f −1 (k + 1) − 2 such that each
block has at most k − 1 faults. So consider groups of k − 1 consecutive faults of the
LRU policy on σ . Each such group defines a block, beginning with the first fault of
the group, and ending with the page request that immediately precedes the beginning
of the next group of faults (see Figure 1.4).
Claim Consider a block other than the first or last. Consider the page requests
in this block, together with the requests immediately before and after this block.
These requests are for at least k + 1 distinct pages.
10
INTRODUCTION
faults on q). Finally, suppose that one of these k faults was on the page p. Because p
was requested just before the first of these faults, the LRU algorithm, subsequent to
this request and prior to evicting p, must have received requests for k distinct pages
other than p. These requests, together with that for p, give the desired k + 1 distinct
page requests.9
1.3.3 Discussion
Theorem 1.1 is an example of a “parameterized analysis” of an algorithm, where
the performance guarantee is expressed as a function of parameters of the input
other than its size. A parameter like αf (k) measures the “easiness” of an input, much
like matrix condition numbers in linear algebra. We will see many more examples of
parameterized analyses later in the book.
There are several reasons to aspire toward parameterized performance guarantees.
9 The first two arguments apply also to the FIFO policy, but the third does not. Suppose p was already in
the cache when it was requested just prior to the block. Under FIFO, this request does not “reset p’s clock”; if
it was originally brought into the cache long ago, FIFO might well evict p on the block’s very first fault.
10 For a familiar example, parameterizing the running time of graph algorithms by both the number of
vertices and the number of edges provides guidance about which algorithms should be used for sparse graphs
and which ones for dense graphs.
11 The parameter α (k) showed up only in our analysis of the LRU policy; in other applications, the chosen
f
parameter also guides the design of algorithms for the problem.
11
Another random document with
no related content on Scribd:
“I have a growing doubt of the value and justice of the
system, whether as regards the interests of the public or the
inventors.”
Lord Campbell—
I have already shown that the law, rightly read, can hardly be said
to sanction the patenting of a “process.”
For his benefit, and still more frequently and surely for the benefit
of a multitude of other individuals, who have less claim, or no claim
at all.
But not to him only, for he invents often along with others, and
always in consequence of knowledge which he derives from the
common store, and which he ought, as its participant, to let others
share, if doing so does himself no harm.
But it does not follow, surely, even in Mr. Mill’s logic, that he should
be invested with monopoly powers, which “raise prices” and “hurt
trade,” and cause “general inconvenience.”
But which, very likely, were trifling, and if heavy, were incurred for
his own sake, and may have produced benefits to himself that
sufficiently compensated all.
The former are numerous; the latter ought to be; and the service is
one the nation may well expect of them. Why should not there be
innumerable Lord Rosses, Sir Francis Crossleys, Sir David Baxters,
and Sir William Browns, promoting beneficent commerce by their
generosity; and why should not manufacturers systematically
combine as an association to procure through science and
experiment every possible improvement?
Well: that is a great deal; but why not in cases that are not
conspicuous?
Which, Mr. Mill rightly thinks, is what ought to be, but it is not and
cannot be what happens under Patents; for, on the contrary, rewards
depend mainly on the extent of use and the facility of levying
royalties.
It is not out of place to inform the House that so far back as the
earliest years of the Patent system a precedent can be adduced. In
1625, Sir F. Crane received a grant of £2,000 a-year for introducing a
tapestry manufacture. There are several other precedents for similar
grants of public money.
Of course, to reward is not to purchase. We do not buy any man’s
invention or secret. But if he thinks proper, as a good subject, to
reveal that secret, we mean he shall have a substantial mark of
favour. Something like this was, no doubt, the original intention of
Patents; only the favour took the form of monopoly for introducing