Kelsey Rochester 0188E 10062

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 191

Coarse-grained Speculative Parallelism

and Optimization
by
Kirk Kelsey

Submitted in Partial Fulfillment


of the
Requirements for the Degree
Doctor of Philosophy

Supervised by
Dr. Chen Ding
Department of Computer Science
Arts, Sciences and Engineering
Edmund A. Hajim School of Engineering and Applied Sciences
University of Rochester
Rochester, New York
2011

ii

To Ellen:
Always Hopes,
Always Perseveres

iii

Curriculum Vitae

The author was born in New Haven, Connecticut on March 3rd, 1979. He
attended Vanderbilt University from 1997 to 2003, and graduated with a Bachelor
of Science degree in 2001 followed by a Master of Science degree in 2003. He came
to the University of Rochester in the Fall of 2003 and began graduate studies
in Computer Science. He pursued his research in software speculative parallelism
under the direction of Professor Chen Ding and received a Master of Science degree
from the University of Rochester in 2005.

iv

Acknowledgments
More than any other factor, I have to contribute so much to the unyielding
support of my wife, Ellen. This certainly extends well beyond the time spent
working towards a thesis, but so few pursuits offer the opportunity for a formal
acknowledgment. If I had the words, my thanks would dwarf this document. My
parents, also, deserve my heart-felt appreciation for many more years of support,
as well as for providing early models of scholarship.
I am deeply thankful to my adviser, Chen Ding, for guiding me through a
marathon process. Chen has been a constant through the many stages of graduate
education and study. Ultimately, he helped me develop a direction in research and
reminded me that we are measured not by the information we consume, but by the
knowledge we create. I owe a sincere debt to the members of my thesis committee
for their advice during the development of ideas that has led to this work, and for
the broader education they provided within the department.
My cohort of fellow aspiring researchers were an invaluable source of insight,
inspiration, humility and support. Id like to thank other students in the compiler
and systems groups who have helped to show the way ahead of me specifically
Yutao Zhong and Xipeng Shen and kept me motivated, especially Mike Spear
and Chris Stewart. From a broader standpoint, I have appreciated time spent
with Ashiwin Lall, Chris Stewart, Ben Van Durme and Matt Post immensely.
My friends outside of the department helped to take my mind off computer
science from time; Jason and Ana stand out specifically in that regard. Finally,

Id like to thank the staff of the computer science department for their help in
innumerable ways. Jo Marie Carpenter, Marty Gunthner, Pat Mitchell and Eileen
Pullara keep a lot of things running around the department and Im happy to be
included among them.

vi

Abstract

The computing industry has long relied on computation becoming faster through
steady exponential growth in the density of transistors on a chip. While the
growth in density has been maintained, factors such as thermal dissipation have
limited the increase in clock speeds. Contemporary computers are rapidly becoming parallel processing systems in which the notion of computer power comes from
multi-tasking rather than speed. A typical home consumer is now more likely
than not to get a parallel processor when purchasing a desktop or laptop.
While parallel processing provides an opportunity for continued growth in
mainstream computational power, it also requires that programs be built to use
multiple threads of execution. The process of writing parallel programs is acknowledged as requiring a significant level of skill beyond general programming,
relegating parallel programming to a small class of expert programmers. The difficulty of parallel programming is only compounded when attempting to modify
an existing program. Given that the vast majority of existing programs have not
been written to use parallelism, a significant amount of code could benefit from
an overhaul.
An alternative to explicitly encoding parallelism into a program is to use speculative parallelism of some form. Speculative parallelism removes the burden of
guaranteeing the independence of parallel threads of execution, which greatly simplifies the process of parallel program development. This is especially true when

vii

retrofitting existing programs because the programmer is less likely to have a


complete understanding of the code base.
In many cases, the safety of the parallelism can be speculative. There are also
cases in which it makes sense to parallelize tasks that are inherently speculative.
One may wish to speculate about the result of some computation, the safety of applying an optimization, or the best heuristics to use when searching for a solution.
This style of speculative parallelism is referred to as speculative optimization.
In this work I describe a speculative parallelism system based on POSIX processes and communication. The system comprises a set of run-time libraries and
compiler support for easily generating a speculatively parallel program. The implementation is designed to be general and portable, and the programming interface is designed to minimize the programmer effort needed to effectively parallelize
a program. There are two variants on the run-time system intended for different
forms of parallelism. Both of these general forms of speculative parallelism are
generally applicable to many different problems.

viii

Table of Contents

Curriculum Vitae

iii

Acknowledgments

iv

Abstract

vi

List of Tables

xiii

List of Figures

xiv

List of Algorithms

xvi

Foreword

1 Introduction

1.1

Explicit Parallel Programing . . . . . . . . . . . . . . . . . . . . .

1.2

Speculative Execution . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Background
2.1

Thread Representation . . . . . . . . . . . . . . . . . . . . . . . .

2.1.1

Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . .

ix

2.1.2
2.2

2.3

2.4

2.5

2.6

Message Passing

. . . . . . . . . . . . . . . . . . . . . . .

12

Speculative Threads . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2.1

Ancillary Tasks . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2.2

Run-Ahead . . . . . . . . . . . . . . . . . . . . . . . . . .

16

Fork and Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.3.1

Futures . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.3.2

Cilk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.3.3

Sequential Semantics . . . . . . . . . . . . . . . . . . . . .

21

Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.4.1

Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . .

23

Support Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.5.1

Operating System . . . . . . . . . . . . . . . . . . . . . . .

26

2.5.2

Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.5.3

Race Detection . . . . . . . . . . . . . . . . . . . . . . . .

31

Correctness Checking . . . . . . . . . . . . . . . . . . . . . . . . .

32

2.6.1

Heavyweight . . . . . . . . . . . . . . . . . . . . . . . . . .

32

2.6.2

Hardware Techniques . . . . . . . . . . . . . . . . . . . . .

33

2.6.3

Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3 Process-Based Speculation
3.1

36

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.1.1

Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.1.2

Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.1.3

Verification . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.1.4

Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.1.5

Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.2

Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.3

Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.4

Special Considerations . . . . . . . . . . . . . . . . . . . . . . . .

41

3.4.1

Input and Output . . . . . . . . . . . . . . . . . . . . . . .

41

3.4.2

Memory Allocation . . . . . . . . . . . . . . . . . . . . . .

42

3.4.3

System Signals . . . . . . . . . . . . . . . . . . . . . . . .

43

4 Speculative Parallelism
4.1

4.2

4.3

4.4

44

Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

4.1.1

Lead and Spec Processes . . . . . . . . . . . . . . . . . . .

45

4.1.2

Understudy: Non-speculative Re-execution . . . . . . . . .

47

4.1.3

Expecting the Unexpected . . . . . . . . . . . . . . . . . .

48

Programming Interface . . . . . . . . . . . . . . . . . . . . . . . .

51

4.2.1

Region Markers . . . . . . . . . . . . . . . . . . . . . . . .

51

4.2.2

Post-Wait . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.2.3

Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

Run-Time System . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.3.1

Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.3.2

Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.3.3

Verification . . . . . . . . . . . . . . . . . . . . . . . . . .

71

4.3.4

Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

4.3.5

Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

Types Of Speculative Parallelism . . . . . . . . . . . . . . . . . .

78

4.4.1

80

Data-Parallel . . . . . . . . . . . . . . . . . . . . . . . . .

xi

4.4.2
4.5

4.6

Task-Parallel . . . . . . . . . . . . . . . . . . . . . . . . .

80

Comparison to Other Approaches . . . . . . . . . . . . . . . . . .

81

4.5.1

Explicit Parallelism . . . . . . . . . . . . . . . . . . . . . .

81

4.5.2

Fine-Grained Techniques . . . . . . . . . . . . . . . . . . .

82

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.6.1

Implementation and Experimental Setup . . . . . . . . . .

84

4.6.2

Application Benchmarks . . . . . . . . . . . . . . . . . . .

85

5 Speculative Optimization
5.1

95

Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

5.1.1

Fast and Normal Tracks . . . . . . . . . . . . . . . . . . .

96

5.1.2

Dual-track . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

5.2

Programming Interface . . . . . . . . . . . . . . . . . . . . . . . .

97

5.3

Run-time Support . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

5.3.1

Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

5.3.2

Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . .

99

5.3.3

Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.3.4

Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3.5

Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3.6

Special Considerations . . . . . . . . . . . . . . . . . . . . 107

5.4

Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.5

Uses of Fast Track . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.6

5.5.1

Unsafe Program Optimization . . . . . . . . . . . . . . . . 115

5.5.2

Parallel Memory-Safety Checking . . . . . . . . . . . . . . 117

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

xii

5.6.1

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.6.2

Experimental Results . . . . . . . . . . . . . . . . . . . . . 122

6 Conclusion

131

6.1

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.2

Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 132


6.2.1

Automation . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.2.2

Composability . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2.3

Further Evaluation . . . . . . . . . . . . . . . . . . . . . . 135

A Code Listings

137

A.1 BOP Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137


A.2 Fast Track Code

. . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.3 Common Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159


Bibliography

166

xiii

List of Tables

4.1

Speculation actions for unexpected behavior . . . . . . . . . . . .

49

4.2

Three types of data protection . . . . . . . . . . . . . . . . . . . .

65

4.3

Comparisons between strong and weak isolation . . . . . . . . . .

70

4.4

XLisp Private Variables . . . . . . . . . . . . . . . . . . . . . . .

85

4.5

XLisp Checked Variables . . . . . . . . . . . . . . . . . . . . . . .

85

4.6

Execution times for various speculation depths . . . . . . . . . . .

87

4.7

The size of various protection groups in training runs . . . . . . .

88

4.8

Execution times of bop GZip . . . . . . . . . . . . . . . . . . . .

89

xiv

List of Figures

4.1

Sequential and speculative execution of three ppr instances . . . .

46

4.2

Example of matching ppr markers . . . . . . . . . . . . . . . . .

54

4.3

The states of the sequential and parallel execution . . . . . . . . .

66

4.4

State diagram of bop . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.5

The effect of speculative processing on Parser . . . . . . . . . . .

90

4.6

Solving 8 systems of linear equations with Intel MKL . . . . . . .

93

5.1

State diagram of FastTrack processes. . . . . . . . . . . . . . . . . 106

5.2

FastTrack resource allocation state diagram . . . . . . . . . . . . 127

5.3

Analytical results of the FastTrack system . . . . . . . . . . . . . 128

5.4

The effect of FastTrack Mudflap on four spec 2006 benchmarks. . 129

5.5

FastTrack application to sorting routines . . . . . . . . . . . . . . 130

5.6

FastTrack on synthetic benchmarks . . . . . . . . . . . . . . . . . 130

xv

List of Algorithms
2.4.1 Listing of pipeline loop. . . . . . . . . . . . . . . . . . . . . . . . .

24

2.4.2 Interleaved iterations of pipelined loop. . . . . . . . . . . . . . . .

24

4.2.1 Example use of bop to mark a possibly parallel region of code


within a loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4.2.2 Example use of bop including EndPPR marker. . . . . . . . . . . .

52

4.2.3 Example use of bop in a non-loop context. . . . . . . . . . . . . .

53

4.2.4 Example of a pipelined loop body. . . . . . . . . . . . . . . . . . .

55

4.2.5 Example use of bop post/wait. . . . . . . . . . . . . . . . . . . .

56

4.3.1 Listing of ppr creation. . . . . . . . . . . . . . . . . . . . . . . .

58

4.3.2 Examples of shared, checked, and private data . . . . . . . . . . .

59

4.3.3 Listing of bop termination routine . . . . . . . . . . . . . . . . .

73

4.3.4 Listing of ppr commit in the speculative process

. . . . . . . . .

75

4.3.5 Listing ppr commit in the understudy process . . . . . . . . . . .

76

4.3.6 Listing of ppr commit in the main process . . . . . . . . . . . . .

77

4.3.7 Listing of ppr commit finalization routine . . . . . . . . . . . . .

78

5.2.1 Example listing of FastTrack loop optimization

. . . . . . . . . .

98

5.2.2 Unsafe function optimization using fast track . . . . . . . . . . . .

98

5.3.1 Listing of FastTrack creation. . . . . . . . . . . . . . . . . . . . . 100


5.3.2 Listing of FastTrack monitoring. . . . . . . . . . . . . . . . . . . . 101
5.3.3 Listing of FastTrack verification routine FT CheckData . . . . . . 103
5.3.4 Listing of slow track commit routine. . . . . . . . . . . . . . . . . 105

xvi

5.3.5 Listing of FastTrack exit point handler. . . . . . . . . . . . . . . . 110


5.6.1 Pseudo code of the synthetic search program . . . . . . . . . . . . 125
6.2.1 Example of FastTrack self-composition . . . . . . . . . . . . . . . 135

Foreword

Chapters 4 and 5 of this dissertation are based on collaborative work. Chapter 4 of my dissertation was co-authored with Professor Chen Ding, and with fellow
students Xipeng Shen, Chris Tice, Ruke Huang, and Chengliang Zhang. I contributed the implementation of the computational system, and the experimental
analysis. It has been published in Proceedings of the ACM SIGPLAN Conference
on Programming Language Design and Implementation, 2007. An early prototype
of the run-time system was created by Xipeng Shen, which was rewritten for our
publication, and again for ongoing work. Rule Huang contributed compiler support, and Chris Tice worked on the MKL benchmark. Chengliang Zhang helped
with system testing.
I am the primary author of Chapter 5, on which I collaborated with Professor Chen Ding and with fellow graduate student Tongxin Bai. This chapter has
been published in Proceedings of the International Symposium on Code Generation and Optimization, March 2009. My contribution is the implementation of
the computational system, construction of the experimental frameworks, and the
experimental analysis. Tongxin Bai contributed design ideas, and assisted with
testing.

Introduction

Since the introduction of the Intel 4004 microprocessor, the number of transistors
on commercial integrated circuits has doubled roughly every two years. This
trend was famously noted by Gordon Moore in 1965 and has continued to the
present [40]. During this period of time the growing number of transistors typically
corresponded with an increase in the clock rate, from 740 kHz for the 4004 chipset
to 3.8 GHz for Intels Pentium 4 processor in 2004.
Since the release of the Pentium 4 processor, clock rates have actually decreased slightly. Currently, the highest clock rate available on an Intel microprocessor is 3.33 Ghz. The primary reason for this stagnation and decline is the
problem of thermal dissipation. Each transistor on a chip uses some amount of
power in two forms: constant leakage and per state switch. Increasing the chip
clock rate directly increases the power consumption due to switching, but also requires a reduction in the size of components (to reduce signal propagation time).
This miniaturization increases the density of the transistors, which increases the
amount of power consumed in any given chip area. Increased power consumption
leads to increased heat consumption. The two factors increased switching and
concentration of components compound on one another.
On the consumer front, weve reached the limits of air cooling a computer

sitting a room-temperature environment. Air cooling can be extended by moving


processing into areas with colder ambient temperature, and liquid cooling techniques provide an alternative solution. Even with more sophisticated approaches
to ensure the integrity of a running processor, at some point a significant amount
of power must be used to cool the chip. In contemporary data centers it is common for the power demands of the cooling systems to surpass the power used to
actually perform computation. The continued growth in power consumption has
been recognized to be unsustainable both technologically and commercially, as
consumers recognize the ancillary costs of their processors.
With the skyward increase in clock rates stalled, the choice has been to expand
processors horizontally. Computers are no longer made faster with increasing
clock rates, but instead are made more powerful with multiple processing cores.
We have reached the multicore era in which it is typical to find a multiprocessor
in consumer desktops, laptops, and even mobile devices.
Although computers are now parallel, the same cannot broadly be said of the
programs running on them. The majority of programs, both existing programs
and those being written today, are not designed to take advantage of parallel
processing. One reason for this is the relative scarcity of parallel computers in
the past particularly in the home consumer arena. Another reason is that
programmers are trained to think about the problems they are solving in an
explicitly sequential way.
The result is a large body of programs that must be retrofitted to take advantage of parallel processing systems. There are a few significant reasons that
parallel programming is difficult, and many of these are only exacerbated when
attempting to modify an existing code base.

1.1

Explicit Parallel Programing

While the general public may recognize that programming requires a certain level
of expertise, parallel programming has largely been relegated to a select group of
programmers. Programmers are typically taught to think explicitly in series
to write an imperative program as a series of steps that depend on one another.
This can make the transition to parallel programming difficult for programmers,
but more importantly it has led to a legacy of programs that are truly serial by
design.

Finding Parallelism
Identifying portions of a program that can safely run in parallel with one another
is perhaps the most difficult aspect of parallel programming. This task is often
made more difficult by attempts by programmers to optimize their code for the
sequential execution. Once the parallel regions have been identified, the programmer must ensure the correctness of each region interacting with all others. This
is most commonly done using locks, which must be correctly associated with the
same collection of data in every case where that data may be modified by multiple
threads. The problems involved in correctly writing a parallel program are exacerbated when attempting to update an existing program. Without a familiarity
with the code in question, the programmer is less likely to recognize side effects
of functions or identify poorly isolated data. Currently, no tool exists that can
automatically identify parallelism in an arbitrary program, and it is not possible
to do so in every case.

Ensuring Progress
One of the most well known problems encountered in parallelism, whether designing a single program with multiple threads of execution, or scheduling multi-

ple programs with shared resources in an operating system, is deadlock. Of the


four conditions necessary for deadlock to exist identified in [12], three are easily
achieved using locks: mutual exclusion (only one thread can hold a lock), lack
of preemption (a thread cannot steal a lock), and hold and wait (a thread can
acquire locks one after another). The only condition that needs to be added by a
programmer is circular waiting, where a group of multiple threads each wait on a
lock held by another member of the group.
In addition to deadlock, a few other problems can arise that prevent a system
from making progress. Livelock is similar to deadlock in that threads do not make
progress collectively because of interference with one another. The difference being
that each thread is active, typically trying to coordinate with another livelocked
thread. Livelock is a specific example of resource starvation, which describes a
situation in which a thread cannot make progress because it lacks access to some
resource. The distinction from deadlock is that progress is made by the thread
holding the resource. When the resource is released, it is allocated to some thread
other than the starving one.

Guaranteeing Correctness
In the context of parallel programming, correctness is defined to mean that the observable behavior of the program is maintained. If the program acts as a function,
mapping input to output, then the function must be preserved. In the context of
parallelizing a sequential program, the original serialization of observable points in
the execution implied by that program must be maintained, ruling out deadlock.
To guarantee correctness, the programmer must ensure that all accesses to shared
data are properly guarded. This requires identifying all shared data, identifying
all accesses to that data, and finally creating an association between data objects
and the primitives used to synchronize their access. Particularly in the case of
parallelizing an inherited code base, the programmer may have difficulty simply

identifying what data objects are shared. Assuming that using a single global lock
will not allow acceptable performance, the programmer will also be responsible for
determining which data need to be protected collectively because their common
state needs to be consistent.

Debugging
One of the more common problem in parallel programming is the occurrence of
a data race, which is the case of two threads accessing the same data without
synchronization between the accesses (at least one of which must be a write).
The result of a race (i.e., the value that is ultimately attributed to the data)
depends on the sequence and timing of events in both threads leading up to their
accesses. Because the scheduling of threads may depend on other processes in
the system at large, the error is effectively non-deterministic. Generally, we want
to reproduce the conditions under which a bug occurs to isolate it. Because the
problem may appear very intermittently, the conditions for the error are effectively
random. Running the program in a debugger can force a particular serialization,
which ensures a certain outcome of the race, potentially making the debug session
useless for finding the problem.

1.2

Speculative Execution

Speculative parallelism running some portions of a program in parallel with


the rest on the assumption they will be useful and correct can extract useful
coarse-grained parallelism from existing programs in several ways. The speculative
execution systems outlined in Section 1.3 directly address the problems of explicit
parallel programming raised in Section 1.1.

Finding Parallelism The first-order problem of explicitly parallelizing code is


to identify which portions of a program can safely be executed simultaneously.
The other problems are largely the result of the solutions used once the parallel
regions of the program have been identified. Using a speculative system allows a
programmer to indicate parallel regions without the responsibility of preserving
run-time dependencies, which the system guarantees will not be violated.

Ensuring Progress Ensuring progress is trivial because there is no potential


for deadlock. The programmer does not introduce any serialization primitives
such as locks. This means that the speculative run-time system cannot introduce
a circular waiting condition. It may be tempting to qualify the previous statement
such as where one did not already exist, but clearly there cannot have been a
circular dependency in a sequential program. Likewise, there is no potential for
introducing livelock or resource starvation.

Debugging The speculative execution system depends on the ability to discard the speculative portion of execution and follow only the sequential flow of
execution. The intent of this fallback is that the speculatively parallel program
maps directly back to the sequential execution. In this case, there is no need to
explicitly debug a speculatively parallel program because the user can debug the
sequential program with the same effect.

1.3

Road Map

In Chapter 2 I cover the extensive existing literature on speculative execution


related to both sequential programs and optimization. In Chapter 3 I describe
the fundamental aspects of an implementation for a process-based speculative
run-time system. Chapter 4 describes a run-time system intended for speculative

execution of program regions ahead of time. In Chapter 5 I describe a softwareonly speculation system that enables unsafe optimization of sequential code. I
conclude with a discussion of the limitations of the current speculative execution
system, and of future directions to address, in Chapter 6.

Background

2.1

Thread Representation

2.1.1

Data Sharing

Because programs written in imperative languages progress by writing to and later


reading from data objects, eventually using their values to represent their result,
data sharing is a natural extension. By allowing multiple processes to share some
of the data they are modifying, the syntax of each process can remain largely the
same. On a surface level, the semantics of each process also remain largely the
same except that the value of a data object may change between being written
and later being read.
Rather than considering a single program and its state represented by data, in
the context of shared data we have to consider all processes and all of their data
as a single state. The problem that arises is guaranteeing that state is always
consistent. A classic example of such consistency is a shared buffer into which
multiple processes may add or remove data. In order to know where new data
should be inserted, or the position from which is should be read, a process must
update an indication of the size of the buffer. If the value is entered first, another
process may overwrite it before the position is updated. If the position is updated

10

first then a reader may attempt to read the buffer and receive garbage.
In order to guarantee a process always sees a consistent view of the global state,
there must be some mechanism to indicate that the data should not be accessed.
This is typically done by introducing a lock, which requires a hardware guarantee
that all processes see the flag consistently and cannot access it simultaneously.
Implementations typically rely on an atomic read-modify-write operation that
only sets the value of a data object if its current value matches what is expected.
Such systems are more efficient if multiple locks are used so that distinct parts
of the shared state can be modified simultaneously. One of the difficulties is
ensuring that the relationship between a lock and the data it is meant to protect
is well defined that no access to the data is made without first acquiring the
lock. In this way, a portion of the shared state is used to protect the consistency
of the shared state.
An alternative to locking regions of memory to provide protection is to create
the illusion that modifications are made atomically. This typically involves introducing additional redirection to encapsulated data that must be keep consistent.
By modifying a local (or private) copy of the data, one process can ensure that
no others will read inconsistent state. Once the modifications are complete, the
single point of redirection can be atomically updated to refer to the new (and no
longer private) version of the data.
This sort of redirection can be expanded to be applied to general memory
access in transactional memory systems. These systems indicate that specific
regions of the program should appear atomically. By tracking all of the reads
and writes that a process makes, it is possible to ensure that none of the memory
involved was changed by another transacting simultaneously.
Transactional memory was originally proposed as a hardware mechanism to
support non-blocking synchronization (by extending cache coherence protocols) [26]
and several software implementations are built for existing hardware. Transaction

11

semantics, which require a serializable result, is less restrictive than speculative


parallelization, which requires observational equivalence or the same result as the
original sequential execution. Because transactions have the potential to conflict
with one another, they do not guarantee parallelism. Inserting transactions does
affect the semantic meaning of a program because they alter the serial execution
it performs. The techniques I describe in Chapter 4 do not change the program
in this way, and are easier to use for a user or a tool to parallelize an unfamiliar
program.
At the implementation level, serializibility checking requires the monitoring of
both data reads and writes, so it is more costly than the run-time dependence
checking. Current transactional memory systems monitor data accesses rather
than values for conflict detection. The additional flexibility is useful for supporting
parallel reduction, but it is not strictly necessary for parallelization, as it is for
concurrency problems such as on-line ticket booking.
In most thread-level speculation schemes (e.g., speculative DOALL) the entire
body of a loop constitutes a transaction. When we consider a parallelization
technique like decoupled software pipelining described in Section 2.4.1 the loop
body is spread across multiple threads. In order to introduce speculation to such
pipelines a multi-threaded transaction (mtx) is necessary, which had depended
on hardware support. The work in [51] introduces a software only multi-threaded
transaction system.
The software mtx gives the threads within a transaction the effect of a private memory space for their work. Each mtx representing a loop iteration is
divided into sub-transactions that are each executed on a separate thread. Each
sub-transaction forwards the uncommitted values it has computed to the next
through a shared queue during execution, and the final sub-transaction has the
responsibility of committing the transaction as a whole.

12

2.1.2

Message Passing

The other typical way to express synchronization between parallel processes is


through message passing. The most basic form of message passing is through
matched pairs of send and receive statements by which one process explicitly
passes data to another process that has expressly made an attempt to acquire
it. This point-to-point communication can be synchronous or asynchronous, and
may be related to allow unmatched pairs of communication primitives. In comparison to data sharing as a synchronization mechanism, message passing benefits
in encouraging local data storage in systems with non-uniform memory access.
Attempting to model message passing as a global, shared state is non-trivial
because of the complexities resulting from delays as message are passed between
processes. Message passing does not have a clear analog to an imperative sequential programming, though it clearly translates to client-server models typical
of networking. Such systems are often event based, where a processes state is
directly affected by signals given to it, rather than polling a data location or
explicitly receiving a message.

2.2
2.2.1

Speculative Threads
Ancillary Tasks

Past work has suggested specifically using speculative execution to treat some
portion of the programs work as a parallel task. Such tasks include the addition of
memory error and leak checking, performing user defined assertions, and profiling.
In [48] the authors suggest creating a shadow version of a program to address
these ancillary tasks specifically, although they do not address how the shadow
process might be generated.

13

By pruning computation that is not necessary to determine the control flow,


the shadow process creates some slack with which it can instead perform the additional work. It is not clear however if typical programs contain enough unnecessary
computation to be removed and compensate for extra work.
It is not always possible for a shadow process to determine all of the values
necessary for control flow. These values could be dictated by interactive input or
system calls that the shadow cannot safely reproduce. As a result, the control
flow within the shadow process may depend on values communicated from the
primary process once it has computed them. Additionally, there is a trade off to
be made between having the shadow compute values and having the main process
communicate those values to the shadow.
A limitation of the shadow processing system is composability; there is no
good way to handle multi-threaded or multi-process tasks. There is a limit to the
availability of signal handlers. The shadow process is generated once and runs
in parallel to the entirety of the original process. Empirical results found the
shadow process typically finishing after the serial process. As a result, there is
no mechanism for the shadow process to provide feedback to the original process.
This is acceptable in some use cases, such as error checking, where a report can
be generated after program execution, but prevents shadow processing from being
applicable for accelerating processes in general.
The limitation of shadow processing is addressed in later work by periodically
spawning a new shadow process [41]. The objective in this case is specifically to
make execution profiling more tolerable, described as shadow profiling. A new
profiling process is created each time a timer interrupt is triggered1 .
Program instrumentation is added using the Pin tool [36] by having the shadow
process switch to a just-in-time rewriting mode within the Pin runtime after it has
forked. Moseley et al. [41] additionally address some of the problems that arise
1

By default the SIGPROF timer is used, but this is customizable.

14

even when using copy-on-write protection. Writes to shared memory segments or


memory mapped files cannot be completely protected and trigger a fault in the
child profiling process. The shadow profiling process can choose to skip the troublesome instruction, or it can terminate and allow a new shadow profile process
to be created.
System calls are also a problem for speculative execution, and Moseley et al.
[41] attempt to categorize them into groups: benign, output, input, miscellaneous,
and unsafe. If call is encountered that is not known explicitly to be safe, the
shadow simply aborts and allows a new profiler to begin.
Because the profiling system is only intended to be an instrumentation sample
the shadow profiling can safely afford to abort in many cases. It is also not necessary for the execution performed in the shadow profile to be deterministic. While
allowing the control flow in the speculative process to deviate from the original
program reduces the accuracy of the profile, it does not affect the correctness of
the program. This flexibility is not acceptable for general purpose speculative execution, and again precludes shadow processing from being used for accelerating
process in the general case.
Newer work has moved beyond parallelizing memory checking to placing dynamic memory management in a parallel task, which is referred to as the memory
management thread (mmt) [58]. If the memory allocation and deallocation subsystem includes safety checks such as invalid frees then these checks can also be
placed in the memory management thread.
One of the difficulties in moving dynamic memory management into a separate
thread is ensuring that the memory management can be wrapped with minimal
modification to the application and memory management library. Another significant difficulty is the overhead of thread synchronization, which the authors note is
comparable to the cost of the memory management itself. These two problems are
addressed by allowing the mmt to run as a server and only requiring synchroniza-

15

tion for memory allocation. When memory objects are released, the program can
essentially issue the deallocation asynchronously and continue without waiting for
memory management to complete.
The synchronous memory requests still have a communication delay in addition
to the period of time needed to actually service the request. This is alleviated by
having the mmt speculative preallocate objects, which can be provided without
delay if the size is right. Delays are further reduced by batching deallocation
requests to the mmt, and symmetrically by providing the client with multiple
preallocated objects.
Although the mmt technique can extract some memory safety checks into a
separate thread, not all types of memory checks are isolated in the allocation or
deallocation routines. Checks such as array over-flow are must be performed in
the context of the memory access.
Some of these limitations are addressed in the approach taken in the Speck
(Speculative Parallel Check) system [45]. The Speck system is intended to decouple the execution of security checking from the execution of the program at large.
During program execution a set of instrumented systems call potentially create an
addition instance of the application that includes the security checks. Like earlier
work, some of the overhead is removed by only entering the instrumented code
path periodically.
The primary focus of the Speck work is on security checks such as virus scanners and taint analysis, though it could be applied to simpler checking for safe
memory access. The limitation of the Speck system is its dependence on the use
of a modified Linux kernel designed to support efficient distributed file system
interaction, called Speculator [44]. This support is necessary to allow for unsafe
actions performed by an application to be rolled back if one of the security checks
were to fail. An addition feature of their operating system support is the ability
to ensure that certain system functionality operates identically in both processes,

16

and that signals are delivered at the same point in the execution of each.
Another recent approach to minimizing the overhead of memory safety checking with thread-level speculation did so by parallelizing an existing memory checking library [31]. Because of the tight synchronization needed by the accesses to the
data structures used by the library, adapting it for use with TLS requires detailed
analysis of the library itself and the manual insertion of source level pragmas to
denote parallel regions. The annotated code was then passed through a parallelizing compiler which extracts each parallel task. Ultimately, the authors assume
that some form of hardware support will guarantee the ordering of the tasks to
guarantee the sequential semantics of the original program. The system also relies
on the presence of a mechanism to explicitly synchronize access to the librarys
data structures which is not provided.

2.2.2

Run-Ahead

An approach related to the techniques used to off-load ancillary tasks to unused


processing units is to create a second thread of execution to precompute some
values for the primary process. Rather than performing additional work, thus
lagging behind the original program, these processes run ahead of the primary
process in order for it to execute more quickly.
There are a number of methods for producing a run-ahead process, relying on
various kinds of support. The Slipstream technique presented in [57] monitors
the original program to detect operations which are redundant at run time. The
operations that are found to be redundant can be speculatively elided from the
leading process when they are next encountered. As a result, the leading process
can run faster because many operations are removed entirely. The trailing process
is also able to execute more quickly because of side-effects from the first process:
memory prefetching and improved branch prediction. The end result is that the

17

two processes together (one of which is the original program) complete faster than
either would independently.
Because the leading process is not performing all of the operations of the
original program, its execution may deviate from the correct execution, which
is always computed by the trailing process. In order to recover from incorrect
speculation, and to generate the lead process, the Slipstream technique requires
a number of additional hardware components. The lead process must have a
program counter that is modified to skip past some instructions by recording
previous traces through the program execution. The address of memory locations
modified by the lead process are recorded to allow for recovery by updating those
values from the memory state of the trailing process2 .
The suggested mechanism for determining which operations may be good candidates for speculative removal is based on a small data flow graph built in reverse as instructions are retired. Operations that write to memory (or registers)
are recorded as being the producer of the value stored there, and a bit denotes
the value as valid. A subsequent write with the same value is redundant, while
a different value updates the producer. A reading operation sets a bit indicating a location has been referenced, which allows an old producer operation to be
removed if the value was unused.
Another related idea used in hardware systems is to extract a fast version of
sequential code to run ahead while the original computation follows. It is used to
reduce memory load latency with run-ahead code generated in software [33], and
recently to reduce hardware design complexity [19].
A third, more recent idea is speculative optimization at fine granularity, which
does not yet make use of multiple processors [43]. All of these techniques require modifications to existing hardware. Similar special hardware support has
been used to parallelize program analysis such as basic block profiling, memory
2

Additionally, the register file is copied.

18

checking [46], data watch-points [66].


Loop-level software speculation was pioneered by the lazy privatizing doall
(LPD) test [52]. The LPD technique works in two separate phases: a marking
phase executes the loop and records access to shared arrays in a set of shadow
arrays. A later analysis phase then checks for dependence between any two iterations of the loop by analyzing the shadow values. Later techniques speculatively
make shared arrays private to allow avoid falsely detecting dependencies, and combine the marking and checking phases to guarantee progress [11, 14, 22]. Previous
systems also address issues of parallel reduction [22, 52] and different strategies
of loop scheduling [11]. A weaker type of software speculation is used for disk
prefetching, where only the data access of the speculation needs to be protected
(through compiler-inserted checks) [9].

2.3
2.3.1

Fork and Join


Futures

A future is a mechanism to indicate that the result of some computation will be


needed by the process referred to as the continuation at some point in the
future. Originally introduced in Multilisp [24], the contents of the future computation are independent of the invoking computation. A system implementing
futures is free to schedule the future computation at any point before the result of
the computation is needed. Support for futures exists in the Java programming
language through its concurrency utilities package. Unlike functional languages
like Multilisp, Java and other imperative languages make frequent modification of
shared state. In its implementation of futures, the Java run-time system does not
make any guarantees about the synchronization of the future with its continuation. The programmer is still responsible for ensuring that access to shared data

19

is protected.
Work on a Java implementation of futures that are safe in terms of maintaining their sequential semantics has been done through modifications to the
run-time virtual machine [63]. In order to ensure the effects of a future are not
intermixed with data accesses of its continuation, each is run in a separate thread
with a local write buffer implemented by chaining multiple versions of an object
together. Reads to the object must traverse a list of versions to location the correct one for the context of the thread. Each thread must also maintain a read
and write map of data accesses, which is used to detecting read-write violations
between the threads. Despite the name, the future should conceptually complete
its data accesses before the continuation.
The implementation of safe futures depends heavily on the fact that Java is a
managed language in which objects have metadata and are accessed by reference,
simplifying the redirection needed to protect access. The additional work needed
to monitor data access is added to existing read and write barriers designed for
garbage collection, and the rollback routine is based on the virtual machines
exception handling.
Recent work has sought to automatically insert synchronization for Java futures using compiler support [42]. This support determines statically when a
future first accesses a shared data object and inserts a special form of barrier
statement called allowed. The allowed statement is not released in a continuation until all of its futures have granted access with an explicitly matched grant
statement. A list of futures is built as they are spawned, and cleared after they
have granted access to the data. Because the insertion of the grant and allowed
operations is based on static analysis, it is more conservative than what could
be achieved with a system using run time analysis. The static analysis has the
advantage of significantly lower overhead during execution.

20

2.3.2

Cilk

One representation of fork-join style parallel programming is provided by the Cilk


programming language, which an extension of the C programming language using an additional set of annotations [6, 18]. Cilk allows a programmer to expose
parallelism using a set of language-specific keywords, which can be removed to
return to a valid C program. Because the programmer is responsible for distinguishing sequential and parallel code, the two portions of a program can be
clearly delineated and the programmer is forced to consider the overheads in the
sequential portion. The code that occurs within the cilk procedures is considered
to be work while the code outside these procedures is referred to as the critical
path. This distinction is directly analogous to the relationship expressed earlier
as Amdahls Law [2, 23].
The original implementation of Cilk required that invocation points distinguish
between threads that produce a value and the continuation that consumes that
value. The first thread would be created using spawn, while the consumer (or
successor) had to be created with spawn next. In order to pass the value to the
consumer, the send argument keyword would be used to explicitly place the result,
potentially allowing the waiting thread to begin. The keyword thread was used in
the way that cilk is now to represent code that contains Cilk specific code. The
use of spawn next and send argument is now handled automatically by the runtime
when the sync keyword is used. This improvement removes a requirement that all
Cilk threads (functions with the cilk keyword) be non-blocking.
In addition to the source code annotations, the Cilk programming language
depends on a sophisticated run-time system. The cornerstone of the system is
a work-stealing scheduler that seeks to balance the load between the available
processing units. The scheduler moves Cilk tasks (threads) from the processor
where they were spawned to processors that are idle. All of the overheads of the

21

system (e.g., spawning and moving tasks) are placed on the critical-path, which
is a design decision not shared by all systems.

2.3.3

Sequential Semantics

Although fork-join style semantics for parallelism makes explicit the point at which
parallel computation is needed, as mentioned in Section 2.3.1 there is no implicit
guarantee of atomicity or progress. A programmer is still responsible for guarding
shared data accesses to preserve object consistency and inserting synchronization
to prevent race conditions. Recent work using a run-time system called Grace
converts a program with fork-join parallel threads into a sequentially consistent
program [4].
Guaranteeing sequential consistency requires the effect of operations appear in
a specific order. This sequence is defined by the semantics of the source program
code. By assuming that threads should be serialized in the order they are created,
the sequential semantics of a fork become the same as a simple function call.
Allowing the run-time system to ensure thread ordering and atomicity, locks can
be elided and the program viewed semantically as though it were serial.
The Grace system does this by converting each thread into a heavy-weight
process with isolated (copy-on-write) memory. Heap and global data that would
have originally been available to all threads are placed in a memory mapped
file and each process maintains a local mapping of the same data for privatized
writes. Using a versioning scheme for the memory, and logging accesses during
execution, the run-time system can determine whether the processes execution
is correct. Assuming correct execution, the process must wait until all logically
prior processes complete before committing its local writes to the global map.
Although the process corresponds to a thread in the original program, Grace
intends to detect violations of the sequential semantics to guard against improper

22

parallel implementations.
Somewhat earlier work suggested two ways in which sequential semantics could
be relaxed intuitively to remove common points of misspeculation [8]. They argue
that sequential semantics may be overly restrictive in many cases in which some
portions of execution do not need to be explicitly ordered, and a program may
have multiple valid outputs. The primary suggestion is that groups of functions
be annotated to indicate a commutative relationship if their internal state does
not need to be ordered but does need to be accessed atomically. Put another
way, these functions have side effects that are only visible to one another. This
kind of behavior is common for dynamic memory management, which maintains
metadata that is not accessed externally.
The programmer is still responsible for identifying all functions accessing the
same state. Although this is significantly easier than identifying all functions that
access shared state and subsequently grouping them, it does allow for failures the
speculation system would otherwise prevent. Additionally, it requires atomicity
guards within the functions, which the authors ignore. There is an additional
requirement that commutative functions operate outside the scope of speculation
itself. If a transactional memory system is being used, the functions must use
non-transactional memory. This complicates cases where some state is internal
to the commutative group, while other state is global and also implies that these
functions must have an explicit inverse function because the rollback mechanism
of the speculation system will not protect them. This limits to applicability of
commutative annotations, or requires significantly more programmer effort that
initially suggested.

23

2.4

Pipelining

The general approach of parallelizing a loop executing each instance of the


loop body separately is not possible if there are dependencies carried from one
iteration of the loop to the next. There may still be cases in which such a loop
can still be parallelized, if it can successfully be pipelined. By breaking the loop
into a sequence of individual stages, we may find that dependencies are carried
from one stage to the next within an iteration, and one iteration of a stage to the
same stage in the next iteration, but that no dependencies are carried from a late
stage to an early stage in the next iteration.
A pipelined loop is analogous to a manufacturing pipeline in which a product
is created in stages. Each individual widget mirrors an instance of the loop body:
just as only one item can be painted or packaged at once, subsequent instances
of the first stage of the loop cannot execute concurrently. Likewise, just as the
widget must be painted before it is packaged, stages of the loop must be executed
in order. In such cases, the iterations of the loop can be performed in parallel by
skewing successive instances by one stage.
Given the code in Listing 2.4, the first three skewed iterations would appear as
depicted in Listing 2.4. Note that the first stage to be executed on each processor
is stalled until the stage is completed on the previous processor3 . There is also
a stall between loop iterations because the number of stages does not align with
the number of processors.

2.4.1

Decoupling

In a traditional software pipeline, the thread on one processor executes an entire


instance of the loop body. Once the instruction carrying a dependency has been
executed (the dependent stage of the pipeline), the next iteration of the loop begins
3

The correctness of this pipeline relies on the memory coherence of the architecture.

24

Algorithm 2.4.1 Listing of pipeline loop.


i n t A [ ] , B [ ] , C [ ] , D[ ] ;
for ( int
B[ i ] =
C[ i ] =
D[ i ] =
}

i
f
g
h

=0;
(A [ i
(B [ i
(C [ i

i < N ; ++i ) {
]);
]);
]);

Algorithm 2.4.2 Interleaved iterations


Processor 1
Processor 2
B[1] = f(A[1]);
C[1] = g(B[1]); B[2] = f(A[2]);
D[1] = h(C[1]); C[2] = g(B[2]);
D[2] = h(C[2]);
B[5] = f(A[5]);
C[5] = g(B[5]); B[6] = f(A[6]);

of pipelined loop.
Processor 3

Processor 4

B[3] = f(A[3]);
C[3] = g(B[3]); B[4] = f(A[4]);
D[3] = h(C[3]); C[4] = g(B[4]);
D[4] = h(C[4]);

on another processor. One seeks to align the loop structure and processor count
so that the first processor completes its loop iteration just as the last processor
completes the dependent stage of the loop. In this case there are no bubbles in
this pipeline and the processors can be maximally utilized.
The above scenario assumes there is no delay between completing the dependent stage one on processor and initiating it on another. In reality, there will likely
be communication latency between the processors causes the later iterations to
stall slightly. Multiple stalls will accumulate over time and propagate through
later iterations.
The reason this problem arises is that communication is flowing cyclically
through all of the processors. Decoupling breaks the communication cycle so the
dependency communication only flows in one direction [47]. In a decoupled software pipelined loop, after the dependent stage is executed on the first processor
the remainder of the loop is dispatched to another processor while the first processor begins the next dependent stage. The result is that any communications

25

delay applies equally in all cases. The second processor is effectively skewed by
that delay.
The processes of scheduling a decoupled software pipeline involves constructing
a dependence graph of the loop instructions. The instructions represented by a
strongly connected component (scc) in the graph must be scheduled collectively in
a thread (though a thread may compute multiple components). These components
limit possible parallelism in two ways: there can be no more pipeline stages than
there are sccs, and the size of the largest scc is the minimum size of the all
pipeline stages.
By introducing speculation into the decoupled software pipelined loop it is
possible to break some of the dependencies [60]. Breaking graph edges allows for
a reduction in the size of sccs and an increase in their number. The speculation temporarily removes dependencies that are highly predictable, schedules the
pipeline normally, then replaces edges that do not cross threads or flow normally
from early threads to later ones.
The implementation presented in [60] relies on compiler support for transformations and versioned memory to enable rollback of misspeculation. Each
loop iteration involves advancing the memory version and sending checkpoint information to a helper thread, as well as speculation status. The dependence
on additional hardware support can be overcome using software multi-threaded
transactions as described in Section 2.1.1.

2.5

Support Systems

In order for parallel programming and particularly speculative parallel programming to be possible, there is a number of ways the task of generating
the program must be supported. The initial problem is determining how the programmer should express the parallelism. The actual implementation of the parallel

26

constructs can be built for an existing language using a new library and programming interface, or may be built around a language explicitly designed for parallel
programming. In the later case, the language compiler may be equipped with
additional analysis techniques to determine whether the parallel execution will be
valid. Below the programming language, the operating system must provide some
form of support. This OS support must at the very least include scheduling for
multiple tasks, but may also provide additional isolation or monitoring. At the
lowest level, the hardware must again provide multiple processing cores.

2.5.1

Operating System

Adding support for speculation at the operating system level provides a broad form
of support for applications. It is however generally limited to use by heavy-weight
processes, while light-weight thread implementations may need to multiplex what
the operating system supports.
One way for the operating system to enable parallel programming is by forcing
sequential semantics on the processes within the system much like the run-time
system described in Section 2.3.3. One way to achieve this is by building a message
based system in which processes only execute in response to receiving a message,
generating output to be passed to another process. Conceptually, only the oldest message in the system can be consumed, which serialize the computation by
forcing the causality to flow linearly through the virtual time of the system.
The Time Warp operating system (twos) takes this approach and extends it
by speculatively allowing multiple processes to execute simultaneously [27]. twos
is motivated by distributed systems in which synchronization between processes
is impeded by varying latencies between parts of the system. A process cannot
quickly determine whether it may receive a message in future that should have
been handled before those currently waiting in its queue. For this reason, allowing

27

a process to proceed speculatively is also advantageous.


To allow for rolling back incorrect speculation each process periodically generates check a point, which is enqueued and assigned to the process current virtual
time. The virtual time value is incremented on (and assigned to) each sent message, while received messages update the virtual time. If the incoming message is
labeled with a time in the virtual past then the process has consumed a message
that should have been processed before it, indicating misspeculation and causing
rollback.
The dependency, or causality, between processes is managed using anti-messages
that annihilate their matching message. Whenever a twos process creates a message a matching anti-message is created as well. While the original message is
sent the anti-message is kept in the output buffer. After a process has rolled back
to a checkpoint, it will begin by consuming the oldest message (which just arrived
and caused the misspeculation). As the process proceeds it will often generate
many of the same outgoing messages, which will annihilate the matching message
waiting in the output buffer. If the anti-message is not matched it will be sent to
the original recipient of the message, where it will either cancel a pending message in that processes input queue or cause another rollback. Irrevocable messages
(i.e., output to the user) are held in buffers until all messages that were issued
before them in virtual time have been consumed. The condition determines when
anti-message can be expunged from output buffers.
The twos has a number of limitations that make it inappropriate for use
as a general purpose operating system resulting from its intended purpose as a
platform for simulation. One complaint leveraged by later work is that twos
processes must be deterministic. In reality, processes can use a random number
generation as long as it is reproducible. Truly non-deterministic execution is
generally not desirable. twos does not allow for the use of dynamic memory
allocation, which is quite limiting. The operating system is designed only to run

28

in single user mode and on a static set of processes, though as long as processes
are not communicating with one another the principles of twos remain valid.
The Speculator system introduces support for explicitly tracking speculative
processes by extending the Linux operating system kernel [44]. As with all speculation systems, Speculator implements detection of and recovery from misspeculation and guarantees that speculative processes do not perform irrevocable operations.
Because speculation is performed on heavy-weight processes, rollback of incorrect speculation is handled by terminating the process and restarting from a
checkpoint. The checkpointing routine is based on extensions to the standard
fork call. The processes is duplicated, but the new child is not made available to
the scheduler and retains the same identifiers as the original process. Additionally,
any pending signals are recorded and file descriptors are saved. The memory of
the process is marked for copy-on-write just as when a normal fork call is made.
During execution of the speculative process the use of output operations are
buffered for playback when the speculation is determined to be correct. Interprocess communication is generally allowed, but the processes receiving the communication is made to checkpoint and become speculative as well. The dependency between the two processes is tracked so misspeculation will cause a series
of rollbacks to occur. Outside of the kernel, the speculative state of a process is
indeterminate.

2.5.2

Compiler

Any language with support for parallel programming will need some form of compiler support, even if it simply interprets a trivial syntax extension as a call to a
run-time library. More powerful analysis by a compiler can allow some degree of
automatic parallelization. The Mitosis compiler implements a form of run-ahead

29

speculation like those described in Section 2.2.2 by automatically selecting and


reducing speculative regions [50].
The objective of the Mitosis compiler is to insert spawn points in a program and
determining a corresponding point to which another thread should speculatively
jump. To enable the speculative thread, the region between the spawn and jump
points is reduced into a pre-computation slice (p-slice) that should evaluate the
state needed by the thread. The compiler estimates the length of the speculative
execution, and traces back through the control-flow graph from the point it should
complete. Any values that are found to live into the region between the jump point
and the assumed end point are required. By continuing past the jump point to
the spawn point, the instructions needed to compute those values are identified
and added to the p-slice.
The Mitosis compiler uses profile information to speculatively optimize the
p-slice in several ways. Branches that are infrequently taken, and write-read
dependencies that occur infrequently are elided. Additionally, control flow that
cannot reach the jump point is removed entirely. The profile information is also
used to select the most effective pairs of spawn and jump points based on the
length of the p-slice and speculative region as well as the likely-hood of the path
being take and correctly speculated. The Mitosis system relies on the underlying
hardware to detect misspeculation and handle recovery.
While the Mitosis system is build on a research system (the Open Research
Compiler), contemporary work implemented similar compiler support for in the
production gcc compiler [35].

Rather than generating p-computation slices,

POSH relies on profile information to select tasks that are likely to speculate
correctly. Tasks are initially created for every loop body and subroutine (and the
continuations of both) and then expanded or pruned to meet size restrictions
large enough to overcome the cost of creation and small enough to be manageable4 .
4

Hardware TLS support is inherently limited to tracking a finite number of accesses.

30

Like the Mitosis system, POSH relies on hardware support for detection of violations of the sequential semantics of the program. In both cases, the assumption
is that threads are explicitly spawned. While POSH specifies that the architecture
provides a spawn instruction, Mitosis leaves the architecture details completely unspecified. In a departure from the fork/join notation, POSH assumes the spawned
task will explicitly commit, while the parent task does nothing to explicitly reclaim the child. If the parent attempts to read results from the child before it is
complete, misspeculation will occur.
Rather than inserting spawn and commit, a compiler could automatically generate the synchronization necessary to guarantee sequential ordering. Past work
has used data-flow analysis to insert wait and signal statements similar to the
grant and allow instructions introduced in Section 2.3.1 [64] to pipeline loop bodies. The precise semantics of the instructions only indicate that access to a particular variable is guarded (equivalent to introducing a lock) and ordered (version
numbered). It must be assumed that instructions to initiate and finalize tasks are
also generated.
Zhai et al. only consider loop bodies as candidates for parallelization. The
naive placement of the synchronization would place the request at the beginning
of the task (loop body) and the release at the end, encapsulating the entire loop
in a single state. The region of code between the wait and signal represents the
critical section in which only the current task can access the variable, and like any
critical section is should be made as small as possible. To optimize the interleaving
of the tasks, the wait statement should be placed as late as possible while still
preceding all accesses to the variable. Likewise, the signal should be as early as
possible as long as no further writes follow it.
To further reduce the size of the critical section, instructions may be reordered
along with the synchronization instructions. By treating a signal instruction as
a read and following the dependence chain up through a control flow graph, the

31

entire sequence of instructions can be hoisted to a point at which dependency


cannot to determined (e.g., due to aliasing). Like the later Mitosis and POSH
systems, Zhai et al. consider profile information to achieve further optimization.
In cases where branches are highly predictable, synchronization may be hoisted out
of the hot path at the expense of misspeculation due to access in the uncommon
case.

2.5.3

Race Detection

Race detection is concerned with determining whether two task can be run in
parallel or need to be performed in series. One way this can be done is by monitoring threads during execution to maintain a representation of their relationship
in terms of being inherently serial or possibly parallel. During specific run-time
operations the representation can be queried to determine if a serial relationship
has been violated [17]. For example, when threads access shared data the order
of accesses must match the order of serial threads.
During execution a tree is maintained to represent threads. The leaves of
the tree represent threads, while the internal nodes indicate either a series or
parallel relationship. To determine the relationship between two threads, their
least common ancestor holds appropriate marker. For a given execution tree, the
leaves are numbered with a depth first traversal, and given a second number by
traversing the parallel nodes in the opposite order. Given these values, two nodes
are in series if the values indicate the same order, while the nodes are executing
in parallel if the values are in opposite orders.
Early implementations required that the reverse ordering of nodes be maintained at run time, requiring computation on order with the depth of the tree. The
approach in [3] allows for parallel maintenance of and queries to the series/parallel
information in linear time.

32

The process of data race detection can be made more efficient by reducing the
number of objects that need to be monitored at compile time. The eraser analysis tool achieves this using a number of deep analysis techniques [38]. Initially,
all accesses within a target Fortran program are assumed to require annotation
(including not just recording of access, but initialization and cleanup of metadata
to allow such recording). Using dependence analysis eraser prunes annotation
around statements without dependencies. With intra-procedural analysis, including alias, modification, and reference information as well as whether a procedure
is ever used in a parallel construct, annotation for a procedures parameters may
be removed as well. After pruning as much annotation as possible, remaining
checks are handled using calls into an associated run-time library to track data
access during execution.

2.6
2.6.1

Correctness Checking
Heavyweight

Recently three software systems use multi-processors for parallelized program profiling and correctness checking. All use heavyweight processes, and all are based
on Pin, a dynamic binary rewriting tool [36]. SuperPin uses a signature-checking
scheme and strives to divide the complete instrumented execution into time slices
and executing them in parallel [62]. Although fully automatic, SuperPin is not
foolproof since in theory the slices may overlap or leave holes in their coverage.
The speculative execution system I describe in Chapter 5 is not designed for
fully automatic program analysis, although I describe a use case in which automatic analysis is enabled with some manual effort. The resulting system guarantees the complete and unique coverage during parallel error checking using a
programming interface that allows selective checking. This is useful when check-

33

ing programs that contain unrecoverable operations on conventional operating


systems. Because the runtime operates at the program level it requires source
code and cannot instrument externally or dynamically linked libraries. On the
other hand, it benefits from full compiler optimization across original and instrumented code. This is especially useful for curbing the high cost of memory-safety
checking. For example it takes a minute for gcc to optimize the instrumented
code of 458.sjeng, and the optimized code runs over 20% faster in typical cases.
Unlike the earlier systems that automatically analyze the full execution, a system
that is designed specifically for speculative optimization can provide a programming interface for selecting program regions, the ability for a checking process to
roll back the computation from the last correct point, and a throttling mechanism
for minimizing useless speculation.

2.6.2

Hardware Techniques

Fast track is closely related to several ideas explored in hardware research. One
is thread-level speculative parallelization, which divides sequential computation
into parallel tasks while preserving their dependencies. The dependencies may be
preserved by stalling a parallel thread as in the Superthreaded architecture [59]
or by extracting dependent computations through code distillation [67] and compiler scheduling for reducing critical forwarding path [64]. These techniques aim
to only reorganize the original implementation rather than to support any type
of alternative implementation. Fast track is not fully automatic, but it is programmable and can be used by both automatic tools and manual solutions. The
run-time system checks correctness differently. The previous hardware techniques
check dependencies or live-in values, while fast track checks result values or some
user-defined criterion.
Hardware-based thread-level speculation is among the first to automatically
exploit loop-level and method-level parallelism in integer code. In most tech-

34

niques, the states of speculative threads are buffered and checked by monitoring
the data accesses in earlier threads either through special hardware additions to
a processor [54], bus snooping [10], or an extended cache coherence protocol [56].
Since speculative states are buffered in hardware, the size of threads is usually no
more than thousands of instructions. A recent study classifies existing loop-level
techniques as control, data, or value speculation and shows that the maximal
speedup is 12% on average for SPEC2Kint assuming no speculation overhead and
unlimited computing resources [28]. The limited potential at the loop level suggests that speculation needs to be applied at larger granularity to fully utilize
multi-processor machines.

2.6.3

Monitoring

Correct data monitoring is essential for speculative parallelism techniques, and is


of the major sources of run-time overhead. For large programs using complex data,
per-access monitoring causes slowdowns often in integer multiples. The problem of
data monitoring is closely related to data breakpoints in the realm of debugging,
which must monitor program memory accesses and subsequently perform some
user defined action. Also related is to on-the-fly data race detection, which seeks
to detect inconsistencies in data accesses between threads of the same program.
Past work focused on reducing the overhead of data breakpoints5 has approached that problem using compile time data-flow analysis to identify accesses
which do not need to be explicitly monitored, and by instrumenting the program
with checks with simplified address calculations [61]. Within a debugger there
are two approaches to handling a break point for a particular memory location:
watch the location itself, or check each operation that could modify the location. Typically, many instructions can be statically determined not to require
5

Data breakpoints are also known as watch points, as opposed to control breakpoints.

35

monitoring.
Another approach to reducing the overhead of debugging is to use sampling
over a large number of runs. One such technique introduces code instrumentation
to record a number of boolean predicates based on run-time program behavior [34].
The predicates represent possible control flow (e.g., was a branch taken), return
value from functions (if it is positive, negative, zero), and the relationship between variables in the same scope (if one is greater than, less than, or equal to
the other). The total number of predicates is extremely large and so is the overhead of potentially recording all of them. This cost is limited by evaluating the
predicate instrumentation infrequently based on random choice at each instance.
By additionally recording whether each predicate was ever observed, it is possible
to evaluate the probability that a given predicate can be used to predict program
failure. Although the approach that Liblit et al. discusses allows for useful analysis
of crash reports from deployed programs, it is not a general solution to program
debugging due to the number of samples needed before a bug can be isolated. For
the same reason, such sampling techniques are not applicable to the monitoring
needed by speculative execution.

36

Process-Based Speculation

Process based speculation consists of a run-time system and a programming interface. The run-time system is built as a code library with which a programmer
might link their program. The programming interface defines how the programmer would invoke calls into the run-time library. In this chapter I describe the
implementation of the core run-time system. Descriptions of the programming
interface and details of the runtime for particular types of speculative parallelism
are addressed in Chapters 4 and 5.

3.1

Implementation

Speculative program execution requires mechanisms for:


dividing a sequential execution into a series of possibly parallel tasks.
spawning a speculative task in addition to the existing sequential execution.
monitoring the execution of tasks.
managing access to one or more speculative versions of data.
determining whether the speculative execution is correct.

37

terminating the speculative execution and reverting any effects it may have
had.
In the remainder of this chapter I will describe how process-based speculation
achieves each of these goals.

3.1.1

Creation

Process based speculation addresses the problem of creating a speculative task


using the operating systems fork call. The fork call is advantageous in that
all POSIX operating systems support it, making its use highly portable. Prior
to calling fork, the speculative run-time system establishes the communication
systems needed to ensure correctness (described in Section 3.1.3). After the fork
call has been made, two paths of execution exist. Before returning from the runtime creation block, each process configures its own correctness routines.

3.1.2

Monitoring

In order to determine whether the execution of a speculative task is correct, the


speculative run-time system must ensure that the state of the speculative execution corresponds with the state of the non-speculative execution. Because each
tasks execution is sequestered in its own process the state is defined by the memory contents of each process.
Because the speculative task is forked from the sequential task we know that
the initial state of the two processes is identical. As a result, we need only track
the changes made in each process and compare that subset of memory. Because
the two tasks are executed with processes we can monitor those changes using two
operating system constructs: memory access protection and signal handlers.

38

The signal handler routine has three basic responsibilities: to ensure the violation is a result of the run-time system monitoring, to record the access for later
reference, and to remove the restriction.
The operating system detects memory access violations in the normal course
of operation in order to protect processes. Because a process might access regions
of memory in violation of the operating systems typical restrictions the run-time
system must ensure such accesses are not allowed to pass. The runtime must differentiate between access to memory regions that it has restricted, and access the
program should never be permitted to make. The signal itself identifies whether
the access was made to a memory location that is not mapped (maperr) or to a
region of memory to which the process does not have access (accerr).
Once the location of the access has been deemed legitimate, the run-time
system must record the access for later reference. The speculative run-time system
uses an access bitmap to represent each block of memory. One bit for each page
equals one word for every 32 pages. With a page size of 4096 bytes the access map
uses one byte to record accesses on 131,072 bytes. Because much of the access
map will be zeros, and most of it will not be modified, the OS will typically be
able to map several of these pages to the same zero-filled data.
Once the access has been recorded the process must be allowed to continue
its operation. Additionally, there is no reason to record future access to the same
block. The run-time system can safely remove memory protection for the current
block.

3.1.3

Verification

Once the sequential process has advanced far enough the run-time system must
verify that the speculative execution is correct. Such verification requires an
analysis of the access maps for both processes, but without special consideration

39

each process would only have access to its own map. The run-time system can
facilitate the access map analysis in two ways. One option is to push a copy of
one of the maps using a POSIX pipe established during the spawning process as
indicated in Section 3.1.1. In practice it is only necessary to transfer the non-zero
portions of the map. The second option is to create the maps in a segment of
memory that has been explicitly shared between the two processes.
The details of verification notably the precise point at which it can be
performed and which types of accesses need to be validated depend on the
type of speculation being performed. These details are discussed in Sections 4.3.3
and 5.3.3.

3.1.4

Abort

Speculative execution requires a mechanism for unrolling or aborting the speculative portion of a process when the speculation proved to be incorrect. In order to
abort speculative execution that has proven incorrect, process-based speculation
can simply kill the running process. Because the Linux kernel protects the memory space of running processes from access by other processes, it is not possible for
the speculative process to directly affect the non-speculative portion of execution.
As a result, once the speculative execution is killed the non-speculative process
continues as it would in the sequential case.

3.1.5

Commit

The approach for committing a speculative task amounts to terminating the nonspeculative process and allowing execution to continue based on what was computed speculatively. In addition, the meta-data used to track memory accesses
must be updated to reflect the fact that the speculative process is no longer speculative.

40

3.2

Advantages

Using processes for speculative parallelism has a major advantage over other
thread based approaches. Perhaps the most significant of these is portability.
By using POSIX constructs the speculative run-time system can be built for any
POSIX operating system. The system does not rely on any specific hardware
architecture or features. The run-time system and compiler support presented in
this work have been built and executed on Linux and Mac OS X.
The access monitoring used by thread based approaches relies on instrumentation to data accesses. This instrumentation must be explicitly applied to both
program code and any libraries used during execution. The process-based system
does not require any attention to external libraries to perform correctly. This
flexibility also improves the portability of the run-time system because only the
annotated source code needs to be recompiled.
Process based memory access monitoring also has the advantage of incurring a
constant cost for each location accessed, rather than a cost at every single access
as in a thread based system. Additionally, because the monitoring is done at the
page level, this cost can be amortized for large tasks with multiple accesses to the
same page.
In addition to monitoring the locations of data accesses, the process-based
system compares the data values for conflicts. Using value based checking guarantees that identical changes to the same data will not be reported as a conflict,
a problem known as false sharing. In order to support value based checking, a
run-time system must maintain multiple copies of the data. While the processbased run-time system gains this for free through the operating systems virtual
memory system, thread based systems need to introduce additional data copies.
Additionally, these multiple copies must be explicitly managed to differentiate
access and guarantee that rollback is possible.

41

3.3

Disadvantages

The process-based protection has a high overhead. However, much of this overhead is inherently unavoidable for a software scheme to support unpredictable
computations. A major goal of this thesis is to show that general protection can
be made cost effective by three techniques. The first is programmable speculation.
Since the overhead depends on the size of (write) accessed data rather then the
length of the ppr region, it can be made negligible if the size of the parallel task
is large enough.
Second, most overheadsstarting, checking, and committingare off the critical path, so the non-speculative execution is almost as fast as the unmodified
sequential execution. Moreover, a race is run in every parallel region, where the
correct speculative result is used only when the speculation finishes faster than
the would-be sequential execution. The overhead of determining the winner of
this race is placed in the speculative execution, off the critical path.
Last, the run-time system uses value-based checking, which is more general
than dependence-based checking, and satisfies the Bernstein conditions [5]. Valuebased checking permits parallel execution in the presence of true dependencies
and it is one of the main differences between process-based system and existing
thread-based systems (as discussed in Section 4.3.2).

3.4
3.4.1

Special Considerations
Input and Output

To ensure that the output of a program running with speculative parallelism


support is correct, we ensure output is produced only by a non-speculative process
or by a speculative process that is known to be correct and is serialized in the

42

correct order. Until a speculative process has confirmed that its initialization and
execution was correct (i.e., that all previous speculation was correct), it buffers all
terminal output and file writes. Given correct execution, any output the process
produces will be the same as what the sequential program would have generated.
Program output buffering is established by creating a temporary file in which
to buffer the output that would otherwise be sent to the standard output. Such
a file is created by the run-time system each time a new speculative process is
created. At link time, we use a linker option1 to replace calls to the known input
and output functions with wrappers included with the run-time library. These
wrappers send file output to the redirection temporary file (in the case of printf)
or abort the speculative process (in all other cases). Although it should be possible
to detect writes to the standard error output using fprintf, such support has not
been implemented.
The task of committing the redirected output is addressed by rewinding to
the beginning of the redirection temporary file, reading it in blocks, and writing
those blocks to the standard output. If the speculative process is aborted, the
temporary redirection file is closed and deleted.

3.4.2

Memory Allocation

Dynamic memory allocation can potentially pose a problem for speculative execution because, unlike stack allocation, its implementation is library based and
the mechanism is not known in advance. The root of the problem for speculative execution is that the implementation may not return the same sequence of
memory locations when the same sequence of requests are made. Even in cases
where the speculative and non-speculative are performing exactly the same computations, the value of some of their pointers may differ because the dynamic
1

The gnu linker supports a --wrap option.

43

allocation return a different location. Additionally, comparing the changes the


processes have made is complicated by the need to recognize that different areas
of memory should be treated as though they were the same.

3.4.3

System Signals

The speculative parallel run-time system uses operating system signals to indicate
or initiate state changes among the running processes. The total number of available signals is limited, and the user program that is being extended may be relying
on some of the same signals. Some of the signals were using are slightly reinterpreted (for example special action may be taken on termination) while others have
no default meaning.
The run-time system does not attempt to preserve any existing signal handlers
installed by the user program, but it would be extended to identify them. The
user installed signal handler can be stored and invoked from within the runtimes
handler. While using signals would still provide a means to actively alert another
process, we would also need to differentiate signals initiated by the run-time system from those of the user program. This could be accomplished using a shared
flag, which the run-time system would consult before either dispatching the signal
to the original handler or processing it.
Ultimately, it is not possible to guarantee that the user program does not install
a new signal handler during execution, over-writing the run-time systems handler
functions. One solution would be to replace or wrap the handler installation
functions to ensure the run-time systems handlers are preserved, while any new
handlers are indirectly dispatched. Because the signals the run-time system is
using are intended for user programs, this change could be performed during
compilation.

44

Speculative Parallelism

Introduction
In this chapter I describe a type of process-based speculative execution referred
to as Behavior Oriented Parallelism (or bop). The bop system is designed to
introduce parallelism into sequential applications. Many sequential applications
are difficult to parallelize because of problems such as unpredictable data access,
input-dependent parallelism, and custom memory management. These difficulties
motivated the development of a system for behavior-oriented parallelization, which
allows a program to be parallelized based on partial information about program
behavior. Such partial information would be typical of a user reading just part of
the source code, or a profiling tool examining a small number of inputs.
The bop style of speculative parallelism allows for some portions of code to
be marked as potentially safe for parallel execution. I refer to these regions of
code as possibly parallel regions, abbreviated ppr. The goal of bop is to allow a
programmer or an analysis tool to provide hints about parallel execution without
needing to guarantee that the parallelism is safe in all cases.
In Section 4.2 I describe the programmatic way in which code is annotated for
bop. The burden on the programmer is intended to be minimal, and the interface

45

to be a natural extension of the existing program. In Section 4.3 I describe how


the run-time system manages the speculative execution. In Section 4.6 I show
an evaluation of performance gains using the bop system, which has improved
the whole-program performance by integer factors for a Lisp interpreter, a data
compressor, a language parser, and a scientific library on a multicore personal
computer.

4.1

Design

The bop system uses concurrent executions to hide the speculation overhead off
the critical path, which determines the worst-case performance where all speculation fails and the program runs sequentially.

4.1.1

Lead and Spec Processes

The execution starts as the lead process, which continues to execute the program
non-speculatively until the program exits. At a pre-specified speculation depth k,
up to k processes are used to execute the next k ppr instances. For a machine
with p available processors, the speculation depth is set to p 1 to make the full
use of the CPU resource.
Figure 4.1 illustrates an example run-time setup of either the sequential execution or the speculative execution of three ppr instances. As shown in Part 4.1(b),
when the lead process reaches the start marker mbP , it forks the first spec process
and continues to execute the ppr instance P . The first spec jumps to the end
marker of P and executes the next ppr instance Q. At mbQ , it forks the second
spec process, which jumps ahead to execute the third ppr instance R.
At the end of P , the lead process becomes the understudy process, which reexecutes the next ppr instance non-speculatively. In addition, it starts a parallel

46

mPb
b
m
spec 1 starts
P
P Pspec 1 startsP
spec 1mstarts
e
P
P
e
e
P
P
m
mPe
mP
mPe P
mPe
mPemPe
b
mQ
understudy
b
b
mQ
mQ
understudy
understudy
spec 2 starts
branch
e
spec 2 starts e
spec 2 starts
mQ
branch
branch
b
e
mQ starts
mQ
mQ
b
mstarts
starts
Q
mRb
Q
Q
mRb
mRb Q
R
R
R e
Q Qme
e
e me
e
m
Q
m
m
Q Q
m
Q
Q
Q
R
R
R
spec 1
spec
spec 1
e 1
e
mQ
m
commits
commits
commits spec 2
Q
spec 2
spec 2
commits
commits
b
commits
mRb R
R
mR
R
(partial)
(partial)first spec
lead spec
next lead
(partial)
spec 2 finishes
2next
finishes
first 2 finishes
next lead
first
R and aborts
and aborts understudy
R understudy
e
and
aborts understudy
mR
e
(parallel
exe.
wins)
(parallel
exe.
wins)
m
(parallel exe. wins)

((

(
(

((

(
(

(
(

(a) Sequential exe-

(
(

mRe

mRb

e
mQ

( (

(
(

b
mQ

lead process

P
me

mPblead process

(
(

mPb

lead process
mPbmb

(b) Parallel execution.

(a) Sequential
(a) Sequential
(b) A successful parallel
(b) Aexecution,
successfulwith
parallel execution, with
(a) Sequential
(b) A (b)
successful
parallel
execution, with
(a)
Sequential
execution.
Parallel
execution.
cution.

execution of PPR execution


lead of
onPPR
the left, speclead
1 and
on 2the
onleft,
the right.
spec 1 and 2 on the right.
execution of PPR
lead on the left, spec 1 and 2 on the right.
instances P, Q, and instancesSpeculation
P, Q, and starts bySpeculation
jumping from
starts
theby jumping from the
instances
Q,execution
andIt
starts
by
jumping
from
Figure
4.1:
Sequential
andstart
speculative
of Speculation
PPR
P,
Q,
and R
R
and
their
start
andR and their
andendP,
start
to
the
marker.
start
tocommits
the end
marker.
when
It commits
when
Figure
4.1:
Sequential
and
speculative
execution
ofinstances
PPR
instances
P,
Q,theand
R
and
their
start
and
start
to
the
end
marker.
It
commits
when
with
their
start
and
end
markers.
In
the
successful
parallel
execution
the
lead
end markers.
end markers.
reaching another endreaching
marker.another end marker.

R
with istheir
start
endSpeculation
markers.
In the
successful
parallel
the lead
end markers.
reaching
another
end
depicted
on and
the left.
starts
by jumping
from
the
start
toexecution
themarker.
end
is depicted
on
the
left.
Speculation
starts
by
jumping
from
the
start
to
the end
marker, and commits when reaching another end marker.
marker, and commits when reaching another end marker.

branch (a process in our current implementation) to check the correctness of spec


1. If no conflict is detected, the checking branch commits with spec 1, and the
two are combined into a single process. More speculation processes are handled
recursively in a sequence. The k th spec is checked and combined after the first
k 1 spec processes commit. When multiple spec processes are used, the data
copying is delayed until the last commit. The changed data are copied only once
instead of multiple times in a rolling commit.
The speculation runs slower than the normal execution because of the costs resulting from initialization, checking, and commit. The costs may be much higher
in process-based systems than in thread-based systems. In the example in Figure 4.1(b), the startup and commit costs, shown as gray bars, are so high that
the parallel execution of spec 1 finishes slower than the sequential understudy.
However, by that time spec 2 has finished and is ready to commit. The second
commit finishes before the understudy finishes, so spec 2 aborts the understudy

47

and becomes the next lead process.


bop executes ppr instances in a pipeline and shares the basic property of
pipelining: if there is an infinite number of pprs, the average finish time is determined by the starting time not the length of each speculation. In other words,
the parallel speed is limited only by the speed of the initialization and the size
of the sequential region outside ppr. The delays during and after speculation do
not affect the steady-state performance. This may be counter intuitive at first
because the commit time does not matter even though it is done sequentially. In
the example in Figure 4.1(b), spec 2 has similar high startup and commit costs
but they overlap with the costs of spec 1. In experiments with real programs, if
the improvement jumps after a small increase in the speculation depth, it usually
indicates a high speculation overhead.

4.1.2

Understudy: Non-speculative Re-execution

bop assumes that the probability, the size, and the overhead of parallelism are all
unpredictable. The understudy provides a safety net not only for correctness when
the speculation fails, but also for performance when speculation is slower than
the sequential execution. For performance, bop holds a two-way race between the
non-speculative understudy and the team of speculative processes.
The non-speculative team represents the worst-case performance along the
critical path. If all speculation fails, it sequentially executes the program. As I
will explain below, the overhead for the lead process consists only of the page-based
write monitoring for the first ppr instance. The understudy runs as the original
code without any monitoring. As a result, if the granularity of ppr instance is
large or when the speculation depth is high, the worst-case running time should
be almost identical to that of the unmodified sequential execution. On the other
hand, whenever the speculation finishes faster than the understudy, it means a

48

performance improvement over the would-be sequential execution.


The performance benefit of understudy comes at the cost of potentially redundant computation. However, the cost is at most one re-execution for each
speculatively executed ppr, regardless of the depth of the speculation.
Using the understudy, the worst-case parallel running time is equal to the
best-case sequential time. One may argue that this can be easily done by running
the sequential version side by side in a sequential-parallel race. The difference is
that the bop system is running a relay race for every group of ppr instances. At
the whole-program level it is sequential-parallel collaboration rather than competition because the winners of each relay are joined together to make the fastest
time. Every improvement in time counts when speculation runs faster, and no
penalty is incurred when it runs slower. In addition, the parallel run can benefit
from sharing read-only data in cache and memory, while multiple sequential runs
cannot. Finally, running two instances of a program is not always possible for
a utility program, since the communication with the outside world often cannot
be undone. In bop, unrecoverable I/O and system calls are placed outside the
parallel region.

4.1.3

Expecting the Unexpected

Figure 4.1 shows the expected behavior when an execution of pprs runs from
BeginPPR to EndPPR. In general, the execution may reach an exit (normal or
abnormal) or an unexpected ppr marker. Table 4.1 shows the actions taken by
the lead process, its understudy branch, and spec processes when encountering
an exit, error, or unexpected ppr markers.
The abort by spec in Table 4.1 is conservative. It is possible for, speculation
to reach a program exit point during correct execution, so an alternative scheme
might delay the abort and salvage the work if it turns out to be correct. We favor

49

Table 4.1: Speculation actions for unexpected behavior


behavior
prog. exit or error unexpected ppr markers
lead
exit
continue
understudy
exit
continue
spec
abort speculation
continue

the conservative design for performance. Although it may recompute useful work,
the checking and commit cost will never delay the critical path.
The speculation process may also allocate an excessive amount of memory
and attempt permanent changes through I/O and other OS or user interactions.
The latter cases are solved by aborting the speculation upon file reads, system
calls, and memory allocation exceeding a pre-defined threshold. The file output is
managed by buffering and is either written out or discarded at the commit point.
The current implementation supports stdout and stderr for the pragmatic purpose
of debugging and verifying the output. Additional engineering effort could add
support for regular file I/O.

Strong Isolation
I describe the bop implementation as having strong isolation because the intermediate results of the lead process are not made visible to speculation processes
until the lead process finishes the first ppr. Strong isolation comes naturally with
process-based protection. It is a basic difference between bop and thread-based
systems, where the updates of one thread are visible to other threads, which I
describe as weak isolation.I discuss the control aspect of the difference here and
complete the rest of comparisons in Section 4.3.2 after describing the data protection.
Weak isolation allows opportunistic parallelism between two dependent threads,
if the source of the dependency happens to be executed before the sink. In the

50

bop system, such parallelism can be made explicit and deterministic using ppr
directives by placing dependent operations outside the ppr region. As an example, the code outside ppr in Figure 4.1 executes sequentially. At the loop level,
the most common dependency comes from the update of the loop index variable.
With ppr, the loop control can be easily excluded from the parallel region and
the pipelined parallelism is definite instead of opportunistic.
The second difference between strong and weak isolation is that strong isolation does not need synchronization during the parallel execution but weak isolation
needs to synchronize between the lead and the spec processes when communicating the updates between the two. Since the synchronization delays the nonspeculative execution, it adds visible overheads to the thread-based systems when
speculation fails. bop does not suffer this overhead.
Although strong isolation delays data updates, it detects speculation failure
and success before the speculation ends. Like systems with weak isolation, strong
isolation detects conflicts as they happen because all access maps are visible to all
processes for reads (each process can only update its own map during the parallel
execution). After the first ppr, strong isolation can check for correctness before
the next speculation finishes by stopping the speculation, checking for conflicts,
and communicating data updates. As a design choice, bop does not abort speculation early because of the property of pipelined parallelism, explained at the
end of Section 4.1.1. The speculation process may improve the program speed,
no matter how slowly it executes, when enough of them are working together.

51

4.2

Programming Interface

In addition to the ppr markers, the bop programming interface two other important components. First, the programmer may provide a list of global and
static variables that are privatizable within each parallel process. By specifying
where the variables are initialized, the system can treat their data as shared until
the initialization and as private thereafter. The third component is described in
Section 4.2.3.

4.2.1

Region Markers

The bop programming interface allows a programmer to indicate what portions of


code are candidates for parallelism. The primary component of the bop programming interface is the BeginPPR function that denotes the beginning of a parallel
region. The return value of BeginPPR is a Boolean value where truth corresponds
to execution of the speculative code path. Put in terms of the run-time system,
the speculative process receives a non-zero return value while the non-speculative
process receives a return value of zero.
A call to BeginPPR is typically wrapped in a conditional statement to control
the flow of execution through the two paths of execution. Listing 4.2.1 illustrates
an example use of a ppr to parallelize a loop. Each iteration of the loop computes
the value to fill one element of a table based on the corresponding index. If one
assumes that the function compute is free of side-effects, then each iteration of the
loop can be executed in parallel with the others. Using ppr guarantees correct
execution even when the assumption about computes purity is not valid.
As counterpart to the BeginPPR marker used to indicate the start of a possible
parallel region, the bop interface provides an EndPPR marker to finalize the region.

52

Algorithm 4.2.1 Example use of bop to mark a possibly parallel region of code
within a loop.
f o r ( i n t i = 0 ; i < N ; ++i ) {
i f ( ! BeginPPR ( 0 ) ) {
t a b [ i ] = compute ( i ) ;
} EndPPR ( 0 ) ;
}

These two functions both accept a single scalar value that identifies the region to
ensure the markers are properly matched, which allows for nesting. Using the
identifier, an incorrectly matched marker can be safely ignored on the assumption
that another marker matches it and is also ignored.
Algorithm 4.2.2 Example use of bop including EndPPR marker.
f o r ( i n t i = 0 ; i < N ; ++i ){
i f ( ! BeginPPR ( ) ) {
t a b [ i ] = compute ( i ) ;
}
}
EndPPR ( ) ;
aggregate ( tab ) ;

In the loop body example shown in Listing 4.2.1, there is little meaning to
the else branch of the BeginPPR conditional. One can view the second branch as
containing any execution until the next ppr marker of any kind. In straight-line
code it may be more clean to explicitly enclose a block of code within an else
branch to place it in juxtaposition to the speculative path. The code in Listing 4.2.3 represents a case in which the else branch is explicitly used to demarcate
distinct paths of execution that may be processed in parallel. Note that there is
no reason that a simple pair of if/else must be used, and in the listing a nest
of conditions is used.

53

Algorithm 4.2.3 Example use of bop in a non-loop context.


i f ( ! BeginPPR ( ) ) {
f ( tab1 ) ;
} e l s e i f ( ! BeginPPR ( ) ) {
f ( tab2 ) ;
} else {
f ( tab3 ) ;
}
EndPPR ( ) ;

Explicitly Matched Markers


While multiple BeginPPR(p) invocations may exist in the code, a EndPPR(p) must
be unique for the same p, and the matching markers must be inserted into the
same function. The exact code sequence in C is as follows:
BeginPPR if (BeginPPR(p)) goto EndPPR p;
EndPPR EndPPR(p); EndPPR p:;
In the presence of unpredictable control flow, there is no guarantee that a start
marker will be correctly followed by its end marker, or that the matching markers
are executed the same number of times. For example, a longjmp in the middle
of a parallel region may cause the execution to back out and re-enter.
The bop system constructs a sequence of zero or more non-overlapping ppr
instances at run time using a dynamic scope. At any point t, the next ppr instance
starts from the first BeginPPR start marker operation after t and then ends at the
first EndPPR end marker operation after the BeginPPR. For example, assume the
program has two ppr regions P and Q, which are marked by the pairs {BP , EP }
and {BQ , EQ } respectively. If the program executes from the start t0 , invoking
the markers six times from t1 to t6 as in Figure 4.2(a), then the two dynamic ppr
instances are depicted in Figure 4.2(b). The ppr range from t1 to t3 and from t4

t0

t1

t2

t3

BP BP EP

t0

t1

t2

t3

BP BP EP

t4

BQ

t5

t6

EP EQ

(a) Sequential execution

t0

t0

t1

t3

PPRP

t4

t5

BQ

t6

EP EQ 54

t4

t6

PPRQ

(b) Parallel execution

Figure 4.2: Example of matching ppr markers

t1

t3

t4

t6

to t6 , and will be run in parallel. The other fragments of the execution will be

PPRP

PPRQ

run sequentially, although the portion from t3 to t4 will be speculative.


Compared to the static and hierarchical scopes used by most parallel constructs, the dynamic scope lacks the structured parallelism to model complex
task graphs and data flows. While it is not a good fit for static parallelism, it is
a useful solution for the extreme case of dynamic parallelism in unfamiliar code.
A coarse-grain task often executes thousands of lines of code, communicates
through dynamic data structures, and has non-local control flows. Functions may
be called through indirect pointers, so parallel regions may be interleaved instead
of being disjoint. Some forms of non-local error handling or exceptions may be
frequent, for example when an interpreter encounters a syntax error. Some forms
are rare, as found in the error checking and abnormal exit found in the commonly
used gzip programs compression code. Although no error has ever happened
in our experience, if one cannot prove the absence of error in such software of
this size, the dynamic scopes implemented by a ppr can be used to parallelize
the common cases while guarding against unpredictable or unknown entries and
exits.
Since the ppr markers can be inserted anywhere in a program and executed
in any order at run time, the system tolerates incorrect marking of parallelism,
which can easily happen when the region is marked by a profiling tool based on
a few inputs or given by a user unfamiliar with the code. The markers, like other
aspects of the interface, are programmable hints where the quality of the hints

55

affects the parallelism but not the correctness or the worst-case performance.

4.2.2

Post-Wait

The basic ppr structure allows for regions of code to be executed in parallel if
there are no dependencies carried from one to another. In many cases a loop body
may have carried dependencies, but be parallelizable if care is taken. Consider
a loop that is structured in stages so that some stages carry a dependency, but
the dependency is consumed by the same stage in the next iteration. In such a
scenario, the stages of the loop body can be viewed as stages of a pipeline.
Post-Wait is an extension of the basic ppr mechanism provided by the bop
system to allow for pipelining portions of the possibly parallel region. Using the
post-wait interface the speculative processes can be synchronized so that the writes
in the earlier process occur before the corresponding reads during run time.
Algorithm 4.2.4 Example of a pipelined loop body.
for ( int
B[ i ] =
C[ i ] =
D[ i ] =
}

4.2.3

i
f
g
h

= 0 ; i < N ; ++i ) {
(A [ i ] ) ;
(A [ i ] ) ;
(B [ i ] , C [ i ] ) ;

Feedback

The third component of the bop interface is run-time feedback to the user. When
speculation fails, the system generates output indicating the cause of the failure,
particularly the memory page on which receives conflicting accesses occurred. In
our current implementation, global variables are placed on separate memory pages
by the compiler. As a result, the system can output the exact name of the global
variable when it causes a conflict. A user can then examine the code and remove

56

Algorithm 4.2.5 Example use of bop post/wait.


f o r ( i n t i = 0 ; i < N ; ++i ) {
i f ( ! BeginPPR ( ) ) {
B [ i ] = f (A [ i ] ) ;
BOP post ( B ) ;
C [ i ] = g (A [ i ] ) ;
BOP wait ( B ) ;
D[ i ] = h (B [ i ] , C [ i ] ) ;
}
EndPPR ( ) ;
}

the conflict by marking the variable privatizable or moving the dependency out
of the parallel region.
Three features of the API are especially useful for working with large, unfamiliar code. First, the user does not write a parallel program and never needs
parallel debugging. Second, the user parallelizes a program step by step as hidden dependencies are discovered and removed one by one. Finally, the user can
parallelize a program for a subset of inputs rather than all inputs. The program
can run in parallel even if it has latent dependencies.

4.3
4.3.1

Run-Time System
Creation

On the first instance of BeginPPR the run-time system initializes the signal handlers and memory protection used by all of the subsequent process. The beginning
of a possibly parallel region is marked by a call to the system fork function. The
fork function creates a new operating system process which will act as the speculative process. This new process is considered to be the child of the preexisting
process, which is non-speculative. The original process returns immediately and

57

continues execution in non-speculative state.

4.3.2

Monitoring

The bop system guarantees that if the speculation succeeds the same user visible output is produced as in the sequential execution. bop partitions the address space of a running program into three disjoint groups: shared, checked,
and private. More formally, Dall = Dshared + Dchecked + Dprivate , and any two of
Dshared , Dchecked , and Dprivate do not overlap.
For the following discussion we consider two concurrent processes the lead
process that executes the current ppr instance, and the spec process that executes
the next ppr instance and the code in between. The cases for k (k > 1) speculation
processes can be proved by induction since they commit in a sequence in the bop
system.

Three types of data protection


Page-based protection of shared data All program data are shared at the
BeginPPR marker by default, and are protected at the memory page granularity.
During execution, the system records all global variables and the range of dynamic memory allocation. At BeginPPR, the system turns off write permission
for the lead process and read/write permission for the spec processes. It installs
customized page-fault handlers that loosen the permission for read or write upon
the first read or write access. At the same time, the handler records which type
of access each process has to each page. At commit time, each spec process is
checked in increasing order based on creation. The k th process fails if and only if
a page is written by the lead process and the previous k 1 spec processes but
read by spec k. If speculation succeeds, the modified pages are merged into a
single address space at the commit point.

58

Algorithm 4.3.1 Listing of ppr creation.


i n t BOP PrePPR ( i n t i d ) {
i f ( mySpecOrder == s p e c D e p t h ) r e t u r n 0 ;
switch ( myStatus ) {
default : return 0 ;

// ignore nested PPRs ( status = MAIN)

case CTRL :
// CRTL i s the i n i t i a l state
memset ( accMapPtr , 0 , ACC MAP SIZE ) ;
myStatus = MAIN ;
mySpecOrder = 0 ;
//
SP
//
SP
SP

signal handlers for monitoring


s e t u p A c t i o n ( BOP SegvHandler , SIG MEMORY FAULT ) ;
signals for sequentialparallel race arbitration
s e t u p A c t i o n ( BOP RaceHandler , SIGUSR1 ) ;
s e t u p A c t i o n ( BOP UndyTermHandler , SIGUSR2 ) ;

// fall through
case SPEC :
pprID = i d ;

// record identifier of this PPR

int f i d = fork ( ) ;
i f (1 == f i d ) r e t u r n 0 ;
// fork failure
i f ( f i d > 0) {
// the MAIN or older SPEC
specPid = f i d ;
// track the SPEC process ID
i f ( myStatus==MAIN) B O P s e t P r o t e c t i o n (PROT READ ) ;
return 0 ;
}
// the newer SPEC continues here
specPid = 0;
myStatus = SPEC ;
mySpecOrder++;
s e t p g i d (0 , SP gpid ) ;
SP RedirectOutput ( ) ;
i f ( mySpecOrder==1)
// set this up only once
B O P s e t P r o t e c t i o n (PROT NONE ) ;
return 1 ;
}
}

59

Algorithm 4.3.2 Examples of shared, checked, and private data


shared = GetTable ( ) ;
...
while ( . . . ) {
...
BeginPPR ( 1 )
...
if (...)
checked = checked + Search ( shared , x )
I n s e r t ( p r i v a t e , new Node ( c h e c k e d ) )
...
i f ( ! e r r o r ) Reset ( checked )
...
EndPPR ( 1 )
...
}

By using Unix processes for speculation, the bop system eliminates all antidependencies and output dependencies through the replication of the address
space, and detects true dependencies at run time. An example is the variable
shared in Figure 4.3.2, which may point to some large dictionary data structure.
Page-based protection allows concurrent executions as long as a later ppr does
not need the entries produced by a previous ppr. The overwrites by a later ppr
are fine even if the entries are used concurrently by a previous ppr.
The condition is significantly weaker than the Bernstein condition [5], which
requires that no two concurrent computations access the same data if at least
one of the two writes to it. The additional parallelism is possible because of
the replication of modified data, which removes anti-dependencies and output
dependencies. The write access by spec k never causes failure in previous spec
processes. As an additional optimization, the last spec process is only monitored
for data reads. In fact, when the system is limited to only one spec process, a
case termed co-processing, the lead process is monitored only for writes and the
spec only for reads.

60

Page-based protection has been widely used for supporting distributed shared
memory [29, 32] and many other purposes including race detection [49]. While
these systems enforce parallel consistency among concurrent computations, the
bop system checks for dependence violation when running a sequential program.
A common problem in page-level protection is false-positive alerts. We alleviate this problem by allocating global variables on separate memory page. Writes
to different parts of a page may be detected by checking the difference at the end
of ppr, as in [29]. In addition, the shared data are never mixed with checked and
private data on the same page, although at run time newly allocated heap data
are private at first and then converted to shared data at EndPPR.

Value-based checking Typical dependence checking is based on data access


rather than data value. Although this type of checking is sufficient for correctness,
it is not necessary. Consider the variable checked in Figure 4.3.2, which causes
true dependencies because both the current and next ppr instances may read and
modify it. On the other hand, the reset statement at the end may re-install the
old value that checked had at the beginning of the ppr. The parallel execution is
still correct at run time despite the true dependence violation. This case is called
a silent dependence [53].
There is often no guarantee that the value of a variable is reset by EndPPR. In
the above example, the reset depends on a flag, so the silence is conditional.
Even after a reset, the value could be modified by pointer indirection in the general
case. Finally, the reset operation may assign different values at different times.
Hence run-time checking is necessary.
For global variables, the size is statically known, so the bop system allocates checked variables in a contiguous region, makes a copy of their value at the
BeginPPR of the lead process, and checks their value at the EndPPR. For dynamic
data, the system needs to know the range of addresses and performs the same

61

checking steps. Checked data can be determined through profiling analysis or


identified by the user as described in more detail in Section 4.2.3. Since the values
are checked, incorrect hints would not compromise correctness. In addition, a
checked variable does not have to return to its initial value in every ppr instance.
Speculation still benefits if the value remains constant for just two consecutive
ppr instances.
Most silent dependencies come from implicit re-initialization of a variable.
Some examples are incrementing and decrementing a scope level when a compiler
compiles a function, setting and clearing traversed bits of the nodes in a graph
during a depth-first search, and filling then clearing the work-list in a scheduling
pass. Such variables that may take the same value at BeginPPR and EndPPR are
classified as checked data. In other words, the ppr execution may have no visible
effect on the checked data variable.
The shared data and checked data have a significant overlap, which is the set
of data that are either read only or untouched by the parallel processes. Data in
this set are classified as checked if their size is small; otherwise, they are shared.
A problem arises when different parts of a structure or array require different
protection schemes. Structure splitting, when possible, may alleviate the problem.
The correctness of checked data is not obvious because their intermediate
values may be used to compute other values that are not checked. I will present
a formal proof of the correctness to show how the three protection schemes work
together to cast a complete shield against concurrency errors.

Likely private data The third class of objects is private data, which is initialized before being used and therefore causes no conflict. In Figure 4.3.2, if private
is always initialized before it is used, the access in the current ppr cannot affect
the result of the next ppr, so any true dependency caused by it can be ignored.
Private data come from three sources. The first is the program stack, which

62

includes local variables that are either read-only in the ppr, or always initialized
before use. Intra-procedure dataflow analysis is capable of identifying such data
for most programs. When the two conditions of safely cannot be guaranteed by
compiler analysis, for example due to unknown control flow or the escape of a local
variables address into the program heap, we redefine the local variable to be a
global variable and classify it as shared data. Recursive functions are not handled
specially, but could be managed either using a stack of pages or by disabling the
ppr.
The second source of private data is global variables and arrays that are always initialized before the use in the ppr. The standard technique to detect this
is inter-procedural kill analysis [1]. In general, a compiler may not always ascertain all cases of initialization. For global data whose access is statically known
in a program, the compiler automatically inserts calls after the initialization assignment or loop to classify the data as private at run time. Any access by the
speculation process before the initialization causes it to be treated as shared data.
For (non-aggregate) data that may be accessed by pointers, the system places it
on a single page and treats it as shared until the first access. Additionally, we
allow the user to specify the list of variables that are known to be written before
read in ppr. These variables are reinitialized to zero at the start of a ppr instance.
Since we cannot guarantee write-first access in all cases, we call this group likely
private data.
The third type of private date is newly allocated data in a ppr instance. Before
BeginPPR, the lead process reserves regions of memory for speculation processes.
Speculation would abort if it allocates more than the capacity of the region. The
main process does not allocate into the region, so at EndPPR, the newly allocated
data can be merged with the data from the speculation process. For programs that
use garbage collection, we encapsulate the heap region of spec processes, which
we will describe when discussing the test of a lisp interpreter. Another solution is

63

to ignore garbage collection, which will cause speculation to fail if it is initiated


during a ppr instance because of the many changes it makes to the shared data.
A variable is marked by bop private if its value is assigned before it is used
within a ppr task. Because the first access is a write, the variable does not
inherit value from prior tasks. Verifying the suggestion requires capturing the
first access to a variable, which can be costly if the variable is an array or a
structure. For efficiency we use a compromise. We insert code at the start of
the ppr to write a constant value in all variables that are marked bop private. If
the suggestion is correct, the additional write adds a small extra cost but does
not change the program semantics. If the suggestion is wrong, the program may
not execute correctly, but the sequential version has the same error, and the error
can be identified using conventional debugging tools. Under this implementation,
bop private is a directive rather than a hint, unlike other bop primitives.
Overheads on the critical path The three data protection schemes are summarized and compared in Table 4.2. Most of the overhead of speculation the
forking of speculation processes, the change of protection, data replication and
read and write monitoring, the checking of access maps for conflicts, the merging
of modified pages, and the competition between the understudy and the spec processes are off the critical path. Therefore, the relation between the worst-case
max
and the time of unmodified sequential program Tseq is
running time Tparallel
max
Tparallel
= Tsequential + c1 (Sshared /Spage ) + c2 (Smodif ied

by 1st ppr

+ Schecked )

The two terms after Tseq are the cost from data monitoring and copying on the
critical path, as explained below.
For monitoring, at the start of ppr, the lead process needs to set and reset
the write protection and the access map for shared data before and after the first
ppr instance. The number of pages is the size of shared data Sshared divided
by the page size Spage plus a constant cost c1 per page. During the instance, a

64

write page fault is incurred for every page of shared data modified in the first ppr
instance. The constant per page cost is negligible compared to the cost of copying
a modified page.
Two types of copying costs may appear on the critical path. The first is for
pages of shared data modified by the lead process in the first ppr instance and
(among those) pages modified again by the understudy. The second cost is taking
the snapshot of checked data. The cost in the above formula is the worst case,
though the copy-on-write mechanism in modern OS may completely hide both
costs.
Data copying may hurt locality across ppr boundaries, although the locality
within is preserved. The memory footprint of a speculative run is larger than the
sequential run as modified data are replicated. However, the read-only data are
shared by all processes in main memory and in shared cache, which is physically
indexed. As a result, the footprint may be much smaller than running k copies of
a program.

A Formal Proof of Correctness


It is sufficient to prove the correctness for a single instance of the parallel execution
between two ppr instances. An abstract model of an execution is defined by:
Vx : a set of variables. Vall represents all variables in memory.
S t : the content of V at time t.
Sxt : the state of Vx at t.
rx : an instruction. The instructions we consider are the markers of the two pprs,
P and Q, P b , P e , Qb , and Qe (corresponding to mbP , meP , mbQ , and meQ in
Section 4.2.1). P and Q can be the same region.

65

Table 4.2: Three types of data protection


shared data
protection:
Not written by lead and read by spec
granularity:
page/element
support:
compiler, profiler, run-time
critical path overhead: 1 fault per modified page
checked data
protection:
Value at BeginPPR is the same at EndPPR in lead.
Concurrent read/write allowed.
granularity:
element
support:
compiler, profiler, run-time
critical path overhead: copy-on-write
private data
protection:
no read before 1st write in spec. Concurrent read/write allowed.
granularity:
element
support:
compiler (run-time)
critical path overhead: copy-on-write

hrx , SVt i : a point in execution where in terms of instruction and state.


p

t1
t2
hr1 , Sall
i = hr2 , Sall
i : execution of a process p from one point to another.

Figure 4.3 shows the parallel execution and the states of the lead and the spec
processes at various times. If a parallel execution passes the three data protection
schemes, all program variables in our abstract model can be partitioned into the
following categories:
Vwf : variables whose first access by spec is a write. wf stands for write first.
Vexcl

lead :

variables accessed only by lead when executing the first ppr

instance P .
Vexcl

spec :

variables accessed only by spec.

66

main process
(main)

(r , S init)
b

(re , S

(rb , S

init

mid

speculation process
(spec)

understudy process
(undy)

(re , S

(re , S init)

main

(re , S main)
(re , S spec)

(re , S seq)

(re , S

undy

(b) parallel execution by three processes

(a) sequential execution

Figure 4.3: The states of the sequential and parallel execution


Vchk : the remaining variables. chk stands for checked.
Vchk = Vall Vwf Vexcl

lead

Vexcl

spec

Examining Table 4.2, we see that Dshared contains data that are either accessed
by only one process (Vexcl

lead

and Vexcl

spec ),

written before read in spec (Vwf ),

read only in both processes or not accessed by either (Vchk ). Dprivate contains data
either in Vwf or Vchk . Dchecked is a subset of Vchk . In addition, the following two
conditions are met upon a successful speculation.
1. lead process reaches the end of P at P e , and the spec process, after leaving
P e , executes the two markers of Q, Qb and then Qe .
2. the state of Vchk is the same at the two ends of P (but it may change in the
init
lead
middle), that is, Schk
= Schk
.

67

To analyze correctness, examine the states of the sequential execution, S init


at P b and S seq at Qe of the sequential process seq, and the states of the parallel
execution, S init at P b , S lead at P e of the lead process and S init at P e and S spec at
Qe of the spec process. These states are illustrated in Figure 4.3.
The concluding state of the parallel execution, S parallel at Qe , is a combination
of S lead and S spec after the successful speculation. To be exact, the merging step
copies the modified pages from the lead process to the spec process, so

spec
S parallel = Sallexcl

lead

lead
+ Sexcl
lead

In the following proof, each operation rt is defined by its inputs and outputs,
which all occur after the last input. The inputs are the read set R(rt ). The outputs include the write set W (rt ) and the next instruction to execute, rt+1 . For
clarification, an operation is an instance of a program instruction. For the simplicity of the presentation, symbol rx is overloaded as both the static instruction
and its dynamic instances. To distinguish in the text, former is referred to as an
instruction and the latter as an operation, so there may be only one instruction
rx but any number of operations rx .
Theorem:
If the spec process reaches the end marker of Q, and the protection in Table 4.2
passes, the speculation is correct, because the sequential execution would also
reach Qe with a state S seq = S parallel , assuming that both the sequential and the
parallel executions start with the same state, S init at P b .
Proof:
spec

Consider the speculative execution, (P e , S init ) = (Qe , S spec ), for the part of the
seq

sequential execution, (P e , S mid ) = (Qe , S seq ). The correct sequential execution


is denoted as pe , r1 , r2 , , and the speculative execution as pe , r10 , r20 , . Proving
the above theorem must show that every operation rt0 in the speculative execution,

68

and the corresponding operation rt in the sequential execution must:


1. map to the same instruction as rt
2. read and write the same variables with the same values
3. move to the same next instruction rt+1
Which is done through contradiction.
Assume the two sequences are not identical and let rt0 be the first instruction
that produces a different value than rt , either by modifying a different variable,
the same variable with a different value, or moving next to a different instruction.
Since rt and rt0 are the same instruction, the difference in output must be due to
a difference in the input.
Suppose rt and rt0 read a variable v but see different values v and v 0 . Since the
values cannot differ if the last writes do not exist, let rv and rv0 be the previous
write operations that produce v and v 0 . The operation rv0 can occur either in spec
before rt0 or in the lead process as the last write to v. The contradiction depends
on showing neither of these two cases is possible.
First, if rv0 happens in spec, then it must produce the same output as rv per
our assumption that rt0 is the first instruction to deviate. Second, rv0 is part of
lead and produces a value not visible to spec. Consider the only way v can be
accessed. Given that rv0 is the last write, v is read before being modified in spec,
and so it does not belong to Vwf or Vexcl

lead .

Neither is it in Vexcl

spec

since it is

modified in the lead process. The only case left is for v to belong to Vchk . Since
lead
init
Vchk
= Vchk
, after the last write the value of v is restored to the beginning state

where spec starts and consequently cannot cause rt0 in spec to see a different value
as rt does in the sequential run. Therefore rt and rt0 cannot have different inputs
and produce different outputs, and the speculative and sequential executions must
be identical.

69

Since spec reads and writes correct values, Vwf , Vexcl


part of Vchk are correct. Vexcl

lead

spec ,

and the accessed

is also correct because of the copying of the

their values at commit time. The remaining part of Vchk is not accessed by lead
or spec and still holds the same value as S init . It follows that the two states
S parallel and S seq are identical, which means that S parallel is correct.

The above proof is similar to that of the Fundamental Theorem of Dependence
(Sec. 2.2.3 in [1]). While the proof in the book deals with statement reordering,
the proof here deals with region reordering and value-based checking. It rules
out two common concerns. First, that the intermediate values of checked data
never lead to incorrect results in unchecked data. Second, the data protection
always ensures the correct control flow by speculation. In bop, the three checking
schemes work together to ensure these strong guarantees.
Comparisons
Strong and weak isolation as discussed in Section 4.1.3 is a basic difference between
process-based bop and thread-based systems that include most hardware and
software speculation and transactional memory techniques. The previous section
discussed the control aspect, while the data protection and system implementation
are discussed below. The comparisons are summarized in Table 4.3.
Weak isolation needs concurrent access to both program data and system data,
as well as synchronization to eliminate race conditions between parallel threads
and between the program and the run-time system. The problem is complicated
if memory operations may be reordered by the compiler or by hardware, and
the hardware uses weak memory consistency, which does not guarantee correct
results without explicit synchronization. In fact, concurrent threads lack a welldefined memory model [7]. A recent loop-level speculation system avoids race
conditions and reduces the number of critical sections (to 1) by carefully ordering

70

Table 4.3: Comparisons between strong and weak isolation


during speculation
strong
weak
data updates visible to outside
no
yes
overall overhead proportional to
data size
data use
synchronization on critical path
none
needed
hardware memory consistency
independent
dependent
support value-based checking
yes
no
type of pipelined parallelism
definite
opportunistic
detect spec failure early
yes
yes
can certify spec success early
yes
yes
the system code based on a sequential memory consistency model and adding
memory directives to enforce the order under relaxed consistency models [11].
In bop, parallel processes are logically separated. The correctness check is done
sequentially in rolling commits with a complete guarantee as stated on page 67.
There is no synchronization overhead on the critical path, and the compiler and
hardware are free to reorder program operations as they do for a sequential program.
Weak isolation cannot efficiently support value-based checking. When data
updates are visible, the intermediate value of a checked variable can be seen by
a concurrent thread and the effect cannot be easily undo even if the variable
resumes the initial value afterward. For locks, this leads to the ABA problem,
where a thread may mistakenly hold a pointer whose value is the same, but the
referenced data has changed. A specific solution the ABA problem has been
developed for a software transactional memory system DSTM [25]. In hardware,
a correct value prediction may cause a thread to read at the wrong time and violate
the sequential consistency, so value prediction requires careful extra tracking by
hardware [37]. No software speculation systems use value-based checking. With
strong isolation in bop, the intermediate values of checked variables have no effect
on other processes, so value-based checking is not only correct but also adds little
cost on the critical path.

71

Value-based checking is different from value-specific dynamic compilation (for


example in DyC [20]), which finds values that are constant for a region of the
code rather than values that are the same at specific points of an execution (and
can change arbitrarily between these points). It is different from a silent write,
which writes the same value as the previous write to the variable. The bop runtime software checking happens once per ppr for a global set of data, and the
correctness is independent of the memory consistency model of the hardware.
Most previous techniques monitor data at the granularity of array elements,
objects, and cache blocks; bop uses pages for heap data and padded variables
for global data. Paging support is more efficient for monitoring unknown data
structures but it takes more time to set up the permissions. It gives rise to
false sharing. The cost of page-based monitoring is proportional to the size of
accessed data (for the overhead on the critical path it is the size of modified data)
rather than the number of accesses as in thread-based systems, making page-based
protection especially suitable for coarse-grain parallelism.

4.3.3

Verification

In the case of speculative parallelism through pprs, verifying correct execution is


primarily the handled by the run-time monitoring. Any conflict between the main
and speculative processes will be detected when it occurs and does not require
additional analysis after the pprs complete.
In addition to verifying the correctness of the in-flight ppr executions, it is important to handle cases where one of the processes attempts to terminate (whether
or not the termination is the result of correct execution). It is always safe for the
non-speculative main process to exit. The nature of the main process is such that
a speculative processes must be running as well1 , which must be terminated. It is
1

The identifier main only exists within the context of a ppr.

72

worth noting that these speculative process are performing useless computation,
but there is no other useful ppr related work that could have been scheduled.
Reaching a program exit point in the understudy process is equivalent doing so
in the main process, except that buffered output must be committed.
If a speculative process reaches a program exit point it cannot be permitted to
commit normally. The current bop system simply forces the speculative process
to abort, which allows the corresponding understudy to eventually reach the exit
point and complete. If the speculative process is the child of another speculative
process, that process is notified of the failure, which allows it to change directly
to control status and elide any further coordination with the terminal speculative
process. An alternative is for the speculative process to treat the exit as the
end marker of the current ppr. This would cause the speculative process to
synchronize with the main process once it reaches its own end marker, after which
the process will potentially commit and exit without delaying until the understudy
reaches the same point.

4.3.4

Commit

The bop commit routine is invoked when a process reaches a EndPPR marker.
The functionality is dependent on the state of the process; sequential and control
processes are ignored, while the other states are handled specifically. If the identifier parameter does not match the current ppr identifier, then the end marker
is ignored.
The commit routine for the speculative process involves synchronizing with the
non-speculative processes, as well as maintaining order among the other speculative processes. The actual tasks are provided in Listing 4.3.4 but can be summarized as follows: We first pass our token to the next waiting speculative process.
We then wait for the previous speculative process to indicate that it has completed

73

Algorithm 4.3.3 Listing of bop termination routine


void
a t t r i b u t e ( ( d e s t r u c t o r ) ) BOP End ( void )
s t a t i c short f i n i s h e d = 0 ;
i f ( f i n i s h e d ) return ;
f i n i s h e d = 1;

switch ( myStatus ) {
case SPEC :
// Tell the parent to start early termination .
i f ( mySpecOrder > 1 )
k i l l ( g e t p p i d ( ) , SIGUSR1 ) ;
e x i t ( EXIT SUCCESS ) ;
case UNDY:
// Commit any buffered output .
SP CommitOutput ( ) ;
// ( fall through)
case MAIN : case CTRL : case SEQ :
BOP pipeClose ( ) ;
// Kill all runtime processes ( including self )
k i l l ( SP gpid , SIGTERM ) ;
// Wait until signal propagates .
pause ( ) ;
e x i t ( EXIT SUCCESS ) ;
break ;
default :
e x i t ( EXIT FAILURE ) ;
}
}

74

(assuming we are not the first). If this process is the first member of a group of
speculative processes then it must also wait for the previous group to have committed. Once the order among the speculative processes is confirmed the process
verifies the access maps are correct and copies the data changes it has made to
the next speculative process. Synchronization with the understudy is handled
by determining its process identifier, signaling the understudy, and waiting for
confirmation. Finally, the speculative process commits its output.
The commit routine for the understudy process is fairly simple. This is because
the understudy is considered to be on the critical path and much of the burden of
work has been placed elsewhere. Additionally, the understudy is not speculative.
As depicted in Listing 4.3.4, the understudy keeps a count of each EndPPR marker it
reaches. Because the speculative processes are placed into groups, the understudy
must complete all of the work of one group in order to succeed. The understudy
officially beats the speculative processes once it blocks the signal they would use to
declare completion. After this point the understudy can safely change its status to
control (which is not to be confused with being the lead process). The speculative
processes are killed, and output from the understudy committed.
The commit routine for the lead process (MAIN) is somewhat anomalous in
that it does not actually commit anything. The main process is responsible for
spawning the understudy process, and for synchronizing with the first speculative
process by passing its own data changes.

4.3.5

Abort

The abort routine basically just amounts to the speculative process exiting. Because the output has been buffered, and the operating system virtual memory
isolates any changes made, the process has no outside impact unless it is explicitly committed. The run-time system is structured so that if the speculative

75

Algorithm 4.3.4 Listing of ppr commit in the speculative process


void PostPPR spec ( void ) {
int token ;
s i z e t s i z e = sizeof ( token ) ;
//remove the restrictive protections from memory pages
B O P s e t P r o t e c t i o n (PROT READ | PROT WRITE ) ;
// set the segfault handler back to the default
s i g n a l (SIG MEMORY FAULT , SIG DFL ) ;
i f ( myStatus==SPEC)
// wait for main done
S P s y n c r e a d ( l o H i P i p e s [ mySpecOrder ] [ 0 ] , &token , s i z e ) ;
i f ( BOP compareMaps ( ) ) e x i t ( 0 ) ; // access conflict
// If I am not the last spec task in the batch
i f ( mySpecOrder < s p e c D e p t h && ! e a r l y T e r m i n a t i o n ) {
// never returns
PostPPR commit ( ) ;
return ;
}
// copy all updates to the last SPEC task (mySpecOrder)
f o r ( i n t k = 0 ; k < s p e c D e p t h ; k++)
SP PullDataAccordingToMap (WRITEMAP( mySpecOrder ) ,
updatePipe [ 0 ] , f a l s e ) ;
// clear the access map
memset ( accMapPtr , 0 , ( s p e c D e p t h +1)BIT MAP SIZE ) ;
// reset early termination flag
earlyTermination = false ;
// read the PID of the understudy
S P s y n c r e a d ( u n d y C r e a t e d P i p e [ 0 ] , &token , s i z e ) ;
k i l l ( token , SIGUSR1 ) ; // tell understudy of our progress
// wait for acknowledgement from the understudy
S P s y n c r e a d ( u n d y C o n c e d e s P i p e [ 0 ] , &token , s i z e ) ;
// spec wins
myStatus = CTRL ;
SP CommitOutput ( ) ;
}

76

Algorithm 4.3.5 Listing ppr commit in the understudy process


// BOP PostPPR for the understudy
void PostPPR undy ( void ) {
++undyWorkCount ;
// UNDY must catch SPECs
i f ( undyWorkCount < s p e c D e p t h ) r e t u r n ;
// ignore notices from the SPEC (the UNDY has won)
s i g p r o c m a s k ( SIG BLOCK , &s i g M a s k U s r 1 , NULL ) ;
myStatus=CTRL ;
undyWorkCount = 0 ;
memset ( accMapPtr , 0 , ( s p e c D e p t h +1)BIT MAP SIZE ) ;
mySpecOrder = 0 ;
// Indicate the success of the understudy .
k i l l ( SP gpid , SIGUSR2 ) ;
// Explicitly kill the first SPEC process .
k i l l ( s p e c P i d , SIGKILL ) ;
s i g p r o c m a s k ( SIG UNBLOCK , &s i g M a s k U s r 1 , NULL ) ;
SP CommitOutput ( ) ;
}

77

Algorithm 4.3.6 Listing of ppr commit in the main process


void PostPPR main ( void ) {
i f ( earlyTermination ) {
// Speculation has failed .
myStatus = CTRL ;
earlyTermination = false ;
return ;
}

Restart the next round .

// open page protection for understudy


B O P s e t P r o t e c t i o n (PROT READ | PROT WRITE ) ;
// start the understudy
int f i d = fork ( ) ;
switch ( f i d ) {
case 1: a s s e r t ( 0 ) ;
case 0 : // the understudy
myStatus = UNDY;
s e t p g i d (0 , SP gpid ) ;
mySpecOrder = 1;
SP RedirectOutput ( ) ;
// tell spec that undy i s ready
pid t current pid = getpid ();
w r i t e ( u n d y C r e a t e d P i p e [ 1 ] , &c u r r e n t p i d , s i z e o f ( i n t ) ) ;
break ;
d e f a u l t : // main continues
PostPPR commit ( ) ;
break ;
}
}

78

Algorithm 4.3.7 Listing of ppr commit finalization routine


void PostPPR commit ( void ) {
i n t token , s i z e = s i z e o f ( t o k e n ) ;
// send main i s done to spec
w r i t e ( l o H i P i p e s [ mySpecOrder ] [ 1 ] , &mySpecOrder , s i z e ) ;
i f ( myStatus == SPEC)
S P s y n c r e a d ( l o H i P i p e s [ mySpecOrder 1 ] [ 0 ] , &token , s i z e ) ;
SP PushDataAccordingToMap (WRITEMAP( mySpecOrder ) ,
updatePipe [ 1 ] ) ;
// send copy done
w r i t e ( l o H i P i p e s [ mySpecOrder ] [ 1 ] , &mySpecOrder , s i z e ) ;
exit (0);
}

process aborts it means that either the understudy has finished the parallel region
first, or that there is an error indicated in the access maps. In either of these cases
the understudy process becomes the control process and continues running. If the
understudy process is aborting then it must be the case that the spec process has
succeeded. Because the understudy is useless at that point it simply exists.

4.4

Types Of Speculative Parallelism

The bop system can be used to express parallelism in several ways. At the program level, parallelism can be broken into three categories: instruction level, data,
and task. The coarse-grained nature of process-based speculative parallelism does
cannot take advantage of instruction level improvements, but it does address both
data and task parallelism.

79

START

CTRL
B E
B

MAIN

SPEC i

E B
B

MAIN

SPEC i

SPEC i+1

UNDY

SPEC i

E E
MAIN

SPEC i+1

UNDY

SPEC i

SPEC i+1

E
E

UNDY
E

END4

END3

E
SPEC i+1
E
END2

END1

Figure 4.4: State diagram of bop. Edge labels represent begin and end ppr
markers (B and E respectively).

80

4.4.1

Data-Parallel

Data parallelism is possible when the same operation can be performed on many
data elements. This form of parallelism is often expressed in a loop, and the conversion from a sequential program will often focus there. It is not necessary that
all instance of the parallel region perform exactly the same sequence of instructions, and so control flow can change within the region. This is not the case in the
simplest SIMD (single instruction multiple data) style parallelism. Other system
may offer an explicitly parallel loop, for example the DOALL construct available in
Fortran, or the parallel for directive in OpenMP, in which a loop is marked
are parallel. The same effect is achieved with bop by making the loop body conditional on a BeginPPR marker and placing the EndPPR marker at the end of the
loop body.

4.4.2

Task-Parallel

Task parallelism exists when separate portions of the execution can be performed
independently. This can be implemented with the bop system by placing one
portion of otherwise straight-line code in a conditional block based on the return of
BeginPPR and finalized with a EndPPR marker. At some later point, and additional
EndPPR marker indicates that the speculative process needs the results of the
parallel task. At run time, the main process will execute the code within the ppr
block and spawn its understudy at its conclusion. The speculative process will skip
the conditional block, eventually synchronizing when it reaches the end marker. If
the understudy reaches the marker first, it will terminate the speculative process.
This arrangement is semantically similar to fork-join execution where the second end marker represents the join point. One can view the conditional block of
code in terms of a future that is explicitly consumed at the end marker. If the
code block were to be placed in a separate function, the syntax would even be

81

quite similar. This setup can be generalized to multiple parallel tasks by treating
each task as described above. Because only a newly created speculative process
receives a unique return value from BeginPPR the understudy will double check
all of the tasks.
The series of ppr markers is necessary to guarantee that each task is not
dependent on the computation of earlier tasks. If the programmer knows that
the work a task is performing is ancillary to final results, then any data modified
within the task can be ignored by the bop run-time system.

4.5
4.5.1

Comparison to Other Approaches


Explicit Parallelism

In order to explicitly parallelism a program it must be proved the program will


execute correctly in parallel in all cases. Perhaps the most significant advantage
of using bop over an explicit technique is the guarantee of correct execution even
if the region markers are incorrect. Using a ppr to guard a region of an execution
is significantly easier than determine what data are modified within the region
and appropriately protecting it.
In comparison to using locks when explicitly parallelizing a program, one does
not need to ensure the association between the protection (the lock) and the data
are correct. If this association is not correct then the lock fails to serve its purpose.
If one were to implement something like a ppr with locks, it would be necessary to
protect the body of the ppr with a lock and acquire the lock immediately before
attempting to access (either read or write) any of the data accessed within the
ppr.
Attempting to debug a parallel program, particularly in the face of race conditions, relies on the non-deterministic interleaving of the executions. A program

82

running with the bop runtime will behave the same as if it were to be executed
sequentially, which largely obviates the need for debugging it. If errors in the
sequential program need to be diagnosed, the bop markers can be easily disabled
(become a non-operation) and the program run sequentially.
Even if locks are used correctly to synchronize parallel execution, these uses
cannot be composed into more general cases. The use of locks for parallel programming has a significant advantage over the bop system in their efficiency.
Locks introduce the least overhead of any synchronization technique, and can use
used in fine-grained cases for which a ppr would not be appropriate.
Attempting to implement something analogous to pprs using a message passing representation would face many of the same problems as locking. Because
message passing generally requires an explicit receive statement, it must be placed
before the first potential access of any type to any of the data potentially modified
within the ppr. Additionally, the message would need to carry all data modified
in the ppr. Because the members of this set cannot generally be known until run
time, a conservative implementation would need to gather all data modified in the
ppr.

4.5.2

Fine-Grained Techniques

bop is not as efficient as thread-level techniques because of the overhead of general


protection and the problem of false sharing. Speculation also causes unnecessary
computations and by nature cannot handle general forms of I/O and other operations with unrecoverable side effects (inside a ppr). However, the main advantage
is ease of programming. bop can parallelize a program based on only partial information. It requires little or no manual changes to the sequential program and
no parallel programming or debugging. The overhead of the system can be hidden
when there is enough parallelism. bop uses unmodified, fully optimized sequential

83

code while explicit threading and its compiler support are often restrained due to
concerns over the weak memory consistency on modern processors. With these
features, bop addresses the scalability of a different sortto let large, existing
software benefit from parallel execution.
Any technique that does not use heavy-weight processes can be considered finegrained. Such techniques are inherently unable to utilize operating system copyon-write memory protection. Without hardware support, speculative parallelism
techniques must employ some other mechanism for the roll-back of speculative
writes.
In addition to lacking the operating system mechanism for protecting memory stores, fine-grained techniques face distinct challenges with regard to logging
memory loads. While the page level read/write access can be manipulated as
in the Fast Track system, this approach is non-viable. The time spent handling
the operating system level signal is far too high in proportion to the duration of
the parallel work. Additionally, the run-time system must do more work than
a system such as Fast Track to determine which thread performed the memory
access.
The more common approach is for the run-time system to instrument memory loads and stores to allow for logging (and subsequent roll-back or replay).
Excluding systems replying on hardware support, such instrumentation amounts
to expensive additional operations surrounding all memory accesses. These additional operations introduce overheads measured as multiples of the execution
time.

84

4.6
4.6.1

Evaluation
Implementation and Experimental Setup

Compiler support is implemented with a modified version of the GNU Compiler


Collection (gcc) 4.0.1 at the intermediate language level. After high-level program optimization passes but before machine code generation, the compiler converts global variables to use dynamic allocation for proper protection. We did not
implement the compiler analysis for local variables. Instead the system privatizes
all stack data. All global and heap data are protected. Each global variable is
allocated on separate page(s) to reduce false sharing.
Also based on gcc 4.0.1 are an instrumentor and a behavior analyzer. The
instrumentor collects complete program traces with unique identifiers for instructions, data accesses, and memory and register variables, so the behavior analyzer
can track all data dependencies and identify ppr.
The bop runtime is implemented as a statically linked library. Shared memory
is used for storing snapshots, access maps, and for copying data at a commit.
Most communication is done by signals, and no locks are used. Two similar
systems have been implemented in the past within our research group using binary
instrumentors. These system do not require program source but offer no easy way
of relocating global data, tracking register dependencies, or finding the cause of
conflicts at the source level.
In bop, the lead process may die long before the program ends, since each
successful speculation produces a new lead (see Figure 4.1 for an example). Now
each parallelized program starts with a timing process that forks the first lead
process and waits until the last process is over (when a lead process hits a program
exit). Instead of collecting user and system times for all processes, the wallclock time of the timing process is used, which includes os overheads in process

85

buf
gsprefix
xlfsize
xlsample
xltrace

xlstack
xlenv
xlcontext
xlvalue
xlplevel

:
:
:
:
:

:
:
:
:
:

Table 4.4: XLisp Private Variables


for copying string constants
for generated name strings
for counting the string length in a print call
the vestige of a deleted feature called oscheck
intermediate results for debugging

Table 4.5: XLisp Checked Variables


current stack pointer, restored after an evaluation
current environment, restored after an evaluation
the setjump buffer for exception handling
would-be exception value
parenthesis nesting level, for command prompt

scheduling. Experiments use multiple runs on an unloaded system with four dualcore Intel 3.40 GHz Xeon processors, with 16MB of shared L3 cache. Compilation
is done with gcc 4.0.1 with -O3 flag for all programs.

4.6.2

Application Benchmarks

XLisp Interpreter v1.6 by D. M. Betz


The XLisp code, which is available as part of the SPEC 1995 benchmark suite,
has 25 files and 7616 lines of C code. The main function has two control loops,
one for reading expressions from the keyboard and the other for batch processing
from a file. The body of the batch loop is marked by hand as a ppr. Through the
programming interface described in Section 4.2.3, 5 likely privatizable variables are
identifiable (listed in Table 4.6.2), along with 5 checked variables (Table 4.6.2) and
one reduction variable, gccalls, which counts the number of garbage collections.
We do not know much about the rest of the 87 global variables (including function
pointers) except that they are all monitored by bop.
The speculatively parallelized version of XLisp runs successfully until the
garbage collection routine is activated. Because of the extensive changes the

86

collector makes to the memory state, it always kills the speculation. To solve this
problem, the mark-sweep collector implementation is revised for bop as described
briefly here. The key idea is to insulate the effect of garbage collection so it can
be done concurrently, without causing unnecessary conflicts. Each ppr uses a
separate page-aligned memory region. At the beginning of a ppr instance (after
forking but before data protection) the garbage collector performs a marking pass
over the entire heap to record all reachable objects in a start list. New objects are
allocated inside the pre-allocated region during the execution of the ppr. When
the garbage collection is invoked, it marks only objects inside the region but traverses the start list as an additional set of root pointers. Likewise, only objects
within the region that are unmarked are freed. At the end of the ppr, the garbage
collector is run again, so only the pages with live objects are copied at the commit.
The code changes to implement this region-based garbage collection comprise the
introduction of three new global variables and 12 additional statements, most of
which are for collecting and traversing the start list and resetting the MARK flags
in its nodes.
The region-based mark-sweep has non-trivial costs at the beginning and end
of pprs. Within the ppr the collector may not be as efficient because it may fail
to reclaim all garbage because some nodes in the start list would have become
unreachable in the sequential run. The extent of these costs depends on the
input. In addition, the memory regions will accumulate long-live data, which
leads to more unnecessary alerts from false sharing. The lisp evaluation may
trigger an exception leading to an early exit from within a ppr, so the content
of checked variables may not be restored even for parallel expressions. Therefore,
one cannot decide a priori whether the chance of parallelism and its likely benefit
would outweigh the overhead. However, these are the exact problems that bop is
designed to address with its streamlined critical path and the on-line sequentialparallel race.

87

Serial

Speculative
3
7

Times (s)

2.25
2.27
2.26

1.50
1.48
1.47

0.95
0.94
0.94

0.68
0.68
0.68

Speedup

1.00

1.53 2.39 3.31

Table 4.6: Execution times for various speculation depths

The NQueens input from spec95 benchmark suite, which computes all positions of n queens on an n n chess board in which no attacks are possible,
is used as a test case of the bop-lisp interpreter. Four lines of the original five
expression lisp program are modified, resulting in 13 expressions, of which 9 are
parallelized in a ppr. When n is 9, the sequential run takes 2.36 seconds using the
base collector and 2.25 seconds using the region-based collector (which effectively
has a larger heap but still needs over 4028 garbage collections for nine 10K-node
regions). The results of testing three speculation depths are listed in Table 4.6.2.
The last row of Table 4.6.2 shows that the speedup, based on the minimum
time of from three runs, is a factor of 1.53 with 2 processors, 2.39 with 4 processors,
and 3.31 with 8 processors. The table does not list the additional cost of failed
speculations, which accounts for 0.02 seconds of the execution.

GZip v1.2.4 by J. Gailly


GZip takes one or more files as input and compresses them one by one using the
Lempel-Ziv coding algorithm (LZ77). This case is based on version 1.2.4, which
available from the spec 2000 benchmark suite. Much of the 8616-line C code
performs bit-level operations, some through in-line assembly. The kernel is based
on a well worn implementation originally written for 16-bit machines. During
testing the program is not instructed to act as a spec and behaves as a normal
compressor rather than a benchmark program (which artificially lengthens the

88

Table 4.7: The size of various protection groups in


Data Groups
GZip
object count 33
Shared Data
size (bytes)
210K
accesses
116M
object count 78
Checked Data
size (bytes)
2003
accesses
46M
object count 33
Private Data (likely) size (bytes)
119K
accesses
51M

training runs
Parser
35
70K
343M
117
5312
336M
16
6024
39M

input by replication).
Table 4.7 shows the results of the bop analyzer, which identifies 33 variables
and allocation sites as shared data, 78 checked variables (many of which not used
during compression), and 33 likely private variables. Behavior analysis detected
flow dependencies between compressions because the original GZip failed to completely reinitialize parts of its internal data structure before starting compression
on another new file. The values would have been zeroed if the file was the first
to be compressed, and in this test the code has been changed to reinitialize these
variables. Compression returns identical results in all test inputs.
The sequential GZip code compresses buffered blocks of data one at a time, and
stores the results until an output buffer is full. pprs are manually placed around
the buffer loop and the set of likely private variables are specified through the
program interface described in Section 4.2.3. In this configuration the program
returned correct results, but speculation continually failed because of conflicts
caused by two variables, unsigned short bi buf and int bi valid, as detected by the
run-time monitoring.
The two variables are used in only three short functions. After inspecting the
original source code it became clear that the compression produces bits rather
than bytes, and the two variables stored the partial byte of the last buffer. This

89

Speculative
3
7
8.46 8.56 7.29 7.71 5.38 5.49 4.80 4.47
8.50 8.51 7.32 7.47 4.16 5.71 4.49 3.10
8.53 8.48 5.70 7.02 5.33 5.56 2.88 4.88
8.51
7.09
5.27
4.10
1.00
1.20
1.61
2.08
Sequential

Times (s)
Average Time
Average Speedup

Table 4.8: Execution times of bop GZip


dependency was hidden below layers of code and among 104 global variables, but
the run-time analyzer enabled quick discovery of the hidden dependency. The
byte cannot simply be filled (as is done for the final byte) if the resulting file is
to be decompressed with the stock Gunzip. A single extra or error bit will render
the output file meaningless to the decompressor. The solution is to compress
individual data buffers in parallel and concatenate the compressed bits afterward.
The intra-file compression permits single-file compression to use multiple processors. The bop version of GZip is testing using a single 84MB file (the gcc 4.0.1
tar file). Table 4.8 shows the comparison between the running time of the unmodified sequential code and the bop version running at three speculation depths.
Although the execution time is stable in sequential runs, it varies by as much as
67% in parallel runs, so in the following table we include the result of six consecutive tests of each version is used, and the computed speedup is based on the
average time.
With 2, 4, and 8 processors, the parallel compression gains speedups of 1.20,
1.61, and 2.08. The 8-way GZip is twice as fast and it is slightly faster than data
decompression by Gunzip, whose time is between 4.40 and 4.73 seconds in 6 runs.
The critical path of bopGZip, when all speculation fails, runs slightly faster than
the sequential version because of the effect of prefetching by the speculation. Intrafile speculation uses additional memory mostly for spec to buffer the compressed
data for the input used. In addition, the program has 104 global variables, so the

90

Sleator-Temperley English parser v2.1

wall-clock time (sec.)

25

20

15

10

sequential
co-processing (0% parallel)
coprocessing (97% parallel)

0
10
25
50
100
num. sentences in the possibly parallel region
Figure 4.5: The effect of speculative processing on Parser
space overhead for page allocation is at most 104 pages or a half mega-byte for
the sequential execution. The space cost of their run-time replication is already
counted in the numbers above (130KB and 7.45MB).

Sleator-Temperley Link Parser v2.1


The parser has a dictionary of about 60000 word forms. It has coverage
of a wide variety of syntactic constructions, including many rare and
idiomatic ones. [. . . ] It is able to handle unknown vocabulary, and
make intelligent guesses from context about the syntactic categories
of unknown words.
(Spec2K web site)

91

Sequential
Times (s)
Speedup

11.35
11.37
11.34
1.00

Speculative
1
3
7
10.06 7.03 5.34
10.06 7.01 5.35
10.07 7.04 5.34
1.13 1.62 2.12

It is not immediately clear from the documentation or from the 11,391 lines
of its C code whether the SeatlorTemperley Link Parser handles sentences in
parallel, but in fact they are not. If a ppr instance parses a command sentence
which changes the parsing environment, e.g., turning on or off the echo mode, the
next ppr instance cannot be speculatively executed. This is a typical example of
dynamic parallelism.
The bop parallelism analyzer identifies the sentence-parsing loop. We manually strip-mine the loop to create a larger ppr. The data are then classified
automatically as shown in Table 4.7. During the training run, 16 variables are
always written first by the speculation process during training, 117 variables always have the same value at the two ends of a ppr instance, and 35 variables are
shared.
The test input for the parallel version of the parser uses 1022 sentences obtained by replicating the spec95 training input twice. When each ppr includes
the parsing of 10 sentences, the sequential run takes 11.34 second, and the parallel
runs show speedup of 1.13, 1.62 and 2.12 with a few failed speculations due to the
dynamic parallelism.
The right-hand side of Figure 4.5 shows the performance on an input with 600
sentences. Strip-mine sizes ranging from 10 sentences to 100 sentences are tested
in each group, and the group size has mixed effects on program performance.
For sequential and spec fail, the largest group size leads to the lowest overhead,
3.1% and 3.6% respectively. Speculative processing improves performance by 16%,
46%, 61%, and 33% for the four group sizes. The best performance occurs with the

92

medium group size. When the group size is small, the relative overhead is high;
when the group size is large, there are fewer ppr instances and they are more
likely to unevenly sized. Finally, the space overhead of speculation is 123KB,
100KB of which is checked data. This space overhead does not seem to change
with the group size.

Comparison with Threaded Intel Math Kernel Library


The Intel Math Kernel Library 9.0 (mkl) provides highly optimized, processorspecific, and multi-threaded routines specifically for Intel processors. The library includes Linear Algebra Package (LAPACK) routines used for, among other
things, solving systems of linear equations. In this experiment the performance of
solving eight independent systems of equations using the dgesv routine is used for
comparison. mkl exploits thread-level parallelism within, but not across, library
calls. The number of threads used is defined by setting the OMP NUM THREADS
environment variable. bop, on the other hand, can speculatively solve the systems of equations in parallel even when it uses an unparallelized library, and so the
value OMP NUM THREADS is set to one for bop executions. Since the program
data are protected, bop guarantees program correctness if speculation succeeds.
Math Kernel Library experiments were conducted on systems of equations
with a number of equations ranging from 500 to 4500 in increments of 500. For
each system of equations mkl-only implementation is tested with the number of
threads set to 1, 2, 4, and 8. For the bop and mkl implementation the level
of speculation tested was correspondingly set to 0, 1, 3, and 7. Results for the
single-threaded mkl run and zero-speculation bop run are not shown.As shown
in Figure 4.6, bop-mkl depth 1 and omp-mkl thread 2 perform similarly, with the
mkl-only implementation achieving at most an 18% increase in operations per
second for 1000 equations. For bop-mkl depth 3 and bop-mkl depth 7, the run-time
overhead of bop prevents the system from achieving speedups for systems with

93

billion operations per second

30
25
20

bop-mkl depth
omp-mkl thread
bop-mkl depth
omp-mkl thread
bop-mkl depth
omp-mkl thread

7
8
3
4
1
2

15
10
5
0
500

1000 1500 2000 2500 3000 3500 4000 4500


num. linear equations

Figure 4.6: Solving 8 systems of linear equations with Intel MKL


1500 equations or fewer. However, above this point the course-grained parallelism
provided by bop is able to outperform the fine-grained, thread-level parallelism
of the mkl library. Increases between 15% and 20% are seen for bop-mkl depth 7
compared to omp-mkl thread 8 and increases between 7% and 11% are seen for
bop-mkl depth 3 compared to omp-mkl thread 4.
The comparison with threaded mkl helps to develop an understanding of the
overhead of the processed-based bop system, in particular its relationship with
the size of parallel tasks and the speculation depth. The results demonstrate the
property explained in Section 4.1.1: the overhead becomes smaller if the granularity is large or if the speculation depth is high. For 1500 equations, 3 speculation
processes perform 10% slower than 4-thread MKL because of the overhead. However, for the same input size, the greater parallelism from 7 speculation processes

94

more than compensates for the overhead and produces an improvement of 16%
over 8-thread mkl. Similar experiments pitting bop against another scientific library, the threaded automatically tuned linear algebra software (atlas), shows
similar results.

95

Speculative Optimization

Introduction
In this chapter I present a variation on process-based speculative execution called
Fast Track. The Fast Track system is based on the infrastructure for speculative
execution described in Chapter 3 but is applicable for a wholly different set of
uses from those in Chapter 4. Fast Track allows the use of unsafely optimized
code, while leaving the tasks of error checking and recovery to the underlying
implementation. The unsafe code can be implemented by a programmer or by a
compiler or other automated tool, and the program regions to be optimized can be
indicated manually or determined during execution by the run-time system. As
before, the system uses coarse-grain tasks to amortize the speculation overhead
and does not require special hardware support.
The shift in processor technology toward multicore, multi-processors opens
new opportunities for speculative optimization, where the unsafely optimized code
marches ahead speculatively while the original code follows behind to check for
errors and recover from mistakes. In the past, speculative program optimization
has been extensively studied both in software and hardware as an automatic
technique. The level of improvement, although substantial, is limited by the

96

ability of both the static and run-time analyzes. In fact, previous techniques
primarily targeted individual loops and only considered transformations based on
value and dependency information.
One may question the benefit of this setup: suppose the fast code gives correct
results, would we not still need to wait for the normal execution to finish to know
it is correct? The reason for the speed improvement is the overlapping of the
normal tracks. Without fast track, the next normal track cannot start until the
previous one fully finishes. With fast track, the next one starts once the fast code
for the previous normal track finishes. In other words, although the checking is as
slow as the original code, it is now done in parallel. If the fast code has an error
or occasionally runs slower than the normal code, the program would execute the
normal code sequentially and will not be delayed by a strayed fast track.
In Section 5.2 I describe the programming interface for Fast Track. This interface can be used an automated too, or in a natural way by a human programmer
with little effort. In Section 5.3 I describe the ways in which the Fast Track
run-time system extends the basic runtime described in Section 3.1.

5.1
5.1.1

Design
Fast and Normal Tracks

The FastTrack system represents two alternative methods of execution for some
portion of a program. At run time both of the methods are executed in parallel.
One of the two is identified a priori to be the canonical method, while the other is
assumed to potentially be unsafe in some cases. The unsafe execution is expected
to complete more quickly and is referred to as the fast track while the correct
computation is called the normal track.

97

5.1.2

Dual-track

In addition to the fast and normal track notation, the FastTrack run-time system
allows for a pair of parallel executions that are considered to be indistinguishable.
In this usage, both of the executions are referred to as Dual Tracks. Here,
whichever of the dual tracks can complete first leads to the continuing sequential
execution. The track which finishes more slowly will then confirm the results of the
first. If the two tracks are known with certainty to compute the same information
(but at unpredictable rates) the verification can be disabled.

5.2

Programming Interface

The FastTrack programming interface allows a programmer to optimize code at


the semantic level to select competing algorithms at run time, or to insert on-line
analysis modules such as locality profiling or memory-leak detection. Figures 5.2.1
and 5.2 show example uses of FastTrack to enable unsafely optimized loop and
function execution. If the fast tracks are correct, they will constitute the critical
path of the execution. The original loop iterations and function executions, which
we refer as normal tracks, will be carried out in parallel, on the side. The use
of FastTrack allows multiprocessors to improve the speed of sequential tasks.
A fast-track region contains a beginning branch if (FastTrack()), the contents of two tracks, and an ending statement EndDualTrack(). An execution of
the code region is called a dual-track instance, in which the two tracks are the fast
track and the normal track. A program execution consists of a sequence of dualtrack instances along with any computations that occur before, between, or after
these instances. Any region of code whose beginning dominates the end in control
flow can be made a dual-track region. Nesting of regions is allowed by maintaining
the type of the track. When a inner dual-track region is encountered, the outer
fast track will take the inner fast track, while the outer normal track will take the

98

Algorithm 5.2.1 Unsafe loop optimization using fast track. Iterations of


fast fortuitous will execute sequentially. Iterations of safe sequential will
execute in parallel with one another, checking the correctness of the fast iterations.
while ( . . . )
{
...
i f ( FastTrack ( ) ) {
fast fortuitous ();
} else {
safe sequential ();
}
EndDualTrack ( ) ;
...
}

// unsafely optimized
// safe code

Algorithm 5.2.2 Unsafe function optimization using fast track. Routines


fast step 2 and step 2 can start as soon as fast step 1 completes. They are
likely to run in parallel with step 1.
...
i f ( F a s t T r a c k ( ) ){
fast step 1 ();
//
} else {
//
step 1 ();
}
...
i f ( F a s t T r a c k ( ) ){
fast step 2 ();
//
} else {
step 2 ();
//
}

optimized
safe code

optimized
safe code

99

inner normal track. Statements with side effects that would be visible across the
processor boundary, such as system calls and file input and output, are prohibited
inside a dual-track region. The amount of memory that a fast instance may allocate is bounded so that an incorrect fast instance will not stall the system through
excessive consumption. Figure 5.2.1 in the previous section shows an example of
a fast track that has been added to the body of a loop. The dual-track region can
include just a portion of the loop body, multiple dual-track regions can be placed
back-to-back in the same iteration, or a region can be used in straight-line code.
Figure 5.2 shows the use of fast track on two procedure calls, with . . . standing in
for any other statements in between. Multiple dual-track regions do not have to
be arranged in a straight sequence. One might be used only within a conditional
branch, while another could be in loop.

5.3
5.3.1

Run-time Support
Creation

In addition to the general creation process described in Section 3.1.1 the FastTrack
run-time variant must enable state comparison between the fast and regular tracks.
Within the FT BeginFastTrack run-time hook, prior to spawning a normal track,
the system allocates a shared memory space for two access maps, and a shared
data pipe. The use of these objects is described in Section 5.3.2.

5.3.2

Monitoring

During execution, memory pages are protected so that any write access will trigger
a segmentation fault. Both the fast and normal tracks use a signal handler to catch
the faults and record the access in a bit map.

100

Algorithm 5.3.1 Listing of FastTrack creation.


// Returns 1 when entering the fast track , 0 otherwise .
i n t F T B e g i n F a s t T r a c k ( void ) {
int sen pipe [ 2 ] ;
// If we are currently in a fast track , finish it .
i f ( F T a c t i v e ) FT PostDualTrack ( ) ;
// the number of processors used i s specDepth + 1
i f ( FT maxSpec < 1 ) r e t u r n 0 ;
i f (SEQ == myStatus ) r e t u r n 0 ;
// Setup memory access handler to watch pages modification .
i f ( S P s e t u p A c t i o n ( FT SegvHandler , SIG MEMORY FAULT ) )
return FT errorOnBegin ( ) ;
// Setup communication channel for data updating .
i f ( pipe ( updatePipe ) ) return FT errorOnBegin ( ) ;
// Setup new access maps for the fast and normal tracks .
i f ( FT setupMaps ( ) ) r e t u r n F T e r r o r O n B e g i n ( ) ;
// Setup pipe for indication track seniority .
i f ( pipe ( s e n p i p e ) ) return FT errorOnBegin ( ) ;
++F T o r d e r ; // Record serial number of the new normal track .
// Enqueue order to readyQueue
i f ( F T o r d e r > FT maxSpec )
w r i t e ( readyQ>p i p e [ 1 ] , &F T o r d e r , s i z e o f ( F T o r d e r ) ) ;
i n t PID = f o r k ( ) ;
i f (1 == s e t p g i d ( 0 , S P g p i d ) )
p e r r o r ( f a i l e d to s e t p r o c e s s group ) ;
switch ( PID ) {
case 1:
return FT errorOnBegin ( ) ;
case 0 :
return FT internalBeginNormal ( s e n p i p e ) ;
default :
return F T i n t e r n a l B e g i n F a s t ( s e n p i p e ) ;
}
}

101

In order to compare the memory modifications of the two track, the fast track
must provide the normal track with a copy of any changes it has made. At the
end of each dual track region, the fast track evaluates its access map to determine
what pages have been modified. Each page flagged in the access map is pushed
over a shared pipe, and consumed by the normal track, which then compares the
data to its own memory page.
Algorithm 5.3.2 Listing of FastTrack monitoring.
s t a t i c void F T S e g v H a n d l e r ( i n t s i g , s i g i n f o t i n f o ,
ucontext t context )
{
a s s e r t (SIG MEMORY FAULT == s i g ) ;
assert ( context );
// access to pages that are not mapped are true faults
i f ( i n f o >s i c o d e == SEGV MAPERR)
i f (1 == k i l l ( SP gpid , SIGALRM ) )
p e r r o r ( f a i l e d to k i l l the timer ) ;
i f ( ! WRITEOPT( c o n t e x t ) ) r e t u r n ;
// record the page and remove the restriction
void f a u l t A d d = i n f o >s i a d d r ;
SP recordAccessToMap ( f a u l t A d d , FT accMap ) ;
i f ( m p r o t e c t (PAGESTART( f a u l t A d d ) , 1 , PROT WRITE | PROT READ) ) {
p e r r o r ( f a i l e d t o change memory a c c e s s p e r m i s s i o n . \ n ) ;
abort ( ) ;
}
}

5.3.3

Verification

To guarantee that the speculative execution is correct, the memory state of the
fast and normal tracks are compared at the end of the dual track region. If the
fast track reached the same state as the normal track, then the initial state of the

102

next normal track must be correct. Typically, the next normal track was started
well before its predecessor finished, and it will know only in hindsight that it was
correctly initialized.
The normal track is responsible for comparing the writes made by both itself
and the fast track. The memory state comparison is performed once the normal
track has finished the dual track region because this is the first point at which
verification is possible. The comparison first determines if the set of writes made
by the two tracks is identical, which is handled by a simple memcmp on the access
map of each of the two tracks. The process then compares the writes themselves
using the FT CheckData run-time call as in Listing 5.3.3. Verification will fail if
either the set or contents differ, or if the fast track has not yet completed the dual
track region.
Once verification has been completed successfully, the two process are know
to have made identical changes to the same memory locations. From that point
forward, the execution of the two process would be identical. Given this, one of
the tracks is superfluous. Because the fast track is aborted if it does not reach
the end of the dual track region first, we assume that it has continued past that
point and completed other useful work. The normal track is thus useless (since
it would be recomputing exactly what the fast track has already computed) and
aborts.
It is worth noting that although multiple dual track regions (i.e., multiple pairs
of fast and normal tracks) may exist simultaneously, a single process will have at
most one fast access map and one normal access map. Because the normal track
is responsible for performing the verification routine, the fast track can abandon
the access map it had been using for a region once the region is complete. The
normal track will still have access to that map. Once the map has been analyzed,
the normal track will abort or transition to the fast state.

103

Algorithm 5.3.3 Listing of FastTrack verification routine FT CheckData


i n t FT CheckData ( void ) {
unsigned long page = 0 ;
char b u f f e r [ PAGESIZE ] ;
i n t count , c ;
w h i l e ( page < PAGECOUNT) {
// Returns true i f the bitmap i s set for this page .
i f ( SP checkMap ( ( void ) ( page PAGESIZE ) , FT accMap ) ) {
count = 0;
// Read a full page into a local buffer .
w h i l e ( c o u n t < PAGESIZE ) {
c = r e a d ( u p d a t e P i p e [ 0 ] , b u f f e r , PAGESIZE ) ;
i f (1 == c ) {
p e r r o r ( f a i l e d t o r e a d from p i p e ) ;
} else {
i f ( 0 == c ) r e t u r n UINT MAX ;
e l s e c o u n t += c ;
}
}
// compare our memory page to the buffered page
i f ( 0 != memcmp ( b u f f e r , ( void ) ( page PAGESIZE ) ,
PAGESIZE ) )
{
r e t u r n page + 1 ;
// non0 indicates failure
}
}
page++;
}
return 0 ;
}

// 0 indicates success

104

5.3.4

Abort

The FastTrack abort routine is handled almost entirely by the normal track. The
normal track first waits to receive a notification that all of the preceding normal
tracks have completed, at which point it commits any buffered output and performs the verification routine. If the fast track needs to be aborted for any of the
reasons indicated in Section 5.3.3 the process executing the fast track is terminated. Because the normal track performs the verification, all cases in which the
fast track is terminated pass through the same code path. The normal path process explicitly signals the process running the fast track, which handles the signal
by simply closing the communications pipes and exiting. The steps taken by the
normal track after completing the dual track region are provided in Listing 5.3.5.
The normal track will continue executing until the next dual-track region is
encountered, or a program exit point is reached. Depending on the difference
in execution speed between the fast and normal track, the fast track may have
reached other dual track regions. In this case the abort of the fast track is followed
by the normal track sending a flag through the floodgates as an indication to
any waiting normal tracks that they should abort. Any normal tracks that have
already been released from the floodgate will run through their dual track region.
At the end of the region the process will synchronize by waiting to receive a flag
through the inheritance pipe indicating that it is the oldest running normal track.
In the case of an error in an earlier normal track, that synchronization flag will
indicate that the current process should also abort.

5.3.5

Commit

If the normal track verifies the correct execution of the dual track region, it clean
up and aborts. The fast track is free to continue execution, possibly entering more
FastTrack regions and creating further normal tracks.

105

Algorithm 5.3.4 Listing of slow track commit routine.


s t a t i c void FT slow TakesO ver ( void ) {
i n t t o k e n = 1;
int ppid = getppid ( ) ;
// (parent i s the fast track)
k i l l ( p p i d , SIGABRT ) ;
// kill the fast track
// Tell running slow tracks to abort on completion .
S P s y n c w r i t e ( b e q u e s t , &token , s i z e o f ( t o k e n ) ) ;
// Flush processes waiting at floodgates .
FT clearFloodgates ();
FT slowCleanup ( ) ;
// setup meta data
FT init ();
myStatus = FAST ;
// become FAST track
}
void FT PostSlow ( void ) {
FT becomeOldest ( ) ;
SP CommitOutput ( ) ;

// wait for inheritance token

i f (memcmp( FT fastAccMap , FT slowAccMap , ACC MAP SIZE )


| | FT CheckData ( ) )
// true i f data changes differ
{
F T slo wTakes Over ( ) ;
return ;
}
// pass the inheritance token
w r i t e ( b e q u e s t , &F T o r d e r , s i z e o f ( F T o r d e r ) ) ;
// let a process leave the floodgate
FT releaseNextSlow ( ) ;
FT slowCleanup ( ) ;
exit (1);
}

106

C
B
F

B
N1

B|E
F

N1

N1

N2

N1

pass
n2

fail

E
F

N1

Figure 5.1: State diagram of FastTrack processes.

107

5.3.6

Special Considerations

There are a number of corner cases of which the Fast Track system must take
account.

Seniority Control
Because the fast track may spawn multiple normal tracks, which may then run
concurrently, each normal track must know when all of its logical predecessors
have completed. Before a normal track terminates, it waits on a flag to be set by
its predecessor, and then signals its successor when complete. If there is an error
in speculation, the normal track uses the same mechanism to lazily terminate
normal tracks that are already running once they reach the end of their FastTrack
region.

Output buffering
To ensure that the output of a program running with FastTrack support is correct,
we ensure output is produced only by a normal track that is known to be correct
and is serialized in the correct order. Until a normal track has confirmed that
its initialization was correct (i.e., that all previous speculation was correct), it
buffers all terminal output and file writes. Once all previous normal tracks have
been committed the normal track is considered to be the oldest, and we can
be certain that its execution is correct. Given correct execution, any output the
process produces will be the same as what the sequential program would have
generated. The fast track never produces any output to the terminal nor does it
write to any regular file.

108

Implicit End Markers


The FastTrack end point can be indicated explicitly using the FT PostDualTrack
programming interface, but it is also handled implicitly in several cases. This
flexibility makes the job of the programmer easier by reducing the amount of
additional code they must write. Implicitly determining the end of the dual track
region also helps ensure correctness by catching cases where the user neglected to
correctly mark the end of the region. It should be noted that explicitly marking
the end of the region reduces the system overhead by pruning one of the system
process earlier.
There are two ways in which the end of a dual track region is determined
implicitly. This first is when the beginning of a dual track region is encountered.
Any process that is currently executing a dual track region (in any state) records
a flag to indicate its active participation. The first step the run-time system
takes when entering the FT BeginFastTrack or FT BeginDualTrack routines is to
check this activity flag and, if it is set, invoke the FT PostDualTrack routine. This
activity flag is distinct from the identifying state of the process (i.e., FAST),
which remains in effect.
The other implicit end marker point is a bit more subtle: we must capture
all program exit points in both the fast and normal tracks. This guarantees that
incorrect speculation does not direct a normal track to perform computation that
leads to a spurious exit from the program. In addition to ensuring correctness
in all cases, capturing all of the program exit points allows for correct program
termination to occur within the scope of a dual track region.
In the normal track we are ensuring that the same computation was performed
as in the fast track. In the fast track we must capture the program exit point and
wait for all normal tracks to finish. This may mean that they complete their dual
track region, or that the normal track has also reached the program exit. Because

109

the normal tracks are serialized, the fast track only needs to wait for the last
normal track it spawned to complete. This is achieved using the same mechanism
the normal tracks use to order themselves: the fast track waits on the inheritance
token. Note that the fast track is not necessarily waiting for the normal track to
reach the same program exit point, but the state of two will agree.
Whether or not we are within the scope of a dual track region, the correctness of
the fast track is not known until the verifying normal tracks complete. Although
we could terminate the fast track and allow the normal track to simply do its
work, the normal track may be predicated on the results of other normal tracks.
Keeping the state of the fast track allows the earlier normal tracks to validate.
The alternative would be to abort all but the oldest normal track, potentially
wasting work.

Processor Utilization
The objective of speculative execution is for execution to occur as quickly as
possible. In order to make this happen, the run-time system should use the
available processing cores as wisely as possible. In a naive approach the fast
track would run until it exits the program, spawning normal tracks along the way.
Each normal track would compute its own version its dual track region and verify
correct computation.
Although execution of the normal tracks (with the exception of the oldest) is
speculative based on the correctness of the fast track, they are taking advantage
of otherwise unused resources. However, if we spawn too many normal tracks,
they may begin contend for hardware resources. Ultimately the normal tracks are
performing the real computation, and delaying their execution would be wasteful.
This is true either if we allow a more speculative process to be scheduled in
favor of an older one, or if it merely interferes with it.

110

Algorithm 5.3.5 Listing of FastTrack exit point handler.


void
attribute
int token ;

( ( d e s t r u c t o r ) ) F T e x i t H a n d l e r ( void ) {

i f ( F T a c t i v e ) FT PostDualTrack ( ) ;
switch ( myStatus ) {
case FAST :
c l o s e ( readyQ>p i p e [ 0 ] ) ;
c l o s e ( readyQ>p i p e [ 1 ] ) ;
// wait for the last normal track
S P s y n c r e a d ( i n h e r i t a n c e , &token , s i z e o f ( i n t ) ) ;
close ( inheritance );
k i l l ( SP gpid , SIGTERM ) ;
break ;
case SLOW:
// wait to be the oldest
i f ( FT order > 1)
S P s y n c r e a d ( i n h e r i t a n c e , &token , s i z e o f ( t o k e n ) ) ;
SP CommitOutput ( ) ;
// commit output
k i l l ( SP gpid , SIGTERM ) ;
// terminate speculation
break ;
default :
break ;
}
}

111

The FastTrack system mitigates the interference between normal tracks by


limiting the number of tracks active at any one time (below). There is no action
take to encourage the operating system to schedule the normal tracks with respect
to one another, but modifying the scheduling priority of the processes would be
a simple way to further improve the efficiency of the system. The impact of such
scheduling is open for further exploration.

Fast-track Throttling The fast track has thus far been described as speculatively running ahead of the normal tracks, constrained only by program termination or a terminal signal from one of the normal tracks. There are two reasons
why it is undesirable for the fast track to run arbitrarily far ahead. The first problem is the potential resource demand of the waiting normal tracks. The second
problem is that, should there be an error in the speculation detected in one of the
normal tracks, the processing done by the fast track is essentially wasted. The
FastTrack run-time system implements a throttling mechanism to keep the fast
track running far enough ahead to supply normal tracks and keep the processing
cores utilized, while minimizing potentially wasted resources.
The throttling strategy is to pause the fast track and give the processor to
a normal track, as shown by the middle diagram in Figure 5.2. When the next
normal track finishes, it re-activates fast track. The word next is critical for
two reasons. First, only one normal track should activate fast track when it
waits, effectively returning the processor after borrowing it. The time of the
activation must be exact. If it is performed by a track to early there will be too
many processes. One track later and there would be under-utilization.
Consider a system with p processors running fast track and p1 normal tracks
until the fast track becomes too fast and suspends execution giving the processor
to a waiting normal track. Suppose that three normal tracks finish in the order n1 ,
n2 , and n3 , and fast track suspends after n1 and before n2 . The proper protocol

112

is for n2 to activate fast track so that before and after n2 we have p and only p
processes running concurrently. Activation before and after n2 would lead to less
than or more than p processes.
In order to ensure that suspension and activation of the fast track is timed
correctly with respect to the completion of the normal tracks FastTrack maintains
some extra state. The value of waitlist length indicates the number of normal-track
processes waiting in the ready queue. A flag ft waiting represents whether the fast
track has been paused.
The fast track is considered to be too fast when waitlist length exceeds p. In
this case, the fast track activates the next waiting process in the ready queue, sets
the ft waiting flag, and then yields its processor by. When a normal track finishes,
it enters the critical section and determines which process to activate based on the
flag: if ft waiting is on, it clears ft waiting and reactivates the fast track; otherwise,
it activates the next normal track and updates the value of waitlist length.
A problem arises when there are no normal tracks waiting to start, which can
happen when the fast track is too slow. If a normal track waits inside the critical
section to start its successor, then the fast track cannot enter to add a new track
to the queue. The bottom graph in Figure 5.2 shows this case, where one or more
normal track processes are waiting for fast track to fill the queue.

Resource Allocation Assuming we are executing on a system with N processors, and that the fast track is executing on one of the processors, the run-time
system should allow at most N 1 normal processes to execute simultaneously.
The exception is when the fast track has been throttled, allowing an N th normal
track process. In addition to limiting the number of normal tracks, the FastTrack
system should guarantee that the N 1 oldest (or, least speculative) processes
are allotted hardware resources. The FastTrack run-time system implements these
constraints using a token passing system such that only a process holding a token

113

is released from the FT BeginFastTrack run-time call.


Once a normal track process has been spawned and initialized, it waits to
receive a token by attempting to read from a pipe we refer to as the floodgate.
Although conceptually each normal track needs its own floodgate, we know that
the maximum number of normal tracks is bounded and a pool of floodgates can be
used (implemented as an array with circular access). The system inserts the set of
tokens into the floodgates at initialization, and whenever resetting the floodgates
due to miss-speculation. The whole set of floodgates is available to all processes.
In order to open the floodgates in the correct order, a normal track must identify to whom it should pass the token it currently holds. The track makes this
determination by reading the identifier from the ready queue. The fast track is responsible for enqueuing the normal tracks when they are created. Because there is
a single producer into the ready queue (the fast track) and it enqueues the normal
tracks in their sequential order, the normal tracks are guaranteed to be released in
the correct sequential order. This is true regardless of the order in which normal
tracks complete. It is worth noting that the normal tracks commit in sequential
order in any case, however the floodgate is opened before the synchronization is
performed.
Allowing the normal tracks to finish out of order allows the run-time system to
better absorb differences in the computational cost of various dual track regions.
The steady state of the run-time systems resource allocation control is shown by
the top diagram in Figure 5.2. The execution rate of fast track is the same as
the combined rate of N 1 normal tracks. When their speeds do not match, the
ready queue may become empty or may continue growing until the fast track is
throttled.
With activity control and fast-track throttling, the FastTrack run-time system
utilizes the available processors as efficiently as possible. Processing resources are
completely utilized unless there is a lack of parallelism and the fast track runs

114

too slowly. When there is enough parallelism, the fast track is constrained to
minimize potentially useless speculative computation.

Memory Usage The FastTrack run-time system relies on the operating system implementation of copy-on-write, which lets processes share memory pages
to which they do not write. In the worst case where every dual-track instance
modifies every data page, the system needs d times the memory needed by the
sequential run, where d is the fast-track depth. The memory overhead can be
controlled by abandoning a fast track instance if it modifies more pages than a
empirical constant threshold h. This bounds the memory increase to be no more
than d h M , where M is the virtual memory page size. The threshold h can be
adjusted based on the available memory in the system. Memory usage is difficult
to estimate since it depends on the demands of the operating system and other
running processes. Earlier work has shown that on-line monitoring can effectively
adapt memory usage by monitoring the page-fault indicators from Linux [21, 65].
Experimental test cases have never indicated that memory expansion will be a
problem, so I do not consider memory resource further.
Running two instances of the same program would double demand for off-chip
memory bandwidth, which is a limiting factor for modern processors, especially
chip multiprocessors. In the worst case if a program is completely memory bandwidth bound, no fast track can reduce the overall memory demand or improve
program performance. However, experience with small and large applications on
recent multicore machines, which are detailed later, is nothing but encouraging.
In FastTrack, the processes originate from the same address space and share readonly data. Their similar access patterns help to prefetch useful data and keep
it in cache. For the two large test applications used, multiple processes in FastTrack ran almost the same speed as that of a single process. In contrast, running
multiple separate instances of a program always degrades the per-process speed.

115

5.4

Compiler Support

The FastTrack system guarantees that it produces the same result as the sequential execution. By using Unix processes, FastTrack eliminates any interference
between parallel executions through the replication of the address space. During
execution, it records which data are changed by each of the normal and fast instances. When both instances finish, it checks whether the changes they made are
identical. Program data can be divided into three parts: global, stack, and heap
data. The stack data protection is guaranteed by the compiler, which identifies
the set of local variables that may be modified through inter-procedural MOD
analysis [30] and then inserts checking code accordingly. Imprecision in compiler
analysis may lead to extra variables being checked, but the conservative analysis
does not affect correctness. The global and heap data are protected by the operating systems paging support. At the beginning of a dual-track instance, the
system turns off write permission to global and heap data for both tracks. It then
installs custom page-fault handlers that record which page has been modified in
an access map and re-enables write permission.

5.5
5.5.1

Uses of Fast Track


Unsafe Program Optimization

In general, the fast code can be any optimization inserted by either a compiler
or a programmer; for example memoization, unsafe compiler optimizations or
manual program tuning. The performance of the system is guaranteed against
slow or incorrect fast track implementations. The programmer can also specify two
alternative implementations and let the system dynamically select the faster one.
Below I discuss four types of optimizations that are good fits for fast track because

116

they may lead to great performance gains but their correctness and profitability
are difficult to ensure.
Memoization For any procedure the past inputs and outputs may be recorded.
Instead of re-executing the procedure in the future, the old result can be reused
when given the same input. Studies dated back to at least 1968 [39] show dramatic
performance benefits when using memoization, for example to speed up table lookup in transcoding programs. Memoization must be conservative about side-effects
and can provide only limited coverage for generic use in C/C++ program [15].
With FastTrack, memoization does not have to be correct in all cases and therefore
can be more aggressively used to optimize the common case.
Semantic optimization Often, different implementation options may exist at
multiple levels, from the basic data structures used such as a hash table, to the
choice of algorithms and their parameters. A given implementation is often more
general than necessary for a program, allowing for specialization. Current programming languages do not provide a general interface for a user to experiment
with an unsafely simplified algorithm or to dynamically select the best choice
among alternative solutions.
Manual program tuning A programmer can often identify performance problems in large software and make changes to improve the performance on test inputs. However, the most radical solutions are often the most difficult to verify in
terms of correctness, or to ensure good performance on other inputs. As a result,
many creative solutions go unused because an automatic compiler cannot possibly
achieve them.
Monitoring and safety checking It is often useful to instrument a program
to collect run-time statistics such as frequently executed instructions or accessed

117

data, or to report memory leaks or out-of-bound memory accesses. In such cases,


the original uninstrumented code can serve as the fast track, and the instrumented
code can run in parallel to reduce the monitoring overhead.

5.5.2

Parallel Memory-Safety Checking

To test fast track on real-world applications, it has been applied to the parallelization of a memory-safety checking tool called Mudflap [16]. Mudflap is bundled
with the widely used GNU compiler collection (gcc), adding checks for array
range (over or under flow) and validity of pointer dereferences to any program
gcc compiles. Common library routines that perform string manipulation or direct memory access are also guarded. Checks are inserted at compile time and
require that a run-time library be linked into the program.
The Mudflap compilation has two passes: memory recording, which tracks all
memory allocation by inserting

mf register and

mf unregister calls, and access

checking, which monitors all memory access by inserting

mf check calls and

inlined operations. The recording cost is proportional to the frequency of data


allocation and recollection, and the checking cost is proportional to the frequency
of data access.
To fast track the Mudflap checking system we introduced a new compiler pass
that clones all functions in the program. The second Mudflap pass is instructed
to ignore the clones while instrumenting the program. The result is an executable
with the original code fully checked while the clone just records data allocation
and free. The instrumentation of the clones is necessary to maintain the same
allocation and meta data of memory as those of the original code. We create a
Fast Track version of programs by using the fully checked version of the program
to verify the memory safety of the unchecked fast track.

118

5.6

Evaluation

5.6.1

Analysis

Throughout he remainder of this section I use the following notation to represent


the basic parameters of the system:
Dual track computations are identified ri .
Interleaving computation regions are uj .
The program execution E is a sequence of u0 r1 u1 r2 . . . rn un .
The running time of a region is denoted by T ().
The number of available processors is p > 1.
Each ri has a success rate (0 1).
A fast instance takes a fraction x (0 x 1) of the time the normal
instance takes
The dual-track execution has a time overhead qc (qc 0) per instance and
is slowed down by a factor of qe (qe 0) because of the monitoring for
modified pages.

Analytical Model
P
The original execution time is T (E) = T (u0 ) + ni=1 T (ri ui ). By reordering
P
P
the terms leads to T (E) = ni=1 T (ri ) + ni=0 T (ui ). Name the two components
Er = r1 r2 . . . rn and Eu = u0 u1 . . . un . The time T (Eu ) is not changed by fast-track
execution because any ui takes the same amount of time regardless of whether it
is executed with a normal or a fast instance.

119

Focusing on T (Er ) and in particular the average time taken per ri , tr =

T (Er )
,
n

and how this time changes as a result of FastTrack. Since we would like to derive
a closed formula to examine the effect of basic parameters, consider a regular case
where the program is a loop with n equal length iterations. A part of the loop
body is a FastTrack region. Let T (ri ) = tc be the (constant) original time for
each instance of the region. The analysis can be extended to the general case
where the length of each ri is arbitrary and tc is the average. The exact result
would depend on assumptions about the distribution of T (ri ). In the following,
we assume T (ri ) = tc for all i.
With FastTrack, an instance may be executed by a normal track in time ts =
(1 + qe )tc + qc or by a fast track in time tpf , where qc and qe are overheads. In the
best case, all fast instances are correct ( = 1) and the machine has unlimited
resources p = . Each time the fast track finishes an instance, a normal track is
started. Thus, the active normal tracks form a pipeline if considering only dualtrack instances (the component T (Er ) in T (E)). The first fast instance is verified
after ts . The remaining n 1 instances finish at a rate of t
f = (1 + qe )xtc + qc ,
where x is the speedup by fast track and qc and qe are overheads.
Using the superscript to indicate the number of processors, the average time
and the overall speedup are
t
f =
speedup =
In the steady state

tc
t
f

(ts + (n 1)t
f )
n

original time
ntc + T (Eu )
=
f ast track time
ntf + T (Eu )

dual-track instances are run in parallel. For simplicity

the equation does not show the fixed lower bound of fast track performance.
Since a fast instance is aborted if it turns out to be slower than the normal
instance, the worst-case is t
f = ts = (1 + qe )tc + qc , and consequently speedup =
ntc +T (Eu )
.
n((1+qe )tc +qc )+T (Eu )

While this is slower than the original speed (speedup 1),

120

the worst-case time is bounded only by the overhead of the system and not by the
quality of fast-track implementation (factor x).
As a normal instance for ri finishes it may find the fast instance incorrect,
canceling the on-going parallel execution, and restarting the system from ri+1 .
This is equivalent to a pipeline flush. Each failure adds a cost of ts t
f , so the
p
average time with a success rate is (1 )(ts t
f ) + tf .

For the sake of illustration, assume no fast-track throttling when considering


the limited number of processors. With p processors, the system can have a fast
track execution depth d of at most d = min(p1, tts ) dual-track instances running
f

concurrently. Because d is an average it may take a non-integral value. When


= 1, p 1 dual-track instances take ts + (p 2)t
f (p 2) time. Therefore the
average time (assuming p 1 | n) is
tpf =

ts + (d 1)t
f
d

When < 1, the cost of restarting has the same effect as in the infinite-processor
case. The average time and the overall speedup are
tpf = (1 )(ts t
f )+
speedupp =

ts + (d 1)t
f
d

ntc + T (Eu )
ntpf + T (Eu )

Finally consider FastTrack throttling. As p 1 dual-track instances execute


and when the last fast instance finishes, the system start the next normal instance
instead of waiting for the first normal instance to finish (and start the next normal
and fast instances together). Effectively it finishes d + (ts dt
f ) instances, hence
the change to the denominator. Augmenting the previous formula we have
tpf = (1 )(ts t
f )+

ts + (d 1)t
f
d + ts dt
f

121

After simplification, FastTrack throttling may seem to increase the per instance
time rather than decreasing it. But it does decrease the time because d

ts
.
t
f

The overall speedup (bounded from below and n 2) is as follows, where all the
basic factors are modeled.
speedupp = max

ntc + T (Eu )
ntc + T (Eu )
, p
nts + qc + T (Eu ) ntf + T (Eu )

Simulation Results
By translating the above formula into actual speedup numbers the effect of major
parameters can be examined. Of interest are the speed of the fast track, the
success rate, the overhead, and the portion of the program executed in dual-track
regions. The four graphs in Figure 5.3 show their effect for different numbers of
processors ranging from 2 to 10 in a step of 1. The fast-track system has no effect
on a single-processor system.
All four graphs include the following setup where the fast instance takes
10%the time of the normal instance (x=0.1), the success rate () is 100%, the
overhead (qc and qe ) adds 10% execution time, and the program spends 90% of the
time in dual-track regions. The performance of this case is shown by the second
highest curve in all but graph 5.3(a), in which it is shown by the highest curve.
FastTrack improves the performance from a factor of 1.60 with 2 processors to a
factor of 3.47 with 10 processors. The maximal possible speedup for this case is
3.47. When we change the speed of the fast instance to vary from 0% to 100%
the time of the normal instance, the speedup changes from 1.80 to 1.00 with 2
processors and from 4.78 to 1.09 with 10 processors, as shown by graph 5.3(a).
When the success rate is reduced from 100% to 0%, the speedup changes from
1.60 to 0.92 (8% slower because of the overhead) with 2 processors and from
3.47 to 0.92 with 16 processors, as shown by the graph in 5.3(a). Naturally the
performance hits the worst case when the success rate is 0%.

122

When the overhead is reduced from 100% to 0% of the running time, the
speedup increases from 1.27 to 1.67 with 2 processors and from 2.26 to 3.69 with
16 processors, as shown by graph 5.3(c). Note that with 100% overhead the
fast instance still finishes in 20% the time of the normal instance, although the
checking needs to wait twice as long.
Finally, when the coverage of the fast-track execution increases from 10% to
100%, the speedup increases from 1.00 to 1.81 with 2 processors and from 1.08
to 4.78, as shown by the graph 5.3(d). If the analytical results are correct, it is
not overly difficult to obtain a 30% improvement with 2 processors, although the
maximal gain is limited by the time spent outside dual-track regions, the speed
of the fast instance, and the overhead of fast-track.
The poor scalability is not a surprise given the program is inherently sequential
to begin with. Two final observations from the simulation results are important.
First, FastTrack throttling is clearly beneficial. Without it there can be no improvement with 2 processors. It often improves the theoretical maximum speedup,
although the increase is slight when the number of processors is large. Second, the
model simplifies the effect of FastTrack system in terms of four parameters, which
we have not validated with experiments on a real system. On the other hand,
if the four parameters are the main factors, they can be efficiently monitored at
run time, and the analytical model may be used as part of the on-line control to
adjust the depth of fast-track execution with the available resources.

5.6.2

Experimental Results

Implementation and Experimental Setup


Compiler support for FastTrack is implemented in gcc 4.0.1s intermediate language, GIMPLE (based on static-single assignment [13]). The transformation is
applied after high-level program optimization passes but before machine code gen-

123

eration. The main transformation is converting global variables to use dynamic


allocation, so the run-time support can track them and set appropriate access
protection. The modified compiler allocates a pointer for each global (and file
and function static) variable, inserts an initialization function in each file that
allocates heap memory for variables (and assigns initial values) defined in the file,
and redirects all accesses through the global pointer. The indirection causes only
marginal slowdown because most global-variable accesses have been removed or
converted to (virtual) register access by earlier optimization passes.
Compiler analysis for data protection of local variables has not been implemented. Stack data are not checked, but global and heap variables are protected.
The run-time system is implemented as a statically linked library using shared
memory only for storing access maps. The design guarantees forward progress,
which means no deadlocks or starvation provided that the operating system does
not permanently stall any process.

Parallel Memory Safety Checking


A FastTrack version of Mudflap has been generated for the C-language benchmarks 401.bzip2, 456.hmmer, 429.mcf, and 458.sjeng from the spec 2006 suite[55].
These benchmarks represent computations in pattern matching, mathematical optimization, chess playing, and data compression. The number of program lines
ranges from a few thousand to over ten thousand. All four programs show significant improvement, up to a factor of 2.7 for 401.bzip2, 7.1 for 456.hmmer, and
2.2 for 429.mcf and 458.sjeng. The factors affecting the parallel performance are
the coverage of FastTrack and the relative speed of the fast track as discussed in
our analytical model. One factor not tested here is the overhead of correctness
checking and error recovery. The running times with and without Mudflap overhead, as given in the captions in Figure 5.4, show that memory-safety checking
delays the execution by factors of 5.4, 15.0, 8.6, and 67.4. By utilizing multiple

124

processors, FastTrack reduces the delay to factors of 2.0, 2.1, 3.7, and 28.8, which
are more tolerable for long-running programs.
The code change in 429.mcf includes replicating the call of price out impl in
function global opt in file mcf.c. Similar to the code in the FastTrack example in
the introduction, the original call is placed in the normal track and the call to the
clone, clone price out impl, in the fast track. For 458.sjeng, the call of search in
function search root in file search.c is similarly changed to use clone search in the
fast track and search in the normal track. In both cases, merely four lines of code
need to be modified.
Memory safety checking by Mudflap more than triples the running time of
mcf. FastTrack improves the speed of checking by over 30%. The running time
of fast track is within half a second of a dual track implication, which shows that
FastTrack runs with little overhead. The cost of safety checking for 458.sjeng is
a factor of 200 slowdownit takes 24 minutes to check the original execution of
7.3 seconds. FastTrack is able to reduce the checking time to 13 minutes, a factor
of two reduction. A dual track style execution without verification runs faster,
finishing in under 9 minutes without the overhead of checking every memory
access.
Results of Sort and Search Tests
The following two tests are intended to measure the performance of FastTrack
use for the support of unsafe optimizations as executed with two Intel dual-core
Xeon 3Ghz processors. Compilation is done using the modified FastTrack version
of gcc using the optimizations specified by the -O3 flag. The first test is a
simple sorting program that repeatedly sorts an array of 10,000 elements. In a
specified percentage of iterations the array contents are randomized. The array
sort is performed with either a short-circuited bubble sort, a quick sort, or by
running both in a FastTrack environment. The results of these tests are shown in

125

Figure 5.5. The quick sort performs consistently and largely independent of the
input array. One can see that the bubble sort quickly detects when the array is
sorted, but performs poorly in cases in which the contents have been randomized.
The FastTrack approach is able to out-perform either of the individual sorting
algorithms. These results illustrate the utility of Fast-Track in cases where both
solutions are correct, knowing which is actually faster is not possible in advance.
In cases where the array is always sorted or always unsorted, the overhead of using
FastTrack will cause it to lose out. Although FastTrack is not a better solution
compared to an explicitly parallel sorting approach, this example motivates the
utility of automatically selecting the faster of multiple sequential approaches.
Algorithm 5.6.1 Pseudo code of the synthetic search program
for i = 1 to n do
Vi random
end for
for 1 to T do
if normal track then
for i = 1 to n do
Vi f (Vi )
end for
m max(v : v V )
else {fast track}
R S random samples from V
for j = 1 to S do
R f (Ri )
end for
m max(r : r R)
end if
randomly modify N1 elements
end for
print m
The second program is a simple search to test the effect of various parameters,
for which the basic algorithm is given in Figure 5.6.1. The program repeatedly
updates some elements of a vector and finds the largest result from certain computations. By changing the size of the vectors, the size of samples, and the

126

frequency of updates, we can effect different success rates by the normal and the
fast instances. Figure 5.6(a) shows the speedups over the base sequential execution, which takes 3.33 seconds on a 4-CPU machine. The variation between times
of three trials is always smaller than 1 millisecond.
The sampling-based fast instance runs in 2.3% the time of the normal instance.
When all fast instances succeed, they improve the performance by a factor of 1.73
on 2 processors, 2.78 on 3 processors, and 3.87 on four processors. When the
frequency of updates is reduced the success rate drops. At 70%, the improvement
is a factor of 2.09 on 3 processors and changes only slightly when the fourth
processor is added. This drop is because the chance of four consecutive fast
instances succeeding is only 4%. When the success rate is further reduced to
30%, the chance for three consecutive successful fast tracks drops to 2.7%. The
speedup from 2 processors is 1.29 and no improvement is observed for more than 2
processors. In the worst case when all fast instances fail, we see that the overhead
of forking and monitoring the normal track adds 6% to the running time.
The results in Figure 5.6(b) show interesting trade-offs when the fast track
is tuned by changing the size of samples. On one hand, a larger sample size
means more work and slower speed for the fast track. On the other hand, a
larger sample size leads to a higher success rate, which allows more consecutive
fast tracks succeed and consequently more processors utilized. The success rate
is 70% when the sample size is 100, which is the same configuration as the row
marked 70% in Figure 5.6(a). The best speedup for 2 processors is found when
the sample size is 200 but adding more processors does not help as much (2.97
speedup) as when the sample size is 300, where 4 processors lead to a speedup
of 3.78. The second experiment shows the significant effect of tuning when using
unsafely optimized code. Experience has shown that the automatic support and
analytical model have made tuning much less labor intensive.

127

next normal track


becomes active

balanced
steady state

the fasttrack
process

ready queue
enqueue new
normal track

p-1 active
normal-track
processes

(1 to p-1 waiting
normal track)
dequeue the
ready queue

fast track activates the next


normal track and then stops

fast-track
throttling

next normal track


becomes active

ready queue

fast track
stopped

p active
normal-track
processes

(p waiting normal
tracks)

next ending normal track re-activates the fast track

next normal track


becomes active

fast track
too slow

the fasttrack
process

less than p-1


active normaltrack processes

ready queue
enqueue new
normal track

(empty)

waiting to
dequeue

Figure 5.2: The three states of fast track: balanced steady state, fast-track throttling when it is too fast, and slow-track waiting when fast track is too slow. The
system returns to the balanced steady state after fast-track throttling.

128

success rate: 100%, 90%, ..., 0%


4

3.5

3.5

speedups

speedups

fast-track speed: 0%, 10%, ..., 100%


4

2.5
2
1.5

2
1.5

0.5

0.5
2

4
6
8
number of processors

10

fast-track overhead: 0%, 10%, ..., 100%


4
3.5

3.5

2.5
2

10

2.5
2

1.5

1.5

0.5

4
6
8
number of processors

exe. in fast track: 100%, 90%, ..., 0%


4

speedups

speedups

3
2.5

0.5
2

4
6
8
number of processors

10

4
6
8
number of processors

10

Figure 5.3: Analytical results of the fast-track system where the speed of the fast
track, the success rate, the overhead, and the portion of the program executed in
dual-track regions vary. The order of the parameters in the title in each graph
corresponds to the top-down order of the curves in the graph.

129

8
6
4

number of processors

number of processors

(a) The checking time of 401.bzip2 is reduced from 24.5 seconds to 9.0 seconds.
The base running time, without memory
safety checking, is 4.5 seconds.

(b) The checking time of 456.hmmer is


reduced from 235 seconds to 33.2 seconds. The base running time is 15.6 seconds.

number of processors

(c) Fast track reduces the checking time


of 429.mcf from 56 seconds to 24.8 seconds. The base running time is 6.7 seconds.

1.5

1.0

1.5

0.5

0.0

0.5
0.0

1.0

1.5
4

0.5

0.0

0.0

1.0

speedup

2.0

2.0

2.5

2.5
2.0

1.5
1.0

0.5

speedup

2.5

sjeng
2.5

mcf

2.0

0.0

0.0
1

0.5

speedup

1.5
1.0

2.0

2.0

1.5
1.0

0.5

speedup

2.5

2.5

hmmer
8

bzip2

number of processors

(d) Fast track reduces the checking time


of 458.sjeng from 33.7 seconds to 14.4
seconds. The base running time is 0.5
seconds.

Figure 5.4: The effect of FastTrack Mudflap on four spec 2006 benchmarks.

130

1000
quick
fast-track
bubble

Speedup over quick-sort

100
10
1
0.1
0.01
0.001
0

5
10
25
50
75
100
Percentage of Iterations that Modify

Figure 5.5: Sorting time with quick sort, bubble sort or the FastTrack of both

success
rate
100%
70%
30%
0%

number
1
2
1 1.73
1 1.47
1 1.29
1 0.94

processors
3
4
2.78 3.87
2.09 2.15
1.29 1.29
0.94 0.94

(a) Effect of fast-track success rates on


the synthetic benchmark

sample
size
100
200
300
400

number
1
2
1 1.48
1 1.71
1 1.70
1 1.68

processors
3
4
2.09 2.15
2.64 2.97
2.71 3.78
2.69 3.74

(b) The speedup due to fast-track tuning of the synthetic benchmark

Figure 5.6: FastTrack on synthetic benchmarks

131

Conclusion

6.1

Contributions

I have presented two systems for implementing speculative parallelism in existing programs. For each system I have implemented a complete working system
including compiler and run-time support. The first system, bop, provides a programmer with tools to introduce traditional types of parallelism in cases where
program dependencies cannot be statically evaluated or guaranteed. I have shown
the use of bop to effectively extract parallelism from utility programs.
I have also presented FastTrack, a system that supports unsafely optimized
code and can also be used to off-loaded safety checking and other program analysis. The key features of the systems include a programmable interface, compiler
support, and a concurrent run-time system that includes correctness checking, output buffering, activity control, and fast-track throttling. I have used the system
to parallelize memory safety checking for sequential code, reducing the overhead
by up to a factor of seven for four large size applications running on a multicore
personal computer. We have developed an analytical model that shows the effect
from major parameters including the speed of the fast track, the success rate, the
overhead, and the portion of the program executed in fast-track regions. We have

132

used our system and model in speculatively optimizing a sorting and a search
program. Both analytical and empirical results suggest that fast track is effective
at exploiting todays multi-processors for improving program speed and safety.

6.2
6.2.1

Future Directions
Automation

Automating the insertion of bop region markers requires identifying pprs automatically, which is similar to identifying parallelism a major open problem.
Because pprs are only hints at parallelism, its not necessary for them to be correct. In addition to inserting the ppr markers automatically, the system could be
simplified by allowing the EndPPR marker to be optional. The difficulty in doing
this comes in handling the final instance of the ppr. Without an end marker,
the speculative task will continue until it reaches a program exit point. The nonspeculative will execute the ppr, and subsequently repeat the same execution as
the speculative task. Such duplicated work is certainly wasteful, but may be acceptable if there is no other useful work that could be offloaded to the additional
processing unit.
In order to automate the use of the FastTrack system, markers can be inserted
at various points throughout the code using compiler instrumentation. We can
choose dynamically whether to initiate a new dual-track region based on the past
success rate and the execution time since the start of the last region. A region
can begin at an arbitrary point in execution, as long as the other track makes
the same decision at that point. We can identify the point with a simple shared
counter each track increments every time it passes a marker. The fast track makes
its increments atomically, and when it creates a new normal track it begins a new
counter (leaving the old one for the previous normal track). As the normal tracks

133

pass marks they compare their counter to the fast tracks to identify the mark at
which verification needs to be performed. If the two processes did not follow the
same execution path then the state verification will fail.
A significant problem is ensuring that the fast path includes all of the markers
the normal track has. This is directly related to where the markers are placed, and
how the two tracks are generated. In a case like the fast mudflap implementation
described in Chapter 5 the markers will be consistent as long as they are not
placed in the mudflap routines. In any case where code is similarly inserted to
create the normal track, it will suffice to simply not insert markers with that code.
In a case where the fast track is created by removing optimizations from existing
code, we must ensure that markers are not removed, and that any function calls
are not directly removed because they might contain further markers.

6.2.2

Composability

One of the major problems in parallel programming particularly when discussing explicit parallel programming with locks is the composability of various
operations. The intuition behind composability is that the combination of multiple components should not break correctness.
Lack of composability is a significant weakness of lock based components, and
is one of the strengths of transactional memory systems. Because the speculative
parallelism run-time systems are intended to be a simple way to extend existing
programs, the bop and FastTrack system should seek to compose correctly. There
are several general questions to ask about the composability of these systems: do
each compose with itself, do they compose with one another, and do they compose
with existing parallel programming techniques.
Self-Composition The question of self-composition is whether the run-time
system properly handles entering a speculative region when one is already active.

134

Cases where disjoint regions of the program use speculation compose trivially. The
bop run-time system does correctly compose with itself. The implementation is
designed to that nested use of pprs are not allowed, but are detected and handled
correctly. If a piece of code (for example a library) is built to use pprs, and that
is invoked from within another ppr, the inner regions will be ignored. Although
this maintains semantic correctness which is the primary concern it may
not be the most effective solution.
The FastTrack run-time system also maintains semantic correctness when it is
composed with itself. When the FastTrack system encounters a nested fast track
region, the runtime will treat it like any other dual track region. If the fast track is
the first to reach the nested region it will spawn a new normal track. Eventually
the normal track will encounter the end of the original dual track region, and
speculation will fail. Although semantic correctness is maintained, performance
will suffer because speculation over the entire outer region will always fail. This
failure could potentially be avoided if fast track regions were given identifiers. The
run-time system would also need a mechanism to match the identifier the normal
track encounters at the end of its region to the fast track. Additionally, the fast
track would need to abandon the inner normal tracks and to reacquire the changes
it made starting at the beginning of the outer region (which are otherwise simply
left for the inner normal track to verify).
If the normal track reaches a nested region then it will assume that the fast
track has miss-speculated, or is otherwise delayed, and that it has simply completed executing the region first. As in any case where the normal track wins the
race, it will terminate the fast track. The normal track will then assume the role
of fast track and spawn a new normal track to handle the inner region. From a
performance standpoint this is not likely to be the most effective solution because
only the smaller inner region will be fast tracked. Nevertheless, it is a better outcome than the case above, and it does maintain semantic correctness. In the case

135

that both tracks encounter a nested dual track region, the result is very much like
the above case in which only the normal track encounters the inner region.
Algorithm 6.2.1 Example of FastTrack self-composition
void o u t e r ( void ) {
i f (FT\ B e g i n F a s t T r a c k ( ) ) {
inner fast ();
} else {
inner normal ();
}
}
void i n n e r f a s t ( void ) {
i f (FT\ B e g i n F a s t T r a c k ( ) ) {
...
} else {
...
}
}

6.2.3

Further Evaluation

One of the major issues in contemporary computing is power consumption. This is


true for mobile systems drawing from a battery, to high performance computing
systems and data centers built on massive numbers of processors. Because so
much power drawn by a computer ends up as waste heat, even more power must be
drawn to cool the system. As such, the power utilization of speculative parallelism
must be considered. Although there will always be a demand for the ability to
complete tasks more quickly, the power costs to do so have to be put in balance.
One way to frame the energy consumption question is to consider the comparable energy necessary to gain the same speed increase using a uni-processor.
Conversely, if the speculative parallelism system can allow a program to be com-

136

pleted as quickly on two or more processors running at a slower clock rate, an


energy savings may be found.
As processor vendors produce systems with higher numbers of cores, they face
the reality that often many of these cores are not utilized. The two largest chip
makes have both introduced schemes to allow some of the cores on the multi-core
systems, which they refer to as turbo boost (Intel) and dynamic speed boost
(AMD). Evaluating the energy consumption of speculatively parallelized programs
on such systems would provide empirical data to address such a hypothesis.

137

Code Listings

Included here are source code fragments not found earlier in this dissertation.
Where relevant, a reference to the earlier source is included. The inclusion of
system header files and standard pre-processor include guards have been omitted
for brevity.

A.1

BOP Code
Listing A.1: Private Header

s t a t i c i n t specDepth ;

// Between 1 and MaxSpecDepth .

static int specPid ;

// Process ID of SPEC.

s t a t i c i n t undyWorkCount ;

// Work done by UND

s t a t i c i n t pprID ;

// User assigned region ID .

s t a t i c i n t mySpecOrder = 0 ;

// Serial number .

// For Data update .


s t a t i c int updatePipe [ 2 ] ;

// For termination of the understudy .


s t a t i c int undyCreatedPipe [ 2 ] ;

138

s t a t i c i n t undyConcedesPipe [ 2 ] ;

// loHiPipes control information flow . The main process


// takes the 0th slot to send MainDone and MainCopyDone.
s t a t i c i n t l o H i P i p e s [ MAX SPEC DEPTH 2 + 1 ] [ 2 ] ;

// Flag set i f the next speculation fails .


s t a t i c v o l a t i l e char e a r l y T e r m i n a t i o n = f a l s e ;

// Signal sets masking SIGUSR1, SIGUSR2, and both


s t a t i c s i g s e t t sigMaskUsr1 , sigMaskUsr2 , sigMaskUsr12 ;

s t a t i c v o i d BOP AbortSpec ( v o i d ) ;
s t a t i c v o i d BOP AbortNextSpec ( v o i d ) ;

Listing A.2: Access Map Handling


// Implementation depends on maps of single byte units .
char useMap ;
char accMapPtr ;

// Record access of the given type in my access map .


void BOP recordAccess ( void page address , AccessType a c c e s s ) {
i n t mapId = mySpecOrder==0 ? 0 : MYRESOURCE( mySpecOrder ) ;
char map = NULL ;

switch ( a c c e s s ){
case READ :
map = accMapPtr + ( mapId 2 BIT MAP SIZE ) ;
break ;
case WRITE :
map = accMapPtr + ( ( ( mapId 2 ) + 1 ) BIT MAP SIZE ) ;
break ;
}

139

SP recordAccessToMap ( p a g e a d d r e s s , map ) ;
}

void BOP setProtection ( i n t prot ) {


unsigned long page ;
unsigned long l a s t = 0 , f i r s t = 0 ;

// Look at each position in the map .


f o r ( page = 0 ; page < PAGECOUNT; page++) {
i f ( SP checkMap ( ( v o i d ) ( page PAGESIZE ) , useMap ) ) {
i f ( page == l a s t +1) l a s t ++;
else {
i f ( l a s t > 0) SP protectPages ( f i r s t , l a s t , p r o t ) ;
l a s t = f i r s t = page ;
}
}
}
i f ( l a s t > 0) SP protectPages ( f i r s t , l a s t , p r o t ) ;
}

// Returns zero i f the there are no conflicts in the maps .


s t a t i c i n t BOP compareMaps ( v o i d ) {
char p r e v W r i t e s ;
i f ( mySpecOrder == 1 )
p r e v W r i t e s = WRITEMAP ( 0 ) ;
else

/ the union map /

p r e v W r i t e s = READMAP( mySpecOrder 1 ) ;

char c u r W r i t e s = WRITEMAP( mySpecOrder ) ;


char c u r R e a d s

= READMAP( mySpecOrder ) ;

f o r ( unsigned i = 0 ; i < BIT MAP SIZE ; i ++) {


i f ( prevWrites [ i ] & ( curReads [ i ] | curWrites [ i ] ) ) return 1;
/ compute the union map in place /

140

curReads [ i ] = prevWrites [ i ] | curWrites [ i ] ;


}
return 0;
}

Listing A.3: Signal Handlers


v o i d BOP RaceHandler ( i n t s i g n o , s i g i n f o t i n f o , u c o n t e x t t c n t x t ) {
a s s e r t ( SIGUSR1 == s i g n o ) ;
assert ( cntxt );

// Committing Spec seeing i t s signal , SIGUSR1.

No action .

i f ( i n f o > s i p i d == g e t p i d ( ) ) r e t u r n ;

// SOS: the next process has a conflict . Start early termination .


// ( set myself as the last process of the group ) .
i f ( myStatus == SPEC) {
BOP AbortNextSpec ( ) ;
return ;
}

i f ( myStatus != UNDY) r e t u r n ;

// Sending a symbolic value .


w r i t e ( u n d y C o n c e d e s P i p e [ 1 ] , &mySpecOrder , s i z e o f ( i n t ) ) ;
exit (0);
}
// Spec or main gets a segmentation fault .
v o i d BOP SegvHandler ( i n t num , s i g i n f o t i n f o , u c o n t e x t t c n t x t ) {
v o i d f a u l t A d d = i n f o >s i a d d r ;

// This should only be the handler for SEGV signals , and we


// only handle the case of permission violations .
a s s e r t (num == SIG MEMORY FAULT ) ;
i f ( i n f o >s i c o d e != SEGV ACCERR) {

141

while (1) pause ( ) ;


assert (0);
}

// Check i f the predecessor wrote to this page .


// A more complete check i s done after this and pred complete .
unsigned mapID = MYRESOURCE( mySpecOrder 1 ) ;
char mapPtr = accMapPtr + ( mapID 2 + 1 ) BIT MAP SIZE ;
unsigned a c c e s s = SP checkMap ( f a u l t A d d , mapPtr ) ;

i f ( myStatus==SPEC && a c c e s s ) BOP AbortSpec ( ) ;

i f (WRITEOPT( c n t x t ) ) {

// A write access .

BOP AbortNextSpec ( ) ;

B O P r e c o r d A c c e s s ( f a u l t A d d , WRITE ) ;
i f ( m p r o t e c t (PAGESTART( f a u l t A d d ) , 1 , PROT WRITE | PROT READ ) )
exit ( errno );
} else {

// A read access .

B O P r e c o r d A c c e s s ( f a u l t A d d , READ ) ;
i f ( m p r o t e c t (PAGESTART( f a u l t A d d ) , 1 , PROT READ ) )
exit ( errno );
}
}
v o i d BOP UndyTermHandler ( i n t num , s i g i n f o t i n f o , u c o n t e x t t c n t x t )
{
a s s e r t ( SIGUSR2 == num ) ;
assert ( cntxt );
i f ( i n f o > s i p i d == g e t p i d ( ) ) r e t u r n ;

/ Must be Undy /

exit (0);
}
// See Listing 4.3.1 for BOP PrePPR implementation .

142

s t a t i c i n t BOP pipeClose ( void ) {


i n t i = 0 , h a s E r r o r =0;

f o r ( i =0; i <=MAX SPEC DEPTH 2 ; i ++) {


h a s E r r o r |= c l o s e ( l o H i P i p e s [ i ] [ 0 ] ) ;
h a s E r r o r |= c l o s e ( l o H i P i p e s [ i ] [ 1 ] ) ;
}
h a s E r r o r |= c l o s e ( u n d y C r e a t e d P i p e [ 0 ] ) ;
h a s E r r o r |= c l o s e ( u n d y C r e a t e d P i p e [ 1 ] ) ;
h a s E r r o r |= c l o s e ( u n d y C o n c e d e s P i p e [ 0 ] ) ;
h a s E r r o r |= c l o s e ( u n d y C o n c e d e s P i p e [ 1 ] ) ;

i f ( hasError ) {
perror (

f a i l e d to c l o s e p i p e s ) ;

myStatus = SEQ ;
return 0;
}
else return 1;
}
// See Listing 4.3.3 for BOP End implementation .
// See Listing 4.3.4 for PostPPR commit implementation .
// See Listing 4.3.4 for PostPPR main implementation .
// See Listing 4.3.4 for PostPPR spec implementation .
// See Listing 4.3.4 for PostPPR undy implementation .
s t a t i c i n t BOP Pipe Init ( void ){
int i , hasError = 0;

f o r ( i =0; i <=MAX SPEC DEPTH 2 ; i ++)


h a s E r r o r |= p i p e ( l o H i P i p e s [ i ] ) ;

h a s E r r o r |= p i p e ( u n d y C r e a t e d P i p e ) ;
h a s E r r o r |= p i p e ( u n d y C o n c e d e s P i p e ) ;

143

i f ( hasError ) {
perror (

update pipe c r e a t i o n f a i l e d : ) ;

myStatus = SEQ ;
return 0;
}
else return 1;
}

s t a t i c void BOP timerAlarmExit ( i n t s i g n o ) {


( void ) s i g n o ;
k i l l ( 0 , SIGKILL ) ;
}

s t a t i c v o i d BOP timerTermExit ( i n t s i g n o ) {
a s s e r t (SIGTERM == s i g n o ) ;
s i g n a l (SIGTERM , SIG IGN ) ;
k i l l ( 0 , SIGTERM ) ;
exit (0);
}

// Allocates the shared data and installs the timer process .


void BOP Init ( )
{
static int init done = 0;
i f ( i n i t d o n e ) return ;

init done = 1;
char c u r P n t = mmap(NULL , ALLOC MAP SIZE + ACC MAP SIZE ,
PROT READ | PROT WRITE ,
MAP ANONYMOUS | MAP SHARED, 1, 0 ) ;
a s s e r t ( curPnt ) ;
useMap = c u r P n t ;
accMapPtr = c u r P n t + ALLOC MAP SIZE ;

144

// Setup BOP process group .


SP gpid = g e t p i d ( ) ;
s e t p g i d (0 , SP gpid ) ;

// Prepare signal handlers .


s i g n a l ( SIGINT ,

BOP timerAlarmExit ) ;

s i g n a l ( SIGQUIT , B O P t i m e r A l a r m E x i t ) ;
s i g n a l ( SIGUSR1 , SIG DFL ) ;
s i g n a l ( SIGUSR2 , SIG DFL ) ;

// Premade for signal blocking and unblocking


s i g e m p t y s e t (& s i g M a s k U s r 1 ) ;
s i g a d d s e t (& s i g M a s k U s r 1 , SIGUSR1 ) ;
s i g e m p t y s e t (& s i g M a s k U s r 2 ) ;
s i g a d d s e t (& s i g M a s k U s r 2 , SIGUSR2 ) ;
s i g e m p t y s e t (& s i g M a s k U s r 1 2 ) ;
s i g a d d s e t (& s i g M a s k U s r 1 2 , SIGUSR1 ) ;
s i g a d d s e t (& s i g M a s k U s r 1 2 , SIGUSR2 ) ;

// Prepare post/wait
BOP Pipe Init ( ) ;

// Create the timer process , which waits for the whole


// speculative precessing team to complete .
int fid = fork ();
switch ( f i d ) {
case 1:
myStatus = SEQ ;
return ;
case 0 : // The child i s the new control .
myStatus = CTRL ;
s e t p g i d (0 , SP gpid ) ;
return ;
default :

145

// Setup SIGALRM
s i g n a l ( SIGALRM , B O P t i m e r A l a r m E x i t ) ;
s i g n a l (SIGTERM , BOP timerTermExit ) ;
while (1) pause ( ) ;

// Timer waits for the program to end .

}
}

v o i d BOP PostPPR ( i n t i d ) {
// Ignore a PPR ending i f i t doesn t match the PPR we started
i f ( i d != p p r I D ) r e t u r n ;

p p r I D = 1;

s w i t c h ( myStatus ) {
case UNDY:
r e t u r n PostPPR undy ( ) ;

case SPEC :
r e t u r n PostPPR spec ( ) ;

case MAIN :
r e t u r n PostPPR main ( ) ;

case SEQ :
case CTRL :
return ;

// No action .

default :
assert (0);
}
}

s t a t i c v o i d BOP AbortSpec ( v o i d ) {
a s s e r t ( myStatus == SPEC ) ;

146

// With no earlier SPEC, just let UNDY take over .


i f ( mySpecOrder == 1 ) e x i t ( 0 ) ;

// Initiate early termination in the parent .


k i l l ( g e t p p i d ( ) , SIGUSR1 ) ;

exit (0);
}

s t a t i c v o i d BOP AbortNextSpec ( v o i d ) {
earlyTermination = true ;
// Kill any following SPEC
i f ( s p e c P i d !=0) k i l l ( s p e c P i d , SIGKILL ) ;
}

A.2

Fast Track Code


Listing A.4: Public Header File

i n t FT BeginFastTrack ( void ) ;
i n t FT BeginDualTrack ( void ) ;
void FT PostDualTrack ( void ) ;

Listing A.5: Private Header File


static bool FT active ;

// True during a dualtrack region .

// The maximum number of speculation processes allowed .


s t a t i c unsigned FT maxSpec = 2 ;

// Access maps used by each fast/normal pair :


s t a t i c char FT fastAccMap ;

// Fast track

s t a t i c char FT slowAccMap ;

// Slow track

s t a t i c char FT accMap ;

// Alias to local map

147

// Queue for waiting slow tracks .


typedef s t r u c t {
int pipe [ 2 ] ;
s e m t sem ;
v o l a t i l e unsigned r e c e n t ;

// Newest activte track .

v o l a t i l e bool

// True when the FT yields .

waiting ;

} r e ad y Qu e u e ;
s t a t i c r e ad y Q ue u e readyQ ;

// Communication channels :
// Channel for passing data updates after verification .
s t a t i c int updatePipe [ 2 ] ;
// File descriptors for assigning seniority . Each slow track reads
// from inheritance pipe and writes to the bequest .
static int inheritance , bequest ;
// Slow tracks open a floodgate of another waiting slow track .
#d e f i n e FLOODGATESIZE ( 2 (MAX SPEC DEPTH + 1 ) )
s t a t i c i n t f l o o d g a t e s [ FLOODGATESIZE ] [ 2 ] ;

// Unique identifier for speculative processes .


s t a t i c unsigned F T o r d e r = 0 ;

Listing A.6: Utility Functions


s t a t i c v o i d FT sharedMap ( s i z e t l e n g t h ) {
r e t u r n mmap(NULL , l e n g t h ,
PROT READ | PROT WRITE ,
MAP ANONYMOUS | MAP SHARED, 1 , 0 ) ;
}

/// Transitions to error state and returns 0 .


s t a t i c char F T e r r o r O n B e g i n ( v o i d ) {
myStatus = SEQ ;
return 0;
}

148

s t a t i c i n t FT getDepthFromEnv ( v o i d ) {
char c v a l ;
s t a t i c const i n t d e f = 2 ; // Default value
int

depth = def ;

c v a l = g e t e n v ( BOP SpecDepth ) ;
i f ( c v a l != NULL) d e p t h = a t o i ( c v a l ) ;
// Must be in the range [ 0 , MAX]
i f ( d e p t h < 0 | | d e p t h > MAX SPEC DEPTH) d e p t h = d e f ;

return depth ;
}

/// Returns zero on success .


s t a t i c char FT setupMaps ( v o i d ) {
// Allocate two access maps contiguously .
char accMap = FT sharedMap ( ACC MAP SIZE 2 ) ;

i f ( MAP FAILED == accMap ) r e t u r n 1 ;

FT fastAccMap = accMap ;
FT slowAccMap = accMap + ACC MAP SIZE ;

return 0;
}

Listing A.7: Floodgate Control


static i n l i n e int FT floodGateFor ( int specOrder ) {
r e t u r n f l o o d g a t e s [ s p e c O r d e r%FLOODGATESIZE ] ;
}

/// Synchronously reads a single token from the floodgate


/// associated with the current process .

149

s t a t i c i n l i n e i n t FT readFloodGate ( void ) {
int token ;
int gate = FT floodGateFor ( FT order ) ;
S P s y n c r e a d ( g a t e [ 0 ] , &token , s i z e o f ( t o k e n ) ) ;
return token ;
}

/// Opens the floodgate for track .


s t a t i c i n l i n e void FT openFloodGate ( i n t track , i n t token ) {
int gate = FT floodGateFor ( track ) ;
w r i t e ( g a t e [ 1 ] , &token , s i z e o f ( t o k e n ) ) ;
}

/// Tell any processes waiting on a floodgate to give up .


s t a t i c void F T c l e a r F l o o d g a t e s ( void ) {
fd set readset ;
const i n t t o k e n = 1;
struct timeval zerotime = {0 ,0};
i n t n f d s = readyQ>p i p e [ 0 ] + 1 ;
int nextslow ;

FD ZERO(& r e a d s e t ) ;
FD SET ( readyQ>p i p e [ 0 ] , &r e a d s e t ) ;

i f (1 == s e l e c t ( n f d s , &r e a d s e t , NULL , NULL , &z e r o t i m e ) )


p e r r o r ( s e l e c t i n g r e a d y queue ) ;

w h i l e ( FD ISSET ( readyQ>p i p e [ 0 ] , &r e a d s e t ) ) {


i f (1 == r e a d ( readyQ>p i p e [ 0 ] , &n e x t s l o w , s i z e o f ( n e x t s l o w ) ) )
p e r r o r ( r e a d from r e a d y queue ) ;
FT openFloodGate ( nextslow , token ) ;
i f (1 == s e l e c t ( n f d s , &r e a d s e t , NULL , NULL , &z e r o t i m e ) )
p e r r o r ( s e l e c t i n g r e a d y queue ) ;
}

150

}
s t a t i c void FT releaseNextSlow ( void ) {
w h i l e ( 0 != s e m w a i t (&( readyQ>sem ) ) ) ;
i f ( readyQ>w a i t i n g ) {
// restart the fast
a s s e r t (FAST != myStatus ) ;
FT slowCleanup ( ) ;
readyQ>w a i t i n g = f a l s e ;
s e m p o s t (&( readyQ>sem ) ) ;
exit (1);
} else {
int slowtrack = 0;
// Read from ready queue until a value i s returned .
S P s y n c r e a d ( readyQ>p i p e [ 0 ] , &s l o w t r a c k , s i z e o f ( s l o w t r a c k ) ) ;
i f ( s l o w t r a c k > 0 ) readyQ>r e c e n t = s l o w t r a c k ;
s e m p o s t (&( readyQ>sem ) ) ;
// If we got a slowtrack from the ready queue , start i t .
i f ( s l o w t r a c k ) FT openFloodGate ( s l o w t r a c k , FT order ) ;
}

// If the fast track gets too far ahead (a lot of slow tracks are
// waiting ) i t will yield to let some slow tracks get work done .
s t a t i c void FT continueOrYield ( void ) {
i f ( F T o r d e r > readyQ>r e c e n t + FT maxSpec ) {
// Continuing after yielding to slow track
FT releaseNextSlow ( ) ;
readyQ>w a i t i n g = t r u e ;
w h i l e ( readyQ>w a i t i n g ) p a u s e ( ) ;
}
}
s t a t i c v o i d FT becomeOldest ( v o i d )

151

{
int token ;
// Wait until we are the most senior slow instance
S P s y n c r e a d ( i n h e r i t a n c e , &token , s i z e o f ( t o k e n ) ) ;
i f ( t o k e n == 1) {
// Upstream error . Propagate and abort .
S P s y n c w r i t e ( b e q u e s t , &token , s i z e o f ( t o k e n ) ) ;
exit (1);
}
// Now the oldest slow track .
close ( inheritance );
}

Listing A.8: Finalization


s t a t i c i n l i n e void FT fastCleanup ( void ) {
int i ;
f o r ( i =0; i < FLOODGATESIZE ; i ++) {
close ( floodgates [ i ] [ 0 ] ) ;
close ( floodgates [ i ] [ 1 ] ) ;
}
}
// Closes everything the slow track normally has open .
s t a t i c i n l i n e void FT slowCleanup ( void ) {
int i ;
c l o s e ( updatePipe [ 0 ] ) ;

close ( bequest ) ;
c l o s e ( readyQ>p i p e [ 0 ] ) ;
f o r ( i =0; i < FLOODGATESIZE ; i ++) {
close ( floodgates [ i ] [ 0 ] ) ;
close ( floodgates [ i ] [ 1 ] ) ;
}
}

152

// See Listing 5.3.3 for FT CheckData implementation


#i f n d e f FT AUTOMARKPOINT
#d e f i n e FT AUTOMARKPOINT 0
#e n d i f

s t a t i c void FT StartAutoMarkPointTimer ( void ) ;


s t a t i c void FT InitAutoMarkPoint ( void ) ;
s t a t i c void

attribute

( ( c o n s t r u c t o r ) ) F T i n i t ( void ) {

int i ;
int sen pipe [ 2 ] ;

/ Shared floodgate pipes /


f o r ( i =0; i <=MAX SPEC DEPTH 2 ; i ++)
i f ( p i p e ( f l o o d g a t e s [ i ] ) != 0 ) {
perror ( allocating floodgates );
abort ( ) ;
}

readyQ = FT sharedMap ( s i z e o f ( r e a dy Q u eu e ) ) ;
readyQ>w a i t i n g = f a l s e ;
i f ( 0 != p i p e ( readyQ>p i p e ) ) {
p e r r o r ( a l l o c a t i n g r e a d y queue ) ;
abort ( ) ;
}

i f (1 == s e m i n i t (&( readyQ>sem ) , 1 , 1 ) ) {
p e r r o r ( u n a b l e t o i n i t i a l i z e semaphore ) ;
abort ( ) ;
}

// Create the first seniority pipe .


i f ( 0 != p i p e ( s e n p i p e ) ) {
p e r r o r ( unable to i n i t i a l i z e

s e n i o r i t y pipe ) ;

153

abort ( ) ;
}

// Ensure the first slow track will know i t i s the oldest .


w r i t e ( s e n p i p e [ 1 ] , &F T o r d e r , s i z e o f ( F T o r d e r ) ) ;
close ( sen pipe [ 1 ] ) ;
// Keep the read end open .
inheritance = sen pipe [ 0 ] ;

FT maxSpec = FT getDepthFromEnv ( ) ;
FT active = f a l s e ;
SP RedirectOutput ( ) ;

i f (FT AUTOMARKPOINT) F T I n i t A u t o M a r k P o i n t ( ) ;
}
// Automatic branch point insertion
s t a t i c unsigned

FT AM count = 0 ;

static bool

FT AM active = f a l s e ;

s t a t i c unsigned F T A M j o i n P o i n t ;

s t a t i c void FT itimerHandler ( i n t s i g n o ) {
a s s e r t ( s i g n o == SIGALRM ) ;
FT AM active = t r u e ;
}

s t a t i c void F T A l l o c a t e J o i n P o i n t e r ( void ) {
F T A M j o i n P o i n t = FT sharedMap ( s i z e o f ( F T A M j o i n P o i n t ) ) ;
FT AM joinPoint = 0 ;
}

s t a t i c void FT StartAutoMarkPointTimer ( void ) {


struct timeval i n t e r v a l = {0 ,500000};
struct i t i m e r v a l timer = { interval , i n t e r v a l };

154

i f ( SIG ERR == s i g n a l ( SIGALRM , F T i t i m e r H a n d l e r ) )


perror ( setting signal );
i f ( 0 > s e t i t i m e r ( ITIMER REAL ,& t i m e r , NULL ) )
perror ( setting timer ) ;
}

s t a t i c void FT InitAutoMarkPoint ( void ) {


i f ( ! FT AUTOMARKPOINT) r e t u r n ;
FT StartAutoMarkPointTimer ( ) ;
FT AllocateJoinPointer ();
}

i n t FT AutoMarkPoint ( v o i d ) {
i f ( ! FT AUTOMARKPOINT) r e t u r n 0 ;
FT AM count++;
i f ( ! FT AM active ) r e t u r n 0 ;

i f (SLOW == myStatus ) {
// If the slow track has already passed the join point , then i t
// i s running ahead of the fast track (or the timer didn t fire
// soon enough ) . Slow Wins .
i f ( FT AM count > F T A M j o i n P o i n t ) F T s l o w T a k e s O v e r ( ) ;
// If we have reached the indicated joint point , cleanup .
e l s e i f ( FT AM count == F T A M j o i n P o i n t ) F T P o s t D u a l T r a c k ( ) ;
} e l s e i f (FAST == myStatus | | CTRL == myStatus ) {
// reset the activation
FT AM active = f a l s e ;
// indicate where the branch/ join i s
F T A M j o i n P o i n t = FT AM count ;
munmap ( F T A M j o i n P o i n t , s i z e o f ( F T A M j o i n P o i n t ) ) ;
// Setup a new joinpoint record for the next slow track .
FT AllocateJoinPointer ();

return FT BeginFastTrack ( ) ;

155

return 0;
}
// See Listing 5.3.5 for FT PostSlow and FT slowTakesOver .
// See Listing 5.3.5 for FT exitHandler implementation
// The slow track kills fast with SIGABRT.
s t a t i c void FT sigAbortFast ( i n t s i g ) {
a s s e r t ( SIGABRT == s i g ) ;
FT fastCleanup ( ) ;
exit (1);
}

// Handler for the fast track to recognize a child has aborted .


s t a t i c void FT sigChildAborted ( i n t s i g ) {
int pid ;
i n t f l a g s = WNOHANG | WUNTRACED | WCONTINUED;
a s s e r t ( SIGCHLD == s i g ) ;
// Clean up any and all dead children .
w h i l e ( 0 < ( p i d = w a i t p i d ( 1 , NULL , f l a g s ) ) ) ;
}
/// Returns 1 on success ( fast track started ) .
char F T i n t e r n a l B e g i n F a s t ( i n t s e n i o r i t y [ 2 ] ) {
myStatus = FAST ;
FT active = true ;
FT accMap = FT fastAccMap ;

// We need to be able to abort i f necessary


s i g n a l ( SIGABRT , F T s i g A b o r t F a s t ) ;
// Keep track of what FAST s children do .
s i g n a l ( SIGCHLD , F T s i g C h i l d A b o r t e d ) ;

156

// Seniority based ordering


c l o s e ( s e n i o r i t y [ 1 ] ) ; // We don t need the read side .
close ( inheritance );

// SLOW i s responsible for old write side .

inheritance = seniority [0];

// We have a new write pipe .

FT continueOrYield ( ) ;
return 1;
}
char F T i n t e r n a l B e g i n N o r m a l ( i n t s e n i o r i t y [ 2 ] ) {
myStatus = SLOW;
bequest = s e n i o r i t y [ 1 ] ;

// Stop using handlers from past fast tracks .


s i g n a l ( SIGABRT , SIG DFL ) ;

// specDepth control via waiting by the floodgate


i f ( F T o r d e r > FT maxSpec ) {
i f ( F T r e a d F l o o d G a t e ( ) == 1) {
// An error occurred earlier .
FT slowCleanup ( ) ;
exit (1);
}
}

c l o s e ( readyQ>p i p e [ 1 ] ) ;
close ( seniority [0]);

FT accMap = FT slowAccMap ;
c l o s e ( updatePipe [ 1 ] ) ;

i f ( SIG ERR == s i g n a l ( SIGABRT , SIG DFL ) )


p e r r o r ( f a i l e d to c l e a r abort handler ) ;

157

SP RedirectOutput ( ) ;
FT active = true ;
i f (FT AUTOMARKPOINT) F T S t a r t A u t o M a r k P o i n t T i m e r ( ) ;
return 0;
}
// See Listing 5.3.2 for FT SegvHandler implementation
// See Listing 5.3.1 for FT Begin implementation
static int dualPid ;

// The other dual (notfast/slow) track .

s t a t i c i n l i n e v o i d FT PostDual ( v o i d ) {
// Just kill the other and move on .
i f (1 == k i l l ( d u a l P i d , SIGABRT ) )
perror (

f a i l e d to abort p a r a l l e l t r a c k ) ;

myStatus = CTRL ;
SP CommitOutput ( ) ;
}

i n t FT BeginDualTrack ( void )
{
// Make sure we re currently running sequentially .
i f ( myStatus != CTRL ) r e t u r n 0 ;
// Don t bother i f there can t be parallelism
i f ( FT maxSpec < 1 ) r e t u r n 0 ;

i n t PID= f o r k ( ) ;
i f (1 == s e t p g i d ( 0 , S P g p i d ) ) {
p e r r o r ( f a i l e d to s e t p r o c e s s group ) ;
abort ( ) ;
}

s w i t c h ( PID ) {
case 1:
myStatus = SEQ ;

158

PID = 0 ;
break ;
case 0 :
myStatus = DUAL ;
dualPid = getppid ( ) ;
break ;
default :
myStatus = DUAL ;
d u a l P i d = PID ;
break ;
}
SP RedirectOutput ( ) ;
r e t u r n PID ;
}
s t a t i c i n l i n e void FT PostFast ( void ) {
SP PushDataAccordingToMap ( FT fastAccMap , u p d a t e P i p e [ 1 ] ) ;

i f ( munmap ( FT fastAccMap , ACC MAP SIZE ) == 1)


p e r r o r ( unmapping a c c e s s map ) ;
i f ( munmap ( FT slowAccMap , ACC MAP SIZE ) == 1)
p e r r o r ( unmapping a c c e s s map ) ;

c l o s e ( updatePipe [ 0 ] ) ;
c l o s e ( updatePipe [ 1 ] ) ;
}

void FT PostDualTrack ( void )


{
s w i t c h ( myStatus ) {
case SLOW:
FT PostSlow ( ) ;
break ;
case FAST :

159

FT PostFast ( ) ;
break ;
case DUAL :
FT PostDual ( ) ;
break ;
default :
f p r i n t f ( s t d e r r , u n e x p e c t e d p r o c e s s s t a t e %d , myStatus ) ;
abort ( ) ;
}
FT active = f a l s e ;
}

A.3

Common Code
Listing A.9: Common Header File

#d e f i n e PAGESIZE 4096 // memory page s i z e

#d e f i n e PAGECOUNT (UINT MAX / PAGESIZE )


// The s i z e of any single memory bitmap in bytes .
#d e f i n e BIT MAP SIZE ( (PAGECOUNT) >> 3 )

// ALLOC MAP SIZE defines the s i z e of the allocation ( use ) map


#d e f i n e ALLOC MAP SIZE BIT MAP SIZE

#d e f i n e MAX SPEC DEPTH 16

// The total s i z e of the access maps .


// Dual maps the map pair for specOrder 0 i s reused for unions
#d e f i n e ACC MAP SIZE ( ( MAX SPEC DEPTH + 1 ) 2 BIT MAP SIZE )

// Write operations are type 2 , and register 13 stores the type info .
#i f d e f i n e d ( MACH )
#d e f i n e SIG MEMORY FAULT SIGBUS

160

#d e f i n e MAP ANONYMOUS MAP ANON


#d e f i n e WRITEOPT( c n t x t ) ( ( c n t x t )>u c m c o n t e x t >

es .

e r r & 2)

#e l s e
#d e f i n e SIG MEMORY FAULT SIGSEGV
#d e f i n e WRITEOPT( c n t x t ) ( ( c n t x t )>u c m c o n t e x t . g r e g s [ 1 3 ] & 2 )
#e n d i f

t y p e d e f enum {
CTRL , MAIN ,
SPEC ,

// a speculation process

UNDY,

// the understudy

SEQ ,

// a sequential process

FAST ,

// a fast track

SLOW,

// a slow track

DUAL

// either of two equal options

} SP Status ;

v o l a t i l e S P S t a t u s myStatus ;

// Current processes status .

i n t SP gpid ;

// The process group .

#d e f i n e PAGESTART( x ) ( ( v o i d ) ( ( ( unsigned long ) x /PAGESIZE ) PAGESIZE ) )


#d e f i n e MYRESOURCE( a ) ( ( a)==0? 0 : ( ( ( a)1)%MAX SPEC DEPTH)+1)
#d e f i n e WRITEMAP( a ) ( accMapPtr + (MYRESOURCE( a )2+1) BIT MAP SIZE )
#d e f i n e READMAP( a )

( accMapPtr + MYRESOURCE( a ) 2 BIT MAP SIZE )

Listing A.10: IO Capture Header


void SP RedirectOutput ( ) ;
v o i d SP CommitOutput ( ) ;

Listing A.11: Utility Functions Header


t y p e d e f enum {
READ,
WRITE
} AccessType ;

161

// Returns map s bit for address .


i n t SP checkMap ( v o i d p a g e a d d r e s s , char map ) ;

// Applies the protection prot to any memory pages that


// are marked as in use according to the useMap .
void S P s e t P r o t e c t i o n ( i n t prot ) ;

// Call read until i t succeeds .


v o i d S P s y n c r e a d ( i n t fd , v o i d buf , s i z e t c o u n t ) {
w h i l e ( r e a d ( fd , buf , c o u n t ) == 1);
}
// Call write until i t succeeds .
v o i d S P s y n c w r i t e ( i n t fd , const v o i d buf , s i z e t c o u n t ) {
w h i l e ( w r i t e ( fd , buf , c o u n t ) == 1);
}

Listing A.12: Utility Function Implementations


s t a t i c i n l i n e void
S P p r o t e c t P a g e s ( unsigned long f i r s t , unsigned long l a s t , i n t p r o t )
{
v o i d page = ( v o i d ) ( f i r s t PAGESIZE ) ;
s i z e t l e n = ( ( l a s t f i r s t + 1 ) PAGESIZE ) ;
// Try to set the protection all at once .
i f ( 0 == m p r o t e c t ( page , l e n , p r o t ) ) r e t u r n ;
perror ( );
f o r ( unsigned long i = f i r s t ; i <= l a s t ; i ++) {
i f ( m p r o t e c t ( ( v o i d ) ( i PAGESIZE ) , PAGESIZE , p r o t ) )
perror ( );
}
}

// Sets a bit in map to indicate that page i s accessed .

162

v o i d SP recordAccessToMap ( v o i d p a g e a d d r e s s , char map ) {


i n t b y t e , b i t , page ;
page = ( ( unsigned long ) p a g e a d d r e s s ) / PAGESIZE ;
b y t e = page >> 3 ;

//byte = page / 8

b i t = page & 7 ;

// bit = page % 8 ;

map [ b y t e ] |= ( 1 << b i t ) ;
}

i n t SP checkMap ( v o i d p a g e a d d r e s s , char map ) {


i n t page = ( ( unsigned long ) p a g e a d d r e s s ) / PAGESIZE ;
i n t b y t e = page >> 3 ;
int bit

= page % 8 ;

char mapvalue

= map [ b y t e ] ;

r e t u r n ( mapvalue >> b i t ) & 0 x1 ;


}

// Writes num memory pages to p ip e i d starting with the i th page .


v o i d SP PushPageToPipe ( unsigned long i , i n t p i p e i d , unsigned num) {
unsigned w r i t e c o u n t = 0 ;
w h i l e ( w r i t e c o u n t < (num PAGESIZE ) ) {
int r e s u l t = write ( pipeid ,
( v o i d ) ( ( i PAGESIZE ) + w r i t e c o u n t ) ,
(numPAGESIZE ) w r i t e c o u n t ) ;
i f ( r e s u l t == 1) {
p e r r o r ( f a i l e d to w r i t e i n t o pipe ) ;
abort ( ) ;
} else {
w r i t e c o u n t += r e s u l t ;
}
}
}

i n t SP PushDataAccordingToMap ( char map , i n t p i p e i d ) {

163

unsigned b c h a r , b i t , i ;
i n t p a g e c o u n t =0;
f o r ( b c h a r =0; b c h a r < BIT MAP SIZE ; b c h a r++) {
i f ( map [ b c h a r ]==0) c o n t i n u e ;
i f ( map [ b c h a r ]==0) {
SP PushPageToPipe ( b c h a r 8 , p i p e i d , 8 ) ;
p a g e c o u n t += 8 ;
continue ;
}
f o r ( b i t =0; b i t <8; b i t ++) {
i f ( ( map [ b c h a r ]>> b i t ) & 0 x1 ) {
i = b c h a r 8+ b i t ;
SP PushPageToPipe ( i , p i p e i d , 1 ) ;
p a g e c o u n t ++;
}
}
}
return page count ;
}

// Read a page from pipe and write i t to the i th page of memory.


s t a t i c void
SP CopyPageFromPipe ( unsigned long i , i n t p i p e , char p r o t e c t e d ) {
unsigned r e a d c o u n t = 0 ;
int increment ;
i f ( protected )
m p r o t e c t ( ( v o i d ) ( i PAGESIZE ) , PAGESIZE , PROT WRITE ) ;
w h i l e ( r e a d c o u n t < PAGESIZE ) {
// read the remaining portion of the page from the pipe .
// The location to read to i s page i offset by the amount
// already read in .
increment = read ( pipe ,
( v o i d ) ( ( i PAGESIZE)+ r e a d c o u n t ) ,
PAGESIZEr e a d c o u n t ) ;

164

i f (1 == i n c r e m e n t ) {
p e r r o r ( e r r o r code ) ;
exit (0);
}
r e a d c o u n t += i n c r e m e n t ;
}
i f ( protected )
m p r o t e c t ( ( v o i d ) ( i PAGESIZE ) , PAGESIZE , PROT NONE ) ;
}

// If we are reading pages into protected space (protected==true ) ,


// we ll need to first open the protection and then close i t .
i n t SP PullDataAccordingToMap ( char map , i n t p i p e , char p r o t e c t e d ) {
unsigned b c h a r , b i t , i ;
i n t p a g e c o u n t =0;
f o r ( b c h a r =0; b c h a r < BIT MAP SIZE ; ++b c h a r ) {
i f ( map [ b c h a r ]==0) c o n t i n u e ;
f o r ( b i t =0; b i t <8; ++b i t ) {
i f ( ( map [ b c h a r ]>> b i t ) & 0 x1 ) {
i = b c h a r 8+ b i t ;
SP CopyPageFromPipe ( i , p i p e , p r o t e c t e d ) ;
++p a g e c o u n t ;
} // i f map [ bint ] . . .
} //for bit
} //for map
return page count ;
}
/// Returns 0 on success .
char S P s e t u p A c t i o n ( v o i d ( h a n d l e r ) ( i n t , s i g i n f o t , u c o n t e x t t ) ,
int signal )
{
struct sigaction action ;

165

s i g f i l l s e t (& a c t i o n . s a m a s k ) ;
a c t i o n . s a f l a g s = SA SIGINFO ;

a c t i o n . s a s i g a c t i o n = ( void ) h a n d l e r ;
i f (1 == s i g a c t i o n ( s i g n a l , &a c t i o n , NULL ) ) {
p e r r o r ( f a i l e d to s e t f a u l t handler ) ;
return 1;
}

return 0;
}

166

Bibliography

[1] Allen, Randy and Ken Kennedy. 2001. Optimizing Compilers for Modern
Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers.
[2] Amdahl, Gene M. 1967. Validity of the single processor approach to achieving
large scale computing capabilities. In AFIPS 67 (Spring): Proceedings of the
April 18-20, 1967, spring joint computer conference, pages 483485. ACM,
New York, NY, USA.
[3] Bender, Michael A., Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson. 2004. On-the-fly maintenance of series-parallel relationships in fork-join
multithreaded programs. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures, pages 133144.
[4] Berger, Emery D., Ting Yang, Tongping Liu, and Gene Novark. 2009. Grace:
safe multithreaded programming for C/C++. In Proceedings of the ACM
SIGPLAN Conference on Object oriented programming systems and applications, pages 8196. ACM, New York, NY, USA.
[5] Bernstein, A. J. 1966. Analysis of programs for parallel processing. IEEE
Transactions on Electronic Computers, 15(5):757763.
[6] Blumofe, Robert D., Christopher F. Joerg, Bradley C. Kuszmaul, Charles E.
Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: an efficient multithreaded runtime system. SIGPLAN Not., 30(8):207216.

167

[7] Boehm, Hans-Juergen. 2005. Threads cannot be implemented as a library.


In Proceedings of the ACM SIGPLAN Conference on Programming language
design and implementation, pages 261268.
[8] Bridges, Matthew, Neil Vachharajani, Yun Zhang, Thomas Jablin, and David
August. 2007. Revisiting the sequential programming model for multi-core.
In Proceedings of the International Symposium on Microarchitecture, pages
6984. IEEE Computer Society, Washington, DC, USA.
[9] Chang, Fay W. and Garth A. Gibson. 1999. Automatic i/o hint generation
through speculative execution. In Proceedings of the Symposium on Operating
System Design and Implementation.
[10] Chen, Michael K. and Kunle Olukotun. 2003. The Jrpm system for dynamically parallelizing Java programs. In 30th International Symposium on
Computer Architecture, pages 434445.
[11] Cintra, Marcelo and Diego Llanos. 2005. Design space exploration of a software speculative parallelization scheme. IEEE Transactions on Parallel and
Distributed Systems, 16(6):562576.
[12] Coffman, Edward G., M. J. Elphick, and Arie Shoshani. 1971. System deadlocks. ACM Computing Surveys, 3(2):6778.
[13] Cytron, Ron, Jeanne Ferrante, Barry Rosen, Mark Wegman, and F. Kenneth
Zadeck. 1991. Efficiently computing static single assignment form and the
control dependence graph. ACM Transactions on Programming Languages
and Systems, 13(4):451490.
[14] Dang, Francis, Hao J. Yu, and Lawrence Rauchwerger. 2002. The R-LRPD
test: Speculative parallelization of partially parallel loops. In IEEE International Parallel and Distributed Processing Symposium on, pages 2029. Ft.
Lauderdale, FL.

168

[15] Ding, Yonghua and Zhiyuan Li. 2004. A compiler scheme for reusing intermediate computation results. In Proceedings of the International Symposium
on Code Generation and Optimization.
[16] Eigler, Frank Ch. 2003. Mudflap: Pointer use checking for C/C++. In GCC
Developers Summit, pages 5769.
[17] Feng, Mingdong and Charles E. Leiserson. 1997. Efficient detection of determinacy races in cilk programs. In Proceedings of the ACM Symposium on
Parallelism in Algorithms and Architectures, pages 111. ACM, New York,
NY, USA.
[18] Frigo, Matteo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM
SIGPLAN Conference on Programming language design and implementation,
pages 212223. ACM, New York, NY, USA.
[19] Garg, Alok and Michael C. Huang. 2008.

A performance-correctness

explicitly-decoupled architecture. In 41st International Symposium on Microarhictecutre.


[20] Grant, Brian K., M. Philipose, Marcus U. Mock, Craig D. Chambers, and
S. J. Eggers. 1999. An evaluation of staged run-time optimizations in DyC.
In Proceedings of the ACM SIGPLAN Conference on Programming language
design and implementation. Atlanta, Georgia.
[21] Grzegorczyk, Chris, Sunil Soman, Chandra Krintz, and Rich Wolski. 2007.
Isla vista heap sizing: Using feedback to avoid paging. In Proceedings of
the International Symposium on Code Generation and Optimization, pages
325340. IEEE Computer Society, Washington, DC, USA.

169

[22] Gupta, Manish and Rahul Nim. 1998. Techniques for speculative run-time
parallelization of loops. In Proceedings of the ACM/IEEE conference on
Supercomputing, pages 112.
[23] Gustafson, John L. 1988. Reevaluating amdahls law. Commun. ACM,
31(5):532533.
[24] Halstead, Robert H., Jr. 1985. MULTILISP: A language for concurrent symbolic computation. ACM Transactions on Programming Langguage Systems,
7(4):501538.
[25] Herlihy, Maurice, Victor Luchangco, Mark Moir, and William N. Scherer III.
2003. Software transactional memory for dynamic-sized data structures. In
Proceedings of the ACM Symposium on Principles of Distributed Computing,
pages 92101. Boston, MA.
[26] Herlihy, Maurice and J. E. Moss. 1993. Transactional memory: Architectural
support for lock-free data structures. In Proceedings of the International
Symposium on Computer Architecture. San Diego, CA.
[27] Jefferson, David R., Brian R. Beckman, Frederick Wieland, L. Blume, and
M. Diloreto. 1987. Time warp operating system. In SOSP 87: Proceedings
of the ACM Symposium on operating systems principles, pages 7793. ACM,
New York, NY, USA.
[28] Kejariwal, Arun, Xinmin Tian, Wei Li, Milind Girkar, Sergey Kozhukhov,
Hideki Saito, Utpal Banerjee, Alexandru Nicolau, Alexander V. Veidenbaum,
and Constantine D. Polychronopoulos. 2006. On the performance potential
of different types of speculative thread-level parallelism. In ICS 06: Proceedings of the 20th annual international conference on Supercomputing, page 24.
ACM, New York, NY, USA.

170

[29] Keleher, Peter J., Allen L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel.
1994. TreadMarks: Distributed shared memory on standard workstations and
operating systems. In Proceedings of the 1994 Winter USENIX Conference.
[30] Kennedy, Andrew and Claudio V. Russo. 2005. Generalized algebraic data
types and object-oriented programming. In Proceedings of the ACM SIGPLAN Conference on Object oriented programming systems and applications,
pages 2140.
[31] Lee, Sanghoon and James Tuck. 2008. Parallelizing Mudflap using threadlevel speculation on a CMP. Presented at the Workshop on the Parallel Execution of Sequential Programs on Multi-core Architecture, co-located with
ISCA.
[32] Li, Kai. 1986. Shared Virtual Memory on Loosely Coupled Multiprocessors.
Ph.D. thesis, Dept. of Computer Science, Yale University, New Haven, CT.
[33] Liao, Shih-Wei, Perry H. Wang, Hong Wang, John Paul Shen, Gerolf
Hoflehner, and Daniel M. Lavery. 2002. Post-pass binary adaptation for
software-based speculative precomputation. In Proceedings of the ACM SIGPLAN Conference on Programming language design and implementation,
pages 117128.
[34] Liblit, Ben, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan.
2005. Scalable statistical bug isolation. In Proceedings of the ACM SIGPLAN
Conference on Programming language design and implementation, pages 15
26. ACM Press, New York, NY, USA.
[35] Liu, Wei, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau,
and Josep Torrellas. 2006. Posh: A TLS compiler that exploits program
structure. In Proceedings of the ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming.

171

[36] Luk, Chi-Keung, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser,
Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood.
2005. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming language design and implementation, pages 190200. ACM, New York,
NY, USA.
[37] Martin, Milo M. K., Daniel J. Sorin, Harold W. Cain, Mark D. Hill, and
Mikko H. Lipasti. 2001. Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing. In Proceedings of
the International Symposium on Microarchitecture.
[38] Mellor-Crummey, John. 1993. Compile-time support for efficient data race
detection in shared-memory parallel programs. In PADD 93: Proceedings of
the 1993 ACM/ONR workshop on Parallel and distributed debugging, pages
129139. ACM Press, New York, NY, USA.
[39] Michie, Donald. 1968. Memo functions and machine learning. Nature, 218:19
22.
[40] Moore, Gordon E. 1965. Cramming more components onto integrated circuits, Electronics. Electronics Magazine, 19:114117.
[41] Moseley, Tipp, Alex Shye, Vijay Janapa Reddi, Dirk Grunwald, and Ramesh
Peri. 2007. Shadow profiling: Hiding instrumentation costs with parallelism.
In Proceedings of the International Symposium on Code Generation and Optimization, pages 198208.
[42] Navabi, Armand, Xiangyu Zhang, and Suresh Jagannathan. 2008. Quasistatic scheduling for safe futures. In Proceedings of the ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming.

172

[43] Neelakantam, Naveen, Ravi Rajwar, Suresh Srinivas, Uma Srinivasan, and
Craig Zilles. 2007. Hardware atomicity for reliable software speculation. In
Proceedings of the International Symposium on Computer Architecture.
[44] Nightingale, Edmund B., Peter M. Chen, and Jason Flinn. 2005. Speculative
execution in a distributed file system. In Proceedings of the twentieth ACM
symposium on Operating systems principles, pages 191205. ACM, New York,
NY, USA.
[45] Nightingale, Edmund B., Daniel Peek, Peter M. Chen, and Jason Flinn.
2008. Parallelizing security checks on commodity hardware. In Proceedings
of the International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 308318.
[46] Oplinger, Jeffrey T. and Monica S. Lam. 2002. Enhancing software reliability
with speculative threads. In Proceedings of the International Conference on
Architectural Support for Programming Languages and Operating Systems,
pages 184196.
[47] Ottoni, Guilherme, Ram Rangan, Adam Stoler, and David I. August. 2005.
Automatic thread extraction with decoupled software pipelining. In Proceedings of the International Symposium on Microarchitecture, pages 105118.
[48] Patil, Harish and Charles Fischer. 1995. Efficient run-time monitoring using
shadow processing. In Mireille Ducasse, editor, International Workshop on
Automated and Algorithmic Debugging, pages 119132.
[49] Perkovic, Dejan and Peter J. Keleher. 2000. A protocol-centric approach
to on-the-fly race detection. IEEE Transactions on Parallel and Distributed
Systems, 11(10):10581072.
[50] Qui
nones, Carlos Garca, Carlos Madriles, Jes
us Sanchez, Pedro Marcuello,
Antonio Gonzalez, and Dean M. Tullsen. 2005. Mitosis compiler: An in-

173

frastructure for speculative threading based on pre-computation slices. In


Proceedings of the ACM SIGPLAN Conference on Programming language
design and implementation.
[51] Raman, Arun, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and
David I. August. 2010.

Speculative parallelization using software multi-

threaded transactions. In Proceedings of the International Conference on


Architectural Support for Programming Languages and Operating Systems,
volume 38, pages 6576. ACM.
[52] Rauchwerger, Lawrence and David Padua. 1995. The LRPD test: Speculative
run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming
language design and implementation. La Jolla, CA.
[53] Shen, Xipeng and Chen Ding. 2005. Parallelization of utility programs based
on behavior phase analysis. In Proceedings of the International Workshop
on Languages and Compilers for Parallel Computing. Hawthorne, NY. Short
paper.
[54] Sohi, Gurindar S., Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar
processors. In Proceedings of the International Symposium on Computer Architecture.
[55] SPEC. 2010.

Standard performance evaluation corporation (SPEC).

http://www.spec.org/.
[56] Steffan, J. Gregory, Christopher B. Colohan, Antonia Zhai, and Todd C.
Mowry. 2005. The STAMPede approach to thread-level speculation. ACM
Transactions on Computer Systems, 23(3):253300.

174

[57] Sundaramoorthy, Karthik, Zach Purser, and Eric Rotenberg. 2000. Slipstream processors: improving both performance and fault tolerance. SIGPLAN Not., 35(11):257268.
[58] Tiwari, Devesh, Sanghoon Lee, James Tuck, and Yan Solihin. 2010. Mmt:
Exploiting fine-grained parallelism in dynamic memory management. IEEE
Transactions on Parallel and Distributed Systems.
[59] Tsai, Jenn-Yuan, Zhenzhen Jiang, and Pen-Chung Yew. 1999. Compiler techniques for the superthreaded architectures. International Journal of Parallel
Programming, 27(1):119.
[60] Vachharajani, Neil, Ram Rangan, Easwaran Raman, Matthew J. Bridges,
Guilherme Ottoni, and David I. August. 2007. Speculative decoupled software pipelining. In Proceedings of the International Conference on Parallel
Architectures and Compilation Techniques, pages 4959. IEEE Computer Society, Washington, DC, USA.
[61] Wahbe, Robert, Steven Lucco, and Susan L. Graham. 1993. Practical data
breakpoints: design and implementation. In Proceedings of the ACM SIGPLAN Conference on Programming language design and implementation.
[62] Wallace, Steven and Kim Hazelwood. 2007. Superpin: Parallelizing dynamic
instrumentation for real-time performance. In Proceedings of the International Symposium on Code Generation and Optimization, pages 209220.
[63] Welc, Adam, Suresh Jagannathan, and Antony L. Hosking. 2005. Safe futures
for Java. In Proceedings of the ACM SIGPLAN Conference on Object oriented
programming systems and applications, pages 439453.
[64] Zhai, Antonia, Christopher B. Colohan, J. Gregory Steffan, and Todd C.
Mowry. 2002. Compiler optimization of scalar value communication between

175

speculative threads. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages
171183.
[65] Zhang, Chengliang, Kirk Kelsey, Xipeng Shen, Chen Ding, Matthew Hertz,
and Mitsunori Ogihara. 2006. Program-level adaptive memory management.
In Proceedings of the International Symposium on Memory Management. Ottawa, Canada.
[66] Zhou, Pin, Feng Qin, Wei Liu, Yuanyuan Zhou, and Josep Torrellas. 2004.
iWatcher: Efficient architectural support for software debugging. In Proceedings of the International Symposium on Computer Architecture, pages
224237.
[67] Zilles, Craig and Gurindar S. Sohi. 2002. Master/slave speculative parallelization. In Proceedings of the International Symposium on Microarchitecture,
pages 8596. IEEE Computer Society Press, Los Alamitos, CA, USA.

You might also like