Professional Documents
Culture Documents
Kelsey Rochester 0188E 10062
Kelsey Rochester 0188E 10062
Kelsey Rochester 0188E 10062
and Optimization
by
Kirk Kelsey
Supervised by
Dr. Chen Ding
Department of Computer Science
Arts, Sciences and Engineering
Edmund A. Hajim School of Engineering and Applied Sciences
University of Rochester
Rochester, New York
2011
ii
To Ellen:
Always Hopes,
Always Perseveres
iii
Curriculum Vitae
The author was born in New Haven, Connecticut on March 3rd, 1979. He
attended Vanderbilt University from 1997 to 2003, and graduated with a Bachelor
of Science degree in 2001 followed by a Master of Science degree in 2003. He came
to the University of Rochester in the Fall of 2003 and began graduate studies
in Computer Science. He pursued his research in software speculative parallelism
under the direction of Professor Chen Ding and received a Master of Science degree
from the University of Rochester in 2005.
iv
Acknowledgments
More than any other factor, I have to contribute so much to the unyielding
support of my wife, Ellen. This certainly extends well beyond the time spent
working towards a thesis, but so few pursuits offer the opportunity for a formal
acknowledgment. If I had the words, my thanks would dwarf this document. My
parents, also, deserve my heart-felt appreciation for many more years of support,
as well as for providing early models of scholarship.
I am deeply thankful to my adviser, Chen Ding, for guiding me through a
marathon process. Chen has been a constant through the many stages of graduate
education and study. Ultimately, he helped me develop a direction in research and
reminded me that we are measured not by the information we consume, but by the
knowledge we create. I owe a sincere debt to the members of my thesis committee
for their advice during the development of ideas that has led to this work, and for
the broader education they provided within the department.
My cohort of fellow aspiring researchers were an invaluable source of insight,
inspiration, humility and support. Id like to thank other students in the compiler
and systems groups who have helped to show the way ahead of me specifically
Yutao Zhong and Xipeng Shen and kept me motivated, especially Mike Spear
and Chris Stewart. From a broader standpoint, I have appreciated time spent
with Ashiwin Lall, Chris Stewart, Ben Van Durme and Matt Post immensely.
My friends outside of the department helped to take my mind off computer
science from time; Jason and Ana stand out specifically in that regard. Finally,
Id like to thank the staff of the computer science department for their help in
innumerable ways. Jo Marie Carpenter, Marty Gunthner, Pat Mitchell and Eileen
Pullara keep a lot of things running around the department and Im happy to be
included among them.
vi
Abstract
The computing industry has long relied on computation becoming faster through
steady exponential growth in the density of transistors on a chip. While the
growth in density has been maintained, factors such as thermal dissipation have
limited the increase in clock speeds. Contemporary computers are rapidly becoming parallel processing systems in which the notion of computer power comes from
multi-tasking rather than speed. A typical home consumer is now more likely
than not to get a parallel processor when purchasing a desktop or laptop.
While parallel processing provides an opportunity for continued growth in
mainstream computational power, it also requires that programs be built to use
multiple threads of execution. The process of writing parallel programs is acknowledged as requiring a significant level of skill beyond general programming,
relegating parallel programming to a small class of expert programmers. The difficulty of parallel programming is only compounded when attempting to modify
an existing program. Given that the vast majority of existing programs have not
been written to use parallelism, a significant amount of code could benefit from
an overhaul.
An alternative to explicitly encoding parallelism into a program is to use speculative parallelism of some form. Speculative parallelism removes the burden of
guaranteeing the independence of parallel threads of execution, which greatly simplifies the process of parallel program development. This is especially true when
vii
viii
Table of Contents
Curriculum Vitae
iii
Acknowledgments
iv
Abstract
vi
List of Tables
xiii
List of Figures
xiv
List of Algorithms
xvi
Foreword
1 Introduction
1.1
1.2
Speculative Execution . . . . . . . . . . . . . . . . . . . . . . . .
1.3
Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Background
2.1
Thread Representation . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1
Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . .
ix
2.1.2
2.2
2.3
2.4
2.5
2.6
Message Passing
. . . . . . . . . . . . . . . . . . . . . . .
12
Speculative Threads . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2.1
Ancillary Tasks . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2.2
Run-Ahead . . . . . . . . . . . . . . . . . . . . . . . . . .
16
18
2.3.1
Futures . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.3.2
Cilk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.3.3
Sequential Semantics . . . . . . . . . . . . . . . . . . . . .
21
Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.4.1
Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Support Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.5.1
Operating System . . . . . . . . . . . . . . . . . . . . . . .
26
2.5.2
Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.5.3
Race Detection . . . . . . . . . . . . . . . . . . . . . . . .
31
Correctness Checking . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.6.1
Heavyweight . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.6.2
Hardware Techniques . . . . . . . . . . . . . . . . . . . . .
33
2.6.3
Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3 Process-Based Speculation
3.1
36
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.1.1
Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.1.2
Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.1.3
Verification . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.1.4
Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.1.5
Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.2
Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.3
Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4
Special Considerations . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4.1
41
3.4.2
Memory Allocation . . . . . . . . . . . . . . . . . . . . . .
42
3.4.3
System Signals . . . . . . . . . . . . . . . . . . . . . . . .
43
4 Speculative Parallelism
4.1
4.2
4.3
4.4
44
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.1.1
45
4.1.2
47
4.1.3
48
Programming Interface . . . . . . . . . . . . . . . . . . . . . . . .
51
4.2.1
Region Markers . . . . . . . . . . . . . . . . . . . . . . . .
51
4.2.2
Post-Wait . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.2.3
Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Run-Time System . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.3.1
Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.3.2
Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.3.3
Verification . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.3.4
Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.3.5
Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
78
4.4.1
80
Data-Parallel . . . . . . . . . . . . . . . . . . . . . . . . .
xi
4.4.2
4.5
4.6
Task-Parallel . . . . . . . . . . . . . . . . . . . . . . . . .
80
81
4.5.1
Explicit Parallelism . . . . . . . . . . . . . . . . . . . . . .
81
4.5.2
Fine-Grained Techniques . . . . . . . . . . . . . . . . . . .
82
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.6.1
84
4.6.2
Application Benchmarks . . . . . . . . . . . . . . . . . . .
85
5 Speculative Optimization
5.1
95
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.1.1
96
5.1.2
Dual-track . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.2
Programming Interface . . . . . . . . . . . . . . . . . . . . . . . .
97
5.3
Run-time Support . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.3.1
Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.3.2
Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.3.3
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.4
Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.5
Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.6
5.4
5.5
5.6
5.5.1
5.5.2
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
xii
5.6.1
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.6.2
6 Conclusion
131
6.1
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2
Automation . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2.2
Composability . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2.3
A Code Listings
137
. . . . . . . . . . . . . . . . . . . . . . . . . . . 146
166
xiii
List of Tables
4.1
49
4.2
65
4.3
70
4.4
85
4.5
85
4.6
87
4.7
88
4.8
89
xiv
List of Figures
4.1
46
4.2
54
4.3
66
4.4
79
4.5
90
4.6
93
5.1
5.2
5.3
5.4
5.5
5.6
xv
List of Algorithms
2.4.1 Listing of pipeline loop. . . . . . . . . . . . . . . . . . . . . . . . .
24
24
52
52
53
55
56
58
59
73
. . . . . . . . .
75
76
77
78
. . . . . . . . . .
98
98
xvi
Foreword
Chapters 4 and 5 of this dissertation are based on collaborative work. Chapter 4 of my dissertation was co-authored with Professor Chen Ding, and with fellow
students Xipeng Shen, Chris Tice, Ruke Huang, and Chengliang Zhang. I contributed the implementation of the computational system, and the experimental
analysis. It has been published in Proceedings of the ACM SIGPLAN Conference
on Programming Language Design and Implementation, 2007. An early prototype
of the run-time system was created by Xipeng Shen, which was rewritten for our
publication, and again for ongoing work. Rule Huang contributed compiler support, and Chris Tice worked on the MKL benchmark. Chengliang Zhang helped
with system testing.
I am the primary author of Chapter 5, on which I collaborated with Professor Chen Ding and with fellow graduate student Tongxin Bai. This chapter has
been published in Proceedings of the International Symposium on Code Generation and Optimization, March 2009. My contribution is the implementation of
the computational system, construction of the experimental frameworks, and the
experimental analysis. Tongxin Bai contributed design ideas, and assisted with
testing.
Introduction
Since the introduction of the Intel 4004 microprocessor, the number of transistors
on commercial integrated circuits has doubled roughly every two years. This
trend was famously noted by Gordon Moore in 1965 and has continued to the
present [40]. During this period of time the growing number of transistors typically
corresponded with an increase in the clock rate, from 740 kHz for the 4004 chipset
to 3.8 GHz for Intels Pentium 4 processor in 2004.
Since the release of the Pentium 4 processor, clock rates have actually decreased slightly. Currently, the highest clock rate available on an Intel microprocessor is 3.33 Ghz. The primary reason for this stagnation and decline is the
problem of thermal dissipation. Each transistor on a chip uses some amount of
power in two forms: constant leakage and per state switch. Increasing the chip
clock rate directly increases the power consumption due to switching, but also requires a reduction in the size of components (to reduce signal propagation time).
This miniaturization increases the density of the transistors, which increases the
amount of power consumed in any given chip area. Increased power consumption
leads to increased heat consumption. The two factors increased switching and
concentration of components compound on one another.
On the consumer front, weve reached the limits of air cooling a computer
1.1
While the general public may recognize that programming requires a certain level
of expertise, parallel programming has largely been relegated to a select group of
programmers. Programmers are typically taught to think explicitly in series
to write an imperative program as a series of steps that depend on one another.
This can make the transition to parallel programming difficult for programmers,
but more importantly it has led to a legacy of programs that are truly serial by
design.
Finding Parallelism
Identifying portions of a program that can safely run in parallel with one another
is perhaps the most difficult aspect of parallel programming. This task is often
made more difficult by attempts by programmers to optimize their code for the
sequential execution. Once the parallel regions have been identified, the programmer must ensure the correctness of each region interacting with all others. This
is most commonly done using locks, which must be correctly associated with the
same collection of data in every case where that data may be modified by multiple
threads. The problems involved in correctly writing a parallel program are exacerbated when attempting to update an existing program. Without a familiarity
with the code in question, the programmer is less likely to recognize side effects
of functions or identify poorly isolated data. Currently, no tool exists that can
automatically identify parallelism in an arbitrary program, and it is not possible
to do so in every case.
Ensuring Progress
One of the most well known problems encountered in parallelism, whether designing a single program with multiple threads of execution, or scheduling multi-
Guaranteeing Correctness
In the context of parallel programming, correctness is defined to mean that the observable behavior of the program is maintained. If the program acts as a function,
mapping input to output, then the function must be preserved. In the context of
parallelizing a sequential program, the original serialization of observable points in
the execution implied by that program must be maintained, ruling out deadlock.
To guarantee correctness, the programmer must ensure that all accesses to shared
data are properly guarded. This requires identifying all shared data, identifying
all accesses to that data, and finally creating an association between data objects
and the primitives used to synchronize their access. Particularly in the case of
parallelizing an inherited code base, the programmer may have difficulty simply
identifying what data objects are shared. Assuming that using a single global lock
will not allow acceptable performance, the programmer will also be responsible for
determining which data need to be protected collectively because their common
state needs to be consistent.
Debugging
One of the more common problem in parallel programming is the occurrence of
a data race, which is the case of two threads accessing the same data without
synchronization between the accesses (at least one of which must be a write).
The result of a race (i.e., the value that is ultimately attributed to the data)
depends on the sequence and timing of events in both threads leading up to their
accesses. Because the scheduling of threads may depend on other processes in
the system at large, the error is effectively non-deterministic. Generally, we want
to reproduce the conditions under which a bug occurs to isolate it. Because the
problem may appear very intermittently, the conditions for the error are effectively
random. Running the program in a debugger can force a particular serialization,
which ensures a certain outcome of the race, potentially making the debug session
useless for finding the problem.
1.2
Speculative Execution
Debugging The speculative execution system depends on the ability to discard the speculative portion of execution and follow only the sequential flow of
execution. The intent of this fallback is that the speculatively parallel program
maps directly back to the sequential execution. In this case, there is no need to
explicitly debug a speculatively parallel program because the user can debug the
sequential program with the same effect.
1.3
Road Map
execution of program regions ahead of time. In Chapter 5 I describe a softwareonly speculation system that enables unsafe optimization of sequential code. I
conclude with a discussion of the limitations of the current speculative execution
system, and of future directions to address, in Chapter 6.
Background
2.1
Thread Representation
2.1.1
Data Sharing
10
first then a reader may attempt to read the buffer and receive garbage.
In order to guarantee a process always sees a consistent view of the global state,
there must be some mechanism to indicate that the data should not be accessed.
This is typically done by introducing a lock, which requires a hardware guarantee
that all processes see the flag consistently and cannot access it simultaneously.
Implementations typically rely on an atomic read-modify-write operation that
only sets the value of a data object if its current value matches what is expected.
Such systems are more efficient if multiple locks are used so that distinct parts
of the shared state can be modified simultaneously. One of the difficulties is
ensuring that the relationship between a lock and the data it is meant to protect
is well defined that no access to the data is made without first acquiring the
lock. In this way, a portion of the shared state is used to protect the consistency
of the shared state.
An alternative to locking regions of memory to provide protection is to create
the illusion that modifications are made atomically. This typically involves introducing additional redirection to encapsulated data that must be keep consistent.
By modifying a local (or private) copy of the data, one process can ensure that
no others will read inconsistent state. Once the modifications are complete, the
single point of redirection can be atomically updated to refer to the new (and no
longer private) version of the data.
This sort of redirection can be expanded to be applied to general memory
access in transactional memory systems. These systems indicate that specific
regions of the program should appear atomically. By tracking all of the reads
and writes that a process makes, it is possible to ensure that none of the memory
involved was changed by another transacting simultaneously.
Transactional memory was originally proposed as a hardware mechanism to
support non-blocking synchronization (by extending cache coherence protocols) [26]
and several software implementations are built for existing hardware. Transaction
11
12
2.1.2
Message Passing
2.2
2.2.1
Speculative Threads
Ancillary Tasks
Past work has suggested specifically using speculative execution to treat some
portion of the programs work as a parallel task. Such tasks include the addition of
memory error and leak checking, performing user defined assertions, and profiling.
In [48] the authors suggest creating a shadow version of a program to address
these ancillary tasks specifically, although they do not address how the shadow
process might be generated.
13
14
15
tion for memory allocation. When memory objects are released, the program can
essentially issue the deallocation asynchronously and continue without waiting for
memory management to complete.
The synchronous memory requests still have a communication delay in addition
to the period of time needed to actually service the request. This is alleviated by
having the mmt speculative preallocate objects, which can be provided without
delay if the size is right. Delays are further reduced by batching deallocation
requests to the mmt, and symmetrically by providing the client with multiple
preallocated objects.
Although the mmt technique can extract some memory safety checks into a
separate thread, not all types of memory checks are isolated in the allocation or
deallocation routines. Checks such as array over-flow are must be performed in
the context of the memory access.
Some of these limitations are addressed in the approach taken in the Speck
(Speculative Parallel Check) system [45]. The Speck system is intended to decouple the execution of security checking from the execution of the program at large.
During program execution a set of instrumented systems call potentially create an
addition instance of the application that includes the security checks. Like earlier
work, some of the overhead is removed by only entering the instrumented code
path periodically.
The primary focus of the Speck work is on security checks such as virus scanners and taint analysis, though it could be applied to simpler checking for safe
memory access. The limitation of the Speck system is its dependence on the use
of a modified Linux kernel designed to support efficient distributed file system
interaction, called Speculator [44]. This support is necessary to allow for unsafe
actions performed by an application to be rolled back if one of the security checks
were to fail. An addition feature of their operating system support is the ability
to ensure that certain system functionality operates identically in both processes,
16
and that signals are delivered at the same point in the execution of each.
Another recent approach to minimizing the overhead of memory safety checking with thread-level speculation did so by parallelizing an existing memory checking library [31]. Because of the tight synchronization needed by the accesses to the
data structures used by the library, adapting it for use with TLS requires detailed
analysis of the library itself and the manual insertion of source level pragmas to
denote parallel regions. The annotated code was then passed through a parallelizing compiler which extracts each parallel task. Ultimately, the authors assume
that some form of hardware support will guarantee the ordering of the tasks to
guarantee the sequential semantics of the original program. The system also relies
on the presence of a mechanism to explicitly synchronize access to the librarys
data structures which is not provided.
2.2.2
Run-Ahead
17
two processes together (one of which is the original program) complete faster than
either would independently.
Because the leading process is not performing all of the operations of the
original program, its execution may deviate from the correct execution, which
is always computed by the trailing process. In order to recover from incorrect
speculation, and to generate the lead process, the Slipstream technique requires
a number of additional hardware components. The lead process must have a
program counter that is modified to skip past some instructions by recording
previous traces through the program execution. The address of memory locations
modified by the lead process are recorded to allow for recovery by updating those
values from the memory state of the trailing process2 .
The suggested mechanism for determining which operations may be good candidates for speculative removal is based on a small data flow graph built in reverse as instructions are retired. Operations that write to memory (or registers)
are recorded as being the producer of the value stored there, and a bit denotes
the value as valid. A subsequent write with the same value is redundant, while
a different value updates the producer. A reading operation sets a bit indicating a location has been referenced, which allows an old producer operation to be
removed if the value was unused.
Another related idea used in hardware systems is to extract a fast version of
sequential code to run ahead while the original computation follows. It is used to
reduce memory load latency with run-ahead code generated in software [33], and
recently to reduce hardware design complexity [19].
A third, more recent idea is speculative optimization at fine granularity, which
does not yet make use of multiple processors [43]. All of these techniques require modifications to existing hardware. Similar special hardware support has
been used to parallelize program analysis such as basic block profiling, memory
2
18
2.3
2.3.1
19
is protected.
Work on a Java implementation of futures that are safe in terms of maintaining their sequential semantics has been done through modifications to the
run-time virtual machine [63]. In order to ensure the effects of a future are not
intermixed with data accesses of its continuation, each is run in a separate thread
with a local write buffer implemented by chaining multiple versions of an object
together. Reads to the object must traverse a list of versions to location the correct one for the context of the thread. Each thread must also maintain a read
and write map of data accesses, which is used to detecting read-write violations
between the threads. Despite the name, the future should conceptually complete
its data accesses before the continuation.
The implementation of safe futures depends heavily on the fact that Java is a
managed language in which objects have metadata and are accessed by reference,
simplifying the redirection needed to protect access. The additional work needed
to monitor data access is added to existing read and write barriers designed for
garbage collection, and the rollback routine is based on the virtual machines
exception handling.
Recent work has sought to automatically insert synchronization for Java futures using compiler support [42]. This support determines statically when a
future first accesses a shared data object and inserts a special form of barrier
statement called allowed. The allowed statement is not released in a continuation until all of its futures have granted access with an explicitly matched grant
statement. A list of futures is built as they are spawned, and cleared after they
have granted access to the data. Because the insertion of the grant and allowed
operations is based on static analysis, it is more conservative than what could
be achieved with a system using run time analysis. The static analysis has the
advantage of significantly lower overhead during execution.
20
2.3.2
Cilk
21
system (e.g., spawning and moving tasks) are placed on the critical-path, which
is a design decision not shared by all systems.
2.3.3
Sequential Semantics
Although fork-join style semantics for parallelism makes explicit the point at which
parallel computation is needed, as mentioned in Section 2.3.1 there is no implicit
guarantee of atomicity or progress. A programmer is still responsible for guarding
shared data accesses to preserve object consistency and inserting synchronization
to prevent race conditions. Recent work using a run-time system called Grace
converts a program with fork-join parallel threads into a sequentially consistent
program [4].
Guaranteeing sequential consistency requires the effect of operations appear in
a specific order. This sequence is defined by the semantics of the source program
code. By assuming that threads should be serialized in the order they are created,
the sequential semantics of a fork become the same as a simple function call.
Allowing the run-time system to ensure thread ordering and atomicity, locks can
be elided and the program viewed semantically as though it were serial.
The Grace system does this by converting each thread into a heavy-weight
process with isolated (copy-on-write) memory. Heap and global data that would
have originally been available to all threads are placed in a memory mapped
file and each process maintains a local mapping of the same data for privatized
writes. Using a versioning scheme for the memory, and logging accesses during
execution, the run-time system can determine whether the processes execution
is correct. Assuming correct execution, the process must wait until all logically
prior processes complete before committing its local writes to the global map.
Although the process corresponds to a thread in the original program, Grace
intends to detect violations of the sequential semantics to guard against improper
22
parallel implementations.
Somewhat earlier work suggested two ways in which sequential semantics could
be relaxed intuitively to remove common points of misspeculation [8]. They argue
that sequential semantics may be overly restrictive in many cases in which some
portions of execution do not need to be explicitly ordered, and a program may
have multiple valid outputs. The primary suggestion is that groups of functions
be annotated to indicate a commutative relationship if their internal state does
not need to be ordered but does need to be accessed atomically. Put another
way, these functions have side effects that are only visible to one another. This
kind of behavior is common for dynamic memory management, which maintains
metadata that is not accessed externally.
The programmer is still responsible for identifying all functions accessing the
same state. Although this is significantly easier than identifying all functions that
access shared state and subsequently grouping them, it does allow for failures the
speculation system would otherwise prevent. Additionally, it requires atomicity
guards within the functions, which the authors ignore. There is an additional
requirement that commutative functions operate outside the scope of speculation
itself. If a transactional memory system is being used, the functions must use
non-transactional memory. This complicates cases where some state is internal
to the commutative group, while other state is global and also implies that these
functions must have an explicit inverse function because the rollback mechanism
of the speculation system will not protect them. This limits to applicability of
commutative annotations, or requires significantly more programmer effort that
initially suggested.
23
2.4
Pipelining
2.4.1
Decoupling
The correctness of this pipeline relies on the memory coherence of the architecture.
24
i
f
g
h
=0;
(A [ i
(B [ i
(C [ i
i < N ; ++i ) {
]);
]);
]);
of pipelined loop.
Processor 3
Processor 4
B[3] = f(A[3]);
C[3] = g(B[3]); B[4] = f(A[4]);
D[3] = h(C[3]); C[4] = g(B[4]);
D[4] = h(C[4]);
on another processor. One seeks to align the loop structure and processor count
so that the first processor completes its loop iteration just as the last processor
completes the dependent stage of the loop. In this case there are no bubbles in
this pipeline and the processors can be maximally utilized.
The above scenario assumes there is no delay between completing the dependent stage one on processor and initiating it on another. In reality, there will likely
be communication latency between the processors causes the later iterations to
stall slightly. Multiple stalls will accumulate over time and propagate through
later iterations.
The reason this problem arises is that communication is flowing cyclically
through all of the processors. Decoupling breaks the communication cycle so the
dependency communication only flows in one direction [47]. In a decoupled software pipelined loop, after the dependent stage is executed on the first processor
the remainder of the loop is dispatched to another processor while the first processor begins the next dependent stage. The result is that any communications
25
delay applies equally in all cases. The second processor is effectively skewed by
that delay.
The processes of scheduling a decoupled software pipeline involves constructing
a dependence graph of the loop instructions. The instructions represented by a
strongly connected component (scc) in the graph must be scheduled collectively in
a thread (though a thread may compute multiple components). These components
limit possible parallelism in two ways: there can be no more pipeline stages than
there are sccs, and the size of the largest scc is the minimum size of the all
pipeline stages.
By introducing speculation into the decoupled software pipelined loop it is
possible to break some of the dependencies [60]. Breaking graph edges allows for
a reduction in the size of sccs and an increase in their number. The speculation temporarily removes dependencies that are highly predictable, schedules the
pipeline normally, then replaces edges that do not cross threads or flow normally
from early threads to later ones.
The implementation presented in [60] relies on compiler support for transformations and versioned memory to enable rollback of misspeculation. Each
loop iteration involves advancing the memory version and sending checkpoint information to a helper thread, as well as speculation status. The dependence
on additional hardware support can be overcome using software multi-threaded
transactions as described in Section 2.1.1.
2.5
Support Systems
In order for parallel programming and particularly speculative parallel programming to be possible, there is a number of ways the task of generating
the program must be supported. The initial problem is determining how the programmer should express the parallelism. The actual implementation of the parallel
26
constructs can be built for an existing language using a new library and programming interface, or may be built around a language explicitly designed for parallel
programming. In the later case, the language compiler may be equipped with
additional analysis techniques to determine whether the parallel execution will be
valid. Below the programming language, the operating system must provide some
form of support. This OS support must at the very least include scheduling for
multiple tasks, but may also provide additional isolation or monitoring. At the
lowest level, the hardware must again provide multiple processing cores.
2.5.1
Operating System
Adding support for speculation at the operating system level provides a broad form
of support for applications. It is however generally limited to use by heavy-weight
processes, while light-weight thread implementations may need to multiplex what
the operating system supports.
One way for the operating system to enable parallel programming is by forcing
sequential semantics on the processes within the system much like the run-time
system described in Section 2.3.3. One way to achieve this is by building a message
based system in which processes only execute in response to receiving a message,
generating output to be passed to another process. Conceptually, only the oldest message in the system can be consumed, which serialize the computation by
forcing the causality to flow linearly through the virtual time of the system.
The Time Warp operating system (twos) takes this approach and extends it
by speculatively allowing multiple processes to execute simultaneously [27]. twos
is motivated by distributed systems in which synchronization between processes
is impeded by varying latencies between parts of the system. A process cannot
quickly determine whether it may receive a message in future that should have
been handled before those currently waiting in its queue. For this reason, allowing
27
28
in single user mode and on a static set of processes, though as long as processes
are not communicating with one another the principles of twos remain valid.
The Speculator system introduces support for explicitly tracking speculative
processes by extending the Linux operating system kernel [44]. As with all speculation systems, Speculator implements detection of and recovery from misspeculation and guarantees that speculative processes do not perform irrevocable operations.
Because speculation is performed on heavy-weight processes, rollback of incorrect speculation is handled by terminating the process and restarting from a
checkpoint. The checkpointing routine is based on extensions to the standard
fork call. The processes is duplicated, but the new child is not made available to
the scheduler and retains the same identifiers as the original process. Additionally,
any pending signals are recorded and file descriptors are saved. The memory of
the process is marked for copy-on-write just as when a normal fork call is made.
During execution of the speculative process the use of output operations are
buffered for playback when the speculation is determined to be correct. Interprocess communication is generally allowed, but the processes receiving the communication is made to checkpoint and become speculative as well. The dependency between the two processes is tracked so misspeculation will cause a series
of rollbacks to occur. Outside of the kernel, the speculative state of a process is
indeterminate.
2.5.2
Compiler
Any language with support for parallel programming will need some form of compiler support, even if it simply interprets a trivial syntax extension as a call to a
run-time library. More powerful analysis by a compiler can allow some degree of
automatic parallelization. The Mitosis compiler implements a form of run-ahead
29
POSH relies on profile information to select tasks that are likely to speculate
correctly. Tasks are initially created for every loop body and subroutine (and the
continuations of both) and then expanded or pruned to meet size restrictions
large enough to overcome the cost of creation and small enough to be manageable4 .
4
30
Like the Mitosis system, POSH relies on hardware support for detection of violations of the sequential semantics of the program. In both cases, the assumption
is that threads are explicitly spawned. While POSH specifies that the architecture
provides a spawn instruction, Mitosis leaves the architecture details completely unspecified. In a departure from the fork/join notation, POSH assumes the spawned
task will explicitly commit, while the parent task does nothing to explicitly reclaim the child. If the parent attempts to read results from the child before it is
complete, misspeculation will occur.
Rather than inserting spawn and commit, a compiler could automatically generate the synchronization necessary to guarantee sequential ordering. Past work
has used data-flow analysis to insert wait and signal statements similar to the
grant and allow instructions introduced in Section 2.3.1 [64] to pipeline loop bodies. The precise semantics of the instructions only indicate that access to a particular variable is guarded (equivalent to introducing a lock) and ordered (version
numbered). It must be assumed that instructions to initiate and finalize tasks are
also generated.
Zhai et al. only consider loop bodies as candidates for parallelization. The
naive placement of the synchronization would place the request at the beginning
of the task (loop body) and the release at the end, encapsulating the entire loop
in a single state. The region of code between the wait and signal represents the
critical section in which only the current task can access the variable, and like any
critical section is should be made as small as possible. To optimize the interleaving
of the tasks, the wait statement should be placed as late as possible while still
preceding all accesses to the variable. Likewise, the signal should be as early as
possible as long as no further writes follow it.
To further reduce the size of the critical section, instructions may be reordered
along with the synchronization instructions. By treating a signal instruction as
a read and following the dependence chain up through a control flow graph, the
31
2.5.3
Race Detection
Race detection is concerned with determining whether two task can be run in
parallel or need to be performed in series. One way this can be done is by monitoring threads during execution to maintain a representation of their relationship
in terms of being inherently serial or possibly parallel. During specific run-time
operations the representation can be queried to determine if a serial relationship
has been violated [17]. For example, when threads access shared data the order
of accesses must match the order of serial threads.
During execution a tree is maintained to represent threads. The leaves of
the tree represent threads, while the internal nodes indicate either a series or
parallel relationship. To determine the relationship between two threads, their
least common ancestor holds appropriate marker. For a given execution tree, the
leaves are numbered with a depth first traversal, and given a second number by
traversing the parallel nodes in the opposite order. Given these values, two nodes
are in series if the values indicate the same order, while the nodes are executing
in parallel if the values are in opposite orders.
Early implementations required that the reverse ordering of nodes be maintained at run time, requiring computation on order with the depth of the tree. The
approach in [3] allows for parallel maintenance of and queries to the series/parallel
information in linear time.
32
The process of data race detection can be made more efficient by reducing the
number of objects that need to be monitored at compile time. The eraser analysis tool achieves this using a number of deep analysis techniques [38]. Initially,
all accesses within a target Fortran program are assumed to require annotation
(including not just recording of access, but initialization and cleanup of metadata
to allow such recording). Using dependence analysis eraser prunes annotation
around statements without dependencies. With intra-procedural analysis, including alias, modification, and reference information as well as whether a procedure
is ever used in a parallel construct, annotation for a procedures parameters may
be removed as well. After pruning as much annotation as possible, remaining
checks are handled using calls into an associated run-time library to track data
access during execution.
2.6
2.6.1
Correctness Checking
Heavyweight
Recently three software systems use multi-processors for parallelized program profiling and correctness checking. All use heavyweight processes, and all are based
on Pin, a dynamic binary rewriting tool [36]. SuperPin uses a signature-checking
scheme and strives to divide the complete instrumented execution into time slices
and executing them in parallel [62]. Although fully automatic, SuperPin is not
foolproof since in theory the slices may overlap or leave holes in their coverage.
The speculative execution system I describe in Chapter 5 is not designed for
fully automatic program analysis, although I describe a use case in which automatic analysis is enabled with some manual effort. The resulting system guarantees the complete and unique coverage during parallel error checking using a
programming interface that allows selective checking. This is useful when check-
33
2.6.2
Hardware Techniques
Fast track is closely related to several ideas explored in hardware research. One
is thread-level speculative parallelization, which divides sequential computation
into parallel tasks while preserving their dependencies. The dependencies may be
preserved by stalling a parallel thread as in the Superthreaded architecture [59]
or by extracting dependent computations through code distillation [67] and compiler scheduling for reducing critical forwarding path [64]. These techniques aim
to only reorganize the original implementation rather than to support any type
of alternative implementation. Fast track is not fully automatic, but it is programmable and can be used by both automatic tools and manual solutions. The
run-time system checks correctness differently. The previous hardware techniques
check dependencies or live-in values, while fast track checks result values or some
user-defined criterion.
Hardware-based thread-level speculation is among the first to automatically
exploit loop-level and method-level parallelism in integer code. In most tech-
34
niques, the states of speculative threads are buffered and checked by monitoring
the data accesses in earlier threads either through special hardware additions to
a processor [54], bus snooping [10], or an extended cache coherence protocol [56].
Since speculative states are buffered in hardware, the size of threads is usually no
more than thousands of instructions. A recent study classifies existing loop-level
techniques as control, data, or value speculation and shows that the maximal
speedup is 12% on average for SPEC2Kint assuming no speculation overhead and
unlimited computing resources [28]. The limited potential at the loop level suggests that speculation needs to be applied at larger granularity to fully utilize
multi-processor machines.
2.6.3
Monitoring
Data breakpoints are also known as watch points, as opposed to control breakpoints.
35
monitoring.
Another approach to reducing the overhead of debugging is to use sampling
over a large number of runs. One such technique introduces code instrumentation
to record a number of boolean predicates based on run-time program behavior [34].
The predicates represent possible control flow (e.g., was a branch taken), return
value from functions (if it is positive, negative, zero), and the relationship between variables in the same scope (if one is greater than, less than, or equal to
the other). The total number of predicates is extremely large and so is the overhead of potentially recording all of them. This cost is limited by evaluating the
predicate instrumentation infrequently based on random choice at each instance.
By additionally recording whether each predicate was ever observed, it is possible
to evaluate the probability that a given predicate can be used to predict program
failure. Although the approach that Liblit et al. discusses allows for useful analysis
of crash reports from deployed programs, it is not a general solution to program
debugging due to the number of samples needed before a bug can be isolated. For
the same reason, such sampling techniques are not applicable to the monitoring
needed by speculative execution.
36
Process-Based Speculation
Process based speculation consists of a run-time system and a programming interface. The run-time system is built as a code library with which a programmer
might link their program. The programming interface defines how the programmer would invoke calls into the run-time library. In this chapter I describe the
implementation of the core run-time system. Descriptions of the programming
interface and details of the runtime for particular types of speculative parallelism
are addressed in Chapters 4 and 5.
3.1
Implementation
37
terminating the speculative execution and reverting any effects it may have
had.
In the remainder of this chapter I will describe how process-based speculation
achieves each of these goals.
3.1.1
Creation
3.1.2
Monitoring
38
The signal handler routine has three basic responsibilities: to ensure the violation is a result of the run-time system monitoring, to record the access for later
reference, and to remove the restriction.
The operating system detects memory access violations in the normal course
of operation in order to protect processes. Because a process might access regions
of memory in violation of the operating systems typical restrictions the run-time
system must ensure such accesses are not allowed to pass. The runtime must differentiate between access to memory regions that it has restricted, and access the
program should never be permitted to make. The signal itself identifies whether
the access was made to a memory location that is not mapped (maperr) or to a
region of memory to which the process does not have access (accerr).
Once the location of the access has been deemed legitimate, the run-time
system must record the access for later reference. The speculative run-time system
uses an access bitmap to represent each block of memory. One bit for each page
equals one word for every 32 pages. With a page size of 4096 bytes the access map
uses one byte to record accesses on 131,072 bytes. Because much of the access
map will be zeros, and most of it will not be modified, the OS will typically be
able to map several of these pages to the same zero-filled data.
Once the access has been recorded the process must be allowed to continue
its operation. Additionally, there is no reason to record future access to the same
block. The run-time system can safely remove memory protection for the current
block.
3.1.3
Verification
Once the sequential process has advanced far enough the run-time system must
verify that the speculative execution is correct. Such verification requires an
analysis of the access maps for both processes, but without special consideration
39
each process would only have access to its own map. The run-time system can
facilitate the access map analysis in two ways. One option is to push a copy of
one of the maps using a POSIX pipe established during the spawning process as
indicated in Section 3.1.1. In practice it is only necessary to transfer the non-zero
portions of the map. The second option is to create the maps in a segment of
memory that has been explicitly shared between the two processes.
The details of verification notably the precise point at which it can be
performed and which types of accesses need to be validated depend on the
type of speculation being performed. These details are discussed in Sections 4.3.3
and 5.3.3.
3.1.4
Abort
Speculative execution requires a mechanism for unrolling or aborting the speculative portion of a process when the speculation proved to be incorrect. In order to
abort speculative execution that has proven incorrect, process-based speculation
can simply kill the running process. Because the Linux kernel protects the memory space of running processes from access by other processes, it is not possible for
the speculative process to directly affect the non-speculative portion of execution.
As a result, once the speculative execution is killed the non-speculative process
continues as it would in the sequential case.
3.1.5
Commit
The approach for committing a speculative task amounts to terminating the nonspeculative process and allowing execution to continue based on what was computed speculatively. In addition, the meta-data used to track memory accesses
must be updated to reflect the fact that the speculative process is no longer speculative.
40
3.2
Advantages
Using processes for speculative parallelism has a major advantage over other
thread based approaches. Perhaps the most significant of these is portability.
By using POSIX constructs the speculative run-time system can be built for any
POSIX operating system. The system does not rely on any specific hardware
architecture or features. The run-time system and compiler support presented in
this work have been built and executed on Linux and Mac OS X.
The access monitoring used by thread based approaches relies on instrumentation to data accesses. This instrumentation must be explicitly applied to both
program code and any libraries used during execution. The process-based system
does not require any attention to external libraries to perform correctly. This
flexibility also improves the portability of the run-time system because only the
annotated source code needs to be recompiled.
Process based memory access monitoring also has the advantage of incurring a
constant cost for each location accessed, rather than a cost at every single access
as in a thread based system. Additionally, because the monitoring is done at the
page level, this cost can be amortized for large tasks with multiple accesses to the
same page.
In addition to monitoring the locations of data accesses, the process-based
system compares the data values for conflicts. Using value based checking guarantees that identical changes to the same data will not be reported as a conflict,
a problem known as false sharing. In order to support value based checking, a
run-time system must maintain multiple copies of the data. While the processbased run-time system gains this for free through the operating systems virtual
memory system, thread based systems need to introduce additional data copies.
Additionally, these multiple copies must be explicitly managed to differentiate
access and guarantee that rollback is possible.
41
3.3
Disadvantages
The process-based protection has a high overhead. However, much of this overhead is inherently unavoidable for a software scheme to support unpredictable
computations. A major goal of this thesis is to show that general protection can
be made cost effective by three techniques. The first is programmable speculation.
Since the overhead depends on the size of (write) accessed data rather then the
length of the ppr region, it can be made negligible if the size of the parallel task
is large enough.
Second, most overheadsstarting, checking, and committingare off the critical path, so the non-speculative execution is almost as fast as the unmodified
sequential execution. Moreover, a race is run in every parallel region, where the
correct speculative result is used only when the speculation finishes faster than
the would-be sequential execution. The overhead of determining the winner of
this race is placed in the speculative execution, off the critical path.
Last, the run-time system uses value-based checking, which is more general
than dependence-based checking, and satisfies the Bernstein conditions [5]. Valuebased checking permits parallel execution in the presence of true dependencies
and it is one of the main differences between process-based system and existing
thread-based systems (as discussed in Section 4.3.2).
3.4
3.4.1
Special Considerations
Input and Output
42
correct order. Until a speculative process has confirmed that its initialization and
execution was correct (i.e., that all previous speculation was correct), it buffers all
terminal output and file writes. Given correct execution, any output the process
produces will be the same as what the sequential program would have generated.
Program output buffering is established by creating a temporary file in which
to buffer the output that would otherwise be sent to the standard output. Such
a file is created by the run-time system each time a new speculative process is
created. At link time, we use a linker option1 to replace calls to the known input
and output functions with wrappers included with the run-time library. These
wrappers send file output to the redirection temporary file (in the case of printf)
or abort the speculative process (in all other cases). Although it should be possible
to detect writes to the standard error output using fprintf, such support has not
been implemented.
The task of committing the redirected output is addressed by rewinding to
the beginning of the redirection temporary file, reading it in blocks, and writing
those blocks to the standard output. If the speculative process is aborted, the
temporary redirection file is closed and deleted.
3.4.2
Memory Allocation
Dynamic memory allocation can potentially pose a problem for speculative execution because, unlike stack allocation, its implementation is library based and
the mechanism is not known in advance. The root of the problem for speculative execution is that the implementation may not return the same sequence of
memory locations when the same sequence of requests are made. Even in cases
where the speculative and non-speculative are performing exactly the same computations, the value of some of their pointers may differ because the dynamic
1
43
3.4.3
System Signals
The speculative parallel run-time system uses operating system signals to indicate
or initiate state changes among the running processes. The total number of available signals is limited, and the user program that is being extended may be relying
on some of the same signals. Some of the signals were using are slightly reinterpreted (for example special action may be taken on termination) while others have
no default meaning.
The run-time system does not attempt to preserve any existing signal handlers
installed by the user program, but it would be extended to identify them. The
user installed signal handler can be stored and invoked from within the runtimes
handler. While using signals would still provide a means to actively alert another
process, we would also need to differentiate signals initiated by the run-time system from those of the user program. This could be accomplished using a shared
flag, which the run-time system would consult before either dispatching the signal
to the original handler or processing it.
Ultimately, it is not possible to guarantee that the user program does not install
a new signal handler during execution, over-writing the run-time systems handler
functions. One solution would be to replace or wrap the handler installation
functions to ensure the run-time systems handlers are preserved, while any new
handlers are indirectly dispatched. Because the signals the run-time system is
using are intended for user programs, this change could be performed during
compilation.
44
Speculative Parallelism
Introduction
In this chapter I describe a type of process-based speculative execution referred
to as Behavior Oriented Parallelism (or bop). The bop system is designed to
introduce parallelism into sequential applications. Many sequential applications
are difficult to parallelize because of problems such as unpredictable data access,
input-dependent parallelism, and custom memory management. These difficulties
motivated the development of a system for behavior-oriented parallelization, which
allows a program to be parallelized based on partial information about program
behavior. Such partial information would be typical of a user reading just part of
the source code, or a profiling tool examining a small number of inputs.
The bop style of speculative parallelism allows for some portions of code to
be marked as potentially safe for parallel execution. I refer to these regions of
code as possibly parallel regions, abbreviated ppr. The goal of bop is to allow a
programmer or an analysis tool to provide hints about parallel execution without
needing to guarantee that the parallelism is safe in all cases.
In Section 4.2 I describe the programmatic way in which code is annotated for
bop. The burden on the programmer is intended to be minimal, and the interface
45
4.1
Design
The bop system uses concurrent executions to hide the speculation overhead off
the critical path, which determines the worst-case performance where all speculation fails and the program runs sequentially.
4.1.1
The execution starts as the lead process, which continues to execute the program
non-speculatively until the program exits. At a pre-specified speculation depth k,
up to k processes are used to execute the next k ppr instances. For a machine
with p available processors, the speculation depth is set to p 1 to make the full
use of the CPU resource.
Figure 4.1 illustrates an example run-time setup of either the sequential execution or the speculative execution of three ppr instances. As shown in Part 4.1(b),
when the lead process reaches the start marker mbP , it forks the first spec process
and continues to execute the ppr instance P . The first spec jumps to the end
marker of P and executes the next ppr instance Q. At mbQ , it forks the second
spec process, which jumps ahead to execute the third ppr instance R.
At the end of P , the lead process becomes the understudy process, which reexecutes the next ppr instance non-speculatively. In addition, it starts a parallel
46
mPb
b
m
spec 1 starts
P
P Pspec 1 startsP
spec 1mstarts
e
P
P
e
e
P
P
m
mPe
mP
mPe P
mPe
mPemPe
b
mQ
understudy
b
b
mQ
mQ
understudy
understudy
spec 2 starts
branch
e
spec 2 starts e
spec 2 starts
mQ
branch
branch
b
e
mQ starts
mQ
mQ
b
mstarts
starts
Q
mRb
Q
Q
mRb
mRb Q
R
R
R e
Q Qme
e
e me
e
m
Q
m
m
Q Q
m
Q
Q
Q
R
R
R
spec 1
spec
spec 1
e 1
e
mQ
m
commits
commits
commits spec 2
Q
spec 2
spec 2
commits
commits
b
commits
mRb R
R
mR
R
(partial)
(partial)first spec
lead spec
next lead
(partial)
spec 2 finishes
2next
finishes
first 2 finishes
next lead
first
R and aborts
and aborts understudy
R understudy
e
and
aborts understudy
mR
e
(parallel
exe.
wins)
(parallel
exe.
wins)
m
(parallel exe. wins)
((
(
(
((
(
(
(
(
(
(
mRe
mRb
e
mQ
( (
(
(
b
mQ
lead process
P
me
mPblead process
(
(
mPb
lead process
mPbmb
(a) Sequential
(a) Sequential
(b) A successful parallel
(b) Aexecution,
successfulwith
parallel execution, with
(a) Sequential
(b) A (b)
successful
parallel
execution, with
(a)
Sequential
execution.
Parallel
execution.
cution.
R
with istheir
start
endSpeculation
markers.
In the
successful
parallel
the lead
end markers.
reaching
another
end
depicted
on and
the left.
starts
by jumping
from
the
start
toexecution
themarker.
end
is depicted
on
the
left.
Speculation
starts
by
jumping
from
the
start
to
the end
marker, and commits when reaching another end marker.
marker, and commits when reaching another end marker.
47
4.1.2
bop assumes that the probability, the size, and the overhead of parallelism are all
unpredictable. The understudy provides a safety net not only for correctness when
the speculation fails, but also for performance when speculation is slower than
the sequential execution. For performance, bop holds a two-way race between the
non-speculative understudy and the team of speculative processes.
The non-speculative team represents the worst-case performance along the
critical path. If all speculation fails, it sequentially executes the program. As I
will explain below, the overhead for the lead process consists only of the page-based
write monitoring for the first ppr instance. The understudy runs as the original
code without any monitoring. As a result, if the granularity of ppr instance is
large or when the speculation depth is high, the worst-case running time should
be almost identical to that of the unmodified sequential execution. On the other
hand, whenever the speculation finishes faster than the understudy, it means a
48
4.1.3
Figure 4.1 shows the expected behavior when an execution of pprs runs from
BeginPPR to EndPPR. In general, the execution may reach an exit (normal or
abnormal) or an unexpected ppr marker. Table 4.1 shows the actions taken by
the lead process, its understudy branch, and spec processes when encountering
an exit, error, or unexpected ppr markers.
The abort by spec in Table 4.1 is conservative. It is possible for, speculation
to reach a program exit point during correct execution, so an alternative scheme
might delay the abort and salvage the work if it turns out to be correct. We favor
49
the conservative design for performance. Although it may recompute useful work,
the checking and commit cost will never delay the critical path.
The speculation process may also allocate an excessive amount of memory
and attempt permanent changes through I/O and other OS or user interactions.
The latter cases are solved by aborting the speculation upon file reads, system
calls, and memory allocation exceeding a pre-defined threshold. The file output is
managed by buffering and is either written out or discarded at the commit point.
The current implementation supports stdout and stderr for the pragmatic purpose
of debugging and verifying the output. Additional engineering effort could add
support for regular file I/O.
Strong Isolation
I describe the bop implementation as having strong isolation because the intermediate results of the lead process are not made visible to speculation processes
until the lead process finishes the first ppr. Strong isolation comes naturally with
process-based protection. It is a basic difference between bop and thread-based
systems, where the updates of one thread are visible to other threads, which I
describe as weak isolation.I discuss the control aspect of the difference here and
complete the rest of comparisons in Section 4.3.2 after describing the data protection.
Weak isolation allows opportunistic parallelism between two dependent threads,
if the source of the dependency happens to be executed before the sink. In the
50
bop system, such parallelism can be made explicit and deterministic using ppr
directives by placing dependent operations outside the ppr region. As an example, the code outside ppr in Figure 4.1 executes sequentially. At the loop level,
the most common dependency comes from the update of the loop index variable.
With ppr, the loop control can be easily excluded from the parallel region and
the pipelined parallelism is definite instead of opportunistic.
The second difference between strong and weak isolation is that strong isolation does not need synchronization during the parallel execution but weak isolation
needs to synchronize between the lead and the spec processes when communicating the updates between the two. Since the synchronization delays the nonspeculative execution, it adds visible overheads to the thread-based systems when
speculation fails. bop does not suffer this overhead.
Although strong isolation delays data updates, it detects speculation failure
and success before the speculation ends. Like systems with weak isolation, strong
isolation detects conflicts as they happen because all access maps are visible to all
processes for reads (each process can only update its own map during the parallel
execution). After the first ppr, strong isolation can check for correctness before
the next speculation finishes by stopping the speculation, checking for conflicts,
and communicating data updates. As a design choice, bop does not abort speculation early because of the property of pipelined parallelism, explained at the
end of Section 4.1.1. The speculation process may improve the program speed,
no matter how slowly it executes, when enough of them are working together.
51
4.2
Programming Interface
In addition to the ppr markers, the bop programming interface two other important components. First, the programmer may provide a list of global and
static variables that are privatizable within each parallel process. By specifying
where the variables are initialized, the system can treat their data as shared until
the initialization and as private thereafter. The third component is described in
Section 4.2.3.
4.2.1
Region Markers
52
Algorithm 4.2.1 Example use of bop to mark a possibly parallel region of code
within a loop.
f o r ( i n t i = 0 ; i < N ; ++i ) {
i f ( ! BeginPPR ( 0 ) ) {
t a b [ i ] = compute ( i ) ;
} EndPPR ( 0 ) ;
}
These two functions both accept a single scalar value that identifies the region to
ensure the markers are properly matched, which allows for nesting. Using the
identifier, an incorrectly matched marker can be safely ignored on the assumption
that another marker matches it and is also ignored.
Algorithm 4.2.2 Example use of bop including EndPPR marker.
f o r ( i n t i = 0 ; i < N ; ++i ){
i f ( ! BeginPPR ( ) ) {
t a b [ i ] = compute ( i ) ;
}
}
EndPPR ( ) ;
aggregate ( tab ) ;
In the loop body example shown in Listing 4.2.1, there is little meaning to
the else branch of the BeginPPR conditional. One can view the second branch as
containing any execution until the next ppr marker of any kind. In straight-line
code it may be more clean to explicitly enclose a block of code within an else
branch to place it in juxtaposition to the speculative path. The code in Listing 4.2.3 represents a case in which the else branch is explicitly used to demarcate
distinct paths of execution that may be processed in parallel. Note that there is
no reason that a simple pair of if/else must be used, and in the listing a nest
of conditions is used.
53
t0
t1
t2
t3
BP BP EP
t0
t1
t2
t3
BP BP EP
t4
BQ
t5
t6
EP EQ
t0
t0
t1
t3
PPRP
t4
t5
BQ
t6
EP EQ 54
t4
t6
PPRQ
t1
t3
t4
t6
to t6 , and will be run in parallel. The other fragments of the execution will be
PPRP
PPRQ
55
affects the parallelism but not the correctness or the worst-case performance.
4.2.2
Post-Wait
The basic ppr structure allows for regions of code to be executed in parallel if
there are no dependencies carried from one to another. In many cases a loop body
may have carried dependencies, but be parallelizable if care is taken. Consider
a loop that is structured in stages so that some stages carry a dependency, but
the dependency is consumed by the same stage in the next iteration. In such a
scenario, the stages of the loop body can be viewed as stages of a pipeline.
Post-Wait is an extension of the basic ppr mechanism provided by the bop
system to allow for pipelining portions of the possibly parallel region. Using the
post-wait interface the speculative processes can be synchronized so that the writes
in the earlier process occur before the corresponding reads during run time.
Algorithm 4.2.4 Example of a pipelined loop body.
for ( int
B[ i ] =
C[ i ] =
D[ i ] =
}
4.2.3
i
f
g
h
= 0 ; i < N ; ++i ) {
(A [ i ] ) ;
(A [ i ] ) ;
(B [ i ] , C [ i ] ) ;
Feedback
The third component of the bop interface is run-time feedback to the user. When
speculation fails, the system generates output indicating the cause of the failure,
particularly the memory page on which receives conflicting accesses occurred. In
our current implementation, global variables are placed on separate memory pages
by the compiler. As a result, the system can output the exact name of the global
variable when it causes a conflict. A user can then examine the code and remove
56
the conflict by marking the variable privatizable or moving the dependency out
of the parallel region.
Three features of the API are especially useful for working with large, unfamiliar code. First, the user does not write a parallel program and never needs
parallel debugging. Second, the user parallelizes a program step by step as hidden dependencies are discovered and removed one by one. Finally, the user can
parallelize a program for a subset of inputs rather than all inputs. The program
can run in parallel even if it has latent dependencies.
4.3
4.3.1
Run-Time System
Creation
On the first instance of BeginPPR the run-time system initializes the signal handlers and memory protection used by all of the subsequent process. The beginning
of a possibly parallel region is marked by a call to the system fork function. The
fork function creates a new operating system process which will act as the speculative process. This new process is considered to be the child of the preexisting
process, which is non-speculative. The original process returns immediately and
57
4.3.2
Monitoring
The bop system guarantees that if the speculation succeeds the same user visible output is produced as in the sequential execution. bop partitions the address space of a running program into three disjoint groups: shared, checked,
and private. More formally, Dall = Dshared + Dchecked + Dprivate , and any two of
Dshared , Dchecked , and Dprivate do not overlap.
For the following discussion we consider two concurrent processes the lead
process that executes the current ppr instance, and the spec process that executes
the next ppr instance and the code in between. The cases for k (k > 1) speculation
processes can be proved by induction since they commit in a sequence in the bop
system.
58
case CTRL :
// CRTL i s the i n i t i a l state
memset ( accMapPtr , 0 , ACC MAP SIZE ) ;
myStatus = MAIN ;
mySpecOrder = 0 ;
//
SP
//
SP
SP
// fall through
case SPEC :
pprID = i d ;
int f i d = fork ( ) ;
i f (1 == f i d ) r e t u r n 0 ;
// fork failure
i f ( f i d > 0) {
// the MAIN or older SPEC
specPid = f i d ;
// track the SPEC process ID
i f ( myStatus==MAIN) B O P s e t P r o t e c t i o n (PROT READ ) ;
return 0 ;
}
// the newer SPEC continues here
specPid = 0;
myStatus = SPEC ;
mySpecOrder++;
s e t p g i d (0 , SP gpid ) ;
SP RedirectOutput ( ) ;
i f ( mySpecOrder==1)
// set this up only once
B O P s e t P r o t e c t i o n (PROT NONE ) ;
return 1 ;
}
}
59
By using Unix processes for speculation, the bop system eliminates all antidependencies and output dependencies through the replication of the address
space, and detects true dependencies at run time. An example is the variable
shared in Figure 4.3.2, which may point to some large dictionary data structure.
Page-based protection allows concurrent executions as long as a later ppr does
not need the entries produced by a previous ppr. The overwrites by a later ppr
are fine even if the entries are used concurrently by a previous ppr.
The condition is significantly weaker than the Bernstein condition [5], which
requires that no two concurrent computations access the same data if at least
one of the two writes to it. The additional parallelism is possible because of
the replication of modified data, which removes anti-dependencies and output
dependencies. The write access by spec k never causes failure in previous spec
processes. As an additional optimization, the last spec process is only monitored
for data reads. In fact, when the system is limited to only one spec process, a
case termed co-processing, the lead process is monitored only for writes and the
spec only for reads.
60
Page-based protection has been widely used for supporting distributed shared
memory [29, 32] and many other purposes including race detection [49]. While
these systems enforce parallel consistency among concurrent computations, the
bop system checks for dependence violation when running a sequential program.
A common problem in page-level protection is false-positive alerts. We alleviate this problem by allocating global variables on separate memory page. Writes
to different parts of a page may be detected by checking the difference at the end
of ppr, as in [29]. In addition, the shared data are never mixed with checked and
private data on the same page, although at run time newly allocated heap data
are private at first and then converted to shared data at EndPPR.
61
Likely private data The third class of objects is private data, which is initialized before being used and therefore causes no conflict. In Figure 4.3.2, if private
is always initialized before it is used, the access in the current ppr cannot affect
the result of the next ppr, so any true dependency caused by it can be ignored.
Private data come from three sources. The first is the program stack, which
62
includes local variables that are either read-only in the ppr, or always initialized
before use. Intra-procedure dataflow analysis is capable of identifying such data
for most programs. When the two conditions of safely cannot be guaranteed by
compiler analysis, for example due to unknown control flow or the escape of a local
variables address into the program heap, we redefine the local variable to be a
global variable and classify it as shared data. Recursive functions are not handled
specially, but could be managed either using a stack of pages or by disabling the
ppr.
The second source of private data is global variables and arrays that are always initialized before the use in the ppr. The standard technique to detect this
is inter-procedural kill analysis [1]. In general, a compiler may not always ascertain all cases of initialization. For global data whose access is statically known
in a program, the compiler automatically inserts calls after the initialization assignment or loop to classify the data as private at run time. Any access by the
speculation process before the initialization causes it to be treated as shared data.
For (non-aggregate) data that may be accessed by pointers, the system places it
on a single page and treats it as shared until the first access. Additionally, we
allow the user to specify the list of variables that are known to be written before
read in ppr. These variables are reinitialized to zero at the start of a ppr instance.
Since we cannot guarantee write-first access in all cases, we call this group likely
private data.
The third type of private date is newly allocated data in a ppr instance. Before
BeginPPR, the lead process reserves regions of memory for speculation processes.
Speculation would abort if it allocates more than the capacity of the region. The
main process does not allocate into the region, so at EndPPR, the newly allocated
data can be merged with the data from the speculation process. For programs that
use garbage collection, we encapsulate the heap region of spec processes, which
we will describe when discussing the test of a lisp interpreter. Another solution is
63
by 1st ppr
+ Schecked )
The two terms after Tseq are the cost from data monitoring and copying on the
critical path, as explained below.
For monitoring, at the start of ppr, the lead process needs to set and reset
the write protection and the access map for shared data before and after the first
ppr instance. The number of pages is the size of shared data Sshared divided
by the page size Spage plus a constant cost c1 per page. During the instance, a
64
write page fault is incurred for every page of shared data modified in the first ppr
instance. The constant per page cost is negligible compared to the cost of copying
a modified page.
Two types of copying costs may appear on the critical path. The first is for
pages of shared data modified by the lead process in the first ppr instance and
(among those) pages modified again by the understudy. The second cost is taking
the snapshot of checked data. The cost in the above formula is the worst case,
though the copy-on-write mechanism in modern OS may completely hide both
costs.
Data copying may hurt locality across ppr boundaries, although the locality
within is preserved. The memory footprint of a speculative run is larger than the
sequential run as modified data are replicated. However, the read-only data are
shared by all processes in main memory and in shared cache, which is physically
indexed. As a result, the footprint may be much smaller than running k copies of
a program.
65
t1
t2
hr1 , Sall
i = hr2 , Sall
i : execution of a process p from one point to another.
Figure 4.3 shows the parallel execution and the states of the lead and the spec
processes at various times. If a parallel execution passes the three data protection
schemes, all program variables in our abstract model can be partitioned into the
following categories:
Vwf : variables whose first access by spec is a write. wf stands for write first.
Vexcl
lead :
instance P .
Vexcl
spec :
66
main process
(main)
(r , S init)
b
(re , S
(rb , S
init
mid
speculation process
(spec)
understudy process
(undy)
(re , S
(re , S init)
main
(re , S main)
(re , S spec)
(re , S seq)
(re , S
undy
lead
Vexcl
spec
Examining Table 4.2, we see that Dshared contains data that are either accessed
by only one process (Vexcl
lead
and Vexcl
spec ),
read only in both processes or not accessed by either (Vchk ). Dprivate contains data
either in Vwf or Vchk . Dchecked is a subset of Vchk . In addition, the following two
conditions are met upon a successful speculation.
1. lead process reaches the end of P at P e , and the spec process, after leaving
P e , executes the two markers of Q, Qb and then Qe .
2. the state of Vchk is the same at the two ends of P (but it may change in the
init
lead
middle), that is, Schk
= Schk
.
67
spec
S parallel = Sallexcl
lead
lead
+ Sexcl
lead
In the following proof, each operation rt is defined by its inputs and outputs,
which all occur after the last input. The inputs are the read set R(rt ). The outputs include the write set W (rt ) and the next instruction to execute, rt+1 . For
clarification, an operation is an instance of a program instruction. For the simplicity of the presentation, symbol rx is overloaded as both the static instruction
and its dynamic instances. To distinguish in the text, former is referred to as an
instruction and the latter as an operation, so there may be only one instruction
rx but any number of operations rx .
Theorem:
If the spec process reaches the end marker of Q, and the protection in Table 4.2
passes, the speculation is correct, because the sequential execution would also
reach Qe with a state S seq = S parallel , assuming that both the sequential and the
parallel executions start with the same state, S init at P b .
Proof:
spec
Consider the speculative execution, (P e , S init ) = (Qe , S spec ), for the part of the
seq
68
lead .
Neither is it in Vexcl
spec
since it is
modified in the lead process. The only case left is for v to belong to Vchk . Since
lead
init
Vchk
= Vchk
, after the last write the value of v is restored to the beginning state
where spec starts and consequently cannot cause rt0 in spec to see a different value
as rt does in the sequential run. Therefore rt and rt0 cannot have different inputs
and produce different outputs, and the speculative and sequential executions must
be identical.
69
lead
spec ,
their values at commit time. The remaining part of Vchk is not accessed by lead
or spec and still holds the same value as S init . It follows that the two states
S parallel and S seq are identical, which means that S parallel is correct.
The above proof is similar to that of the Fundamental Theorem of Dependence
(Sec. 2.2.3 in [1]). While the proof in the book deals with statement reordering,
the proof here deals with region reordering and value-based checking. It rules
out two common concerns. First, that the intermediate values of checked data
never lead to incorrect results in unchecked data. Second, the data protection
always ensures the correct control flow by speculation. In bop, the three checking
schemes work together to ensure these strong guarantees.
Comparisons
Strong and weak isolation as discussed in Section 4.1.3 is a basic difference between
process-based bop and thread-based systems that include most hardware and
software speculation and transactional memory techniques. The previous section
discussed the control aspect, while the data protection and system implementation
are discussed below. The comparisons are summarized in Table 4.3.
Weak isolation needs concurrent access to both program data and system data,
as well as synchronization to eliminate race conditions between parallel threads
and between the program and the run-time system. The problem is complicated
if memory operations may be reordered by the compiler or by hardware, and
the hardware uses weak memory consistency, which does not guarantee correct
results without explicit synchronization. In fact, concurrent threads lack a welldefined memory model [7]. A recent loop-level speculation system avoids race
conditions and reduces the number of critical sections (to 1) by carefully ordering
70
71
4.3.3
Verification
72
worth noting that these speculative process are performing useless computation,
but there is no other useful ppr related work that could have been scheduled.
Reaching a program exit point in the understudy process is equivalent doing so
in the main process, except that buffered output must be committed.
If a speculative process reaches a program exit point it cannot be permitted to
commit normally. The current bop system simply forces the speculative process
to abort, which allows the corresponding understudy to eventually reach the exit
point and complete. If the speculative process is the child of another speculative
process, that process is notified of the failure, which allows it to change directly
to control status and elide any further coordination with the terminal speculative
process. An alternative is for the speculative process to treat the exit as the
end marker of the current ppr. This would cause the speculative process to
synchronize with the main process once it reaches its own end marker, after which
the process will potentially commit and exit without delaying until the understudy
reaches the same point.
4.3.4
Commit
The bop commit routine is invoked when a process reaches a EndPPR marker.
The functionality is dependent on the state of the process; sequential and control
processes are ignored, while the other states are handled specifically. If the identifier parameter does not match the current ppr identifier, then the end marker
is ignored.
The commit routine for the speculative process involves synchronizing with the
non-speculative processes, as well as maintaining order among the other speculative processes. The actual tasks are provided in Listing 4.3.4 but can be summarized as follows: We first pass our token to the next waiting speculative process.
We then wait for the previous speculative process to indicate that it has completed
73
switch ( myStatus ) {
case SPEC :
// Tell the parent to start early termination .
i f ( mySpecOrder > 1 )
k i l l ( g e t p p i d ( ) , SIGUSR1 ) ;
e x i t ( EXIT SUCCESS ) ;
case UNDY:
// Commit any buffered output .
SP CommitOutput ( ) ;
// ( fall through)
case MAIN : case CTRL : case SEQ :
BOP pipeClose ( ) ;
// Kill all runtime processes ( including self )
k i l l ( SP gpid , SIGTERM ) ;
// Wait until signal propagates .
pause ( ) ;
e x i t ( EXIT SUCCESS ) ;
break ;
default :
e x i t ( EXIT FAILURE ) ;
}
}
74
(assuming we are not the first). If this process is the first member of a group of
speculative processes then it must also wait for the previous group to have committed. Once the order among the speculative processes is confirmed the process
verifies the access maps are correct and copies the data changes it has made to
the next speculative process. Synchronization with the understudy is handled
by determining its process identifier, signaling the understudy, and waiting for
confirmation. Finally, the speculative process commits its output.
The commit routine for the understudy process is fairly simple. This is because
the understudy is considered to be on the critical path and much of the burden of
work has been placed elsewhere. Additionally, the understudy is not speculative.
As depicted in Listing 4.3.4, the understudy keeps a count of each EndPPR marker it
reaches. Because the speculative processes are placed into groups, the understudy
must complete all of the work of one group in order to succeed. The understudy
officially beats the speculative processes once it blocks the signal they would use to
declare completion. After this point the understudy can safely change its status to
control (which is not to be confused with being the lead process). The speculative
processes are killed, and output from the understudy committed.
The commit routine for the lead process (MAIN) is somewhat anomalous in
that it does not actually commit anything. The main process is responsible for
spawning the understudy process, and for synchronizing with the first speculative
process by passing its own data changes.
4.3.5
Abort
The abort routine basically just amounts to the speculative process exiting. Because the output has been buffered, and the operating system virtual memory
isolates any changes made, the process has no outside impact unless it is explicitly committed. The run-time system is structured so that if the speculative
75
76
77
78
process aborts it means that either the understudy has finished the parallel region
first, or that there is an error indicated in the access maps. In either of these cases
the understudy process becomes the control process and continues running. If the
understudy process is aborting then it must be the case that the spec process has
succeeded. Because the understudy is useless at that point it simply exists.
4.4
The bop system can be used to express parallelism in several ways. At the program level, parallelism can be broken into three categories: instruction level, data,
and task. The coarse-grained nature of process-based speculative parallelism does
cannot take advantage of instruction level improvements, but it does address both
data and task parallelism.
79
START
CTRL
B E
B
MAIN
SPEC i
E B
B
MAIN
SPEC i
SPEC i+1
UNDY
SPEC i
E E
MAIN
SPEC i+1
UNDY
SPEC i
SPEC i+1
E
E
UNDY
E
END4
END3
E
SPEC i+1
E
END2
END1
Figure 4.4: State diagram of bop. Edge labels represent begin and end ppr
markers (B and E respectively).
80
4.4.1
Data-Parallel
Data parallelism is possible when the same operation can be performed on many
data elements. This form of parallelism is often expressed in a loop, and the conversion from a sequential program will often focus there. It is not necessary that
all instance of the parallel region perform exactly the same sequence of instructions, and so control flow can change within the region. This is not the case in the
simplest SIMD (single instruction multiple data) style parallelism. Other system
may offer an explicitly parallel loop, for example the DOALL construct available in
Fortran, or the parallel for directive in OpenMP, in which a loop is marked
are parallel. The same effect is achieved with bop by making the loop body conditional on a BeginPPR marker and placing the EndPPR marker at the end of the
loop body.
4.4.2
Task-Parallel
Task parallelism exists when separate portions of the execution can be performed
independently. This can be implemented with the bop system by placing one
portion of otherwise straight-line code in a conditional block based on the return of
BeginPPR and finalized with a EndPPR marker. At some later point, and additional
EndPPR marker indicates that the speculative process needs the results of the
parallel task. At run time, the main process will execute the code within the ppr
block and spawn its understudy at its conclusion. The speculative process will skip
the conditional block, eventually synchronizing when it reaches the end marker. If
the understudy reaches the marker first, it will terminate the speculative process.
This arrangement is semantically similar to fork-join execution where the second end marker represents the join point. One can view the conditional block of
code in terms of a future that is explicitly consumed at the end marker. If the
code block were to be placed in a separate function, the syntax would even be
81
quite similar. This setup can be generalized to multiple parallel tasks by treating
each task as described above. Because only a newly created speculative process
receives a unique return value from BeginPPR the understudy will double check
all of the tasks.
The series of ppr markers is necessary to guarantee that each task is not
dependent on the computation of earlier tasks. If the programmer knows that
the work a task is performing is ancillary to final results, then any data modified
within the task can be ignored by the bop run-time system.
4.5
4.5.1
82
running with the bop runtime will behave the same as if it were to be executed
sequentially, which largely obviates the need for debugging it. If errors in the
sequential program need to be diagnosed, the bop markers can be easily disabled
(become a non-operation) and the program run sequentially.
Even if locks are used correctly to synchronize parallel execution, these uses
cannot be composed into more general cases. The use of locks for parallel programming has a significant advantage over the bop system in their efficiency.
Locks introduce the least overhead of any synchronization technique, and can use
used in fine-grained cases for which a ppr would not be appropriate.
Attempting to implement something analogous to pprs using a message passing representation would face many of the same problems as locking. Because
message passing generally requires an explicit receive statement, it must be placed
before the first potential access of any type to any of the data potentially modified
within the ppr. Additionally, the message would need to carry all data modified
in the ppr. Because the members of this set cannot generally be known until run
time, a conservative implementation would need to gather all data modified in the
ppr.
4.5.2
Fine-Grained Techniques
83
code while explicit threading and its compiler support are often restrained due to
concerns over the weak memory consistency on modern processors. With these
features, bop addresses the scalability of a different sortto let large, existing
software benefit from parallel execution.
Any technique that does not use heavy-weight processes can be considered finegrained. Such techniques are inherently unable to utilize operating system copyon-write memory protection. Without hardware support, speculative parallelism
techniques must employ some other mechanism for the roll-back of speculative
writes.
In addition to lacking the operating system mechanism for protecting memory stores, fine-grained techniques face distinct challenges with regard to logging
memory loads. While the page level read/write access can be manipulated as
in the Fast Track system, this approach is non-viable. The time spent handling
the operating system level signal is far too high in proportion to the duration of
the parallel work. Additionally, the run-time system must do more work than
a system such as Fast Track to determine which thread performed the memory
access.
The more common approach is for the run-time system to instrument memory loads and stores to allow for logging (and subsequent roll-back or replay).
Excluding systems replying on hardware support, such instrumentation amounts
to expensive additional operations surrounding all memory accesses. These additional operations introduce overheads measured as multiples of the execution
time.
84
4.6
4.6.1
Evaluation
Implementation and Experimental Setup
85
buf
gsprefix
xlfsize
xlsample
xltrace
xlstack
xlenv
xlcontext
xlvalue
xlplevel
:
:
:
:
:
:
:
:
:
:
scheduling. Experiments use multiple runs on an unloaded system with four dualcore Intel 3.40 GHz Xeon processors, with 16MB of shared L3 cache. Compilation
is done with gcc 4.0.1 with -O3 flag for all programs.
4.6.2
Application Benchmarks
86
collector makes to the memory state, it always kills the speculation. To solve this
problem, the mark-sweep collector implementation is revised for bop as described
briefly here. The key idea is to insulate the effect of garbage collection so it can
be done concurrently, without causing unnecessary conflicts. Each ppr uses a
separate page-aligned memory region. At the beginning of a ppr instance (after
forking but before data protection) the garbage collector performs a marking pass
over the entire heap to record all reachable objects in a start list. New objects are
allocated inside the pre-allocated region during the execution of the ppr. When
the garbage collection is invoked, it marks only objects inside the region but traverses the start list as an additional set of root pointers. Likewise, only objects
within the region that are unmarked are freed. At the end of the ppr, the garbage
collector is run again, so only the pages with live objects are copied at the commit.
The code changes to implement this region-based garbage collection comprise the
introduction of three new global variables and 12 additional statements, most of
which are for collecting and traversing the start list and resetting the MARK flags
in its nodes.
The region-based mark-sweep has non-trivial costs at the beginning and end
of pprs. Within the ppr the collector may not be as efficient because it may fail
to reclaim all garbage because some nodes in the start list would have become
unreachable in the sequential run. The extent of these costs depends on the
input. In addition, the memory regions will accumulate long-live data, which
leads to more unnecessary alerts from false sharing. The lisp evaluation may
trigger an exception leading to an early exit from within a ppr, so the content
of checked variables may not be restored even for parallel expressions. Therefore,
one cannot decide a priori whether the chance of parallelism and its likely benefit
would outweigh the overhead. However, these are the exact problems that bop is
designed to address with its streamlined critical path and the on-line sequentialparallel race.
87
Serial
Speculative
3
7
Times (s)
2.25
2.27
2.26
1.50
1.48
1.47
0.95
0.94
0.94
0.68
0.68
0.68
Speedup
1.00
The NQueens input from spec95 benchmark suite, which computes all positions of n queens on an n n chess board in which no attacks are possible,
is used as a test case of the bop-lisp interpreter. Four lines of the original five
expression lisp program are modified, resulting in 13 expressions, of which 9 are
parallelized in a ppr. When n is 9, the sequential run takes 2.36 seconds using the
base collector and 2.25 seconds using the region-based collector (which effectively
has a larger heap but still needs over 4028 garbage collections for nine 10K-node
regions). The results of testing three speculation depths are listed in Table 4.6.2.
The last row of Table 4.6.2 shows that the speedup, based on the minimum
time of from three runs, is a factor of 1.53 with 2 processors, 2.39 with 4 processors,
and 3.31 with 8 processors. The table does not list the additional cost of failed
speculations, which accounts for 0.02 seconds of the execution.
88
training runs
Parser
35
70K
343M
117
5312
336M
16
6024
39M
input by replication).
Table 4.7 shows the results of the bop analyzer, which identifies 33 variables
and allocation sites as shared data, 78 checked variables (many of which not used
during compression), and 33 likely private variables. Behavior analysis detected
flow dependencies between compressions because the original GZip failed to completely reinitialize parts of its internal data structure before starting compression
on another new file. The values would have been zeroed if the file was the first
to be compressed, and in this test the code has been changed to reinitialize these
variables. Compression returns identical results in all test inputs.
The sequential GZip code compresses buffered blocks of data one at a time, and
stores the results until an output buffer is full. pprs are manually placed around
the buffer loop and the set of likely private variables are specified through the
program interface described in Section 4.2.3. In this configuration the program
returned correct results, but speculation continually failed because of conflicts
caused by two variables, unsigned short bi buf and int bi valid, as detected by the
run-time monitoring.
The two variables are used in only three short functions. After inspecting the
original source code it became clear that the compression produces bits rather
than bytes, and the two variables stored the partial byte of the last buffer. This
89
Speculative
3
7
8.46 8.56 7.29 7.71 5.38 5.49 4.80 4.47
8.50 8.51 7.32 7.47 4.16 5.71 4.49 3.10
8.53 8.48 5.70 7.02 5.33 5.56 2.88 4.88
8.51
7.09
5.27
4.10
1.00
1.20
1.61
2.08
Sequential
Times (s)
Average Time
Average Speedup
90
25
20
15
10
sequential
co-processing (0% parallel)
coprocessing (97% parallel)
0
10
25
50
100
num. sentences in the possibly parallel region
Figure 4.5: The effect of speculative processing on Parser
space overhead for page allocation is at most 104 pages or a half mega-byte for
the sequential execution. The space cost of their run-time replication is already
counted in the numbers above (130KB and 7.45MB).
91
Sequential
Times (s)
Speedup
11.35
11.37
11.34
1.00
Speculative
1
3
7
10.06 7.03 5.34
10.06 7.01 5.35
10.07 7.04 5.34
1.13 1.62 2.12
It is not immediately clear from the documentation or from the 11,391 lines
of its C code whether the SeatlorTemperley Link Parser handles sentences in
parallel, but in fact they are not. If a ppr instance parses a command sentence
which changes the parsing environment, e.g., turning on or off the echo mode, the
next ppr instance cannot be speculatively executed. This is a typical example of
dynamic parallelism.
The bop parallelism analyzer identifies the sentence-parsing loop. We manually strip-mine the loop to create a larger ppr. The data are then classified
automatically as shown in Table 4.7. During the training run, 16 variables are
always written first by the speculation process during training, 117 variables always have the same value at the two ends of a ppr instance, and 35 variables are
shared.
The test input for the parallel version of the parser uses 1022 sentences obtained by replicating the spec95 training input twice. When each ppr includes
the parsing of 10 sentences, the sequential run takes 11.34 second, and the parallel
runs show speedup of 1.13, 1.62 and 2.12 with a few failed speculations due to the
dynamic parallelism.
The right-hand side of Figure 4.5 shows the performance on an input with 600
sentences. Strip-mine sizes ranging from 10 sentences to 100 sentences are tested
in each group, and the group size has mixed effects on program performance.
For sequential and spec fail, the largest group size leads to the lowest overhead,
3.1% and 3.6% respectively. Speculative processing improves performance by 16%,
46%, 61%, and 33% for the four group sizes. The best performance occurs with the
92
medium group size. When the group size is small, the relative overhead is high;
when the group size is large, there are fewer ppr instances and they are more
likely to unevenly sized. Finally, the space overhead of speculation is 123KB,
100KB of which is checked data. This space overhead does not seem to change
with the group size.
93
30
25
20
bop-mkl depth
omp-mkl thread
bop-mkl depth
omp-mkl thread
bop-mkl depth
omp-mkl thread
7
8
3
4
1
2
15
10
5
0
500
94
more than compensates for the overhead and produces an improvement of 16%
over 8-thread mkl. Similar experiments pitting bop against another scientific library, the threaded automatically tuned linear algebra software (atlas), shows
similar results.
95
Speculative Optimization
Introduction
In this chapter I present a variation on process-based speculative execution called
Fast Track. The Fast Track system is based on the infrastructure for speculative
execution described in Chapter 3 but is applicable for a wholly different set of
uses from those in Chapter 4. Fast Track allows the use of unsafely optimized
code, while leaving the tasks of error checking and recovery to the underlying
implementation. The unsafe code can be implemented by a programmer or by a
compiler or other automated tool, and the program regions to be optimized can be
indicated manually or determined during execution by the run-time system. As
before, the system uses coarse-grain tasks to amortize the speculation overhead
and does not require special hardware support.
The shift in processor technology toward multicore, multi-processors opens
new opportunities for speculative optimization, where the unsafely optimized code
marches ahead speculatively while the original code follows behind to check for
errors and recover from mistakes. In the past, speculative program optimization
has been extensively studied both in software and hardware as an automatic
technique. The level of improvement, although substantial, is limited by the
96
ability of both the static and run-time analyzes. In fact, previous techniques
primarily targeted individual loops and only considered transformations based on
value and dependency information.
One may question the benefit of this setup: suppose the fast code gives correct
results, would we not still need to wait for the normal execution to finish to know
it is correct? The reason for the speed improvement is the overlapping of the
normal tracks. Without fast track, the next normal track cannot start until the
previous one fully finishes. With fast track, the next one starts once the fast code
for the previous normal track finishes. In other words, although the checking is as
slow as the original code, it is now done in parallel. If the fast code has an error
or occasionally runs slower than the normal code, the program would execute the
normal code sequentially and will not be delayed by a strayed fast track.
In Section 5.2 I describe the programming interface for Fast Track. This interface can be used an automated too, or in a natural way by a human programmer
with little effort. In Section 5.3 I describe the ways in which the Fast Track
run-time system extends the basic runtime described in Section 3.1.
5.1
5.1.1
Design
Fast and Normal Tracks
The FastTrack system represents two alternative methods of execution for some
portion of a program. At run time both of the methods are executed in parallel.
One of the two is identified a priori to be the canonical method, while the other is
assumed to potentially be unsafe in some cases. The unsafe execution is expected
to complete more quickly and is referred to as the fast track while the correct
computation is called the normal track.
97
5.1.2
Dual-track
In addition to the fast and normal track notation, the FastTrack run-time system
allows for a pair of parallel executions that are considered to be indistinguishable.
In this usage, both of the executions are referred to as Dual Tracks. Here,
whichever of the dual tracks can complete first leads to the continuing sequential
execution. The track which finishes more slowly will then confirm the results of the
first. If the two tracks are known with certainty to compute the same information
(but at unpredictable rates) the verification can be disabled.
5.2
Programming Interface
98
// unsafely optimized
// safe code
optimized
safe code
optimized
safe code
99
inner normal track. Statements with side effects that would be visible across the
processor boundary, such as system calls and file input and output, are prohibited
inside a dual-track region. The amount of memory that a fast instance may allocate is bounded so that an incorrect fast instance will not stall the system through
excessive consumption. Figure 5.2.1 in the previous section shows an example of
a fast track that has been added to the body of a loop. The dual-track region can
include just a portion of the loop body, multiple dual-track regions can be placed
back-to-back in the same iteration, or a region can be used in straight-line code.
Figure 5.2 shows the use of fast track on two procedure calls, with . . . standing in
for any other statements in between. Multiple dual-track regions do not have to
be arranged in a straight sequence. One might be used only within a conditional
branch, while another could be in loop.
5.3
5.3.1
Run-time Support
Creation
In addition to the general creation process described in Section 3.1.1 the FastTrack
run-time variant must enable state comparison between the fast and regular tracks.
Within the FT BeginFastTrack run-time hook, prior to spawning a normal track,
the system allocates a shared memory space for two access maps, and a shared
data pipe. The use of these objects is described in Section 5.3.2.
5.3.2
Monitoring
During execution, memory pages are protected so that any write access will trigger
a segmentation fault. Both the fast and normal tracks use a signal handler to catch
the faults and record the access in a bit map.
100
101
In order to compare the memory modifications of the two track, the fast track
must provide the normal track with a copy of any changes it has made. At the
end of each dual track region, the fast track evaluates its access map to determine
what pages have been modified. Each page flagged in the access map is pushed
over a shared pipe, and consumed by the normal track, which then compares the
data to its own memory page.
Algorithm 5.3.2 Listing of FastTrack monitoring.
s t a t i c void F T S e g v H a n d l e r ( i n t s i g , s i g i n f o t i n f o ,
ucontext t context )
{
a s s e r t (SIG MEMORY FAULT == s i g ) ;
assert ( context );
// access to pages that are not mapped are true faults
i f ( i n f o >s i c o d e == SEGV MAPERR)
i f (1 == k i l l ( SP gpid , SIGALRM ) )
p e r r o r ( f a i l e d to k i l l the timer ) ;
i f ( ! WRITEOPT( c o n t e x t ) ) r e t u r n ;
// record the page and remove the restriction
void f a u l t A d d = i n f o >s i a d d r ;
SP recordAccessToMap ( f a u l t A d d , FT accMap ) ;
i f ( m p r o t e c t (PAGESTART( f a u l t A d d ) , 1 , PROT WRITE | PROT READ) ) {
p e r r o r ( f a i l e d t o change memory a c c e s s p e r m i s s i o n . \ n ) ;
abort ( ) ;
}
}
5.3.3
Verification
To guarantee that the speculative execution is correct, the memory state of the
fast and normal tracks are compared at the end of the dual track region. If the
fast track reached the same state as the normal track, then the initial state of the
102
next normal track must be correct. Typically, the next normal track was started
well before its predecessor finished, and it will know only in hindsight that it was
correctly initialized.
The normal track is responsible for comparing the writes made by both itself
and the fast track. The memory state comparison is performed once the normal
track has finished the dual track region because this is the first point at which
verification is possible. The comparison first determines if the set of writes made
by the two tracks is identical, which is handled by a simple memcmp on the access
map of each of the two tracks. The process then compares the writes themselves
using the FT CheckData run-time call as in Listing 5.3.3. Verification will fail if
either the set or contents differ, or if the fast track has not yet completed the dual
track region.
Once verification has been completed successfully, the two process are know
to have made identical changes to the same memory locations. From that point
forward, the execution of the two process would be identical. Given this, one of
the tracks is superfluous. Because the fast track is aborted if it does not reach
the end of the dual track region first, we assume that it has continued past that
point and completed other useful work. The normal track is thus useless (since
it would be recomputing exactly what the fast track has already computed) and
aborts.
It is worth noting that although multiple dual track regions (i.e., multiple pairs
of fast and normal tracks) may exist simultaneously, a single process will have at
most one fast access map and one normal access map. Because the normal track
is responsible for performing the verification routine, the fast track can abandon
the access map it had been using for a region once the region is complete. The
normal track will still have access to that map. Once the map has been analyzed,
the normal track will abort or transition to the fast state.
103
// 0 indicates success
104
5.3.4
Abort
The FastTrack abort routine is handled almost entirely by the normal track. The
normal track first waits to receive a notification that all of the preceding normal
tracks have completed, at which point it commits any buffered output and performs the verification routine. If the fast track needs to be aborted for any of the
reasons indicated in Section 5.3.3 the process executing the fast track is terminated. Because the normal track performs the verification, all cases in which the
fast track is terminated pass through the same code path. The normal path process explicitly signals the process running the fast track, which handles the signal
by simply closing the communications pipes and exiting. The steps taken by the
normal track after completing the dual track region are provided in Listing 5.3.5.
The normal track will continue executing until the next dual-track region is
encountered, or a program exit point is reached. Depending on the difference
in execution speed between the fast and normal track, the fast track may have
reached other dual track regions. In this case the abort of the fast track is followed
by the normal track sending a flag through the floodgates as an indication to
any waiting normal tracks that they should abort. Any normal tracks that have
already been released from the floodgate will run through their dual track region.
At the end of the region the process will synchronize by waiting to receive a flag
through the inheritance pipe indicating that it is the oldest running normal track.
In the case of an error in an earlier normal track, that synchronization flag will
indicate that the current process should also abort.
5.3.5
Commit
If the normal track verifies the correct execution of the dual track region, it clean
up and aborts. The fast track is free to continue execution, possibly entering more
FastTrack regions and creating further normal tracks.
105
106
C
B
F
B
N1
B|E
F
N1
N1
N2
N1
pass
n2
fail
E
F
N1
107
5.3.6
Special Considerations
There are a number of corner cases of which the Fast Track system must take
account.
Seniority Control
Because the fast track may spawn multiple normal tracks, which may then run
concurrently, each normal track must know when all of its logical predecessors
have completed. Before a normal track terminates, it waits on a flag to be set by
its predecessor, and then signals its successor when complete. If there is an error
in speculation, the normal track uses the same mechanism to lazily terminate
normal tracks that are already running once they reach the end of their FastTrack
region.
Output buffering
To ensure that the output of a program running with FastTrack support is correct,
we ensure output is produced only by a normal track that is known to be correct
and is serialized in the correct order. Until a normal track has confirmed that
its initialization was correct (i.e., that all previous speculation was correct), it
buffers all terminal output and file writes. Once all previous normal tracks have
been committed the normal track is considered to be the oldest, and we can
be certain that its execution is correct. Given correct execution, any output the
process produces will be the same as what the sequential program would have
generated. The fast track never produces any output to the terminal nor does it
write to any regular file.
108
109
the normal tracks are serialized, the fast track only needs to wait for the last
normal track it spawned to complete. This is achieved using the same mechanism
the normal tracks use to order themselves: the fast track waits on the inheritance
token. Note that the fast track is not necessarily waiting for the normal track to
reach the same program exit point, but the state of two will agree.
Whether or not we are within the scope of a dual track region, the correctness of
the fast track is not known until the verifying normal tracks complete. Although
we could terminate the fast track and allow the normal track to simply do its
work, the normal track may be predicated on the results of other normal tracks.
Keeping the state of the fast track allows the earlier normal tracks to validate.
The alternative would be to abort all but the oldest normal track, potentially
wasting work.
Processor Utilization
The objective of speculative execution is for execution to occur as quickly as
possible. In order to make this happen, the run-time system should use the
available processing cores as wisely as possible. In a naive approach the fast
track would run until it exits the program, spawning normal tracks along the way.
Each normal track would compute its own version its dual track region and verify
correct computation.
Although execution of the normal tracks (with the exception of the oldest) is
speculative based on the correctness of the fast track, they are taking advantage
of otherwise unused resources. However, if we spawn too many normal tracks,
they may begin contend for hardware resources. Ultimately the normal tracks are
performing the real computation, and delaying their execution would be wasteful.
This is true either if we allow a more speculative process to be scheduled in
favor of an older one, or if it merely interferes with it.
110
( ( d e s t r u c t o r ) ) F T e x i t H a n d l e r ( void ) {
i f ( F T a c t i v e ) FT PostDualTrack ( ) ;
switch ( myStatus ) {
case FAST :
c l o s e ( readyQ>p i p e [ 0 ] ) ;
c l o s e ( readyQ>p i p e [ 1 ] ) ;
// wait for the last normal track
S P s y n c r e a d ( i n h e r i t a n c e , &token , s i z e o f ( i n t ) ) ;
close ( inheritance );
k i l l ( SP gpid , SIGTERM ) ;
break ;
case SLOW:
// wait to be the oldest
i f ( FT order > 1)
S P s y n c r e a d ( i n h e r i t a n c e , &token , s i z e o f ( t o k e n ) ) ;
SP CommitOutput ( ) ;
// commit output
k i l l ( SP gpid , SIGTERM ) ;
// terminate speculation
break ;
default :
break ;
}
}
111
Fast-track Throttling The fast track has thus far been described as speculatively running ahead of the normal tracks, constrained only by program termination or a terminal signal from one of the normal tracks. There are two reasons
why it is undesirable for the fast track to run arbitrarily far ahead. The first problem is the potential resource demand of the waiting normal tracks. The second
problem is that, should there be an error in the speculation detected in one of the
normal tracks, the processing done by the fast track is essentially wasted. The
FastTrack run-time system implements a throttling mechanism to keep the fast
track running far enough ahead to supply normal tracks and keep the processing
cores utilized, while minimizing potentially wasted resources.
The throttling strategy is to pause the fast track and give the processor to
a normal track, as shown by the middle diagram in Figure 5.2. When the next
normal track finishes, it re-activates fast track. The word next is critical for
two reasons. First, only one normal track should activate fast track when it
waits, effectively returning the processor after borrowing it. The time of the
activation must be exact. If it is performed by a track to early there will be too
many processes. One track later and there would be under-utilization.
Consider a system with p processors running fast track and p1 normal tracks
until the fast track becomes too fast and suspends execution giving the processor
to a waiting normal track. Suppose that three normal tracks finish in the order n1 ,
n2 , and n3 , and fast track suspends after n1 and before n2 . The proper protocol
112
is for n2 to activate fast track so that before and after n2 we have p and only p
processes running concurrently. Activation before and after n2 would lead to less
than or more than p processes.
In order to ensure that suspension and activation of the fast track is timed
correctly with respect to the completion of the normal tracks FastTrack maintains
some extra state. The value of waitlist length indicates the number of normal-track
processes waiting in the ready queue. A flag ft waiting represents whether the fast
track has been paused.
The fast track is considered to be too fast when waitlist length exceeds p. In
this case, the fast track activates the next waiting process in the ready queue, sets
the ft waiting flag, and then yields its processor by. When a normal track finishes,
it enters the critical section and determines which process to activate based on the
flag: if ft waiting is on, it clears ft waiting and reactivates the fast track; otherwise,
it activates the next normal track and updates the value of waitlist length.
A problem arises when there are no normal tracks waiting to start, which can
happen when the fast track is too slow. If a normal track waits inside the critical
section to start its successor, then the fast track cannot enter to add a new track
to the queue. The bottom graph in Figure 5.2 shows this case, where one or more
normal track processes are waiting for fast track to fill the queue.
Resource Allocation Assuming we are executing on a system with N processors, and that the fast track is executing on one of the processors, the run-time
system should allow at most N 1 normal processes to execute simultaneously.
The exception is when the fast track has been throttled, allowing an N th normal
track process. In addition to limiting the number of normal tracks, the FastTrack
system should guarantee that the N 1 oldest (or, least speculative) processes
are allotted hardware resources. The FastTrack run-time system implements these
constraints using a token passing system such that only a process holding a token
113
114
too slowly. When there is enough parallelism, the fast track is constrained to
minimize potentially useless speculative computation.
Memory Usage The FastTrack run-time system relies on the operating system implementation of copy-on-write, which lets processes share memory pages
to which they do not write. In the worst case where every dual-track instance
modifies every data page, the system needs d times the memory needed by the
sequential run, where d is the fast-track depth. The memory overhead can be
controlled by abandoning a fast track instance if it modifies more pages than a
empirical constant threshold h. This bounds the memory increase to be no more
than d h M , where M is the virtual memory page size. The threshold h can be
adjusted based on the available memory in the system. Memory usage is difficult
to estimate since it depends on the demands of the operating system and other
running processes. Earlier work has shown that on-line monitoring can effectively
adapt memory usage by monitoring the page-fault indicators from Linux [21, 65].
Experimental test cases have never indicated that memory expansion will be a
problem, so I do not consider memory resource further.
Running two instances of the same program would double demand for off-chip
memory bandwidth, which is a limiting factor for modern processors, especially
chip multiprocessors. In the worst case if a program is completely memory bandwidth bound, no fast track can reduce the overall memory demand or improve
program performance. However, experience with small and large applications on
recent multicore machines, which are detailed later, is nothing but encouraging.
In FastTrack, the processes originate from the same address space and share readonly data. Their similar access patterns help to prefetch useful data and keep
it in cache. For the two large test applications used, multiple processes in FastTrack ran almost the same speed as that of a single process. In contrast, running
multiple separate instances of a program always degrades the per-process speed.
115
5.4
Compiler Support
The FastTrack system guarantees that it produces the same result as the sequential execution. By using Unix processes, FastTrack eliminates any interference
between parallel executions through the replication of the address space. During
execution, it records which data are changed by each of the normal and fast instances. When both instances finish, it checks whether the changes they made are
identical. Program data can be divided into three parts: global, stack, and heap
data. The stack data protection is guaranteed by the compiler, which identifies
the set of local variables that may be modified through inter-procedural MOD
analysis [30] and then inserts checking code accordingly. Imprecision in compiler
analysis may lead to extra variables being checked, but the conservative analysis
does not affect correctness. The global and heap data are protected by the operating systems paging support. At the beginning of a dual-track instance, the
system turns off write permission to global and heap data for both tracks. It then
installs custom page-fault handlers that record which page has been modified in
an access map and re-enables write permission.
5.5
5.5.1
In general, the fast code can be any optimization inserted by either a compiler
or a programmer; for example memoization, unsafe compiler optimizations or
manual program tuning. The performance of the system is guaranteed against
slow or incorrect fast track implementations. The programmer can also specify two
alternative implementations and let the system dynamically select the faster one.
Below I discuss four types of optimizations that are good fits for fast track because
116
they may lead to great performance gains but their correctness and profitability
are difficult to ensure.
Memoization For any procedure the past inputs and outputs may be recorded.
Instead of re-executing the procedure in the future, the old result can be reused
when given the same input. Studies dated back to at least 1968 [39] show dramatic
performance benefits when using memoization, for example to speed up table lookup in transcoding programs. Memoization must be conservative about side-effects
and can provide only limited coverage for generic use in C/C++ program [15].
With FastTrack, memoization does not have to be correct in all cases and therefore
can be more aggressively used to optimize the common case.
Semantic optimization Often, different implementation options may exist at
multiple levels, from the basic data structures used such as a hash table, to the
choice of algorithms and their parameters. A given implementation is often more
general than necessary for a program, allowing for specialization. Current programming languages do not provide a general interface for a user to experiment
with an unsafely simplified algorithm or to dynamically select the best choice
among alternative solutions.
Manual program tuning A programmer can often identify performance problems in large software and make changes to improve the performance on test inputs. However, the most radical solutions are often the most difficult to verify in
terms of correctness, or to ensure good performance on other inputs. As a result,
many creative solutions go unused because an automatic compiler cannot possibly
achieve them.
Monitoring and safety checking It is often useful to instrument a program
to collect run-time statistics such as frequently executed instructions or accessed
117
5.5.2
To test fast track on real-world applications, it has been applied to the parallelization of a memory-safety checking tool called Mudflap [16]. Mudflap is bundled
with the widely used GNU compiler collection (gcc), adding checks for array
range (over or under flow) and validity of pointer dereferences to any program
gcc compiles. Common library routines that perform string manipulation or direct memory access are also guarded. Checks are inserted at compile time and
require that a run-time library be linked into the program.
The Mudflap compilation has two passes: memory recording, which tracks all
memory allocation by inserting
mf register and
118
5.6
Evaluation
5.6.1
Analysis
Analytical Model
P
The original execution time is T (E) = T (u0 ) + ni=1 T (ri ui ). By reordering
P
P
the terms leads to T (E) = ni=1 T (ri ) + ni=0 T (ui ). Name the two components
Er = r1 r2 . . . rn and Eu = u0 u1 . . . un . The time T (Eu ) is not changed by fast-track
execution because any ui takes the same amount of time regardless of whether it
is executed with a normal or a fast instance.
119
T (Er )
,
n
and how this time changes as a result of FastTrack. Since we would like to derive
a closed formula to examine the effect of basic parameters, consider a regular case
where the program is a loop with n equal length iterations. A part of the loop
body is a FastTrack region. Let T (ri ) = tc be the (constant) original time for
each instance of the region. The analysis can be extended to the general case
where the length of each ri is arbitrary and tc is the average. The exact result
would depend on assumptions about the distribution of T (ri ). In the following,
we assume T (ri ) = tc for all i.
With FastTrack, an instance may be executed by a normal track in time ts =
(1 + qe )tc + qc or by a fast track in time tpf , where qc and qe are overheads. In the
best case, all fast instances are correct ( = 1) and the machine has unlimited
resources p = . Each time the fast track finishes an instance, a normal track is
started. Thus, the active normal tracks form a pipeline if considering only dualtrack instances (the component T (Er ) in T (E)). The first fast instance is verified
after ts . The remaining n 1 instances finish at a rate of t
f = (1 + qe )xtc + qc ,
where x is the speedup by fast track and qc and qe are overheads.
Using the superscript to indicate the number of processors, the average time
and the overall speedup are
t
f =
speedup =
In the steady state
tc
t
f
(ts + (n 1)t
f )
n
original time
ntc + T (Eu )
=
f ast track time
ntf + T (Eu )
the equation does not show the fixed lower bound of fast track performance.
Since a fast instance is aborted if it turns out to be slower than the normal
instance, the worst-case is t
f = ts = (1 + qe )tc + qc , and consequently speedup =
ntc +T (Eu )
.
n((1+qe )tc +qc )+T (Eu )
120
the worst-case time is bounded only by the overhead of the system and not by the
quality of fast-track implementation (factor x).
As a normal instance for ri finishes it may find the fast instance incorrect,
canceling the on-going parallel execution, and restarting the system from ri+1 .
This is equivalent to a pipeline flush. Each failure adds a cost of ts t
f , so the
p
average time with a success rate is (1 )(ts t
f ) + tf .
ts + (d 1)t
f
d
When < 1, the cost of restarting has the same effect as in the infinite-processor
case. The average time and the overall speedup are
tpf = (1 )(ts t
f )+
speedupp =
ts + (d 1)t
f
d
ntc + T (Eu )
ntpf + T (Eu )
ts + (d 1)t
f
d + ts dt
f
121
After simplification, FastTrack throttling may seem to increase the per instance
time rather than decreasing it. But it does decrease the time because d
ts
.
t
f
The overall speedup (bounded from below and n 2) is as follows, where all the
basic factors are modeled.
speedupp = max
ntc + T (Eu )
ntc + T (Eu )
, p
nts + qc + T (Eu ) ntf + T (Eu )
Simulation Results
By translating the above formula into actual speedup numbers the effect of major
parameters can be examined. Of interest are the speed of the fast track, the
success rate, the overhead, and the portion of the program executed in dual-track
regions. The four graphs in Figure 5.3 show their effect for different numbers of
processors ranging from 2 to 10 in a step of 1. The fast-track system has no effect
on a single-processor system.
All four graphs include the following setup where the fast instance takes
10%the time of the normal instance (x=0.1), the success rate () is 100%, the
overhead (qc and qe ) adds 10% execution time, and the program spends 90% of the
time in dual-track regions. The performance of this case is shown by the second
highest curve in all but graph 5.3(a), in which it is shown by the highest curve.
FastTrack improves the performance from a factor of 1.60 with 2 processors to a
factor of 3.47 with 10 processors. The maximal possible speedup for this case is
3.47. When we change the speed of the fast instance to vary from 0% to 100%
the time of the normal instance, the speedup changes from 1.80 to 1.00 with 2
processors and from 4.78 to 1.09 with 10 processors, as shown by graph 5.3(a).
When the success rate is reduced from 100% to 0%, the speedup changes from
1.60 to 0.92 (8% slower because of the overhead) with 2 processors and from
3.47 to 0.92 with 16 processors, as shown by the graph in 5.3(a). Naturally the
performance hits the worst case when the success rate is 0%.
122
When the overhead is reduced from 100% to 0% of the running time, the
speedup increases from 1.27 to 1.67 with 2 processors and from 2.26 to 3.69 with
16 processors, as shown by graph 5.3(c). Note that with 100% overhead the
fast instance still finishes in 20% the time of the normal instance, although the
checking needs to wait twice as long.
Finally, when the coverage of the fast-track execution increases from 10% to
100%, the speedup increases from 1.00 to 1.81 with 2 processors and from 1.08
to 4.78, as shown by the graph 5.3(d). If the analytical results are correct, it is
not overly difficult to obtain a 30% improvement with 2 processors, although the
maximal gain is limited by the time spent outside dual-track regions, the speed
of the fast instance, and the overhead of fast-track.
The poor scalability is not a surprise given the program is inherently sequential
to begin with. Two final observations from the simulation results are important.
First, FastTrack throttling is clearly beneficial. Without it there can be no improvement with 2 processors. It often improves the theoretical maximum speedup,
although the increase is slight when the number of processors is large. Second, the
model simplifies the effect of FastTrack system in terms of four parameters, which
we have not validated with experiments on a real system. On the other hand,
if the four parameters are the main factors, they can be efficiently monitored at
run time, and the analytical model may be used as part of the on-line control to
adjust the depth of fast-track execution with the available resources.
5.6.2
Experimental Results
123
124
processors, FastTrack reduces the delay to factors of 2.0, 2.1, 3.7, and 28.8, which
are more tolerable for long-running programs.
The code change in 429.mcf includes replicating the call of price out impl in
function global opt in file mcf.c. Similar to the code in the FastTrack example in
the introduction, the original call is placed in the normal track and the call to the
clone, clone price out impl, in the fast track. For 458.sjeng, the call of search in
function search root in file search.c is similarly changed to use clone search in the
fast track and search in the normal track. In both cases, merely four lines of code
need to be modified.
Memory safety checking by Mudflap more than triples the running time of
mcf. FastTrack improves the speed of checking by over 30%. The running time
of fast track is within half a second of a dual track implication, which shows that
FastTrack runs with little overhead. The cost of safety checking for 458.sjeng is
a factor of 200 slowdownit takes 24 minutes to check the original execution of
7.3 seconds. FastTrack is able to reduce the checking time to 13 minutes, a factor
of two reduction. A dual track style execution without verification runs faster,
finishing in under 9 minutes without the overhead of checking every memory
access.
Results of Sort and Search Tests
The following two tests are intended to measure the performance of FastTrack
use for the support of unsafe optimizations as executed with two Intel dual-core
Xeon 3Ghz processors. Compilation is done using the modified FastTrack version
of gcc using the optimizations specified by the -O3 flag. The first test is a
simple sorting program that repeatedly sorts an array of 10,000 elements. In a
specified percentage of iterations the array contents are randomized. The array
sort is performed with either a short-circuited bubble sort, a quick sort, or by
running both in a FastTrack environment. The results of these tests are shown in
125
Figure 5.5. The quick sort performs consistently and largely independent of the
input array. One can see that the bubble sort quickly detects when the array is
sorted, but performs poorly in cases in which the contents have been randomized.
The FastTrack approach is able to out-perform either of the individual sorting
algorithms. These results illustrate the utility of Fast-Track in cases where both
solutions are correct, knowing which is actually faster is not possible in advance.
In cases where the array is always sorted or always unsorted, the overhead of using
FastTrack will cause it to lose out. Although FastTrack is not a better solution
compared to an explicitly parallel sorting approach, this example motivates the
utility of automatically selecting the faster of multiple sequential approaches.
Algorithm 5.6.1 Pseudo code of the synthetic search program
for i = 1 to n do
Vi random
end for
for 1 to T do
if normal track then
for i = 1 to n do
Vi f (Vi )
end for
m max(v : v V )
else {fast track}
R S random samples from V
for j = 1 to S do
R f (Ri )
end for
m max(r : r R)
end if
randomly modify N1 elements
end for
print m
The second program is a simple search to test the effect of various parameters,
for which the basic algorithm is given in Figure 5.6.1. The program repeatedly
updates some elements of a vector and finds the largest result from certain computations. By changing the size of the vectors, the size of samples, and the
126
frequency of updates, we can effect different success rates by the normal and the
fast instances. Figure 5.6(a) shows the speedups over the base sequential execution, which takes 3.33 seconds on a 4-CPU machine. The variation between times
of three trials is always smaller than 1 millisecond.
The sampling-based fast instance runs in 2.3% the time of the normal instance.
When all fast instances succeed, they improve the performance by a factor of 1.73
on 2 processors, 2.78 on 3 processors, and 3.87 on four processors. When the
frequency of updates is reduced the success rate drops. At 70%, the improvement
is a factor of 2.09 on 3 processors and changes only slightly when the fourth
processor is added. This drop is because the chance of four consecutive fast
instances succeeding is only 4%. When the success rate is further reduced to
30%, the chance for three consecutive successful fast tracks drops to 2.7%. The
speedup from 2 processors is 1.29 and no improvement is observed for more than 2
processors. In the worst case when all fast instances fail, we see that the overhead
of forking and monitoring the normal track adds 6% to the running time.
The results in Figure 5.6(b) show interesting trade-offs when the fast track
is tuned by changing the size of samples. On one hand, a larger sample size
means more work and slower speed for the fast track. On the other hand, a
larger sample size leads to a higher success rate, which allows more consecutive
fast tracks succeed and consequently more processors utilized. The success rate
is 70% when the sample size is 100, which is the same configuration as the row
marked 70% in Figure 5.6(a). The best speedup for 2 processors is found when
the sample size is 200 but adding more processors does not help as much (2.97
speedup) as when the sample size is 300, where 4 processors lead to a speedup
of 3.78. The second experiment shows the significant effect of tuning when using
unsafely optimized code. Experience has shown that the automatic support and
analytical model have made tuning much less labor intensive.
127
balanced
steady state
the fasttrack
process
ready queue
enqueue new
normal track
p-1 active
normal-track
processes
(1 to p-1 waiting
normal track)
dequeue the
ready queue
fast-track
throttling
ready queue
fast track
stopped
p active
normal-track
processes
(p waiting normal
tracks)
fast track
too slow
the fasttrack
process
ready queue
enqueue new
normal track
(empty)
waiting to
dequeue
Figure 5.2: The three states of fast track: balanced steady state, fast-track throttling when it is too fast, and slow-track waiting when fast track is too slow. The
system returns to the balanced steady state after fast-track throttling.
128
3.5
3.5
speedups
speedups
2.5
2
1.5
2
1.5
0.5
0.5
2
4
6
8
number of processors
10
3.5
2.5
2
10
2.5
2
1.5
1.5
0.5
4
6
8
number of processors
speedups
speedups
3
2.5
0.5
2
4
6
8
number of processors
10
4
6
8
number of processors
10
Figure 5.3: Analytical results of the fast-track system where the speed of the fast
track, the success rate, the overhead, and the portion of the program executed in
dual-track regions vary. The order of the parameters in the title in each graph
corresponds to the top-down order of the curves in the graph.
129
8
6
4
number of processors
number of processors
(a) The checking time of 401.bzip2 is reduced from 24.5 seconds to 9.0 seconds.
The base running time, without memory
safety checking, is 4.5 seconds.
number of processors
1.5
1.0
1.5
0.5
0.0
0.5
0.0
1.0
1.5
4
0.5
0.0
0.0
1.0
speedup
2.0
2.0
2.5
2.5
2.0
1.5
1.0
0.5
speedup
2.5
sjeng
2.5
mcf
2.0
0.0
0.0
1
0.5
speedup
1.5
1.0
2.0
2.0
1.5
1.0
0.5
speedup
2.5
2.5
hmmer
8
bzip2
number of processors
Figure 5.4: The effect of FastTrack Mudflap on four spec 2006 benchmarks.
130
1000
quick
fast-track
bubble
100
10
1
0.1
0.01
0.001
0
5
10
25
50
75
100
Percentage of Iterations that Modify
Figure 5.5: Sorting time with quick sort, bubble sort or the FastTrack of both
success
rate
100%
70%
30%
0%
number
1
2
1 1.73
1 1.47
1 1.29
1 0.94
processors
3
4
2.78 3.87
2.09 2.15
1.29 1.29
0.94 0.94
sample
size
100
200
300
400
number
1
2
1 1.48
1 1.71
1 1.70
1 1.68
processors
3
4
2.09 2.15
2.64 2.97
2.71 3.78
2.69 3.74
131
Conclusion
6.1
Contributions
I have presented two systems for implementing speculative parallelism in existing programs. For each system I have implemented a complete working system
including compiler and run-time support. The first system, bop, provides a programmer with tools to introduce traditional types of parallelism in cases where
program dependencies cannot be statically evaluated or guaranteed. I have shown
the use of bop to effectively extract parallelism from utility programs.
I have also presented FastTrack, a system that supports unsafely optimized
code and can also be used to off-loaded safety checking and other program analysis. The key features of the systems include a programmable interface, compiler
support, and a concurrent run-time system that includes correctness checking, output buffering, activity control, and fast-track throttling. I have used the system
to parallelize memory safety checking for sequential code, reducing the overhead
by up to a factor of seven for four large size applications running on a multicore
personal computer. We have developed an analytical model that shows the effect
from major parameters including the speed of the fast track, the success rate, the
overhead, and the portion of the program executed in fast-track regions. We have
132
used our system and model in speculatively optimizing a sorting and a search
program. Both analytical and empirical results suggest that fast track is effective
at exploiting todays multi-processors for improving program speed and safety.
6.2
6.2.1
Future Directions
Automation
Automating the insertion of bop region markers requires identifying pprs automatically, which is similar to identifying parallelism a major open problem.
Because pprs are only hints at parallelism, its not necessary for them to be correct. In addition to inserting the ppr markers automatically, the system could be
simplified by allowing the EndPPR marker to be optional. The difficulty in doing
this comes in handling the final instance of the ppr. Without an end marker,
the speculative task will continue until it reaches a program exit point. The nonspeculative will execute the ppr, and subsequently repeat the same execution as
the speculative task. Such duplicated work is certainly wasteful, but may be acceptable if there is no other useful work that could be offloaded to the additional
processing unit.
In order to automate the use of the FastTrack system, markers can be inserted
at various points throughout the code using compiler instrumentation. We can
choose dynamically whether to initiate a new dual-track region based on the past
success rate and the execution time since the start of the last region. A region
can begin at an arbitrary point in execution, as long as the other track makes
the same decision at that point. We can identify the point with a simple shared
counter each track increments every time it passes a marker. The fast track makes
its increments atomically, and when it creates a new normal track it begins a new
counter (leaving the old one for the previous normal track). As the normal tracks
133
pass marks they compare their counter to the fast tracks to identify the mark at
which verification needs to be performed. If the two processes did not follow the
same execution path then the state verification will fail.
A significant problem is ensuring that the fast path includes all of the markers
the normal track has. This is directly related to where the markers are placed, and
how the two tracks are generated. In a case like the fast mudflap implementation
described in Chapter 5 the markers will be consistent as long as they are not
placed in the mudflap routines. In any case where code is similarly inserted to
create the normal track, it will suffice to simply not insert markers with that code.
In a case where the fast track is created by removing optimizations from existing
code, we must ensure that markers are not removed, and that any function calls
are not directly removed because they might contain further markers.
6.2.2
Composability
One of the major problems in parallel programming particularly when discussing explicit parallel programming with locks is the composability of various
operations. The intuition behind composability is that the combination of multiple components should not break correctness.
Lack of composability is a significant weakness of lock based components, and
is one of the strengths of transactional memory systems. Because the speculative
parallelism run-time systems are intended to be a simple way to extend existing
programs, the bop and FastTrack system should seek to compose correctly. There
are several general questions to ask about the composability of these systems: do
each compose with itself, do they compose with one another, and do they compose
with existing parallel programming techniques.
Self-Composition The question of self-composition is whether the run-time
system properly handles entering a speculative region when one is already active.
134
Cases where disjoint regions of the program use speculation compose trivially. The
bop run-time system does correctly compose with itself. The implementation is
designed to that nested use of pprs are not allowed, but are detected and handled
correctly. If a piece of code (for example a library) is built to use pprs, and that
is invoked from within another ppr, the inner regions will be ignored. Although
this maintains semantic correctness which is the primary concern it may
not be the most effective solution.
The FastTrack run-time system also maintains semantic correctness when it is
composed with itself. When the FastTrack system encounters a nested fast track
region, the runtime will treat it like any other dual track region. If the fast track is
the first to reach the nested region it will spawn a new normal track. Eventually
the normal track will encounter the end of the original dual track region, and
speculation will fail. Although semantic correctness is maintained, performance
will suffer because speculation over the entire outer region will always fail. This
failure could potentially be avoided if fast track regions were given identifiers. The
run-time system would also need a mechanism to match the identifier the normal
track encounters at the end of its region to the fast track. Additionally, the fast
track would need to abandon the inner normal tracks and to reacquire the changes
it made starting at the beginning of the outer region (which are otherwise simply
left for the inner normal track to verify).
If the normal track reaches a nested region then it will assume that the fast
track has miss-speculated, or is otherwise delayed, and that it has simply completed executing the region first. As in any case where the normal track wins the
race, it will terminate the fast track. The normal track will then assume the role
of fast track and spawn a new normal track to handle the inner region. From a
performance standpoint this is not likely to be the most effective solution because
only the smaller inner region will be fast tracked. Nevertheless, it is a better outcome than the case above, and it does maintain semantic correctness. In the case
135
that both tracks encounter a nested dual track region, the result is very much like
the above case in which only the normal track encounters the inner region.
Algorithm 6.2.1 Example of FastTrack self-composition
void o u t e r ( void ) {
i f (FT\ B e g i n F a s t T r a c k ( ) ) {
inner fast ();
} else {
inner normal ();
}
}
void i n n e r f a s t ( void ) {
i f (FT\ B e g i n F a s t T r a c k ( ) ) {
...
} else {
...
}
}
6.2.3
Further Evaluation
136
137
Code Listings
Included here are source code fragments not found earlier in this dissertation.
Where relevant, a reference to the earlier source is included. The inclusion of
system header files and standard pre-processor include guards have been omitted
for brevity.
A.1
BOP Code
Listing A.1: Private Header
s t a t i c i n t specDepth ;
// Process ID of SPEC.
s t a t i c i n t undyWorkCount ;
s t a t i c i n t pprID ;
s t a t i c i n t mySpecOrder = 0 ;
// Serial number .
138
s t a t i c i n t undyConcedesPipe [ 2 ] ;
s t a t i c v o i d BOP AbortSpec ( v o i d ) ;
s t a t i c v o i d BOP AbortNextSpec ( v o i d ) ;
switch ( a c c e s s ){
case READ :
map = accMapPtr + ( mapId 2 BIT MAP SIZE ) ;
break ;
case WRITE :
map = accMapPtr + ( ( ( mapId 2 ) + 1 ) BIT MAP SIZE ) ;
break ;
}
139
SP recordAccessToMap ( p a g e a d d r e s s , map ) ;
}
p r e v W r i t e s = READMAP( mySpecOrder 1 ) ;
= READMAP( mySpecOrder ) ;
140
No action .
i f ( i n f o > s i p i d == g e t p i d ( ) ) r e t u r n ;
i f ( myStatus != UNDY) r e t u r n ;
141
i f (WRITEOPT( c n t x t ) ) {
// A write access .
BOP AbortNextSpec ( ) ;
B O P r e c o r d A c c e s s ( f a u l t A d d , WRITE ) ;
i f ( m p r o t e c t (PAGESTART( f a u l t A d d ) , 1 , PROT WRITE | PROT READ ) )
exit ( errno );
} else {
// A read access .
B O P r e c o r d A c c e s s ( f a u l t A d d , READ ) ;
i f ( m p r o t e c t (PAGESTART( f a u l t A d d ) , 1 , PROT READ ) )
exit ( errno );
}
}
v o i d BOP UndyTermHandler ( i n t num , s i g i n f o t i n f o , u c o n t e x t t c n t x t )
{
a s s e r t ( SIGUSR2 == num ) ;
assert ( cntxt );
i f ( i n f o > s i p i d == g e t p i d ( ) ) r e t u r n ;
/ Must be Undy /
exit (0);
}
// See Listing 4.3.1 for BOP PrePPR implementation .
142
i f ( hasError ) {
perror (
f a i l e d to c l o s e p i p e s ) ;
myStatus = SEQ ;
return 0;
}
else return 1;
}
// See Listing 4.3.3 for BOP End implementation .
// See Listing 4.3.4 for PostPPR commit implementation .
// See Listing 4.3.4 for PostPPR main implementation .
// See Listing 4.3.4 for PostPPR spec implementation .
// See Listing 4.3.4 for PostPPR undy implementation .
s t a t i c i n t BOP Pipe Init ( void ){
int i , hasError = 0;
h a s E r r o r |= p i p e ( u n d y C r e a t e d P i p e ) ;
h a s E r r o r |= p i p e ( u n d y C o n c e d e s P i p e ) ;
143
i f ( hasError ) {
perror (
update pipe c r e a t i o n f a i l e d : ) ;
myStatus = SEQ ;
return 0;
}
else return 1;
}
s t a t i c v o i d BOP timerTermExit ( i n t s i g n o ) {
a s s e r t (SIGTERM == s i g n o ) ;
s i g n a l (SIGTERM , SIG IGN ) ;
k i l l ( 0 , SIGTERM ) ;
exit (0);
}
init done = 1;
char c u r P n t = mmap(NULL , ALLOC MAP SIZE + ACC MAP SIZE ,
PROT READ | PROT WRITE ,
MAP ANONYMOUS | MAP SHARED, 1, 0 ) ;
a s s e r t ( curPnt ) ;
useMap = c u r P n t ;
accMapPtr = c u r P n t + ALLOC MAP SIZE ;
144
BOP timerAlarmExit ) ;
s i g n a l ( SIGQUIT , B O P t i m e r A l a r m E x i t ) ;
s i g n a l ( SIGUSR1 , SIG DFL ) ;
s i g n a l ( SIGUSR2 , SIG DFL ) ;
// Prepare post/wait
BOP Pipe Init ( ) ;
145
// Setup SIGALRM
s i g n a l ( SIGALRM , B O P t i m e r A l a r m E x i t ) ;
s i g n a l (SIGTERM , BOP timerTermExit ) ;
while (1) pause ( ) ;
}
}
v o i d BOP PostPPR ( i n t i d ) {
// Ignore a PPR ending i f i t doesn t match the PPR we started
i f ( i d != p p r I D ) r e t u r n ;
p p r I D = 1;
s w i t c h ( myStatus ) {
case UNDY:
r e t u r n PostPPR undy ( ) ;
case SPEC :
r e t u r n PostPPR spec ( ) ;
case MAIN :
r e t u r n PostPPR main ( ) ;
case SEQ :
case CTRL :
return ;
// No action .
default :
assert (0);
}
}
s t a t i c v o i d BOP AbortSpec ( v o i d ) {
a s s e r t ( myStatus == SPEC ) ;
146
exit (0);
}
s t a t i c v o i d BOP AbortNextSpec ( v o i d ) {
earlyTermination = true ;
// Kill any following SPEC
i f ( s p e c P i d !=0) k i l l ( s p e c P i d , SIGKILL ) ;
}
A.2
i n t FT BeginFastTrack ( void ) ;
i n t FT BeginDualTrack ( void ) ;
void FT PostDualTrack ( void ) ;
// Fast track
s t a t i c char FT slowAccMap ;
// Slow track
s t a t i c char FT accMap ;
147
v o l a t i l e bool
waiting ;
} r e ad y Qu e u e ;
s t a t i c r e ad y Q ue u e readyQ ;
// Communication channels :
// Channel for passing data updates after verification .
s t a t i c int updatePipe [ 2 ] ;
// File descriptors for assigning seniority . Each slow track reads
// from inheritance pipe and writes to the bequest .
static int inheritance , bequest ;
// Slow tracks open a floodgate of another waiting slow track .
#d e f i n e FLOODGATESIZE ( 2 (MAX SPEC DEPTH + 1 ) )
s t a t i c i n t f l o o d g a t e s [ FLOODGATESIZE ] [ 2 ] ;
148
s t a t i c i n t FT getDepthFromEnv ( v o i d ) {
char c v a l ;
s t a t i c const i n t d e f = 2 ; // Default value
int
depth = def ;
c v a l = g e t e n v ( BOP SpecDepth ) ;
i f ( c v a l != NULL) d e p t h = a t o i ( c v a l ) ;
// Must be in the range [ 0 , MAX]
i f ( d e p t h < 0 | | d e p t h > MAX SPEC DEPTH) d e p t h = d e f ;
return depth ;
}
FT fastAccMap = accMap ;
FT slowAccMap = accMap + ACC MAP SIZE ;
return 0;
}
149
s t a t i c i n l i n e i n t FT readFloodGate ( void ) {
int token ;
int gate = FT floodGateFor ( FT order ) ;
S P s y n c r e a d ( g a t e [ 0 ] , &token , s i z e o f ( t o k e n ) ) ;
return token ;
}
FD ZERO(& r e a d s e t ) ;
FD SET ( readyQ>p i p e [ 0 ] , &r e a d s e t ) ;
150
}
s t a t i c void FT releaseNextSlow ( void ) {
w h i l e ( 0 != s e m w a i t (&( readyQ>sem ) ) ) ;
i f ( readyQ>w a i t i n g ) {
// restart the fast
a s s e r t (FAST != myStatus ) ;
FT slowCleanup ( ) ;
readyQ>w a i t i n g = f a l s e ;
s e m p o s t (&( readyQ>sem ) ) ;
exit (1);
} else {
int slowtrack = 0;
// Read from ready queue until a value i s returned .
S P s y n c r e a d ( readyQ>p i p e [ 0 ] , &s l o w t r a c k , s i z e o f ( s l o w t r a c k ) ) ;
i f ( s l o w t r a c k > 0 ) readyQ>r e c e n t = s l o w t r a c k ;
s e m p o s t (&( readyQ>sem ) ) ;
// If we got a slowtrack from the ready queue , start i t .
i f ( s l o w t r a c k ) FT openFloodGate ( s l o w t r a c k , FT order ) ;
}
// If the fast track gets too far ahead (a lot of slow tracks are
// waiting ) i t will yield to let some slow tracks get work done .
s t a t i c void FT continueOrYield ( void ) {
i f ( F T o r d e r > readyQ>r e c e n t + FT maxSpec ) {
// Continuing after yielding to slow track
FT releaseNextSlow ( ) ;
readyQ>w a i t i n g = t r u e ;
w h i l e ( readyQ>w a i t i n g ) p a u s e ( ) ;
}
}
s t a t i c v o i d FT becomeOldest ( v o i d )
151
{
int token ;
// Wait until we are the most senior slow instance
S P s y n c r e a d ( i n h e r i t a n c e , &token , s i z e o f ( t o k e n ) ) ;
i f ( t o k e n == 1) {
// Upstream error . Propagate and abort .
S P s y n c w r i t e ( b e q u e s t , &token , s i z e o f ( t o k e n ) ) ;
exit (1);
}
// Now the oldest slow track .
close ( inheritance );
}
close ( bequest ) ;
c l o s e ( readyQ>p i p e [ 0 ] ) ;
f o r ( i =0; i < FLOODGATESIZE ; i ++) {
close ( floodgates [ i ] [ 0 ] ) ;
close ( floodgates [ i ] [ 1 ] ) ;
}
}
152
attribute
( ( c o n s t r u c t o r ) ) F T i n i t ( void ) {
int i ;
int sen pipe [ 2 ] ;
readyQ = FT sharedMap ( s i z e o f ( r e a dy Q u eu e ) ) ;
readyQ>w a i t i n g = f a l s e ;
i f ( 0 != p i p e ( readyQ>p i p e ) ) {
p e r r o r ( a l l o c a t i n g r e a d y queue ) ;
abort ( ) ;
}
i f (1 == s e m i n i t (&( readyQ>sem ) , 1 , 1 ) ) {
p e r r o r ( u n a b l e t o i n i t i a l i z e semaphore ) ;
abort ( ) ;
}
s e n i o r i t y pipe ) ;
153
abort ( ) ;
}
FT maxSpec = FT getDepthFromEnv ( ) ;
FT active = f a l s e ;
SP RedirectOutput ( ) ;
i f (FT AUTOMARKPOINT) F T I n i t A u t o M a r k P o i n t ( ) ;
}
// Automatic branch point insertion
s t a t i c unsigned
FT AM count = 0 ;
static bool
FT AM active = f a l s e ;
s t a t i c unsigned F T A M j o i n P o i n t ;
s t a t i c void FT itimerHandler ( i n t s i g n o ) {
a s s e r t ( s i g n o == SIGALRM ) ;
FT AM active = t r u e ;
}
s t a t i c void F T A l l o c a t e J o i n P o i n t e r ( void ) {
F T A M j o i n P o i n t = FT sharedMap ( s i z e o f ( F T A M j o i n P o i n t ) ) ;
FT AM joinPoint = 0 ;
}
154
i n t FT AutoMarkPoint ( v o i d ) {
i f ( ! FT AUTOMARKPOINT) r e t u r n 0 ;
FT AM count++;
i f ( ! FT AM active ) r e t u r n 0 ;
i f (SLOW == myStatus ) {
// If the slow track has already passed the join point , then i t
// i s running ahead of the fast track (or the timer didn t fire
// soon enough ) . Slow Wins .
i f ( FT AM count > F T A M j o i n P o i n t ) F T s l o w T a k e s O v e r ( ) ;
// If we have reached the indicated joint point , cleanup .
e l s e i f ( FT AM count == F T A M j o i n P o i n t ) F T P o s t D u a l T r a c k ( ) ;
} e l s e i f (FAST == myStatus | | CTRL == myStatus ) {
// reset the activation
FT AM active = f a l s e ;
// indicate where the branch/ join i s
F T A M j o i n P o i n t = FT AM count ;
munmap ( F T A M j o i n P o i n t , s i z e o f ( F T A M j o i n P o i n t ) ) ;
// Setup a new joinpoint record for the next slow track .
FT AllocateJoinPointer ();
return FT BeginFastTrack ( ) ;
155
return 0;
}
// See Listing 5.3.5 for FT PostSlow and FT slowTakesOver .
// See Listing 5.3.5 for FT exitHandler implementation
// The slow track kills fast with SIGABRT.
s t a t i c void FT sigAbortFast ( i n t s i g ) {
a s s e r t ( SIGABRT == s i g ) ;
FT fastCleanup ( ) ;
exit (1);
}
156
FT continueOrYield ( ) ;
return 1;
}
char F T i n t e r n a l B e g i n N o r m a l ( i n t s e n i o r i t y [ 2 ] ) {
myStatus = SLOW;
bequest = s e n i o r i t y [ 1 ] ;
c l o s e ( readyQ>p i p e [ 1 ] ) ;
close ( seniority [0]);
FT accMap = FT slowAccMap ;
c l o s e ( updatePipe [ 1 ] ) ;
157
SP RedirectOutput ( ) ;
FT active = true ;
i f (FT AUTOMARKPOINT) F T S t a r t A u t o M a r k P o i n t T i m e r ( ) ;
return 0;
}
// See Listing 5.3.2 for FT SegvHandler implementation
// See Listing 5.3.1 for FT Begin implementation
static int dualPid ;
s t a t i c i n l i n e v o i d FT PostDual ( v o i d ) {
// Just kill the other and move on .
i f (1 == k i l l ( d u a l P i d , SIGABRT ) )
perror (
f a i l e d to abort p a r a l l e l t r a c k ) ;
myStatus = CTRL ;
SP CommitOutput ( ) ;
}
i n t FT BeginDualTrack ( void )
{
// Make sure we re currently running sequentially .
i f ( myStatus != CTRL ) r e t u r n 0 ;
// Don t bother i f there can t be parallelism
i f ( FT maxSpec < 1 ) r e t u r n 0 ;
i n t PID= f o r k ( ) ;
i f (1 == s e t p g i d ( 0 , S P g p i d ) ) {
p e r r o r ( f a i l e d to s e t p r o c e s s group ) ;
abort ( ) ;
}
s w i t c h ( PID ) {
case 1:
myStatus = SEQ ;
158
PID = 0 ;
break ;
case 0 :
myStatus = DUAL ;
dualPid = getppid ( ) ;
break ;
default :
myStatus = DUAL ;
d u a l P i d = PID ;
break ;
}
SP RedirectOutput ( ) ;
r e t u r n PID ;
}
s t a t i c i n l i n e void FT PostFast ( void ) {
SP PushDataAccordingToMap ( FT fastAccMap , u p d a t e P i p e [ 1 ] ) ;
c l o s e ( updatePipe [ 0 ] ) ;
c l o s e ( updatePipe [ 1 ] ) ;
}
159
FT PostFast ( ) ;
break ;
case DUAL :
FT PostDual ( ) ;
break ;
default :
f p r i n t f ( s t d e r r , u n e x p e c t e d p r o c e s s s t a t e %d , myStatus ) ;
abort ( ) ;
}
FT active = f a l s e ;
}
A.3
Common Code
Listing A.9: Common Header File
// Write operations are type 2 , and register 13 stores the type info .
#i f d e f i n e d ( MACH )
#d e f i n e SIG MEMORY FAULT SIGBUS
160
es .
e r r & 2)
#e l s e
#d e f i n e SIG MEMORY FAULT SIGSEGV
#d e f i n e WRITEOPT( c n t x t ) ( ( c n t x t )>u c m c o n t e x t . g r e g s [ 1 3 ] & 2 )
#e n d i f
t y p e d e f enum {
CTRL , MAIN ,
SPEC ,
// a speculation process
UNDY,
// the understudy
SEQ ,
// a sequential process
FAST ,
// a fast track
SLOW,
// a slow track
DUAL
} SP Status ;
v o l a t i l e S P S t a t u s myStatus ;
i n t SP gpid ;
161
162
//byte = page / 8
b i t = page & 7 ;
// bit = page % 8 ;
map [ b y t e ] |= ( 1 << b i t ) ;
}
= page % 8 ;
char mapvalue
= map [ b y t e ] ;
163
unsigned b c h a r , b i t , i ;
i n t p a g e c o u n t =0;
f o r ( b c h a r =0; b c h a r < BIT MAP SIZE ; b c h a r++) {
i f ( map [ b c h a r ]==0) c o n t i n u e ;
i f ( map [ b c h a r ]==0) {
SP PushPageToPipe ( b c h a r 8 , p i p e i d , 8 ) ;
p a g e c o u n t += 8 ;
continue ;
}
f o r ( b i t =0; b i t <8; b i t ++) {
i f ( ( map [ b c h a r ]>> b i t ) & 0 x1 ) {
i = b c h a r 8+ b i t ;
SP PushPageToPipe ( i , p i p e i d , 1 ) ;
p a g e c o u n t ++;
}
}
}
return page count ;
}
164
i f (1 == i n c r e m e n t ) {
p e r r o r ( e r r o r code ) ;
exit (0);
}
r e a d c o u n t += i n c r e m e n t ;
}
i f ( protected )
m p r o t e c t ( ( v o i d ) ( i PAGESIZE ) , PAGESIZE , PROT NONE ) ;
}
165
s i g f i l l s e t (& a c t i o n . s a m a s k ) ;
a c t i o n . s a f l a g s = SA SIGINFO ;
a c t i o n . s a s i g a c t i o n = ( void ) h a n d l e r ;
i f (1 == s i g a c t i o n ( s i g n a l , &a c t i o n , NULL ) ) {
p e r r o r ( f a i l e d to s e t f a u l t handler ) ;
return 1;
}
return 0;
}
166
Bibliography
[1] Allen, Randy and Ken Kennedy. 2001. Optimizing Compilers for Modern
Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers.
[2] Amdahl, Gene M. 1967. Validity of the single processor approach to achieving
large scale computing capabilities. In AFIPS 67 (Spring): Proceedings of the
April 18-20, 1967, spring joint computer conference, pages 483485. ACM,
New York, NY, USA.
[3] Bender, Michael A., Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson. 2004. On-the-fly maintenance of series-parallel relationships in fork-join
multithreaded programs. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures, pages 133144.
[4] Berger, Emery D., Ting Yang, Tongping Liu, and Gene Novark. 2009. Grace:
safe multithreaded programming for C/C++. In Proceedings of the ACM
SIGPLAN Conference on Object oriented programming systems and applications, pages 8196. ACM, New York, NY, USA.
[5] Bernstein, A. J. 1966. Analysis of programs for parallel processing. IEEE
Transactions on Electronic Computers, 15(5):757763.
[6] Blumofe, Robert D., Christopher F. Joerg, Bradley C. Kuszmaul, Charles E.
Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: an efficient multithreaded runtime system. SIGPLAN Not., 30(8):207216.
167
168
[15] Ding, Yonghua and Zhiyuan Li. 2004. A compiler scheme for reusing intermediate computation results. In Proceedings of the International Symposium
on Code Generation and Optimization.
[16] Eigler, Frank Ch. 2003. Mudflap: Pointer use checking for C/C++. In GCC
Developers Summit, pages 5769.
[17] Feng, Mingdong and Charles E. Leiserson. 1997. Efficient detection of determinacy races in cilk programs. In Proceedings of the ACM Symposium on
Parallelism in Algorithms and Architectures, pages 111. ACM, New York,
NY, USA.
[18] Frigo, Matteo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM
SIGPLAN Conference on Programming language design and implementation,
pages 212223. ACM, New York, NY, USA.
[19] Garg, Alok and Michael C. Huang. 2008.
A performance-correctness
169
[22] Gupta, Manish and Rahul Nim. 1998. Techniques for speculative run-time
parallelization of loops. In Proceedings of the ACM/IEEE conference on
Supercomputing, pages 112.
[23] Gustafson, John L. 1988. Reevaluating amdahls law. Commun. ACM,
31(5):532533.
[24] Halstead, Robert H., Jr. 1985. MULTILISP: A language for concurrent symbolic computation. ACM Transactions on Programming Langguage Systems,
7(4):501538.
[25] Herlihy, Maurice, Victor Luchangco, Mark Moir, and William N. Scherer III.
2003. Software transactional memory for dynamic-sized data structures. In
Proceedings of the ACM Symposium on Principles of Distributed Computing,
pages 92101. Boston, MA.
[26] Herlihy, Maurice and J. E. Moss. 1993. Transactional memory: Architectural
support for lock-free data structures. In Proceedings of the International
Symposium on Computer Architecture. San Diego, CA.
[27] Jefferson, David R., Brian R. Beckman, Frederick Wieland, L. Blume, and
M. Diloreto. 1987. Time warp operating system. In SOSP 87: Proceedings
of the ACM Symposium on operating systems principles, pages 7793. ACM,
New York, NY, USA.
[28] Kejariwal, Arun, Xinmin Tian, Wei Li, Milind Girkar, Sergey Kozhukhov,
Hideki Saito, Utpal Banerjee, Alexandru Nicolau, Alexander V. Veidenbaum,
and Constantine D. Polychronopoulos. 2006. On the performance potential
of different types of speculative thread-level parallelism. In ICS 06: Proceedings of the 20th annual international conference on Supercomputing, page 24.
ACM, New York, NY, USA.
170
[29] Keleher, Peter J., Allen L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel.
1994. TreadMarks: Distributed shared memory on standard workstations and
operating systems. In Proceedings of the 1994 Winter USENIX Conference.
[30] Kennedy, Andrew and Claudio V. Russo. 2005. Generalized algebraic data
types and object-oriented programming. In Proceedings of the ACM SIGPLAN Conference on Object oriented programming systems and applications,
pages 2140.
[31] Lee, Sanghoon and James Tuck. 2008. Parallelizing Mudflap using threadlevel speculation on a CMP. Presented at the Workshop on the Parallel Execution of Sequential Programs on Multi-core Architecture, co-located with
ISCA.
[32] Li, Kai. 1986. Shared Virtual Memory on Loosely Coupled Multiprocessors.
Ph.D. thesis, Dept. of Computer Science, Yale University, New Haven, CT.
[33] Liao, Shih-Wei, Perry H. Wang, Hong Wang, John Paul Shen, Gerolf
Hoflehner, and Daniel M. Lavery. 2002. Post-pass binary adaptation for
software-based speculative precomputation. In Proceedings of the ACM SIGPLAN Conference on Programming language design and implementation,
pages 117128.
[34] Liblit, Ben, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan.
2005. Scalable statistical bug isolation. In Proceedings of the ACM SIGPLAN
Conference on Programming language design and implementation, pages 15
26. ACM Press, New York, NY, USA.
[35] Liu, Wei, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau,
and Josep Torrellas. 2006. Posh: A TLS compiler that exploits program
structure. In Proceedings of the ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming.
171
[36] Luk, Chi-Keung, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser,
Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood.
2005. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming language design and implementation, pages 190200. ACM, New York,
NY, USA.
[37] Martin, Milo M. K., Daniel J. Sorin, Harold W. Cain, Mark D. Hill, and
Mikko H. Lipasti. 2001. Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing. In Proceedings of
the International Symposium on Microarchitecture.
[38] Mellor-Crummey, John. 1993. Compile-time support for efficient data race
detection in shared-memory parallel programs. In PADD 93: Proceedings of
the 1993 ACM/ONR workshop on Parallel and distributed debugging, pages
129139. ACM Press, New York, NY, USA.
[39] Michie, Donald. 1968. Memo functions and machine learning. Nature, 218:19
22.
[40] Moore, Gordon E. 1965. Cramming more components onto integrated circuits, Electronics. Electronics Magazine, 19:114117.
[41] Moseley, Tipp, Alex Shye, Vijay Janapa Reddi, Dirk Grunwald, and Ramesh
Peri. 2007. Shadow profiling: Hiding instrumentation costs with parallelism.
In Proceedings of the International Symposium on Code Generation and Optimization, pages 198208.
[42] Navabi, Armand, Xiangyu Zhang, and Suresh Jagannathan. 2008. Quasistatic scheduling for safe futures. In Proceedings of the ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming.
172
[43] Neelakantam, Naveen, Ravi Rajwar, Suresh Srinivas, Uma Srinivasan, and
Craig Zilles. 2007. Hardware atomicity for reliable software speculation. In
Proceedings of the International Symposium on Computer Architecture.
[44] Nightingale, Edmund B., Peter M. Chen, and Jason Flinn. 2005. Speculative
execution in a distributed file system. In Proceedings of the twentieth ACM
symposium on Operating systems principles, pages 191205. ACM, New York,
NY, USA.
[45] Nightingale, Edmund B., Daniel Peek, Peter M. Chen, and Jason Flinn.
2008. Parallelizing security checks on commodity hardware. In Proceedings
of the International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 308318.
[46] Oplinger, Jeffrey T. and Monica S. Lam. 2002. Enhancing software reliability
with speculative threads. In Proceedings of the International Conference on
Architectural Support for Programming Languages and Operating Systems,
pages 184196.
[47] Ottoni, Guilherme, Ram Rangan, Adam Stoler, and David I. August. 2005.
Automatic thread extraction with decoupled software pipelining. In Proceedings of the International Symposium on Microarchitecture, pages 105118.
[48] Patil, Harish and Charles Fischer. 1995. Efficient run-time monitoring using
shadow processing. In Mireille Ducasse, editor, International Workshop on
Automated and Algorithmic Debugging, pages 119132.
[49] Perkovic, Dejan and Peter J. Keleher. 2000. A protocol-centric approach
to on-the-fly race detection. IEEE Transactions on Parallel and Distributed
Systems, 11(10):10581072.
[50] Qui
nones, Carlos Garca, Carlos Madriles, Jes
us Sanchez, Pedro Marcuello,
Antonio Gonzalez, and Dean M. Tullsen. 2005. Mitosis compiler: An in-
173
http://www.spec.org/.
[56] Steffan, J. Gregory, Christopher B. Colohan, Antonia Zhai, and Todd C.
Mowry. 2005. The STAMPede approach to thread-level speculation. ACM
Transactions on Computer Systems, 23(3):253300.
174
[57] Sundaramoorthy, Karthik, Zach Purser, and Eric Rotenberg. 2000. Slipstream processors: improving both performance and fault tolerance. SIGPLAN Not., 35(11):257268.
[58] Tiwari, Devesh, Sanghoon Lee, James Tuck, and Yan Solihin. 2010. Mmt:
Exploiting fine-grained parallelism in dynamic memory management. IEEE
Transactions on Parallel and Distributed Systems.
[59] Tsai, Jenn-Yuan, Zhenzhen Jiang, and Pen-Chung Yew. 1999. Compiler techniques for the superthreaded architectures. International Journal of Parallel
Programming, 27(1):119.
[60] Vachharajani, Neil, Ram Rangan, Easwaran Raman, Matthew J. Bridges,
Guilherme Ottoni, and David I. August. 2007. Speculative decoupled software pipelining. In Proceedings of the International Conference on Parallel
Architectures and Compilation Techniques, pages 4959. IEEE Computer Society, Washington, DC, USA.
[61] Wahbe, Robert, Steven Lucco, and Susan L. Graham. 1993. Practical data
breakpoints: design and implementation. In Proceedings of the ACM SIGPLAN Conference on Programming language design and implementation.
[62] Wallace, Steven and Kim Hazelwood. 2007. Superpin: Parallelizing dynamic
instrumentation for real-time performance. In Proceedings of the International Symposium on Code Generation and Optimization, pages 209220.
[63] Welc, Adam, Suresh Jagannathan, and Antony L. Hosking. 2005. Safe futures
for Java. In Proceedings of the ACM SIGPLAN Conference on Object oriented
programming systems and applications, pages 439453.
[64] Zhai, Antonia, Christopher B. Colohan, J. Gregory Steffan, and Todd C.
Mowry. 2002. Compiler optimization of scalar value communication between
175
speculative threads. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages
171183.
[65] Zhang, Chengliang, Kirk Kelsey, Xipeng Shen, Chen Ding, Matthew Hertz,
and Mitsunori Ogihara. 2006. Program-level adaptive memory management.
In Proceedings of the International Symposium on Memory Management. Ottawa, Canada.
[66] Zhou, Pin, Feng Qin, Wei Liu, Yuanyuan Zhou, and Josep Torrellas. 2004.
iWatcher: Efficient architectural support for software debugging. In Proceedings of the International Symposium on Computer Architecture, pages
224237.
[67] Zilles, Craig and Gurindar S. Sohi. 2002. Master/slave speculative parallelization. In Proceedings of the International Symposium on Microarchitecture,
pages 8596. IEEE Computer Society Press, Los Alamitos, CA, USA.