15-150 Fall 2014

Lecture 18
Stephen Brookes

parallel programming

parallelism and functional style

cost semantics

Brents Theorem and speed-ups

sequences: an abstract type with
efficient parallel operations

exploiting multiple processors

evaluating independent code simultaneously

low-level implementation

scheduling work onto processors

high-level planning

designing code abstractly

without baking in a schedule

our approach
design abstractly

specify behavioral correctness

specify asymptotic runtime (work, span)

reason abstractly

independently of schedule

cost semantics and evaluation

functional benefits
No side effects, so evaluation order

doesnt affect behavioral correctness

Can build abstract types that support

efficient parallel-friendly operations

Can use work and span to predict

how parallelizable our code is

Work and span are independent of

scheduling details

In practice, its hard to achieve speed-up

Current language implementations
dont make it easy

Problems include:

scheduling overhead

locality of data (cache problems)

runtime sensitive to scheduling choices

why bother?
Its good to think abstractly first
and figure out details later

Focus on data dependencies

when you design your code

Our thesis: this approach to parallelism

will prevail...

(and 15-210 builds on these ideas...)

cost semantics
Weve already introduced work and span

Work estimates the sequential running time

on a single processor

Span takes account of data dependency,

estimates the parallel running time
with unlimited processors

cost semantics
We showed how to calculate work and span

for recursive functions with recurrence relations

Now we introduce cost graphs,

another way to deal with work and span

Cost graphs also allow us to talk about schedules...

... and the potential for speed-up

cost graphs
A cost graph is a series-parallel graph

directed graph, with source and sink

nodes represent units of work

edges represent data dependencies

branching indicates potential parallelism

(constant time)

cost graphs




a single node

. .
. .





work and span

of a cost graph

The work is the number of nodes

The span is the length of the longest path
from source to sink

span(G) work(G)

. .
. .








= work G1 + work G2 + c
sequential code add the work

= work G1 + work G2 + c
independent code add the work

. .
. .








= span G1 + span G2 + c
sequential code add the span

= max(span G1 , span G2) + c

parallel code max the span

must be done


work = 11 (number of nodes)

span = 4 (longest path length)

using cost graphs

Every expression can be given a cost graph

Can calculate work and span using the graph

These are asymptotically the same as
the work and span derived from
recurrence relations

work and span provide

asymptotic estimates of

actual running time,

under certain assumptions

basic operations

take constant time
work: single processor

span: many processors

Work: number of nodes

Span: length of critical path

w = 11

uses 5 processors


assign units of work to processors

respecting data dependency

an optimal

(iii) parallel schedule
(5 rounds,

or 4 steps)

What if there are only 2 processors?

w = 11


a best schedule

for 2 processors
(6 rounds,

5 steps)

2 cannot do the work as fast as 5 (!)

Brents Theorem
An expression with work w and span s !
can be evaluated on a p-processor machine
in time O(max(w/p, s)).
Optimal schedule using p processors:

Do (up to) p units of work each round

Total work to do is w

Needs at least s steps

Richard Brent is an illustrious Australian mathematician and computer scientist.

He is known for Brents Theorem, which shows that a parallel algorithm can
always be adapted to run on fewer processors with only the obvious time penalty
a beautiful example of an obvious but non-trivial theorem.

Brents Theorem
An expression with work w and span s !
can be evaluated on a p-processor machine
in time O(max(w/p, s)).
Find me the smallest

p such that

w/p s
Using more than

this many processors

wont yield any speed-up


w = 11

min {p | w/p s} is 3
(i) a best schedule

(ii) for 3 processors
(5 rounds,

4 steps)

3 processors

can do the work as fast as 5(!)

Exploiting parallelism in ML

A signature for parallel collections

Cost analysis of implementations

Cost benefits of parallel algorithm design

signature SEQ =!
type a seq!
exception Range!
val tabulate : (int -> a) -> int -> a seq!
val length : a seq -> int!
val nth : int -> a seq -> a!
val map : (a -> b) -> a seq -> b seq!
val reduce : (a * a -> a) -> a -> a seq -> a!
val mapreduce : (a -> b) -> b -> (b * b -> b) -> a seq -> b!

Many ways to implement the signature

lists, balanced trees, arrays, ...

For each one, can give a cost analysis

There may be implementation trade-offs

arrays: item access is O(1)

trees: item access is O(log n)

Seq : SEQ
An abstract parameterized type of sequences

Think of a sequence as a parallel collection

With parallel-friendly operations

constant-time access to items

efficient map and reduce
Well work today with an implementation

Seq : SEQ

based on vectors

sequence values
A value of type t seq

is a sequence of values of type t

We use math notation like

v1, ..., vn

v0, ..., vn-1

for sequence values

1, 2, 4, 8 is a value of type int seq

Two sequence values are (extensionally) equal
iff they have the same length
and their items are equal

v1, ..., vn = u1, ..., um

if and only if
n = m and for all i, vi = ui

For each operation in the signature SEQ

we specify the (extensional) behavior of the

operation implemented in Seq
and discuss its cost semantics

Other structures with the same signature

may implement the operations with
different work and span profile

Learn to choose wisely!

tabulate f n = f 0, ..., f(n-1)

If G is cost graph for f(i),


the cost graph for tabulate f n is

... Gn-1

If f is O(1), the work for tabulate f n is O(n)

If f is O(1), the span for tabulate f n is O(1)

tabulate (fn x:int => x) 6

tabulate (fn x:int => x*x) 6

tabulate (fn _ => raise Range) 0

0, 1, 2, 3, 4, 5
0, 1, 4, 9, 16, 25

length v1, ..., vn = n

Work is O(1)

Span is O(1)

Cost graph is
Contrast: List.length [v1,...,vn] = n

work, span O(n)

nth i v0, ..., vn-1 = vi
= raise Range

Work is O(1)

Span is O(1)
Cost graph is

Seq provides

constant-time access to items

if 0 i < n

map f v1, ..., vn = f v1, ..., f vn
map f v1, ..., vn has cost graph



where each Gi

is graph for f vi

If f is constant time, map f v , ..., v has


work O(n), span O(1)

(contrast with

reduce should be used to combine a sequence using
an associative function g with identity element z

g : t * t -> t is associative iff for all x ,x ,x :t


g(x1, g(x2, x3)) = g(g(x1, x2), x3)

z is an identity for g iff for all x:t, g(x,z) = x

We write

v1 g v2 g ... g vn g z
for the result of combining v1, , vn, z
reduce g z v1, ..., vn = v1 g v2 g ... g vn g z

When g is associative and z is an identity

reduce g z v1, ..., vn = v1 g v2 g ... g vn g z

If g is constant time,

reduce g z v1, ..., vn

has work O(n)

and span O(log n)

needs to use g n times

(Contrast with foldr, foldl on lists)

reduce (op +) 0 1, 2, 3, 4, 5, 6, 7, 8

1 2

3 4

5 6

cost graph

7 8


reduce cost
reduce g z v1, ..., v2n =
g(reduce g z v1, ..., vn, reduce g z vn+1, ..., v2n)

G1, ..., 2n =
G1, ..., n

W(2n) = 2*W(n) + c
S(2n) = S(n) + c

Gn+1, ..., 2n


W(n) is O(n)
S(n) is O(log2 n)

When g is associative and z is an identity,

mapreduce f z g v1, ..., vn = (f v1) g ... g (f vn) g z

When f, g are constant time,

mapreduce f z g v1, ..., vn
has work O(n)
and span O(log n)

fun sum (s : int seq) : int = !
reduce (op +) 0 s
fun count (s : int seq seq) : int = !
sum (map sum s)

fun sum (s : int seq) : int = reduce (op +) 0 s
fun count (s : int seq seq) : int = sum (map sum s)

Let s be a value of type int seq seq

consisting of n rows, each of length n

What are the work and span for

count s ?

Let s = s1, ..., sn , si = xi1, ..., xin, ti = sum si
For each i, sum si = reduce(op +) 0 xi1, ..., xin
cost graph of

sum si

sum si

work is O(n)

span is O(log n)

log2 n

map sum s = sum s1, ..., sum sn


cost graph of

sum s
sum s
map sum s
span is O(log n)

Let ti = sum si
count s = sum t1, ..., tn
sum s1

cost graph of

sum (map sum s)


sum sn

sum t1, ..., tn

log2 n

log2 n

work is O(n2)

span is O(log n)

