Professional Documents
Culture Documents
CS61C: Machine Structures: Lecture #29 Performance & Parallel Intro
CS61C: Machine Structures: Lecture #29 Performance & Parallel Intro
CS61C: Machine Structures: Lecture #29 Performance & Parallel Intro
edu/~cs61c
CS61C : Machine Structures
Lecture #29 Performance & Parallel Intro
2007-8-14
Scott Beamer, Instructor
“Paper” Battery Developed
by Researchers at
Rensselaer
www.bbc.co.uk
CS61C L29 Performance & Parallel (1) Beamer, Summer 2007 © U
“Last
time…”
• Magnetic Disks continue rapid advance: 60%/yr capacity,
40%/yr bandwidth, slow on seek, rotation improvements,
MB/$ improving 100%/yr?
• Designs to fit high volume form factor
• PMR a fundamental new technology
breaks through barrier
• RAID
• Higher performance with more disk arms per $
• Adds option for small # of extra disks
• Can nest RAID levels
• Today RAID is > tens-billion dollar industry,
80% nonPC disks sold in RAIDs,
started at Cal
ABC
1. RAID 1 (mirror) and 5 (rotated parity) help 0: FFF
with performance and availability 1: FFT
2: FTF
2. RAID 1 has higher cost than RAID 5 3: FTT
4: TFF
3. Small writes on RAID 5 are slower than on 5: TFT
RAID 1 6: TTF
7: TTT
CS61C L29 Performance & Parallel (3) Beamer, Summer 2007 © U
Peer Instruction
Answer
1. All RAID (0-5) helps with performance, only
RAID0 doesn’t help availability. TRUE
2. Surely! Must buy 2x disks rather than 1.25x
(from diagram, in practice even less) TRUE
3. RAID5 (2R,2W) vs. RAID1 (2W). Latency
worse, throughput (|| writes) better. TRUE
ABC
1. RAID 1 (mirror) and 5 (rotated parity) help 0: FFF
with performance and availability 1: FFT
2: FTF
2. RAID 1 has higher cost than RAID 5 3: FTT
4: TFF
3. Small writes on RAID 5 are slower than on 5: TFT
RAID 1 6: TTF
7: TTT
CS61C L29 Performance & Parallel (4) Beamer, Summer 2007 © U
Why Performance? Faster is better!
• Purchasing Perspective: given a
collection of machines (or upgrade
options), which has the
best performance ?
least cost ?
best performance / cost ?
TRUE
A. Rarely does a company selling a product give unbiased
performance data.
B. The Sieve of Eratosthenes, Puzzle and Quicksort were
FALSE
early effective benchmarks.
C. A program runs in 100 sec. on a machine, mult
FALSE
accounts for 80 sec. of that. If we want to make the
program run 6 times faster, we need to up the speed of
mults by AT LEAST 6.
ABC
A. TRUE. It is rare to find a company that 0: FFF
gives Metrics that do not favor its product. 1: FFT
2: FTF
B. Early benchmarks? Yes. 3: FTT
Effective? No. Too simple! 4: TFF
C. 6 times faster = 16 sec. 5: TFT
mults must take -4 sec! 6: TTF
I.e., impossible! 7: TTT
CS61C L29 Performance & Parallel (26) Beamer, Summer 2007 © U
“And in
CPU time = Instructions x Cycles x Seconds
conclusion…”
Program Instruction Cycle
• Latency v. Throughput
• Performance doesn’t depend on any single factor:
need Instruction Count, Clocks Per Instruction (CPI)
and Clock Rate to get valid estimations
• User Time: time user waits for program to execute:
depends heavily on how OS switches between tasks
• CPU Time: time spent executing a single program:
depends solely on design of processor (datapath,
pipelining effectiveness, caches, etc.)
• Benchmarks
• Attempt to predict perf, Updated every few years
• Measure everything from simulation of desktop
graphics programs to battery life
• Megahertz Myth
• MHz ≠ performance, it’s just one factor
CS61C L29 Performance & Parallel (27) Beamer, Summer 2007 © U
Administrivia
1000
52%/year
100
Performance
10 (vs. VAX-11/780)
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
CS61C L29 Performance & Parallel (30) Beamer, Summer 2007 © U
Let’s Put Many CPUs
Together!
• Distributed computing (SW parallelism)
• Many separate computers (each with independent
CPU, RAM, HD, NIC) that communicate through a
network
• Grids (home computers across Internet) and Clusters
(all in one room)
• Can be “commodity” clusters, 100K+ nodes
• About being able to solve “big” problems, not
“small” problems faster
• Multiprocessing (HW parallelism)
• Multiple processors “all in one box” that often
communicate through shared memory
• Includes multicore (many new CPUs)
Source:
top500.org
– David Patterson
code.google.com/edu/parallel/mapreduce-tutorial.html
CS61C L29 Performance & Parallel (44) Beamer, Summer 2007 © U
MapReduce Code Example
map(String input_key,
String input_value):
// input_key : document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key,
Iterator intermediate_values):
// output_key : a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
• “Mapper” nodes are responsible for the map function
• “Reducer” nodes are responsible for the reduce function
• Data on a distributed file system (DFS)
ah:1 ah:1 er:1 ah:1 if:1 or:1 or:1 uh:1 or:1 ah:1 if:1
reduce(String output_key,
Iterator intermediate_values):
// output_key : a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
4 1 2 3 1
CS61C L29 Performance & Parallel (46) (ah) (er) (if) (or) (uh) Beamer, Summer 2007 © U
MapReduce Advantages/Disadvantages
• Now it’s easy to program for many CPUs
• Communication management effectively gone
I/O scheduling done for us
• Fault tolerance, monitoring
machine failures, suddenly-slow machines, other issues are
handled
• Can be much easier to design and program!
• But… it further restricts solvable problems
• Might be hard to express some problems in a MapReduce
framework
• Data parallelism is key
Need to be able to break up a problem by data chunks
• MapReduce is closed-source – Hadoop!