Professional Documents
Culture Documents
01) Fundamentals of Quantitative Design and Analysis
01) Fundamentals of Quantitative Design and Analysis
Chapter 1
Fundamentals of Quantitative
Design and Analysis
RISC
Dynamic power
½ x Capacitive load x Voltage2 x Frequency switched
Bose-Einstein formula:
Speedup of X relative to Y
Execution timeY / Execution timeX
Execution time
Wall clock time: includes all system overheads
CPU time: only computation time
Benchmarks
Kernels (e.g. matrix multiply)
Toy programs (e.g. sorting)
Synthetic benchmarks (e.g. Dhrystone)
Benchmark suites (e.g. SPEC06fp, TPC-C)
32
Performance
"X is n times faster than Y" means
ExTime(Y) Performance(X)
--------- = --------------- =n
ExTime(X) Performance(Y)
33
Computer Performance
Typical performance metrics:
Response time
Throughput
Response Time (latency)
How long does it take for my job to run?
How long does it take to execute a job?
How long must I wait for the database query?
Throughput
How many jobs can the machine run at once?
What is the average execution rate?
How much work is getting done?
34
Computer Performance
If we upgrade a machine with a new
processor what do we increase?
If we add a new machine to the lab what
do we increase?
Benchmarks
Kernels: which are small, key pieces of real applications.
Toy programs (e.g. sorting)
Synthetic benchmarks (e.g. Dhrystone)
Benchmark suites (e.g. SPEC06fp, TPC-C)
39
Benchmark Suites
Desktop Benchmark Suites
CPU-intensive benchmarks
Graphics-intensive benchmarks
CPU-intensive benchmarks
SPEC89 SPEC92 SPEC95 SPEC2000 SPEC CPU
2006 (12 integer benchmark (CINT2006) and 17 floating-point
benchmark (CFP2006).
SPEC benchmarks are real programs modified to be portability and
to minimize the effect of I/O on performance. (highlighting CPU)
Graphics-intensive benchmarks:
SPECviewperf for systems supporting the OpenGL graphics library,
SPECapc for applications with intensive use of graphics (dealing
with 3D images, CAD/CAM image library)
40
Figure 1.16 SPEC2006 programs and the evolution of the SPEC benchmarks over time, with integer programs above the line and floating-point programs
below the line. Of the 12 SPEC2006 integer programs, 9 are written in C, and the rest in C++. For the floating-point programs, the split is 6 in Fortran, 4 in C++, 3 in
C, and 4 in mixed C and Fortran. The figure shows all 70 of the programs in the 1989, 1992, 1995, 2000, and 2006 releases. The benchmark descriptions on the left
are for SPEC2006 only and do not apply to earlier versions. Programs in the same row from different generations of SPEC are generally not related; for example,
fpppp is not a CFD code like bwaves. Gcc is the senior citizen of the group. Only 3 integer programs and 3 floating-point programs survived three or more
generations. Note that all the floating-point programs are new for SPEC2006. Although a few are carried over from generation to generation, the version of the
program changes and either the input or the size of the benchmark is often changed to increase its running time and to avoid perturbation in measurement or
domination of the execution time by some factor other than CPU time. 41
Benchmark Suites
Server Benchmark Suites
Server has multiple functions Multiple types of benchmark
Processor-throughput benchmarks
SPEC CPU2000 uses the SPEC CPU benchmark
Run multiple copies SPEC benchmark, converting CPU time into a rate
(SPECrate). SPECrate is a measure of request-level parallelism
SPEC offers high-performance benchmark around OpenMP and MPI to
measure thread-level parallelism
SPEC offers both file server benchmark (SPECSFS) and Web
server benchmark (SPECWeb)
SPECSFS is a benchmark for measuring NFS (Network File
System) performance
Test the performance of the I/O system (both disk and network I/O) as well as
the processor
SPECWeb is a web server benchmark
SEPECjbb measures server performances for Web applications written in Java
42
Benchmark Suites
Transaction-processing (TP) benchmarks: measure the
ability of a system to handle database transactions
(transactions per second)
TPC (Transaction Processing Council): TPC-A (85) TPC-C
(complex query) TPC-H (ad-hoc decision support) TPC-
R (business decision support) TPC-W (web-oriented)
TPC-E : On-Line Transaction Processing (OLTP) workload
that simulates a brokerage firm's customer accounts
TPC Energy: Add energy metrics to all existing TPC
benchmark
43
Benchmark Suites
Embedded Benchmarks
Electronic Design News Embedded Microprocessor Benchmark
Consortium (EEMBC, pronounced as embassy )
It is set of 41 kernel used to predict performance of different
embedded applications: automotive/industrial, consumer,
networking, office automation, and telecommunication.
44
Reporting Results
Reproducibility
Another experiment with everything listed would duplicate the
results
Factors affect performance results of benchmarks
Require a fairly complete description of the machine and compiler
flags …
System’s software configuration
OS performance and support
Different Compiler optimization levels
Benchmark-specific flags
Source code modifications?
Not allowed ex. SPEC
45
Comparing and Summarizing Results
Total Execution Time (Ti)/n
Computer A Computer B Computer C
P1 (secs) 1 10 20
P2 (secs) 1000 100 20
Total time(secs) 1001 110 40
Following results:
A is 10 times faster than B for P1
B is 10 times faster than A for P2
A is 20 times faster than C for P1
C is 50 times faster than A for P2 ……
Simple summary :
B is 9.1 times faster than A for P1 and P2
C is 25 times faster than A for P1 and P2 …
46
Quantitative Principles of Computer Design
Take advantage of parallelism: Important method to
improve performance
System level
To improve the throughput performance of a typical server
benchmark (SPECWeb, TPC-C), multiple processor and disks can
be used
Individual processor
Parallelism among instruction is critical to achieving high
performance
Pipelining
Digital Design
Set-associate cache uses multiple banks of memory that are
searched in parallel to find the desired data item
Carry-Look ahead adder
47
Quantitative Principles of Computer Design
Principle of Locality
Programs tends to reuse data and instructions they have used
recently
Programs spend 90% of its execution time in only 10% of the
code.
Temporal Locality
Recently accessed data items are likely to be accessed in the
near future
Spatial Locality
Items whose addresses are near to one another tend to be
referenced close in time
48
Quantitative Principles of Computer Design
Make the Common Case Fast and Make Other Parts
Correct
Find a good trade-off is computer architect’s purpose
What is the right interface between hardware and software?
In a CPU with rare overflow events, optimize the normal “add”
part (more than 90% typically) rather than considering priority on
overflow parts
49
Amdahl's Law
Amdahl’s law states that the performance improvement
to be gained from using some faster mode execution is
limited by the fraction of time the faster mode can be
used
Amdahl’s law defines speedup that can be gained by
using a particular feature
Speedupenhanced
51
Amdahl’s Law
1
ExTimeold
Speedupoverall = =
(1 - Fractionenhanced) + Fractionenhanced
ExTimenew
Speedupenhanced
52
Amdahl’s Law
Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP
ExTimenew =
Speedupoverall =
53
Amdahl’s Law
Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP
Speedupoverall = 1 = 1.053
0.95
54
Amdahl’s Law
A program spent 90% time on doing computing jobs, and
other 10% time on disk accesses. If we use a 100X faster
CPU, how much speedup we gain?
55
Amdahl’s Law
A program spent 90% time on doing computing jobs, and
other 10% time on disk accesses. If we use a 100X faster
CPU, how much speedup we gain?
Speedupoverall = 1 = 9.17
0.109
56
Metrics of Performance
57
Performance
Performance is determined by execution time
Do any of the other variables equal to performance?
# of cycles to execute program?
# of instructions in program?
# of cycles per second?
average # of cycles per instruction?
average # of instructions per second?
58
CPU Performance Equation
CPU
CPUtime
time == Seconds
Seconds == Instructions
Instructions xx Cycles
Cycles xx Seconds
Seconds
Program
Program Program
Program Instruction
Instruction Cycle
Cycle
61
CPU Performance Equation
CPU time is equally depend on IC, CPI, and CC time.
A 10% improvement in any one of them leads to 10%
improvement in CPU time.
It is difficult to change one parameter in complete
isolation from others because the basic technologies
involved in changing each characteristics are
interdependent
Clock Cycle Time : Hardware technology and Organization
CPI : Organization and Instruction set architecture
Instruction count : Instruction set architecture and Compiler
technology
62
CPU Performance Equation
n
CPI = CPIi * ICi
i =1
Instruction Count
64
CPU Performance Equation
Instruction Count
CPI = n
ICi * CPIi
i =1 Instruction Count
65
Calculating CPI
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Typical Mix
66
Calculating CPI
Suppose we have the following measurements:
Frequency of FP operation = 25 %
Average CPI of FP operations = 4.0
Average CPI of other instructions = 1.33
Frequency of FPSQR = 2%
CPI of FPSQR = 20
Assume that the two design alternatives are to decrease the CPI of
FPSQR to 2 or to decreases the average CPI of FP operations to
2.5. Compare these two design alternatives using the processor
performance equation.
67
More on CPI
Is CPI a good metric to evaluate CPU performance?
No, small CPI does not mean a faster processor since clock
rate may vary and compiler may generate program codes in
different lengths
68
More on CPI
Can CPI be less than 1?
69
Amdahl’s Law for Multiprocessor
70
Amdahl’s Law for Multiprocessor
Speedup (S)
= T / [αT + (1 – α)T/n]
= 1/[α + (1- α)/n]
= n/[nα + (1- α)]
71