Professional Documents
Culture Documents
Fundamentals of Quantitative Design and Analysis: A Quantitative Approach, Fifth Edition
Fundamentals of Quantitative Design and Analysis: A Quantitative Approach, Fifth Edition
Fundamentals of Quantitative Design and Analysis: A Quantitative Approach, Fifth Edition
Chapter 1
Fundamentals of Quantitative
Design and Analysis
RISC
Dynamic power
½ x Capacitive load x Voltage2 x Frequency switched
Bose-Einstein formula:
ExTime(Y) Performance(X)
--------- = --------------- =n
ExTime(X) Performance(Y)
41
Computer Performance
Typical performance metrics:
Response time
Throughput
Response Time (latency)
How long does it take for my job to run?
How long does it take to execute a job?
How long must I wait for the database query?
Throughput
How many jobs can the machine run at once?
What is the average execution rate?
How much work is getting done?
42
Computer Performance
If we upgrade a machine with a new processor what do
we increase?
If we add a new machine to the lab what do we increase?
Benchmarks
Kernels: which are small, key pieces of real applications.
Toy programs (e.g. sorting)
Synthetic benchmarks (e.g. Dhrystone)
Benchmark suites (e.g. SPEC06fp, TPC-C)
47
Benchmark Suites
Desktop Benchmark Suites
CPU-intensive benchmarks
Graphics-intensive benchmarks
CPU-intensive benchmarks
SPEC89 SPEC92 SPEC95 SPEC2000 SPEC CPU
2006 (12 integer benchmark (CINT2006) and 17 floating-point
benchmark (CFP2006).
SPEC benchmarks are real programs modified to be portability and
to minimize the effect of I/O on performance. (highlighting CPU)
Graphics-intensive benchmarks:
SPECviewperf for systems supporting the OpenGL graphics library,
SPECapc for applications with intensive use of graphics (dealing
with 3D images, CAD/CAM image library)
48
Figure 1.16 SPEC2006 programs and the evolution of the SPEC benchmarks over time, with integer programs above the line and floating-point programs
below the line. Of the 12 SPEC2006 integer programs, 9 are written in C, and the rest in C++. For the floating-point programs, the split is 6 in Fortran, 4 in C++, 3 in
C, and 4 in mixed C and Fortran. The figure shows all 70 of the programs in the 1989, 1992, 1995, 2000, and 2006 releases. The benchmark descriptions on the left
are for SPEC2006 only and do not apply to earlier versions. Programs in the same row from different generations of SPEC are generally not related; for example,
fpppp is not a CFD code like bwaves. Gcc is the senior citizen of the group. Only 3 integer programs and 3 floating-point programs survived three or more
generations. Note that all the floating-point programs are new for SPEC2006. Although a few are carried over from generation to generation, the version of the
program changes and either the input or the size of the benchmark is often changed to increase its running time and to avoid perturbation in measurement or
domination of the execution time by some factor other than CPU time. 49
Benchmark Suites
Server Benchmark Suites
Server has multiple functions Multiple types of benchmark
Processor-throughput benchmarks
SPEC CPU2000 uses the SPEC CPU benchmark
Run multiple copies SPEC benchmark, converting CPU time into a rate
(SPECrate). SPECrate is a measure of request-level parallelism
SPEC offers high-performance benchmark around OpenMP and MPI to
measure thread-level parallelism
SPEC offers both file server benchmark (SPECSFS) and Web
server benchmark (SPECWeb)
SPECSFS is a benchmark for measuring NFS (Network File
System) performance
Test the performance of the I/O system (both disk and network I/O) as well as
the processor
SPECWeb is a web server benchmark
SEPECjbb measures server performances for Web applications written in Java
50
Benchmark Suites
Transaction-processing (TP) benchmarks: measure the
ability of a system to handle database transactions
(transactions per second)
TPC (Transaction Processing Council): TPC-A (85) TPC-C
(complex query) TPC-H (ad-hoc decision support) TPC-
R (business decision support) TPC-W (web-oriented)
TPC-E : On-Line Transaction Processing (OLTP) workload
that simulates a brokerage firm's customer accounts
TPC Energy: Add energy metrics to all existing TPC
benchmark
51
Benchmark Suites
Embedded Benchmarks
Electronic Design News Embedded Microprocessor Benchmark
Consortium (EEMBC, pronounced as embassy )
It is set of 41 kernel used to predict performance of different
embedded applications: automotive/industrial, consumer,
networking, office automation, and telecommunication.
52
Reporting Results
Reproducibility
Another experiment with everything listed would duplicate the
results
Factors affect performance results of benchmarks
Require a fairly complete description of the machine and compiler
flags …
System’s software configuration
OS performance and support
Different Compiler optimization levels
Benchmark-specific flags
Source code modifications?
Not allowed ex. SPEC
53
Comparing and Summarizing Results
Total Execution Time (Ti)/n
Computer A Computer B Computer C
P1 (secs) 1 10 20
P2 (secs) 1000 100 20
Total time(secs) 1001 110 40
Following results:
A is 10 times faster than B for P1
B is 10 times faster than A for P2
A is 20 times faster than C for P1
C is 50 times faster than A for P2 ……
Simple summary :
B is 9.1 times faster than A for P1 and P2
C is 25 times faster than A for P1 and P2 …
54
Quantitative Principles of Computer Design
Take advantage of parallelism: Important method to
improve performance
System level
To improve the throughput performance of a typical server
benchmark (SPECWeb, TPC-C), multiple processor and disks can
be used
Individual processor
Parallelism among instruction is critical to achieving high
performance
Pipelining
Digital Design
Set-associate cache uses multiple banks of memory that are
searched in parallel to find the desired data item
Carry-Look ahead adder
55
Quantitative Principles of Computer Design
Principle of Locality
Programs tends to reuse data and instructions they have used
recently
Programs spend 90% of its execution time in only 10% of the
code.
Temporal Locality
Recently accessed data items are likely to be accessed in the
near future
Spatial Locality
Items whose addresses are near to one another tend to be
referenced close in time
56
Quantitative Principles of Computer Design
Make the Common Case Fast and Make Other Parts
Correct
Find a good trade-off is computer architect’s purpose
What is the right interface between hardware and software?
In a CPU with rare overflow events, optimize the normal “add”
part (more than 90% typically) rather than considering priority on
overflow parts
57
Amdahl's Law
Amdahl’s law states that the performance improvement
to be gained from using some faster mode execution is
limited by the fraction of time the faster mode can be
used
Amdahl’s law defines speedup that can be gained by
using a particular feature
Speedupenhanced
59
Amdahl’s Law
1
ExTimeold
Speedupoverall = =
(1 - Fractionenhanced) + Fractionenhanced
ExTimenew
Speedupenhanced
60
Amdahl’s Law
Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP
ExTimenew =
Speedupoverall =
61
Amdahl’s Law
Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP
Speedupoverall = 1 = 1.053
0.95
62
Amdahl’s Law
A program spent 90% time on doing computing jobs, and
other 10% time on disk accesses. If we use a 100X faster
CPU, how much speedup we gain?
63
Amdahl’s Law
A program spent 90% time on doing computing jobs, and
other 10% time on disk accesses. If we use a 100X faster
CPU, how much speedup we gain?
Speedupoverall = 1 = 9.17
0.109
64
Metrics of Performance
65
Performance
Performance is determined by execution time
Do any of the other variables equal to performance?
# of cycles to execute program?
# of instructions in program?
# of cycles per second?
average # of cycles per instruction?
average # of instructions per second?
66
CPU Performance Equation
69
CPU Performance Equation
CPU time is equally depend on IC, CPI, and CC time.
A 10% improvement in any one of them leads to 10%
improvement in CPU time.
It is difficult to change one parameter in complete
isolation from others because the basic technologies
involved in changing each characteristics are
interdependent
Clock Cycle Time : Hardware technology and Organization
CPI : Organization and Instruction set architecture
Instruction count : Instruction set architecture and Compiler
technology
70
CPU Performance Equation
n
CPI = CPIi * ICi
i =1
Instruction Count
72
CPU Performance Equation
Instruction Count
CPI = n
ICi * CPIi
i =1 Instruction Count
73
Calculating CPI
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Typical Mix
74
Calculating CPI
Suppose we have the following measurements:
Frequency of FP operation = 25 %
Average CPI of FP operations = 4.0
Average CPI of other instructions = 1.33
Frequency of FPSQR = 2%
CPI of FPSQR = 20
Assume that the two design alternatives are to decrease the CPI of
FPSQR to 2 or to decreases the average CPI of FP operations to 2.5.
Compare these two design alternatives using the processor
performance equation.
75
More on CPI
Is CPI a good metric to evaluate CPU performance?
No, small CPI does not mean a faster processor since clock
rate may vary and compiler may generate program codes in
different lengths
76
More on CPI
Can CPI be less than 1?
77
Amdahl’s Law for Multiprocessor
78
Amdahl’s Law for Multiprocessor
Speedup (S)
= T / [αT + (1 – α)T/n]
= 1/[α + (1- α)/n]
= n/[nα + (1- α)]
79