Professional Documents
Culture Documents
CAQA5e ch1
CAQA5e ch1
Chapter 1
Fundamentals of Quantitative
Design and Analysis
RISC
◼ Dynamic power
◼ ½ x Capacitive load x Voltage2 x Frequency switched
◼ Bose-Einstein formula:
ExTime(Y) Performance(X)
--------- = --------------- =n
ExTime(X) Performance(Y)
41
Computer Performance
◼ Typical performance metrics:
◼ Response time
◼ Throughput
◼ Response Time (latency)
◼ How long does it take for my job to run?
◼ How long does it take to execute a job?
◼ How long must I wait for the database query?
◼ Throughput
◼ How many jobs can the machine run at once?
◼ What is the average execution rate?
◼ How much work is getting done?
42
Computer Performance
◼ If we upgrade a machine with a new processor what do
we increase?
◼ If we add a new machine to the lab what do we
increase?
◼ Benchmarks
◼ Kernels: which are small, key pieces of real applications.
◼ Toy programs (e.g. sorting)
◼ Synthetic benchmarks (e.g. Dhrystone)
◼ Benchmark suites (e.g. SPEC06fp, TPC-C)
47
Benchmark Suites
◼ Desktop Benchmark Suites
◼ CPU-intensive benchmarks
◼ Graphics-intensive benchmarks
◼ CPU-intensive benchmarks
◼ SPEC89 → SPEC92 → SPEC95 → SPEC2000 → SPEC CPU
2006 → SPEC CPU 2017 (10 integer benchmark (CINT2017) and
17 floating-point benchmark (CFP2017).
◼ SPEC benchmarks are real programs modified to be portability and
to minimize the effect of I/O on performance. (highlighting CPU)
◼ Graphics-intensive benchmarks:
◼ SPECviewperf for systems supporting the OpenGL graphics library,
SPECapc for applications with intensive use of graphics (dealing
with 3D images, CAD/CAM image library)
48
49
Benchmark Suites
◼ Server Benchmark Suites
◼ Server has multiple functions → Multiple types of benchmark
◼ Processor-throughput benchmarks
◼ SPEC CPU2000 uses the SPEC CPU benchmark
◼ Run multiple copies SPEC benchmark, converting CPU time into a rate
(SPECrate). SPECrate is a measure of request-level parallelism
◼ SPEC offers high-performance benchmark around OpenMP and MPI to
measure thread-level parallelism
◼ SPEC offers both file server benchmark (SPECSFS) and Web
server benchmark (SPECWeb)
◼ SPECSFS is a benchmark for measuring NFS (Network File
System) performance
◼ Test the performance of the I/O system (both disk and network I/O) as well
as the processor
◼ SPECWeb is a web server benchmark
◼ SEPECjbb measures server performances for Web applications written in
Java
50
Benchmark Suites
◼ Transaction-processing (TP) benchmarks: measure the
ability of a system to handle database transactions
(transactions per second)
◼ TPC (Transaction Processing Council): TPC-A (85) → TPC-C
(complex query) → TPC-H (ad-hoc decision support)→ TPC-
R (business decision support) → TPC-W (web-oriented)
◼ TPC-E : On-Line Transaction Processing (OLTP) workload
that simulates a brokerage firm's customer accounts
◼ TPC Energy: Add energy metrics to all existing TPC
benchmark
51
Benchmark Suites
◼ Embedded Benchmarks
◼ Electronic Design News Embedded Microprocessor Benchmark
Consortium (EEMBC, pronounced as embassy )
◼ It is set of 41 kernel used to predict performance of different
embedded applications: automotive/industrial, consumer,
networking, office automation, and telecommunication.
52
Reporting Results
◼ Reproducibility
◼ Another experiment with everything listed would duplicate the
results
◼ Factors affect performance results of benchmarks
◼ Require a fairly complete description of the machine and
compiler flags …
◼ System’s software configuration
◼ OS performance and support
◼ Different Compiler optimization levels
◼ Benchmark-specific flags
◼ Source code modifications?
◼ Not allowed ex. SPEC
53
Comparing and Summarizing Results
◼ Total Execution Time (Ti)/n
Computer A Computer B Computer C
P1 (secs) 1 10 20
P2 (secs) 1000 100 20
Total time(secs) 1001 110 40
Following results:
A is 10 times faster than B for P1
B is 10 times faster than A for P2
A is 20 times faster than C for P1
C is 50 times faster than A for P2 ……
Simple summary :
B is 9.1 times faster than A for P1 and P2
C is 25 times faster than A for P1 and P2 …
54
Quantitative Principles of Computer Design
◼ Take advantage of parallelism: Important method to
improve performance
◼ System level
◼ To improve the throughput performance of a typical server
benchmark (SPECWeb, TPC-C), multiple processor and disks can
be used
◼ Individual processor
◼ Parallelism among instruction is critical to achieving high
performance
◼ Pipelining
◼ Digital Design
◼ Set-associate cache uses multiple banks of memory that are
searched in parallel to find the desired data item
◼ Carry-Look ahead adder
55
Quantitative Principles of Computer Design
◼ Principle of Locality
◼ Programs tends to reuse data and instructions they have used
recently
◼ Programs spend 90% of its execution time in only 10% of the
code.
◼ Temporal Locality
◼ Recently accessed data items are likely to be accessed in the
near future
◼ Spatial Locality
◼ Items whose addresses are near to one another tend to be
referenced close in time
56
Quantitative Principles of Computer Design
◼ Make the Common Case Fast and Make Other Parts
Correct
◼ Find a good trade-off is computer architect’s purpose
◼ What is the right interface between hardware and software?
◼ In a CPU with rare overflow events, optimize the normal “add”
part (more than 90% typically) rather than considering priority on
overflow parts
57
Amdahl's Law
◼ Amdahl’s law states that the performance improvement
to be gained from using some faster mode execution is
limited by the fraction of time the faster mode can be
used
◼ Amdahl’s law defines speedup that can be gained by
using a particular feature
◼ Speedupenhanced
59
Amdahl’s Law
1
ExTimeold
Speedupoverall = =
ExTimenew (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
60
Amdahl’s Law
◼ Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP
ExTimenew =
Speedupoverall =
61
Amdahl’s Law
◼ Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP
1
Speedupoverall = = 1.053
0.95
62
Amdahl’s Law
◼ A program spent 90% time on doing computing jobs, and
other 10% time on disk accesses. If we use a 100X faster
CPU, how much speedup we gain?
63
Amdahl’s Law
◼ A program spent 90% time on doing computing jobs, and
other 10% time on disk accesses. If we use a 100X faster
CPU, how much speedup we gain?
Speedupoverall = 1 = 9.17
0.109
64
Metrics of Performance
65
Performance
◼ Performance is determined by execution time
◼ Do any of the other variables equal to performance?
◼ # of cycles to execute program?
◼ # of instructions in program?
◼ # of cycles per second?
◼ average # of cycles per instruction?
◼ average # of instructions per second?
66
CPU Performance Equation
◼
69
CPU Performance Equation
◼ CPU time is equally depend on IC, CPI, and CC time.
◼ A 10% improvement in any one of them leads to 10%
improvement in CPU time.
◼ It is difficult to change one parameter in complete
isolation from others because the basic technologies
involved in changing each characteristics are
interdependent
◼ Clock Cycle Time : Hardware technology and Organization
◼ CPI : Organization and Instruction set architecture
◼ Instruction count : Instruction set architecture and Compiler
technology
70
CPU Performance Equation
◼
n
CPI = CPIi * ICi
i =1
Instruction Count
72
CPU Performance Equation
Instruction Count
CPI = n
ICi * CPIi
i =1 Instruction Count
73
Calculating CPI
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Typical Mix
74
Calculating CPI
▪ Suppose we have the following measurements:
▪ Frequency of FP operation = 25 %
▪ Average CPI of FP operations = 4.0
▪ Average CPI of other instructions = 1.33
▪ Frequency of FPSQR = 2%
▪ CPI of FPSQR = 20
Assume that the two design alternatives are to decrease the CPI of
FPSQR to 2 or to decreases the average CPI of FP operations to 2.5.
Compare these two design alternatives using the processor performance
equation.
75
Assume that we make an enhancement to a computer that
improves some mode of execution by a factor of 10.
Enhanced mode is used 50% of the time measured as a
percentage of the execution time when the enhanced mode
is in use.
a) What is the speedup we have obtained from the fast mode?
b) What percentage of the original execution time has been
converted to fast mode?
◼ No, small CPI does not mean a faster processor since clock
rate may vary and compiler may generate program codes in
different lengths
77
More on CPI
◼ Can CPI be less than 1?
78
Amdahl’s Law for Multiprocessor
◼ Let T the total execution time of a program on a
uniprocessor workstation
◼ Let the program has been parallelized or partitioned for
parallel execution on a cluster of many processing nodes
◼ Assume that a fraction α of the code must be executed
sequentially, called sequential bottleneck
◼ 1 – α of the code can be compiled for parallel execution
by n processor
◼ Total execution time of the program = αT + (1- α)T/n
79
Amdahl’s Law for Multiprocessor
◼ Speedup (S)
= T / [αT + (1 – α)T/n]
= 1/[α + (1- α)/n]
= n/[nα + (1- α)]
80
◼ Three enhancements with the following speedups are proposed for a new
architecture:
◼ Speedup1 = 30
◼ Speedup2 = 20
◼ Speedup3 = 10
◼ Only one enhancement is usable at a time.
◼ a. If enhancements 1 and 2 are each usable for 30% of the time, what
fraction of the time must enhancement 3 be used to achieve an overall
speedup of 10?
◼ b. Assume the distribution of enhancement usage is 30%, 30%, and 20% for
enhancements 1, 2, and 3, respectively. Assuming all three enhancements
are in use, for what fraction of the reduced execution time is no
enhancement in use?
◼ c. Assume for some benchmark, the fraction of use is 15% for each of
enhancements 1 and 2 and 70% for enhancement 3. We want to maximize
performance. If only one enhancement can be implemented, which should it
be? If two enhancements can be implemented, which should be chosen?