Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Computer Architecture

A Quantitative Approach, Fifth Edition

Chapter 1
Fundamentals of Quantitative
Design and Analysis

Copyright © 2012, Elsevier Inc. All rights reserved. 1


Introduction
Computer Technology
◼ Performance improvements:
◼ Improvements in semiconductor technology
◼ Feature size, clock speed
◼ Improvements in computer architectures
◼ Enabled by HLL compilers, UNIX
◼ Lead to RISC architectures

◼ Together have enabled:


◼ Lightweight computers
◼ Productivity-based managed/interpreted
programming languages

Copyright © 2012, Elsevier Inc. All rights reserved. 2


Introduction
Single Processor Performance
Move to multi-processor

RISC

Copyright © 2012, Elsevier Inc. All rights reserved. 3


Introduction
Single Processor Performance

Copyright © 2012, Elsevier Inc. All rights reserved. 4


Introduction
Current Trends in Architecture
◼ Cannot continue to leverage Instruction-Level
parallelism (ILP)
◼ Single processor performance improvement ended in
2003

◼ New models for performance:


◼ Data-level parallelism (DLP)
◼ Thread-level parallelism (TLP)
◼ Request-level parallelism (RLP)

◼ These require explicit restructuring of the


application

Copyright © 2012, Elsevier Inc. All rights reserved. 5


Classes of Computers
Classes of Computers
◼ Personal Mobile Device (PMD)
◼ e.g. start phones, tablet computers
◼ Emphasis on energy efficiency and real-time
◼ Desktop Computing
◼ Emphasis on price-performance
◼ Servers
◼ Emphasis on availability, scalability, throughput
◼ Clusters / Warehouse Scale Computers
◼ Used for “Software as a Service (SaaS)”
◼ Emphasis on availability and price-performance
◼ Sub-class: Supercomputers, emphasis: floating-point
performance and fast internal networks
◼ Embedded Computers
◼ Emphasis: price

Copyright © 2012, Elsevier Inc. All rights reserved. 6


Classes of Computers
Parallelism
◼ Classes of parallelism in applications:
◼ Data-Level Parallelism (DLP)
◼ Task-Level Parallelism (TLP)

◼ Classes of architectural parallelism:


◼ Instruction-Level Parallelism (ILP)
◼ Vector architectures/Graphic Processor Units (GPUs)
◼ Thread-Level Parallelism
◼ Request-Level Parallelism

Copyright © 2012, Elsevier Inc. All rights reserved. 7


Classes of Computers
Flynn’s Taxonomy
◼ Single instruction stream, single data stream (SISD)

◼ Single instruction stream, multiple data streams (SIMD)


◼ Vector architectures
◼ Multimedia extensions
◼ Graphics processor units

◼ Multiple instruction streams, single data stream (MISD)


◼ No commercial implementation

◼ Multiple instruction streams, multiple data streams


(MIMD)
◼ Tightly-coupled MIMD
◼ Loosely-coupled MIMD

Copyright © 2012, Elsevier Inc. All rights reserved. 8


Defining Computer Architecture
Defining Computer Architecture
◼ “Old” view of computer architecture:
◼ Instruction Set Architecture (ISA) design
◼ i.e. decisions regarding:
◼ registers, memory addressing, addressing modes,
instruction operands, available operations, control flow
instructions, instruction encoding

◼ “Real” computer architecture:


◼ Specific requirements of the target machine
◼ Design to maximize performance within constraints:
cost, power, and availability
◼ Includes ISA, microarchitecture, hardware

Copyright © 2012, Elsevier Inc. All rights reserved. 9


Figure 1.6 MIPS64 instruction set architecture formats. All instructions are 32 bits long. The R format is for integer
register-to-register operations, such as DADDU, DSUBU, and so on. The I format is for data transfers, branches, and
immediate instructions, such as LD, SD, BEQZ, and DADDIs. The J format is for jumps, the FR format for floating-point
operations, and the FI format for floating-point branches.

Copyright © 2011, Elsevier Inc. All rights 10


Reserved.
Trends in Technology
Trends in Technology
◼ Integrated circuit technology
◼ Transistor density: 35%/year
◼ Die size: 10-20%/year
◼ Integration overall: 40-55%/year

◼ Semiconductor DRAM : 25-40%/year (slowing)

◼ Semiconductor Flash: 50-60%/year


◼ 15-20X cheaper/bit than DRAM

◼ Magnetic disk technology: 40%/year


◼ 15-25X cheaper/bit then Flash
◼ 300-500X cheaper/bit than DRAM
◼ Network technology
Copyright © 2012, Elsevier Inc. All rights reserved. 11
Trends in Technology
Bandwidth and Latency
◼ Bandwidth or throughput
◼ Total work done in a given time
◼ 32,000 - 40,000x improvement for processors
◼ 400-2400x improvement for memory and disks

◼ Latency or response time


◼ Time between start and completion of an event
◼ 50-90x improvement for processors
◼ 8-9x improvement for memory and disks

Copyright © 2012, Elsevier Inc. All rights reserved. 12


Trends in Technology
Bandwidth and Latency

Copyright © 2012, Elsevier Inc. All rights reserved. 13


Trends in Technology
Transistors and Wires
◼ Feature size
◼ Minimum size of transistor or wire in x or y
dimension
◼ 10 μm in 1971 to 0.016 μm in 2017
◼ Transistor performance scales linearly
◼ Wire delay does not improve with feature size!
◼ Density of transistors increases quadratically
with linear decrease in feature size

Copyright © 2012, Elsevier Inc. All rights reserved. 14


Trends in Power and Energy
Power and Energy
◼ Problem: Get power in, get power out

◼ Thermal Design Power (TDP)


◼ Characterizes sustained power consumption
◼ Used as target for power supply and cooling system
◼ Lower than peak power, higher than average power
consumption

◼ Clock rate can be reduced dynamically to limit


power consumption

◼ Energy per task is often a better measurement


Copyright © 2012, Elsevier Inc. All rights reserved. 15
Power and Energy
◼ Processor A have 20% higher average power
consumption than processor B
◼ Processor A executes the task in only 70% of the
time needed by processor B
◼ E(A) = 120*70
◼ E(B) = 100*100

Copyright © 2012, Elsevier Inc. All rights reserved. 16


Trends in Power and Energy
Dynamic Energy and Power
◼ Dynamic energy
◼ Transistor switch from 0 -> 1 or 1 -> 0
◼ Capacitive load x Voltage2

0 -> 1 -> 0
◼ ½ x Capacitive load x Voltage2

◼ Dynamic power
◼ ½ x Capacitive load x Voltage2 x Frequency switched

◼ Reducing clock rate reduces power, not energy

Copyright © 2012, Elsevier Inc. All rights reserved. 17


Dynamic Energy and Power
◼ Some processor today are designed to have a
adjustable voltage, so a 15% reduction in voltage
may result in a 15% reduction in frequency.
What would be the impact on dynamic energy
and on dynamic power?

Copyright © 2012, Elsevier Inc. All rights reserved. 18


Trends in Power and Energy
Power
◼ Intel 80386
consumed ~ 2 W
◼ 4.0 GHz Intel
Core i7-6700K
consumes 95 W
◼ Heat must be
dissipated from
1.5 x 1.5 cm chip
◼ This is the limit of
what can be
cooled by air

Copyright © 2012, Elsevier Inc. All rights reserved. 19


Trends in Power and Energy
Reducing Power
◼ Techniques for reducing power:
◼ Do nothing well
◼ Dynamic Voltage-Frequency Scaling
◼ Low power state for DRAM, disks
◼ Overclocking, turning off cores

Copyright © 2012, Elsevier Inc. All rights reserved. 20


Figure 1.12 Energy savings for a server using an AMD Opteron microprocessor, 8 GB of DRAM, and one ATA disk.
At 1.8 GHz, the server can only handle up to two-thirds of the workload without causing service level violations, and, at 1.0 GHz, it can
only safely handle one-third of the workload. (Figure 5.11 in Barroso and Hölzle [2009].)

Copyright © 2011, Elsevier Inc. All rights 21


Reserved.
Trends in Power and Energy
Static Power
◼ Static power consumption
◼ Currentstatic x Voltage
◼ Scales with number of transistors
◼ To reduce: power gating (Turning off the
power supply to inactive modules to control
loss due to leakage)

Copyright © 2012, Elsevier Inc. All rights reserved. 22


The Shift in Computer Architecture Because of
Limits of Energy
◼ Dark silicon - in that much of a chip cannot be unused
(“dark”) at any moment in time because of thermal
constraints.
◼ This observation has led architects to reexamine the
fundamentals of processors’ design in the search for a
greater energy-cost performance
◼ The new design principle of minimizing energy per task
combined with the relative energy and area costs have
inspired a new direction for computer architecture
Domain-specific processors

Copyright © 2012, Elsevier Inc. All rights reserved. 23


The Shift in Computer Architecture Because of
Limits of Energy
◼ Domain-specific processors save energy by
◼ reducing wide floating-point operations and deploying special-
purpose memories to reduce accesses to DRAM.

◼ Energy saving is used to provide 10 – 100 more


(narrower) integer arithmetic units than a traditional
processor.

Copyright © 2012, Elsevier Inc. All rights reserved. 24


The Shift in Computer Architecture Because of
Limits of Energy

Copyright © 2012, Elsevier Inc. All rights reserved. 25


Trends in Cost
Trends in Cost
◼ Cost driven down by learning curve
◼ Yield

◼ DRAM: price closely tracks cost

◼ Microprocessors: price depends on


volume
◼ 10% less for each doubling of volume

Copyright © 2012, Elsevier Inc. All rights reserved. 26


Trends in Cost
Integrated Circuit Cost
◼ Integrated circuit

◼ Bose-Einstein formula:

Copyright © 2012, Elsevier Inc. All rights reserved. 27


Trends in Cost
Integrated Circuit Cost
◼ Defects per unit area =
◼ 0.012-0.016 defects per square cm for 28 nm and
◼ 0.016 – 0.047 for 16 nm

◼ N (process-complexity factor) a measure of


manufacturing difficulty
◼ 7.5-9.5 (28 nm, 2017) and
◼ 10 – 14 for 16 nm

Copyright © 2012, Elsevier Inc. All rights reserved. 28


Copyright © 2012, Elsevier Inc. All rights reserved. 29
Copyright © 2011, Elsevier Inc. All rights Reserved. 30
Copyright © 2011, Elsevier Inc. All rights Reserved. 31
◼ Find the number of dies per 300 mm wafer for a die that
is 1.5 cm on a side and for a die that is 1.0 cm on a side

Copyright © 2012, Elsevier Inc. All rights reserved. 32


◼ Find the die yield for the dies that are 1.5 cm on a side
and 1.0 cm on a side, assuming a defect density of
0.031 per cm2 and N is 13.5

Copyright © 2012, Elsevier Inc. All rights reserved. 33


Dependability
◼ Computer system dependability is the quality of delivered service such that
reliance can justifiably be placed on this service. The service delivered by a
system is its observed actual behavior as perceived by other system(s)
interacting with this system’s users. Each module also has an ideal specified
behavior, where a service specification is an agreed description of the
expected behavior. A system failure time between the occurrence of an error
and the resulting failure is the error latency. Thus, an error is the manifestation
in the system of a fault, and a failure is the manifestation on the service of an
error occurs when the actual behavior deviates from the specified behavior.
The failure occurred because of an error, a defect in that module. The cause of
an error is a fault.
◼ When a fault occurs, it creates a latent error, which becomes effective when it
is activated; when the error actually affects the delivered service, a failure
occurs. The time between the occurrence of an error and the resulting failure
is the error latency. Thus, an error is the manifestation in the system of a fault,
and a failure is the manifestation on the service of an error

Copyright © 2012, Elsevier Inc. All rights reserved. 34


Dependability
Dependability
◼ System alternates between two sate of service
with respect to an SLA:
1. Service accomplishment: service is delivered as
specified
2. Service interruption: the delivered service is different
from the SLA.
◼ Transition between these two states are
caused by failures or restoration

Copyright © 2012, Elsevier Inc. All rights reserved. 35


Dependability
Dependability
◼ Module reliability: is a measure of continuous
service accomplishment (or equivalently, the
time to failure) from a reference initial instant.
◼ Mean time to failure (MTTF)
◼ Mean time to repair (MTTR)
◼ Mean time between failures (MTBF) = MTTF + MTTR
◼ Module availability: is a measure of the service
accomplishment with respect to the alternation
between the two states of accomplishment and
interruption.
◼ Availability = MTTF / MTBF

Copyright © 2012, Elsevier Inc. All rights reserved. 36


Dependability
◼ Assume a disk subsystem with the following components
and MTTF:
◼ 10 disks, each rated at 1,000,000-hour MTTF
◼ 1 ATA controller, 500,000-hour MTTF
◼ 1 power supply, 200,000-hour MTTF
◼ 1 fan, 200,000-hour MTTF
◼ 1 ATA cable, 1,000,000-hour MTTF
Using the simplifying assumptions that the lifetimes are
exponentially distributed and that failures are independent,
compute the MTTF of the system as a whole.

Copyright © 2012, Elsevier Inc. All rights reserved. 37


Dependability
◼ Disk subsystems often have redundant power supplies to
improve dependability. Using the preceding components
and MTTFs, calculate the reliability of redundant power
supplies. Assume that one power supply is sufficient to
run the disk subsystem and that we are adding one
redundant power supply.

Copyright © 2012, Elsevier Inc. All rights reserved. 38


Performance
▪ "X is n times faster than Y" means

ExTime(Y) Performance(X)
--------- = --------------- =n
ExTime(X) Performance(Y)

▪ Performance is inversely proportional to execution


time

41
Computer Performance
◼ Typical performance metrics:
◼ Response time
◼ Throughput
◼ Response Time (latency)
◼ How long does it take for my job to run?
◼ How long does it take to execute a job?
◼ How long must I wait for the database query?
◼ Throughput
◼ How many jobs can the machine run at once?
◼ What is the average execution rate?
◼ How much work is getting done?

42
Computer Performance
◼ If we upgrade a machine with a new processor what do
we increase?
◼ If we add a new machine to the lab what do we
increase?

Copyright © 2012, Elsevier Inc. All rights reserved. 43


Execution Time
◼ Elapsed Time (wall-clock time, response time)
◼ counts everything (disk and memory accesses, I/O , etc.)
◼ a useful number, but often not good for comparison purposes
◼ CPU time
◼ doesn't count I/O or time spent running other programs
◼ can be broken up into system time, and user time
◼ Our focus: user CPU time
◼ time spent executing the lines of code that are "in" our program

Copyright © 2012, Elsevier Inc. All rights reserved. 44


Execution Time
◼ CPU time: the accumulated time during which CPU is
computing:
◼ user CPU time: CPU time spent on user program
◼ system CPU time: CPU time spent on OS
◼ An example from UNIX: 90.7u 12.9s 2:39 65%
◼ 90.7u: user CPU time (seconds)

◼ 12.9s: system CPU time

◼ 2:39(159 sec): elapsed time

◼ 65%: percentage of CPU time

Copyright © 2012, Elsevier Inc. All rights reserved. 45


Measuring Performance
◼ Use benchmarks: The best choice of benchmarks to
measure performance is real applications.

◼ Running programs that are much simpler than the real


application have led to performance pitfalls

◼ Benchmarks
◼ Kernels: which are small, key pieces of real applications.
◼ Toy programs (e.g. sorting)
◼ Synthetic benchmarks (e.g. Dhrystone)
◼ Benchmark suites (e.g. SPEC06fp, TPC-C)

Copyright © 2012, Elsevier Inc. All rights reserved. 46


Benchmark Suites
◼ Definition
◼ Collection of kernels, real and benchmark programs.
◼ Lessening the weakness of any one benchmark by the presence of
others.
◼ Goal: It will characterize the relative performance of two computers,
particularly for programs not in the suite that customers are likely to
run
◼ SPEC (Standard Performance Evaluation Corporation) is one of the
most successful product

47
Benchmark Suites
◼ Desktop Benchmark Suites
◼ CPU-intensive benchmarks
◼ Graphics-intensive benchmarks
◼ CPU-intensive benchmarks
◼ SPEC89 → SPEC92 → SPEC95 → SPEC2000 → SPEC CPU
2006 → SPEC CPU 2017 (10 integer benchmark (CINT2017) and
17 floating-point benchmark (CFP2017).
◼ SPEC benchmarks are real programs modified to be portability and
to minimize the effect of I/O on performance. (highlighting CPU)
◼ Graphics-intensive benchmarks:
◼ SPECviewperf for systems supporting the OpenGL graphics library,
SPECapc for applications with intensive use of graphics (dealing
with 3D images, CAD/CAM image library)

48
49
Benchmark Suites
◼ Server Benchmark Suites
◼ Server has multiple functions → Multiple types of benchmark
◼ Processor-throughput benchmarks
◼ SPEC CPU2000 uses the SPEC CPU benchmark
◼ Run multiple copies SPEC benchmark, converting CPU time into a rate
(SPECrate). SPECrate is a measure of request-level parallelism
◼ SPEC offers high-performance benchmark around OpenMP and MPI to
measure thread-level parallelism
◼ SPEC offers both file server benchmark (SPECSFS) and Web
server benchmark (SPECWeb)
◼ SPECSFS is a benchmark for measuring NFS (Network File
System) performance
◼ Test the performance of the I/O system (both disk and network I/O) as well
as the processor
◼ SPECWeb is a web server benchmark
◼ SEPECjbb measures server performances for Web applications written in
Java

50
Benchmark Suites
◼ Transaction-processing (TP) benchmarks: measure the
ability of a system to handle database transactions
(transactions per second)
◼ TPC (Transaction Processing Council): TPC-A (85) → TPC-C
(complex query) → TPC-H (ad-hoc decision support)→ TPC-
R (business decision support) → TPC-W (web-oriented)
◼ TPC-E : On-Line Transaction Processing (OLTP) workload
that simulates a brokerage firm's customer accounts
◼ TPC Energy: Add energy metrics to all existing TPC
benchmark

51
Benchmark Suites
◼ Embedded Benchmarks
◼ Electronic Design News Embedded Microprocessor Benchmark
Consortium (EEMBC, pronounced as embassy )
◼ It is set of 41 kernel used to predict performance of different
embedded applications: automotive/industrial, consumer,
networking, office automation, and telecommunication.

52
Reporting Results
◼ Reproducibility
◼ Another experiment with everything listed would duplicate the
results
◼ Factors affect performance results of benchmarks
◼ Require a fairly complete description of the machine and
compiler flags …
◼ System’s software configuration
◼ OS performance and support
◼ Different Compiler optimization levels
◼ Benchmark-specific flags
◼ Source code modifications?
◼ Not allowed ex. SPEC

◼ OK but impossible ex. TPC

◼ OK and doable ex. NAS & EEMBC

53
Comparing and Summarizing Results
◼ Total Execution Time (Ti)/n
Computer A Computer B Computer C
P1 (secs) 1 10 20
P2 (secs) 1000 100 20
Total time(secs) 1001 110 40
Following results:
A is 10 times faster than B for P1
B is 10 times faster than A for P2
A is 20 times faster than C for P1
C is 50 times faster than A for P2 ……
Simple summary :
B is 9.1 times faster than A for P1 and P2
C is 25 times faster than A for P1 and P2 …

54
Quantitative Principles of Computer Design
◼ Take advantage of parallelism: Important method to
improve performance
◼ System level
◼ To improve the throughput performance of a typical server
benchmark (SPECWeb, TPC-C), multiple processor and disks can
be used
◼ Individual processor
◼ Parallelism among instruction is critical to achieving high
performance
◼ Pipelining
◼ Digital Design
◼ Set-associate cache uses multiple banks of memory that are
searched in parallel to find the desired data item
◼ Carry-Look ahead adder

55
Quantitative Principles of Computer Design
◼ Principle of Locality
◼ Programs tends to reuse data and instructions they have used
recently
◼ Programs spend 90% of its execution time in only 10% of the
code.
◼ Temporal Locality
◼ Recently accessed data items are likely to be accessed in the
near future
◼ Spatial Locality
◼ Items whose addresses are near to one another tend to be
referenced close in time

56
Quantitative Principles of Computer Design
◼ Make the Common Case Fast and Make Other Parts
Correct
◼ Find a good trade-off is computer architect’s purpose
◼ What is the right interface between hardware and software?
◼ In a CPU with rare overflow events, optimize the normal “add”
part (more than 90% typically) rather than considering priority on
overflow parts

57
Amdahl's Law
◼ Amdahl’s law states that the performance improvement
to be gained from using some faster mode execution is
limited by the fraction of time the faster mode can be
used
◼ Amdahl’s law defines speedup that can be gained by
using a particular feature

ExTime w/o E Performance w/ E


Speedup = ------------- = -------------------
ExTime w/ E Performance w/o E

Copyright © 2012, Elsevier Inc. All rights reserved. 58


Speedup
◼ Depends on two factors:
◼ Fractionenhanced

◼ The fraction of the computation time in the original computer that


can be converted to take advantage of the enhancement.
Fractionenhanced is always less than 1.

◼ Speedupenhanced

◼ The improvement gained by the enhanced execution mode; that is


how much faster the task would run if the enhanced mode were
used for the entire program. Speedupenhanced is always greater than
1.

59
Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced


Speedupenhanced

1
ExTimeold
Speedupoverall = =
ExTimenew (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced

60
Amdahl’s Law
◼ Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP

ExTimenew =

Speedupoverall =

61
Amdahl’s Law
◼ Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP

ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

1
Speedupoverall = = 1.053
0.95

• What’s the overall performance gain if we could


improve the non-FP instructions by 2x?

62
Amdahl’s Law
◼ A program spent 90% time on doing computing jobs, and
other 10% time on disk accesses. If we use a 100X faster
CPU, how much speedup we gain?

63
Amdahl’s Law
◼ A program spent 90% time on doing computing jobs, and
other 10% time on disk accesses. If we use a 100X faster
CPU, how much speedup we gain?

ExTimenew = ExTimeold x (0.9/100 + 0.1) = 0.109 x ExTimeold

Speedupoverall = 1 = 9.17
0.109

64
Metrics of Performance

Application Answers per month


Operations per second
Programming
Language
Compiler
(millions) of Instructions per second: MIPS
ISA (millions) of (FP) operations per second: MFLOP/s
Datapath
Control Megabytes per second
Function Units
Transistors Wires Pins Cycles per second (clock rate)

65
Performance
◼ Performance is determined by execution time
◼ Do any of the other variables equal to performance?
◼ # of cycles to execute program?
◼ # of instructions in program?
◼ # of cycles per second?
◼ average # of cycles per instruction?
◼ average # of instructions per second?

◼ Common pitfall: thinking one of the variables is


indicative of performance when it really isn’t.

66
CPU Performance Equation

Copyright © 2012, Elsevier Inc. All rights reserved. 67


CPU Performance Equation

Copyright © 2012, Elsevier Inc. All rights reserved. 68


CPU Performance Equation
◼ CPU time = CPU clock cycles for a program x clock
cycle time

◼ CPU time = Instruction count x Cycles per instruction x Clock


cycle time

CPU time = Seconds = Instructions x Cycles x Seconds


Program Program Instruction Cycle

69
CPU Performance Equation
◼ CPU time is equally depend on IC, CPI, and CC time.
◼ A 10% improvement in any one of them leads to 10%
improvement in CPU time.
◼ It is difficult to change one parameter in complete
isolation from others because the basic technologies
involved in changing each characteristics are
interdependent
◼ Clock Cycle Time : Hardware technology and Organization
◼ CPI : Organization and Instruction set architecture
◼ Instruction count : Instruction set architecture and Compiler
technology

70
CPU Performance Equation

Copyright © 2012, Elsevier Inc. All rights reserved. 71


CPU Performance Equation
n

CPU time = Clock Cycle Time x  CPIi * ICi


i =1

n
CPI =  CPIi * ICi
i =1

Instruction Count

72
CPU Performance Equation

CPI =  CPIi * ICi


i =1

Instruction Count

CPI =  n
ICi * CPIi
i =1 Instruction Count

73
Calculating CPI
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5

Typical Mix

74
Calculating CPI
▪ Suppose we have the following measurements:
▪ Frequency of FP operation = 25 %
▪ Average CPI of FP operations = 4.0
▪ Average CPI of other instructions = 1.33
▪ Frequency of FPSQR = 2%
▪ CPI of FPSQR = 20
Assume that the two design alternatives are to decrease the CPI of
FPSQR to 2 or to decreases the average CPI of FP operations to 2.5.
Compare these two design alternatives using the processor performance
equation.

75
Assume that we make an enhancement to a computer that
improves some mode of execution by a factor of 10.
Enhanced mode is used 50% of the time measured as a
percentage of the execution time when the enhanced mode
is in use.
a) What is the speedup we have obtained from the fast mode?
b) What percentage of the original execution time has been
converted to fast mode?

Copyright © 2012, Elsevier Inc. All rights reserved. 76


More on CPI
◼ Is CPI a good metric to evaluate CPU performance?

◼ Yes, when clock rate is fixed, CPI is a good metric in


comparing pipelining machines since higher instruction level
parallelism results in smaller CPI

◼ No, small CPI does not mean a faster processor since clock
rate may vary and compiler may generate program codes in
different lengths

77
More on CPI
◼ Can CPI be less than 1?

◼ Yes. Multiple-issue processors like superscalar, allowing


multiple instructions to issue in a clock cycle

78
Amdahl’s Law for Multiprocessor
◼ Let T the total execution time of a program on a
uniprocessor workstation
◼ Let the program has been parallelized or partitioned for
parallel execution on a cluster of many processing nodes
◼ Assume that a fraction α of the code must be executed
sequentially, called sequential bottleneck
◼ 1 – α of the code can be compiled for parallel execution
by n processor
◼ Total execution time of the program = αT + (1- α)T/n

79
Amdahl’s Law for Multiprocessor
◼ Speedup (S)
= T / [αT + (1 – α)T/n]
= 1/[α + (1- α)/n]
= n/[nα + (1- α)]

◼ The maximum speedup of n is achieved only if the


sequential bottleneck α is reduced to zero or the code is
fully parallelizable with α = 0.

80
◼ Three enhancements with the following speedups are proposed for a new
architecture:
◼ Speedup1 = 30
◼ Speedup2 = 20
◼ Speedup3 = 10
◼ Only one enhancement is usable at a time.
◼ a. If enhancements 1 and 2 are each usable for 30% of the time, what
fraction of the time must enhancement 3 be used to achieve an overall
speedup of 10?
◼ b. Assume the distribution of enhancement usage is 30%, 30%, and 20% for
enhancements 1, 2, and 3, respectively. Assuming all three enhancements
are in use, for what fraction of the reduced execution time is no
enhancement in use?
◼ c. Assume for some benchmark, the fraction of use is 15% for each of
enhancements 1 and 2 and 70% for enhancement 3. We want to maximize
performance. If only one enhancement can be implemented, which should it
be? If two enhancements can be implemented, which should be chosen?

Copyright © 2012, Elsevier Inc. All rights reserved. 81


◼ Assume that we make an enhancement to a computer that improves
some mode of execution by a factor of 10. Enhanced mode is used
50% of the time, measured as a percentage of the execution time
when the enhanced mode is in use. Recall that Amdahl’s Law
depends on the fraction of the original, unenhanced execution time
that could make use of enhanced mode. Thus we cannot directly use
this 50% measurement to compute speedup with Amdahl’s Law.
◼ a. What is the speedup we have obtained from fast mode?
◼ b. What percentage of the original execution time has been
converted to fast mode?

Copyright © 2012, Elsevier Inc. All rights reserved. 82

You might also like