01) Fundamentals of Quantitative Design and Analysis

Computer Architecture
A Quantitative Approach, Fifth Edition
Chapter 1
Fundamentals of Quantitative
Design and Analysis
Copyright © 2012, Elsevier Inc. All rights reserved. 1

Introduction
Computer Technology
 Performance improvements:
 Improvements in semiconductor technology
 Feature size, clock speed
 Improvements in computer architectures
 Enabled by HLL compilers, UNIX
 Lead to RISC architectures
 Together have enabled:

 Lightweight computers
 Productivity-based managed/interpreted
programming languages

Introduction
Single Processor Performance
Move to multi-processor
RISC

Introduction
Current Trends in Architecture
 Cannot continue to leverage Instruction-Level
parallelism (ILP)
 Single processor performance improvement ended in
2003
 New models for performance:

 Data-level parallelism (DLP)
 Thread-level parallelism (TLP)
 Request-level parallelism (RLP)
 These require explicit restructuring of the

application

Classes of Computers
 Personal Mobile Device (PMD)
 e.g. start phones, tablet computers
 Emphasis on energy efficiency and real-time
 Desktop Computing
 Emphasis on price-performance
 Servers
 Emphasis on availability, scalability, throughput
 Clusters / Warehouse Scale Computers
 Used for “Software as a Service (SaaS)”
 Emphasis on availability and price-performance
 Sub-class: Supercomputers, emphasis: floating-point
performance and fast internal networks
 Embedded Computers
 Emphasis: price

Parallelism
 Classes of parallelism in applications:
 Data-Level Parallelism (DLP)
 Task-Level Parallelism (TLP)
 Classes of architectural parallelism:

 Instruction-Level Parallelism (ILP)
 Vector architectures/Graphic Processor Units (GPUs)
 Thread-Level Parallelism
 Request-Level Parallelism

Flynn’s Taxonomy
 Single instruction stream, single data stream (SISD)
 Single instruction stream, multiple data streams (SIMD)

 Vector architectures
 Multimedia extensions
 Graphics processor units
 Multiple instruction streams, single data stream (MISD)

 No commercial implementation
 Multiple instruction streams, multiple data streams

(MIMD)
 Tightly-coupled MIMD
 Loosely-coupled MIMD

Defining Computer Architecture
Defining Computer Architecture
 “Old” view of computer architecture:
 Instruction Set Architecture (ISA) design
 i.e. decisions regarding:
 registers, memory addressing, addressing modes,
instruction operands, available operations, control flow
instructions, instruction encoding
 “Real” computer architecture:

 Specific requirements of the target machine
 Design to maximize performance within constraints:
cost, power, and availability
 Includes ISA, microarchitecture, hardware

Figure 1.6 MIPS64 instruction set architecture formats. All instructions are 32 bits long. The R format is for integer
register-to-register operations, such as DADDU, DSUBU, and so on. The I format is for data transfers, branches, and
immediate instructions, such as LD, SD, BEQZ, and DADDIs. The J format is for jumps, the FR format for floating-point
operations, and the FI format for floating-point branches.
Copyright © 2011, Elsevier Inc. All rights Reserved. 9

Trends in Technology
 Integrated circuit technology
 Transistor density: 35%/year
 Die size: 10-20%/year
 Integration overall: 40-55%/year
 DRAM capacity: 25-40%/year (slowing)
 Flash capacity: 50-60%/year

 15-20X cheaper/bit than DRAM
 Magnetic disk technology: 40%/year

 15-25X cheaper/bit then Flash
 300-500X cheaper/bit than DRAM
 Network technology
Bandwidth and Latency
 Bandwidth or throughput
 Total work done in a given time
 10,000-25,000X improvement for processors
 300-1200X improvement for memory and disks
 Latency or response time

 Time between start and completion of an event
 30-80X improvement for processors
 6-8X improvement for memory and disks

Bandwidth and Latency
Log-log plot of bandwidth and latency milestones

Transistors and Wires
 Feature size
 Minimum size of transistor or wire in x or y
dimension
 10 microns in 1971 to .032 microns in 2011
 Transistor performance scales linearly
 Wire delay does not improve with feature size!
 Integration density scales quadratically

Trends in Power and Energy
Power and Energy
 Problem: Get power in, get power out
 Thermal Design Power (TDP)

 Characterizes sustained power consumption
 Used as target for power supply and cooling system
 Lower than peak power, higher than average power
consumption
 Clock rate can be reduced dynamically to limit

power consumption
 Energy per task is often a better measurement

 A have 20% higher avg power
consumption than processor B
 A executes the task in only 70% of the
time needed by B
 E(A) = 120*70
 E(B) = 100*100

Dynamic Energy and Power
 Dynamic energy
 Transistor switch from 0 -> 1 or 1 -> 0
 Capacitive load x Voltage2
 0 -> 1 -> 0
 ½ x Capacitive load x Voltage2
 Dynamic power
 ½ x Capacitive load x Voltage2 x Frequency switched
 Reducing clock rate reduces power, not energy

 Some processor today are designed to
have a adjustable voltage, so a 15%
reduction in voltage may result in a 15%
reduction in frequency. What would be the
impact on dynamic energy and on dynamic
power?

Power
 Intel 80386
consumed ~ 2 W
 3.3 GHz Intel
Core i7 consumes
130 W
 Heat must be
dissipated from
1.5 x 1.5 cm chip
 This is the limit of
what can be
cooled by air

Reducing Power
 Techniques for reducing power:
 Do nothing well
 Dynamic Voltage-Frequency Scaling
 Low power state for DRAM, disks
 Overclocking, turning off cores

Figure 1.12 Energy savings for a server using an AMD Opteron microprocessor, 8 GB of DRAM, and one ATA disk.
At 1.8 GHz, the server can only handle up to two-thirds of the workload without causing service level violations, and, at 1.0 GHz, it can
only safely handle one-third of the workload. (Figure 5.11 in Barroso and Hölzle [2009].)

Static Power
 Static power consumption
 Currentstatic x Voltage
 Scales with number of transistors
 To reduce: power gating

Trends in Cost
Trends in Cost
 Cost driven down by learning curve
 Yield
 DRAM: price closely tracks cost
 Microprocessors: price depends on

volume
 10% less for each doubling of volume

Trends in Cost
Integrated Circuit Cost
 Integrated circuit
 Bose-Einstein formula:
 Defects per unit area = 0.016-0.057 defects per square cm (2010)

 N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

Figure 1.13 Photograph of an Intel Core i7 microprocessor die, which is evaluated in Chapters 2 through 5. The
dimensions are 18.9 mm by 13.6 mm (257 mm2) in a 45 nm process. (Courtesy Intel.)

Figure 1.14 Floorplan of Core i7 die in Figure 1.13 on left with close-up of floorplan of second core on right.

Figure 1.15 This 300 mm wafer contains 280 full Sandy Bridge dies, each 20.7 by 10.5 mm in a 32 nm process. (Sandy
Bridge is Intel’s successor to Nehalem used in the Core i7.) At 216 mm2, the formula for dies per wafer estimates 282. (Courtesy
Intel.)

 Find the number of dies per 300 mm wafer
for a die that is 1.5 cm on a side and for a
die that is 1.0 cm on a side

 Find the die yield for the dies that are 1.5
cm on a side and 1.0 cm on a side,
assuming a defect density of 0.031 per
cm2 and N is 13.5

Dependability
Dependability
 System alternates between two sate of service
with respect to an SLA:
1. Service accomplishment: service is delivered as
specified
2. Service interruption: the delivered service is different
from the SLA.
 Transition between these two states are
caused by failures or restoration

Dependability
Dependability
 Module reliability
 Mean time to failure (MTTF)
 Mean time to repair (MTTR)
 Mean time between failures (MTBF) = MTTF + MTTR
 Availability = MTTF / MTBF

Measuring Performance
 Typical performance metrics:
 Response time
 Throughput
 Speedup of X relative to Y
 Execution timeY / Execution timeX
 Execution time
 Wall clock time: includes all system overheads
 CPU time: only computation time
 Benchmarks
 Kernels (e.g. matrix multiply)
 Toy programs (e.g. sorting)
 Synthetic benchmarks (e.g. Dhrystone)
 Benchmark suites (e.g. SPEC06fp, TPC-C)

Measuring and Reporting Performance
 Benchmarks, Traces, Mixes

 Hardware: Cost, delay, area, power
estimation
 Simulation (many levels)
 ISA, RT, Gate, Circuit
 Queuing Theory, analytical model
 Rules of Thumb
 Fundamental “Laws”/Principles
32
Performance
"X is n times faster than Y" means
ExTime(Y) Performance(X)
--------- = --------------- =n
ExTime(X) Performance(Y)
• Performance is inversely proportional to execution

time
33
Computer Performance
 Typical performance metrics:
 Response time
 Throughput
 Response Time (latency)
 How long does it take for my job to run?
 How long does it take to execute a job?
 How long must I wait for the database query?
 Throughput
 How many jobs can the machine run at once?
 What is the average execution rate?
 How much work is getting done?
34
Computer Performance
 If we upgrade a machine with a new
processor what do we increase?
 If we add a new machine to the lab what
do we increase?

Execution Time
 Elapsed Time (wall-clock time, response time)
 counts everything (disk and memory accesses, I/O , etc.)
 a useful number, but often not good for comparison purposes
 CPU time
 doesn't count I/O or time spent running other programs
 can be broken up into system time, and user time
 Our focus: user CPU time
 time spent executing the lines of code that are "in" our program

Execution Time
 CPU time: the accumulated time during which CPU is
computing:
 user CPU time: CPU time spent on user program
 system CPU time: CPU time spent on OS
 An example from UNIX: 90.7u 12.9s 2:39 65%
 90.7u: user CPU time (seconds)
 12.9s: system CPU time
 2:39(159 sec): elapsed time
 65%: percentage of CPU time

 Use benchmarks: The best choice of benchmarks to
measure performance is real applications.
 Running programs that are much simpler than the real

application have led to performance pitfalls
 Benchmarks
 Kernels: which are small, key pieces of real applications.
 Toy programs (e.g. sorting)
 Synthetic benchmarks (e.g. Dhrystone)
 Benchmark suites (e.g. SPEC06fp, TPC-C)

Benchmark Suites
 Definition
 Collection of kernels, real and benchmark programs.
 Lessening the weakness of any one benchmark by the presence of
others.
 Goal: It will characterize the relative performance of two computers,
particularly for programs not in the suite that customers are likely to
run
 SPEC (Standard Performance Evaluation Corporation) is one of the
most successful product
39
Benchmark Suites
 Desktop Benchmark Suites
 CPU-intensive benchmarks
 Graphics-intensive benchmarks
 CPU-intensive benchmarks
 SPEC89  SPEC92  SPEC95  SPEC2000  SPEC CPU
2006 (12 integer benchmark (CINT2006) and 17 floating-point
benchmark (CFP2006).
 SPEC benchmarks are real programs modified to be portability and
to minimize the effect of I/O on performance. (highlighting CPU)
 Graphics-intensive benchmarks:
 SPECviewperf for systems supporting the OpenGL graphics library,
SPECapc for applications with intensive use of graphics (dealing
with 3D images, CAD/CAM image library)
40
Figure 1.16 SPEC2006 programs and the evolution of the SPEC benchmarks over time, with integer programs above the line and floating-point programs
below the line. Of the 12 SPEC2006 integer programs, 9 are written in C, and the rest in C++. For the floating-point programs, the split is 6 in Fortran, 4 in C++, 3 in
C, and 4 in mixed C and Fortran. The figure shows all 70 of the programs in the 1989, 1992, 1995, 2000, and 2006 releases. The benchmark descriptions on the left
are for SPEC2006 only and do not apply to earlier versions. Programs in the same row from different generations of SPEC are generally not related; for example,
fpppp is not a CFD code like bwaves. Gcc is the senior citizen of the group. Only 3 integer programs and 3 floating-point programs survived three or more
generations. Note that all the floating-point programs are new for SPEC2006. Although a few are carried over from generation to generation, the version of the
program changes and either the input or the size of the benchmark is often changed to increase its running time and to avoid perturbation in measurement or
domination of the execution time by some factor other than CPU time. 41
Benchmark Suites
 Server Benchmark Suites
 Server has multiple functions  Multiple types of benchmark
 Processor-throughput benchmarks
 SPEC CPU2000 uses the SPEC CPU benchmark
 Run multiple copies SPEC benchmark, converting CPU time into a rate
(SPECrate). SPECrate is a measure of request-level parallelism
 SPEC offers high-performance benchmark around OpenMP and MPI to
measure thread-level parallelism
 SPEC offers both file server benchmark (SPECSFS) and Web
server benchmark (SPECWeb)
 SPECSFS is a benchmark for measuring NFS (Network File
System) performance
 Test the performance of the I/O system (both disk and network I/O) as well as
the processor
 SPECWeb is a web server benchmark
 SEPECjbb measures server performances for Web applications written in Java
42
Benchmark Suites
 Transaction-processing (TP) benchmarks: measure the
ability of a system to handle database transactions
(transactions per second)
 TPC (Transaction Processing Council): TPC-A (85)  TPC-C
(complex query)  TPC-H (ad-hoc decision support) TPC-
R (business decision support)  TPC-W (web-oriented)
 TPC-E : On-Line Transaction Processing (OLTP) workload
that simulates a brokerage firm's customer accounts
 TPC Energy: Add energy metrics to all existing TPC
benchmark
43
Benchmark Suites
 Embedded Benchmarks
 Electronic Design News Embedded Microprocessor Benchmark
Consortium (EEMBC, pronounced as embassy )
 It is set of 41 kernel used to predict performance of different
embedded applications: automotive/industrial, consumer,
networking, office automation, and telecommunication.
44
Reporting Results
 Reproducibility
 Another experiment with everything listed would duplicate the
results
 Factors affect performance results of benchmarks
 Require a fairly complete description of the machine and compiler
flags …
 System’s software configuration
 OS performance and support
 Different Compiler optimization levels
 Benchmark-specific flags
 Source code modifications?
 Not allowed ex. SPEC
 OK but impossible ex. TPC
 OK and doable ex. NAS & EEMBC
45
Comparing and Summarizing Results
 Total Execution Time (Ti)/n
Computer A Computer B Computer C
P1 (secs) 1 10 20
P2 (secs) 1000 100 20
Total time(secs) 1001 110 40
Following results:
A is 10 times faster than B for P1
B is 10 times faster than A for P2
A is 20 times faster than C for P1
C is 50 times faster than A for P2 ……
Simple summary :
B is 9.1 times faster than A for P1 and P2
C is 25 times faster than A for P1 and P2 …
46
Quantitative Principles of Computer Design
 Take advantage of parallelism: Important method to
improve performance
 System level
 To improve the throughput performance of a typical server
benchmark (SPECWeb, TPC-C), multiple processor and disks can
be used
 Individual processor
 Parallelism among instruction is critical to achieving high
performance
 Pipelining
 Digital Design
 Set-associate cache uses multiple banks of memory that are
searched in parallel to find the desired data item
 Carry-Look ahead adder
47
 Principle of Locality
 Programs tends to reuse data and instructions they have used
recently
 Programs spend 90% of its execution time in only 10% of the
code.
 Temporal Locality
 Recently accessed data items are likely to be accessed in the
near future
 Spatial Locality
 Items whose addresses are near to one another tend to be
referenced close in time
48
 Make the Common Case Fast and Make Other Parts
Correct
 Find a good trade-off is computer architect’s purpose
 What is the right interface between hardware and software?
 In a CPU with rare overflow events, optimize the normal “add”
part (more than 90% typically) rather than considering priority on
overflow parts
49
Amdahl's Law
 Amdahl’s law states that the performance improvement
to be gained from using some faster mode execution is
limited by the fraction of time the faster mode can be
used
 Amdahl’s law defines speedup that can be gained by
using a particular feature
ExTime w/o E Performance w/ E

Speedup = ------------- = -------------------
ExTime w/ E Performance w/o E

Speedup
 Depends on two factors:
 Fractionenhanced
 The fraction of the computation time in the original computer that

can be converted to take advantage of the enhancement.
Fractionenhanced is always less than 1.
 Speedupenhanced
 The improvement gained by the enhanced execution mode; that is

how much faster the task would run if the enhanced mode were
used for the entire program. Speedupenhanced is always greater than 1.
51
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced
1
ExTimeold
Speedupoverall = =
(1 - Fractionenhanced) + Fractionenhanced
ExTimenew
Speedupenhanced
52
Amdahl’s Law
 Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP
ExTimenew =
Speedupoverall =
53
Amdahl’s Law
 Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall = 1 = 1.053
0.95
• What’s the overall performance gain if we could

improve the non-FP instructions by 2x?
54
Amdahl’s Law
 A program spent 90% time on doing computing jobs, and
other 10% time on disk accesses. If we use a 100X faster
CPU, how much speedup we gain?
55
Amdahl’s Law
 A program spent 90% time on doing computing jobs, and
other 10% time on disk accesses. If we use a 100X faster
CPU, how much speedup we gain?
ExTimenew = ExTimeold x (0.9/100 + 0.1) = 0.109 x

ExTimeold
Speedupoverall = 1 = 9.17
0.109
56
Metrics of Performance
Application Answers per month

Operations per second
Programming
Language
Compiler
(millions) of Instructions per second: MIPS
ISA (millions) of (FP) operations per second:
MFLOP/s
Datapath
Control Megabytes per second
Function Units
Transistors Wires Pins Cycles per second (clock rate)
57
Performance
 Performance is determined by execution time
 Do any of the other variables equal to performance?
 # of cycles to execute program?
 # of instructions in program?
 # of cycles per second?
 average # of cycles per instruction?
 average # of instructions per second?
 Common pitfall: thinking one of the variables is indicative

of performance when it really isn’t.
58
CPU Performance Equation


 CPU time = CPU clock cycles for a program x clock
cycle time
 CPU time = Instruction count x Cycles per instruction x Clock

cycle time
CPU
CPUtime
time == Seconds
Seconds == Instructions
Instructions xx Cycles
Cycles xx Seconds
Seconds
Program
Program Program
Program Instruction
Instruction Cycle
Cycle
61
 CPU time is equally depend on IC, CPI, and CC time.
 A 10% improvement in any one of them leads to 10%
improvement in CPU time.
 It is difficult to change one parameter in complete
isolation from others because the basic technologies
involved in changing each characteristics are
interdependent
 Clock Cycle Time : Hardware technology and Organization
 CPI : Organization and Instruction set architecture
 Instruction count : Instruction set architecture and Compiler
technology
62

n
CPU time = Clock Cycle Time x  CPIi * ICi

i =1
n
CPI =  CPIi * ICi
i =1
Instruction Count
64
CPI =  CPIi * ICi

i =1
Instruction Count
CPI =  n
ICi * CPIi
i =1 Instruction Count
65
Calculating CPI
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Typical Mix
66
Calculating CPI

Suppose we have the following measurements:
 Frequency of FP operation = 25 %
 Average CPI of FP operations = 4.0
 Average CPI of other instructions = 1.33
 Frequency of FPSQR = 2%
 CPI of FPSQR = 20
Assume that the two design alternatives are to decrease the CPI of
FPSQR to 2 or to decreases the average CPI of FP operations to
2.5. Compare these two design alternatives using the processor
performance equation.
67
More on CPI
 Is CPI a good metric to evaluate CPU performance?
 Yes, when clock rate is fixed, CPI is a good metric in

comparing pipelining machines since higher instruction level
parallelism results in smaller CPI
 No, small CPI does not mean a faster processor since clock
rate may vary and compiler may generate program codes in
different lengths
68
More on CPI

Can CPI be less than 1?
 Yes. Multiple-issue processors like superscalar, allowing

multiple instructions to issue in a clock cycle
69
Amdahl’s Law for Multiprocessor
 Let T the total execution time of a program on a

uniprocessor workstation
 Let the program has been parallelized or partitioned for
parallel execution on a cluster of many processing nodes
 Assume that a fraction α of the code must be executed
sequentially, called sequential bottleneck
 1 – α of the code can be compiled for parallel execution
by n processor
 Total execution time of the program = αT + (1- α)T/n
70
Amdahl’s Law for Multiprocessor
 Speedup (S)
= T / [αT + (1 – α)T/n]
= 1/[α + (1- α)/n]
= n/[nα + (1- α)]
 The maximum speedup of n is achieved only if the

sequential bottleneck α is reduced to zero or the code is
fully parallelizable with α = 0.
71

01) Fundamentals of Quantitative Design and Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01) Fundamentals of Quantitative Design and Analysis

Uploaded by

Copyright:

Available Formats

Computer Architecture

A Quantitative Approach, Fifth Edition

Copyright © 2012, Elsevier Inc. All rights reserved. 1

 Together have enabled:

Copyright © 2012, Elsevier Inc. All rights reserved. 2

Copyright © 2012, Elsevier Inc. All rights reserved. 3

 New models for performance:

 These require explicit restructuring of the

Copyright © 2012, Elsevier Inc. All rights reserved. 4

Copyright © 2012, Elsevier Inc. All rights reserved. 5

 Classes of architectural parallelism:

Copyright © 2012, Elsevier Inc. All rights reserved. 6

 Single instruction stream, multiple data streams (SIMD)

 Multiple instruction streams, single data stream (MISD)

 Multiple instruction streams, multiple data streams

Copyright © 2012, Elsevier Inc. All rights reserved. 7

 “Real” computer architecture:

Copyright © 2012, Elsevier Inc. All rights reserved. 8

Copyright © 2011, Elsevier Inc. All rights Reserved. 9

 DRAM capacity: 25-40%/year (slowing)

 Flash capacity: 50-60%/year

 Magnetic disk technology: 40%/year

 Latency or response time

Copyright © 2012, Elsevier Inc. All rights reserved. 11

Log-log plot of bandwidth and latency milestones

Copyright © 2012, Elsevier Inc. All rights reserved. 12

Copyright © 2012, Elsevier Inc. All rights reserved. 13

 Thermal Design Power (TDP)

 Clock rate can be reduced dynamically to limit

 Energy per task is often a better measurement

Copyright © 2012, Elsevier Inc. All rights reserved. 15

 Reducing clock rate reduces power, not energy

Copyright © 2012, Elsevier Inc. All rights reserved. 16

Copyright © 2012, Elsevier Inc. All rights reserved. 17

Copyright © 2012, Elsevier Inc. All rights reserved. 18

Copyright © 2012, Elsevier Inc. All rights reserved. 19

Copyright © 2011, Elsevier Inc. All rights Reserved. 20

Copyright © 2012, Elsevier Inc. All rights reserved. 21

 DRAM: price closely tracks cost

 Microprocessors: price depends on

Copyright © 2012, Elsevier Inc. All rights reserved. 22

 Defects per unit area = 0.016-0.057 defects per square cm (2010)

Copyright © 2012, Elsevier Inc. All rights reserved. 23

Copyright © 2011, Elsevier Inc. All rights Reserved. 24

Copyright © 2011, Elsevier Inc. All rights Reserved. 25

Copyright © 2011, Elsevier Inc. All rights Reserved. 26

Copyright © 2012, Elsevier Inc. All rights reserved. 27

Copyright © 2012, Elsevier Inc. All rights reserved. 28

Copyright © 2012, Elsevier Inc. All rights reserved. 29

Copyright © 2012, Elsevier Inc. All rights reserved. 30

Copyright © 2012, Elsevier Inc. All rights reserved. 31

 Benchmarks, Traces, Mixes

• Performance is inversely proportional to execution

Copyright © 2012, Elsevier Inc. All rights reserved. 35

Copyright © 2012, Elsevier Inc. All rights reserved. 36

 12.9s: system CPU time

 2:39(159 sec): elapsed time

 65%: percentage of CPU time

Copyright © 2012, Elsevier Inc. All rights reserved. 37

 Running programs that are much simpler than the real

Copyright © 2012, Elsevier Inc. All rights reserved. 38

 OK but impossible ex. TPC

 OK and doable ex. NAS & EEMBC

ExTime w/o E Performance w/ E