Lecture2 E5231

Computer Architecture (ENGI-5231)
Ehsan Atoofian
Electrical Engineering Department
Lakehead University
Chip Multiprocessor
 Instead of single processor, several processors on the same die
 Power wall: Each core, simple architecture
 Memory wall: Overlap computation and memory access(sensitivity of CPU
multiprocessor are less sensitive than single core)
 ILP wall: Thread level parallelism
P0 P1
Single Processor
P2 P3
Single Processor Chip Multiprocessor
2
Outline
Guidelines for Design & Analysis
1)Parallelism
2)Locality
3)Common case
Performance evaluation & benchmarks

Cost
Reliability & Availability
3
Guidelines for Design & Analysis
1) Parallelism
-In servers, multiple processor and disk, improves
performance
-In processors, pipeline, overlap instruction execution,

reduces total time
4
Datapath of MIPS
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Adder
Next SEQ PC
4 RS1 Reg File

Zero?
RS2
Memory
MUX MUX
Inst
PC
ALU
Memory
Data
RD
MUX
Sign
Imm Extend
WB Data 5
Pipelining in MIPS
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

I
n
ALU
Ifetch Reg DMem Reg
s add
t
r.
ALU
mult Ifetch Reg DMem Reg
ALU
Reg
r store Ifetch Reg DMem
d
e
ALU
jump Ifetch Reg DMem Reg
r
6
1) Parallelism
-In servers, multiple processor and disk, improves
performance
-In processors, pipeline, overlap instruction execution,

reduces total time
-In component level, memory banks increase

parallelism
7
2) Locality: programs tend to reuse instruction and
data they have used recently
90% of programs’ time, spent in 10% of code
Cache works based on locality
Processor Cache Memory
8
3) Common case: in a design trade-off, favour the
frequent case over rare case
e.g., instruction fetch and decode stages common
among all instruction, but multiplier is not, optimize
fetch and decode first
How much performance is improved by optimizing

common case? Amdahl’s Law
9
Amdahl’s Law
Amdahl’s law says the performance improvement of the
whole system achieved by enhancing a fraction of the
system
Told1 Told2
Partial Enhancement
Tnew1 = Told1 Tnew2
Note that fractionenhanced is fraction of time before enhancement.10

Example
A new CPU 10X faster
60% of the time, waiting I/O-40% of the time spent
on computation
Total Speedup?
speedupoveral  1.56
11
Another Example
80% of a sequential program is parallelizable
Speedup in a dual core relative to single core
speedupoveral  1.66
12
Maximum Speedup
The maximum speedup achieved by improving
fractionenhanced:
13
Outline
1)Parallelism
2)Locality
3)Common case

Cost
14
Performance
What do we mean by: computer X is faster than
computer Y?
Execution Time Or Throughput?
Desktop user interested in execution time
Server admin. interested in throughput
15
Performance
Assuming we are concerned execution time
X is n times faster than Y means:
16
Execution Time
∑ ICi×CPIi
CPItotal = (CPU Clock Cycles/#of total instructions) =
#of total instructions
=∑ frequencyi×CPIi ICi
frequencyi=
#of total instructions
17
Example
 Assume a program with following instructions
Instruction Frequency CPI
ALU 45% 1
Branch 15% 3
Memory 40% 10
 What is average CPI ?

 In order to improve the performance we have the two following
options:
 1) Use a better ALU and reduce CPI of ALU instructions to 0.3
 2) Use a better and more complex memory system, which will reduce
the CPI of memory instruction to 9 but will require a 10% increase in
clock cycle. Which option provides better performance?
18
Example
 CPI=4.9
 1) CPI= 4.585
 Speedup= 1.069
 2) CPI= 4.5
 Speedup=0.98 Slow-down
 Execution time is the REAL measure of computer performance!
 We should take into account side effect of an optimization

technique
19
Execution Time
 # of instructions depends on programmer, compiler, and ISA

 CISC vs. RISC?
 CPI (clock cycles per instruction): depends on architecture of

processor
 e.g. an architecture with lower cache miss, lower CPI
 Clock cycle time: depends on organization and VLSI technology

 Pipelining reduces clock cycle
 Smaller feature size, faster gates
20
Benchmarks
Which type of application to measure performance?
A collection of benchmarks representative for real

application, called benchmark suite
21
SPEC Benchmark Suite
Standard Performance Evaluation Corporation (Spec):
a popular benchmark suite for desktops
Focuses processor performance
Initial introduced in 1989 (SPEC89)
Fifth generation released in 2006 (SPEC2006)
 12 integer benchmarks, 17 floating-point benchmarks
 A mix of C and Fortran programs
22
Outline
1)Parallelism
2)Locality
3)Common case

Cost
23
Cost of Computers
Yield: Percentage of products that pass test
Time: Cost reduces over time due to learning curve
Volume: Increasing volumes, reduces learning curve,

amortizes design cost
Competition: reduces gap between cost and selling

price
24
Wafer
Pentium 4 wafer in 130nm
From Howe, Sodini: Microelectronics:An

Integrated Approach, Prentice Hall
25
Die
Pentium 4 Die
26
Cost of Chips
27
Die Yield
Empirical equation
Defects per unit area

 Random manufacturing defects
 0.016-0.057 defects per square cm (2010)
N = process-complexity factor = 11.5-15.5 (40 nm, 2010)
In architectural level, only die area is controllable, the other

parameters dictated by manufacturing process
28
Outline
1)Parallelism
2)Locality
3)Common case

Performance equation
Cost
29
Reliability & Availability
When feature size reduces, failure rate increases
What is the chance a system fails? Formulate

Reliability & Availability.
30
System alternate between two states:
 1) Service accomplishment: service delivered properly
 2) Service interruption: system fails
Reliability: A measure of continuous service accomplishment

T0 start time
T1 System fails
T2 system restores
Time to failure: T1 - T0 T0 T1 T2
Time to repair: T2 - T1
31
Interested in average
Mean Time To Failure (MTTF)
 Failure rate: 1/MTTF
Mean Time TO Repair (MTTR)
Module availability: MTTF / ( MTTF + MTTR)
T0 T1 T2
32
Example
Calculate MTTF for 10 disks (1M hour MTTF per disk),
1 disk controller (0.5M hour MTTF), and 1 power
supply (0.2M hour MTTF):
FailureRat e  10  (1 / 1,000,000)  1 / 500,000  1 / 200,000

 10  2  5 / 1,000,000
 17 / 1,000,000
 17,000 FIT
MTTF  1,000,000,000 / 17,000
 59,000hours
33
Readings
Chapter 1
34
ISA as Interface
Programmer's View Computer
Program
ADD 01010 (Instructions)
SUBTRACT 01110
AND 10011 CPU
OR 10001 Memory
COMPARE 11010
. .
. . I/O
. .
Computer's View
 Last through generation (portability)

 Efficient implementation in hardware
35
Question in ISA Level
What type of operation should be supported ?
Is it enough to have just load, store, and branch
instructions ?
What data type should be supported ?

Character, integer, floating-point ?
How many operands should be supported ?

Add operation with 4 sources operands ? It makes
compiler writers happy, what about hardware complexity ?
36
General Format of Instructions
Opcode SIZE ? Data SIZE ?
How many operations? How many operands?
37
Classes of ISA
Based on internal storage:
1) Stack
A=B+C
2)Accumulator
3)Register-register
B C
4)Register-memory
Comparison
# of instructions ?
B
38

Lecture2 E5231

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture2 E5231

Uploaded by

Copyright:

Available Formats

Computer Architecture (ENGI-5231)

Single Processor Chip Multiprocessor

Performance evaluation & benchmarks

-In processors, pipeline, overlap instruction execution,

4 RS1 Reg File

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

-In processors, pipeline, overlap instruction execution,

-In component level, memory banks increase

Cache works based on locality

Processor Cache Memory

How much performance is improved by optimizing

Tnew1 = Told1 Tnew2

Note that fractionenhanced is fraction of time before enhancement.10

Performance evaluation & benchmarks

Execution Time Or Throughput?

Desktop user interested in execution time

Server admin. interested in throughput

X is n times faster than Y means:

 What is average CPI ?

 Execution time is the REAL measure of computer performance!

 We should take into account side effect of an optimization

 # of instructions depends on programmer, compiler, and ISA

 CPI (clock cycles per instruction): depends on architecture of

 Clock cycle time: depends on organization and VLSI technology

A collection of benchmarks representative for real

Performance evaluation & benchmarks

Time: Cost reduces over time due to learning curve

Volume: Increasing volumes, reduces learning curve,

Competition: reduces gap between cost and selling

From Howe, Sodini: Microelectronics:An

Defects per unit area

N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

In architectural level, only die area is controllable, the other

Performance evaluation & benchmarks

What is the chance a system fails? Formulate

Reliability: A measure of continuous service accomplishment

Module availability: MTTF / ( MTTF + MTTR)

FailureRat e  10  (1 / 1,000,000)  1 / 500,000  1 / 200,000

 Last through generation (portability)

What data type should be supported ?

How many operands should be supported ?

Opcode SIZE ? Data SIZE ?

How many operations? How many operands?

You might also like