Professional Documents
Culture Documents
Remainder of CIS501: Parallelism: System Software App App App
Remainder of CIS501: Parallelism: System Software App App App
CIS 501
Computer Architecture
Unit 7: Superscalar
App
App
System software
Mem
CPU
I/O
Readings
! H+P
! TBD
! Paper
! Edmondson et al., Superscalar Instruction Execution in the 21164
Alpha Microprocessor
! Multiple-issue designs
! Superscalar
! VLIW and EPIC (Itanium)
Multiple-Issue Pipeline
regfile
I$
regfile
I$
D$
B
P
1
F
2-way superscalar 1
lw 0(r1)!r2
lw 4(r1)!r3
lw 8(r1)!r4
add r14,r15!r6
add r12,r13!r7
add r17,r16!r8
lw 0(r18)!r9
D$
B
P
F
F
2
D
F
2
D
D
F
F
3
X
D
F
3
X
X
D
D
F
F
4
M
X
D
F
5 6 7 8 9 10 11 12
W
M W
X M W
D X M W
F D X M W
F D X M W
F D X M W
4
M
M
X
X
D
D
F
5 6 7 8
W
W
M W
M W
X M W
X M W
D X M W
1
F
2-way superscalar 1
10 11 12
lw 0(r1)!r2
lw 4(r1)!r3
lw 8(r1)!r4
add r4,r5!r6
add r2,r3!r7
add r7,r6!r8
lw 0(r8)!r9
7
F
F
2
D
F
2
D
D
F
F
3
X
D
F
3
X
X
D
d*
F
4 5 6 7
M W
X M W
D X M W
F d* D X
F D
F
4
M
M
X
d*
d*
5 6 7
W
W
M W
D X M
D X M
F D X
F d* D
10 11 12
M
X
D
F
W
M W
X M W
D X M W
10 11 12
W
W
M W
X M W
8
I-regfile
I$
B
P
! Scalar pipeline
! 1 + 0.2*0.75*2 = 1.3 ! 1.3/1 = 1.3 ! 30% slowdown
+! Limited modifications
! Limited speedup
! 4-way superscalar
! 0.25 + 0.2*0.75*2 = 0.55 ! 0.55/0.25 = 2.2 ! 120% slowdown
CIS 501 (Martin/Roth): Superscalar
E*
E
+
E
+
E*
E/
F-regfile
10
I$
D$
B
P
! 2 integer + 2 FP
! Similar to
Alpha 21164
E*
regfile
I$
D$
D$
B
P
E*
E*
E
+
E
+
E*
! Parallel decode
! Need to check for conflicting instructions
! Output of I1 is an input to I2
! Other stalls, too (for example, load-use delay)
E/
F-regfile
CIS 501 (Martin/Roth): Superscalar
11
12
regfile
I$
D$
B
P
! Replicate decoders
! Memory unit
! Single load per cycle (stall at decode) probably okay for dual issue
! Alternative: add a read port to data cache
! Larger area, latency, power, cost, complexity
CIS 501 (Martin/Roth): Superscalar
13
14
! Fundamental challenge:
15
16
Wide Decode
N2 Dependence Cross-Check
! Stall logic for 1-wide pipeline with full bypassing
regfile
17
Wide Execute
D/X1.rs2==X/M1.rd) ||
D/X2.rs2==X/M1.rd) ||
D/X1.rs2==X/M2.rd) ||
D/X2.rs2==X/M2.rd)
18
19
20
D$ Bandwidth: Banking
! Have already seen split I$/D$, but that gives you just one D$ port
! How to provide a second (maybe even a third) D$ port?
! Option#1: multi-porting
+! Most general solution, any two accesses per cycle
! Expensive in terms of latency, area (cost), and power
21
Wide Bypass
regfile
! N2 bypass network
!
!
!
!
!
! Implemented as bit-slicing
! 64 1-bit bypass networks
! Mitigates routing problem somewhat
22
23
24
Clustering
! Clustering: mitigates N2 bypass
! N2 bypass by far
! 32- or 64- bit quantities (vs. 5-bit)
! Multiple levels (MX, WX) of bypass (vs. 1 level of stall logic)
! Must fit in one clock period with ALU (vs. not)
1020
1021
1022
1023
26
B
P
28
Trace Cache
I$
b0
I$
b1
B
P
T$
T
P
0
4
addi,beq #4,ld,sub
st,call #32,ld,add
0:
1:
4:
5:
addi r1,4,r1
beq r1,#4
st r1,4(sp)
call #32
1
F
F
f*
f*
2
D
D
F
F
addi r1,4,r1
beq r1,#4
st r1,4(sp)
call #32
1
F
F
F
F
2
D
D
D
D
0:T
0:
1:
4:
5:
30
! Trace cache
Tag Data (insns)
31
32
Multiple-Issue Implementations
! Such as x86
! Or CISCy ugly instructions in some RISC ISAs
33
! Dynamically-scheduled superscalar
! Pentium Pro/II/III (3-wide), Alpha 21264 (4-wide)
VLIW
History of VLIW
35
34
! Commercial attempts
!
!
!
!
!
36
37
EPIC
! Tainted VLIW
!
!
!
!
38
39
Year
Width
486
Pentium
PentiumII
Pentium4
Itanium
ItaniumII
Core2
1989
1993
1998
2001
2002
2004
2006
40
! Multiple issue
! Problem spots
! Fetch + branch prediction ! trace cache?
! N2 bypass ! clustering?
! Register file ! replication?
! Implementations
! (Statically-scheduled) superscalar, VLIW/EPIC
41
42
Grid Processor
Aside: SAXPY
! Components
! Pipeline stages
! Fetch block to grid
! Read registers
! Execute/memory
! Cascade
! Write registers
! Block atomic
! No intermediate regs
! Grid limits size/shape
NH
I$
for (i=0;i<N;i++)
Z[i]=A*X[i]+Y[i];
regfile
read
read
read
read
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
D$
43
0:
1:
2:
3:
4:
5:
6:
ldf X(r1),f1
mulf f0,f1,f2
ldf Y(r1),f3
addf f2,f3,f4
stf f4,Z(r1)
addi r1,4,r1
blt r1,r2,0
// loop
// A in f0
// X,Y,Z are constant addresses
// i in r1
// N*4 in r2
44
r2,0
0
0
0
read f1,0
pass 1
pass 0,1
addi
nop
read
pass
mulf
pass
pass
r1,0,1
-1,1
1
1
0,r1
nop
ldf X,-1
ldf Y,0
addf 0
stf Z
NH
I$
regfile
r2
f1
r1
pass
pass
pass
ldf
pass
pass
mulf
ldf
pass
addi
pass
addf
pass
stf
D$
ble
45
46
! Read registers
I$
regfile
NH
r2
f1
r1
pass
pass
pass
ldf
pass
pass
mulf
ldf
I$
regfile
r2
f1
r1
pass
pass
pass
ldf
pass
pass
mulf
ldf
addi
pass
addf
pass
stf
D$
pass
ble
CIS 501 (Martin/Roth): Superscalar
addi
D$
pass
addf
pass
pass
stf
ble
47
48
NH
I$
regfile
NH
r2
f1
r1
pass
pass
pass
ldf
pass
pass
mulf
ldf
I$
regfile
r2
f1
r1
pass
pass
pass
ldf
pass
pass
mulf
ldf
addi
pass
addf
pass
stf
D$
pass
addi
ble
D$
pass
addf
pass
pass
stf
ble
49
50
NH
I$
regfile
NH
r2
f1
r1
pass
pass
pass
ldf
pass
pass
mulf
ldf
I$
regfile
r2
f1
r1
pass
pass
pass
ldf
pass
pass
mulf
ldf
addi
pass
addf
pass
stf
D$
pass
ble
CIS 501 (Martin/Roth): Superscalar
addi
D$
pass
addf
pass
pass
stf
ble
51
52
! Performance
!
!
!
!
!
1 cycle fetch
1 cycle read regs
8 cycles execute
1 cycle write regs
11 cycles total
NH
! 7 / (11 * 16) = 4%
regfile
r2
! Utilization
pass
I$
+! Simpler components
+! Faster clock?
! Code size
f1
pass
! Non-compatibility
r1
pass
ldf
pass
pass
mulf
ldf
pass
addi
pass
addf
pass
stf
D$
ble
CIS 501 (Martin/Roth): Superscalar
53
Next Up
! Extracting more ILP via
! Static scheduling by compiler
! Dynamic scheduling in hardware
55
54