Professional Documents
Culture Documents
4 Parallel Architectures
4 Parallel Architectures
Computer Architecture
2
Dr. Haroon Mahmood
Does doubling the clock rate
double the performance?
Computer Architecture
3
Dr. Haroon Mahmood
Amdahl’s law
The performance improvement to be gained from using
some faster mode of execution is limited by the fraction of
the time the faster mode can be used.
Computer Architecture
4
Dr. Haroon Mahmood
Example
Trip from point A to point B in two parts
A 20 C 50/20/4/1.7/0.3 B
A-C Trip C-B Trip Total Time C-B Speedup Overall Speedup
20 50 70 1 1
20 20 40 2.5 1.75
20 4 24 12.5 2.9
20 1.7 21.7 29.4 3.2
20 0.3 20.3 166.66 3.4
Computer Architecture
5
Dr. Haroon Mahmood
Example
Computer Architecture
6
Dr. Haroon Mahmood
Answer
Execution time after improvement = (100 – 80 seconds) +
(80 seconds / n)
Computer Architecture
7
Dr. Haroon Mahmood
Amdahl’s law….
Computer Architecture
8
Dr. Haroon Mahmood
Speedup depends on two factors
The fraction of the computation time in the original machine
that can be converted to take advantage of the
enhancement
Fraction enhanced ≤ 1
Speedup enhanced ≥ 1
Computer Architecture
9
Dr. Haroon Mahmood
Amdahl’s law
Computer Architecture
10
Dr. Haroon Mahmood
Amdahl’s law
1
Speedup =
(1-Fractionenhanced)+Fractionenhanced
Speedupenhanced
• Suppose that we want to enhance the processor used for Web serving.
The new processor is 10 times faster on computation in the Web
serving application than the original processor. Assuming that the
original processor is busy with computation 40 % of the time and is
waiting for I/O 60 % of the time, what is the overall speedup?
Speedup = 1/(0.6+(0.4/10))
= 1.56
that is 56% speedup!!
Computer Architecture
11
Dr. Haroon Mahmood
Parallel Architectures
Data-Level Parallelism
Task-Level Parallelism
Computer Architecture
12
Dr. Haroon Mahmood
The method for exploiting parallelism
The key to higher performance in microprocessors is the ability
to achieve higher degree of parallelism (fine-grain,
instruction-level parallelism):
Computer Architecture
13
Dr. Haroon Mahmood 13
Superscalar Processors
Superscalar Execution
How it can help
Issues:
Maintaining Sequential Semantics
Scheduling
Superscalar vs. Pipelining
Computer Architecture
14
Dr. Haroon Mahmood
Sequential Semantics - Review
Instructions appear as if they executed:
In the order they appear in the program
One after the other
Computer Architecture
15
Dr. Haroon Mahmood
Superscalar vs. Pipelining
loop: ld r2, 10(r1)
add r3,r3, r2 sum += a[i--]
sub r1, r1, 1
bne r1, r0, loop
Pipelining: time
fetch decode ld
fetch decode add
fetch decode sub
fetch decode bne
Superscalar:
fetch decode ld
fetch decode add
fetch decode sub
fetch decode bne
Computer Architecture A. Moshovos , ECE Toronto 16
Dr. Haroon Mahmood
Superscalar Performance
Performance Spectrum?
What if all instructions were dependent?
Speedup = 0, Superscalar buys us nothing
Computer Architecture
17
Dr. Haroon Mahmood
Superscalar vs. Superpipelining
Superpipelining:
Vaguely defined as deep pipelining, i.e., lots of stages
Superscalar issue complexity: limits super-pipelining
How do they compare?
2-way Superscalar vs. Twice the stages
Not much difference.
N-way Superscalar
Can issue up to N instructions per cycle
2-way, 3-way, …
Computer Architecture
19
Dr. Haroon Mahmood
Superscalar Execution
Computer Architecture
20
Dr. Haroon Mahmood
Parallel processing
Processing instructions in parallel requires three major tasks:
Computer Architecture
21
Dr. Haroon Mahmood
VLIW
FU FU FU
Register file
Computer Architecture
23
Dr. Haroon Mahmood
VLIW vs Superscalar
In VLIW and superscaler both the method pipelining and
replication are employed to achieve higher performace.
Computer Architecture
24
Dr. Haroon Mahmood
Static Scheduling
Computer Architecture
25
Dr. Haroon Mahmood
Static Scheduling
Computer Architecture
26
Dr. Haroon Mahmood
VLIW vs Super Scalar
Super Scalar architectures, in contrast, use dynamic
scheduling that transform all ILP complexity to the
hardware
Computer Architecture
27
Dr. Haroon Mahmood
Tradeoffs
Also the VLIW compiler is specific
it is an integral part of the VLIW system
Computer Architecture
28
Dr. Haroon Mahmood
VLIW principles
1.The compiler analyzes dependence of all instructions among
sequential code, tries to extract as much parallelism as
possible.
3.Finally, the work left with VLIW hardware is only fetch the
VLIWs from cache, decode them, and then dispatch the
independent primitive instructions to corresponding
function units and execute.
Computer Architecture
29
Dr. Haroon Mahmood
Generating of VLIW instruction words
Computer Architecture
30
Dr. Haroon Mahmood
Generating of VLIW instruction words
1. One VLIW instruction word contains maximum 8 primitive
instructions.
Computer Architecture
31
Dr. Haroon Mahmood
Conclusion
1. The highly parallel implementation is much simpler
and cheaper than its counterparts.
Computer Architecture
32
Dr. Haroon Mahmood
SuperScalar: In-order issue with in-order completion
Instruction issuing is stalled by resource conflicts,
procedural or any data dependencies.
1. Up to two instructions may be fetched, issued and
written back at a time
2. Fetch of next two instructions waits till decode buffer
is cleared
3. Functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock.
4. Data dependency stalls instruction issuing until the
execution of the earlier instruction is completed.
5. In RAW later instruction may be issued only after the
earlier instruction has written the result.
Computer Architecture
33
Dr. Haroon Mahmood
Superscalar execution
3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock.
1. R3=R0*R1
2. R4=R0+R2
3. R5=R0/R1
4. R6=R1+R4
5.R7=R1*R2
6.R1=R0-R2
7.R3=R3*R1
8.R1=R4+R4
Computer Architecture
34
Dr. Haroon Mahmood
Superscalar execution (in-order)
3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock.
1. R3=R0*R1
2. R4=R0+R2
3. R5=R0/R1
4. R6=R1+R4
5.R7=R1*R2
6.R1=R0-R2
7.R3=R3*R1
8.R1=R4+R4
Computer Architecture
35
Dr. Haroon Mahmood
Superscalar execution
3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock.
1. R3=R0*R1
2. R4=R0+R2
3. R5=R0/R1
4. R6=R1+R4
5.R7=R1*R2
6.R8=R0-R2
7.R3=R3*R8
8.R9=R4+R4
Computer Architecture
36
Dr. Haroon Mahmood
Superscalar execution (out of order)
3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock.
1. R3=R0*R1
2. R4=R0+R2
3. R5=R0/R1
4. R6=R1+R4
5.R7=R1*R2
6.R8=R0-R2
7.R3=R3*R8
8.R9=R4+R4
Computer Architecture
37
Dr. Haroon Mahmood
Loop Example
for (i=1000; i>0; i--)
x[i] = x[i] + s; Source code
Computer Architecture
38
Dr. Haroon Mahmood
Loop Example LD -> any : 1 stall
FPALU -> any: 3 stalls
FPALU -> ST : 2 stalls
for (i=1000; i>0; i--)
Source code IntALU -> BR : 1 stall
x[i] = x[i] + s;
Loop: L.D F0, 0(R1) ; F0 = array element
ADD.D F4, F0, F2 ; add scalar
S.D F4, 0(R1) ; store result
DADDUI R1, R1,# -8 ; decrement address pointer
BNE R1, R2, Loop ; branch if R1 != R2 Assembly code
NOP
Loop: L.D F0, 0(R1) ; F0 = array element
stall
ADD.D F4, F0, F2 ; add scalar
stall
stall 10-cycle
S.D F4, 0(R1) ; store result schedule
DADDUI R1, R1,# -8 ; decrement address pointer
stall
BNE R1, R2, Loop ; branch if R1 != R2
stall
Computer Architecture
39
Dr. Haroon Mahmood
Smart Schedule LD -> any : 1 stall
FPALU -> any: 3 stalls
FPALU -> ST : 2 stalls
Loop: L.D F0, 0(R1) IntALU -> BR : 1 stall
stall
ADD.D F4, F0, F2 Loop: L.D F0, 0(R1)
stall DADDUI R1, R1,# -8
stall ADD.D F4, F0, F2
S.D F4, 0(R1) stall
DADDUI R1, R1,# -8 BNE R1, R2, Loop
stall S.D F4, 8(R1)
BNE R1, R2, Loop
stall
Computer Architecture
40
Dr. Haroon Mahmood
Problem 1 LD -> any : 1 stall
FPMUL -> any: 5 stalls
FPMUL -> ST : 4 stalls
IntALU -> BR : 1 stall
for (i=1000; i>0; i--)
x[i] = y[i] * s; Source code
Computer Architecture
41
Dr. Haroon Mahmood
Problem 1 LD -> any : 1 stall
FPMUL -> any: 5 stalls
FPMUL -> ST : 4 stalls
IntALU -> BR : 1 stall
for (i=1000; i>0; i--)
x[i] = y[i] * s; Source code
Computer Architecture
42
Dr. Haroon Mahmood
Loop Unrolling
Loop: L.D F0, 0(R1) Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2 ADD.D F4, F0, F2
S.D F4, 0(R1) S.D F4, 0(R1)
DADDUI R1, R1,# -8 L.D F6, -8(R1)
BNE R1, R2, Loop ADD.D F8, F6, F2
S.D F8, -8(R1)
L.D F10,-16(R1)
ADD.D F12, F10, F2
LD -> any : 1 stall
S.D F12, -16(R1)
FPALU -> any: 3 stalls
L.D F14, -24(R1)
FPALU -> ST : 2 stalls
ADD.D F16, F14, F2
IntALU -> BR : 1 stall
S.D F16, -24(R1)
DADDUI R1, R1, #-32
BNE R1,R2, Loop
Computer Architecture
43
Dr. Haroon Mahmood
Scheduled and Unrolled Loop
Computer Architecture
44
Dr. Haroon Mahmood
Loop Unrolling
Increases program size
Computer Architecture
45
Dr. Haroon Mahmood
Automating Loop Unrolling
Determine the dependences across iterations: in the
example, we knew that loads and stores in different
iterations did not conflict and could be re-ordered
Computer Architecture
46
Dr. Haroon Mahmood
Problem 2 LD -> any : 1 stall
FPMUL -> any: 5 stalls
FPMUL -> ST : 4 stalls
IntALU -> BR : 1 stall
for (i=1000; i>0; i--)
x[i] = y[i] * s; Source code
Computer Architecture
47
Dr. Haroon Mahmood
Problem 2 LD -> any : 1 stall
FPMUL -> any: 5 stalls
FPMUL -> ST : 4 stalls
IntALU -> BR : 1 stall
for (i=1000; i>0; i--)
x[i] = y[i] * s; Source code
Computer Architecture
48
Dr. Haroon Mahmood
Automating Loop Unrolling
Determine the dependences across iterations: in the
example, we knew that loads and stores in different
iterations did not conflict and could be re-ordered
Computer Architecture
49
Dr. Haroon Mahmood