Professional Documents
Culture Documents
Lecture On Global Informatics and Electronics
Lecture On Global Informatics and Electronics
and Electronics Ⅱ
Jubee Tada
Graduate School of Science and Engineering,
Yamagata University
Tel:0238-26-3576
E-mail:jubee@yz.yamagata-u.ac.jp
Methods for performance improvement
• Instruction pipelining
– Pipeline hazards
– Branch prediction
• Data-level parallelism
– SIMD
• Instruction-level parallelism
– VLIW
– Superscalar
• Thread-level parallelism
– Simultaneous Multi-Threading
– Multicore processors 1
Instruction pipelining
A→B→C→D→E
A→B→C→D →E
A→B→C→D →E
A→B→C→D →E
A B C D E
A B C D E Fast processing is
A B C D E possible for this amount of time
A B C D E
lw $s1,100($0) IF ID EX MEM WB
lw $s2,200($0) IF ID EX MEM WB
800ps
lw $s3,300($0) IF
800ps
800ps
program
time 200 400 600 800 1000 1200 1400 1600 1800
execution order
lw $s1,100($0) IF ID EX MEM WB
lw $s2,200($0) 200ps
IF ID EX MEM WB
lw $s3,300($0) 200ps
IF ID EX MEM WB
6
Structural hazard
• Hazards due to hardware limitations
– Example: If there is only one memory
• Avoidable with Harvard architecture
Time 200 400 600 800 1000 1200 1400 1600 1800
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
7
Data hazard
$s0
sub $t2,$s0,$t3 stall stall IF ID EX MEM WB
Time 200 400 600 800 1000 1200 1400 1600 1800
9
A problem of forwarding
lw $s0,20($t1) IF ID EX MEM WB
A=B+E;
C=B+F;
lw $t1,0($t0) lw $t1,0($t0)
lw $t2,4($t0) lw $t2,4($t0)
add $t3,$t1,$t2 lw $t4,8($t0)
sw $t3,12($t0) add $t3,$t1,$t2
lw $t4,8($t0) sw $t3,12($t0)
add $t5,$t1,$t4 add $t5,$t1,$t4
sw $t5,16($t0) sw $t5,16($t0)
11
Control hazard
12
An example of branch prediction (1/2)
abs: addi $sp,$sp,-4
sw $s0,0($sp)
slt $t0,$a0,$a1
beq $t0,$zero,Else
sub $s0,$a1,$a0
j Exit
Else: sub $s0,$a0,$a1
Exit: add $v0,$s0,$zero
lw $s0,0($sp)
addi $sp,$sp,4
jr $ra
Time 200 400 600 800 1000 1200 1400 1600 1800
13
An example of branch prediction (2/2)
Time 200 400 600 800 1000 1200 1400 1600 1800
14
Branch misprediction
• If branch prediction fails:
– It is only after the execution of the branch instruction
that whether it has failed or not is known.
– Several instructions already have been already
started executing.
→ Those results must be discarded.
• Branch mispredictions have a significant impact
on processor performance.
untaken
Will be taken Will be taken
taken
taken untaken
untaken
Will be untaken Will be untaken
taken
untaken
18
Various branch prediction methods
David A.Patterson/John L.Hennessy, “Computer Organization and Design, Fifth Edition: The Hardware/Software Interface”
20
Advances in semiconductor manufacturing technology
• Moore's law (density doubles every 18-24 months)
– Density:Number of elements such as transistors per unit area
– Because elements can be made smaller.
• It is possible to make the same thing smaller.
→Increasing the speed of transistors: Improving clock frequency
• More transistors can be mounted on one chip.
→Increased transistors can be used to improve performance.
• Pollack's law
– Processor performance is proportional to the square root of
complexity.
→ If the density is doubled, the performance will be
approximately 1.4 times.
21
For improving performance
22
Increasing clock frequency
l.d $f0,a($sp)
lv $v1,0($s0)
mulvs.d $v2,$v1,$f0
lv $v3,0($s1)
addv.d $v4,$v2,$v3
sv $v4,0($s1)
26
Increasing IPC
28
VLIW
Arithmetic/Branch IF ID EX MEM WB
Load/Store IF ID EX MEM WB
Arithmetic/Branch IF ID EX MEM WB
Load/Store IF ID EX MEM WB
Arithmetic/Branch IF ID EX MEM WB
Load/Store IF ID EX MEM WB
Arithmetic/Branch IF ID EX MEM WB
Load/Store IF ID EX MEM WB
30
Example: Scheduling in VLIW (1/2)
31
Example: Scheduling in VLIW (2/2)
33
An example of loop unrolling: unrolling 4 loops
Loop: lw $t0,0($s1) #load A[i] to $t0
addi $s1, $s1, -16 #i is decreased for 4 values
lw $t1,12($s1) #load A[i-1] to $t1
lw $t2,8($s1) #load A[i-2] to $t2
lw $t3,4($s1) #load A[i-3] to $t3
add $t0,$t0,$s2 #$t0=A[i]+$s2
add $t1,$t1,$s2 #$t1=A[i-1]+$s2
add $t2,$t2,$s2 #$t2=A[i-2]+$s2
add $t3,$t3,$s2 #$t3=A[i-3]+$s2
sw $t0,16($s1) #A[i]=$t0
sw $t1,12($s1) #A[i-1]=$t1
sw $t2,8($s1) #A[i-2]=$t2
sw $t3,4($s1) #A[i-3]=$t3
bne $s1,$zero, Loop #goto Loop if $s1≠0
34
An example of loop unrolling: scheduling
Floating- Load/
Integer Integer Out-of-order
… point Store
execution
ALU ALU
unit unit
37
The pipeline configuration of Core i7 processor
39
SMT(Simultaneous Multi Threading)
40
Multi-Core processors
リザベーション・ステーション
リザベーション・ステーション
リザベーション・ステーション
Reservation unit Processor Processor
リザベーション・ステーション
リザベーション・ステーション
リザベーション・ステーション
Functional unit
Processor Processor
Commit unit
41
Exploiting instruction/thread-level parallelism
Issue Slots
Time
42
Improving performance of a processor.
• Exploiting parallelism
– Data-level:SIMD instructions
– Instruction-level:Superscalar, VLIW
– Thread-level:SMT, Multi-core
• However, if parallelism does not exist,
performance will not improve
→ Programming techniques to increase
parallelism is required.
43
Limits of performance improvement through parallelism
• Amdahl's law
– The limit of performance improvement obtained by
parallelization is determined by the proportion of parts that
cannot be executed in parallel.
executed in be parallelized,
90 parallel the speed cannot be
increased by 10 times or more.
45
9
10 Cannot 10 10 10