Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Lecture on Global Informatics

and Electronics Ⅱ
Jubee Tada
Graduate School of Science and Engineering,
Yamagata University
Tel:0238-26-3576
E-mail:jubee@yz.yamagata-u.ac.jp
Methods for performance improvement

• Instruction pipelining
– Pipeline hazards
– Branch prediction
• Data-level parallelism
– SIMD
• Instruction-level parallelism
– VLIW
– Superscalar
• Thread-level parallelism
– Simultaneous Multi-Threading
– Multicore processors 1
Instruction pipelining

A→B→C→D→E

A→B→C→D →E

A→B→C→D →E

A→B→C→D →E

A B C D E

A B C D E Fast processing is
A B C D E possible for this amount of time

A B C D E

• Divide one process into multiple stages


• Each stage works in parallel
→ Multiple processes can be performed
simultaneously depending on the number of
stages. 2
An example of instruction pipelining
program time 200 400 600 800 1000 1200 1400 1600 1800
execution order

lw $s1,100($0) IF ID EX MEM WB

lw $s2,200($0) IF ID EX MEM WB
800ps

lw $s3,300($0) IF
800ps

800ps
program
time 200 400 600 800 1000 1200 1400 1600 1800
execution order

lw $s1,100($0) IF ID EX MEM WB

lw $s2,200($0) 200ps
IF ID EX MEM WB

lw $s3,300($0) 200ps
IF ID EX MEM WB

200ps 200ps 200ps 200ps 200ps


3
Execution time in instruction pipelining

• Clock cycle time


– the longest time of each stage:200ps
• When executing one instruction
– 200ps×the number of stages(5)=1000ps
– It is longer than the single clock cycle method
(800ps).
• When executing three instructions
– 2400ps→1400ps:1.7 times faster
• When executing 1,000 instructions
– 800,000ps→200,800ps:Approximately 4 times
faster 4
Notes on instruction pipelining

• Instruction pipelining improves throughput.


– It does not reduce the time required for processing
one instruction.
– Overall processing time becomes shorter when there
are multiple instructions.
– If the processing is divided evenly and the number of
processing is large enough, the speed will be n times
faster if the number of stages is n.
• The slowest stage determines clock cycle time.
– If the time for each stage cannot be divided equally,
the performance improvement cannot be achieved
according to n. 5
Pipeline hazards

• In the instruction pipelining, a situation where


the next instruction cannot be started in the
next clock cycle is occurred.
→Pipeline Hazards
– Structural hazard
– Data hazard
– Control hazard

6
Structural hazard
• Hazards due to hardware limitations
– Example: If there is only one memory
• Avoidable with Harvard architecture
Time 200 400 600 800 1000 1200 1400 1600 1800

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB
7
Data hazard

• When an instruction needs to wait for another


instruction to complete,
→ The pipeline should be stalled.
• example: add $s0,$t0,$t1
sub $t2, $s0,$t3
– The subtract instruction cannot start until the add
instruction completes.
• Solution:Forwarding
– It forwards data inside the processor
– It’s also called bypassing.
8
Forwarding
Time 200 400 600 800 1000 1200 1400 1600 1800

add $s0,$t0,$t1 IF ID EX MEM WB

$s0
sub $t2,$s0,$t3 stall stall IF ID EX MEM WB

Time 200 400 600 800 1000 1200 1400 1600 1800

add $s0,$t0,$t1 IF ID EX MEM WB

sub $t2,$s0,$t3 IF ID EX MEM WB

9
A problem of forwarding

• Not all data hazards can be eliminated.


– For load instructions: load-use data hazard

時間 200 400 600 800 1000 1200 1400 1600 1800

lw $s0,20($t1) IF ID EX MEM WB

sub $t2,$s0,$t3 stall IF ID EX MEM WB

• pipeline stall is occurred


10
Avoiding hazards by changing the order of instructions

A=B+E;
C=B+F;
lw $t1,0($t0) lw $t1,0($t0)
lw $t2,4($t0) lw $t2,4($t0)
add $t3,$t1,$t2 lw $t4,8($t0)
sw $t3,12($t0) add $t3,$t1,$t2
lw $t4,8($t0) sw $t3,12($t0)
add $t5,$t1,$t4 add $t5,$t1,$t4
sw $t5,16($t0) sw $t5,16($t0)

11
Control hazard

• It’s also called branch hazard


• A branch instruction determines whether a branch is
taken or untaken depending on a condition.
→ It is necessary to stall the pipeline until the branch
destination is determined.

Solution :Branch prediction


• There are various methods.
– The simplest method: Predict the branch to always be untaken
→ Next instruction is executed.

12
An example of branch prediction (1/2)
abs: addi $sp,$sp,-4
sw $s0,0($sp)
slt $t0,$a0,$a1
beq $t0,$zero,Else
sub $s0,$a1,$a0
j Exit
Else: sub $s0,$a0,$a1
Exit: add $v0,$s0,$zero
lw $s0,0($sp)
addi $sp,$sp,4
jr $ra

Time 200 400 600 800 1000 1200 1400 1600 1800

slt $t0,$a0,$a1 IF ID EX MEM WB Without Branch


IF ID EX MEM WB prediction
beq $t0,$zero,Else 200ps

sub $s0,$a1,$a0 IF ID EX MEM WB


600ps

13
An example of branch prediction (2/2)
Time 200 400 600 800 1000 1200 1400 1600 1800

slt $t0,$a0,$a1 IF ID EX MEM WB Branch prediction


beq $t0,$zero,Else IF ID EX MEM WB
is succeeded.
200ps

sub $s0,$a1,$a0 IF ID EX MEM WB


200ps

時間 200 400 600 800 1000 1200 1400 1600 1800

slt $t0,$a0,$a1 IF ID EX MEM WB


Branch prediction
beq $t0,$zero,Else 200ps IF ID EX MEM WB is failed.
(misprediction)

Else:sub $s0,$a0,$a1 IF ID EX MEM WB


600ps

14
Branch misprediction
• If branch prediction fails:
– It is only after the execution of the branch instruction
that whether it has failed or not is known.
– Several instructions already have been already
started executing.
→ Those results must be discarded.
• Branch mispredictions have a significant impact
on processor performance.

• It is necessary to increase the accuracy of branch


prediction (the probability that the prediction
will be correct). 15
Dynamic branch prediction

• A method for improving branch prediction


accuracy
• Dynamically predict branch destinations
– Dynamic: at runtime
– Static: at compile time
• Branch prediction buffer
– It’s Also called branch history table
– It records whether a branch was recently taken or
untaken with 1-bit.
→If it holds true, the instruction will be taken, if it
holds false, the instruction will be untaken. 16
A problem on the 1-bit branch prediction

• Considering a loop execution


• What will happen to the branch prediction accuracy if a
branch is taken nine times in a row and one branch is
not taken?
– First branch → Success/Miss
– 2nd to 9th → Success
– 10th → Miss
– Return to top of loop → Miss
• 8 out of 10 times are succeeded.
→ Branch prediction accuracy: 80%
If the prediction result is changed due to a single
misprediction, mispredictions occur continuously.
17
2-bit branch prediction

• Increasing buffer from 1-bit to 2-bit


– The prediction is changed if success/failure
continues.
taken

untaken
Will be taken Will be taken
taken

taken untaken

untaken
Will be untaken Will be untaken
taken

untaken
18
Various branch prediction methods

• BTB: Branch target buffer


– Even if a branch is predicted, the branch destination address
must be calculated each time.
→Storing branch destination address of branch instruction
– The address of the branch destination instruction can be
obtained without waiting for the result of address calculation.
→ No need to stall the pipeline.
• Correlation prediction method
– Considering not only local branch information but also global
branch behavior
• Tournament branch prediction method
– Adopt the most suitable prediction for the branch among
multiple prediction information
19
Performance improvement of a processor

David A.Patterson/John L.Hennessy, “Computer Organization and Design, Fifth Edition: The Hardware/Software Interface”

20
Advances in semiconductor manufacturing technology
• Moore's law (density doubles every 18-24 months)
– Density:Number of elements such as transistors per unit area
– Because elements can be made smaller.
• It is possible to make the same thing smaller.
→Increasing the speed of transistors: Improving clock frequency
• More transistors can be mounted on one chip.
→Increased transistors can be used to improve performance.
• Pollack's law
– Processor performance is proportional to the square root of
complexity.
→ If the density is doubled, the performance will be
approximately 1.4 times.

21
For improving performance

Instruction count × CPI


CPU time =
Clock frequency

• How to improve performance?


– Increase clock frequency
– Decrease instruction count
– Decrease CPI(Cycles Per Instruction)
→Increase IPC(Instructions Per Cycle)

22
Increasing clock frequency

• Clock frequency can be increased by


miniaturizing transistors.
• How to increase clock frequency in
architecture?
→Subdividing of an instruction pipeline
– Also called super pipelining
– Up to 31 stages on Intel Pentium Ⅳ processor
• Note: Increasing clock frequency does not
always improve performance.
– Penalties due to pipeline stalls are increased.
→CPI increases 23
Clock frequency and power consumption

• Increasing of clock frequency → increase


power consumption
𝑃𝑜𝑤𝑒𝑟 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 𝑜𝑓 𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑡𝑒𝑑 𝑐𝑖𝑟𝑐𝑢𝑖𝑡𝑠
= 𝐷𝑦𝑛𝑎𝑚𝑖𝑐 𝑝𝑜𝑤𝑒𝑟 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛
+ 𝑆𝑡𝑎𝑡𝑖𝑐 𝑝𝑜𝑤𝑒𝑟 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛

𝐷𝑦𝑛𝑎𝑚𝑖𝑐 𝑝𝑜𝑤𝑒𝑟 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛


∝ 𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑎𝑛𝑐𝑒 × 𝑉𝑜𝑙𝑡𝑎𝑔𝑒 2 × 𝐶𝑙𝑜𝑐𝑘 𝑓𝑟𝑒𝑞𝑛𝑐𝑦
• How to improve performance without
increasing clock frequency?
→Using parallelism
24
Reduction of instruction count

• To reduce instruction count, it is necessary to enable


complex processing with one instruction.
– However, complex instructions have a negative impact on
clock frequency.
• Programs often repeat the same operation on multiple
pieces of data.
→ Process multiple pieces of data with one instruction
• SIMD instructions
– Single Instruction Multiple Datapath
– Examples:MMX, SSE, AVX
• Operations can be performed on multiple data at the
same time.
– Data parallelism in the program is used. 25
Effects of SIMD instructions
• Y[ j ]=a×X[ i ]+Y[ j ]
l.d $f0,a($sp)
addiu r4,$s0,512
loop: l.d $f2,0($s0)
mul.d $f2,$f2,$f0
l.d $f4,0($s1)
add.d $f4,$f4,$f2
s.d $f4,0($s1)
addiu $s0,$s0,8
addiu $s1,$s1,8
subu $t0,r4,$s0
bne $t0,$zero,loop

l.d $f0,a($sp)
lv $v1,0($s0)
mulvs.d $v2,$v1,$f0
lv $v3,0($s1)
addv.d $v4,$v2,$v3
sv $v4,0($s1)
26
Increasing IPC

• How to decrease CPI = increase IPC?


→ Increase the number of instructions that can be
executed in one cycle
• ILP:Instruction Level Parallelism
– Parallelism that exists between instructions
→ whether those instructions can be executed simultaneously
• Multiple instructions can be executed simultaneously
by using instruction-level parallelism.

IPC(Instructions Per Cycle)


is increased.
27
Exploiting instruction-level parallelism

• Prepare multiple pieces of hardware to execute


instructions
– Each piece of hardware can execute instructions
independently.
– It can execute as many instructions as the hardware has at
the same time.
• It requires to extract instructions to be executed
simultaneously from a program.
– Statically extraction: VLIW
– Dynamically extraction: Superscalar

28
VLIW

• A compiler extracts instructions that can be executed


simultaneously.
• It combines multiple extracted instructions into one
instruction.
→ One instruction length becomes longer
(Very Long Instruction Word)
• Features
☺ No hardware required to dynamically extract instruction-
level parallelism.
 It requires instruction set architecture changes.
→A program needs to be recompiled.
29
An example of instruction pipelining with VLIW

• In this example, arithmetic/branch instructions and load/store


instructions can be executed simultaneously.
Instruction types Pipeline stages

Arithmetic/Branch IF ID EX MEM WB

Load/Store IF ID EX MEM WB

Arithmetic/Branch IF ID EX MEM WB

Load/Store IF ID EX MEM WB

Arithmetic/Branch IF ID EX MEM WB

Load/Store IF ID EX MEM WB

Arithmetic/Branch IF ID EX MEM WB

Load/Store IF ID EX MEM WB
30
Example: Scheduling in VLIW (1/2)

• How is the loop below scheduled in the above


pipeline?

Loop: lw $t0,0($s1) #load A[i] to $t0


add $t0,$t0,$s2 #$t0=A[i]+$s2
sw $t0,0($s1) #A[i]=$t0
addi $s1, $s1, -4 #i=i-1
bne $s1,$zero, Loop #goto Loop if $s1≠0

31
Example: Scheduling in VLIW (2/2)

Arithmetic/Branch Load/Store clock


instructions instructions cycles
Loop: nop(no operation) lw $t0,0($s1) 1
addi $s1,$s1,-4 nop 2
add $t0,$t0,$s2 nop 3
bne $s1,$zero,Loop sw $t0,4($s1) 4

• It executes 5 instructions in 4 cycles.


– IPC=1.25 CPI=0.8
– Not enough compared to the maximum IPC value
of 2.0. 32
Loop Unrolling

• Preparing multiple copies of the loop and


schedule instructions for several loops at once
→It can increase instruction parallelism.
Loop Unrolling

• Using the same register causes a conflict


→The used register name are changed.
Register renaming

33
An example of loop unrolling: unrolling 4 loops
Loop: lw $t0,0($s1) #load A[i] to $t0
addi $s1, $s1, -16 #i is decreased for 4 values
lw $t1,12($s1) #load A[i-1] to $t1
lw $t2,8($s1) #load A[i-2] to $t2
lw $t3,4($s1) #load A[i-3] to $t3
add $t0,$t0,$s2 #$t0=A[i]+$s2
add $t1,$t1,$s2 #$t1=A[i-1]+$s2
add $t2,$t2,$s2 #$t2=A[i-2]+$s2
add $t3,$t3,$s2 #$t3=A[i-3]+$s2
sw $t0,16($s1) #A[i]=$t0
sw $t1,12($s1) #A[i-1]=$t1
sw $t2,8($s1) #A[i-2]=$t2
sw $t3,4($s1) #A[i-3]=$t3
bne $s1,$zero, Loop #goto Loop if $s1≠0
34
An example of loop unrolling: scheduling

• 14 instructions are executed in 8 cycles.


– IPC=1.75,CPI=0.57
Arithmetic/Branch Load/Store instructions Clock cycles
instructions
Loop: addi $s1,$s1,-16 lw $t0,0($s1) 1
nop lw $t1,12($s1) 2
add $t0,$t0,$s2 lw $t2,8($s1) 3
add $t1,$t1,$s2 lw $t3,4($s1) 4
add $t2,$t2,$s2 sw $t0,16($s1) 5
add $t3,$t3,$s2 sw $t1,12($s1) 6
nop sw $t2,8($s1) 7
bne $s1,$zero,Loop sw $t3,4($s1) 8
35
Superscalar

• It dynamically extracts instructions that can be executed


simultaneously.
☺ Recompiling is not required.
 It requires hardware to extract ILP.

• Processing flow of superscalar processor


– Fetched instructions are held in a reservation station prepared
for each functional unit.
– These instructions are executed as soon as functional units
become available.
– The results are stored in the reorder buffer in the commit unit.
– The results are stored in registers or memory as soon as
possible.
36
The structure of a superscalar processor

Instruction fetch In-order issue


/decode unit

Reservation Reservation Reservation Reservation


station station … station station

Floating- Load/
Integer Integer Out-of-order
… point Store
execution
ALU ALU
unit unit

Commit unit In-order commit

37
The pipeline configuration of Core i7 processor

出典:David A. Patterson and John L. Hennessy, Computer Organization and Design


The Hardware/Software Interface Fifth Edition 38
Exploiting thread-level parallelism

• There are limits to instruction-level parallelism.


– IPC is about 2 on average at most.
– Due to dependencies between instructions
• Thread-level parallelism has been focused.
– Thread:
• A set of instructions that have no dependencies on each
other.
• A program consists of one or more threads.
• There are no dependencies, so they can be executed at
the same time.

39
SMT(Simultaneous Multi Threading)

• Applying hardware for superscalar


– There are many functional units that are not used even
if they are prepared.
→ Executing instructions of another thread in a free
functional unit.
• Problems
– It requires hardware to manage multiple threads.
– Performance per thread decreases due to conflicts
between threads.

40
Multi-Core processors

• Multiple processors in one chip


• Different threads can be executed simultaneously on each
processor.
– SMT can be adopted to each processor.

Instruction fetch/decode unit LLC(Last Level Cache)

リザベーション・ステーション
リザベーション・ステーション
リザベーション・ステーション
Reservation unit Processor Processor

リザベーション・ステーション
リザベーション・ステーション
リザベーション・ステーション
Functional unit

Processor Processor
Commit unit

41
Exploiting instruction/thread-level parallelism

Thread1 Thread2 Thread3 Thread4

Thread5 Thread6 Thread7 Thread8

Issue Slots
Time

Normal Superscalar SMT Multi-Core Multi-Core+SMT

42
Improving performance of a processor.

• Exploiting parallelism
– Data-level:SIMD instructions
– Instruction-level:Superscalar, VLIW
– Thread-level:SMT, Multi-core
• However, if parallelism does not exist,
performance will not improve
→ Programming techniques to increase
parallelism is required.

43
Limits of performance improvement through parallelism

• Amdahl's law
– The limit of performance improvement obtained by
parallelization is determined by the proportion of parts that
cannot be executed in parallel.

Can be If 10% of the parts cannot


Execution time

executed in be parallelized,
90 parallel the speed cannot be
increased by 10 times or more.

45

9
10 Cannot 10 10 10

1 processors 2 processors 10 processors 1000 processors


44

You might also like