Lecture On Global Informatics and Electronics

Lecture on Global Informatics
and Electronics Ⅱ
Jubee Tada
Graduate School of Science and Engineering,
Yamagata University
Tel:0238-26-3576
E-mail：jubee@yz.yamagata-u.ac.jp
Methods for performance improvement
• Instruction pipelining
– Pipeline hazards
– Branch prediction
• Data-level parallelism
– SIMD
• Instruction-level parallelism
– VLIW
– Superscalar
• Thread-level parallelism
– Simultaneous Multi-Threading
– Multicore processors 1
Instruction pipelining
A→B→C→D→E
A→B→C→D →E
A→B→C→D →E
A→B→C→D →E
A B C D E
A B C D E Fast processing is
A B C D E possible for this amount of time
A B C D E
• Divide one process into multiple stages

• Each stage works in parallel
→ Multiple processes can be performed
simultaneously depending on the number of
stages. 2
An example of instruction pipelining
program time 200 400 600 800 1000 1200 1400 1600 1800
execution order
lw $s1,100($0) IF ID EX MEM WB
800ps
lw $s3,300($0) IF
800ps
800ps
program
time 200 400 600 800 1000 1200 1400 1600 1800
execution order
lw $s2,200($0) 200ps
IF ID EX MEM WB
lw $s3,300($0) 200ps
IF ID EX MEM WB
200ps 200ps 200ps 200ps 200ps

3
Execution time in instruction pipelining
• Clock cycle time

– the longest time of each stage：200ps
• When executing one instruction
– 200ps×the number of stages(5)=1000ps
– It is longer than the single clock cycle method
(800ps).
• When executing three instructions
– 2400ps→1400ps：1.7 times faster
• When executing 1,000 instructions
– 800,000ps→200,800ps：Approximately 4 times
faster 4
Notes on instruction pipelining
• Instruction pipelining improves throughput.

– It does not reduce the time required for processing
one instruction.
– Overall processing time becomes shorter when there
are multiple instructions.
– If the processing is divided evenly and the number of
processing is large enough, the speed will be n times
faster if the number of stages is n.
• The slowest stage determines clock cycle time.
– If the time for each stage cannot be divided equally,
the performance improvement cannot be achieved
according to n. 5
Pipeline hazards
• In the instruction pipelining, a situation where

the next instruction cannot be started in the
next clock cycle is occurred.
→Pipeline Hazards
– Structural hazard
– Data hazard
– Control hazard
6
Structural hazard
• Hazards due to hardware limitations
– Example: If there is only one memory
• Avoidable with Harvard architecture
Time 200 400 600 800 1000 1200 1400 1600 1800
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
7
Data hazard
• When an instruction needs to wait for another

instruction to complete,
→ The pipeline should be stalled.
• example： add $s0,$t0,$t1
sub $t2, $s0,$t3
– The subtract instruction cannot start until the add
instruction completes.
• Solution：Forwarding
– It forwards data inside the processor
– It’s also called bypassing.
8
Forwarding
Time 200 400 600 800 1000 1200 1400 1600 1800
add $s0,$t0,$t1 IF ID EX MEM WB
$s0
sub $t2,$s0,$t3 stall stall IF ID EX MEM WB
Time 200 400 600 800 1000 1200 1400 1600 1800
add $s0,$t0,$t1 IF ID EX MEM WB
sub $t2,$s0,$t3 IF ID EX MEM WB
9
A problem of forwarding
• Not all data hazards can be eliminated.

– For load instructions: load-use data hazard
時間 200 400 600 800 1000 1200 1400 1600 1800
lw $s0,20($t1) IF ID EX MEM WB
sub $t2,$s0,$t3 stall IF ID EX MEM WB
• pipeline stall is occurred

10
Avoiding hazards by changing the order of instructions
A=B+E;
C=B+F;
lw $t1,0($t0) lw $t1,0($t0)
lw $t2,4($t0) lw $t2,4($t0)
add $t3,$t1,$t2 lw $t4,8($t0)
sw $t3,12($t0) add $t3,$t1,$t2
lw $t4,8($t0) sw $t3,12($t0)
add $t5,$t1,$t4 add $t5,$t1,$t4
sw $t5,16($t0) sw $t5,16($t0)
11
Control hazard
• It’s also called branch hazard

• A branch instruction determines whether a branch is
taken or untaken depending on a condition.
→ It is necessary to stall the pipeline until the branch
destination is determined.
Solution ：Branch prediction

• There are various methods.
– The simplest method: Predict the branch to always be untaken
→ Next instruction is executed.
12
An example of branch prediction (1/2)
abs: addi $sp,$sp,-4
sw $s0,0($sp)
slt $t0,$a0,$a1
beq $t0,$zero,Else
sub $s0,$a1,$a0
j Exit
Else: sub $s0,$a0,$a1
Exit: add $v0,$s0,$zero
lw $s0,0($sp)
addi $sp,$sp,4
jr $ra
Time 200 400 600 800 1000 1200 1400 1600 1800
slt $t0,$a0,$a1 IF ID EX MEM WB Without Branch

IF ID EX MEM WB prediction
beq $t0,$zero,Else 200ps
sub $s0,$a1,$a0 IF ID EX MEM WB

600ps
13
An example of branch prediction (2/2)
Time 200 400 600 800 1000 1200 1400 1600 1800
slt $t0,$a0,$a1 IF ID EX MEM WB Branch prediction

beq $t0,$zero,Else IF ID EX MEM WB
is succeeded.
200ps
sub $s0,$a1,$a0 IF ID EX MEM WB

200ps
時間 200 400 600 800 1000 1200 1400 1600 1800
slt $t0,$a0,$a1 IF ID EX MEM WB

Branch prediction
beq $t0,$zero,Else 200ps IF ID EX MEM WB is failed.
(misprediction)
Else:sub $s0,$a0,$a1 IF ID EX MEM WB

600ps
14
Branch misprediction
• If branch prediction fails:
– It is only after the execution of the branch instruction
that whether it has failed or not is known.
– Several instructions already have been already
started executing.
→ Those results must be discarded.
• Branch mispredictions have a significant impact
on processor performance.
• It is necessary to increase the accuracy of branch

prediction (the probability that the prediction
will be correct). 15
Dynamic branch prediction
• A method for improving branch prediction

accuracy
• Dynamically predict branch destinations
– Dynamic: at runtime
– Static: at compile time
• Branch prediction buffer
– It’s Also called branch history table
– It records whether a branch was recently taken or
untaken with 1-bit.
→If it holds true, the instruction will be taken, if it
holds false, the instruction will be untaken. 16
A problem on the 1-bit branch prediction
• Considering a loop execution

• What will happen to the branch prediction accuracy if a
branch is taken nine times in a row and one branch is
not taken?
– First branch → Success/Miss
– 2nd to 9th → Success
– 10th → Miss
– Return to top of loop → Miss
• 8 out of 10 times are succeeded.
→ Branch prediction accuracy: 80%
If the prediction result is changed due to a single
misprediction, mispredictions occur continuously.
17
2-bit branch prediction
• Increasing buffer from 1-bit to 2-bit

– The prediction is changed if success/failure
continues.
taken
untaken
Will be taken Will be taken
taken
taken untaken
untaken
Will be untaken Will be untaken
taken
untaken
18
Various branch prediction methods
• BTB: Branch target buffer

– Even if a branch is predicted, the branch destination address
must be calculated each time.
→Storing branch destination address of branch instruction
– The address of the branch destination instruction can be
obtained without waiting for the result of address calculation.
→ No need to stall the pipeline.
• Correlation prediction method
– Considering not only local branch information but also global
branch behavior
• Tournament branch prediction method
– Adopt the most suitable prediction for the branch among
multiple prediction information
19
Performance improvement of a processor
David A.Patterson/John L.Hennessy, “Computer Organization and Design, Fifth Edition: The Hardware/Software Interface”
20
Advances in semiconductor manufacturing technology
• Moore's law (density doubles every 18-24 months)
– Density：Number of elements such as transistors per unit area
– Because elements can be made smaller.
• It is possible to make the same thing smaller.
→Increasing the speed of transistors: Improving clock frequency
• More transistors can be mounted on one chip.
→Increased transistors can be used to improve performance.
• Pollack's law
– Processor performance is proportional to the square root of
complexity.
→ If the density is doubled, the performance will be
approximately 1.4 times.
21
For improving performance
Instruction count × CPI

CPU time =
Clock frequency
• How to improve performance?

– Increase clock frequency
– Decrease instruction count
– Decrease CPI(Cycles Per Instruction)
→Increase IPC(Instructions Per Cycle)
22
Increasing clock frequency
• Clock frequency can be increased by

miniaturizing transistors.
• How to increase clock frequency in
architecture?
→Subdividing of an instruction pipeline
– Also called super pipelining
– Up to 31 stages on Intel Pentium Ⅳ processor
• Note: Increasing clock frequency does not
always improve performance.
– Penalties due to pipeline stalls are increased.
→CPI increases 23
Clock frequency and power consumption
• Increasing of clock frequency → increase

power consumption
𝑃𝑜𝑤𝑒𝑟 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 𝑜𝑓 𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑡𝑒𝑑 𝑐𝑖𝑟𝑐𝑢𝑖𝑡𝑠
= 𝐷𝑦𝑛𝑎𝑚𝑖𝑐 𝑝𝑜𝑤𝑒𝑟 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛
+ 𝑆𝑡𝑎𝑡𝑖𝑐 𝑝𝑜𝑤𝑒𝑟 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛
𝐷𝑦𝑛𝑎𝑚𝑖𝑐 𝑝𝑜𝑤𝑒𝑟 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛

∝ 𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑎𝑛𝑐𝑒 × 𝑉𝑜𝑙𝑡𝑎𝑔𝑒 2 × 𝐶𝑙𝑜𝑐𝑘 𝑓𝑟𝑒𝑞𝑛𝑐𝑦
• How to improve performance without
increasing clock frequency?
→Using parallelism
24
Reduction of instruction count
• To reduce instruction count, it is necessary to enable

complex processing with one instruction.
– However, complex instructions have a negative impact on
clock frequency.
• Programs often repeat the same operation on multiple
pieces of data.
→ Process multiple pieces of data with one instruction
• SIMD instructions
– Single Instruction Multiple Datapath
– Examples：MMX, SSE, AVX
• Operations can be performed on multiple data at the
same time.
– Data parallelism in the program is used. 25
Effects of SIMD instructions
• Y[ j ]=a×X[ i ]+Y[ j ]
l.d $f0,a($sp)
addiu r4,$s0,512
loop: l.d $f2,0($s0)
mul.d $f2,$f2,$f0
l.d $f4,0($s1)
add.d $f4,$f4,$f2
s.d $f4,0($s1)
addiu $s0,$s0,8
addiu $s1,$s1,8
subu $t0,r4,$s0
bne $t0,$zero,loop
l.d $f0,a($sp)
lv $v1,0($s0)
mulvs.d $v2,$v1,$f0
lv $v3,0($s1)
addv.d $v4,$v2,$v3
sv $v4,0($s1)
26
Increasing IPC
• How to decrease CPI = increase IPC?

→ Increase the number of instructions that can be
executed in one cycle
• ILP:Instruction Level Parallelism
– Parallelism that exists between instructions
→ whether those instructions can be executed simultaneously
• Multiple instructions can be executed simultaneously
by using instruction-level parallelism.
IPC(Instructions Per Cycle)

is increased.
27
Exploiting instruction-level parallelism
• Prepare multiple pieces of hardware to execute

instructions
– Each piece of hardware can execute instructions
independently.
– It can execute as many instructions as the hardware has at
the same time.
• It requires to extract instructions to be executed
simultaneously from a program.
– Statically extraction： VLIW
– Dynamically extraction: Superscalar
28
VLIW
• A compiler extracts instructions that can be executed

simultaneously.
• It combines multiple extracted instructions into one
instruction.
→ One instruction length becomes longer
(Very Long Instruction Word)
• Features
☺ No hardware required to dynamically extract instruction-
level parallelism.
 It requires instruction set architecture changes.
→A program needs to be recompiled.
29
An example of instruction pipelining with VLIW
• In this example, arithmetic/branch instructions and load/store

instructions can be executed simultaneously.
Instruction types Pipeline stages
Arithmetic/Branch IF ID EX MEM WB
Load/Store IF ID EX MEM WB
30
Example: Scheduling in VLIW (1/2)
• How is the loop below scheduled in the above

pipeline?
Loop: lw $t0,0($s1) #load A[i] to $t0

add $t0,$t0,$s2 #$t0=A[i]+$s2
sw $t0,0($s1) #A[i]=$t0
addi $s1, $s1, -4 #i=i-1
bne $s1,$zero, Loop #goto Loop if $s1≠0
31
Example: Scheduling in VLIW (2/2)
Arithmetic/Branch Load/Store clock

instructions instructions cycles
Loop: nop(no operation) lw $t0,0($s1) 1
addi $s1,$s1,-4 nop 2
add $t0,$t0,$s2 nop 3
bne $s1,$zero,Loop sw $t0,4($s1) 4
• It executes 5 instructions in 4 cycles.

– IPC=1.25 CPI=0.8
– Not enough compared to the maximum IPC value
of 2.0. 32
Loop Unrolling
• Preparing multiple copies of the loop and

schedule instructions for several loops at once
→It can increase instruction parallelism.
Loop Unrolling
• Using the same register causes a conflict

→The used register name are changed.
Register renaming
33
An example of loop unrolling: unrolling 4 loops
Loop: lw $t0,0($s1) #load A[i] to $t0
addi $s1, $s1, -16 #i is decreased for 4 values
lw $t1,12($s1) #load A[i-1] to $t1
lw $t2,8($s1) #load A[i-2] to $t2
lw $t3,4($s1) #load A[i-3] to $t3
add $t0,$t0,$s2 #$t0=A[i]+$s2
add $t1,$t1,$s2 #$t1=A[i-1]+$s2
add $t2,$t2,$s2 #$t2=A[i-2]+$s2
add $t3,$t3,$s2 #$t3=A[i-3]+$s2
sw $t0,16($s1) #A[i]=$t0
sw $t1,12($s1) #A[i-1]=$t1
sw $t2,8($s1) #A[i-2]=$t2
sw $t3,4($s1) #A[i-3]=$t3
bne $s1,$zero, Loop #goto Loop if $s1≠0
34
An example of loop unrolling: scheduling
• 14 instructions are executed in 8 cycles.

– IPC=1.75,CPI=0.57
Arithmetic/Branch Load/Store instructions Clock cycles
instructions
Loop: addi $s1,$s1,-16 lw $t0,0($s1) 1
nop lw $t1,12($s1) 2
add $t0,$t0,$s2 lw $t2,8($s1) 3
add $t1,$t1,$s2 lw $t3,4($s1) 4
add $t2,$t2,$s2 sw $t0,16($s1) 5
add $t3,$t3,$s2 sw $t1,12($s1) 6
nop sw $t2,8($s1) 7
bne $s1,$zero,Loop sw $t3,4($s1) 8
35
Superscalar
• It dynamically extracts instructions that can be executed

simultaneously.
☺ Recompiling is not required.
 It requires hardware to extract ILP.
• Processing flow of superscalar processor

– Fetched instructions are held in a reservation station prepared
for each functional unit.
– These instructions are executed as soon as functional units
become available.
– The results are stored in the reorder buffer in the commit unit.
– The results are stored in registers or memory as soon as
possible.
36
The structure of a superscalar processor
Instruction fetch In-order issue

/decode unit
Reservation Reservation Reservation Reservation

station station … station station
Floating- Load/
Integer Integer Out-of-order
… point Store
execution
ALU ALU
unit unit
Commit unit In-order commit
37
The pipeline configuration of Core i7 processor
出典：David A. Patterson and John L. Hennessy, Computer Organization and Design

The Hardware/Software Interface Fifth Edition 38
Exploiting thread-level parallelism
• There are limits to instruction-level parallelism.

– IPC is about 2 on average at most.
– Due to dependencies between instructions
• Thread-level parallelism has been focused.
– Thread:
• A set of instructions that have no dependencies on each
other.
• A program consists of one or more threads.
• There are no dependencies, so they can be executed at
the same time.
39
SMT（Simultaneous Multi Threading）
• Applying hardware for superscalar

– There are many functional units that are not used even
if they are prepared.
→ Executing instructions of another thread in a free
functional unit.
• Problems
– It requires hardware to manage multiple threads.
– Performance per thread decreases due to conflicts
between threads.
40
Multi-Core processors
• Multiple processors in one chip

• Different threads can be executed simultaneously on each
processor.
– SMT can be adopted to each processor.
Instruction fetch/decode unit LLC(Last Level Cache)
リザベーション・ステーション
Reservation unit Processor Processor
Functional unit
Processor Processor
Commit unit
41
Exploiting instruction/thread-level parallelism
Thread1 Thread2 Thread3 Thread4
Thread5 Thread6 Thread7 Thread8
Issue Slots
Time
Normal Superscalar SMT Multi-Core Multi-Core+SMT
42
Improving performance of a processor.
• Exploiting parallelism
– Data-level：SIMD instructions
– Instruction-level：Superscalar, VLIW
– Thread-level：SMT, Multi-core
• However, if parallelism does not exist,
performance will not improve
→ Programming techniques to increase
parallelism is required.
43
Limits of performance improvement through parallelism
• Amdahl's law
– The limit of performance improvement obtained by
parallelization is determined by the proportion of parts that
cannot be executed in parallel.
Can be If 10% of the parts cannot

Execution time
executed in be parallelized,
90 parallel the speed cannot be
increased by 10 times or more.
45
9
10 Cannot 10 10 10
1 processors 2 processors 10 processors 1000 processors

44

Lecture On Global Informatics and Electronics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture On Global Informatics and Electronics

Uploaded by

Copyright:

Available Formats

Lecture on Global Informatics

• Divide one process into multiple stages

200ps 200ps 200ps 200ps 200ps

• Clock cycle time

• Instruction pipelining improves throughput.

• In the instruction pipelining, a situation where

• When an instruction needs to wait for another

add $s0,$t0,$t1 IF ID EX MEM WB

add $s0,$t0,$t1 IF ID EX MEM WB

sub $t2,$s0,$t3 IF ID EX MEM WB

• Not all data hazards can be eliminated.

時間 200 400 600 800 1000 1200 1400 1600 1800

sub $t2,$s0,$t3 stall IF ID EX MEM WB

• pipeline stall is occurred

• It’s also called branch hazard

Solution ：Branch prediction

slt $t0,$a0,$a1 IF ID EX MEM WB Without Branch

sub $s0,$a1,$a0 IF ID EX MEM WB

slt $t0,$a0,$a1 IF ID EX MEM WB Branch prediction

sub $s0,$a1,$a0 IF ID EX MEM WB

時間 200 400 600 800 1000 1200 1400 1600 1800

slt $t0,$a0,$a1 IF ID EX MEM WB

Else:sub $s0,$a0,$a1 IF ID EX MEM WB

• It is necessary to increase the accuracy of branch

• A method for improving branch prediction

• Considering a loop execution

• Increasing buffer from 1-bit to 2-bit

• BTB: Branch target buffer

Instruction count × CPI

• How to improve performance?

• Clock frequency can be increased by

• Increasing of clock frequency → increase

𝐷𝑦𝑛𝑎𝑚𝑖𝑐 𝑝𝑜𝑤𝑒𝑟 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛

• To reduce instruction count, it is necessary to enable

• How to decrease CPI = increase IPC?

IPC(Instructions Per Cycle)

• Prepare multiple pieces of hardware to execute

• A compiler extracts instructions that can be executed

• In this example, arithmetic/branch instructions and load/store

• How is the loop below scheduled in the above

Loop: lw $t0,0($s1) #load A[i] to $t0

Arithmetic/Branch Load/Store clock

• It executes 5 instructions in 4 cycles.

• Preparing multiple copies of the loop and

• Using the same register causes a conflict

• 14 instructions are executed in 8 cycles.

• It dynamically extracts instructions that can be executed

• Processing flow of superscalar processor

Instruction fetch In-order issue

Reservation Reservation Reservation Reservation

Commit unit In-order commit

出典：David A. Patterson and John L. Hennessy, Computer Organization and Design

• There are limits to instruction-level parallelism.

• Applying hardware for superscalar

• Multiple processors in one chip

Instruction fetch/decode unit LLC(Last Level Cache)

Thread1 Thread2 Thread3 Thread4

Thread5 Thread6 Thread7 Thread8

Normal Superscalar SMT Multi-Core Multi-Core+SMT

Can be If 10% of the parts cannot

1 processors 2 processors 10 processors 1000 processors

You might also like