15 - 370w18 - Pipeline Last

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

15.

 Basic  Processor  Design  –  


Pipelining  With  Control  Hazards  

EECS 370 – Introduction to Computer Organization – Winter 2018

Mark Brehob, Reetu Das and Harry Davis

EECS Department
University of Michigan in Ann Arbor, USA

© Davis, Das, and Brehob, 2018


The material in this presentation cannot be
copied in any form without our written permission
Announcements

q  Homework 4 due today 3/8


q  Midterm re-grade request due by 3/12
•  Need detailed description of re-grade request, sign form to acknowledge
everything will be re-graded

The University of Michigan 2


Review: Control hazards

beq 1 1 10
add 3 4 5

time

beq fetch decode execute memory writeback

add fetch decode execute

EECS 370: Introduction to The University of Michigan 3


Computer Organization
Review: Approaches to handling control hazards
q  Avoid
•  Make sure there are no hazards in the code.

q  Detect and stall


•  Delay fetch until branch resolved.

q  Speculate and Squash-if-Wrong aka Branch Prediction


•  Go ahead and fetch more instructions in case it is correct, but stop them if
they shouldn’t have been executed.

EECS 370: Introduction to The University of Michigan 4


Computer Organization
Review: Branch Prediction
•  Predict the next fetch address (to be used in the next cycle)

•  Requires three things to be predicted at fetch stage:


•  Whether the fetched instruction is a branch
•  Branch direction (if conditional)
•  Branch target address (if direction is taken)

•  Observation: Target address remains the same for a conditional


direct branch across dynamic instances
•  Store the target address from previous instance and access it with the PC
•  Called Branch Target Buffer (BTB) or Branch Target Address Cache

EECS 370: Introduction to The University of Michigan 5


Computer Organization
Review: Fetch Stage with Branch Prediction
Direction predictor (2-bit counters)

taken?

PC + inst size Next Fetch


Address
Program
hit?
Counter

Address of the
current branch

target address

Cache of Target Addresses (BTB: Branch Target Buffer)


EECS 370: Introduction to The University of Michigan
Computer Organization
Review: State Machine for 2-bit Saturating Counter
actually   actually  
taken   pred   !taken   pred   “weakly  
taken   taken   taken”  
11   actually   10  
“strongly   taken  
taken”  
actually   actually  
taken   !taken  
“strongly  
“weakly   actually   !taken”  
pred   pred  
!taken”   !taken   !taken   !taken  
01   00   actually  
actually   !taken  
taken  
EECS 370: Introduction to The University of Michigan
Computer Organization
7
Branch prediction

q  Predict not taken: ~50% accurate.


q  Predict backward taken: ~65% accurate.
q  Predict same as last time: ~80% accurate.

q  Pentium: ~85% accurate.


q  Pentium Pro: ~92% accurate.
q  Best paper designs: ~96% accurate.

EECS 370: Introduction to The University of Michigan 8


Computer Organization
Can We Do Better?

Last-time and 2BC predictors exploit “last-time” predictability

Realization 1: A branch’s outcome can be correlated with


other branches’ outcomes
Global branch correlation

Realization 2: A branch’s outcome can be correlated with past


outcomes of the same branch (other than the outcome of
the branch “last-time” it was executed)
Local branch correlation

EECS 370: Introduction to The University of Michigan


Computer Organization 9
5 Global Branch Prediction
In the local branch prediction scheme, the only patterns considered are t
Global
branch.Branch
AnotherCorrelation
scheme proposed by Yeh and Patt[YP92] is able to take
recent branches to make a prediction. One implementation of such an
Recently
in Figureexecuted branch
6. A single shift outcomes in records
register GR, the execution path is
the direction taken by
correlated branches.
conditional with the outcome
Since theof the next
branch branch
history is global to all branch
called global branch prediction in this paper.
Global branch prediction is able to take advantage of two types
the direction take by the current branch may depend strongly on othe
If Consider the example
first branch below:
not taken, second also not taken
if (x<1)
if (x>1)

If firstUsing
branch taken,
global second
history definitely
prediction, we not takento base the predictio
are able
on the direction taken by the first if. If (x<1), we know that the sec
EECS 370: Introduction to The University of Michigan
Computer Organization 10
6
Two Level Global Branch Prediction
First level: Global branch history register (N bits)
The direction of last N branches
Second level: Table of saturating counters for each history entry
The direction the branch took the last time the same history was seen
Pattern History Table (PHT)
00 …. 00
1 1 ….. 1 0 00 …. 01
2 3
previous one 00 …. 10

GHR
(global index
history 0 1
register) 11 …. 11

EECS 370: Introduction to The University of Michigan


Computer Organization 11
Can We Do Better?

Last-time and 2BC predictors exploit “last-time” predictability

Realization 1: A branch’s outcome can be correlated with


other branches’ outcomes
Global branch correlation

Realization 2: A branch’s outcome can be correlated with past


outcomes of the same branch (other than the outcome of
the branch “last-time” it was executed)
Local branch correlation

EECS 370: Introduction to The University of Michigan


Computer Organization 12
Local Branch Correlation

McFarling, “Combining Branch Predictors,” DEC WRL TR 1993.

EECS 370: Introduction to The University of Michigan


Computer Organization 13
Two Level Local Branch Prediction
First level: A set of local history registers (N bits each)
Select the history register based on the PC of the branch
Second level: Table of saturating counters for each history entry
The direction the branch took the last time the same history was seen
Pattern History Table (PHT)
Local history registers
00 …. 00
1 1 ….. 1 0 00 …. 01
2 3
00 …. 10

index
0 1
11 …. 11

EECS 370: Introduction to The University of Michigan


Computer Organization 14
What Do Architects Do For Fun?
Alpha 21264 Tournament Predictor

•  Minimum branch penalty: 7 cycles


•  Typical branch penalty: 11+ cycles
•  48K bits of target addresses stored in I-cache
•  Predictor tables are reset on a context switch
EECS 370: Introduction to The University of Michigan
Computer Organization
What can go wrong?

q  Data hazards: since register reads occur in stage 2 and register writes
occur in stage 5 it is possible to read the wrong value if is about to be
written.
q  Control hazards: A branch instruction may change the PC, but not until
stage 4. What do we fetch before that?
q  Exceptions: How do you handle exceptions in a pipelined processor with 5
instructions in flight?

EECS 370: Introduction to The University of Michigan 16


Computer Organization
Exceptions
q  Exception: when something unexpected happens during
program execution.
•  Example: divide by zero.
q  More complex for pipelined implementations.
•  Multiple instructions executing at the same time.
•  Simple case: ALU overflow.
-  “flush the pipeline after the exception”.
-  “handle the exception”.
n  Identify address of instruction causing exception (PC+1 in ID/
EX pipeline register).
n  JALR to exception handler.

EECS 370: Introduction to The University of Michigan 17


Computer Organization
Early exceptions

q  What about an early (fetch) exception?


•  Maybe a mis-speculated fetch.
-  Branch is wrong, fetching “down the wrong path”.
•  Solution.
-  Delay the handling of an exception until it is known to
be a “real” problem.

EECS 370: Introduction to The University of Michigan 18


Computer Organization
Late exceptions

q  What about a late (WB) exception?


•  When does a sw modify state?
-  In the memory access stage which means it may have
completed the write before the exception (to the
instruction that logically precedes it) occurs.
•  Solution:
-  Delay the memory write until writeback.
-  What happens if the sw is followed by a lw?

EECS 370: Introduction to The University of Michigan 19


Computer Organization
Multiple exceptions
q  What about simultaneous exceptions?
•  div 10 0 5
•  badop 4 4 4
q  They will generate a divide by 0 and an invalid opcode at the
same cycle.
q  Solution: assign priority
•  Generally deepest pipeline exception is handled.
•  Later exceptions can usually be ignored.

EECS 370: Introduction to The University of Michigan 20


Computer Organization
Building  with  Pipelines  
•  CPI  for  pipelining:  
•  1  (ideal  case  -­‐  no  stalls)  
•  >  1  (reality,  depends  on  program)  
•  What  if  we  want  to  improve  performance  more?  
•  Want  CPI  as  low  as  possible  –  lower  than  1  
•  Use  Parallelism  
•  InstrucIon  Level  Parallelism  (ILP)  –  Within  task  
•  Thread  Level  Parallelism  (TLP)  –  Having  many  tasks  
•  Data  Level  Parallelism  (DLP)  –  Many  tasks  with  
same  instrucIons  
EECS 370: Introduction to The University of Michigan 21  
Computer Organization
Crea=ng  more  pipelines  

q  InstrucIon  Level  Parallelism  –  Superscalar  Pipeline  


•  Have  two  or  more  pipelines  in  same  processor  
•  pipelines  need  to  work  in  tandem  to  improve  single  program  
performance  
q  Thread  Level  Parallelism  –  MulI-­‐core  
•  Have  two  or  more  processors  (Independent  Pipelines)  
•  Need  more  programs  or  a  parallel  program  
•  does  not  improve  single  program  performance  
q  Data  Level  Parallelism  –  Single  Inst.  MulIple  Data  (SIMD)  
•  Have  two  or  more  execuIon  pipelines  (ID-­‐>WB)  
•  Share  the  same  fetch  and  control  pipeline  to  save  power  (IF+cont.)  
•  Similar  to  GPU’s  

EECS 370: Introduction to The University of Michigan


22  
 
Computer Organization
ILP  Techniques:  Superscalar  
Write  
Fetch   Decode   Execute   Memory  
back  
A  
L   M  
U   U  
X  
PC  

Instr.   Register   Data  


Memory   File   Memory   Write  
Execute  
back  
A  
L   M  
U   U  
X  

EECS 370: Introduction to The University of Michigan


Computer Organization
23  
Other  Techniques  for  ILP:  Out  of  Order  Execu=on  
q  EliminaIng  stall  condiIons  decreases  CPI  
q  Reorder  instrucIons  to  avoid  stalls  
q  Example  (5-­‐stage  LC2K  pipeline):  
 
add          1  2  3   add      1  2  3  
lw              3  2  16   lw          3  2  16  
nand      2  6  7   add      4  5  1  
add          4  5  1   nand  2  6  7  

EECS 370: Introduction to The University of Michigan


Computer Organization
Why  Use  Out  of  Order  Execu=on?  

q  Some  instrucIons  take  a  long  Ime  to  execute  


•  FloaIng  point  operaIons  
•  Some  loads  and  stores  (more  when  we  talk  about  
memory  hierarchy)  
q  OpIons:  
•  Increase  cycle  Ime  
•  Increase  number  of  pipeline  stages  
•  Execute  other  instrucIons  while  you  wait  

EECS 370: Introduction to The University of Michigan


Computer Organization
TLP  Techniques:  Mul=processors  
Fetch   Decode   Execute   Memory   Write  
back  
A  
PC   Register   L   M  
U  
File   U  
X  

Instr.   Data  
Memory   Memory  

Register   A  
L   M  
PC   File   U   U  
X  

EECS 370: Introduction to The University of Michigan


Computer Organization
Other  Techniques  for  TLP:  Mul=-­‐Threading  
Fetch   Decode   Execute   Memory   Write  
back  
A  
PC  
PC  
Instr.   Register  
Register   L   Data   M  
Memory   File   U   Memory   U  
File   X  

q  Virtual  MulIprocessor  (MulI-­‐Threading  or  HyperThreading)  


•  Duplicate  the  state  (PC,  Registers)  but  Ime  share  hardware  
•  User/OperaIng  system  see  2  cores,  but  only  one  execuIon  
•  Used  to  hide  long  latencies  (i.e.  memory  access  to  disk)  

EECS 370: Introduction to The University of Michigan


Computer Organization
DLP  Techniques:  Single  Instr.  Mul=ple  Data  (SIMD)  
Fetch   Decode   Execute   Memory   WB  

Register    
PC  

Instr.   File    
Mem    

Register    
File    
 
Data  
Memory  
Register    
File    
 

Register    
File    
 

EECS 370: Introduction to The University of Michigan 28  


Computer Organization
Building  a  GPU  
  q  Add  special  funcIonal  
RF    
units  in  EX  
PC  

 
 
RF    
 
 
RF     q  Combine  Techniques  
 
  •  SIMD  +  MP  +  MT=  SIMT  
Inst.  Mem  

RF  

Data  Mem  
 
 

 
q  MT  used  to  hide  
RF     memory  latencies  
PC  

 
 
RF     q  SIMD  used  to  decrease  
 
  power  of  fetch/control  
RF    
 
 
q  MP  used  to  improve  
RF     throughput  
 
EECS 370: Introduction to The University of Michigan 29  
Computer Organization
LC2k  Pipeline  Summary  
Fetch   Decode   Execute   Memory   WB  
PC  read   Need  register  values   Branches  resolved   Register  values  produced  
q  Data  hazards  
q  Hazard  exists  if  producer-­‐consumer  of  a  register  within  a  2  instrucIon  window  
q Note  for  project,  the  window  is  3  instrucIons  
q  Detect  and  stall  –  insert  enough  noops  to  separate    producer  and  consumer  by  >  
2  instrucIons  (>  3  instrucIons  for  project)  
q  Detect  and  forward  
q Handles  all  cases  except  LW-­‐USE,  need  1  noop  here  
q  Control  hazards  
q  Detect  and  stall  –  needs  3  noops  inserted  aler  each  branch  
q  Predict  and  squash  
q Zero  noops  if  predict  correctly  
q 3  if  predict  incorrectly  

EECS 370: Introduction to The University of Michigan 30  


Computer Organization
Review:  basic  performance  equa=on  

q  ExecuIon  Ime  (Time/Program)  =  


•  #  of  instr  (I/P)  ×  CPI  (C/I)  ×  cycle  Ime  (T/C)  
 
q  MulI-­‐cycle  decreases  cycle  Ime,  but  increases  CPI.  
 
q  Pipelining  decreases  CPI  
•  Approaches  1.0  if  no  stalls  (hazards  that  are  fixed  by  
stalling).  

EECS 370: Introduction to The University of Michigan 31  


Computer Organization
Calcula=ng  performance  with  no  stalls  
   
How  many  cycles  does  this    
code  take  to  execute?  

add        1    2    3  
nor  1    4    5  
add  4    6    7  
What  value  is  wri]en  to  the  
ALU  result  field  of  the  Mem/WB  
pipeline  register  at  the  end  of  cycle  5.  

EECS 370: Introduction to The University of Michigan 32  


Computer Organization
Calcula=ng  performance  with  no  stalls  
   
How  many  cycles  does  this    
code  take  to  execute?  
No  stalls  –  Final  WB  @  cycle  7  
add        1    2    3  
nor  1    4    5  
add  4    6    7  
What  value  is  wri]en  to  the  
ALU  result  field  of  the  Mem/WB  
pipeline  register  at  the  end  of  cycle  5.  
nor  result  

EECS 370: Introduction to The University of Michigan 33  


Computer Organization
Calcula=ng  performance  with  data  hazards    
(detect  and  stall)  
 
  How  many  data  hazards  are  there  
in  this  code?  
add        1    2    3  
nor  3    4    5  
add  3    5    6   How  many  stall  cycles  if  we  use    
detect  and  stall  to  handle  the    
hazards?  
Time:   1   2   3   4   5   6   7   8   9   10   11  
add  1  2  3   IF   ID   EX   ME   WB  

nand  3  4  5  
add  3  5  6  

EECS 370: Introduction to The University of Michigan 34  


Computer Organization
Calcula=ng  performance  with  data  hazards    
(detect  and  stall)  
 
  How  many  data  hazards  are  there  
in  this  code?  
add        1    2    3  
3  data  hazards  
nor  3    4    5  
add  3    5    6   How  many  stall  cycles  if  we  use    
detect  and  stall  to  handle  the    
hazards?  
Time:   1   2   3   4   5   6   7   8   9   10   11  
add  1  2  3   IF   ID   EX   ME   WB  

 Stall  :    4  cycles     nor  3  4  5   IF   ID*   ID*   ID   EX   ME   WB  

Total  :  11  cycles   add  3  5  6   IF*   IF*   IF   ID*   ID*   ID   EX   ME   WB  

EECS 370: Introduction to The University of Michigan 35  


Computer Organization
Calcula=ng  performance  with  data  hazards    
(detect  and  forward)  
 
 
add        1    2    3   Where  do  the  values  for  the  second  add    
nor  3    4    5   instruc=on  come  from?  
add  3    5    6  
lw      3    6    7  
add  6    6    1  
How  many  stall  cycles  on  the  
LC2K  pipelined  datapath  with  
data  forwarding  from  lecture?  

EECS 370: Introduction to The University of Michigan 36  


Computer Organization
Calcula=ng  performance  with  data  hazards    
(detect  and  forward)  
 
 
add        1    2    3   Where  do  the  values  for  the  second  add    
nor  3    4    5   instruc=on  come  from?  
add  3    5    6   From  Mem/WB  and  EX/Mem  
lw      3    6    7  
add  6    6    1  
How  many  stall  cycles  on  the  
LC2K  pipelined  datapath  with  
data  forwarding  from  lecture?  

1  stall  for    lw  ➜  add  

EECS 370: Introduction to The University of Michigan 37  


Computer Organization
Calcula=ng  performance  with  control  hazards  
(speculate  and  squash)  

q  How  many  cycles  are  saved  if  you  perform  speculate  and  squash  for  
the  following  code  (assume  that  branches  are  predicted  to  be  not  
taken)?   add        1    2    3  
  beq      1    5    1  
  nor  6    4    1  
  add  3    4    5  
q  Assume  the  branch  is  taken:  How  many  cycles  to  execute  this  code?  
 
q  Assume  the  branch  is  not  taken:  How  many  cycles  execute  this  code?  

EECS 370: Introduction to The University of Michigan 38  


Computer Organization
Calcula=ng  performance  with  control  hazards  
(speculate  and  squash)  

q  How  many  cycles  are  saved  if  you  perform  speculate  and  squash  for  
the  following  code  (assume  that  branches  are  predicted  to  be  not  
taken)?   add        1    2    3  
  beq      1    5    1  
  nor  6    4    1  
  add  3    4    5  
q  Assume  the  branch  is  taken:  How  many  cycles  to  execute  this  code?  
   3  instr  +  3  stalls  +  4  to  empty  pipe  =  10  cycles  
q  Assume  the  branch  is  not  taken:  How  many  cycles  execute  this  code?  
4  instr  +  4  to  empty  pipe  =  8  cycles  
EECS 370: Introduction to The University of Michigan 39  
Computer Organization
Calcula=ng  performance  with  control  hazards  
Assume  the  first  branch  is  taken  50%   add        1    2    3  
of  the  Ime  and  the  loop  iterates  100   beq      1    5    1  
Imes,  and  forwarding  for  all  data   lw  6    4    1  
hazards.   add  3    4    5  
1.  How  many  cycles  does  the  code   beq            5    7    -­‐5  
take  assuming  detect  and  stall  for   halt  
control  hazards?  
Assume halt is resolved
in WB stage

EECS 370: Introduction to The University of Michigan 40  


Computer Organization
Calcula=ng  performance  with  control  hazards  
Assume  the  first  branch  is  taken  50%   add        1    2    3  
of  the  Ime  and  the  loop  iterates  100   beq      1    5    1  
Imes,  and  forwarding  for  all  data   lw  6    4    1  
hazards.   add  3    4    5  
1.  How  many  cycles  does  the  code   beq            5    7    -­‐5  
take  assuming  detect  and  stall  for   halt  
control  hazards?  

 #  InstrucIons  =  100*(0.5*5  +  0.5*4)  +  1  =  451  


Time  =  451    +  load  stalls  +  branch  stalls  +  empty  pipe  
Time  =  451  +  100*0.5*1  +    
Time  =  451  +  100*0.5*1  +  (100*3  +  100*3)  +  4  
Time  =  1105  
EECS 370: Introduction to The University of Michigan 41  
Computer Organization
Calcula=ng  performance  with  control  hazards  
Assume  the  first  branch  is  taken   add        1    2    3  
50%  of  the  Ime  and  the  loop   beq      1    5    1  
iterates  100  Imes,  and  forwarding   lw  6    4    1  
for  all  data  hazards.   add  3    4    5  
2.  How  many  cycles  does  the  code   beq            5    7    -­‐5  
take  assuming  speculate  and   halt  
squash  where  all  branches  are  
predicted  not  taken?  

EECS 370: Introduction to The University of Michigan 42  


Computer Organization
Calcula=ng  performance  with  control  hazards  
Assume  the  first  branch  is  taken  50%   add        1    2    3  
of  the  Ime  and  the  loop  iterates  100   beq      1    5    1  
Imes,  and  forwarding  for  all  data   lw  6    4    1  
hazards.   add  3    4    5  
2.  How  many  cycles  does  the  code   beq            5    7    -­‐5  
take  assuming  speculate  and   halt  
squash  where  all  branches  are  
predicted  not  taken?  
 #  InstrucIons  =  100*(0.5*5  +  0.5*4)  +  1  =  451  
Time  =  451  +  load  stalls  +  branch  stalls  +  empty  pipe  
Time  =  451  +  100*0.5*1  +  (100*0.5*3  +  99*3)  +  4  
Time  =  952  

EECS 370: Introduction to The University of Michigan 43  


Computer Organization
Calcula=ng  performance  with  control  hazards  
Assume  the  first  branch  is  taken   add        1    2    3  
50%  of  the  Ime  and  the  loop   beq      1    5    1  
iterates  100  Imes,  and  forwarding   lw  6    4    1  
for  all  data  hazards.   add  3    4    5  
3.  How  many  cycles  does  the  code   beq            5    7    -­‐5  
take  assuming  speculate  and   halt  
squash  where  backward  
branches  are  predicted  taken    
and  forward  branches  not  
taken?  

EECS 370: Introduction to The University of Michigan 44  


Computer Organization
Calcula=ng  performance  with  control  hazards  
Assume  the  first  branch  is  taken  50%   add        1    2    3  
of  the  Ime  and  the  loop  iterates  100   beq      1    5    1  
Imes,  and  forwarding  for  all  data   lw  6    4    1  
hazards.   add  3    4    5  
3.  How  many  cycles  does  the  code   beq            5    7    -­‐5  
take  assuming  speculate  and   halt  
squash  where  backward  
branches  are  predicted  taken    
and  forward  branches  not  taken,  
andBTB  has  all  entrys  to  start?  
 #  InstrucIons  =  100*(0.5*5  +  0.5*4)  +  1  =  451  
Time  =  451    +  load  stalls  +  branch  stalls  +  empty  pipe  
Time  =  451  +  100*0.5*1  +  (100*0.5*3  +  1*3)  +  4  
Time  =  658  
EECS 370: Introduction to The University of Michigan 45  
Computer Organization
Calcula=ng  performance  with  control  hazards  

Assume  the  first  branch  has  the  pasern  TTTN   add        1    2    3  


that  repeats,  and  the  loop  is  iterated  100  Imes   beq      1    5    1  
4.  How  many  cycles  does  the  code  take  if  a  2-­‐ lw  6    4    1  
bit  counter  BTB  is  used  to  predict  each   add  3    4    5  
branch,  how  many  cycles  does  the  code   beq            5    7    -­‐5  
take?  Assume  iniIal  state  of  branch   halt  
predictor  counter  is    “10”  (WT)  

EECS 370: Introduction to The University of Michigan 46  


Computer Organization
Calcula=ng  performance  with  control  hazards  

Assume  the  first  branch  has  the  pasern  TTTN   add        1    2    3  


that  repeats,  and  the  loop  is  iterated  100  Imes   beq      1    5    1  
4.  How  many  cycles  does  the  code  take  if  a  2-­‐ lw  6    4    1  
bit  counter  BTB  is  used  to  predict  each   add  3    4    5  
branch,  how  many  cycles  does  the  code   beq            5    7    -­‐5  
take?  Assume  iniIal  state  of  branch   halt  
predictor  counter  is    “10”  (WT)  
√ √ √ X   √ √ √ X   √ √ √ X   beq  2  is  correct  99  Imes,  
beq  1   T  T  T  N   T  T  T  N   T  T  T  N   •  •  •    then  incorrect  last  iter  
 #  InstrucIons  =  100*(0.25*5  +  0.75*4)  +  1  =  426  
Time  =  426    +  load  stalls  +  branch  stalls  +  empty  pipe  
Time  =  426  +  100*0.25*1  +  100*0.25*3  +  1*3  +  4  
Time  =    533  
EECS 370: Introduction to The University of Michigan 47  
Computer Organization
Classic  performance  problem  
q  Program  with  following  instrucIon  breakdown:  
 lw    10%  
 sw    15%  
 beq    25%  
 R-­‐type  50%  
q  Speculate  “always  not-­‐taken”  and  squash.    80%  of  branches  not-­‐taken  
q  Full  forwarding  to  execute  stage.    20%  of  loads  stall  for  1  cycle  
q  What  is  the  CPI  of  the  program?  
q  What  is  the  total  execuIon  Ime  if  cycle  Ime  is  100MHz?  
 

EECS 370: Introduction to The University of Michigan 48  


Computer Organization
Classic  performance  problem  
q  Program  with  following  instrucIon  breakdown:  
 lw    10%  
 sw    15%  
 beq    25%  
 R-­‐type  50%  
q  Speculate  “always  not-­‐taken”  and  squash.    80%  of  branches  not-­‐taken  
q  Full  forwarding  to  execute  stage.    20%  of  loads  stall  for  1  cycle  
q  What  is  the  CPI  of  the  program?  
q  What  is  the  total  execuIon  Ime  if  cycle  Ime  is  100MHz?  
 
CPI  =  1  +  0.10  (loads)  *  0.20  (load  use  stall)*1      
                           +  0.25  (branch)  *  0.20  (miss  rate)*3  
CPI  =  1  +  0.02  +  0.15  =  1.17  
Time  =    1.17  *  10ns  =  11.7ns  

EECS 370: Introduction to The University of Michigan 49  


Computer Organization
Classic  performance  problem  (cont.)  

q  Assume  branches  are  resolved  at  Execute?  


•  What  is  the  CPI?  
•  What  happens  to  cycle  Ime?  
•  What  is  the  total  execuIon  Ime?  
 

q  What  if  branches  are  resolved  at  Decode?  

EECS 370: Introduction to The University of Michigan 50  


Computer Organization
Classic  performance  problem  (cont.)  

q  Assume  branches  are  resolved  at  Execute?  


•  What  is  the  CPI?  
•  What  happens  to  cycle  Ime?  
•  What  is  the  total  execuIon  Ime?  
 
CPI  =  1  +  0.10  (loads)  *0.20  (load  use  stall)*1  
                             +  0.25  (branch)  *  0.20  (miss  rate)*2  
CPI  =  1  +  0.02  +  0.1  =  1.12  

q  What  if  branches  are  resolved  at  Decode?  


CPI  =  1  +  0.10  (loads)  *0.20  (load  use  stall)*1    
                             +  0.25  (branch)  *  0.20  (miss  rate)*1  
CPI  =  1  +  0.02  +  0.05  =  1.07  

EECS 370: Introduction to The University of Michigan 51  


Computer Organization
Performance  with  deeper  pipelines  

q  Assume  the  setup  of  the  previous  problem.  


q  What  if  we  have  a  10  stage  pipeline?  
•  InstrucIons  are  fetched  at  stage  1.  
•  Register  file  is  read  at  stage  3.  
•  ExecuIon  begins  at  stage  5.  
•  Branches  are  resolved  at  stage  7.  
•  Memory  access  is  complete  in  stage  9.  
q  What’s  the  CPI  of  the  program?  
q  If  the  clock  rate  was  doubled  by  doubling  the  pipeline  
depth,  is  performance  also  doubled?  

EECS 370: Introduction to The University of Michigan


Computer Organization
Performance  with  deeper  pipelines  

q  Assume  the  setup  of  the  previous  problem.  


q  What  if  we  have  a  10  stage  pipeline?  
•  InstrucIons  are  fetched  at  stage  1.  
•  Register  file  is  read  at  stage  3.  
•  ExecuIon  begins  at  stage  5.  
•  Branches  are  resolved  at  stage  7.  
•  Memory  access  is  complete  in  stage  9.  
q  What’s  the  CPI  of  the  program?  
q  If  the  clock  rate  was  doubled  by  doubling  the  pipeline  
depth,  is  performance  also  doubled?  
CPI  =  1  +  0.10  (loads)  *0.20  (load  use  stall)*4  +  0.25  (branch)  *  0.20  (N  stalls)*6  
CPI  =  1  +  0.08  +  0.30  =  1.38  
Time  =  1.38  *  5ns  =  6.9  ns  
EECS 370: Introduction to The University of Michigan
Computer Organization

You might also like