Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

ECPE13

Computer Architecture and


Organization
Dr. P. Maheswaran
Assistant Professor, Dept. Of ECE
NIT Trichy

mahes@nitt.edu
Syllabus
Unit 3: Basic Processing Unit


Fundamental concepts: ALU, Control unit, Multiple bus organization

Hardwired control, Micro programmed control,

Pipelining, Data hazards, Instruction hazards,

Influence on instruction sets, Data path and control considerations, Performance
considerations.
Fundamental concepts:

A computing task consists of a series of operations specified by a sequence of machine-language instructions.

The processor fetches one instruction at a time and performs the operation specified.

The processor uses the program counter (PC) to keep track of the address of the next instruction.

After fetching an instruction, the contents of the PC are updated to point to the next instruction in sequence.

A branch instruction may cause a different value to be loaded into the PC.

When an instruction is fetched, it is placed in the instruction register, IR.

It is interpreted, or decoded, by the processor’s control circuitry.

To execute an instruction, the processor has to perform the following steps:
Fundamental concepts:

The instruction fetch phase: Fetching an instruction and loading it into the IR.

The instruction execution phase: Performing the operation specified in the instruction.

The operation specified by an instruction can be carried out by performing one or more of the following actions:


The hardware components needed to perform these actions are shown.

The register file is a memory unit whose storage locations are organized to form the processor’s
general-purpose registers.

During execution, the contents of the registers named in an instruction that performs an
arithmetic or logic operation are sent to the ALU.
Fundamental concepts: Data Processing Hardware

A computation operates on data stored in registers.

These data are processed by combinational circuits such as adders, and the results are placed into a register.

A clock signal is used to control the timing of data transfers.

The registers comprise edge-triggered flip-flops, data loaded into it at rising edge of clock.

The clock period must be long enough to allow the combinational circuit to produce the correct result.

The combinational logic circuit can be broken down into several simpler steps.

Each step is performed by a subcircuit of the original circuit.

These subcircuits are cascaded into a multistage structure.

If n stages are used, the operation will be completed in n clock cycles.

These combinational subcircuits are smaller, they can complete their
operation in less time, and hence a shorter clock period can be used.
A key advantage of the multi-stage structure is that it is suitable for pipelined
operation. This structure is useful for implementing processors that
have a RISC-style instruction set.
Fundamental concepts: Data Processing Hardware
Fundamental concepts: Instruction Execution
Load Instructions:

Index addressing mode to load a word from memory location X + [R7] into register R5.

Some of these actions can be performed at the same time.

Assume that the processor has five hardware stages (commonly used arrangement in RISC-style processors).

Execution of each instruction is divided into five steps:

Each step is carried out by one hardware stage.
Fundamental concepts: Instruction Execution
Arithmetic and Logic Instructions:

These differ from the Load instruction in two ways:

There are either two source registers, or a source register and an immediate source operand.

No access to memory operands is required.

The Add instruction does not require access to an operand in the memory.

It is advantageous to use the same multistage processing hardware for many instructions.

Arrange for all instructions to be executed in the same number of steps.

The Add instruction should be extended to five steps, patterned along the steps of the Load instruction.

No access to memory operands is required in Add instruction, insert a step in which no action takes place between
steps 3 and 4.

For immediate addressing mode, steps 2 and 3 are modified.
Fundamental concepts: Instruction Execution
Store Instructions:

The five-step sequence is also suitable for Store instructions.

The final step of loading the result into a destination register is not required.

The hardware stage responsible for this step takes no action.


The five-step sequence of actions given in figure is suitable for all instructions in a RISC-style instruction set.

RISC-style instructions are one word long and only Load and Store instructions access operands in the memory.

Instructions that perform computations use data:

Stored in general-purpose registers or given as immediate data in the instruction.

The five-step sequence is suitable for all Load and Store instructions.

The addressing modes that can be used in these instructions are special cases of the Index mode.
Fundamental concepts: Instruction Execution
Store Instructions:

RISC-style processors provide one general-purpose register, usually register R0, that contains the value zero.

When R0 is used as the index register, the effective address of the operand is the immediate value X.

The Absolute addressing mode.

If the offset X is set to zero, the effective address is the contents of the index register, Ri.

The Indirect addressing mode.

Only the Index mode needs to be implemented, resulting in a significant simplification of the processor hardware.

Selecting R0 as the index register or setting X to zero is left to the assembler or the compiler.

This is consistent with the RISC philosophy of aiming for:

Simple and fast hardware.

At the expense of higher compiler complexity and longer compilation time.

Programs are compiled much less frequently than they are executed.

A net gain in the time needed to perform various tasks on a computer.
Fundamental concepts: Hardware Components
Register File:

The processor hardware is organized in five stages. Each stage performs the actions needed in one of the steps.

Registers are implemented in the form of a register file, a small and fast memory block.

An array of storage elements, with access circuitry that enables data to be read from or written into any register.

The access circuitry is designed to enable two registers to be read at the same time, making their contents available
at two separate outputs, A and B.

The register file has two address inputs that select the two registers to be read.

These inputs are connected to the fields in the IR.

A data input, C, and a corresponding address input:

To select the register into which data are to be written.

Address input is connected to the IR.
Fundamental concepts: Hardware Components
Register File:

The inputs and outputs of any memory unit are often called input and output ports.

A memory unit that has two output ports is said to be dual-ported.

Possibility 1: A single set of registers with duplicate data paths and access circuitry.

Enables two registers to be read at the same time.

Possibility 2: Use two memory blocks, each containing one copy of the register file.

Whenever data are written into a register, they are written into both copies of that register.

The two files have identical contents.

When an instruction requires data from two registers:

One register is accessed in each file.

The two register files together function as a

single dual-ported register file.
Fundamental concepts: Hardware Components
Arithmetic and Logic Unit:

It performs arithmetic operations such as addition and subtraction, and logic operations such as AND, OR, and XOR.

The register file and the ALU may be connected as shown in figure.

During an arithmetic or logic operation:

The contents of the two registers specified in the instruction are read from the register file and put in A and B.

A is connected directly to the first input of the ALU InA.

B is connected to a multiplexer, MuxB.

Multiplexer selects either output B of the register file or the immediate value in the IR.

The output result of the ALU is connected to the data input, C to be put in

destination register.
Fundamental concepts: Hardware Components
Datapath:

Instruction processing consists of two phases: The fetch phase and the execution phase.

Divide the processor hardware into two sections – one for fetching, other for execution.

Fetch section is responsible for decoding instructions, generating the control signals for
execution section.

The execution section reads the data operands specified in an instruction, performs the
required computations, and stores the results.

Multistage organization of hardware is shown in figure.

Action in each stage completed in one clock cycle.
Step 1: Hardware instruction fetch, place it in IR.
Step 2: Decode instruction, read its source registers.
Step 3. IR is used to generate control signal for all subsequent steps. IR to hold instruction till
execution is completed.
Step 4: Memory access.
Step 5: Move result to destination register.
Fundamental concepts: Hardware Components
Datapath:

It is necessary to insert registers between stages.

Inter-stage registers hold results from previous stage, input to next stage
during next clock cycle.
1. Data read from the register file are placed in registers RA and RB.
2. RA provides the data to input InA of the ALU.
3. Multiplexer MuxB forwards either the contents of RB or the immediate value
in the IR to the ALU’s second input, InB.
4. The ALU constitutes stage 3, results saved in RZ.
5. For computational instructions (Add), no action in stage 4.
5i. MuxY selects register RZ to transfer the result of the computation to RY.
5ii. Data in RY transferred to the register file in stage 5 and loaded into the
destination register.
The register file is in both stages 2 (source registers) and 5 (destination
registers).
Fundamental concepts: Hardware Components
Datapath:

Load and Store instructions, the effective address of the memory operand is
computed by the ALU in stage 3 and loaded into register RZ.

EA is send to memory in stage 4.
Load instruction: Data from memory is placed in RY by MuxY. Transferred to
register file in next clock cycle.
Store instruction: Data from register file in stage 2 placed in RB. Memory
access in stage 4, inter-stage register RM is needed.
Data flow: RB=>RM=>Memory, no action in stage 5.
Fundamental concepts: Hardware Components
Instruction Fetch Section:

Memory access address come from PC when fetching instruction.

Address come from RZ when accessing instruction operands.

MuxMA selects one of these two sources to be sent to the processor-memory interface.

Instruction address generation block updates PC after each instruction fetch.

Fetched instruction is loaded in IR, stays there till execution is complete.

The control circuitry examines IR to generate the control signals for all the processor’s hardware.

They are also used by the block labeled Immediate.

A 16-bit immediate value is extended to 32 bits.

The extended value is used as operand, or to compute the effective address of an operand.

For arithmetic instructions: the immediate value is sign-extended.

For logic instructions: it is padded with zeros.

The Immediate block generates the extended value and forwards it to MuxB.

Used by ALU.

Generates the extended value to compute target address of branch instructions.
Fundamental concepts: Hardware Components
Instruction Fetch Section:

Adder increments the PC by 4 during straight-line execution.

It computes a new value to be loaded into the PC when executing branch and subroutine call instructions.

Adder input 1: connected to PC.

Adder input 2: connected to MuxINC, selects either the constant 4 or the branch offset to be added to the PC.

The branch offset is given in the immediate field of the IR, sign-extended to 32 bits by the Immediate block.

The adder output is routed to the PC via MuxPC.

MuxPC selects between the adder and the output of register RA.

Latter is needed when executing subroutine linkage instructions.

Register PC-Temp holds the contents of the PC temporarily:

During the process of saving the subroutine or interrupt return address.
Control signals:
 The operation of the processor’s hardware components is governed by control signals.
 These signals determine which multiplexer input is selected, what operation is
performed by the ALU, and so on.
 Data are transferred from one stage to the next in every clock cycle, inter-stage
registers are always enabled.

RA, RB, RZ, RY, RM, and PC-Temp are always enabled.
 The PC, the IR, and the register file, not changed in every clock cycle.
 New data loaded into these registers only when called for a particular processing step.

Enabled only at those times.
 The ALU is used only in stage 3.

Selection in MuxB matters only in stage 3.

MuxB selection maintained for simple circuit.

Same goes for MuxY.
 MuxMA changes its selection during different execution steps.

Selects the PC during step 1.

Step 4 of Load and Store instructions, it selects register RZ.
Control signals:
 The register file has three 5-bit address inputs.

Access 32 general-purpose registers.
 Address A – IR31-27, Address B – IR26-22.
 Address C selects the destination register for port C data.

MuxC selects the source of that address.

C_select=0 for immediate operand, =1 for register operand.

The third input of MuxC is the address of the link register used in
subroutine linkage instructions.
 New data are loaded into the selected register only when RF_write=1.
 B_select = 0, MuxB selects RB.
 Two bits needed to control MuxC and MuxY because of 3 inputs.
 ALU_op – k bit control code.

Determines operation by ALU.

Up to 2k distinct operations – Add, AND, etc.
Control signals:
 A comparator performs the comparison specified.

Generates condition signals that indicate the result of the comparison.

Examined by the control circuitry during the execution of conditional branch
instructions whether the branch condition is true or false.
 MEM_read – memory read operation
 MEM_write - memory write operation
 MFC = 1 when operation is complete.
 IR_enable=1, load new instruction to IR.
 During fetch step:

IR_enable =1 only after MFC=1.
Control signals:
 INC_select - selects the value to be added to the PC.
 PC_select – selects either the updated address or the contents of register RA.
 PC_enable = 1, value from MuxPC loaded into PC.
Hardwired Control:
 The processor generates the control signals for fetch/execute to take place in the correct sequence and at the right time.
 Two approaches for control signal generation:

Hardwired control, Microprogrammed control.
 Each step of instruction execution requires one clock cycle.

A step counter may be used to keep track of the progress of execution.
 The setting of the control signals depends on:

Contents of the step counter.

Contents of the instruction register.

The result of a computation or a comparison operation.

External input signals, such as interrupt requests.
 The instruction decoder interprets the OP-code and addressing mode information in the IR.

The corresponding INSi output is set to 1.
 One of the outputs T1 to T5 of the step counter is set to 1.

To indicate which of the five steps in fetch/execute cycle involved.

A mod-5 counter, as all executions complete in 5 steps.
Hardwired Control:
 The control signal generator is a combinational circuit.

Produces control signals based on its inputs.
 The required settings of the control signals are determined by INS1 to INSm.
Example:
Stage 1:
1. New instruction fetched from memory.
2. T1 = 1 is set.
3. In this clock period, MA_select=1 (to select the PC as the source of the memory address, Fig.5.19).
4. MEM_read=1 (to initiate a memory Read operation, Fig.5.19).
5. If MFC=1, then set IR_enable=1, data received from the memory are loaded into the IR.
6. The PC is incremented by 4.

By setting INC_select=0, and PC_select=1 (Fig.5.20).

PC_enable = 1 at positive edge of the clock marking the end of step T1.

New value to be loaded into the PC.
CISC-Style Processors:
 A RISC-style instruction set is conducive to a multi-stage implementation of the processor.

The hardware is simple and well suited to pipelined operation.

The control signals are easy to generate.

Load and Store instructions only access data in the memory.

One word in length.
 CISC-style instruction sets are more complex:

They allow much greater flexibility in accessing instruction operands.

They can access operands directly from memory.

Instrucions may span several words to specify operand addresses and the actions to be performed.
 CISC-style instructions require a different organization of the processor hardware.
 Main differences between Fig.5.22 and Fig.5.8:

The Interconnect block: Provides interconnection among other blocks.

No particular structure or pattern of data flow as in Fig.5.8.

Provides paths to transfer data between any two components.

Fig.5.8 uses inter-stage registers such as RZ and RY, not needed in Fig.5.22.

Registers are needed to hold intermediate results.

Temporary registers block in the figure is used.

Two temporary registers, Temp1 and Temp2.
CISC-Style Processors:
 The Interconnect block is traditionally implemented using bus interconnection.
 Bus driver: A logic gate that sends a signal over a bus line.
 All devices connected to the bus have the ability to send data.

Have to ensure only one of them is driving the bus at any given time.

For this, the bus driver is implemented with tri-state gate.
 Tri-state gates have a control input that turns it ON or OFF.

When ON, the gate places a logic signal of 0 or 1 on the bus, based on input value.

When OFF, the gate is electrically disconnected from the bus.
 A flip-flop that forms one bit of a data register connected to bus is shown in Fig.5.23.
 Rin = 1, the multiplexer selects the data on the bus line to be loaded into the flip-flop.
 Rin = 0, the flip-flop maintains its present value.
 The flip-flop output is connected to the bus line through a tri-state gate.

Tri-state gate turned ON when Rout=1.

When Rout=0, other devices can drive the bus.
CISC-Style Processors: An Interconnect using Buses
 Interconnect block in Fig.5.22 may be implemented as in Fig.5.24 (three bus
implementation).
 All registers are assumed to be edge-triggered.

When a register is enabled, data are loaded into it on the active edge of the clock at the
end of the clock period.
 Addresses for the three ports of the register file are provided by the Control block.

Connections not shown to keep the figure simple.
 The IR is connected to bus B through the Immediate block.

The circuit that extends an immediate operand in the IR to 32 bits.

Not shown in figure.
 ADD R5, R6 performs R5 ← [R5] + [R6].

Performed in three steps. Step 1 takes more than one clock cycle, others only one.
CISC-Style Processors: An Interconnect using Buses
 AND X(R7), R9:

AND operation on the contents of register R9 and memory location X + [R7].

Stores the result back in the same memory location.
 Assume that index offset X is a 32-bit value given as the second word of the instruction.
 To execute this instruction, it is necessary to access the memory four times.
1. The OP-code word is fetched.
2. When the instruction decoding circuit recognizes the Index addressing mode, the index
offset X is fetched.
3. The memory operand is fetched and the AND operation is performed.
4. The result is stored back into the memory.
The number of execution steps in
ADD and AND instructions vary.

No uniform sequence of actions that


can be followed for all instructions.

The number of execution steps in


CISC-style instructions vary
compared to RISC-style instructions.
Microprogrammed Control:
 The control signals of the components in Figs. 5.22 and 5.24 can be generated using the hardwired approach.
 Control signals are generated for each execution step based on the instruction in the IR.

In hardwired control, these signals are generated by circuits that interpret:

The contents of the IR.

The timing signals derived from a step counter.
 Instead of employing circuits, it is possible to use a “software" approach.

The desired setting of the control signals in each step is determined by a program stored in a special memory.

The control program is called a microprogram to distinguish it from the program being executed by the processor.
 The microprogram is stored on the processor chip in:

The microprogram memory or the control store.
 Suppose n control signals are needed.

Let each control signal be represented by a bit in an n-bit word.

Often referred to as a control word or a microinstruction.
 Each bit in that word specifies the setting of the corresponding signal for a particular
 step in the execution flow.
Microprogrammed Control:
 One control word is stored in the microprogram memory for each step in the execution sequence of an instruction.
 Example: the action of reading an instruction or a data operand from the memory

Uses the MEM_read and WMFC signals.

These signals are asserted by setting the corresponding bits in the control word to 1 for steps 1, 3, and 5 in Fig. 5.26.
 When a microinstruction is read from the control store, each control signal takes on the value of its corresponding bit.
 The sequence of microinstructions corresponding to a given machine instruction constitutes the microroutine that implements that
instruction.
 Steps 1 and 2 in Figs. 5.25 and 5.26 are common to all instructions ==> can be a microroutine.
 Fig. 5.27 consist of:

A microinstruction address generator - generates the address.

To be used for reading microinstructions from the control store.

The address generator uses a microprogram counter, μPC.

Keep track of control store addresses when reading microinstructions from successive locations.
Microprogrammed Control:
 Step 2 in Figs. 5.25 and 5.26:

The microinstruction address generator decodes the instruction in the IR.

Obtain the starting address of the corresponding microroutine and loads that address into the μPC.
 This is the address that will be used in the following clock cycle to read the control word corresponding to step 3.
 As execution proceeds, μPC++ to read microinstructions from successive locations in the control store.
 End bit in the microinstruction is used to mark the last microinstruction in a given microroutine.
 When End=1 is step 3 in Fig. 5.25 and step 7 in Fig. 5.26.

The address generator returns to the microinstruction corresponding to step 1.

Causes a new machine instruction to be fetched.
 Microprogrammed control can be viewed as having a control processor within the main processor.
 Microinstruction function is to direct the actions of the main processor’s hardware components.

Indicating which control signals need to be active during each execution step.
 Microprogrammed control is simple to implement and provides considerable flexibility.
 But, it is slower than hardwired control.
 The flexibility it provides is not needed in RISC-style processors.
 Since the cost of logic circuitry is no longer a significant factor, hardwired
 control has become the preferred choice.
Pipelining: Basic Concept
 Ways to improve performance of processor:

Use faster circuit technology to implement the processor and the main memory.

Arrange the hardware so that more than one operation can be performed at the same time.

The number of operations performed per second is increased.

The time needed to perform any one operation is not changed.
 Pipelining is a way of organizing concurrent activity in a computer system.

In manufacturing plants, pipelining is commonly known as an assembly-line operation.

First station: Prepare the automobile chassis, by group 1, on automobile 1.

Second station: Adds the body, by group 2, on automobile 2.

Third stations: Installs the engine, by group 3, on automobile 3.
 Overlapping pattern of execution would be possible for all instructions.
Pipeline organization:
 Fig. 6.2 indicates how the five-stage organization can be pipelined.
 The program counter (PC) is used to fetch a new instruction in stage 1 of pipeline.
 Execution proceeds through successive stages as other instructions are fetched.
 At any given time, each stage of the pipeline is processing a different instruction.

Information such as register addresses, immediate data, and the operations to be
performed are carried through the pipeline.

This information is held in interstage buffers.

RA, RB, RM, RY, and RZ in Fig. 5.8, IR and PC-Temp in Figs. 5.9, 5.10.
 Interstage buffer B1 feeds the Decode stage with a newly-fetched instruction.
 B2 feeds the Compute stage with:

The two operands read from the register file.

The source/destination register identifiers.

The immediate value derived from the instruction.

The incremented PC value used as the return address for a subroutine call.

The settings of control signals determined by the instruction decoder.

The settings for control signals move through the pipeline to determine the ALU
operation, the memory operation, and a possible write into the register file.
Pipeline organization:
 B3 holds the result of the ALU operation which may be:

Data to be written into the register file.

An address that feeds the Memory stage.

A write access to memory. B3 holds the data to be written.

The incremented PC value passed from the previous stage in case it is needed as
the return address for a subroutine-call instruction.
 B4 feeds the Write stage with a value to be written into the register file.

The ALU result from the Compute stage.

The result of the Memory access stage.

The incremented PC value that is used as the return address for a subroutine-call
instruction.
Pipeline issues:
 Fig. 6.1 is ideal overlap of three instructions.
 There are times when it is not possible to have a new instruction enter the pipeline in every cycle.

 Consider the case of two instructions: Ij and Ij+1

 The destination register for instruction Ij is a source register for instruction Ij+1.
 The result of instruction Ij is not written into the register file until cycle 5.
 It is needed earlier in cycle 3 when the source operand is read for instruction Ij+1.

 If steps in Fig. 6.1 is used, the result of instruction Ij+1 would be incorrect.

The arithmetic operation would be performed using the old value of the register in question.
 It is necessary to wait until the new value is written into the register by instruction Ij.
Pipeline issues:
 Instruction Ij+1 cannot read its operand until cycle 6.

It must be stalled in the Decode stage for three cycles.
 While Ij+1 is stalled, Ij+2 and all subsequent instructions are delayed.
 New instructions cannot enter the pipeline, and the total execution time is increased.
 Any condition that causes the pipeline to stall is called a hazard.
 Above example is data hazard.
 Other hazards arise from:

Memory delays.

Branch instructions.

Resource limitations.
Data dependencies:
 Two instructions in Fig. 6.3.
 There is a data dependency between these two instructions:
 R2 carries data from I1 to I2.
 The Subtract instruction is stalled for three cycles.

R2 is read only in cycle 6.
 The control circuit should recognize the data dependency.

Compare source register identifier from interstage buffer B1 of Subtract instruction with the destination register identifier of
the Add instruction that is held in interstage buffer B2.

The Subtract instruction must be held in interstage buffer B1 during cycles 3 to 5

Meanwhile, the Add instruction proceeds through the remaining pipeline stages.
 In cycles 3 to 5, control signals can be set in interstage buffer B2 for an implicit NOP (No-operation) instruction.

NOP does not modify the memory or the register file.
 Each NOP creates one clock cycle of idle time, called a bubble.

As it passes through the Compute, Memory, and Write stages to

the end of the pipeline.
Data dependencies: Operand forwarding
 Pipeline stalls due to data dependencies can be alleviated through operand forwarding.
 The desired value for Subtract instruction is actually available at the end of cycle 3.

When the ALU completes the operation for the Add instruction.

The result is loaded in RZ (part of interstage buffer B3) of Fig. 5.8.
 The hardware can forward the value from register RZ to ALU input in cycle 4.

Fig. 6.4 shows pipelined execution with forwarding.

The ALU result from cycle 3 is used as an input to the ALU in cycle 4 (indicated by arrow).
 Fig. 6.5 shows the modification needed in the datapath for forwarding.

MuxA, is inserted before input InA of the ALU.

MuxB is expanded with another input.

The multiplexers select: A value read from the register file in the normal manner, or the value available in register RZ.
 Forwarding can also be extended to a result in register RY in Fig. 5.8.

Subtract – Compute stage

Or instruction – memory stage (no operation)

Add instruction – write stage, data in RY.

Forward RY to ALU avoids stalling pipeline.

MuxA and MuxB need another input.
Data dependencies: Operand forwarding
Data dependencies: Handling data dependencies in software
 The task of detecting data dependencies and dealing with them is given to the compiler.

 Compiler identifies a data dependency between instructions Ij and Ij+1.



It inserts three explicit NOP (No-operation) instructions between them.

The NOPs introduce the necessary delay to enable instruction Ij+1 to read the new value from the register file after it is
written.

The three NOP instructions have the same effect on execution time as the stall.
 Requiring the compiler to identify dependencies and insert NOP instructions simplifies the hardware implementation of the
pipeline.

The code size increases.

The execution time is not reduced as in operand forwarding.
 The compiler can attempt to optimize the code:

Improve performance and reduce the code size.

Move usefull instruction to NOP slots.
Memory Delays:
 Delays from memory accesses are another cause of pipeline stalls.
 A Load instruction may require more than one clock cycle to obtain its operand from memory.

May occur because the requested instruction or data are not found in the cache, resulting in a cache miss.

Fig. 6.7 shows delay due to memory access.

A memory access may take ten or more cycles, for simplicity only three cycles are shown.
 A cache miss causes all subsequent instructions to be delayed.
Memory Delays:
 An additional type of memory-related stall occurs when there is a data dependency involving a Load instruction.

The data for the Load instruction is found in the cache (assume). Only one cycle to access the operand.

R2 is destination of Load instruction, source of Subtract instruction.

Operand forwarding cannot be done as in Fig. 6.4.

The data read from memory (the cache, in this case) are not available until they are loaded into register RY at the beginning
of cycle 5.

The Subtract instruction must be stalled as in Fig. 6.8 for one cycle to delay ALU operation.
 The compiler can eliminate this one-cycle stall.

Reordering instructions to insert a useful instruction between Load instruction and instruction that depends on the data read
from the memory.

The inserted instruction fills the bubble.

If a useful instruction cannot be found by the compiler, then the hardware introduces the one-cycle stall automatically.
Branch Delays: Unconditional branches
 Branch instructions can alter the sequence of execution.

They must first be executed to determine whether and where to branch.
 Pipeline begins with unconditional branch instruction Ij.

 Ij+1 and Ij+2, are stored in successive memory addresses.

 The target of the branch is instruction Ik.

 From fig. 5.15, Ik is fetched in cycle 4, after the program counter has been updated with the target address.

 Ij+1 and Ij+2 are fetched in cycles 2 and 3 before fetching Ik. They must be discarded.
 The resulting two-cycle delay constitutes a branch penalty.
 Branching instructions represent about 20 percent of the dynamic instruction count.
 With two-cycle branch penalty, the execution time increases as much as 40%.
Branch Delays: Unconditional branches
 Computing the branch target address earlier in the pipeline reduces branch penalty.

Determine the target address and update the program counter in the Decode stage.
 Ik can be fetched one clock cycle earlier, reducing the branch penalty to one cycle.
 Only one instruction, Ij+1, is fetched incorrectly.
Branch Delays: Conditional branches
 The result of the comparison in the third step determines whether the branch is taken.
 To limit branch penalty, test branch condition as early as possible.
 The comparator that tests the branch condition can be moved to the Decode stage.

The comparator uses the values from outputs A and B of the register file directly.
 Moving branch decision to Decode stage gives:

A common branch penalty of only one cycle for all branch instructions.
Branch Delays: Branch Delay Slot
 Assume:

The branch target address and the branch decision are determined in the Decode stage.
 At the same time, instruction Ij+1 is fetched.

 If branch condition is true, Ij+1 is discarded, one cycle brach penalty. If the condition is false, Ij+1 is executed.
 In both cases, the instruction immediately following the branch instruction is always fetched.
 The location that follows a branch instruction is called the branch delay slot.

Arrange the pipeline always execute this instruction, whether or not the branch is taken.
 Ij+1 cannot be in branch delay slot, it may get discarded based on branch condition.
 The compiler finds suitable instruction for the delay slot, one that needs to be executed even when the branch is taken.

Move one of the instructions preceding the branch instruction to the delay slot (if there is no data dependancy).
 If there is no useful instruction or data dependancy:

NOP is inserted.

One cycle branch penalty, whether or

not branch is taken.
 Branch takes place at the end of Add instruction.

Delayed branching.
Branch prediction:
 Making the branch decision in cycle 2 of branch instruction reduces the branch penalty.

The instruction after branch instruction is fetched in cycle 2 and may be discarded (if no branch delay considered).

The decision to fetch this instruction is actually made in cycle 1 (PC is incremented, when branch instruction itself is fetched).
 To reduce branch penalty further, processor predicts branch instruction outcome.

To determine which instruction should be fetched in cycle 2.
 Methods for branch prediction:

Static branch prediction

Dynamic branch prediction
Branch prediction: Static prediction
 Assume that the branch will not be taken, fetch the next instruction in sequential address order.

If the prediction is correct, the fetched instruction is completed, there is no penalty.

If the prediction is incorrect, the fetched instruction is discarded, the correct branch target instruction is fetched.

Misprediction incurs the full branch penalty.

The same choice (assume not-taken) is used every time a conditional branch is encountered.
 If branch outcomes were random:

Assume 50% of times it is taken.

Assuming that branches will not be taken results in a prediction accuracy of 50%.
 A backward branch at the end of a loop is taken most of the time.

For this, better accuracy can be achieved by predicting that the branch is likely to be taken.

Instructions are fetched using the branch target address as soon as it is known.
 For a forward branch at the beginning of a loop, the not-taken prediction leads to good prediction accuracy.
 The processor can determine the static prediction of taken or not-taken by checking the sign of the branch offset.
 Branch instruction encoding: include one bit to indicate whether branch should be predicted or not.

This bit can be specified by compiler.
Branch prediction: Dynamic prediction
 Use actual branch behavior to influence the prediction.
 The processor hardware assesses the likelihood of a given branch being taken by:

Keeping track of branch decisions every time that a branch instruction is executed.
 A dynamic prediction algorithm can use the result of the most recent execution of a branch instruction.

The next time the instruction is executed, the branch decision is likely to be the same as the last time.

The algorithm described by the two-state machine in Fig. 6.12a.

The algorithm is starts in state LNT.

When the branch instruction is executed:

Branch taken, it moves to LT. Otherwise it remains in LNT.

The next time the same instruction is encountered, the branch is predicted as:

Taken, if state is in LT.

Not taken, if state is in LNT.

This algorithm requires a single bit to represent the history of execution for a branch instruction.
Branch prediction: Dynamic prediction
 Once a loop is entered:

Decision for the branch instruction that controls looping will always be the same except for the last pass through the loop.

Each prediction for the branch instruction will be correct except in the last pass.
 Once a loop is exit:

The prediction in the last pass will be incorrect.

The branch history state machine will be changed to the opposite state.

The next time this same loop is entered (assume loop has more than one pass), the state machine will lead to the wrong prediction for
the first time.

Repeated execution of the same loop results in mispredictions in the first pass and the last pass.
 Better prediction accuracy can be achieved by keeping more information about execution history (Fig. 6.12b).

The algorithm is initially in LNT.

If branch is taken, state changes to ST, otherwise to SNT.

The branch is predicted as taken if the state is either ST or LT.
 The branch instruction is at the end of the loop, initial state in LNT:

1st pass: The prediction (not taken) will be wrong, state changes to ST.

Intermediate passes: The prediction will be correct.

Last pass: BNT, state changes to LT.

Second time, 1st pass: Takes the branch, correct decision if iterations > 1.

Repeated execution of the same loop now results in only one misprediction in the last pass.
Resource limitations:
 Pipelining enables overlapped execution of instructions.

The pipeline stalls when there are insufficient hardware resources.
 Two instructions need to access the same resource in the same clock cycle:

One instruction must be stalled.

This can be prevented by providing additional hardware.
 Stalls can occur in a computer that has:

A single cache, only one access per cycle.
 If both the Fetch and Memory stages of the pipeline are connected to the cache:

It is not possible for activity in both stages to proceed simultaneously.
 Fetch stage accesses the cache in every cycle.

This activity must be stalled for one cycle when there is:

A Load or Store instruction in the Memory stage also needing to access the cache.
 Separate caches for instructions and data:

The Fetch and Memory stages to proceed simultaneously without stalling.
Performance evaluation:
 A non-pipelined processor:

T - the execution time, N – dynamic instruction count, S - average number of clock cycles per instruction
 Instruction throughput: The number of instructions executed per second.
 Non-pipelined execution: Throughput Pnp is
 Five stage processor uses five cycles to execute all instructions.

If there are no cache misses, S = 5.
 Pipelining improves instruction throughput even though an individual instruction is still executed in the same number of cycles.
 Five-stage pipeline:

Each instruction is executed in five cycles. A new instruction can ideally enter the pipeline every cycle.

In the absence of stalls, S = 1. The ideal throughput with pipelining:
 An n-stage pipeline has the potential to increase throughput n times.

It would appear that the higher the value of n, the larger the performance gain.

 Any time a pipeline is stalled or instructions are discarded, the instruction throughput is reduced below its ideal value.

Stalls due to data dependencies, penalties due to branches, cache misses.
Performance evaluation: Effects of stalls and penalties
 The five-stage pipeline:

Memory-access operations in the Fetch and Memory stages.

ALU operations in the Compute stage.
 The operations with the longest delay dictate the cycle time, and hence the clock rate R.

A processor with on-chip caches: Small time for memory-access operations with cache hit.

Assume delay through ALU is 2 ns. Then, R = 500 MHz.

The ideal pipelined instruction throughput is Pp = 500 MIPS.
 A processor with operand forwarding in hardware:

Consider no stalls due to cache miss.

One cycle penalty occurs when Load instruction is followed by another dependent instruction.

Stalls due to such Load instructions increas S by .

Assume that Load instructions constitute 25 percent of the dynamic instruction count.

Assume that 40 percent of these Load instructions are followed by a dependent instruction.

With one cycle stall, the increase of S from ideal value is

Throughput is
Performance evaluation: Effects of stalls and penalties
The penalties due to mispredicting branches during program execution:
 The branch penalty is one cycle when:

The branch decision and the branch target address are determined in the Decode stage.
 Assume 20% of dynamic instruction count is branch instruction.

The average prediction accuracy for branch instructions is 90 percent. 10 percent of all branch incur one cycle penalty.

Increase in the average number of cycles per instruction due to branch penalties:
 Load and penalties from branch misprediction are independent.
 δstall + δbranch_penalty determines the increase in S, T, and reduction in Pp.
 The frequency of occurrence cache miss determine performance degradation.
 A cache miss causes pm cycles of stall.

Fraction of instructions that are fetched incur a cache miss: mi

Fraction of Load or Store instructions: d
 Fraction of d instruction incur cache miss: md.

S=1 increases by:
 When all factors are combined, S increases from 1 to 1 + δstall + δbranch_penalty + δmiss.
Performance evaluation: Effects of stalls and penalties
What is a good value for n?
 n-stage pipeline may increase instruction throughput by a factor of n.

Use a large number of stages for performance improvement.
 The number of pipeline stages increase, more instructions being executed concurrently.
 May cause more potential dependencies between instructions that may lead to pipeline stalls.
 A longer pipeline moves the branch decision to a later stage:

The branch penalty may be larger than one cycle.
 The gain in throughput from increasing the value of n begins to diminish.
 Another important factor: the ALU delay.

Cycle time of the processor clock is chosen such that one ALU operation can be completed in one cycle.

Reductions in the clock cycle time are possible if a pipelined ALU is used.

You might also like