Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Problems and solutions

What is Microarchitecture?
Microarchitecture refers to detailed hardware design and implementation of instruction set architecture (ISA) of a processor. A general
purpose processor is usually design to process arithmetic, logical, data transfer, memory read/write, conditional and branch instructions.
Moreover, several addressing modes are introduced in ISA for different types of instructions. Different types of digital circuits and systems,
such as, logic gates, flip-flops, registers, multiplexers, demultiplexers, decoders, encoders, adders, multipliers etc are used to design
microarchitecture to implement all instructions of the instruction set electronically inside the processor. The design complexity (space & time)
increases with instructions and addressing modes of the ISA supported by the processor. To optimize the design of the microarchitecture,
processing of instructions is split into several sub-tasks. The sub-tasks are decided/designed in such a way that all the instructions of
instruction set can be processed using some or most of the sub-tasks. The microarchitecture is designed in modular form; separate module
for each sub-task. In synchronous control, a central clock (timing signal) is used to initial, sequence and synchronize the tasks as required by
the instruction and its addressing mode. For some instructions, a module can be used several times but at different time clocks. For
asynchronous control, completion of one task will initiate the next task.

Microarchitecture deals with


• Design of Registers, ALU, Control Unit
• How to interconnect these and how to control/flow information among these
• How to select particular operation in ALU
• How to select particular registers (read/write)
• How to access memory (read/write)
• How to implement conditional instructions

What is Register file?


Registers are high speed storage inside the CPU used to hold instructions to be decoded, operands to be used in ALU operation and results
or partial results to be used in following instructions, address or partial address of memory to fetch/read instructions/data.
Register file refers to all general purpose registers used to hold operands to be used in ALU operations and results or partial results.
Although register file contains a number of registers, only few registers are selected to read at a time using multiplexers. Similarly, only a
few registers are selected out of many using demultiplexers to store result generated by ALU or data read from memory.
The number of registers required to read/write at a time depends on instruction set architecture.
If two registers are selected for read and one register is selected for write at a time, called 2R + 1W register file.
What are the steps usually followed to process instructions?
Processing of instructions usually involve
 Instruction Fetch
 Decode
 Register read
 ALU operation
 Memory operation (optional)
 Write back
In some processor, Decode and Register read are overlapped, means these two sub-tasks are performed at the same clock, may be at
different half cycles. In that case, steps are:
 Instruction Fetch
 Decode and Register read
 ALU operation
 Memory operation (optional)
 Write back

Assume that a CPU has 32 registers each of 32 bits. Show the design of Register architecture so that the CPU can read maximum any two
registers and write to any one register at a clock cycle. (2R + 1W)
Two Multiplexers, each of 32 – to – 1 line and a demultiplexer ( 1 –to – 32 lines)
Machine code (Register addressing mode)
opcode Operand-1 Operand-2 Result field
Minimum 5 bits Minimum 5 bits Minimum 5 bits

Assume that a CPU has 16 registers each of 32 bits. Show the design of Register architecture so that the CPU can read maximum any two
registers and write to any one register at a clock cycle. (2R + 1W)

Two Multiplexers, each of 16 – to – 1 line and a demultiplexer ( 1 –to – 16 lines)


Machine code (Register addressing mode)
opcode Operand-1 Operand-2 Result field
Minimum 4 bits Minimum 4 bits Minimum 4 bits
Design a (2R + 1W) register file. ALU Instructions use register addressing mode and three operand fields, shown below
Opcode Operand-1 Operand-2 Result field

Assume that the CPU has 24 general purpose registers, denoted by R0 – R23. Please note that R0 – R7 are reserved for operand-1, R8 –
R15 are reserved for operand-2 and R16 – R23 are reserved for result field.

Assume that a CPU has 8 registers each of 32 bits. Show the design of Register architecture so that the CPU can read maximum any two
registers and write to any one register at a clock cycle. (2R + 1W)
A CPU has 24 registers each of 16 bits and supports only register-based instruction. Registers are specified as follows
Opcode 1st operand field 2nd operand field Result field
R1 – R7 R8 – R15 R16 – R23
Show the design of Register architecture (hints: first decide how many registers CPU may need to read and to write on).
A CPU has 24 registers each of 16 bits and supports only register-based instruction. Registers are specified as follows
Opcode 1st operand field 2nd operand field Result field
R1 – R7 R8 – R15 R16 – R23

Show the design of Register architecture (hints: first decide how many registers CPU may need to read and to write on). Show the
connection of ALU with the register file.

Show the microarchitecture for ALU instructions using register addressing mode.
Design datapath for ALU instructions using register addressing mode.

Assuming ALU Instructions use three operand fields as shown below


Show the microarchitecture for ALU instructions using Immediate addressing mode. The data field is fixed (next to opcode).
Design datapath for ALU instructions using Immediate addressing mode. The data field is fixed (next to opcode).
Show the microarchitecture for ALU instructions using Immediate addressing mode. The data field is flexible but the result field is fixed.
Design datapath for ALU instructions using Immediate addressing mode. The data field is flexible but the result field is fixed.
The instruction may have any of the following formats
Show the microarchitecture for LOAD instruction using relative register-indirect memory addressing mode.
In LOAD instruction, CPU reads data from memory and saves to a register as indicated in the operand field (destination). In relative register-
indirect memory addressing mode, memory address is calculated by adding the offset (number) given in the operand field to the content of
a register (base).
Show the microarchitecture for STORE instruction using relative register-indirect memory addressing mode.
In STORE instruction, CPU saves the content of a register as indicated source to memory. In relative register-indirect memory addressing
mode, memory address is calculated by adding the offset (number) given in the operand field to the content of a register (base).
Show the microarchitecture for ALU instructions using direct memory addressing mode.
Design datapath for ALU instructions using direct memory addressing mode.
Opcode R1 M1(memory address) R3 (result field)
(Operand/data-1) Operand/data-2

CPU will read a data from memory address indicated by M1 and add it to the contents of R1. The result is saved to R3.
Show the microarchitecture for ALU instructions that supports both register addressing mode and direct memory addressing mode.
Design datapath for ALU instructions that supports both register addressing mode and direct memory addressing mode.
In register addressing mode,
Opcode R1(operand-1) R2(operand-2) R3 (Result field)

CPU adds the contents of R1 to R2 and the result is saved in R3


In direct memory addressing mode,
Opcode R1 M1(memory address) R3 (result field)
(Operand/data-1) Operand/data-2

CPU will read a data from memory address indicated by M1 and add it to the contents of R1. The result is saved to R3.
What are branch instruction? Explain, how a branch instruction works with an example and suitable diagrams.

The branch instructions are used to change the sequence of instruction execution. To solve problems or to implement different
types of algorithms, users/programmers use branch instructions to change the sequence of instruction execution in a program.

A program contains 10 instructions, each of one Byte and loaded into a byte addressable RAM starting at memory address 100.

If there is no branch instruction in the program, the CPU will fetch & process instructions in sequence as it appears in the
program and loaded in memory location. It means, the CPU will fetch & process Inst-1, followed by Inst-2, Inst-3, Inst-4 so on
and continue until the end of the list.

Memory address contents


100 Inst-1
101 Inst-2
102 Inst-3
103 Inst-4
104 Inst-5
105 Inst-6
106 Inst-7
107 Inst-8
108 Inst-9
109 Inst-10
Branch instructions are divided into two categories : (i) Unconditional Branch and (ii) Conditional branch instructions.

The general format of Unconditional Branch instructions is as follows

Opcode Target Address

(Memory address to read instruction)


JUMP 106

A program contains 10 instructions, each of one Byte and loaded into a byte addressable RAM starting at memory address 100.

Inst-3 is an unconditional branch instruction:

JUMP 106

When the program is run, CPU fetches & executes Inst-1 followed by Inst-2 and Inst-3. Here Inst-3 is an unconditional Jump
instruction and it instructs the CPU to Jump to memory address 106 (Target Address) to read next instruction. So after processing
Inst-3, the CPU changes the sequence of instructions, that means it does not process Inst-4, Inst-5 rather it Jumps to Inst-7. The
CPU will read Inst-7 and process it. Moreover the CPU will continue to process following instruction in sequence from memory
address 106 until there appears another conditional instruction.

Memory address contents


100 Inst-1; simple ALU inst
101 Inst-2; simple ALU inst
102 Inst-3; Unconditional Branch Inst
JUMP 106
103 Inst-4
104 Inst-5
105 Inst-6
106 Inst-7 (Next Instruction)
107 Inst-8
108 Inst-9
109 Inst-10

Datapath design for Fetch unit for Unconditional JUMP Instructions:

JUMP Branch Target Address


Conditional Branch instructions

The general format of Conditional Branch instructions is as follows

Opcode Reg-1 Reg-2 Relative Branch Target Address

(offset: to be added to current PC to calculate address of next instruction)


BEQ R1 R2 4

A program contains 10 instructions, each of one Byte and loaded into a byte addressable RAM starting at memory address 100.

Inst-3 is an conditional branch instruction:

BEQ R1, R2, 4

When the program is run, CPU fetches & executes Inst-1 followed by Inst-2 and Inst-3. Here Inst-3 is a conditional Branch
instruction. Please note that when Inst-3 is decoded, the PC is incremented by one (PC = PC + 1 = 103) to point Inst-4 but Inst-3
is still under processing. Since Inst-3 is a conditional Branch Instruction, the CPU will evaluate the condition as indicted in the
opcode. In this example instruction, the CPU will check whether the content of register R1 is equal to content of R2. In order to
check the condition, CPU will read the contents of R1 and R2 and a subtraction operation can be performed at ALU. If the result
of subtraction is found zero, then the condition is said to be evaluated TRUE. On the other hand, after subtraction, if the result is
found non-zero, the condition is said to be evaluated FALSE.

In any conditional branch instruction, first the condition is evaluated (TRUE/FALSE). If the condition is evaluated TRUE, the
CPU will switch/change the sequence of instruction. If the condition is evaluated FALSE, the CPU will continue the sequence of
instruction, means, Inst-4 will be fetched/processed next followed by Inst-5, Inst-6 so on.

If the condition is evaluated TRUE, the CPU will read next Instruction from a new location of RAM. The address of the location
is called Branch Target address. As mentioned earlier, when Inst-3 was decoded, the program counter was incremented by one to
point next instruction that was loaded in RAM. When the condition is evaluated TRUE, “Relative Branch Target Address” or the
Offset value given in the instruction will be added to current content of PC to calculate the address of next Instruction.

PC = PC + 1 + Offset

Here PC = 102 + 1 + 4 = 107

Now the CPU will switch/Jump to memory address 107 (Target Address) to read next instruction. So after processing Inst-3, the
CPU changes the sequence of instructions, that means it does not process Inst-4, Inst-5 rather it Jumps to Inst-8. The CPU will
read Inst-8 and process it. Moreover the CPU will continue to process following instruction in sequence from memory address
106 until there appears another conditional instruction.

Memory address contents


100 Inst-1; simple ALU inst
101 Inst-2; simple ALU inst
102 Inst-3; Conditional Branch Inst
BEQ R1, R2, 4 (condition evaluated
TRUE)
103 Inst-4
104 Inst-5
105 Inst-6
106 Inst-7
107 Inst-8
108 Inst-9
109 Inst-10

Datapath

Datapath
In case the condition ifs evaluated FALSE, the CPU will follow the current sequence, it means that the current content of PC
(103) will point the next instruction. So CPU will fetch and process Inst-4 followed by Inst-5, Inst-6 so on.

Memory address contents


100 Inst-1; simple ALU inst
101 Inst-2; simple ALU inst
102 Inst-3; Conditional Branch Inst
BEQ R1, R2, 4 (condition evaluated
FALSE)
103 Inst-4
104 Inst-5
105 Inst-6
106 Inst-7
107 Inst-8
108 Inst-9
109 Inst-10

Datapath

Datapath
Branch Instructions
o Unconditional branch
o Conditional branch
 Branch IF condition is True
 If condition is False….continue

Another example:

Assume, a program contains 100 instructions, each of 32 bits (4 Bytes) and loaded into a byte addressable RAM starting at
memory address 100. Each Instruction (Machine Code) would require 4 addressable locations (shown below) and Program
Counter (PC) would be incremented by 4 to point next instruction once an instruction is fetched/executed.
When the program is run, CPU fetches & executes Inst-1 followed by Inst-2. Here Inst-2 is a conditional Branch instruction.

Here Inst-2 is a conditional Branch Instruction : BEQ R1, R2, 96

Please note that when Inst-2 is decoded, the PC is incremented by 4 (PC = PC + 4 = 104+4 = 108) to point Inst-3 but Inst-2 is
still under processing. Since Inst-2 is a conditional Branch Instruction, the CPU will evaluate the condition as indicted in the
opcode. In this example instruction, the CPU will check whether the content of register R1 is equal to content of R2. In order to
check the condition, CPU will read the contents of R1 and R2 and a subtraction operation can be performed at ALU. If the result
of subtraction is found zero, then the condition is said to be evaluated TRUE. On the other hand, after subtraction, if the result is
found non-zero, the condition is said to be evaluated FALSE.

In any conditional branch instruction, first the condition is evaluated (TRUE/FALSE). If the condition is evaluated TRUE, the
CPU will switch/change the sequence of instruction. If the condition is evaluated FALSE, the CPU will continue the sequence of
instruction, means, Inst-3 will be fetched/processed next followed by Inst-4, Inst-5 so on.

If the condition is evaluated TRUE, the CPU will read next Instruction from a new location of RAM. The address of the location
is called Branch Target address. As mentioned earlier, when Inst-2 was decoded, the program counter was incremented by 4 to
point next instruction that was loaded in RAM. When the condition is evaluated TRUE, “Relative Branch Target Address” or the
Offset value given in the instruction will be added to current content of PC to calculate the address of next Instruction.

PC = PC + 4 + Offset

Here PC = 104 + 4 + 96 = 204

Now the CPU will switch/Jump to memory address 204 (Target Address) to read next instruction. So after processing Inst-2, the
CPU changes the sequence of instructions that means it does not process Inst-3, Inst-4 rather it Jumps to Inst-25. The CPU will
read Inst-25 and process it. Moreover the CPU will continue to process following instruction in sequence from memory address
204 until there appears another conditional instruction.
Memory address Contents Types of Inst
100 Inst-1 lower 8 bits ALU
101 Inst-1 next 8 bits
102 Inst-1 next 8 bits
103 Inst-1 next 8 bits
104 Inst-2 (BEQ R1, R2, 96) Branch Inst (Conditional)
PC = PC + 4 = 108
105
106
107
108 Inst-3 (IF branch condition is FALSE)
109
--
--
--
--
204 Inst-25 (IF branch condition is TRUE) PC = PC + 4 + 96 = 204
205
206
207
208 Inst-26

Each Inst is of 32 bits (4 bytes) and RAM is byte addressable


Once an Inst is Fetched, PC is incremented by 4 to point next Inst; PC = PC + 4
BEQ R1, R2, 96 ; Branch if R1 = R2;
CPU will check whether R1 = R2
If Branch condition is evaluated TRUE then CPU will Jump to memory address
PC = PC + 4 + 96 = 104 + 4 + 96 = 204 to read next Instruction (Inst-25)

If R1 is Not equal to R2, Branch condition is evaluated FALSE


Then continue next instruction that appears in the program, means CPU will read next instruction from PC =
PC + 4

Datapath design
Datapath design: Condition evaluated TRUE

--
Datapath design: Condition evaluated FALSE

In case the condition is evaluated FALSE, the CPU will follow the current sequence, it means that the current content of PC (108)
will point the next instruction. So CPU will fetch and process Inst-3 followed by Inst-4, Inst-5 so on.
--
Design datapath for the Branch Instruction, given below

BEQ R1 R2 Offset
Please note that, instructions are assumed to be 32 bits each and a byte addressable memory is used. So, PC is incremented by 4 to point
next instruction once an instruction is fetched/executed.

After executing the conditional branch Instruction: BEQ R1, R2, OFFSET, the CPU will read next instruction from PC = PC + 4 + Offset, if R1 =
R2. On the other hand, CPU will read next instruction from PC = PC + 4, if R1 # R2. It is noted that PC is pointing conditional branch
Instruction: BEQ R1, R2, OFFSET in the RAM. Moreover, PC = PC + 4 points the next instruction appears in program as well as stored next to
conditional branch Instruction: BEQ R1, R2, OFFSET in the RAM. Furthermore, PC = PC + 4 + Offset points to a different instruction in RAM
which does not lie in a sequence in RAM following conditional branch Instruction: BEQ R1, R2, OFFSET.
DATAPATH: if condition TRUE
Datapath if condition False
Show the datapath design for conditional branch instruction
BEQ R1, R2, 128 ; Branch if R1 = R2
BNQ R1, R2, 128 ; Branch if R1 is NOT equal to R2
Show the datapath design for Instruction Fetch Unit.
Show the datapath design for decode and register read unit.
What is single cycle implementation of a CPU?

What is multi-cycle implementation of a CPU?

Explain design of a multi-cycle implementation of a CPU?


Show multi-cycle implementation of following instructions for CISC processors
ADD R1, M1, R2; here R2 is result field
INSTRUCTION TASKS IN DIFFERENT CLOCK CYCLES) COMMENTS
ADD R1, M1, R2 FETCH DECODE MEMORY ALU RESULT 5 CLOCK CYCLES REQUIRED
AND ACCESS STORE
REGISTER
READ


What is Average CPI of a multi-cycle implementation?
In multi-cycle implementation, processing of instructions is split into subtasks and each sub-task is usually completed in a single clock cycle
(synchronous control).
Simple instructions may require few sub-tasks as well as few clock cycles whereas complex instructions require more sub-tasks as well as
more clock cycles.
To calculate average CPI, it depends on programs.
A program usually contains different types of instructions with different proportion.
Usually greater than 1, more specifically approximately 4 or 3.

Discuss addressing modes of RISC processors? Give examples


Discuss addressing modes of CISC processors? Give examples
Show multi-cycle implementation of following instructions for RISC processors
ADD R1, R3, R2; here R2 is result field
LOAD R1, R2(64)
STORE R3, R2(64)
Compare the number of clock cycles required in above instructions.
Show multi-cycle implementation of following instructions for CISC processors
ADD R1, R3, R2; here R2 is result field
ADD R1, M1, R2
ADD R1, [R2], R3
ADD R1, [R2+R3], R4
Compare the number of clock cycles required in above instructions.
Compare the memory models of RISC and CISC processors.
Compare the stages and sequence of tasks of RISC and CISC processors. If different, explain why.
How to control the multi-cycle implementation?
Example of multi-cycle implementation: (central clock: synchronous)
Processing of instructions is split into following tasks
Fetch Decode ALU Memory access Result store
(20 micro-second) And Register read (5) (20) (8)
(8)

We need to use a clock to control the flow of task and the clock is set to 20micro-second(consistent with the slowest task)

Compare single-cycle and multi-cycle implementation.


Comment on overall advantages of RISC
The Overall RISC Advantage
Today, the Intel x86 is arguable the only chip which retains CISC architecture. This is primarily due to advancements in
other areas of computer technology. The price of RAM has decreased dramatically. In 1977, 1MB of DRAM cost about
$5,000. By 1994, the same amount of memory cost only $6 (when adjusted for inflation). Compiler technology has also
become more sophisticated, so that the RISC use of RAM and emphasis on software has become ideal.
What is pipelining?
What is Average CPI of a pipeline implementation?

Give an example of complex instruction for CISC and show the equivalence in RISC

VAX 11/780
Instruction:

A single instruction can read data from memory, add to the content of a register, store it back in memory and increment the memory
pointer.
(R2)=(R2)+R1, R2=R2+1
The Complex Instruction can be written as a group of simple instructions (like we
have seen in MIPS ISA):
R4=(R2) #load instruction, I-type
R4=R4+R1 #add instruction, R-type
(R2)=R4 #Store instruction, R-type
R2=R2+1

Compare RISC and CISC

Characteristics VAX11/780 Intel 486 MIPSR4000


Number of 303 235 94
Instructions
Addressing 22 11 1
Modes
Instruction size 2 - 57 1 - 12 4
in bytes
No of general 16 8 32
purpose
registers

Intel complex instruction

MOVSB
A single instruction reads one byte data from one memory location and copy to another memory location. Moreover two registers are
either autoincrement/autodecrement to point source and destination memory locations.
What is pipelining architecture? Briefly explain.
Which processor (RISC/CISC) is suitable to implement in pipelined form? Explain why?
How to design a pipelined processor? Explain the procedure with suitable diagrams.
What types of programs/instructions are suitable to run on a pipelined processor? Why
Show the processing of first 8 instructions of a program on a pipelined processor assuming the instructions are independent. Comment, if
some instructions depend on the results of preceding instructions (what would be runtime and speed up?)
EXAMPLE FOR A RISC PROCESSOR HAVING 5 STAGE PIPELINING

FETCH DECODE ALU MEMORY ACCESS WRITE RESULT


(IF) AND REGISTER READ (EX) (MA) (WR)
(ID)

--
INSTRUCTIONS CLOCK CYCLES
1 2 3 4 5 6 7 8 9
1 IF ID EX
2 IF ID
3 IF
4
5
6
7
8
Ideal case: runtime = 12 clock cycles
Speedup = runtime on a non-pipelined processor/runtime on a 5-stage pipelined processor = 8x5/12 =
The pipeline will stall for some clock cycles.
Run time will be higher and speed up will be lower.

---
Show the processing of first 10 instructions of a program on a pipelined processor assuming the instructions are independent. Comment, if
some instructions depend on the results of preceding instructions (what would be runtime and speed up?)
What do you understand by stages in pipelined architecture?
Calculate Average CPI and speedup of a pipelined architecture?
What does pipeline architecture improve, (i) processing time of instruction (ii) Throughput of the system (number of instructions/sec) (iii)
both, Explain with example.
Compare the runtime of a program on RISC and CISC in general

The Performance Equation


The following equation is commonly used for expressing a computer's performance ability:

Runtime = I x average CPI x clock period

The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per
instruction. RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per
program.
A program compiled on a RISC and a CISC machine. RISC compiler generates 50% more instructions compared to CISC compiler. Average CPI
of CISC processor is approximately 4. Which machine will run faster and what is the speedup of that compared to other machine. Assume
that the CPU clock remains same for both the machines.

Assume that total instruction count in CISC = 100


Clock period for both RISC and CISC = C micro second
Average CPI for RISC = 1

CPU runtime for (CISC) = 100 x 4 x C = 400 C


CPU runtime(RISC) = 150 x 1 x C = 150 C
RISC will run faster

Speedup = CPU runtime for (CISC)/ CPU runtime(RISC) = 400 C/150 C = 2.67
RISC runs 2.67 times faster than CISC

A program compiled on a RISC and a CISC machine. RISC compiler generates 75% more instructions compared to CISC compiler. Average CPI
of CISC processor is approximately 3. Which machine will run faster and what is the speedup of that compared to other machine. Assume
that the CPU clock remains same for both the machines.

Assume that individual stages of the datapath have the following latencies/required time/delay:
IF ID EX MEM WB
250ps 350ps 150ps 300ps 200ps

Also, assume that instructions executed by the processor are broken down as follows:
alu beq lw sw

45% 20% 20% 15%

What is the clock cycle time in a pipelined and non-pipelined processor?

In case of pipelined implementation, a central clock signal is used to control the flow of tasks and the clock is set to slowest task

IF ID EX MEM WB
250ps 350ps 150ps 300ps 200ps
In case of non-pipelined multi-cycle implementation; same clock cycle will be used.

Non-pipelined (single cycle implementation) ; clock cycle = 1250ps

If we can split one stage of the pipelined datapath into two new stages, each with half the latency of the original stage, which stage would you
split and what is the new clock cycle time of the processor?
Assume that individual stages of the datapath have the following latencies/required time/delay:
IF ID EX MEM WB
250ps 350ps 150ps 300ps 200ps

IF ID-1 ID-2 EX MEM WB


250ps 175ps 175ps 150ps 300ps 200ps

The new clock signal = 300ps.

Assuming there are no stalls or hazards, what is the utilization of the data memory?
Assuming there are no stalls or hazards, what is the utilization of the write-register port of the “Registers” unit?
Instead of a single-cycle organization, we can use a multi-cycle organization where each instruction takes multiple cycles but one instruction
finishes before another is fetched. In this organization, an instruction only goes through stages it actually needs (e.g., ST only takes 4 cycles
because it does not need the WB stage). Compare clock cycle times and execution times with singlecycle, multi-cycle, and pipelined
organization.
following sequence of instructions:

or r1,r2,r3
or r2,r1,r4
or r1,r1,r2

Also, assume the following cycle times for each of the options related to forwarding:
Without Forwarding - 250ps
With Full Forwarding - 300ps
With ALU-ALU Forwarding Only-290ps

Indicate dependences and their type.


Assume there is no forwarding in this pipelined processor. Indicate hazards and add nop instructions to eliminate them.
Assume there is full forwarding. Indicate hazards and add NOP instructions to eliminate them.
What is the total execution time of this instruction sequence without forwarding and with full forwarding? What is the speedup
achieved by adding full forwarding to a pipeline that had no forwarding?
Add nop instructions to this code to eliminate hazards if there is ALU-ALU forwarding only (no forwarding from the MEM to the
EX stage).
What is the total execution time of this instruction sequence with only ALU-ALU forwarding? What is the speedup over a no-
forwarding pipeline?

You might also like