Professional Documents
Culture Documents
Test 1
Test 1
What is Microarchitecture?
Microarchitecture refers to detailed hardware design and implementation of instruction set architecture (ISA) of a processor. A general
purpose processor is usually design to process arithmetic, logical, data transfer, memory read/write, conditional and branch instructions.
Moreover, several addressing modes are introduced in ISA for different types of instructions. Different types of digital circuits and systems,
such as, logic gates, flip-flops, registers, multiplexers, demultiplexers, decoders, encoders, adders, multipliers etc are used to design
microarchitecture to implement all instructions of the instruction set electronically inside the processor. The design complexity (space & time)
increases with instructions and addressing modes of the ISA supported by the processor. To optimize the design of the microarchitecture,
processing of instructions is split into several sub-tasks. The sub-tasks are decided/designed in such a way that all the instructions of
instruction set can be processed using some or most of the sub-tasks. The microarchitecture is designed in modular form; separate module
for each sub-task. In synchronous control, a central clock (timing signal) is used to initial, sequence and synchronize the tasks as required by
the instruction and its addressing mode. For some instructions, a module can be used several times but at different time clocks. For
asynchronous control, completion of one task will initiate the next task.
Assume that a CPU has 32 registers each of 32 bits. Show the design of Register architecture so that the CPU can read maximum any two
registers and write to any one register at a clock cycle. (2R + 1W)
Two Multiplexers, each of 32 – to – 1 line and a demultiplexer ( 1 –to – 32 lines)
Machine code (Register addressing mode)
opcode Operand-1 Operand-2 Result field
Minimum 5 bits Minimum 5 bits Minimum 5 bits
Assume that a CPU has 16 registers each of 32 bits. Show the design of Register architecture so that the CPU can read maximum any two
registers and write to any one register at a clock cycle. (2R + 1W)
Assume that the CPU has 24 general purpose registers, denoted by R0 – R23. Please note that R0 – R7 are reserved for operand-1, R8 –
R15 are reserved for operand-2 and R16 – R23 are reserved for result field.
Assume that a CPU has 8 registers each of 32 bits. Show the design of Register architecture so that the CPU can read maximum any two
registers and write to any one register at a clock cycle. (2R + 1W)
A CPU has 24 registers each of 16 bits and supports only register-based instruction. Registers are specified as follows
Opcode 1st operand field 2nd operand field Result field
R1 – R7 R8 – R15 R16 – R23
Show the design of Register architecture (hints: first decide how many registers CPU may need to read and to write on).
A CPU has 24 registers each of 16 bits and supports only register-based instruction. Registers are specified as follows
Opcode 1st operand field 2nd operand field Result field
R1 – R7 R8 – R15 R16 – R23
Show the design of Register architecture (hints: first decide how many registers CPU may need to read and to write on). Show the
connection of ALU with the register file.
Show the microarchitecture for ALU instructions using register addressing mode.
Design datapath for ALU instructions using register addressing mode.
CPU will read a data from memory address indicated by M1 and add it to the contents of R1. The result is saved to R3.
Show the microarchitecture for ALU instructions that supports both register addressing mode and direct memory addressing mode.
Design datapath for ALU instructions that supports both register addressing mode and direct memory addressing mode.
In register addressing mode,
Opcode R1(operand-1) R2(operand-2) R3 (Result field)
CPU will read a data from memory address indicated by M1 and add it to the contents of R1. The result is saved to R3.
What are branch instruction? Explain, how a branch instruction works with an example and suitable diagrams.
The branch instructions are used to change the sequence of instruction execution. To solve problems or to implement different
types of algorithms, users/programmers use branch instructions to change the sequence of instruction execution in a program.
A program contains 10 instructions, each of one Byte and loaded into a byte addressable RAM starting at memory address 100.
If there is no branch instruction in the program, the CPU will fetch & process instructions in sequence as it appears in the
program and loaded in memory location. It means, the CPU will fetch & process Inst-1, followed by Inst-2, Inst-3, Inst-4 so on
and continue until the end of the list.
A program contains 10 instructions, each of one Byte and loaded into a byte addressable RAM starting at memory address 100.
JUMP 106
When the program is run, CPU fetches & executes Inst-1 followed by Inst-2 and Inst-3. Here Inst-3 is an unconditional Jump
instruction and it instructs the CPU to Jump to memory address 106 (Target Address) to read next instruction. So after processing
Inst-3, the CPU changes the sequence of instructions, that means it does not process Inst-4, Inst-5 rather it Jumps to Inst-7. The
CPU will read Inst-7 and process it. Moreover the CPU will continue to process following instruction in sequence from memory
address 106 until there appears another conditional instruction.
A program contains 10 instructions, each of one Byte and loaded into a byte addressable RAM starting at memory address 100.
When the program is run, CPU fetches & executes Inst-1 followed by Inst-2 and Inst-3. Here Inst-3 is a conditional Branch
instruction. Please note that when Inst-3 is decoded, the PC is incremented by one (PC = PC + 1 = 103) to point Inst-4 but Inst-3
is still under processing. Since Inst-3 is a conditional Branch Instruction, the CPU will evaluate the condition as indicted in the
opcode. In this example instruction, the CPU will check whether the content of register R1 is equal to content of R2. In order to
check the condition, CPU will read the contents of R1 and R2 and a subtraction operation can be performed at ALU. If the result
of subtraction is found zero, then the condition is said to be evaluated TRUE. On the other hand, after subtraction, if the result is
found non-zero, the condition is said to be evaluated FALSE.
In any conditional branch instruction, first the condition is evaluated (TRUE/FALSE). If the condition is evaluated TRUE, the
CPU will switch/change the sequence of instruction. If the condition is evaluated FALSE, the CPU will continue the sequence of
instruction, means, Inst-4 will be fetched/processed next followed by Inst-5, Inst-6 so on.
If the condition is evaluated TRUE, the CPU will read next Instruction from a new location of RAM. The address of the location
is called Branch Target address. As mentioned earlier, when Inst-3 was decoded, the program counter was incremented by one to
point next instruction that was loaded in RAM. When the condition is evaluated TRUE, “Relative Branch Target Address” or the
Offset value given in the instruction will be added to current content of PC to calculate the address of next Instruction.
PC = PC + 1 + Offset
Now the CPU will switch/Jump to memory address 107 (Target Address) to read next instruction. So after processing Inst-3, the
CPU changes the sequence of instructions, that means it does not process Inst-4, Inst-5 rather it Jumps to Inst-8. The CPU will
read Inst-8 and process it. Moreover the CPU will continue to process following instruction in sequence from memory address
106 until there appears another conditional instruction.
Datapath
Datapath
In case the condition ifs evaluated FALSE, the CPU will follow the current sequence, it means that the current content of PC
(103) will point the next instruction. So CPU will fetch and process Inst-4 followed by Inst-5, Inst-6 so on.
Datapath
Datapath
Branch Instructions
o Unconditional branch
o Conditional branch
Branch IF condition is True
If condition is False….continue
Another example:
Assume, a program contains 100 instructions, each of 32 bits (4 Bytes) and loaded into a byte addressable RAM starting at
memory address 100. Each Instruction (Machine Code) would require 4 addressable locations (shown below) and Program
Counter (PC) would be incremented by 4 to point next instruction once an instruction is fetched/executed.
When the program is run, CPU fetches & executes Inst-1 followed by Inst-2. Here Inst-2 is a conditional Branch instruction.
Please note that when Inst-2 is decoded, the PC is incremented by 4 (PC = PC + 4 = 104+4 = 108) to point Inst-3 but Inst-2 is
still under processing. Since Inst-2 is a conditional Branch Instruction, the CPU will evaluate the condition as indicted in the
opcode. In this example instruction, the CPU will check whether the content of register R1 is equal to content of R2. In order to
check the condition, CPU will read the contents of R1 and R2 and a subtraction operation can be performed at ALU. If the result
of subtraction is found zero, then the condition is said to be evaluated TRUE. On the other hand, after subtraction, if the result is
found non-zero, the condition is said to be evaluated FALSE.
In any conditional branch instruction, first the condition is evaluated (TRUE/FALSE). If the condition is evaluated TRUE, the
CPU will switch/change the sequence of instruction. If the condition is evaluated FALSE, the CPU will continue the sequence of
instruction, means, Inst-3 will be fetched/processed next followed by Inst-4, Inst-5 so on.
If the condition is evaluated TRUE, the CPU will read next Instruction from a new location of RAM. The address of the location
is called Branch Target address. As mentioned earlier, when Inst-2 was decoded, the program counter was incremented by 4 to
point next instruction that was loaded in RAM. When the condition is evaluated TRUE, “Relative Branch Target Address” or the
Offset value given in the instruction will be added to current content of PC to calculate the address of next Instruction.
PC = PC + 4 + Offset
Now the CPU will switch/Jump to memory address 204 (Target Address) to read next instruction. So after processing Inst-2, the
CPU changes the sequence of instructions that means it does not process Inst-3, Inst-4 rather it Jumps to Inst-25. The CPU will
read Inst-25 and process it. Moreover the CPU will continue to process following instruction in sequence from memory address
204 until there appears another conditional instruction.
Memory address Contents Types of Inst
100 Inst-1 lower 8 bits ALU
101 Inst-1 next 8 bits
102 Inst-1 next 8 bits
103 Inst-1 next 8 bits
104 Inst-2 (BEQ R1, R2, 96) Branch Inst (Conditional)
PC = PC + 4 = 108
105
106
107
108 Inst-3 (IF branch condition is FALSE)
109
--
--
--
--
204 Inst-25 (IF branch condition is TRUE) PC = PC + 4 + 96 = 204
205
206
207
208 Inst-26
Datapath design
Datapath design: Condition evaluated TRUE
--
Datapath design: Condition evaluated FALSE
In case the condition is evaluated FALSE, the CPU will follow the current sequence, it means that the current content of PC (108)
will point the next instruction. So CPU will fetch and process Inst-3 followed by Inst-4, Inst-5 so on.
--
Design datapath for the Branch Instruction, given below
BEQ R1 R2 Offset
Please note that, instructions are assumed to be 32 bits each and a byte addressable memory is used. So, PC is incremented by 4 to point
next instruction once an instruction is fetched/executed.
After executing the conditional branch Instruction: BEQ R1, R2, OFFSET, the CPU will read next instruction from PC = PC + 4 + Offset, if R1 =
R2. On the other hand, CPU will read next instruction from PC = PC + 4, if R1 # R2. It is noted that PC is pointing conditional branch
Instruction: BEQ R1, R2, OFFSET in the RAM. Moreover, PC = PC + 4 points the next instruction appears in program as well as stored next to
conditional branch Instruction: BEQ R1, R2, OFFSET in the RAM. Furthermore, PC = PC + 4 + Offset points to a different instruction in RAM
which does not lie in a sequence in RAM following conditional branch Instruction: BEQ R1, R2, OFFSET.
DATAPATH: if condition TRUE
Datapath if condition False
Show the datapath design for conditional branch instruction
BEQ R1, R2, 128 ; Branch if R1 = R2
BNQ R1, R2, 128 ; Branch if R1 is NOT equal to R2
Show the datapath design for Instruction Fetch Unit.
Show the datapath design for decode and register read unit.
What is single cycle implementation of a CPU?
…
What is Average CPI of a multi-cycle implementation?
In multi-cycle implementation, processing of instructions is split into subtasks and each sub-task is usually completed in a single clock cycle
(synchronous control).
Simple instructions may require few sub-tasks as well as few clock cycles whereas complex instructions require more sub-tasks as well as
more clock cycles.
To calculate average CPI, it depends on programs.
A program usually contains different types of instructions with different proportion.
Usually greater than 1, more specifically approximately 4 or 3.
We need to use a clock to control the flow of task and the clock is set to 20micro-second(consistent with the slowest task)
Give an example of complex instruction for CISC and show the equivalence in RISC
VAX 11/780
Instruction:
A single instruction can read data from memory, add to the content of a register, store it back in memory and increment the memory
pointer.
(R2)=(R2)+R1, R2=R2+1
The Complex Instruction can be written as a group of simple instructions (like we
have seen in MIPS ISA):
R4=(R2) #load instruction, I-type
R4=R4+R1 #add instruction, R-type
(R2)=R4 #Store instruction, R-type
R2=R2+1
MOVSB
A single instruction reads one byte data from one memory location and copy to another memory location. Moreover two registers are
either autoincrement/autodecrement to point source and destination memory locations.
What is pipelining architecture? Briefly explain.
Which processor (RISC/CISC) is suitable to implement in pipelined form? Explain why?
How to design a pipelined processor? Explain the procedure with suitable diagrams.
What types of programs/instructions are suitable to run on a pipelined processor? Why
Show the processing of first 8 instructions of a program on a pipelined processor assuming the instructions are independent. Comment, if
some instructions depend on the results of preceding instructions (what would be runtime and speed up?)
EXAMPLE FOR A RISC PROCESSOR HAVING 5 STAGE PIPELINING
--
INSTRUCTIONS CLOCK CYCLES
1 2 3 4 5 6 7 8 9
1 IF ID EX
2 IF ID
3 IF
4
5
6
7
8
Ideal case: runtime = 12 clock cycles
Speedup = runtime on a non-pipelined processor/runtime on a 5-stage pipelined processor = 8x5/12 =
The pipeline will stall for some clock cycles.
Run time will be higher and speed up will be lower.
---
Show the processing of first 10 instructions of a program on a pipelined processor assuming the instructions are independent. Comment, if
some instructions depend on the results of preceding instructions (what would be runtime and speed up?)
What do you understand by stages in pipelined architecture?
Calculate Average CPI and speedup of a pipelined architecture?
What does pipeline architecture improve, (i) processing time of instruction (ii) Throughput of the system (number of instructions/sec) (iii)
both, Explain with example.
Compare the runtime of a program on RISC and CISC in general
The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per
instruction. RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per
program.
A program compiled on a RISC and a CISC machine. RISC compiler generates 50% more instructions compared to CISC compiler. Average CPI
of CISC processor is approximately 4. Which machine will run faster and what is the speedup of that compared to other machine. Assume
that the CPU clock remains same for both the machines.
Speedup = CPU runtime for (CISC)/ CPU runtime(RISC) = 400 C/150 C = 2.67
RISC runs 2.67 times faster than CISC
A program compiled on a RISC and a CISC machine. RISC compiler generates 75% more instructions compared to CISC compiler. Average CPI
of CISC processor is approximately 3. Which machine will run faster and what is the speedup of that compared to other machine. Assume
that the CPU clock remains same for both the machines.
Assume that individual stages of the datapath have the following latencies/required time/delay:
IF ID EX MEM WB
250ps 350ps 150ps 300ps 200ps
Also, assume that instructions executed by the processor are broken down as follows:
alu beq lw sw
In case of pipelined implementation, a central clock signal is used to control the flow of tasks and the clock is set to slowest task
IF ID EX MEM WB
250ps 350ps 150ps 300ps 200ps
In case of non-pipelined multi-cycle implementation; same clock cycle will be used.
If we can split one stage of the pipelined datapath into two new stages, each with half the latency of the original stage, which stage would you
split and what is the new clock cycle time of the processor?
Assume that individual stages of the datapath have the following latencies/required time/delay:
IF ID EX MEM WB
250ps 350ps 150ps 300ps 200ps
Assuming there are no stalls or hazards, what is the utilization of the data memory?
Assuming there are no stalls or hazards, what is the utilization of the write-register port of the “Registers” unit?
Instead of a single-cycle organization, we can use a multi-cycle organization where each instruction takes multiple cycles but one instruction
finishes before another is fetched. In this organization, an instruction only goes through stages it actually needs (e.g., ST only takes 4 cycles
because it does not need the WB stage). Compare clock cycle times and execution times with singlecycle, multi-cycle, and pipelined
organization.
following sequence of instructions:
or r1,r2,r3
or r2,r1,r4
or r1,r1,r2
Also, assume the following cycle times for each of the options related to forwarding:
Without Forwarding - 250ps
With Full Forwarding - 300ps
With ALU-ALU Forwarding Only-290ps