Professional Documents
Culture Documents
Latancy Solution Pipeline Reservation Table
Latancy Solution Pipeline Reservation Table
Latancy Solution Pipeline Reservation Table
X
S2
X
X
S1
X
X
S3
(a)
(b)
(c)
(d)
(e)
1011010
1011010
+ 0001101
1011011
1011011
1011010
+ 0001011
1011011
CASE 2: latency 6
Present state
1011010
Collision vector
1011010
PS with 6 shifts
+ 0000001
Next state
1011011
Present state
Collision vector
PS with 8 shifts
Next state
1011011
1011010
+ 0000000
1011010
Present state
Collision vector
PS with 6 shifts
Next state
1011011
1011010
+ 0000001
1011011
Present state
Collision vector
PS with 8 shifts
Next state
1011011
1011010
+ 0000000
1011010
Present state
Collision vector
PS with 8 shifts
Next state
1011010
1011010
+ 0000000
1011010
CASE 3: latency 1
Present state
1011010
Collision vector
1011010
PS with 1 shifts
+ 0101101
Next state
1111111
CASE 4: latency 8
Present state
1111111
Collision vector
1011010
PS with 8 shifts
+ 0000000
Next state
1011010
8+
1011010
3
8+
1*
1011011
3*
8+
1111111
Latency cycles: (1, 8) (1, 8, 8) (1, 8, 3, 8) (1, 8, 8, 3, 8) (1, 8, 6, 8) (1, 8, 8, 6, 8) (8) (3)
(6) (1, 8, 8, 6, 6, 8)
Simple cycles: (3) (6) (8) (1, 8) (3, 8) (6, 8)
Greedy cycles: (3) (1, 8)
Optimal latency cycle: (3)
MAL:
Lower bound = 3
Upper bound = 4+1 = 5
Average greedy cycle latency = (1+8) / 2 = 4.5
MAL 4.5
MAL = (3)
Given:
= 20 ns
Throughput of the pipeline = N/n x = 3/8 x 20 x 10-9 = 18.75 MIPS.
Fetch
cach
e
Seq buffer 2
Target buffer 1
Target buffer 2
Instruction pipeline
Seq buffer 2
Instructions from branched locations
After the branch condition is checked, appropriate instructions are taken from one
of the two buffers. The instructions in the other buffers are discarded.
Two buffers alternate to prevent a collision between instruction following into
and out of pipeline.
Multiple functional units: Loop buffers:
These buffers hold sequential instruction contained in small loop. The loop
buffers are maintained by fetch stage of pipeline. Pre-fetched instructions in the
loop body will be executed repeatedly until all iterations complete execution.
The loop buffer operates in two steps.
a. It contains instructions sequentially ahead of current instruction. This saves the
instruction fetch time from memory.
b. It recognizes when the target of a branch falls within the target boundary.
The above architecture is pipelined scalar architecture. In this architecture, in
order to resolve data dependences and resource dependences among successive
instructions entering the pipeline.
The reservation stations [RS] are used with each functional unit. Operands can
wait in the reservation stations until its data dependences have been resolved.
Each reservation station is uniquely identified by a tag, which is monitored by a
tag unit.
The tag unit keeps checking the tags from all currently used registers or
reservation stations.
This register tagging technique allows the hardware to resolve conflicts between
source and destination registers assigned for multiple instructions.
Besides resolving conflicts, the reservation stations also serve as buffers to
interface the pipelined function units with decode and issue units.
The multiple functional units are supported to operate in parallel, once the
dependences are resolved.
Instruction fetch
unit
Tag
unit
Reservation
R
S
Stations
R
S
R
S
Load
register
s
R
S
Functional units
F
U
F
U
F
U
F
U
Memor
y
PART-2
Answer any Two full questions.
X
S2
6
X
X
X
S3
X
X
(a)
(b)
(c)
(d)
S1
S4
S5
(10
Sol:
Marks)
Forbidden latencies: 3, 4 and 5
Permissible latencies: 1, 2 and 6
Collision vector: C5C4C3C2C1 = 11100
CASE 1: latency 1
Present state
Collision vector
PS with 1 shifts
Next state
11100
11100
+ 01110
11110
Present state
Collision vector
PS with 1 shifts
Next state
11110
11100
+ 01111
11111
Present state
Collision vector
PS with 6 shifts
Next state
11111
11100
+ 00000
11100
Present state
Collision vector
PS with 6 shifts
Next state
11110
11100
+ 00000
11100
Present state
Collision vector
PS with 6 shifts
Next state
11111
11100
+ 00000
11100
CASE 2: latency 2
Present state
Collision vector
PS with 2 shifts
Next state
11100
11100
+ 00111
11111
Present state
Collision vector
PS with 2 shifts
Next state
11111
11100
+ 00111
11111
CASE 3: latency 6
Present state
Collision vector
PS with 6 shifts
11100
11100
+ 00000
Next state
11100
6+
11100
1*
6+
2*
6+
11110
1
1111
Latency cycles: (2),(6),(2,6),(1,6),(1,1,6)
Simple cycles: (2),(6),(2,6),(1,6),(1,1,6)
Greedy cycles: (2) (1, 6)
Optimal latency cycle: (2)
MAL:
Lower bound = 2
Upper bound = 3+1 = 4
Average greedy cycle latency = (1+6) / 2 = 3.5
MAL = 2
4) Consider the following pipelined processor with four stages. This pipeline has a total
evaluation time of six clock cycles. All successor stages must be used after each clock
cycle.
Output
Input
S1
S2
S3
S4
(a)
(b)
(c)
(d)
(e)
Specify the reservation table for this pipeline with six columns and four rows.
List the set of forbidden latencies between task initiations.
Draw the state diagram which shows all possible latency cycles
List all greedy cycles from the state diagram
What is the value of minimal average latency (MAL)?
(10
Marks)
Sol:
Reservation table:
1
X
S2
6
S1
X
X
X
X
X
X
S3
S4
CASE 1: latency 1
Present state
Collision vector
PS with 1 shifts
Next state
1010
1010
+ 0101
1111
Present state
Collision vector
PS with 1 shifts
Next state
1111
1010
+ 0111
1111
CASE 2: latency 3
Present state
Collision vector
PS with 3 shifts
Next state
1010
1010
+ 0001
1011
Present state
Collision vector
PS with 5 shifts
Next state
1111
1010
+ 0000
1010
Present state
Collision vector
PS with 3 shifts
Next state
1011
1010
+ 0001
1011
Present state
Collision vector
PS with 5 shifts
Next state
1011
1010
+ 0000
1010
CASE 3: latency 5
Present state
Collision vector
PS with 5 shifts
Next state
1010
1010
+ 0000
1010
5+
1010
1*
1111
5+
5+
1011
3*
5) Design an arithmetic pipeline unit for fixed-point multiplication of 8-bit integer using
CSA and CPA.
(10
Marks)
Sol:
An arithmetic pipeline unit for fixed-point multiplication of 8-bit integer using CSA and
CPA:
PART3
Answer any Two full questions.
Implementing the dot-product operation with internal data forwarding between a multiply
unit and an add unit.
Advantages:
The three instructions must be executed sequentially in a looping structure in
without internal data forwarding.
With data forwarding, the output of the multiplier is fed directly into the input
register R4 of the adder and the output of the multiplier is also routed to register
R3 as shown in Fig.
Therefore internal data forwarding between the two functional units reduces the
total execution time through the pipelined processor.
7) Design a binary multiply pipeline unit for two 4-bit operands. Use minimum number
of CSAs and CPAs. Show all interconnections and bus width in the schematic
diagram. Calculate the output of each CSA and CPA.
(5+5 = 10 Marks)
Sol:
A binary multiply unit for two 4-bit operands:
CSA2:
0101101
0111100
1111000
S = 1101001
C = 1111000
CPA:
1101001
+ 1111000
S= 11100001
8) Describe dynamic instruction scheduling achieved in Tomasulos register-tagging
scheme built in IBM 360/91 processor.
(10 Marks)
Sol:
Dynamic instruction scheduling achieved in Tomasulos register-tagging scheme
built in IBM 360/91 processor:
This hardware dependence resolution scheme was implemented with multiple
floating point units of IBM 91 processors for the model 91 processor, 3 RSs are
used in a floating point adder and two pairs in a floating point multiplier.
The scheme resolves resource conflicts as well as data dependences using register
tagging to allocate or deallocate the source and destination registers.
An issue instruction whose operands are not available is forwarded to an RS
associated with the functional unit it will use.
It waits until its data dependences have been resolved and its operands become
available.
The dependence is resolved by monitoring the result bus.
When all operands for an instruction is available, it is dispatched to the functional
unit for execution.
All working registers are tagged.
If a source register is busy when an instruction reaches the issue stage, the tag for
the source register is forwarded to an RS.
When the register becomes available, the tag can signal the availability.