Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Computer Architecture CT 2 Paper Solution

Q1. ULTRA SPARC IV Plus is a 4 way superscalar 14 stage pipeline. Find speedup. Obtain the
relation used. Comment on the specific requirement for cache. [1+2+1]

A1.
Calculating the speedup - [1 mark]
Given m=4, k= 14
S (m, 1) = S (4, 1) = T (1, 1) / T (4, 1) = (4*(N+14-1)) / (N+4*(14-1)) = (4*(N+13)) / (N+52)
S (4, 1) 4
N ∞
Obtaining the relation used to calculate the speedup - [2 marks]
For a k-stage linear pipeline,
No. of stages in the pipeline = k
No. of tasks to be processed = N
No. of cycles used to fill up the pipeline = k,
(also, this is the time required to complete execution of the first task), therefore,
No. of remaining tasks = N-1, and
No. of cycles needed to complete N-1 tasks = N-1
Therefore, the time required by the scalar base machine,
T (1, 1) = Tk = k + (N-1) clock periods
Similarly, the time required by an m-issue superscalar machine,
T (m, 1) = k + (N – m)/m
= N/m + (k – 1)
where,
k = time required to execute the first ‘m’ instructions through the m-pipelines simultaneously
(N – m)/m = time required to execute the remaining (N – m) instructions ‘m’ per cycles through
‘m’ pipelines
Speedup of the superscalar machine over the base machine,
T (1,1)  + k − 1 m(  + k − 1)
S(m,1 ) = = =
T (m,1)  + k − 1  + m(k − 1)
m
S (m,1)  →∝ → m

Specific requirement for cache - [1 mark]


• L1 Cache (per pipeline): 64 KB 4-way data
32 KB 4-way instruction
2 KB write, 2 KB prefetch
• L2 Cache: 16 MB external (exclusive access to 8MB per pipeline)
Q2. A 64 KB cache uses 12 bit index. Number of index bits is proposed to be increased to 16
bits. Suggest the best option out of the two with justification. Support your answer with for and
against arguments. [4]

A2.
Calculating the block size - [0.25 marks]
Given, Size of Cache = 64 KB
Index Bit = 12
Size of index ≡ f (cache size, block size, set associativity)
cachesize
2index =
blocksize× set associativity
212 = 64KB / (block size * 1), (set associativity = 1 for direct mapping)
Block size = 16 Bytes
Similarly, for Index bit 16,
216 = 64KB / (block size * 1)
Block size = 1 Bytes

In favor of 12 bit index - [3 marks for any 4 points]


1. By comparing above results it can be derived that when numbers of bits in index are
increased then block size will decrease. So the better solution would be the 12 index bit
value.
2. When the size of the block is decreased, the miss rate increases. Because of spatial
locality, as the size of block increases, the hit ratio will also increase.
3. Block sized is the only important criteria to be monitored when it comes to tradeoff with
cost, comparator size and various other issues.
For Instruction Cache, the rate of increase of block size is directly proportional to the rate of
decrease of cache miss rate.
But for Data Cache, this proportionality rule doesn’t hold true. Rate of cache miss is less in data
cache as compared to instruction.
4. Comparison of tags in case of index bit as 12 will be less as compared when index bit is
set to 16.
5. Cost of implementing with Index bit 12 will be less.
6. It will be less complex to implement.
7. Searching overhead will be less for a word.
8. Better for Sequential access of the memory.
Against 12 bit index - [0.75 marks]
9. As the graph depicts when the size of the block increases significantly, the decrease in
miss rate is less than the increase in miss penalty, as the whole block is to be transferred
to the cache which will increase the time delay to provide the required data. So if we
further decrease the index bit then can create a problem.

10. Special Case Problem:


Let’s say, we want to access words 0, 4,8,12 where 1 block size = 4 words.
Therefore for each access, there will be a miss and for these four accesses 4 cache miss
will occur. And as the size of the block is big so miss penalty will also be more.
Q3. VLIW is not a main stream computer. Give two reasons in support of this statement. What
are the salient features of this machine? [1+2]

A1.
Reasons for “VLIW is not a main stream computer” - [1 mark for any 2 points]
• Due to lack of compatibility with conventional hardware and software.
• Due to the dependence on Trace-scheduling compiling and code compaction.
• VLIW instructions are larger than the mainstream computer since each specifies several
independent operations.
• VLIW uses random parallelism among scalar operations instead of regular synchronous
parallelism.
Salient features of a VLIW machine - [2 marks for any 4 points & diagram]
• Instruction word length: hundreds of bits.
• Use of (i) multifunctional units concurrently, and
(ii) Common Large Register File shared by all functional Units.
• In VLIW run-time scheduling and synchronization are eliminated, because Instruction
parallelism and data movement in VLIW architecture are completely specified at compile
time.
• VLIW processor is an extreme of a superscalar processor in which all independent or
unrelated operations are already synchronously compacted.
• The instruction parallelism embedded in the compacted code may require a different
latency to be executed by different FUs even though the instructions are issued at the
same time.
• It uses random parallelism among scalar operations instead of regular synchronous
parallelism (as used in vectorized supercomputer or as in an SIMD computer).

Processor

Instruction Format
Q4. State various conditions to be satisfied for code compaction. Why is the trace scheduling
done in VLIW? [2]

A4.
Conditions for code compaction - [1 mark]
Code compaction is an attempt to tune the code automatically in order to make it occupy less
space while maintain all of its original functionality. Thus, the conditions are -
• It must reduce the space used.
• It must retain the original functionality (i.e. the dataflow of the program must not
change and the exception behavior must be preserved).
Trace scheduling done in VLIW - [1 mark]
Many compilers for first-generation ILP processors used a three phase method to generate code.
The phases were,
• Generate a sequential program. Analyze each basic block in the sequential program for
independent operations.
• Schedule independent operations within the same block in parallel if sufficient hardware
resources are available.
• Move operations between blocks when possible.
This three phase approach fails to exploit much of the ILP available in the program for two
reasons,
• Often times, operations in a basic block are dependent on each other. Therefore sufficient
ILP may not be available within a basic block.
• Arbitrary choices made while scheduling basic blocks make it difficult to move
operations between blocks.
Trace scheduling is a form of global code motion. In trace scheduling, a set of commonly
executed sequence of blocks is gathered together into a trace and the whole trace is scheduled
together. It works by converting a loop to long straight-line code sequence using loop unrolling
and static branch prediction. This process separates out "unlikely" code and adds handlers for
exits from trace. The goal is to have the most common case executed as a sequential set of
instructions without branches.
Q5. Explainn set associative memory mapping. [3]

A5.
Explanation – [0.5 marks]
It is a compromise between the two extreme cache designs based on direct mapping and full
associativity mapping. In a k-way
way associative search, the tag needs to be compared only with the
k-tags within the identified set. Since k is small in practice, the kk-way
way associative search is much
more economical than the full associativity
associativity.
In general, a block Bj can be mapped into any onone of the available frames in a set Si, Bj  Є Si, if
j (modulo v) = i. Matched tag identifies the current block which resides in the frame.
Design Trade-offs – [0.25 marks]
• The set size (associativity) k and the number of sets v are inversely related;
related m = v*k
• For a fixed cache size there exists a tradeoff between the set size and the number of sets.
• Set-Associative
Associative Cache is used in most of the high performance computer systems.
Advantages – [0.25 marks]
1. The block replacement algorithm needs to consider only a few blocks in the same set.
The replacement policy can be more economically implemented with limited choices, as
compared with the fully associative cache.
2. The k-way
way associative search is easier to implement.
3. Many design tradeoffs can be considered ((m =v*k) to yield a higher hit ratioatio in the cache.
Figure with explanation – [1 mark]
Example with figure – [1 mark]

Ex. 2 way mapping (Set-Associative Search)


B0 Main
TAG CACHE
B1 Memory
n = 16, m = 8, B0 B2
and v = 4 sets,
B1 B3
w=2 B4
2 bit
s, d = ? B5
B2
B6
B3 B7
2d = v
B8
d=2 B9
B4
s=4 B10
B5 B11
B12
B6 B13
B14
B7
B15
Q6. Give formal statement of the task dispatching problem. [1]

A6.
Formal statement – [1 mark]
Given,
I. a vector task system V
II. a vector computer, with
III. m identical pipelines, and
IV. a deadline D,
Does there exist a parallel schedule f for V with Finish Time ω such that ω ≤ D?
It is a feasibility problem.
Desirable algorithm: Heuristic scheduling algorithm.
Q7. A sum of first k components of a vector (k = 0 to n - 1), when number () of 3 processors
and length (n) of the vector, both are 8 is obtained in three steps. Will the number of steps
increase if both  and n are raised to 16? Justify your answer. [3]

A7.
Formulation – [0.25 marks]
We have to find the sum, S (k) of the first k components in vector A for each k varying from 0 to
15.
k
S k = Σ Ai for k = 0, 1,2, ...., 15
i=0

i.e., S (0) = A0;


S (1) = A1 + S (0);
S (2) = A2 + A1 + A0 = A2 + S (1);
or, S (k) = S (k-1) + Ak;
It is a recursive sum.
Relation used – [1 mark]
S (k) depends on S (k-1). Therefore, sum cannot be obtained in parallel. A0 to A15 can be
accessed simultaneously from PEM0 to PEM15. Sk’s (intermediate sum) calculated in one step
are to be routed to PEs from other PE’s. At the same time, inactive PE’s can be masked.
Therefore for n=16;
Store A0 in PE0, A1 in PE1, …, A15 in PE15.
The operation is,
[R0 ← A0, R1 ← A1, …., R15 ← A15]
To get S (1) = A1 + A0 (=S (0)), Route A0 to A1;
<Ri+1> ← <Ri> for i = 0 to 14.
Similarly to obtain other intermediate solutions,
<Ri+2> ← <Ri> for i = 0 to13.
<Ri+4> ← <Ri> for i = 0 to 11.
<Ri+8> ← <Ri> for i = 0 to 7
Thus,
<Ri+1> ← Ai + Ai+1 (intermediate sum)
R1 ← A0 + A1 = S (1)
R2 ← A1 + A2 = S (2) – S (0)
R3 ← A2 + A3 = S (3) – S (1)
.
R15 ← A14 + A15 = S (15) – S (13)
In the same way it will be done for other steps.
Calculation
alculation of number of steps diagrammatically – [1.5 marks]

PE’s not involved in routing: PE15 PE’s not involved in routing: PE14 & PE15
PE’s inactive in addition: PE0 PE’s inactive in addition: PE0 & PE1
Mask-off: PE0 & PE7 Mask-off: PE0, PE1, PE14 & PE15
PE’s not involved in routing: PE12 to PE15 PE’s not involved in routing: PE8 to PE15
PE’s inactive in addition: PE0 to PE3 PE’s inactive in addition: PE0 to PE7
Mask-off: PE0 to PE3, PE12 to PE15 Mask-off: PE0 to PE15
So we can see the number of steps required is 4.
Calculation
alculation of number of steps using the formula – [0.25 marks]
If number of processors is a perfect square then the number of steps would be equals to r = √@.
So,
r = √16
r=4
this is also the result we obtained diagrammatically.

You might also like