Professional Documents
Culture Documents
CSE 530 Homework #1 Due September 26 Anthony Dotterer: C C C T C T C C T T
CSE 530 Homework #1 Due September 26 Anthony Dotterer: C C C T C T C C T T
CSE 530 Homework #1 Due September 26 Anthony Dotterer: C C C T C T C C T T
Anthony Dotterer
Answer:
PT =256
T
∑ ci
i =1
c a=
T
T
ca
∑ ci
∑ c i t i = i=1 = 1 T
= = ∑ i
cm cm∑ t i cmT T i=1
• Discuss about the advantage(s) and the disadvantages of the von Nuemann concept.
Answer:
The von Neumann concept is a computer design model that uses a single storage model to
hold both instructions and data.
Advantages:
• Reprogramming was made easier
• Programs are allowed to modify themselves
• Programs can write Programs
• General flexibility
Disadvantages:
• Malfunctioning programs can damage other programs or the operating system
• von Neumann bottleneck - CPU must wait for Data to transfer to and from memory
• Problems 1.1 and 1.2 page 60 (Hennessy and Patterson, 2nd edition).
Answer:
1.1) a)
1
S=
P v e , where S e =20
1−Pv e
Se
Speedup over Percent of Vectorization
25
20
15
Speedup
10
0
%
%
0%
0%
10
20
30
40
50
60
70
80
90
10
b)
1
S= P S e−1 P v e 1 , S e S−1
P v e , 1−Pv e v e = 1 , 1− = Pv e =
1−Pv e Se S Se S S S e −1
Se
If S e =20 and S=2 , then Pv e =0.5263
c)
If S m a x =20 then S=10 and Pv e =0.9474
d)
For hardware improvement, Pv e =0.7 and S e =40 , therefore S=3.1496
So compiler must have S=3.1496 with S e =20 , making Pv e =0.71842
Therefore compiler group must increase percentage of vectorization by 0.01842. Based on
this the compiler group should be given the investment.
1.2) a)
1
If S e =10 , Pe =0.5 , and S= P v e , therefore S=1.8182
1−Pv e
Se
b)
If E =E 1−P P e , S e =10 , and Pe =0.5 , then En e w =0.55 Eo l d
ne w ol d e
Se
So 55% of the original execution time has been converted to fast mode
• A 40-MHz processor was used to execute a benchmark program with the following
instruction mix and clock cycle counts:
Determine the effective CPI, MIPS rate, and execution time for this program.
Answer:
N
If ∑ C P I i I i and
i=1 I c =45000 32000 15000 8000=100000 , then
C PI=
Ic
45000 * 1 32000 * 2 15000 * 2 8000* 2
C PI= =1.55
100000
N
If 1 , E is the execution time, and
=
40000000 E=∑ C PI i I i , then
i=1
45000 * 1 32000 * 2 15000 * 2 8000 * 2
E= =0.003875 s e c s=3.875 m se c s
40000000
If Ic ,
M I P S= I c =100000 , and E=0.003875 , then
E * 10 6
100000
M I P S= =25.80645
0.003875 * 10 6
Answer:
N N
If M I P S= Ic , 1
E * 10
6 E=∑ C PI i I i , and C PI=
∑ C P I i I i , then M I P S=
i=1
C P I * 106
and
Ic
i=1
1
C PI=
M I P S * 106
So if = 1 and M I P S=10 , then C P I =1.5
15000000
b) Suppose the processor is being upgraded with a 30-MHz clock. However, the
speed of the memory subsystem remains unchanged, and consequently two
clock cycles are needed per memory access. If 30% of the instructions require
one memory access and another 5% require two memory accesses per
instruction, what is the performance of the upgraded processor with a
compatible instruction set and equal instruction counts in the given program
mix?
Answer:
N
Answer:
1.14) a)
Fp
If M F L O P S p = and F =100000000 , so the following table shows the MFLOPS
p
E p * 10 6
for each program and computer in figure 1.11
P1 100 10 5
P2 0.1 1 5
b)
N
Arithmetic mean is ∑
i
M F L O P Si , so −M F L O P S A =50.05 , −M F L O P S B=5.5 ,
− M F L O P S=
N
and −M F L O P SC =5
N
1
Geometric mean is ∏ M F LO P S i , so
i G M F LO P S A =3.16228 ,
G M F LO P S =
N
, and M F LO P S =5
G M F LO P S B=3.16228 G C
N
Harmonic mean is H M F LO P S= N
1 , so H ¿ =0.1998 , H ¿ =1.8182
∑ M F L O P Si
i
, and H =5
¿
c) Not sure.
• Define term “delayed branch”, its application, and its shortcomings (if any).
Answer:
Delayed branch is technique for reducing the effects of control dependencies by delaying the
point where a branch operation effects the program counter. This allows one or more
instructions following the branch operation to execute whether or not the branch
operation succeeds.
Advantage:
• Allows for pipeline CPUs to reduce the clock cycles wasted due to pipeline flushing
during a branch or a jump operation
Disadvantage:
• If the compiler cannot put instructions to execute after the branch due to dependencies,
then it must insert no-op instructions which increases the size of program
Answer:
1
If S= P1 P2 P 3 , P1=0.3 , S 1=30 , P2=0.3 , S 2=20 , S 3=10 ,
1−P1−P2−P3
S1 S2 S 3
1
and 10= 1 9 P3 ,
S=10 , then 1−0.3−0.3−P3
0.3 0.3 P 3 ,
=0.425−
10 10
30 20 10
9 P3
=0.325 , and P3=0.36111
10
2) Assume the distribution of enhancement usage is 30%, 30%, and 20% for
enhancements 1, 2, and 3, respectively. Assuming all three enhancements are in use, for what
fraction of the reduced execution time is no enhancement in use?
Answer:
Answer:
Answer:
1
If S= 1−f , f =0.1 , and p=l im P ∞ P , where f is the percent of operations
f
p
performed sequentially and p is the speedup gained from parallelism which goes to
1 1
infinity with unlimited processors, then S= =
1−0.1 0.1 0
=10
0.1
l i mP ∞ P
Therefore it is true that no matter how much parallelism is available the maximum speedup
gained is 10
Answer:
Not sure
Answer:
Issues:
• Memory Access: Any access to the memory can take longer than one instruction
• Branching: Program branches will flush instructions in a pipeline and cause it to take
longer then one instruction
• Loop fusion allows two or more loops that are executed the same number of times and
that use the same indices to be combined into one loop:
a) Within the scope of a RISC processor, why does it (Loop fusion) improve
performance (detail explanation)?
Answer:
In the scope of a RISC processor, Loop fusion can improve performance by decreasing the
need for extraneous loop control instructions. In the absence of extraneous loop control
instructions, the processor can run a program faster.
b) Within the scope of a vector processor, why does it (Loop fusion) improve
performance (beyond what has been discussed in Part a) (detail explanation)?
Answer:
In the scope of a vector processor, loop fusion improves performance by allowing data
dependent loops to pipeline. Not Sure ?
c) Within the scope of a superscalar processor, why does it (Loop fusion) improve
performance (beyond what has been discussed in Part a) (detail explanation)?
Answer:
A superscalar processor allows fused loops to execute the fused loop instructions in parallel.
• Interleave memory
A. Define interleaved memory (be as clear as possible);
Answer:
Interleaved memory describes a way to virtually access memory into a number of memory
banks.
B. Within the scope of interleaved memory, define mapping of the logical addresses to
the physical addresses. Distinguish them from each other.
Answer:
In interleaved memory, the memory is divided into N banks of memory where virtual
address, i, would actually reside in memory bank i/N (ignoring the remainder), logically
addressed by i mod N.
C. What is the main difference between an interleaved memory and a parallel memory?
Answer:
D. Consider a memory hierarchy using one of the three organizations for main memory as
shown below. Assume that the cache block size is 16 words, the width of organization b is
four words, and the number of memory modules in organization c is four. If the main memory
latency for a new access is 10 cycles and the transfer time is 1 cycle, what is the cache miss
penalty for each of these organizations?
Answer:
1 Access word
So the 4 words are accumulated every 11 cycles with the initial access of 14 cycles, therefore
47 cycles are spent.
E. Suppose a processor with a 16-word block size has an effective miss rate per
instruction of 0.5%. Assume that the CPI without cache miss is 1.2. Using the memory
organizations in part D, how much faster is this processor when using the wide memory that
when using narrow or interleaved memory?
Answer:
CPU CPU
CP
U
Cache
Mux. Cach
e
B
B
U U
S Cache
S
Answer: