CSE 530 Homework #1 Due September 26 Anthony Dotterer: C C C T C T C C T T

CSE 530
Homework #1 Due September 26
Anthony Dotterer
• For a linearly connected array of processors - i.e., ILLIAC IV type architecture -

calculate the hardware utilization when performing a maximum search operation.
Assume n data elements and an organization composed of n processing elements (n is
a power of 2) (assume one dimensional communication pattern among the processors
as defined by its architecture.
Answer:
PT =256
T
∑ ci
i =1
c a=
T
T
ca
∑ ci
∑ c i  t i = i=1 = 1 T
= = ∑ i
cm cm∑  t i cmT T i=1
• Discuss about the advantage(s) and the disadvantages of the von Nuemann concept.
Answer:
The von Neumann concept is a computer design model that uses a single storage model to
hold both instructions and data.
Advantages:
• Reprogramming was made easier
• Programs are allowed to modify themselves
• Programs can write Programs
• General flexibility
Disadvantages:
• Malfunctioning programs can damage other programs or the operating system
• von Neumann bottleneck - CPU must wait for Data to transfer to and from memory
• Problems 1.1 and 1.2 page 60 (Hennessy and Patterson, 2nd edition).
Answer:
1.1) a)
1
S=
P v e , where S e =20
1−Pv e  
Se
Speedup over Percent of Vectorization
25
20
15
Speedup
10
0
%
%
0%
0%
10
20
30
40
50
60
70
80
90
10
b)
1
S= P  S e−1  P v e 1 , S e  S−1 
P v e , 1−Pv e   v e = 1 , 1− = Pv e =
1−Pv e   Se S Se S S  S e −1 
Se
If S e =20 and S=2 , then Pv e =0.5263
c)
If S m a x =20 then S=10 and Pv e =0.9474
d)
For hardware improvement, Pv e =0.7 and S e =40 , therefore S=3.1496
So compiler must have S=3.1496 with S e =20 , making Pv e =0.71842
Therefore compiler group must increase percentage of vectorization by 0.01842. Based on
this the compiler group should be given the investment.
1.2) a)
1
If S e =10 , Pe =0.5 , and S= P v e , therefore S=1.8182
1−Pv e  
Se
b)
If E =E  1−P   P e  , S e =10 , and Pe =0.5 , then En e w =0.55 Eo l d
ne w ol d e
Se
So 55% of the original execution time has been converted to fast mode
• A 40-MHz processor was used to execute a benchmark program with the following
instruction mix and clock cycle counts:
Instruction type Instruction count Clock cycle count

Integer arithmetic 45000 1
Data Transfer 32000 2
Floating point 15000 2
Control Transfer 8000 2
Determine the effective CPI, MIPS rate, and execution time for this program.
Answer:
N
If ∑ C P I i I i and
i=1 I c =45000  32000  15000 8000=100000 , then
C PI=
Ic
 45000 * 1   32000 * 2   15000 * 2   8000* 2
C PI= =1.55
100000
N
If 1 , E is the execution time, and
=
40000000 E=∑ C PI i I i  , then
i=1
 45000 * 1   32000 * 2   15000 * 2   8000 * 2
E= =0.003875 s e c s=3.875 m se c s
40000000
If Ic ,
M I P S= I c =100000 , and E=0.003875 , then
E * 10 6
100000
M I P S= =25.80645
0.003875 * 10 6
• A workstation uses a 15-MHz processor with a claimed 10-MIPS rating to execute a

given program mix. Assume a one-cycle delay for each memory access:
a) What is the effective CPI of this computer?
Answer:
N N
If M I P S= Ic , 1
E * 10
6 E=∑ C PI i I i  , and C PI=
∑ C P I i I i , then M I P S=
i=1
 C P I * 106
and
Ic
i=1
1
C PI=
 M I P S * 106
So if = 1 and M I P S=10 , then C P I =1.5
15000000
b) Suppose the processor is being upgraded with a 30-MHz clock. However, the
speed of the memory subsystem remains unchanged, and consequently two
clock cycles are needed per memory access. If 30% of the instructions require
one memory access and another 5% require two memory accesses per
instruction, what is the performance of the upgraded processor with a
compatible instruction set and equal instruction counts in the given program
mix?
Answer:
N
If ∑ C P I i I i then C P I = C P I ¿ I i Not sure...

C P I = i=1 Ic
Ic
• Problem 1.14 page 66 (Hennessy and Patterson, 2nd edition).
Answer:
1.14) a)
Fp
If M F L O P S p = and F =100000000 , so the following table shows the MFLOPS
p
E p * 10 6
for each program and computer in figure 1.11
Programs Computer A Computer B Computer C
P1 100 10 5
P2 0.1 1 5
b)
N
Arithmetic mean is ∑
i
M F L O P Si , so −M F L O P S A =50.05 , −M F L O P S B=5.5 ,
− M F L O P S=
N
and −M F L O P SC =5
N
1
Geometric mean is  ∏ M F LO P S i  , so
i G M F LO P S A =3.16228 ,
G M F LO P S =
N
, and M F LO P S =5
G M F LO P S B=3.16228 G C
N
Harmonic mean is H M F LO P S= N
1 , so H ¿ =0.1998 , H ¿ =1.8182
∑ M F L O P Si
i
, and H =5
¿
c) Not sure.
• Define term “delayed branch”, its application, and its shortcomings (if any).
Answer:
Delayed branch is technique for reducing the effects of control dependencies by delaying the
point where a branch operation effects the program counter. This allows one or more
instructions following the branch operation to execute whether or not the branch
operation succeeds.
Advantage:
• Allows for pipeline CPUs to reduce the clock cycles wasted due to pipeline flushing
during a branch or a jump operation
Disadvantage:
• If the compiler cannot put instructions to execute after the branch due to dependencies,
then it must insert no-op instructions which increases the size of program
• Three enhancements with the following speedups are proposed:

Speedup1 = 30
Speedup2 = 20
Speedup3 = 10
Only one enhancement is usable at a time:
1) If enhancements 1 and 2 are each usable for 30% of the time, what fraction of the time
must enhancement 3 be used to achieve an overall speedup of 10?
Answer:
1
If S= P1 P2 P 3 , P1=0.3 , S 1=30 , P2=0.3 , S 2=20 , S 3=10 ,
1−P1−P2−P3    
S1 S2 S 3
1
and 10= 1 9 P3 ,
S=10 , then 1−0.3−0.3−P3  
0.3 0.3 P 3 ,
 
=0.425−
10 10
30 20 10
9 P3
=0.325 , and P3=0.36111
10
2) Assume the distribution of enhancement usage is 30%, 30%, and 20% for
enhancements 1, 2, and 3, respectively. Assuming all three enhancements are in use, for what
fraction of the reduced execution time is no enhancement in use?
Answer:
If E =E  1−P −P −P   P1  P2  P 3  and P3=0.2 , then En e w =0.245 Eo l d

ne w ol d 1 2 3
S1 S2 S3
So Po l d =1−P1−P2− P 3=0.2 represents the usage of no enhancements
Therefore F = En e w = 0.245 E o l d =1.225 , where Ene is the execution time with no
ne
En e 0.2 E o l d
enhancements and Fne is the fraction of the reduced execution time where no enhancements
are in use
3) Assume for some benchmark, the fraction of use is 15% for each of the enhancements
1 and 2 and 70% for enhancement 3. We want to maximize performance. If only one
enhancement can be implemented, which should be chosen?
Answer:
If P1=0.15 , then S=1.16959

If P2=0.15 , then S=1.16618
If P3=0.7 , then S=2.7027
Therefore enhancement 3 should be implemented
• True or False; if 10% of operations, in a program, must be performed sequentially,

then the maximum speed up gained is 10, no matter how much parallelism is available
(prove your answer).
Answer:
1
If S=  1−f  , f =0.1 , and p=l im P  ∞ P , where f is the percent of operations
f
p
performed sequentially and p is the speedup gained from parallelism which goes to
1 1
infinity with unlimited processors, then S= =
1−0.1 0.1  0
=10
0.1 
l i mP  ∞ P
Therefore it is true that no matter how much parallelism is available the maximum speedup
gained is 10
• True or false; in general linear speed up is needed to make parallel systems

(multiprocessors) cost effective (justify your answer).
Answer:
Not sure
• CPU time (T) is defined as:

T = Ic* CPI * 
Ic stands for the instruction count,
CPI stands for average clock cycles per instruction, and
 stands for the clock cycle time.
A RISC computer, ideally, should be able to execute one instruction per clock cycles. Within
the scope of a RISC architecture, name and discuss (briefly) distinct issues that do not allow
ideal performance.
Answer:
Issues:
• Memory Access: Any access to the memory can take longer than one instruction
• Branching: Program branches will flush instructions in a pipeline and cause it to take
longer then one instruction
• Loop fusion allows two or more loops that are executed the same number of times and
that use the same indices to be combined into one loop:
a) Within the scope of a RISC processor, why does it (Loop fusion) improve
performance (detail explanation)?
Answer:
In the scope of a RISC processor, Loop fusion can improve performance by decreasing the
need for extraneous loop control instructions. In the absence of extraneous loop control
instructions, the processor can run a program faster.
b) Within the scope of a vector processor, why does it (Loop fusion) improve
performance (beyond what has been discussed in Part a) (detail explanation)?
Answer:
In the scope of a vector processor, loop fusion improves performance by allowing data
dependent loops to pipeline. Not Sure ?
c) Within the scope of a superscalar processor, why does it (Loop fusion) improve
performance (beyond what has been discussed in Part a) (detail explanation)?
Answer:
A superscalar processor allows fused loops to execute the fused loop instructions in parallel.
• Interleave memory
A. Define interleaved memory (be as clear as possible);
Answer:
Interleaved memory describes a way to virtually access memory into a number of memory
banks.
B. Within the scope of interleaved memory, define mapping of the logical addresses to
the physical addresses. Distinguish them from each other.
Answer:
In interleaved memory, the memory is divided into N banks of memory where virtual
address, i, would actually reside in memory bank i/N (ignoring the remainder), logically
addressed by i mod N.
C. What is the main difference between an interleaved memory and a parallel memory?
Answer:
Interleaved memory requires 2 to N memory banks to look up multiple contiguous virtual

memory locations where parallel memory only requires 1 memory bank.
D. Consider a memory hierarchy using one of the three organizations for main memory as
shown below. Assume that the cache block size is 16 words, the width of organization b is
four words, and the number of memory modules in organization c is four. If the main memory
latency for a new access is 10 cycles and the transfer time is 1 cycle, what is the cache miss
penalty for each of these organizations?
Answer:
A) If M a=10 c y c l e s p e r w o r d , M t=1 c y c l e s p e r w o r d , M c =16 w o r d s and

M =M c *  M a  M t  , then M=176 c y c l e s
B) If M a=10 c y c l e s p e r 4 w o r d s ,
M t=1 c y c l e s p e r 4 w o r d s , M c =16 w o r d s , and
10 c y c l e s  1 c y c le 
M =M c *  M a  M t  , then M=16 w o r d s * =44 c y c l e s
4 wor d s
C) So the memory access would be the following:
Cycles Bank 1 Bank 2 Bank 3 Bank 4
1 Access word
2 Cycle 2 Access word
3 Cycle 3 Cycle 2 Access word
4 Cycle 4 Cycle 3 Cycle 2 Access word
10 Cycle 10 Cycle 9 Cycle 8 Cycle 7
11 Transfer word Cycle 10 Cycle 9 Cycle 8
12 Access word Transfer word Cycle 10 Cycle 9
13 Cycle 2 Access word Transfer word Cycle 10
14 Cycle 3 Cycle 2 Access word Transfer word
15 Cycle 4 Cycle 3 Cycle 2 Access word
So the 4 words are accumulated every 11 cycles with the initial access of 14 cycles, therefore
47 cycles are spent.
E. Suppose a processor with a 16-word block size has an effective miss rate per
instruction of 0.5%. Assume that the CPI without cache miss is 1.2. Using the memory
organizations in part D, how much faster is this processor when using the wide memory that
when using narrow or interleaved memory?
Answer:
CPU CPU
CP
U
Cache
Mux. Cach
e
B
B
U U
S Cache
S
Mem0 Mem1 Mem2 Mem3

M B
E U
M S
O
R
Memory
Y
a. One-word-wide b. Wide memory C. Interleaved memory

Memory Org. Organization Organization
Answer:

CSE 530 Homework #1 Due September 26 Anthony Dotterer: C C C T C T C C T T

Uploaded by

Copyright:

Available Formats

You might also like

CSE 530 Homework #1 Due September 26 Anthony Dotterer: C C C T C T C C T T

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE 530 Homework #1 Due September 26 Anthony Dotterer: C C C T C T C C T T

Uploaded by

Copyright:

Available Formats

CSE 530

Homework #1 Due September 26

• For a linearly connected array of processors - i.e., ILLIAC IV type architecture -

Instruction type Instruction count Clock cycle count

• A workstation uses a 15-MHz processor with a claimed 10-MIPS rating to execute a

a) What is the effective CPI of this computer?

If ∑ C P I i I i then C P I = C P I ¿ I i Not sure...

• Problem 1.14 page 66 (Hennessy and Patterson, 2nd edition).

Programs Computer A Computer B Computer C

• Three enhancements with the following speedups are proposed:

If E =E  1−P −P −P   P1  P2  P 3  and P3=0.2 , then En e w =0.245 Eo l d

If P1=0.15 , then S=1.16959

• True or False; if 10% of operations, in a program, must be performed sequentially,

• True or false; in general linear speed up is needed to make parallel systems

• CPU time (T) is defined as:

Interleaved memory requires 2 to N memory banks to look up multiple contiguous virtual

A) If M a=10 c y c l e s p e r w o r d , M t=1 c y c l e s p e r w o r d , M c =16 w o r d s and

2 Cycle 2 Access word

3 Cycle 3 Cycle 2 Access word

4 Cycle 4 Cycle 3 Cycle 2 Access word

10 Cycle 10 Cycle 9 Cycle 8 Cycle 7

11 Transfer word Cycle 10 Cycle 9 Cycle 8

12 Access word Transfer word Cycle 10 Cycle 9

13 Cycle 2 Access word Transfer word Cycle 10

14 Cycle 3 Cycle 2 Access word Transfer word

15 Cycle 4 Cycle 3 Cycle 2 Access word

Mem0 Mem1 Mem2 Mem3

a. One-word-wide b. Wide memory C. Interleaved memory

You might also like