Computer Performance

Computer Performance
Sachin Gajjar
Dhaval Shah
Reading Material
• Chapter 4 COMPUTER PERFORMANCE

• Book - Computer Architecture from Microprocessor
to Super Computer by Behrooz Parahami, Oxford
Publications
Computer Architecture
Application Program
Operating
Systems Applications
Compilers/
Assemblers
x86 Machine Primitives Computer

Logic Gates & Memory Architecture
Hardware Technology
Transistors & Devices
• Layers to manage complexity

• Layers are everywhere!
3
Elements of Modern Computer System
4
Elements of Modern Computer
System
• Computing Problems
• Numerical, alphanumerical, logical reasoning
• Science, business, machine learning, embedded
applications
• Algorithms and Data Structures
• Operating System
• Allocation/deallocation of resources
• System Software
• Compiler, Assembler
5
Defining Computer Performance
• As users, we expect a higher-performing computer to run our
application programs faster.
• The computer must react immediately, is a universally accepted
indicator of performance
• As longer execution time implies lower performance, we might
write:
• Performance = 1/Execution time
• For end user this execution time is total response time or
turnaround time which includes latency due to scheduling
decisions, work interruptions, I/O queuing delays, and so on.
• This is wall clock time (measured by looking at a wall clock at
the start and termination of a task)
Defining Computer Performance
• To filter out effects of such highly variable and hard-to-
quantify factors, CPU execution time is used to define
user-perceived performance:
• Performance = 1/CPU execution time
Processor Performance
1
Processor Performance =
CPU Execution Time
Instructions Cycles Seconds

CPU Execution Time = X X
Program Instruction Cycle
(code size, instruction count) (M/C cycles (Time

consumed for required
execution of each for each
instruction, Cycles machine
per instruction cycle)
(CPI))
Elaborated CPU Execution Time Formula
CPU time = Instructions  (Cycles Per Instruction)  (Seconds Per Cycle)
= Instructions  Average CPI / (Clock Rate)
(i.e Clock Rate = Cycles Per Second = frequency)

Instructions:
• Number of instructions executed, not number of
instructions in program (dynamic instruction count)
• Actual Instructions executed, not static code size
• Dynamic instruction count > Static code size (loops,
repeated calls, procedures)
• Determined by
• algorithm (eg. sorting and searching),
• programmer,
• compiler (they do code optimisations),
• Instruction Set Architecture (which instructions
are available) (eg. ARM ADD R1,R2, R3, LSL #2,
DJNZ reg, addr = DEC, CMP, JMP)
Average CPI
• Cycles Per Instruction
• Is calculated based on the dynamic instruction mix and
knowledge of how many clock cycles are needed to
execute various instructions (or instruction classes)
• Inverse is Instructions Per Cycle - with hardware
support (pipelining) number of instructions per cycle
can be increased
• IPC increases, CPI will decrease and hence the
Execution time will decrease, which in turn increases
Performance
Clock rate:
Clock rate: Clock = 1 GHz = 109 cycles / s (cycle time 10–9 s = 1 ns)
200 MHz = 200  106 cycles / s (cycle time = 5 ns)
Clock period
Clock frequency given to the processor

Determined by technology, circuit design
P α f Vdd2
• A high f decreases execution time, increases performance
• f increases, P increases, setting an upper limit to the value of f
that can be had
Amdahl's Law (Parallel Processor )
• Points out some limitations of parallel processing.

• Programs contain certain computations that are
inherently sequential and thus cannot be speeded up
through parallel processing.
Amdahl's Law (Parallel Processor )
• f = represents fraction of program that can be parallelized to run
in vector computation mode (eg. adding arrays in parallel)
• 1-f represents part of program that runs sequentially.
• T = time required to run program then overall speedup S can be
represented by
(where 1 is original running time of the program and (1-f) +(f/N)

is program's improved execution time with N processors)
• N becomes very large then second term (f/N) approaches to zero
then total execution time is dedicated to sequential part (1-f)
• This is referred as sequential bottleneck
Sequential Bottleneck
• Sequential bottleneck - time spent in sequential execution or
scalar computation becomes a limit to how much overall
performance improvement can be achieved via exploitation of
parallelism.
• As N increases or as machine parallelism increases, performance
will become more and more sensitive to and dictated by the
sequential part of program.
• Overall speedup due to parallel processing is strongly dictated by
sequential part of program as machine parallelism increases
Example 1
A processor spends 30% of its time on floating (flp) point addition,
25% on flp mult, and 10% on flp division. Evaluate the following
enhancements, each costing the same to implement:
a. Redesign of the flp adder to make it twice as fast.

b. Redesign of the flp multiplier to make it three times as fast.
c. Redesign the flp divider to make it 10 times as fast.
Solution
a. Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18 [f=0.3, N=2]

b. Multiplier redesign speedup = 1 / [0.75 + 0.25 / 3] = 1.20
c. Divider redesign speedup = 1 / [0.9 + 0.1 / 10] = 1.10
Significant speedup of the divider is not worth the effort
What if both the adder and the multiplier are redesigned?

Benchmarking
• Benchmarks are real or synthetic programs that are selected or designed
for comparative evaluation of machine performance.
• A benchmark suite is a collection of such programs intended to represent
an entire class of applications
• Benchmarks facilitate comparison across different platforms and
computer classes.
• They make it possible for computer vendors and independent firms to
evaluate many machines upon their entry into the market and to publish
the benchmarking results for the benefit of users.
• In this way, the user may not need to perform any benchmarking at all.
• Eg. Standard Performance Evaluation Corporation (SPEC)
SPEC CPU2000 benchmark suite characteristics
Performance Estimation
• Peak Performance
• the absolute highest level of performance that can be got from the system
• expressed in units of instructions per second or IPS, with MIPS (Million
Instructions per Second) and GIPS (Giga Instructions per second) preferred
to keep the numbers small.
• applications that involve floating- point calculations, floating-point
operations per second (FLOPS) is used as the unit, again with megaflops
(MFLOPS) and gigaflops (GFLOPS) preferred.
1
Peak Performance =
CPU Execution Time
CPU Execution time = Instructions  Average CPI / (Clock rate)
Peak performance = Clock Rate/Average CPI (If all instructions are of same class)
Comparing two machines
• When comparing two machines M1 and M2, notion of relative

performance comes into play.
(Performance of M1)/(Performance of M2)
=Speedup of M1 over M2
= (Execution time of M2)/(Execution time M1)
Example 3
Consider two implementations M1 (600 MHz) and M2 (500 MHz) of
an instruction set containing three classes of instructions:
Class CPI for M1 CPI for M2 Comments
F 5.0 4.0 Floating-point
I 2.0 3.8 Integer arithmetic
N 2.4 2.0 Non-arithmetic
a. What are the peak performances of M1 and M2 in MIPS?
b. If 50% of instructions executed are class-N, with the rest divided
equally among F and I, which machine is faster? By what factor?
Solution
a. Peak MIPS for M1 = 600 / 2.0 = 300; for M2 = 500 / 2.0 = 250
b. Average CPI for M1 = 5.0 / 4 + 2.0 / 4 + 2.4 / 2 = 2.95;
for M2 = 4.0 / 4 + 3.8 / 4 + 2.0 / 2 = 2.95 [50%=1/2, 25%=1/4]
CPI is same 2.95 so M1 with higher clock rate is faster
 M1 is faster; factor 1.2 (ratio of clock rates 600/500)
Contd…
C. Designer of M1 plan to redesign the machine for better performance
with assumptions of part b, which of the following have great
performance impact and why?
1. Using faster floating point unit with double the speed (Class F, CPI = 2.5)
2. Adding Second inter ALU to reduce the integer CPI to 1.20
3. Using faster logic that allows a clock rate of 750 MHz with same CPI
Solution
1. Average CPI = 2.5/4+2.0/4+2.4/2 = 2.325; MIPS = 600/2.325 = 258
2. Average CPI = 5.0/4+1.2/4+2.4/2 = 2.75; MIPS = 600/2.75 = 218
3. MIPS = 750/2.95 = 254
Option 1 (MIPS=258) has greater impact.

Contd…
d. Given CPI has included the effect of instruction cache misses at an

average rate of 5%. Each cache miss imposes a 10-cycle penalty (i.e.
adds 10 to the effective CPI of the instruction causing the miss or 0.5
cycle per instruction on the average). A fourth redesign option is to
use a larger instruction cache that would reduce the miss rate from
5% to 3%. How does this compare to the three options in part c?
Solution: With a larger cache, all CPIs are reduced by 0.2 (0.02 (i.e. 2%
reduction in cache miss) X 10 cycle penalty) owing to lower cache miss
rate.
Average CPI: 4.8/4 + 1.8/4 + 2.2/2 = 2.75
This option is comparable to option 2 of part c.
Cont….
e. Characterize application programs that would run faster on M1

than M2. (you can say about the instruction mix in that application)
Hint: Let x, y and 1 – x – y be the fraction of instructions belonging to

classes F, I and N, respectively.
Solution:
Average CPI for M1 = 5.0x + 2.0y + 2.4 (1-x-y) = 2.6x – 0.4y + 2.4
Average CPI for M2 = 4.0x + 3.8y + 2.0 (1-x-y) = 2x + 1.8y + 2
So we are looking for condition under which,
600/(2.6x – 0.4y + 2.4)> 500/(2x + 1.8y + 2)
x/y < 12.8
Example
Consider two implementations M1 (600 MHz) and M2 (500 MHz)
of an instruction set containing three classes of instructions:
Class CPI for M1 CPI for M2 Comments
F 5.8 5.4 Floating-point
I 2.8 3.8 Integer arithmetic
N 2.4 2.8 Non-arithmetic
a. What are the peak performances of M1 and M2 in MIPS?
b. If 50% of instructions executed are class-N, with the rest divided
equally among F and I, which machine is faster? By what factor?
Effect of Instruction Mix on Performance
Consider two applications DC (Data Compression) and RS (Reactor
Simulation) and two machines M1 and M2:
Class Data Comp. Reactor Sim. M1’s CPI M2’s CPI
A: Ld/Str 25% 32% 4.0 3.8
B: Integer 32% 17% 1.5 2.5
C: Sh/Logic 16% 2% 1.2 1.2
D: Float 0% 34% 6.0 2.6
E: Branch 19% 9% 2.5 2.2
F: Other 8% 6% 2.0 2.3
a. Find the effective CPI for the two applications on both machines.
Solution
a. CPI of DC on M1: 0.25  4.0 + 0.32  1.5 + 0.16  1.2 + 0  6.0 +
0.19  2.5 + 0.08  2.0 = 2.31
DC on M2: 2.54 RS on M1: 3.94 RS on M2: 2.89
Thank You

Computer Performance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Performance

Uploaded by

Copyright:

Available Formats

Computer Performance

• Chapter 4 COMPUTER PERFORMANCE

x86 Machine Primitives Computer

• Layers to manage complexity

Instructions Cycles Seconds

(code size, instruction count) (M/C cycles (Time

(i.e Clock Rate = Cycles Per Second = frequency)

Clock frequency given to the processor

• Points out some limitations of parallel processing.

(where 1 is original running time of the program and (1-f) +(f/N)

a. Redesign of the flp adder to make it twice as fast.

a. Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18 [f=0.3, N=2]

Significant speedup of the divider is not worth the effort

What if both the adder and the multiplier are redesigned?

• When comparing two machines M1 and M2, notion of relative

Option 1 (MIPS=258) has greater impact.

d. Given CPI has included the effect of instruction cache misses at an

e. Characterize application programs that would run faster on M1

Hint: Let x, y and 1 – x – y be the fraction of instructions belonging to

You might also like