Performance Forecasting: Finding Bottlenecks Before They Happen

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

1

Performance Forecasting:
Finding bottlenecks before they happen

Ali Saidi†◊, Nathan Binkert‡, Steve Reinhardt§†, Trevor Mudge†

† – University of Michigan
‡ – HP Labs
§ – AMD Research
◊ – ARM R&D

June 23, 2009


Performance Analysis Challenge
2

  Today’s workloads & systems are complex


  Many layers of HW (disk, network), SW (app, OS)
  How to evaluate systems in design stage?
  Where are the bottlenecks?
  Conventional tools inadequate

Kernel Hardware Hardware Kernel


Application •  Protocol •  NIC Network •  NIC •  Protocol Application
•  Driver •  Disk •  Disk •  Driver
Machine 1 Machine 2
Solution: Global Critical Path
3

  Directly identifies true bottlenecks


  Accounts for overlapped latencies
  Used successfully in past in isolated domains
  Fieldset al.  out-of-order CPU
  Barford and Crovella TCP
  Yang and Miller  MPI

Kernel Hardware Hardware Kernel


Application •  Protocol •  NIC Network •  NIC •  Protocol Application
•  Driver •  Disk •  Disk •  Driver
Machine 1 Machine 2
Building a Global Critical Path
4

  Requires global event dependence graph

Challenge:
typically requires detailed knowledge
across many domains!

Solution:
automatically extract dependence graph
from interacting state machines
End Result
5

  Our simulation technique directly identifies:


  The
T ES
current bottleneck

I NU
M
  How much improvement until next bottleneck

  What the next bottleneck will be

Conventional simulation approach:


S
 

  Hypothesize
DAY
bottleneck

R S /
  Prototype solution

H O U
  Simulate solution

  Test hypothesis, repeat if incorrect


Constructing a Dependence Graph
6

App Protocol Driver NIC Network NIC Driver Protocol App


Machine 1 Machine 2

  Systematically map state machines into a global


dependence graph
  Most HW is already specified as a state machines
  Extract implicit state machines from SW
Explicit State Machine Conversion
7

State Machine Dependence Graph

Edges Nodes

Nodes Edges

dependence edge weight =


time spent in state
Explicit State Machine Conversion
8

A
↘ t1
← t2
D B
→ t3
↙ t4
C

Start A→B B→D D→B B→C


t1-t0 t2-t1 t3-t2 t4-t3
What about software?
9

F() { H() { L() {


int a; int z; int z;
… … …
t1
H(); … t2 …
… ret L(); …
… } …
… …
S(); …
} }
t3
t4

Start F→H H→L L→F F→S


t1-t0 t2-t1 t3-t2 t4-t3
State Machine Interactions
10

  Link up piece of dependence graph through


these interactions
  Queues are interaction points

  Without them back pressure can’t be modeled


  Abstract entities

  Annotated in models and code


  Developed iteratively
  Analysis can pinpoint problems
Interaction Example
11

Get Process
Packet Packet

Append Enqueue Dequeue Enqueue


hdrs in TXQ from TXQ in TX ring

Find Wait on
Dest TXQ

Simplified IP stack state machine Simplified TX NIC driver state machine


Interaction Example
12

Get Process
Packet Packet

Append Enqueue Dequeue Enqueue


hdrs in TXQ from TXQ in TX ring

Wait on
Find Dest TXQ

Simplified IP stack state machine Simplified TX NIC driver state machine

Get Pkt→ 5 App hdrs→


4 Find
2 Enqueue→
4 Get Pkt→ 8 App hdrs→
App hdrs Find dest dest→EnQ Get pkt App hdrs Find dest

2
3 10
EnQ→Wait
TXQ
0 X
13 Wait TXQ
→DeQ
DeQ→
Process
Process
Pkt→EnQ
Pkt

Time
Current Time: 13
26
16
15
11
5
23
0
9
Finding Global Critical Path
13

  Use standard graph analysis techniques


  Locate longest path through the graph

Get Pkt→ 5 App hdrs→


4 Find
2 Enqueue→
4 Get Pkt→ 8 App hdrs→
App hdrs Find dest dest→EnQ Get pkt App hdrs Find dest

2
Enqueue→ 0 (13) Wait TXQ
3 DeQ→ 10 Process
Process
Wait TXQ →DeQ Pkt→EnQ
Pkt

Time
Critical States & Predicting Speedup
14

  Aggregate states on critical path


  Most frequent state is the bottleneck
  Dependence graph contains all transitions and
interactions
  Not
just the ones that compose critical path or
where waiting occurred
  Modify weights on the critical path
  Re-analyze data to see how critical path changes
  Next critical path length → potential speedup
Resource Dependence Loops
15

  Critical path can sometimes be improved


without reducing latency of any tasks
  In resource constrained environments critical
path can be shorted by providing more
resources
Resource Dependence Loops
16

  Analysis automatically find candidates


  Addition of buffering changes critical path

Iteration 1 Iteration 2

A0-A1 A1-A2 A2-A3 A3-A0 A0-A1 A1-A2 A2-A3 A3-A0

B0-B1 B1-B2 B2-B3 B3-B0 B0-B1 B1-B2 B3-B3 B3-B0

C0-C1 C1-C2 C2-C0 C0-C1 C1-C2 C2-C0

Time
Workloads
17

  Linux 2.6.18
  SinkGen – Streaming benchmark from CERN

  Analyzed the transmit side


  Lighttpd – High-performance web server
  Uses non-blocking I/O to manage connections
  Used by large websites

  Metric is bandwidth achieved


TCP Transmit
18

  Start with default M5 system parameters


1.  Capture bottleneck data from that system
2.  Locate current bottleneck
3.  Predict performance when bottleneck is removed
4.  Repeat steps 2 and 3 for successive bottlenecks
5.  Verify that the locations and predictions are
correct
TCP Streaming Benchmark
19

Predicted Actual Initial


  How did we do?
4000
3500   Run experiments
3000 making the above
2500 suggested
2000
changes
1500
1000
500
0
Config 2 Config 3 Config 4
TCP Streaming Benchmark
20

Predicted Actual Initial   Predict performance


4000
again, this time
3500
starting with
3000
configuration 2
2500
2000
1500   Config 3: 1%
1000   Config 4: 15%
500
0
Config 3 Config 4
TCP Streaming Benchmark
21

Predicted Actual Initial   Predict performance


3500
one last time, starting
3000
with configuration 3
2500
  3% error
2000

1500

1000

500

0
Config 4
Experiments and Errors
22

  Additional experiments are in the paper


  Multi-core speed up of web server
  Describe why errors occurred
  Compare modified dependence graph to observed
graph from new simulation
Conclusion
23

  Architects are increasing looking at system-level


issues for performance

  Apply critical path analysis to complete systems


composed of concurrent components
  Span multiple layers of HW &SW
  Automate extraction of dependence graphs

  Identify end-to-end bottlenecks in network systems


  Critical tasks
  Resource dependence loops
  Performance of hypothetical systems
  Minutes not hours
24 Questions?

You might also like