Performance Forecasting: Finding Bottlenecks Before They Happen

1
Performance Forecasting:
Finding bottlenecks before they happen
Ali Saidi†◊, Nathan Binkert‡, Steve Reinhardt§†, Trevor Mudge†
† – University of Michigan
‡ – HP Labs
§ – AMD Research
◊ – ARM R&D
June 23, 2009

Performance Analysis Challenge
2
  Today’s workloads & systems are complex

  Many layers of HW (disk, network), SW (app, OS)
  How to evaluate systems in design stage?
  Where are the bottlenecks?
  Conventional tools inadequate
Kernel Hardware Hardware Kernel

Application •  Protocol •  NIC Network •  NIC •  Protocol Application
•  Driver •  Disk •  Disk •  Driver
Machine 1 Machine 2
Solution: Global Critical Path
3
  Directly identifies true bottlenecks

  Accounts for overlapped latencies
  Used successfully in past in isolated domains
  Fieldset al.  out-of-order CPU
  Barford and Crovella TCP
  Yang and Miller  MPI
Kernel Hardware Hardware Kernel

Application •  Protocol •  NIC Network •  NIC •  Protocol Application
•  Driver •  Disk •  Disk •  Driver
Machine 1 Machine 2
Building a Global Critical Path
4
  Requires global event dependence graph
Challenge:
typically requires detailed knowledge
across many domains!
Solution:
automatically extract dependence graph
from interacting state machines
End Result
5
  Our simulation technique directly identifies:

  The
T ES
current bottleneck
I NU
M
  How much improvement until next bottleneck
  What the next bottleneck will be
Conventional simulation approach:

S
 
  Hypothesize
DAY
bottleneck
R S /
  Prototype solution
H O U
  Simulate solution
  Test hypothesis, repeat if incorrect

Constructing a Dependence Graph
6
App Protocol Driver NIC Network NIC Driver Protocol App

Machine 1 Machine 2
  Systematically map state machines into a global

dependence graph
  Most HW is already specified as a state machines
  Extract implicit state machines from SW
Explicit State Machine Conversion
7
State Machine Dependence Graph
Edges Nodes
Nodes Edges
dependence edge weight =

time spent in state
Explicit State Machine Conversion
8
A
↘ t1
← t2
D B
→ t3
↙ t4
C
Start A￫B B￫D D￫B B￫C

t1-t0 t2-t1 t3-t2 t4-t3
What about software?
9
F() { H() { L() {

int a; int z; int z;
… … …
t1
H(); … t2 …
… ret L(); …
… } …
… …
S(); …
} }
t3
t4
Start F￫H H￫L L￫F F￫S

t1-t0 t2-t1 t3-t2 t4-t3
State Machine Interactions
10
  Link up piece of dependence graph through

these interactions
  Queues are interaction points
  Without them back pressure can’t be modeled

  Abstract entities
  Annotated in models and code

  Developed iteratively
  Analysis can pinpoint problems
Interaction Example
11
Get Process
Packet Packet
Append Enqueue Dequeue Enqueue

hdrs in TXQ from TXQ in TX ring
Find Wait on
Dest TXQ
Simplified IP stack state machine Simplified TX NIC driver state machine

Interaction Example
12
Get Process
Packet Packet
Append Enqueue Dequeue Enqueue

hdrs in TXQ from TXQ in TX ring
Wait on
Find Dest TXQ
Simplified IP stack state machine Simplified TX NIC driver state machine
Get Pkt￫ 5 App hdrs￫

4 Find
2 Enqueue￫
4 Get Pkt￫ 8 App hdrs￫
App hdrs Find dest dest￫EnQ Get pkt App hdrs Find dest
2
3 10
EnQ￫Wait
TXQ
0 X
13 Wait TXQ
￫DeQ
DeQ￫
Process
Process
Pkt￫EnQ
Pkt
Time
Current Time: 13
26
16
15
11
5
23
0
9
Finding Global Critical Path
13
  Use standard graph analysis techniques

  Locate longest path through the graph
Get Pkt￫ 5 App hdrs￫

4 Find
2 Enqueue￫
4 Get Pkt￫ 8 App hdrs￫
App hdrs Find dest dest￫EnQ Get pkt App hdrs Find dest
2
Enqueue￫ 0 (13) Wait TXQ
3 DeQ￫ 10 Process
Process
Wait TXQ ￫DeQ Pkt￫EnQ
Pkt
Time
Critical States & Predicting Speedup
14
  Aggregate states on critical path

  Most frequent state is the bottleneck
  Dependence graph contains all transitions and
interactions
  Not
just the ones that compose critical path or
where waiting occurred
  Modify weights on the critical path
  Re-analyze data to see how critical path changes
  Next critical path length → potential speedup
Resource Dependence Loops
15
  Critical path can sometimes be improved

without reducing latency of any tasks
  In resource constrained environments critical
path can be shorted by providing more
resources
Resource Dependence Loops
16
  Analysis automatically find candidates

  Addition of buffering changes critical path
Iteration 1 Iteration 2
A0-A1 A1-A2 A2-A3 A3-A0 A0-A1 A1-A2 A2-A3 A3-A0
B0-B1 B1-B2 B2-B3 B3-B0 B0-B1 B1-B2 B3-B3 B3-B0
C0-C1 C1-C2 C2-C0 C0-C1 C1-C2 C2-C0
Time
Workloads
17
  Linux 2.6.18
  SinkGen – Streaming benchmark from CERN
  Analyzed the transmit side

  Lighttpd – High-performance web server
  Uses non-blocking I/O to manage connections
  Used by large websites
  Metric is bandwidth achieved

TCP Transmit
18
  Start with default M5 system parameters

1.  Capture bottleneck data from that system
2.  Locate current bottleneck
3.  Predict performance when bottleneck is removed
4.  Repeat steps 2 and 3 for successive bottlenecks
5.  Verify that the locations and predictions are
correct
TCP Streaming Benchmark
19
Predicted Actual Initial

  How did we do?
4000
3500   Run experiments
3000 making the above
2500 suggested
2000
changes
1500
1000
500
0
Config 2 Config 3 Config 4
20
Predicted Actual Initial   Predict performance

4000
again, this time
3500
starting with
3000
configuration 2
2500
2000
1500   Config 3: 1%
1000   Config 4: 15%
500
0
Config 3 Config 4
21
Predicted Actual Initial   Predict performance

3500
one last time, starting
3000
with configuration 3
2500
  3% error
2000
1500
1000
500
0
Config 4
Experiments and Errors
22
  Additional experiments are in the paper

  Multi-core speed up of web server
  Describe why errors occurred
  Compare modified dependence graph to observed
graph from new simulation
Conclusion
23
  Architects are increasing looking at system-level

issues for performance
  Apply critical path analysis to complete systems

composed of concurrent components
  Span multiple layers of HW &SW
  Automate extraction of dependence graphs
  Identify end-to-end bottlenecks in network systems

  Critical tasks
  Resource dependence loops
  Performance of hypothetical systems
  Minutes not hours
24 Questions?

Performance Forecasting: Finding Bottlenecks Before They Happen

Uploaded by

Copyright:

Available Formats

You might also like

Performance Forecasting: Finding Bottlenecks Before They Happen

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Forecasting: Finding Bottlenecks Before They Happen

Uploaded by

Copyright:

Available Formats

1

Ali Saidi†◊, Nathan Binkert‡, Steve Reinhardt§†, Trevor Mudge†

June 23, 2009

 Today’s workloads & systems are complex

Kernel Hardware Hardware Kernel

 Directly identifies true bottlenecks

Kernel Hardware Hardware Kernel

 Requires global event dependence graph

 Our simulation technique directly identifies:

 What the next bottleneck will be

Conventional simulation approach:

 Test hypothesis, repeat if incorrect

App Protocol Driver NIC Network NIC Driver Protocol App

 Systematically map state machines into a global

State Machine Dependence Graph

dependence edge weight =

Start A￫B B￫D D￫B B￫C

F() { H() { L() {

Start F￫H H￫L L￫F F￫S

 Link up piece of dependence graph through

 Without them back pressure can’t be modeled

 Annotated in models and code

Append Enqueue Dequeue Enqueue

Simplified IP stack state machine Simplified TX NIC driver state machine

Append Enqueue Dequeue Enqueue

Simplified IP stack state machine Simplified TX NIC driver state machine

Get Pkt￫ 5 App hdrs￫

 Use standard graph analysis techniques

Get Pkt￫ 5 App hdrs￫

 Aggregate states on critical path

 Critical path can sometimes be improved

 Analysis automatically find candidates

A0-A1 A1-A2 A2-A3 A3-A0 A0-A1 A1-A2 A2-A3 A3-A0

B0-B1 B1-B2 B2-B3 B3-B0 B0-B1 B1-B2 B3-B3 B3-B0

C0-C1 C1-C2 C2-C0 C0-C1 C1-C2 C2-C0

 Analyzed the transmit side

 Metric is bandwidth achieved

 Start with default M5 system parameters

Predicted Actual Initial

Predicted Actual Initial  Predict performance

Predicted Actual Initial  Predict performance

 Additional experiments are in the paper

 Architects are increasing looking at system-level

 Apply critical path analysis to complete systems

 Identify end-to-end bottlenecks in network systems

You might also like

  Today’s workloads & systems are complex

  Directly identifies true bottlenecks

  Requires global event dependence graph

  Our simulation technique directly identifies:

  What the next bottleneck will be

  Test hypothesis, repeat if incorrect

  Systematically map state machines into a global

  Link up piece of dependence graph through

  Without them back pressure can’t be modeled

  Annotated in models and code

  Use standard graph analysis techniques

  Aggregate states on critical path

  Critical path can sometimes be improved

  Analysis automatically find candidates

  Analyzed the transmit side

  Metric is bandwidth achieved

  Start with default M5 system parameters

Predicted Actual Initial   Predict performance

Predicted Actual Initial   Predict performance

  Additional experiments are in the paper

  Architects are increasing looking at system-level

  Apply critical path analysis to complete systems

  Identify end-to-end bottlenecks in network systems