Reliability and State Machines in

an Advanced Network Testbed

Mac Newbold
School of Computing
University of Utah
MS Thesis Defense
April 5, 2004
Advisor: Prof. Jay Lepreau
Distributed Systems
• Distributed Systems are complex
– Many components
– Distributed across multiple systems
• Component failures are relatively common
– But should not cause system breakdown
• “A distributed system is one in which the
failure of a computer you didn’t even know
existed can render your own computer
unusable.” – Leslie Lamport, quoted in
CACM, June 1992
Our Context: Emulab
• Emulab is an advanced network testbed
• Complex time- and space-shared system
• System dynamically reconfigures nodes and
network links to create “experiments”
• Key architectural feature: Central Database
– System uses DB for storage, communication
• Complex system with many different scripts
and programs on clients and servers
Emulab Background
• First prototype in April 2000 (10 nodes)
• In production since Oct. 2000 (40 nodes)
• Early versions weren’t perfect
– Reliability problems
– Experiments of limited size
– Inefficient use of resources
• Problem is becoming harder
– 200 nodes, 400 remote, 2000 virtual nodes

Four Key Challenges
Emulab requirements:
• Reliability
• Scalability
• Performance and Efficiency
• Generality and Portability

1. Reliability
• Complex systems are hard to make reliable
• Many sources of unreliability:
– Hardware – commodity PCs as nodes
– Software – misconfiguration, bugs, etc.
– Humans – can interrupt at any time
• More complexity and more parts mean
higher chance that something is broken at
any given time

2. Scalability
• Almost everything a testbed provides is
harder to provide at a larger scale
• Larger scale requires more resources
• If throughput doesn’t increase, things slow
down at larger scale
• Increased load adversely affects reliability
• Practical scalability limited by reliability,
performance and efficiency
3. Performance and Efficiency
• One direct requirement on performance:
– Emulab is used in an interactive style
• System tasks must complete in “a few minutes”
• Indirect requirements:
– Scalability requirement places high demands
on many system components
– Maximize efficient resource utilization
• As many users/experiments as possible in the
shortest time with the fewest resources

4. Generality and Portability
• Workload Generality
– Wide variety of research, teaching,
development, and testing activities
– Good support for minimally and non-
instrumented client OS’s and devices
• Model generality
– Evolving system
– New types of network devices
• Portable and non-intrusive client software
Summary of Challenges
• Three challenges are closely related
– Fourth is a constraint more than a challenge
• Reliability is key
• Failure rates directly impact scalability,
performance, and efficiency
• Generality and portability requirements
constrain any solution used to address the

Thesis Statement

Enhancing Emulab with a flexible framework

based on state machines provides better
monitoring and control, and improves the
reliability, scalability, performance,
efficiency, and generality of the system.

• Introduction, Background, and Challenges
• Thesis Statement
• State Machines in Emulab
– Interactions, Control Models, stated
– Node Boot State Machines
• Results
• Related and Future Work
• Conclusion
State Machines
• Also called Finite State Machines (FSMs),
Finite State Automata (FSAs), or
Statecharts in UML
• Well-known model
• Simple
• Explicit model
• Rich and flexible
• Easy to understand and visualize
Example State Machine
• Three main parts:
• States
• Transitions
– Directional
• Events
– Associated with
transitions – labels
• Stored in database
– diagrams generated
automatically from DB

State Machines in Emulab
• Each state machine has a “type”
– Currently three: node boot, node allocation,
and experiment status
• Multiple machines allowed within a type
– Only in one state in one machine of a type
• States can have “timeouts” with actions
– Timer starts when state is entered
• State “triggers” – Entry Actions

Direct Interaction
• Within a type, can take a transition from a
state in one machine (or “mode”) to a state
in another machine of that type
– Known as “mode transition” in Emulab
– Similar to hierarchical state machines
• Highlights similarities/symmetries
– Most machines are variations of another
• Improved code reuse

Direct Interaction Example

Models of State Machine Control
• Centralized monitoring & control: stated
– State changes submitted, checked for
correctness, applicable actions performed
– Daemon tracks timeouts
– Used for Node Boot state machines
• Distributed management of state machine
– No central service enforcing correctness
– No dependency on central service
– Timeouts harder to implement
The stated State Daemon
• Listens for events continuously
• State transitions cause database updates
• Invalid transitions cause notifications
• Timeouts, timeout actions, triggers
configurable in DB for each state
• Caching – only writer of node boot states
• Modular design – dispatch events to proper
action handlers
Node Boot State Machines
• Nodes in Emulab self-configure
• Monitored via state machines – stated
• “Normal” node boot machine
• Variations – “Minimal”
• Reloading node disks

Node Self-Configuration
• Nodes send state events during booting to
allow progress to be monitored
• “Global knowledge” inside state daemon
– Better decisions about recovery steps
• Finer granularity gives more information for
recovery, allows for shorter timeouts
• Each OS image is associated with a “mode”
(state machine) that describes its behavior

“Normal” Node Boot Machine
• Start in SHUTDOWN
• DHCP, start OS booting
• When Emulab-specific
configuration begins,
• ISUP when finished
• In case of failure, can
retry from SHUTDOWN
Variations of Node Boot
• Example: MINIMAL
• For OS images with
little or no special
Emulab support
• ISUP generated by
stated if necessary
– Immediate or ping
• SilentReboot allowed
in this mode
Reloading Node Disks
• Mode transition
into RELOAD /
transitions into
mode for
OS image

Reloading and Mode Transitions

Experiment Status State Machine
• Uses distributed model
– Stored in database, but not
strictly enforced
• Documents life-cycle
• Restricts user interruption
– Reduces a source of errors
• Can queue, activate,
modify, restart, swap, or
terminate an expt.
Node Allocation State Machine
• Distributed control model
• Diagram documents the
way the states are used by
the program, but not
currently enforced
• Either reloads nodes with a
custom image, or reboots
them as members of the
Results: Context
• Emulab in production 1 year before state
machines were added
• In production 3 years since first stated
– 650 users, 150 projects, 75 institutions
• 19 papers, top venues
– Over 155,000 nodes allocated in nearly 10,000
experiment instances
– 13 classes at 10 universities
• Emulab SW on 6 more testbeds, 4 planned
• Anecdotal: (others in thesis)
– Reliability/Performance: Preventing race
– Generality: Graceful handling of custom OS
– Generality: New node types
• Experiment:
– Reliability/Scalability: Improved timeout
Preventing Race Conditions
• Expt. ends, nodes move to holding expt.,
get reloaded, then freed while they boot
• Problem: Getting allocated while booting
• Node appears unresponsive, gets forcefully
power cycled, corrupts FS on disk
• Solution: don’t free immediately
• Add trigger on next ISUP for a node that
finishes booting, that frees it when booted
Generality: Graceful Handling
of Custom OS Images
• Users create custom OS images
• Emulab client software is optional
• Problem: Nodes don’t send state events
• Solution: “Minimal” state machine
• SHUTDOWN: maybe on server, optional
• BOOTING: server side, trigger checks ISUP
• ISUP: either node sends, or generated
when pingable, or generated immediately
Generality: New Node Types
• Emulab is always growing and changing
• State machine model and our framework
are flexible to provide graceful evolution
• We’ve added 5 new node types
– IXPs, wide-area, PlanetLab, vnodes, sim-nodes
• Mostly used existing machines
• 2 new machines, slight variations
• 1 change to stated to add a new trigger
Improved Timeout Mechanisms
• Before: reboot node, wait for it to boot
– Static, 7 minute timeout
• Pragmatic – minimizes false positives/negatives
• Avg. 4 min., but max. error-free boot is 15 min.
• 11 minute delay is too long
• Improved: state machine monitoring
– Fine-grained, context-sensitive timeouts
– Faster error detection
– Better monitoring and control
Improved Timeout Mechanisms
• Experiment: Measure expt. swap-in time,
with and without the improvements
– Synthetic but plausible scenario
• One node, loads an RPM (8 min. install)
• Node reboots, timeout during RPM install
– Reboots again, timeout again, mark node dead
– Try twice per swap-in, 3 swap-in attempts
– Total failure in 45 min., 3 nodes “dead”

Improved Timeout Mechanisms
• With state machines:
– Timeouts: SHUTDOWN 2 min, BOOTING 3 min,
TBSETUP 10 min
• Node reboots, enters BOOTING
• 1 minute: Enters TBSETUP
• 9 minutes: Enters ISUP, expt. ready
• Succeeds, with no dead nodes or retries
• Cut time from 45 min. to 9 min. (80%)
Limitations and Issues
• stated is critical infrastructure
– Another single point of failure
– More system complexity, new bugs,
complicated debugging
– Potential for scaling problems (none seen yet)
• Simple heuristics for error detection
– Send mail for invalid transitions

Summary of Results
• Explicit model requires careful thought
– Improves design and implementation
• Visualization makes it easier to understand
• Faster and more accurate error detection
• Better reliability helps scalability/efficiency
– Bigger expts. possible, less overhead per expt.
• Flexibility for evolution, workload generality

Related Work
• “Standard” Finite State Automata – basics
– Timed Automata – have global clock
• Message Sequence Charts (MSCs)
– “Scenarios” – hierarchy, like modes/machines
• UML Statecharts
– States have entry actions – “triggers”
– Hierarchical states – similar to modes
– Can model Emulab’s timeouts

Future Work
• Further developing distributed control
– Add monitoring, timeouts, triggers
• Better heuristics for error detection
– Only flag clustered or related errors
• Implement more ideas from other systems
– UML’s exit actions, guarded transitions, etc.
• Move code into database – i.e. triggers
– Easier to modify, framework code vs. machine

Demonstrating Improvement
• Currently: programs have their own retry
and timeout mechanisms for node reboots
– No knowledge of progress, just completion
– Can cause failures by forcing a reboot, which
can damage file systems on node disk
• “New Way”: stated handles timeout and
retry during rebooting
– Implemented, not installed
– Knows if progress is being made
– Programs simply wait for ISUP or failure event
Demonstrating Improvement (cont’d)
• These failures directly hurt reliability
– Node failure can cause experiment setup to fail
• Significant impact on scaling, performance,
• Maximum experiment size is limited by
node failure rate
– Failures make things take longer
– A slower system means less efficient use of
Demonstrating Improvement (cont’d)
• Compare current vs. new: failure rate, time
to completion, etc.
• Test data; one of:
– Historical experiments
– Artificially high load
– Fault injection, e.g. reboots
• Why new way should help:
– Better knowledge for intelligent recovery
• Know when to wait longer and when to retry
– Shorter timeouts allow for early error detection
Future Work:
Modeling Indirect Interaction
• Occur between machines of different types
– Due to external relationships between the
entities tracked by each type of machine
• Examples:
– Same entity may be tracked in two different
types of machine
• Nodes are in Boot and Allocation machines
– Other relationship between entities
• Nodes may be “owned” or allocated to an
experiment – links Expt. Status and Node machines
