Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

White Paper

Architectural Issues in Carrier Class


Operating Systems

Jeff Doyle
JUNOS Product Management

Juniper Networks, Inc.


1194 North Mathilda Avenue
Sunnyvale, California 94089
USA
408.745.2000
1.888 JUNIPER
www.juniper.net

Part Number: 200209-001 Dec 2006


Architectural Issues in Carrier Class Operating Systems

Table of Contents
Executive Summary.................................................................................3
Introduction.............................................................................................3
Router Operating System Objectives........................................................4
Objectives for Any Router OS............................................................4
Open Standards Support..............................................................4
Flexibility.....................................................................................5
Manageability..............................................................................5
Basic Security..............................................................................5
Service and Support.....................................................................6
Basic Reliability...........................................................................6
What Makes a Router OS Carrier Class?.............................................6
Stability.......................................................................................7
Advanced Security.......................................................................8
Scalability....................................................................................8
Precision......................................................................................9
High Availability...........................................................................9
Consistency.................................................................................9
Predictability............................................................................. 10
Carrier Class Reliability.............................................................. 10
JUNOS Architecture................................................................................ 11
Modularity....................................................................................... 11
Managing Modular Architectures.....................................................12
Intelligent Modular Design...............................................................12
Intelligent Modular Design: The JUNOS Routing Module..................13
Intelligent Modular Design: The Periodic Packet
Management Daemon..................................................................15
The JUNOS Kernel...........................................................................16
Engineering Discipline...........................................................................17
JUNOS Release Schedule..................................................................18
JUNOS Single Train Release Model...................................................18
New Product Introduction................................................................19
Conclusions............................................................................................20

 Copyright ©2006, Juniper Networks, Inc


Architectural Issues in Carrier Class Operating Systems

Executive Summary
Juniper Networks’ original market was large service providers, carriers, and other high
performance networks requiring the utmost levels of dependability while providing a rich set of
features. We recognized that no router operating systems existed at that time to answer these
requirements. Only recently have other vendors begun offering router operating systems that are
being positioned as “carrier class.”
This paper examines the characteristics that define a carrier class router operating system, and
the architectural and engineering practices that are required to support these characteristics.
From its inception Juniper Networks has maintained that a modular software architecture is
fundamental to any carrier class operating system. Although at least one of our competitors
has long disagreed with that assertion, they have recently released a modular operating
system of their own, claiming that their architecture is superior to JUNOS because it is “more
modular.” We contend that this is a naïve understanding of the benefits of modularity, arising
from inexperience in building and managing such software. Modularity is an engineering tool
for creating a reliable operating system, and it is as important to understand the limitations of
modularity as it is to understand its usefulness.
Even the most well designed operating system, however, cannot continue to deliver carrier
class quality unless it is supported by disciplined engineering practices. Complexity must be
controlled through unwavering adherence to strict development processes and release standards;
otherwise an operating system quickly becomes unpredictable, unreliable, and unmanageable.

Introduction
For a almost two decades “IP” has been synonymous with “Internet services.” When you
thought of an IP network you thought of web browsing, e-mail, data access and transfer, and
IM. During those early years network operators gained experience and confidence in IP as a
foundation communications protocol, and now we are in the beginning years of building much
more demanding and critical services over IP networks. Voice, video, an array of business and
entertainment services, military and emergency response communications, industrial sensors
and controls, mobile and wireless services – these are just some of the capabilities that are now
being deployed over IP networks.
The driver for the move to consolidation of multiple services over IP is economics: It is far
cheaper to build and operate a single infrastructure that can support many services, and it
is attractive to customers to receive multiple services from a single provider over a single
connection . One of the more prominent examples of this move is BT’s 21CN project. Once the
incumbent telephone monopoly in the United Kingdom, BT is abandoning its circuit-switched
voice infrastructure and consolidating both old and new services onto a high-performance IP
backbone – and in the process transforming itself from a traditional PSTN into a cutting-edge,
forward-thinking communications company. While BT itself calls the move radical, one of the
expected benefits should make sense to the most conservative of executives: An annual savings,
when the transition is complete, of £1 billion ($1.86 billion US).
“Digitised voice, data and video can now be combined, changed, merged and manipulated
on a single digital platform,” says BT’s Paul Reynolds. “And if it is the ability to merge multiple
information formats on a single platform that is driving the desire for convergence at a device
level, the availability of carrier class IP networks, multi-service networks and software-driven
switching, are fuelling the agenda for fundamental change in our industry.”

Copyright ©2006, Juniper Networks, Inc 


Architectural Issues in Carrier Class Operating Systems

Given its corporate history, BT certainly understands exactly what is meant by carrier class IP
networks: It is the PSTNs and telcos, with over a century of developing and operating circuit-
switched networks, that have set the modern expectations for communications service quality.
Best-effort packet delivery is perfectly acceptable for early Internet-oriented applications such
as file transfers and e-mail, but is wholly unacceptable for quality-sensitive applications such as
voice and video. Convergence of such services onto a common IP infrastructure can succeed
only if the service quality meets or exceeds that of legacy networks.
The heart of all IP networks is the packet processors – routers – and the heart of all routers is
the operating system. A carrier class IP network, then, must begin with a carrier class router
operating system.

Router Operating System Objectives


The two basic functions of any router are route processing and packet switching. These functions
are accomplished, respectively, by two logical entities: the control plane and the forwarding
plane. The router’s operating system is the software that creates these two logical entities – the
routing protocols and the various databases that the routing protocols use to build the forwarding
information, for example, are a component of the control plane, for example. The OS also
manages the physical components of the router, and is the means by which you access the
router – both directly, such as the CLI, and indirectly, such as SNMP. It also includes peripheral
protocols for managing and operating the router, such as FTP or TFTP, NTP, Telnet, and SSH.

Objectives for Any Router OS


Before discussing the characteristics that make a router OS carrier class, there is a more basic
set of features that you should expect of any router OS, from the largest high-performance core
router to the smallest home routers. These features are:
• Support for open standards
• Feature flexibility
• Manageability
• Basic security
• Service and support
• Basic reliability

Open Standards Support


Any router uses a number of protocols – and for a high-end router it is a long list indeed – to
perform its duties. These protocols can be specified by open standards bodies like the IETF, IEEE,
and ITU-T, or they can be proprietary to the manufacturer of the router’s OS. Open standards are
important for three reasons.
First, they give you some assurance that your router will interoperate with other routers
supporting the same standards, regardless of the manufacturer. Proprietary protocols obligate
you to use the same vendor for all routers between which the protocol must operate, sharply
reducing your design options and your ability to negotiate pricing among multiple vendors.
Second, your network operators are far more likely to be intimately familiar with open standards
because the specifications are publicly available. This understanding is essential when your
network experiences problems or failures; therefore open standards support contributes directly
to network reliability.

 Copyright ©2006, Juniper Networks, Inc


Architectural Issues in Carrier Class Operating Systems

Third, perhaps counter-intuitively, open standards are more secure. It is certainly true that
malicious parties study open protocols for security vulnerabilities; but it is equally true that
open protocols are subject to a scope of peer review not possible for a single vendor. Therefore
security risks are more likely to be identified and corrected in open standards before the
protocols are ever implemented. A vulnerability in proprietary code is more likely to go
unnoticed until it is exploited.

Flexibility
Most networks change. New routing protocols are introduced as the network grows, new
features are enabled to support added network missions, new interfaces are installed to satisfy
growing bandwidth or redundancy requirements. The router OS must present you with a rich
menu of protocol and configuration options to support not only your initial design choices but
the changes you are sure to make as your network grows. Additionally, the OS must have the
capability of being easily upgraded to accommodate both improved code and newly-added
features from the vendor.

Manageability
Just as the OS should support a variety of protocols to adapt to different design philosophies and
network growth, the OS should provide a variety of means by which to manage the router. At a
minimum, the OS should provide:
• An intuitive command line interface (CLI) with extensive error checking capabilities and
help options
• A web-based configuration tool
• Simple Network Management Protocol (SNMP)
• The applicable open standards management information bases (MIBs)
The OS should also support both direct access to the router (in conjunction with the physical
router architecture) through a console connection and a modem connection; and remote access
to the router both through a dedicated management network connection and in-band access via
protocols such as Telnet and SSH.
Further management flexibility can be achieved by offering an application programming
interface (API) using an open standard such as Extensible Markup Language (XML). Such an
interface allows the router to be managed and configured using third-party management
platforms.

Basic Security
There are many aspects to security for an IP router, but there are certain security features that
you should expect every router, from the smallest to the largest, to support. A feature that should
be present on any router of any size is password-secured access. On any but the smallest home
routers this access authentication should be supplemented by the ability to define permissions
for different users – that is, the ability to specify what actions a given user or user group is
authorized to perform on the router – and the ability to monitor and record what actions
each user takes while accessing the router. These three security functions should be further
strengthened through the capability of being supported by independent servers: For example,
Radius or TACACS for authentication and authorization, and an independent file server for
accounting.
A remote access protocol such as Secure Shell (SSH) should be available as an alternative to less
secure access protocols such as Telnet.
All routing protocols should have the capability of authenticating all peers. This is highly
recommended practice within your own network, and essential when peering with untrusted
neighbors – meaning all routers in networks not under your direct control.

Copyright ©2006, Juniper Networks, Inc 


Architectural Issues in Carrier Class Operating Systems

Finally, a router should not have any potentially vulnerable protocols, such as Telnet or small
servers such as Finger, enabled by default. You should be required to explicitly enable all services
and protocols you desire to run on the router, and never be required to disable services you do
not want. Said another way, a router powered up “out of the box” should do almost nothing until
you tell it what to do. This gives you a reasonable assurance that no exploitable vulnerabilities
will go overlooked.

Service and Support


The value of strong technical support becomes most apparent when things go wrong. At such
times getting your network back to normal must be done as quickly as possible, which means
support staff must be both knowledgeable and responsive. At the same time, technical support
must be proactive; for operating systems that means implementing engineering processes
that minimize bugs and interoperability problems before customers ever see the software, and
implementing ongoing programs that can identify and correct problems in production code
before the problems become apparent as a wider network concern.

Basic Reliability
Reliability matters for even home routers. Interruption of services is at the least irritating, and
can drive customers away from vendors of routers that are perceived to be undependable.
Reliability increases in importance with the criticality of the services the router is transporting.
But just what is reliability? At a basic level we understand reliability to be the ability of a system
to function as expected for a given amount of time. Ideally we would like to add “without
failures” to the definition; however, as the complexity of a system increases the potential for
failure increases. Therefore a reliable system is one in which failures are minimized as much as
reasonably possible, but also one which can recover from a failure quickly and efficiently when
the unexpected does occur.
Given this definition, all of the features discussed so far can be seen to contribute to reliability:
Open standards support, flexibility, manageability, security, and strong technical suppport.
Moving to the discussion of the characteristics of a carrier class OS, all of those defining
characteristics can also be listed as the contributors to carrier class reliability.

What Makes a Router OS Carrier Class?


A carrier class router operating system must have all of the features described in the previous
section, but the quality of those features must far exceed what we have described so far. There
are also additional qualities to be found in a carrier class OS. Although you might find one or a
few of these additional features in other operating systems, carrier class requires the presence of
all of them. The unique features distinguishing a router OS as carrier class are:
• Stability
• Advanced security
• Scalability
• Precision
• High availability
• Consistency
• Predictability
• Carrier class reliability

1
An exception to this rule, called Hierarchical VPLS (H-VPLS) is discussed in a later section.

 Copyright ©2006, Juniper Networks, Inc


Architectural Issues in Carrier Class Operating Systems

Stability
Stability is the capability of a router to deliver invariable performance under variable network
circumstances. Every network has its “ups and downs:” Erratic traffic loads and topological
change. If a network operator’s business is dependent on guaranteed service levels – as carrier
class networks are – no router in the network can suffer a performance degradation while
coping with variable network behavior.
Every router performs two very fundamental functions: Packet forwarding and route processing.
Packet forwarding is of course the action of reading the destination address (and possibly other
information) in the header of an incoming packet, making a decision about where that packet
should go, and then switching the packet to the correct outgoing interface. Route processing
is the means by which the router comes to know how to make the correct packet forwarding
decisions: Routers exchange information about the network infrastructure among themselves
and then determine the best path to all known destinations based on some agreed-upon set of
rules2.
A router must perform these two basic functions at the same time, and that has implications
for stability. If the traffic load through the router becomes very heavy most resources might be
used in performing packet forwarding, causing delays in route processing, and resulting in slow
reactions to changes in the network topology. On the other hand, a significant change in the
network topology might cause a flood of new information to the router; most resources might be
used in performing route processing, slowing the router’s packet forwarding.
The key to the problem as described is internal resources. If most resources are consumed by one
of the two basic functions, the other function suffers, and the router is destabilized. The answer
to the problem is to perform these functions in separate physical entities, each with its own
resources, as shown in Figure 1. In such an architecture the packet forwarding (forwarding plane)
and the route processing (control plane) do not draw processing cycles away from each other.

Control Plane
Route Process Kernel
RIB FIB
Management Process Security

Forwarding Plane
FIB
Layer 2 Processing
Interfaces

Figure 1
The physical architecture depicted in Figure 1 has positive implications on the router’s operating
system. Because a part of the OS is the routing protocols, the OS resides in the control plane.
So, the forwarding plane can perform to full capacity without affecting the ability of the OS to
control the entire physical system. It also means that the OS is protected from unintentional or
malicious influences of the network, as discussed in the next section.

Static routes are a notable exception to this description; the route processing mainly occurs inside a human brain and the results of the best path determination are
2

manually entered into the router. But while static routes are commonly a part of carrier class network configurations, they are never a primary source of route infor-
mation in such networks.

Copyright ©2006, Juniper Networks, Inc 


Architectural Issues in Carrier Class Operating Systems

Advanced Security
A carrier class router OS must go far beyond the basic security features discussed earlier in this
paper. At this advanced level, features and tools must be provided which address two missions:
• Strong protection of the router itself
• Protection of the network in general
As mentioned at the end of the previous section, the physical architecture of Figure 1 is a key
contributor to protection of the router. Attacks against the router will almost always come from
the network (rather than an out-of-band connection), and are directed against one of the routing
protocols or the OS itself. That means that attacks must enter at the packet forwarding entity
and then make their way up to the route processing entity. The link between these two entities,
then, serves as a “choke point” at which malicious packets can be identified and stopped, as
shown in Figure 2.

Control Plane

Attack
Packets
Forwarding Plane

Figure 2
Powerful firewalling capabilities must be available for detailed identification and passing of only
specifically permitted packets to the control plane, blocking all others. Rate limiting capabilities
must also be available so that essential packets permitted through the firewall filters, such as
ICMP, cannot be exploited for flooding attacks.
The tools for protecting the router – fine-grained packet filtering and rate limiting – should be
extended to the protection of the network itself, by being extendable to the interfaces of the
forwarding plane. In this application, however, it is important that the application of such control
functions on production interfaces does not negatively impact the performance of the router.
A carrier class router OS should also offer tools that help the network operator take action
against malicious traffic entering the network. For example, if a distributed denial of service
(DDoS) attack is in progress against a node in the network or is transiting the network toward
its target, the OS should have capabilities that aid the operator in tracing the attack traffic to its
entry points, where specific filters or rate limiters can be enabled to stop or alleviate the attack.

Scalability
The feature flexibility discussed earlier certainly contributes to scalability. At the carrier class
level, the feature flexibility must be at performance. That is, the network operator must be able
to confidently enable a multitude of features on a given router without reducing the router’s
basic packet processing and forwarding rates. For example, in support of a multiservice network
a router might be running OSPF or IS-IS, Multiprotocol BGP, intricate routing policies, MPLS
and its associated signaling and traffic engineering protocols, related layer 2 and layer 3 VPNs,
IP multicast protocols, highly granular packet identification both for security and for traffic
classification, and advanced queuing – all while forwarding packets at or near line rate.
 Copyright ©2006, Juniper Networks, Inc
Architectural Issues in Carrier Class Operating Systems

Scalability also means that new features can be added to the OS quickly, efficiently, and safely.
Finally, scalability means that the same OS can be used on multiple hardware platforms
and with any interface type; upgrading hardware or adding interfaces must not require the
replacement of the existing OS code with a different hardware-, interface-, or feature-specific
version of the OS.

Precision
Route calculation errors, even transient ones, can cause inaccurate packet forwarding,
forwarding loops, and black holes. In a network carrying sensitive traffic such as voice and
entertainment-quality video such errors, no matter how temporary, are unacceptable. Therefore
the route calculations of a carrier class OS must be correct every time. Without precision
stability, scalability, and security are impossible.

High Availability
With the convergence of high-quality, high-demand services such as voice and video onto IP
infrastructures, network outages of any kind are no longer acceptable. Even relatively small
packet losses can have a negative effect on users’ perception of the service delivery; a major
node, link, or interface failure can have serious effects for the provider. A carrier class router
operating system must therefore itself be resilient to failure, and must provide the network
operator with tools that minimize network failures whenever possible and that minimize the
effects of failures that do occur.
It must be noted that most unplanned network outages are due not to hardware or software
failures, but to configuration mistakes. “The possibility of failures would be much reduced,”
writes Jeffrey Nudler, a Senior Analyst with Enterprise Management Associates, “if you consider
that changing device configuration causes 60% of downtime due to human error.”3 A carrier
class router operating system must take this human factor into account and help the operator
avoid making configuration mistakes.
Planned outages – taking a node offline for routing maintenance or upgrades – are somewhat
less service impacting than unplanned outages, because they are predictable. Nevertheless,
modern service level guarantees and “five nines” network standards preclude the traditional
practice of offline router operations. Carrier class router operating systems must enable in-
service router changes and upgrades.

Consistency
Multiservice networks require complex configurations, which in turn can present enormous
operational challenges. Considering, as the previous section emphasizes, that human error is the
major cause of network outages, unnecessary operational complexity must be avoided whenever
possible. If different versions of an operating system are required for different platforms,
different interfaces, or different features, the difficulties of network management and hence the
chances of operational mistakes are significantly increased.
The ability to run the same OS software image on all routers helps control operational
complexity. Such consistency requires several factors:
• No platform-specific versions of the OS
• No interface-specific versions of the OS
• No feature-specific versions of the OS

3
http://www.networkworld.com/news/2005/101005-ietf.html

Copyright ©2006, Juniper Networks, Inc 


Architectural Issues in Carrier Class Operating Systems

A consistent OS contributes to network stability and availability not only from an operational
aspect, but also from a software maintenance aspect. If the OS vendor is managing only a single
release at a time, adding enhancements and new features is greatly simplified; because changes
impact only a single code set, the changes can be more thoroughly tested. This translates directly
into more reliable software for the customer. Similarly, the customer’s regression testing before
an upgrade to a newer OS release is much more trustworthy if there are no multiple versions or
feature packages to test, reducing the chance of overlooked incompatibilities or unexpected code
conflicts during implementation.

Predictability
Delivery of high quality services requires a predicable transport network. There are two aspects
to predictability that are influenced by the router OS:
• Predictable network behavior
• Predictable software management by the OS vendor

The factors contributing to consistency discussed in the previous section – an OS that is not
platform or interface specific, and no separate feature packages – also help make the network
predictable by reducing the chances for unexpected events during OS changes. These factors
also help conserve predictability because the addition of features, interfaces, or platforms to the
network are far less likely to entail a change of OS software.
Network predictability is also helped by OS resilience. Although engineering practices that
minimize software bugs are crucial, occasional bugs are an inescapable fact in any complex
software code. Therefore an OS architecture that can isolate and limit the negative effects of
bugs, preventing them from causing systemwide failures, supports network predictability.
Another aspect of predictability is the manner in which the vendor manages the OS. Tightly
controlled development milestones, well-defined engineering quality principles, and a strict
adherence to a regular release schedule all enable confident planning for the network operator.

Carrier Class Reliability


Carrier class reliability is defined by all the qualities that go into basic reliability, as described
earlier in this paper, plus all of the carrier class qualities described in this section: stability,
advanced security, scalability, precision, high availability, consistency, and predictability. The
reduction of any one of these qualities diminishes the overall ability of the operating system to
fulfill the requirements of modern carrier class networks.

2
It should be noted that there are corresponding Options A, B, and C for inter-AS Layer 3 MPLS VPNs; Option B for VPLS is actually much more scalable than its
L3VPN counterpart.

10 Copyright ©2006, Juniper Networks, Inc


Architectural Issues in Carrier Class Operating Systems

JUNOS Architecture
Juniper Networks’ first focus market was carriers and large-scale service providers. We
recognized that there were no carrier class router operating systems in existence, and we
designed JUNOS to fill that void. The architectural choices we made then proved to be the right
choices; while continually building upon that foundation operating system architecture, we have
never found, nor do we foresee, a need to consider a new operating system. In fact it is only
recently that some competitors have begun attempting to offer an operating system similar to
what we first offered a decade ago.
This section examines the key architectural features of the JUNOS software, and how these
features enable JUNOS to meet the requirements of a carrier class OS.

Modularity
The most essential architectural characteristic of JUNOS is its modularity. Rather than a single,
highly complex code base the JUNOS software consists of a set of individual components, each
running in its own protected memory space, communicating with each other through well-
defined interfaces, and all controlled by the JUNOS kernel (Figure 3). The separate modules,
called daemons4, are key to both stability and scalability.
Protocols(RPD)

Interface Mgmt
Chassis Mgmt
PPMD(Hellos)

SNMP

Operating System

Figure 3

Modularity is essential to stability because of the functional separation of software components.


A malfunction or bug in one module might cause the module to fail, while the rest of the
system continues functioning; a monolithic operating system, on the other hand, has no such
compartmentalization and a similar malfunction or bug is likely to cause a full system crash.
Similarly, because each module operates in its own protected memory space and cannot
“scribble” on another module’s memory space, the modules cannot disrupt each other.
Stability is also supported by the ability to replace an individual module. So if a problem is
identified in a given module, that module can be changed; without modularity the entire
operating system would have to be changed, meaning the router must be taken out of service, to
perform a similar code patch.

Daemon is a Unix term and reflects the FreeBSD origins of the JUNOS kernel.
4

Copyright ©2006, Juniper Networks, Inc 11


Architectural Issues in Carrier Class Operating Systems

The concept of modular scaling is certainly not new; one of the innovations Vint Cerf and Bob
Kahn introduced in TCP/IP was the idea of a “layered” protocol stack, allowing the change of one
layer without affecting the other layers.
The modular JUNOS architecture supports scalability because new modules can be added as
needed, and existing modules can be updated, without requiring a complete overhaul of the
entire OS code. This principle has been proven over and over; through the life of JUNOS several
dozen new modules have been added to the original OS as new features and capabilities
have been introduced. Yet for years after the advent of JUNOS other routers continued to run
monolithic operating systems with inherent instability and scaling limits.

Managing Modular Architectures


There are also engineering advantages to the JUNOS architecture that contribute to stability and
scalability. A reasonably small team of engineers manages the software comprising each module,
and the same team of engineers is responsible for the same module release after release.
Therefore the code is much better understood than it would be if it were a more integral part
of a monolithic code base or if there were separate release teams. As a result any additions or
change to the module code is very well understood in terms of how the changes will affect the
code. Because the module communicates with other modules through a defined interface, its
interactions with other modules are tightly controlled.
Because dedicated engineering teams manage the modules, communication within the team and
between teams can be carefully controlled. A strong sense of ownership is also inspired, insuring
fewer bugs in the code. And there are no separate “bug fix” teams; when bugs do arise, the team
responsible for writing the code is responsible for correcting the code.
So the engineering advantages of the modular JUNOS architecture result in faster code
development, testing, and debugging. The end benefit is to the customer is sound, reliable
operating system software.

Intelligent Modular Design


It might seem that if modularity is good, the more modular the OS the better. But this is not the
case. While a component module must be small enough to be beneficially managed, it must also
be large enough to contain major interdependencies. If a module is made too small, artificial
barriers will be created between dependent functions, and the interprocess communication
between those functions adds complexity to the overall system.
A fundamental advantage of grouping functions into individual modules or processes, as already
discussed, is that the processes can be stopped, replaced, or can fail independently without
crashing the entire system. When deciding whether a function should be a part of an existing
module or should be in its own module, a determination must be made about what it means for
this function to stop or fail independently: Will other functions be affected? If so, the functions
are interdependent and should probably be grouped together in the same module. This concept
is illustrated in Figure 4.

12 Copyright ©2006, Juniper Networks, Inc


Architectural Issues in Carrier Class Operating Systems

A = Function = Functional Interaction

Module 1 Module 2 Module 3

A E H K
B L
F I M
C N
D G J O

• Interdependencies well contained


• Light interprocess communications
Figure 4

Another consideration involves shared functions. If there is a common function that serves
several other functions, all of those functions should probably be grouped together in the same
module. Otherwise, as Figure 5 shows, either heavy interprocess communication must be
accepted in order for the separated functions to work together, or the shared function must be
duplicated in each module.

A = Function = Functional Interaction


Module Module Module Module Module Module
1 2 3 4 5 6

A E H K
B L
F I M
C N
D G J O

• Interdependencies poorly contained


• Heavy interprocess communications
Figure 5

Intelligent Modular Design: The JUNOS Routing Module


A particularly clear example of intelligent modular design can be found in the JUNOS routing
module, called the Routing Protocol Daemon (RPD). The RPD contains all of the routing
protocols, such as OSPF, BGP, IS-IS, and RIP. It has been proposed by others that this is an
“old” architecture, and that containing each routing protocol in its own module (a BGP module,
an OPSF module, and so on) is better. There are two arguments to be made in favor separate
protocol modules:
• A single protocol can fail or be stopped independently, without affecting the other
protocols.
• A single protocol can be upgraded to gain new features without the necessity of
upgrading the entire OS.

Copyright ©2006, Juniper Networks, Inc 13


Architectural Issues in Carrier Class Operating Systems

Both of these arguments are attractive and make sense on the surface. They are, after all, two
of the fundamental reasons a modular OS is superior to a monolithic OS. And in fact, Juniper
Networks has on more than one occasion considered replacing the RPD with individual protocol
modules. In each case Juniper engineers concluded that such a change was a move in the wrong
direction, and that more problems would be created than would be solved.
The first argument, that protocols can fail or be stopped independently without affecting other
routing protocols, is flawed because it assumes that each protocol is completely independent.
Such is not the case. Even at a superficial level all of the protocols running on a router tend to
have dependencies on each other. If OSPF fails, for example, it can affect IBGP, RSVP-TE, LDP,
and the RPF checks used both for security and for IP multicast. If BGP is stopped, it affects not
only inter-AS routing but possibly L2 and L3 MPLS VPNs, and IP multicast. Dig a little deeper and
you find that all IP routing protocols share a dependence on several other common protocols
and functions such as ICMP and ARP. Go even deeper and you find that the protocols must
cooperate in such basic functions as choosing a best route and maintaining the routing database.
If the routing protocols are in separate modules, heavy interprocess communication is required,
burdening the overall system, and sharing such basic functions as ARP and routing database
maintenance becomes complex problems.
By maintaining all routing protocols in a single module, the RPD, the many interdependencies
among individual protocols are contained. The interprocess communication load is not taxed,
shared functions are controlled, and the overall system is simpler, which translates into a more
reliable routing platform.
The second argument, that modularizing individual protocols allows the customer to upgrade
only the protocol he wishes in order to acquire new features, is particularly appealing. For
example, the BGP module could be upgraded to a version that supports a desirable new feature
without the necessity of upgrading the entire OS. This provides the appearance of an In-
Service Software Upgrade (ISSU), because one section of code can be replaced without taking
the entire system out of service. Modularizing at the protocol level would seem to make sense
when offering this approach, so that individual protocols can be updated as non-disruptively as
possible.
But given the interdependencies among protocols already discussed, replacing a single protocol
such as BGP is hardly as non-disruptive to routing operations as it might appear on the surface.
Far more important, the practice of selectively replacing protocol modules – or any OS module,
for that matter – comes at a steep price in lost consistency and predictability. To illustrate the
problem, take a hypothetical router OS that has five currently available releases: Release A
through Release E. Release B is newer (and therefore has newer features) than A, C is newer than
B, and so on. In each release, there is an OSPF module, an IS-IS module, a BGP module, and a
RIP module. You are allowed to pick and choose among the protocol modules to attain exactly the
features you want: Perhaps OSPF from Release B, RIP from Release A, and BGP from Release D.
To make this menu of combinations available to you, the software vendor must maintain and
understand the interactions of each routing protocol module from each release with all of the
other routing protocol modules from each release. Given the four protocols across five releases,
the total possible release-specific protocol combinations is approximately 45, or 1,024. Whenever
the vendor adds a new feature to one of the protocols he must perform regression testing
not just for that release, but for all of the 1000+ possible protocol combinations. And if you
experience problems with a newly upgraded protocol module, the vendor’s technical support
personnel must understand the interoperability implications of all 1000+ combinations.

5
The actual number of combinations is slightly less (16 less, in this example), because a given protocol from one release would never be combined with the same
protocol from another release.

14 Copyright ©2006, Juniper Networks, Inc


Architectural Issues in Carrier Class Operating Systems

This example considers just RIP, OSPF, IS-IS, and BGP modules. Add to that an MPLS module
and an IP multicast module in each of the five releases. The possible protocol combinations
now become approximately 65, or 7,776. And this assumes that the MPLS module is not further
divided into separate RSVP-TE, LDP, L2 VPN, L3 VPN, and VPLS modules, or that IP multicast is
not further divided into its constituent protocols. Take the practice beyond just routing protocol
modules and include all of the OS modules, and the possible package combinations across
several releases soars exponentially into the hundreds of thousands.
The liabilities of this approach are clear: A vendor might gain positive short-term customer
response by allowing “mix-and-match” modules from different releases, but the code will
quickly become unmanageable. The end result is an inconsistent, unpredictable, and ultimately
unreliable operating system.
The JUNOS RPD therefore remains a single module containing all routing protocols. And while
the RPD can be replaced as a module, Juniper Networks supports doing so only for installing
code patches and bug fixes when necessary; new features are acquired by upgrading the entire
OS. This practice is key to a well understood, closely controlled, highly reliable operating system.

Intelligent Modular Design: The Periodic Packet Management Daemon


Although good engineering practice dictates keeping all routing protocols in a single module,
there is another view of modularization of the routing functions. To understand where
modularization is beneficial in the routing process it is necessary to think about basic routing
functions. On the one hand, a routing process is responsible for performing route calculations
using the information presented to it. Precision and stability require that this calculation be
allowed to run uninterrupted until it is finished. If the calculation is interrupted, there is a risk
of incorrect or incomplete route information finding its way into the routing database, possibly
resulting in incorrect forwarding, routing loops, or packet black holes.
On the other hand, there are elements of a routing process that must be serviced as soon as
possible. Hellos, adjacency maintenance messages, and route updates have timers – often tightly
set timers – that require quick processing and response. Reacting slowly to these functions
could cause timeouts that in turn can result in unnecessary message retransmissions at best and
closed adjacencies at worst. Stability, precision, and predictability can all be negatively affected.
There is a potential conflict in these two basic functions. A route calculation is a run-to-
completion task – in computer science terms, it requires cooperative multitasking. Performing
adjacency maintenance and update tasks is a real-time, or preemptive multitasking, function.
When a routing protocol implementation must share a processor, should it allow interruptions
of its run-to-completions tasks whenever a real-time task needs the processor, at the risk of
temporarily corrupted route data? Or should it require real-time demands to wait until run-to-
completion tasks are finished, at the risk of broken adjacencies and network instabilities?
The answer, of course, is that neither situation is acceptable. Herein, then, is a justification for
a separation of the software comprising the real-time and run-to-completion elements of the
routing process. JUNOS implements the RPD, with all of its constituent routing protocols, as a
run-to-completion module. The real-time elements of the routing protocols are separated into a
module called the Periodic Packet Management Daemon (PPMD). The distinct processing needs
of each module are then served, and a scheduler manages the demands of both modules on the
shared Routing Engine processor. The result is a highly responsive, accurate, and stable routing
platform.

Copyright ©2006, Juniper Networks, Inc 15


Architectural Issues in Carrier Class Operating Systems

The JUNOS Kernel


The heart of JUNOS, the JUNOS kernel began as a FreeBSD kernel. FreeBSD is renowned for
running on servers with exceptionally long uptimes, indicating both its level of reliability and
its infrequent need for updating. Because FreeBSD is open source software, Juniper Networks
engineers were free to retain what mattered, discard what didn’t, and custom-build the parts that
make the kernel JUNOS rather than FreeBSD.
Recently one of Juniper Networks’ competitors has begun offering a new operating system
built on the proprietary QNX Neutrino microkernel, and that vendor has made much about the
supposed superiority of microkernels over kernels such as JUNOS’. To understand the issue,
it helps to briefly describe the reasoning behind microkernels. A simplistic comparison of a
monolithic kernel to a microkernel is illustrated in Figure 6. Only essential system services
remain in the microkernel (hence the prefix micro); functions such as the host stack, device
drivers, and file system have become external processes running in user mode, communicating
with the microkernel via system calls. By doing this, these “externalized” functions can restart or
fail independently without causing a complete kernel failure.

Kernel Microkernel

Processes Processes

System IPC
Calls Host Device File External
System call interface no kernel Stack Drivers System Processes
Scheduler, System
Host Stack
Paging Calls
Device Drivers
Virtual Memory
File System System call interface no kernel
Etc.
Kernal interface no hardware Scheduler,
Paging...
Hardware
Kernal interface no hardware

Hardware

Figure 6

This argument in favor of microkernels is of course the same argument in favor of modularity
in the overall OS architecture. But the principles for intelligent modular design discussed in this
paper also apply here. The system is so heavily dependent on the host stack and file system
that a failure of one of these services is likely to have a severely negative impact on the entire
system whether they are in the kernel or external processes. And in reality, device drivers can
be sopped and started even within the kernel. So the reality of microkernels is that by adding
artificial barriers between these services interprocess communication is increased; the attempt
to simplify the kernel adds complexity to the overall system.
There is nothing new in the arguments currently being made in favor of microkernels; in fact
they come from a 20-year-old academic debate. One of the more enlightening versions of this
debate took place in 1992 between Andy Tanenbaum, proponent of the microkernel-based
Minix operating system, and Linus Torvolds, creator of the kernel-based Linux, on the Usenet
newsgroup comp.os.minix6. Tanenbaum made the same arguments then as the arguments
now being used to promote microkernels as the latest innovation in router operating system
architecture. “Among people who design operating systems,” Tanenbaum wrote, “the debate is
essentially over. Microkernels have won7.”

A simple Google search provides the complete text of the debate.


6

Andy Tanenbaum, “LINUX is Obsolete,” comp.os.minix, January 1992.


7

16 Copyright ©2006, Juniper Networks, Inc


Architectural Issues in Carrier Class Operating Systems

Yet reality has shown otherwise. While microkernels have proven popular in embedded systems
such as automotive computers and industrial controls, (QNX is famously used in the Space
Shuttle’s robotic arm), they have found little acceptance in more complex operating systems.
“Microkernels are mostly discredited now,” writes Miles Nordin in Linux Journal, “because they
have performance problems, and the benefits originally promised are a fantasy.” This view is
supported in the widely respected textbook on operating system design, Operating System
Concepts: “Unfortunately, microkernels can suffer from performance decreases due to increased
system function overhead.”
Juniper Networks maintains no strong position on the arguments for and against microkernels.
Rather we chose FreeBSD as the “genetic forerunner” of the JUNOS kernel because of its
openness, in keeping with our strong belief in open standards. Its open source software has
made FreeBSD the most peer-reviewed software in the world; the reliability of JUNOS is thereby
rooted in the reliability of FreeBSD.

Engineering Discipline
The consistent message of the previous few sections has been that modularity is essential to a
carrier class OS architecture, but naïve approaches to designing modules can cause as many or
more problems than it solves. This paper has called thoughtful, experience-based modularity
intelligent modular design.
There is a deeper message throughout this paper: A router operating system that can meet
carrier class demands is only possible when it is managed by a highly experienced, highly
disciplined engineering team following strict engineering processes. J. M. Juran, the guru of
modern business and industrial quality practices, says that you can determine the quality of the
product by assessing the quality of the processes used to develop it.
Any carrier class router OS is necessarily a highly complex system. Reliability can only
be maintained in such a system when the processes for improving the code and feature
enhancements are tightly controlled.
The principles of engineering discipline and strict processes were implemented at Juniper
Networks from the very beginning, by the engineers joining the young company. Many had
experienced first hand what happens when the rules governing product development are loose,
and when the developers do not have control of the code: The software becomes unmanageable,
and changes bring unpredictable conflicts that often become apparent only when the customer
attempts to implement the software.
Our quality development practices have evolved and matured with the company, but Juniper
Networks has never deviated from the standards implemented in its first years. In fact our
acquisition of TL9000 certification only required documenting the processes already in place, not
implementing new processes.

Miles Nordin, “Obsolete Microkernel Dooms MAC OS X to Lag Linux in Performance,” Linux Journal, May 2002.
8

Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne, Operating System Concepts, Seventh Edition, John Wiley & Sons, 2005, page 62.
9

Copyright ©2006, Juniper Networks, Inc 17


Architectural Issues in Carrier Class Operating Systems

JUNOS Release Schedule


There are four major releases of JUNOS each year, one per quarter, always in the same months:
• February
• May
• August
• November
There are also, typically, five working releases at any given time:
• Three maintenance releases
• One release in beta
• One release under development
The release schedule provides a high degree of predictability for customers planning upgrades
and new feature implementation. Because of this, the release schedule always has highest
priority. Several dozen new features are included in each release, so it is important that the
customers planning for these new features not be delayed by development problems with one
feature. If a new feature project becomes delayed, the feature is moved to the next release; the
release is never delayed while waiting for a specific feature development to catch up.
Well-defined development milestones are essential to this process, so that expected development
delays can be identified early on. Any rescheduling of a feature to a later release, then, normally
occurs early enough that customers expecting the feature are given plenty of lead time to adjust
their plans accordingly.
Major infrastructure projects and unusually complex new features are introduced in phases
over multiple releases. A good example of a phased project is Non-Stop Routing (NSR). Early
components of NSR were added to JUNOS code as early as release 7.6; these first components
were “invisible” to the customer, but allowed Juniper Networks system test personnel to insure
correct integration before moving on to the next phase components. The first “customer visible”
NSR components – OSPF and IS-IS support – were released in JUNOS 8.1, and the NSR project
will be fully complete at JUNOS 9.0. Releasing such projects in phases insures reliability by
allowing incremental regression testing of components as they are added.
The JUNOS release schedule is also essential to helping adhere to the single train model.

JUNOS Single Train Release Model


The JUNOS single-train model means that for each JUNOS release, there is only one image; that
one image runs on all T, M, and J Series routers. The same code that runs on the largest T series
router also runs on the smallest J Series router. And all features supported at a given release are
supported in the one image. There are no separate feature packages to add when you want to
add a feature; you only have to enable the feature you want.
There are a number of development ethics that are adhered to in order to maintain this single
train model:
• No feature development is performed in maintenance (working) releases. New features
are added to new releases.
• No back-porting of features is allowed. That is, when a new feature is developed in a new
release, the feature cannot be added to an older release.
• No “customer specials.” All features requested by all customers are developed and
released in the mainline code.

18 Copyright ©2006, Juniper Networks, Inc


Architectural Issues in Carrier Class Operating Systems

Stating what we will not do seems somewhat inflexible. It is. Loosening any of these rules means
veering off of the focused path delineated by our strict quality processes, and in the end our
customers would suffer. Adhering to the rules means that at all times our developers are working
with only a single code at any release; the result is well-understood code, with new features
and changes carefully tested for correct integration. For the customer, this means superior
reliability. It also eliminates for the customer any need to cautiously select from a complex menu
of platform-specific, interface-specific, and feature-specific packages and then perform careful
regression testing to insure that the selected code interoperates as expected with previously-
implemented versions of the code and all installed hardware.
The single train model also benefits our customers in the following ways:
• The same development teams manage the same software modules release after release,
insuring that the code – and any chances made to the code – is intimately understood.
• The same team responsible for writing the code is responsible for finding and correcting
bugs in the code. As a result, bugs are remedied far faster than would be possible if we
used separate “bug fix” teams.
• The dedicated engineering team concept inspires a sense of ownership for the code,
sharply reducing the chances of bugs in the software in the first place.
Again, these principles translate directly into reliability for our customers.

New Product Introduction


In addition to engineering rules and procedures, there must be a set of phases that guide a given
product throughout its lifetime from first inception to end-of-life. At Juniper Networks this is the
New Product Introduction (NPI) model. The NPI model defines seven phases, and is applied to all
engineering projects. Well-defined milestones must be met for any project to progress from one
phase to the next. Figure 7 shows the specific NPI model for JUNOS releases and features.

Initial Feature Design,


Content, and Development, Beta Test End of Life
SW Resource and
Estimate Unit Test

Phase 0 Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6

Product
Definition, System and FRS to
Commitment Alpha Test Production
and Approval

figure 7

Certainly every company that produces a product has some similar model for defining the
product’s lifecycle; but without strict engineering discipline, the models mean little. Juniper
Networks’ NPI model is defined to provide value to the customer by enabling us to foresee
resource requirements well in advance of the point where they might affect timely delivery to
our customers.

Copyright ©2006, Juniper Networks, Inc 19


Architectural Issues in Carrier Class Operating Systems

Enforcement of the development milestones is accomplished by mechanized process controls.


Communications within and between development teams is also highly mechanized, beginning
with enhancement requests from field sales personnel, representing our customers, all the way
to the end of product life.
Just as module teams and the single train release model are essential for understanding the code,
the NPI model is essential for projecting resource requirements and understanding exactly where
the code is in its lifecycle.

Conclusions
JUNOS is the repository of our accumulated networking knowledge. It does not distinguish
between core and edge, service provider and enterprise. The power, discipline, and consistency
of our engineering practices insure the continuing advancement of JUNOS as the single operating
system architecture for all future Juniper Networks platforms.
JUNOS was designed from the beginning to meet the demands of carrier class networks,
and we have continually improved upon it while never deviating from our core engineering
principles. Our decade of experience with JUNOS’ modular architecture brings a level of mature
understanding of managing such architectures that cannot be matched by other vendors just
now attempting to offer similar router operating systems.
JUNOS has always been the premier operating system in high-performance, high-demand
networks. As more and more sensitive services are added to existing networks, the unmatched
reliability of JUNOS becomes more important to serious service providers than ever before.

Copyright 2006, Juniper Networks, Inc. All rights reserved. Juniper Networks and the Juniper Networks logo are registered trademarks of Juniper Networks, Inc. in
the United States and other countries. All other trademarks, service marks, registered trademarks, or registered service marks in this document are the property of
Juniper Networks or their respective owners. All specifications are subject to change without notice. Juniper Networks assumes no responsibility for any inaccuracies
in this document or for any obligation to update information in this document. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this
publication without notice.

20 Copyright ©2006, Juniper Networks, Inc

You might also like