Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

RELIABLE NETWORK MASS STORAGE

– A two-node High Availability cluster providing NFS fail-over and mirrored storage

A master’s thesis by Jonas Johansson

Mail: sanoj@sanoj418.com
URL: http://www.sanoj418.com/master-thesis/

Ericsson Utvecklings AB Department of Microelectronics and Information


Technology at the Royal Institute of Technology
Reliable Network Mass Storage 2(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

ABSTRACT

Ericsson Utvecklings AB is developing a server platform, which agrees with the


common telecommunication requirements. New wireless communication technologies
are influencing the service and control networks’ servers. It is desirable to have
reliable storage because it promises a more versatile usage of the generic server
platform.

Current generations of the server platform lack the support for non-volatile storage
and this project has investigated the possibilities to design a mass storage prototype
based on open source software components and conventional hardware. The result
was a two-node cluster with fail-over capabilities that provides a NFS file system with
increased availability. The prototypes are somewhat limited but several proposals that
enhance the solution are discussed.
Reliable Network Mass Storage 3(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

1 INTRODUCTION 6
1.1 BACKGROUND 6
1.2 PURPOSE 7
1.3 DISPOSITION 7
1.4 NOTATION USED IN THIS PAPER 7

2 THE SERVER PLATFORM 9


2.1 THE NETWORK SERVER PLATFORM 9
2.1.1 The Generic Ericsson Magazine 9
2.1.2 Switch Boards 10
2.1.3 Processor Boards 10
2.2 TELORB 11
2.2.1 Communication 12
2.2.2 TelORB Processes 13
2.3 STORAGE REQUIREMENTS 14

3 MAGNETIC DISK DEVICE BASICS 15


3.1 DISK DEVICE TERMINOLOGY 15
3.1.1 Basic Hardware Components 15
3.1.2 Data Layout 16
3.1.3 Data Encoding 16
3.1.4 Form Factor and Areal Density 17
3.2 SERVICE TIMES 17
3.3 DISK DEVICE RELIABILITY 18
3.3.1 Common reasons causing disk device failure 18
3.3.2 Disk device reliability measurements 19
3.3.3 Self Monitoring and Reporting Technology 20
3.4 SINGLE DISK STORAGE 20

4 STORAGE DEVICE INTERFACES 21


4.1 ADVANCED TECHNOLOGY ATTACHMENT 21
4.2 SMALL COMPUTERS SYSTEMS INTERFACE 22
4.3 ISCSI 23
4.4 SERIAL STORAGE ARCHITECTURE 24
4.5 FIBRE CHANNEL 24

5 REDUNDANT DISK ARRAYS 26


5.1 DISK ARRAY BASICS 26
5.1.1 Striping 26
5.1.2 Disk Array Reliability 27
5.1.3 Redundancy 27
5.1.4 RAID Array Reliability 28
5.2 RAID LEVELS 28
5.2.1 Level 0 – Striped and Non-Redundant Disks 29
5.2.2 Level 1 – Mirrored Disks 29
5.2.3 Level 0 and Level 1 Combinations – Striped and Mirrored or vice versa 31
5.2.4 Level 2 – Hamming Code for Error Correction 32
5.2.5 Level 3 – Bit-Interleaved Parity 32
5.2.6 Level 4 – Block-Interleaved Parity 33
5.2.7 Level 5 – Block-Interleaved Distributed Parity 34
5.2.8 Level 6 – P+Q Redundancy 35
Reliable Network Mass Storage 4(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

5.2.9 RAID Level Comparison 35


5.3 RAID CONTROLLERS 37

6 FILE SYSTEMS 38
6.1 BASICS ABOUT LOCAL FILE SYSTEMS 38
6.1.1 Format and Partition 38
6.1.2 Data Blocks 39
6.1.3 Inodes 39
6.1.4 Device Drivers 40
6.1.5 Buffers and Synchronisation 41
6.1.6 An Example 42
6.1.7 Journaling and Logging 42
6.2 LINUX VIRTUAL FILE SYSTEM 43
6.3 DISTRIBUTED FILE SYSTEMS 43
6.3.1 Network File System 44
6.3.2 Andrew File System 44
6.4 THE FILE SYSTEM AND THE USER 45
6.4.1 User Perspective of the File System 45
6.4.2 Filesystem Hierarchy Standard 45

7 SYSTEM AVAILABILITY 46
7.1 THE TERM AVAILABILITY 46
7.2 TECHNIQUES TO INCREASE SYSTEM AVAILBILITY 46

8 PROTOTYPE DESIGN PROPOSAL: ACTIVE — STANDBY NFS CLUSTER 49


8.1 PROPOSAL BACKGROUND 49
8.1.1 Simple NFS Server Configuration 49
8.1.2 Adding a Redundant NFS Server 50
8.1.3 Adding Shared Storage 51
8.1.4 Identified Components 52
8.2 VIRTUAL SHARED STORAGE 52
8.2.1 Network Block Device and Linux Software RAID Mirroring 52
8.2.2 Distributed Replicated Block Device 55
8.3 INTEGRATION OF THE COMPONENTS 57
8.3.1 Linux NFS Server 57
8.3.2 Linux-HA Heartbeat 57
8.3.3 The two-node High Availability Cluster 57

9 IMPLEMENTATION 59
9.1 REDBOX 59
9.1.1 Hardware Configuration 59
9.1.2 Operating system 60
9.1.3 NFS 61
9.1.4 Distributed Replicated Block Device 61
9.1.5 Network Block Device and Software RAID Mirroring 62
9.1.6 Heartbeat 63
9.1.7 Integrating the Software Components into a Complete System 63
9.2 BLACKBOX 65
9.2.1 Hardware Configuration 65
9.2.2 Software configuration 66

10 BENCHMARKING AND TEST RESULTS 67


Reliable Network Mass Storage 5(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

10.1 BENCHMARKING TOOLS 67


10.1.1 BogoMips 67
10.1.2 Netperf 68
10.1.3 IOzone 69
10.1.4 Bonnie 69
10.1.5 DRBD Performance 69
10.2 REDBOX BENCHMARK 70
10.3 BLACKBOX BENCHMARK 70
10.4 REDBOX VERSUS BLACKBOX 71
10.5 FAULT INJECTION 71

11 CONCLUSIONS 72
11.1 GENERAL 72
11.2 THE PROTOTYPES 72
11.3 DATACOM VERSUS TELECOM 73
11.4 LINUX AND THE OPEN SOURCE COMMUNITY 73
11.5 TSP 74

12 FUTURE WORK 75
12.1 POSSIBLE PROTOTYPE IMPROVEMENTS 75
12.2 BRIEF OVERVIEW OF OTHER POSSIBLE SOLUTIONS 76

13 ACKNOWLEDGEMENTS 77

14 ABBREVIATIONS 78

15 REFERENCES 80
15.1 INTERNET 80
15.2 PRINTED 80

APPENDIX

A HARDWARE CONFIGURATION

B SOFTWARE CONFIGURATION
Reliable Network Mass Storage 6(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

1 INTRODUCTION

All building blocks in a modern telecom infrastructure (e.g. servers, switches, media
gateways and base station controllers) are designed to be extremely fault tolerant and
highly available. Unplanned system downtime is unacceptable; every second is
crucial but it is also vital that planned downtime is kept at a minimum level. In modern
telecom networks far from all traffic are regular phone calls. Mobile units, for instance
PDAs, are often used to connect to the Internet via cellular telephones and the trend
is towards an increasing data/voice traffic ratio. That is, more data traffic and less
actual talking. New wireless communication technologies with enhanced performance,
e.g. GPRS and UMTS, speeds up the emerging of cellular telephones and PDAs. The
service and control networks’ servers are of course influenced by this change. They
must be able to adopt new services and therefore it must be easy to configure and
scale the servers.

This project has investigated the possibilities to design a solution based on open
source software components and conventional hardware. The result was a two-node
cluster with fail-over capabilities that provides an increased availability.

1.1 BACKGROUND

Ericsson Utvecklings AB is headquarters for Core Network Development, which is a


virtual organisation hosted by several Ericsson companies and partners around the
world. Ericsson UAB provides Ericsson and its customers with cutting edge
telecommunications platform products, services and support. This project is
performed at Ericsson UAB in Älvsjö, Sweden, which is developing a server platform
informally referred to as The Server Platform (TSP). TSP agrees with the common
telecommunication requirements: high availability, high performance, cost efficiency
and scalability. The first three generations of TSP lacks of any implementation and
integration of mass storage solutions. Currently the fourth generation is under
development and telecom operators, the primary buyers, have now clearly showed
interest in the possibility to store large amount of information in a non-volatile
memory (e.g. a hard disk device). Many new services and their applications rely on
the possibility to store large amount of information. An example of an application that
needs reliable storage is AAA, which is an application that provides authentication,
authorisation and accounting services. A Home Location Register (HLR) is another
example of a TSP implementation affected by new storage demands. An HLR could
shortly be described as a high performance database manager system (DBMS) that
stores client specific information such as billing and last geographical location.

Today the server platform stores operating system files and files used when booting
the processor cluster on single hard disk devices attached to dedicated load nodes.
The processes running in the processor cluster do not use these disks other than for
booting. Application data generated by a process during execution is stored in a
Reliable Network Mass Storage 7(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

database, which is distributed over the cluster members’ volatile memory. To exclude
the possibility of a single point of failure, another processor always stores a copy of a
process’ information. This simple principle of storing the same information at two
different places is an example of the redundancy principle, an efficient solution to
provide a higher level of availability. Volatile memory has two obvious drawbacks; it
is expensive when storing large amounts of information and the information is of
course lost when the power is turned off.

1.2 PURPOSE

The purpose of this thesis was to investigate the possibilities to design a high
availability mass storage configuration suitable for a telecom server’s requirements
and if it was feasible to implement a prototype with conventional hardware and open
source software. Questions that pervade the thesis:

- How is a storage configuration suitable for a telecommunication server platform


designed to exclude any single point of failures?

- How a storage configuration prototype is implemented just using standard


components and open source software?

- How is the performance compared to commercial solutions?

1.3 DISPOSITION

The thesis starts with a brief description of the rather complex target system – the
TSP – and carries on with the basics about the fundamental hardware components
used in storage configurations. Principally because a deeper knowledge in the most
basic storage components increases the understanding for the more high-level
system design issues, i.e. understand why single disk storage is inappropriate in a
vital system. Techniques and theories providing increased reliability and availability at
some level such as RAID, clustering and distributed file systems are discussed.
General system design issues such as single point of failure and redundancy are also
in the scope of the thesis. A system’s availability is basically a compromise of
hardware availability, software availability and human error. High availability is a hot
topic; several open source projects are designing and implementing new ideas
concerning high availability for Linux. These theories are of course useful when
designing a high availability mass storage suitable for a telecommunication server
platform. Next section presents possible designs based on standard components and
open source software followed by an evaluation, the final conclusions and
suggestions how the prototypes may be improved.

1.4 NOTATION USED IN THIS PAPER

Something that needs an extra attention when presented for the first time is written in
italic. For instance:
Reliable Network Mass Storage 8(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

A widely used file system is EXT2FS that is considered being the standard Linux
file system today.

There are a few examples of applications used during this project that could be typed
in at a computer terminal and these are written with the fixed courier font style:

When creating an EXT2 file system it is desirable to use the mke2fs application.

A UNIX or Linux prompt is illustrated with a dollar sign:

$ ls | grep foo

References are marked as:

[Reference]
Reliable Network Mass Storage 9(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

2 THE SERVER PLATFORM

The Server Platform is developed to fulfil


increased demands for openness,
robustness and flexibility. TSP is currently
available in three generations, each with a
set of different configurations. This is a
description of ANA 901 02/1 (figure 1), a
configuration of TSP 3.0 that utilises the
fault control software system TelORB for
traffic handling and database transactions.
The ANA 901 02/1 configuration and its
hardware and software components are
used for testing and evaluation of the mass
storage prototype during this project. In this
document TSP refers to this configuration
and components are specified ANA 901
02/1 components if not otherwise stated.

In parallel with this thesis project the fourth


generation of TSP is under development.
Major differences between it and its
precursors are the migration from Solaris
UNIX to Linux and a homogenous use of
Figure 1 – A fully equipped TSP base cabinet.
Intel Pentium based processor boards.

The material in this section is more thoroughly described in the ANA 901 02/1 System
Description [Ericsson01] and the TelORB System Introduction [Ericsson02].

2.1 THE NETWORK SERVER PLATFORM

The TSP is derived from a broad variety of hardware components and its generic
hardware platform is called the Network Server Platform (NSP). This covers the most
essential components, emphasising on those used during this master’s thesis project.

2.1.1 The Generic Ericsson Magazine

TSP is adapted to the Generic Ericsson Magazine (GEM) hardware infrastructure that
is based on the BYB 501 equipment practise. It is a generic platform sub rack that is
used in several different Ericsson products. All open hardware components are
1
standard Compact PCI components. An Ericsson made carrier board is used to

1
CompactPCI or cPCI is a very high performance industrial bus based on the standard PCI electrical
specification. Compact PCI boards are inserted from the front of the chassis, and I/O can break out either to the
Reliable Network Mass Storage 10(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

integrate the standard cPCI components to the GEM back plane practise. This
provides Ericsson with the possibility to influence essential characteristics such as
EMC.

2.1.2 Switch Boards

A sub rack is equipped with two redundant Switch Control Boards (SCB) providing a
complete electrical and mechanical infrastructure for up to 24 circuit boards with 15-
mm spacing. The SCB includes a level 1 Ethernet switch with 26 100Base-T
backplane connections, one 1000BaseT connection and two 100Base-TX connections
in the front. Its primary usage is to provide Ethernet communication between a sub
rack’s processor boards.

The Gigabit Ethernet Switch Board (GESB) is a state-of-the-art Gigabit Ethernet


switch made by Ericsson. It provides many features but its obvious usage is to render
the possibility for communication between different sub racks. When the SCBs’
Gigabit interfaces are connected to the GESB it is possible to communicate between
processors not only in specific sub rack but also between all processors mounted in
connected sub racks.

2.1.3 Processor Boards

In third generation of TSP the Support Processor (SP) is a Sparc based processor
2
board from Force Computers adapted to GEM with a special carrier board. In TSP 4
the goal is to replace the Sparc based processor board with a processor board based
on the Intel Pentium processor. There are typically four SPs in a base cabinet,
working in a load sharing mode, which are dedicated for operation and maintenance
(O&M) as well as other support functions. They are used to start-up the cluster, i.e.
deliver the appropriate boot images and operating system files as well as
applications. The SPs include all I/O units used in a TSP node, i.e. SCSI hard disks,
DVD players and tape drives, and that is why the SP sometimes internally is referred
to as to as I/O processor. Another important feature of the SPs is that they are acting
as gateways between the internal and external networks; all communication to the
TSP is managed with the SPs’ external Ethernet interfaces or RS-232 serial
3
communication ports. The first three generations of SPs run Solaris UNIX but
generation 4 is aiming towards Linux.

The applications are executed on a cluster of Traffic Processors (TP) running the
TelORB operating system. A TP in TSP 3 is really a MXP64GX Pentium III processor
board from Teknor, which could be described as a PC integrated on a single board.

front or through the rear. More information can be found at http://www.picmg.org/, the PCI Industrial Computers
Manufacturer's Group.

2 http://www.force.de/.

3
http://www.sun.com/solaris/
Reliable Network Mass Storage 11(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Each board is by default equipped with two 100base-TX interfaces in the backplane
but it is possible to have at least two additional interfaces in the front. A TP-IP board
is a TP mounted with a dual port Ethernet PMC4 module. The TPs could also be
configured to support Signalling System number 7 (SS7), an out-of-band signalling
architecture used in telecom networks providing service and control capabilities.
Similar to TP-IP, a standard TP mounts a PMC module providing the SS7
functionality. This configuration is referred to as Signalling Terminal (ST). The PMC
enables versatile usage of the TP since it provides means to add functionality, as it is
needed.

In TSP 4.0 the TPs can run both Linux and TelORB. The two operating systems can
coexist but a single TP can of course not run both OS at the same time.

2.2 TELORB

TelORB is a distributed operating system with real-time characteristics suitable for


controlling, e.g., telecom applications. It can be loaded onto a group of processors
and the group will behave like one system (figure 2).

Applications run on top of TelORB. Different parts of an application can run on


different processors and communication among these parts is possible in a
transparent manner. A TelORB system is described as a truly scalable system,
meaning that the capacity is a linear function of the number of processors. If there is
a need to increase the processing capacity you can add processors in run-time

APPLICATIONS

CORBA - ORB JAVA VM

SOFTWARE MANAGEMENT
SW
DATABASE

COMMUNICATION

KERNEL KERNEL KERNEL KERNEL


PROM PROM PROM PROM

PROCESSOR PROCESSOR PROCESSOR PROCESSOR

HW

I/O
I/OPROCESSORS
I/OPROCESSORS
PROCESSORS

Figure 2 – Overview of the hardware and software used in a TelORB system.

without disturbing ongoing activities.

4
PCI Mezzanine Connector modules are small PCI cards mountable to the PMC interface standard.
Reliable Network Mass Storage 12(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

TelORB also include a database called DBN. It is distributed over the cluster of
TelORB processors and it is stored in their primary memory. Since it is distributed
every process can access any database instance no matter of which processor it is
running on. The distributed DBN database has unique characteristics and is a key
component in a working TelORB application.

In addition to TelORB’s real-time and linearity characteristics the system can also be
described as [Ericsson02]:

- Open, since the external protocols of the system are standardised protocols, e.g.
IIOP and TCP/IP. IIOP is also used for managing the TelORB system and it is a
part of CORBA and it is implemented via ORB. The processors used are
commercially available and applications can be programmed in C++ and Java.

- Fault Tolerant, a TelORB application is extremely fault tolerant and this is


achieved primarily by duplication of functionality and persistent data on at least
two processors. There exists no single point of failure, meaning that one
component fault alone will not reduce the systems availability.

- High Performance, the true linear scalability makes TelORB applications capable
of handling large amounts of data.

- Object Oriented; a TelORB system is able to run common object oriented


program languages such as C++ and Java. TelORB objects are specified using
IDL that is supported by CORBA standard.

2.2.1 Communication

TSP uses well-specified and open protocol stacks for communication both internally
and externally (figure 3). A TelORB zone is internally built around two switched
Ethernet networks physically mounted in the GEM back plane. Thus all processors
have direct contact with all other TelORB processors in the same zone.

Inter-Node O&M
IP Network IP Network

Routers Remote
Modem O&M
E1/T1 Links Ethernet Ethernet RS-232

TelORB I/O
Processors ST ST TP-IP TP-IP TP TP SP SP Processors

Internal Ethernet
Private
Network IPC
TCP/IP, UDP/IP

Figure 3 – Overview of the provided communication possibilities.


Reliable Network Mass Storage 13(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Communication between TelORB processors utilises the Inter Process


Communication protocol (IPC), which is a protocol layered directly on top of the MAC
layer. For internal communication between I/O processors and TelORB processors,
common IP protocol stacks such as UDP/IP and TCP/IP are used. The internal
processor communication is never exposed to the outside world.

For operation, maintenance and supervision purposes, the support processors are
accessible externally using IP protocol stacks such as TCP/IP.

To provide geographical network redundancy, two TelORB zones are able to


cooperate and serve as redundant copies. Updates of the two TelORB databases are
transferred between the zones using TCP/IP connections directly between TelORB
processors in different zones. Virtual IP (VIP) is a function used as interface towards
external IP networks. TSP also supports SS7, an out-of-band signalling architecture
used in telecom networks providing service and control capabilities, and RS-232 that
is standard serial communication.

2.2.2 TelORB Processes

Everything executing on a TelORB processor executes in processes. A process is a


separate execution environment and in TelORB they execute in parallel with other
processes. Every process is an instance of a special process type defined in Delos,
which is TelORB’s internal specification language. A process cannot affect another
process except through special communication mechanisms called Dialogues, which
are software entities also defined in Delos, handling communication between
processes. Dialogues are based on IPC, which take care of packaging, sending,
receiving and unpacking.

Processes can be static or dynamic. Static processes are started when the system
starts and they are always running, if a static process is destroyed, it is automatically
restarted by the operating system. A dynamic process is created from other process
instances but never automatically restarted. Dynamic processes are started on
request and terminated when their task is completed.

A static process could for example supervise a group of subscribers. Whenever a


subscriber calls, the supervising process starts a dialogue for creation of a dynamic
process to handle that particular call. When the call is finished the process closes the
dialogue and the dynamic process terminates. The dynamic process is not necessarily
executed on the same processor as its static parent process.

Dynamic processes are used for increasing the robustness of a TelORB system. If a
software fault occurs in a call, only the corresponding distributed dynamic process will
be shut down, all other calls are unaffected and the static parent process could
immediately start a new dynamic process to handle the ongoing call. A static process
is automatically restarted by the operating system if a software fault occurs.
Reliable Network Mass Storage 14(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Every process type is associated to a distribution unit type (DUT), which in turn is
associated to a processor pool. The DUT is specified in Delos while the processor
pool is specified in a special configuration file. A processor pool can be associated
with several DUTs. When a system starts, TelORB could for example start an
instance of a static process at a processor. If that particular processor crashes,
TelORB will automatically restart that process at another processor associated to the
same DUT.

2.3 STORAGE REQUIREMENTS

Currently TSP does not support non-volatile storage other than the block devices
used by the support processors and small Flash disks used by the traffic processors
for the initial booting. It is desirable to support the use of attached reliable storage
since it promise a more versatile usage of the TSP platform. What exactly the storage
is going to be used for is really a matter for the customers.

The storage must of course be reliable and provide high availability, that is, never
ever go down. It is desirable to have a solution that easy to maintain and that is
scalable. The solution must at least support continuous reading and writing at a rate
of 5 to 10 MBytes/s.
Reliable Network Mass Storage 15(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

3 MAGNETIC DISK DEVICE BASICS

The main purpose for a magnetic hard disk or hard drive is to serve as a long-term
inexpensive storage for information. Compared with other types of memory, e.g.
DRAM, magnetic disks are considered slow but generally a rather reliable storage
medium. Hard disks are considered non-volatile storage, meaning that data remains
even when turning of the power. A disk drive supports random access compared to a
tape device that’s referred to as a sequential access technology.

3.1 DISK DEVICE TERMINOLOGY

A head-disk assembly (HDA) is the set of platters, actuator, arms and heads protected
by an airtight casing to insure that no outside air contaminants the platters. A hard
disk device is an HDA and all associated electronics.

3.1.1 Basic Hardware Components

A hard disk consists of a set of platters (figure 4) coated with a magnetic medium
designed to store information in the form of magnetic patterns. Usually both surfaces
of the platters are coated and thus both surfaces are able to store information. The
platters are mounted by cutting a hole in the centre and stacking them onto a spindle.
The platters rotate with a constant angular velocity, driven by a spindle motor
connected to the spindle. Modern hard disks rotational velocity is usually 5,400, 7,200
or 10,000 RPM but there are examples of state-of-the-art SCSI disks with speeds as
high as 15,000 RPM.

Sector Track Head Actuator


Arm

Platters

Cylinder

Figure 4 – An overview of the most basic hard disk device terminology.

A set of arms with magnetic read/write heads is moved radially across the platters’
surfaces by an actuator. The head is an electromagnet that produces switchable
magnetic fields to read or record bit streams on a platter’s track. The heads are very
close to the spinning platters but they never touch the surfaces. In almost every disk
drive the actuator moves the heads collectively but only one head can read or write
concurrently. When the heads are correctly positioned radially the correct surface is
chosen.
Reliable Network Mass Storage 16(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

3.1.2 Data Layout

Information, i.e. bit streams of zeroes and ones can be read from or recorded to the
platters’ surfaces. Information is stored in concentric tracks. Each track is further
broken down into small arcs called sectors, each of which typically holds 512 bytes of
information. Read and write operations physically affect complete sectors since a disk
is unable to address bits within a sector. A cylinder is the vertical set of tracks at the
same radius.

Early disk devices had the same amount


Zone 3
of sectors on all tracks and thus present
Zone 2
an inhomogeneous data density across
Zone 1
the platters’ surface. By placing more
sectors on tracks at the outside of the
platter and fewer sectors at the inside
edge of the platter a constant data bit
density is maintained across the platter’s
surface (figure 5). This technique is
called zone bit recording (ZBR).
Figure 5 – An illustration of a platter divided into
Typically, drives have at least 60 sectors three different zones: 1, 2 and 3.
on the outside tracks and usually less
then 40 on the inside tracks. This changes the disk's raw data rate. The data rate is
higher on the outside than on the inside. Most ZBR drives have at least 3 zones but
some may have 30 and even more zones. All of the tracks within a zone have the
same number of sectors per track.

3.1.3 Data Encoding

As mentioned above the bit streams are encoded as series of magnetic fields
produced by the electromagnetic heads. The magnetic fields are not used for
absolute measurements, i.e. north polarity represents a zero and south polarity
represents a one. The technique used is based on reversal flux. When a head moves
over a reversal, e.g. a transition from a field with one polarity to an adjacent field of
the opposite polarity, a small voltage spike is produced that is much easier to detect
then the magnetic field’s actual polarity (figure 6).

Write Current
Waveform

Magnetic
Domains

Read Voltage
Waveform

Figure 6 – Data is encoded and recorded to the magnetic coating as magnetic fields.
Reliable Network Mass Storage 17(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Suppose a series of more then 500 zeroes is being recorded in the magnetic coating
of a platter, then it is almost impossible to keep track of all the bits since it is just one
long unipolar magnetic field. To avoid problems associated with long sequences of
zeroes or ones the encoded information contains a clocking synchronisation.

3.1.4 Form Factor and Areal Density

The platter’s size is the primary determinant of the disk device’s overall physical
dimensions, also generally called the drive's form factor. All platters in a disk are of
the same size and it is usually the same for all drives of a given form factor, though
not always. The most widely used disk device today is the 3.5-inch form factor disk
and it is used in a wide range of configurations from ordinary PCs to powerful storage
servers.

Traditionally, bigger platters meant more storage. Manufacturers extended the


platters as close to the width of the physical drive package as possible to maximise
the amount of storage in one drive. But despite this fact the trend is towards smaller
platters and the primary reason is performance. The areal density of a disk drive is
the number of bits that can be stored per square inch or centimetre. As areal density
is increased, the number of tracks per areal unit, referred to as track density, and the
number of bits per inch stored on each track, usually called linear density or recording
density, also increases. As data is more closely packed together on the tracks, the
data can be written and read far more quickly. The areal density is increasing so fast
that the loss of storage due to smaller platters is negligible.

3.2 SERVICE TIMES

Disk performance specifications for hard disks are generally based upon how the hard
disk performs while reading. The hard disk spends more time reading than writing but
the service times for reading is also lower than for writing. Disk device performance is
a function of service times, which can be divided into three components: seek time,
rotational latency and data transfer time [Chen93].

The Seek time is the amount of time required for the actuator to move a head to the
correct radial position, i.e. correct track. The heads’ movement is a mechanical
process and thus the seek time is a function of the time needed to initially accelerate
the heads and the number of tracks traversed. Seek time is the most discussed
measurement for hard disks performance but since the number of traversed tracks
varies it is presented with three different values:

- Average: The average seek time is the time required to seek from a random track
to another random track. Usually in the range of 8 – 10 ms but some of the latest
SCSI drives are as low as 4 ms.

- Full Stroke: The amount of time required traversing all tracks, starting from the
innermost track to the outermost track. In the range of 20 ms and this time
Reliable Network Mass Storage 18(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

combined with the average seek time is close to the actual seek time for a full
hard disk.

- Track-to-Track: This measurement is the amount of time that is required to seek


between adjacent tracks, approximately 1 ms.

The heads movement and the platters’ spin are not synchronised. That is, the desired
sectors can be anywhere on the track and therefore the head, when positioned over
the desired track, must wait for the desired sector to rotate under itself. The amount of
time waiting is called the rotational latency. The waiting depends on the platter’s rate
of rotation and how far the sector is from the head. The rotational latency correlates
with the disk spin; hence faster disk spins results in less rotational latency. Generally
the average rotational latency value calculated for half a rotation is provided from:

0.5
AverageRotationalLatency (x ) =
x
Equation 1 – Average rotational latency is calculated for half a rotation. X is the platter’s rate of rotation in
rounds per minute.

Some manufacturers also provide a worst case scenario, meaning that the sector just
passed the head and a full rotation is needed before the sector can be read or written.
The worst case latency is twice the amount of the average rotational latency.

Data transfer time is the amount of time required to deliver the requested data. It
correlates with the disk’s bandwidth, which is a combination of the areal density of the
disk device medium and the rate at which data can be transferred from the platters’
surface.

Command overhead time, actually “the disk’s reaction time”, is referring to the time
that elapses between commands are sent and when they are executed. It is usually
just added to the much greater seek time added since it is only about 0.5 ms.

Settle time is the time needed for the heads to stabilise after the actuator have moved
them. The heads must be stable enough to be able to read or write information. The
settle time is usually in the range of 0.1 ms and therefore it is often negligible.

3.3 DISK DEVICE RELIABILITY

Hard disk devices are manufactured under rigorous safety measurements. New
devices are seldom delivered with hardware faults but if the disk vendors accidentally
deliver drives with some chemicals inside the hard disk assembly that should not be
there they normally fail almost instantly. A failure is defined as a detectable physical
change to the hardware, a fault is an event triggered by a non-normal operation and
an error is the consequence of a fault.

3.3.1 Common reasons causing disk device failure

There are three major reasons hard disk drives fail:


Reliable Network Mass Storage 19(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

- Heat is lethal to disk devices and it rapidly decreases the overall lifetime. In
reliable systems hard disk devices are often equipped with a special disk device
cooling fan. If the cooling halts and a disk get overheated it could cause serious
hardware faults to that particular device. “Bad sectors” is one common result,
which means that some of the platters sectors are corrupt and unusable. High
heat could in a worst case scenario cause the disk’s heads to get “glued” to the
platters’ surfaces and cause the spindle trouble when it tries to spin the platters.

- Mishandling is of course one major reason that a hard disk fails. The small
electromechanical components are not designed for drops or earthquake similar
behaviour.

- Electronics failure is common due to the heating/cooling cycles that cause breaks
in the printed circuit board or breaks in the wires inside the chips. Electronics
failures caused by these cycles are usually sudden and without warning. Ignoring
ESD safety measurements (e.g. proper grounding) when handling a drive could
cause the electronics to fail due to electrostatic discharges.

3.3.2 Disk device reliability measurements

Disk device vendors present several measurements according to disks’ reliability.


Most of these measurements tend to be hard to interpret and sometimes they are
misleading but if interpreted correctly, they are helpful when comparing different disk
devices. Two important reliability measurements are:

- Mean Time Between Failures (MTBF) is the most commonly used measurement
for hard disk device reliability. MTBF is the average amount time that will pass
between two random failures on a drive. It is usually measured in hours and
modern disk devices today are usually in the range of 300,000 to 1,200,000
hours. A common misinterpretation is that a disk device with a MTBF value of
300,000 hours (approximately 34 years) will work for as many years without
failing. This is of course not the case. It is not effective for a disk device vendor to
test a unit for 34 years and an aggregated analysis of a large number of devices
is used instead. The MTBF should be used in conjunction with the service life
specification. Assume that a disk device have a MTBF value of X hours and a
service life of Y years. The device is then supposed to work for Y years. During
that period of time a large number of disks will accumulate X hours of run time.

- Service life is the most correct measurement to P(t)

use if the disk device itself is used in a system


with high reliability demands. The service life is
the amount of time before the disk device enters
Time, t
a period where the disk’s probability to fail over Service life
time increases.
Reliable Network Mass Storage 20(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

3.3.3 Self Monitoring and Reporting Technology

The Self Monitoring and Reporting Technology (SMART) was first developed at
Compaq and it tries to detect possible hard disk failures before they occur. It evolved
from a technology developed by IBM called Predictive Failure Analysis. The
manufacturers analyse mechanical and electronic characteristics of failed drives to
determine relationships between predictable failures and trends in various
characteristics of the drive that suggest the possibility of slow degradation of the
drive. The exact characteristics monitored depend on the particular manufacturer.

3.4 SINGLE DISK STORAGE

Conventional hard disks are considered to be rather cheap hardware components.


For a personal computer used in the office for writing documents or at home for
playing computer games they are also considered to be sufficiently fast and reliable.
Most hard disk devices outlive the rest of the computer components in an ordinary
personal computer and when changing computer you also tend to change the disk. If
the disk in your personal computer unfortunately crashes the only one affected is
sadly you. In a real-time and business critical network with servers handling
databases with millions of clients, every single minute of down time could cost a
fortune and affect millions of people, i.e. clients paying for a working service. Single
disk storage solutions are impossible to use; critical systems must survive at least a
single disk failure.

A single disk storage system can support multiple user sessions when the disk I/O
bandwidth is greater than the per session bandwidth requirement by multiplexing the
disk I/O bandwidth among the users. This is achieved by retrieving data for a user
session at the disk transfer rate, buffering them, and delivering them to the user at the
desired rate. Despite new and improved I/O performance a single disk device are
often not enough to serve as storage for a high-end server system. Vital servers and
systems need more I/O bandwidth than a sole disk is able to provide.
Reliable Network Mass Storage 21(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

4 STORAGE DEVICE INTERFACES

The interface is the channel where the data transmission between a hard disk device
and a computer system takes place. It is really a set of hardware and software rules
that manage the data transmission. Physically the interface exists in many various
configurations but traditionally most are implemented in hardware with compatible
chips at the motherboard and the disk device linked together with cable. ATA and
SCSI interfaces are two interfaces often utilised today but there are emerging
standards improving the performance and connectivity as well as reliability.

4.1 ADVANCED TECHNOLOGY ATTACHMENT

The Advanced Technology Attachment (ATA) interface is mostly used in conventional


PCs. ATA is considered a low-bandwidth interface and is relatively cheap compared
with other existing interfaces. It originates from the standard bus interface first seen
on the original IBM AT computer [IBM01].

ATA is sometimes referred to as Integrated Device Electronics (IDE) and Ultra DMA
but the real ANSI standard designation is ATA. Despite the official ATA
standardisation many vendors have invented their own names but these are to
consider as marketing hypes.

The first method ATA used for transferring data over the interface was a technique
called programmed I/O (PIO). The system CPU and support hardware executes the
instructions that transfer the data to and from the drive control. It works well for lower
speed devices such as floppy drives but high-speed data transfers tend to take over
all CPU cycles and simply make the system too slow. When introducing Direct
Memory Access (DMA) the actual data transfer does not involve the CPU. The disk
and some additional hardware communicate directly with the memory. DMA is a
generic term used for the peripheral’s possibility to communicate directly with the
memory. The transfer speed increases because of decreased overhead and the CPU
workload significantly decreases because it is not involved in the actual data transfer.

Though not standardised, Ultra-ATA is accepted by the industry as the standard that
boosts ATA-2’s performance with double transition clocking and includes CRC error
detection to maintain data integrity.

ATA ATA-2 Ultra-ATA/33 Ultra-ATA/66 Ultra-


ATA/100

Max. data bus transfer 8.3 16.6 33 66 100


speed [MBytes/s]

Max. data bus width 16-bit 16-bit 16-bit 16-bit 16-bit

Max. device support 2 2 2 2 2

Table 1 – An overview of different ATA specifications. Note that data rates are maximum rates only
available for short data transfers.
Reliable Network Mass Storage 22(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

The standard ATA cable consists of 40 wires, each with a corresponding pin, and its
official maximum length is ~ 0.45 metres. It was originally designed for transfer
speeds less than 5 MBytes/s. When the Ultra-DMA interface was introduced the
original cable caused problems. The solution was a cable with 80 wires. It is still pin-
compatible with the original 40-pin ATA interface, since all of the additional 40 wires
are connected to ground and used as shielding.

4.2 SMALL COMPUTERS SYSTEMS INTERFACE

The Small Computers Systems Interface (SCSI) is the second most used PC interface
today, accepted as an ANSI standard 1986. It is preferred in high-end servers prior
ATA. While ATA is primarily a disk interface it may be more correct to consider SCSI
a system-level bus, regarding that each SCSI device’s intelligent controller. SCSI
components are generally more expensive than ATA but they are considered faster
and they load the CPU less [IBM01].

While the performance of modern ATA transfers correlates with the speed of the
DMA, SCSI data throughput is influenced by two factors:

- Bus width is really how many bits that are transferred in parallel on the bus. The
SCSI term Wide refers to a wider data bus, typically 16 bits.

- Bus speed refers to the speed of the bus. Fast, Ultra and Ultra2 are typical SCSI
terms referring to specific data rates.

Except for the two SCSI characteristics controlling the data throughput there is
another important characteristic, signalling. There are several standards but the most
common are Single Ended (SE), High Voltage Differential (HVD) and Low Voltage
Differential (LVD). SE is the signalling used in the original standard, it is simple and
cheap but with some flaws. HVD tried to solve the problems associated with SE with
two wires for each signal but it is expensive and consumes lots of power and was
never really used. When LVD was defined in the Parallel Interface Standard 2, it was
feasible to increase bus speed and cable length. LVD is today the best choice for
most configurations and it is the exclusive signalling method for all SCSI modes
faster than Ultra2 (if not HVD is used).

All SCSI devices must have a unique id, typically set using jumpers, which is used for
identifying and prioritising the SCSI devices. The SCSI configuration also requires
proper bus termination and since there are almost as many signalling standards as
SCSI standards there are several types of terminators.
Reliable Network Mass Storage 23(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

SCSI Fast Ultra Wide Ultra2 Wide Ultra3


SCSI SCSI Ultra SCSI Ultra2 SCSI
SCSI SCSI

Max. data bus transfer 5 10 20 40 40 80 160


speed [MBytes/s]

Max. data bus width 8-bit 8-bit 8-bit 16-bit 8-bit 16-bit 16-bit

Max. cable length [m] 6 3 1.5 - 3 1.5 - 3 12 12 12

Max. device support 8 8 8-4 8-4 8 16 16

Table 2 – A comparison of different SCSI generations. Note that data rates are maximum rates only available
for short data transfers.

SCSI is often described as intelligent compared to ATA. There are several reasons for
this:

- Command Queuing and Re-Ordering, allows for concurrent multiple requests to


devices on the SCSI bus, while ATA only allows one request at the time.

- Negotiation and Domain Validation is a feature that automatically interrogates


each SCSI device for its supported bus speed. If the supported speed cause
errors during a validation test, the speed is lowered to increase the data bus
reliability.

- Quick Arbitration and Select allows a SCSI device to quickly access the bus after
another device is finished sending data. A built in regulation prevents high priority
devices to dominate the bus.

- Packetisation is an effort to improve SCSI bus performance by reducing


overhead.

Most SCSI implementations also support CRC and bus parity to increase data
integrity.

4.3 ISCSI

iSCSI, which is also known as Net SCSI, provides the SCSI


SCSI genereic layer with a reliable network transport. It is a iSCSI
mapping of SCSI commands, data and status over TCP/IP Upper Functional Layers
networks and enables universal access to storage devices TCP
and storage area networks. TCP ensures data reliability
Lower Functional Layers
and manages conguestions and IP networks provide (e.g. IPsec)

IP
security, scalability, interoperability and cost efficiency.
Link
It is described in an Internet draft and its standardisation is
Figure 7 – A layered model
of iSCSI.
Reliable Network Mass Storage 24(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

5
managed by the IP Storage Working Group of the IETF. It is still under development
6
but there are a few working implementations, e.g. Linux-iSCSI Project .

4.4 SERIAL STORAGE ARCHITECTURE

Serial Storage Architecture (SSA) is an advanced serial interface that provides higher
data throughput and scalability compared to conventional SCSI [Shim97]. Its intended
use, according to IBM, is high-end server systems that need cost-effective and high-
performance SCSI alternatives.

Serial Storage Architecture nodes (e.g. devices, subsystems and local host
processors, are able to aggregate several links’ bandwidth. Common configurations
use one, two or four pairs of links. A pair consists of one in-link and one out-link. Each
link supports 20 MBytes/s bandwidth, thus the aggregated link bandwidth is 40, 80 or
even 160 MBytes/s depending on the number of pairs utilised by the SSA
configuration [IBM01]. SSA supports several flexible interconnection topologies,
which includes string, loop and switched architectures. If the media interconnecting
the nodes is copper, the maximum distance between two nodes is 25 metres but if
fibre optics is used, the distance is extendable up to 10 km. A SSA loop enables
simultaneous communication between multiple nodes, which results in higher
throughput. SSA supports up to 128 devices and a fairness algorithm, which is
intended to provide a fair bandwidth sharing among the devices connected to a loop.
Hot swapping, auto configuration of new devices and support for multiple
communication paths are features making the systems utilising SSA configurations
more available.

4.5 FIBRE CHANNEL

Fibre Channel (FC) is a rather new open industry-standard interface but it has
attained a strong position in Storage Area Networks (SAN). FC provides the ability to
connect nodes in several flexible topologies and makes the storage local to all
servers in the SAN. FC supports topologies such as fabrics (analogue to switched
Ethernet networks) and arbitrated loops (FC-AL). FC promise high performance,
reliability and scalability.

FC-AL is really a circle where all devices share the same transmission medium. A
single loop provides a bandwidth of 100 MBytes/s but most FC nodes are able to
utilise at least two loops. The use of two loops not only enhance the data transfer but
it also serve as a redundant communication path which increase the reliability. An FC-
AL can have up to 126 addressable nodes but even at a low node count the shared
medium might become a bottleneck.

5
IP Storage Working Group home page: http://www.ece.cmu.edu/~ips/

6
The Linux-iSCSI Project: http://linux-iscsi.sourceforge.net/
Reliable Network Mass Storage 25(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

A fabric switch is able to interconnect several loops. A device in a “public loop” gets a
unique address and it is allowed to access any other device on the same public loop.
The switched fabric address domain contains of 16 M addresses and provides
services such as multicast and broadcast.

FC cabling is either copper wiring or optical fibre. It is designed to provide high


reliability and it supports redundant medium as well as hot swap.

Fibre Channel supports several protocols and thanks to its multi-layered architecture,
it easily adopts new protocols. SCSI-FCP is a serial SCSI protocol using frame
transfers instead of block transfers.
Reliable Network Mass Storage 26(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

5 REDUNDANT DISK ARRAYS

Disk based storage are the most popular for building storage configurations. This is
primarily due to the relatively low price performance ratio for disk systems in
comparison with other forms of storage such as magnetic tape drives and solid state
memory devices (e.g. Flash disks). A disk array is basically a set of disks configured
to act as one virtual disk. The primary reason to implement a disk array is to
overcome drawbacks with single disk storage: reliability and performance.

The array is often transparent to the system using it, which means that the system
does not need to know anything about the array’s architecture. It just uses it as a
regular block device. The disk array systems are often, but not always, encapsulated
from the public environment and treated as one disk communicating via common I/O
interfaces such as SCSI and ATA.

There are three basic characteristics when evaluating Performance

disk arrays: performance, reliability and cost [Chen93]. In


every configuration there must be at least one
compromise or else the result is a disk array with modest Reliability Cost

availability and performance that are of average cost, i.e.


Figure 8 – The relation
the same characteristics as a conventional hard disk between the disk arrays basic
characteristics.
device.

While RAID organisations (except striping, see section 5.2) protect and increase the
data reliability, the storage systems using the array are often unreliable. Many storage
system vendors tend to exaggerate RAID’s significance in storage systems’
availability. A system is not highly available just because the data stored in the
system is managed by some RAID organisation. What if the RAID controller fails or
the power is lost?

5.1 DISK ARRAY BASICS

5.1.1 Striping

Data striping is used to enhance I/O performance and was historically the best
solution to the problem described as “The Pending I/O Crisis”. The problem in short is
that an I/O system’s performance is limited by the performance of its networks and
magnetic disks. The performance of CPUs and memories is improving extremely fast,
much faster than the I/O units’ performance. So far the modest gain in storage device
performance is solved with striped disk arrays.

Striping means that a stripe of data is divided into a number of strips, which are
written to subsequent disks. Since several disks’ I/O are aggregated the performance
of the array is greatly improved compared to a single disk. Since there is no need to
calculate and store any redundant information all I/O and storage capacity is
Reliable Network Mass Storage 27(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

dedicated for user data. Hence striping is really fast and relatively cheap but it does
not provide increased reliability, if anything, the other way around.

5.1.2 Disk Array Reliability

Adding more disks to a disk array to enhance the I/O performance, significantly
increase the probability of data loss and therefore the array’s data reliability7 is
decreased [Schulze89]. If the disks used in an array have a mean time to failure of
MTTFdisk and the failures are assumed to be independent and occur at a constant
rate, the corresponding value for the array is:

MTTFdisk
MTTFarray =
Number _ of _ Disks _ in _ the _ Array

Equation 2 – Mean Time to Failure for a disk array without any redundancy.

Conventional disks’ service lifetime is approximately 5 years or 60 months. In a disk


array configuration with 10 disks the arrays service lifetime is drastically decreased to
just 6 months if using the above relationship.

To increase disk arrays reliability they must be redundant, i.e. dedicate some of the
disks capacity and bandwidth in order to save redundant data. In case of a failure the
lost data can be reconstructed using the redundant information, an array using this
technique is called a Redundant Array of Independent Disks (RAID). The RAID
approach does not intend to increase each individual component’s reliability, its
purpose is to make the array itself more tolerant to failures.

5.1.3 Redundancy

There are several different redundancy approaches to counteract the decreased


reliability caused by an increasing number of disks. The most common usable
implementations are:

- Mirroring or shadowing is the traditional approach and the simplest to implement


but also the least storage effective, regarding MB/$. When data is written to the
array it is written to two separate disks, hence twice the amount of disk space is
needed. If one of the two disks fails the other one is used alone, if supported by
the controller or the software. Some implementations only secure the data and do
not provide any increased availability.

- Codes are parity information calculated from the data stored on the disks using
special algorithms. They are often used for both error detection and correction
despite the fact that error detection is a feature already supported by most
conventional disks, e.g. SMART. Codes are rather unusual due to calculation

7
In this section the data reliability is equal to the data availability.
Reliable Network Mass Storage 28(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

overhead, complexity and because they do not significantly decrease the number
of dedicated disks needed to store the redundant information compared to
mirroring.

- Parity is a redundancy code capable of correcting any single, self-identifying


8
failure. Parity is calculated using bitwise exclusive-OR , Parity = Disk 1 ⊕ Disk 2
⊕ Disk 3. If Disk 1 fails, exclusive-OR’s nature makes it possible to regenerate it
from the available information, Disk 1 = Parity ⊕ Disk 2 ⊕ Disk 3.

If one disk fails the RAID array is still functional but running in a degraded mode,
depending on which RAID level that is used the performance may be decreased.
When the failed disk is replaced regeneration starts, which is a process that rebuilds
the replaced disk to the state prior failure. During regeneration the RAID array is non-
redundant, i.e. if another critical disk fails the whole RAID array also fails. Under
special circumstances mirrored RAIDs can survive multiple disk failures and RAID
Level 6 is designed to sustain two simultaneous disk failures. How fast the
regeneration takes depends on the complexity of the calculations needed to derive
the lost information.

5.1.4 RAID Array Reliability

If a RAID array is broken into nG reliability groups, each with G data disks and 1 disk
with redundant information, the RAID array’s reliability of could be described as:

MTTFRAID =
(MTTFdisk )2
nG ⋅ G ⋅ (G + 1) ⋅ MTTRdisk
Equation 3 – The equation provides a somewhat pleasant value of the Mean Time to Failure for a RAID
array, since it does not pay attention to any other hardware.

The equation assumes that the disk failure rate is constant and that MTTRdisk is the
individual disks’ mean time to repair value. Low MTTR is obtained if a spare disk is
used and the system using the RAID is configured to automatically fence the failed
disk and begin regeneration. The above expression ignores all other hardware and
tends to exaggerate the RAID’s MTTF value. For instance a single RAID Level 5
group with 9 data disks and 1 redundancy disk (each disk with MTTF = 60 months =
43830 hours and MTTR = 2 hours), would according to the above expression have a
MTTFRAID of 10672605 hours = 14610 months ≈ 1218 years!

5.2 RAID LEVELS

In the beginning of RAID’s evolution the researchers at Berkeley University defined


five different RAID organisations or levels as they are called, level 1 to 5

8
Exclusive-OR, XOR and symbolised by ⊕ is defined as: 0 ⊕ 0 = 0, 1 ⊕ 0 = 1 and 1 ⊕ 1 = 0. Example of
bitwise XOR: 110 ⊕ 100 = 010.
Reliable Network Mass Storage 29(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

[Patterson88]. Since then, RAID Levels 0 and 6 have generally been accepted but
strictly speaking RAID Level 0 is not a redundant array. The levels’ numbers are not
to be used as some kind of performance metrics; they are just names for a particular
RAID organisation.

In the following RAID organisation figures all white cylinders represents user data and
grey cylinders are used to store redundant data. All organisations are organised to
provide four disks of user storage and each stack of cylinders represents a single
disk. The letters A, B, C... represents the order of which the strips are distributed
when written to the array.

5.2.1 Level 0 – Striped and Non-Redundant Disks

RAID Level 0 is a non-redundant disk array with the lowest cost and the best read
performance of any RAID organisation [Chen93]. Data striping is used to enhance I/O
performance and since there is no need to calculate and store any redundant
information all I/O and storage capacity is dedicated for user data (figure 9). Due to
the lack of redundancy, a sole disk failure results in lost data and therefore it’s often
regarded as a “non-true” RAID.

A B C D
E F G H
I J K L
M N O P

Figure 9 – A RAID Level 0 organisation is non-redundant, i.e. any single disk failure results in data-loss

Advantages:
- I/O performance is greatly improved by data striping
- No parity calculation overhead is involved
- Simple design and thus easy to implement
- All storage capacity is dedicated to user data

Disadvantages:
- Non-redundant, a single disk failure results in data-loss

Use this organisation when performance, price and capacity are more important then
reliability.

5.2.2 Level 1 – Mirrored Disks

A RAID Level 1, usually referred to as mirroring or shadowing, uses twice as many


disks as a striped and non-redundant RAID organisation to improve data reliability
(figure 10). The RAID Level 1 organisation’s write performance is slower than for a
single hard disk device due to disk synchronisation latency but its reads are faster
since the information can be retrieved from the disk by the moment presenting the
shortest service times, i.e. seek time and rotational latency. If a disk fails its copy is
Reliable Network Mass Storage 30(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

used instead and if the failed disk is changed, it is automatically regenerated and the
RAID array is again redundant.
Reliable Network Mass Storage 31(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

A A E E I I M M
B B F F J J N N
C C G G K K O O
D D H H L L P P
Figure 10 – A RAID Level 1 organisation

Advantages:
- Extremely high reliability, under certain circumstances RAID Level 1 can sustain
multiple simultaneous drive failures
- Simplest RAID organisation

Disadvantages:
- Really expensive, since twice the number of the user storage disks are needed

Use this organisation when reliability is top priority.

5.2.3 Level 0 and Level 1 Combinations – Striped and Mirrored or vice versa

The two most basic disk array techniques (striping and mirroring) are combined to
enhance their respective strengths, high I/O performance and high reliability. There
are two possible combinations: RAID Level 1+0 (sometimes called RAID Level 10)
and RAID Level 0+1.

RAID 1+0 is implemented as a striped array whose segments are mirrored arrays
(figure 11) and RAID 0+1 is implemented as a mirrored array whose segments are
striped arrays (figure 12). Both combinations increase performance as well as
reliability; RAID 1+0 has the same fault tolerance as RAID Level 1 while a RAID 0+1
organisation has the same fault tolerance as RAID Level 5. A drive failure in a RAID
0+1 will cause the whole array to actually degrade to the same level of reliability as a
RAID Level 0 array.

A A B B C C D D
E E F F G G H H
I I J J K K L L
M M N N O O P P

Figure 11 – A RAID Level 1+0 organisation is a striped array whose segments are mirrored arrays.

A B C D A B C D
E F G H E F G H
I J K L I J K L
M N O P M N O P

Figure 12 – A RAID Level 0+1 organisation is a mirrored array whose segments are striped arrays.
Reliable Network Mass Storage 32(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Advantages:
- High I/O performance is achieved by striping, especially reads but also writes are
considerable faster compared to a single disk storage
- High reliability, under certain circumstances RAID Level 1+0 can sustain multiple
simultaneous drive failures
- Low overhead, no parity calculations needed

Disadvantages:
- Expensive, these organisations require twice as many disks as for user data
- Limited scalability

5.2.4 Level 2 – Hamming Code for Error Correction

Hamming code is an error detection and correction code technology originally used by
computer designers to increase DRAM memory reliability. RAID Level 2 utilises the
Hamming code to calculate redundant information for the user data stored on the data
disks. The information stored on the dedicated error code disks is used for error
detection, error correction and redundancy (figure 13). The number of disks used for
storing redundant information is proportional to the log2 of the total number of disks in
the system. Hence the storage efficiency increases as the number of disks increases
and compared to mirroring it is more storage efficient.

A B C D Hamming Hamming Hamming

E F G H Hamming Hamming Hamming

I J K L Hamming Hamming Hamming

M N O P Hamming Hamming Hamming

Figure 13 - A RAID Level 2 organisation use Hamming code to calculate parity.

Advantages:
- The array sustain a disk failure
- Less disks needed to support redundancy compared to mirroring

Disadvantages:
- Overhead, the Hamming code is complex compared with for instance parity
- Commercial implementations are rare

5.2.5 Level 3 – Bit-Interleaved Parity

On writes, a RAID Level 3 calculates a parity code and writes the information to an
extra disk – a dedicated parity disk (figure 14). During reads the parity information is
read and checked. RAID arrays utilising parity, i.e. RAID Level 3 to 6, are much
cheaper than other organisations discussed. They use less hard disks to provide
redundancy and utilise conventional disk controllers’ features to detect disk errors.
Reliable Network Mass Storage 33(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

RAID Level 3 is a bit-interleaved organisation and it is primarily used in systems that


require high data bandwidth but not as high I/O request rates. In a bit-interleaved
array, the read and write requests access all disks, data disks as well as the parity
disk. Hence the array is only capable of serving one request simultaneously. Since a
write request accesses all disks, the information needed to calculate the parity is
already known and thus re-reads are unnecessary. When the parity has been
calculated it is written to the dedicated parity disk, a write request limited by that
single disk’s I/O performance.

A B C D A-D parity

E F G H E-H parity

I J K L I-L parity

M N O P M-P parity

Figure 14 - A RAID Level 3 organisation use bit-interleaved parity. Though similar organisation as RAID Level 4
it’s important to differ between a bit and a block oriented disk array.

Advantages:
- High data bandwidth
- Cheap, when parity is used for redundancy less disks are needed compared to
mirroring
- Easy to implement compared to higher RAID Levels since a dedicated parity disk
is used

Disadvantages:
- Low I/O rate and if the average amount of data requested is low the disks spends
most their time seeking

Used with applications requiring very high throughput and where the average amount
of data requested is large, i.e. high bandwidth but low request rates, e.g. video
production and multimedia streaming.

5.2.6 Level 4 – Block-Interleaved Parity

A block-interleaved parity disk array is organised similar to a bit-interleaved parity


array. But instead of interleave the data bit-wise it is interleaved in blocks of a
predetermined size. The size of the blocks is called the striping unit. If the size of the
data to read is less then a stripe unit only one disk is accessed, hence multiple read
requests can be serviced in parallel if they map to different disks. When information is
recorded it may not affect all disks and since all data in a stripe (a group of
corresponding strips, e.g. strips A, B, C and D in figure 15) is needed to calculate
parity some strips may be missing. This parity calculation problem is solved by
reading the missing strips and then calculate the parity, a rather performance
decreasing operation. Because the parity disk is accessed on all write requests it can
easily become a bottleneck as for RAID Level 2 and thus decrease the overall array
performance, especially when the write load is high.
Reliable Network Mass Storage 34(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

A B C D A-D parity

E F G H E-H parity

I J K L I-L parity

M N O P M-P parity

Figure 15 - A RAID Level 4 organisation

Advantages:
- High read performance, especially for many small reads requesting information
less than a stripe unit
- Cheap, since only one disk is dedicated to store redundant information

Disadvantages:
- Low write performance, especially for many small writes
- The parity disk can easily become a bottleneck since all read and write requests
access the parity disk

This RAID organisation is seldom used because of the parity disk bottleneck.

5.2.7 Level 5 – Block-Interleaved Distributed Parity

The block-interleaved distributed parity disk array organisation distributes the parity
information over all of the disks and hence the parity disk bottleneck is eliminated
(figure 16). Another consequence of parity distribution is that the user data is
distributed over all disks and therefore all disks are able to participate to service read
operations. The performance is also dependent of how the parity is distributed over
the disk. A common distribution often considered being the best is called the left-
symmetric parity distribution.

A B C D A-D parity

F G H E-H parity E
K L I-L parity I J
P M-P parity M N O
Q-T parity Q R S T

Figure 16 - A RAID Level 5 organisation with left-symmetric parity distribution that is considered to be the best
parity distribution scheme available.

Advantages:
- The best small read, large read and large write performance of any RAID
organisation

Disadvantages:
- Rather low performance for small writes
- Complex controller
Reliable Network Mass Storage 35(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Considered to be the most versatile RAID organisation and it’s used for a number of
different applications: file servers, web servers and databases.

5.2.8 Level 6 – P+Q Redundancy

RAID Level 6 is basically an extension of RAID Level 5 which allows for additional
fault tolerance by using a second independent distributed parity scheme or two-
dimensional parity as it is called (figure 17). RAID Level 6 provides for an extremely
high data fault tolerance and three concurrent disk failures are required before any
data is lost. Every write request requires two parity calculations and parity updates.
Therefore the write performance is extremely low.

A B C D A-D parity A-D parity

F G H E-H parity E-H parity E


K L I-L parity I-L parity I J
P M-P parity M-P parity M N O
Q-T parity Q-T parity Q R S T
Figure 17 - A RAID Level 6 using a two-dimensional parity, which allows multiple disk failures.

Advantages:
- An extremely reliable RAID organisation

Disadvantages:
- Very low write performance
- Controller overhead to compute parity addresses is extremely high
- Generally complex

Considered to be on of the most reliable RAID organisations available and it’s


primarily used for mission critical applications.

5.2.9 RAID Level Comparison

RAID 0 RAID 1 RAID 2 RAID 3 RAID 4 RAID 5 RAID 6 RAID 10

Redundant 0 n ∝ log2 k, 1 1 1 2 n
disks needed where k is
for n user disks the total
disks

Redundancy None Mirror ECC Parity Parity Parity Dual Mirror


Parity

Complexity Medium Low High Medium Medium High Very Medium


High

Reliability Low High High High High High Very High


High
Reliable Network Mass Storage 36(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Table 3 – Comparison of different RAID levels.


Reliable Network Mass Storage 37(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

It is difficult to compare different RAID Levels and state which level that is the best
since they all have special characteristics suitable for different applications. RAID
Level 5 is the most versatile organisation while RAID Level 6 is the array providing
the highest reliability since it sustains 2 concurrent disk device failures.

5.3 RAID CONTROLLERS

The unit that takes care of data distribution, parity calculations and regeneration is
called a RAID controller. They are available as hardware and software solutions but
both are based on software. The big difference between hardware and software
controllers is where the code is executed. A hardware RAID controller is superior to
software RAID in virtually every way, except cost.

Hardware RAID controllers are essentially small computers dedicated to control the
disk array. They are usually grouped as:
- Controller card or bus-based RAID: The conventional hardware RAID controller,
which is installed into the server’s PCI slot. The array drives are usually
connected to it via ATA or SCSI interface. Software running on the server is used
to operate and maintain the disk array.
- External RAID: The controller is as the name implies completely removed from
the system using it. Usually it is installed in a separate box together with the disk
array. It is connected to the server using SCSI or Fibre Channel. Ethernet or
RS232 are common interfaces for operation and maintenance.

An alternative to dedicated hardware RAID controllers is to let the host system


provide the RAID functionality, that is take care of I/O commands, parity calculations
and distribution algorithms. Software RAID controllers are cheap compared to
hardware controllers but they require high-end systems to work properly.
Reliable Network Mass Storage 38(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

6 FILE SYSTEMS

The hard disk device’s platters are the medium where the information actually is
stored; zeroes and ones are encoded as magnetic fields. A file system provides a
logical structure of how information is stored and routines to control the access to the
information recorded on a block device. File systems are in most case hardware
independent and different operating systems are often able to use more the one file
system. The emphasis in this section is on different file systems supported by the
Linux operating system.

6.1 BASICS ABOUT LOCAL FILE SYSTEMS

Most Linux file systems make use of the same concepts as the UNIX file systems;
files are represented by inodes (see section 6.1.3) and directories are basically tables
with entries for each file in that particularly directory. Files on a block device are
accessed with a set of I/O commands, which are defined in the device drivers. Some
specialised applications are not using a file system to access physical disks or
partitions, they use raw access. A database like Oracle is using low-level access from
the application itself, not managed by the kernel.

Though it is general concepts, this subsection tends to emphasise on the Second


Extended File System (EXT2), a Linux file system currently installed with virtually all
Linux distributions.

6.1.1 Format and Partition

Before a blank disk device is usable for the first time it must be low-level formatted.
The process outlines the positions of the tracks and sectors on the hard disk and
writes the control structures that define where the tracks and sectors are. Low-level
formatting is not needed for modern disks after the disks left the vendor, though older
disks may need it occasionally because their platters are more affected by heat.

Before a disk is usable by the operating system it must be partitioned, which means
dividing a single hard disk into one or more logical drives. Disks must be divided into
partitions even if it is only one partition. A partition is treated as an independent disk
but it is really a set of contiguous sectors on the physical disk device. Typically a disk
device under Linux is divided into several partitions, each capable of its own kind of
file system. A partition table is an index that maps partitions to the physical location
on the hard disk. There is an upper size limit to how large certain partitions can be
depending on file system and hardware. EXT2 supports approximately 4 TB.

After a disk has been low-level formatted and partitioned the disk contains sectors
and logical drives. Still it is unusable to most operating systems (if not raw access is
used) because they need a structure in which they can store files. High-level
formatting is the process of writing the file system specific structures. While a low-
Reliable Network Mass Storage 39(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

level format totally cleans a disk device, high-level format only removes the paths to
the information stored on the disk.

6.1.2 Data Blocks

The smallest manageable units on a disk device are the sectors. Most file systems,
including EXT2, are not using individual sectors to store information. Instead they are
using the concept of data blocks to store the data held in files. A data block could be
described as a continuous group of sectors on the disk. The data blocks’ sizes are
specified during the file system’s creation and they are all of the same length within a
file system; i.e. they contain the same amount of sectors. Data blocks are sometimes
referred to as Clusters and Allocation Units.

Every file’s size is rounded up to an integral number of data blocks. If the block size is
1024 bytes a 1025 bytes file requires two data blocks of 1024 bytes each, thus the file
system waste 1023 bytes. On average half a data block is wasted per file. It is
possible to derive an algorithm able to optimise the data block usage but almost
every modern operating system accepts an insufficient disk usage in order to reduce
the processor’s workload.

6.1.3 Inodes

Every file in EXT2 is represented by a unique structure called inode. The inodes are
the basic building blocks for virtually every UNIX-like file system. Inodes specifies
which data blocks specific files occupies as well as access rights, modification dates
and file types (figure 18). Each inode has a single unique number that is stored in
special inode tables.

Directories in EXT2 are actually files themselves described by inodes containing


pointers to all inodes in that particular directory.

Device files in EXT2 (for the first ATA drive in Red Hat Linux it is typically /dev/hda)
are not “real files”, they are device handles that provide applications with access to
Linux devices.
Reliable Network Mass Storage 40(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Data Block

Data Block

Data Block
Data Block

Inode information Data Block


Data Block
Direct Blocks Data Block
Direct Blocks
Data Block
Direct Blocks
Direct Blocks
Data Block
Indirect Blocks
Double Indirect Blocks Data Block

Triple Indirect Blocks


Data Block

Data Block
Figure 18 – An EXT2FS inode and data blocks.

The EXT2 file system divides the partition, the logical volume it occupies, into a
series of blocks. The data blocks themselves are aggregated into manageable groups
called block groups. The block groups contain information about used inodes and
those that are unallocated (figure 19). Every block group contains a redundant copy of
itself and it is used as a backup in case of file system corruption.

Super Group Block Inode Inode Data


Block Descriptor Bitmap Bitmap Table Blocks

Figure 19 – An EXT2 file system Block Group

Block group number 0 is the EXT2 file system’s super block. It contains basic
information about the file system and it provides the file system manager basic
functionality for handling and maintaining the file system. The EXT2 super block’s
magic number is 0xEF53 and that number identifies the partition as an EXT2 file
system. The Linux kernel also uses the super block to indicate the file system’s
current status:

- “Not clean” when mounted read/write. If a reboot occurs when the file system is
dirty a file system check is forced the next time Linux boot.

- “Clean” when mounted read only, unmounted or when successfully checked.

- “Erroneous” when a file system checker finds file system inconsistencies.

6.1.4 Device Drivers

From a file system’s point of view a block device, e.g. a hard disk device, is just a
series of blocks that can be written and read. Where the actual blocks are stored on
the physical media does not concern the file system, it is a task for the device drivers.
Reliable Network Mass Storage 41(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

The major part of the Linux kernel is device drivers, which control the interaction
between the operating system and the hardware devices that they are associated
with. Linux file systems (and most other file systems) do not know anything about the
underlying physical structure of the disk; it makes use of a general block device
interface when writing blocks to disk. The device driver takes care of the device
specifics and maps file system block requests to meaningful device information; that
is information concerning cylinders, heads and sectors.

The block device drivers hide the differences between the physical block device types
(for instance ATA and SCSI) and, so far as each file system is concerned, the
physical devices are just linear collections of blocks of data. The block sizes may vary
between devices but this is also hidden from the users of the system. An EXT2 file
system appears the same to the application, independent of the device used to hold
it.

6.1.5 Buffers and Synchronisation

The buffer cache contains data buffers that are used by the block device drivers. The
primary function of a cache is to act as a buffer between a relatively fast device and a
relatively slow one. These buffers are of fixed sizes and contain blocks of information
that have either been read from a block device or are being written to it. It is used to
increase performance since it is possible to "pre-fetch" information that is likely to be
requested in the near future, for example the sector or sectors immediately after the
one just requested. Hard disks also have a hardware cache but it is primarily used to
hold the results of recent reads from the disk.

When a file system is mounted, that is, attached to the operating system’s file system
tree structure, it is possible to specify if to use synchronisation or not. Most times it is
by default unused. Synchronisation provides the possibility to ignore the write buffer
cache. Briefly this means that when a write request is acknowledged, it is really
written to the physical media and not only to the buffer. In some cases this is vital
because a power loss empties all volatile memory. Hence the buffers are wiped out
and the information is lost despite that it has been acknowledged as successfully
written. It is a matter of increased performance versus increased data integrity.
Reliable Network Mass Storage 42(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

6.1.6 An Example

How then is a text file stored on a block device such as a hard disk device discussed
in previous sections?

Disk divided Block Data


Bits Sectors into partitions Groups Blocks File

0 A
1
T
1 E
X
0
T
1
F
0 I
0 L
E
0

1
0

Figure 20 – A simplified overview of how a text file is stored on a block device such as a hard disk device or a
CD-ROM.

On the left-hand side (figure 20) there is a series of zeroes and ones, bits. A group of
bits, typically 512, is in this example referred to as a hard disk sector. The sectors are
the manageable units of a disk device, which consists of numerous sectors of a
specific size. A disk device is often divided into an arbitrary number of partitions,
which in turn is divided into a number of block groups. Each block group consists of a
number of data blocks which sizes are constant and specified during file system
creation. In this example the text file is contained in three file system data blocks.

6.1.7 Journaling and Logging

Non-journaled file systems, for instance EXT2, rely on file system utilities when
restarted dirty. These file system checkers (typically fsck) examine all meta-data at
restart to detect and repair any integrity problems. For large file systems this is a time
consuming process. A logical write operation in a non-journaled file system may need
several device I/Os before accomplished.

Journaling file systems, e.g. JFS, use fundamental database techniques; all file
system operations are atomic transactions and all operations affecting meta-data are
also logged. Thus the recovery in the event of a system failure is just to apply the log
records for the corresponding transactions. The recovery time associated with
journalised file systems is hence much faster than traditional file systems but during
normal operation a journalised file system may be less effective since operations are
logged.
Reliable Network Mass Storage 43(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

6.2 LINUX VIRTUAL FILE SYSTEM

The Linux kernel provides an abstract file system layer, which present the processes
with an unambiguous set of rules of accessing file systems, independent of their real
layout. It is called the Virtual File System (VFS) and acts like an interaction layer
between the file system calls and the specific file systems (figure 21). VFS must at all
times manage the mounted file systems because it is the only access path. To do that
it maintains data structures describing the virtual file system and the real file systems.

Application User space

System calls interface

Inode Cache
VFS
Directory Cache

DOS EXT2 MINIX Linux Kernel space

Buffer Cache

ATA Driver SCSI Driver

Disk Disk Hardware

Figure 21 – An overview of the Linux Virtual File System and how it connects with user space processes, file
systems, drivers and hardware.

File systems are either build into the kernel or as loadable kernel modules and they
are responsible for the interaction between the common buffer cache, which is used
by all Linux file systems and the device drivers.

Except for the buffer cache, VFS also provides inode and directory caches.
Frequently used VFS inodes (similar to the EXT2 inodes) are cached in the inode
cache which make access to them faster. The directory cache stores a mapping
between the full directory names and their inode numbers but not any inodes for the
directory itself. To keep the caches up to date and valid they use Least Recently Used
(LRU) principle.

6.3 DISTRIBUTED FILE SYSTEMS

Local file systems such as EXT2, NTFS and JFS are only accessible by the systems
where they are installed. There are several approaches that export a local file system
so that it is accessible from other hosts as well. Distributed file systems as they are
called, allow sharing of files and/or completely shared storage areas.
Reliable Network Mass Storage 44(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

6.3.1 Network File System

The Network File System designed by Sun Microsystems in the mid-80s allows
transparent file sharing among multiple clients and it is today the de facto standard in
heterogeneous computer environments. NFS assumes a file system that is
hierarchical and it is centralised; several hosts connect to one file server, which
manages all access to the real file system. NFS works well in small and medium size
installations, preferably local area networks. AFS, described below, is more suitable
when used in wide area networks and installations where scalability is important.

Most NFS implementations are based on the Remote Procedure Call (RPC). The
combination of host address, program number and procedure number specifies one
remote procedure. NFS is one example of such a program. The eXternal Data
Representation (XDR) standard is used to specify the NFS protocol but it also
provides a common way of to representing the data types sent over the network.

The NFS protocol was intended to be stateless; a server should not need to maintain
any protocol state information about any of its clients in order to function correctly. In
the event of a failure there is a prominent advantage with stateless servers. The
clients need only to retry a request until the server respond, it does not need to know
why the server is down. If a stateful server goes down, the client must detect the
server failure and rebuild the stateful information or mark the operation as failed. The
idea with a near stateless server is the possibility to write very simple servers. It is the
NFS clients that need the intelligence.

The protocol should not introduce any additional states itself but there are some
stateful operations available, implemented as separate services file and record
locking and remote execution.

6.3.2 Andrew File System

The Andrew File System (AFS) was developed at Carnegie Mellon University to
provide a scalable file system suitable for critical distributed computing environments.
Transarc, an IBM company, is the current owners of AFS but they have also released
an open source version of AFS.

AFS is suitable for wide area network installations as well as smaller local area
networks installations [IBM02]. AFS is based on secure RPC and provides Kerberos
authentication to enhance the security. Compared with NFS centralised client/server
architecture AFS is a little bit different. AFS provides a common global namespace;
files are addressed unambiguously from all clients and the path does not incorporate
any mount points as for NFS.

Another significant difference is that AFS allows more than one server in one group or
cell as it is called. AFS joins together the file systems of multiple file servers and
export one file system. Therefore the clients do not need to know on which server the
files are stored, which makes access to files as easy as on a local file system.
Reliable Network Mass Storage 45(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Important files, e.g. application binaries, may be replicated to other servers. If one of
the servers goes down the client automatically access the file from the other server
without any interruption. This feature significantly increases the availability of a critical
system. The use of several file servers also increase the efficiency since the work is
distribute over several file servers and not as for NFS where only one server manage
all requests.

6.4 THE FILE SYSTEM AND THE USER

6.4.1 User Perspective of the File System

From a user’s point of view it is totally irrelevant how the information is recorded on to
the disk and how the information is stored. From a user perspective, modern file
systems are based on three assumptions [Nielsen96]:

- Information is partitioned into coherent and disjunct units, each of which is treated
as a separate object or file.

- Information objects are classified according to a single hierarchy, the subdirectory


structure.

- Each information object is given a semi-unique file name, which users use to
access information inside the object.

The fact that information is normally stored as non-contiguous sectors of the hard disk
is hidden to the end users. The information is usually presented to the user as files,
the most common abstract level of digital information. That it is possible to read and
to write information and that it is stored in a safe and unambiguous manner is much
more important than knowing exactly on which sectors the information is stored.

6.4.2 File system Hierarchy Standard

9
The File system Hierarchy Standard (FHS) is a collaborative document that defines a
set of guidelines and requirements for names and locations of many files and
directories under UNIX-like operating systems. Its intended use is to support
interoperability between applications and present a uniform file hierarchy to the user,
independent of distribution. Many independent software suppliers and operating
system developers provide FHS compliant systems and applications, which simplifies
installation and configuration since the files’ directories are known.

9
The FHS standard is available from http://www.pathname.com/fhs/.
Reliable Network Mass Storage 46(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

7 SYSTEM AVAILABILITY

7.1 THE TERM AVAILABILITY

A system’s availability is measured in terms of the percent of the time the system is
available and provides its services correctly. Rest of the time is assumed to be
unplanned downtime, i.e. time when the system is unavailable. The availability
measure uses a logarithmic scale based on nines; a system with three nines of
availability is thus available 99.9% of the time. Each additional nine is associated with
more extreme requirements and especially increased costs. High Availability (HA)
refers to systems that are close to be continuously available, meaning no down time –
an expression often associated with telecom equipment that promise up to five nines
of availability.

The concept of availability incorporates both reliability and repairability, which are
measurable as MTTF and MTTR. Availability compromises at least four components:
- hardware availability
- software availability
- human error
- catastrophe

The determinant for most system’s availability is human error. More intuitive and
more automated user interfaces may prevent most unnecessary errors associated
with configuration and installation. Today’s hardware generally provides good
availability. If the intention is to build a HA service one should consider to utilise one
or many of the existing technologies that improve a system’s availability.

7.2 TECHNIQUES TO INCREASE SYSTEM AVAILBILITY

There are many different attempts to increase systems availability. Some of the
following explanations make use of examples of systems but the techniques are of
course applicable to other systems than those described.

A system is as weak as its weakest point and a system is not only involves the
components actually presenting some functionality. If for instance a system is
considered to be highly available and it is powered from a single power supply its
maximum availability is identical to the one for the power supply it is attached to; if
the power supply fails the whole system fails. Single point of failures (SPOF) must be
avoided. As for RAID where redundant information increased its reliability a system’s
availability is increased when adding an extra redundant component. Adding an
additional power supply thus increases the whole system’s availability but the system
must of course be designed to use the redundant component.

If a system is vital and it is of most importance that it never ever goes down,
geographic redundancy is a final extreme precaution. Adding redundancy to a system
Reliable Network Mass Storage 47(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

increase its availability but what if a catastrophe such as earthquake destroys the
whole system including its redundant components? As the additional “geographic”
intimates, geographic redundancy not only involves redundant components, it implies
that the redundant components are not used in the same geographic location.

When a system recognises a component as failed and a redundant component is


present it must be possible to transfer the service from the failed to the working
component. This mechanism is called a fail-over and it is often implemented in
active-standby server configurations; one server is active and presents some service
but when it goes down the standby node is ready to take over and restarts the service.
Fail-over often introduces a little time delay and it is close to impossible to eliminate
this delay since it increases the risk of introducing other problems.

Heartbeats are used to monitor a system’s health. A heartbeat monitor continuously


asks the node or the subsystem if it is working properly. If the answer is negative or if
the question is unanswered after a defined number of tries an action is triggered. It
may involve fail-over, resource fencing and other actions that provide means to
maintain the complete system’s functionality. Heartbeats are available both as
hardware and software solutions but the hardware implementations are desirable if
fast response is required.

In virtually all systems it is desirable to isolate malfunctioning components. If a


component is active but is not working properly it might introduce new problems to a
system and thus damage it. Resource fencing is an approach that isolates
components that are identified as malfunctioning from the rest of the system to
prevent them from disturbing or harming other components. In clusters it eliminates
the possibility that a “half-dead” node is able to present its arbitrarily working
resources, for instance in a two-node cluster utilising fail-over this is very important.
Suppose that the heartbeats between the two systems are late and that the standby
node after a while declares the other node as dead. When it turns out that it was just
network congestion both nodes already believe that they are the active one – a
situation known as split brain. If resource fencing is used it is possible to control the
other node’s power supply and turn it off during a fail-over.

High availability can also be provided without any specialised hardware. Clustering is
a method where a number of collaborating computers or nodes provide a distributed
service and/or serve as backup for each other. Clustering involves specialised
software to work, so called cluster managers.

Checkpointing is another software concept, which provides clusters with the possibility
to store information about individual nodes’ processes that are vital to the system. If a
cluster member fails the checkpoint information is used to quickly allow another node
to take over the failed node’s processes and restart them. Thus the time delay
normally associated with a normal fail-over is reduced.
Reliable Network Mass Storage 48(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Suppose that a storage system is equipped with a RAID array to provide a higher
level of data reliability. If a disk in the RAID array fails it should be desirable to repair
or replace the failed device without the need of taking down the storage system. Hot
swap is a technology that increases system availability in that way that it provides the
possibility to replace a component, for instance a disk device, while the system is
running. This of course requires that the system sustain at least one failed
component.

A system’s availability could be even further increased if it is equipped with hot


standby. It is similar to hot swap but requires no manual intervention. If for instance a
disk in a RAID configuration supporting hot standby fails, the failed disk is
automatically regenerated and replaced by the hot standby. That is, an additional
component is installed but it is not used until one of the active components fails. Hot
standby sharply decreases the MTTR value for that particular component and this
also affect the system’s comprehensive MTTR.

A hardware watchdog is really a timer that is periodically reset by a system when it is


working properly. If the timer is not reset for a period of time it reaches a threshold
and the watchdog assumes that there are problems with the system. It automatically
inactivates the system and chooses from restarting the object or forces it to stay off-
line. Watchdogs are also available both as hardware and software but only the
hardware solutions really provide any increased availability. Assume that a system is
utilising a software watchdog and the watchdog’s environment hangs. The software
watchdog is thus totally useless whereas the hardware watchdog is shielded from any
software influence.
Reliable Network Mass Storage 49(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

8 PROTOTYPE DESIGN PROPOSAL: ACTIVE — STANDBY NFS CLUSTER

Until now, this thesis has discussed basic storage components used in most modern
storage systems. The purpose of this project is to design and implement a prototype
that fulfils the telecom requirements regarding availability and performance. The
forthcoming sections discuss one possible solution and an evaluation of the solution.

I decided to enhance Network File System servers’ availability. It was desirable to use
a well-known standard as NFS but also because TelORB currently supports this file
system. The main drawback with centralised storage solutions is the problems
associated with availability. If a single NFS server goes down due to hardware or
software failures all clients connected are unable to use the data stored on the
server’s disks for some period of time. This is not acceptable in any high availability
systems and applications. Therefore I have tried to create a two-node high availability
cluster where an active node provides an NFS service and another node is standing
by ready to take over if the active fails. If the active accidentally goes down, the
standby acquires the former active’s IP address and restarts its services transparent
to the clients using them.

There are systems with load sharing capabilities on top of its high availability features
but this project focus on HA only. Load shared storage solutions involve mechanisms
to have two or more nodes with a homogenous file system image and this is not
feasible within this project’s limited time.

8.1 PROPOSAL BACKGROUND

This subsection briefly explains some weaknesses associated with a single NFS
server and how it is possible to overcome them and implement a solution with
increased availability.

8.1.1 Simple NFS Server Configuration

In a simple NFS configuration there is a central server providing the file service, i.e.
exporting a file system, and a number of clients using the services (figure 22), that is
mounting the exported file system. It is a rather simple process of putting it all
together; most Linux distrubutions come with both NFS server and client support and
a simple system needs minor configuration when starting from scratch.

If the server accidentily crashes or the network goes down, the clients loses contact
with the server providing the file service and the stored information gets inaccessible.
Hard disk mirroring and other RAID organisations are common methods that prevent
data loss in case of hard disk failure. But mirroring inside a single machine does not
increase server availability if a component other than a hard disk is failing. The server
itself and the network are single point of failures (SPOF) and it is therefore not
enough to increase the server’s “internal” reliability.
Reliable Network Mass Storage 50(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

To exclude the server SPOF one might have a redundant server standing by, ready
to take over if the service provider goes down. This proposal makes use of this idea,

Switched
Ethernet

NFS NFS
Server Clients

Figure 22 – A simple NFS configuration where the server and the network are single point of failures.

generally known as fail over. Shortly, fail-over means moving services from a failing
server to another redundant server standing by.

To exclude the access path SPOF, that is, the single Ethernet network in figure 22, it
is possible to add a redundant network analogous to the NFS server redundancy. This
prototype proposal only focuses on NFS server redundancy because the Linux NFS
implementation used does not support redundant networks. TelORB’s NFS
implementation supports redundant networks but to port this to Linux is out of this
project’s scope. Henceforth the figures are using dual networks despite it is not
implemented.

8.1.2 Adding a Redundant NFS Server

If another NFS server is added to provide server redundancy the configuration gets
more complicated (figure 23). Except for the redundant hardware additional software
is also required if the two servers are goin to work together. In this proposal “working
together” does not include any kind of load sharing.

Switched
Ethernet

NFS NFS
Servers 1 2 Clients

Figure 23 – Adding a redundant server and a redundant network eliminates the server and the network single
point of failures.

If it is possible for server number 2 to monitor server 1’s status in real-time it is


possible to restart the services provided by 1 at server 2 if the primary fails. For static
information services, for instance a web server and databases seldom updated, this is
basically enough to increase the availability. But for a file service or any other service
where the information is constantly changing the situation is somewhat more
complex. If the NFS clients write information to the active server (they usually do so)
and it goes down it must be possible for the standby server, number 2 in the
illustration, to access the same information just written by the clients. The system,
consisting of the two servers, must have a homogeneous image of the file system. In
Reliable Network Mass Storage 51(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

this report a homogeneous file system means that the in-cluster nodes see exactly the
same file system but not that it is mounted simultaneously.

8.1.3 Adding Shared Storage

Many reliable storage solutions use redundant servers with access to some sort of
shared storage (figure 24). The obvious advantage with this approach is that there is
only one physical place where the information is stored, thus only one file system
image. It should be a straightforward process to implement this in Linux if it is
acceptable that only one server has access to the file system at a time. There are,
however, solutions available where two servers are simultaneously accessing shared
storage, typically solutions that use FC.

Switched Switched
Ethernet Ethernet

Active Standby Active Standby


FC-AL
SCSI

Figure 24 – Two shared storage possibilities: Fibre Channel Arbitrated Loop and shared SCSI.

Shared Storage often involves special hardware, such as shared SCSI or Fibre
Channel. Shared SCSI is really regular SCSI used by two host adapters instead of
just one and it does not provide any hardware interface redundancy. Thus it is not
suitable for systems that require extreme high availability. Fibre Channel Arbitrated
Loop (FC-AL) is a configuration of Fibre Channel providing high throughput but also
redundant access paths as well as hot-swap capabilities. Unfortunately FC equipment
10
is much more expensive than conventional hardware such as SCSI. Both SCSI and
FC-AL can of course use RAID controllers to build disk configurations with increased
reliability.

If shared storage is used, the standby node is able to mount the shared storage in
case of a fail-over and provide the clients with exactly the same file system.
Assuming that there are mechanisms for monitoring and restarting processes and
servers’ status the technique is rather simple. But since none of the hardware
solutions were appropriate; shared SCSI is not providing redundancy and FC-AL is
too expensive. The proposal is to try a software approach instead and create a so-
called Virtual Shared Storage.

10
Cheetah 73LP, a high-end FC disk from Seagate (36.7 GB, 10k RPM, 4.7 ms Avg. Seek), costs $540.00,
2001-12-18
Reliable Network Mass Storage 52(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

8.1.4 Identified Components

To build a two-node NFS fail-over cluster I identified the following components:


- hardware platform
- network
- cluster service, to provide monitoring, fail-over and IP address binding
- shared storage, to present the file managers with a homogenous file system
image
- file server application

8.2 VIRTUAL SHARED STORAGE

It would be desirable to have a disk area that two servers have direct access to
because a single file system image simplifies the solution. It is possible to create a
virtual shared storage similar to the high-end shared storage described just using
standard Ethernet networks and additional Linux software. These solutions could be
thought of as a cheap model of the above or as a stand-alone software solution where
geographic redundancy comes for free. Virtual shared storage is considered shared
storage with the limitation that only one file system manager can mount the file
system read/write instantaneously. In this fail-over system proposal it is an acceptable
limitation since only one server needs full access to the file system. The other waits
until it is in primary state to mount the disk space and by this time it is alone.

I have tried two different solutions and both are specialised Linux software
components. Both solutions are basically providing a mirroring service and it is
important to emphasise that they should not be compared with any high-performance
hardware solutions under the same conditions.

8.2.1 Network Block Device and Linux Software RAID Mirroring

The vital component in this virtual shared storage solution is the enhanced network
block device driver (NBD), which is a device driver that makes a remote resource look
like a local device in Linux. Typically it is mounted to the file system using /dev/nda.
The driver simulates a block device, such as a hard disk device or a hard disk
partition, but access to the physical device is carried across an IP-network and hidden

Switched
Ethernet

NBD NBD
Client 1 2 Server

Local Network
Block Block
Device Device

to user processes.
Reliable Network Mass Storage 53(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

Figure 25 – A Network Block Device configuration where Node 1 transparently accesses a block device
mounted as a local device that’s physically at Node 2. The dotted cylinder represents the network block device
mounted locally at Node 1 and the grey-shaded cylinder is the actual device used at Node 2. A device is either a
physical device or a partition.
Reliable Network Mass Storage 54(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

On the server side (Node 2 in figure 25), a server daemon accepts requests from the
client daemon at Node 1. At the server the only extra process running is just the
server daemon listening at a pre-configured port. The client must in addition to the
client daemon also insert a Linux kernel module before any NBD is mountable. When
the module is loaded, the both parent processes started and the block device
mounted using for instance the native Linux EXT2 file system, it is used as a
conventional block device. In the figure above the grey-shaded cylinder represents
the real partition or disk device mounted at Node 1 as the dotted cylinder.

The enhanced network block device driver uses TCP as data transfer protocol.
Compared to UDP the TCP protocol is a much more reliable protocol due to its
consistency and recovery mechanisms. The developers have accepted the extra
overhead associated with TCP because it significantly simplifies the NBD
implementation.

Many Linux distributions are by default installed with kernels supporting Linux
software RAID. If not, late kernels are possible to upgrade with a matching RAID
patch. The current RAID software for Linux supports RAID levels 0, 1, 4 and 5. It also
supports a non-standardised RAID level; linear mode that aggregates one or more
disks to act as one large physical device. The use of spare disks for “hot-standby” is
also supported in current release. The 2.2.12 kernel and later are able to mount any
type of RAID as ROOT and use the software RAID device for booting. There is also a
software package called raidtools that includes the tools you need to set up and
maintain a Linux software RAID device.

Assume that the client in the network block device example divides its local block
device into two partitions, one for operating system files and one partition identical to
the one mounted from the NBD server. It’s of most importance that the partitions are
identically defined according to data block size, partition size and all other block
device specific parameters. Linux software RAID organisations require identical
partitions or devices to work but there are no limitations in where they are
geographically or physically installed. It is therefore possible to use a geographically
local partition or device and a network block device locally mounted but physically
installed at another node in a software RAID configuration, assuming they are

Mirrored pair – RAID Level 1

Figure 26 – A local partition and a network block device configuration used as a Linux software RAID
configuration.

identical.
Reliable Network Mass Storage 55(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

If the client creates a partition with identical parameters as the network block device
these two partitions could be used as a RAID Level 1 configuration, i.e. disk mirroring
(figure 26). Hidden to the application using the apparently locally mounted RAID
device all file operations are both carried out locally but also mirrored to a block
device somewhere on the network.

This configuration not only increases the data reliability but it also increases the
availability of the data. If any of the two nodes whose disk devices are used in the
mirroring crashes, the data is still available from the disk in the available node. This
could be thought as LAN mirroring or virtual shared storage since everything written
on the active side is transparently written to a redundant copy, which is usable in case
of fail over.

8.2.2 Distributed Replicated Block Device

Distributed Replicated Block Device (DRBD) is an open source kernel module for
Linux. It renders the possibility to build a two-node HA cluster with distributed mirrors.
DRBD provides a virtual shared disk to form a highly available storage cluster and it
is similar to the RAID mirror solution but includes some extra features and it is
distributed as single software package.

Linux virtual file system passes data blocks to a block device via file system and
device driver specific layers (figure 27). DRBD acts as a middle layer that
transparently forwards everything written to the local file system to a mirror connected
to the same network.

Active Service Standby Service

VFS VFS

File System File System

Buffer Cache Buffer Cache

DRBD TCP/IP TCP/IP DRBD

Disk Driver NIC Driver NIC Driver Disk Driver

Disk NIC NIC Disk

Figure 27 – An overview of how DRBD is acting as a middle layer and forwards file system operations to a
redundant disk.

Three different protocols are available, each with different characteristics suitable for
a number of applications:

- Protocol A: Signals the completion of a write request as soon as the block is


written to the local disk and sent it out to the network. This protocol is best suited
Reliable Network Mass Storage 56(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

for long distance mirroring. Lowest performance penalty of the three protocols but
it is also the least reliable DRBD protocol.

- Protocol B: A write request is considered completed as soon as the block is


written to the local disk and when the standby system has acknowledged the
reception of the block.

- Protocol C: Treats a write request as completed as soon as the block is written to


the local disk and when an acknowledgement is received from the standby system
assuring that the block is written to local disk. The most reliable protocol of the
three discussed here. It guarantees the transaction semantics in all failure cases.

For some file systems, e.g. the journaling file system JFS, it is vital that the blocks
are recorded to the media in a pre-determined order. DRBD ensures that the data
blocks are written exactly in same order on both the primary and the secondary disks.
It is vital that the nodes in the cluster all have the same up-to-date data and nodes
that do not have up-to-date data must be updated as soon as possible. A small
amount of information referred to as meta-data is stored in non-volatile memory at
each node. The meta-data consisting of an inconsistent flag and generation counter is
used to decide which node that has the most up-to-date information. The generation
counter is really a tuple of four counters:

<human-intervention-count, connected-count, arbitrary-count, primary-indicator>

During normal operation, data blocks are mirrored as they get written in real-time. If a
node rejoins a cluster after some down time, the cluster nodes are in need of
synchronisation. The meta-data is used during a cluster node’s restart in order to
identify the node with the most up-to-date data. When the most up-to-date node is
identified the nodes are synchronised using one of the two mechanisms supported by
DRBD:

- Full synchronisation: the common way to synchronise two nodes is to copy each
block from the up-to-date node to the node in need for an update. Not
performance efficient.

- Quick synchronisation: if a node leaves the cluster for a short time a memory
bitmap that records all block modifications is used to specifically update the
blocks modified during the node’s absence. DRBD’s requirement for a short time
is that the active node is not restarted during this time.

Synchronisation is designed to run in parallel to the data block mirroring and other
services, thus not affecting the node’s normal operation. Therefore the
synchronisation can just use a limited amount of the total network bandwidth.
Reliable Network Mass Storage 57(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

8.3 INTEGRATION OF THE COMPONENTS

8.3.1 Linux NFS Server

NFS on Linux was made possible by a collaborative effort of many people and
currently version 3 is considered the standard installation. NFS version 4 is under
development as a protocol (http://www.nfsv4.org) and includes many features from
other file system competitors such as the Andrew File System and Coda File System.
The advantage of NFS today is that it is mature, standard and supported robustly
across variety of platforms.

All Linux kernels version 2.4 and later have full NFS version 3 functionality. From
kernels version 2.2.14 and above there are patches available that provide the use of
NFS version 3 and reliable file locking. Linux NFS is backward compatible and thus
version software 3 supports version 2 implementations as well.

8.3.2 Linux-HA Heartbeat

A high availability cluster is a group of computers, which work together in such a way
that a failure of any single node in the cluster will not cause the service to become
unavailable [Robertson00]. Heartbeat is open source software that provides the
possibility to monitor another system’s health by periodically send “heartbeats” to it. If
the response is delayed or never received it is possible to define actions that
hopefully increase the complete systems availability. Heartbeat is highly configurable
and it is possible to develop own scripts suitable for specific purposes.

8.3.3 The two-node High Availability Cluster

The basic idea for this prototype proposal is to make the Linux NFS server highly
available. I chose NFS because it is today’s de facto standard distributed file system
and under Linux, NFS version 3 is considered a mature and stable implementation. If
NFS is the best choice or not, is really not an issue here in this section. The purpose
of the network mass storage project is to build a prototype and it is of course easier to
implement a prototype with well-known and already working components.

Switched
Ethernet
Virtual Cluster IP

NFS NFS
Active 1 Heartbeat 2 Standby

LAN Mirrored
Disk Arrays

Figure 28 – Overview of the two-node HA cluster design proposal.


Reliable Network Mass Storage 58(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

The obvious problem with this kind of centralised NFS file server is of course that its
availability is exactly the same as the server’s it is running on. Any single point of
failures in a system is unacceptable, even if it is a complete server. Using Linux-HA
Heartbeat makes it possible to have an identical standby NFS server ready to take
over the active node’s operation in case of failure (figure 28). With a hardware RAID
the data’s reliability is increased locally but if the active node goes down the standby
node must be able to go online from any state, since we do not know when the fail
over is expected. Using DRBD the active node’s disk is transparently mirrored in real-
time to the standby’s disk. DRBD also takes care of synchronisation if any of the two
nodes are restarted.
Reliable Network Mass Storage 59(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

9 IMPLEMENTATION

Two prototypes were built; Redbox11 and Blackbox 12. Each consists of two Linux
servers working as one unit – a two-node high availability cluster. This section
emphasises on practical issues concerning the assembly process and it involves
many technical details.

Redbox was the first out of the two prototypes built. When it was working “properly”
some of the hardware and software used in Redbox was reused to build the Blackbox
prototype. The prominent difference between Redbox and Blackbox is the hardware
configuration. Redbox is built from TSP components with some minor tweaks and
Blackbox is compiled from conventional PC components.

9.1 REDBOX

Because Redbox was the first prototype built, the process of putting it all together
involved many time-consuming mistakes possible to avoid in the Blackbox prototype.
This section about Redbox is therefore more detailed than the next section
concerning Blackbox.

GEM Magazine cPCI Magazine

Red1: Red0:
Pinkbox: 10.0.0.20 10.0.0.10
192.168.0.40 192.168.0.20 192.168.0.10

SCB Serial Monitor,


Switchboard
Keyboard &
Mouse

Redbox: 192.168.0.111

3Com Superstack 3900


192.168.0.1

Figure 29 – Overview of the physical components and the communication possibilities in the Redbox prototype.

9.1.1 Hardware Configuration

In the Ericsson hardware prototype lab I configured three Teknor MXP64GX cPCI
processor boards to run Red Hat Linux: Red0, Red1 and Pinkbox (figure 29). Two
processor boards were running in a cPCI magazine with split backplane and they are

11
Redbox – the first prototype needed a name and since it is running Red Hat Linux as OS, the name had to
include Red.

12
Blackbox – the second prototype also needed a name, since Black Hat Linux is unknown to me but the cluster
node cabinets are black the name was as obvious as for Redbox.
Reliable Network Mass Storage 60(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

therefore treated as separate nodes with no connection to each other, other than
Ethernet networks. The 3.5” hard disks are fed from an external power supply and
attached to the processor boards with a self-made cable (the pin layout is available in
Appendix A). The third node was installed in a TSP cabinet with GEM and Teknor
specific adapter boards. A more detailed description about the hardware components
is available in the Appendix A.
13
I used a 3Com Superstack 3900 Ethernet switch, a SCB Ethernet switchboard and
standard Ethernet twisted-pair cabling to connect Red0, Red1 and Pinkbox in a
switched private network with network number 192.168.0.0. Red0 and Red1 were
except from the switched network also connected directly to each other with a second
crossed Ethernet cable. This interface was used for internal cluster communication
between Red0 and Red1, which constitutes the Redbox two-node high availability
cluster. The internal cluster network with network number 10.0.0.0 is of course not
connected to any non-cluster members. The third communication possibility is a
14
simple null-modem cable used by the cluster’s high availability software as a
redundant heartbeat path if the primary path failed, i.e. the cluster’s internal network.

9.1.2 Operating system

Ericsson UAB is currently transferring parts of the TSP’s functionality to run under
Linux as a complement to Solaris UNIX. Red Hat Linux 15 is the distribution used for
development, testing and evaluation at Ericsson UAB today and that’s the main
reason why I choose to use Red Hat as operating system during this master thesis
project. Red Hat is a mature and well-known distribution, some even say that it is
most widespread of them all, but there are a lot other distributions to choose from;
SuSE, Mandrake, Debian and Slackware are just some examples.

I downloaded the latest Red Hat Linux distribution from Sunet’s ftp archive, at the
time release 7.1, which is also known as “Seawolf”. The Linux kernel distributed along
with this Red Hat release was version 2.4.2. The Linux kernel, the core component of
the Linux operating system, undergoes constant updates and the latest stable kernels
16
and patches are always published at the Linux Kernel Archives homepage.

The installation process was a straightforward process compared to earlier


distributions of Linux. Before adding any extra functionality, i.e. cluster and virtual
shared storage software, I configured the three nodes and tested the networks and the
serial connection carefully. When all communication paths were up and running I

13
http://www.3com.com

14
A null-modem is a cable connected to a computer’s serial interface and a cheap and simple solution to get two
computers to “talk” with each other.

15
http://www.redhat.com

16
http://www.kernel.org
Reliable Network Mass Storage 61(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

configured a NFS version 3 server at both cluster’s nodes. When I successfully was
able to mount the two exported file systems from the third client, Pinkbox, I decided
to move on with the additional software components.

9.1.3 NFS

Red Hat Linux 7.1 supports both NFS version 2 and version 3. At the server side an
administrator typically defines what partitions or directories to export and to whom.
The file /etc/exports is the access control list for the file systems which may be
exported to the NFS clients and it is used by NFS file server daemon and the NFS
mount daemon (rpc.mountd). Security is of less importance in this proposal since
the prototype is attached to a private network and all clients attached to this particular
network are granted access. Access rights are else defined in /etc/hosts.allow
and /etc/hosts.deny but I used them blank.

9.1.4 Distributed Replicated Block Device

The first extra non-standard Red Hat component I installed was the distributed
replicated block device. At http://www.complang.tuwien.ac.at/reisner/drbd/ I
downloaded the latest release of the DRBD software, at the time release 0.6.1-pre5.
As the name intimates it is a pre-release, but since I am using the 2.4.2 Linux kernel
previous releases 0.5.x won’t work, they require kernel 2.2.x. Since DRBD is a kernel
module, the kernel source code must be installed; else it is impossible to compile the
source code.

At a glance, the software seems really well documented but as the first problems
arise the only way of solving the problems is more time for testing and tweaking. A
good approach to eliminate some basic problems is to join the DRBD-developers’
17
mailing list . Philipp Reisner, the original author of DRBD, is a frequent visitor and
answers any relevant question almost immediately. I have been in touch with him
regarding a bug that arises when DRBD is used together with some NFS export
specific parameters.

Two identical partitions are needed to get DRBD up and running and therefore I
started with creating two small partitions accessed via /dev/hda10 at both cluster
nodes, each approximately 100 MB. I began with small partitions because re-
synchronisation is time-consuming for large disk spaces. When I finally tested the
DRBD software binaries after compilation and installation it worked but it was really
slow. I started configuring the system; typically /etc/drbd.conf for Red Hat Linux,
and some minor tweaking greatly improved the performance, especially the
synchronisation process. Typical parameters used in the configuration file are
resource name, nodes, ports, file system check operations, synchronisation bandwidth

17
Currently hosted at http://www.geocrawler.com/lists/3/SourceForge/3756/0/
Reliable Network Mass Storage 62(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

utilisation and kernel panic options. See appendix for configuration files used in the
Redbox implementation.

Because the cluster nodes use a dedicated 100 Mbit/s network (network 10.0.0.0) for
disk replication it was possible to utilise full resynchronisation bandwidth, the
Heartbeat signalling bandwidth usage is negligible. The DRBD software is the limiting
factor; today the maximum resynchronisation throughput is approximately 7
MBytes/s.

DRBD is distributed with several scripts for various purposes. The most usable script
is a service initialisation script that can be executed either with Red Hat’s service tool
or as a standalone script. Another useful script is used for benchmarking. The script
tests the individual systems respective hard disk device as well as the bulk data
transfer for each of the three protocols between the DRBD nodes. Since DRBD utilise
TCP for data transfer I found it interesting to benchmark the bulk TCP transfer over
the 100 Mbit/s Ethernet network. Hewlett-Packard has developed an interesting
application for this purpose. Netperf, as the software is called, was originally targeted
for the UNIX world but is now distributed for Linux as well, free of charge. More
information about the results can be found in the benchmark section.

9.1.5 Network Block Device and Software RAID Mirroring

In parallel with the DRBD testing I also tried to configure the software RAID mirror.
Since I found the DRBD software much more interesting I just tested this solution
briefly but found some interesting limitations when trying to integrate it with NFS. I
began with a local RAID configuration making use of two local partitions. It was no
problems to set it up and it seemed like it worked ok.

Next step was to test the network block device and I used the same partition used by
DRBD and successfully mounted Red1’s partition at Red0. When the two components
needed for the distributed RAID mirror worked properly I tried to integrate them, thus
working as a unit transparent the processes using them. All configuration files used
are published in Appendix B.

The RAID regeneration process is poorly documented and I am not really sure of how
it works. Red1 had some problems with the system hardware clock and I think this
affect the regeneration because some files stored at the mirror disappeared by
mistake. I tried to solve this by integrating rdate, which is a client application that
uses TCP to retrieve the current time of another machine using the protocol described
in RFC 868. Despite the use of time synchronisation I am not sure that the problem is
truly eliminated and hence no conclusions are made.

I believe that it is desirable to use DRBD prior to a NBD RAID mirror configuration
since it is made solely for this explicit purpose and requires less manual intervention.
Reliable Network Mass Storage 63(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

9.1.6 Heartbeat

Heartbeat is the high availability software for Linux I intended to use as cluster
manager. It is open source and thus it id free to download it from http://linux-ha.org.
The problems began when I started to configure Red0 and Red1. For instance there is
a file /etc/ha.d/authkeys whose mode must be set to 600, which corresponds to
root read/write only. This is easily done by using the chmod command but if it is unset
the software refuse to start, which is a bit confusing.

When the Heartbeat software starts it creates a virtual interface typically eth0:0,
which is bound to the real interface eth0. If the active node is shut down this
interface is within a specified time interval rebound to the standby node’s
corresponding interface. It works well and it seems stable but before I started to
integrate it with any virtual shared storage I tried to fail-over a service presenting
static information, e.g. a web server. I installed an Apache, a Linux web server, and
successfully moved the service from Red0 to Red1 transparent to Pinkbox, despite
Red0 was rebooted as Pinkbox concurrently downloaded some files from it.

9.1.7 Integrating the Software Components into a Complete System

When every component worked satisfactory I started integrating them one by one to
finally build a complete system. I started with Heartbeat and a NFS configuration
exporting a local device /dev/hda10. The NFS processes were successfully
restarted at the standby node when the active was shut down but the Pinkbox client
was unable to access the exported partition after the fail-over. When trying to access
a file or just display a directory list the following message was returned by the server:
“Stale NFS handle”. According to the NFS specifications this is returned because the
file referred to by that file handle no longer exists, or access to it have been revoked
[RFC1094] [RFC1813]. Thus a fail-over using two nodes’ separate disks is of course
impossible at protocol level since the file systems are separate and this is not really a
problem, just an observation.

Another problem with the NFS implementation arose when I tried to add a NBD RAID
mirror to serve as a virtual shared storage. I manually failed-over the NFS and
remounted the local device previously used by the NBD server but the NFS still
complained over “Stale NFS handle” despite the files were there and the access was
granted by the server. Reading the NFS specifications once more revealed another
NFS feature; exports and mounts are hardware dependent. In Linux every device has
a pair of numbers; in short they refer to what type of driver to use when accessing a
specific device. These numbers are called the major and minor numbers and a
directory listening in the device directory /dev/ creates the following output (this
output is edited to fit into the report):

brw-rw---- 1 root disk 3, 0 Mar 24 2001 hda


brw-rw---- 1 root disk 3, 1 Mar 24 2001 hda1
brw-rw---- 1 root disk 3, 2 Mar 24 2001 hda2
Reliable Network Mass Storage 64(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

brw-rw---- 1 root disk 9, 0 Mar 24 2001 md0


brw-rw---- 1 root disk 9, 1 Mar 24 2001 md1
brw-rw---- 1 root disk 43, 0 Mar 24 2001 nb0
brw-rw---- 1 root disk 43, 1 Mar 24 2001 nb1
brw-r--r-- 1 root root 43, 0 Oct 16 2001 nda

minor number device


major number

Hda is the first ATA disk and hda1 is the first partition on the first ATA disk, its major
number is 3 and the 1 refers to which partition it is. When using the software RAID, a
device called mdx is created. As seen for the device md0 its major number is 9 and its
minor number is 0, thus its major number differs from the hda’s major number. The
difference causes a “Stale NFS handle” error and thus I found it hard to use a NBD
RAID in a NFS fail-over configuration. Nbx is a DRBD device and nda is a NBD
device. Since an NFS fail-over configuration using DRBD as virtual shared storage is
accessing the physical device via nbx at both nodes the problem associated with
different major numbers is eliminated.

I discarded the NBD RAID solution in favour to DRBD and tried to use Heartbeat and
two scripts distributed with the packages, datadisk and filesystem, to finally fail-
over a NFS service. When I shut down the active node the services were restarted
and the DRBD device remounted. Despite compatible major numbers, homogenous
file system and identical access rights the only message I got was “Stale NFS
handle”. After some research I found out that when the NFS server is started, a
daemon called the NFS state-daemon (rpc.statd) is also started. It maintains NFS
state specific information such as mounted file systems and by whom they are
mounted, typically in /var/lib/nfs/. On the active node when DRBD was running
and mounted at /shared/ I created the following tree structure as seen from the
root:

/ -shared +export
+nfs

Since DRBD is running, the same file modifications are of course also carried out to
standby node’s disk. I moved the NFS state information by default stored in
/var/lib/nfs/ to the /shared/nfs/ folder place and created a symbolic link
from its original to the copy on the shared storage. Thus it is possible for the standby
node to access exactly the same information about the current NFS server-state as
the active in case of a fail-over. The /shared/export/ folder is the actual file
system exported by the NFS server. Because it is a DRBD device, every file
operation is automatically mirrored to the standby node’s disk. I also found it
necessary to append the following line to the existing file system table, typically
/etc/fstab, because the file system should not be mounted automatically and it
should only be accessible from the currently active node:

/dev/nb0 /shared ext2 noauto 0 0


Reliable Network Mass Storage 65(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

With these minor tweaks the fail-over worked and a restarted node automatically
resynchronise its DRBD managed disks, all actions are carried out in the background
and thus hidden to clients using the exported file system. The only thing that a client
process notice is a short delay for approximately 5 to 10 seconds, which is the time it
takes for the standby node to declare the active node dead and to restart necessary
processes.

9.2 BLACKBOX

The Redbox testing was somewhat limited since all nodes were controlled from Red0
whose processor board was a prototype with external VGA, keyboard and mouse
connectors. If Red0 is rebooted all possibilities to monitor a fail-over is lost.

Pinkbox: Red0:
30.0.0.40 30.0.0.10

SCB
Monitor,
Switchboard
Keyboard &
Mouse

3Com Superstack 3900


30.0.0.1
Blackbox: 30.0.0.111

Black0 Eth1: 10.0.0.0 Black1

Eth2: 20.0.0.0

Serial

Figure 30 – Overview of the Blackbox prototype.

9.2.1 Hardware Configuration

Blackbox is compiled from conventional computer components available in most


computer stores, primarily due to cost efficiency and simplicity. Compared with the
cPCI components used in Redbox this prototype is considered cheap but it has great
theoretical performance possibilities. The only limitation except cost was of course
that Linux must support the hardware components.

The problems began with delivery delays, the last components arrived less than two
weeks from the project’s end-date, and this influenced the time plan negatively.
Reliable Network Mass Storage 66(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

18
Despite a thorough Linux hardware support research the hardware caused
compatibility problems when installing the operating system. Many drivers provided
were not working properly and required tweaking and re-compiling. Unfortunately
Black1’s main board’s AGP port was not working properly and together with a failed
SCSI disk I spent many hours seeking for the problems’ origin. Apart from hardware
compatibility problems and delivery delays the hardware was extremely unstable. I
tried to cool down the systems with four extra fans each but it just raises the systems
to a modest level of stability.

I used the same network components as for Redbox but utilised Blackbox’ additional
Gigabit Ethernet interfaces for disk replication; the 20.0.0.0 network. One of the two
100 Mbit/s Ethernet networks was used explicitly for Heartbeat signalling, network
10.0.0.0, but as for Redbox I also used a serial null modem to provide heartbeat
redundancy. The second 100 Mbit/s is connected to an Ericsson UAB internal LAN
with access to the Internet.

Originally I also intended to use a GESB – SCB coupling to aggregate ten 100 Mbit/s
Ethernet links to utilise the server’s 1000 Mbit/s interface. Unfortunately the delays
associated with the Blackbox hardware and poor access to TSP equipment forced me
too skip this configuration. I only used a SCB and the Superstack.

9.2.2 Software configuration

Most software and configuration files from Redbox was re-used in Blackbox, the only
updated software was DRBD. Blackbox utilise DRBD release 0.6.1-pre7. Some minor
modifications to the configuration files were of course needed to reflect the change in
hardware and communication paths.

If an NFS export is configured to use sync and no_wdelay, which are two
/etc/exports specific parameters, the NFS server’s request-response is extremely
slow. Sync is used to synchronise file system operations and no_wdelay is used to
force file operations to be carried out immediately and prevent that they are buffered.
19
A network interface monitor revealed a strange behaviour. When a client performs a
file operation, e.g. a file copy, the communication between the servers is extremely
low for about 5 - 10 seconds. For a short period of time, a fast burst of data is sent to
the server and the operation finish. I contacted the developer to solve the problem
and his suggestion was to use DRBD protocol B instead or try to recompile the
software with a little fix he sent me. I suggest that the simplest solution is to skip the
parameters. No file system corruption has yet been detected despite several tests and
it was important for the delayed project to move on.

18
There are many databases with information about supported Linux hardware: http://www.linhardware.com,
http://www.linuxhardware.net and http://hardware.redhat.com.

19
IPTraf - http://cebu.mozcom.com/riker/iptraf/
Reliable Network Mass Storage 67(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

10 BENCHMARKING AND TEST RESULTS

Benchmarking means measuring the speed of which a computer system will execute
a computing task [Balsa97]. It is difficult to specify and create valid benchmarking
tools and most measurements are often abused and seldom used correct. Deeper
knowledge in benchmarking is somewhat peripheral to this thesis since it is a really
wide and difficult area that requires lots of time.

From the thesis project’s point of view the most interesting measurement is how fast
clients can read and write a mounted NFS partition. But since I have implemented
two prototypes, Redbox and Blackbox, with significantly different hardware
configurations I also found it interesting to benchmark more specific parts of the
system. It is difficult to find bottlenecks but I have tried to measure the most important
components influencing the over all system performance. Each test is of course
possible to divide further into even smaller benchmark tests, but since the critical
factor is time and benchmarking is somewhat time-consuming I decided that the
general tests are enough.

10.1 BENCHMARKING TOOLS

When the Redbox implementation was working and running at an acceptable level of
stability I began looking for performance measurement tools. Since the prototype’s
intended use is to export a network file system I decided to measure components
involved in the process of reading and writing a file: block device, local file system,
network, DRBD and network file system. I had some difficulties to find the appropriate
software tools for Linux. Especially tools for hardware component specific
performance benchmarking and that is why I dropped the block device performance
measurements. All software tools I used and describe here are open source software
or free of charge.

The Standard Performance Evaluation Corporation focuses on a standardised set of


relevant benchmarks and metrics for performance evaluation of modern computer
systems [SPEC]. The tests are unfortunately not free of charge but they are really
interesting.

An interesting Linux NFS Client performance project [LinuxNFS] is performed at the


University of Michigan and they have composed a set of benchmarking procedures
that is useful when evaluating NFS under Linux.

10.1.1 BogoMips

20
Linus Thorvalds invented the BogoMips concept but its intended use is not to serve
as a benchmarking tool. The Linux kernel is using a timing loop that must be

20
The original author of Linux
Reliable Network Mass Storage 68(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

calibrated to the system’s processor speed at boot time. Hence, the kernel measures
how fast a certain kind of loop runs on a computer each time the system boots. This
measurement is the BogoMips value and a system’s current BogoMips rating is stored
in the processor’s state information, typically /proc/cpuinfo.

BogoMips is a compilation of “Bogo” and “MIPS” which should be interpreted as


“bogus” and “Millions of Instructions Per Second”. BogoMips is related to the
processor’s speed and sometimes the only portable way of getting some information
of different processors speed but it is totally unscientific. It is not a valid computer
speed measurement and it should never be used for benchmark ratings. Despite
these facts there are lots of benchmarking ratings derived from BogoMips
measurements. Somebody humorously defined BogoMips as “the number of million
times per second a processor can do absolutely nothing”. Though not a scientific
statement it illustrates the BogoMips concept’s loose correlation with reality and that
is why I mention the BogoMips concept.

10.1.2 Netperf

Netperf is a benchmark for measuring network performance. It was developed by


Hewlett-Packard and was originally targeted for UNIX but is now distributed for Linux
as well. Documentation and source is available from http://www.netperf.org/.

Netperf’s primary focus is on bulk data transfers, referred to as “streams”, and


request/response performance using either TCP or UDP and BSD sockets21 [HP95].
When the network performance is measured between two hosts, that is how fast one
host can send data to another and/or how fast the other host can receive it, one is
acting server and the other is acting client. The server can be started manually as a
22
separate process or using inetd . The Netperf distribution also provides several
scripts used to measure TCP and UDP stream performance. The default Netperf test
is the TCP stream test and it typically creates the following output:

$ ./netperf
TCP STREAM TEST
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. Kbytes/sec

4096 4096 4096 10.00 8452.23

21
The BSD socket is a method for accomplishing inter-process communication, which is used to allow one
process to speak to another. More information at http://www-users.cs.umn.edu/~bentlema/unix/.

22
The Internet Daemon, current Linux distributions are distributed with the improved xinetd instead.
Reliable Network Mass Storage 69(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

When I performed the benchmark tests I made use of the provided scripts. These
scripts test the network performance for different socket and packet sizes at a fixed
test time. I benchmarked both systems in both directions, e.g. I measured the Redbox
internal performance both from Red0 to Red1 and from Red1 to Red0, and thus each
host acted both server and client. If the results correlate, I assume that I have
eliminated the possibility that one host runs “faster” as server or as client than the
other, a possibility that of course affects the total result.

10.1.3 IOzone

IOzone is a free file system benchmark tool available for many different computer
systems [IOzone01]. Its source and documentation is available from
http://www.iozone.org/. IOzone is able to test file system I/O with a broad set of
different file operations: read, write, re-read, re-write, read backwards, read strided,
fread, fwrite, random read, pread, mmap, aio_read, aio_write. It is also possible to
use IOzone for NFS specific benchmarking, typical tests are read and re-read latency
tests for different sized operations.

The benchmark tests utilised IOzone’s fully automatic mode to test all file operations
for record sizes of 4 kBytes to 16 MBytes for file sizes of 64 kBytes to 512 MBytes.
The typical command line options I have used looks like:

$ ./iozone -a -z -b result.xls -U /IOtest

The -a and -z options force IOzone to test all possible record sizes and file sizes in
automatic mode. -b is used to specify the Excel file and I have also used the mount
point option, -U, which mounts and remounts the specified mount point between
tests. This guarantees that the buffer cache is flushed. To use this option the mount
point must exist and it must be specified in the file system table, a file that contains
descriptive information about the various file systems, typically /etc/fstab.

10.1.4 Bonnie

Bonnie is also a file system benchmark but the use of Bonnie was troublesome.
According the brief Bonnie user manual it was important to use a file at least twice the
amount of the RAM. Since the systems used are equipped with 1 GB RAM I naturally
used a 2 GB file. That particular file size caused file system errors and forced the
systems to halt. Files size less than twice the amount of RAM result in invalid values
and hence it was impossible to conduct any benchmarking with the Bonnie software.

10.1.5 DRBD Performance

DRBD replication is the process where the data is copied from the active server to the
server standing by. Since a write at the active side is first acknowledged when the
information is also written to the standby node, this is an important component of the
over-all performance for the system. DRBD Replication was measured using the
Reliable Network Mass Storage 70(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

performance script distributed along with the DRBD source code. The results were
also confirmed with a network utilisation monitor that shows the momentary amount of
data passing the network interface.

DRBD Resynchronisation was measured using the Linux command time, which
calculates the elapsed time while the operation actually is performed. The test is
simple and may not be totally accurate but I believe it is an acceptable approximation
of the performance. I tested resynchronisation on several partitions of different size.

10.2 REDBOX BENCHMARK

I started testing different network paths: Red0 to Red1 and Pinkbox to Redbox. There
was no remarkable difference depending on the path and the link utilisation was about
8 – 10.8 MBytes/s, which I believe is a good result. The result depends on socket
sizes as well as message sizes used by Netperf during the benchmark.

According to the DRBD performance script the maximum replication speed between
the two nodes is about 9.5 – 10.5 MBytes/s. These values are close to the maximum
of network bandwidth.

The author of DRBD claims that the maximum resynchronisation speed is


approximately 7 MBytes/s. It is today the limiting factor and my measurements
correspond to this value, despite the link’s higher bandwidth. Resynchronisation is a
process where both reading and writing is involved. This is highly prioritised by the
author and resynchronisation is going to be improved in newer versions of DRBD.

I tried to use IOzone and Bonnie when testing the NFS file system read/write
performance but it was troublesome. Therefore I wrote my own scripts, which rapidly
writes numerous of files of various sizes to a mounted NFS file system. In the range
of 1000 files was written and during this time I controlled the network utilisation. The
total time for the file writes and the amount of data was also used to approximate the
write performance over long time. With the results I estimated the write performance
to be approximately 6 – 7 MBytes/s.

10.3 BLACKBOX BENCHMARK

As for Redbox I tested the Ethernet links between Black0 to Black1 and from Red0 to
Blackbox. The Blackbox internal link is Gigabit Ethernet and the result was about 70 –
95 MBytes/s. Since Red0 is using a 100 Mbit/s interface it makes no difference that
Blackbox is utilising a Gigabit interface; the result corresponds with the Redbox
results.

Since I used Gigabit Ethernet links with much greater bandwidth I expected higher
DRBD replication speed. Sadly, DRBD does not today fully support the usage of
Gigabit; only a small increase about 1 MBytes/s was noticed in the tests.
Reliable Network Mass Storage 71(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

As for the DRBD replication results the resynchronisation is limited by the software
rather than the hardware and no major improvements compared to Redbox was
noticed.

Using the same approach as for Redbox NFS write performance I noticed a small
increase in the speed but since this method is considered unscientific, no conclusions
are made from this increase.

10.4 REDBOX VERSUS BLACKBOX

Despite the fact that Blackbox is superior in virtually all aspects compared to Redbox,
the results did not differ much except for the Gigabit links. Currently the DRBD
software is the limiting factor but hopefully this change as newer versions are
released.

In my final week I had the opportunity to very briefly familiarise with a commercial
2
NAS from EMC ; its fundamental design is similar to Redbox and Blackbox but utilise
Fibre Channel as internal storage interface and it also presents the clients with a
Gigabit interface.

I performed the same write tests mentioned above; mounting the file system from
Red0, I reached a maximum of 8 MBytes/s. Mounting the NFS file system from
Black0 and utilising the Gigabit interfaces, the corresponding value was 14.5
MBytes/s. I emphasise that the tests are not verified themselves and it is uncertain if
these results are of significance.

10.5 FAULT INJECTION

When the prototypes were up and running and I tested different fault scenarios. Both
prototypes sustain a complete in-cluster node failure, that is, remove the power from
one of the nodes. Since both prototypes are equipped with redundant heartbeat paths
they also survive any single heartbeat path failure, both Ethernet and null-modem.

If the Ethernet cable connecting the prototypes to the clients where disconnected, all
file system activities halt until it is reconnected again. When the connection utilised by
the DRBD replication was removed, the mirroring also halted. After the cable was
reconnected, the software starts communicating with the lost DRBD device and
initiates resynchronisation.
Reliable Network Mass Storage 72(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

11 CONCLUSIONS

Personally this project have been really successful in terms of new knowledge and
experience but I believe that this project’s scope was somewhat wide, making a
reliable network mass storage incorporates many hardware and software components.
The most troublesome was to find an appropriate solution that was feasible within the
project’s time frame.

Currently the prototypes sustain a complete in-cluster node failure or failures that
cause kernel panics, e.g. a hardware fault. Kernel panics restart the failing node and
initiate a fail-over at the standby node. Concurrent client reads and writes are possible
during a fail-over with the limitation of the time delay associated with the fail-over.

11.1 GENERAL

There are many solutions regarding reliable storage but it was a bit troublesome to
find an open source solution that applies to telecom equipment. There are many
commercial solutions available but most of them rely on datacom techniques that are
quite different compared to telecom requirements. In short the task was a bit tougher
than I thought from the beginning. There are many theories but less actual
implementations and my primarily task was to sift through all information.

The choice of using NFS as a file system was simply because TSP already supports
NFS and that it is a well-known standard. Afterwards it might have been desirable to
test another more experimental file system but at the moment my focus was on
making a system reliable and not to test a new file system.

Because the software used is entirely written by the open source community it is hard
to really tell anything about its quality. DRBD proved to be the limiting factor,
resynchronisation is really slow, but if it is enhanced it is going to be a powerful
software.

11.2 THE PROTOTYPES

Due to the rather tight time frame and the requirement of a working prototype, the
solution was rather limited. Several approaches are however discussed in the next
section that may drastically enhance the prototype proposal’s reliability and
performance.

An advantage with Redbox compared to any commercial solution is that it is


mountable in GEM since it is made solely of standardised Ericsson components. The
only modification needed is carriers for the disk arrays. Since it is attached to the TSP
via the GESB switchboards no modifications are needed to the TSP hardware
platform.

Simplicity is pervading the proposal and both prototypes are cost effective compared
to other solutions. They both use standard components; common Ethernet for
Reliable Network Mass Storage 73(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

communication and no specialised hardware is required except for the RAID


controllers used in Blackbox. The only real drawback is that twice the amount of disks
is needed; an issue that is solved by introducing shared storage.

11.3 DATACOM VERSUS TELECOM

There are fundamental differences between standard datacom equipment and


telecom equipment; quantitative and qualitative differences, there are many cheap
datacom products while telecom are rather few but expensive and exclusive. It seems
like the availability and performance is the two most important characteristics for the
telecom industry, even prior cost. A telecom product must work under any
circumstance but many datacom products are often not designed for redundant
usage, even “reliable” NAS solutions. This telecom and datacom contradiction
affected the project process since the solution is based entirely on standard datacom
components. When building the prototypes there was an obvious difference in the
hardware stability and reliability; the cPCI components used in Redbox were much
more stable than the conventional PC hardware used when building Blackbox.

11.4 LINUX AND THE OPEN SOURCE COMMUNITY

Linux is really a hot topic in datacom today with its free networking applications and
servers. Lately, the open source community has been more accepted in other
industrial areas as well but the question is if Linux and the rest of the open source
community in general are mature enough to meet the demands of the telecom
industry today?

I believe that it is possible to compile open source components to build a HA system


but if it presents an availability that is enough the meet the telecom requirements is
uncertain. Maybe if the software components are improved and the hardware platform
is based on state-of-the-art components but the software modification parts introduce
new problems for commercial solutions. According to GNU General Public License
(GNU GPL), a software license intended to guarantee the freedom to share and
change free software [GPL], all software under GNU GPL is free even if it has been
modified. That is, all software based on free software must be free. I hardly believe
that a company happily spends loads of hours to improve free software to release it
free again, for anyone to use, e.g. a competitor.

Constant updates and patches makes it hard to have a stable system, at least stable
enough to be called an HA system. Linux high availability is advancing but I consider
it optimistic to believe that it is possible to use open source software “as is”, without
any modifications. There are commercial Linux solutions that guarantee a certain
level of availability but these are often modified Linux solutions and not clean open
source.
Reliable Network Mass Storage 74(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

11.5 TSP

Does the prototype fulfil the TSP requirements for reliable network storage? As the
prototypes are at the moment the answer must be no. Currently the prototypes are
somewhat limited according to scalability, maintenance and reliability. Still it seems
that the basic idea with fail-over and virtual shared storage is working. It is a reliable
approach that is utilised in various solutions. Since it is made up of independent
components it is possible to exchange the components individually to enhance
system characterises; e.g. change the virtual shared storage to shared storage based
on FC.

The main disadvantage with the solution is that it is basically a centralised solution.
Despite its distributed file system images and clustering features the information is
really stored at one place, at the two-node cluster. A distributed solution where all
processor boards contribute would be a more TelORB-like solution but it is also much
more complex.

During the second half of the project when prototype implementation was long gone I
found really interesting solutions regarding distributed and fault-tolerant storage.
These are briefly discussed in the next section.
Reliable Network Mass Storage 75(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

12 FUTURE WORK

This section discusses improvements regarding the prototypes, which are assumed to
increase their performance as well as their availability. The suggestions are all
existing solutions but they are neither integrated nor evaluated.

12.1 POSSIBLE PROTOTYPE IMPROVEMENTS

Only one Ethernet network is used to access the networks file system and this SPOF
must be eliminated. Ericsson has a solution, which is really a modified NFS
implementation that provides the possibility to use redundant networks. The clients
make use of the network currently available with the lowest traffic load. It may also
be desirable to remove any features that make the NFS fail-over complex, such as
state information and security. Since it is used in an internal protected network this
should not cause any problems.

Make use of a specialised shared storage solution such as FC to eliminate the


bottleneck associated with the virtual shared storage. It does not only increase the
performance but also the reliability and the scalability since it are specified to support
hot swap and redundant medium.

Integrate some sort of monitoring software that continuously monitors networks and
23
local processes. There is an open source solution called Mon that makes it possible
to define actions that trigger on certain failures. Mon renders the possibility to restart
local processes, redirect traffic if a network fails and kill nodes that are considered
active but that are behaving strange.

Resource fencing is another possible improvement. It eliminates the possibility that


both nodes believe that they are active – the split-brain syndrome. If a fail-over is
started it automatically initiates a process where the failed node’s power is cut off. It
force the node to reboot and promise for a more reliable fail-over, at lest when
concerning split-brain.
24
The Logical Volume Manager (LVM) provides on-line disk storage management of
disk and disk subsystems by grouping arbitrary disks into volume groups. The total
capacity of volume groups can be allocated to logical volumes, which are accessed
as regular block devices. These block devices are resizable while on-line, so if more
storage capacity is needed it is just a matter of adding an extra disk and bind it to the
correct subgroup without interrupting the ongoing processes. Logical volumes hence
decrease downtime and enhance maintainability as well as scalability.

23
Mon is a Service Monitoring Daemon available at http://www.kernel.org/software/mon/.

24
LVM is a storage management application for Linux. Further information is available from
http://www.sistina.com/lvm/.
Reliable Network Mass Storage 76(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

25
The Heartbeat software supports the use of the software Watchdog , which is a
daemon that checks if your system is still working properly. If programs in user space
are not longer executed it will reboot the system. It is not as reliable as a hardware
watchdog because a total system hanging also affects the Watchdog software, which
then is unable to force the system to reboot. A hardware watchdog is desirable and it
exists as standard PCI cards as well as PMC modules.

12.2 BRIEF OVERVIEW OF OTHER POSSIBLE SOLUTIONS

During the project I found many interesting solutions regarding storage that promise
increased performance, scalability, security and reliability. I mention two projects
here:

Network Attached Secure Disks (NASD) is a project at the Carnegie Mellon University
supported by the storage industry leaders. The object in short is to move primitives
such as data transfer, data layout and quality of service down to the storage device
itself while a manager is responsible for policy decisions such as namespace, access
control, multi-access atomicity and client caches [Gibson99]. More information is to
be found at http://www.pdl.cs.cmu.edu/.

The Global File System (GFS) is a file system in which cluster nodes physically share
storage devices connected via a network [Soltis96]. This shared storage solution tries
to exploit the sophistication of new device technology. GFS distributes the file system
responsibilities across the nodes and storage across the devices. Consistency is
established by using a locking mechanism maintained by the storage device
controllers. More information at: http://www.globalfilesystem.org/.

25
Watchdog is available from http://www.debian.org/.
Reliable Network Mass Storage 77(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

13 ACKNOWLEDGEMENTS

I would like to thank Johan Olsson who gave me the opportunity to perform my
master thesis at Ericsson UAB, my supervisor Kjell-Erik Dynesen who guided me
throughout the project and my examiner Mats Brorson.

I also would like to thank all personal at Ericsson Utvecklings AB and especially
everyone at KY/DR who have been very helpful during my master thesis project.
Besides all help you have all made it a pleasant time at Ericsson with lots of floor ball
and talk about Mr Béarnaise; I am especially proud of my bronze medal in Go-Cart
racing.
Reliable Network Mass Storage 78(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

14 ABBREVIATIONS

AFS Andrew File System


ANSI American National Standards Institute
ATA Advanced Technology Attachment
CORBA Common Object Request Broker Architecture
cPCI Compact Peripheral Component Interface
DAS Direct Attached Storage
DRBD Distributed Replicated Block Device
ECC Error Correction Code
EMC Electro Magnetic Compatibility
ESD Electrostatic Discharge
FC Fibre Channel
FC-AL Fibre Channel Arbitrated Loop
GPRS General Packet Radio System
HDA Head-Disk Assembly
I/O In or Out
IDL Interface Definition Language
IP Internet Protocol
IPC Inter-Process Communication
kBytes Kilo Byte
MB Mega Byte
MBytes/s Megabyte per Second
Mbit/s Megabit per Second
MTTF Mean Time to Failure
MTBF Mean Time between Failures
NAS Network Attached Storage
NBD Network Block Device
NFS Networking File System
PC Personal Computer
PCI Peripheral Component Interface
RAID Redundant Array of Independent Disks
RFC Request for Comments
RPC Remote Procedure Call
RPM Rounds per Minute
SAN Storage Area Network
SCSI Small Computer System Interface
SPOF Single Point of Failure
SS7 Signalling System number 7
TCP Transport Control Protocol
TSP The Server Platform
UDP User Datagram Protocol
Reliable Network Mass Storage 79(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

UMTS Universal Mobile Telephone System


VIP Virtual IP
ZBR Zone Bit Recording
Reliable Network Mass Storage 80(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

15 REFERENCES

15.1 INTERNET

[Balsa97] André D. Balsa, 1997, “Linux Benchmarking HOWTO”,


http://www.linuxdoc.org/, 2001-11-1

[Barr01] Tavis Barr, Nicolai Langfeldt, Seth Vidal, 2001, “Linux NFS-
HOWTO”, http://www.linuxdoc.org/, 2001-10-10

[Dorst01] Win van Dorst, 2001, “BogoMips mini-Howto”,


http://www.linuxdoc.org/, 2001-11-13

[GPL] GNU General Public License, http://www.gnu.org/licenses/, 2002-


02-19

[IBM01] http://www.storage.ibm.com/, 2001-10-15

[IBM02] “The AFS File System In Distributed Computing Environments“,


http://www.transarc.ibm.com/, 2002-02-11

[IOzone01] “IOzone File system Benchmark”, http://www.iozone.org/, 2001-11-


23

[LinuxNFS] University of Michigan, ”Linux NFS Client performance”,


http://www.citi.umich.edu/projects/nfs-perf/, 2001-11-13

[Nielsen96] Dr. Jacob Nielsen, 1996, “The Death of File Systems”,


http://www.useit.com/, 2001-10-2

[SPEC] Standard Performance Evaluation Corporation,


http://www.spec.org/, 2002-02-25

15.2 PRINTED

[Brown97:1] Aaron Baeten Brown, 1997, “A Decompositional Approach to


Computer System Performance Evaluation”, Center for Research
in Computing Technology, Harvard University

[Brown97:2] Aaron B. Brown, Margo I. Seltzer, 1997, “Operating System


Benchmarking in the Wake of Lmbench: A Case Study of the
Performance of NetBSD on the Intel x86 Architecture”,
Proceedings of the 1997 ACM SIGMETRICS Conference on
Measurement and Modeling of Computer Systems

[Chen93] Peter Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz,


David A. Patterson, 1993, “RAID: High-Performance, Reliable
Secondary Storage”, ACM Computing Surveys
Reliable Network Mass Storage 81(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

[Du96] D. Du, J. Hsieh, T. Chang, Y. Wang, S. Shim, 1996, “Performance


Study of Serial Storage Architecture (SSA) and Fibre Channel -
Arbitrated Loop (FC-AL)”, Technical Report at Computer Science
Department, University of Minnesota

[Ericsson01] Ericsson Internal Information, 2001, “ANA 901 02/1 System


Description”

[Ericsson02] Ericsson Internal Information, 2001, “TelORB System Introduction”

[Gibson99] Garth A. Gibson, David F. Nagle, William Courtright II, Nat Lanza,
Paul Mazaitis, Marc Unangst, Jim Zelenka, 1999, “NASD Scalable
Storage Systems”, Carnegie Mellon University, Pittsburgh

[HP95] Information Networks Division Hewlett-Packard Company, 1995,


“Netperf: A Network Performance Benchmark, Revision 2.0”

[Katz89] Randy H. Katz, Garth A. Gibson, David A. Patterson, 1989, “Disk


System Architectures for High Performance Computing”,
University of California at Berkeley

[Nagle99] David F. Nagle, Gregory R. Ganger, Jeff Butler, Gart Goodson,


Chris Sabol, 1999, “Network Support for Network-Attached
Storage”, Carnegie Mellon University, Pittsburgh

[O’Keefe98] Matthew T. O’Keefe, 1998, “Shared File Systems and Fibre


Channel”, Proceedings of the Sixth NASA Goddard Space Flight
Conference on Mass Storage Systems and Technologies

[Patterson99] David A. Patterson, 1999, “Anatomy of I/O Devices: Magnetic


Disks”, Lecture Material, University of California at Berkeley

[Patterson89] David A. Patterson, Peter Chen, Garth Gibson, Randy H. Katz,


1989, “Introduction to Redundant Arrays of Inexpensive Disks
(RAID)”, Proceedings Spring COMPCON Conference, San
Francisco

[Patterson88] David A. Patterson, Garth Gibson, Randy H. Katz, 1988, “A case


for Redundant Arrays of Inexpensive Disks (RAID)”, University of
California at Berkeley

[RFC1094] Sun Microsystems Inc, 1989, “NFS: Network File System Protocol
Specification”, Request for Comments: 1094, IETF

[RFC1813] B. Callaghan, B. Pawlowski, P. Staubach, 1995, “NFS Version 3


Protocol Specification”, Request for Comments: 1813, IETF

[Reisner01] Philipp Reisner, 2001, “DRBD”, Proceedings of UNIX en High


Availability, Netherlands UNIX User Group
Reliable Network Mass Storage 82(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH 2002-02-14

[Robertson00] Alan Robertson, 2000, “Linux-HA Heartbeat System Design”,


th
Proceedings of the 4 Annual Linux Showcase & Conference,
Atlanta

[Schulze89] Martin Schulze, Garth Gibson, Randy Katz, David Patterson,


1989, “How reliable is RAID?”, Proceedings Spring COMPCON
Conference, San Francisco

[Shim97] Sangyup Shim, Taisheng Chang, Yuewei Wang, Jenwei Hsieh,


David H.C. Du, 1997, “Supporting continuous media: Is Serial
Storage Architecture (SSA) better than SCSI?”, Proceedings of the
1997 International Conference on Multimedia Computing and
Systems (ICMCS '97)

[Soltis96] Steven R. Soltis, Thomas M. Ruwart, Matthew T. O’Keefe,


1996,”The Global File System”, Proceedings of the Fifth NASA
Goddard Space Flight Conference on Mass Storage Systems and
Technologies

You might also like