Building Highly Scalable File Serving Solutions

The Definitive Guide To
tm
Building Highly Scalable Enterprise File Serving Solutions

Chris Wolf
Introduction
Introduction to Realtimepublishers
by Sean Daily, Series Editor
The book you are about to enjoy represents an entirely new modality of publishing and a major first in the industry. The founding concept behind Realtimepublishers.com is the idea of providing readers with high-quality books about todays most critical technology topicsat no cost to the reader. Although this feat may sound difficult to achieve, it is made possible through the vision and generosity of a corporate sponsor who agrees to bear the books production expenses and host the book on its Web site for the benefit of its Web site visitors. It should be pointed out that the free nature of these publications does not in any way diminish their quality. Without reservation, I can tell you that the book that youre now reading is the equivalent of any similar printed book you might find at your local bookstorewith the notable exception that it wont cost you $30 to $80. The Realtimepublishers publishing model also provides other significant benefits. For example, the electronic nature of this book makes activities such as chapter updates and additions or the release of a new edition possible in a far shorter timeframe than is the case with conventional printed books. Because we publish our titles in real-timethat is, as chapters are written or revised by the authoryou benefit from receiving the information immediately rather than having to wait months or years to receive a complete product. Finally, Id like to note that our books are by no means paid advertisements for the sponsor. Realtimepublishers is an independent publishing company and maintains, by written agreement with the sponsor, 100 percent editorial control over the content of our titles. It is my opinion that this system of content delivery not only is of immeasurable value to readers but also will hold a significant place in the future of publishing. As the founder of Realtimepublishers, my raison dtre is to create dream team projectsthat is, to locate and work only with the industrys leading authors and sponsors, and publish books that help readers do their everyday jobs. To that end, I encourage and welcome your feedback on this or any other book in the Realtimepublishers.com series. If you would like to submit a comment, question, or suggestion, please send an email to feedback@realtimepublishers.com, leave feedback on our Web site at http://www.realtimepublishers.com, or call us at 800-5090532 ext. 110. Thanks for reading, and enjoy! Sean Daily Founder & Series Editor Realtimepublishers.com, Inc.
Table of Contents Introduction to Realtimepublishers.................................................................................................. i Chapter 1: Moving Beyond Current File Serving Philosophies ......................................................1 State of the World ............................................................................................................................1 Performance Challenges ......................................................................................................1 Management Challenges......................................................................................................1 Availability Challenges........................................................................................................2 Growth of Managed Data.....................................................................................................2 Todays File Serving Landscape..........................................................................................2 Standalone Servers...................................................................................................3 DFS ..........................................................................................................................5 NAS Appliances.......................................................................................................7 Failover Clusters ......................................................................................................8 Cluster Architecture .................................................................................................9 Shared Data Clusters..............................................................................................10 Current Storage Architectures........................................................................................................14 SCSI ...................................................................................................................................14 SATA .................................................................................................................................15 FC and SANs .....................................................................................................................16 Switches and Hubs.................................................................................................17 Router.....................................................................................................................17 FCIP and iFCP ...................................................................................................................18 iSCSI ..................................................................................................................................18 Clustered File Serving Gaining Momentum ..................................................................................19 High Availability ...............................................................................................................19 Consolidation Advantages .................................................................................................20 Drive Toward Standardization...........................................................................................20 Summary ........................................................................................................................................21 Chapter 2: Taming Storage GrowthA Modern Perspective.......................................................22 Current Storage Problems ..............................................................................................................22 Availability ........................................................................................................................22 Growth ...............................................................................................................................23 Management.......................................................................................................................23 Expanding Backup Windows.............................................................................................23
ii
Table of Contents Existing Storage Solutions.............................................................................................................24 SAN....................................................................................................................................24 NAS Filers .........................................................................................................................24 DFS ....................................................................................................................................25 Virtualization .....................................................................................................................25 Storage Virtualization ............................................................................................25 Server Virtualization..........................................................................................................32 Virtual Machines....................................................................................................32 Shared Data Clusters..............................................................................................33 Comparing Virtual Machines and Shared Data Clusters .......................................34 Examining Unappliance vs. Appliance Solutions..........................................................................35 Proprietary vs. Open Solutions ..........................................................................................36 Volume Economics............................................................................................................37 Integration with Existing Infrastructure and Investments..................................................37 The Scalability Dilemma ...................................................................................................37 Backup Challenges.............................................................................................................38 Taming Server and Storage Growththe Non-Proprietary Approach..........................................40 Storage Consolidation via SAN .........................................................................................40 Server Consolidation via Clustering ..................................................................................41 Planning for Growth While Maintaining Freedom........................................................................41 Summary ........................................................................................................................................42 Chapter 3: Data Path Optimization for Enterprise File Serving ....................................................43 The Big Picture of File Access ......................................................................................................43 Availability and Accessibility........................................................................................................44 Redundant Storage .........................................................................................................................45 RAID Levels ......................................................................................................................45 RAID 0...................................................................................................................45 RAID 1...................................................................................................................46 RAID 5...................................................................................................................47 RAID 0+1 ..............................................................................................................47 RAID 1+0 ..............................................................................................................49 RAID 5+0 ..............................................................................................................49 Hardware vs. Software RAID ............................................................................................51
iii
Table of Contents Hardware RAID .....................................................................................................51 Software RAID ......................................................................................................52 Redundant SAN Fabrics ................................................................................................................53 Elements of the Redundant SAN .......................................................................................53 Managing the Redundant SAN ..........................................................................................54 Redundant LANs ...........................................................................................................................55 Redundant Power ...........................................................................................................................56 Redundant Servers .........................................................................................................................57 Shared Data Clusters..........................................................................................................57 Failover Clusters ................................................................................................................57 Proprietary Redundant Servers ..........................................................................................58 Eliminating Bottlenecks.................................................................................................................58 Architectural Bottlenecks...................................................................................................60 Single NAS Head...................................................................................................60 Single File Server...................................................................................................60 Load Balancing ..............................................................................................................................61 Managing the Resilient Data Path..................................................................................................61 Summary ........................................................................................................................................63 Chapter 4: Building High-Performance, Scalable, and Resilient Windows File Serving Solutions64 Managing High-Performance and Availability Across a Windows Infrastructure........................64 VDS....................................................................................................................................64 VSS ....................................................................................................................................66 Shadow Copies for Shared Folders....................................................................................67 Shadow Copies for Shared Folders Basics ............................................................68 Enabling Shadow Copies for Shared Folders Support...........................................68 Recovering Previous Versions of a File.................................................................72 Enhanced Storage and File Serving Support .....................................................................74 Multipath I/O Support............................................................................................74 STORport Driver Support......................................................................................74 iSCSI Support ........................................................................................................74 Improved Offline Files Support .............................................................................75 The Microsoft Approach to High-Availability File Serving..........................................................79 MSCS.................................................................................................................................79
iv
Table of Contents DFS ....................................................................................................................................80 AD Integration ...................................................................................................................81 Commercial File Serving Solutions ...............................................................................................81 PolyServe NAS Cluster......................................................................................................81 Symantec Cluster ...............................................................................................................82 Current Trends in Windows File Serving ......................................................................................82 Benefits of Consolidation ..................................................................................................82 Benefits of Shared Storage.................................................................................................83 Deploying Enterprise-Class Windows File-Serving Solutions......................................................83 Pre-Deployment Considerations ........................................................................................83 Validating Server and Storage Requirements ....................................................................84 Summary ........................................................................................................................................85 Chapter 5: Building High-Performance, Scalable, and Resilient Linux File-Serving Solutions...86 Challenges Facing the Linux File-Serving Landscape ..................................................................86 Performance .......................................................................................................................87 Scalability ..........................................................................................................................87 Availability ........................................................................................................................88 Integration ..........................................................................................................................88 Existing Linux File-Serving Solutions...........................................................................................89 Standalone..........................................................................................................................89 NAS....................................................................................................................................89 DFS ....................................................................................................................................90 Clustered ............................................................................................................................90 Failover Clustering.................................................................................................90 Load-Balanced Clustering .....................................................................................92 LVS Architecture ...............................................................................................................93 LVS via NAT.........................................................................................................94 LVS via IP Tunneling ............................................................................................94 LVS via Direct Routing .........................................................................................94 Commercial File-Serving Solutions...............................................................................................95 PolyServe NAS Cluster......................................................................................................95 VERITAS Cluster ..............................................................................................................97 Red Hat Cluster Suite and Global File System..................................................................97
Table of Contents Deploying Performance-Based Scalable Linux File-Serving Solutions........................................98 Pre-Deployment Considerations ........................................................................................98 Server Sizing......................................................................................................................98 Storage Sizing ..................................................................................................................100 Managing Enterprise-Class Linux File Serving...........................................................................101 NFS ..................................................................................................................................101 What Is New in NFS v4? .....................................................................................102 NFS Setup Checklist ............................................................................................102 Samba...............................................................................................................................104 What Is Coming in Samba 4.0? ...........................................................................105 Samba Deployment..............................................................................................105 Current Trends in Linux File Serving..........................................................................................106 Migration from UNIX to Linux .......................................................................................106 Benefits of Consolidation ................................................................................................107 Storage Consolidation......................................................................................................107 Summary ......................................................................................................................................108 Chapter 6: Managing High-Performance, Scalable, and Resilient Data Across the Enterprise ..109 Challenges Facing Heterogeneous Networks ..............................................................................109 Inhibited Agility...............................................................................................................110 Complexity.......................................................................................................................110 Integration Concerns........................................................................................................110 IT Risk and Compliance Considerations .........................................................................110 Integrating Windows and Linux File-Serving Solutions .............................................................111 CIFS and NFS Integration ...............................................................................................111 Managing ACLs...............................................................................................................112 Integration with Existing Services ...................................................................................113 Backup and Recovery ..................................................................................................................113 Disaster Planning Essentials ............................................................................................114 Development ........................................................................................................114 Disaster Planning Roles .......................................................................................115 Traditional Backup Methodologies..................................................................................116 Snapshots .........................................................................................................................116 Server-Free Backups........................................................................................................116
vi
Table of Contents Server-Less Backups........................................................................................................117 Archiving and Migration..................................................................................................119 Successful Backup Architectures.....................................................................................120 D2T ......................................................................................................................120 D2D......................................................................................................................121 D2D2T .................................................................................................................121 Benefits of Share-Data Approaches.................................................................................123 Comparison: Consolidated vs. Distributed Backup Architectures ..................................124 Distributed Approach...........................................................................................124 Consolidated Approach........................................................................................125 Data Recovery..................................................................................................................126 The Advantages of Freedom........................................................................................................127 Benefits of Avoiding Proprietary Solutions.....................................................................127 Uncapped Scalability and Performance ...........................................................................128 Architecture Flexibility....................................................................................................128 Freedom of Choice...........................................................................................................129 Summary ......................................................................................................................................129
vii
Copyright Statement
Copyright Statement
2006 Realtimepublishers.com, Inc. All rights reserved. This site contains materials that have been created, developed, or commissioned by, and published with the permission of, Realtimepublishers.com, Inc. (the Materials) and this site and any such Materials are protected by international copyright and trademark laws. THE MATERIALS ARE PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice and do not represent a commitment on the part of Realtimepublishers.com, Inc or its web site sponsors. In no event shall Realtimepublishers.com, Inc. or its web site sponsors be held liable for technical or editorial errors or omissions contained in the Materials, including without limitation, for any direct, indirect, incidental, special, exemplary or consequential damages whatsoever resulting from the use of any information contained in the Materials. The Materials (including but not limited to the text, images, audio, and/or video) may not be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any way, in whole or in part, except that one copy may be downloaded for your personal, noncommercial use on a single computer. In connection with such use, you may not modify or obscure any copyright or other proprietary notice. The Materials may contain trademarks, services marks and logos that are the property of third parties. You are not permitted to use these trademarks, services marks or logos without prior written consent of such third parties. Realtimepublishers.com and the Realtimepublishers logo are registered in the US Patent & Trademark Office. All other product or service names are the property of their respective owners. If you have any questions about these terms, or if you would like information about licensing materials from Realtimepublishers.com, please contact us via e-mail at info@realtimepublishers.com.
viii
Chapter 1 [Editor's Note: This eBook was downloaded from Realtime NexusThe Digital Library. All leading technology guides from Realtimepublishers can be found at http://nexus.realtimepublishers.com.]
Chapter 1: Moving Beyond Current File Serving Philosophies

The challenges that face file serving have evolved over the past few years, and the methods used to meet those challenges have advanced as well. Today, many organizations view data availability as critical, allowing for very small windows of system downtime. Compounding the problems of maintaining data availability is the sheer volume of data that many organizations must manage. The industry has moved from needing gigabytes of storage a few years ago to eclipsing the terabyte or even petabyte range of managed storage. This chapter will begin an exploration of how to build highly scalable enterprise file serving solutions by looking at the current state of the world of file serving. Along the way, you will see the many disk, server, performance, and availability choices at your disposal. After exploring the countless available options, the chapter will examine how Information Technology (IT) as a whole is modernizing its approach to file serving and data management. This chapter will provide the foundation on which to build the rest of the guide.
State of the World

Today, file serving can be deployed in many shapes and sizes. Architecturally, there are several methods for designing and deploying file serving solutions. Many organizations dont just employ one idea or methodology but are often faced with managing a collection of disparate technologies. Performance Challenges Performance problems often follow the pattern of a pendulumthey flow from one extreme to another. On some levels, there are several servers not working up to capacity with physical resources under-utilized. On most networks, there are almost always other servers that are overutilized, with users continually complaining about slow performance. On many networks, resources are present to solve the problems of high-volume file serving, but the distribution of the resources doesnt allow all file servers to cohesively meet demand. Management Challenges In addition to performance challenges, managing a high volume of servers is a difficult task. With each independent file server on your network, you are faced with the need to maintain system hardware, software updates, and antivirus software in addition to a host of other management tasks. To deal with the increased management requirements that are often the result of network sprawl, many organizations are looking to achieve the following: Consolidate for the purpose of managing and maintaining fewer servers Consolidate and manage storage centrally Scale on-demand Centrally manage a collection of servers as a single computing resource
Reduce software costs such as operating system (OS) and application licensing costs Besides the management challenges faced, file serving continues to be challenged by availability trials.
1
Chapter 1 Availability Challenges In 2004, the Gartner Group determined that the average cost of downtime worldwide was $42,000 per hour. They also found that the average network experiences 175 hours of downtime each year. Thus, based on Gartners determinations, it should not take your organization long to recognize the importance of data availability. Even if an organization is far below the average downtime and is down for 100 hours in a year, that time would equate to potentially $4,200,000 in lost revenue. Although the cost of downtime may be obvious and is certainly backed by some pretty significant statistics from the Gartner Group, there are still countless organizations that simply deal with downtime as if its an expected part of life in IT. In addition, many organizations believe that the cost of downtime is eliminated once systems are backed up. When a companys data is unavailable, their reputation may be damaged and customer confidence weakened as a result of the downtime. This is especially true with e-commerce. Potential customers will likely not return to an unavailable Web site and will look to other option to buy their needed solution. In many cases, if an organizations data availability is unreliable, potential customers will believe that the organization is also unreliable. Although downtime for individual systems is inevitable, data does not have to be unavailable during that period. System patches, hardware, and software upgrades are a required factor for all networks, but the sole purpose of the network is to provide access to data. If one system must go down for maintenance, why must the data be unavailable? With clustered file serving, server maintenance or even failure will not significantly interrupt data access. Growth of Managed Data Over the past decade, storage growth has repeatedly exceeded the projections of most network planners. Storage has continued to grow at an exponential rate, while the reliance each company has on electronic data has increased as well. The result has been a need to manage an abundance of storage while providing for fast access and high availability. Todays File Serving Landscape Years ago, file serving was pretty simple. Today, file serving is much more complex, and there are many approaches from which to choose. Todays approaches to file serving include: Standalone servers Distributed file systems (DFSs) Network Attached Storage (NAS) appliances Failover Clusters Shared data clusters
This section will look at the current role of each of these architectures as well as the advantages and disadvantages of each of these approaches.
Chapter 1 Standalone Servers Standalone servers represent the root origin of file serving, and today maintain a very large presence in the file serving landscape. Figure 1.1 shows a typical standalone file server implementation.
Figure 1.1: Standalone file server implementation.
Notice in the figure that the storage scalability is addressed by attaching an external disk array to the server. Although the initial deployment and management of this type of architecture is usually simple at first, as the network scales, management generally becomes more difficult. The file server implementation that this figure shows is generally referred to as a data island. The reason is that access to the data is through a single paththe Local Area Network (LAN). Whether access is required for clients or for backup and restore operations, the data must be accessed over the LAN. For backup operations, this requirement might mean that backup and restore data is throttled by the speed of the LAN. A 100Mbps LAN, for example, would provide you with a maximum throughput of 12.5MBps (100Mbps divided by 8 bits).
Chapter 1 Many organizations have combated the storage management shortcomings of standalone file servers by implementing either a dedicated LAN or a storage area network (SAN) for backup and recovery operations. Although this approach might solve immediate storage needs, it does little for scalability and availability. With the single file server acting as the lone access point for data access, several individual problems that occur with the server can result in the complete loss of data access. For example, any of the following failures would result in data unavailability: Hardware failure, such as CPU, RAM, or motherboard Network failure Power failure Disk failure Malware
Aside from any element of system hardware representing a possible single point of failure, having just one or even two access points to data can result in performance bottlenecks.
Chapter 3 will look at ways to combat the availability and performance bottleneck issues associated with standalone file servers.
How do most organizations overcome file serving performance issues as their networks grow? Most simply add file servers. If one server is becoming overtaxed, an organization will order another server and move some of the shares on the overburdened server to the new server. This approach to growth is simple and has certainly been tested over time. However, adding servers also means that administrators have more systems to manage. This load will ultimately include additional work in hardware, software, and patch management. In addition, administrators will be faced with the task of updating login scripts to direct clients to the new servers. Thus, in addition to the cost of the new servers, there will ultimately be increased software and administrative costs associated with the addition of the new server. Although adding servers to the network is an inevitable part of growth, there are other technologies that can assist in the scalability issues that surround file serving today. The next few sections will look at alternative methods that can be either substituted for or complement the addition of file servers to the LAN.
Chapter 1
DFS The use of a DFS to manage file serving has been a growing trend in recent years. In short, a DFS enables the logical organization of file shares and presents them to users and applications as a single view. Thus, an organizations 200 file shares scattered across 12 servers may logically appear as if theyre attached to a single server. Figure 1.2 illustrates the core concept of a DFS.
Figure 1.2: A simple DFS implementation.
With DFS, users can access network shares via a DFS root server. On the DFS root server, administrators can configure a logical folder hierarchy, then map each folder to a share located on another server on the network. Each physical location that is mapped in the DFS hierarchy is referred to as a DFS link. The link will contain the Universal Naming Convention (UNC) path to the actual location of the shared folder. When a user accesses a shared folder on the DFS server, the user will be transparently linked to another physical server on the network.
Chapter 1 To illustrate this concept, compare DFS with traditional file servingDFS enables administrators to present users a mapped network drive to access each share, and the administrators can simply map a single drive letter to the DFS root. Having a logical access layer in front of physical network resources offers several advantages: Administrators can change the physical location of shared data to support data consolidation or relocation without interrupting user access Replicas can be created for folders at the DFS root, allowing files to be replicated between multiple file servers With domain-based DFS, the DFS root can exist on multiple domain controllers, thus adding fault tolerance to the DFS root itself Windows DFS is closely intertwined with Active Directory (AD), enabling users to automatically be directed to shares that exist in their local site when multiple replicas of the same shared folders exist
For more information about DFS, refer to Microsoft TechNet http://www.microsoft.com/technet and search using the key word DFS.
In being able to create replicas of DFS links, administrators can add a level of fault tolerance to the file serving infrastructure. Also, in being able to integrate with AD sites, users accessing a link that contains multiple replicas will be directed to the replica location that exists in their computers local site. The actual DFS root can also be replicated using domain-based DFS; thus, the DFS root will also be fault tolerant. DFS solves a few of the scalability issues with file serving. With DFS in place, file servers can be added without having any impact on users and drive mappings. Availability can be increased by creating replica links for critical shares. If the replica links traverse two or more sites, an organization will also have simple disaster protection in place.
DFS should not be considered a replacement for normal backups. Although DFS can transparently maintain multiple copies of files across two sites, it does not prevent file corruption, erroneous data entry, or accidental or intentional deletion. Thus, you should still back up your file server data to removable media and store it at an offsite facility.
Although DFS can solve some of the data access and availability concerns of standalone file servers, it does not help combat the server sprawl that administrators will have to contend with as additional servers are added to the network. Each server will still need to be maintained as a separate entity. DFS will hide the complexity of the network infrastructure to end users and applications, but administrators arent so fortunate. As the network grows, administrators will be faced with managing and maintaining each server on the LAN.
Chapter 1
NAS Appliances NAS appliances began gaining momentum as a method to consolidate and simplify file serving in the late 1990s. NAS appliances quickly gained popularity as a result of the fact that they can be deployed quickly (often within minutes) and with support for up to terabytes of storage; several file servers are often able to be consolidated into a single NAS. Figure 1.3 shows a typical NAS deployment.
Figure 1.3: A simple NAS deployment.
NAS devices are labeled appliances because of the fact that an administrator can literally buy a NAS and plug it in. However, NAS devices have restricted software choices. By restricting the software that could be installed, if any, NAS vendors are able to guarantee the reliability of their systems. As most NAS appliances have a sole purpose of being file servers, there isnt much of a need to install applications. Major vendors in the NAS space include Network Appliance, EMC, and Microsoft, which offers a Windows Storage Server 2003 OS. Network Appliance and EMC provide both hardware and their own proprietary NAS OS with each appliance. Microsoft does not ship NAS appliances. Instead, it provides a NAS OS to vendors such as Dell and Hewlett-Packard, who ship NAS appliances with the Windows Storage Server 2003 OS. In being built for file serving, nearly all NAS appliances (including those from Network Appliance, EMC, and Microsoft) support the two most common network file sharing protocols: Common Internet File System (CIFS) and Network File System (NFS). Also, most NAS appliances include built-in redundant hardware as well as data management utilities. The popularity of NAS has been attributed primarily to its ability to be quickly deployed as well as the relative simplicity of administration of the NAS appliance. Nearly all NAS appliances come with a simple to use Web-based administration tool.
Chapter 1 As with other file serving approaches, NAS has a few drawbacks. Most NAS appliances come with proprietary hardware and a proprietary OS. This shortcoming limits the flexibility of the device in the long run. For example, an older and slower NAS appliance cannot later be used as a database server. Also, the nature of proprietary solutions requires the purchaser to return to the same NAS vendor to purchase hardware upgrades. Another challenge that has recently plagued NAS is sprawl. For many network administrators that bought into the NAS philosophy of file serving, adding capacity means adding another NAS. In time, many organizations have accumulated several NAS appliances that are all independently managed. Failover Clusters Another approach to file serving involves the use of clusters. The simple definition of a cluster is two or more physical computers collectively hosting one or more applications. A major advantage to clusters is in the ability for an application to be able to move from one node to another in the cluster. The process of an application moving to another node is known as failover. A shared storage device between all nodes in the cluster is needed so that an application will see a consistent view of its data regardless of the physical node that is hosting it. With these capabilities, when many think of the term cluster, they quickly realize the benefits of availability provided by clustering. The two primary architectures available for file serving clusters are failover clusters and shared data clusters. The difference between these architectures lies in how the clusters shared storage is accessed. With failover clustering, one node in the cluster exclusively owns a portion of the shared store resource. If an application in the cluster needs to fail over to another node, the failover node will need to mount the storage before bringing the application online. Figure 1.4 illustrates a failover cluster.
Figure 1.4: Failover cluster with SCSI-attached shared storage.
Notice that a heartbeat connection is also shown in the illustration. The heartbeat represents a dedicated network over which the cluster nodes can monitor each other. In this way, a node can determine whether another node is offline. If no dedicated heartbeat network is present, the cluster nodes will monitor each other over the LAN.
Chapter 1 In a simple failover cluster, one node hosts an application, such as a file server inside a virtual server. The virtual server acts as an addressable host on the network and has a unique host name and IP address. The second node, the passive node, monitors the first node for failure. If the first node becomes non-responsive, the second node will assume control of the virtual server. Many popular OS vendors offer failover clustering support with their OSs. For example, Microsoft Windows Server 2003 (WS2K3) Enterprise Edition and Red Hat Enterprise Advanced Server 4.0 with the add-on Cluster Suite both support as many as 8-node failover clusters. The open source High-Availability Linux Project offers support for failover clusters of 8 nodes or more. There are plenty of available failover clustering solutions on the market today. However, vendors are also starting to embrace shared data clusters, which offer the same level of fault tolerance as failover clusters, but several additional benefits as well. Cluster Architecture Clusters are typically described as either N-to-1 or N-Plus-1. In an N-to-1 architecture, one node in the cluster is designated as the passive node, leaving it available to handle failover if an active node in the cluster fails. Figure 1.5 shows a 3-node N-to-1 cluster.
Figure 1.5: A 3-node N-to-1 cluster.
Notice that Node1 is active for the virtual server FS-Sales and Node2 is active for the virtual server FS-Acct. If either active node fails, Node3 will assume its role. In this architecture, Node3 is always designated as the passive node, meaning that when the primary active node returns online following a failure, the service will fail back to the primary node. Although this approach offers simplicity, having automatic fail back means that the failed service will be offline twice once during the initial failover and again during the fail back.
Chapter 1 N-Plus-1 clustering offers a different approach. With N-Plus-1, a standby (passive) node can assume control of a primary nodes service when the primary active node fails. However, when the active node returns to service, it then assumes the role of the passive node. Thus, in time, the active node for each service managed by the cluster may be completely different than at the time the cluster was originally set up. However, automatic failback is not an issue with this approach, thus providing for better overall availability. Shared Data Clusters Shared data clustering can also provide the benefit of high performance as well as load balancing. Shared data clusters differ from failover clusters in how they work with shared storage. In a shared data cluster, each node in the cluster simultaneously mounts the shared storage resources. The approach provides far superior performance over failover clusters because mount delays are not encountered when an application tries to failover to another physical node in the cluster. With shared data clusters, multiple nodes in the cluster can access the shared data concurrently; with failover clusters, only one node can access a shared storage resource at a time. Figure 1.6 shows a shared data cluster. Notice that one of the key differences with the shared data cluster is that a SAN is used to interconnect the shared storage resources.
Figure 1.6: Shared data cluster with SAN-attached storage.
The elements of the SAN cloud are discussed later in this chapter.
10
Chapter 1 Shared data clusters have steadily grown in popularity as a result of their ability to satisfy many of the problems facing todays file serving environments. In particular, shared data clusters can offer the following benefits: Provide more effective utilization of hardware resources Provide for simple scalability to accommodate growth Provide for high availability
Depending on whom you ask, industry analysts have found that average server CPU consumption runs from 8 percent to 30 percent. Most organizations have several servers that exhibit similar performance statistics. For example, consider an organization that has two servers that average 10 percent CPU utilization. Consolidating the servers to a single system will not only allow hardware to be more effectively utilized but also reduce the total number of managed systems on the network.
Several organizations have turned to virtual machines as a means to further consolidate server resources. Companies such as VMware and Microsoft provide excellent virtualization tools in this arena. Although virtualization might make sense in many circumstances, a virtual machine is still a managed system and will need to endure patch and security updates as with any other system on the network. Virtual machines provide an excellent benefit in consolidation, especially when consolidating legacy OSs running needed proprietary database applications, but they are not always the best fit for file serving. Consolidating to virtual servers running on top of clusters not only allows you to maximize your hardware investment but also reduces the number of managed systems on your network.
Like traditional failover clustering, shared data or cluster file system architectures involve the use of virtual servers that are not bound to a single physical server. Virtual servers that exist in the cluster can move to another host if their original host becomes unavailable. Where cluster file systems differ is in their fundamentally unique approach to clustering. In traditional clustering, each virtual server has its own data that is not shared with any other virtual server. In shared data cluster computing, multiple virtual servers can export the same data. To summarize the key components of shared data cluster, consider the following common characteristics: ModularSeveral dense servers are grouped to support mission-critical file serving and application needs. AdaptivePhysical resources in the cluster can be dynamically allocated to meet performance requirements. High availabilityVirtual servers are enabled to fail over to available physical resources if a failure occurs. Shared dataServers in the cluster concurrently access shared data via a SAN. Concurrent access provides for near instantaneous failover. Platform independenceHardware of each node in the cluster does not need to be identical or even from the same vendor. Management layerIntelligence exists that oversees and ensures cohesion of physical and logical elements in the cluster.
11
Chapter 1
Modular
In being modular, the cluster should support the logical grouping of physical resources to support the demand and quantity of virtual servers that are needed. In being able to group both physical server resources as well as storage resources, management is relatively simple. On the outside, shared data clusters can look intimidating. For this architecture to succeed, its important for the management of resources to be simple. Modularization provides this simplicity.
Adaptive
Shared data clusters have the ability to take advantage of both high-performance clustering and failover clustering. To meet the needs of applications, additional servers can be redeployed to virtual server groups to accommodate demand. Additional virtual servers and applications can usually be added with minimal to no investment.
High Availability
To support high availability, virtual servers in the cluster can failover to other nodes. If data access via one physical server in the cluster is interrupted, another physical server can take control of a virtual server in the cluster. Also, shared data clusters provides for a unique data sharing architecture that allows failovers to typically complete within seconds. With file servers running as virtual servers hosted by a shared data cluster, data access does not need to be unavailable for several hours due to scheduled or unscheduled downtime. Instead, if a node in the cluster needs to go offline (or is taken offline by system failure), the application hosted by the node can simply be moved to another node in the cluster. With failover generally taking seconds to complete, user access would be minimally disrupted.
Not all clustering products support application failover during upgrades. Some products will require all servers be taken down simultaneously during an upgrade. Administrators should consult their cluster product vendor prior to performing any cluster maintenance to verify that clustered applications will remain available during any system upgrades. Shared Data
With shared data, many traditional failover cluster architectures employ a shared-nothing architecture. With shared-nothing clustering, one or more servers share storage, but in reality only one server can use a shared physical disk at a time. The argument for this approach has long been that concurrent I/O operations from multiple sources could corrupt the shared hard disk, so it is best that the disk only be mounted on one physical server at a time. Ultimately, this means that traditional architectures in which software running on the servers in the cluster simply will not run properly if multiple physical servers are concurrently accessing the same disk space. However, this architecture will result in slow failovers in the event of a failure due to one node needing to release the storage resource and then the failover node needing to mount the storage resource. With shared data clusters, each node in the cluster mounts the shared storage on the SAN. Thus, during a failover, no delay is incurred for mounting storage resources. To insure data integrity, the clusters management layer uses a distributed lock manager (DLM). The DLM allows multiple servers to read and write to the same files simultaneously. The DLM also provides for cache coherence across the cluster. True cache coherence is what allows multiple servers to work on the same application data at the same time. This feature is what allows shared data clustering to offer both high performance and high availability.
12
Chapter 1
Platform Independence
In being platform independent, cluster computing allows the use of preferred hardware for the assembly of the clusters inner infrastructure. Platform independence makes it much easier for organizations to get started with cluster computing, and as servers in the cluster age, those servers can potentially be used for other purposes within the organization.
Management Layer
The role of the management layer within cluster computing is to not only modularize physical resources such as servers and storage but also provide failover and dynamic allocation of additional resources to meet performance demands. As shared data clusters are a new and different approach to clustering, there are currently few choices available that can provide the complex management functionality of cluster computingdriven server infrastructure. The lone vendor that can fully deliver shared data clusters today is PolyServe; however, there are other storage vendors that offer consolidation and availability solutions such as Network Appliance and EMC (but each of these solutions is hardware centric).
Built to Scale
Another aspect of shared data clustering that has lead to its popularity has been its simple growth model. As load increases, nodes can simply be added to the cluster. Although many failover cluster architectures experience trouble scaling, shared data clusters that can run on both Windows and Linux OSs support scaling to 16 nodes or beyond. This type of flexibility eliminates much of the guesswork of growth and capacity planning. With shared data clusters supporting a high number of maximum nodes, administrators can add nodes as needed rather than purchase based on capacity that may be planned 18 months out.
The Cost Factor
Shared data clusters offer several advantages, but those advantages come with a price. Shared data clusters typically share a common storage source. The shared storage is usually interconnected to the cluster nodes via a fibre channel SAN. Although shared storage contributes to some of the benefits mentioned earlier (and several more discussed in Chapter 6), it comes at a higher cost than traditional direct attached storage (DAS). However, although cost can lead to initial sticker shock, the surprise often quickly passes when the cost of downtime is compared with the cost of the shared storage infrastructure with your data hosted on industry standard Intel-based architecture and the need to scale performance. To understand the savings, look past the cost of DAS on a single server. With shared storage, after the initial infrastructure investment, there is little difference in the cost of actual storage. When compared with the cost of Intel servers over proprietary UNIX or NAS appliances, the cost savings of shared data clustering is often estimated at 8 to 10 times the cost of the proprietary equipment. Thus, the shared data approach provides not only better utilization of storage resources, better availability, and better performance, but also substantial cost savings. In terms of complexity, storage architectures are less intimidating once the technologies available have been explored. The following sections highlight these technologies.
13
Chapter 1
Current Storage Architectures

Today, there are several ways to deploy storage on a LAN. Among the most popular choices are: SCSI Serial ATA (SATA) Fibre Channel (FC) Internet SCSI (iSCSI)
This section will take a brief look at each of these technologies as they relate to building a better file serving infrastructure. SCSI SCSI has long been the core storage architecture for high-performance file serving. Although this disk architecture has lost significant ground to FC, most organizations still employ several SCSI storage devices on their networks. The first generation of SCSI offered throughput of as fast as 5MBps; today Ultra320 SCSI can push data at a rate of as fast as 320MBps. With SCSI device support, the size of the SCSI bus will ultimately determine the number of devices that can be connected to the bus. For example, narrow SCSI has an 8-bit bus, which allows it to support as many as 8 devices, including the SCSI host bus adapter (HBA). Wide SCSI has a 16-bit bus, which allows for support for as many as 16 devices. By using logical unit numbers (LUNs), SCSI buses can support more than this limitation. SCSI IDs are used to identify each device on the bus. By default, each SCSI HBA uses an ID of 7. For narrow SCSI, IDs of 0 to 7 are valid; whereas 0 to 15 are valid IDs for wide SCSI. Table 1.1 shows the different SCSI bus types available today.
Bus Type Bus Width (Bits) Bandwidth (MBps) Maximum Cable Length (m) SE SCSI-1 SCSI-2 Wide SCSI Fast SCSI Fast Wide SCSI Ultra SCSI Ultra SCSI-2 Ultra2 SCSI Ultra160 SCSI Ultra320 SCSI 8 8 16 8 16 8 16 16 16 16 5 5 10 10 20 20 40 80 160 320 6 3 3 3 3 1.5 3 ------LVD --------------12 12 12 HDV 25 25 25 25 25 25 25 25 -----
Table 1.1: SCSI bus type comparison.
14
Chapter 1
Note that the table lists cable lengths only for a SCSI bus standard supported for a particular SCSI bus type. LVD cable lengths are not listed until Ultra2 SCSI, which was the first SCSI standard to support the LVD bus type.
For more information about SCSI, visit Gary Fields SCSI Info Central at http://www.scsifaq.org.
SCSI runs into major scalability problems with shared storage architectures. In nearly all failover cluster implementations, shared storage connected via SCSI supports a maximum of 2 nodes. This scalability limitation has led many organizations to move away from failover clustering. Although failover clusters can run on SANs, the products of many vendors still behave as if theyre SCSI attached, thus diminishing their attractiveness. For greater scalability, many organizations are moving toward shared data clusters that interconnect shared storage to cluster nodes via an FC SAN. Although FC provides the data transport in the SAN, FC disk arrays attached to the SAN may contain internal FC, SCSI, or SATA disks. With the ability to offer scalability and support for all major disk storage architectures, its easy to see why FC has become the leading storage interconnect in the industry. SATA SATA drives have become increasingly popular due to their lower cost (compared with SCSI) and comparable speeds. The first SATA standard provided for 150MBps data transfer rates. In response to this standard, SCSI vendors quickly met the challenge, and, in turn, SATA began to offer 300MBps with its SATA II standard. At 300MBps, SATA II is still slightly slower than Ultra320 SCSI, but is now a viable cost-effective option in high-performance file serving. Also, many storage vendors have jumped on the SATA bandwagon, with several vendors such as Hitachi and Sun Microsystems offering SATA disk arrays. The rise of SATA has been pushed by several storage vendors that have built SATA storage devices that can be interconnected to FC SANs.
For more information about SATA, refer to the SATA International Organization homepage at http://www.sata-io.org.
15
Chapter 1
FC and SANs Today, FC is the predominant architecture for interconnecting shared storage devices. The high adoption rate of FC has been fueled by its several advantages over SCSI: Speed4Gbps FC mediums offer data transfer rates as fast as 512MBps FC SANs support as many as 16 million devices FC supports cable lengths as long as 10KM
One of FCs greatest benefits is that this architecture allows for interconnecting storage devices via a dedicated SAN. SANs provide the following benefits: Storage resources can be pooled and shared by all servers Backup performance will likely increase dramatically Scalability issues can be more easily managed Shared data clusters can scale as high as 16 nodes or beyond, depending on the clustering application
Each server connected to the SAN can potentially access any storage resource on the SAN. SANs enable the maximized use of storage resources by creating better opportunities to allocate unused resources to other servers. This setup has significantly aided data backups. Now, a server no longer has to send its data over the LAN to access a tape library for backup, for example. Instead, the server can directly access the library via the SAN. Backup vendors such as Symantec (formerly VERITAS) and CommVault have architectures that support sharing of backup targets in a SAN. Now servers are no longer faced with network bottlenecks while backing up their data. The term LAN-free is often used to describe this backup approach. Other backup methods such as server-free and server-less are also available by using enterprise-class backup products and interconnecting storage resources via a SAN.
Chapter 6 will provide examples of all SAN-based backup configurations, including LAN-free, serverfree, and server-less as well as several examples of how organizations are consolidating storage resources by connecting their servers to SANs.
Disk arrays as well as backup devices can be shared on a SAN. In the past, many in IT addressed storage by guessing how much storage a server would need when it was initially requisitioned, and if the server needed more disk resources, more would be ordered at a later date. For servers for which the estimate was too high, disk resources would go unused. The ability to collectively pool physical disks in a SAN enables the allocation of disk space to servers as needed. The bottom line with SANs is that their implementation is a natural part of the progression toward consolidation. Figure 1.7 shows a basic SAN.
16
Chapter 1
Figure 1.7: A SAN that consists of a switch, router, disk array, and tape library.
Notice that three servers are sharing a disk array and tape library. The switch and router are used to interconnect the storage devices on the SAN. FC SAN hardware devices share the same names of devices that you have already come to know and love with LANs. The primary devices that drive a SAN include: Switches and hubs Routers (also known as bridges)
Switches and Hubs Switches and hubs are used to interconnect devices on the SAN. Their role on the SAN is similar to a switch or hub on a LAN. Hubs are older FC devices that support a topology known as FCArbitrated Loop (FC-AL), which is the SAN equivalent to a token ring network. Switches dominate todays SAN landscape and work similarly to Ethernet switches. SANs connected via a switch are said to be a part of a Switched Fabric topology. This setup is similar to the Ethernet switches. With a switch, dedicated point-to-point connections are made between devices on the SAN, allowing the devices to use the full bandwidth of the SAN. With FC-AL hubs, bandwidth is shared and only one device can send data at a time. Among the popular switch vendors today are Brocade, McData, and Cisco Systems. Another very popular device on the SAN is the router. Router Routers are devices that are used to connect an FC SAN to a SCSI device. The job of the device is to route between a SCSI bus and an FC bus. The router is a very important consideration when planning to implement a SAN, as it allows an organization to connect existing SCSI storage devices (disk arrays and libraries) to the SAN. This connection prevents the loss of the initial SCSI storage investment. The two most popular router vendors today are ADIC and Crossroads.
For more information about FC and SANs, refer to the excellent online resources: Storage Networking Industry Association at http://www.snia.org, Fibre Channel Industry Association at http://www.fibrechannel.org, and Legato Systems SAN Academy at http://www.sanacademy.com.
17
Chapter 1 FCIP and iFCP The cheapest transmission medium is the Internet, which requires IP. With this in mind, wouldnt it be useful to be able to bridge SANs in two sites together through the Internet? In order for this to happen, you will need a device capable of doing the FC-to-FCIP translation. Some FC switches have integrated FCIP ports that allow you to do so. However, FCIP doesnt provide any means to directly interface with an FC device; instead, its a method of bridging two FC SANs over an IP network. Internet FC Protocol (iFCP) is much more robust than FCIP. Like FCIP, iFCP can also be used to bridge FC switches over an IP network. However, this protocol also provides the ability to network native IP storage devices and FC devices together on the same IP-based storage network. With the rise of gigabit Ethernet networks, consider iFCP as a way to provide full integration between an FC and IP network. Another rising protocol that provides the same level of hardware integration over gigabit Ethernet is iSCSI. iSCSI iSCSI works very similarly to iFCP, except that instead of encapsulating Fibre Channel Protocol (FCP) data in IP packets, SCSI data is encapsulated. In being designed to run over Ethernet, iSCSI enables the leveraging of existing Ethernet devices on a storage network. For example, consider an organization that purchases new gigabit Ethernet switches for an iSCSI SAN. As technology improves and the organization decides to upgrade to faster gigabit switches, the older switches can be used to connect hosts on the LAN. FC switches dont offer this level of flexibility. iSCSI architecture involves a host configured as an iSCSI target. The iSCSI target can be a server with locally connected storage or a storage device that natively supports iSCSI. Clients that access the storage over the network using the iSCSI protocol are known as initiators. Initiators need to have iSCSI client software installed in order to access the iSCSI target. Figure 1.8 shows a typical iSCSI environment, showing two initiator hosts and one iSCSI target.
Figure 1.8: A small iSCSI SAN.
As iSCSI is a newer and maturing protocol, there are not as many storage devices that support iSCSI as those that support FC. As more devices become available, expect competition to cause the price of both iSCSI and FC SANs to drop even further.
18
Chapter 1
Clustered File Serving Gaining Momentum

To get past the reliance of data on an individual system, clustered file serving has emerged as the primary means for maintaining data availability. In short, clustering allows a virtual server to run on top of any physical server participating in the cluster. Virtual servers have the same characteristics as physical file serversa name, IP address, and the ability to provide access to data. However, they differ in the fact that they are not dependent on a single piece of hardware to remain online. Instead, if a virtual server hosts hardware fails, the virtual server can simply move to another host. The result is that the virtual server is only offline for a few seconds while moving to another physical host, compared with several minutes or hours of unavailability in the event of a server failure. High Availability Keeping data available means keeping everything in the data path available. This goal is most often secured through redundancy. Storage itself can achieve redundancy through Redundant Array of Inexpensive Disks (RAID). Redundant switches can be added to the data path on the network, preventing against a switch failure. Redundant switches can be added to a SAN. Finally, physical servers themselves can be made redundant through clustering. Figure 1.9 illustrates an example of a highly available file serving architecture.
Figure 1.9: An example of a high-availability clustering architecture.
Considerable time is spent exploring adding redundancy to the complete data path in Chapter 3.
19
Chapter 1 Consolidation Advantages Newer cars have a lot more parts inside. Although the additional parts may equate to more features, such as power windows, these additions also mean that there are more parts that can break. On a network that employs 200 servers, each part on each server represents a potential failure. Reducing the number of servers on the network ultimately reduces the number of potential failures. PolyServe recently studied the benefits of consolidating file servers to a clustered file system running on standard hardware and found the following: Procurement costs are reduced by as much as 70 percent Physical and logical file server use and storage consumption are reduced by as much as 80 percent Operational costs are reduced by at least 50 percent File server downtime is reduced by almost 100 percent
Thus, consolidating to clustered file system (CFS)-based file serving easily equates to quantifiable savings. An administrator who wants to lower the number of system management headaches needs a way to quantify proposals for new technologies in order to get them approved. If data unavailability is reduced from 175 hours per year to 1 hour per year, for example, an organization may see a production savings of more than 4 million dollars, according to the Gartner survey cited earlier. Drive Toward Standardization Movement toward standardized hardware on Intel-based platforms has steadily gained ground over the past decade. Moving away from proprietary hardware solutions gives organizations true independence with their hardware investments. As hardware ages, it can be used in other roles, such as in application serving of a less critical database application. When mission-critical servers are upgraded, the original server systems can be used for other roles within the organization. Having standard non-proprietary hardware also offers complete flexibility with OS and application choices. A Windows box could easily become a Linux box or vice-versa as the need arises. As needs on the network change, systems can be moved to where theyre most needed. With proprietary solutions, this level of flexibility is typically not possible. The push toward standard platforms has gone past the major OS vendors and extended to application and service vendors. Running servers on standardized hardware ultimately means far more applications are available to select from. The bottom line with the movement toward standardization is that administrators and end users benefit the most. Organizations have better and less expensive products and much more to choose from when making purchasing decisions. The competition that has been steadily expanding in the non-proprietary market will only continue to benefit the industry with innovation fueling further competition.
20
Chapter 1
Summary
With increased need for performance and availability of files, shared data clusters have steadily emerged as the architecture of choice to meet many organizations file serving needs. Shared data clusters offer superior scalability and a significantly lower cost than point-level proprietary solutions such as the offerings of many NAS vendors. With this type of momentum, it appears that shared data clusters will continue to experience rapid growth in the years to come. Deploying a shared data cluster architecture as part of a consolidated and highly available server infrastructure can provide a resilient and flexible architecture that can scale as an organization grows. The next chapter digs deeper into the problems plaguing modern architectures and looks further into how these problems are being solved. The rest of the guide will explore specific examples of how to optimize the data path for performance and availability and provide examples of increasing performance, availability, and scalability of both Windows and Linux file serving solutions.
21
Chapter 2
Chapter 2: Taming Storage GrowthA Modern Perspective

Chapter 1 introduced the vast array of storage and file serving solutions available today. It is important to understand how each file serving and storage technology works, and equally crucial to know which technology is right for your organizations needs. This chapter provides a detailed examination of the current storage and growth problems facing IT and explores what server, storage, and application vendors are doing to address these problems.
Current Storage Problems

Most organizations face several problems with their storage infrastructures, most notably: Availability Growth Management Backup window expansion
This section will look at the root causes for each of these problems, setting the foundation for a later discussion of how vendors are using technology to address these issues. Availability Availability refers to data being obtainable when a user or application needs it. Because need is a relative term, the definition of availability can vary from one organization to the next. For a small medical office, availability probably means access to resources from 8:00 AM to 6:00 PM Monday through Friday. For an ecommerce Web site, availability means 24 7 access to data. For most organizations, availability likely falls in the middle of the two previous examples. Regardless of an organizations definition of availability, the performance of IT staff is often measured by the availability of data. Several problems can derail data availability: System hardware failure System software failure Power failure Network failure Disk failure
22
Chapter 2 Fortunately, many of the single points of failure that prevent data availability can be overcome with redundancy. An organization can overcome power failure through the use of UPSs and backup generators. Reliance on additional switches, routers, and redundant links can help prevent network failure; RAID can help overcome data availability problems that result from disk failure.
Chapter 3 will focus on how to overcome these issues so as to ensure data availability.
System hardware or software failure is often more difficult to rebound from. To overcome this challenge, several solutions are now available, including NAS and shared data clusters. These technologies will be examined later in this chapter. Growth Another problem facing administrators today is growth. As storage requirements and client demand increases, how do you accommodate the increase? To put growth into perspective, according to IDCs Worldwide Disk Storage Systems Tracker, disk factory revenue grew 6.7 percent year over year, as reported in the first quarter of 2005. The report also noted that the 6.7 percent growth value represented 8 quarters of consecutive growth. The increase in reviews has occurred despite the fact that the cost per gigabyte of disk storage continues to fall. For example, the 2005 Worldwide Disk Storage Systems Tracker also reported that capacity continues to grow at an exponential rate, with 2005 year-on-year capacity growing 58.6 percent. Market trends have shown that nearly all administrators face growth. One of the major problems is how to effectively manage it. Is the ideal solution to continue to add capacity or is a better solution to rebuild the network infrastructure so that it can effectively scale to meet the needs of future growth? Many administrators are deciding that now is the time to look at new ways of managing data, as earlier architectures are not well-suited to meet the continued year-on-year demands of growth. Management With growth comes additional headachesthe first of which is management. As online storage increases, what is the best way to effectively manage the increase? If each server on the LAN is using local DAS storage, this situation creates several potential bottlenecks, data islands, and independently managed systems. If client demand is also increasing, is the best option to add servers to the LAN to deal with the heavier load? Ultimately, the problem that is hurting data management today is that many administrators are trying to use traditional architectures to deal with modern problems. Expanding Backup Windows As the amount of data grows, so do the backup windows for many organizations. With traditional servers with DAS storage and LAN-based backups, it has become almost impossible to back up servers over a LAN within the time of a backup window. This challenge has resulted in many organizations altering their backup schedules and doing fewer full backups in order to have backups complete before business starts each morning. To deal with the issue of expanding backup windows, many IT shops have considered or have already deployed solutions such as NAS, SAN, DFS, and virtualization. The following sections explore the part that each of these technologies plays within the network.
23
Chapter 2
Existing Storage Solutions

There are many vendors in the market arena that offer products to solve the current storage problems of today. Although these solutions can ease the management burden of an administrator, the solutions one-size-fits-all approach doesnt guarantee success. Its the responsibility of the organization and the IT staff to understand each of the available storage and consolidations solutions, then select the solution that best fits the companys mission. SAN For many organizations, a SAN is often the answer for consolidating, pooling, and centrally managing storage resources. The SAN can provide a single and possibly redundant network for access to all storage devices. Depending on the supporting products purchased by an organization, a SAN might also provide the ability to run backups independent of the LAN. The advantage to LAN-free backups is almost always increased throughput, thus making it easier to complete backups on time. As SANs have become an industry standard for consolidating storage resources, hundreds of application vendors now offer products that help support storage management and availability in a SAN. In addition, as SANs are assembled using industry standard parts and protocols such as FCP and iSCSI, an administrator can design a SAN using off-the-shelf parts. NAS Filers NAS filers have been at the heart of the file server consolidation boom. Organizations that face the challenge of scaling out file servers can simply purchase a single NAS filer with more than a terabyte of available storage. One of the greatest selling points to NAS filers has been that theyre plug-and-play in nature, allowing them to be deployed within minutes. At the same time, however, most NAS solutions are vendor-centric, meaning that they dont always easily integrate with other network management products. NAS vendors such as EMC and NetApp offer support for a common protocol known as Network Data Management Protocol (NDMP), which allows third-party backup products to back up data on EMC and NetApp appliances. The benefit of NDMP is that it is intended to be vendor neutral, meaning that if a backup product supports NDMP, it can back up any NDMP-enabled NAS appliance. Microsoft NAS appliances that run the Windows Storage Server OS, however, do not support NDMP. Backing up a Windows Storage Server NAS will require the installation of backup agent software on the NAS itself. Traditional NAS vendor offerings do not allow administrators to install backup software on their filers. The reason for this restriction is to guarantee the availability of the NAS; however, it significantly ties the hands of administrators when theyre looking for flexibility.
Later, this chapter will spend more time contrasting the role of NAS in server and storage consolidation with that of other products.
24
Chapter 2 DFS DFS has been seen as an easy way to combat server sprawl, at least from a user or application perspective. As servers are added to a network to accommodate growth and demand, the new servers can be referenced under a single domain-based DFS root. This feature allows the addition of the new servers to be transparent to users. DFS can equally support server consolidation. For organizations that are consolidating and removing servers from the LAN, DFS can add a layer of transparency to the consolidation process. If user workstations and applications are set up to access file shares via the DFS root, administrators are free to move and relocate shares in the background and simply update the links that exist at the DFS root once the migration is complete. Thus, the way users access file systems will be the same both before and after the migration. Virtualization Virtualization technologies have recently jumped to the forefront of organizations efforts to consolidate and simplify data access and management. This section will look at how storage virtualization has aided in storage consolidation efforts of SANs and how server virtualization has enabled companies to reduce the number of physical servers on their LANs by as much as 75 percent. Storage Virtualization As the number of managed storage resources on a LAN grows, so does the time and cost of managing those resources. Implementing a SAN provides an excellent first step toward consolidation and easing an administrators storage management burden, however, the SAN alone may not be enough. This is where storage virtualization comes into the picture. There are plenty of ways to define storage virtualization, but to keep it simple, consider storage virtualization to be the logical abstraction of physical storage resources. In other words, a logical access layer is placed in front of physical storage resources.
Storage virtualization is often a confusing topic due to the fact that several storage vendors have their own definition of the term. Competing vendorsmost of which claim to have invented storage virtualizationmay offer differing definitions. A common voice for storage virtualization can be found at the Storage Network Industry Association (http://www.snia.org).
Figure 2.1 provides a simple illustration of storage virtualization. The primary point of storage virtualization is to logically present physical storage resources to servers. This setup often results in better storage utilization and simplified management. As Figure 2.1 illustrates, storage virtualization starts by adding a data access layer between systems and their storage devices.
25
Chapter 2
Figure 2.1: Virtualization access layer for physical storage resources.
The actual virtualization layer can be comprised of several different technologies. Among the virtualization technologies that may exist between servers and storage are: In-band virtualization Out-of-band virtualization Hierarchical Storage Management (HSM) Policy-based storage virtualization
The next four sections explore how each of these virtualization architectures aids in storage consolidation.
In-Band Virtualization
With storage virtualization, the term in-band implies that the virtualization device lies in the data path. The purpose of the device is to control the SAN storage resources seen by each server attached to the SAN. This level of virtualization goes far beyond traditional SAN segmentation practices such as zoning by allowing an administrator to allocate storage resources at the partition level, instead of at the volume level. Figure 2.2 shows an example of in-band virtualization.
26
Chapter 2
Figure 2.2: In-band storage virtualization.
Notice that there is a virtualization appliance in the data path. The role of the appliance is to logically present storage resources to physical servers connected to the SAN. Also, as it resides directly in the data path, the appliance will provide more control over physical separation of SAN resources. For simplicity, the virtualization appliance is shown as an independent device, but such doesnt have to be the case. Although established NAS vendors such as Network Appliance and EMC are now offering virtualization appliances that fall inline with their general NAS philosophy, other fabric switch vendors such as Cisco Systems and Brocade are integrating virtualization intelligence into their fabric switches. Thus, the virtualization appliance does not have to be a standalone box and instead can seamlessly integrate into a SAN fabric.
Oftentimes multi-function SAN devices appear to initially have a higher cost than their single function counterparts. However, every device introduced to the data path in a SAN can result in a single point of failure. This shortcoming is often overcome by adding redundant components. Thus, for fault tolerance, an organization will need two of each single function device. If the devices across the SAN cant be cohesively managed, separate management utilities will be required for each. Comparing this option with solutions such as running VERITAS Storage Foundation on top of a Cisco MDS 9000 switched fabric will reveal a significantly lower cost of ownership.
To get the most out of in-band virtualization, many organizations deploy a software storage virtualization controller such as IBM SAN Volume Controller or VERITAS Storage Foundation. With the virtualization component residing in fabric switches as opposed to living on standalone appliances, an organization will have fewer potential single points of failure in the SAN.
27
Chapter 2 Many organizations have been wary about deploying in-band virtualization because of the overhead of the virtualization appliance. In being inside the data path and having to make logical decisions as data passes through the appliance, the virtualization appliance will induce at least marginal latency to the SAN. Vendors such as Cisco have worked to overcome this issue by adding a data cache to the appliance or switch. Although adding a cache to the appliance can improve latency, it will still likely be noticeable in performance-intensive deployments.
Out-of-Band Virtualization
Out-of-band storage virtualization differs from in-band virtualization in the location of the virtualization device or software controlling the virtualization. With out-of-band virtualization, the virtualization device resides outside of the data path (see Figure 2.3). Thus, two separate paths exist for the data path and control path. (With in-band virtualization, both data and control signals use the same data path.)
Figure 2.3: Out-of-band storage virtualization.
With control separated from the data path, out-of-band virtualization deployments dont share the same latency problems as in-band virtualization. Also, as out-of-band deployments dont reside directly in the data path, they can be deployed without major changes to the SAN topology. Out-of-band solutions can be hardware or software based. For example, a DFS root server can provide out-of-band virtualization. DFS clients locate data on the DFS root server and are then redirected to the location of a DFS link. Data transfer occurs directly between the server hosting the DFS link and the client accessing the data. This setup causes out-of-band deployments to have less data path overhead than in-band virtualization.
28
Chapter 2 Another advantage of out-of-band virtualization is that its not vendor or storage centric. For example, IBM and Cisco sell a bundled in-band virtualization solution that requires specific hardware from Cisco and software from IBM. Although both vendors solutions are effective, some administrators dont like feeling that an investment in technology will equal a marriage to a particular vendor. However, purchasing a fabric switch such as the Cisco MDS 9000 series that offers in-band virtualization as opposed to a dedicated in-band appliance will still offer some degree of flexibility. If an organization decides to move to an out-of-band solution later, the company will still be able to use the switch on the SAN. There are several options available for pooling and sharing disk resources on a SAN. Although both in-band and out-of-band virtualization differ in their approach to storage virtualization, both options offer the ability to make the most out of a storage investment. Consolidating storage resources on a SAN is often the first step in moving toward a more scalable storage growth model. Adding virtualization to complement the shared SAN storage will provide greater control of shared resources and likely allow even more savings in terms of storage utilization and management with a SAN investment.
HSM
HSM is a management concept that has been around for several years and is finally starting to gain traction as a method for controlling storage growth. Think of HSM as an automated archival tool. As files exceed a predetermined last access age, they are moved to slower, less expensive media such as tape. When the HSM tool archives the file, it leaves behind a stub file, which is usually a few kilobytes in size and contains a pointer to the actual physical location of the files data. The use of stub files is significant because it provides a layer of transparency to users and applications accessing the file. If a file has been migrated off of a file server, leaving a stub file allows users to access the file as they usually do without being aware of the files new location. The only noticeable differs for the users will be in the time it takes for the file to be retrieved by the HSM application.
Some HSM tools have moved away from the use of stub files and work at the directory level instead, thus allowing the contents of an entire folder to be archived. NuViews StorageX performs HSM with this approach. StorageX is able to leverage the existing features of DFS and NFS to archive folders, while adding a layer of transparency to the end user.
Figure 2.4 shows a simple HSM deployment. In this example, files are migrated from a disk array on the SAN to a tape library. The migration job would be facilitated by a file server attached to the SAN that is running an HSM application.
29
Chapter 2
Figure 2.4: Migrating files older than 6 months to tape.
HSM tools typically set file migration criteria based on: Last access time Minimum size Type
When a migration job is run, files that meet the migration criteria are moved to a storage device such as a tape and a stub file is left in its place. The advantage of HSM for storage consolidation is that it allows the control of the growth of online storage resources. By incorporating near-line storage into a storage infrastructure, an organization can continue to meet the needs of online data demands while minimizing the demand of online storage devices as well as the size of backups. To further understand HSM, consider the example of a law office. Many legal organizations maintain electronic copies of contracts; however, it may be months or even years before a contract document may need to be viewed. In circumstances such as this, HSM is ideal. HSM allows all contract documents to remain available while controlling the amount of online disk consumption and needed full backup space.
30
Chapter 2
Policy-Based Storage Virtualization
Several backup software vendors, including VERITAS and CommVault, currently offer policybased storage virtualization. This approach to virtualization has simplified how backup and restore operations are run. By using a logical container known as a storage policy, backup administrators no longer need to know the physical location of data on backup media. Instead, the physical location of data is managed by the policy. Consider a storage policy to be a logical placeholder that defines the following: Backup target device (library, disk array, and so on) Backup medium (tape, disk) Backup data retention days
When a server is defined to be backed up using enterprise backup software, an administrator does not need to select a backup target. This selection is automated through the use of the storage policy. Although backups are simplified through the use of storage policies, restores are where an organization will see the most benefit. When an administrator attempts to restore data, he or she does not need to know which tapes the necessary backups were stored on. Instead, the administrator simply selects the system and the files to restore. If any exported tapes are needed for the restore job, the administrator will be prompted. If the tapes are already available in the library, the restore will simply run and complete. The advantage to this approach is that administrators dont need to scan through a series of reports to find a particular backup media in order to run a restore. Instead, they can simply select a server, file, and date from which to restore. This level of storage virtualization is especially useful in large and enterprise-class environments with terabytes to petabytes of managed storage. As storage continues to grow, it is becoming increasingly difficult to manage. Adding a storage policy management layer to data access alleviates many of these problems.
For more information about storage virtualization, download a copy of SNIAs Shared Storage Model: A Framework for Describing Storage Architectures at http://www.snia.org/tech_activities/shared_storage_model. This document describes the SNIA standards for a layered storage architecture.
Storage virtualization is often seen as a first step in consolidating network resources. With storage resources centrally pooled and managed, the constraints of running tens to hundreds (or even thousands) of independent data islands on the LAN are eliminated. By consolidating to a SAN, a storage investment will now be closer to storage needs and organizations will have an easier time backing up servers within an allocated backup window. With storage firmly under control, the next logical step in consolidating network resources is server virtualization.
31
Chapter 2
Server Virtualization Server virtualization involves freeing servers on the network from their normal hardware dependencies. The result of sever virtualization is often additional space in the server room. This benefit results from the fact that most likely several servers on the network are not using nearly all of their physical resources. For example, an organization might have one file server that averages 10 percent CPU utilization per day. Suppose that peak utilization hits 30 percent. Ultimately, the majority of the servers running CPU resources are doing nothing. One way to solve the problem of under-utilized server hardware resources is to run multiple logical servers on one physical box. Today, there are two fundamental approaches to achieving this: Virtual Machines Shared data clusters
The next two sections show how each of these approaches allows for a reduction in the amount of physical resources on the LAN. Virtual Machines The use of virtual machines enables the running of multiple independent logical servers on a single physical server. With virtual machines, the needed hardware for the virtual machine to run is emulated by the virtual machine-hosting application. The use of virtual machines offers several advantages: Allows for the reduction in the number of physical systems on the network Provides hardware independencea virtual machine can be migrated from one physical host to another without significant driver updates Enables an organization to run legacy application servers on newer, more reliable hardware
Simplified Recovery
When restoring a Windows server backup on a system containing different hardware, it can be difficult to get the restored backup to boot on the hardware. This difficulty is usually attributed to the characteristics of the Windows System State. When a System State backup is run on a system, the systems registry, device drivers, boot files, and all other system files are collectively backed up. When the System State is restored, all of these files come back as well. This behavior can cause problems if an administrator is restoring to different hardware, which is usually the case when a restore is needed to recover from complete system failure or a disaster. Once the restore operation writes all of the old registry, boot, and drive settings to the new system, odds are that the system will blue screen the first time it boots. Sometimes recovery from these problems lasts only a few minutes. However, for some administrators, this process may take several hours. In virtualizing system hardware, restoring a virtual machine to another virtual machine running on a separate host system should be relatively problem free because the hardware seen by the OS on each system will be nearly identical. Thus, portability is another benefit provided by virtual machines in production environments.
32
Chapter 2
Major Drawbacks
As a result of the advantages of virtual machines, some administrators rushed to fully virtualize their complete production environments, only to later return some virtualized servers back to physical systems. The reason was the major drawbacks to running virtual machines in production: Additional latency No reduction in the number of managed systems
Virtual machine host applications emulate system resources for their hosted OSs, so an additional access layer exists between virtual machines and the resources they use. Some of the latency encountered by virtual machines can be reduced by having them connect directly to physical disks instead of using virtual disk files. However, on CPU-intensive applications, the latency is still noticeable. The other hidden drawback to consolidation through the use of virtual machines is that it does not reduce the number of managed systems on the networkinstead it can increase that number. For example, if an administrator plans to consolidate 24 servers to virtual machines running on three hosts, there will be 27 servers to managethe 24 original systems as well as each of the three virtual machine host servers. Therefore, although virtual machines enable fewer physical resources in the server room, there will still be the same number or even more servers that will require software, OS, and security updates. Shared Data Clusters Shared data clusters have emerged as a way to escape from the boundaries of server consolidation through virtual machines. Unlike with virtual machines, deploying shared data clusters can provide the ability to reduce both the number of physical systems and managed OSs on the network. A major difference between shared data clusters and virtual machine applications is in their approach to consolidation. Shared data clusters are application centric, meaning that the clustered applications drive the access to resources. Each application, whether a supported database application or file server, can directly address physical resources. Also, in being application centric, the consolidation ratio doesnt need to be 1-to-1. Thus, 24 production file servers could perhaps run on four cluster nodes. As the virtualized servers exist as part of the clustering application, they are not true managed systems. Instead, there would be only four true servers to update and maintain. This option offers the benefit of physical server consolidation at a highly reduced cost of ownership.
33
Chapter 2 Shared data clusters also offer the following advantages: Failover supportIf a cluster node fails, an application can move to another node in the cluster, thus maintaining data availability after a system failure Shared data supportShared data clustering provides the ability for data sharing between hosted file serving and database applications Load-balancing supportClient connections can be distributed among multiple cluster nodes; with traditional failover clustering, each virtual server entity is hosted by a single system Simplified backupWith shared data support, backups can be driven through a single cluster node, thus simplifying backups and restores as well as reducing the number of required licenses for backup software
Comparing Virtual Machines and Shared Data Clusters To make sense of these two approaches, Table 2.1 lists the major differences between server consolidation via virtual machines and by deploying shared data clusters.
Administrative Task Reduce the number of managed systems Virtual Machines Each virtual machine must still be independently updated and patched; the number of managed systems will likely increase as virtual machine hosts are added Supported Shared Data Clusters The number of managed systems is significantly reduced, with the cluster nodes representing the total number of managed systems Supported
Consolidate legacy file servers to a single physical system Failover Virtualization software overhead Single point of failure Ability to share data Backup and recovery
Yes, via installed OS or third-party application Up to 25 percent Potentially each virtual machine No Each virtual machine must be backed up independently Yes
Yes None No Yes Shared storage resources in the SAN can be backed up through a single node attached to the SAN No
Support for legacy applications
Table 2.1: Virtual machine vs. shared data clusters.
34
Chapter 2 As this table illustrates, server consolidation via shared data clusters offers several advantages over consolidation with virtual machines. With shared data clusters, you wind up with fewer managed systems, no additional CPU overhead, the ability to share data between applications, no single point of failure, and far fewer managed systems. However, application support is limited to what is offered by the shared data cluster vendor. Virtual machines can host nearly any x86 OS and thus are well-suited for consolidating legacy application servers, such as older NetWare, Windows NT, or even DOS servers that have had to remain in production in order to support a single application. When looking to use virtual machines or shared data clusters to support server consolidation, an organization might not need to choose between one and the other. Instead, an organization should look to use each application where its best suitedvirtual machines for consolidating legacy application servers, and shared data clusters for supporting file and database server consolidation for mission-critical applications.
In consolidating to a shared data cluster, organizations not only see the benefit of fewer managed systems but also realize the benefits of high availability and improved performance.
Examining Unappliance vs. Appliance Solutions

At the heart of server consolidation are two fundamentally different points of viewappliance and unappliance. Each differs in its approach to both the hardware and software used in the consolidation effort: ApplianceVendor-proprietary hardware and/or software solution that provides for server and/or storage consolidation UnapplianceVendor-neutral x86-based hardware and software solution
These approaches to consolidation are significantly different, with both short-term and long-term consequences. This section will look at the specific differentiators between unappliance and appliance solutions: Proprietary vs. open solutions Volume economics Integration with existing infrastructure and investments Scalability Backup challenges
The section will begin with a look at the key differences between proprietary and open solutions.
35
Chapter 2
Proprietary vs. Open Solutions When retooling a network, there are convincing arguments for both proprietary and open solutions. Proprietary solutions are usually packaged from a single vendor or group of vendors and are often deployed with a set of well-defined guidelines or by a team of engineers that work for the company offering the solution. A long-term benefit of a proprietary solution is that an organization ends up with a tested system that has predictable performance and results. A drawback to this approach, however, lies in cost. The bundled solution often costs more than comparable open solutions. Another difference with proprietary solutions is that they may be managed by a proprietary OS. For example, many NAS filers run a proprietary OS, so management of the filer after it is deployed will require user training or at least a few calls to the Help desk. Aside from the initial cost, buying a proprietary solution could cause additional higher costs in the long run. Upgrading a vendor-specific NAS, for example, will likely require the purchase of hardware through the same vendor. Software updates will also need to come through the vendor. Finally, if the network outgrows the proprietary solution, an organization might find that it must start all over again with either another proprietary solution or an open-standard solution. The most obvious difference between open solutions and proprietary solutions is usually in cost. However, the differences extend much further. Open solutions today are based on industrystandard Intel platforms and enable the ability to run an x86 class OS, such as Windows, Linux, and NetWare. For hardware support, there are several vendors selling the same products, thus lowering the overall cost. Also, with open-standard hardware being able to support a variety of applications and OSs, as servers are replaced, they can be moved to other roles within the organization. A drawback to open systems is often support. Proprietary solution vendors will frequently argue that they provide end-to-end support for their entire solution. In many cases, several vendors involved with a problem existing on a network using open architecture may point fingers when a problem occurs. For example, a storage application vendor may say a problem is the result of a defective SAN switch. The SAN vendor may go back and say that the problem is with a driver on the application server, or that the application is untested and thus not supported.
Although finger pointing often occurs in the deployment and troubleshooting of hardware and software on open architecture, this is often the result of a lack of knowledge of most of the parties involved. Consider Ethernet as an example. Today, just about anyone will help assist problems with Ethernet network troubleshooting, and can do so because this open standard has been around for several years. The same can be said about SANs. As SANs have matured, the number of skilled IT professionals that understand SANs has grown too. This growth leads to more effective troubleshooting, less finger pointing, and often less fear of migrating to a SAN.
Thus, although open systems dont always offer the same peace of mind as proprietary solutions, their price is often enough to sway IT decision makers in their direction.
36
Chapter 2
Volume Economics In IT circles, proprietary is almost always equated to expense. This association is perhaps the simplest argument for going with a non-proprietary solution. With non-proprietary hardware, an organization can choose preferred servers and storage infrastructure. Also, use of industrystandard equipment allows for the free use any of the existing management applications on the consolidated server solution. In March of 2004, PolyServe studied the price and performance differences between a proprietary NAS filer and a shared data cluster. In the study, the company found that going with a shared data cluster over NAS resulted in 83 percent savings. To arrive at the savings, PolyServe priced the hardware and software necessary to build a 2-node PolyServe shared data cluster. The cost of 12.6TB storage in a SAN and two industry-standard servers with two CPUs, 2GB RAM and running Windows Server 2003 (WS2K3) and PolyServe Matrix Server was $79,242. In comparison, two NAS files with 12.6TB of storage with the CIFS file serving option and cluster failover option cost $476,000. If you look at the solution in terms of cost per terabyte, the proprietary NAS appliance-based solution cost $38,000 per terabyte. The unappliance-based solution cost $6300 per terabyte. Integration with Existing Infrastructure and Investments Integration is another key differentiator between appliance and unappliance philosophies. Proprietary appliances are often limited to management tools provided by the appliance vendor. Installation of management software on an appliance is often taboo. Many NAS appliances can connect to and integrate with SANs to some degree, but the level of interoperability is not that of an open unappliance-based system. The Scalability Dilemma With an initial investment in a NAS appliance, an organizations needs will likely be satisfied for the next 12 to 18 months. Many proprietary NAS appliances offer some scalability in terms of storage growth by allowing for the attachment of additional external SCSI arrays or connectivity to fibre channel storage via a SAN. Scaling to meet performance demand is much more difficult for proprietary NAS. Unlike shared data clusters on industry-standard architecture, proprietary NAS solutions may offer failover but do not offer load balancing of file serving data. As two or more NAS appliances cannot simultaneously share the same data in a SAN, they cannot offer true load balancing for access to a common data store. Instead, as client demand grows, the NAS head often becomes a bottleneck for data access. To address scalability, organizations often have to deploy multiple NAS heads and divide data equally among them. Tools such as DFS can allow the addition of the NAS heads to be transparent to the end users. Unappliance-based shared data clusters do not run into the same scalability issues as proprietary NAS appliances. With open hardware and architecture, additional nodes can be added to the cluster and attached to the SAN as client load increases. Furthermore, with the ability to load balance client access across multiple nodes simultaneously with a common data store, shared data clusters can seamlessly scale to meet client demand as well.
37
Chapter 2 Backup Challenges Many proprietary NAS appliances have not been able to work well with current open backup practices. Instead, each vendor typically offers its own method of advanced backup functionality. For example, both EMC and NetApp provide proprietary snapshot solutions for performing block-level data backups. To perform a traditional backup, backup products that support NDMP can issue NDMP backup commands to the NAS appliance. Figure 2.5 illustrates an example of an NDMP-enabled backup.
Figure 2.5: NDMP-enabled backup.
As additional NAS appliances are added, they too could be independently backed up via NDMP. To keep up with newer industry standard backup methods such as server-free and server-less backups (discussed in Chapter 6), NAS appliance vendors have worked to develop their own methods of backing up their appliances without the need of CPU cycles on the NAS head. Not all NAS appliances can directly manage robot arms of a tape library, so often a media server must be involved in the backup process to load a tape for the NAS appliance. Once the tape is loaded, the NAS can then back up its data. Figure 2.6 shows how backups can be configured on a shared data cluster. Notice in this scenario that one of the cluster nodes is handling the role of the backup server. With two other nodes in the cluster actively serving client requests, the dedicated failover node is free to run backup and restore jobs behind the scenes.
38
Chapter 2
Figure 2.6: Unappliance-based shared data cluster backup.
The key to making this all work lies in the fact that all nodes in the shared data cluster can access the shared storage simultaneously. This functionality allows the passive node to access shared data for the purpose of backup and restore. Also, as the passive node is doing the backup work, the two active nodes do not incur any CPU overhead while the backup is running.
Notice that the backup data path appears to reach the failover node before heading to the library. The behavior of the backup data will be ultimately determined by the features of the backup software and SAN hardware. For example, if the backup software and SAN hardware support SCSI-3 X-Copy, backup data will be able to go directly from the storage array to the tape library without having to touch a server.
Another advantage to backing up data in an unappliance shared data cluster is the cost of licenses. As all nodes in the cluster could potentially back up or restore the shared data, only one node needs to have backup software installed on it, and consequently a backup license. As a cluster continues to scale, an even higher cost savings for backup licensing will become apparent.
39
Chapter 2
Taming Server and Storage Growththe Non-Proprietary Approach

With most of the industry leaning in the direction of managing growth and scalability through non-proprietary hardware, the remainder of this chapter will focus on building a scalable file serving infrastructure on non-proprietary hardware. Taming server and storage growth requires consolidation with an eye on scalability and management of both file servers and storage resources. Storage Consolidation via SAN For several years, SANs have been a logical choice to support storage consolidation. As SANs share data using industry-standard protocols such as FCP and iSCSI, there are products from several hardware vendors available to build a SAN. Because the products adhere to industry standards, they allow for the flexibility to mix and match products from different vendors. One of the primary goals of consolidating via a SAN is to remove the numerous independent data islands on the network. The SAN can potentially give any server with access to the SAN the ability to reach shared resources on the SAN. This feature can provide additional flexibility for both data management and backups. In addition to sharing disks, an administrator can share storage resources such as tape libraries, making it easier to back up data within the constraints of a backup window. With SAN-based LAN-free backups, backup data does not have to traverse a LAN in order for it to reach its backup target. To ensure that an administrator is able to perform LAN-free backups on the SAN, the administrator will need to ensure that the backup vendor supports the planned SAN deployment. The use of SAN routers enables the continued use of SCSI-based storage resources on the SAN. This may mean that to get started with a SAN, an organization will need to purchase only the following: 1 to 2 HBAs for each serverTwo HBAs re required for multipath support and allows for fault tolerance in the complete SAN data path 1 to 2 fabric switchesTwo switches are required in redundant fabrics) 1 routerThe quantity of routers will vary depending on the number of SCSI resources being moved to the SAN
The number of routers required for the initial SAN deployment may be increased to address concerns such as the creation of a bottleneck at the fibre channel port of the router. Chapter 3 covers these issues in detail.
With several servers potentially accessing the same data on the SAN, care will need to be taken to ensure that corruption does not occur. The choice of virtualization engine or configuration of zoning or LUN masking can protect shared resources from corruption.
40
Chapter 2
Server Consolidation via Clustering Server consolidation via clustering offers a true benefit of reducing the total number of managed systems on the network. In addition, consolidating to a shared data cluster provides greater flexibility with backups, the ability to load balance client requests, and failover support. Shared data clusters offer the benefit of scaling to meet both changes in client load and storage. Because they run on industry standard hardware, scaling is an inexpensive option. Furthermore, consolidating to shared data clusters offers significant savings in cost of ownership and software licensing. Consolidating a large number of servers to a shared data cluster results in the need to maintain fewer OS, backup, and antivirus licenses. With fewer managed systems on the network, administrators will have fewer hardware and software resources to maintain on a daily basis. With substantial cost savings over proprietary NAS appliance solutions, shared data clustering has emerged as an easy fit in many organizations.
Planning for Growth While Maintaining Freedom

When planning for growth while consolidating, there are several best practices to keep in mind. When planning to consolidate the network, consider the following guidelines: Design for scalability Design for availability Use mature products Avoid proprietary hardware solutions
One of the most over-used words in the IT vocabulary is scalability. However, it is often one of the most important. Scalability means that all elements of the network infrastructure restructure should support company growth, both expected and unexpected. For example, if current requirements warrant the purchase of an 8-port fibre channel switch, consider purchasing a 16port switch. Ensure that the planned SAN switches offer expansion ports so that the option is available to scale out further in the event that growth surpasses projections. Scalability can also be greatly aided by the use of centralized management software. Many of the solutions previously mentioned in this chapter can help with centrally managing storage resources and backups across the enterprise. If performance is an issue, another major scalability concern should be in technologies that support load balancing of client access. Such technologies alleviate single server bottlenecks that result from unexpected client growth.
41
Chapter 2 In designing for availability, redundancy is always crucial. Redundancy often starts with shared storage on the SAN via a RAID implementation. However, adding redundancy to the physical disks is most valuable when the disks are behind a fully redundant data path. Ensuring true redundancy means not only protecting disks but also Using cluster architecture to prevent data access loss due to a system failure Using redundant switches in the SAN fabric and redundant HBAs Using multipath-compliant HBA drivers on servers to ensure that they can realize the benefits of redundancy in the SAN Planning for redundant power sources to prevent data loss or corruption from a power failure Adding redundancy to the LAN to prevent switch failure from interrupting data access
Using mature products refers to products that have established reputations in IT. Regardless of how great the sales pitch, bringing an unknown product into the network is always a risk. Products with the backing of OS vendors such as Microsoft are more likely to do what they promise, and their vendors are more likely to be in business at both the beginning and end of the project. With server and storage consolidation, its often tempting to look to purchase a proprietary solution for a single hardware vendor. With a SAN, there will be little problem integrating solutions from vendors such as Brocade, McData, and Cisco Systems. However, several vendors offer end-to-end solutions. If the proposed solution involves proprietary hardware, its unlikely that the solution will work as well with the rest of the systems on the network. For example, managing and reporting cant be done by the tools currently be used and instead will need to be achieved through an add-on utility from the product vendor or by using custom scripts. Most vendors have no problem offering professional services and can write a script to handle most tasks. However, each time support is required for the script, an organization might end up spending additional dollars on more professional services. As an organization begins to piece together a planned network, it must pay close attention to the product support matrixes listed on each vendor Web site. Doing so ensures that the proposed pieces have been tested and will work well together. For planned products not on a vendors support matrix, a good practice is to negotiate a pilot period to ensure the product works with new hardware and software.
Summary
Chapters 1 painted a picture of the current file serving landscape. This chapter looked deeper into the picture and examined the available server and storage consolidation alternatives. Building on this framework for the technologies that may be involved in consolidation, the next chapter will look at how to piece these technologies together while emphasizing how to maintain high availability and high performance.
42
Chapter 3
Chapter 3: Data Path Optimization for Enterprise File Serving

With enterprise file serving, much of the attention concerning availability and high performance is focused on the servers themselves. However, the clients that access data on file servers face many other obstacles and potential bottlenecks along the way. For a clients request to reach a file, it must traverse the network, reach the server, and, ultimately, the server must request the file from its attached storage. This chapter will view the entire landscape of the file serving picture. Although attention will be paid to the servers themselves, you will also see the common problems and pitfalls involved with getting data from storage to a client.
The Big Picture of File Access

Optimizing and protecting file access involves more than just setting up fault-tolerant disks on each file server. Figure 3.1 illustrates a simple example of the objects involved in a client accessing a file from a server.
Figure 3.1: Objects in the data path.
For the client to open the file residing on the SAN, the clients request will need to traverse Switch1, Router1, and Switch2 to reach the file server. For the file server to answer the request, the server will need to pull the file from the disk array through the fabric switch on the SAN. In this example, each device between the client computer and the file represents both a potential bottleneck and a single point of failure. A single point of failure is any single device whose failure would prevent data access.
43
Chapter 3
Availability and Accessibility

Many administrators tout the fact that some of their servers have been available 99.999 percent of the time during the past year. To make better sense of uptime percentages, Table 3.1 quantifies uptime on an annual basis.
Uptime Percentage 99% 99.9% 99.99% 99.999% Total Annual Downtime 3.65 days 8.75 hours 52.5 minutes 5.25 minutes
Table 3.1: Quantifying downtime by uptime percentage.
With 99 percent uptime, you have 1 percent downtime. One percent of 365 days yields 3.65 days. If you divided this number by 52 (number of weeks in a year), you would average being down 0.07 days a week. This statistic equates to 1.68 hours (0.07 days 24 hours) or 1 hour and 41 minutes of downtime per week. If you advance to the 5 nines (99.999) of availability, a server would need to be offline no longer than 5.25 minutes a year. This statistic equates to only 6 seconds of allowable weekly downtime! Keeping your file servers available is always important. The amount of uptime that is required often varies by organization. For example, if no one will be accessing a file server on a Sunday, it might not be a big deal if it is down for 8 hours. For other shops that require 247 access, almost any downtime is detrimental. When measuring uptime, most organizations simply report on server availability. In other words, if the server is online, its available. Although this method sounds logical, it is often not completely accurate. If the switch that interconnects clients to the server fails, the server is not accessible. It might be online and available, but if it is not accessible, it might as well be down. Thus, uptime percentages can be misleading. Having a server online is only valuable when the data path associated with the server is online as well. To truly deploy a highly available file server, the data path must be highly available as well. For high-performance file serving, the same logic holds true. To meet the performance expectations of the server, the data path must be able to support getting data to and from the server at a rate thatat a minimummeets client demands. The next sections in this chapter provide examples of how to ensure availability and performance in the following data path elements: Redundant Storage SANs LAN switches and routers Servers Power
Lets start by looking at how to add redundancy and performance to storage resources.
44
Chapter 3
Redundant Storage
Redundant storage can offer two elements that are crucial to high-performance and highavailability file serving. In configuring redundant storage, or Redundant Array of Independent Disks (RAID), you can configure two or more physical disks to collectively act as a single logical disk. This combination can result in both better performance and fault tolerance. In this section, you will see the most common types of RAID configurations as well as their relation to improving enterprise file serving. RAID Levels RAID levels are described by number, such as 0, 1, or 5. The most common RAID implementations today are: RAID 0 RAID 1 RAID 5 RAID 0+1 RAID 1+0 RAID 5+0
Lets start with a look at RAID 0. RAID 0 RAID 0 is not considered to be true RAID because it does not offer redundancy. Because of this, RAID 0 is often combined with other RAID levels in order to achieve fault tolerance. Although not fault tolerant, RAID 0 does offer the fastest performance of all RAID levels. RAID 0 achieves this level of performance by striping data across two or more physical disks. Striping means that data is being written to multiple disks simultaneously. All the disks in what is known as the stripe set are seen by the operating system (OS) as a single physical disk. RAID 0 disk striping is depicted in Figure 3.2.
Figure 3.2: RAID 0 operation.
45
Chapter 3 To understand how RAID works, consider the following example. Suppose you wanted to store the word get in a RAID 0 array containing three disks. Now picture each disk as being a cup. Since get has three letters, a single letter would be stored in each cup. With the letters evenly spaced out, you could theoretically drop all the letters into the three cups simultaneously. This scenario illustrates the advantage of RAID 0its fast. The problem, however, is if one of the cups (disks) is lost or damaged, all data is lost. Because of its lack of fault tolerance, RAID 0 is not considered to be a good fit for file serving. Its raw performance makes it ideal for high-speed caching but makes it risky for storing critical files. As you will see later in this chapter, RAID 0 can be combined with other RAID levels to achieve fault tolerance. This setup can give you the best of both worldsspeed and resiliency. The remaining RAID levels discussed in this section offer fault tolerance at the expense of some of the performance found in RAID 0. RAID 1 RAID 1 is the first fault-tolerant RAID level, and is available in two forms: Disk mirroring Disk duplexing
With both disk mirroring and disk duplexing, two or more physical disks provide redundancy by having one or more disks mirror another. Data written or deleted from one disk in the mirror set is automatically written or deleted on all other disks in the set. With this approach, fault tolerance is ensured by having redundant copies of the same data on several disks. The failure of a single disk will not cause any data loss. The disk mirroring and disk duplexing implementations of RAID 1 differ in how the physical drives are connected. With disk mirroring, the physical disks in the mirror set are connected to the same disk controller. With disk duplexing, the physical disks in the mirror set are connected using at least two disk controllers. Disk duplexing is the more fault tolerant RAID 1 implementation because it eliminates a disk controller as a single point of failure. When a file is saved to a two-disk RAID 1 array, it is written to both disks simultaneously. Thus, the actual write operation does not complete until the file is finished being written to the slowest disk. The result of this architecture is that the actual performance of the RAID 1 array will be equal to the speed of the slowest disk. RAID 1 is ideal when you are looking for an easy means to ensure data redundancy. RAID 1 automatically creates a mirror of disks, so you are getting a continuous online backup of data. This setup allows for little to no data loss in the event of a disk failure.
RAID should not be considered a substitute for backing up data. Although fault-tolerant RAID protects against the failure of a single disk, it does not protect against data corruption or disasters. Because of this shortcoming, regular backups to media stored offsite should still be performed.
The one disadvantage to RAID 1 is that you have to purchase at least twice the amount of disk space for the data you want to store, depending on the number of disks in the RAID 1 mirror. If you are planning to configure two disks to mirror each other, remember that one disk will work exclusively as a backup. For example, a 100GB RAID 1 volume consisting of two physical disks would need a total of 200GB of storage (two disks 100GB).
46
Chapter 3 RAID 5 RAID 5 operates similar to RAID 0 by striping data across multiple disks. However, it differs in the following ways: RAID 5 uses parity to achieve fault tolerance RAID 5 requires three or more physical disks (RAID 0 only requires two disks)
With each write, RAID 5 writes a parity bit to one disk in the array. This functionality allows a RAID 5 array to lose a single disk and still operate. However, if a second disk in the array fails, all data would be lost. This loss would result in having to rebuild the array and restore data from backup. In terms of performance, RAID 5 is slower than RAID 0, but outperforms RAID 1. As RAID 5 uses parity to provide fault tolerance, you must consider the storage of the parity data when sizing a RAID 5 array. Parity writes will effectively take up one disk in the array. Thus, with a three-disk array, two disks would store actual data and the third disk would store the parity bits. If you built a RAID 5 array with three 100GB disks, you would have 200GB of available storage, enabling you to store actual data on 67 percent of your purchased storage. If you add disks to the array, the efficiency of the array improves. For example, with four disks in the array, three disks would store data and one disk would store parity bits, giving you 75 percent utilization of your disks. RAID 5 has been very popular for enterprise file serving because it offers better speed than RAID 1 and is more efficient. Although it is slower than RAID 0, the fact that it provides fault tolerance makes it desirable. RAID 0+1 RAID 0+1 arrays provide the performance of RAID 0 as well as the fault tolerance of RAID 1. This configuration is commonly known as a mirrored stripe. With RAID 0+1, data is first striped to a RAID 0 array and then mirrored to a redundant RAID 0 array. Figure 3.3 shows this process.
47
Chapter 3
Figure 3.3: RAID 0+1 operation.
RAID 0+1 is configured by first creating two RAID 0 arrays and then creating a mirror from the two arrays. This approach improves performance, but the inclusion of RAID 1 means that your storage investment will need to be double the amount of your storage requirement. Assuming that the illustration that Figure 3.3 shows uses 100GB disks, each RAID 0 array would be able to store 300GB of data (100GB three disks). As the second RAID 0 array is used for redundancy, it cannot store new data. This setup results in being able to store 300GB of data on 600GB of purchased disk storage. An advantage to RAID 0+1 is that it offers the performance of RAID 0 and provides fault tolerance. You can lose a single disk in the array and not lose any data. However, you can only lose one disk without experiencing data loss. If youre looking for better fault tolerance, RAID 1+0 is the better choice.
48
Chapter 3 RAID 1+0 RAID 1+0 (also known as RAID 10) combines RAID 1 and RAID 0 to create a striped set of mirrored volumes. To configure this type of RAID array, you first create mirrored pairs of disks and then stripe them together. Figure 3.4 shows an example of this implementation.
Note that the RAID 1+0 configuration is exactly the opposite of RAID 0+1. With RAID 1+0, the RAID 1 arrays are first configured. Then each mirror is striped together to form a RAID 0 array. A major advantage of RAID 1+0 over RAID 0+1 is that RAID 1+0 is more fault tolerant. If there are a total of six disks in the array, you could lose up to three disks without losing any data. The number of disks that can fail is determined by where the failures occur. With RAID 1+0, as long as one physical disk in a mirror set in each stripe remains online, the array will remain online. For example, the array that Figure 3.4 shows could lose disks four, two, and six and remain operational. As long as one disk in each RAID 1 mirror remains available, the array will remain available as well. RAID 1+0 is similar to RAID 0+1 in terms of storage efficiency. If each disk in the array shown in Figure 3.4 is 100GB in size, you would have 600GB of total purchased storage but only 300GB of writable storage (due to the RAID 1 mirroring). If youre looking for better storage efficiency at the sake of a little speed, RAID 5+0 might be a better option. RAID 5+0 RAID 5+0 is configured by combining RAID 5 and RAID 0. This array would be configured by first striping data across RAID 5 volumes. This setup is more efficient than RAID 1+0 because only a fraction of your storage investment is lost, instead of half the investment. Figure 3.5 shows this RAID type.
49
Chapter 3
Compared with RAID 5, RAID 5+0 provides faster read and write access. However, a problem with RAID 5+0 is that if a drive fails, disk I/O to the array is significantly slowed. Unlike RAID 5, RAID 5+0 is more fault tolerant because it can withstand the loss of a disk in each RAID 5 subarray. With the array that Figure 3.5 shows, both disks one and five could fail and the array would still remain online. If disks four and five failed, the array would go down. RAID 5+0 sizing is similar to sizing a RAID 5 array, except that a disk in each RAID 5 subarray is used for parity. If the array in the figure had 100GB disks, you would have 600GB of storage space in the array. Of the 600GB, you could write data to 400GB because you give up one 100GB disk in each of the subarrays to parity. As with RAID 5, adding disks to each subarray would provide better storage efficiency.
50
Chapter 3
Hardware vs. Software RAID As you can see, there are several methods for improving storage performance and fault tolerance for your file servers. When looking to configure each of these RAID levels, you have two general choiceshardware RAID and software RAID. Hardware RAID volumes are set up using a hardware RAID controller card. This setup requires the disks in the array to be connected to the controller card. With software RAID, disks are managed through either the OS or a third-party application. Lets look at each RAID implementation in more detail. Hardware RAID Hardware RAID controllers are available to support all of the most common disk storage buses, including SCSI, Fibre Channel (FC), and Serial ATA (SATA). Hardware RAID is advantageous in that its transparent to the OS. The OS only sees the disk presented to it by the RAID controller card. Hardware RAID also significantly outperforms software RAID because no CPU cycles are needed to manage the RAID array. This management is instead performed by the RAID controller card. Another advantage of hardware RAID is that it is supported by most clustering implementations, whereas most software RAID configurations are not supported by cluster products. To be sure that your configuration is compatible, you should verify that the hardware RAID controller and disk products have been certified by your clustering product vendor.
Sometimes hardware alone is not the only compatibility issue. Be sure to verify that your installed hardware is using firmware and drivers that have been certified by your clustering product vendor.
Most of the major RAID controller vendors post technical manuals for their controllers on their Web sites. This accessibility makes it easy to configure RAID controllers in your file serving storage infrastructure.
SCSI RAID
Most of the hardware RAID implementations in production today are in the form of SCSI RAID. Although FC and SATA have gained significant ground in recent years, SCSI still has maintained a large portion of market share. Among the most popular SCSI RAID controller vendors are Adaptec, QLogic, and LSI Logic. Each of these vendors offer products with longstanding reputations and excellent support.
FC RAID
With FC RAID, disk arrays can be configured as RAID arrays and attached directly to a SAN. As SANs have continued to become an integral part of enterprise file serving, FC RAID has risen in popularity. For example, Adaptecs SANbloc 2Gb RAID solution can allow you to connect FC RAID to a cluster via a SAN and can scale to 112 drives with as much as 16.4TBs of storage. Other vendors that offer FC RAID solutions include Hewlett-Packard, Quantum, Dot Hill, and XioTech.
51
Chapter 3
SATA RAID
SATA has been steadily growing in recent years as a cost-effective alternative to SCSI. SATA offers data transfer rates as fast as 450MBps, depending on the SATA RAID controller and disk vendor. Besides the lower cost, SATA also differs from SCSI in that each disk has a dedicated serial connection to the SATA controller card. This connection allows each disk to utilize the full bandwidth of its serial bus. With SCSI, all disks on the bus share the bandwidth of the bus. Unlike SCSI, SATA disks are not chained together. Thus, the number of disks in the array will be restricted to the physical limitations of the controller card. For example, the Broadcom RAIDCore 4852 card supports eight ports and all of the popular RAID levels, including RAID 1+0 and RAID 5+0. This controller provided RAID 0 writes at 450MBps and RAID 5 writes at 280MBps during vendor tests. Many vendors are also developing technologies that allow you to connect SATA disk arrays to an FC SAN. For example, the HP StorageWorks Modular Smart Array (MSA) controller shelf can allow you to connect as many as 96 SATA disks to an FC SAN. This feature gives you the ability to add disk storage to support your file servers on the SAN at a significant cost savings over SCSI. Software RAID Software RAID is advantageous in that you do not need a RAID controller card in order to configure it. However, with many organizations deploying clustering technology to support the demands of file serving, software RAID has not been a possibility for shared storage resources in the cluster. There are some exceptions to this rule. For example, Symantec (formerly VERITAS) Volume Manager can set up software RAID that is compatible with some clusters such as Microsoft server clusters. However, most organizations that spend the money to deploy a cluster in the first place dont try to save a few bucks by cutting corners with RAID. Having an OS control disks via software RAID can also result in significant CPU overhead. The CPU loading of software RAID often makes it impractical on high-volume enterprise-class file servers. However, some organizations that connect shared cluster storage via hardware RAID will use software RAID to provide redundancy for the OS itself. Having the OS files mirrored across a RAID 1 array can prevent a disk failure from taking a server down. You can also do so with a hardware RAID controller, but if youre at the end of your budget, you might find using software RAID to protect the OS to be an alternative. As with hardware RAID, you still must use multiple physical disks to configure the RAID array, so breaking a disk into partitions to build a software RAID array is not an option. Windows OSs natively support software RAID, which can be configured using the Disk Management utility, which is a part of the Computer Management Microsoft Management Console (MMC). Using Disk Management, you can configure RAID 0, 1, and 5 arrays on Windows Server OSs. On Windows client OSs, you can only configure RAID 0 using Disk Management. With Linux OSs, software RAID 0, 1, and 5 can be configured using the Disk Druid tool during a GUI installation of the OS. If the OS is already installed, you can use the Raidtools package to configure and manage software RAID. Although increasing performance and availability of disks through RAID is an important part of enterprise file serving, there are more elements of the data path that must be considered as well.
52
Chapter 3
Redundant SAN Fabrics

Redundant disks are not of much value if there is only a single path to the disks through a SAN. With this in mind, if you have protected your disk storage through RAID, you should also strongly consider adding redundant data paths between servers attached to the SAN and the storage resources. Elements of the Redundant SAN A fully redundant SAN has no single point of failure. An example of a redundant SAN fabric is shown in Figure 3.6.
Figure 3.6: Redundant SAN fabric.
In this example, three servers that are part of a shared data cluster all share common storage in a SAN. Fault tolerance begins with redundant FC HBAs in the servers. This setup eliminates an HBA as a single point of failure. Each HBA connects to a separate FC switch. This way, all three servers can withstand the failure of a switch or switch port in the SAN. Finally, the disk array and library in the SAN are also connected to each switch.
53
Chapter 3 Although redundancy adds to the cost, many organizations with SANs have seen redundancy as a necessity. For example, a large hospital recently deployed a non-redundant SAN to connect their file servers to disks resources. The goal was to get better use of their existing storage resources while streamlining backups. However, the administrators opinion of SANs faded when a switch failed and as a result took down seven servers. With a FC switch serving as the access point to SAN storage, the switchs failure could have devastating consequences. With a fully redundant SAN fabric, a switch failure will not equate to the failure of several servers. Instead, it will simply be a minor hiccup. However, getting all of this to work is not as simple as just connecting all the devices. Each host OS must be aware of the duplicate paths through the SAN to each storage resource. For this setup to work, you will need to install the multipath drivers for the SAN HBA. You will also need to ensure that the purchased SAN HBAs support multipath. Managing the Redundant SAN All of the major SAN switch vendors offer tools to help you manage their products. For example, Brocades Fabric Manager allows you to manage as many as 200 switches in a SAN. With this product, you can make changes to all switches simultaneously or can make changes to individual switches or even small groups. You can also configure alerting features to alert you if a failure occurs. Other storage vendors have also jumped into the SAN management ring by offering products that collectively manage a variety of SAN hardware devices. Symantecs CommandCentral is an example of software that can manage a diverse collection of storage resources across an enterprise. There are several products that can assist you in spotting failures on a SAN. How you deal with failures may depend on your IT budget. Some organizations maintain spare parts on hand to quickly resolve failures. This practice could mean having a spare HBA, switch, and FC hard disks. This way, if a failure occurs, you can quickly replace the failed component. Once the failed component is replaced, you can then order the replacement.
Most SAN products have built-in backup utilities. To quickly replace a failed switch and update its configuration, you should perform frequent backups of your SAN switches. Many organizations perform configuration backups before and after each configuration change to a SAN switch. This practice ensures that you will always have the most recent configuration available if you need to replace a failed switch.
54
Chapter 3
Redundant LANs
At this point, you have seen how to improve performance and fault tolerance in the data path from a server to storage. Another element of the data path that is also crucial is the path from the clients to the servers. This path often encompasses the LAN. A simple example of adding faulttolerant LAN connections to servers is shown in Figure 3.7.
Figure 3.7: Redundant server LAN connections.
The idea of redundant LAN connections is relatively straightforward. As with redundant SAN connections, the redundant LAN illustrated uses a meshed fabric to connect each node to two switches. This approach will make each node resilient to NIC, cable, or switch failure. With this approach, a teamed NIC driver should be installed on each server. This installation will allow the two NICs to be collectively seen as a single NIC and share a virtual IP address. Aside from meshing server connections to switches, some organizations mesh connections between switches, thus providing for additional resiliency. Figure 3.8 shows this architecture.
Figure 3.8: Redundant switched LAN connections.
55
Chapter 3 Meshing core and access layer switches can provide fault tolerance for the network backbone, but this also requires additional management. To prevent network loops, Spanning Tree Protocol (STP) will need to be configured on the switches. Loops occur when multiple paths exist between hosts on a LAN. With multiple open paths, it is possible for frames leaving one host to loop between the switches offering the redundant paths while never actually reaching the destination host. Lops can not only disrupt communication between hosts but also flood the LAN with traffic. When configured, STP will dynamically build a logical tree that spans the switch fabric. In building the tree, STP discovers all paths through the LAN. Once the tree is established, STP will ensure that only one path exists between two hosts on the LAN. Ports that would provide a second path are forced into a standby or blocked state. If an active port goes down, the redundant port is brought back online. This setup allows for fault tolerance while preventing network loops from disrupting communication. One other element of the LAN that can benefit from redundancy is routers. As most hosts on a network route through a single default gateway, failure of a router can shut down LAN communications. This shutdown can be overcome by using routers that support Hot Standby Routing Protocol (HSRP) or Virtual Router Redundancy Protocol (VRRP). Both HSRP and VRRP allow you to configure multiple routers to share a virtual IP address. This functionality provides failover between routers. If one router fails, a second router can automatically assume the routing duties of the first router. Although HSRP and VRRP offer similar functionality, they differ in the fact that HSRP is Cisco-proprietary, while VRRP is an open standard. Regardless of the protocol used, both HSRP and VRRP can allow you to eliminate a router as a single point of failure. Of course, eliminating the router as a single point of failure comes at the cost of having to purchase and power additional routers for redundancy.
For more information about HSRP, read the Cisco internetworking case study Using HSRP for FaultTolerant IP Routing. This document is available at http://www.cisco.com/univercd/cc/td/doc/cisintwk/ics/cs009.htm.
Redundant Power
With single points of failure eliminated from the LAN, you can focus on power. Power loss, sags, or surges can also wreak havoc on the availability of your file servers. To eliminate these potential problems, both Uninterruptible Power Supplies (UPS) and backup generators can be deployed. Redundant power is no secret in IT circles and has been used for quite some time. If youre managing enterprise-class file servers, odds are that you already have redundant power in place. In protecting against power failure, the UPS can sustain servers, storage, and network devices for a short period of time. During this period, all devices could be powered down gracefully so as not to corrupt any stored files. In organizations in which availability is crucial, the role of the UPS is usually to maintain servers online long enough for backup generators to start. The backup generator can sustain the critical elements of the network for hours or even days, depending on the number of systems on the network as well as the amount of fuel available to power the generator.
56
Chapter 3
Redundant Servers
Thus far, we have looked at how to add redundancy to power, the LAN, the SAN, and disks. The only aspect of the information system that has been ignored to this point has been the servers themselves. If you have gone this far to protect your file serving infrastructure, you dont want a motherboard failure, for example, to disrupt data access. Adding redundancy to servers can be accomplished in a few different ways: Deploy shared data clusters Deploy failover clusters Deploy proprietary servers that are fully redundant
Lets start with a look at shared data clusters. Shared Data Clusters Shared data clusters have already been fully described in Chapters 1 and 2. They provide full failover support for file servers, allowing a virtual file server entity to move from one physical host to another if a failure occurs. In having this ability, all hardware and software elements of a physical server are eliminated as single points of failure. In addition to failover support, shared data clusters offer load balancing by allowing multiple nodes in the shared data cluster to simultaneously access files in the shared storage on the SAN. This functionality prevents the performance bottlenecks that are common in other redundant server and clustering solutions. Finally, with shared data clusters running on industry standard x86-class hardware, organizations do not have to fear deploying a proprietary solution when deciding to go with a shared data cluster. Failover Clusters Failover clusters, like shared data clusters, offer failover support. If one servers hardware fails, a virtual file server running on that server can simply move to another node in the cluster. All major OS vendors, including Microsoft, Red Hat, and SUSE, offer clustering support with some of their OS products, making them convenient for administrators already familiar with a certain OS. If performance was not an issue, failover clusters would be an ideal fault-tolerant file serving solution. However, failover clusters lack in the ability to effectively load balance data between hosts. Instead, failover clusters use a shared nothing architecture that allows only one node in a cluster access to a file resource. This setup prevents failover clusters from being able to load balance access to a virtual file server. Instead, access to the virtual file server would have to be provided by one server at a time.
57
Chapter 3
Proprietary Redundant Servers One final alternative to eliminating servers as a single point of failure is to deploy fully redundant server solutions. These solutions can range in price from tens to hundreds of thousands of dollars. On the low end, companies such as Stratus Technologies offer a server that has fully redundant power, motherboards, CPUs, and storage. At the high end, companies such as Network Appliance and EMC offer fault-tolerant NAS appliances. Although both EMC and Network Appliance share a significant portion of the file serving market, their popularity has been heavily challenged in recent years by companies such as PolyServe that offer fully redundant high-performance file serving solutions that can run on industry-standard hardware.
Eliminating Bottlenecks
In addition to adding availability to the data path by eliminating single points of failure, performance bottlenecks should be a key concern. As with failure points, each element in the data path can represent a potential bottleneck (see Figure 3.9).
Figure 3.9: Potential data path bottlenecks.
58
Chapter 3 Figure 3.9 points out eight potential bottlenecks in a data path:
1. Client access switch10Mbps uplink 2. RouterSingle router connects all clients to server network segment 3. Server access switch100Mbps uplink 4. Server NIC100Mbps NIC 5. Server internal hardwareCPU, RAM, motherboard, and so on 6. Server FC HBA1Gbps 7. Fabric switch1Gbps 8. Disk arrayJust a Bunch of Disks (JBOD)
Connecting clients to the server LAN through a single router on a stick via 10Mbps switches can quickly slow file access performance. The term router on a stick refers to a single router that services several LAN segments that are multinetted together. To communicate with each logical network that resides on the multinet, the router will have multiple IP addresses on its network interface that faces the clients. If a 10Mbps switch connects to the router, you are already faced with all clients having to share a single 10Mbps pipe to access server resources. These bottlenecks could be reduced or eliminated by upgrading the client access switch to 100Mbps and with at least one 1Gbps port to uplink to the router. If several network segments are bottlenecked at the router, you could consider replacing the router with a Layer-3 switch. If needed, a switch with 1Gbps ports could be used. If the server NIC is the bottleneck, it could also be upgraded to 1Gbps or teamed with a second NIC to improve throughput. At the server, several elements could hurt performance. If there is not enough RAM, too slow of a CPU, or slow hard disks, performance will suffer. The expensive answer to solving server resource bottlenecks is to replace or upgrade hardware. A more scalable solution in the file serving arena is to configure the file servers in a shared data cluster. This solution offers the benefit of load balancing, fault tolerance, and often results in server consolidation. If youre looking to document a server bottleneck, Windows and Linux offer tools to help pinpoint a problem. On Windows OSs, System Monitor can be used to collect system performance statistics in real time. On Linux, tools such as the Gnome System Monitor or vmstat can allow to you query system performance.
Resource Memory Threshold Committed bytes > Physical RAM Pages/sec > 20 Disk Queue Length > 2 % Disk time > 90% % Processor time > 80% Processor queue length > 2 Remains near 100% utilization Required Action Add or upgrade RAM
Physical Disk Processor Network
Upgrade to a faster disk, deploy RAID Upgrade CPU or add an additional CPU Upgrade to a faster NIC or team NICs
Table 3.2: Common performance thresholds.
59
Chapter 3 The SAN also introduces a possible bottleneck site. Arbitrated loop SANs behave like Token Ring LANs and thus provide shared bandwidth for all resources attached to the SAN. Thus, 10 servers attached to an arbitrated loop SAN would have to share its bandwidth. If SAN performance is slow and you are using an arbitrated loop topology, the most effective way to improve performance will be to upgrade to a switched fabric SAN. The same can be said for a 1Gbps SAN. If you have SAN switches and HBAs that support a maximum throughput of 1Gbps, upgrading to a newer 2Gbps or 4Gbps SAN fabric will greatly improve performance. Finally, the disks themselves could also represent a bottleneck. Upgrading to faster disks is an option; another alternative is to configure the disks as RAID. For example, moving from RAID 1 to RAID 5 could improve performance. Another alternative is to go to RAID 1+0. Architectural Bottlenecks Sometimes its not the individual pieces that represent the bottleneck; instead it could be the architecture. If the file serving infrastructure is architected poorly, the bottleneck might be the architecture itself. Two typical examples of architectural bottlenecks are single NAS heads and single file servers. Single NAS Head As NAS grew in popularity, a big selling point of NAS was that you could consolidate all your file serving resources into a single NAS. With terabytes of available storage, this option seemed like a good idea to many at the time. However, the single NAS head presents severe scalability and performance limitations. The NAS itself represents a single path for LAN clients to access files. Even with teamed NICs, performance scaling is limited. When stuck with a NAS bottleneck, many organizations find themselves adding NAS heads. At first, servers are consolidated, but performance demands must be met by adding NAS boxes. However, adding NAS boxes to handle file serving will likely induce additional management overhead. Thus, you will likely need to reduce user drive mappings as well as restructure backups to accommodate the additional server. If scaling continues, you will need to add another NAS, further compounding the problem. Single File Server The single file server represents the same problems presented with single NAS heads. The lone exception to the single file server is that it is not a proprietary solution. However, having a single access point is still an issue. You can add NICs and certainly max out CPU and RAM resources but could still be faced with network throughput bottlenecks in enterprise-class environments. The answer to the performance bottleneck dilemma of both single NAS heads and single file servers can be found in load balancing.
60
Chapter 3
Load Balancing
Traditional load balancing involves balancing a client load between multiple servers. In the traditional load-balancing architecture, each server that participates in a load-balanced cluster maintains its own local storage. Without shared storage, the traditional load-balanced cluster is not a solution for file serving scalability issues. Instead, it is best suited for read-only userintensive demands such as front-end Web serving or as FTP download servers. For file serving, the only true way to provide load-balanced read/write access to file system data is through shared data clusters. In the shared data cluster, multiple servers can present a single logical file server application to clients. This setup allows the client load to be distributed across multiple physical servers. This approach can eliminate many of the traditional file serving bottlenecks, including: Single network access point CPU RAM Motherboard HBA
With a file serving load being distributed across four server nodes, for example, you have four times the amount of server resources to handle client demand. This flexibility can allow you to get more out of your hardware investment and likely extend the life of your servers. In many shops, some servers are over-utilized while others are underutilized. Consolidating file servers to a single shared data cluster will allow you to equally use all file server resources in your organization.
Managing the Resilient Data Path

Building out a resilient data path requires knowledge that crosses several technical boundaries. Its easy for a storage or server administrator to fail to take LAN performance issues into account. Likewise, its easy for a switch and router administrator to disregard SAN issues. To assist administrators in managing performance and fault-tolerance issues, many vendors are developing products that locate and report on server performance. One such product is the HP Performance Management Pack (see Figure 3.10).
61
Chapter 3
Figure 3.10: Discovering a system memory problem with the HP Performance Management Pack.
Tools such as this have grown in popularity because they not only alert you of performance problems but also will recommendations for how to solve them. The Brocade and Symantec tools mentioned earlier in this chapter are ideal for monitoring and managing SAN resources. Other SAN hardware vendors, such as QLogic, offer similar solutions. Its an easy trap for administrators to invest countless dollars in hardware and then decide to forgo management software in order to save money. The time saved by having software monitor and report on problems within your data path will undoubtedly be worth the cost of the management software.
62
Chapter 3 Many vendors like to tout the fact that they provide an end-to-end solution. However, end-to-end solutions often advertise to solve all problems along a data path, but rarely deliver. When designing a high-performance fault-tolerant data path, be sure to ask yourself or any vendors involved in the project the following questions: Is the network path to my file servers fault tolerant? Is the SAN path to the file resources fault tolerant? Can the planned technologies effectively load balance file access requests? How can I monitor and report on LAN bottlenecks and failures? How can I monitor and report on SAN bottlenecks and failures? How can I monitor and report on server bottlenecks and failures?
With acceptable answers to these questions, you should be ready to enjoy a resilient and faulttolerant file serving infrastructure.
Summary
Far too often with file serving, administrators focus solely on performance issues from the servers back to storage and all but ignore the remainder of the data path. Hopefully, this chapter has made you aware of all of the aspects of getting a file from a server to a client, and back. With the right architecture in front of and behind your file servers, they should be able to grow and respond to client performance as the needs of your organization evolve. Basing your file serving infrastructure around shared data cluster architecture is the only true way to add fault tolerance and load balancing to file server resources. As you have seen, all the resources that you need to build a fault-tolerant and resilient file serving data path exist today. Having knowledge of what is available as well as how to use new resources should allow you to build a reliable file serving infrastructure within your organization. The next chapter will look at building high-performance file serving solutions in both Windows and Linux environments. Chapter 4 will take you through the process of building a highperformance Windows file server and Chapter 5 will explore the process of building a highly available and high-performance Linux file serving solution.
63
Chapter 4
Chapter 4: Building High-Performance, Scalable, and Resilient Windows File Serving Solutions
The last chapter took more of an external look at file serving. In taking on file serving from a data path perspective, you can see which elements of the network must be made highly available as well as meet performance and scalability demands. This chapter takes a deep look at the role of the OS itselfin particular, how to manage high-performance and highly available Windowsbased file servers. As technology has continued to improve, Microsoft and countless vendors have developed new tools and methods for managing file systems. In many cases, the available file serving tools and OS enhancements are complimentary rather than competitive. As you navigate though this chapter, you will first see what Microsoft has done on its own to improve file serving. You will then see the role that some of the major file serving vendors have provided. Although awareness of Windows file serving technologies is important, it is equally significant that you understand how these technologies canor cannotcoexist with your existing or planned file serving infrastructure. Lets begin with a look at WS2K3s file serving enhancements.
Managing High-Performance and Availability Across a Windows Infrastructure

With each new churn of Windows server OSs, Microsoft has continually added more file serving features to the OS. Among the new file serving-related features of WS2K3 are: Virtual Disk Service (VDS) Volume Shadow Copy Service (VSS) Shadow Copies for Shared Folders Enhanced storage and file serving support
This section will look at the numerous technical considerations for deploying WS2K3-based file servers. Along the way, you will see the new file-serving and storage features available in WS2K3 as well as the factors that must be considered when integrating third-party applications such as antivirus with Windows-based file servers. VDS VDS provides a method to standardize the way application vendors access disk storage resources connected to WS2K3 hosts. In short, VDS is an application programming interface (API) that allows third-party storage application vendors to connect to all attached disk storage resources through a single API. The API that the third-party vendors connect to is VDS. Figure 4.1 shows the VDS architecture.
64
Chapter 4
Figure 4.1: The VDS architecture.
Basically, VDS sits between applications and storage resources. Storage applications that support VDS can send instructions to VDS, which in turn passes the instructions to the storage resource. This architecture prevents storage application vendors from having to write their own drivers or write code that provides instructions to each specific type of hardware that the vendor supports. Instead, a single set of instructions can be passed to VDS, and VDS will take care of the rest. Providing a common storage interface is nothing new to Microsoft. The company actually tried providing a common interface in Windows 2000 (Win2K) for removable media resources such as tape libraries. This service was known as Removable Storage Manager (RSM) and its intention was similar to that of VDS in WS2K3. The primary difference with VDS is that it provides access to disk resources instead of removable media. With RSM, many storage vendors found the service to be moderately reliable at best and thus wound up writing their own drivers for removable storage resources. For storage resources that were normally unsupported, the storage vendors could communicate with those devices through RSM. By narrowing its scope, VDS has proven to be much more reliable in production than RSM. Notice that the objects underneath VDS in Figure 4.1 are listed as providers. The term provider is used by Microsoft to describe code written by disk vendors to interface with VDS. For VDS to send the correct instructions to a disk resource, it must communicate with the disks provider. Only disk storage vendors that have written VDS providers are supported by VDS.
To take advantage of the flexibility provided by VDS, before purchasing disk storage resources, be sure to ask the storage vendor if their products offer a VDS provider. For storage applications, be sure to verify that the software application is written to support VDS.
65
Chapter 4 VSS VSS is a WS2K3 service that is often confused with the Shadow Copies for Shared Folders feature. Although WS2K3s Shadow Copies for Shared Folders feature is a part of VSS, it is only a piece of VSS. VSS is a service that can be utilized by backup and storage management applications to effectively back up files that are normally open during backup. For example, to back up a third-party database, pre-VSS, you had two choices: Stop the database before the backup runs and restart it after the backup completes Purchase a specialized backup agent that provides for online backups of the database
If your backup product supports VSS, you can back up the database as part of a normal file system backup without taking the database offline or purchasing a backup agent. Many enterprise shops already have backup agents for major database applications such as Oracle or SQL, so for these applications, the news of VSS is not that significant. However, for smaller third-party databases that do not have available backup agents, VSS provides a viable alternative to cold backups. VSS also has merit in file-serving applications as well. For example, if a user has an open Word file at the time of backup, the file would probably be skipped during the backup. Pre-VSS, if you wanted to back up open files such as these, you would need to purchase a third-party product such as St. Bernards Open File Manager. If your backup product supports VSS, you can now back up open document files without needing a third-party open file manager product. VSS works by creating a point-in-time snapshot of an open file. Prior to the snapshot being made, VSS first freezes write operations to the open file. All writes to an open file are stored in a temporary cache while VSS backs up the open file. Once the backup of the file is finished, any suspended writes to the file are then committed. This process allows the backup software to achieve a consistent backup of an open file. Maintaining the consistency of the file through the backup process is imperative. Changes written to a file while it is being backed up could result in file corruption. To take advantage of VSS, your file server needs to run WS2K3. Also, you need to ensure that VSS is supported by your backup product vendor. All the major backup vendors such as Symantec (formerly VERITAS) and CommVault support VSS, so it is likely that VSS is a supported feature of your backup vendor. With Windows Backup, this feature is enabled by default. If you want to disable VSS backups, which would cause open files to be skipped during the backup, you can do so from the Windows Backup GUI. When a backup is initiated from the Windows Backup GUI, you first select the data to back up, then click Start Backup. Once you click this button, you are presented with the Backup Job Information dialog box. If you click Advanced, you are presented with the Advanced Backup Options dialog box (see Figure 4.2). From this point, you can select to disable VSS by selecting the Disable volume shadow copy check box.
66
Chapter 4
Figure 4.2: Windows Backup advanced backup options.
As a general practice, VSS should be enabled unless particular files defined in the backup will be backed up by an application-specific agent. Each backup vendor may have their own guidance for working with VSS, so your backup products documentation will provide the best guidance on how to work with VSS.
If you are executing backups using the ntbackup.exe command-line tool, the switch
/SNAP:{on | off}
determines whether VSS is used. VSS can provide greater reliability of file server backups and allow you to back up open files that are normally skipped. This service also provides a feature that can offload some of the dayto-day file recovery work to the end userShadow Copies for Shared Folders. Shadow Copies for Shared Folders Shadow Copies for Shared Folders is the best-known aspect of VSS. During the initial release of WS2K3, the Shadow Copies for Shared Folders feature was touted as one example of how WS2K3 could lower TCO. Simply, enabling Shadow Copies for Shared Folders causes a server to run periodic snapshots of a volume on a file server. When the shadow copy is executed, pointin-time snapshots of each file on the shadow copy-enabled volume are created. If a user accidentally deletes a file or wants to work with an earlier version of a file, the user can simply restore the earlier version of the filethus, this feature helps avert extra Help desk calls. Although the Shadow Copies for Shared Folders feature is powerful, it will require you to train the end user on how to recover files. This task can be difficultmany administrators still do not know how to correctly recover shadow copied files, so expecting end users to be able to recover their own files without any training is a bit of a stretch.
67
Chapter 4 Shadow Copies for Shared Folders Basics Many organizations run backups of their file servers only at night, which means that users needing an earlier version of a file must revert back to the previous days file. With Shadow Copies for Shared Folders configured, you could run two to four snapshots during the day, for example, which would provide for more options when wanting to return a file to an earlier state. Shadow Copies for Shared Folders is not a replacement for backup but rather another tool that can assist the productivity of users. In every organization, there are bound to be some users that are too embarrassed to call the Help desk and request a restore when they accidentally delete a file or want to revert back to an earlier version of the file. In these situations, the users often recreate earlier work in an effort to save themselves what they perceive as embarrassment. By empowering the users to recover their own files, the users would be less likely to spend time recreating earlier work.
The backup APIs provided by Microsoft to the backup vendors do not provide the functionality to back up any previous versions of a file secured by a shadow copy snapshot. Instead, only the most recently saved version of a file will be copied when a Windows file server is backed up.
On WS2K3-based file servers, Shadow Copies for Shared Folders is enabled at the volume level. Selectively enabling Shadow Copies for Shared Folders at the individual folder level is not supported. This shortcoming is worth noting because it may play into your decision making when deploying a new file server. Also, shadow copy recovery is only possible via shared folders. Thus, you cannot view shadow copy snapshots of files locally using Windows Explorer. When Shadow Copies for Shared Folders is enabled on a volume, a best practice is to reserve another free volume on the file server for storage of shadow copy snapshots. This setup allows for dedicated and controllable disk space for the shadow copy snapshots. Each time a snapshot is executed, the snapshot will update a base block-level image of the volume. The image updates are incremental in nature, so only file changes are captured in the snapshot. This method allows the OS to save numerous snapshots of a volume on another volume of equal size. For example, a 100GB volume with 60GB of stored data could have its shadow copy snapshots stored on another 100GB volume. On the 100GB second disk, you may easily be able to fit ten versions of previous volume snapshots. If the Shadow Copies for Shared Folders default settings were used, snapshots would run twice daily (7:00AM and 12:00PM). With ten saved snapshots, users would be able to roll back to a file as old as 5 days. If a user needed to go further back in time, the user can request that an earlier version of the file (for example, from 2 weeks ago) be restored. Enabling Shadow Copies for Shared Folders Support Enabling Shadow Copies for Shared Folders support is a relatively simple process that can be completed in Windows Explorer. Prior to enabling Shadow Copies for Shared Folders for a volume, you should install an additional volume on the server that can be used exclusively for storage of shadow copy snapshots.
If enabled with the default settings, shadow copy snapshots will be stored on the same volume on which Shadow Copies for Shared Folders is enabled. This setup is not recommended for any performance-intensive file serving environments.
68
Chapter 4 To enable Shadow Copies for Shared Folders: In Windows Explorer, right-click the volume on which you want to enable Shadow Copies for Shared Folders support, and select Properties. In the drive properties dialog box, select the Shadow Copies tab. On the Shadow Copies tab, ensure that the correct volume is highlighted, then click Settings. In the Settings dialog box (see Figure 4.3), select the volume on which to store the shadow copy snapshots from the Located on this volume drop-down menu.
Figure 4.3: Shadow copy volume settings dialog box.
Select the No Limit radio button from the Maximum Size portion of the window to use the entire volume for shadow copy backups, or click the Use Limit radio button and specify the limit for shadow copy backups in megabytes. With the snapshot volume defined, you need to set the shadow copy backup schedule. To do so, click Schedule. By default, shadow copy snapshots will run at 7:00AM and 12:00PM Monday through Friday. As Figure 4.4 shows, in the Schedule dialog box, you can either edit the default times and days of the week for the shadow copy snapshot schedule or add new shadow copy snapshot times to the schedule. Keep in mind that the more events you maintain, the more storage space is needed and thus fewer days will ultimately be able to be maintained.
69
Chapter 4 Once the snapshot schedule is set, click OK. In the Settings dialog box, click OK. With Shadow Copies for Shared Folders now enabled for the volume, you can create the first shadow copy snapshot of the volume by clicking Create Now. Note that this step is optional. Repeat the earlier steps to enable Shadow Copies for Shared Folders support for additional volumes. After you are finished enabling Shadow Copies for Shared Folders support on the desired volume(s), click OK to close the volume properties dialog box.
Figure 4.4: Shadow Copies for Shared Folders default schedule.
For more flexibility with shadow copy snapshots, you might want to execute snapshots using the vssadmin.exe command-line utility. As this tool can be integrated into a script, youll have more freedom in controlling when shadow copy snapshots are performed. The general syntax for using vssadmin.exe to create a shadow copy snapshot is:
vssadmin create shadow /for=<volume name>
70
Chapter 4 The volume name parameter must be in the form of

\\?\volume{GUID}\
The GUID is the globally unique identifier for the volume. You can determine the GUID of each volume on the server by accessing the command prompt and running vssadmin list volumes. Listing 4.1 shows an example of using vssadmin to display volume information.
C:\>vssadmin list volumes vssadmin 1.1 - Volume Shadow Copy Service administrative command-line tool (C) Copyright 2001 Microsoft Corp. Volume path: C:\ Volume name: \\?\Volume{a0bfd740-6772-11d9-8604-806e6f6e6963}\ Volume path: E:\ Volume name: \\?\Volume{9063d3b7-0f32-11da-bf02-505054503030}\ Volume path: F:\ Volume name: \\?\Volume{9063d3b8-0f32-11da-bf02-505054503030}\ C:\>
Listing 4.1: Querying volume GUID information using vssadmin list volumes.
Once you have the volume information, you can then use vssadmin create shadow to create a snapshot. For example, to create a snapshot of the E drive and associated GUID shown in Listing 4.1, you would run
vssadmin create shadow /For=\\?\Volume{9063d3b7-0f32-11da-bf02505054503030}\
By default, Shadow Copies for Shared Folders run as a scheduled task on the server on which it was enabled; this setting can cause problems for clustered file servers. In a cluster, if a volume configured to support Shadow Copies for Shared Folders fails over to another server, by default, the task to run the shadow copy snapshot will not failover as well. If you are running Microsoft Cluster Service (MSCS) clusters, you can accommodate periodic snapshots by adding a Volume Shadow Copy Service Task resource to each file server cluster group. This resource will provide failover support for scheduled shadow copy snapshots. For other third-party cluster applications, such as the PolyServe Matrix Server, you could run the
vssadmin create shadow
command in a script to provide failover support for scheduled snapshots.
71
Chapter 4
Recovering Previous Versions of a File Both Win2K and Windows XP client OSs can recover shadow copies. However, neither OS provides Shadow Copies for Shared Folders support out-of-the-box. As Shadow Copies for Shared Folders are a feature initially introduced in WS2K3, only WS2K3 natively supports Shadow Copies for Shared Folders file recovery. To support Shadow Copies for Shared Folders recovery on the current Windows client OSs, you can install one of the following: Previous Versions Client software (only supported on Windows XP) Shadow Copy Client Software (supported on Windows XP and Win2K with SP3 or later)
The Previous Versions Client software (Twcli32.msi) is copied to WS2K3 systems during setup and is located in the %windir%\system32\clients\twclient folder on the server. Most organizations prefer to deploy the Shadow Copy Client Software (ShadowCopyClient.msi) because it supports both Windows XP and Win2K. This software is available at http://www.microsoft.com/windowsserver2003/downloads/shadowcopyclient.mspx. Both shadow copy client programs are packaged as .msi files, so they can be deployed throughout your enterprise via Group Policy. Once the Shadow Copy Client is installed on the user workstations, users will be able to recover previous versions of their files on their own. As Shadow Copies for Shared Folders support relies on the Common Internet File Sharing (CIFS) protocol, users must access files using CIFS in order to recover them. CIFS access to shared folders is accomplished by using a Universal Naming Convention (UNC) path for access. For example, to access the shared folder named Public on the server named Eagle, the UNC path would be \\Eagle\Public. Most organizations provide users with mapped network drives for accessing and saving files over the network, so users with existing mapped drives that connect to network shares via UNC can recover previous versions of a file by simply navigating to their mapped drive. Suppose that you accidentally delete a file located on your mapped Z drive. To recover the file, you would need to follow these steps: Click StartMy Computer. In the My Computer window, double-click the Z drive. In the File and Folder pane located on the left side of the Window, click the View Previous Versions link. If you dont see the View Previous Versions link, right-click any open space in the right pane of the window, and select Properties. Then select the Previous Versions tab in the folder properties dialog box. As Figure 4.5 shows, you will then see a list of the previous versions of the folder that were included in a VSS snapshot. If you remember the last time that you had the file, you can double-click the snapshot that occurred right before you deleted the file. If you do not know, you will need to start with the most recent folder and work backwards until you find the file you need. The point-in-time snapshot of the folder will now open in a new window. Once the deleted file is located, you would then need to double-click it to open it. The file will open as read only. To permanently save the file, you will need to select the Save As option from the documents File menu, then select the location in which to save the file.
72
Chapter 4
Figure 4.5: Viewing previous file versions.
At this point, the file has been successfully recovered.

You could have also copied the file and pasted it to its original location instead of opening the file and then saving it to the original location.
Although user training is an inevitable aspect of Shadow Copies for Shared Folders support, using this feature can eliminate a significant number of Help desk calls for requests to restore accidentally deleted files. Also, as snapshots are incremental in nature and complete in a relatively short period of time, running shadow copy snapshots during business hours will provide for more alternatives (than the previous nights backup) for users needing to revert to an earlier file version or recover a deleted file.
73
Chapter 4
Enhanced Storage and File Serving Support Several storage features are also included to WS2K3 that help to better support file serving. According to Microsoft, several architectural improvements to WS2K3 result in CIFS performance improvements of 250 percent over systems running Windows NT 4.0. NFS improvements made to Microsoft Services for UNIX with WS2K3 resulted in performance improvements 1500 times faster than with NT. There are also several additions to WS2K3 that result in improved quality of life for both users and administrators: Improved multipath I/O support STORport driver model support iSCSI support Improved offline files support
Multipath I/O Support The importance of multipath I/O to the reliability of a SAN was stressed in Chapter 3. With WS2K3, Microsoft also realized this importance and worked with the major SAN HBA vendors to help them certify multipath drivers for WS2K3. Again, with multipath drivers, WS2K3-based file servers can access storage resources in a SAN with a resilient data path between the servers and storage. Thus, if one fibre channel link goes down, having the multipath driver installed will ensure that access to the resources will seamlessly fail over to the next available data path. STORport Driver Support With first generation SANs, miniport drivers were used by HBAs to interface with SAN resources. The primary drawback to miniport drivers is that they were designed for SCSI and ATA storage devices. Thus, an OS connecting to storage devices in a fibre channel SAN using an HBA with miniport drivers could not take advantage of any of the new features that separated fibre channel from SCSI. Instead, fibre channel devices for the most part were treated as SCSI devices. With STORport driver support, the OS can take full advantage of fibre channel storage devices and are not constrained by the limits of SCSI. iSCSI Support Although iSCSI storage networks still significantly lag behind fibre channel in terms of industry acceptance and market share, this technology has still been growing steadily in popularity. iSCSI was first supported with the WS2K3 release; however, Microsoft requires that WS2K3 SP1 be installed on WS2K3-based failover clusters for full 8-node iSCSI cluster support.
For more information about iSCSI support for Windows, see the Microsoft Storage Technologies iSCSI page at http://www.microsoft.com/windowsserver2003/technologies/storage/iscsi/default.mspx.
74
Chapter 4 Improved Offline Files Support The use of offline files can be beneficial to both users and file servers. With offline files, you can configure a users system to store a local copy of select files that reside on a file server. For mobile users, the benefit is obviousthe user will have access to the same files the user sees when he or she is in the office. When the user returns, his or her locally stored offline files will synchronize with the copies of the files that reside on the file server. However, offline files are not just beneficial for mobile users. For example, suppose a local user edits numerous documents and presentations on a file server each day. Although the user is editing the documents, they are remaining open on the file server. Each save is being saved directly to the file server. Eventually, this standard approach to file serving could limit the file servers scalability. With offline files, you could configure offline file caching for permanent local users in addition to mobile users. This way, when a local user is editing a document, the user is doing so locally on his or her workstation. This practice would offload a substantial amount of work from the file server. By default, Windows users can set their offline files settings. This ability allows offline files (on the client side) to run transparent to the file server. To optimize performance of offline files, you can edit the Offline Settings of any shared folder on the file server. To do so: Open Windows Explorer, and locate the shared folder that you want to modify. Right-click the shared folder, and select Properties. In the Properties dialog box, select the Sharing. On this tab, click Offline Settings. For optimal file server performance, select the All files and programs that users open from the share will be automatically available offline radio button. Make sure the Optimized for performance check box is selected (see Figure 4.6), and click OK. Click OK to close the folder properties dialog box.
Figure 4.6: Optimizing offline files for best server performance.
75
Chapter 4 With the Optimized for performance check box selected, both files and programs will be locally cached on each workstation. With program files, clients will only need to download updates or changes to the file that is based on the server. Otherwise, the client system will run the program using its locally cached copy. With offline files enabled on a shared folder, server performance can be substantially improved. Keep in mind, however, that offline files is not a one-size-fits-all solution. Your choice of shared folders to configure for offline support should be driven by the needs of the network as a whole. For example, offline files are not well suited for organizations in which users are not consistently logging on from the same workstation. In these situations, waiting for offline files to synchronize on each different workstation from which a user accesses the network would be extremely frustrating. For shared folders that are accessed by users who always log on from the same computer, optimizing those folders for performance using offline files can be beneficial. Keep in mind, however, that users will need adequate disk space on their workstations to support the offline files caching.
By default, cached offline files are stored on the clients system drive. This location can be changed by running the WS2K3 resource kit Cachemov.exe tool. For example, to locate offline files on the clients E drive, you would run
cachemov unattend e:\
At the client level, users have the ability to select which folders they want to make available offline: Right-click the shared folder or mapped network drive, and select Make Available Offline. When the Offline Files Wizard opens, click Next. Click Finish.
After you click Finish, the wizard will then synchronize the Windows XP workstation with the file server. This process creates a local cache of the files stored on the file server. You can change the way Windows XP works with offline files by following these steps: Open Windows Explorer, select the Tools menu, and select Folder Options. In the Folder Options dialog box, select the Offline Files tab. Aside from enabling or disabling offline files support, you can also set the amount of disk space that is reserved for offline file caching. By default, 10 percent of the drive is reserved. Figure 4.7 shows the available offline files options. Once you have set the appropriate options, click OK.
76
Chapter 4
Figure 4.7: Windows XP client offline files options.
User-based offline files settings can be configured using a Group Policy Object (GPO). Offline files settings can be found in a GPO in the User Configuration, Administrative Templates, Network, Offline Files or Computer Configuration, Administrative Templates, Network, Offline Files portion of a GPO. By default, only the folder that a user selects to be available offline is cached (and not its subfolders). This behavior can be changed so that subfolders are included by editing the Computer Configuration, Administrative Templates, Network, Offline Files, Subfolders Always Available Offline GPO.
77
Chapter 4 Another issue that often gets in the way of successful offline files deployments is that whenever a client workstation connects to a network, offline files will try to synchronize. As many mobile users are often connecting to WiFi hotspots, they will probably find this feature to be annoying. This behavior can be changed by performing the following steps on the client: Click Start, All Programs, Accessories, Synchronize. In the Items to Synchronize dialog box, click Setup. You can limit offline file synchronization by selecting the appropriate network connection in the When I am using this network connection drop-down menu in the Synchronization Settings dialog box (see Figure 4.8). You can also force Windows to prompt users before synchronizing by selecting the Ask me before synchronizing the items check box. Once finished setting the synchronization settings, click OK. Click Close to close the Items to Synchronize dialog box.
Figure 4.8: Configuring offline file synchronization settings.
As you can see, using offline files can provide more flexibility in dealing with both server performance scalability as well as mobile clients. Next, well look at how Microsoft is structuring its existing technologies to support enterprise file serving.
78
Chapter 4
The Microsoft Approach to High-Availability File Serving

Microsofts approach to high-availability file serving is twofold. Microsoft addresses highavailability and performance concerns by offering the following technologies with the companys OSs: Server clusters (also known as failover clusters) Distributed File System (DFS) Active Directory (AD)
Lets look at how Microsoft applies these technologies to both high-availability and highperformance file serving. MSCS Microsoft offers server clustering support with WS2K3 Enterprise and Datacenter OSs. Server clustering has been available with Microsofts enterprise-class server OSs since NT. Microsoft Cluster Server (MSCS) supports clusters as large as 8 nodes in size and utilizes a shared nothing cluster architecture. In using a shared nothing as opposed to a shared disk architecture, only one node in the cluster can access a physical disk at a time. This limitation results in the cluster supporting failover but not load balancing. Microsofts Network Load Balancing cannot run in conjunction with Microsofts server cluster service and is primarily geared to users and servers accessing readonly data. The reason is that with the Microsoft Load Balancing cluster model, each node in the cluster maintains its own separately managed storage resources. This setup makes the load balanced cluster impractical for typical file-serving applications. With failover support, Microsoft server clusters provide fault tolerance. If one node in the cluster fails, another node can host the virtual server that originally ran on the failed node. Although failover support is essential for mission-critical applications, the limitations of the failed nothing architecture hamper the scalability of Windows clusters as file servers. As only one physical node in the cluster can access a shared disk at a time, the single cluster node can, in time, become an I/O bottleneck as user demand increases. To solve this problem, Microsoft offers two solutions: split the file-serving duties across two clusters or deploy DFS. Many organizations move to a cluster-based file serving architecture in an effort to better support consolidation, availability, and performance. If a cluster is divided in two to support performance demands, youre starting a backward trend in which youre adding managed systems to the network. The second solution offered by Microsoft is to run DFS on top of the MSCS.
79
Chapter 4
DFS DFS has increasing grown in popularity as a complimentary file serving technology because it offers the following advantages: Maintain consistent replicated file data between two hosts Provide transparent access to network resources for users and applications Provide load balancing for file access across replica links
Figure 4.9 illustrates an architectural example of using DFS to add performance load balancing to Microsoft server clusters.
Figure 4.9: Running DFS on top of two MSCS clusters.
In theory, you could set up DFS to replicate data between two server clusters. As DFS load balances requests across replica links in a round-robin fashion, you would have a mechanism to offer load balanced fault tolerant data access. However, many in the field have found File Replication Service (FRS), the replication engine behind DFS replication, to be unreliable at best. If you are serious about using DFS as a means to balance access across multiple file server clusters, strongly consider investing in a third-party product to guarantee reliable replication of the data. For example, NuViews StorageX provides this capability and would allow you to manage a reliable DFS architecture across your enterprise.
80
Chapter 4
AD Integration AD integration of Windows applications and services started with Win2K and has become even stronger in WS2K3. The AD database is extensible, so any vendor could add AD schema extensions to their products to allow their products to integrate with AD. Microsoft has been pushing AD integration of its products for several years, and many organizations have started to buy into this philosophy. On the file-serving side, for example, you could publish all your shares in AD using the Active Directory Users and Computers Microsoft Management Console (MMC) snap-in. To provide a granular view of shares, you could publish shares into specific organizational units (OUs) based on the users or departments that will need to access them. By performing a simple directory browse through My Network Places, users or administrators can browse to an OU in the directory to view all the published shares. A good practice is to publish shares into AD at the OU level and create desktop shorts for users to their respective OUs. This way, when they look inside their OU folder, they see only the shares that you have published for them. This setup prevents users from randomly browsing the network for resources and can add a level of access transparency similar to DFS. If a shared folders location moves, you will just have to update the published objects UNC path in AD. The movement of the share will be transparent to the user. Although Microsoft may have a home court advantage with its file servers, other commercial product vendors are offering competing high-performance and high-availability file-serving solutions.
Commercial File Serving Solutions

Today, countless products exist for providing file-serving solutions in the Windows space. Among the sea of available products, two major vendors stand out: PolyServe and Symantec (formerly VERITAS). This section will look at each of these product offerings as alternatives to existing Microsoft solutions. PolyServe NAS Cluster PolyServe NAS Clusters, like MSCS clusters, offer failover support for file-serving applications. However, PolyServes approach to clustering is significantly different in that that it offers true shared data clustering. Unlike MSCS, PolyServes Matrix Server platform does not employ a shared nothing architecture, meaning that several servers can access files on the clusters shared storage simultaneously. This setup provides for true load-balancing support in addition to the failover support found with MSCS. Unlike traditional NAS products, PolyServes NAS cluster is designed to run on a standard OS, such as WS2K3; it also runs on industry-standard hardware. With this approach, no significant hardware investments are needed to deploy this technology. Remember that even with 8-node scalability, MSCS clusters provide only a single access head for a single virtual file server. Thus, the virtual file server cannot take full advantage of the processing power of all 8 nodes with MSCS and is instead relegated to using the horsepower of a single node.
81
Chapter 4 With the PolyServe architecture, multiple nodes can host virtual server resources simultaneously, so you can have true load-balancing support as well as a much easier scalability model. Earlier, it was mentioned that MSCS could scale by splitting a cluster into two clusters and possibly configuring DFS to run on top of them. This solution will require that you effectively double your hardware and OS investment in order to meet the scalability need. With the PolyServe architecture, a performance problem can be managed by simply adding one more server to the cluster or by reallocating some processing to an underutilized node that is already in the cluster. Symantec Cluster Before the growth in popularity of the PolyServe Matrix Server clustering architecture, VERITAS (now Symantec) stood as the leading cluster service provider in the market. Similar in Microsofts approach to clustering, Symantecs product enables only a single node in the cluster to access a file. Thus, this architecture cannot provide true load balancing. However, VERITAS Cluster Server for Windows does offer an intelligent agent that can dynamically move a virtual server in the cluster to an underutilized node. VERITAS clusters can also scale to 32 nodes, which allows for significantly more growth within a single cluster compared with MSCS.
Current Trends in Windows File Serving

Today, Windows-based file serving has been dominated by two major trends: server consolidation and storage consolidation. This section will explore the impact of these two trends on the Windows file-serving landscape. Benefits of Consolidation Server sprawl, which was a common trait of the late 1990s and early 2000s, led to the movement of server consolidation. The major problem with server sprawl was that it led to increased TCO of nearly all IT entities. More servers equated to more management, more software licensing, and additional hardware and power costs. As the availability of high-performance system hardware improved, consolidation became an easy sell. For example, if an organization can consolidate from 250 servers to 80 servers without sacrificing performance, the TCO savings would easily reach several hundred thousand dollars a year. For network administrators, the argument for consolidation is more than just about money. Having fewer servers to maintain can mean less trips into the office at 11:00 PM on a Saturday night. With fewer systems to manage, administrators have fewer systems to repair and upkeep. The organization as a whole has significantly less software licenses and service contracts to maintain. In essence, consolidation really means utilizing your server resources to their full potential. Instead of running 12 servers with an average CPU utilization of 6 percent, why not consolidate to a single server and take full advantage of the one servers CPU resources?
Keep in mind that consolidating to virtual machines running on a commercial VM host product such as VMWare ESX Server or Microsoft Virtual Server could lead to fewer physical systems but likely will not reduce your number of managed systems or related software licensing. File server consolidation is best achieved via clustering, which can reduce the number of both physical systems and managed systems.
82
Chapter 4 Benefits of Shared Storage Consolidating storage resources is another major file serving trend. With the growth of managed data, maintaining data availability has become an increasingly more challenging problem to manage. In consolidating storage resources to a SAN, storage can be allocated as its needed, which will give you a greater return on your storage investment by not wasting resources through over-allocation. In addition, consolidating to a SAN opens more backup and data management possibilities.
Chapter 6 will discuss storage scenarios in more detail.
Deploying Enterprise-Class Windows File-Serving Solutions

As weve explored, there are many solutions at your fingertips. To help determine the right solution for your environment, lets look at some general guidelines for deploying Windows fileserving solutions. Pre-Deployment Considerations The tendency of IT administrators is often to deploy first and customize later. For those that practice this approach, planned customizations take months or even years to complete. With the file server deployed and operational, justifying spending additional time on the project may be difficultespecially with the myriad tasks already on the IT staffs list. To deploy a file server right the first time, planning has to be an important part of the process. One major part of the planning process is deciding which technologies should be used to compliment the file server. Table 4.1 lists the most common file serving problems as well as the available technologies that can alleviate or manage the potential problems.
File Serving Problem Improve availability of file versions Limit user usage of file server resources Provide failover support Provide failover and load-balancing support Provide offline access to data Provide antivirus protection Prevent unauthorized access Solution Deploy and configure Shadow Copies for Shared Folders Deploy and configure disk quotas Deploy and configure a server cluster Deploy a PolyServe Matrix shared data cluster Deploy and configure Offline Files Deploy antivirus solution that is compatible with any installed file serving applications as well as your backup product Determine the necessary permissions for each user or group that has access to the server
Table 4.1: Solutions for the most common file serving deployment problems.
Unsupported antivirus products can significantly degrade file server performance by triggering a scan of each file during a file server backup. Your backup product vendor should be able to tell you which antivirus products have been tested and thus are supported.
83
Chapter 4 Validating Server and Storage Requirements At a minimum, WS2K3 requires a 550MHZ CPU and 256MB of RAM. As you already know, this setup wont go very far on an enterprise-class file server. For CPU scalability, the installed Windows OS will determine the number of CPUs that are supported. The maximum CPUs supported by each WS2K3 OS are: WS2K3 Standard4 CPUs WS2K3 Enterprise Edition8 CPUs WS2K3 Datacenter Edition32 CPUs
In file-serving applications, Microsoft research has shown that going from one to two processors will improve performance anywhere from 1.4 to 1.6 times, depending on the original client load. Going from 1 to 8 processors will result in an improvement between 2.4 and 3.2 times. When sizing RAM, Microsoft has estimated that 1GB of RAM can effectively handle as many as 100,000 simultaneous open file handles. In general, the OS will use as much as 500MB of RAM for OSs tasks. Thus, a system with 2GB of RAM will have 1.5GB of addressable RAM for open files. The more open file content that can be stored in RAM, the less the file server has to rely on disk paging to serve up the file to users. This results in improved performance. As you can see, the amount of RAM in the file server can play a huge role in the amount of simultaneous open files that can be supported. Chapter 3 described the many available methods for providing fault-tolerant storage accessthis section will focus on sizing. WS2K3 by itself will consume about 1.5GB of disk storage. Microsoft recommends that for file-serving deployments, you allocate 1.5 times the amount of physical RAM to the paging file. So a file server with 2GB of RAM should have a 3GB paging file. Performance of the page file can also be improved by locating it on a separate disk. This setup clears an I/O channel for just paging operations. Aside from the OS storage requirements, you will also need to budget for program files and log files. Each application vendor should be able to provide appropriate sizing guidelines for their log files. Finally, you will also need to budget for the file data itself. This task is often predictable because you should have on hand information about the current file server capacity as well as some historical data showing capacity over the past 12 to 18 months. This data should allow you to predict requirements 18 to 24 months out. For new deployments, its always best to plan on future capacity. If you are in a situation in which you are unsure of how much data to budget for, growth can be estimated by looking at reports from previous backups. For example, examining the size of a servers monthly full backup over the past 12 months should provide a reasonable baseline for expected storage growth.
Another major aspect of file server deployment involves backup planning. This topic is covered in Chapter 6.
84
Chapter 4
Summary
This chapter presented several technologies that aid in deploying reliable Windows file-serving solutions. Tools such as Shadow Copies for Shared Folders can give you more flexibility with data availability by allowing users to recover their own files and providing the ability to perform snapshots of open files during business hours. With the abundance of new tools comes greater complexitythat complexity can be reduced by consolidating file-serving applications to shared data clusters. Although other solutions exist, only shared data clusters can provide the benefit of server consolidation, load balancing, and failover support. This chapter was fully devoted to Windows file serving, which really only represents a part of the file-serving landscape. The next chapter will look at the issues and technologies surrounding Linux file serving in the enterprise.
85
Chapter 5
Chapter 5: Building High-Performance, Scalable, and Resilient Linux File-Serving Solutions

The last chapter took a close look at the world of Windows file serving. This chapter will take a similar approach with Linux file serving. Although many of the challenges facing file serving today remain consistent between both Windows and Linux operating systems (OSs), the approach to solving the challenges presented by enterprise file-serving for these OSs certainly differ. In this chapter, you will see the world of file serving from a Linux perspective. Along the way, youll get a close look at the several challenges facing file serving on Linux platforms. We will also examine the current alternatives (both commercial and open source) for solving the performance, scalability, and availability file-serving issues. Another major aspect of Linux file serving is the ability to integrate Linux file servers onto Windows-based networks. With a Windows Active Directory (AD)-dominated client base, building Linux file-serving solutions that can seamlessly integrate into an AD infrastructure is deemed critical by many organizations. To that end, this chapter will also introduce several of the technologies that promote file serving across the heterogeneous enterprise.
Chapter 6 will provide detailed procedures and examples of Linux and Windows integration concepts, such as simplifying authentication using winbind single sign-on and mapping user home folders between both Windows and Linux desktops.
Before we turn to examining the technologies that are being used to solve todays Linux fileserving problems, lets first take a look at the current file-serving landscape.
Challenges Facing the Linux File-Serving Landscape

Todays Linux-based file servers face similar challenges to their Windows-based counterparts. Among these challenges are: Performance Scalability Availability Integration
Lets start with a look at performance.
86
Chapter 5
Performance As an organization grows, so does its demands on file serving. To accommodate growth, several elements of the file server may need to be evaluated: CPU utilization Memory usage Disk performance Network bottlenecks
Any of these issues can seriously degrade system performance. Problems such as CPU or memory usage may be overcome with a simple upgrade. The same may hold true for disk performance. Upgrading to a U320 SCSI or SATA 2.0 hard disks could be a relatively inexpensive solution, depending on the server capacity. Network bottlenecks are often the result of having a single network access point for a file server. This setup is often the case with traditional standalone file servers as well as Network Attached Storage (NAS) appliances. In these instances, often one of the easiest ways to streamline performance management is to consolidate to a shared data cluster. Shared data clustering not only gives you the ability to balance client traffic across several servers but also can provide an alternative to decommissioning servers that have reached their maximum CPU or memory limit and thus cannot be upgraded further. Later in this chapter, additional time is spent analyzing the benefits of shared data clustering as the baseline for Linux file serving in comparison with the traditional approaches.
For more information about performance tuning and data path optimization, turn back to Chapter 3.
Scalability In time, scalability issues often result in many of the performance problems that were noted in the last section. As your organizations data requirements increase, how does your file server respond? In some organizations, scalability problems are not easy to predict. In some instances, server and storage resources are over-allocated due to anticipating too much growth for one division within an organization. On the flip side, if other resources grow beyond your existing forecasts, some servers may quickly reach capacity. Running at capacity could result in several problems, such as hitting a performance bottleneck or running out of available disk space. To be fully prepared for the pains of scalability, it is important for your file-serving infrastructure to be just as dynamic as the flow of the business processes within your organization.
87
Chapter 5
Availability With data access being critical to countless business processes, availability of data is also a significant consideration among todays Linux file servers. If a server crashes due to a hardware or software failure, or even from human error, how does the network respond? If the answer is that the administrators are running around scrambling for parts or are troubleshooting software, it means that a particular IT shop is not taking advantage of the many high-availability technologies that are currently available. If a file server is crucial to your organizations day-today operations, its data should be resilient to any server-based hardware or software failure. Integration Integration is another significant concern among those managing Linux file servers. If your organization is running a Windows domain, ensuring that your Linux file servers and domain controllers can seamlessly work together is also very important to the success of your fileserving infrastructure. Managing permissions and authentication between to two OS platforms in many cases presents challenges for administrators. However, with knowledge of the right tools and integration techniques, the two OSs can play together.
Linux-Windows integration is not a topic to be taken lightly; therefore, most of Chapter 6 deals with how to effectively mesh the two environments together.
Another major integration concern with Linux file-server management is that of configuring multiple file servers to coexist in a SAN. Although all major Linux distributions offer fibre channel support, most have limited support in terms of distributed file locking across shared data in a SAN. Another weakness that exists in some of todays Linux file-serving solutions is a lack of reliable multipath support in the SAN. To take advantage of a redundant SAN, predictable multipath support on fibre channel HBAs attached to Linux servers is crucial. When faced with these problems, many Linux shops have turned toward tested and certified solutions offered by third-party hardware and software product vendors. The previous sections have hit on the major problems that exist in the Linux file-serving landscape; lets look at the methods many organizations are currently using to provide file services to their networks.
88
Chapter 5
Existing Linux File-Serving Solutions

Today, there are predominantly four architectures for offering Linux-based file serving: Standalone NAS DFS Clustered
This section will take a look at each of these four approaches. Standalone The standalone approach to file serving has stood the test of time and is still well suited for many small businesses. With this approach, a single server provides shared data access to users and applications. This approach is suitable for small organizations that do not live and die by the availability of their file services. If availability is critical, one of the next three architectures would be a better bet. NAS NAS has been a very popular architecture for Linux file serving in recent years. As many NAS appliances are easy to deploy, include built-in redundant components, and can offer several terabytes of storage, they have been viewed as an easy choice for many organizations. As the last chapter mentioned, one of the problems faced by NAS appliances, however, is growth. If an organization outgrows one NAS, they will need to buy another one. NAS appliances run on proprietary hardware, so a NAS cannot be redeployed for other uses if it no longer serves a file-serving need. Another drawback to NAS is sprawl. If an organization deals with file data growth by continuing to add multiple NAS appliances to the LAN, management costs for a network that could grow to host several NAS devices would inevitably go up. One other problem with NAS appliances relates to performance. It is difficult for NAS appliances to be as resilient to high network traffic as other architectures such as shared data clusters. One final drawback to NAS-based file serving as seen by many organizations is the high cost of a NAS appliance. As nearly all NAS vendors sell products that run on proprietary hardware, cost is another factor that sways organizations toward other Linux-based file serving technologies.
89
Chapter 5
DFS Like the Windows DFS options discussed in Chapter 4, Linux file servers can also participate in a DFS hierarchy. Linux file servers running DFS via Samba 3.0 can accept connections from any DFS-aware Windows clients, such as Windows 98, Windows 2000 (Win2K), or Windows XP. With DFS support on Samba, there are two ways to integrate Linux file serving into a DFS hierarchy: Create links on a Microsoft DFS root that map to Samba Common Internet File System (CIFS) file shares on a Linux file server Configure the Linux file server as the DFS root
Most AD shops that run DFS configure Windows DFS controllers as DFS roots and simply create DFS links to any CIFS file shares on Linux Samba servers. This approach allows organizations to take advantage of some of the Windows DFS features that have not yet made it into Samba, such as DFS root replicas and AD site-awareness. If your preference is to run your entire file-serving infrastructure on Linux, you may opt to configure the DFS root on a Linux box, then point each DFS link to other Linux file servers. DFS is unique in file-serving architectures in that it does not have to represent an absolute choice. Instead, DFS can complement other file-serving approaches such as standalone, NAS, or clustered. The ability to deliver transparent access to file shares could free up administrators to migrate file shares to other servers without having to impact users. Instead, all that would need to be updated would be the DFS link that exists at the DFS root so that it references the new shared folder location.
Basic Samba setup is covered later in this chapter.
Clustered Another major approach to Linux file serving is to implement a clustered file server. For Linux file serving, two open source cluster solutions currently exist: Linux-HA failover clustering LVS load-balanced clustering
These solutions are described in the next two sections. Failover Clustering Open source failover clustering on Linux is provided by the High-Availability Linux Project (http://www.linux-ha.org). Linux-HA clusters can be configured on nearly any Linux distribution. The key to the operation of Linux-HA clusters is Heartbeat. Heartbeat is the monitoring service that will allow one node to monitor the state of another and assume control of the clusters virtual IP address if the primary node fails. Heartbeat also provides the ability to automate the startup of services on the standby node.
90
Chapter 5 In a traditional heartbeat scenario, two virtually identical servers are configured with one acting as the primary server and the second as the standby server. Both servers are kept in sync by replication from the primary server to the standby server, and the standby server will routinely send a heartbeat signal to the primary server, which, if it is up and running will respond. If the primary server fails and the heartbeat signal goes unanswered, the standby server will assume the role of the primary server. Many Linux vendors have jumped on the Heartbeat bandwagon. One such vendor is SUSE, which includes the Heartbeat setup packages on its SUSE Linux Enterprise Server setup CD. For distributions such as Red Hat Enterprise Linux, you can download Heartbeat and all necessary dependant packages from http://www.ultramonkey.org. Ultra Monkey provides the software and documentation of Linux-HA on Red Hat distributions. Figure 5.1 shows a simple Linux-HA failover cluster.
Figure 5.1: A 2-node Linux-HA failover cluster.
Note that in Figure 5.1, each node is maintaining its own copy of local storage. For file serving, this setup can prove to be very challenging. In order for each cluster node to present a consistent view of file system data, the local storage on each node will need to be continually synchronized. To maintain consistency across the local storage in the cluster, many organizations turn to rsync. With rsync, you can configure incremental block-level replication to run between each node in the failover cluster. Doing so will ensure that the second node in the cluster (RS2) will have up-to-date data in the event of a failover of the first node (RS1). Of course, this functionality comes with a few significant drawbacks. For the sake of supporting failover, you would need to double your storage investment. For clusters consisting of more than two nodes, this investment would be proportionally higher. As you can imagine, this presents significant problems when facing storage growth. If the servers are configured to replicate every 15 minutes, for example, then the standby server may come online at a disadvantage. To achieve true high-availability failover, its best to implement a shared storage environment so that when the standby server is called into action, it has access to the file system at the point where the primary server left offwithout any replication delay.
For more information about configuring incremental file replication using rsync, visit http://rsync.samba.org.
91
Chapter 5 Load-Balanced Clustering Most Linux load-balanced clusters are based on the Linux Virtual Server (LVS) Project. Compared with the Microsoft network load-balanced cluster architecture, you will see that LVS uses a fundamentally different approach. With LVS, one or two servers outside of the cluster are used to distribute client traffic among cluster members. Thus, to build a 3-node LVS cluster, youll need at least four servers. Figure 5.2 illustrates this configuration.
Figure 5.2: A 3-node load-balanced cluster.
In Figure 5.2, the server labeled as Load Balancer accepts incoming client requests and directs them to an internal Real Server (RS). Each RS is a cluster node. With the load balancer directing client traffic, the RS nodes in the cluster can be located anywhere that has TCP/IP connectivity to the load balancer. Thus, each RS does not have to be on the same LAN segment. As the load balancer is the director for all client requests, having one server as the load balancer does have one fundamental flawfault tolerance. If the load balancer fails, the entire cluster is brought down. To avoid this problem, most LVS cluster implementations use two systems as load balancers. One system serves as the active load balancer, and the second system is passive, only coming online in the event that the active system fails. Figure 5.3 shows a fault-tolerant LVS cluster.
92
Chapter 5
Figure 5.3: A 3-node fault-tolerant load-balanced cluster.
As with the failover cluster, the LVS load-balanced cluster by default allows for each real server to maintain independent local storage. This setup again means that to maintain consistency across the cluster, a replication tool such rsync will need to be employed. Now that you have seen the basic operation of an LVS cluster, you may be wondering whether the load balancer acts as a bottleneck for client access. The answer lies completely in LVS cluster architecture that is applied to the cluster. LVS Architecture LVS is generally configured in one of three ways: LVS via Network Address Translation (NAT) LVS via IP tunneling LVS via direct routing
In the next three sections, well look at each of these unique configurations.
93
Chapter 5
LVS via NAT With the LVS via NAT architecture, the load balancer server is dual-homed and NATs all traffic to the real servers on an internal LAN. Figure 5.2 and 5.3 show this configuration. With NAT, each load balancer server directs client traffic into the internal LAN and to a real server. When the real server replies, the reply goes back through the load balancer system before returning to the requesting client. This approach can present both a performance bottleneck as well as scalability limits. Most LVS cluster implementations cannot scale beyond 10 to 20 nodes and still see any gains in performance. LVS via IP Tunneling Several advantages exist with the LVS via IP tunneling, most notably scalability. Unlike configuring LVS via NAT, the IP tunneling approach causes the load balancer server to direct client requests to the real servers via a Virtual Private Network (VPN) tunnel. Replies from the real servers will use a different network. This approach does not have the scalability limitations of LVS via NAT. With use of VPN tunneling, this cluster can easily be distributed among multiple sites and connected via the Internet. However, this approach is usually best suited for load balancing between FTP servers and is rarely applied as a high-performance file-serving solution. LVS via Direct Routing The LVS via direct routing approach is similar to LVS via NAT, except that reply traffic will not flow back through the load balancer; instead, replies will be sent directly from the real servers to the requesting client. As with LVS via NAT, real servers connect to the load balancer via the LAN. Replies from the real servers would return to the client over a different LAN segment that is routable to the requesting client. Unlike the LVS via IP tunneling approach, this method is more sensible for LAN-based file serving. However, it is still far from the best solution for enterprise file serving. The currently available commercial solutions are far superior to their open source counterparts.
Although open source clustering technologies have emerged as methods for increasing the availability and performance of file servers, many organizations are wary of open source technologies due to a lack of support. If a failure occurs, help may be days (instead of minutes) away.
94
Chapter 5
Commercial File-Serving Solutions

There are several commercial file-serving solutions in the Linux space, including: PolyServe NAS Cluster VERITAS (now part of Symantec) Cluster Server Red Hat Linux Cluster Suite and GFS
In the next three sections, each of these enterprise file-serving solutions will be looked at in closer detail. PolyServe NAS Cluster PolyServe NAS Cluster provides all the benefits of NAS (consolidation, ease of management, high availability) as well as all the advantages of both Linux-HA and LVS clustering. PolyServe NAS Clusters offer failover support for file-serving applications and offer true shared data clustering. In a PolyServe Matrix Server cluster, each node in the cluster shares a common storage pool in a SAN. Thus, with all cluster shares being in a common location, there is no need to replicate file server data between nodes. In comparison with the 3-node Linux-HA cluster shown earlier in Figure 5.1, migrating to a PolyServe NAS Cluster platform will allow you to immediately triple the amount of storage available for the cluster. Assuming that a Linux-HA failover cluster had 500GB of local storage attached to each node, the cluster would have 1500GB of total storage, with only 500GB that is truly writable. The reason is that the shared cluster storage in each node must mirror the storage of the other nodes in the cluster. If the same storage resources were applied to a PolyServe NAS Cluster, all 1500GB of storage would be writable. Figure 5.4 provides a comparison between a PolyServe Matrix Server cluster and a Linux-HA cluster.
95
Chapter 5
Figure 5.4: PolyServe NAS Cluster vs. Linux-HA cluster.
The fact that multiple nodes in a PolyServe NAS Cluster can simultaneously access shared files provides for high-performance load balancing as well as failover support. Thus, with this architecture, you can get the benefits of open source clustering products as well as a maximum return on your storage investment.
96
Chapter 5 Aside from PolyServes better approach to clustering, it also has advantages over traditional NAS vendors such as Network Appliance and EMC. Unlike NetApp and EMC, PolyServes NAS Cluster can be deployed on industry-standard Intel or AMD platforms running Linux. Unlike NAS, the answer to performance bottlenecks is not through a separate NAS; instead, you can simply add another node to the cluster.
For more information about the PolyServe NAS Cluster solution, download the white paper UNIX to Linux Migration at http://www.polyserve.com.
VERITAS Cluster Similar to the Windows clustering solution described in Chapter 4, VERITAS offers a comparable clustering solution for Linux. Although this product offers failover support, it does not provide the load balancing support or shared data capability that is found in PolyServe NAS Clusters. VERITAS does make up for its lack of load-balancing support by offering other features such as an intelligent agent that can dynamically move a virtual server in the cluster to an underutilized node, which is similar to a solution from PolyServe, which also supports the movement of virtual IP addresses. VERITAS clusters can scale to 32 nodes, giving you plenty of room for growth potential. If performance and availability are primary concerns, the VERITAS cluster solution has trouble delivering in performance-demanding environments. This shortcoming is essentially due to the fact that VERITAS Linux clusters can only provide failover support and does not allow multiple nodes in the cluster simultaneous access to the same file. Red Hat Cluster Suite and Global File System Red Hat offers its own commercial clustering product, which is the companys adaptation of the Linux-HA project. Unlike Linux-HA, which is available for free via download and with SUSE Linux, Red Hats Cluster Suite must be purchased as a separate add-on to the Red Hat Enterprise Advanced Server OS. The Red Hat Cluster Suite provides support for as many as 8-node failover clusters. The Red Hat Cluster Suite supports shared storage via SCSI or fibre channel, a management UI to simplify configuration, and a shared cluster quorum. In a significant diversion from many traditional Linux server-management practices, Red Hat only supports management of its Cluster Suite using its Cluster Manager GUI tool. If you want to change cluster configuration files via a text editor, youre on your own! The Red Hat Cluster Suite also supports Global File System (GFS), which provides for better integration with storage networks. GFS supports simultaneously reads and writes to a single shared file system in a SAN. This feature allows clusters configured in the Red Hat Cluster Suite to offer both failover and load-balancing support, similar to the PolyServe NAS Cluster.
97
Chapter 5
Deploying Performance-Based Scalable Linux File-Serving Solutions

Now that you are aware of the available alternatives, lets take a look at some considerations for deploying Linux file-serving solutions. Pre-Deployment Considerations The tendency of IT administrators is often to deploy first and customize later. For those that practice this approach, planned customizations take months or even years to complete. After all, with the file server deployed and operational, justifying spending additional time on the file server may be difficult, especially if youre like many IT folks and have countless other tasks on your list. To deploy a file server right the first time, planning has to be an important part of the process. One major part of the planning process is deciding which technologies should be used to complement the file server. Table 5.1 lists the most common file-serving problems as well as the available technologies that can alleviate or manage the potential problems.
File-Serving Problem Limit user usage of file-server resources Provide failover support Provide failover and load-balancing support Provide antivirus protection Prevent unauthorized access Solution Deploy and configure disk quotas Deploy and configure a Linux-HA cluster Deploy a third-party product Deploy an antivirus solution that is compatible with any installed file-serving applications as well as your backup product Determine the necessary permissions for each user or group that has access to the server
Table 5.1: Solutions for the most common file-serving deployment problems.
With some of the general requirements under your belt, lets look at the process of sizing up both server and storage requirements. Server Sizing One of the most difficult aspects of deploying any server is the process of determining the servers hardware requirements. This task can be difficult and the result is often an educated guess based on past experience. To help administrators in their quest to build servers that are perfect for their needs, many hardware vendors offer online sizing tools. To aid in the reliability of the tools, sizing tools are typically organized by server purpose, such as file server, and OS. One such tool is the IBM eServer Workload Estimator, which is available at http://www912.ibm.com/wle/EstimatorServlet. Figure 5.5 shows this tool.
98
Chapter 5
Figure 5.5: The IBM eServer Workload Estimator tool.
In the example in Figure 5.5, the Workload Estimator is being used to size a Samba server running on SUSE Linux Enterprise Server 9. Note that the tool allows to you size server hardware requirements based on factors such as concurrent user sessions, average user throughput, and average storage allocation per user. Once you provide the estimator with the necessary information (or accept the default settings), the tool will recommend server hardware that will meet your performance requirements. For SUSE Linux Enterprise Server 9 Samba servers, IBM offered the general guidelines that Table 5.2 shows.
Environment Large (400 concurrent users) Medium (200 concurrent users) Small (85 concurrent users) Recommended CPU 1.9GHZ 4-core 1.9GHZ 2-core 1.65GHZ 1-core Recommended RAM 4GB 2GB 2GB Server Platform P5 550 Express P5 520 Express P5 505 Express
Table 5.2: IBM Linux file-server sizing recommendations.
99
Chapter 5
If your preferred server vendor does not offer an online tool to assist in Linux file-server sizing, you can probably pass along your requirements to your local vendor representative. The local rep should be able to use an internal tool or consult an engineer to arrive at the proper server sizing requirements for your environment. As each server application and server uses system resources slightly differently, there is no one-size-fits-all tool for server resource sizing.
Storage Sizing Storage sizing starts with allocating adequate internal disks for the OS, applications, log files, and the paging file. For file-server deployments, a best practice is to allocate 1.5 times the amount of physical RAM to the paging file. Thus, a file server with 4GB of RAM should have a 6GB paging file. For optimal performance, the paging file should be stored on a separate disk, which clears an I/O channel for just paging operations. For log file storage sizing, you should consult with the application vendors for each application running on the server. Once you have determined the storage requirements for the OS, paging file, and applications, you can then move on to the storage requirements for the data itself. This value is often predictable because you should have on hand information about the current file server capacity as well as some historical data showing capacity over the past 12 to 18 months. For file server data sizing, a good practice is to requisition ample storage to meet the expected data growth for the next 18 to 24 months. When unsure about past storage growth, backup logs can usually provide the information you need. A simple method is to simply view the statistics for monthly full backups over the past year. This task should allow you to gauge the percentage of storage growth over the next 1.5 to 2 years. Once you have a handle on how much storage you need, you can work with your preferred storage vendor to decide the type and size of disk drives that youll need to purchase. As with server sizing, most storage vendors offer sizing tools that can assist you in determining the storage devices that will meet your disk storage requirements. One such tool that can help in identifying the hardware components that could support your storage requirements is the HP StorageWorks Sizing tool, which is available at http://h30144.www3.hp.com/. With this tool, you can enter your planned capacity and RAID level and the tool will generate information about the hard disks to use to meet your requirements as well as the overall storage efficiency of your planned storage system. Being able to view efficiency is very helpful when comparing different RAID levels. Figure 5.6 shows a portion of the HP StorageWorks Sizing tool output.
100
Chapter 5
Figure 5.6: Comparing RAID level efficiency using the HP StorageWorks Sizing Tool.
In the example that Figure 5.6 shows, a 1TB RAID 5 was compared with a 1TB RAID 10. The tool shows that the RAID 10 configuration would be 49 percent efficient, while the RAID 5 would be 74 percent efficient. The tool also allows you to see information about the disk size to be used as well as the total amount of storage to be purchased. For example, the 1TB RAID 5 would incorporate a total of twelve 146GB hard disks, total capacity of 1752GB. The usable capacity would be 1293GB. Once you have your Linux file-serving hardware sized, you are ready for deployment and management of the essential Linux file-serving services.
Managing Enterprise-Class Linux File Serving

Regardless of whether you have a standalone, NAS or clustered file server, the protocols that enable file sharing on Linux file servers are the same. This section will look at the roles of the following protocols and services as they pertain to Linux file serving: Network File System (NFS) Samba
Lets start with a look at NFS. NFS NFS has long been the de facto file-sharing protocol on UNIX and Linux servers. NFS has stood the test of time because it provides a simple and efficient means for sharing data between systems on a network. NFS has continued to evolve and get better with age as was proven by the recent improvements that were introduced in NFS v4.
101
Chapter 5
What Is New in NFS v4? NFS v4 is currently supported on both the SUSE 10 and Red Hat Enterprise Linux 4 distributions. NFS v4 offers several new features that both significantly improve the performance and security of NFS. The following list highlights some of the most significant improvements brought about by NFS v4: Improved securitySupports Kerberos v5 and Simple Public Key Mechanism 3 (SPKM3) Better ACL managementSupports named attributes; user and group information is stored in strings instead of numeric values Better firewall compatibilityThe disparate NFS protocols (ACL, mount, NFS, NLM, and stat) are now combined into a single protocol specification File delegationNFS clients can now modify files stored locally in their own cache without having to send requests back to the NFS server; this feature provides for a significant network performance improvement Lease-based file lockingNFS v4 clients lock files based on a share reservation; if an NFS v4 client loses contact with a server, once its lease on a locked file expires, that file is free to be accessed by other users Supports file migration and replicationFile migration and replication are now supported via NFS
With a general overview of NFS under your belt, lets examine the steps for getting this service up and running. NFS Setup Checklist Setting up NFS is a relatively straightforward process. Lets start by looking at the general steps for configuring and enabling NFS on a Linux file server: Define the folders to publish as shares in the /etc/exports file. Set local permissions for each shared folder as necessary. Define the hosts and logical networks that are allowed access to the NFS service by editing the /etc/hosts.allow and /etc/hosts.deny files. Start the NFS service. Mount a shared folder from an NFS client.
/public/ *(ro,root_squash,sync)
Heres an example of /etc/exports configured to share a folder named /public:
102
Chapter 5 Once the shares are defined in the /etc/exports file, you then need to ensure that you have the proper local permissions set for each exported folder. This step is necessary to prevent against unauthorized access, modification, or deletion of shared files. Network access can be restricted on a host-by-host or network-by-network basis by editing /etc/hosts.allow and /etc/hosts.deny. When a connection is attempted to a Linux file server, the connecting hosts IP address is first evaluated in the etc/hosts.allow file. If no match exists, the /etc/hosts.deny file is checked. If a match exists, the host is denied access. By default, if no match exists, the host is allowed access. If you want to deny all traffic from any hosts or services not explicitly listed in the /etc/hosts.allow file, you would add the following line to the /etc/hosts.deny file:
ALL:ALL.
Although denying all traffic not explicitly granted access is the most secure method of locking down a file server, you will need to remember this setting in the event that you are setting up additional network services or applications on the file server. If the new service or application is not allowed in /etc/hosts.allow, clients will not be able to connect to the service. Once you have created the implicit deny rule in /etc/hosts.deny, you would then need to edit /etc/hosts.allow to grant access to the appropriate hosts or network segments. The following example shows how to configure /etc/hosts.allow to grant NFS access to hosts on the 172.16.1.0/24 subnet:
lockd: 172.16.1. rquotad: 172.16.1. mountd: 172.16.1. statd: 172.16.1.
At this point, you can start the NFS service and you are on your way. Linux distributions are continually improving their GUI management tools, and such is particularly the case with SUSE Linux 9. NFS can be fully configured within minutes by using SUSE Linuxs YaST, as Figure 5.7 shows.
103
Chapter 5
Figure 5.7: Configuring NFS using YaST.
Now that you have seen how to complete the initial setup of NFS, lets take a quick look at Samba.
For more information about NFS, point your Web browser to http://nfs.sourceforge.net.
Samba Samba provides the functionality for Linux file servers to host shared folders that are accessible via the CIFS protocol, which is the default file-sharing protocol for all Windows-based OSs. Both Red Hat Enterprise 4 and SUSE Linux 9 run Samba 3. With Samba 3, major improvements were made that allowed for reliable authentication between Windows AD domain controllers and Samba servers. Although the reliability improvements are significant, Sambas feature set is closer to that of a Windows NT Primary Domain Controller (PDC) than that of a Win2K or Windows Server 2003 (WS2K3) AD domain controller. This limitation of Samba is expected to change in Samba v4.
104
Chapter 5
What Is Coming in Samba 4.0? The upcoming release of Samba 4.0 is being hailed as Sambas first true challenge to AD. Among the planned features for Samba 4.0 are: Support for AD logon and administration protocols An internal Lightweight Directory Access Protocol (LDAP) server Internal Kerberos server Flexible (extensible) database architecture Full NTFS semantics Much better scalability
Many administrators have reveled in Samba 3 for its ability to provide highly available CIFS file serving. With so many planned enhancements in Samba 4, its pending arrival has garnered significant buzz in the industry. Samba Deployment Samba deployment is similar in approach to NFS deployment, with the exception that additional attention will need to be paid to Windows authenticationconsidering Samba file servers most often are used to provide access to Windows client systems. As with NFS, Samba can be configured using YaST on SUSE Linux or with the Samba Server Configuration tool (see Figure 5.8) on Red Hat Linux.
Figure 5.8: Red Hat Linux Samba server configuration.
105
Chapter 5 The Samba server configuration is stored in the /etc/samba/smb.conf file. The following example shows the smb.conf file settings that match the /public share definition shown in the Samba Server Configuration tool earlier:
[public] comment = Company Docs path = /public writeable = yes
This code creates a writable share named public. In addition to defining the shares and level of share access, you need to set permissions for the shared files and folders. In the next chapter, you will see how to set permissions on a Linux Samba server for Windows user accounts residing in an AD domain. As so many Samba issues in the enterprise are directly related to Windows, the bulk of the information on fully deploying Samba is provided in Chapter 6.
For additional information about Samba setup and configuration, read the Official Samba HOWTO at http://www.samba.org.
Current Trends in Linux File Serving

Linux file servers have gained from three major trends in the IT industry: Migration from UNIX to Linux Server consolidation Storage consolidation
This section will look at the impact of these three trends on the Linux file-serving landscape. Migration from UNIX to Linux When Linux first burst onto the IT scene, many thought that it would be a serious challenger to Windows. Although the question of whether Linux will ever be able to overtake Windows still remains to be decided, UNIX OSs have suffered substantially at the hands of Linux. To most, moving from UNIX to Linux is a no-brainer. Many of the enterprise applications that run on UNIX, such as Oracle, also run on Linux. As Linux OSs can run on industry-standard Intel-based hardware platforms, Linux servers are far less expensive than UNIX servers that run on proprietary hardware. Being proprietary can also mean that an organization will need to pay more to maintain a UNIX server. This cost is not only related to the proprietary hardware in the server but also the higher cost to pay an administrator that has the specialized skills to maintain the UNIX server. With Linux file-serving solutions being able to offer comparable performance to UNIX servers at a fraction of the price, migrating legacy UNIX boxes to Linux is a logical step.
106
Chapter 5
Benefits of Consolidation Another logical step that many have taken in file-server management is toward consolidation. Both proprietary UNIX servers and NAS appliances have been major contributors to server sprawl. For organizations that have anywhere from two to five NAS appliances, management overhead is becoming more difficult as the network expands. As with UNIX migration, the detriments to server sprawl are easy to spot and have led to a flood of organizations consolidating dozens of UNIX servers and NAS appliances to Linux clusters. The bottom line with consolidation is that it can result in significant yearly savings. Take for example a consolidation project that reduces 60 servers to two 15-node PolyServe Matrix clusters. In this case, the TCO savings could easily reach several hundred thousand dollars a year. Fewer servers can also mean fewer software updates. With less to maintain, IT shops can stretch their budgets further. Consolidation is not about getting smaller for the sake of getting smaller but is instead about getting the most out of your existing hardware investments. Having several servers with 30 percent CPU utilization, for example, means that you have several servers that have CPUs doing nothing 70 percent of the time. If your organization has paid for the hardware, it should very well get the most out of it. Again, since consolidation is about reducing hardware costs and system management, it is important to keep in mind that file server consolidation is best suited for shared data clustering. Clustering provides the ability to configure failover support and load-balanced data access for critical file servers. Other approaches to consolidation, such as those that consolidate file servers to virtual machines, only reduce the amount of managed hardware on the network. They do not reduce the number of managed systems on the network and thus will not help with reducing software licensing costs. Thus, although there are several ways to go about file-server consolidation, consolidating to a shared data cluster that can offer the benefit of load balancing, failover, and streamlined management from a single console has been deemed the most logical methodology by many organizations in the IT community. Storage Consolidation Most organizations also consolidate their storage resources while in the process of consolidating file-server resources. Storage consolidation offers several benefits: More efficient utilization of purchased storage resources Simpler storage scalability Ability to back up and protect data using methods that are not available to traditional DAS storage Ability to share data between redundant servers instead of having each server maintain its own local copy of data files
107
Chapter 5 When combined with server consolidation to a shared data cluster, sharing disk resources between servers in a SAN also allows for true load balancing of data access to storage. With consolidated storage, when a need arises for additional disk resources, the disks can simply be added to the SAN and then mapped to the server that needs them. This method is more efficient for managing storage than the traditional process of marrying a disk array to a single server. If you allow it, the complexity of networks will only continue to grow over time. Warding off network complexity requires you to be proactive. The instinct to growth is always to buy more parts. More parts only further add to complexity and in turn more management costs. Streamlining your network with consolidated server and storage resources will ultimately lead to better TCO. When combined with shared data clustering, consolidation will also result in vastly improved reliability and performance.
Summary
In this chapter, you were presented with the state of the Linux file-serving world as well as best practices for optimizing Linux file serving in production. The final chapter will look at the management issues surrounding heterogeneous networks. In particular, youll see how to configure winbind authentication on your Linux file servers for the sake of supporting user authentication to a Linux server via AD. For environments that are running both Windows and Linux desktops, you will also see how to set up user home folders to be shared across both Windows and Linux workstations. After tackling the most challenging Windows-Linux integration issues, the chapter will then examine modern backup methodologies that are used to maintain data availability and disaster protection for both Windows and Linux file servers.
108
Chapter 6
Chapter 6: Managing High-Performance, Scalable, and Resilient Data Across the Enterprise
The two previous chapters took an in-depth look at both Windows and Linux file-serving solutions and how these two operating systems (OSs) present individual challenges and advantages in the areas of performance, scalability, availability, and integration. As vendors have worked to manage and mitigate these challenges, varying new technologies have been developed to meet the need to manage data. As each new incarnation is adopted, heterogeneous networks have developed over time, presenting challenges in the areas of integration, backup and recovery, and freedom to manage your storage solution the way you see fit. This chapter will examine the broader enterprise picture and leverage what previous chapters have discussed to develop a clearer understanding of the high-level responsibilities (and granular realities) of managing data across the enterprise. This chapter will assume a holistic vantage point to examine the challenges faced in the enterprise today and touch on key points to consider when defining the strategy behind building highly scalable enterprise file-serving solutions. Lets begin by examining a few of the challenges faced in heterogeneous networks and how those challenges come into play in the enterprise and progress into the areas of backup and recovery.
Challenges Facing Heterogeneous Networks

For reasons apparent to IT managers and administrators, it is appropriate to refer to large-scale implementations of networked computer systems as an enterprisea word that not only means an undertaking but more specifically an undertaking of great scope, complication, and risk. The scope of an enterprise may vary, but generally, all enterprise environments are complicated landscapes comprised of one or more heterogeneous networks that come together to be defined as one enterprise Wide Area Network (WAN). As IT managers work to align solutions with their individual IT missions and goals, meet compliance requirements, increase their return on investment, and drive cost efficiency, decisions are made that very rarely permit each solution to align with a single OS or product offeringand what is decided upon gets added to the overencompassing umbrella referred to as the enterprise. Heterogeneous networks comprised of many OSs and protocols present inherit challenges such as: Inhibited agility Complexity Integration concerns IT risk and compliance considerations
This section will focus on the numerous management concerns presented when managing a heterogeneous network environment.
109
Chapter 6
Inhibited Agility Although you will often find strength in diversity, diversity also has a price. Protocols that are not inherently compatible with one another complicate the environment, creating additional management overhead and inhibiting the capability of the enterprise to remain agile. Agility within the enterprise is the enterprises ability to quickly reconfigure IT resources to meet changing business demands. An enterprise should strive to be flexible in its ability to quickly respond to changing business requirements to meet the growing needs of the enterprise. Remaining agile in an environment populated by years of accumulated, different storage solutions is challenging, to say the least. Many data centers are faced with the realization that although in the business world more often equates to better, the same is not the case in the data center. Striving to meet shorter times to market for emerging business initiatives is often a challenge that is hindered by the capacity of both the physical and virtual enterprise IT infrastructure components required to meet the growing demands of the organization. Complexity Relying on multiple OSs, hardware platforms, and software platforms inhibits the ability of administrators to centrally manage the entire environment. As the complexity of the enterprise environment increases, so do the costs of managing the environment. In addition to the core competencies required of the engineering, architecture, and support staff, these costs include implementing and maintaining the systems and technologies required to support, maintain, and when disaster strikesrecover the environment. Whenever possible, steps should be taken to simplify the enterprise architecture to minimize these costs and maximize the effectiveness of ongoing support efforts. Integration Concerns Enterprise IT managers are continually driven to seek harmony in their environments. Compatible systems reduce the total cost of ownership of the environment by allowing the enterprise to standardize and simplify. As new systems, protocols, and applications are developed by competing vendors, often with little internal desire to remain compatible with the competition, the challenge of integrating enterprise systems escalates and becomes burdensome. Focusing on integration will help you reduce cost by reducing complexity and simplify management efforts by reducing or eliminating incompatible systems. IT Risk and Compliance Considerations Each OS, firmware revision, and supporting software application presents unique security and compliance challenges. These range from the broad consideration of service packs and hotfixes to granular OS and application configuration. The amount of time associated with managing and mitigating risks and maintaining compliance across the enterprise will inherently impact server, system, or network availability. As the enterprise strives to remain secure and compliant, the impact to availability is often felt. These risks need to be analyzed for their potential to impact business processes and goals.
110
Chapter 6
Integrating Windows and Linux File-Serving Solutions

Windows and Linux file-serving solutions represent different, yet often complementary, approaches to file serving. At their very core, these two OSs are dramatically different and distant in design; as such, the solutions developed for each OS have been accordingly disparate. Central to integrating these two platforms are the challenges presented by: Common Internet File System (CIFS) and Network File System (NFS) integration Managing Access Control Lists (ACLs) Integration with existing services
This section will focus on each of these challenges and how you can leverage the information covered in the previous chapters to effectively integrate a cross-platform file-serving solution. CIFS and NFS Integration Chapter 4 presented Microsofts Shadow Copy and its reliance on CIFS as the entry point for users to recover files. Chapter 5 presented NFS, which has risen to become the unmatched standard file-sharing protocol on UNIX and Linux servers. In addition to the out-of-the-box packaged distributions of UNIX and Linux, many NAS devices, of which the vast majority are based on Linux, use NFS as well. Enterprise IT managers that are looking to maximize their return on investment by utilizing the best Windows- and Linux-based technologies are now faced with a bit of dilemmaincompatibility. The gap that exists between these two protocols can be bridged, but doing so presents performance concerns. Software solutions exist that allow UNIX- and Linux-based servers to provide remote file access functionality to PCs without requiring NFS. The most widely used of these are SAMBA, Hummingbird NFS Maestro, and Microsoft Windows Services for UNIX (SFU). SAMBA is a server-side installation, while NFS Maestro and SFU are NFS redirectors that are installed on client workstations. When considering the use of a redirector, it is important to keep in mind the scalability of the solution and the architectures dependency upon it. Installing NFS Maestro on a few dozen workstations that require access to a UNIX system for a special purpose, such as a small accounting department with a need to access an NFS share to run a particular report, may be acceptable in the short term. Success, however, often equates to scale, so pay heed to the overall strategy. What may be communicated to you today as a special-purpose small-scale implementation could very well develop into a situation that is much larger in scale than anyone initially foresaw. Both Windows and Linux have support for the others file systemLinux, by the use of SAMBA for SMB services and Windows through SFU. On a Linux server, one might encounter the native NFS stack running in combination with SAMBA to serve files to both systems; on Windows, the native CIFS stack can be used with SFU to provide the same services.
111
Chapter 6 Installing and managing a centralized server to bridge the gap is often a better option not only to reduce complexity but also to help prevent application sprawl by preventing the need for redirector software to be installed on multiple systems across the enterprise. The use of Linux as a mainstream file-serving solution is becoming a reality, and many IT managers are no longer being afforded the luxury of sticking with a single protocol for all their file serving needs. This reality adds to the complexity of the environment, as NFS and CIFS protocols may both be required. NFS vs. CIFS is not an all-or-nothing equation; both can exist simultaneously to further enhance the capabilities of an existing infrastructure, and as technologies are continually developed to further enable enterprise management, the inclusion and integration of both CIFS and NFS file-sharing protocols can dramatically simplify the inclusion of future systems into the multi-protocol file-sharing environment. Managing ACLs Security management is becoming an increasingly complex task. Adapting to industry and regulatory compliance needs puts the enterprise IT manager in a virtually continuous reactive state. As new compliance directives are developed and subsequently adopted by your organization, information security policies, standards, guidelines, and procedures are implemented to meet these needs. Legislative compliance standards such as the Health Insurance Portability and Accountability Act (HIPAA) of 1996 provide clear directives on what needs to be protected from disclosure as well as severe penalties for organizations who fail to meet those directives. The only piece this legislation doesnt supply is the means to implement the changes to meet the requirementsthat decision is left to you. Often the most challenging and certainly the most granular task an IT administrator may face is managing ACLs. This task can be further compounded when file-serving resources are spread across multiple platforms with dissimilar ACL architectures. Different ACLs supporting separate application systems or users make it more difficult to meet legislative compliance requirements because they lack the means to be centrally managed effectively. Because the ACL security model in NTFS is more robust and fundamentally different than the file-security model used in Linux, no one-to-one mapping can be made between them. The fundamental problem occurs when a Windows client (which expects an NTFS ACL) accesses a Linux file, or a Linux client (which expects Linux file permissions) accesses a Windows file. In these cases, the file server must sometimes authorize the request using a user identity that has been mapped from one system to the otheror, in some cases, even a set of permissions that has been synthesized for one system based on the actual permissions for the file in the other system. This setup creates its own set of security concerns that also need to be managed. There is some good news in that NFS version 4 standardizes the use and interpretation of ACLs across Posix and Windows environments. This standardization will make centralized management of ACLs between these two systems easier. It will also support named attributes. User and group information is stored in the form of strings, not as numeric values. ACLs, user names, group names, and named attributes are stored with UTF-8 encoding.
For more information on NFSv4 refer to the NFSv4 home page at http://www.nfsv4.org/.
112
Chapter 6 Integration with Existing Services Transparency is an important concept of enterprise architecture that aides in security efforts by keeping authorized users relatively unaware of the inner-workings of the infrastructure and makes systems and resources easier to locate and use. In few places is transparency more vital to productivity than in the file-serving arena. The viability of a file-serving solution within an enterprise environment is, to a great extent, dependant upon that solutions capability to remain transparent to the end user. File services should appear to users to be a conventional, centralized file system. The number and location of servers and storage devices should be made invisible.
Backup and Recovery

The previous chapters spent time discussing the use of file servers as a centralized repository to simplify the process of backing up and recovering systems. It is important when designing and implementing file-serving solutions that due care be given to ensure that these solutions are also backed up. The following list highlights a few simple rules to remember when designing a backup strategy: A backup should be easy to do. A backup should be automated and rely on as little human interaction as possible. Backups should be made regularly. There should be at least two copies of the data stored on resilient media and kept at different locations. A backup should rely on standard, well-established formats. A backup should not use compression. Uncompressed data is easier to recover if the backup media is damaged or corrupted. A backup should be able to run without interrupting normal work.
A backup is simply one process in an overarching area of responsibility for an organization. Whether the focus is on business continuity planning (BCP), disaster recovery, or compliance, backup and recovery planning is going to be crucial to the success of an organization to recover. To effectively plan and implement a backup and recovery plan in support of business continuity or disaster recovery, several processes need to be examined for inclusion: Risk analysisRisk is a subjective term and although storage administrators tend to treat all data as if it were mission critical, it is important to understand how much risk a line of business is willing to accept in terms of data loss. SchedulingIt is a generally accepted rule that backups should not affect production. Work with your business partners to understand their business requirements and minimize the impact of backup operations on their environment.
113
Chapter 6
Review of logsBackups generate logs that need to be reviewed regularly in order to ensure that theyre performing properly. TestingBackups are often taken for granted. Until, of course, you need to recover the system and discover that a backup has malfunctioned during a real disaster recovery operation. Test backups often to ensure their quality and viability for recovery. RetentionAll data is not created equal. Some can be discarded virtually immediately; other data must be archived and maintained for years to meet legal, regulatory, or compliance obligations. The amount of time data must be maintained is an important metric to determining the proper archiving solution.
Disaster Planning Essentials An area that is all too often overlooked or given too little emphasis is that of disaster recovery planning. Systems that comprise an enterprise file-serving environment should be protected, ideally in a secure data center with sufficient resources to operate autonomously should a natural disaster or other emergency interrupt key services and utilities such as telecommunications and power. For disaster planning to be effective, it must be put into the proper context. Disaster planning is often presented in a manner that isfor all intents and purposesinaccurate. Enterprise managers often approach disaster recovery planning with a perspective that doesnt clearly define the benefits. Managers also often fail to communicate the need for disaster recovery planning effectively to the line of business operations that will depend upon it. The planning itself doesnt relate directly back to an immediate or pressing need, so it is therefore set aside. When presented in context, the need for disaster recovery planning is clear. Although you hope to never need to use this parachute, there is peace of mind in knowing it is there to protect you. The upfront cost and effort required for disaster recovery planning may be uncomfortable, but when you clearly illustrate the risk and real potential for disaster, the response from the business should be one of appreciation rather than one of remorse. Once you understand why disaster recovery planning is essential and how to best communicate it, you can move forward to examine the stages of disaster recovery planning and how you can begin. Development The first stage of disaster recovery planning requires the development of a plan to document the procedures for responding to an emergency, providing extended backup operations during the interruption, and managing recovery and reclamation of data and processes afterwards (should an organization experience a loss of data access or processing capability). A disaster recovery plan is an enterprise document that should outline the roles and processes of senior management as well as IT management and other critical personnel in key areas such as security, facilities engineering, and finance.
114
Chapter 6
Disaster Planning Roles From the vantage point of the IT manager, there will be several key roles to be fulfilled that should be documented in the disaster recover plan: Identifying and prioritizing mission-critical applications Recovering and reconstructing all critical data, systems, and supporting infrastructure Continuously reassessing the recovery sites stability
Identifying and prioritizing mission-critical applications is only one step that requires a close and indepth understanding of business needs. Throughout, this chapter will refer to the importance of maintaining open lines of communication with storage users to better understand their business and subsequently their current and future storage requirements. Identify and Prioritize
Identifying and prioritizing mission-critical applications is a step that should be taken as part of disaster recovery planning and then periodically reevaluated to align with changing business needs. It is important that clear lines of communication be established early on between the various lines of business supported by the enterprise infrastructure and IT management so that IT managers can make informed decisions to protect their line of business partners. Reevaluating this step periodically is important, not only to keeping those communication lines open but also in providing visibility of the disaster recovery plan.
Recovery and Reconstruction
The speed, efficiency, and effectiveness of the recovery and reconstruction efforts should be the focus of a sound disaster recovery plan. Many times, organizations spend a great deal of effort and planning to ensure that the backup of systems themselves does not impact production and pay little attention to the recovery time associated with the process.
Reassess
When operating in a disaster from a recovery site, an organization is in its most vulnerable state of operations. Few organizations outside of government the financial services industry maintain multiple recovery sites that can be utilized during a disaster. IT managers must act vigilantly to monitor and protect the stability of the recovery site. As part of an organizations disaster recovery planning efforts, a recovery team may be defined with the mandate to implement the recovery procedures once a disaster is declared. The recovery teams primary duty is to get critical business functions operating at the alternative or backup site.
115
Chapter 6
Traditional Backup Methodologies Backup methodologies vary with the size, scope, and criticality of the data they are intended to protect. In the most basic traditional model, critical data from a key system is copied to storage media such as a tape, or file server, separate from the client to provide protection in the event of client failure. This process depends heavily on the system resources of the client to perform either the tape backup or copy operation, which naturally affects the performance of the client (by way of CPU utilization, network utilization, and so on). Tape storage can become quite costly when adopted as a standard methodology as the use of individual tape drives sprawls throughout the enterprise. In addition to the cost of personnels time to handle the physical tapes and maintenance cost associated with the tape drive hardware and replacement of tapes as they become unviable, there is the cost of storage and shipping to consider when transporting the media off-site for safekeeping. In the past, it has been generally accepted that a backup should be performed at least once every 24 hours. This number is arbitrary and should be reconciled against the actual business continuity and disaster recovery requirements of the system you intend to recover. Snapshots Although some backup solutions simply copy data directly to a tape or another disk, some solutions use another process that utilizes snapshots. A snapshot is a relative (or delta) copy of a data set. It is differentiated from a mirror in that there are links between the original (or source) and the copy (or mirror). In the snapshot process, the backup software makes a copy of the pointers to the data, which indicate its location, then relies on data movers to pick up the pointers and transfer the data. Snapshot volumes are point-in-time copies of primary storage volumes. By creating snapshot volumes, the primary volumes continue to be available for production operations, while the snapshot volumes are used for offline operations such as backup, reporting, and testing. This setup results in improved backup operations, data reporting, application testing, and many other day-to-day operations. Server-Free Backups The biggest advantage to server-free backup is the reduction of workload on the target server. A server-free backup, as the name implies, frees target servers CPU, memory, and I/O consumption during the backup process by decreasing the servers involvement in the backup process. Essentially, the data being backed up will move from the target servers disk to a data mover (see Figure 6.1). In a server-free architecture, the data mover is another server that is dedicated to providing the actual transportation of the data. A data mover can also be a device, as well see in the next section, such as a Small Computer System Interface (SCSI) drive or router that reads the data from a network drive and writes it to the backup device. The data mover also manages the flow of data between the network drive and backup device to ensure that no information is lost.
116
Chapter 6
Figure 6.1: In a server-free design, a dedicated server acts as the data mover to free system resources on the target servers.
Server-Less Backups Like server-free backups, server-less backups offer the advantages of efficiency, scalability, fault tolerance, and cost reduction, but are defined by a complete lack of dependence on a dedicated server to fulfill the role of a data mover. In a server-less environment, either the storage device or its supporting infrastructure are used to fulfill the role of the data mover. Technologies such as the SCSI-3 Extended Copy (XCOPY) command can be used to read and write data directly between a disk array and a secondary device and can take advantage of existing modules in backup applications to coordinate the backup process. This method of backup reduces total cost of ownership and operational costs by eliminating any need for additional servers and increases backup performance by eliminating the intermediary server from the backup process (see Figure 6.2).
117
Chapter 6
Figure 6.2: A server-less architecture relies upon the storage solution, or the components of its supporting infrastructure, to transfer the data.
Some of the major advantages of using server-less backup include: Increased server efficiency Increased scalability Better fault tolerance Lower overall hardware costs
Utilizing a server-less architecture provides savings in the form of server elimination as there is no need for a dedicated server to perform the role of a data mover. In addition, this setup saves on network utilization, permitting the data to be transferred once directly from the server to the storage device and thus eliminating the additional step required with an additional server.
118
Chapter 6
Archiving and Migration There are times when data needs to be held onto for the long haul. Aside from traditional backup requirements, many organizations find that certain data may be required for legal, or compliance, reasons to be maintained for months or even years. Archiving refers to the processes supporting these needs and the storage requirements necessary to meet legal obligations. When designing a file-serving storage solution, architects need to understand the archiving requirements and how they pertain to the data being stored so that the solution can be designed to meet those requirements. A tiered-storage approach provides insight into the value of data over time by classifying data early and progressively re-classifying the data as its value changes based upon differing motivators. Take for example financial data whose value is immensely important during the time of and immediately following a transaction. Loss of a large-scale financial database, such as those used by credit card processors, during peak transaction hours could be catastrophic to the business. Once the transactions have cleared and have been reconciled, the data still remains important and the processor may need the data surrounding the transaction immediately on-hand for the next 30 days to facilitate refunds or for other internal business processes. As time passes, the data becomes less critical but still important as information about the transaction is used in tax reporting and, depending upon the industry and purpose of the transaction, regulatory compliance may require the data be kept for several years. Understanding the need and scope of the long-term storage requirement will help drive storage decisions and the underlying financial motivations. As an understanding of the business need for the data matures, this understanding drives transformation. Enterprise storage architects, armed with an understanding of business needs, are compelled to consider storage architectures that meet those needs especially in the areas of performance, scalability, and resiliencywhich leads many to consider migration to a consolidated storage architecture. Unfortunately, storage consolidation isnt as simple as buying a large, enterprise-class storage array and migrating applications to the new platform. Migration takes time for planning if its going to be successful. Essentially, there are four key areas that need to be assessed when planning for storage migration: Assess the current environmentDuring this phase, youll need to identify your current storage capacity and gather metrics surrounding its utilization. Youll also need to reestablish your understanding of the business being supported by the storage environment and strive to understand the future requirements the business may be soon facing. Understanding the current costsThis phase is important in gathering cost justification for a storage consolidation effort. Work to gather the current storage hardware and software costs and understand how those costs impact the bottom line. Remember to focus attention on the cost of supporting the environment in time, personnel, support contracts, and licensing agreements.
119
Chapter 6
Assess storage management capabilitiesIt is important have a clear picture of the current management so that you can make informed decisions in the same context. In addition to administration, one must also consider the ability to monitor the environment for performance, availability, and security. Understand the future business and legal requirementsAlthough you might not have a crystal ball readily available to predict the future requirements that will be placed on your storage infrastructure, you can leverage the experience of your business colleagues whom often can provide a wealth of information about what the future may hold. New compliance issues and regulations that your partners may be working to meet may have significant impact on storage.
Once these steps have been completed, a clearer picture will have developed that will serve to guide you through the migration process. To aide in understanding a few of the architectures available, lets first examine what has worked in the past and what is now being adopted for use in enterprise backup architecture. Successful Backup Architectures There are many approaches to backup and recovery to be considered when developing an enterprise file-serving solution. Each will be dependant on the size, scope, business continuity, and disaster recovery needs of the file-serving solution you intend to develop. The three architectures most commonly found within an enterprise environment are D2T D2T has been, and to a great extent still is, the most common method used to create backups. D2T as a standalone backup solution still has many inroads within the enterprise, but its use as a standard for backup and recoveryas a standalone solution enterprise standardis diminishing. New storage solutions and architectures such as NAS and SAN, as discussed in Chapter 2, have opened avenues to consolidation. For small scale or systems isolated from the storage infrastructure, D2T is viable as an independent solution, and the relative size-to-cost ratio of tape storage makes it one of the most affordable solutions from a media standpoint. Within an enterprise file-storage solution, D2T is commonly found as a component of the more robust and resilient D2D2T architecture. Disk-to-Tape (D2T) Disk-to-Disk (D2D) Disk-to-Disk-to-Tape (D2D2T)
120
Chapter 6
D2D In D2D, as it is often represented, a computer hard disk is backed up to another hard disk rather than to removable media. D2D as a backup methodology enables both greater performance and higher capacity relative to tape or other removable media alternatives, which directly translates to shorter time to recovery. D2D can be used both to refer to a dedicated backup architecture in which one disk is used as a dedicated backup media and whose sole function is to serve as a backup device. D2D can also refer to contingency backup solutions in which one system is routinely backed up to a second identical system for recovery purposes.
D2D is often confused with Virtual Tape Library (VTL) technology. A VTL is a data storage technology that employs the use of emulation that causes hard disks to behave virtually as if they were tape drives. However, D2D differs in that it enables multiple backup and recovery operations to simultaneously access the disk directly by using a true file system.
D2D has further advantages over D2T in that in midsized to large-scale implementations, D2D can lower the total cost of ownership of the backup and recovery solution due to increased automation of the process and lower hardware costs. D2D2T In D2D2T, data is initially copied to backup storage on a disk storage system and then periodically copied again to a tape or other removable media storage system. Traditionally, many businesses have done backup directly to relatively inexpensive tape systems. Many highperformance application systems, such as financial databases, however, have a production assurance or business continuity need to have their data immediately ready to be restored from secondary disk if and when the data on the primary disk becomes inaccessible. As individual storage requirements have begun to be defined in terms of business criticality, rather than in terms of storage devices, organizations have adopted the concept of storage virtualization. In a storage virtualization system, IT managers can define an organizations need for storage in terms of storage policies rather than physical devices. For example, if a financial database has a business requirement stating that no more than 15 minutes worth of data may be lost as the result of a technology failure, D2D2T makes a great deal of storage virtualization sense. Figure 6.3 demonstrates how D2D2T can be used as a backup architecture for such a production database.
121
Chapter 6
Figure 6.3: D2D2T and storage virtualization.
To meet a 15-minute goal, the supporting infrastructure would require a server to be dedicated to providing that contingency role. Every 15 minutes, data is backed up to the contingency server; then, at regular intervals, the contingency server can be backed up, which eliminates the demand for a full backup process from the production server. Other critical data, such as recent email, may also benefit from such a system. Email is considered by many to be mission-critical data in the short term, though dependency on this information eventually fades over time. D2D backup will enable email backups to be readily available for recovery in the short-term, then, as the organizations policy dictates, eventually moved for archival purposes. D2D2T has further advantages in that once the data has been moved to a secondary device either a dedicated server (as Figure 6.3 shows) or a disk dedicated to serving as backup media the data can then be examined by other applications. For example, if one of the businesses being supported by the architecture desires reporting to be performed that is non-real-time and nonintrusive, the backup data present on the secondary disk may serve as a means to access that data without impacting production systems. Such methods should NEVER be used if the data itself is to be modified in any way, but for simple reporting based upon the data utilizing the backup copy, is a good way to reduce impact on production systems.
122
Chapter 6
Benefits of Share-Data Approaches Share-data approaches to storage architecture help to reduce the risk of data loss and increase productivity and collaboration as well as reduce backup and recovery processes and expenses, which is why they have become so popular. Over the years, as the ability to share data or centralize data storage has evolved, enterprises architects have battled to maintain availability, scalability, and manageability across their enterprises. However, the need for expansion and growth compounded by a sprawling file-serving architecture has crippled many storage solution initiatives. The benefits of share-data approaches are Reduction of complexity Increased performance Increased scalability High availability Consolidation of storage Simplification of management.
In many data centers, storage is an afterthought, something that is considered to be a byproduct of an application installation. Because of this mindset, storage is often dedicated to a server or collection of servers specific to the application systems they support. In a share-data approach, all servers can see all the data, storage is consolidated, and this architecture aggregates I/O performance and enables enterprise storage architects to greatly simplify storage management. By approaching storage management with a share-data approach, the enterprise can consolidate existing storage and scale the environment as a whole without directly impacting any one application. This scalability means that a share-data approach is highly flexible, and by sharing the data throughout the storage infrastructure, the design is inherently fault tolerant allowing for failover with virtually no application disruption. Share-data approaches centralize, simplify, and holistically contain the storage infrastructure into a manageable solution.
123
Chapter 6
Comparison: Consolidated vs. Distributed Backup Architectures Consolidation and simplification will be the focus of countless IT projects this year as organizations strive to reduce their total cost of ownership and leverage new, higher-performing platforms available that result in increased storage density. Depending upon the project and subsequent storage application, consolidation can make a great deal of financial sense. The results of such a project can be felt directly on the bottom line by reducing hardware and software licensing costs, ongoing maintenance fees, and power and network requirements. When comparing the architectures of consolidated and distributed backup architectures, several key points of contrast come to light. The first and broadest in scope is that there are many different hardware and software architectures and options to choose fromeach supporting different or varying protocols and their own architectures that come together to develop the overall storage strategy. The following list highlights the key points to examine: Availability Scalability Interoperability Data protection
Distributed Approach In a distributed environment, file servers and supporting infrastructure abound, and although traditional methodologies for providing high data availability have driven the enterprise storage infrastructure in this direction, such environments are more costly to maintain. High-availability has for years been the siren song of the distributed approach. In a distributed environment, there are more copies of the data on hand to meet the availability requirements of the business, but the concept of high-availability is not exclusive to distributed environments. Today, consolidated data storage can meet the demands of high availability as readily as many distributed approaches and in an architecture that lends itself more easily to centralized management and scalability. Scaling a distributed architecture to meet the growing needs of the enterprise brings significant costs. In addition, new software, servers, drive arrays, and supporting infrastructure are brought online to meet growing needsand the cost quickly adds up. Over time as new vendors present new options in storage, the sprawl of devices within your storage environment becomes increasingly difficult to manage. Storage administrators now face new challenges as a byproduct of years of distributing their environments and struggle to maintain interoperability of storage products within the enterprise. Planning and providing for common data protection within the enterprise storage environment is a bit more difficult in the distributed approach. Consider, for an example, the process of moving large backup jobs in a distributed approach. With a dozen servers each being backed up independently to separate, distributed storage devices, the operational and subsequent disaster recovery of these systems can become quite complex. If experience has taught the enterprise storage community anything, its that recovery needs to be as simple and swift a process as is technically and humanly possible.
124
Chapter 6
Consolidated Approach Server and storage simplification and consolidation are reaping great benefits to enterprise IT infrastructure. Technologies are continually being developed to increase the return on investment for costly mission-critical equipment. Servers, infrastructure, and storage devices arent just costly to purchase, theyre also expensive to maintain and support. Thus, leveraging these new advances can produce real financial savings. Maintaining high-availability in a consolidated environment is no longer the exercise in futility it once may have seemed. However, many still compare the consolidated approach to storage architecture as putting too many eggs in one basket. The fact of the matter is that today consolidated storage networks can be just as highly available and resilient as their distributed counterparts. Redundancy methods and technologiessuch as redundant host bus adapters (HBAs) to protect against cable failure, multi-pathing software, and resilient connectivity pathshave been developed to facilitate more highly available, robust, and resilient storage solution architectures. A consolidated architecture excels in the scalability arena. Existing consolidated solutions can be scaled up much more easily than a distributed approach, which would often require not only the expense of the additional equipment but also the supporting architecture. Providing for common data protection is significantly simplified in a consolidated storage architecture. In its simplest form, a consolidated storage solution can offer a virtual one-to-one backup architecture. To illustrate the savings, consider an environment of 60 application servers geographically disbursed, each with their own individual storage requirements and supporting storage devices. To maintain common data protection throughout this environment, the backup and recovery efforts would need to be centered on those data storage devices and the subsequent supporting infrastructure; whether this results in individual tape drives for the servers or some other removable media, the cost to maintain the backups can become burdensome. Next consider the same 60 devices, but instead of a tape drive, theyre all linked to a common SAN. The backup effort has now been reduced by a factor of 60 because the data has been consolidated to a single, albeit virtual, location. Figure 6.4.illustrates a potential layout for a dedicated SAN contingency environment.
125
Chapter 6
Figure 6.4: An example dedicated SAN contingency environment.
Data Recovery The only reason to make a backup copy of any data is to be able to restore that data after it has either been lost or damaged. Data recovery is the process of recovering data from primary storage media when it cannot be accessed normally as a result of physical damage to the storage device or logical damage to the file system that prevents it from being mounted by the host OS. Physical damage of a storage device can occur for a multitude of reasons ranging from malfunction of the physical inner workings of the devicesuch as a magnetic I/O head of a drive making physical contact with one of the drive plattersto disasters that impact the equipment directly as in the case of a fire or a flood. Most organizations lack the facilities, tools, and experience to recover physically damaged media in-house and must rely on external data recovery centers that specialize in the recovery of data from physically damaged media. This is a costly undertaking. Not only is data unavailable and the impact of the absence burdening the lines of business that depend upon it, but the entire process can be extremely expensive. The impacts and implications of physical damage are strong motivators to design and implement recovery solutions that reduce or completely eliminate dependency upon any one physical device.
126
Chapter 6 Logical damage is by far the most common data recovery focus; fortunately, however, despite the ease with which logical damage can be inflicted, it is, to a great extent, offset by the relative ease (in comparison to physical damage) to which the damage can be recovered. Logical damage is often the byproduct of a sudden loss of power to a file storage device that prevents the file system structures from being completely written to the storage medium; however, problems with hardware, drivers, supporting infrastructure, and system crashes can have the same effect. The result is that the file system is left in an inconsistent state. This situation can cause a variety of problems, such as strange behavior (for example, infinite directory recursion, drives reporting negative amounts of free space), system crashes, or an actual loss of data. Various programs exist to correct these inconsistencies and most OSs come with at least a rudimentary repair tool for their native file systems. Linux, for instance, comes with the fsck utility, and Microsoft Windows provides chkdsk. Third-party utilities are also available, and some can produce superior results by recovering data even when the disk cannot be recognized by the OSs repair utility.
The Advantages of Freedom

Throughout, this chapter has discussed many pitfalls to data storage integrity and availability. There are an overwhelming abundance of storage solutions vendors, each with their own architecture and agenda to fulfill. Although they may be working to align their interests with the needs of enterprise consumers, to date, there is no one single universally accepted miracle solution to meet all the needs of all consumers. The question then becomes which solutions most closely align with the needs of your enterprise and how can you leverage these solutions to meet those needs? Central to this concern is the ability for todays solutions to thrive in future environments. Freedom from proprietary solutions and standards, as defined by vendors, is critical in maintaining the flexibility and scalability of a storage architecture. Benefits of Avoiding Proprietary Solutions The drive for innovation is intoxicating. Vendors continually work to develop solutions and market those solutions based upon a foundation of their own technology to further their stake in the enterprise market; foundation being the operative word. When building an enterprise storage architecture, enterprise architects should approach their task with the same wisdom as a wise man who would build his house upon a rock rather than the sand. Building an enterprise foundation that includes proprietary solutions is akin to building a house upon sand, which shifts over time and provides little stability against the elementswhich, in storage architecture, are the battering of change and how that change impacts the ability of the solutions to be scalable, highly available, and resilient. To avoid the pitfalls of this approach, build a storage solution that embraces standardization and openness.
127
Chapter 6 Proprietary solutions come with a price in that they can be difficult to manage and these difficulties often increase in direct correlation to the size of the implementation. Although a small organization can often withstand the year-to-year changes in proprietary solutions and standards, larger organizations have a much more difficult time weathering the storms. Proprietary solutions are, by definition, difficult, if not impossible, to integrate with other solutions, so although a proprietary solution that is being touted as highly scalable today may seem to meet the immediate needs of the business, a wise storage architect will weigh in other factors not the least of which is the solutions ability to be integrated with other key architectures. Uncapped Scalability and Performance If an enterprise file storage system is to be measured by any means it is in scalability and performance. For a solution to be viable, it must perform to meet the needs of the business, and to stay viable, it must scale to meet the growing needs of the organization. In a distributed environment, scalability has been historically hindered by the dedication of servers to fulfill a specific storage role. A consolidated share-data approach is unhindered by those chains of bondage as new storage capacity can be brought online without directly impacting any of the applications supported by the environment. New servers and storage can be added as needed with no service disruption, enabling the storage environment to grow without directly impacting the business it is designed to support. Some third-party solution vendors, such as PolyServe, Symantec (which acquired VERITAS), and IBM offer solutions that enable uncapped scalability and performance. As your enterprise storage requirements grow, such solutions architecture facilitates the growth of the storage environment by providing flexibility in architecture and freedom of choice. Architecture Flexibility Some third-party solution architectures enable storage growth and the ability to scale on demand, which means that your storage architecture can remain agile and flexible to meet the growing needs of your business. By providing for a centralized, consolidated yet highly available and flexible storage architecture, these solutions enable the enterprise to better serve changes in business requirements as increased demands for capacity and performance, whose cost at one time may have been prohibitive, are now being realized.
128
Chapter 6
Freedom of Choice The storage solutions offered by third parties such as PolyServe are not constrained by hardware platform or OS to the same degree other solutions may be and enables multiple low-cost Linuxor Windows-based servers to function as a single, easy-to-use, highly available system. For example, PolyServes Matrix Server includes a fully symmetric cluster file system that enables scalable data sharing, high-availability services that increase system uptime and utilization, and cluster and storage management capabilities for managing servers and storage as one solution independent of hardware platform or supported application system. All of this equates to freedom of choice, and because storage administrators are unburdened by a dependency on any one particular hardware or software platform, they are free to reuse existing infrastructure or more closely align their storage infrastructure with enterprise IT hardware and software roadmaps easing administration, management, and compliance efforts.
Summary
Throughout this chapter, you have seen how storage solutions are often the critical pivot point for management and business concerns. The abilities of an enterprise storage environment to remain agile, reduce complexity, seamlessly integrate, and reduce risk are all concerns that have a direct financial impact on business operations. Disaster recovery panning has been underscored as a central theme to storage management and one that deserves due care throughout the life of your enterprise environment. Remaining flexible and not allowing your environment to become constrained by proprietary solutions will provide the freedom you require to provide agility and manage the environment as a whole. As you have seen throughout the course of this guide, there are many disk, server, performance, and availability choices at your disposaleach with their own benefits and limitations. As you continue to work to bring harmony to your enterprise storage solution environment, battle current file-serving storage growth problems, provide for data path optimization, and set out to build high-performance, scalable, and resilient Windows and Linux file-serving solutions, remember to keep an eye on the big picture. Storage solutions are about more than just providing storage; theyre about enabling the business to succeed by providing quick and easy access to data whenever and wherever it is required in a manner that is cost effective to implement, manage, and maintain.
Download Additional eBooks from Realtime Nexus!

Realtime NexusThe Digital Library provides world-class expert resources that IT professionals depend on to learn about the newest technologies. If you found this eBook to be informative, we encourage you to download more of our industry-leading technology eBooks and video guides at Realtime Nexus. Please visit http://nexus.realtimepublishers.com.
129

Building Highly Scalable File Serving Solutions

Uploaded by

Copyright:

Available Formats

You might also like

Building Highly Scalable File Serving Solutions

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Building Highly Scalable File Serving Solutions

Uploaded by

Copyright:

Available Formats

The Definitive Guide To

Building Highly Scalable Enterprise File Serving Solutions

Chapter 1: Moving Beyond Current File Serving Philosophies

State of the World

Figure 1.1: Standalone file server implementation.

Figure 1.2: A simple DFS implementation.

Figure 1.3: A simple NAS deployment.

Figure 1.4: Failover cluster with SCSI-attached shared storage.

Figure 1.5: A 3-node N-to-1 cluster.

Figure 1.6: Shared data cluster with SAN-attached storage.

Current Storage Architectures

Table 1.1: SCSI bus type comparison.

Figure 1.8: A small iSCSI SAN.

Clustered File Serving Gaining Momentum

Figure 1.9: An example of a high-availability clustering architecture.

Chapter 2: Taming Storage GrowthA Modern Perspective

Current Storage Problems

Existing Storage Solutions

Figure 2.1: Virtualization access layer for physical storage resources.

Figure 2.2: In-band storage virtualization.

Figure 2.3: Out-of-band storage virtualization.

Figure 2.4: Migrating files older than 6 months to tape.

Policy-Based Storage Virtualization

Support for legacy applications

Table 2.1: Virtual machine vs. shared data clusters.

Examining Unappliance vs. Appliance Solutions

Figure 2.5: NDMP-enabled backup.

Figure 2.6: Unappliance-based shared data cluster backup.

Taming Server and Storage Growththe Non-Proprietary Approach

Planning for Growth While Maintaining Freedom

Chapter 3: Data Path Optimization for Enterprise File Serving

The Big Picture of File Access

Figure 3.1: Objects in the data path.

Availability and Accessibility

Table 3.1: Quantifying downtime by uptime percentage.

Figure 3.2: RAID 0 operation.

Figure 3.3: RAID 0+1 operation.

Figure 3.4: RAID 1+0 operation.

Figure 3.5: RAID 5+0 operation.

Redundant SAN Fabrics

Figure 3.6: Redundant SAN fabric.

Figure 3.7: Redundant server LAN connections.

Figure 3.8: Redundant switched LAN connections.

Figure 3.9: Potential data path bottlenecks.

Physical Disk Processor Network

Table 3.2: Common performance thresholds.

Managing the Resilient Data Path

Managing High-Performance and Availability Across a Windows Infrastructure

Figure 4.1: The VDS architecture.

Figure 4.2: Windows Backup advanced backup options.

Figure 4.3: Shadow copy volume settings dialog box.

Figure 4.4: Shadow Copies for Shared Folders default schedule.

Chapter 4 The volume name parameter must be in the form of

command in a script to provide failover support for scheduled snapshots.

Figure 4.5: Viewing previous file versions.

At this point, the file has been successfully recovered.

Figure 4.6: Optimizing offline files for best server performance.

cachemov unattend e:\

Figure 4.7: Windows XP client offline files options.