Professional Documents
Culture Documents
Systems P Software Whitepapers Hacmp XD GLVM
Systems P Software Whitepapers Hacmp XD GLVM
Systems P Software Whitepapers Hacmp XD GLVM
March 2006
WHITE PAPER
Contents Overview Understanding Performance: Three Areas Running the Performance Optimization Cycle Tuning for Optimizing Performance Case Study: Massive Two Shoes Summary References 3
14
16
18 20 20
hafeedbk@us.ibm.com
ABSTRACT The newest addition to IBMs HACMP/XD family, HACMP/XD for Geographic Logical Volume Manager (GLVM) is easier to configure and manage than the existing IP-based disaster recovery solutions. To ensure that the mirroring of the critical data across two remote locations is optimized to meet performance requirements at low costs, system administrators must plan to achieve that performance. This paper discusses the characteristics of a data-intensive application that requires data mirroring and explains what must be done to plan, provision, and tune for optimal performance in a cluster spanning two remote sites.
sptrHACMPGLVMwp061506.doc
Page 2
Overview
The mirroring of data between two remote sites has become significantly easier with the arrival of the new HACMP/XD for Geographic Logical Volume Manager (GLVM) solution. As an IP-based solution, GLVM sites can span an almost unlimited distance. In GLVM, the AIX 5L LVM (Logical Volume Manager) itself is responsible for mirroring the data; thus several complex and time consuming manual tasks have been eliminated. The configuration and management of GLVM is therefore simpler and with fewer geographical limitations than other remote mirroring solutions. But, as with any remote mirroring solution, it is still important to design and tune for optimal performance. The enemy of remote data mirroring is delay you want to reduce the time it takes to write changes to your data to the disks at the remote site. Optimal performance for a mirroring solution is one where both the delay and the cost of the solution are minimized. If you overestimate the amount of bandwidth needed for the network carrying the disk I/ O operations, you could end up with an expensive mirroring solution. Yet it is important to know that an under-performing mirroring solution can create a performance bottleneck for your application, thereby lowering productivity, putting your Service Level Agreements at risk, or in other ways reducing your potential revenue. This paper helps to understand how HACMP/XD for GLVM works and where to anticipate performance bottlenecks. You will then be able to accurately estimate the network bandwidth required and target areas for tuning. This will help you minimize delay and cost. A typical HACMP/ XD for GLVM implementation is displayed below:
Data Mirror XD_data Service IP Label PV hdisk1 Disk Array Waltham XD_data Standby IP Label XD_data Standby IP Label TCP/IP WAN XD_data Network XD_data Service IP Label Burlington PV hdisk2 Disk Array Data Mirror
In this typical mirroring configuration we have a two-node cluster with each node at a different site. Waltham is the primary node where the application runs, and Burlington is a node at the secondary site, which uses GLVM to mirror the data used by the application. In developing this white paper we tested the above scenario within one geographic facility. The exact configuration is listed in Appendix A. The tool diskio2 was used to generate disk I/O for testing. It provides many options
sptrHACMPGLVMwp061506.doc Page 3
to read or write data of any size or frequency, and timing the results of the I/ O operations. Although this white paper refers to HACMP/ XD for GLVM V5.3, GLVM is also available for HACMP/ XD V5.2. All the information provided here can be applied to V5.2. Note: Throughout this paper (except where otherwise noted) the following abbreviations are used: Kb = kilobits (1,024 bits) Mb = megabits (1,048,576 bits) KB = kilobytes (1,024 bytes) MB = megabytes (1,048,576 bytes)
Empirically study the disk I/O performance for typical and peak periods of time.
The formal requirement specifications for your application may directly or indirectly indicate the amount of I/O activity that your application must be able to support. An empirical study of the data I/O is not only good for estimating the bandwidth needed for communicating the disk I/O, but also for creating a baseline for evaluating disk performance for the future, should disk performance become an issue in your mirroring solution. To ensure the bandwidth is adequate for the peak periods, you should choose a sample period that covers a representational period of peak activity. You should also evaluate the steady state or typical performance; this is useful if, for example, your application has short-lived and/or only occasional spikes. If you have an application already running, you can use the iostat command for the disks being examined, and examine the value for Kb_w rtn (Note: iostat uses Kb to mean kilobytes). Figure 2 gives an example of running iostat on a disk used at the remote node for mirroring the application data, giving a summary of activity every 60 seconds. The value of Kbps can be used to indicate disk throughput; this will later be used to calculate the network bandwidth required for the mirroring solution.
waltham$ iostat -d hdisk1 60 System configuration: lcpu=2 drives=17 paths=1 vdisks=0 Disks: hdisk1 hdisk1 hdisk1 hdisk1 hdisk1 hdisk1 hdisk1 hdisk1 hdisk1 % tm_act 98.0 98.0 98.0 98.3 97.8 98.3 98.0 81.2 0.0 Kbps 2134.2 2133.7 2126.1 2115.1 2145.9 2213.5 2216.7 1840.2 0.0 tps 533.6 533.3 531.6 528.6 536.5 553.5 554.1 460.0 0.0 Kb_read 0 0 0 0 0 0 0 0 0 Kb_wrtn 128320 128020 127564 126904 128752 132808 132992 110412 0
Alternatively, you can use the filemon command if you are using a file system (/gmfs1 in this example) on top of logical volumes. Figure 3 shows an example of using the filemon command to illustrate the disk activity (Note: filemon also uses Kb to mean kilobytes).
sptrHACMPGLVMwp061506.doc
Page 5
waltham$ filemon -o /tmp/lv_filemon_$(date +%Y%m%d%T) -O lv; sleep 60; trcstop Enter the "trcstop" command to complete filemon processing [filemon: Reporting started] [filemon: Reporting completed] [filemon: 60.964 secs in measured interval] waltham$ head -15 /tmp/lv_filemon_2005090615:35:31 Tue Sep 6 15:49:42 2005 System: AIX wolverine Node: 5 Machine: 00088C9A4C00 Cpu utilization: 44.2%
Most Active Logical Volumes ----------------------------------------------------------------util #rblk #wblk Kb/s volume description ----------------------------------------------------------------0.99 0 842496 6918.4 /dev/lvGMVG1 /gmfs1 0.00 0 936 7.7 /dev/hd3 /tmp 0.00 40 176 1.8 /dev/hd2 /usr 0.00 0 128 1.1 /dev/hd4 / 0.00 0 64 0.5 /dev/hd8 jfslog
In Figure 3, the number of write blocks can be used to determine disk throughput. Since each block is half a kilobyte, you can divide the # wblk amount by two and the time (in seconds) to get the disk throughput.
sptrHACMPGLVMwp061506.doc
Page 6
The increase in write time shows that performance is better with large blocks of data. Similarly, with a JFS2 file system, the performance of I/O operations improves as the size of the block being written grows, although sizes greater than 100KB perform equivalently to raw volume I/O:
Number of Average Write Time Writes Size per Write (bytes) (Mb/Sec) 1,000,000 10,000 47.04641 100,000 100,000 91.798174 100,000 102,400 105.295252 50,000 200,000 1.746223 10,000 1,000,000 1.340575 10,000 1,024,000 1.767009 Table 2: I/O Throughput on a GLVM Volume with a JFS2 File System
Both tables demonstrate that larger, fewer writes perform better than smaller, more frequent writes.
No (raw) 1.4941 JFS2 6.8938 1 MB No (raw) 1.455486 JFS2 1.340575 Table 3: File System Type vs. Average Throughput at the Remote Site
When the size of the blocks of data was greater than the 100KB boundary, the performance and throughput of using raw logical volume versus using the JFS2 file systems was nearly identical. If you have a choice of whether to use a file system or a raw disk, and the sizes of the data are small (less than 100KB), consider using a file system like JFS2.
Network Performance
There are three factors that would cause a delay in the transmission of the changes to the data: inadequate bandwidth, excessive latency, and saturation.
Network Bandwidth
If there is one facet of system performance that stands out over the rest, it is the size of the network pipe, or the network bandwidth, between sites. The network bandwidth must be sufficient and robust to carry the I/O traffic for mirroring. Since HACMP/XD for GLVM does not support multilinking (using multiple networks), we recommend that you accurately estimate the width of the network between the sites before procuring it. If you fail to perform a fair estimate during the planning stage of your geographically dispersed cluster, it may be costly to upgrade or install a replacement network that can provide a greater network capacity.
comparable to 75% of the theoretical maximum (to avoid network saturation), but the average throughput was less (see below). It is worth noting that the average performance is due more to the application writing data, rather than to network conditions.
Maximum Network Throughput Average Peak Speed (Theoretic) 75% Max Throughput Throughput (Mb/sec) (MB/sec) (MB/sec) (MB/sec) (MB/sec) 10 1.220 0.91552 0.7742 0.7888 100 12.207 9.1552 6.8938 9.6872 1000 122.07 91.552 67.454 89.472 Table 4: The Network Throughput under Different Network Speeds
The corresponding time to write data to disk was also affected, being slowed down, as shown when a 10 Mbps data network is used to carry too much data. Table 5 illustrates this:
Number of operations Size (bytes) Time (MB/Sec) 1,000,000 10,000 57.85429 100,000 100,000 52.83522 100,000 102,400 108.7942 Table 5: The Effect of Over-saturating a 10Mbps Network
It is worth noting several things observed during the testing procedure: The choice of a file system doesnt matter; even a raw file system exhibits this poor performance for write sizes greater than 100KB. It is possible to write the same value of information with more and smaller sized blocks of data. However, this can lead to unpredictable or disastrous performance issues if there are unexpected peaks in the amount of data being written.
To summarize, in order to have a robust geographic mirroring solution for your applications, plan and provision the correct network bandwidth taking into account peaks and growth in network activity.
sptrHACMPGLVMwp061506.doc
Page 9
burlington$ entstat ent0 ------------------------------------------------------------ETHERNET STATISTICS (ent0) : Device Type: 10/100/1000 Base-TX PCI-X Adapter (14106902) Hardware Address: 00:11:25:08:18:43 Elapsed Time: 0 days 0 hours 0 minutes 2 seconds Transmit Statistics: -------------------Packets: 184 Bytes: 12686 Interrupts: 0 Transmit Errors: 0 Packets Dropped: 0 [further output omitted ] Receive Statistics: ------------------Packets: 2804 Bytes: 4215556 Interrupts: 411 Receive Errors: 0 Packets Dropped: 0 Bad Packets: 0
If you are trying to estimate throughput for your adapters, it will be helpful to run entstat r on the adapter used on the data network; this will reset the counters. For example, to measure network throughput, sample the amount of traffic over a period of 60 seconds on the adapter on the data network: enstat r ent0; sleep 60; entstat ent0 The amount of bytes under the Receive Statistics column would be the amount of traffic received per minute; divide this number by 60 to get a per-second rate. This would be your network throughput, and you can use it to compare against what you should be getting for your network.
Network Latency
The network latenc y, or delay, can be a surprise factor in estimating the efficiency of the mirroring performance of your cluster. In general, each network technology has a delay for each packet that is sent from one network device to another. Each network hop, such as a router
sptrHACMPGLVMwp061506.doc Page 10
or gateway that is on the way from the source to the destination, will introduce its own delay. The total network latency is the accumulation of such delays from each network device. Therefore, to ensure the best mirroring performance for your application, you should provision a network that has guaranteed minimums for its latency.
----burlington PING Statistics---7 packets transmitted, 7 packets received, 0% packet loss round-trip min/avg/max = 80/280/397 ms $
The two things to note about the results of the ping command in Figure 6 are that the average time is 280ms (or 0.28 of a second) and that there were no packets lost. This shows that although the network is reliable (no packets lost) there was about a quarter of a second delay in sending a request and receiving a response due. On systems on the same local network, you would see near 0ms round-trip time. On remote systems, you would tend to see larger numbers. Clearly youd like to have as little a delay as possible, to ensure the data changes are mirrored at the remote site as close to real-time as possible. traceroute is a network debugging tool used to indicate where on the network delays may occur. Since network traffic is routed through routers and other network devices between the source and destination nodes, each step or hop on the way can introduce delay. traceroute allows you to see the delay at each hop on the way to the destination, as shown in Figure 7.
trying to get source for burlington source should be 10.10.10.8 traceroute to burlington(192.168.3.5) from 10.10.10.8 (10.10.10.8), 30 hops max outgoing MTU = 1500 1 10.10.1.1 (10.10.1.1) 13 ms 16 ms 16 ms 2 10.10.0.1 (10.10.0.1) 1 ms 1 ms 1 ms 3 * * * 4 192.168.1.1 (192.168.1.1) 100 ms 91 ms 75 ms 5 burlington (192.168.3.5) 87 ms 79 ms 77 ms $
What is notable about the results of the traceroute command in Figure 7 is that significant delay of 100ms is occurring between network hops 3 (unknown IP address) and 4 (192.168.1.1). This is causing the majority of the delay, but because hop 3 is not responding to ICMP requests (hence the * * *) it is difficult to determine which hop of the two is causing the delay. Your network administrators or ISP should be able to assist you in determining where the network delays occur and address them.
Network Saturation
Ethernet media has a theoretical limit for the amount of data that can be sent through it. When the amount of packets sent over the network reach this maximum, the network is saturated. In special cases where two machines perform on a closed network with synchronized I/O, the throughput can approach the limit. But Ethernet traffic is rarely synchronized this way, and rarely are there only two machines on the wire. As a result, network collisions occur as the throughput approaches saturation. Large amount of network collisions degrade the performance of the network. It is important to build a certain amount of overhead into the calculations for throughput, usually 25%, to avoid network saturation.
sptrHACMPGLVMwp061506.doc
Page 12
burlington$ entstat ent0 ------------------------------------------------------------ETHERNET STATISTICS (ent0) : Device Type: IBM 10/100 Mbps Ethernet PCI Adapter (23100020) Hardware Address: 00:04:ac:5e:4d:3e Elapsed Time: 90 days 2 hours 52 minutes 4 seconds Transmit Statistics: -------------------Packets: 360716570 Bytes: 128358396889 [output omitted] Max Collision Errors: 327 Late Collision Errors: 0 [output omitted] No Resource Errors: 0 Receive Collision Errors: 0 Receive Statistics: ------------------Packets: 361676332 Bytes: 104279932996
netstat v produces similar output but for all devices. If the collision rate (collisions divided by packets) is greater than 0.10 (10%), then the network has become saturated, and it has to be either reorganized or partitioned. In this example, the Max Collision Errors divided by the transmit packets is much less than 10%, which means the network is not exhibiting effects of saturation.
sptrHACMPGLVMwp061506.doc
Page 13
Measure Performance
Tune Performance
In order to provide these figures, you will have to either estimate your bandwidth based on your applications requirements, or measure your disk activity using iostat or filemon as mentioned in Measuring Application I/O Performance. For example, suppose a database application was at its peak while writing out 4MBps; the amount of bandwidth necessary to carry the traffic is:
sptrHACMPGLVMwp061506.doc Page 14
Bandwidth
This speed falls within the threshold of a T3, which can handle 44.736 Mbps. Keep in mind that with HACMP/XD for GLVM V5.3 you can set up a mutual takeover cluster configuration, where applications can exist at both sites, and therefore mirroring occurs in both directions at the same time. In this case, the network bandwidth has to be sufficient to accommodate both streams of data writes. You should estimate performance measurements for the disk I/O for both applications, and add them together before calculating an overall bandwidth for the network. Once you have your estimates, you can discuss with your network provider what kind of network solution you must choose so that it meets the requirements for bandwidth and cost. There will usually be several choices, depending on cost and expansion or growth capabilities.
operations? Network throughput can deteriorate due to unexpected changes in service, routing changes, or saturation due to a higher than anticipated load in traffic. By using the network commands netstat, entstat, ping and traceroute mentioned earlier, you should have an idea where the network bottlenecks lie. Are the remote disks performing adequately? If the remote disks are slow, perhaps due to contention with other applications, then tuning the disk operations may improve performance. By using the iostat or filemon utilities on the remote node, you can determine if there are bottlenecks.
sptrHACMPGLVMwp061506.doc
Page 16
or
chdev -l en0 -a rfc1323=1
If you use the chdev command, the values are set permanently in the ODM, but the change will not take place until after a reboot. If you use the ifconfig command, you can make the change without rebooting, but it is good only for the current session. To make the change persistent across reboots, add the ifconfig command as shown above to the /etc/rc.net file. The network option tcp_nodelay, normally considered along with rfc1323 is explicitly set by GLVM. Setting the option globally or on the interface will have no effect on the mirroring performance.
operations performed by your application, it may be possible to tune the sequential and random write-behind options for the file system you are using in order to commit the pending I/ O to disk faster. Like other performance tuning options this has the potential of degrading your I/O performance so it should be considered carefully. The File System Performance Tuning topic in the Performance Management Guide found in the pSeries an d A IX Inform ation Center discusses write-behind options.
Following
Running the Performance Optimization Cycle, the data administrator for MTS measures the amount of disk activity during an average period of sales using iostat; he receives the following information:
burlington$ iostat hdisk10 60 Disks: % tm_act Kbps hdisk2 9.8 1985.9 tps 2.0 Kb_read 0 Kb_wrtn 119172
This is showing that the applications on average write 1.985MB per second. Since this historically doubles, the network must carry at most 3.97MB per second of disk activity. The data administrator calculates (using the formula in Determining Mirroring Performance Requirements) 21.06 Mbps on average, with peaks up to 42.13 Mbps. The data administrator walks these numbers to the network administrator, who performs the next steps. The network administrator talks with a few network providers and decides that a Fractional T3 meets their needs since the peak can be 44.763 Mbps. The network provider installs the network and helps to establish
sptrHACMPGLVMwp061506.doc Page 18
communication between the two sites. He runs a traceroute and finds that network performance had an unexpected delay, as shown in Figure 11.
burlington$ traceroute waltham trying to get source for waltham source should be 10.70.28.1 traceroute to waltham (10.70.28.2) from 10.70.28.1 (10.70.28.1), 30 hops max outgoing MTU = 1500 1 burl-fw (10.70.281) 2 ms 1 ms 0 ms 2 waltham-fw (10.60.1.1) * * * 3 waltham (10.60.1.1) 247ms 149ms 183ms $
After much wrangling, the network provider fixes the issues and the delay is minimized to below 10ms. The data administrator configures HACMP/XD for GLVM to mirror the volumes used for application data storage, and lowers the syncd rate to 30 seconds. Measurements show that the average disk I/O meets the needs of the application, but the data administrator wonders if he can shoehorn additional performance into the solution to anticipate the peak periods.
waltham$ iostat hdisk12 60 Disks: % tm_act Kbps hdisk12 9.8 2018.97 tps 2.0 Kb_read 0 Kb_wrtn 121138
The first target is application performance. Since the application was a legacy database developed by a third party developer, there was no money budgeted for changing the applications I/ O characteristics. The next step would be to address network performance. Since the network provider has minimized the network delay, the next step is to change the MTU settings to see if they had an impact on performance. The following parameters were set: tcp_sendpage: 640k tcp_recvpage 640k rfc1323 1
After making these changes to the network settings, the performance was again tested and found to have little difference, as shown in Figure 13. Any differences could be attributed to the variation in the load from the application. The fact that there are no substantial differences is a result of the network being provisioned correctly from the beginning.
waltham$ iostat hdisk12 60 Disks: % tm_act Kbps sptrHACMPGLVMwp061506.doc tps Kb_read Kb_wrtn Page 19
hdisk12
9.8
2007.08
2.0
120425
The data administrator can rest assured that this solution is optimal.
Summary
The key to optimal mirroring performance is planning understanding what volume of disk activity will be generated by the applications, and procuring the appropriate network bandwidth to carry that volume. In order to achieve maximum mirroring performance with HACMP/XD for GLVM V5.3, always adequately plan and provision the network architecture for carrying the disk I/O traffic between two sites. To achieve optimal mirroring performance, use these recommended approaches: Estimate the disk activity performed by the applications on one site, or two if there is mirroring. Calculate the network bandwidth required for the mirroring. If there is a choice between using a raw logical volume or a JFS2 file system, use JFS2 if the sizes of the writes are less than 100KB. Generate a peak load and capture a baseline for performance. Tune using the techniques provided, test whether they were effective, and retune until the performance goals have been met.
With these performance guidelines, you can build an optimal mirroring solution using HACMP/ XD for GLVM.
References
IBM Publications: HA CMP/ XD for Geographic L V M: Plan ning an d A dm in istration Guide, August 2005, SA23-1338-02 High A vailability Geographic Cluster for A IX: Plan ning an d A dm in istration Guide, V ersion 2 Release 4, September 2003, SC23-1886-04 IBM White papers: Geom irror Perform anc e for HA GE O an d GeoRM: A Com m en tary, September 10, 2001 Planning Considerations for Geographic ally Dispersed Clusters using IBM HA CMP/ XD: HA GE O Tec hnology, June 2004 IBM Resource centers: pSeries an d A IX Inform ation Center,
http://publib.boulder.ibm.com/infocenter/pseries/index.jsp
sptrHACMPGLVMwp061506.doc
Page 20
Wayne Wylupski Wayne Wylupski is a Senior Engineer for Availant, Inc. He has over 20 years in the IT field as software engineer, network engineer and systems designer. He has helped to develop HACMP for the past three years while at Availant. Chris Cox Chris Cox is a Principal Quality Assurance Engineer at Availant, Inc. She has 20 years experience in the IT field, initially as system administrator and customer support, then as a quality assurance engineer. Chris has worked for the past seven years in assuring the quality of HACMP.
Appendix A
Description of hardware and software used in the test case. Two nodes using AIX 5L V5.3, one designated as the local node and the other designated as the remote node. One private TCP/ IP network connecting the local and remote nodes that will support 10/100/1000 Mbps. Ten disks connected to the local node and five disks connected to the remote node. These disks are configured with five local volume groups and five GLVM volume groups; each volume group has five logical volumes with two mirror copies. The utility diskio2 used to generate disk I/O for testing can be found in the Appendix of Geom irror Perform anc e for HA GE O an d GeoRM: A Com m entary, available from the IBM Web site at: www.ibm.com/ servers/ eserver/ pseries/ software/ whitepapers/ gmdperf.html.
sptrHACMPGLVMwp061506.doc
Page 21
IBM Corporation 2006 IBM Corporation Marketing Communications Systems Group Route 100 Somers, New York 10589 Produced in the United States of America March 2006 All Rights Reserved This document was developed for products and/or services offered in the United States. IBM may not offer the products, features, or services discussed in this document in other countries. The information may be subject to change without notice. Consult your local IBM business contact for information on the products, features and services available in your area. All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. IBM, the IBM logo, the e-business logo, AIX 5L, HACMP are trademarks or registered trademarks of International Business Machines Corporation in the United States or other countries or both. A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. Other company, product, and service names may be trademarks or service marks of others. Information concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of the non-IBM products should be addressed with the suppliers. The IBM home page on the Internet can be found at http://www.ibm.com . The System p home page on the Internet can be found at http://www.ibm.com/systems/p.
sptrHACMPGLVMwp061506.doc
Page 22