HP Poli Serv

hf836s c.
00
Use of this material to deliver training without prior written permission from HP is prohibited.
hf836s c.00
Use of this material to deliver training without prior written permission from HP is prohibited.
Copyright 2009 Hewlett-Packard Development Company, L.P.

The information contained herein is subject to change without notice. The only warranties for HP
products and services are set forth in the express warranty statements accompanying such products and
services. Nothing herein should be construed as constituting an additional warranty. HP shall not be
liable for technical or editorial errors or omissions contained herein.
This is an HP copyrighted work that may not be reproduced without the written permission of HP. You
may not use these materials to deliver training to any person outside of your organization without the
written permission of HP.
Microsoft, Windows, Windows NT are U.S. registered trademarks of Microsoft Corporation.
UNIX is a registered trademark of The Open Group.
All other product names mentioned herein may be trademarks of their respective companies.
Export Compliance Agreement
Export Requirements. You may not export or re-export products subject to this agreement in violation of
any applicable laws or regulations.
Without limiting the generality of the foregoing, products subject to this agreement may not be exported,
re-exported, otherwise transferred to or within (or to a national or resident of) countries under U.S.
economic embargo and/or sanction including the following countries:
Cuba, Iran, North Korea, Sudan and Syria.
This list is subject to change.
In addition, products subject to this agreement may not be exported, re-exported, or otherwise
transferred to persons or entities listed on the U.S. Department of Commerce Denied Persons List; U.S.
Department of Commerce Entity List (15 CFR 744, Supplement 4); U.S. Treasury Department
Designated/Blocked Nationals exclusion list; or U.S. State Department Debarred Parties List; or to
parties directly or indirectly involved in the development or production of nuclear, chemical, or
biological weapons, missiles, rocket systems, or unmanned air vehicles as specified in the U.S. Export
Administration Regulations (15 CFR 744); or to parties directly or indirectly involved in the financing,
commission or support of terrorist activities.
By accepting this agreement you confirm that you are not located in (or a national or resident of) any
country under U.S. embargo or sanction; not identified on any U.S. Department of Commerce Denied
Persons List, Entity List, US State Department Debarred Parties List or Treasury Department Designated
Nationals exclusion list; not directly or indirectly involved in the development or production of nuclear,
chemical, biological weapons, missiles, rocket systems, or unmanned air vehicles as specified in the
U.S. Export Administration Regulations (15 CFR 744), and not directly or indirectly involved in the
financing, commission or support of terrorist activities.
Printed in US
HP PolyServe Matrix Server for Linux Administration
Student guide
June 2009
Module 0 Course Introduction

Module 1 PolyServe Overview
Overview .................................................................................................... 1-3
PolyServe design guidelines.......................................................................... 1-17
Module 2 Installation and Configuration

Installation .................................................................................................. 2-3
Configuring the cluster ................................................................................ 2-44
Lab activity................................................................................................ 2-65
Module 3 Storage Configuration

Server Fencing............................................................................................. 3-3
Membership Partitions ................................................................................. 3-11
Cluster-wide device naming ......................................................................... 3-27
Matrix Server Disk Usage............................................................................ 3-30
Cluster Volume Manager (CVM)................................................................... 3-38
Module 4 File Systems and Quotas

File System Configuration .............................................................................. 4-3
Quotas ..................................................................................................... 4-30
Snapshots .................................................................................................4-44
File System Replication ................................................................................ 4-56
Multipath I/O ........................................................................................... 4-78
Additional SAN topics ................................................................................ 4-91
Lab activity...............................................................................................4-102
Module 5 Cluster Management

Management GUI and mx command.............................................................. 5-3
Authentication ............................................................................................ 5-15
Roll Based Access Control ............................................................................ 5-18
Message Catalog, Logging and Alerts .......................................................... 5-30
Event Notifications ..................................................................................... 5-41
SNMP Sub-agent ....................................................................................... 5-52
Performance Dashboard.............................................................................. 5-58
Lab activity................................................................................................ 5-64
hf836s c.00
Module 6 PolyServe Clusterware

Creating and managing virtual hosts .............................................................. 6-3
Creating and managing service monitors .......................................................6-12
Creating and managing device monitors....................................................... 6-29
Custom Monitors ........................................................................................ 6-39
Module 7 HP Scalable NAS

Web Serving ............................................................................................... 7-3
Lab activity................................................................................................. 7-10
NFS Scalable NAS...................................................................................... 7-11
Lab activity................................................................................................ 7-42
Samba Configuration ................................................................................. 7-43
Lab activity................................................................................................ 7-56
File Serving Performance Considerations ....................................................... 7-57
Other considerations .................................................................................. 7-65
Deploying Oracle over Scalable NAS ............................................................7-70
Module 8 Database Utility

Oracle RAC Database Utility ......................................................................... 8-3
Lab activity................................................................................................. 8-16
Module 9 Logs and Services

General matrix heatlh ................................................................................... 9-3
Alerts and error messages ............................................................................. 9-4
Analyzing log files ...................................................................................... 9-15
Cluster overview.......................................................................................... 9-17
Scalable NAS commands ........................................................................... 9-42
Module 10 Troubleshooting
Network and SAN issues ............................................................................ 10-3
Membership partitions .............................................................................. 10-20
Other troubleshooting issues .......................................................................10-25
Lab activity.............................................................................................. 10-53
Module 11 ExDS9100 Product Overview

Product positioning...................................................................................... 11-3
Hardware architecture ................................................................................. 11-9
ExDS Component detail .............................................................................. 11-18
Software architecture ..................................................................................11-41
ii
hf836s c.00
Course Introduction
Module 0
hf836s c.00
HP Education Services
2009 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
hf836s c.00
0-1
Course Introduction
Course objectives
At the end of this course, you will have:
A thorough understanding of the HP PolyServe Linux product
Enough hands-on experience to enable the successful
deployment and administration of the product in their own

data center
Knowledge of best practices and solutions for deploying
applications running on top of the HP PolyServe File Linux

product
0-2
hf836s c.00
0-2
Course Introduction
Course prerequisites
If you are taking this course, you should have:
A rudimentary knowledge of clustering technologies
A working knowledge of SAN technologies
An in-depth knowledge of LinuxOS administration
A working knowledge of Apache, Oracle, NFS
0-3
hf836s c.00
0-3
Course Introduction
Agenda
Day One
Module 1 Product Overview
Module 2 Installation and Configuration
Module 3 Storage Configuration
0-4
Agenda
The time spent on each module will vary depending on the audience.
hf836s c.00
0-4
Course Introduction
Agenda
Day Two
Module 4 File Systems and Quotas
Module 5 Matrix Administration
Module 6 Matrix Server Clusterware
Module 7 File Serving Utility
0-5
Agenda
hf836s c.00
0-5
Course Introduction
Agenda
Day Three
Module 7 File Serving Utility (contd)
Module 8 Database Utility
Module 9 Logs and Services
Module 10 Troubleshooting
0-6
Agenda
hf836s c.00
0-6
Course Introduction
Classroom facilities
Fire exits
Phones
Restrooms
Smoking
Breaks
Lunch
Class time
Begin class promptly
Use time wisely
Use the student guide as a reference manual
No cell phones or laptops please
0-7
hf836s c.00
0-7
Course Introduction
Introductions
Name
Work location
Years with company
Experience
Certifications
Expectations for this course
0-8
hf836s c.00
0-8
hf836s c.00
Course Introduction
2009 Hewlett-Packard
Rev. 7.4 Development
HP Restricted Company, L.P.
0-9
hf836s c.00
0 - 10
Course Introduction
HP Polyserve Matrix Server for Linux Administration
MxS Overview
Module 1
hf836s c.00

hf836s c.00
1-1
MxS Overview
Objectives
At the end of this module the student should be able to:
Describe HP PolyServe Software and its components used in the
Scalable NAS Products
Describe the type of environments in which the HP PolyServe
Software can be deployed
Understand hardware requirements for servers running HP
PolyServe Software
Explain HP PolyServe clustering
1-2
hf836s c.00
hf836s c.00
1-2
MxS Overview
HP PolyServe Feature Overview

The following features offer a new approach to datacenter
computing called Shared Data Clustering
Cluster File System (CFS)
Enables multiple servers to read and write shared data concurrently on
shared SAN storage devices
Designed to be scalable, highly recoverable and highly available
Failover support for applications

Uses virtual hosts and application monitors to provide highly available
and/or scalable access to applications running on the matrix
Cluster-wide administration
Management Console and command line interface enable
configuration and management of the entire matrix remotely or from
any server in the matrix
1-3
hf836s c.00
hf836s c.00
1-3
MxS Overview
HP Scalable NAS Core Technical Advantage

Cluster File System
All nodes can read and write all data concurrently
HP Scalable NAS Nodes

Cluster File System
Fully symmetric
No master lock/meta-data server
Single system semantics

Distributed scalable metadata
Distributed lock manager
Dynamic lock migration
High-Availability Services
MPIO
Dual HBA
Dual Switch
Dual Fabric
FC SW
RAID
Array
Fault tolerance using standard servers

Full n:1, n-m monitoring and failover
FC SW
RAID
Array
Easy to use and manage

16 node active:active cluster
Shared Data in Cluster File System
Storage and Volume Management

Cluster Volume Management
Fabric and LUN Discovery
Full MPIO Support
1-4
hf836s c.00
hf836s c.00
1-4
MxS Overview
Cluster overview
Public
Public
Network(s)
Network(s)
Server 1
Server 2
Server 3
Fibre Channel/ISCSI
The structure of
Ethernet/IP
100/1000/10000
1-5
hf836s c.00
a 4-node cluster
hf836s c.00
1-5
Server 4
MxS Overview
HP Scalable NAS Technology Key: Symmetric vs.

Asymmetric File System
Symmetricall servers see
all data all the time
clientx
clienty
clientz
client
client
client
client
NAS
Head
client
Asymmetricsingle file system can

be seen from a single NAS head
clientx
clienty
clientz
NAS
Head
NAS
Head
NAS
Head
client
NAS
Head
NAS
Head
clientx
clientx
clienty
clientz
1-6
hf836s c.00
hf836s c.00
1-6
clienty
clientz
HP PolyServe features
MxS Overview
(1 of 4)
Storage Features
Shared File System
General purpose cluster file system
Fully symmetric, concurrent access by multiple servers in the
matrix
Any server can lock a file
Any server can update metadata
Online recovery, no automatic disk checking

Supports standard file system operations
Enables single system semantics across multiple servers
Proprietary format but uses standard Linux file system semantics
and system calls including flock() and fcntl() clusterwide
1-7
hf836s c.00
hf836s c.00
1-7
MxS Overview
(2 of 4)
Storage Features (contd)

In-memory Distributed Lock Manager (DLM)
Locking mechanism to coordinate server access to shared file systems
Fully symmetric and fully distributed locking and metadata
Requires shared Fibre Channel or ISCSI devices

Every server must be able to access every disk/LUN in the SAN
Cluster-wide device naming

Pseudo device driver provides cluster-wide consistent device names across
all servers
Cluster Volume Manager

Allows striping and concatenation of LUNs
Used primarily for online file system growth and >2TB file systems
1-8
hf836s c.00
hf836s c.00
1-8
MxS Overview
(3 of 4)
High Availability Features

Virtual hosts (vhost or Virtual IPs)
Provide failover protection for servers and applications
Will be controlled by service and device monitors
Service/Application monitors
A network service or application such as HTTP, NFS or Oracle
Monitors the application and will cause the vhost to failover in the
event of a failure
Device monitors
Designed to watch a part of a server, such as a shared file system,
and failover or influence the location of a virtual host
File-based Replication
Cluster-aware replication for disaster recovery purposes
1-9
hf836s c.00
hf836s c.00
1-9
MxS Overview
(4 of 4)
Primary Solutions
Web Serving
Deploy a highly available scale-out web site using Industry
Standard Servers and Storage
File Serving
Deploy a highly available scale-out NAS utility using Industry
Standard Servers and Storage
Oracle RAC and Non-RAC

Provide a highly available and manageable database utility for
Oracle
1 - 10
hf836s c.00
hf836s c.00
1 - 10
MxS Overview
HP PolyServe processes
User Mode
Software Components
Kernel Mode
ClusterPulse
PANPulse
GroupCom
Management
Console
HBA
driver
SDMP
psd
driver
SCL
PSFS
SANPulse
DLM
LCL
pswebsrv
mxinit
mxlogd
1 - 11
hf836s c.00
hf836s c.00
1 - 11
mxlog
MxS Overview
HP StorageWorks Clustered Gateway H/W and

S/W bundle
HP DL380-SL Clustered Gateway includes the following features;
Hardware:
HP ProLiant DL380 G5, FC HBA, GbE NIC embedded, HP NC7771 PCI-X Gigabit Server Adaptor
Moving to DL360-G mid-2009, will require errata kernel support
Recommended minimum memory configuration is 12GB
OS:
Storage:
HP StorageWorks Clustered File System (PSFS)
Novell SUSE Linux v10.2 Enterprise Edition

Multiple supported configurations
The CFS enables multiple servers to read and write shared data concurrently on shared
SAN storage devices. The Clustered File System is designed to be scalable, highly
recoverable and highly available
Failover Support for NFS

Virtual hosts and monitors provide highly available and scalable access for file serving across the
cluster
Cluster Wide Administration

The Cluster Server Management Console and command line interface enable you to configure and
manage the entire Cluster either remotely or from any server in the Cluster.
1 - 12
hf836s c.00
hf836s c.00
1 - 12
MxS Overview
HP 4400 Scalable NAS File Services

Product Overview
HP 4400 Scalable NAS File Services is a fully
integrated network storage solution. The product
comes standard with key features:
Three File Serving Controllers
EVA4400 dual controller
4.8 TB of storage expandable to 88 TB
Dual 8 Gb Fibre Channel Switches
Storage Management Software
Snapshot Software
Windows or Linux
42U Rack and RKM
System Management Server
Factory Integration and pre-configuration
VMware ESX 3.5 Certified (Linux version)
1 - 13
hf836s c.00
hf836s c.00
1 - 13
MxS Overview
HP StorageWorks ExDS9100
System Specifications
System
Power*
Standard HP 42U Rack 12.8 cores/U and 12TB/U

(increases with cores/capacities)
2-16 Performance Blocks
1-10 Capacity Blocks
4 in base rack (328 drives)
6 in expansion rack (492 drives)
Complete Redundancy, NSPOF
NFS, CIFS, HTTP Protocol Support
Custom Application Integration
Min config: 32.7A, 7.52kW

Max config: 54A, 12.42kW
Performance block: .9A, 0.2kW
Capacity block: 9A, 2.07kW
Weight
Base Min: 1,724lbs

Max Expansion: 2,415lbs
Per Performance Block: 14lbs
Per Capacity Block: 335lbs
Base Min: 25,659 Btu/hr

Expansion Max: 42,378 Btu/hr
Performance Block: 682 Btu/hr
Capacity Block: 7,063 Btu/hr
Performance Capability
Thermal
3.2GB/s Blades to Ethernet Switch

5.4GB/s Storage to Chassis (seq read)
2.6GB/s Storage to Chassis (seq write)
* Single Phase power standard; Three Phase optional
1 - 14
hf836s c.00
hf836s c.00
1 - 14
MxS Overview
HP ExDS Software
Linux OS
Red Hat Enterprise Linux v4 Update 4
HP PolyServe Cluster File System
All blades access all

files in shared pool
One pool of storage for all blades

e.g. /mnt/images is same on all blades
Full direct-to-disk speed for all blades
HP PolyServe HA Services
Full n:1, n-m monitoring and failover for HTTP
and NFS v3
One shared pool of

storage
ExDS Manageability Software

Simplified management framework
One command configures all sub-components
Simplified monitoring
1 - 15
hf836s c.00
The core of the PolyServe technology consists of three pieces

A cluster file system that lets a set of blades/servers have full, simultaneous access to a
common set of files in disks that are attached to all the servers
High performance
Scalability by adding servers
HA because there is no single point of failure
A cluster volume manager
Lets you treat multiple physical arrays as a single pool of storage
HA infrastructure
Ensures application availability
Any blade/server can substitute in for any other, because they all share a single
pool of data
hf836s c.00
1 - 15
MxS Overview
HP PolyServe configuration limits

16 servers
512 imported disks
UP TO 32TB LUs
512 file systems
128TB file system, with variable block sizes
2 objects per file system
No inode limitations
128 vhosts
128 server and/or device monitors
4 HBA Ports per server
1 - 16
hf836s c.00
hf836s c.00
1 - 16

hf836s c.00
1 - 17
MxS Overview
MxS Overview
Cluster design: server hardware

HP PolyServe Software
runs on any X64 compatible architecture including
AMD and Intel
can support up to 16 Nodes having a mixed number of

processors, 1-way, 2-way, 4-way and 8-way can all co-exist in a
matrix
supports Enterprise Linux versions from Red Hat and Novell
Red Hat Enterprise Linux v5.2 64-bit
Kernel version 2.6.18-92.el5
SuSE Linux Enterprise v10 SP2 64-bit

Kernel version 2.6.16.60-0.21
Note: Binary kernels are provided by HP that already have the NFS
components and any appropriate patches applied.
1 - 18
hf836s c.00
Check the online Compatibility Matrix for an up to date list of supported hardware and
software:
hf836s c.00
1 - 18
MxS Overview
Cluster design: server memory

General server memory requirements
8GB minimum for most applications
More memory is better.
Local Storage
105MB disk space for the MxS software
100MB disk space for log and runtime files
It is recommended that the memory sizes on all nodes should be

reasonably close, i.e. no more than 2x differential across the
nodes.
1 - 19
hf836s c.00
Memory resources on each server in the matrix are consumed to manage the state that
preserves the coherency of shared file systems.
For this reason, all servers should have approximately the same memory footprint. A 2-to-1
ratio typically is not problematic, but a larger difference can increase paging and affect lock
management to the extent that overall matrix performance is impacted.
Failover Considerations
It is important to understand your failover requirements such that any server that is a
candidate for failover has the required hardware to support that application.
hf836s c.00
1 - 19
MxS Overview
Cluster design
Time Synchronization
To ensure that file times are consistent across the matrix, it is
important that all servers operate with synchronized time-ofday clocks. An NTP server is one commonly used mechanism
for synchronizing system clocks.
Security
When configuring security on PSFS file systems, you need to
ensure the uids and gids are the same on every node.
Implement LDAP, NIS+ or use local files
1 - 20
hf836s c.00
hf836s c.00
1 - 20
MxS Overview
Cluster design: networking

Normal operation of the cluster depends on a reliable
network and reliable hostname resolution. If the hostname

lookup facility becomes unreliable, this can cause instability
in the cluster.
Consider using the /etc/hosts file for name resolution of nodes in the
matrix rather than an external DNS server
Administrative Network
Matrix Server uses an IP network to communicate between nodes. This

should be a private and isolated network, GigE, and should be
separate from the public network.
No data is shipped across this network, just group communications traffic,

cluster health, state change and lock information
Should the private network fail, PolyServe will use any available IP network
to keep the cluster intact.
It is possible to either discourage or exclude admin traffic fromspecific
interfaces
It is not possible to exclude admin traffic from the Primary interface
1 - 21
hf836s c.00
hf836s c.00
1 - 21
MxS Overview
Cluster design: network changes

Public Network(s)
This is how users get access to the applications running in the cluster. It
is a best practice to keep the public and private networks separate
Network changes should not be made while HP PolyServe
Software is running. If you need to perform changes on a

particular server, you will need to shutdown the service on
that node and possibly all the other nodes, depending on the
change.
Taking an ethx down on a running system is a very bad thing to do
and can cause unexpected issues.
Changing the IP address of a server requires that server to be
removed from the cluster administratively and then re-added.

The main server IP address is the node identifier in the cluster.
1 - 22
hf836s c.00
hf836s c.00
1 - 22
MxS Overview
Cluster design: network interfaces

HP PolyServe
should support any NIC that Red Hat or SuSE support
has been tested with up to 6 network interfaces per server but
should be able to support more
prefers using multicast but unicast can be configured if there are
issues running multicast on the customers network
requires each NIC be on a physically and/or logically separate
network to be able to detect that an interface has gone down
requires that the NIC stats are updated by the driver as fall back
method for failure detection
recommends that the network topology must be symmetrical.
bonded interfaces typically work but can cause issues with multicast in
which case configure Matrix Server to use unicast
Ensure adequate testing of your bonded driver
1 - 23
hf836s c.00
hf836s c.00
1 - 23
MxS Overview

# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half
10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: d
Current message level: 0x000000ff (255)
Link detected: yes
1 - 24
hf836s c.00
hf836s c.00
1 - 24
MxS Overview

# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,DYNAMIC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:e0:81:2c:2e:82 brd ff:ff:ff:ff:ff:ff
inet 10.12.80.5/16 brd 10.12.255.255 scope global eth0
inet 10.12.80.101/16 scope global secondary eth0
inet 10.12.80.104/16 scope global secondary eth0
inet6 fe80::2e0:81ff:fe2c:2e82/64 scope link
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:e0:81:2c:2e:83 brd ff:ff:ff:ff:ff:ff
inet6 fe80::2e0:81ff:fe2c:2e83/64 scope link
4: sit0: <NOARP> mtu 1480 qdisc noop
link/sit 0.0.0.0 brd 0.0.0.0
1 - 25
hf836s c.00
hf836s c.00
1 - 25
MxS Overview
Cluster design: HBA Drivers

HP PolyServe Software requires a Fibre Channel or iSCSI
SAN for shared data access

FC HBAs
Up to 4 FC ports per server are supported

HBAs from Qlogic and Emulex are supported
Typically it is not recommended to mix HBAs from different vendors
in the same cluster. It should work but is typically not supported by
the HBA vendor
Check any specific HBA driver options relevant to your storage
array and set as required
If no 3rd party mpio is being used the HBA driver will be loaded
from the /etc/opt/polyserve/fc_pcitable file
HP PolyServe v3.7 now supports the Device Mapper mpio
capability
1 - 26
hf836s c.00
hf836s c.00
1 - 26
MxS Overview
Cluster design guidelines: shared storage

HP PolyServe Software requires a Fibre Channel or iSCSI
SAN for shared data access

FC switch
To use fabric based fencing a supported switch is required from Brocade,

McData, Cisco or Qlogic and any of their OEM brands
If Flexible Server Fencing is used, a switch is not be required but limits the
number of nodes to the number of ports on the storage
Per port/initiator zoning is typically recommended
Storage
Generally speaking HP PolyServe works with all storage
The main issue is to understand any multipath requirements of the array
and the deployment
Switch attached storage is recommended (presents WWNN:LUN#).
In some cases, scsi code page information is used to derive a unique ID
1 - 27
hf836s c.00
hf836s c.00
1 - 27
MxS Overview
Cluster design guidelines: MPIO

Eliminate single points of failure in the matrix storage fabric
HBA, FC Switch, switch port, storage port, cables
Built into the HP PolyServe Software for TRUE active/active arrays
Very basic, no load balancing, always defaults to first path found
Support for many 3rd party solutions
Powerpath, RDAC, Qlogic failover driver

V3.7 supports Device Mapper which is the supported option for HP
arrays
Need to consider where the mpio is occurring and what you are trying to
achieve
Failover and high availability
At the host
In the fabric
At the storage array
Performance through load balancing over different paths, arrays etc.

1 - 28
hf836s c.00
hf836s c.00
1 - 28
MxS Overview
Cluster design guidelines: server fencing

Definition: When certain problems occur on a server (for
example, hardware problems or the loss of network

communications), and the server ceases to effectively
coordinate and communicate with other servers in the cluster,
HP PolyServe must remove the servers access to shared file
systems to preserve data integrity. This step is called fencing
and eliminates split-brain scenarios in the cluster
Fencing is used to maintain data integrity in the cluster.
Two methods of fencing are currently supported
Fabric based fencing
Server based fencing
Nodes are fenced due to communication failures, not storage

failures.
Fencing is a normal part of cluster operation.
1 - 29
hf836s c.00
Fabric based fencing is preferable and approximately 90% of our end users use it.
hf836s c.00
1 - 29
MxS Overview
Cluster design guidelines: Server Fencing

Fabric fencing
Only works with Fibre Channel today

PolyServe disables ports on the FC switch to which the server is
attached using SNMP set commands, disabling its I/O path(s) to
shared disk
Also supports blind switch configurations where cluster nodes are
connected to an edge switch which then connects to the SAN
Storage is never fenced so there is no need to connect to the switches that
are connected to the storage
Pros:
Setup is automatic, just provide switch IP/Community String

Lighter weight than STOMITH
Does not require additional hardware, such as networked power switch
Allows for debugging/crashing of the failed node because node stays up
Cons
Requires out of band snmp access to FC switches
1 - 30
hf836s c.00
hf836s c.00
1 - 30
MxS Overview
Cluster design guidelines: Server Fencing

Non-Fabric fencing
PolyServe software executes the Hold Power Button down command

and then the Power Up command to fence the node
Should the node not respond to this and it cannot be fenced then all
I/O in the cluster will be suspended until the problem is fixed
Supported Server Management interfaces
IBM x-Series and Blade (RSA, RSAII)

HP Blades and Servers (iLo)
Dell Server (DRAC III or IPMI)
Any Server supporting IPMI v1.5 & v2.0
Pros
Does not require FC switches or access to FC switches
Cons
Cannot take a crash dump on a node when using NFF
Requires operator input to setup
1 - 31
hf836s c.00
hf836s c.00
1 - 31
MxS Overview
Ready to Install
Once the Servers are setup correctly with
A supported Operating System
A good network configuration
A good storage configuration
Suitable HBA drivers, if the drivers delivered with the PolyServe
software are not being used
Appropriate MPIO s/w
then we are ready to install the cluster software.
1 - 32
hf836s c.00
hf836s c.00
1 - 32
hf836s c.00
1 - 33
MxS Overview
hf836s c.00
1 - 34
MxS Overview
Installation and configuration
Module 22
c.00
hf836s c.00

hf836s c.00
2-1
Objectives
At the end of this module the you should be able to:
Install HP PolyServe for Linux
Configure a HP PolyServe Cluster
2-2
hf836s c.00
hf836s c.00
2-2
hf836s c.00
2-3
HP PolyServe Installation Overview

Install the base OS on each server in your matrix
Do this before connecting the SAN storage as some Linux installs will try
to install components onto the SAN disks if you choose the default install
Different versions of Linux have somewhat different installation
requirements. Always consult the appropriate section in the Software
Installation Guide for the supported OS you will be using in production.
Prerequisites
If the HP provided kernel patches are required then a kernel rebuild may
need to be performed but a binary kernel precompiled with any patches
is available from HP
2-4
hf836s c.00
lmsensors, netsnmp,
These are open source kernel patches that HP required to complete QA of the
product
If the File Serving solution pack is being used then a kernel rebuild or HP
binary kernel is required for the NFS changes
The binary kernel includes these NFS changes
Procedure has changed significantly from previous versions

hf836s c.00
2-4
Configuring the network

Connect the Ethernet NICs from Servers to Administrative Network
and Public Network(s)
Keep the admin traffic on its own network

All servers should be on the same subnets
If using Fabric Fencing connect the Fibre Channel Switch(es) to one
of the networks
They need to be accessible from all servers over IP

Need to set the SNMP Community string
Defaults to using private
Add all servers and FC switches to /etc/hosts or DNS
PolyServe uses the IP address associated with the hostname command
as the node identifier in the matrix
Make sure that is correct before configuring the cluster
2-5
hf836s c.00
hf836s c.00
2-5
Network Best Practices

A successful PolyServe cluster requires a stable network infrastructure.
Recommended settings are
Implement a dedicated GigE Admin network

(do not use crossover cables)
Ensure broadcast and multicast are enabled on the network interface and
switches
Unicast can be supported via a configuration parameter
Ensure each network interface card (NIC) is on a separate network. Networks

must be physically separated unless using a managed Ethernet switch in which
case they must be logically separated using VLANs.
Disable 802.3 flow control, it does not communicate with upper layers
Turn off TSO and TOE offload features
If you are using a Proliant, with Broadcom, you *must* be on current drivers
and you *must* turn off the "scalable networking" feature.
If using a firewall product on Windows or iptables on Linux, ensure the service
ports required by PolyServe are open
2-6
hf836s c.00
hf836s c.00
2-6
Network: NIC settings best practice

Parameter
TCP Segmentation
Offload (TSO)
2-7
hf836s c.00
Linux
configuration
option
Default
Value
Recommended
Value
ethtool k eth0
on
off
Number of receive
Varies by NIC
(Rx) and transmit (Tx) ethtool g eth0
hardware
Buffers (FIFOs)
Balanced
maximums,
favoring receive
Ethernet flow control
ethtool a eth0 on
off
NIC transmit queue

length
ifconfig eth0
30000 1GbE
30000 10GbE
NIC Statistics
ethtool S eth0 N/A
1000
hf836s c.00
2-7
Examine for errors

and useful
information
Ethernet Flow Control
HP usually recommends that flow control be disabled. Use the ethtool

command to verify whether flow control is on or off
# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate: on
RX:
on
TX:
on
To disable pause behavior, issue the commands:
ethtool -a eth0
ethtool -a eth0
tx off
rx off
It may be necessary to disable the auto-negotiation of flow control to

keep the network switch from overriding the tuning action.
2-8
hf836s c.00
hf836s c.00
2-8
Transmit Queue Length
The NIC maintains a hardware transmit descriptor queue of limited

size, it may be desirable to allow more outstanding requests per
physical interface than the default. This adjustment may allow the
network stack to continue queuing data to the interface after the
internal hardware buffers are completely used
# ifconfig eth0
eth0
2-9
hf836s c.00
Link encap:Ethernet HWaddr 00:30:6E:4C:14:FC

inet addr:192.168.0.111 Bcast:192.168.0.255 Mask:255.255.255.0
inet6 addr: fe80::230:6eff:fe4c:14fc/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:102322 errors:0 dropped:0 overruns:0 frame:0
TX packets:102286 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:9465285 (9.0 Mb) TX bytes:61548011 (58.6 Mb)
Interrupt:209
To set the transmit queue length
# ifconfig eth0 txqueuelen 30000
hf836s c.00
2-9
Network: NIC Bonding Modes
NIC Bonding
balance-rr or 0 - Round-robin policy: Transmit packets in sequential
order from the first available slave through the last. This mode provides
load balancing and fault tolerance.
active-backup or 1 - Active-backup policy: Only one slave in the
bond is active. A different slave becomes active if, and only if, the
active slave fails. The bond's MAC address is externally visible on only
one port (network adapter) to avoid confusing the switch.
balance-xor or 2 - XOR policy: Transmit based on the selected
transmit hash policy. The default policy is a simple [(source MAC
address XOR'd with destination MAC address) modulo slave count].
Alternate transmit policies may be selected via the xmit_hash_policy
option.
2 - 10
hf836s c.00
hf836s c.00
2 - 10
Network: NIC Bonding Modes
NIC Bonding (contd)

broadcast or 3 - Broadcast policy: transmits everything on all slave
interfaces. This mode provides fault tolerance.
802.3ad or 4 - IEEE 802.3ad Dynamic link aggregation. Creates
aggregation groups that share the same speed and duplex settings.
Preferred mode of operation

Use either layer 2 or layer3+4 xmit hash policies
balance-tlb or 5 - Adaptive transmit load balancing: channel

bonding that does not require any special switch support.
balance-alb or 6 - Adaptive load balancing: includes balance-tlb
plus receive load balancing (rlb) for IPV4 traffic, and does not require
any special switch support.
2 - 11
hf836s c.00
Needs switch that support 802.3ad
hf836s c.00
2 - 11
Network: NIC Bonding Preferred Option
Mode 4 802.3ad Dynamic link aggregation. Provides

failover and best performance. Requires network switch be
configured. Recommended setting.
/etc/modprobe.conf.local
alias bond0 bonding
options bonding miimon=500 mode=802.3ad lacp_rate=0
use_carrier=1 xmit_hash_policy=layer3+4
Reference documents
2 - 12
hf836s c.00
/usr/src/linux/Documentation/networking/bonding.txt/
Novell search for Setting Up Bonding on SLES 9
hf836s c.00
2 - 12
Network: NIC Bonding

[root@sut170 ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
802.3ad info
LACP rate: fast
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 3
Actor Key: 17
Partner Key: 290
Partner Mac Address: 00:1b:3f:be:44:00
Slave Interface: eth0
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1f:29:cf:b6:12
Aggregator ID: 1
MII Status: up
Permanent HW addr: 00:1c:c4:7d:d6:98
Aggregator ID: 1
MII Status: up
Permanent HW addr: 00:1c:c4:7d:d6:a8
Aggregator ID: 1
2 - 13
hf836s c.00
hf836s c.00
2 - 13
Configuring the SAN

Connect Fibre from servers and storage to the Fibre Channel Switch(es)
Each server should have only one path to each disk, unless using MPIO S/W
Enable server access to the SAN for each server in your matrix
Fibre Channel switches managed by the cluster should be from the same
vendor
If servers are connected to switches in multiple fabrics then the physical ports
on each switch should be assigned to unique domain ids.
Ensure the port for each server is enabled
Ensure the servers and storage have logged in to the switch with an F_PORT
login
Modify the switch SNMP setup per the install guide when using Fabric
Fencing
Configure port zoning or some other method on the fabric switches
2 - 14
hf836s c.00
hf836s c.00
2 - 14
HBA Drivers
HBA drivers from Qlogic and Emulex are supported, HP do not develop
our own drivers

PolyServe uses the HBA API library to query the SAN, so this needs to be
loaded also
Typically used by the HBA vendor utilities such as SanSurfer
HP ships drivers with the product but other drivers may work
Usually recommended to go with the driver supported by the storage vendor
Use the driver parameters recommended by the storage vendor
May have to tune down the HBA queue to prevent overdriving the storage
Drivers are typically loaded from the fc_pcitable rather than initrd or
modprobe.conf
2 - 15
hf836s c.00
hf836s c.00
2 - 15
Configuring the SAN

During the MxS installation, MxS probes the server to
find out which HBA is installed and loads the appropriate

driver line into /etc/opt/polyserve/fc_pcitable.
If a 3rd patry mpio solution is being used then follow the
vendors documentation on loading the qlogic driver
Configure the Storage with the desired LUN configuration
and present every LUN to every server with the same
WWNN and LUN number
Create the membership luns
1 is a minimum, 3 are recommended for redundancy
Recommended minimum size is 1GB
It is recommended that mps are put on their own luns
Note: Partition tables are NOT required on luns being used for file
systems but they ARE required for the membership partition luns
2 - 16
hf836s c.00
hf836s c.00
2 - 16
SAN Components: HBA Settings Best Practice

Recommended Qlogic Driver settings
MSA 1000/1500/2000, EVA and XP

ql2xfailover=0
(1 is for enabling the failover driver but DM is
typically used for mpio in v3.7)
ql2xlbType=0
(not used when ql2xfailover set to 0 and not required
as we are not using the f/o driver in v3.7)
ql2xmaxqdepth=32
qlport_down_retry=30
ql2xloginretrycount=30
ql2xexcludemodel=0x0
ql2xautorestore=0xA6
2 - 17
hf836s c.00
hf836s c.00
2 - 17
Checking Qlogic HBA Settings (1 of 2)

# tail /etc/opt/hpcfs/fc_pcitable
#
QLogic HBA API libraries are called libHBAAPI.so and
#
libqlsdm.so. These are included in the source for the
#
HBA driver that can be found at the QLogic web site.
#
#
# Adapters found on this system:
0x1077 0x0 intermodule qla2xxx-8.02.11 "" Intermodule communications driver
0x1077 0x0 qla2xxx_conf qla2xxx-8.02.11 "" QLogic conf module
0x1077 0x2312 qla2xxx qla2xxx-8.02.11 "ql2xfailover=1" QLogic 2312 Fibre Channel
Adapter
# cat /etc/hba.conf
qla2xxx /opt/polyserve/lib/hba_files/qla2xxx-8.02.11/libqlsdm.so
2 - 18
hf836s c.00
hf836s c.00
2 - 18
Checking Qlogic HBA Settings (2 of 2)

<abbreviated output>
QLogic PCI to Fibre Channel Host Adapter for QLA2342:
Firmware version 3.03.25 IPX, Driver version 8.02.11-hpps
BIOS version 1.34
FCODE version
EFI version 0.00
Flash FW version 0.00.00 0000
ISP: ISP2312, Serial# Q63943
Request Queue = 0x71800000, Response Queue = 0x71f70000
Request Queue count = 2048, Response Queue count = 512
Total number of active commands = 0
Total number of interrupts = 163
Device queue depth = 0x20
2 - 19
hf836s c.00
hf836s c.00
2 - 19
Device Mapper mpio

Provides multipath I/O on RHEL5 and SLES10
Only HP supported mpio on these Operating Systems
Device Mapper does many things around Logical Volume Management but
we are only using the mpio capability
HP recommends downloading a specific HP add-on to the standard DM
packages
http://www.hp.com/go/devicemapper
For RHEL 5 Update 2:
device-mapper-1.02.24-1.el5 or later
device-mapper-multipath-0.4.7-17.el5 or later
For SLES 10 SP2:
device-mapper-devel-1.02.13-6.14 or later
device-mapper-1.02.13-6.14 or later
2 - 20
hf836s c.00
hf836s c.00
2 - 20
Device Mapper Multipath

Best practices
Install and Configure DM-MPIO before setting up the
Cluster.
Partition disks before enabling DM_MPIO or a reboot
may be required to recognize new partitions.
Refer to HP Scalable NAS Documentation
Refer to OS Distribution Documentation.
http://www.redhat.com/docs/manuals/enterprise/RHEL-5manual/en-US/RHEL510/DM_Multipath/index.html
http://www.novell.com/documentation/sles10/stor_evms/index.ht
ml?page=/documentation/sles10/stor_evms/data/multipathing.htm
l
2 - 21
hf836s c.00
The cluster software expects that if the device mapper multipath is to be used, that it is
installed and configured before starting the cluster. Our install and upgrade documents talk
about how to do this. Also, there is documentation from the OS distribution that tells how to
set up device mapper multipath.
Our documentation only mentions using the HP device mapper enablement package to get
the required multipath.conf file and setting for HBA parameters. The correct multipath.conf
settings for other vendor's storage would have to come from the vendor.
MP disk partitions should be created before starting device mapper multipath. Device mapper
and partitions don't work particularly well with each other. If a disk has partitions already on
it, that is fine. If you want to put partitions on a disk that is already controlled by dm-mpio,
then it is more trouble. The distribution documentation says to reboot after creating partitions.
This is because re-reading the partition data when the device is controlled by the device
mapper doesn't work. This is why, to make things easier, partitions should be added to MP
disks before enabling device mapper multipath.
hf836s c.00
2 - 21
Installing the HP enablement kit

To install HPDM Multipath 4.2.0, complete the following steps:
Download the HPDM Multipath Enablement Kit for HP StorageWorks Disk Arrays
v4.2.0 available
Log in as root to the host system.

Copy the installation tar package to a temporary directory (for instance,
/tmp/HPDMmultipath).
Unbundle the package by executing the following commands:
at http://www.hp.com/go/devicemapper.
#cd /tmp/HPDMmultipath
#tar -xvzf HPDMmultipath-4.2.0.tar.gz
#cd HPDMmultipath-4.2.0
Verify that the directory contains README.txt, COPYING, INSTALL, bin, conf,
SRPMS, and docs directories
To install HPDM Multipath 4.2.0, execute the following command:

#./INSTALL
2 - 22
hf836s c.00
hf836s c.00
2 - 22
Installing the HP enablement kit (contd)

** HPDMmultipath-4.2.0 kit Installation. Date : Tue Apr 21 09:18:35 PDT 2009 **
Checking for previous installation. Please wait...
HP Device Mapper Multipath v4.2.0 kit - Installation Menu
1. Install HPDM Multipath Utilities
2. Uninstall HPDM Multipath Utilities
3. Exit
Enter choice [1/2/3] :
Note: This kit installs the binaries and configuration file required to support HP StorageWorks
Disk Arrays.
Warning: If you are retaining the existing /etc/multipath.conf file, you will have to manually
edit the file with HP recommended parameters.
Please refer user documentation for more details.
Would you like to overwrite the existing /etc/multipath.conf file with the
new multipath configuration file ? (y/n) :
2 - 23
hf836s c.00
hf836s c.00
2 - 23

Saving /etc/multipath.conf file to /etc/multipath.conf.savefile
Copying new multipath configuration file multipath.conf to /etc directory
*********************************************************************
* Complete the following steps to create the Device Mapper multipath devices *
*
* 1. Restart the multipath services

*
~foo# /etc/init.d/multipathd restart
* 2. Create the Device Mapper multipath devices

*
~foo# multipath -v0
*********************************************************************
Installation completed successfully!
[root@poly2 HPDMmultipath-4.2.0]#
2 - 24
hf836s c.00
hf836s c.00
2 - 24
Device Mapper Features
I/O failover and failback

Provides transparent failover and failback of I/Os by rerouting I/Os automatically to an alternative
path when a path failure is sensed and routing them back when the path is restored.
Path grouping policies: Paths are coalesced based on the following pathgrouping policies:
Priority based path-grouping
Provides priority to group paths based on Asymmetric Logical Unit Access (ALUA) state
Provides static load balancing policy by assigning higher priority to the preferred path
Multibus
All paths are grouped under a single path group
Group by serial
Paths are grouped together based on controller serial number
Failover only
Provides failover without load balancing by grouping the paths into individual path groups
I/O load balancing policies: Provides weighted Round Robin load balancing policy within a path group
I/O load balancing policies: Provides weighted Round Robin load balancing
policy within a path group
Path monitoring: Periodically monitors each path for status and enables faster
failover and failback
2 - 25
hf836s c.00
hf836s c.00
2 - 25
multipath.conf entry - EVA

For EVA A/A arrays
device {
vendor
"HP|COMPAQ"
product
"HSV1[01]1 $C$COMPAQ|HSV[2][01]0|HSV300"
path_grouping_policy
group_by_prio
getuid_callout
"/sbin/scsi_id -g -u -s /block/%n"
path_checker
tur
path_selector
"round-robin 0"
prio_callout
"/sbin/mpath_prio_alua /dev/%n"
rr_weight
uniform
failback
immediate
hardware_handler
no_path_retry
rr_min_io
"0"
12
100
2 - 26
hf836s c.00
hf836s c.00
2 - 26
multipath.conf entry - XP
For XP
device {
vendor
"HP"
product
"OPEN-.*"
multibus
getuid_callout
path_selector
"round-robin 0"
rr_weight
uniform
path_checker
tur
hardware_handler
"0"
failback
immediate
no_path_retry
12
rr_min_io
1000
2 - 27
hf836s c.00
hf836s c.00
2 - 27

Caveats, limitations, known bugs
DM-MPIO and creating partitions.
Partition handling SLES10 vs. RHEL5
RHEL5 kpartx doesnt generate a device for the extended partition.
SLES10 kpartx generates a 1 block device for the extended partition.
We dont need or recommend partition tables for data luns, only for membership
partitions
User friendly names vs. what is in /proc/partitions

dm-<NN> name in /sys/block, /proc/partitions, /dev
dm-2, dm-3, dm-4.
Names like mpath<N> used by dmsetup and multipath programs.
path2, mpath3, mpath4.
2 - 28
hf836s c.00
Partitions should be created when the device is not part of a device mapper multipath device.
If you run fdisk on a dm- device or one of its component paths, you will see this when you try
to write the partition table out:
WARNING: Re-reading the partition table failed with error 22: Invalid argument.
The kernel still uses the old table.
The new table will be used at the next reboot.
Syncing disks.
The versions of the multipath tools are different between SLES10 and RHEL5. The kpartx
program functions a little differently between the two. Kpartx is used by udev to create linear
devices mapped to the multipath device to act as the partition devices. The RHEL5 version
does not generate a map for the extended partition. The SLES10 version of kpartx does.
hf836s c.00
2 - 28

Troubleshooting
/var/log/messages
multipathd logs messages to /var/log/messages. You can see when
paths fail over or fail back.
sandiskinfo shows what device mapper devices and which sd
devices it is managing for each psd.
dmsetup (ls tree)

device mapper utility managing mapped devices. It can also be used to
list mapped devices, remove mapped devices, and display information
about mapped devices.
multipathd k
multipath -ll
2 - 29
hf836s c.00
multipathd logs messages to /var/log/messages. You can see when paths fail over or fail
back.
sandiskinfo shows what device mapper devices and which sd devices it is managing for each
psd.
dmsetup is a device mapper utility managing mapped devices. It can also be used to list
mapped devices, remove mapped devices, and display information about mapped devices.
"dmsetup ls --tree" will show a tree of multipath devices and their component sd devices.
"dmsetup remove_all" will remove all mapped devices (if not is use) force will remove in use
mapped devices as well.
The multipath command is used to create mapped multipath devices. It can also be used to
list the current devices. Use "multipath -ll" it produces something like "dmsetup ls --tree" but
with more information.
"multipathd -k" will start a copy of the multipathd in interactive mode and you can display
multipathd state. Use control D to exit.
hf836s c.00
2 - 29

Check for correct configuration and operation
multipath ll mpath3
mpath3 (3600508b4000800c90004800001490000) dm-5 HP,HSV300
[size=50G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=50][active]
\_ 5:0:1:4 sdm 8:192 [active][ready]
\_ round-robin 0 [prio=10][enabled]
\_ 5:0:0:4 sde 8:64 [active][ready]
- multipathd> show paths

hcil
dev dev_t pri dm_st chk_st next_check
5:0:0:1 sdb 8:16 50 [active][ready] XXXX...... 19/40

5:0:0:2 sdc 8:32 50 [active][ready] XXXX...... 19/40
5:0:0:3 sdd 8:48 50 [active][ready] XXXX...... 19/40
5:0:0:4 sde 8:64 10 [active][ready] XXXX...... 19/40
5:0:1:4 sdm 8:192 50 [active][ready] XXXX...... 19/40

5:0:1:5 sdn 8:208 10 [active][ready] XXXX...... 19/40
2 - 30
hf836s c.00
hf836s c.00
2 - 30

Check for correct configuration and operation
The devices show up in the mxmpio output BUT the failover mode
shows up as disabled, the * shows the active (major,minor), which is

the dm device
[root@poly2 ~]# mxmpio status

MPIO Failover is globally enabled
Failover Timeout
psd1
disabled
psd1p1
disabled 10000
psd2
disabled
psd2p1
psd3
disabled 10000
disabled
none
(8,33) (8,161) *(253,7)

(8,48) (8,176) *(253,4)
psd3p1
disabled 10000
(8,49) (8,177) *(253,12)
psd4
disabled
none
(8,112) (8,240) *(253,10)
psd5
disabled
none
(8,96) (8,224) *(253,9)
psd6
disabled
none
(8,80) (8,208) *(253,6)
psd7
none
(8,64) (8,192) *(253,5)
2 - 31
hf836s c.00
disabled
none
Targets
none
(8,16) (8,144) *(253,2)

(8,17) (8,145) *(253,8)
(8,32) (8,160) *(253,3)
hf836s c.00
2 - 31
Running the mxcheck utility

Used to verify a server meets the configuration requirements for Matrix
Server
This is run automatically when the product is installed and every time the product is
started
It can also be run interactively at any time
The utility performs the following checks on the server:

SUCCESS: Supported OS Test (100.supported_os)
SUCCESS: OS Configuration Test (110.os_config)
SUCCESS: Free Local Disk Space Test (200.freespace)
SUCCESS: Physical Memory Test (300.memory)
SUCCESS: Networks Test (400.networks)
SUCCESS: Hostname Test (500.hostname)
SUCCESS: Scalable NAS File Integrity Test (510.pmxs_rpm_verify)
SUCCESS: Saved Configuration Files Test (525.saved_conf_files)
SUCCESS: Product Configuration Test (550.prod_config)
SUCCESS: HBA Drivers Report (600.hba_drivers)
SUCCESS: Fibre Channel Switch Test (700.fcswitch)
2 - 32
hf836s c.00
hf836s c.00
2 - 32
HP PolyServe Base Components

pmxs-3.7.0-1324.<arch>.rpm
HP PolyServe product in RPM format, includes the drivers for the supported HBAs
pmxs-rhel-kernels-3.7.0-1324.<arch>.rpm
HP supplied binary kernels for 3.7.0 release
pmxs-quota-tools-3.13-1324.<arch>.rpm
Linux quota commands that include the psfs file system
mxsperfmon-3.7.0-1324. arch>.rpm
Performance monitoring utility
mxconsole-3.7.0-1324.<arch>.rpm
Management console GUI and mx utility for Linux
mxconsole-3.7.0-1324.<arch>.msi
Management console GUI and mx utility for Windows
2 - 33
hf836s c.00
hf836s c.00
2 - 33
Installing HP PolyServe Procedure

Locate the Matrix Server rpm
Run the following commands on each node:
1.
2.
rpm i pmxs-X.Y.Z-<xxxx>.<arch>.rpm
rpm i pmxs-<os>-support.X.Y.Z-<xxx>.<arch>.rpm
rpm I kernel-HPPS-<kernel_version>.<xxxx>.<arch>.rpm
or
Locate the Management console and mx utility rpm

Run the following command on each node:
3.
4.
5.
rpm i mxconsole-X.Y.Z-<xxxx>.<arch>.rpm
Configure the matrix on one node (a singleton)
Either through the configuration GUI or command line
Start Matrix Server on the singleton

/etc/init.d/pmxs start
7. Connect to the matrix
8. Add the other nodes to the Matrix
9. Export the configuration to other nodes
10. Start Matrix Server on other nodes
6.
2 - 34
hf836s c.00
hf836s c.00
2 - 34
Linux I/O scheduler Policies

The Linux I/O schedulers (also called elevators) attempt to
sort and issue disk I/O requests according to specific priority

policies. Testing has shown that the deadline policy results in
the best performance for the PSFS file system.
The deadline policy is the default in the SLES10 kernel. For
RHEL5, the default policy is cfq. In the HP RHEL5 binary

kernels and kernel sources provided by HP, the I/O
scheduler policy has been set to deadline.
For best performance, HP recommends that you do not
change the deadline policy for the I/O schedulers.
2 - 35
hf836s c.00
hf836s c.00
2 - 35
Installing Cluster Gateway

As the Cluster Gateway is a hardware/software
bundle there is no need to install the Operating

System or the HP PolyServe rpms, these have
already been installed.
There is also no need to rebuild the kernel, it is
already pre-built
All that is needs to be done it to go through the
HP node setup page to setup the node-specific
parameters on the page and then configure the
cluster from one node.
2 - 36
hf836s c.00
hf836s c.00
2 - 36
Pre-Planning Worksheet (Customer Survey Questionnaire)

Fill
out with the customer before installation

Covers things like:
HP/Customer contacts
IP Networking (Topology, IP Addresses, DNS, etc)
SNMP Info (For management agents)
SAN Planning (Membership partitions, Storage,
etc)
File System Planning
2 - 37
hf836s c.00
hf836s c.00
2 - 37
HP DL380-SL Clustered Gateway

Physical Ethernet Network Connectivity
PCI Card NIC Port
Public Data (NFS) Network
iLO Port
Private Intra-Cluster Network
PCI-X PCI-E
3
100
MHz
2
x4
2
100
MHz
1
x4
iLO
1
133
MHz
N/A
SCSI Port 1
UID
HBA Ports
Embedded NIC Port 2 Public
Data (NFS) Network
2 - 38
hf836s c.00
hf836s c.00
2 - 38
Embedded NIC Port 1 Private

Intra-Cluster Network
Clustered Gateway Node Setup (1 of 4)

Initial Boot:
1.
2.
You must connect a keyboard/mouse/display

iLo Card set
3.
Assign the same user/password to each node for fencing (can use
the admin user if desired).
Set network params (IP, Mask, etc) for private network fencing
Display Setup
2 - 39
hf836s c.00
Look for Integrated Lights Out line and press F8

Default user/password listed on tag attached to server.
When booting for the first time, you will need to configure the
system for the monitor being used. Be sure to pick a configuration
that allows for 16K colors (8K color depth will make using the
GUIs difficult) and at least 1024x768 resolution.
hf836s c.00
2 - 39

Click Here to
Launch
2 - 40
hf836s c.00
hf836s c.00
2 - 40
Have the Pre-Planning Worksheet ready
Use HP Clustered File System Control Panel To:

View HBA WWID strings
2. Configure Date & Time
1.
3.
Configure Networking
Configure Host Names (local /etc/hosts)

5. Configure Network Time Protocol (NTP) Server
4.
Setup HP Management Home Page (Insight Mgr)

7. Setup Clustered File System Software
6.
Use /opt/hpcfsmgmt/bin/nasconfig for a command line version of

the control panel.
Each of these steps launches an existing tool to perform the desired
task. You may use your favorite method.
2 - 41
hf836s c.00
hf836s c.00
2 - 41

HP Management Home Page
HP Strongly Recommends you enable these agents

(Insight Manager)
You will need SNMP information from the Pre-Installation

Worksheet
Answer Yes to the question about installing modules

that taint the kernel
Answer Yes to all questions about installing agents
These agents must be enabled to manage the iLO card

via a Web GUI as well as use the MSA arrays ACU
tool.
2 - 42
hf836s c.00
hf836s c.00
2 - 42
GUI cluster configuration

Open up the Management Console on one node
On the Connection Parameters window that appears, type in the IP
address of the server. Then type admin for both user and
password and click Configure
Click Here to
Launch
2 - 43
hf836s c.00
hf836s c.00
2 - 43
hf836s c.00
2 - 44
Configuring the Matrix (GUI)

PolyServe supports configuring the matrix from the
Management console running in a Windowing environment

on Linux or on a Windows server and from the command line
Required Information for the installation:

IP address or hostname of each server in the matrix
Location of the HP PolyServe license file
A secret network key that is used to sign messages between
servers in the matrix
The fencing method to be used in the matrix
IP address or hostname of the Fibre Channel switch in the SAN
or IP address/Username/Password of the server management
port if using non fabric fencing
Array snapshot details if required
2 - 45
hf836s c.00
hf836s c.00
2 - 45
Configuring the Matrix (GUI)

Open up the Management Console on one node
On the Matrix Server Login window that appears:
type in the IP address of the server and click Configure
2 - 46
hf836s c.00
hf836s c.00
2 - 46
Configuring the Cluster (GUI)

Enter a valid
username and
password
On v3.7 this will be
a valid Linux
username/password
as the internal
admin user is no
longer supported
2 - 47
hf836s c.00
hf836s c.00
2 - 47
General Settings Tab

The configuration
Window will appear:
On the General Settings
tab:
Enter the name for

your cluster
Enter the location of
the license file
Enter the secret
network key
Configure either
multicast or unicast
in this Window
2 - 48
hf836s c.00
hf836s c.00
2 - 48
Fencing overview
When problems occur on a server and the server ceases to
effectively communicate with other servers in the cluster,

Polyserve must remove the servers access to file systems to
preserve data integrity.
This step is called fencing and is used to prevent split-brain
scenarios from occurring.
Nodes are typically fenced due to network issues or server
resource issues, not storage issues
2 - 49
hf836s c.00
hf836s c.00
2 - 49
SAN and Fencing Tab

When a server needs to be
fenced, Matrix Server will

disable the servers access
in the Fibre Channel switch
by sending snmp commands
to enable and disable its
port(s).
Must reboot server to
regain access to the SAN.
If you select this method,
Add the IP address of each

of the FC switches that your
servers are connected to
2 - 50
hf836s c.00
hf836s c.00
2 - 50
Flexible server fencing (1 of 2)

Matrix Server uses remote
management hardware on
the server to reboot or
power down the node
Check the compatibility
guide for supported
interfaces
Make sure you know the
IP_address/Username/Pa
ssword of each of the
management ports on the
cluster nodes
2 - 51
hf836s c.00
Remote Management Controller Vendor

Select the vendor for your Remote Management Controllers.
Remote Management Controller ID
Specify how Matrix Server should identify the Remote Management Controller
associated with each server
Fencing Action
When a server needs to be restricted from the SAN, Matrix Server can either
power-cycle the server or shut it down.
Remote Management Controller Access
Matrix Server needs to log into a user account on the Remote Management
Controller in order to fence the server.
Specify the user name and password for the account that you want Matrix Server to
use.
The account must currently exist on the Remote Management Controller
If all servers in the matrix will not be sharing the same login account, remove the
checkmark from: Login shared by all nodes of cluster
hf836s c.00
2 - 51
Flexible server fencing
(2 of 2)
On the Advanced tab,
you can make the

server management
port configuration and
the fencing action
global across the
cluster
2 - 52
hf836s c.00
hf836s c.00
2 - 52
Membership partitions
(1 of 2)
Matrix Server uses a set of membership partitions to control
access to the SAN and to store the device and server naming
databases including the global device names that Matrix
Server assigns to the SAN disks placed under its control
Membership partitions should exist on their own LUNs to
prevent any I/O contention
Membership Partitions need to be at least 1GB
MPs need to have a Linux partition table installed on them
We recommend configuring 3 membership partitions for
redundancy
2 - 53
hf836s c.00
LUNs
A LUN that has a membership partition on it can not be deported while the matrix is
running, Matrix Server needs to be stopped to modify the partition table.
LUNs should already be partitioned and unformatted with no drive letter attached.
hf836s c.00
2 - 53
Storage Settings Tab

Select membership
partitions
Click Add to select the

membership partitions
2 - 54
hf836s c.00
hf836s c.00
2 - 54
(2 of 2)
1. To create a membership
partition, click Add.
Add Membership
Partition window lists all
of the disks or LUNs that
it can access.
2. Select disks and then
partitions where you

want to place the
membership partitions.
2 - 55
hf836s c.00
Choose three for

redundancy purposes.
hf836s c.00
2 - 55
Storage Settings Tab
Snapshot Configuration
Click Add to configure
the array based
snapshots
2 - 56
hf836s c.00
hf836s c.00
2 - 56
Snapshot Configuration
1. To create configure array
based snapshots, click

Add.
2. Enter the appropriate
credentials for the array

that is being used
2 - 57
hf836s c.00
hf836s c.00
2 - 57
Apply the configuration

Click on Yes to overwrite
The Membership Partitions
And then Click Yes to start
The Cluster Service on that
Node
Click Apply to
save the configuration
2 - 58
hf836s c.00
hf836s c.00
2 - 58
Cluster-wide configuration
Click Add Server to
add more nodes to
the cluster
Highlight the Server
and then Click
Export To
Highlight the Server
and then Click Start
Service
If using NFF then
highlight the
server and Click
on Test Fencing
2 - 59
hf836s c.00
hf836s c.00
2 - 59
Connecting to the cluster
2 - 60
hf836s c.00
Open Management
Console
To Connect to Cluster,
enter either the hostname
or IP address of one of
the servers in the cluster
Enter
username/password to
connect to the cluster
HP PolyServe
installation and
configuration is complete
and you have a running
cluster
hf836s c.00
2 - 60
Setup the Admin network
2 - 61
hf836s c.00
It is important to set up a
dedicated administrative
network that will be used
for cluster interconnect
traffic
Discourage Admin
Traffic on all networks
except those to be used
for admin traffic
You can also Exclude
Admin traffic from ever
running on networks,
except for the primary
network, it can never be
excluded
hf836s c.00
2 - 61
SizingActions file
HP Scalable NAS includes a script called SizingActions that configures certain
operating system parameters to improve system performance, particularly in a
file serving environment. The changes improve network throughput and make
better use system memory. On HP Scalable NAS Clustered Gateway servers
and HP 4400 Scalable NAS File Services systems, additional changes are
made to tune the operating system for the hardware provided with those
systems.
The SizingActions script is run when HP Scalable NAS starts up. The script
does not determine whether the system parameters it adjusts have been
modified from their default values by a user on the system. This can be an
issue if, for example, you are running an application that requires system
parameters such as vmem_max or mmem_max to be modified, typically in the
/etc/sysctl.conf file.
To disable SizingActions in the cluster:
1. Go to the directory containing the SizingActions script:
# cd /etc/opt/hpcfs
2. Run the following command:
# chmod 444 SizingActions
2 - 62
hf836s c.00
hf836s c.00
2 - 62
Other procedures
Hosting Preference
Disable hosting on the Admin network

Test The Fencing Configuration
The Test Fencing button on the Matrix Wide Configuration tab can be
used to verify that the fencing configuration is correct for each server.
This feature is particularly useful for Flexible Server Fencing via Server
Reset/Shutdown.
MxS should not be running on the server to be fenced.
Preventing Matrix Server from starting on boot-up
You can prevent MxS from auto-starting on boot-up by typing in

/etc/init.d/pmxs autostart off.
2 - 63
hf836s c.00
hf836s c.00
2 - 63
Cluster Gateway: Node Disaster Recovery

A QuickRecovery DVD is included with the product to rebuild
a failed node.
Individual node state should be backed up using the Save
System State link in the HP Clustered Gateway Control Panel:
This backup file should be stored in a safe place

New Save System State backups should be done after any
change to the systems configuration
2 - 64
hf836s c.00
hf836s c.00
2 - 64
hf836s c.00
2 - 65
hf836s c.00
2 - 66
Storage Configuration
Module 3
hf836s c.00

hf836s c.00
3-1
Objectives
At the end of this module you should be able to:
Describe Node fencing and its configuration
Describe Membership Partitions and Matrix Data Store
Describe cluster-wide Device Naming
Describe the components of the Clustered Volume Manager
Create Matrix volumes
3-2
hf836s c.00
3-2
hf836s c.00
3-3
Split-brain resolution
(1 of 2)
Matrix Server uses a set of disk-based data structures called SANlocks to
protect file system integrity, these are stored in the membership partitions
If a problem causes a cluster to split into two or more network partitions,
the SANlocks ensure that only one of the resulting network partitions has
access to the SAN
this avoids the classic split-brain scenario where two sets of servers
assume they own the data
A server becomes the SAN administrator when it acquires the SANlocks
and then fences the isolated node(s)
To become the SAN administrator you need to satisfy certain criteria

Majority ownership of mps
Majority server membership
then lowest IP address wins
Be able to acquire the SANlocks
3-4
Split-brain resolution (1 of 2)
All I/O is suspended and servers update their server registry
entries in the membership partition, then an election
mechanism begins for the SAN administrator role.
hf836s c.00
3-4
Split-brain resolution
(2 of 2)
All servers throw away any locks they dont own and all
servers calculate the new home nodes for the locks they own
Another server replays the file system journals to complete
any transactions extant at the time of the crash
Takes only a few seconds to complete
The recovery process restores only the structural metadata

The fenced server is unable to access the file systems until it is
rebooted
For fabric fencing deployments, the fenced server is

responsible for un-fencing itself
Servers are only fenced when they cannot communicate with
each other at the network level

Servers do not get fenced because of storage issues
3-5
Split-brain resolution (2 of 2)
Typical reasons for nodes getting fenced are:
Network isolation at the node level
Network communication issues between nodes
Lack of udp traffic between the nodes
Lack of multicast communication between the nodes
Network addresses are changed dynamically with MxS
running
Nodes are so heavily loaded they are not communicating
correctly
Network traffic is so high that messages are taking too long
to get between the nodes
hf836s c.00
3-5
Server fencing methods - fabric

Fabric-based fencing
Fabric fencing requires the IP address of the switch(es) and a
read/write SNMP community string
PolyServe verifies the HBA WWN(s) of the node, queries the
switch and sees which the port(s) the HBA(s) are logged into
and fences the port using SNMP library calls
On reboot, the node that was fenced is responsible for unfencing itself
To test fabric based fencing use the following command
PSANcfg [-fh?u] [-c <community string> [-b]] [-[lL] <Port WWN>] [switch ...]
The options are:
f Fence local FibreChannel switch ports
u Unfence local FibreChannel switch ports
c Set the community string
l Add port to local port list
L Remove port from local port list
h ? Display the usage message and exit
3-6
hf836s c.00
3-6
Server fencing methods flexible server

Flexible Server fencing
Requires the IP address and username/password of the server
management port
PolyServe will either reboot or shutdown the rogue node by interfacing
with the server management port or the IPMI port on the server
PolyServe verifies the node has been fenced, or all I/O are suspended
until fence verified
Critical that the correct management port is specified during setup so
the correct server is fenced
Use the Test Fence button to check this
wmtest can be used to test the capability of being able to fence the
server BUT it takes command line arguments. The real fencing
information is stored on the servers.
If wmtest works but fencing doesnt, check the fencing configuration
stored on each the nodes and in the membership partitions
/etc/opt/hpcfs/myFENCEidentity
3-7
hf836s c.00
3-7
Troubleshooting
What happens when the node cannot be fenced?
Successful fencing is critical to data integrity, to prevent uncoordinated
access to shared data from multiple nodes

If fencing fails, then any DLM requests that reference the rogue node(s)
will wait forever, there is no timeout, and eventually it will appear like i/o
in the cluster has hung, until verification is received that the node has
been fenced, such as rebooting the node
Failures such as SAN ADM unable to get to switch interface, or wrong IP
address has been specified for Server Management port
If you are certain the node is down, enter the following command:
mx server markdown <server>
Use with caution as data corruption could result if the node still has access to
the storage
Informs Matrix Server the node has been fenced and to ignore it
3-8
hf836s c.00
3-8
SAN configuration (quorum disk)
Node 1
Node 2
Node 3
Votes: 2 3
Votes: 2
Quorum Disk
3-9
hf836s c.00
Node 4
3-9
SAN configuration (MP)
Node 1
Node 3
Membership Partition(s)
(Server Registry)
Access to majority
of membership
partitions
Node 1
lock
Node 2
Majority node set
Node 3
Lowest IP
Node 4
3 - 10
hf836s c.00
Node 2
These can see each

other and are the
majority
3 - 10
Node 4
hf836s c.00
3 - 11
PolyServe uses a set of membership partitions to control access to the
SAN and to store the device naming database, which includes the global
device names for SAN disks imported into the matrix.
In v3.7 part of the MPs are used as a cluster-wide datastore, MxDS

The membership partitions are created when you install and configure
PolyServe.
When a membership partition is placed under PolyServe control, write
access to the lun is disabled to prevent accidental corruption from other

applications.
They should be configured with a partition table, on their own luns, to
prevent any interference from other operations
mp I/Os are atomic and will timeout, default t/o is 5s

The mps are backed up to local storage on each node once a day.
3 - 12
hf836s c.00
3 - 12
Membership partitions: Device registry

Contain the device registry for the matrix, the device name
(psd#) and physical UID for each imported disk
The membership partitions are automatically imported into
the matrix upon startup which means any partitions on the

same lun are also automatically imported
The device database must be present in order for a matrix to
form
The recommended configuration is three membership partitions
Loss of one mp has no effect
Loss of two mps means no cluster changes can be performed
3 - 13
hf836s c.00
3 - 13
Membership partitions: Volume registry

Contains the device mapping for UIDs to psv devices
With v3.7 the volume information is now stored on the
subdevices of the volumes so they can be imported/deported

In the event of a total loss of membership partitions the volumes
can easily be recovered through the import feature
Each subdevice know which volume it was configured in and
where in that volume it was configured
If you try to overwrite a subdevice that has a volume GUID on it
then the product will warn you and ask if you want to overwrite
it.
3 - 14
hf836s c.00
3 - 14
Membership partitions: Server registry

Used to control server access to PSFS file systems
A server is added to the registry and marked dirty when it
mounts its first shared file system, and remains in the registry
until it un-mounts its last file system, or leaves the matrix
If a server crashes or loses communication with the matrix,
and that server is in the registry, one of the other servers in

the Matrix will fence the server, disabling its access to shared
file systems
An alert will appear on the Management Console, and a
message will be written to the matrix log file
After the server is rebooted and rejoins the matrix, its registry
entry is removed, and will only be added back when the server
mounts a shared file system
3 - 15
hf836s c.00
3 - 15
mpdump -v output
Current Product MP Version: 2
Membership Partition Version: 2
Membership Partitions:
20:00:00:80:e5:11:ed:75::0/1 (ONLINE)
20:00:00:80:e5:11:ed:75::1/1 (ONLINE)
20:00:00:80:e5:11:ed:75::2/1 (ONLINE)
Membership Partition Device Database (Version 1):
UID:20:00:00:80:e5:11:ed:75::0
Label:psd1
(state=0x1/mask=00000000)
UID:20:00:00:80:e5:11:ed:75::1
Label:psd2
UID:20:00:00:80:e5:11:ed:75::2
Label:psd3
UID:20:00:00:80:e5:11:ed:75::7
Label:psd7
UID:20:00:00:80:e5:11:ed:75::8
Label:psd8
Membership Partition Volume Database (Version 2):
VOL:psv2
(stripesize=4096K)
Set 0: SUBDEV: 20:00:00:80:e5:11:ed:75::7/0
size=10240000K
SUBDEV: 20:00:00:80:e5:11:ed:75::8/0
size=10240000K
Membership Partition Host Registry (Version 3):
Host ID: 10.12.10.110 fencestatus=0 fencetype=0
Fence ID:21:00:00:e0:8b:1c:d2:50::qlogicswitch1 state=0
Fence ID:21:00:00:e0:8b:1c:90:4f::qlogicswitch1 state=0
Fence ID:21:00:00:e0:8b:1c:9b:4f::qlogicswitch1 state=0
Fence ID:21:00:00:e0:8b:1c:10:41::qlogicswitch1 state=0
3 - 16
hf836s c.00
3 - 16
Membership Partition Data Store

v3.7 introduced the Matrix Data Store (MxDS)
Cluster-coherent data store stored in the membership

partitions
Designed to be a repository for configuration data
Highly available and fault-tolerant
Stores key/value pairs
Updates are transactional
Currently limited to 500MB of data (1GB physical size),
may expand in future
Not intended for customer use
Membership Partitions need to be 1GB minimum.
3 - 17
hf836s c.00
3 - 17
High availability / fault tolerance features

Datastore is triply-replicated (if the matrix has three MPs), no SPOF
Automatic error detection and correction
Internal and external SHA1 checksums for each block identify lost writes as well as
torn writes and other failed reads and writes
Datastore can be successfully read or written even if all three MPs have mxds
corruption, as long as no single block is corrupt in all three MPs.
Errors in a block are automatically corrected if at least one MP has a correct copy of
that block
Online datastore fsck searches in background for corruption and repairs it if its
repairable, or generates alert if all replicas of a block are corrupt.
Extended availability
A node can access the MP datastore even if it loses access to all the MPs, as long as
at least one node in the matrix has access to the MPs (the request is automatically
redirected to a node that can access the MPs).
If no node can access the MPs, a local read-only early-access datastore can be
read instead. A nodes early-access datastore is a consistent copy of the MP
datastore, but may not contain all updates.
Datastore is always consistent, even if a node crashes during an
update. No journal replay is needed.
3 - 18
hf836s c.00
3 - 18
Membership Partition Data Store

Processes that use the data store
mxconfig
stores the cluster-ID and cluster description
SCL
uses the cluster-ID to verify that MPs belong to the cluster
RBAC
roles are stored in /rbac
mxlog
reads the cluster-ID and cluster description for inclusion in log messages
SNMP traps
enabled flag, event filter, and trap targets are stored in /notifiers
Can be accessed using mxdstool but should only be used as directed

by HP Support.
3 - 19
hf836s c.00
3 - 19
Membership Partition Online Manipulation

v3.7 introduced online membership manipulation
Ability to go from 1 to 3 MPs while the cluster is running
Ability to repair MPs while cluster is running
Ability to replace MPs while cluster is running
Offline operations are now scriptable
GUI and CLI interfaces
3 - 20
hf836s c.00
3 - 20
Membership Partition resync

If a device was previously assigned as a membership
partition, but now cannot be recognized as a membership

partition (does not have the MP device signatures), it is
considered corrupt.
In a cluster with three membership partitions, v3.7 will allow
one or two recognized membership partitions that have

become out of sync, out of date, invalid, or inconsistent, but
not corrupted, to be re-syncd automatically (without user
intervention), without taking the cluster out of service (this is
the same as previous behavior).
3 - 21
hf836s c.00
3 - 21
Membership Partition corruption

There will be different behavior depending on when
the corruption is detected
If the corruption is detected during an ADM re-election

then the corruption will be repaired automatically if 1
or 2 mps are detected as corrupt
If the corruption is detected when the 1st node in the
cluster is being brought up then the corruption will be
repaired automatically if 1 mp is corrupted but not 2
(it would not be possible to get majority access to the
mps with 2 corrupted)
3 - 22
hf836s c.00
3 - 22
Membership Partition Corruption Alert
3 - 23
hf836s c.00
3 - 23
Membership Partition replacement

In a cluster with three membership partitions, v3.7 will enable
a user to replace one membership partition, at a time,

without taking the cluster out of service, allowing you to
migrate them to another storage array if required
If two or more membership partitions are lost, the cluster must
be taken out of service to effect repair/replacement

No ADM election can take place
A cluster configured to have only 1 membership partition
which loses (or has corrupted) that partition must be taken out
of service to effect repair/replacement
No ADM election can take place
It is possible to upgrade a cluster, online from 1 membership
partition to 3 membership partitions
3 - 24
hf836s c.00
3 - 24
Membership Partition Online Repair
3 - 25
hf836s c.00
3 - 25
Membership Partition Backups

mpdump x/X dumps the Matrix Datastore
dump is binary, not human-readable
default dump file is MPmxds.backup
daily dumps are done automatically (in-place)
manual dumps first copy existing dump to MPmxds.backup.prev
(unless Xi is used)
to dump everything in the MPs, must use both
mpdump -f/F and mpdump x/X
mpdump f/F dump the mps as pre-v3.6
mpimport x/X imports the datastore information
mpimport f/F imports the mp information
to restore everything in the MPs, must use both
mpimport f/F and mpimport x/X
3 - 26
hf836s c.00
3 - 26
hf836s c.00
3 - 27
Cluster-wide device names

We cannot rely on different Linux servers to enumerate the disk the same
way so we need an independent method for determining disk identity

cluster-wide
Use unique device names to control access to shared devices
Form the pathnames that servers use to access shared data

All disks need to be presented to all server for them to be importable
When you import a SAN disk, the SCL gives it a global device name that
represents the entire disk

For example, psd1
Individual disk partitions also have a global device name
For example, psd1p1and psd2p1

When the disk is imported and mounted it is write protected at the Linux
level, i.e you cannot use Linux utilities to format the drive
3 - 28
hf836s c.00
3 - 28
Cluster-wide device naming
Node 1
Node 2
/dev/psd1
Node 3
/dev/psd1
/dev/sda
Consistent device
naming across the
cluster
/dev/sda
/dev/psd1
/dev/sdd
LUN 1
Prevents access to
/dev/sdx devices
3 - 29
Cluster wide Device names

Unique device names used to control access to shared devices
Form the pathnames that servers use to access shared data
When you import a SAN disk, the SCL gives it a global device name that represents the
entire disk: e.g.,
psd1
Individual disk partitions also have a global device name: e.g.,
psd1p1, psd2p1
Global device names are not dependent on the physical location of the disks.
If you import a disk that already has a device name, the SCL will keep that name if
it is not currently in use.
Upon deport and re-import the name assignment is not guaranteed to be retained but
will be if possible. Careful use of file system naming is recommended
hf836s c.00
3 - 29
hf836s c.00
3 - 30
Importing SAN disks

Disks to be used for shared file systems must be imported into
the matrix
Gives the matrix complete control over access to the disks

Shared disks may be partitioned, but do not need to be,
before they are imported into the matrix
Partitions remain fixed and immutable while imported

If you need to modify partitions, the disk must first be deported
from the matrix, fdisk is not an online utility
fdisk changes are only visible to the server on which fdisk was run,
re-importing will make these changes visible matrix-wide
3 - 31
hf836s c.00
3 - 31
Import Disks screen

The Import and
Deport buttons
on the UI are
used to import
and deport
disks, not
volumes
3 - 32
hf836s c.00
3 - 32
Deporting SAN disks

Removes a disk from matrix control
The /dev/psd device nodes are removed
The original /dev entries are re-enabled
You cannot deport a disk that contains
mounted file systems or a membership partition
3 - 33
hf836s c.00
3 - 33
Deport Disks screen

The Import and
Deport utilities can

also be accessed
from the drop down
menu
Removes a disk
from matrix control.

The psd device
names are removed
You cannot deport
a disk that contains

Mounted file
systems
A membership
partition
3 - 34
hf836s c.00
3 - 34
Displaying disk information

Use either the Management Console or the sandiskinfo
command to display information about SAN disks
Management Console:
The Import Disks and Deport Disks windows display the UID,
vendor, model, and capacity for imported or unimported disks
These windows also display the FC switch used to access each
disk
sandiskinfo (VERY useful tool)
# sandiskinfo -ial
# sandiskinfo -ual
3 - 35
hf836s c.00
3 - 35
Local Disk Info screen
3 - 36
hf836s c.00
3 - 36
sandiskinfo ial output

[root@poly3 ~]# sandiskinfo -ial
Disk: /dev/psd/psd1
Uid:
(Membership Disk)
6-6005-08B4-0008-00C9-0004-8000-0131-0000
Vendor:
HP HSV300
SAN info: 172.31.0.22:0,172.31.0.22:1
Capacity: 1024.00M
Local Device Paths: /dev/sda

partition 01: size 1019.71M type Linux (83)
Disk: /dev/psd/psd2
Uid:
(PSMP/Active )
(Membership Disk)
6-6005-08B4-0008-00C9-0004-8000-0135-0000
Vendor:
HP HSV300
SAN info: 172.31.0.22:0,172.31.0.22:1
Capacity: 1024.00M
Local Device Paths: /dev/sdb

Disk: /dev/psd/psd3
Uid:
(PSMP/Active )
(Membership Disk)
6-6005-08B4-0008-00C9-0004-8000-0139-0000
Vendor:
HP HSV300
SAN info: 172.31.0.22:0,172.31.0.22:1
Capacity: 1024.00M
Local Device Paths: /dev/sdc

(PSMP/Active )
Disk: /dev/psd/psd4
Uid:
6-6005-08B4-0008-00C9-0004-8000-0149-0000
Vendor:
HP HSV300
SAN info: 172.31.0.22:0,172.31.0.22:1
Capacity: 51200.00M
Local Device Paths: /dev/sdd

unpartitioned: size 51200.00M type (none)
(SUBDEV/psv1)
Disk: /dev/psd/psd5
Uid:
6-6005-08B4-0008-00C9-0004-8000-014D-0000
Vendor:
HP HSV300
SAN info: 172.31.0.22:0,172.31.0.22:1
Capacity: 51200.00M
Local Device Paths: /dev/sde

unpartitioned: size 51200.00M type (none)
3 - 37
hf836s c.00
(SUBDEV/psv2)
3 - 37
hf836s c.00
3 - 38
CVM overview
Matrix Server V3 introduced a Cluster Volume Manager that
can be used to create, extend, recreate, or destroy dynamic

volumes.
The main reason for developing the CVM was to overcome
the 2TB LUN limitation, up to 32TB LUNs are now supported
The current version of the CVM will allow file systems up to
128TB and files up to just under 16TB
The CVM will also allow a file system to be striped across
multiple luns, controllers and arrays, potentially allowing
higher performance from low-end arrays, assuming the lun or
array is the bottleneck
There is currently no cluster wide mirroring support scheduled
for the PolyServe CVM
It is expected that only protected storage will be used
3 - 39
hf836s c.00
3 - 39
Basic volumes
Basic volumes (psd)
Consist of a single disk or LUN that has been imported into the
matrix
A PSFS file system is then created directly on the disk, partition,
or LUN
If the underlying lun can be expanded in the array then the file
system can be extended to use the additional space,
This may be an offline operation
Basic volumes can be converted to dynamic volumes and

expanded online
3 - 40
Basic volumes
Volumes are used to store PSFS file systems.
hf836s c.00
3 - 40
Dynamic volumes
Dynamic volumes (psv)
Created by the Matrix Server Volume Manager
Can include one or more subdevices, such as disks, disk
partitions, or LUNs that have been imported into the matrix
A single PSFS file system can be placed on each dynamic
volume
Can add more subdevices to a dynamic volume as necessary
and extend the file system on the volume online
3 - 41
hf836s c.00
3 - 41
Concatenated volume
.
Each sub device is completely filled
before using next one
3 - 42
hf836s c.00
300 GB
(Full)
100 GB
(Full)
200 GB
3 - 42
Extended striped volume
100 GB
100 GB
100 GB
Data Striped across multiple LUNs
100 GB
300
100 GB
GB
100 GB
Dont Do This!
Data Striped
across multiple
LUNs.
Results
in Inconsistant
Performance
3 - 43
hf836s c.00
3 - 43
Grow striped
volumes by
adding uniform
stripe sets for
consistent
performance
Striped volume
150 GB
100 GB
200 GB
Data Striped across multiple LUNs

Wasted
Space
Each LUN in a stripe set should:
o Be the same size
o Have the same performance characteristics
3 - 44
hf836s c.00
3 - 44
Wasted
Space
Dynamic volume names

The Volume Manager uses unique device names to control
access to dynamic volumes. These names form the pathnames

that servers use to access shared data.
When you create a dynamic volume, the Volume Manager
gives it a global device name. The name consists of psv

followed by an integer. For example, psv1, psv2, and psv25
are all valid names.
Matrix Server stores information about dynamic volumes in a
volume database located on the membership partitions.
3 - 45
hf836s c.00
3 - 45
Dynamic volume configuration limits

A maximum size of 128 TB for a dynamic volume (32k block
size).
A maximum of 512 dynamic volumes per matrix.

A maximum of 128 subdevices per dynamic volume.
Concatenated dynamic volumes can be extended up to 127
times; however, the total number of subdevices cannot

exceed 128.
3 - 46
hf836s c.00
3 - 46
Dynamic volume creation

The subdevices are used in the
order they appear, use arrows

to reorder
Enable striping and pick stripe
size
Depends on application, disk

array, and so forth
Check the box and insert a file
system label to create the file

system here
Click OK or Apply
Apply keeps window open
psv name is assigned to the
dynamic volume
Use a stripe size that makes
sense for your storage array

1MB should be a good
starting point
3 - 47
hf836s c.00
3 - 47
Dynamic volume properties

Stripe state is either:
Optimal Volume has only
one stripe set that includes all

subdevices
Unstriped Volume is
concatenated
Suboptimal Volume has
been extended and includes
more than one stripe set, the
volume may or may not have
been extended with a stripe
set
3 - 48
The volume properties can then be displayed from the Management Console
Stripe state will be one of the following:
Optimal the volume has only one stripeset that includes all subdevices
Unstriped the volume is concatenated
Suboptimal the volume has been extended and includes more than one stripeset, the
first stripeset will be filled before write to the next stripeset begin.
hf836s c.00
3 - 48
Extending a dynamic volume

Extend Volume option
allows you to add

subdevices to an existing
dynamic volume
When you extend the
volume on which a file
system is mounted, you
can increase the size of
the file system to fill the
size of the volume
This will result in a brief
pause of the file system

Should be transparent to
the OS and any
applications
3 - 49
The Extend Volume option allows you to add subdevices to an existing dynamic volume.
When you extend the volume on which a file system is mounted, you can optionally
increase the size of the file system to fill the size of the volume.
Note: The subdevices used for a striped dynamic volume are called a stripeset. When a
striped dynamic volume is extended, the new subdevices form another stripeset. If you want
the entire dynamic volume to be in the same stripeset, you will need to recreate the volume.
hf836s c.00
3 - 49
Recreating a dynamic volume

Recreate dynamic volumes to:
Implement striping on a
concatenated volume, or
Place all subdevices in the
same stripe set if a striped
dynamic volume has been
extended
Volume Manager first
destroys the volume, then

recreates using the
subdevices and options
selected
When volume is destroyed,
any file system on that volume

is also removed
3 - 50
Recreating a dynamic volume

Before recreating a volume, be sure that the file system is no longer needed or has been
copied or backed up to another location. The file system must be unmounted when recreating
the volume.
hf836s c.00
3 - 50
Converting a basic to a dynamic volume

If PSFS file systems were
created directly on an
imported disk partition or
LUN (a basic volume),
convert the basic volume to
dynamic volume
New dynamic volume
contains only the original

subdevice
Use the Extend Volume
option to add other
subdevices to the
dynamic volume
3 - 51

The new dynamic volume is unstriped. It is not possible to add striping to a converted
dynamic volume. If you want to use striping, you will need to recreate the volume.
hf836s c.00
3 - 51
Deleting a dynamic volume
Note: Be sure to remove a

file system from any export
groups and unmount it prior
to destroying the volume
3 - 52
When a dynamic volume is destroyed, the file system on that volume is also destroyed.
Before destroying a dynamic volume, be sure that the file system is no longer needed or
has been copied or backed up to another location.
The file system must be unmounted when you perform this operation.
On the Destroy Dynamic Volume Windows, select the volume to destroy then click Ok or
Apply
hf836s c.00
3 - 52
Dynamic Volume Recovery

In v3.7 the Dynamic Volume Recovery feature provides
the ability to import a dynamic volume using the LUNs

originally in the volume. This feature can be used for
purposes such as the following:
Move dynamic volumes from one cluster to another. Deport

the dynamic volumes on the original cluster and then import
them on the new cluster.
Recover dynamic volumes from mirrored LUNs for disaster
recovery purposes. Import the dynamic volumes from the
mirrored LUNs to the replacement cluster.
3 - 53

hf836s c.00
3 - 53
Deporting a dynamic volume

Will deconstruct the
dynamic volume bindings

and remove any drive letter
and/or mount point
assignments
Will not deport the
underlying disks
Will not destroy the file
system or underlying data
3 - 54

hf836s c.00
3 - 54
Importing a dynamic volume

Will import disks as
needed
Can import volumes
with both imported
and unimported disks
Not guaranteed to
get the same psv#
once re-imported
3 - 55
hf836s c.00
3 - 55
Unimportable Volumes
Unimported volumes may be IMPORTABLE or
UNIMPORTABLE, possible reasons for a volume being

unimportable are;
DUPLICATE
One or more subdevices have matching signatures
Both sides or mirror are exposed to the same cluster
TRUNCATED
One or more subdevices are smaller than the size recorded
in the subdevice signature
MISSING
One or more subdevices are missing
3 - 56

hf836s c.00
3 - 56
Subdevice re-use
The GUI will prompt before reusing a subdevice that
has an existing volume signature
3 - 57

hf836s c.00
3 - 57
Upgrading Dynamic Volumes

Existing Dynamic volumes from a 3.5 cluster will be
automatically upgraded when the first 3.7 SAN ADM is

elected, if it is
a simple, non-concatenated striped volume
A single subdevice concatenated volume
A single subdevice striped volume
BUT NOT A multi-subdevice concatenated volume, these
will need to be recreated.
There will be an alert in the UI of non-upgradeable volumes
exist in the cluster
The recommendation it to upgrade the IP addresses highest to
lowest so that the SAN ADM is the last remaining 3.5 node
and when that is shutdown a 3.7 node is elected as the SAN
ADM and he upgrades the dynamic volumes automatically
3 - 58

hf836s c.00
3 - 58
59
hf836s c.00
Version 7.4 Development
HP Restricted
Company, L.P.
3 - 59
hf836s c.00
3 - 60
File systems and quotas
Module 4
hf836s c.00
HP Restricted
hf836s c.00
4-1
Objectives
After completing this module you should be able to:
Configure file systems
Mount file systems
Perform file system checks
Configure user and group quotas
Create snapshots
Describe and Configure Replication
Explain Multi-Path IO software
4-2
hf836s c.00
4-2
hf836s c.00
4-3
Creating a file system

From the command line,
enter:
mx fs create [
label <label>] [
blocksize
[4K|8K]]
<storageExtent>
4-4
Select Storage Filesystem New or click the Add Filesystem icon on the toolbar
The Label field identifies the file system
The storage extents (partitions) that are not currently in use are displayed in the
Available Extents area
Note: the Create a Filesystem window identifies disk partitions by their MxS names,
such as psd1p2. To match these names to their local Linux names, open the Get
Local Disk Info window
From the command line
mx fs create [--size <KB>] <filesystem> <storageExtent>
mkpsfs <device> [<size-in-blocks>]
mkfs t psfs ...
hf836s c.00
4-4
File system options

Will allow you to specify a
different file system block size

(64-bit only)
4k is the default block size
and will allow a file system up
to 16TB to be created
8k 32TB
16k 64TB
32k 128TB
Caution: File system block size
set the minimum size for every
data and metadata block
4-5
hf836s c.00
4-5
File system options

Enable quota support in
the file system
This can be added later

but is an offline
operation
Sets default user and
group quotas
Covered in more detail in
the quotas section
4-6
hf836s c.00
4-6
Mounting a file system

Select a file system from the
Filesystems window, and then

either right-click and select
Mount
You can mount a PSFS file
system on any server that can
access the storage extent over
the SAN, if the mountpoint
does not exist you can have it
created
Mount read/write or read-only
across the matrix; you cannot
mix modes
/etc/fstab cannot be used to
remount file systems upon
system reboot
Use the Persist option
instead
4-7
hf836s c.00
4-7
Advanced mount options

Shared/Exclusive
Either one node or all nodes
have access to the file system
Ordered/Unordered
Defines the order in which
metadata and data blocks are
written to the f/s
Cluster-coherent locks
Use with caution, affects only
fcntl() locks. If this is disabled
then record locking must be
implemented a different way
to ensure data integrity
4-8
hf836s c.00
4-8
small files enhancement in 3.7
In the past with the PolyServe file system, every file used at
least one metadata block and 1 file system block
on a file system built with 8k blocks a 1kB file would consume 16kB of
the file system, an 8k metadata block and an 8k data block.
With the small_files option in the file system we now have

the capability to store data in the metadata block
in the previous example the 1kB file now only uses one 8k block that
would hold both the metadata and the data.
When the amount of data in the file grows beyond the
available space in the metadata block then the behavior

mirrors the pre-3.7 file system, the data will be copied to
data blocks and the metadata block will just contain
metadata
4-9
hf836s c.00
4-9
Small Files Enablement

Always enabled on file systems created in 3.7.0
mkpsfs turns feature on
psfsinfo:
Features: SMALL_FILES
Enable on a 3.51 file system with

psfsck e enable-smallfiles
Applies to new files
Upgrade Guide has been updated
Once enabled on an FS, no way to revert

Existing files will still use the original method
4 - 10
1.
New file system feature bit PSFS_FEATURE_SMALL_FILES
2.
The old psfs.ko should fail to mount a new file system with this feature bit on. The
kernel log will have a message like this: "The on-disk revision level of filesystem psd4
is newer than that supported by this version of PSFS. You may need to upgrade to a
newer version of PSFS in order to access this filesystem.
Once enabled on an FS, no way to revert: The intention is to always have feature on. The
new 3.7.0 psfs.ko driver must always undertand file data stored using the new
model
hf836s c.00
4 - 10
Small Files psfsdebug example

File system without Small Files feature, file of size 12
bytes
[root@e3s13 fs_5]# ls -l
total 4
-rwxrwxrwx 1 root root 12 Feb 26 10:28 fdn1
[root@e3s13 fs_5]# cat fdn1
hello world
===================================================================
LEAF NODE (8358) contains level=1, nr_items=2, free_space=3972
dev 4, size 4096, blocknr 8358, count 1, list 0, state 0x1, (UPTODATE, CLEAN, UNLOCKED)
-------------------------------------------------------------------------------------------------------------------------------------------------|###|type|ilen|f/sp| loc|fmt|fsck|
|
|e/cn|
key
| |need|
--------------------------------------------------------------------------------------------------------------------------------------------------| 0|0 8358 0x0 SD(0), len 48, loc 4048 free space 65535, fsck need 0, format new|
(SD), mode -rwxrwxrwx, size 12, nlink 1, mtime 02/26/2009 10:28:55 blocks 8, uid 0, gid 0
---------------------------------------------------------------------------------------------------------------------------------------------------| 1|0 8358 0x1 IND(1), len 4, loc 4044 free space 0, fsck need 0, format new|
1 pointer
[ 8359]
===================================================================
4 - 11
The stat data shows i_size (file size) of 12

The metadata block has an Indirect Item, and the actual file data is in the indirect block
8359
hf836s c.00
4 - 11
Small Files psfsdebug example

File systems with Small Files enabled, 4K blocksize
Small file of size 12 bytes
===================================================================
LEAF NODE (8358) contains level=1, nr_items=2, free_space=0
dev 4, size 4096, blocknr 8358, count 1, list 0, state 0x1, (UPTODATE, CLEAN, UNLOCKED)
--------------------------------------------------------------------------------------------------------------------------------------------------|###|type|ilen|f/sp| loc|fmt|fsck |
|
|e/cn|
key
| |need|
---------------------------------------------------------------------------------------------------------------------------------------------------| 0|0 8358 0x0 SD(0), len 48, loc 4048 free space 65535, fsck need 0, format new|
(SD), mode -rwxrwxrwx, size 12, nlink 1, mtime 02/27/2009 10:00:20 blocks 8, uid 0, gid 0
----------------------------------------------------------------------------------------------------------------------------------------------------| 1|0 8358 0x1 DRCT(2), len 3976, loc 72 free space 65535, fsck need 0, format new|
"hello world\n"
4 - 12
The stat data shows i_size (file size) of 12

The metadata block has, in addition to the stat data item, a Direct Item with all of the files data
bytes: "hello world\n
The size of the Direct Item is 3976 (the max size of real data for a small file: 4096 120)
hf836s c.00
4 - 12
How does this affect performance?

Where the data is small enough to fit into a single file system
block
when reading the file and scanning the file system, half the number of
blocks will need to be read, this should decrease the load on the array
and improve the performance of the reads
when writing the file, half the number of blocks will need to be written
and there will also be savings in writing to the journal
When growing the file to beyond a single file system block

the file system reverts to the pre-v3.7 model, the metadata
block only contains metadata
This means we need to copy the data from the metadata block to its
own data block so there is a slight additional overhead in growing a
file to greater than one file system block.
4 - 13
hf836s c.00
4 - 13
Small Files Performance Testing

postmark: 9% - 46% improvement
largeio testing: 2% - 50% improvement
iozone fileop (small files only)
Read: Up to 38X better (4m x 4k files)
Write: Up to 6X better
sfp (home grown small files test small files only)

Read: Up to 19X better
Write: Up to1.7X better
df (disk usage small files only)

50% improvement
4 - 14
These are highlights from results that Fitsum and Rob have sent to me, contact them for details
and official latest Guinness testing numbers.
The best 38X better results were obtained running iozones fileop with 4 million files, 4K
blocksize, quota enabled, Rhek5.2, single node (RAM 16K)
hf836s c.00
4 - 14
Small Files sfp data example

8K blocksize,Small Files OFF vs Small Files ON
Random read times for 32,768 open small files
600
Microseconds per read
500
400
Series1
300
Series2
200
100
0
1
11
13
15
17
19
21
23
25
27
29
31
33
Thousands of files
4 - 15
Graph courtesy of Fitsum Mulugeta

System: qar14s1
OS: RHEL5.2 (build974)
Device Type: Psd
Block Sizes: 8K
Filesystem options used: Quota enabled
spf opens 32,768 existing small files (of size 8072 bytes), and then reads files at random. It
times the read system call.
hf836s c.00
4 - 15
Small Files sfp data example

16K blocksize Small Files OFF vs Small Files ON
Random Read times for 32,768 open small files
1600
Microseconds per read
1400
1200
1000
Series1
800
Series2
600
400
200
0
1
11
13
15
17
19
21
23
25
27
29
31
33
Thousands of files
4 - 16
Graph courtesy of Fitsum Mulugeta

System: qar14s1
OS: RHEL5.2 (build974)
Device Type: Psd
Block Sizes: 16K
Filesystem options used: Quota enabled
spf opens 32,768 existing small files (of size 32648 bytes), and then reads files at random.
hf836s c.00
4 - 16
What Changed in 3.6.1 for 8.3

Kernel code
Create, delete, rename, directory enumeration code paths.

New system calls.
Psfscheck.exe
Needed to understand 8.3 file system layout changes.

-e enable8dot3|disable8dot3 option to turn on and off 8.3
functionality.
Psfsformat.exe
Psfsinfo.exe
Psfsdebug.exe
4 - 17
Slide courtesy of Rob Lucke

1million small files (fileop f 100 s 3700)
In this case all files are small files
Fileop creates 3 levels of subdirectories, and in this case 1,000000 files in the lowest level of
subdirectories, writes 3700 bytes to each file (in order), closes them, stats them and reads them
(open/read/close, in order). Times how long the read/write/stat etc syscall took.
hf836s c.00
4 - 17
4 - 18
Same test but with all large files Equivalent performance
hf836s c.00
4 - 18
Matrix Server Database option

Separately licensed product that enables a mount option (dboptimize) that
is optimized for file systems that will house database data files by
supporting direct I/O
Disables file system buffering for I/Os (buffer cache bypass)
Allows database files to be accessed directly to and from the application

Regular files can co-exist with database files on the same file systems
If the i/os are on 512 byte boundaries and are multiple of 512 byte then
they will be rendered direct, otherwise they will be cached
Disables datafile coherency control

Allows the database to manage the coherency of its own data files
Separately licensed product included for Oracle that implements Oracle
Disk Manager interface (ODM)
Produces more efficient I/O path

Supports asyncio
Supports Matrix-wide I/O statistics, mxodmstats
4 - 19
hf836s c.00
4 - 19
Mounting a file system dboptimized
4 - 20
hf836s c.00
4 - 20
Persistent mounts
Used to ensure that file systems
are remounted automatically

after server reboots
To see all persistent mounts on a
server, select that server on the

Servers windows, right-click and
select Edit persistent mount
Allows you to remove the
persistent status from a file
system and/or to mount the
file system
4 - 21
hf836s c.00
4 - 21
File system properties
4 - 22
hf836s c.00
(1 of )
4 - 22
File system properties
(2 of 2)
Full zone bit map

Alters the way in which free file
system blocks are searched
Should be used for f/s > 500GB
Is the default but can be
changed via the command line;
psfsck downgrade-fs
<device>
The SMALL_FILES features was
explained earlier
4 - 23
hf836s c.00
4 - 23
Administrative File System
New features in v3.7, the Replication feature and the Performance

Dashboard, require that an administrative file system exist on the cluster.
This file system can also be used for the mxfs reply cache, which the
MxFS Option uses when the virtualized NFSD RPC reply cache feature is
enabled. The administrative file system should be used only for HP
Scalable NAS operations.
The administrative file system must be created manually. Before creating
the file system, obtain the UUIDs and partition numbers for the volumes or
disks that will be used for the file system. The minimum size for the file
system is 10GB, although a bigger file system may be required.
It is recommended to have the adminfs be as performant as possible

when using replication
Use the following command to create the admin file system

mx config adminfs set [--reuse] [--path <unix path> <partitionUID ...>
By default, the administrative file system is mounted at /_adminfs
4 - 24
hf836s c.00
4 - 24
Configuring atime updates
(1 of 2)
PSFS file systems can be configured to update access times
(atime) automatically.
For a regular file an access refers to reading the file, for a

directory it refers to listing the directory contents
atime can have a performance penalty, especially in a CFS,
as every read would result in a write, hence it is disabled by

default
It may be required for users who want to use HSM or similar
applications
4 - 25
hf836s c.00
4 - 25
Configuring atime updates
(2 of 2)
To enable atime updates add the following parameter to the
/etc/var/polyserve/mxinit.conf file:
psfs_atime_enabled = 1
By default the atime is updated immediately but it is recommended to
make the update lazy to minimize the performance impact.
To specify the update period add the following parameter to the
mxinit.conf file:
3600 seconds, once an hour, may be adequate for most applications

psfs_atime_resolution_seconds = <seconds>
4 - 26
hf836s c.00
4 - 26
File system restrictions
(1 of 2)
A PSFS file system cannot be used as the standard root or
/boot file system.
A server can mount another non-shared file system on a
directory of a PSFS file system, however, that file system will

be local to the host. It will not be mounted on other hosts in
the matrix.
PSFS file systems cannot be mounted using the Linux loop
device, they can be accessed only as SCL-managed psd or

psv devices.
The loopback driver cannot be used in conjunction with files
located on a PSFS file system.
Swap files are not supported.
4 - 27
hf836s c.00
4 - 27
File system restrictions
(2 of 2)
Certain fcntl() APIs for user-level applications are not
supported. This includes F_GETLEASE, F_SETLEASE,

F_NOTIFY, and mandatory locks.
All options of the mmap() system call are supported.
However, when using the PROT_WRITE/MAP_SHARED

options, cross-server coherency of the mapped regions is not
guaranteed. To support this usage model, all processes using
writable shared mmaps to share data must be on the same
server.
4 - 28
hf836s c.00
4 - 28
File system evictions

Write errors to a metadata block could result in a file system,
being evicted
Read errors to the journal will also result in a file system
being evicted
A node may only get evicted on 1 node but that could be
because processes on the other nodes are not accessing the

file system or corrupted area on disk.
Excessive i/o errors to the membership partition will not result
in any evictions but may delay a node being fenced if the

node that is the ADM cannot access the MPs in a timely
manner.
4 - 29
hf836s c.00
4 - 29
hf836s c.00
4 - 30
Quota overview
(1 of 3)
With PolyServeV3 support for user/group quotas was
introduced
The PSFS file system supports both hard and soft file system
quotas for users and groups.
Hard quotas specify the maximum amount of disk space on a
particular file system that can be used by files owned by the

user or group.
Soft limits provide a way to warn users (or groups) when they
are about to reach their hard limit.

Need to setup warnquota
4 - 31
Hard quotas
When a file owner reaches the hard limit, the file system will not allow the owner to
create files or to increase the size of an existing file
Any attempts to allocate more space will fail.
The file owner will need to remove files or reduce their size until the disk usage falls
below the hard limit.
Soft quotas
A soft limit is typically set below the hard limit and triggers the warning.
If you want to use soft limits, you will need to configure a warning mechanism such as
the Linux warnquota utility.
hf836s c.00
4 - 31
Quota overview
(2 of 3)
Hard quotas
When a file owner reaches the hard limit, the file system will not
allow the owner to create files or to increase the size of an existing
file
The file owner will need to remove files or reduce their size until the
disk usage falls below the hard limit.
Soft quotas
A soft limit is typically set below the hard limit and triggers the
warning.
If you want to use soft limits, you will need to configure a warning
mechanism such as the Linux warnquota utility.
4 - 32
Hard quotas
When a file owner reaches the hard limit, the file system will not allow the owner to
create files or to increase the size of an existing file
The file owner will need to remove files or reduce their size until the disk usage falls
below the hard limit.
Soft quotas
A soft limit is typically set below the hard limit and triggers the warning.
If you want to use soft limits, you will need to configure a warning mechanism such as
the Linux warnquota utility.
hf836s c.00
4 - 32
Quota overview
(3 of 3)
Quotas are enabled or disabled at the file system level.
When quotas are enabled, the file system performs quota

accounting to track the disk use of each user or group having
an assigned file system quota.
When you create a PSFS file system, you will need to specify
whether quotas should be enabled or disabled on that file

system.
Quotas can also be enabled or disabled on an existing file
system.
The file system must be unmounted.

Locate the file system on the Management Console, right-click,
and select Properties.
Then go to the Quotas tab on the Properties dialog.
4 - 33
hf836s c.00
4 - 33
Quota configuration
Check or uncheck Enable quotas
as appropriate.
If you are enabling quotas, you

can set the default hard limit for
users and groups on that file
system.
Click on Limit and then specify the
appropriate size in either

kilobytes, megabytes, gigabytes,
or terabytes.
The default is rounded down to the

nearest filesystem block.
To enable or disable quotas from
the command line, use mx quota

utilities.
4 - 34
hf836s c.00
4 - 34
Quota editor
4 - 35
The Management Console includes a quota editor that you can use to view quota
information and to set or change the hard and soft limits for specific users or groups on a
file system.
You can start the editor from the Quotas tab on the File System Properties window
Right-click on a File system, select Properties, and then click on Manage Quotas
You can also use the menus: Storage Manage Quotas at the top of Management
Console.
hf836s c.00
4 - 35
Quota searches
4 - 36
You can use the search feature on the left side of the quota editor to locate quota
information for specific users or groups.
If you are searching by name, the quota information must be in a database (such as a
password file or LDAP database) that can be accessed from the server where the file
system is mounted.
The search locates the name in the database and matches it with the ID, which is the
value stored on the file system
hf836s c.00
4 - 36
Adding quotas
4 - 37
(1 of 2)
To assign quotas to a user or group:

Click the Add button on the Quota Editor toolbar and then search for the user or
group in the database.
The search pattern works in the same manner as on the Quotas dialog. When you
are searching for a group, you can check the Members option, which selects all
members of the group.
You can then assign quotas to all of the group members.
If you know the user or group ID and want to skip the search (or if the LDAP or password
file is missing):
Click Advanced and enter the ID on the Advanced User/Group Add dialog.
hf836s c.00
4 - 37
Adding quotas
4 - 38
(2 of 2)
When the Add Quota dialog appears, select the appropriate file system and set the
quota limits.
Any existing quota limits on the file system will be overwritten.
hf836s c.00
4 - 38
Remove quotas for a user or group

If a particular user or group no longer owns files on a file
system you can remove the quotas for that user or group.
Select the user (or group) on the Quotas dialog and then click
the Delete icon on the toolbar.
Note: The quotas cannot be removed if the user or group has
blocks allocated on the file system.
4 - 39
hf836s c.00
4 - 39
Export quota information to a file

The quota information is stored with the file system metadata
You can save the information appearing on the Quota dialog
to a CSV (comma-separated values) file.

You can then use the data in another application such as a
spreadsheet
To create the CSV file:
Click the Export icon at the top of the Quotas dialog.
4 - 40
hf836s c.00
4 - 40
Manage quotas from the Command Line

The operations available on the quota editor can also be
performed from the command line using the mx quota

commands.
See the PolyServe Matrix Server Command Reference for
details about these commands.
4 - 41
hf836s c.00
4 - 41
Linux quota commands

PolyServe has modified the following Linux quota commands
to enable them to work with the PSFS file system.
edquota, quota, repquota, setquota, warnquota
These commands are provided in a separate RPM as
described in the PolyServe Matrix Server Installation Guide.

When you install the RPM, the administrator commands are placed in
the /opt/polyserve/sbin directory and the quota command is placed
in the /opt/polyserve/bin directory.
Note: You do not need to install this RPM to use quotas with
PSFS file systems. With the exception of warnquota, all of

the functionality of these commands is provided by the
Quotas window on the Management Console and by the mx
quota commands.
4 - 42
hf836s c.00
4 - 42
Back-up and restore quotas

The psfsdq and psfsrq commands can be used to back
up and restore the quota information stored on the PSFS file

system.
These commands should be run in conjunction with the
standard file system backup utilities, as those utilities do not

save the quota limits set on the file system.
psfsdq command saves a quota summary for all users and
groups having quota information stored on the specified PSFS
filesystem.
psfsrq command restores the quota data generated by the
psfsdq command to the specified PSFS file system.
4 - 43
The psfsdq command has this syntax:

/opt/polyserve/sbin/psfsdq [-f <path>] <filesystem>
The -f option specifies the file to which psfsdq will write its output, iIf the file
already exists, it will be overwritten. If the -f option is not specified, psfsdq writes
to stdout. File system is the psd or psv device used for the file system.
The psfsrq command has this syntax:
/opt/polyserve/sbin/psfsrq [-f <path>] <filesystem>
The -f option specifies the file that psfsrq should read to obtain the quota data. If
this option is not specified, psfsdq reads from stdin. File system is the psd or psv
device used for the file system.
hf836s c.00
4 - 43
hf836s c.00
4 - 44
Array-Based Snapshots
PolyServe file systems do not support snapshots on the file

systems, it only supports array-based snapshots
v3.5 supported snaps on the EVA and Engenio arrays
v3.7 adds integrated snapshot support for MSA2000 and XP

arrays
Snapshot configuration is setup using the configuration UI
4 - 45
The psfsdq command has this syntax:

/opt/polyserve/sbin/psfsdq [-f <path>] <filesystem>
The -f option specifies the file to which psfsdq will write its output, iIf the file
already exists, it will be overwritten. If the -f option is not specified, psfsdq writes
to stdout. File system is the psd or psv device used for the file system.
The psfsrq command has this syntax:
/opt/polyserve/sbin/psfsrq [-f <path>] <filesystem>
The -f option specifies the file that psfsrq should read to obtain the quota data. If
this option is not specified, psfsdq reads from stdin. File system is the psd or psv
device used for the file system.
hf836s c.00
4 - 45
Array-based Snapshots - existing

HP EVA storage arrays
To take hardware snapshots on HP StorageWorks Enterprise Virtual Array (EVA)
storage arrays, the latest version of the HP StorageWorks Scripting System Utility
(SSSU) must be installed on all servers in the cluster. Also, the latest version of
CommandView EVA software must be installed on your Management Appliance.
Be sure that your versions of SSSU and CommandView EVA are consistent. The
SSSU utility must be renamed, or linked, to /usr/sbin/sssu, and must be
executable by all users. To locate this software, contact your HP representative.
Engenio storage arrays

To take hardware snapshots on Engenio storage arrays, the latest version of
SANtricity Storage Manager client software must be installed on all servers in the
cluster. Also, the latest version of firmware must be installed on your storage array
controllers. To locate this software and firmware, contact your Engenio
representative.
4 - 46
hf836s c.00
4 - 46
Array-based Snapshots - new

HP MSA2000 storage arrays
To take hardware snapshots on HP StorageWorks 2000 Modular Smart Array
(MSA2000) storage arrays, the latest version of firmware must be installed on the
array controllers and the SSH Command Line Interface (CLI) service must be
enabled on the array controllers. Also, note that a MSA2000 snapshot license is
required. Only the file systems located on Master Volumes (not Standard Volumes)
are snapshot capable.
HP XP storage arrays
To take hardware snapshots on HP StorageWorks XP24000, XP20000,
XP12000, or XP10000 storage arrays, the latest version of firmware must be

installed on the array controllers. Also, the latest versions of StorageWorks XP
Business Copy and StorageWorks XP Snapshot software must also be installed on
the array controllers. On the servers, the latest version of StorageWorks XP RAID
Manager must be installed and configured, with both local and remote HORCM
instances running on each server, and with all file system LUNs (P-VOLs)
controlled by the local instance and all snapshot V-VOLs (S-VOLs) controlled by
the remote instance.
4 - 47
hf836s c.00
4 - 47
Array-based Snapshots Configuration
4 - 48
hf836s c.00
4 - 48
Array-based Snapshot Configuration
4 - 49
hf836s c.00
4 - 49
Creating a snapshot
(1 of 2)
To create a snapshot:
Select the file system on the
Management Console
Right-click and select
Create Snapshot.
The file system must be
mounted.
You will then see a dialog

asking for information
specific to your storage
array.
4 - 50
hf836s c.00
4 - 50
Creating a snapshot
(2 of 2)
Once selected you will then see a
dialog asking for information specific

to your storage array. When you click
Ok, Matrix Server will:
quiesce the file system to ensure
that the snapshot can be mounted
cleanly
Perform the snapshot operation
using the snapshot capability
provided by the array. Matrix
Server selects the next available
LUNs for the snapshot
Resume normal file system activity
Import the LUNs used for the
snapshot into the matrix. The
import can take several moments
When the snapshot is complete,
you will be asked whether you
want to mount the snapshot. If you
choose to do this, the Mount
Filesystem window will be
displayed
4 - 51
hf836s c.00
4 - 51
Mounted snapshots
Mounted snapshots appear on the Management Console
beneath the entry for the file system because they still need
the original file system, they are really just deltas from the
original
Each snapshot is assigned a Matrix Server psd or psv device
name.
4 - 52
hf836s c.00
4 - 52
Mounted snapclones
A snapclone appears as its own filesystem mount in the
FileSystems tab because it is a complete copy of the original

luns and does not depend on the original luns in any way
4 - 53
hf836s c.00
4 - 53
Command Line Snapshot

To create a snapshot from the command line, first run the
following command to determine the options available for the

array type on which the specified volume is located:
mx snapshot showcreateopt <volume>
Then run the following command to create the snapshot:

mx snapshot create [--terse] [<options>]
<volume>
The --terse option causes only the name of the snapshot volume
to be printed on success.
4 - 54
hf836s c.00
4 - 54
Delete a snapshot
Storage arrays typically limit the number of snapshots that
can be taken of a specific file system.
Before taking an additional snapshot you will need to delete
an existing snapshot.
Also, if you want to destroy a file system, you will first need
to delete all snapshots of that file system.
To delete a snapshot:
Select the snapshot on the Management Console
Right-click and select Delete.
To delete a snapshot from the command line, type the
following:
mx snapshot destroy <volume>

4 - 55
hf836s c.00
4 - 55
hf836s c.00
4 - 56
File system replication

HP Scalable NAS provides tools and processes to enable automatic
replication of the data in your PSFS file systems. During replication, data
in specific file systems or directories on the source cluster is transferred
to another cluster called the destination cluster.
Before using replication, you will need to create an administrative file
system and install a matching ssh key pair on both the source and
destination clusters. You will also need to create a configuration file
that specifies information such as the file systems and directories to be
replicated and the time between replications (the replication interval).
The source and destination clusters must both be running the same
operating system and HP Scalable NAS 3.7.0 (or higher). The source
cluster must be able to reach the destination cluster via ssh.
4 - 57
hf836s c.00
4 - 57
File system replication (contd)

When replication has been configured, the replication program is started
automatically when HP Scalable NAS is started. When a replication
interval begins, a replication program verifies that the file systems to be
replicated are mounted. The replication program also monitors the file
systems and directories for any changes and creates log files specifying
files that have been added, modified, or deleted during the replication
interval.
At the end of the interval, the new and changed files are transferred to the
destination cluster via rsync, if a file already exists on the destination
cluster, only the changes in the file will be transferred. Files that were
deleted from the source cluster are also deleted from the destination
cluster.
Replication stores its state information in the mxds datastore on the
membership partitions. The replication configuration is also kept in the
mxds datastore
4 - 58
hf836s c.00
4 - 58
What it is and isnt

IS
IS NOT
Aimed at disaster recovery
Database backup
Aimed at file-oriented
replication
Designed to get replication
into a customers site with low
pain
Realtime
4 - 59
hf836s c.00
A cheap backup utility (target

is another cluster not a tape
silo)
Only for cluster data not a
system disk replication utility
4 - 59
How Replication Works

rplmonitor runs on every node in the source cluster
Multiple listener processes, rplwatch, run on each node in replication
group, one for each watchpoint
One rplwatch process per watchpoint

This monitors changes to the watchpoints on that node through iNotify()
writes summary info about changes to logs on the admin file system
Sentinel node use rsync to detect changes to the file system and
transmits them to destination

Only changes transmitted
Data compressed when possible

Data encrypted on link via SSL/SSH
Sentinel is HA, so fails over if node dies
Sentinel DOES NOT follow ADM
4 - 60
hf836s c.00
4 - 60
How Replication Works (contd)

The sentinel is responsible for starting and monitoring rsync
processes (10 default, up to as many as needed)
All replication changes are transmitted from the sentinel node

When rsync is running you will see corresponding rsync processes on
the target node
Sentinel
takes information about changes from log file
reads data and compares with target
sends to replication partner.
This action takes place on a replication cycle

default to 120 seconds but will probably need to be increased
depending on workload, can be longer not shorter
Can replicate to physical node or vhost on target

Target needs to be running PolyServe v3.7
4 - 61
hf836s c.00
4 - 61

Source cluster
rplwatch runs on all nodes

Destination
cluster
User changes data
01
010110
rplwatch gets notified

of data change
writes where, how
much, what file to the
replication log
Time
passes
adminfs
and writes data to destination file

system.
At start of next cycle, rplmonitor reads
log
..reads data from file, formats..
starts rsync to transfer
data
PSFS
4 - 62
hf836s c.00
PSFS
4 - 62
Configuring Replication Overview

1.
Create the administrative file system on both the source and

destination clusters if it does not already exist.
2.
Set up the ssh configuration for replication on both the

source and destination clusters.
3.
Create a configuration file that specifies replication

information such as the PSFS file systems that should be
replicated and write it to the Matrix Datastore
4.
Restart HP Scalable NAS.
Note: there is no configuration file for the target
4 - 63
hf836s c.00
4 - 63
The Admin File System

REQUIRED if running replication or performance dashboard
Also can be used for the new NFS reply cache (recommended)
Suggested size is 50GB, can be as small as 10GB, must have its own
LUN, preferably raid1, sized to replication needs, 50GB should handle

99% of customers.
It is created as a dynamic volume so can be expanded

If it fills up replication will not be able to operate
Auto-mounted at startup on all nodes
New mx config command
Default path is /_adminfs
mx config adminfs set [--reuse] [--path <unix path>] <partitionUID>
4 - 64
New concept in 3.7

similar use has cropped up again and again in the past.
PSFS file system that is used for administrative tasks
Rep, perf dashboard use adminfs as a cache or repository for data
hf836s c.00
4 - 64
Setup ssh keys
Set up the ssh configuration for replication on both the source and
destination clusters, it uses ssh t communicate with the destination.
You can create a custom key pair and then install that key pair on the
source and destination clusters. Use one of these methods:
Run rplkeys -c
creates the custom key pair, adds it to the mxds datastore, and publishes it for
replication use.
The rplkeys -c and -i commands attempt to publish the key on the destination
cluster. If the command is not successful(for example, because of ssh connection

issues), you can use replkeys -d to install the key on the destination cluster after
the issues are fixed.
If replicating to a vhost then you need to run rplkeys v on each of the target
nodes that could host the vhost
For replicating to a vhost to work, ALL nodes that can host the vhost must have the
same ssh key otherwise replication will fail when the vhost fails over
4 - 65
hf836s c.00
4 - 65
Replication Configuration File
Create a configuration file that specifies replication information such as

the PSFS file systems that should be replicated.
The replication configuration file is stored in the mxds datastore on the
membership partitions. To create the file, run the following command.
You can call the file whatever you want.
# rplconfig -e <filename>
Configuration file definitions
Replication directory entries. For each PSFS file system or directory to be replicated, add a
line such as the following to the file:
replicate <dir> <dest_ipaddress> <dest_rootpath>

Snapshot interval. The time between snapshots. This is also called the replication interval.
The minimum value (and the default) is 120 seconds
Sentinel. The nodes on the source cluster that are involved with replication are called the
replication group. One node from the replication group coordinates replication activities such as
performing the file transfers. This node is called the sentinel node.
Once the configuration file is complete, write it to the datastore

# rplconfig -i <filename>
4 - 66
hf836s c.00
4 - 66
Replication Configuration File Example

Replication Configuration File
#Created on Tue Sep 9 15:36:02 2008
# version 1.0.0
#Replication Directory Entries
#replicate <dir> <dest_ipaddress> <dest_rootpath>
replicate /mnt/data1 99.30.31.11 /
replicate /mnt/data2 99.30.31.11 /
#Note: In current version, There can be only one root path # at any destination ipaddress
#Snapshot Interval
#Note: interval time should be at least 120 seconds
#interval <secs>
interval 120
#Preferred Sentinel
#sentinel <ip_address>
sentinel 99.30.35.6
#Note: In current version, only one preferred sentinel should be specified
4 - 67
hf836s c.00
4 - 67
Replication Has an Impact on the Cluster

Replication can be a bandwidth hog
Best practice is to dedicate interfaces to it
Dont recommend internal broadcoms
Dont use admin interfaces!
Replication is hard on arrays

Effectively forces each write to have triple overhead
Cant expect to take a busy array and add replication
Carefully examine array sizing for existing customers
Replication is hard on CPUs

Consider dedicated sentinel node
Replication consumes Memory

12GB minimum and dedicated node for large configurations
4 - 68
hf836s c.00
4 - 68
Replication Setup - Planning

Replication does not lend itself to ad hoc use
Need to understand work load and data rates
Need to configure AdminFS first
Right size with known data OR oversize (50GB is oversize)
Best practices
If possible, structure directories to reduce # of watch points
Make sure you need to replicate the data thats flowing, dont just default
to replicating everything
Dedicate a node to being the sentinel under normal operation
Network bandwidth matters so use a dedicated network for replication
Troubleshooting
Watch CPU load, memory utilization on sentinel node; if it gets too high,
you may get problems.
69
hf836s c.00
4 - 69
Replication Commands
rplmonitor
Start and stop replication system.
Typically run by mxinit, but useful for starting/stopping replication
without node reboot.
rplstatus
Display replication state (configured, running, stopped, etc.) and
current sentinel node (also scripting extensions)
rplcontrol
Control replication state (start, stop, pausetransport, etc.)
rplkeys
Setup up replication SSH keys
70
hf836s c.00
4 - 70
Replication Troubleshooting
rpl_create_hr
Convert rplwatch binary log files to text
Typically run by mxinit, but useful for starting/stopping replication
without node reboot.
rplctldump
Convert history file to readable text
Useful for seeing exactly whats been replicated.
rpl_versions.sh
Verify OS and HP Scalable NAS versions on source and
destination are the same
71
hf836s c.00
4 - 71
Replication Logs
The replication log files created by rplwatch are located in the /_adminfs file system under
the directory /replication/logs. The following types of logs are created:
Path logs. These log files contain the inodes and absolute paths of the directories included
in the replica set. Replication uses this information internally to find the absolute paths for
modified files in the directories.
Change logs. These log files contain a history of the watched files and changes to those
files during a single replication interval. There is a change log for each node. At the end of
a replication interval, these logs are merged into one change log that specifies all of the files
that are to be transferred to the destination cluster. (If a file already exists on the destination
cluster, only the changes in the file will be transferred.)
Delete logs. These log files contain a history of the files that were deleted during a single
interval. There is a delete log for each node. At the end of a replication interval, these logs
are merged into one delete log that specifies all of the files that are to be deleted from the
destination cluster.
The log files are in a binary format and cannot be viewed with a text editor. Use the
rpl_create_hr command as follows to convert the files into a readable format.
4 - 72
hf836s c.00
rpl_create_hr <binary log file> [<binary log file> ...]
4 - 72
Replication History File

The rplmonitor program creates a history file containing a record of the
nodes participating in replication and a record of the most recent

replication intervals for each node. For each replication interval, the
information in the file includes the IP addresses of the cluster nodes
participating in replication and, for each replication interval, the interval
switch times (when one interval ends and the next interval begins), the
number of mounted directories that were replicated, and the number of
acknowledgements from the worker threads performing the replication.
The history file is located at
/_adminfs/replication/messages/rplclusterfile. The file is in a

binary format and should not be edited directly. Use the following
command to convert the file to a text format:
# rplctldump [-o <output_file_path>]/_adminfs/replication/messages/rplclusterfile
4 - 73
hf836s c.00
4 - 73
Restore Replicated Files

To restore files from the destination replica back to the source cluster, take these
steps:
Stop replication on the source cluster:
# rplcontrol -s stop
Use a file transfer method (for example, scp, ftp, tar, cpio) to copy the files back
to the source cluster.
Restart replication on the source cluster:
# rplcontrol -s restart
Note: Replication does not provide tools to analyze or report the
deltas between the source cluster and the destination cluster.
It is possible to configure the destination to now be a source, and replicate in the
other direction. There is not an advantage over scp if there is total data loss on
the source. There is an advantage if the data loss is large but not total, as some
files will not need to be replicated.
4 - 74
hf836s c.00
4 - 74
Check Replication State
When diagnosing replication issues, it may be useful to check the replication

state. The state information is stored in the mxds datastore. The replication state
can be different for the cluster and individual nodes e.g. replication can be
running on the cluster but not be running on a particular node. Use the following
commands to view the replication state.
#rplstatus
The command can be run as part of a monitor script using different arguments
#rplstatus status
This will return zero (success) if replication is running on the node, this should succeed
on every node
#rplstatus sentinel
This should return zero (success) if replication is the sentinel is up and running, should
on succeed on the sentinel
The next slide shows these command running a s a custom device monitor being used to
monitor replication in the cluster
4 - 75
hf836s c.00
4 - 75
You Can Use a Custom Device Monitor
4 - 76
hf836s c.00
4 - 76
Hard Link Behavior

Because of the way that replication handles file change events, there is
some odd behavior if multiple hard links to a single file are created in the
same directory during a single interval. Each link will be given the same
ID, and therefore only one of the objects will be included in the master
replication log. This means that only that object will be replicated to the
destination. (The hardlink instance that will be replicated is the last one to
be modified on the node with the lexically highest IP address.) All of the
user data will be on the destination, but the file structure will look different
(it will be missing some hard links to the file). Later changes to these links
will cause them to show up on the destination, as will any event that
causes a sub-tree containing the links to be walked.
4 - 77
hf836s c.00
4 - 77
hf836s c.00
4 - 78
Multipath I/O
Used to eliminate single points of failure
Requires LUN to be presented with the same WWNN:LUN# at
the same time on all storage controllers
It is expected that the requires third party MPIO software will be
installed and configure
A matrix can have
Multiple FC switches, multiple FC ports per server and multiported SAN disks
Requires third party array-specific software to support MPIO
regardless of the active/passive or active/active nature of the
array
4 - 79
hf836s c.00
4 - 79
Multipath I/O
HP PolyServe for Linux has a built in mpio driver, mxmpio, but is not
recommended to use it as it is not array aware

When a matrix is configured for mxmpio it automatically configures all
paths from each server to the storage devices.
The first path discovered is the path used for I/O
If this path fails, MxS automatically fails over to another path
There is no ability to do any load-balancing
The mxmpio command is used to display status information about MPIO
devices or to control the path used for I/O
Refer to the MxS Administration Guide for details on how to manage
MPIO using the mxmpio command
4 - 80
hf836s c.00
4 - 80
3rd Party Multipath I/O

EMC - PowerPath
Required to support LUN failover in active/passive arrays from
EMC such as the Clariion CX arrays.
Not required for the DMX
HDS - HDLM
HP DM driver
DM Required to support LUN failover in arrays from HP such as
the MSA1500, EVA and XP arrays
Qlogic failover driver is required for SVSM
Engenio/IBM/STK - RDAC
Required to support LUN failover in all Engenio and EngenioOEM arrays such as the IBM FAStT and the StorageTek arrays
Not required for the ESS
4 - 81
hf836s c.00
4 - 81

Best practices
Cluster.
l
4 - 82
The cluster software expects that if the device mapper multipath is to be used, that it is installed
and configured before starting the cluster. Our install and upgrade documents talk about how to
do this. Also, there is documentation from the OS distribution that tells how to set up device
mapper multipath.
Our documentation only mentions using the HP device mapper enablement package to get the
required multipath.conf file and setting for HBA parameters. The correct multipath.conf settings for
other vendor's storage would have to come from the vendor.
and partitions don't work particularly well with each other. If a disk has partitions already on it,
that is fine. If you want to put partitions on a disk that is already controlled by dm-mpio, then it is
more trouble. The distribution documentation says to reboot after creating partitions. This is
because re-reading the partition data when the device is controlled by the device mapper doesn't
work. This is why, to make things easier, partitions should be added to MP disks before
enabling device mapper multipath.
hf836s c.00
4 - 82
Device Mapper mpio

Provides multipath I/O on RHEL5 and SLES10
Only HP supported mpio on these Operating Systems
Device Mapper does many things around Logical Volume Management but
we are only using the mpio capability
HP recommends downloading a specific HP add-on to the standard DM
packages
http://www.hp.com/go/devicemapper
For RHEL 5 Update 2:
device-mapper-1.02.24-1.el5 or later
device-mapper-multipath-0.4.7-17.el5 or later
For SLES 10 SP2:
device-mapper-devel-1.02.13-6.14 or later
device-mapper-1.02.13-6.14 or later
4 - 83
hf836s c.00
4 - 83

Best practices
Cluster.
l
4 - 84
The cluster software expects that if the device mapper multipath is to be used, that it is installed
and configured before starting the cluster. Our install and upgrade documents talk about how to
do this. Also, there is documentation from the OS distribution that tells how to set up device
mapper multipath.
Our documentation only mentions using the HP device mapper enablement package to get the
required multipath.conf file and setting for HBA parameters. The correct multipath.conf settings for
other vendor's storage would have to come from the vendor.
and partitions don't work particularly well with each other. If a disk has partitions already on it,
that is fine. If you want to put partitions on a disk that is already controlled by dm-mpio, then it is
more trouble. The distribution documentation says to reboot after creating partitions. This is
because re-reading the partition data when the device is controlled by the device mapper doesn't
work. This is why, to make things easier, partitions should be added to MP disks before
enabling device mapper multipath.
hf836s c.00
4 - 84
Installing the HP enablement kit

To install HPDM Multipath 4.2.0, complete the following steps:
Download the HPDM Multipath Enablement Kit for HP StorageWorks Disk Arrays
v4.2.0 available
at http://www.hp.com/go/devicemapper.
Log in as root to the host system.

Copy the installation tar package to a temporary directory (for instance,
/tmp/HPDMmultipath).
Unbundle the package by executing the following commands:
#cd /tmp/HPDMmultipath
#tar -xvzf HPDMmultipath-4.2.0.tar.gz
#cd HPDMmultipath-4.2.0
Verify that the directory contains README.txt, COPYING, INSTALL, bin, conf,
SRPMS, and
docs directories.
To install HPDM Multipath 4.2.0, execute the following command:
4 - 85
hf836s c.00
#./INSTALL
4 - 85

** HPDMmultipath-4.2.0 kit Installation. Date : Tue Apr 21 09:18:35 PDT 2009 **
Checking for previous installation. Please wait...
HP Device Mapper Multipath v4.2.0 kit - Installation Menu
1. Install HPDM Multipath Utilities
2. Uninstall HPDM Multipath Utilities
3. Exit
Enter choice [1/2/3] :
Note: This kit installs the binaries and configuration file required to support HP StorageWorks Disk Arrays.
Warning: If you are retaining the existing /etc/multipath.conf file, you will have to manually edit the file with
HP recommended parameters.
Please refer user documentation for more details.
Would you like to overwrite the existing /etc/multipath.conf file with the
new multipath configuration file ? (y/n) :
4 - 86
hf836s c.00
4 - 86

Saving /etc/multipath.conf file to /etc/multipath.conf.savefile
Copying new multipath configuration file multipath.conf to /etc directory
********************************************************************************
* Complete the following steps to create the Device Mapper multipath devices *
*
* 1. Restart the multipath services

*
~foo# /etc/init.d/multipathd restart
* 2. Create the Device Mapper multipath devices

*
*
~foo# multipath -v0
*
*
********************************************************************************
Installation completed successfully!
[root@poly2 HPDMmultipath-4.2.0]#
4 - 87
hf836s c.00
4 - 87
Device Mapper Features

I/O failover and failback
Provides transparent failover and failback of I/Os by rerouting I/Os automatically to an alternative path when a path
failure is sensed and routing them back when the path is restored.
Path grouping policies: Paths are coalesced based on the following path-
grouping policies:
Priority based path-grouping

Provides priority to group paths based on Asymmetric Logical Unit Access (ALUA) state
Provides static load balancing policy by assigning higher priority to the preferred path
Multibus
All paths are grouped under a single path group
Group by serial
Paths are grouped together based on controller serial number
Failover only
Provides failover without load balancing by grouping the paths into individual path groups
I/O load balancing policies: Provides weighted Round Robin load balancing policy within a path group
I/O load balancing policies: Provides weighted Round Robin load balancing
policy within a path group

Path monitoring: Periodically monitors each path for status and enables faster
failover and failback
4 - 88
hf836s c.00
4 - 88
multipath.conf entry - EVA

For EVA A/A arrays
device {
vendor
"HP|COMPAQ"
product
"HSV1[01]1 $C$COMPAQ|HSV[2][01]0|HSV300"
group_by_prio
getuid_callout
path_checker
tur
path_selector
"round-robin 0"
prio_callout
"/sbin/mpath_prio_alua /dev/%n"
rr_weight
uniform
failback
immediate
hardware_handler
no_path_retry
rr_min_io
"0"
12
100
4 - 89
hf836s c.00
4 - 89
multipath.conf entry - XP
For XP
device {
vendor
"HP"
product
"OPEN-.*"
multibus
getuid_callout
path_selector
"round-robin 0"
rr_weight
uniform
path_checker
tur
hardware_handler
"0"
failback
immediate
no_path_retry
12
rr_min_io
1000
4 - 90
hf836s c.00
4 - 90
hf836s c.00
4 - 91
Suspend a file system for backups

The psfssuspend utility suspends a shared file system in a stable,
coherent and unchanging state

Allows you to copy it for backup and/or archival purposes
Use raw device (/dev/rpsd/...) when copying directly from a

suspended device
Ensures that all blocks copied are up-to-date
Writes to suspended file systems are paused

The psfsresume utility restores a suspended file system to normal
operations
Note: This cannot be done from the Management Console unless using
the EVA snapshot capability
4 - 92
hf836s c.00
4 - 92
Resize a single LUN file system

The resizepsfs utility can be used to enlarge the size of a
PSFS file system.

resizepsfs will not change the size of the partition table,
if one is used, that needs to be done separately using fdisk.
This is done offline
Always back up your data before using resizepsfs
Example:
# umount /s
# resizepsfs s +1G /dev/psd/psd6p4
(to increase by 1GB)
# resizepsfs /dev/psd/psd6p4
(to use all available space)
# mount t psfs o shared /dev/psd/psd6p4 /s

4 - 93
hf836s c.00
4 - 93
Context dependent symbolic links

A context dependent symbolic link (CDSL) contains a keyword that
identifies a particular location

When resolved by the OS, the keyword is translated into the
appropriate pathname
$ ln s target_with_keyword link_name
MxS supports the following keywords, which must be enclosed in curly
braces:
HOSTNAME, output from uname n

MACH, output from uname m
OS, output from uname s
SYS, concatenated output from uname m and uname s, separated
by an underscore character (e.g., i386_Linux)
UID, the effective UID (numeric) of the process accessing the link.
GID, the effective GID (numeric) of the process accessing the link.
CDSLs will not work if they are accessed through NFS because NFS
resolves the link on the client

4 - 94
CDSL example: Locate a target by its hostname:

Three servers: serv1, serv2, and serv3; each having specific configuration files in the
/oracle/etc directory
Create a subdirectory for each server in /oracle a shared file system. Then create an
etc directory in each server directory.
/oracle/serv1/etc
/oracle/serv2/etc
/oracle/serv3/etc
Populate the directories with the data from the /oracle/etc directory
Edit the files in the server-specific etc directories as appropriate
Create the CDSL:
# ln s {HOSTNAME}/etc /oracle/etc
# ls l /oracle/etc
lrwxrwxrwx 1 root root 14 Jun 16 16:40 /oracle/etc -> {HOSTNAME}/etc
If you are logged in on serv1 and access the /oracle/etc symbolic link, the HOSTNAME
keyword resolves to /oracle/serv1/etc
hf836s c.00
4 - 94
Check a file system
(1 of 2)
If a file system is not unmounted cleanly, the journal will be
replayed the next time the file system is mounted to restore

consistency, this should not require a file systemcheck
You should seldom need to check the file system

If it is necessary (due to hardware or software failure), you
can repair a file system using the psfscheck utility
This can take a long time depending on the number of
objects in the file system
4 - 95
hf836s c.00
4 - 95
Check a file system
(2 of 2)
If necessary, due to hardware or
software failure, you can repair a

file system using the psfscheck
utility or from the UI
Checking a file system is an
offline operation, the file system

will need to be unmounted before
the check can be run
Refer to the Administration Guide
for more information about this

powerful command
4 - 96
96
hf836s c.00
4 - 96
Booting from SAN

Increasingly customers want to boot servers from the SAN
rather than local disks.
This is possible with Matrix nodes assuming your OS, HBA
driver and SAN support boot from SAN.
With Fabric Fencing, if a server is fenced, it will crash, as its
access to the boot partition will be lost.
Before that server can reboot, the port will need to be manually
unfenced.
The preferred fencing method when booting from SAN is to use
Flexible Server Fencing.
It is possible to use a switch that is not managed by Matrix
Server as a boot SAN.

psfs file systems cannot be used as bootable file systems
4 - 97
hf836s c.00
4 - 97
Partition table alignment
(1 of 2)
Many disk arrays depend on aligned i/os for performance
and data integrity.
When a partition table is created on a lun in Linux the
partition by default starts at sector 63 to maintain DOS

compatibility. This is due to DOS programs expecting disk
geometry in heads, cylinders and sectors.
This misalignment can create performance problems for some
arrays including those from HDS and DataDirect Networks.
4 - 98
hf836s c.00
4 - 98
Partition table alignment
(2 of 2)
To ensure partition alignment and hence aligned i/os to the
array the simplest thing to do is to create a null partition at

the beginning of the lun of size 60MB. This will be rounded
up to 63MB and will ensure that the next partition will be
aligned on a 4k boundary. This first partition should not be
used
Another option is to use a partitioning tool to modify the start
position of the partition
On MxS V3, unpartitioned disks are supported and are not
affected by this issue but currently membership partitions can

only be configured on partitioned luns and customers may
wish to use partitioned luns for their file systems so in both
cases the above method of aligning i/os should be used.
4 - 99
hf836s c.00
4 - 99
Online insertion/removal of new storage

PolyServe MxS supports online insertion (OLI) of new storage, provided
that OLI support is supported by your combination of storage device,

SAN fabric, HBA vendor-supplied device driver, and the associated HBA
vendor-supplied libhbaapi.
When this lower-level OLI support is in place, inserting a new disk will
cause a new device to automatically become eligible for importing.

The disk can be imported with the Management Console or mx utility,
and can be used normally from that point forward.
Removal of devices in Linux can be problematic and it may be necessary
to reboot a server to completely remove a device
This typically shows up where snapshots are used and deleted
4 - 100
When using FC HBAs, disk OLI is possible in some circumstances.

If the disk insertion causes libhbaapi (on 32-bit Linux) to notify PolyServe MxS that a
new disk is present, then OLI will be possible.
Usually this is true when adding a new target device (for example, connecting an
entire new disk array).
If OLI is not possible with your hardware combination, you will need to reboot the matrix
nodes after inserting a new disk. The disk will then be visible to the matrix.
hf836s c.00
4 - 100
Fibre Channel switch replacement

When a matrix includes multiple FibreChannel switches you may be able
to replace a switch without affecting normal matrix operations.

The following conditions must be met when performing online
replacement of a FibreChannel switch:
The replacement switch must be the same model as the original switch
and must have the same number of ports.
The FC connectors must be reinserted in the same location on the new
switch.
PolyServe MxS must be stopped on any servers that are connected
only to the switch to be replaced.
If these conditions are not met, you will not be able to perform online
replacement of the switch.

Instead, you will need to stop the matrix, replace the switch, and use
mxconfig to reconfigure the new switch into the matrix.
4 - 101
hf836s c.00
4 - 101
hf836s c.00
4 - 102
4 - 103
103
hf836s c.00
Version 7.4 Development
HP Restricted
Company, L.P.
4 - 103
hf836s c.00
4 - 104
Cluster administration
Module 5
hf836s c.00

hf836s c.00
5-1
Objectives
At the end of this module, students will be able to:
Understand the security architecture of the HP PolyServe
Software
Authentication
Role Based Access Control
Understand the various management interface points of the
HP PolyServe software including

The Management Console
5-2
hf836s c.00
The CLI
SNMP
Notifiers
Performance Dashboard
5-2
hf836s c.00
5-3
Cluster management overview

PolyServe can be managed using the GUI based
Management console or the CLI mx command
The Management Console can be run from any
Windows or Linux machine

The Management Console can manipulate all
entities in the matrix, including some entities that are
currently down
The down server will learn its new configuration
when it comes back up and rejoins the matrix
5-4
hf836s c.00
5-4
Matrix management GUI
5-5
Servers Tab
This gives a server view of the matrix, including the network interfaces
Virtual Hosts Tab
Show all vhosts in matrix, can drill down into members and services
Applications Tab
Shows the application monitors configured in the matrix
Filesystems Tab
Shows all psfs file systems in the matrix
Notifiers Tab
Shows all notifiers configured in the matrix
hf836s c.00
5-5
Using the tabs

Servers tab
Server view of the matrix,
including the network

interfaces
Virtual Hosts tab

All vhosts in matrix, can drill
down into members and

services
Applications tab
Application monitors
configured in the matrix
Filesystems tab
All PSFS file systems in the
matrix
5-6
hf836s c.00
5-6
Using the Applications tab

Applications and their associated resources can be managed
either by using drag and drop or by right-clicking on the

application or resource and then selecting the appropriate
menu item.
To perform a drag and drop operation:
left-click on the resource that you want to change
Drag it to the appropriate cell, and then release.
When you begin the drag operation the cursor will change to a
circle with a bar through it meaning that the current mouse location
does not allow drops.
5-7
hf836s c.00
5-7
Filter the display

You can also limit the information
appearing on the Application tab

by filtering on the state of any
application
5-8
hf836s c.00
5-8
Cluster-wide Software Viewer

Useful for tracking
cluster software
versions running
on each node,
especially during
rolling upgrades
5-9
hf836s c.00
5-9
Cluster-wide Storage View

Storage Summary
dialog a
consolidated view
of the relationship
between LUNs,
Volumes, and File
Systems
5 - 10
hf836s c.00
5 - 10
.matrixrc file
The mx utility can be used both interactively and in scripts.
MxS uses an external configuration file named .matrixrc to

provide authentication. This file is required for mx operation
and contains password information.
Default location is .matrixrc
Created using bookmarks with the UI
The entries in the .matrixrc file have this format:
machine user password default
5 - 11
hf836s c.00
5 - 11
mx syntax
mx [mx_options] class command [command_options]
--help Displays a command summary.
The mx_options affect an entire mx command session. The
options are as follows:
--matrix <matrix> Specifies the matrix that you want to connect with.
matrix can be any node in the matrix.
--config <file> Specifies the configuration file to be consulted for
server, user, and password information. The file must have the
same format as matrixrc
5 - 12
hf836s c.00
5 - 12
Using the mx utility to group servers

This method allows mx to connect to any available server in
the cluster without the need to specify a server name on the

command line.
To use this method, enclose the server entries in brackets:
# production cluster
prod {
srv1 admin secret1
srv2 admin secret1
srv3 admin secret1
}
To connect to the matrix include the option --matrix prod
in the mx command.
mx first attempts to access srv1. If that server is not available
the command tries the other servers in the group
5 - 13
hf836s c.00
5 - 13
Using the mx utility to group servers

mx utility examples
5 - 14
hf836s c.00
mx server add|update|delete|enable|disable|status|dump|read_license
mx vhost add|update|move|delete|status|enable|disable|dump
mx service add|update|delete|status|enable|disable|clear|dump
mx device add|update|delete|enable|disable|status|clear|dump
mx netif enable|disable|admin|noadmin|status|add|update|delete
mx notifier add|update|delete|enable|disable|status|test|dump
mx matrix destroy|dump|status|log
mx sleep
mx disk import|deport|status|dump
mx fs
create|resize|destroy|recreate|showcreateopt|mount|unmount|status|dump
getdriveletters|assignpath|unassign|queryassignments
mx alert status
mx snapshot create|destroy|showcreateopt
mx dynvol
mx server markdown
mx file <filename>
5 - 14
hf836s c.00
5 - 15
Authentication
Feature overview
Web browser and mxconsole authenticate as OS users and no longer
authenticate to the built-in admin user mxpasswd concept of users.
Pswebsrv supports only https on port 6771
MxS 3.5 mxconsole will NOT be able to connect to MxS 3.7
MxS 3.7 mxconsole can be used to connect to MxS 3.5
Password Authentication
.matrixrc, mx/mxconsole --user and --password, via GUI as user dialog or
when prompted
Apache module mod_authz_pmxs utilizes mxauthpw which authenticates with
PAM or if PAM not installed authenticates against shadow passwords.
MxS does NOT configure PAM, it merely uses the existing OS PAM configuration.
Best practices
Standard security best practices, i.e. dont logon as root
Saved passwords are encrypted in the .matrixrc as usual, but make sure this file is
protected
5 - 16
hf836s c.00
5 - 16
Authentication
No single sign-on as in Windows PolyServe clusters
Troubleshooting
apache_access.log
401 need to authenticate and/or handshaking
200 authenticated successfully
apache_error.log
[Tue Feb 24 14:41:47 2009] [error] [client 10.10.211.6] AuthExtern
pmxs_auth [/opt/polyserve/lib/apache/bin/mxauthpw]: Failed (10) for user
foo
[Tue Feb 24 14:41:47 2009] [error] [client 10.10.211.6] user foo:
authentication failure for "/cgi-bin/pmxs/pulselet": Password
Mismatch
For unknown users or typo in username the error will still be reported as
Password Mismatch
Unfortunately, when you enter a utf8 username of - it prints as ???????? in the apache_error.log so no L18N support in the error messages
5 - 17
hf836s c.00
5 - 17
hf836s c.00
5 - 18
Role Based Access Control

v3.7 adds additional security features to cluster deployments
Role-Based Security. By default, the machine local Administrators

group has full cluster rights and can perform all Matrix Server
operations but the Role-Based Security feature can be used to create
roles that allow or deny other users and groups the ability to perform
specific cluster operations.
An audit trail of matrix operations that change the state or
configuration of the matrix, as well as operations that consume large
amounts of system resources. The audit messages specify both the
operation performed and the user who initiated the operation.
Matrix Server provides an audit trail of operations that change the
configuration or state of the matrix. Operations such as mxcollect that
consume significant resources but do not change the state of the matrix are
also logged.
The audit log messages appear in the system event log. The log messages
specify the operation performed and identify the user who initiated the
action.
5 - 19
hf836s c.00
5 - 19
Role-Based Access Control

Feature overview
Resource URIs are hierarchical
Rights are Create, Read (implied), Update, Delete
Roles consist of zero or more resource URIs with associated access rights
mask.
Rights to URIs may be allowed or denied (deny overrides allow)
Roles may be associated with zero or more user or group IDs
Command and some tool execution constrained to perform based on

requested operation against a resource URI.
Authenticated user access token contains user ID and associated group IDs
Used to determine which roles are associated with the
user
Will determine which rights are allowed/denied.
mxconsole configuration changes are audited (see mcs tool or GUI View Events)
Roles can be exported/imported to/from other clusters

Best practices
Add groups to roles instead of users; Manage group membership rather
than PolyServe role membership.
If users must be added to roles, prefer LDAP accounts over server local
accounts
Specify deny access rights at the lowest URI level
5 - 20
hf836s c.00
5 - 20

Read access on all resources is granted all authorized users if they
are assigned to any enabled role.

Read-only access requests are not audited
root is always given access without auditing

Not currently supporting URIs which specify resource instances.
root group members by default have all access rights but can be
denied (at risk of having to manage roles to allow access)
Without SAN access, roles in MxDS may become stale through the
early-access/off-line cache
Only some tools which could change the cluster configuration
perform the RBAC authorization check and have their actions audit
logged
5 - 21
hf836s c.00
5 - 21

Tool Audits RBAC actions
psfslabel Yes
resizepsfs Yes
mkpsfs Yes
destroypsfs Yes
psfsck Yes
psfssuspend No
psfsresume No
psdctl No
psvctl No
mprepair No
mpimport No
5 - 22
hf836s c.00
5 - 22
Differences between 3.5 and 3.7

3.5
3.7
Administrator Account
admin
Any root group member
Read-Only Guest
Any non-Administrators
member in a deny-all role
Password Management
Use mxpasswd manually to

create an account on each
node or re-export all
configuration to the other
servers
mxconfig, mxpasswd
Custom Authorization
N/A
RBAC
Single Sign-on
N/A
N/A
.matrixrc
Optional
Optional
Encryption
Some CGIs
All remote management over

SSL
Password Challenge
MD5
none
5 - 23
hf836s c.00
5 - 23
Use OS utilities
Role Based Security

When you access the cluster, the software reads the access
token created when you logged into the OS to determine

your user account and the groups to which you belong. It
then assigns cluster permissions, or rights, to you based on
the roles to which your user account and groups belong.
For example, if you are a member of a role that allows file

system operations and also belong to another role that allows
you to configure servers, you will have both sets of permissions.
A role denying an operation takes precedence over a role
that allows the operation.
If you belong to a role that allows you to create, modify, and

delete file systems and you also belong to a role that denies the
ability to delete file systems, you will only be authorized to
create and modify file systems. The deny status overrides the
allow status.
5 - 24
hf836s c.00
5 - 24
Role Based Security Control Panel

Roles are configured on the
Role-Based Security Control

Panel.
The Control Panel lists all
roles that have been created.

(The default System
Administrator role is not
included.)
To define a new role click
Add
5 - 25
hf836s c.00
5 - 25
Role Properties
Name
Type a name for the new

role.
Enabled
By default, the role will be
enabled when it is created.
To disable the role, remove
the checkmark.
Allow or deny rights
This can be done for

cluster setup, storage tasks
or applications
View Resource descriptions
Apply role templates
5 - 26
hf836s c.00
5 - 26
Role Members
Assign Accounts to a Role
The Members tab on the Role

Properties window shows the user
and group accounts that belong to
the role.
Matrix Server uses the contents of
the access token created when you
logged into the OS to determine
user and group identities.
To simplify Role-Based Security
administration, specify groups
instead of users wherever possible.
Specify groups that are valid for all
servers in the matrix. Domain
universal groups and domain
global groups have access to all
servers. You can also use domain
local groups from the domain to
which the servers belong.
5 - 27
hf836s c.00
5 - 27
My Rights
View the rights of the currently
logged in user
View the roles of the currently
logged in user (disabled roles

are italicized, members of the
local Administrators group are
in the System Administrator
role)
View the assigned group of the
role membership if not directly
assigned
5 - 28
hf836s c.00
5 - 28
Export/Import
The import and export features can used if you will be
configuring a new matrix and want to use the Role-Based

Security settings that you have configured on the existing
matrix.
5 - 29
hf836s c.00
5 - 29
hf836s c.00
5 - 30
Logging
v3.7 introduces a new message catalog
Messages have unique IDs
Defined in the message catalog:

Matrix event viewer can display help text for many message IDs
One distributed log for the cluster
All events are global

Designed to (eventually) keep all log replicas identical in content and order
Customer visible events copied to /var/log/messages (but we wish they

werent)
Only locally generated events are copied here
Design considerations
Same API and catalog on all platforms

Support for other languages included, but not tested with real languages
5 - 31
hf836s c.00
5 - 31
Server Event Viewer

Shows more information than the old log viewer
Ability to filter by timestamp, severities, and word
Ability to show a message detail of an event
CLI support tail mode
CLI usage:
usage: mx server viewevents [--filter <String to match>] [--maxevents <number>]
[--tail] [--timestamp <startTime,endTime>] [--noHeaders]
[--csv] [--showBorder] <server>
5 - 32
hf836s c.00
5 - 32
Server Event Viewer & Event Properties

Filter by word, timestamp, and severity
5 - 33
hf836s c.00
5 - 33
Logging
Best practices
Use MxS log viewer, not /var/log/messages

/var/log/messages contain only copies of local messages, with some details omitted
Troubleshooting
General
Dont view the matrix log from a node that has MxS stopped
Events are copied to /var/log/messages immediately
If you need to see events during (early) startup or (late) shutdown, look in
/var/log/messages.
Matrix.log file still exists on Linux
Some event details omitted
Events not written there until mxlogd starts
Matrix log can be dumped or viewed locally with mcs select command.
Use mcs select non-customer to include messages hidden from MxS UI
Use cgi to dump in XML
Notifiers
Test scripts and email filters on specific message IDs with mcs log t i<event-id> .
These go to notifiers, but not to the matrix log.
Sends a single event from a single node, unlike the test GUI button which sends one event
from every running node (to test all the paths)
Cluster admin can omit -t, and inject events indistinguishable from real ones
5 - 34
hf836s c.00
5 - 34
New Notification Capabilities Event Viewer
5 - 35
hf836s c.00
5 - 35
Logging
Caveats, limitations, know bugs
The message catalog contains all messages used in any feature on any
platform
Messages may appear in the filter dialog that cant actually happen in a specific
installation.
Events dont appear in the matrix log until mxlogd starts

Events logged early in startup sequence (or late in shutdown) will not be visible to
our log viewer until mxlogd starts
Log replication not implemented
Nodes that are down will not have their log synchronized after joining the cluster
Kernel messages are local on Linux, but will be global for 3.7
The mxlogk API provides PolyServe kernel modules with event queuing
Times embedded in messages are not currently localizable
Times in event metadata should be (but we didnt test it)
5 - 36
hf836s c.00
5 - 36
Log Message Flow
5 - 37
hf836s c.00
5 - 37
Alerts
Overview
Alerts represent a persistent noteworthy condition in the system
Alerts persist until:
The condition they represent ends
The object the alert refers to (or that objects container) is deleted (or disabled)
Alerts cannot be cleared manually
Consider an email notifier instead
Alerts all have unique IDs, defined in the message catalog
Events are logged at the beginning and end of every alert
The beginning event for alert n has ID n, and the end event has ID n+1
There is one authority in the 3.7.0 cluster for the current state of all alerts
A UI gets the current state of all alerts when it connects
5 - 38
hf836s c.00
5 - 38
Alert Viewer
Displays more information in an alert
Ability to show alerts detail & source
Replace view last error of an object with view alerts.
5 - 39
hf836s c.00
5 - 39
Alerts
Caveats, limitations, know bugs
3.5.x to 3.7.0 rolling upgrade issues
The alert authority doesnt exist until all nodes are running 3.7.0
3.7.0 nodes will only know about their own alerts (and the sanpulse alerts, which
take a different path) until the rolling upgrade is complete
Since alert events are logged by the alert authority, this doesnt happen until the
rolling upgrade is complete and a 3.7.0 alert authority exists
3.7.0 alerts will not be displayed on UIs attached to 3.5.x nodes
so the underlying problems (that dont cause failovers) may go undetected
during a prolonged rolling upgrade
but when the customer does it anyway, you can see the
active alerts on a 3.5.0 node by connecting a UI directly
to it
Sanpulse alerts from 3.7.0 will be displayed on 3.5.x consoles, but wont be readable
there
3.5.x doesnt understand the 3.7.0 alert encoding, and displays them as the raw
base 64 encoded alert structure (gibberish to users)
Any 3.7.0 node will display them correctly
5 - 40
hf836s c.00
Your MPs are too small for 3.7.0 is a very common 3.7.0 alert to see during a
rolling upgrade
5 - 40
hf836s c.00
5 - 41
Event Notifications
New event notification capabilities have been introduced
in v3.7 of the PolyServe software
Event Notifications sent via SNMP V1 Trap

Management Information Base (MIB) file:
/opt/hpcfs/lib/snmp/mxsnmp.mib
Event Notifications sent via Email

Supports CRAM-MD5, LOGIN, and PLAIN authentication
Data sent as x-mxs-* headers and text/plain, text/html, text/xml
MIME types
Event Notifications sent via Script

Script command cmd.exe /c <script file> > script.err 2>&1
Data sent as xml to STDIN and MxS-event-* environment variables
Implemented as an mxlogd thread
5 - 42
hf836s c.00
5 - 42
Event Notification Menu
5 - 43
hf836s c.00
5 - 43
Event Definition Configuration

The Matrix Event Notifier Configuration window allows you to configure
the event notifier services (SNMP, email, script) and specify the matrix
events that will trigger them.
5 - 44
hf836s c.00
5 - 44
Event Definition Configuration

The Event Definition tab
can be used to select the

events that should trigger
the appropriate notifier
services. By default,
messages that appear as
Alerts on the PolyServe
Management Console are
selected for all of the
event notifier services.
The selected messages
are specified by the
check marks in the event
notifier columns. To tailor
the event notifier services
for a site, use these
methods to assign or
remove events for the
appropriate notifier
services:
5 - 45
hf836s c.00
5 - 45
SNMP Trap Configuration

To configure the SNMP
notifier service, select the

SNMP Trap Forwarding
Settings tab. This tab contains
an SNMP Trap Forwarding
Table that specifies the IP
address and community string
needed to access SNMP
targets. You can configure up
to 32 targets.
Click Add and insert the trap
server details
Uncheck the Disablebutton
Click Apply
Click Send Test Trap
5 - 46
hf836s c.00
5 - 46
SNMP Extension Agent

There is a PolyServe MIB available
5 - 47
hf836s c.00
5 - 47
Email Notifier Configuration

To configure the email notifier
service, select the Email

Notification Settings tab. This
service sends email to the
specified addresses when an
event configured for the service
occurs.
Make sure the from email
address is a properly formatted
address i.e.
user@host.domain.com
Enter the SMTP server
Uncheck the Disable button
Click Apply
Click on Send Test Message
5 - 48
hf836s c.00
5 - 48
Email Notification Example
5 - 49
hf836s c.00
5 - 49
Sample notifier script
#!/bin/bash
# Basic Notifier Script that sends email to root
# on the localhost. This will be the node that is
# the Group Leader at the time
exec mail -s
5 - 50
hf836s c.00
'Matrix Server Notifier Event root@localhost
5 - 50
Script Notifier Configuration

To configure the script notifier service, select the Script Notification
Settings tab. This service runs a script when an event configured for
the service occurs. You can specify only one script. Data sent as xml
to STDIN and MxS-event-* environment variables
5 - 51
hf836s c.00
5 - 51
hf836s c.00
5 - 52
SNMP Sub-Agent
Feature overview
Provides read-only access to cluster state/status information
Servers
serverTable, netifTable
Vhosts/SMs
vhostTable, vhstatTable,
DMs
devmonTable, dmstatTable,
File systems
filesystemTable, fsmountTable
Alerts
alertTable
svcmonTable, smstatTable
dmserverTable, dmvhostTable
Implemented as a Net-SNMP loadable module and Management

Information Base (MIB) file
/opt/hpcfs/lib/snmp/mxsnmp.so
/opt/hpcfs/lib/snmp/mxsnmp.mib
5 - 53
hf836s c.00
5 - 53
SNMP Sub-Agent
Usage
To enable, add the following line to the etc/snmp/snmpd.conf
file:
dlmod mxsnmp /opt/hpcfs/lib/snmp/mxsnmp.so
And then restart snmpd:

# /etc/init.d/snmpd restart
Stopping snmpd:
[ OK ]
Starting snmpd:
[ OK ]
Load or compile the mxsnmp.mib into a network management

station (e.g. HP Network Node Manager) or mib browser (e.g.
iReasoning MIB Browser)
http://www.ireasoning.com/downloadmibbrowserfree.php
5 - 54
hf836s c.00
5 - 54
SNMP Sub-Agent
Example /etc/snmp/snmpd.conf file
5 - 55
hf836s c.00
5 - 55
SNMP Sub-Agent
Example MIB browser output
5 - 56
hf836s c.00
5 - 56
SNMP Sub-Agent
Troubleshooting
First, try to isolate problem with log files:
/var/log/messages on target server
/var/opt/polyserve/mxsnmp.log on target server
If problem appears to be a MIB leaf object or MIB table
object problem:
Create /tmp/mxsnmp.env on target server and add

SNMPLOGLEVEL=N, where N is 0=emerg 1=alert
2=crit 3=error 4(default)=warn 5=notice 6=info 7=debug
Debug trace output will be dumped into mxsnmp.log on the
target server
If problem cannot be resolved at customer site:
Get mxcollects from all servers and contact support
5 - 57
hf836s c.00
5 - 57
hf836s c.00
5 - 58
V3.7 creates a new and improved performance monitoring
capability
The MxS_Perfmon*.rpm package provided with HP Clustered File

system provides a Performance Dashboard that displays information
for the cluster, individual servers, and PSFS file systems. (See the HP
Scalable NAS File Serving Software installation guide for information
about installing this package.)
The Performance Dashboard provides performance data over a period
of time (the last hour, day, week, month, or year). You can view data
for individual servers or the entire cluster. File system data is provided
for individual PSFS file systems and can be aggregated for all PSFS file
systems.
5 - 59
hf836s c.00
5 - 59
Create the administrative file system
The Performance Dashboard uses an administrative file system that
must be created on the shared storage. This file system is also used by
the replication feature and was described earlier in the presentation
Start performance monitoring:

# /etc/init.d/mxperfsrv start
Restart performance monitoring:

# /etc/init.d/mxperfsrv restart
Stop performance monitoring:

# /etc/init.d/mxperfsrv stop
5 - 60
hf836s c.00
5 - 60
Using the Performance Dashboard
The Performance Dashboard is a web-based application. To start the
dashboard, either click the Dashboard icon on the Management
Console toolbar or open a browser and enter the following URL,
where <node> is the server name or IP address of the node that will
run the dashboard.
https://<node>:6771/perfmon
The dashboard opens in the browser with a full view of the cluster to
which the connected node belongs (the Cluster Report). You will need
to authenticate to the dashboard by entering the fully qualified NTLM
(DOMAIN\User) or UPN (user@FQDN) credentials.
5 - 61
hf836s c.00
5 - 61
5 - 62
hf836s c.00
5 - 62
5 - 63
hf836s c.00
5 - 63
hf836s c.00
5 - 64
hf836s c.00
5 - 65
hf836s c.00
5 - 66

HP Poli Serv

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HP Poli Serv

Uploaded by

Copyright:

Available Formats

hf836s c.

Copyright 2009 Hewlett-Packard Development Company, L.P.

Module 0 Course Introduction

Module 2 Installation and Configuration

Module 3 Storage Configuration

Module 4 File Systems and Quotas

Module 5 Cluster Management

HP PolyServe Matrix Server for Linux Administration

Module 6 PolyServe Clusterware

Module 7 HP Scalable NAS

Module 8 Database Utility

Module 9 Logs and Services

Module 11 ExDS9100 Product Overview

HP PolyServe Matrix Server for Linux Administration

HP PolyServe Matrix Server for Linux Administration

deployment and administration of the product in their own

Knowledge of best practices and solutions for deploying

applications running on top of the HP PolyServe File Linux

2009 Hewlett-Packard Development Company, L.P.

HP PolyServe Matrix Server for Linux Administration

2009 Hewlett-Packard Development Company, L.P.

HP PolyServe Matrix Server for Linux Administration

2009 Hewlett-Packard Development Company, L.P.

HP PolyServe Matrix Server for Linux Administration

2009 Hewlett-Packard Development Company, L.P.

HP PolyServe Matrix Server for Linux Administration

2009 Hewlett-Packard Development Company, L.P.

HP PolyServe Matrix Server for Linux Administration

2009 Hewlett-Packard Development Company, L.P.

HP PolyServe Matrix Server for Linux Administration

2009 Hewlett-Packard Development Company, L.P.

HP PolyServe Matrix Server for Linux Administration

HP PolyServe Matrix Server for Linux Administration

HP Polyserve Matrix Server for Linux Administration

2009 Hewlett-Packard Development Company, L.P.

HP Polyserve Matrix Server for Linux Administration

HP Polyserve Matrix Server for Linux Administration

HP PolyServe Feature Overview

Failover support for applications

HP Polyserve Matrix Server for Linux Administration

HP Scalable NAS Core Technical Advantage

HP Scalable NAS Nodes

No master lock/meta-data server

Single system semantics

Fault tolerance using standard servers

Easy to use and manage

Shared Data in Cluster File System

Storage and Volume Management

HP Polyserve Matrix Server for Linux Administration

HP Polyserve Matrix Server for Linux Administration

HP Scalable NAS Technology Key: Symmetric vs.

Asymmetricsingle file system can

HP Polyserve Matrix Server for Linux Administration

Online recovery, no automatic disk checking

HP Polyserve Matrix Server for Linux Administration

Storage Features (contd)

Requires shared Fibre Channel or ISCSI devices

Cluster-wide device naming

Cluster Volume Manager

HP Polyserve Matrix Server for Linux Administration

High Availability Features

HP Polyserve Matrix Server for Linux Administration

Oracle RAC and Non-RAC