Download as pdf or txt
Download as pdf or txt
You are on page 1of 345

1. Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5
1.1 Welcome to MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Quick Start - MapR Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1.1 Installing the MapR Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1.2 A Tour of the MapR Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.1.3 Working with Snapshots, Mirrors, and Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.1.4 Getting Started with Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.1.5 Getting Started with Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.1.6 Getting Started with HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.2 Quick Start - Single Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.2.1 Single Node - RHEL or CentOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.2.2 Single Node - Ubuntu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.1.3 Quick Start - Small Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.1.3.1 M3 - RHEL or CentOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.1.3.2 M3 - Ubuntu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.1.3.3 M5 - RHEL or CentOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.1.3.4 M5 - Ubuntu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.1.4 Our Partners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.1.4.1 Datameer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.1.4.2 Karmasphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.2 Installation Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.2.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.2.1.1 PAM Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.2.1.2 Setting Up Disks for MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.2.2 Planning the Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.2.2.1 Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.2.2.2 Setting Up NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.2.3 Installing MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.2.4 Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.2.5 Integration with Other Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.2.5.1 Flume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.2.5.2 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.2.5.2.1 HBase Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1.2.5.3 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1.2.5.4 Mahout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.2.5.5 MultiTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.2.5.6 Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.2.5.7 Using Whirr to Install on Amazon EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1.2.6 Setting Up the Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1.2.7 Uninstalling MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
1.3 User Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
1.3.1 Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
1.3.1.1 Mirrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
1.3.1.2 Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
1.3.1.3 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
1.3.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
1.3.2.1 Compiling Pipes Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
1.3.2.2 Secured TaskTracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
1.3.2.3 Standalone Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
1.3.2.4 Tuning MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
1.3.2.4.1 ExpressLane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
1.3.3 Working with Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
1.3.3.1 Copying Data from Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
1.3.3.2 Data Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
1.3.3.3 Provisioning Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
1.3.3.3.1 Provisioning for Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
1.3.3.3.2 Provisioning for Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
1.3.4 Managing the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
1.3.4.1 Balancers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
1.3.4.2 CLDB Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
1.3.4.3 Cluster Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
1.3.4.3.1 Manual Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
1.3.4.3.2 Rolling Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
1.3.4.4 Dial Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
1.3.4.5 Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
1.3.4.6 Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
1.3.4.6.1 Alarms and Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
1.3.4.6.2 Ganglia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
1.3.4.6.3 Monitoring Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
1.3.4.6.4 Nagios Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
1.3.4.6.5 Service Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
1.3.4.7 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
1.3.4.7.1 Adding Nodes to a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
1.3.4.7.2 Adding Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
1.3.4.7.3 Node Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
1.3.4.7.4 Removing Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
1.3.4.8 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
1.3.4.9 Shutting Down and Restarting a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
1.3.5 Users and Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
1.3.5.1 Managing Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
1.3.5.2 Managing Quotas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
1.3.6 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
1.3.6.1 Disaster Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
1.3.6.2 Out of Memory Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
1.3.6.3 Troubleshooting Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
1.3.7 Direct Access NFS™ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
1.3.8 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
1.3.9 Programming Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
1.4 Reference Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
1.4.1 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
1.4.1.1 Version 1.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
1.4.1.1.1 Hadoop Compatibility in Version 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
1.4.1.2 Version 1.1.3 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
1.4.1.3 Version 1.1.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
1.4.1.4 Version 1.1.1 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
1.4.1.5 Version 1.1 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
1.4.1.5.1 Hadoop Compatibility in Version 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
1.4.1.6 Version 1.0 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
1.4.1.6.1 Hadoop Compatibility in Version 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
1.4.1.7 Beta Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
1.4.1.8 Alpha Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
1.4.2 MapR Control System Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
1.4.2.1 Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
1.4.2.2 MapR-FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
1.4.2.3 NFS HA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
1.4.2.4 Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
1.4.2.5 System Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
1.4.2.6 Other Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
1.4.2.6.1 CLDB View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
1.4.2.6.2 HBase View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
1.4.2.6.3 JobTracker View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
1.4.2.6.4 MapR Launchpad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
1.4.2.6.5 Nagios View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
1.4.2.6.6 Terminal View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
1.4.3 API Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
1.4.3.1 acl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
1.4.3.1.1 acl edit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
1.4.3.1.2 acl set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
1.4.3.1.3 acl show . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
1.4.3.2 alarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
1.4.3.2.1 alarm clear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
1.4.3.2.2 alarm clearall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
1.4.3.2.3 alarm config load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
1.4.3.2.4 alarm config save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
1.4.3.2.5 alarm list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
1.4.3.2.6 alarm names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
1.4.3.2.7 alarm raise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
1.4.3.3 config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
1.4.3.3.1 config load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
1.4.3.3.2 config save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
1.4.3.4 dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
1.4.3.4.1 dashboard info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
1.4.3.5 dialhome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
1.4.3.5.1 dialhome ackdial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
1.4.3.5.2 dialhome disable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
1.4.3.5.3 dialhome enable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
1.4.3.5.4 dialhome lastdialed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
1.4.3.5.5 dialhome metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
1.4.3.5.6 dialhome status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
1.4.3.6 disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
1.4.3.6.1 disk add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
1.4.3.6.2 disk list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
1.4.3.6.3 disk listall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
1.4.3.6.4 disk remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
1.4.3.7 entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
1.4.3.7.1 entity exists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
1.4.3.7.2 entity info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
1.4.3.7.3 entity list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
1.4.3.7.4 entity modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
1.4.3.8 license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
1.4.3.8.1 license add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
1.4.3.8.2 license addcrl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
1.4.3.8.3 license apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
1.4.3.8.4 license list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
1.4.3.8.5 license listcrl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
1.4.3.8.6 license remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
1.4.3.8.7 license showid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
1.4.3.9 nagios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
1.4.3.9.1 nagios generate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
1.4.3.10 nfsmgmt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
1.4.3.10.1 nfsmgmt refreshexports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
1.4.3.11 node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
1.4.3.11.1 node heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
1.4.3.11.2 node list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
1.4.3.11.3 node move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
1.4.3.11.4 node path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
1.4.3.11.5 node remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
1.4.3.11.6 node services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
1.4.3.11.7 node topo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
1.4.3.12 schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
1.4.3.12.1 schedule create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
1.4.3.12.2 schedule list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
1.4.3.12.3 schedule modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
1.4.3.12.4 schedule remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
1.4.3.13 service list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
1.4.3.14 setloglevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
1.4.3.14.1 setloglevel cldb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
1.4.3.14.2 setloglevel fileserver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
1.4.3.14.3 setloglevel hbmaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
1.4.3.14.4 setloglevel hbregionserver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
1.4.3.14.5 setloglevel jobtracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
1.4.3.14.6 setloglevel nfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
1.4.3.14.7 setloglevel tasktracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
1.4.3.15 trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
1.4.3.15.1 trace dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
1.4.3.15.2 trace info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
1.4.3.15.3 trace print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
1.4.3.15.4 trace reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
1.4.3.15.5 trace resize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
1.4.3.15.6 trace setlevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
1.4.3.15.7 trace setmode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
1.4.3.16 urls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
1.4.3.17 virtualip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
1.4.3.17.1 virtualip add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
1.4.3.17.2 virtualip edit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
1.4.3.17.3 virtualip list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
1.4.3.17.4 virtualip remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
1.4.3.18 volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
1.4.3.18.1 volume create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
1.4.3.18.2 volume dump create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
1.4.3.18.3 volume dump restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
1.4.3.18.4 volume fixmountpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
1.4.3.18.5 volume info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
1.4.3.18.6 volume link create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
1.4.3.18.7 volume link remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
1.4.3.18.8 volume list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
1.4.3.18.9 volume mirror push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
1.4.3.18.10 volume mirror start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
1.4.3.18.11 volume mirror stop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
1.4.3.18.12 volume modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
1.4.3.18.13 volume mount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
1.4.3.18.14 volume move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
1.4.3.18.15 volume remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
1.4.3.18.16 volume rename . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
1.4.3.18.17 volume snapshot create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
1.4.3.18.18 volume snapshot list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
1.4.3.18.19 volume snapshot preserve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
1.4.3.18.20 volume snapshot remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
1.4.3.18.21 volume unmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
1.4.4 Scripts and Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
1.4.4.1 configure.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
1.4.4.2 disksetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
1.4.4.3 fsck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
1.4.4.4 Hadoop MFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
1.4.4.5 mapr-support-collect.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
1.4.4.6 rollingupgrade.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
1.4.5 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
1.4.5.1 core-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
1.4.5.2 disktab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
1.4.5.3 hadoop-metrics.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
1.4.5.4 mapr-clusters.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
1.4.5.5 mapred-default.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
1.4.5.6 mapred-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
1.4.5.7 mfs.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
1.4.5.8 taskcontroller.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
1.4.5.9 warden.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
1.4.6 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Home

Start Here
Welcome to MapR! If you are not sure how to get started, here are a few places to find the information you are looking for:

New! Quick Start - MapR Virtual Machine - A single-node cluster that's ready to roll, right out of the box!
Quick Start - Single Node - Set up a single-node Hadoop cluster and try some of the features that set MapR Hadoop apart
Quick Start - Small Cluster - Set up a Hadoop cluster with a small to moderate number of nodes
Installation Guide - Learn how to set up a production cluster, large or small
User Guide - Read more about what you can do with a MapR cluster

Welcome to MapR
MapR Distribution for Apache Hadoop is the easiest, most dependable, and fastest Hadoop distribution on the planet. It is the only Hadoop
distribution that allows direct data input and output via MapR Direct Access NFS™ with realtime analytics, and the first to provide true High
Availability (HA) at all levels. MapR introduces logical volumes to Hadoop. A volume is a way to group data and apply policy across an entire data
set. MapR provides hardware status and control with the MapR Control System, a comprehensive UI including a Heatmap™ that displays the
health of the entire cluster at a glance. Read on to learn about how the unique features of MapR provide the highest-performance, lowest cost
Hadoop available.

To get started right away, read the Quick Start guides:

New Quick Start - MapR Virtual Machine


Quick Start - Single Node
Quick Start - Small Cluster

To learn more about MapR, read on!

Ease of Use
With MapR, it is easy to run Hadoop jobs reliably, while isolating resources between different departments or jobs, applying data and performance
policies, and tracking resource usage and job performance:

1. Create a volume and set policy. The MapR Control System makes it simple to set up a volume and assign granular control to users or
groups. Use replication, mirroring, and snapshots for data protection, isolation, or performance.
2. Provision resources. You can limit the size of data on a volume, or place the volume on specific racks or nodes for performace or
protection.
3. Run the Hadoop job normally. Realtime Hadoop™ analytics let you track resource usage and job performance, while Direct Access
NFS™ gives you easy data input and direct access to the results.

MapR lets you control data access and placement, so that multiple concurrent Hadoop jobs can safely share the cluster.

With MapR, you can mount the cluster on any server or client and have your applications write data and log files directly into the cluster, instead
of the batch processing model of the past. You do not have to wait for a file to be closed before reading it; you can tail a file as it is being written.
Direct Access NFS™ even makes it possible to use standard shell scripts to work with Hadoop data directly.

Provisioning resources is simple. You can easily create a volume for a project or department in a few clicks. MapR integrates with NIS and LDAP,
making it easy to manage users and groups. The MapR Control System makes it a breeze to assign user or group quotas, to limit how much data
a user or group can write; or volume quotas, to limit the size of a volume. You can assign topology to a volume, to limit it to a specific rack or set
of nodes. Setting recovery time objective (RTO) and recovery point objective (RPO) for a data set is a simple matter of scheduling snapshots and
mirrors on a volume through the MapR Control System. You can set read and write permissions on volumes directly via NFS or using hadoop fs
commands, and volumes provide administrative delegation through ACLs; for example, through the MapR Control System you can control who
can mount, unmount, snapshot, or mirror a volume.

At its heart, MapR is a Hadoop distribution. You can run Hadoop jobs the way you always have.

MapR has partnered with Datameer, which provides a self-service Business Intelligence platform that runs best on the MapR Distribution for
Apache Hadoop. Your download of MapR includes a 30-day trial version of Datameer Analytics Solution (DAS), which provides spreadsheet-style
analytics, ETL and data visualization capabilities.

For more information:

Read about Provisioning Applications


Learn about Direct Access NFS™
Check out the Datameer Trial

Dependability
With clusters growing to thousands of nodes, hardware failures are inevitable even with the most reliable machines in place. MapR Distribution for
Apache Hadoop has been designed from the ground up to tolerate hardware failure seamlessly.

MapR is the first Hadoop distibution to provide true HA and failover at all levels, including a MapR Distributed HA NameNode™. If a disk or node
in the cluster fails, MapR automatically restarts any affected processes on another node without requiring administrative intervention. The HA
JobTracker ensures that any tasks interrupted by a node or disk failure are re-started on another TaskTracker node. In the event of any failure,
the job's completed task state is preserved and no tasks are lost. For additional data reliability, every bit of data on the wire is compressed and
CRC-checked.

With volumes, you can control access to data, set replication factor, and place specific data sets on specific racks or nodes for performance or
data protection. Volumes control data access to specific users or groups with Linux-style permissions that integrate with existing LDAP and NIS
directories. Volumes can be size-limited with volume quotas to prevent data overruns from using excessive storage capacity. One of the most
powerful aspects of the volume concept is the ways in which a volume provides data protection:

To enable point-in-time recovery and easy backups, volumes have manual and policy-based snapshot capability.
For true business continuity, you can manually or automatically mirror volumes and synchronize them between clusters or datacenters to
enable easy disaster recovery.
You can set volume read/write permission and delegate administrative functions, to control access to data.

Volumes can be exported with MapR Direct Access NFS™ with HA, allowing data to be read and written directly to Hadoop without the need for
temporary storage or log collection. You can load-balance across NFS nodes; clients connecting to different nodes see the same view of the
cluster.

The MapR Control System provides powerful hardware insight down to the node level, as well as complete control of users, volumes, quotas,
mirroring, and snapshots. Filterable alarms and notifications provide immediate warnings about hardware failures or other conditions that require
attention, allowing a cluster administrator to detect and resolve problems quickly.

For more information:

Take a look at the Heatmap™


Learn about Volumes, Snapshots, and Mirroring
Explore Data Protection scenarios

Performance
MapR Distribution for Apache Hadoop achieves up to three times the performance of any other Hadoop distribution, and can reduce your
equipment costs by half.

MapR Direct Shuffle uses the Distributed NameNode™ to improve Reduce phase performance drastically. Unlike Hadoop distributions that use
the local filesystem for shuffle and HTTP to transport shuffle data, MapR shuffle data is readable directly from anywhere on the network. MapR
stores data with Lockless Storage Services™, a sharded system that eliminates contention and overhead from data transport and retrieval.
Automatic, transparent client-side compression reduces network overhead and reduces footprint on disk, while direct block device I/O provides
throughput at hardware speed with no additional overhead. As an additional performance boost, with MapR Realtime Hadoop™, you can read
files while they are still being written.

MapR gives you ways to tune the performance of your cluster. Using mirrors, you can load-balance reads on highly-accessed data to alleviate
bottlenecks and improve read bandwidth to multiple users. You can run MapR Direct Access™ NFS on many nodes – all nodes in the cluster, if
desired – and load-balance reads and writes across the entire cluster. Volume topology helps you further tune performance by allowing you to
place resource-intensive Hadoop jobs and high-activity data on the fastest machines in the cluster.

For more information:

Read about Provisioning for Performance

Get Started
Now that you know a bit about how the features of MapR Distribution for Apache Hadoop work, take a quick tour to see for yourself how they can
work for you:

If you would like to give MapR a try on a single machine, see MapR in Minutes: CentOS or RHEL - Ubuntu
To explore cluster installation scenarios, see Planning the Deployment
For more about provisioning, see Provisioning Applications
For more about data policy, see Working with Data
Quick Start - MapR Virtual Machine
The MapR Virtual Machine is a fully-functional single-node Hadoop cluster capable of running MapReduce programs and working with
applications like Hive, Pig, and HBase. You can try the MapR Virtual Machine on nearly any 64-bit computer by downloading the free VMware
Player.

The MapR Virtual Machine desktop contains the following icons:

MapR Control System - navigates to the graphical control system for managing the cluster
MapR User Guide - navigates to the MapR online documentation
MapR NFS - navigates to the NFS-mounted cluster storage layer

Ready for a tour? The following documents will help you get started:

Installing the MapR Virtual Machine


A Tour of the MapR Virtual Machine
Working with Snapshots, Mirrors, and Schedules
Getting Started with Hive
Getting Started with Pig
Getting Started with HBase

Installing the MapR Virtual Machine


The MapR Virtual Machine runs on VMware Player, a free desktop application that lets you run a virtual machine on a Windows or Linux PC. You
can download VMware Player from the VMware web site. To install the VMware Player, see the VMware documentation.

Use of VMware Player is subject to the VMware Player end user license terms, and VMware provides no support for VMware Player. For self-help
resources, see the VMware Player FAQ.

Requirements

The MapR Virtual Machine requires at least 20 GB free hard disk space and 2 GB of RAM on the host system. You will see higher performance
with more RAM and more free hard disk space.

To run the MapR Virtual Machine, the host system must have one of the following 64-bit x86 architectures:

A 1.3 GHz or faster AMD CPU with segment-limit support in long mode
A 1.3 GHz or faster Intel CPU with VT-x support

If you have an Intel CPU with VT-x support, you must verify that VT-x support is enabled in the host system BIOS. The BIOS settings that must be
enabled for VT-x support vary depending on the system vendor. See the
VMware knowledge base article at http://kb.vmware.com/kb/1003944 for information about how to determine if VT-x support is enabled.

Installing and Running the MapR Virtual Machine

1. Download the archive file vm_pathvm_filename


2. Extract the archive to your home directory, or another directory of your choosing.
3. Run the VMware player.
4. Click Open a Virtual Machine and navigate to the directory where you extracted the archive.
5. The first time the virtual machine starts, you must accept the agreement and choose whether to enable the MapR Dial Home service.
6. At the login screen, click MapR Technologies and enter the password mapr (all lowercase).
7. Once the virtual machine is fully started, you can proceed with the tour.

A Tour of the MapR Virtual Machine


In this tutorial, you'll get familiar with the MapR Control System dashboard, learn how to get data into the cluster (and organized), and run some
MapReduce jobs on Hadoop. You can read the following sections in order or briwse them as you explore on our own:

The Dashboard
Working with Volumes
Exploring NFS
Running a MapReduce Job

Once you feel comfortable working with the MapR Virtual Machine, you can move on to more advanced topics:

Working with Snapshots, Mirrors, and Schedules


Getting Started with Hive
Getting Started with Pig
Getting Started with HBase

The Dashboard
The dashboard, the main screen in the MapR Control System, shows the health of the cluster at a glance. To get to the dashboard, click the
MapR Control System link on the desktop of the MapR Virtual Machine and log on with the username root and the password mapr. If it is your
first time using the MapR Control System, you will need to accept the terms of the license agreement to proceed.

Parts of the dashboard:

To the left, the navigation pane lets you navigate to other views that display more detailed information about nodes in the cluster,
volumes in the MapR Storage Services layer, NFS settings, Alarms, and System Settings.
In the center, the main dashboard view displays the nodes in a "heat map" that uses color to indicate node health--since there is only one
node in the MapR Virtual Machine cluster, there is a single green square.
To the right, information about cluster usage is displayed.

Try clicking the Health button at the top right of the heat map. You will see a number of different information that can be displayed in the
heat map.
Try clicking the green square representing the node. You will see more detailed information about the status of the node.
By the way, the browser is pre-configured with the following bookmarks, which you will find useful as you gain experience with Hadoop,
MapReduce, and the MapR Control System:

MapR Control System


JobTracker Status
TaskTracker Status
HBase Master
CLDB Status

Don\'t worry if you aren\'t sure what those are yet.

Working with Volumes


MapR provides volumes as a way to organize data into groups, so that you can manage your data and apply policy all at once instead of file by
file. Think of a volume as being similar to a huge hard drive---it can be mounted or unmounted, belong to a specific department or user, and have
permissions set as a whole or on any directory or file within. In this section, you will create a volume that you can use for later parts of the tutorial.

Create a volume:

1. In the Navigation pane, click Volumes in the MapR-FS group.


2. Click the New Volume button to display the New Volume dialog.
3. For the Volume Type, select Standard Volume.
4. Type the name MyVolume in the Volume Name field.
5. Type the path /myvolume in the Mount Path field.
6. Select /default-rack in the Topology field.
7. Scroll to the bottom and click OK to create the volume.

Notice that the mount path and the volume name do not have to match. The volume name is a permanent identifier for the volume, and the mount
path determines the file path by which the volume is accessed. The topology determines the racks available for storing the volume and its
replicas.

In the next step, you'll see how to get data into the cluster with NFS.

Exploring NFS
With MapR, you can mount the cluster via NFS, and browse it as if it were a filesystem. Try clicking the MapR NFS icon on the MapR Virtual
Machine desktop.
When you navigate to mapr > my.cluster.com you can see the volume myvolume that you created in the previous example.

Try copying some files to the volume; a good place to start is the files constitution.txt and sample-table.txt which are attached to this
page. Both are text files, which will be useful when running the Word Count example later.

To download them, select Attachments from the Tools menu to the top right of this document (the one you are reading now) and then
click the links for those two files.
Once they are downloaded, you can add them to the cluster.
Since you'll be using them as input to MapReduce jobs in a few minutes, create a directory called in in the volume myvolume and drag
the files there. If you do not have a volume mounted at myvolume on the cluster, use the instructions in Working with Volumes above to
create it.

By the way, if you want to verify that you are really copying the files into the Hadoop cluster, you can open a terminal on the MapR Virtual
Machine (select Applications > Accessories > Terminal) and type hadoop fs -ls /myvolume/in to see that the files are there.

The Terminal
When you run MapReduce jobs, and when you use Hive, Pig, or HBase, you'll be working with the Linux terminal. Open a terminal window by
selecting Applications > Accessories > Terminal.
Running a MapReduce Job
In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files (like the ones you copied to the
cluster in the previous section). The Word Count program reads files from an input directory, counts the words, and writes the results of the job to
files in an output directory. For this exercise we will use /myvolume/in for the input, and /myvolume/out for the output. The input directory
must exist and must contain the input files before running the job; the output directory must not exist, as the Word Count example creates it.

Try MapReduce

1. On the MapR Virtual Machine, open a terminal (select Applications > Accessories > Terminal)
2. Copy a couple of text files into the cluster. If you are not sure how, see the previous section. Create the directory /myvolume/in and put
the files there.
3. Type the following line to run the Word Count job:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in


/myvolume/out

4. Look in the newly-created /myvolume/out for a file called part-r-00000 containing the results.

That's it! If you're ready, you can try some more advanced exercises:

Working with Snapshots, Mirrors, and Schedules


Getting Started with Hive
Getting Started with Pig
Getting Started with HBase

Working with Snapshots, Mirrors, and Schedules


Snapshots, mirrors, and schedules help you protect your data from user error, make backup copies, and in larger clusters provide load balancing
for highly-accessed data. These features are available under the M5 license.

If you are working with an M5 virtual machine, you can use this section to get acquainted with snapshots, mirrors, and schedules.
If you are working with the M3 virtual machine, you should proceed to the sections about Getting Started with Hive, Pig, and Getting
Started with HBase.

Taking Snapshots
A snapshot is a point-in-time image of a volume that protects data against user error. Although other strategies such as replication and mirroring
provide good protection, they cannot protect against accidental file deletion or corruption. You can create a snapshot of a volume manually before
embarking on risky jobs or operations, or set a snapshot schedule on the volume to ensure that you can always roll back to specific points in time.

Try creating a snapshot manually:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Select the checkbox beside the volume MyVolume (which you created during the previous tutorial).
3. Expand the MapR Virtual Machine window or scroll the browser to the right until the New Snapshot button is visible.
4. Click New Snapshot to display the Snapshot Name dialog.
5. Type a name for the new snapshot in the Name field.
6. Click OK to create the snapshot.

Try scheduling snapshots:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Display the Volume Properties dialog by clicking the volume name MyVolume (which you created during the previous tutorial), or by
selecting the checkbox beside MyVolume and clicking the Properties button.
3. In the Replication and Snapshot Scheduling section, choose a schedule from the Snapshot Schedule dropdown menu.
4. Click Modify Volume to save changes to the volume.

Viewing Snapshot Contents

All the snapshots of a volume are available in a directory called .snapshot at the volume's top level. For example, the snapshots of the volume
MyVolume, which is mounted at /myvolume, are available in the /myvolume/.snapshot directory. You can view the snapshots using the
hadoop fs -ls command or via NFS. If you list the contents of the top-level directory in the volume, you will not see .snapshot — but it's
there.

To view the snapshots for /myvolume on the command line, type hadoop fs -ls /myvolume/.snapshot
To view the snapshots for /myvolume in the file browser via NFS, navigate to /myvolume and use CTRL-L to specify an explicit path,
then add .snapshot to the end.

Creating Mirrors

A mirror is a full read-only copy of a volume, which you can use for backups, data transfer to another cluster, or load balancing. A mirror is itself a
type of volume; after you create a mirror volume, you can sync it with its source volume manually or set a schedule for automatic sync.

Try creating a mirror volume:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Click the New Volume button to display the New Volume dialog.
3. Select the Local Mirror Volume radio button at the top of the dialog.
4. Type my-mirror in the Mirror Name field.
5. Type the MyVolume in the Source Volume Name field.
6. Type /my-mirror in the Mount Path field.
7. To schedule mirror sync, select a schedule from the Mirror Update Schedule dropdown menu respectively.
8. Click OK to create the volume.

You can also sync a mirror manually; it works just like taking a manual snapshot. View the list of volumes, select the checkbox next to a mirror
volume, and click Start Mirroring.

Working with Schedules

The MapR Virtual machine comes pre-loaded with a few schedules, but you can create your own as well. Once you have created a schedule, you
can use it for snapshots and mirrors on any volume. Each schedule contains one or more rules that determine when to trigger a snapshot or a
mirror sync, and how long to keep snapshot data resulting from the rule.

Try creating a schedule:


1. In the Navigation pane, expand the MapR-FS group and click the Schedules view.
2. Click New Schedule.
3. Type My Schedule in the Schedule Name field.
4. Define a schedule rule in the Schedule Rules section:
a. From the first dropdown menu, select Every 5 min
b. Use the Retain For field to specify how long the data is to be preserved. Type 1 in the box, and select hour(s) from the
dropdown menu.
5. Click Save Schedule to create the schedule.

You can use the schedule "My Schedule" to perform a snapshot or mirror operation automatically every 5 minutes. If you use "My Schedule" to
automate snapshots, they will be preserved for one hour (you will have 12 snapshots of the volume, on average).

Next Steps
If you haven't already, try the following tutorials:

Getting Started with Hive


Getting Started with Pig
Getting Started with HBase

Getting Started with Hive

Getting Started with Hive


Hive is a data warehouse system for Hadoop that uses a SQL-like language called HiveQL to query structured data stored in a distributed
filesystem. (For more information about Hive, see the Hive project page.)

You'll be working with Hive from the Linux shell. To use Hive, open a terminal by selecting Applications > Accessories > Terminal (see A Tour
of the MapR Virtual Machine).

In this tutorial, you'll create a Hive table on the cluster, load data from a tab-delimited text file, and run a couple of basic queries against the table.

First, make sure you have downloaded the sample table: On the page A Tour of the MapR Virtual Machine, select Tools > Attachments
and right-click on sample-table.txt to save it. We'll be loading the file from the local filesystem on the MapR Virtual Machine (not the
cluster storage layer) so save the file in the Home directory (/home/mapr).

Take a look at the source data

First, take a look at the contents of the file using the terminal:

1. Make sure you are in the Home directory where you saved sample-table.txt (type cd ~ if you are not sure).
2. Type cat sample-table.txt to display the following output.

mapr@mapr-desktop:~$ cat sample-table.txt


1320352532 1001 http://www.mapr.com/doc http://www.mapr.com 192.168.10.1
1320352533 1002 http://www.mapr.com http://www.example.com 192.168.10.10
1320352546 1001 http://www.mapr.com http://www.mapr.com/doc 192.168.10.1

Notice that the file consists of only three lines, each of which contains a row of data fields separated by the TAB character. The data in the file
represents a web log.

Create a table in Hive and load the source data:

1. Type the following command to start the Hive shell, using tab-completion to expand the <version>:

/opt/mapr/hive/hive-<version>/bin/hive

2. At the hive> prompt, type the following command to create the table:

CREATE TABLE web_log(viewTime INT, userid BIGINT, url STRING, referrer STRING, ip STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY '\t';

3. Type the following command to load the data from sample-table.txt into the table:
3.

LOAD DATA LOCAL INPATH '/home/mapr/sample-table.txt' INTO TABLE web_log;

Run basic queries against the table:

Try the simplest query, one that displays all the data in the table:

SELECT web_log.* FROM web_log;

This query would be inadvisable with a large table, but with the small sample table it returns very quickly.

Try a simple SELECT to extract only data that matches a desired string:

SELECT web_log.* FROM web_log WHERE web_log.url LIKE '%doc';

This query launches a MapReduce job to filter the data.

Getting Started with Pig


Apache Pig is a platform for parallelized analysis of large data sets via a language called PigLatin. (For more information about Pig, see the Pig
project page.)

You'll be working with Pig from the Linux shell. Open a terminal by selecting Applications > Accessories > Terminal(see A Tour of the MapR
Virtual Machine).

In this tutorial, we'll use Pig to run a MapReduce job that counts the words in the file /myvolume/in/constitution.txt on the cluster, and
store the results in the file /myvolume/wordcount.txt.

First, make sure you have downloaded the file: On the page A Tour of the MapR Virtual Machine, select Tools > Attachments and
right-click constitution.txt to save it.
Make sure the file is loaded onto the cluster, in the directory /myvolume/in. If you are not sure how, look at the NFS tutorial on A Tour
of the MapR Virtual Machine.

Open a Pig shell and get started:

1. In the terminal, type the command pig to start the Pig shell.
2. At the grunt> prompt, type the following lines (press ENTER after each):

A = LOAD '/myvolume/constitution.txt' USING TextLoader() AS (words:chararray);

B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));

C = GROUP B BY $0;

D = FOREACH C GENERATE group, COUNT(B);

STORE D INTO '/myvolume/wordcount.txt';

After you type the last line, Pig starts a MapReduce job to count the words in the file constitution.txt.

3. When the MapReduce job is complete, type quit to exit the Pig shell and take a look at the contents of the file
/myvolume/wordcount.txt to see the results.

Getting Started with HBase


HBase is the Hadoop database, designed to provide random, realtime read/write access to very large tables — billions of rows amd millions of
columns — on clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's
Bigtable. (For more information about HBase, see the HBase project page.)

You'll be working with HBase from the Linux shell. Open a terminal by selecting Applications > Accessories > Terminal (see A Tour of the
MapR Virtual Machine).

In this tutorial, we'll create an HBase table on the cluster, enter some data, query the table, then clean up the data and exit.

HBase tables are organized by column, rather than by row. Furthermore, the columns are organized in groups called column families. When
creating an HBase table, you must define the column families before inserting any data. Column families should not be changed often, nor should
there be too many of them, so it is important to think carefully about what column families will be useful for your particular data. Each column
family, however, can contain a very large number of columns. Columns are named using the format family:qualifier.

Unlike columns in a relational database, which reserve empty space for columns with no values, HBase columns simply don't exist for rows where
they have no values. This not only saves space, but means that different rows need not have the same columns; you can use whatever columns
you need for your data on a per-row basis.

Create a table in HBase:

1. Start the HBase shell by typing the following command:

/opt/mapr/hbase/hbase-0.90.4/bin/hbase shell

2. Create a table called weblog with one column family named stats:

create 'weblog', 'stats'

3. Verify the table creation by listing everything:

list

4. Add a test value to the daily column in the stats column family for row 1:

put 'weblog', 'row1', 'stats:daily', 'test-daily-value'

5. Add a test value to the weekly column in the stats column family for row 1:

put 'weblog', 'row1', 'stats:weekly', 'test-weekly-value'

6. Add a test value to the weekly column in the stats column family for row 2:

put 'weblog', 'row2', 'stats:weekly', 'test-weekly-value'

7. Type scan 'weblog' to display the contents of the table. Sample output:

ROW COLUMN+CELL
row1 column=stats:daily, timestamp=1321296699190, value=test-daily-value

row1 column=stats:weekly, timestamp=1321296715892, value=test-weekly-value

row2 column=stats:weekly, timestamp=1321296787444, value=test-weekly-value

2 row(s) in 0.0440 seconds

8. Type get 'weblog', 'row1' to display the contents of row 1. Sample output:
8.

COLUMN CELL
stats:daily timestamp=1321296699190, value=test-daily-value
stats:weekly timestamp=1321296715892, value=test-weekly-value
2 row(s) in 0.0330 seconds

9. Type disable 'weblog' to disable the table.


10. Type drop 'weblog' to drop the table and delete all data.
11. Type exit to exit the HBase shell.

Quick Start - Single Node


You're just a few minutes away from installing and running a real single-node MapR Hadoop cluster. To get started, choose your operating
system:

Red Hat Enterprise Linux (RHEL) or CentOS


Ubuntu

Single Node - RHEL or CentOS


Use the following steps to install a simple single-node cluster with a basic set of services. These instructions bring up a single node with the
following roles:

CLDB
FileServer
JobTracker
NFS
TaskTracker
WebServer
ZooKeeper

Step 1. Requirements
64-bit Red Hat 5.4 or greater, or 64-bit CentOS 5.4 or greater
RAM: 4 GB or more
At least one free unmounted drive or partition, 500 GB or more
At least 10 GB of free space on the operating system partition
Java JDK version 1.6.0_24 (not JRE)
The root password, or sudo privileges
A Linux user chosen to have administrative privileges on the cluster
Make sure the user has a password (using sudo passwd <user> for example)

If Java is already installed, check which versions of Java are installed:

java -version

If JDK 6 is installed, the output will include a version number starting with 1.6, and then below that the text Java(TM). Example:

java version "1.6.0_24"


Java(TM) SE Runtime Environment (build 1.6.0_24-b07)

If necessary, install Sun Java JDK 6. Once Sun Java JDK 6 is installed, use update-alternatives to make sure it is the default Java: sudo
update-alternatives --config java

This procedure assumes you have free, unmounted physical partitions or hard disks for use by MapR. If you are not sure,
please read Setting Up Disks for MapR.

Create a text file /tmp/disks.txt listing disks and partitions for use by MapR. Each line lists a single disk, or partitions on a single
disk. Example:
/dev/sdb
/dev/sdc1 /dev/sdc2 /dev/sdc4
/dev/sdd

Later, when you run disksetup to format the disks, specify the disks and partitions file. Example:

disksetup -F /tmp/disks.txt

To get the most out of this tutorial, be sure to register for your free M3 license after installing the MapR software.

Step 2. Add the MapR Repository


1. Change to the root user (or use sudo for the following commands).
2. Create a text file called maprtech.repo in the directory /etc/yum.repos.d/ with the following contents:
[maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/v1.1.3/redhat/
enabled=1
gpgcheck=0
protect=1
To install a previous release, see the Release Notes for the correct path to use in the baseurl parameter.
3. If your connection to the Internet is through a proxy server, you must set the http_proxy environment variable before installation:

http_proxy=http://<host>:<port>
export http_proxy

If you don't have Internet connectivity, do one of the following:

Set up a local repository.


Download and install packages manually.

Step 3. Install the Software


Before installing the software, make sure you have created the file /tmp/disks.txt containing a list of disks and partitions for use by MapR.

1. Change to the root user (or use sudo) and use the following commands to install a MapR single-node cluster:

yum install mapr-single-node


/opt/mapr/server/disksetup -F /tmp/disks.txt
/etc/init.d/mapr-zookeeper start
/etc/init.d/mapr-warden start

2. Specify an administrative user for the cluster by running the following command (as root or with sudo), replacing <user> with your
linux username:

/opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

If you are running MapR Single Node on a laptop or a computer that you reboot frequently, you might want to use the command
chkconfig --del mapr-warden (as root or with sudo) to prevent the cluster from starting automatically when the
computer restarts. You can always start the cluster manually by running /etc/init.d/mapr-zookeeper start and
/etc/init.d/mapr-warden start (as shown above).

Use the following steps to install a simple single-node cluster with a basic set of services. These instructions bring up a single node with the
following roles:

CLDB
FileServer
JobTracker
NFS
TaskTracker
WebServer
ZooKeeper

Step 4. Check out the MapR Control System


After a few minutes the MapR Control System starts.

The MapR Control System gives you power at a glance over the entire cluster. With the MapR Control System, you can see immediately which
nodes are healthy; how much bandwidth, disk space, and CPU time are in use; and the status of MapReduce jobs. But there's more: the MapR
Control System lets you manage your data in ways that were previously impossible with Hadoop. During the next few exercises, you will learn a
few of the easy, powerful capabilities of MapR.

1. Open a browser and navigate to the MapR Control System on the server where you installed MapR. The URL is
https://<host>:8443. Example:

https://localhost:8443

Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You can safely
ignore the warning this time.
2. The first time MapR starts, you must accept the agreement and choose whether to enable the MapR Dial Home service.
3. Log in using the Linux username and password you specified as the administrative user in Step 3.

Welcome to the MapR Control System, which displays information about the health and performance of the cluster. Notice the navigation pane to
the left and the larger view pane to the right. The navigation pane lets you quickly switch between views, to display different types of information.

When the MapR Control System starts, the first view displayed is the Dashboard. The MapR single node is represented as a square. The square
is not green; the node health is not perfect: because the license has not yet been applied, the NFS service has failed to start, resulting in an alarm
on the node. Later, you'll see how to obtain the license and mount the cluster with NFS.

Click the node to display the Node Properties view. Notice the types of information available about the node:

Alarms - timely information about any problems on the node


Machine Performance - resource usage
General Information - physical topology of the node and the times of the most recent heartbeats
MapReduce - map and reduce slot usage
Manage Node Services - services on the node
MapR-FS and Available Disks - disks used for storage by MapR
System Disks - disks available to add to MapR storage

Step 5. Get a License


1. In the navigation pane of the MapR Control System, expand the System Settings group and click MapR Licenses to display the MapR
License Management dialog.
2. Click Add Licenses via Web.
3. If the cluster is already registered, the license is applied automatically. Otherwise, click OK to register the cluster on MapR.com and
follow the instructions there.
If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to a registration page.
On the registration page, create an account and log in.
On the Register Cluster page, choose M3 and click Register.
When the message "Cluster Registered" appears, click Return to your MapR Cluster UI.
Step 6. Start Working with Volumes
MapR provides volumes as a way to organize data into groups, so that you can manage your data and apply policy all at once instead of file by
file. Think of a volume as being similar to a huge hard drive---it can be mounted or unmounted, belong to a specific department or user, and have
permissions set as a whole or on any directory or file within. In this section, you will create a volume that you can use for later parts of the tutorial.

Create a volume:

1. In the Navigation pane, click Volumes in the MapR-FS group.


2. Click the New Volume button to display the New Volume dialog.
3. For the Volume Type, select Standard Volume.
4. Type the name MyVolume in the Volume Name field.
5. Type the path /myvolume in the Mount Path field.
6. Select /default-rack in the Topology field.
7. Scroll to the bottom and click OK to create the volume.

Step 7. Mount the Cluster via NFS


With MapR, you can export and mount the Hadoop cluster as a read/write volume via NFS from the machine where you installed MapR, or from a
different machine.

If you are mounting from the machine where you installed MapR, replace <host> in the steps below with localhost
If you are mounting from a different machine, make sure the machine where you installed MapR is reachable over the network and
replace <host> in the steps below with the hostname of the machine where you installed MapR.

Try the following steps to see how it works:

1. Change to the root user (or use sudo for the following commands).
2. See what is exported from the machine where you installed MapR:

showmount -e <host>

3. Set up a mount point for the NFS share:

mkdir /mapr

4. Mount the cluster via NFS:

mount <host>:/mapr /mapr

Tips
If you get an error such as RPC: Program not registered it likely means that the NFS service is not running. See Setting Up NFS
for tips.
5. To see the cluster, list the /mapr directory:

# ls /mapr/my.cluster.com
my.cluster.com

6. List the cluster itself, and notice that the volume you created is there:

# ls -l /mapr/my.cluster.com
Found 3 items
drwxrwxrwx - root root 0 2011-11-22-12:44 /myvolume
drwxr-xr-x - mapr mapr 0 2011-01-03 13:50 /tmp
drwxr-xr-x - mapr mapr 0 2011-01-04 13:57 /user
drwxr-xr-x - root root 0 2010-11-25 09:41 /var

7. Try creating a directory in your new volume via NFS:

mkdir /mapr/my.cluster.com/myvolume/foo

8. List the contents of /myvolume:


8.

hadoop fs -ls /myvolume

Notice that Hadoop can see the directory you just created with NFS. Try navigating to the cluster using the computer's file browser --- you can
drag and drop files directly to your new volume, and see them immediately in Hadoop!

If you are already running an NFS server, MapR will not run its own NFS gateway. In that case, you will not be able to mount the
single-node cluster via NFS, but your previous NFS exports will remain available.

Step 8. Try MapReduce


In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files. The Word Count program reads
files from an input directory, counts the words, and writes the results of the job to files in an output directory. For this exercise we will use
/myvolume/in for the input, and /myvolume/out for the output. The input directory must exist and must contain the input files before running
the job; the output directory must not exist, as the Word Count example creates it.

1. Open a terminal (select Applications > Accessories > Terminal)


2. Copy a couple of text files into the cluster, either using the file browser or the command line. Create the directory /myvolume/in and put
the files there. Example:

mkdir /mapr/my.cluster.com/myvolume/in
cp <some files> /mapr/my.cluster.com/myvolume/in

3. Type the following line to run the Word Count job:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in


/myvolume/out

4. Look in the newly-created /myvolume/out for a file called part-r-00000 containing the results.

Single Node - Ubuntu


Use the following steps to install a simple single-node cluster with a basic set of services. These instructions bring up a single node with the
following roles:

CLDB
FileServer
JobTracker
NFS
TaskTracker
WebServer
ZooKeeper

Step 1. Requirements
64-bit Ubuntu 9.04 or greater
RAM: 4 GB or more
At least one free unmounted drive or partition, 500 GB or more
At least 10 GB of free space on the operating system partition
Java JDK version 1.6.0_24 (not JRE)
The root password, or sudo privileges
A Linux user chosen to have administrative privileges on the cluster
Make sure the user has a password (using sudo passwd <user> for example)

If Java is already installed, check which versions of Java are installed:

java -version

If JDK 6 is installed, the output will include a version number starting with 1.6, and then below that the text Java(TM). Example:
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)

If necessary, install Sun Java JDK 6. Once Sun Java JDK 6 is installed, use update-alternatives to make sure it is the default Java: sudo
update-alternatives --config java

This procedure assumes you have free, unmounted physical partitions or hard disks for use by MapR. If you are not sure,
please read Setting Up Disks for MapR.

Create a text file /tmp/disks.txt listing disks and partitions for use by MapR. Each line lists a single disk, or partitions on a single
disk. Example:

/dev/sdb
/dev/sdc1 /dev/sdc2 /dev/sdc4
/dev/sdd

Later, when you run disksetup to format the disks, specify the disks and partitions file. Example:

disksetup -F /tmp/disks.txt

To get the most out of this tutorial, be sure to register for your free M3 license after installing the MapR software.

Step 2. Add the MapR Repository


1. Change to the root user (or use sudo for the following commands).
2. Add the following line to /etc/apt/sources.list:
deb http://package.mapr.com/releases/v1.1.3/ubuntu/ mapr optional
(To install a previous release, see the Release Notes for the correct path.)
3. If your connection to the Internet is through a proxy server, add the following lines to /etc/apt/apt.conf:

Acquire {
Retries "0";
HTTP {
Proxy "http://<user>:<password>@<host>:<port>";
};
};

If you don't have Internet connectivity, do one of the following:

Set up a local repository.


Download and install packages manually.

Step 3. Install the Software


Before installing the software, make sure you have created the file /tmp/disks.txt containing a list of disks and partitions for use by MapR.

1. Change to the root user (or use sudo) and use the following commands to install a MapR single-node cluster:

apt-get update
apt-get install mapr-single-node
/opt/mapr/server/disksetup -F /tmp/disks.txt
/etc/init.d/mapr-zookeeper start
/etc/init.d/mapr-warden start

2. Specify an administrative user for the cluster by running the following command (as root or with sudo), replacing <user> with your
linux username:
2.

/opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

If you are running MapR Single Node on a laptop or a computer that you reboot frequently, you might want to use the command
update-rc.d -f mapr-warden remove (as root or with sudo) to prevent the cluster from starting automatically when the
computer restarts. You can always start the cluster manually by running /etc/init.d/mapr-zookeeper start and
/etc/init.d/mapr-warden start (as shown above).

Step 4. Check out the MapR Control System


After a few minutes the MapR Control System starts.

The MapR Control System gives you power at a glance over the entire cluster. With the MapR Control System, you can see immediately which
nodes are healthy; how much bandwidth, disk space, and CPU time are in use; and the status of MapReduce jobs. But there's more: the MapR
Control System lets you manage your data in ways that were previously impossible with Hadoop. During the next few exercises, you will learn a
few of the easy, powerful capabilities of MapR.

1. Open a browser and navigate to the MapR Control System on the server where you installed MapR. The URL is
https://<host>:8443. Example:

https://localhost:8443

Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You can safely
ignore the warning this time.
2. The first time MapR starts, you must accept the agreement and choose whether to enable the MapR Dial Home service.
3. Log in using the Linux username and password you specified as the administrative user in Step 3.

Welcome to the MapR Control System, which displays information about the health and performance of the cluster. Notice the navigation pane to
the left and the larger view pane to the right. The navigation pane lets you quickly switch between views, to display different types of information.

When the MapR Control System starts, the first view displayed is the Dashboard. The MapR single node is represented as a square. The square
is not green; the node health is not perfect: because the license has not yet been applied, the NFS service has failed to start, resulting in an alarm
on the node. Later, you'll see how to obtain the license and mount the cluster with NFS.

Click the node to display the Node Properties view. Notice the types of information available about the node:

Alarms - timely information about any problems on the node


Machine Performance - resource usage
General Information - physical topology of the node and the times of the most recent heartbeats
MapReduce - map and reduce slot usage
Manage Node Services - services on the node
MapR-FS and Available Disks - disks used for storage by MapR
System Disks - disks available to add to MapR storage

Step 5. Get a License


1. In the navigation pane of the MapR Control System, expand the System Settings group and click MapR Licenses to display the MapR
License Management dialog.

2.
2. Click Add Licenses via Web.
3. If the cluster is already registered, the license is applied automatically. Otherwise, click OK to register the cluster on MapR.com and
follow the instructions there.
If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to a registration page.
On the registration page, create an account and log in.
On the Register Cluster page, choose M3 and click Register.
When the message "Cluster Registered" appears, click Return to your MapR Cluster UI.

Step 6. Start Working with Volumes


MapR provides volumes as a way to organize data into groups, so that you can manage your data and apply policy all at once instead of file by
file. Think of a volume as being similar to a huge hard drive---it can be mounted or unmounted, belong to a specific department or user, and have
permissions set as a whole or on any directory or file within. In this section, you will create a volume that you can use for later parts of the tutorial.

Create a volume:

1. In the Navigation pane, click Volumes in the MapR-FS group.


2. Click the New Volume button to display the New Volume dialog.
3. For the Volume Type, select Standard Volume.
4. Type the name MyVolume in the Volume Name field.
5. Type the path /myvolume in the Mount Path field.
6. Select /default-rack in the Topology field.
7. Scroll to the bottom and click OK to create the volume.

Step 7. Mount the Cluster via NFS


With MapR, you can export and mount the Hadoop cluster as a read/write volume via NFS from the machine where you installed MapR, or from a
different machine.

If you are mounting from the machine where you installed MapR, replace <host> in the steps below with localhost
If you are mounting from a different machine, make sure the machine where you installed MapR is reachable over the network and
replace <host> in the steps below with the hostname of the machine where you installed MapR.

Try the following steps to see how it works:

1. Change to the root user (or use sudo for the following commands).
2. See what is exported from the machine where you installed MapR:

showmount -e <host>

3. Set up a mount point for the NFS share:

mkdir /mapr

4. Mount the cluster via NFS:

mount <host>:/mapr /mapr

Tips
If you get an error such as RPC: Program not registered it likely means that the NFS service is not running. See Setting Up NFS
for tips.
5. To see the cluster, list the /mapr directory:

# ls /mapr/my.cluster.com
my.cluster.com

6. List the cluster itself, and notice that the volume you created is there:

# ls -l /mapr/my.cluster.com
Found 3 items
drwxrwxrwx - root root 0 2011-11-22-12:44 /myvolume
drwxr-xr-x - mapr mapr 0 2011-01-03 13:50 /tmp
drwxr-xr-x - mapr mapr 0 2011-01-04 13:57 /user
drwxr-xr-x - root root 0 2010-11-25 09:41 /var
7. Try creating a directory in your new volume via NFS:

mkdir /mapr/my.cluster.com/myvolume/foo

8. List the contents of /myvolume:

hadoop fs -ls /myvolume

Notice that Hadoop can see the directory you just created with NFS. Try navigating to the cluster using the computer's file browser --- you can
drag and drop files directly to your new volume, and see them immediately in Hadoop!

If you are already running an NFS server, MapR will not run its own NFS gateway. In that case, you will not be able to mount the
single-node cluster via NFS, but your previous NFS exports will remain available.

Step 8. Try MapReduce


In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files. The Word Count program reads
files from an input directory, counts the words, and writes the results of the job to files in an output directory. For this exercise we will use
/myvolume/in for the input, and /myvolume/out for the output. The input directory must exist and must contain the input files before running
the job; the output directory must not exist, as the Word Count example creates it.

1. Open a terminal (select Applications > Accessories > Terminal)


2. Copy a couple of text files into the cluster, either using the file browser or the command line. Create the directory /myvolume/in and put
the files there. Example:

mkdir /mapr/my.cluster.com/myvolume/in
cp <some files> /mapr/my.cluster.com/myvolume/in

3. Type the following line to run the Word Count job:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in


/myvolume/out

4. Look in the newly-created /myvolume/out for a file called part-r-00000 containing the results.

Quick Start - Small Cluster


Choose the Quick Start guide that is right for your operating system:

M3 - RHEL or CentOS
M3 - Ubuntu
M5 - RHEL or CentOS
M5 - Ubuntu

M3 - RHEL or CentOS
Use the following steps to install a simple MapR cluster up to 100 nodes with a basic set of services. To build a larger cluster, or to build a cluster
that includes additional services (such as Hive, Pig, Flume, or Oozie), see the Installation Guide. To add services to nodes on a running cluster,
see Reconfiguring a Node. To get the most out of this tutorial, and to enable NFS, be sure to register for your free M3 license after installing the
MapR software.

Setup
Follow these instructions to install a small MapR cluster (3-100 nodes) on machines that meet the following requirements:

64-bit Red Hat 5.4 or greater, or 64-bit CentOS 5.4 or greater


RAM: 4 GB or more
At least one free unmounted drive or partition, 50 GB or more
At least 10 GB of free space on the operating system partition
Java JDK version 1.6.0_24 (not JRE)
The root password, or sudo privileges
A Linux user chosen to have administrative privileges on the cluster
Make sure the user has a password (using sudo passwd <user> for example)

Each node must have a unique hostname, and keyless SSH set up to all other nodes.

This procedure assumes you have free, unmounted physical partitions or hard disks for use by MapR. If you are not sure,
please read Setting Up Disks for MapR.

Create a text file /tmp/disks.txt listing disks and partitions for use by MapR. Each line lists a single disk, or partitions on a single
disk. Example:

/dev/sdb
/dev/sdc1 /dev/sdc2 /dev/sdc4
/dev/sdd

Later, when you run disksetup to format the disks, specify the disks and partitions file. Example:

disksetup -F /tmp/disks.txt

For the steps that follow, make the following substitutions:

<user> - the chosen administrative username


<node 1>, <node 2>, <node 3>... - the IP addresses of nodes 1, 2, 3 ...
<proxy user>, <proxy password>, <host>, <port> - proxy server credentials and settings

If you are installing a MapR cluster on nodes that are not connected to the Internet, contact MapR for assistance. If you are
installing a cluster larger than 100 nodes, see the Installation Guide. In particular, CLDB nodes on large clusters should not run
any other service (see Isolating CLDB Nodes).

Deployment
1. Change to the root user (or use sudo for the following commands).
2. On all nodes, create a text file called maprtech.repo in the directory /etc/yum.repos.d/ with the following contents:
[maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/v1.1.3/redhat/
enabled=1
gpgcheck=0
protect=1
To install a previous release, see the Release Notes for the correct path to use in the baseurl parameter.
3. If your connection to the Internet is through a proxy server, you must set the http_proxy environment variable before installation:

http_proxy=http://<host>:<port>
export http_proxy

If you don't have Internet connectivity, do one of the following:

Set up a local repository.


Download and install packages manually.
4. On node 1, execute the following command:

yum install mapr-cldb mapr-fileserver mapr-jobtracker mapr-nfs mapr-tasktracker mapr-webserver


mapr-zookeeper

5. On nodes 2 and 3, execute the following command:

yum install mapr-fileserver mapr-tasktracker mapr-zookeeper

6.
6. On all other nodes (nodes 4...n), execute the following commands:

yum install mapr-fileserver mapr-tasktracker

7. On all nodes, execute the following commands:

/opt/mapr/server/configure.sh -C <node 1> -Z <node 1>,<node 2>,<node 3>


/opt/mapr/server/disksetup -F /tmp/disks.txt

8. On nodes 1, 2, and 3, execute the following command:

/etc/init.d/mapr-zookeeper start

9. On node 1, execute the following command:

/etc/init.d/mapr-warden start

Tips

If you see "WARDEN running as process <process>. Stop it" it means the warden is already running. This can happen, for
example, when you reboot the machine. Use /etc/init.d/mapr-warden stop to stop it, then start it again.
10. On node 1, give full permission to the chosen administrative user using the following command:

/opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

Tips

The Warden can take a few minutes to start. If you see the error "Couldn't connect to the CLDB service," wait a few minutes
and try again.
11. On a machine that is connected to the cluster and to the Internet, perform the following steps to install the license:
In a browser, view the MapR Control System by navigating to the node that is running the WebServer:
https://<node 1>:8443
Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You can
ignore the warning this time.
The first time MapR starts, you must accept the agreement and choose whether to enable the MapR Dial Home service.
Log in to the MapR Control System as the administrative user you designated earlier.
In the navigation pane of the MapR Control System, expand the System Settings group and click MapR Licenses to display the
MapR License Management dialog.
Click Add Licenses via Web.
If the cluster is already registered, the license is applied automatically. Otherwise, click OK to register the cluster on MapR.com
and follow the instructions there.
If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to a
registration page.
On the registration page, create an account and log in.
On the Register Cluster page, choose M3 and click Register.
When the message "Cluster Registered" appears, click Return to your MapR Cluster UI.
12. On node 1, execute the following command:

/opt/mapr/bin/maprcli node services -nodes <node 1> -nfs start

13. On all other nodes (nodes 2...n), execute the following command:

/etc/init.d/mapr-warden start

14. Log in to the MapR Control System.


15. Under the Cluster group in the left pane, click Dashboard.
16. Check the Services pane and make sure each service is running the correct number of instances:
Instances of the FileServer and TaskTracker on all nodes
3 instances of ZooKeeper
1 instance of the CLDB, JobTracker, NFS, and WebServer
Next Steps

Start Working with Volumes

MapR provides volumes as a way to organize data into groups, so that you can manage your data and apply policy all at once instead of file by
file. Think of a volume as being similar to a huge hard drive---it can be mounted or unmounted, belong to a specific department or user, and have
permissions set as a whole or on any directory or file within. In this section, you will create a volume that you can use for later parts of the tutorial.

Create a volume:

1. In the Navigation pane, click Volumes in the MapR-FS group.


2. Click the New Volume button to display the New Volume dialog.
3. For the Volume Type, select Standard Volume.
4. Type the name MyVolume in the Volume Name field.
5. Type the path /myvolume in the Mount Path field.
6. Select /default-rack in the Topology field.
7. Scroll to the bottom and click OK to create the volume.

Mount the Cluster via NFS

With MapR, you can export and mount the Hadoop cluster as a read/write volume via NFS from the machine where you installed MapR, or from a
different machine.

If you are mounting from the machine where you installed MapR, replace <host> in the steps below with localhost
If you are mounting from a different machine, make sure the machine where you installed MapR is reachable over the network and
replace <host> in the steps below with the hostname of the machine where you installed MapR.

Try the following steps to see how it works:

1. Change to the root user (or use sudo for the following commands).
2. See what is exported from the machine where you installed MapR:

showmount -e <host>

3. Set up a mount point for the NFS share:

mkdir /mapr

4. Mount the cluster via NFS:

mount <host>:/mapr /mapr

Tips
If you get an error such as RPC: Program not registered it likely means that the NFS service is not running. See Setting Up NFS
for tips.
5. To see the cluster, list the /mapr directory:

# ls /mapr/my.cluster.com
my.cluster.com

6. List the cluster itself, and notice that the volume you created is there:

# ls -l /mapr/my.cluster.com
Found 3 items
drwxrwxrwx - root root 0 2011-11-22-12:44 /myvolume
drwxr-xr-x - mapr mapr 0 2011-01-03 13:50 /tmp
drwxr-xr-x - mapr mapr 0 2011-01-04 13:57 /user
drwxr-xr-x - root root 0 2010-11-25 09:41 /var

7. Try creating a directory in your new volume via NFS:

mkdir /mapr/my.cluster.com/myvolume/foo
8. List the contents of /myvolume:

hadoop fs -ls /myvolume

Notice that Hadoop can see the directory you just created with NFS. Try navigating to the cluster using the computer's file browser --- you can
drag and drop files directly to your new volume, and see them immediately in Hadoop!

If you are already running an NFS server, MapR will not run its own NFS gateway. In that case, you will not be able to mount the
single-node cluster via NFS, but your previous NFS exports will remain available.

Try MapReduce

In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files. The Word Count program reads
files from an input directory, counts the words, and writes the results of the job to files in an output directory. For this exercise we will use
/myvolume/in for the input, and /myvolume/out for the output. The input directory must exist and must contain the input files before running
the job; the output directory must not exist, as the Word Count example creates it.

1. Open a terminal (select Applications > Accessories > Terminal)


2. Copy a couple of text files into the cluster, either using the file browser or the command line. Create the directory /myvolume/in and put
the files there. Example:

mkdir /mapr/my.cluster.com/myvolume/in
cp <some files> /mapr/my.cluster.com/myvolume/in

3. Type the following line to run the Word Count job:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in


/myvolume/out

4. Look in the newly-created /myvolume/out for a file called part-r-00000 containing the results.

M3 - Ubuntu
Use the following steps to install a simple MapR cluster up to 100 nodes with a basic set of services. To build a larger cluster, or to build a cluster
that includes additional services (such as Hive, Pig, Flume, or Oozie), see the Installation Guide. To add services to nodes on a running cluster,
see Reconfiguring a Node. To get the most out of this tutorial, and to enable NFS, be sure to register for your free M3 license after installing the
MapR software.

Setup
Follow these instructions to install a small MapR cluster (3-100 nodes) on machines that meet the following requirements:

64-bit Ubuntu 9.04 or greater


RAM: 4 GB or more
At least one free unmounted drive or partition, 50 GB or more
At least 10 GB of free space on the operating system partition
Java JDK version 1.6.0_24 (not JRE)
The root password, or sudo privileges
A Linux user chosen to have administrative privileges on the cluster
Make sure the user has a password (using sudo passwd <user> for example)

Each node must have a unique hostname, and keyless SSH set up to all other nodes.

This procedure assumes you have free, unmounted physical partitions or hard disks for use by MapR. If you are not sure,
please read Setting Up Disks for MapR.

Create a text file /tmp/disks.txt listing disks and partitions for use by MapR. Each line lists a single disk, or partitions on a single
disk. Example:
/dev/sdb
/dev/sdc1 /dev/sdc2 /dev/sdc4
/dev/sdd

Later, when you run disksetup to format the disks, specify the disks and partitions file. Example:

disksetup -F /tmp/disks.txt

For the steps that follow, make the following substitutions:

<user> - the chosen administrative username


<node 1>, <node 2>, <node 3>... - the IP addresses of nodes 1, 2, 3 ...
<proxy user>, <proxy password>, <host>, <port> - proxy server credentials and settings

If you are installing a MapR cluster on nodes that are not connected to the Internet, contact MapR for assistance. If you are
installing a cluster larger than 100 nodes, see the Installation Guide. In particular, CLDB nodes on large clusters should not run
any other service (see Isolating CLDB Nodes).

Deployment
1. Change to the root user (or use sudo for the following commands).
2. On all nodes, add the following line to /etc/apt/sources.list:
deb http://package.mapr.com/releases/v1.1.3/ubuntu/ mapr optional
To install a previous release, see the Release Notes for the correct path to use
If you don't have Internet connectivity, do one of the following:
Set up a local repository.
Download and install packages manually.
3. On all nodes, run the following command:

apt-get update

4. If your connection to the Internet is through a proxy server, add the following lines to /etc/apt/apt.conf:

Acquire {
Retries "0";
HTTP {
Proxy "http://<proxy user>:<proxy password>@<host>:<port>";
};
};

5. On node 1, execute the following command:

apt-get install mapr-cldb mapr-fileserver mapr-jobtracker mapr-nfs mapr-tasktracker


mapr-webserver mapr-zookeeper

6. On nodes 2 and 3, execute the following command:

apt-get install mapr-fileserver mapr-tasktracker mapr-zookeeper

7. On all other nodes (nodes 4...n), execute the following commands:

apt-get install mapr-fileserver mapr-tasktracker

8. On all nodes, execute the following commands:


8.

/opt/mapr/server/configure.sh -C <node 1> -Z <node 1>,<node 2>,<node 3>


/opt/mapr/server/disksetup -F /tmp/disks.txt

9. On nodes 1, 2, and 3, execute the following command:

/etc/init.d/mapr-zookeeper start

10. On node 1, execute the following command:

/etc/init.d/mapr-warden start

Tips

If you see "WARDEN running as process <process>. Stop it" it means the warden is already running. This can happen, for
example, when you reboot the machine. Use /etc/init.d/mapr-warden stop to stop it, then start it again.
11. On node 1, give full permission to the chosen administrative user using the following command:

/opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

Tips

The Warden can take a few minutes to start. If you see the error "Couldn't connect to the CLDB service," wait a few minutes
and try again.
12. On a machine that is connected to the cluster and to the Internet, perform the following steps to install the license:
In a browser, view the MapR Control System by navigating to the node that is running the WebServer:
https://<node 1>:8443
Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You can
ignore the warning this time.
The first time MapR starts, you must accept the agreement and choose whether to enable the MapR Dial Home service.
Log in to the MapR Control System as the administrative user you designated earlier.
In the navigation pane of the MapR Control System, expand the System Settings group and click MapR Licenses to display the
MapR License Management dialog.
Click Add Licenses via Web.
If the cluster is already registered, the license is applied automatically. Otherwise, click OK to register the cluster on MapR.com
and follow the instructions there.
If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to a
registration page.
On the registration page, create an account and log in.
On the Register Cluster page, choose M3 and click Register.
When the message "Cluster Registered" appears, click Return to your MapR Cluster UI.
13. On node 1, execute the following command:

/opt/mapr/bin/maprcli node services -nodes <node 1> -nfs start

14. On all other nodes (nodes 2...n), execute the following command:

/etc/init.d/mapr-warden start

15. Log in to the MapR Control System.


16. Under the Cluster group in the left pane, click Dashboard.
17. Check the Services pane and make sure each service is running the correct number of instances:
Instances of the FileServer and TaskTracker on all nodes
3 instances of ZooKeeper
1 instance of the CLDB, JobTracker, NFS, and WebServer

Next Steps

Start Working with Volumes


MapR provides volumes as a way to organize data into groups, so that you can manage your data and apply policy all at once instead of file by
file. Think of a volume as being similar to a huge hard drive---it can be mounted or unmounted, belong to a specific department or user, and have
permissions set as a whole or on any directory or file within. In this section, you will create a volume that you can use for later parts of the tutorial.

Create a volume:

1. In the Navigation pane, click Volumes in the MapR-FS group.


2. Click the New Volume button to display the New Volume dialog.
3. For the Volume Type, select Standard Volume.
4. Type the name MyVolume in the Volume Name field.
5. Type the path /myvolume in the Mount Path field.
6. Select /default-rack in the Topology field.
7. Scroll to the bottom and click OK to create the volume.

Mount the Cluster via NFS

With MapR, you can export and mount the Hadoop cluster as a read/write volume via NFS from the machine where you installed MapR, or from a
different machine.

If you are mounting from the machine where you installed MapR, replace <host> in the steps below with localhost
If you are mounting from a different machine, make sure the machine where you installed MapR is reachable over the network and
replace <host> in the steps below with the hostname of the machine where you installed MapR.

Try the following steps to see how it works:

1. Change to the root user (or use sudo for the following commands).
2. See what is exported from the machine where you installed MapR:

showmount -e <host>

3. Set up a mount point for the NFS share:

mkdir /mapr

4. Mount the cluster via NFS:

mount <host>:/mapr /mapr

Tips
If you get an error such as RPC: Program not registered it likely means that the NFS service is not running. See Setting Up NFS
for tips.
5. To see the cluster, list the /mapr directory:

# ls /mapr/my.cluster.com
my.cluster.com

6. List the cluster itself, and notice that the volume you created is there:

# ls -l /mapr/my.cluster.com
Found 3 items
drwxrwxrwx - root root 0 2011-11-22-12:44 /myvolume
drwxr-xr-x - mapr mapr 0 2011-01-03 13:50 /tmp
drwxr-xr-x - mapr mapr 0 2011-01-04 13:57 /user
drwxr-xr-x - root root 0 2010-11-25 09:41 /var

7. Try creating a directory in your new volume via NFS:

mkdir /mapr/my.cluster.com/myvolume/foo

8. List the contents of /myvolume:

hadoop fs -ls /myvolume


Notice that Hadoop can see the directory you just created with NFS. Try navigating to the cluster using the computer's file browser --- you can
drag and drop files directly to your new volume, and see them immediately in Hadoop!

If you are already running an NFS server, MapR will not run its own NFS gateway. In that case, you will not be able to mount the
single-node cluster via NFS, but your previous NFS exports will remain available.

Try MapReduce

In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files. The Word Count program reads
files from an input directory, counts the words, and writes the results of the job to files in an output directory. For this exercise we will use
/myvolume/in for the input, and /myvolume/out for the output. The input directory must exist and must contain the input files before running
the job; the output directory must not exist, as the Word Count example creates it.

1. Open a terminal (select Applications > Accessories > Terminal)


2. Copy a couple of text files into the cluster, either using the file browser or the command line. Create the directory /myvolume/in and put
the files there. Example:

mkdir /mapr/my.cluster.com/myvolume/in
cp <some files> /mapr/my.cluster.com/myvolume/in

3. Type the following line to run the Word Count job:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in


/myvolume/out

4. Look in the newly-created /myvolume/out for a file called part-r-00000 containing the results.

M5 - RHEL or CentOS
Use the following steps to install a simple MapR cluster up to 100 nodes with a basic set of services. To build a larger cluster, or to build a cluster
that includes additional services (such as Hive, Pig, Flume, or Oozie), see the Installation Guide. To add services to nodes on a running cluster,
see Reconfiguring a Node. To get the most out of this tutorial, and to enable NFS, be sure to register for your free M5 Trial license after installing
the MapR software.

Setup
Follow these instructions to install a small MapR cluster (3-100 nodes) on machines that meet the following requirements:

64-bit Red Hat 5.4 or greater, or 64-bit CentOS 5.4 or greater


RAM: 4 GB or more
At least one free unmounted drive or partition, 50 GB or more
At least 10 GB of free space on the operating system partition
Java JDK version 1.6.0_24 (not JRE)
The root password, or sudo privileges
A Linux user chosen to have administrative privileges on the cluster
Make sure the user has a password (using sudo passwd <user> for example)

Each node must have a unique hostname, and keyless SSH set up to all other nodes.

This procedure assumes you have free, unmounted physical partitions or hard disks for use by MapR. If you are not sure,
please read Setting Up Disks for MapR.

Create a text file /tmp/disks.txt listing disks and partitions for use by MapR. Each line lists a single disk, or partitions on a single
disk. Example:

/dev/sdb
/dev/sdc1 /dev/sdc2 /dev/sdc4
/dev/sdd

Later, when you run disksetup to format the disks, specify the disks and partitions file. Example:
disksetup -F /tmp/disks.txt

For the steps that follow, make the following substitutions:

<user> - the chosen administrative username


<node 1>, <node 2>, <node 3>... - the IP addresses of nodes 1, 2, 3 ...
<proxy user>, <proxy password>, <host>, <port> - proxy server credentials and settings

If you are installing a MapR cluster on nodes that are not connected to the Internet, contact MapR for assistance. If you are
installing a cluster larger than 100 nodes, see the Installation Guide. In particular, CLDB nodes on large clusters should not run
any other service (see Isolating CLDB Nodes).

Deployment
1. Change to the root user (or use sudo for the following commands).
2. On all nodes, create a text file called maprtech.repo in the directory /etc/yum.repos.d/ with the following contents:
[maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/v1.1.3/redhat/
enabled=1
gpgcheck=0
protect=1
To install a previous release, see the Release Notes for the correct path to use in the baseurl parameter.
If you don't have Internet connectivity, do one of the following:
Set up a local repository.
Download and install packages manually.
3. If your connection to the Internet is through a proxy server, you must set the http_proxy environment variable before installation:

http_proxy=http://<host>:<port>
export http_proxy

4. On node 1, execute the following command:

yum install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker mapr-webserver

5. On nodes 2 and 3, execute the following command:

yum install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker

6. On all other nodes (nodes 4...n), execute the following commands:

yum install mapr-fileserver mapr-nfs mapr-tasktracker

7. On all nodes, execute the following commands:

/opt/mapr/server/configure.sh -C <node 1>,<node 2>,<node 3> -Z <node 1>,<node 2>,<node 3>


/opt/mapr/server/disksetup -F /tmp/disks.txt

8. On nodes 1, 2, and 3, execute the following command:

/etc/init.d/mapr-zookeeper start

9. On node 1, execute the following command:

/etc/init.d/mapr-warden start
Tips

If you see "WARDEN running as process <process>. Stop it" it means the warden is already running. This can happen, for
example, when you reboot the machine. Use /etc/init.d/mapr-warden stop to stop it, then start it again.
10. On node 1, give full permission to the chosen administrative user using the following command:

/opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

Tips

The Warden can take a few minutes to start. If you see the error "Couldn't connect to the CLDB service," wait a few minutes
and try again.
11. On a machine that is connected to the cluster and to the Internet, perform the following steps to install the license:
In a browser, view the MapR Control System by navigating to the node that is running the WebServer:
https://<node 1>:8443
Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You can
ignore the warning this time.
The first time MapR starts, you must accept the agreement and choose whether to enable the MapR Dial Home service.
Log in to the MapR Control System as the administrative user you designated earlier.
In the navigation pane of the MapR Control System, expand the System Settings group and click MapR Licenses to display the
MapR License Management dialog.
Click Add Licenses via Web.
If the cluster is already registered, the license is applied automatically. Otherwise, click OK to register the cluster on MapR.com
and follow the instructions there.
If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to a
registration page.
On the registration page, create an account and log in.
On the Register Cluster page, choose M5 Trial and click Register.
When the message "Cluster Registered" appears, click Return to your MapR Cluster UI.
12. On node 1, execute the following command:

/opt/mapr/bin/maprcli node services -nodes <node 1> -nfs start

13. On all other nodes (nodes 2...n), execute the following command:

/etc/init.d/mapr-warden start

14. Log in to the MapR Control System.


15. Under the Cluster group in the left pane, click Dashboard.
16. Check the Services pane and make sure each service is running the correct number of instances:
Instances of the FileServer, NFS, and TaskTracker on all nodes
3 instances of the CLDB
1 of 3 instances of the JobTracker
1 instance of the WebServer

Next Steps

Start Working with Volumes

MapR provides volumes as a way to organize data into groups, so that you can manage your data and apply policy all at once instead of file by
file. Think of a volume as being similar to a huge hard drive---it can be mounted or unmounted, belong to a specific department or user, and have
permissions set as a whole or on any directory or file within. In this section, you will create a volume that you can use for later parts of the tutorial.

Create a volume:

1. In the Navigation pane, click Volumes in the MapR-FS group.


2. Click the New Volume button to display the New Volume dialog.
3. For the Volume Type, select Standard Volume.
4. Type the name MyVolume in the Volume Name field.
5. Type the path /myvolume in the Mount Path field.
6. Select /default-rack in the Topology field.
7. Scroll to the bottom and click OK to create the volume.

Mount the Cluster via NFS


With MapR, you can export and mount the Hadoop cluster as a read/write volume via NFS from the machine where you installed MapR, or from a
different machine.

If you are mounting from the machine where you installed MapR, replace <host> in the steps below with localhost
If you are mounting from a different machine, make sure the machine where you installed MapR is reachable over the network and
replace <host> in the steps below with the hostname of the machine where you installed MapR.

Try the following steps to see how it works:

1. Change to the root user (or use sudo for the following commands).
2. See what is exported from the machine where you installed MapR:

showmount -e <host>

3. Set up a mount point for the NFS share:

mkdir /mapr

4. Mount the cluster via NFS:

mount <host>:/mapr /mapr

Tips
If you get an error such as RPC: Program not registered it likely means that the NFS service is not running. See Setting Up NFS
for tips.
5. To see the cluster, list the /mapr directory:

# ls /mapr/my.cluster.com
my.cluster.com

6. List the cluster itself, and notice that the volume you created is there:

# ls -l /mapr/my.cluster.com
Found 3 items
drwxrwxrwx - root root 0 2011-11-22-12:44 /myvolume
drwxr-xr-x - mapr mapr 0 2011-01-03 13:50 /tmp
drwxr-xr-x - mapr mapr 0 2011-01-04 13:57 /user
drwxr-xr-x - root root 0 2010-11-25 09:41 /var

7. Try creating a directory in your new volume via NFS:

mkdir /mapr/my.cluster.com/myvolume/foo

8. List the contents of /myvolume:

hadoop fs -ls /myvolume

Notice that Hadoop can see the directory you just created with NFS. Try navigating to the cluster using the computer's file browser --- you can
drag and drop files directly to your new volume, and see them immediately in Hadoop!

If you are already running an NFS server, MapR will not run its own NFS gateway. In that case, you will not be able to mount the
single-node cluster via NFS, but your previous NFS exports will remain available.

Try MapReduce

In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files. The Word Count program reads
files from an input directory, counts the words, and writes the results of the job to files in an output directory. For this exercise we will use
/myvolume/in for the input, and /myvolume/out for the output. The input directory must exist and must contain the input files before running
the job; the output directory must not exist, as the Word Count example creates it.

1. Open a terminal (select Applications > Accessories > Terminal)


2.
2. Copy a couple of text files into the cluster, either using the file browser or the command line. Create the directory /myvolume/in and put
the files there. Example:

mkdir /mapr/my.cluster.com/myvolume/in
cp <some files> /mapr/my.cluster.com/myvolume/in

3. Type the following line to run the Word Count job:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in


/myvolume/out

4. Look in the newly-created /myvolume/out for a file called part-r-00000 containing the results.

M5 - Ubuntu
Use the following steps to install a simple MapR cluster up to 100 nodes with a basic set of services. To build a larger cluster, or to build a cluster
that includes additional services (such as Hive, Pig, Flume, or Oozie), see the Installation Guide. To add services to nodes on a running cluster,
see Reconfiguring a Node. To get the most out of this tutorial, and to enable NFS, be sure to register for your free M5 Trial license after installing
the MapR software.

Setup
Follow these instructions to install a small MapR cluster (3-100 nodes) on machines that meet the following requirements:

64-bit Ubuntu 9.04 or greater


RAM: 4 GB or more
At least one free unmounted drive or partition, 50 GB or more
At least 10 GB of free space on the operating system partition
Java JDK version 1.6.0_24 (not JRE)
The root password, or sudo privileges
A Linux user chosen to have administrative privileges on the cluster
Make sure the user has a password (using sudo passwd <user> for example)

Each node must have a unique hostname, and keyless SSH set up to all other nodes.

This procedure assumes you have free, unmounted physical partitions or hard disks for use by MapR. If you are not sure,
please read Setting Up Disks for MapR.

Create a text file /tmp/disks.txt listing disks and partitions for use by MapR. Each line lists a single disk, or partitions on a single
disk. Example:

/dev/sdb
/dev/sdc1 /dev/sdc2 /dev/sdc4
/dev/sdd

Later, when you run disksetup to format the disks, specify the disks and partitions file. Example:

disksetup -F /tmp/disks.txt

For the steps that follow, make the following substitutions:

<user> - the chosen administrative username


<node 1>, <node 2>, <node 3>... - the IP addresses of nodes 1, 2, 3 ...
<proxy user>, <proxy password>, <host>, <port> - proxy server credentials and settings

If you are installing a MapR cluster on nodes that are not connected to the Internet, contact MapR for assistance. If you are
installing a cluster larger than 100 nodes, see the Installation Guide. In particular, CLDB nodes on large clusters should not run
any other service (see Isolating CLDB Nodes).
Deployment
1. Change to the root user (or use sudo for the following commands).
2. On all nodes, add the following line to /etc/apt/sources.list:
deb http://package.mapr.com/releases/v1.1.3/ubuntu/ mapr optional
To install a previous release, see the Release Notes for the correct path to use.
If you don't have Internet connectivity, do one of the following:
Set up a local repository.
Download and install packages manually.
3. On all nodes, run the following command:

apt-get update

4. If your connection to the Internet is through a proxy server, add the following lines to /etc/apt/apt.conf:

Acquire {
Retries "0";
HTTP {
Proxy "http://<proxy user>:<proxy password>@<host>:<port>";
};
};

5. On node 1, execute the following command:

apt-get install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker


mapr-webserver

6. On nodes 2 and 3, execute the following command:

apt-get install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker

7. On all other nodes (nodes 4...n), execute the following commands:

apt-get install mapr-fileserver mapr-nfs mapr-tasktracker

8. On all nodes, execute the following commands:

/opt/mapr/server/configure.sh -C <node 1>,<node 2>,<node 3> -Z <node 1>,<node 2>,<node 3>


/opt/mapr/server/disksetup -F /tmp/disks.txt

9. On nodes 1, 2, and 3, execute the following command:

/etc/init.d/mapr-zookeeper start

10. On node 1, execute the following command:

/etc/init.d/mapr-warden start

Tips

If you see "WARDEN running as process <process>. Stop it" it means the warden is already running. This can happen, for
example, when you reboot the machine. Use /etc/init.d/mapr-warden stop to stop it, then start it again.
11. On node 1, give full permission to the chosen administrative user using the following command:

/opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc


Tips

The Warden can take a few minutes to start. If you see the error "Couldn't connect to the CLDB service," wait a few minutes
and try again.
12. On a machine that is connected to the cluster and to the Internet, perform the following steps to install the license:
In a browser, view the MapR Control System by navigating to the node that is running the WebServer:
https://<node 1>:8443
Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You can
ignore the warning this time.
The first time MapR starts, you must accept the agreement and choose whether to enable the MapR Dial Home service.
Log in to the MapR Control System as the administrative user you designated earlier.
In the navigation pane of the MapR Control System, expand the System Settings group and click MapR Licenses to display the
MapR License Management dialog.
Click Add Licenses via Web.
If the cluster is already registered, the license is applied automatically. Otherwise, click OK to register the cluster on MapR.com
and follow the instructions there.
If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to a
registration page.
On the registration page, create an account and log in.
On the Register Cluster page, choose M5 Trial and click Register.
When the message "Cluster Registered" appears, click Return to your MapR Cluster UI.
13. On node 1, execute the following command:

/opt/mapr/bin/maprcli node services -nodes <node 1> -nfs start

14. On all other nodes (nodes 2...n), execute the following command:

/etc/init.d/mapr-warden start

15. Log in to the MapR Control System.


16. Under the Cluster group in the left pane, click Dashboard.
17. Check the Services pane and make sure each service is running the correct number of instances:
Instances of the FileServer, NFS, and TaskTracker on all nodes
3 instances of the CLDB
1 of 3 instances of the JobTracker
1 instance of the WebServer

Next Steps

Start Working with Volumes

MapR provides volumes as a way to organize data into groups, so that you can manage your data and apply policy all at once instead of file by
file. Think of a volume as being similar to a huge hard drive---it can be mounted or unmounted, belong to a specific department or user, and have
permissions set as a whole or on any directory or file within. In this section, you will create a volume that you can use for later parts of the tutorial.

Create a volume:

1. In the Navigation pane, click Volumes in the MapR-FS group.


2. Click the New Volume button to display the New Volume dialog.
3. For the Volume Type, select Standard Volume.
4. Type the name MyVolume in the Volume Name field.
5. Type the path /myvolume in the Mount Path field.
6. Select /default-rack in the Topology field.
7. Scroll to the bottom and click OK to create the volume.

Mount the Cluster via NFS

With MapR, you can export and mount the Hadoop cluster as a read/write volume via NFS from the machine where you installed MapR, or from a
different machine.

If you are mounting from the machine where you installed MapR, replace <host> in the steps below with localhost
If you are mounting from a different machine, make sure the machine where you installed MapR is reachable over the network and
replace <host> in the steps below with the hostname of the machine where you installed MapR.

Try the following steps to see how it works:

1. Change to the root user (or use sudo for the following commands).
2. See what is exported from the machine where you installed MapR:
2.

showmount -e <host>

3. Set up a mount point for the NFS share:

mkdir /mapr

4. Mount the cluster via NFS:

mount <host>:/mapr /mapr

Tips
If you get an error such as RPC: Program not registered it likely means that the NFS service is not running. See Setting Up NFS
for tips.
5. To see the cluster, list the /mapr directory:

# ls /mapr/my.cluster.com
my.cluster.com

6. List the cluster itself, and notice that the volume you created is there:

# ls -l /mapr/my.cluster.com
Found 3 items
drwxrwxrwx - root root 0 2011-11-22-12:44 /myvolume
drwxr-xr-x - mapr mapr 0 2011-01-03 13:50 /tmp
drwxr-xr-x - mapr mapr 0 2011-01-04 13:57 /user
drwxr-xr-x - root root 0 2010-11-25 09:41 /var

7. Try creating a directory in your new volume via NFS:

mkdir /mapr/my.cluster.com/myvolume/foo

8. List the contents of /myvolume:

hadoop fs -ls /myvolume

Notice that Hadoop can see the directory you just created with NFS. Try navigating to the cluster using the computer's file browser --- you can
drag and drop files directly to your new volume, and see them immediately in Hadoop!

If you are already running an NFS server, MapR will not run its own NFS gateway. In that case, you will not be able to mount the
single-node cluster via NFS, but your previous NFS exports will remain available.

Try MapReduce

In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files. The Word Count program reads
files from an input directory, counts the words, and writes the results of the job to files in an output directory. For this exercise we will use
/myvolume/in for the input, and /myvolume/out for the output. The input directory must exist and must contain the input files before running
the job; the output directory must not exist, as the Word Count example creates it.

1. Open a terminal (select Applications > Accessories > Terminal)


2. Copy a couple of text files into the cluster, either using the file browser or the command line. Create the directory /myvolume/in and put
the files there. Example:

mkdir /mapr/my.cluster.com/myvolume/in
cp <some files> /mapr/my.cluster.com/myvolume/in

3. Type the following line to run the Word Count job:


3.

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in


/myvolume/out

4. Look in the newly-created /myvolume/out for a file called part-r-00000 containing the results.

Our Partners
MapR works with the leaders in the Hadoop ecosystem to provide the most powerful data analysis solutions. For more information about our
partners, take a look at the following pages:

Datameer
Karmasphere

Datameer
Datameer provides the world's first business intelligence platform built natively for Hadoop. Datameer delivers powerful, self-service analytics for
the BI user through a simple spreadsheet UI, along with point-and-click data integration (ETL) and data visualization capabilities.

MapR provides a pre-packaged version of Datameer Analytics Solution ("DAS"). DAS is delivered as an RPM or Debian package.

See How to setup DAS on MapR to add the DAS package to your MapR environment.
Visit Demos for MapR to explore several demos included in the package to illustrate the usage of DAS in behavioral analytics and IT
systems management use cases.
Check out the library of video tutorials with step-by-step walk-throughs on how to use DAS, and demo videos showing various
applications.

If you have questions about using DAS, please visit the DAS documentation. For information about Datameer, please visit www.datameer.com.

Karmasphere
Karmasphere provides software products for data analysts and data professionals so they can unlock the power of Big Data in Hadoop, opening a
whole new world of possibilities to add value to the business. Karmasphere equips analysts with the ability to discover new patterns, relationships,
and drivers in any kind of data – unstructured, semi-structured or structured - that were not possible to find before.

The Karmasphere Big Data Analytics product line supports the Map R distributions, M3 and M5 Editions and includes:

Karmasphere Analyst, which provides data analysts immediate entry to structured and unstructured data on Hadoop, through SQL and
other familiar languages, so that they can make ad-hoc queries, interact with the results, and iterate – without the aid of IT.
Karmasphere Studio, which provides developers that support analytic teams a graphical environment to analyze their MapReduce code
as they develop custom analytic algorithms and systematize the creation of meaningful datasets for analysts.

To get started with Karmasphere Analyst or Studio:

Request a 30-day trial of Karmasphere Analyst or Studio for MapR


Learn more about Karmasphere Big Data Analytics products
View videos about Karmasphere Big Data Analytics products
Access technical resources
Read documentation for Karmasphere products

If you have questions about Karmasphere please email info@karmasphere.com or visit www.karmasphere.com.

Installation Guide

Getting Started
If installing a new cluster, make sure to install the latest version of MapR software. If applying a new license to an existing MapR
cluster, make sure to upgrade to the latest version of MapR first. If you are not sure, check the contents of the file
MapRBuildVersion in the /opt/mapr directory. If the version is 1.0.0 and includes GA then you must upgrade before
applying a license. Example:

# cat /opt/mapr/MapRBuildVersion
1.0.0.10178GA-0v

For information about upgrading the cluster, see Cluster Upgrade.

To get started installing a basic cluster, take a look at the Quick Start guides:

M3 - RHEL or CentOS
M3 - Ubuntu
M5 - RHEL or CentOS
M5 - Ubuntu

To design and configure a cluster from the ground up, perform the following steps:

1. CHOOSE a cluster name, if desired.


2. PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements.
3. PLAN which services to run on which nodes in the cluster.
4. INSTALL MapR Software:
On all nodes, ADD the MapR Repository.
On each node, INSTALL the planned MapR services.
On all nodes, RUN configure.sh.
On all nodes, FORMAT disks for use by MapR.
START the cluster.
SET UP node topology.
SET UP NFS for HA.
5. CONFIGURE the cluster:
SET UP the administrative user.
CHECK that the correct services are running.
SET UP authentication.
CONFIGURE cluster email settings.
CONFIGURE permissions.
SET user quotas.
CONFIGURE alarm notifications.

More Information
Once the cluster is up and running, you will find the following documents useful:

Integration with Other Tools - guides to third-party tool integration with MapR
Setting Up the Client - set up a laptop or desktop to work directly with a MapR cluster
Uninstalling MapR - completely remove MapR software
Cluster Upgrade - upgrade an entire cluster to the latest version of MapR software
Datameer Trial - set up the pre-packaged trial version of Datameer Analytics Solution

Architecture
MapR is a complete Hadoop distribution, implemented as a number of services running on individual nodes in a cluster. In a typical cluster, all or
nearly all nodes are dedicated to data processing and storage, and a smaller number of nodes run other services that provide cluster coordination
and management. The following table shows the services corresponding to roles in a MapR cluster.

CLDB Maintains the container location database (CLDB) and the MapR Distributed NameNode™ . The CLDB maintains the MapR
FileServer storage (MapR-FS) and is aware of all the NFS and FileServer nodes in the cluster. The CLDB process
coordinates data storage services among MapR FileServer nodes, MapR NFS Gateways, and MapR Clients.

FileServer Runs the MapR FileServer (MapR-FS) and MapR Lockless Storage Services™.

HBaseMaster HBase master (optional). Manages the region servers that make up HBase table storage.

HRegionServer HBase region server (used with HBase master). Provides storage for an individual HBase region.
JobTracker Hadoop JobTracker. The JobTracker coordinates the execution of MapReduce jobs by assigning tasks to TaskTracker
nodes and monitoring their execution.

NFS Provides read-write MapR Direct Access NFS™ access to the cluster, with full support for concurrent read and write access.
With NFS running on multiple nodes, MapR can use virtual IP addresses to provide automatic transparent failover, ensuring
high availability (HA).

TaskTracker Hadoop TaskTracker. The process that starts and tracks MapReduce tasks on a node. The TaskTracker registers with the
JobTracker to receive task assignments, and manages the execution of tasks on a node.

WebServer Runs the MapR Control System and provides the MapR Heatmap™

Zookeeper Enables high availability (HA) and fault tolerance for MapR clusters by providing coordination.

A process called the warden runs on all nodes to manage, monitor, and report on the other services on each node. The MapR cluster uses
ZooKeeper to coordinate services. ZooKeeper runs on an odd number of nodes (at least three, and preferably five or more) and prevents service
coordination conflicts by enforcing a rigid set of rules and conditions that determine which instance of each service is the master. The warden will
not start any services unless ZooKeeper is reachable and more than half of the configured ZooKeeper nodes are live.

Hadoop Compatibility
MapR provides the following packages:

Apache Hadoop 0.20.2


flume-0.9.3
hbase-0.90.2
hive-0.7.0
oozie-3.0.0
pig-0.8
sqoop-1.2.0
whirr-0.3.0

For more information, see Hadoop Compatibility in Version 1.1.

Requirements
Before setting up a MapR cluster, ensure that every node satisfies the following hardware and software requirements, and consider which MapR
license will provide the features you need.

If you are setting up a large cluster, it is a good idea to use a configuration management tool such as Puppet or Chef, or a parallel ssh tool, to
facilitate the installation of MapR packages across all the nodes in the cluster. The following sections provide details about the prerequisites for
setting up the cluster.

Node Hardware
Minimum Requirements Recommended

64-bit processor 64-bit processor with 8-12 cores


4G DRAM 32G DRAM or more
1 network interface 2 GigE network interfaces
At least one free unmounted drive or partition, 100 GB or more 3-12 disks of 1-3 TB each
At least 10 GB of free space on the operating system partition At least 20 GB of free space on the operating
Twice as much swap space as RAM (if this is not possible, see Memory system partition
Overcommit) 32 GB swap space or more (see also: Memory
Overcommit)

In practice, it is useful to have 12 or more disks per node, not only for greater total storage but also to provide a larger number of storage pools
available. If you anticipate a lot of big reduces, you will need additional network bandwidth in relation to disk I/O speeds. MapR can detect multiple
NICs with multiple IP addresses on each node and manage network throughput accordingly to maximize bandwidth. In general, the more network
bandwidth you can provide, the faster jobs will run on the cluster. When designing a cluster for heavy CPU workloads, the processor on each
node is more important than networking bandwidth and available disk space.

Disks
Set up at least three unmounted drives or partitions, separate from the operating system drives or partitions, for use by MapR-FS. For information
on setting up disks for MapR-FS, see Setting Up Disks for MapR. If you do not have disks available for MapR, or to test with a small installation,
you can use a flat file instead.

It is not necessary to set up RAID on disks used by MapR-FS. MapR uses a script called disksetup to set up storage pools. In most cases, you
should let MapR calculate storage pools using the default stripe width of two or three disks. If you anticipate a high volume of random-access I/O,
you can use the -W option with disksetup to specify larger storage pools of up to 8 disks each.

You can set up RAID on the operating system partition(s) or drive(s) at installation time, to provide higher operating system performance (RAID
0), disk mirroring for failover (RAID 1), or both (RAID 10), for example. See the following instructions from the operating system websites:

CentOS
Red Hat
Ubuntu

Software
Install a compatible 64-bit operating system on all nodes. MapR currently supports the following operating systems:

64-bit CentOS 5.4 or greater


64-bit Red Hat 5.4 or greater
64-bit Ubuntu 9.04 or greater

Each node must also have the following software installed:

Java JDK version 1.6.0_24 (not JRE)

If Java is already installed, check which versions of Java are installed: java -version
If JDK 6 is installed, the output will include a version number starting with 1.6, and then below that the text Java(TM). Example:

java version "1.6.0_24"


Java(TM) SE Runtime Environment (build 1.6.0_24-b07)

Use update-alternatives to make sure JDK 6 is the default Java: sudo update-alternatives --config java

Configuration
Each node must be configured as follows:

Unique hostname
Keyless SSH set up between all nodes
SELinux disabled
Able to perform forward and reverse host name resolution with every other node in the cluster
Administrative user - a Linux user chosen to have administrative privileges on the cluster
Make sure the user has a password (using sudo passwd <user> for example)

In VM environments like EC2, VMware, and Xen, when running Ubuntu 10.10, problems can occur due to an Ubuntu bug unless the IRQ balancer
is turned off. On all nodes, edit the file /etc/default/irqbalance and set ENABLED=0 to turn off the IRQ balancer (requires reboot to take
effect).

NTP
To keep all cluster nodes time-synchronized, MapR requires NTP to be configured and running on every node. If server clocks in the cluster drift
out of sync, serious problems will occur with HBase and other MapR services. MapR raises a Time Skew alarm on any out-of-sync nodes. See
http://www.ntp.org/ for more information about obtaining and installing NTP. In the event that a large adjustment must be made to the time on a
particular node, you should stop ZooKeeper on the node, then adjust the time, then restart ZooKeeper.

DNS Resolution
For MapR to work properly, all nodes on the cluster must be able to communicate with each other. Each node must have a unique hostname, and
must be able to resolve all other hosts with both normal and reverse DNS name lookup.

You can use the hostname command on each node to check the hostname. Example:

$ hostname -f
swarm
If the command returns a hostname, you can use the getent command to check whether the hostname exists in the hosts database. The
getent command should return a valid IP address on the local network, associated with a fully-qualified domain name for the host. Example:

$ getent hosts `hostname`


10.250.1.53 swarm.corp.example.com

If you do not get the expected output from the hostname command or the getent command, correct the host and DNS settings on the node. A
common problem is an incorrect loopback entry (127.0.x.x), which prevents the correct IP address from being assigned to the hostname.

Pay special attention to the format of /etc/hosts. For more information, see the hosts(5) man page. Example:

127.0.0.1 localhost
10.10.5.10 mapr-hadoopn.maprtech.prv mapr-hadoopn

Users and Groups


MapR uses each node's native operating system configuration to authenticate users and groups for access to the cluster. If you are deploying a
large cluster, you should consider configuring all nodes to use LDAP or another user management system. You can use the MapR Control
System to give specific permissions to particular users and groups. For more information, see Managing Permissions. Each user can be restricted
to a specific amount of disk usage. For more information, see Managing Quotas.

All nodes in the cluster must have the same set of users and groups, with the same uid and gid numbers on all nodes:

When adding a user to a cluster node, specify the --uid option with the useradd command to guarantee that the user has the same
uid on all machines.
When adding a group to a cluster node, specify the --gid option with the groupadd command to guarantee that the group has the
same gid on all machines.

Choose a specific user to be the administrative user for the cluster. By default, MapR gives the user root full administrative permissions. If the
nodes do not have an explicit root login (as is sometimes the case with Ubuntu, for example), you can give full permissions to the chosen
administrative user after deployment. See Cluster Configuration.

On the node where you plan to run the mapr-webserver (the MapR Control System), install Pluggable Authentication Modules (PAM). See PAM
Configuration.

Network Ports
The following table lists the network ports that must be open for use by MapR.

Service Port

SSH 22

NFS 2049

MFS server 5660

ZooKeeper 5181

CLDB web port 7221

CLDB 7222

Web UI HTTP 8080 (set by user)

Web UI HTTPS 8443 (set by user)

JobTracker 9001

NFS monitor (for HA) 9997

NFS management 9998

JobTracker web 50030

TaskTracker web 50060


HBase Master 60000

LDAP Set by user

SMTP Set by user

The MapR UI runs on Apache. By default, installation does not close port 80 (even though the MapR Control System is available over HTTPS on
port 8443). If this would present a security risk to your datacenter, you should close port 80 manually on any nodes running the MapR Control
System.

Licensing
Before installing MapR, consider the capabilities you will need and make sure you have obtained the corresponding license. If you need NFS,
data protection with snapshots and mirroring, or plan to set up a cluster with high availability (HA), you will need an M5 license. You can obtain
and install a license through the License Manager after installation. For more information about which features are included in each license type,
see MapR Editions.

If installing a new cluster, make sure to install the latest version of MapR software. If applying a new license to an existing MapR
cluster, make sure to upgrade to the latest version of MapR first. If you are not sure, check the contents of the file
MapRBuildVersion in the /opt/mapr directory. If the version is 1.0.0 and includes GA then you must upgrade before
applying a license. Example:

# cat /opt/mapr/MapRBuildVersion
1.0.0.10178GA-0v

For information about upgrading the cluster, see Cluster Upgrade.

PAM Configuration
MapR uses Pluggable Authentication Modules (PAM) for user authentication in the MapR Control System. Make sure PAM is installed and
configured on the node running the mapr-webserver.

There are typically several PAM modules (profiles), configurable via configuration files in the /etc/pam.d/ directory. Each standard UNIX
program normally installs its own profile. MapR can use (but does not require) its own mapr-admin PAM profile. The MapR Control System
webserver tries the following three profiles in order:

1. mapr-admin (Expects that user has created the /etc/pam.d/mapr-admin profile)


2. sudo (/etc/pam.d/sudo)
3. sshd (/etc/pam.d/sshd)

The profile configuration file (for example, /etc/pam.d/sudo) should contain an entry corresponding to the authentication scheme used by your
system. For example, if you are using local OS authentication, check for the following entry:

auth sufficient pam_unix.so # For local OS Auth

Example: Configuring PAM with mapr-admin

Although there are several viable ways to configure PAM to work with the MapR UI, we recommend using the mapr-admin profile. The following
example shows how to configure the file. If LDAP is not configured, comment out the LDAP lines.

Example /etc/pam.d/mapr-admin file


account required pam_unix.so
account sufficient pam_succeed_if.so uid < 1000 quiet
account [default=bad success=ok user_unknown=ignore] pam_ldap.so
account required pam_permit.so

auth sufficient pam_unix.so nullok_secure


auth requisite pam_succeed_if.so uid >= 1000 quiet
auth sufficient pam_ldap.so use_first_pass
auth required pam_deny.so

password sufficient pam_unix.so md5 obscure min=4 max=8 nullok


try_first_pass
password sufficient pam_ldap.so
password required pam_deny.so

session required pam_limits.so


session required pam_unix.so
session optional pam_ldap.so

The following sections provide information about configuring PAM to work with LDAP or Kerberos.

The file /etc/pam.d/sudo should be modified only with care and only when absolutely necessary.

LDAP

To configure PAM with LDAP:

1. Install the appropriate PAM packages:


On Ubuntu, sudo apt-get libpam-ldap
On Redhat/Centos, sudo yum install pam_ldap
2. Open /etc/pam.d/sudo and check for the following line:

auth sufficient pam_ldap.so # For LDAP Auth

Kerberos

To configure PAM with Kerberos:

1. Install the appropriate PAM packages:


On Redhat/Centos, sudo yum install pam_krb5
On Ubuntu, install sudo apt-get -krb5
2. Open /etc/pam.d/sudo and check for the following line:

auth sufficient pam_krb5.so # For kerberos Auth

Setting Up Disks for MapR


In a production environment, or when testing performance, MapR should be configured to use physical hard drives and partitions. In some cases,
it is necessary to reinstall the operating system on a node so that the physical hard drives are available for direct use by MapR. Reinstalling the
operating system provides an unrestricted opportunity to configure the hard drives. If the installation procedure assigns hard drives to be
managed by the Linux Logical Volume manger (LVM) by default, you should explicitly remove from LVM configuration the drives you plan to use
with MapR. It is common to let LVM manage one physical drive containing the operating system partition(s) and to leave the rest unmanaged by
LVM for use with MapR.

To determine if a disk or partition is ready for use by MapR:

1. Run the command sudo lsof <partition> to determine whether any processes are already using the disk or partition.
2. There should be no output when running sudo fuser <partition>, indicating there is no process accessing the specific disk or
partition.
3.
3. The disk or partition should not be mounted, as checked via the output of the mount command.
4. The disk or partition should not have an entry in the /etc/fstab file.
5. The disk or partition should be accessible to standard Linux tools such as mkfs. You should be able to successfully format the partition
using a command like sudo mkfs.ext3 <partition> as this is similar to the operations MapR performs during installation. If mkfs
fails to access and format the partition, then it is highly likely MapR will encounter the same problem.

Any disk or partition that passes the above testing procedure can be added to the list of disks and partitions passed to the disksetup command.

To specify disks or partitions for use by MapR:

Create a text file /tmp/disks.txt listing disks and partitions for use by MapR. Each line lists a single disk, or partitions on a single
disk. Example:

/dev/sdb
/dev/sdc1 /dev/sdc2 /dev/sdc4
/dev/sdd

Later, when you run disksetup to format the disks, specify the disks and partitions file. Example:

disksetup -F /tmp/disks.txt

You should run disksetup only after running configure.sh.

To test without formatting physical disks:

If you do not have physical partitions or disks available for reformatting, you can test MapR by creating a flat file and including a path to the file in
the disk list file. You should create at least a 4GB file or larger.

The following example creates a 20 GB flat file (bs=1G specifies 1 gigabyte, multiply by count=20):

$ dd if=/dev/zero of=/root/storagefile bs=1G count=20

Using the above example, you would add the following to /tmp/disks.txt:

/root/storagefile

Working with a Logical Volume Manager

The Logical Volume Manager creates symbolic links to each logical volume's block device, from a directory path in the form: /dev/<volume
group>/<volume name>. MapR needs the actual block location, which you can find by using the ls -l command to list the symbolic links.

1. Make sure you have free, unmounted logical volumes for use by MapR:
Unmount any mounted logical volumes that can be erased and used for MapR.
Allocate any free space in an existing logical volume group to new logical volumes.
2. Make a note of the volume group and volume name of each logical volume.
3. Use ls -l with the volume group and volume name to determine the path of each logical volume's block device. Each logical volume is
a symbolic link to a logical block device from a directory path that uses the volume group and volume name: /dev/<volume
group>/<volume name>
The following example shows output that represents a volume group named mapr containing logical volumes named mapr1, mapr2,
mapr3, and mapr4:

# ls -l /dev/mapr/mapr*
lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr1 -> /dev/mapper/mapr-mapr1
lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr2 -> /dev/mapper/mapr-mapr2
lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr3 -> /dev/mapper/mapr-mapr3
lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr4 -> /dev/mapper/mapr-mapr4

4. Create a text file /tmp/disks.txt containing the paths to the block devices for the logical volumes (one path on each line). Example:
4.

$ cat /tmp/disks.txt
/dev/mapper/mapr-mapr1
/dev/mapper/mapr-mapr2
/dev/mapper/mapr-mapr3
/dev/mapper/mapr-mapr4

5. Pass disks.txt to disksetup

# sudo /opt/mapr/server/disksetup -F /tmp/disks.txt

Planning the Deployment


Planning a MapR deployment involves determining which services to run in the cluster and where to run them. The majority of nodes are worker
nodes, which run the TaskTracker and MapR-FS services for data processing. A few nodes run control services that manage the cluster and
coordinate MapReduce jobs.

The following table provides general guidelines for the number of instances of each service to run in a cluster:

Service Package How Many

CLDB mapr-cldb 1-3

FileServer mapr-fileserver Most or all nodes

HBase Master mapr-hbase-master 1-3

HBase RegionServer mapr-hbase-regionserver Varies

JobTracker mapr-jobtracker 1-3

NFS mapr-nfs Varies

TaskTracker mapr-tasktracker Most or all nodes

WebServer mapr-webserver One or more

Zookeeper mapr-zookeeper 1, 3, 5, or a higher odd number

Sample Configurations
The following sections describe a few typical ways to deploy a MapR cluster.

Small M3 Cluster

A small M3 cluster runs most control services on only one node (except for ZooKeeper, which runs on three nodes) and data services on the
remaining nodes. The M3 license does not permit failover or high availability, and only allows one running CLDB.
Medium M3 Cluster

A medium M3 cluster runs most control services on only one node (except for ZooKeeper, which runs on three nodes on different racks) and data
services on the remaining nodes. The M3 license does not permit failover or high availability, and only allows one running CLDB.

Small M5 Cluster

A small M5 cluster of only a few nodes runs control services on three nodes and data services on the remaining nodes, providing failover and high
availability for all critical services.

Medium M5 Cluster

A medium M5 cluster runs control services on three nodes and data services on the remaining nodes, providing failover and high availability for all
critical services.
Large M5 Cluster

A large cluster (over 100 nodes) should isolate CLDB nodes from the TaskTracker and NFS nodes.

In large clusters, you should not run TaskTracker and ZooKeeper together on any nodes.

If running the TaskTracker on CLDB or ZooKeeper nodes is unavoidable, certain settings in mapred-site.xml should be configured differently
to protect critical cluster-wide services from resource shortages. In particular, the number of slots advertised by the TaskTracker should be
reduced and less memory should be reserved for MapReduce tasks. Edit
/opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xml and set the following parameters:

mapred.tasktracker.map.tasks.maximum=(CPUS > 2) ? (CPUS * 0.50) : 1


mapred.tasktracker.reduce.tasks.maximum=(CPUS > 2) ? (CPUS * 0.25) : 1
mapreduce.tasktracker.prefetch.maptasks=0.25
mapreduce.tasktracker.reserved.physicalmemory.mb.low=0.50
mapreduce.tasktracker.task.slowlaunch=true
Planning NFS
The mapr-nfs service lets you access data on a licensed MapR cluster via the NFS protocol:

M3 license: one NFS node


M5 license: multiple NFS nodes with VIPs for failover and load balancing

You can mount the MapR cluster via NFS and use standard shell scripting to read and write live data in the cluster. NFS access to cluster data
can be faster than accessing the same data with the hadoop commands. To mount the cluster via NFS from a client machine, see Setting Up the
Client.

NFS Setup Tips


Before using the MapR NFS Gateway, here are a few helpful tips:

Ensure the stock Linux NFS service is stopped, as Linux NFS and MapR NFS will conflict with each other.
Ensure portmapper is running (Example: ps a | grep portmap).
Make sure you have installed the mapr-nfs package. If you followed the Quick Start - Single Node or Quick Start - Small Cluster
instructions, then it is installed. You can check by listing the /opt/mapr/roles directory and checking for nfs in the list.
Make sure you have applied an M3 license or an M5 (paid or trial) license to the cluster. See Getting a License.
Make sure the NFS service is started (see Services).
For information about mounting the cluster via NFS, see Setting Up the Client.

NFS on an M3 Cluster
At installation time, choose one node on which to run the NFS gateway. NFS is lightweight and can be run on a node running services such as
CLDB or ZooKeeper. To add the NFS service to a running cluster, use the instructions in Adding Roles to install the mapr-nfs package on the
node where you would like to run NFS.

NFS on an M5 Cluster
At cluster installation time, plan which nodes should provide NFS access according to your anticipated traffic. For instance, if you need 5Gbps of
write throughput and 5Gbps of read throughput, here are a few ways to set up NFS:

12 NFS nodes, each of which has a single 1Gbe connection


6 NFS nodes, each of which has a dual 1Gbe connection
4 NFS nodes, each of which has a quad 1Gbe connection

You can also set up NFS on all file server nodes, so each node can NFS-mount itself and native applications can run as tasks, or on one or more
dedicated gateways outside the cluster (using round-robin DNS or behind a hardware load balancer) to allow controlled access.

You can set up virtual IP addresses (VIPs) for NFS nodes in an M5-licensed MapR cluster, for load balancing or failover. VIPs provide multiple
addresses that can be leveraged for round-robin DNS, allowing client connections to be distributed among a pool of NFS nodes. VIPs also make
high availability (HA) NFS possible; in the event an NFS node fails, data requests are satisfied by other NFS nodes in the pool. You should use a
minimum of one VIP per NFS node per NIC that clients will use to connect to the NFS server. If you have four nodes with four NICs each, with
each NIC connected to an individual IP subnet, use a minimum of 16 VIPs and direct clients to the VIPs in round-robin fashion. The VIPs should
be in the same IP subnet as the interfaces to which they will be assigned.

Here are a few tips:

Set up NFS on at least three nodes if possible.


All NFS nodes must be accessible over the network from the machines where you want to mount them.
To serve a large number of clients, set up dedicated NFS nodes and load-balance between them. If the cluster is behind a firewall, you
can provide access through the firewall via a load balancer instead of direct access to each NFS node. You can run NFS on all nodes in
the cluster, if needed.
To provide maximum bandwidth to a specific client, install the NFS service directly on the client machine. The NFS gateway on the client
manages how data is sent in or read back from the cluster, using all its network interfaces (that are on the same subnet as the cluster
nodes) to transfer data via MapR APIs, balancing operations among nodes as needed.
Use VIPs to provide High Availability (HA) and failover. See Setting Up NFS HA for more information.

To add the NFS service to a running cluster, use the instructions in Adding Roles to install the mapr-nfs package on the nodes where you would
like to run NFS.

NFS Memory Settings


The memory allocated to each MapR service is specified in the /opt/mapr/conf/warden.conf file, which MapR automatically configures
based on the physical memory available on the node. You can adjust the minimum and maximum memory used for NFS, as well as the
percentage of the heap that it tries to use, by setting the percent, max, and min parameters in the warden.conf file on each NFS node.
Example:
...
service.command.nfs.heapsize.percent=3
service.command.nfs.heapsize.max=1000
service.command.nfs.heapsize.min=64
...

The percentages need not add up to 100; in fact, you can use less than the full heap by setting the heapsize.percent parameters for all
services to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settings for individual services,
unless you see specific memory-related problems occurring.

Planning Services for HA


When properly licensed and configured for HA, the MapR cluster provides automatic failover for continuity throughout the stack. Configuring a
cluster for HA involves running redundant instances of specific services, and configuring NFS properly. In HA clusters, it is advisable to have 3
nodes run CLDB and 5 run ZooKeeper. In addition, 3 Hadoop JobTrackers and/or 3 HBase Masters are appropriate depending on the purpose of
the cluster. Any node or nodes in the cluster can run the MapR WebServer. In HA clusters, it is appropriate to run more than one instance of the
WebServer with a load balancer to provide failover. NFS can be configured for HA using virtual IP addresses (VIPs). For more information, see
Setting Up NFS HA.

The following are the minimum numbers of each service required for HA:

CLDB - 2 instances
ZooKeeper - 3 instances (to maintain a quorum in case one instance fails)
HBase Master - 2 instances
JobTracker - 2 instances
NFS - 2 instances

You should run redundant instances of important services on separate racks whenever possible, to provide failover if a rack goes down. For
example, the top server in each of three racks might be a CLDB node, the next might run ZooKeeper and other control services, and the
remainder of the servers might be data processing nodes. If necessary, use a worksheet to plan the services to run on each node in each rack.

Tips:

If you are installing a large cluster (100 nodes or more), CLDB nodes should not run any other service and should not contain any cluster
data (see Isolating CLDB Nodes).
In HA clusters, it is advisable to have 3 nodes run CLDB and 5 run ZooKeeper. In addition, 3 Hadoop JobTrackers and/or 3 HBase
Masters are appropriate depending on the purpose of the cluster.

Cluster Architecture
The architecture of the cluster hardware is an important consideration when planning a deployment. Among the considerations are anticipated
data storage and network bandwidth needs, including intermediate data generated during MapReduce job execution. The type of workload is
important: consider whether the planned cluster usage will be CPU-intensive, I/O-intensive, or memory-intensive. Think about how data will be
loaded into and out of the cluster, and how much data is likely to be transmitted over the network.

Typically, the CPU is less of a bottleneck than network bandwidth and disk I/O. To the extent possible, network and disk transfer rates should be
balanced to meet the anticipated data rates using multiple NICs per node. It is not necessary to bond or trunk the NICs together; MapR is able to
take advantage of multiple NICs transparently. Each node should provide raw disks and partitions to MapR, with no RAID or logical volume
manager, as MapR takes care of formatting and data protection.

Example Architecture

The following example architecture provides specifications for a standard compute/storage node for general purposes, and two sample rack
configurations made up of the standard nodes. MapR is able to make effective use of more drives per node than standard Hadoop, so each node
should present enough face plate area to allow a large number of drives. The standard node specification allows for either 2 or 4 1Gb/s ethernet
network interfaces.

Standard Compute/Storage Node

2U Chassis
Single motherboard, dual socket
2 x 4-core + 32 GB RAM or 2 x 6-core + 48 GB RAM
12 x 2 TB 7200-RPM drives
2 or 4 network interfaces
(on-board NIC + additional NIC)
OS on single partition on one drive (remainder of drive used for storage)
Standard 50TB Rack Configuration

10 standard compute/storage nodes


(10 x 12 x 2 TB storage; 3x replication, 25% margin)
24-port 1 Gb/s rack-top switch with 2 x 10Gb/s uplink
Add second switch if each node uses 4 network interfaces

Standard 100TB Rack Configuration

20 standard nodes
(20 x 12 x 2 TB storage; 3x replication, 25% margin)
48-port 1 Gb/s rack-top switch with 4 x 10Gb/s uplink
Add second switch if each node uses 4 network interfaces

To grow the cluster, just add more nodes and racks, adding additional service instances as needed. MapR rebalances the cluster automatically.

Setting Up NFS
The mapr-nfs service lets you access data on a licensed MapR cluster via the NFS protocol:

M3 license: one NFS node


M5 license: multiple NFS nodes with VIPs for failover and load balancing

You can mount the MapR cluster via NFS and use standard shell scripting to read and write live data in the cluster. NFS access to cluster data
can be faster than accessing the same data with the hadoop commands. To mount the cluster via NFS from a client machine, see Setting Up the
Client.

NFS Setup Tips

Before using the MapR NFS Gateway, here are a few helpful tips:

Ensure the stock Linux NFS service is stopped, as Linux NFS and MapR NFS will conflict with each other.
Ensure portmapper is running (Example: ps a | grep portmap).
Make sure you have installed the mapr-nfs package. If you followed the Quick Start - Single Node or Quick Start - Small Cluster
instructions, then it is installed. You can check by listing the /opt/mapr/roles directory and checking for nfs in the list.
Make sure you have applied an M3 license or an M5 (paid or trial) license to the cluster. See Getting a License.
Make sure the NFS service is started (see Services).
For information about mounting the cluster via NFS, see Setting Up the Client.

NFS on an M3 Cluster
At installation time, choose one node on which to run the NFS gateway. NFS is lightweight and can be run on a node running services such as
CLDB or ZooKeeper. To add the NFS service to a running cluster, use the instructions in Adding Roles to install the mapr-nfs package on the
node where you would like to run NFS.

NFS on an M5 Cluster
At cluster installation time, plan which nodes should provide NFS access according to your anticipated traffic. For instance, if you need 5Gbps of
write throughput and 5Gbps of read throughput, here are a few ways to set up NFS:

12 NFS nodes, each of which has a single 1Gbe connection


6 NFS nodes, each of which has a dual 1Gbe connection
4 NFS nodes, each of which has a quad 1Gbe connection

You can also set up NFS on all file server nodes, so each node can NFS-mount itself and native applications can run as tasks, or on one or more
dedicated gateways outside the cluster (using round-robin DNS or behind a hardware load balancer) to allow controlled access.

You can set up virtual IP addresses (VIPs) for NFS nodes in an M5-licensed MapR cluster, for load balancing or failover. VIPs provide multiple
addresses that can be leveraged for round-robin DNS, allowing client connections to be distributed among a pool of NFS nodes. VIPs also make
high availability (HA) NFS possible; in the event an NFS node fails, data requests are satisfied by other NFS nodes in the pool. You should use a
minimum of one VIP per NFS node per NIC that clients will use to connect to the NFS server. If you have four nodes with four NICs each, with
each NIC connected to an individual IP subnet, use a minimum of 16 VIPs and direct clients to the VIPs in round-robin fashion. The VIPs should
be in the same IP subnet as the interfaces to which they will be assigned.

Here are a few tips:

Set up NFS on at least three nodes if possible.


All NFS nodes must be accessible over the network from the machines where you want to mount them.
To serve a large number of clients, set up dedicated NFS nodes and load-balance between them. If the cluster is behind a firewall, you
can provide access through the firewall via a load balancer instead of direct access to each NFS node. You can run NFS on all nodes in
the cluster, if needed.
To provide maximum bandwidth to a specific client, install the NFS service directly on the client machine. The NFS gateway on the client
manages how data is sent in or read back from the cluster, using all its network interfaces (that are on the same subnet as the cluster
nodes) to transfer data via MapR APIs, balancing operations among nodes as needed.
Use VIPs to provide High Availability (HA) and failover. See Setting Up NFS HA for more information.

To add the NFS service to a running cluster, use the instructions in Adding Roles to install the mapr-nfs package on the nodes where you would
like to run NFS.

NFS Memory Settings

The memory allocated to each MapR service is specified in the /opt/mapr/conf/warden.conf file, which MapR automatically configures
based on the physical memory available on the node. You can adjust the minimum and maximum memory used for NFS, as well as the
percentage of the heap that it tries to use, by setting the percent, max, and min parameters in the warden.conf file on each NFS node.
Example:

...
service.command.nfs.heapsize.percent=3
service.command.nfs.heapsize.max=1000
service.command.nfs.heapsize.min=64
...

The percentages need not add up to 100; in fact, you can use less than the full heap by setting the heapsize.percent parameters for all
services to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settings for individual services,
unless you see specific memory-related problems occurring.

Installing MapR

Before performing these steps, make sure all nodes meet the Requirements, and that you have planned which services to run
on each node. You will need a list of the hostnames or IP addresses of all CLDB nodes, and the hostnames or IP addresses of
all ZooKeeper nodes.

Perform the following steps, starting the installation with the control nodes running CLDB and ZooKeeper:

1. On all nodes, ADD the MapR Repository.


2. On each node, INSTALL the planned MapR services.
3. On all nodes, RUN configure.sh.
4. On all nodes, FORMAT disks for use by MapR.
5. START the cluster.
6. SET UP node topology.
7. SET UP NFS for HA.

The following sections provide details about each step.

Adding the MapR Repository


The MapR repository provides all the packages you need to install and run a MapR cluster using native tools such as yum on Red Hat or CentOS,
or apt-get on Ubuntu. Perform the following steps on every node to add the MapR Repository for MapR Version 1.1.3.

To Add the Repository on Red Hat or CentOS:

1. Change to the root user (or use sudo for the following commands).
2. Create a text file called maprtech.repo in the directory /etc/yum.repos.d/ with the following contents:
[maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/v1.1.3/redhat/
enabled=1
gpgcheck=0
protect=1
To install a previous release, see the Release Notes for the correct path to use in the baseurl parameter.
3. If your connection to the Internet is through a proxy server, you must set the http_proxy environment variable before installation:
http_proxy=http://<host>:<port>
export http_proxy

If you don't have Internet connectivity, do one of the following:

Set up a local repository.


Download and install packages manually.

To Add the Repository on Ubuntu:

1. Change to the root user (or use sudo for the following commands).
2. Add the following line to /etc/apt/sources.list:
deb http://package.mapr.com/releases/v1.1.3/ubuntu/ mapr optional
(To install a previous release, see the Release Notes for the correct path.)
3. If your connection to the Internet is through a proxy server, add the following lines to /etc/apt/apt.conf:

Acquire {
Retries "0";
HTTP {
Proxy "http://<user>:<password>@<host>:<port>";
};
};

If you don't have Internet connectivity, do one of the following:

Set up a local repository.


Download and install packages manually.

Installing MapR Services


The following sections provide instructions for installing MapR.

To Install Services on Red Hat or CentOS:

1. Change to the root user (or use sudo for the following command).
2. Use the yum command to install the services planned for the node. Examples:
To install TaskTracker and MapR-FS: yum install mapr-tasktracker mapr-fileserver
To install CLDB, JobTracker, Webserver, and ZooKeeper: yum install mapr-cldb mapr-jobtracker
mapr-webserver mapr-zookeeper

To Install Services on Ubuntu:

1. Change to the root user (or use sudo for the following commands).
2. On all nodes, issue the following command to update the Ubuntu package list:

apt-get update

3. Use the apt-get command to install the services planned for the node. Examples:
To install TaskTracker and MapR-FS: apt-get install mapr-tasktracker mapr-fileserver
To install CLDB, JobTracker, Webserver, and ZooKeeper: apt-get install mapr-cldb mapr-jobtracker
mapr-webserver mapr-zookeeper

Running configure.sh
Run the script configure.sh to create /opt/mapr/conf/mapr-clusters.conf and update the corresponding *.conf and *.xml files.
Before performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. Optionally, you can specify the ports
for the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are:

CLDB – 7222
ZooKeeper – 5181

The script configure.sh takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IP
addresses (and optionally ports), using the following syntax:
/opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z
<host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>]

Example:

/opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Z
r1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r1n1.sj.us5:5181 -N MyCluster

If you have not chosen a cluster name, you can run configure.sh again later to rename the cluster.

Formatting the Disks


On all nodes, use the following procedure to format disks and partitions for use by MapR.

This procedure assumes you have free, unmounted physical partitions or hard disks for use by MapR. If you are not sure,
please read Setting Up Disks for MapR.

Create a text file /tmp/disks.txt listing disks and partitions for use by MapR. Each line lists a single disk, or partitions on a single
disk. Example:

/dev/sdb
/dev/sdc1 /dev/sdc2 /dev/sdc4
/dev/sdd

Later, when you run disksetup to format the disks, specify the disks and partitions file. Example:

disksetup -F /tmp/disks.txt

The script disksetup removes all data from the specified disks. Make sure you specify the disks correctly, and that any data
you wish to keep has been backed up elsewhere. Before following this procedure, make sure you have backed up any data you
wish to keep.

1. Change to the root user (or use sudo for the following command).
2. Run disksetup, specifying the disk list file.
Example:

/opt/mapr/server/disksetup -F /tmp/disks.txt

Bringing Up the Cluster


In order to configure the administrative user and license, bring up the CLDB, MapR Control System, and ZooKeeper; once that is done, bring up
the other nodes. You will need the following information:

A list of nodes on which mapr-cldb is installed


<MCS node> - the node on which the mapr-webserver service is installed
<user> - the chosen Linux (or LDAP) user which will have administrative privileges on the cluster

To Bring Up the Cluster

1. Start ZooKeeper on all nodes where it is installed, by issuing the following command:

/etc/init.d/mapr-zookeeper start

2. On one of the CLDB nodes and the node running the mapr-webserver service, start the warden:
2.

/etc/init.d/mapr-warden start

3. On the running CLDB node, issue the following command to give full permission to the chosen administrative user:

/opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

4. On a machine that is connected to the cluster and to the Internet, perform the following steps to install the license:
In a browser, view the MapR Control System by navigating to the node that is running the MapR Control System:
https://<MCS node>:8443
Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You can
ignore the warning this time.
The first time MapR starts, you must accept the agreement and choose whether to enable the MapR Dial Home service.
Log in to the MapR Control System as the administrative user you designated earlier.
In the navigation pane of the MapR Control System, expand the System Settings group and click MapR Licenses to display the
MapR License Management dialog.
Click Add Licenses via Web.
If the cluster is already registered, the license is applied automatically. Otherwise, click OK to register the cluster on MapR.com
and follow the instructions there.
5. If mapr-nfs is installed on the running CLDB node, execute the following command on the running CLDB node:

/opt/mapr/bin/maprcli node services -nodes <node 1> -nfs start

6. On all other nodes, execute the following command:

/etc/init.d/mapr-warden start

7. Log in to the MapR Control System.


8. Under the Cluster group in the left pane, click Dashboard.
9. Check the Services pane and make sure each service is running the correct number of instances.

Setting up Topology
Topology tells MapR about the locations of nodes and racks in the cluster. Topology is important, because it determines where MapR places
replicated copies of data. If you define the cluster topology properly, MapR scatters replication on separate racks so that your data remains
available in the event an entire rack fails. Cluster topology is defined by specifying a topology path for each node in the cluster. The paths group
nodes by rack or switch, depending on how the physical cluster is arranged and how you want MapR to place replicated data.

Topology paths can be as simple or complex as needed to correspond to your cluster layout. In a simple cluster, each topology path might consist
of the rack only (e. g. /rack-1). In a deployment consisting of multiple large datacenters, each topology path can be much longer (e. g.
/europe/uk/london/datacenter2/room4/row22/rack5/). MapR uses topology paths to spread out replicated copies of data, placing
each copy on a separate path. By setting each path to correspond to a physical rack, you can ensure that replicated data is distributed across
racks to improve fault tolerance.

After you have defined node topology for the nodes in your cluster, you can use volume topology to place volumes on specific racks, nodes, or
groups of nodes. See Setting Volume Topology.

Setting Node Topology


You can specify a topology path for one or more nodes using the node topo command, or in the MapR Control System using the following
procedure.

To set node topology using the MapR Control System:

1. In the Navigation pane, expand the Cluster group and click the Nodes view.
2. Select the checkbox beside each node whose topology you wish to set.
3. Click the Change Topology button to display the Change Node Topology dialog.
4. Set the path in the New Path field:
To define a new path, type a topology path. Topology paths must begin with a forward slash ('/').
To use a path you have already defined, select it from the dropdown.
5. Click Move Node to set the new topology.

Setting Up NFS HA
You can easily set up a pool of NFS nodes with HA and failover using virtual IP addresses (VIPs); if one node fails the VIP will be automatically
reassigned to the next NFS node in the pool. If you do not specify a list of NFS nodes, then MapR uses any available node running the MapR
NFS service. You can add a server to the pool simply by starting the MapR NFS service on it. Before following this procedure, make sure you are
running NFS on the servers to which you plan to assign VIPs. You should install NFS on at least three nodes. If all NFS nodes are connected to
only one subnet, then adding another NFS server to the pool is as simple as starting NFS on that server; the MapR cluster automatically detects it
and adds it to the pool.

You can restrict VIP assignment to specific NFS nodes or MAC addresses by adding them to the NFS pool list manually. VIPs are not assigned to
any nodes that are not on the list, regardless of whether they are running NFS. If the cluster's NFS nodes have multiple network interface cards
(NICs) connected to different subnets, you should restrict VIP assignment to the NICs that are on the correct subnet: for each NFS server, choose
whichever MAC address is on the subnet from which the cluster will be NFS-mounted, then add it to the list. If you add a VIP that is not accessible
on the subnet, then failover will not work. You can only set up VIPs for failover between network interfaces that are in the same subnet. In large
clusters with multiple subnets, you can set up multiple groups of VIPs to provide NFS failover for the different subnets.

You can set up VIPs with the virtualip add command, or using the Add Virtual IPs dialog in the MapR Control System. The Add Virtual IPs dialog
lets you specify a range of virtual IP addresses and assign them to the pool of servers that are running the NFS service. The available servers are
displayed in the left pane in the lower half of the dialog. Servers that have been added to the NFS VIP pool are displayed in the right pane in the
lower half of the dialog.

To set up VIPs for NFS using the MapR Control System:

1. In the Navigation pane, expand the NFS HA group and click the NFS Setup view.
2. Click Start NFS to start the NFS Gateway service on nodes where it is installed.
3. Click Add VIP to display the Add Virtual IPs dialog.
4. Enter the start of the VIP range in the Starting IP field.
5. Enter the end of the VIP range in the Ending IP field. If you are assigning one one VIP, you can leave the field blank.
6. Enter the Netmask for the VIP range in the Netmask field. Example: 255.255.255.0
7. If you wish to restrict VIP assignment to specific servers or MAC addresses:
a. If each NFS node has one NIC, or if all NICs are on the same subnet, select NFS servers in the left pane.
b. If each NFS node has multiple NICs connected to different subnets, select the server rows with the correct MAC addresses in the
left pane.
8. Click Add to add the selected servers or MAC addresses to the list of servers to which the VIPs will be assigned. The servers appear in
the right pane.
9. Click OK to assign the VIPs and exit.

Cluster Configuration
After installing MapR Services and bringing up the cluster, perform the following configuration steps.

Setting Up the Administrative User


Give the administrative user full control over the cluster:

1. Log on to any cluster node as root (or use sudo for the following command).
2. Execute the following command, replacing <user> with the administrative username:
sudo /opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

For general information about users and groups in the cluster, see Users and Groups.

Checking the Services


Use the following steps to start the MapR Control System and check that all configured services are running:

1. Start the MapR Control System: in a browser, go to the following URL, replacing <host> with the hostname of the node that is running
the WebServer: https://<host>:8443
2. Log in using the administrative username and password.
3. The first time you run the MapR Control System, you must accept the MapR Terms of Service. Click I Accept to proceed.
4. Under the Cluster group in the left pane, click Dashboard.
5. Check the Services pane and make sure each service is running the correct number of instances. For example: if you have configured 5
servers to run the CLDB service, you should see that 5 of 5 instances are running.

If one or more services have not started, wait a few minutes to see if the warden rectifies the problem. If not, you can try to start the services
manually. See Managing Services.
If too few instances of a service have been configured, check that the service is installed on all appropriate nodes. If not, you can add the service
to any nodes where it is missing. See Reconfiguring a Node.
Configuring Authentication
If you use Kerberos, LDAP, or another authentication scheme, make sure PAM is configured correctly to give MapR access. See PAM
Configuration.

Configuring Email
MapR can notify users by email when certain conditions occur. There are three ways to specify the email addresses of MapR users:

From an LDAP directory


By domain
Manually, for each user

To configure email from an LDAP directory:

1. In the MapR Control System, expand the System Settings group and click Email Addresses to display the Configure Email Addresses
dialog.
2. Select Use LDAP and enter the information about the LDAP directory into the appropriate fields.
3. Click Save to save the settings.

To configure email by domain:

1. In the MapR Control System, expand the System Settings group and click Email Addresses to display the Configure Email Addresses
dialog.
2. Select Use Company Domain and enter the domain name in the text field.
3. Click Save to save the settings.

To configure email manually for each user:

1. Create a volume for the user.


2. In the MapR Control System, expand the MapR-FS group and click User Disk Usage.
3. Click the username to display the User Properties dialog.
4. Enter the user's email address in the Email field.
5. Click Save to save the settings.

Configuring SMTP
Use the following procedure to configure the cluster to use your SMTP server to send mail:

1. In the MapR Control System, expand the System Settings group and click SMTP to display the Configure Sending Email dialog.
2. Enter the information about how MapR will send mail:
Provider: assists in filling out the fields if you use Gmail.
SMTP Server: the SMTP server to use for sending mail.
This server requires an encrypted connection (SSL): specifies an SSL connection to SMTP.
SMTP Port: the SMTP port to use for sending mail.
Full Name: the name MapR should use when sending email. Example: MapR Cluster
Email Address: the email address MapR should use when sending email.
Username: the username MapR should use when logging on to the SMTP server.
SMTP Password: the password MapR should use when logging on to the SMTP server.
3. Click Test SMTP Connection.
4. If there is a problem, check the fields to make sure the SMTP information is correct.
5. Once the SMTP connection is successful, click Save to save the settings.

Configuring Permissions
By default, users are able to log on to the MapR Control System, but do not have permission to perform any actions. You can grant specific
permissions to individual users and groups. See Managing Permissions.

Setting Quotas
Set default disk usage quotas. If needed, you can set specific quotas for individual users and groups. See Managing Quotas.

Configuring alarm notifications


If an alarm is raised on the cluster, MapR sends an email notification by default to the user associated with the object on which the alarm was
raised. For example, if a volume goes over its allotted quota, MapR raises an alarm and sends email to the volume creator. You can configure
MapR to send email to a custom email address in addition or instead of the default email address, or not to send email at all, for each alarm type.
See Notifications.

Designating NICs for MapR


If you do not want MapR to use all NICs on each node, use the environment variable MAPR_SUBNETS to restrict MapR traffic to specific NICs. Set
MAPR_SUBNETS to a comma-separated list of up to four subnets in CIDR notation with no spaces. Example:

export MAPR_SUBNETS=1.2.3.4/12, 5.6/24

If MAPR_SUBNETS is not set, MapR uses all NICs present on the node.

Integration with Other Tools


This section provides information about integrating the following tools with a MapR cluster:

Mahout - Environment variable settings needed to run Mahout on MapR


Ganglia - Setting up Ganglia monitoring on a MapR cluster
Nagios Integration - Generating a Nagios Object Definition file for use with a MapR cluster
Using Whirr to Install on Amazon EC2 - Using a Whirr script to start a MapR cluster on Amazon EC2
Compiling Pipes Programs\ - Using Hadoop Pipes on a MapR cluster
HBase - Installing and using HBase on MapR
MultiTool - Starting Cascading Multitool on a MapR cluster
Flume - Installing and using Flume on a MapR cluster
Hive - Installing and using Hive on a MapR cluster, and setting up a MySQL metastore
Pig - Installing and using Pig on a MapR cluster

Flume
Flume is a reliable, distributed service for collecting, aggregating, and moving large amounts of log data, generally delivering the data to a
distributed file system such as HDFS. For more information about Flume, see the Apache Flume Incubation Wiki.

Installing Flume
The following procedures use the operating system package managers to download and install from the MapR Repository. To install the
packages manually, refer to the Local Packages document for Red Hat or Ubuntu.

To install Flume on an Ubuntu cluster:

1. Execute the following commands as root or using sudo.


2. This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the Installation Guide.
3. Update the list of available packages:

apt-get update

4. On each planned Flume node, install mapr-flume:

apt-get install mapr-flume

To install Flume on a Red Hat or CentOS cluster:

1. Execute the following commands as root or using sudo.


2. This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the Installation Guide.
3. On each planned Flume node, install mapr-flume:

yum install mapr-flume

Using Flume
For information about configuring and using Flume, see the following documents:

Flume User Guide


Flume Cookbook

HBase
HBase is the Hadoop database, which provides random, realtime read/write access to very large data.

See Installing HBase for information about using HBase with MapR
See Setting Up Compression with HBase for information about compressing HFile storage
See Running MapReduce Jobs with HBase for information about using MapReduce with HBase
See HBase Best Practices for HBase tips and tricks

Installing HBase
Plan which nodes should run the HBase Master service, and which nodes should run the HBase RegionServer. At least one node (generally three
nodes) should run the HBase Master; for example, install HBase Master on the ZooKeeper nodes. Only a few of the remaining nodes or all of the
remaining nodes can run the HBase RegionServer. When you install HBase RegionServer on nodes that also run TaskTracker, reduce the
number of map and reduce slots to avoid oversubscribing the machine. The following procedures use the operating system package managers to
download and install from the MapR Repository. To install the packages manually, refer to the Local Packages document for Red Hat or Ubuntu.

To install HBase on an Ubuntu cluster:

1. Execute the following commands as root or using sudo.


2. This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the Installation Guide.
3. Update the list of available packages:

apt-get update

4. On each planned HBase Master node, install mapr-hbase-master:

apt-get install mapr-hbase-master

5. On each planned HBase RegionServer node, install mapr-hbase-regionserver:

apt-get install mapr-hbase-regionserver

6. On all HBase nodes, run configure.sh with a list of the CLDB nodes and ZooKeeper nodes in the cluster.
7. The warden picks up the new configuration and automatically starts the new services. When it is convenient, restart the warden:

# /etc/init.d/mapr-warden stop
# /etc/init.d/mapr-warden start

To install HBase on a Red Hat or CentOS cluster:

1. Execute the following commands as root or using sudo.


2. On each planned HBase Master node, install mapr-hbase-master:

yum install mapr-hbase-master

3. On each planned HBase RegionServer node, install mapr-hbase-regionserver:

yum install mapr-hbase-regionserver

4. On all HBase nodes, run configure.sh with a list of the CLDB nodes and ZooKeeper nodes in the cluster.
5. The warden picks up the new configuration and automatically starts the new services. When it is convenient, restart the warden:
# /etc/init.d/mapr-warden stop
# /etc/init.d/mapr-warden start

Getting Started with HBase


In this tutorial, we'll create an HBase table on the cluster, enter some data, query the table, then clean up the data and exit.

HBase tables are organized by column, rather than by row. Furthermore, the columns are organized in groups called column families. When
creating an HBase table, you must define the column families before inserting any data. Column families should not be changed often, nor should
there be too many of them, so it is important to think carefully about what column families will be useful for your particular data. Each column
family, however, can contain a very large number of columns. Columns are named using the format family:qualifier.

Unlike columns in a relational database, which reserve empty space for columns with no values, HBase columns simply don't exist for rows where
they have no values. This not only saves space, but means that different rows need not have the same columns; you can use whatever columns
you need for your data on a per-row basis.

Create a table in HBase:

1. Start the HBase shell by typing the following command:

/opt/mapr/hbase/hbase-0.90.4/bin/hbase shell

2. Create a table called weblog with one column family named stats:

create 'weblog', 'stats'

3. Verify the table creation by listing everything:

list

4. Add a test value to the daily column in the stats column family for row 1:

put 'weblog', 'row1', 'stats:daily', 'test-daily-value'

5. Add a test value to the weekly column in the stats column family for row 1:

put 'weblog', 'row1', 'stats:weekly', 'test-weekly-value'

6. Add a test value to the weekly column in the stats column family for row 2:

put 'weblog', 'row2', 'stats:weekly', 'test-weekly-value'

7. Type scan 'weblog' to display the contents of the table. Sample output:
ROW COLUMN+CELL
row1 column=stats:daily, timestamp=1321296699190, value=test-daily-value

row1 column=stats:weekly, timestamp=1321296715892, value=test-weekly-value

row2 column=stats:weekly, timestamp=1321296787444, value=test-weekly-value

2 row(s) in 0.0440 seconds

8. Type get 'weblog', 'row1' to display the contents of row 1. Sample output:

COLUMN CELL
stats:daily timestamp=1321296699190, value=test-daily-value
stats:weekly timestamp=1321296715892, value=test-weekly-value
2 row(s) in 0.0330 seconds

9. Type disable 'weblog' to disable the table.


10. Type drop 'weblog' to drop the table and delete all data.
11. Type exit to exit the HBase shell.

Setting Up Compression with HBase


Using compression with HBase reduces the number of bytes transmitted over the network and stored on disk. These benefits often outweigh the
performance cost of compressing the data on every write and uncompressing it on every read.

GZip Compression

GZip compression is included with most Linux distributions, and works natively with HBase. To use GZip compression, specify it in the per-column
family compression flag while creating tables in HBase shell. Example:

create 'mytable', {NAME=>'colfam:', COMPRESSION=>'gz'}

LZO Compression

Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm, included in most Linux distributions, that is designed for decompression
speed.

To Set Up LZO Compression for Use with HBase:

1. Make sure HBase is installed on the nodes where you plan to run it. See Planning the Deployment and Installing MapR for more
information.
2. On each HBase node, ensure the native LZO base library is installed:
on Ubuntu: apt-get install liblzo2-dev
On Red Hat or CentOS: yum install liblzo2-devel
3. Check out the native connector library from http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/
For 0.20.2 check out branches/branch-0.1

svn checkout http:


//svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/branches/branch-0.1/

For 0.21 or 0.22 check out trunk

svn checkout http:


//svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/branches/trunk/

4. Set the compiler flags and build the native connector library:
4.

$ export CFLAGS="-m64"
$ ant compile-native
$ ant jar

5. Create a directory for the native libraries (use TAB completion to fill in the <version> placeholder):

mkdir -p /opt/mapr/hbase/hbase-<version>/lib/native/Linux-amd64-64/

6. Copy the build results into the appropriate HBase directories on every HBase node. Example:

$ cp build/hadoop-gpl-compression-0.2.0-dev.jar /opt/mapr/hbase/hbase-<version>/lib
$ cp build/native/Linux-amd64-64/lib/libgplcompression.* /opt/mapr/hbase/hbase-<version>/lib/
native/Linux-amd64-64/

Once LZO is set up, you can specify it in the per-column family compression flag while creating tables in HBase shell. Example:

create 'mytable', {NAME=>'colfam:', COMPRESSION=>'lzo'}

Running MapReduce Jobs with HBase


To run MapReduce jobs with data stored in HBase, set the environment variable HADOOP_CLASSPATH to the output of the hbase classpath
command (use TAB completion to fill in the <version> placeholder):

$ export HADOOP_CLASSPATH=`/opt/mapr/hbase/hbase-<version>/bin/hbase classpath`

Note the backticks (`).

Example: Exporting a table with MapReduce


/opt/mapr/hadoop/hadoop-0.20.2# export HADOOP_CLASSPATH=`/opt/mapr/hbase/hbase-0.90.2/bin/hbase
classpath`
/opt/mapr/hadoop/hadoop-0.20.2# ./bin/hadoop jar /opt/mapr/hbase/hbase-0.90.2/hbase-0.90.2.jar export
t1
/hbase/exportt1
11/09/28 09:35:11 INFO mapreduce.Export: verisons=1, starttime=0,
endtime=9223372036854775807
11/09/28 09:35:11 INFO fs.JobTrackerWatcher: Current running JobTracker is:
lohit-ubuntu/10.250.1.91:9001
11/09/28 09:35:12 INFO mapred.JobClient: Running job: job_201109280920_0003
11/09/28 09:35:13 INFO mapred.JobClient: map 0% reduce 0%
11/09/28 09:35:19 INFO mapred.JobClient: Job complete: job_201109280920_0003
11/09/28 09:35:19 INFO mapred.JobClient: Counters: 15
11/09/28 09:35:19 INFO mapred.JobClient: Job Counters
11/09/28 09:35:19 INFO mapred.JobClient: Aggregate execution time of
mappers(ms)=3259
11/09/28 09:35:19 INFO mapred.JobClient: Total time spent by all reduces
waiting after reserving slots (ms)=0
11/09/28 09:35:19 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
11/09/28 09:35:19 INFO mapred.JobClient: Launched map tasks=1
11/09/28 09:35:19 INFO mapred.JobClient: Data-local map tasks=1
11/09/28 09:35:19 INFO mapred.JobClient: Aggregate execution time of
reducers(ms)=0
11/09/28 09:35:19 INFO mapred.JobClient: FileSystemCounters
11/09/28 09:35:19 INFO mapred.JobClient: FILE_BYTES_WRITTEN=61319
11/09/28 09:35:19 INFO mapred.JobClient: Map-Reduce Framework
11/09/28 09:35:19 INFO mapred.JobClient: Map input records=5
11/09/28 09:35:19 INFO mapred.JobClient: PHYSICAL_MEMORY_BYTES=107991040
11/09/28 09:35:19 INFO mapred.JobClient: Spilled Records=0
11/09/28 09:35:19 INFO mapred.JobClient: CPU_MILLISECONDS=780
11/09/28 09:35:19 INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=759836672
11/09/28 09:35:19 INFO mapred.JobClient: Map output records=5
11/09/28 09:35:19 INFO mapred.JobClient: SPLIT_RAW_BYTES=63
11/09/28 09:35:19 INFO mapred.JobClient: GC time elapsed (ms)=35

HBase Best Practices


The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase,
turn off MapR compression for directories in the HBase volume (normally mounted at /hbase. Example:

hadoop mfs -setcompression off /hbase

You can check whether compression is turned off in a directory or mounted volume by using Hadoop MFS to list the file contents.
Example:

hadoop mfs -ls /hbase

The letter Z in the output indicates compression is turned on; the letter U indicates compression is turned off. See Hadoop MFS for more
information.

On any node where you plan to run both HBase and MapReduce, give more memory to the FileServer than to the RegionServer so that
the node can handle high throughput. For example, on a node with 24 GB of physical memory, it might be desirable to limit the
RegionServer to 4 GB, give 10 GB to MapR-FS, and give the remainder to TaskTracker. To change the memory allocated to each
service, edit the /opt/mapr/conf/warden.conf file. See Tuning MapReduce for more information.
For higher performance, make sure fs.mapr.mempool.size is set to 0 in hbase-site.xml to disable shared memory in the fileclient.

Hive
Hive is a data warehouse system for Hadoop that uses a SQL-like language called HiveQL to query structured data stored in a distributed
filesystem. For more information about Hive, see the Hive project page.

Once Hive is installed, the executable is located at: /opt/mapr/hive/hive-<version>/bin/hive


Make sure the environment variable JAVA_HOME is set correctly. Example:

# export JAVA_HOME=/usr/lib/jvm/java-6-sun

Installing Hive
The following procedures use the operating system package managers to download and install Hive from the MapR Repository. To install the
packages manually, refer to the Local Packages document for Red Hat or Ubuntu.

To install Hive on an Ubuntu cluster:

1. Execute the following commands as root or using sudo.


2. This procedure is to be performed on a MapR cluster (see the Installation Guide) or client (see Setting Up the Client).
3. Update the list of available packages:

apt-get update

4. On each planned Hive node, install mapr-hive:

apt-get install mapr-hive

To install Hive on a Red Hat or CentOS cluster:

1. Execute the following commands as root or using sudo.


2. This procedure is to be performed on a MapR cluster (see the Installation Guide) or client (see Setting Up the Client).
3. On each planned Hive node, install mapr-hive:

yum install mapr-hive

Getting Started with Hive


In this tutorial, you'll create a Hive table on the cluster, load data from a tab-delimited text file, and run a couple of basic queries against the table.

First, make sure you have downloaded the sample table: On the page A Tour of the MapR Virtual Machine, select Tools > Attachments
and right-click on sample-table.txt to save it. We'll be loading the file from the local filesystem on the MapR Virtual Machine (not the
cluster storage layer) so save the file in the Home directory (/home/mapr).

Take a look at the source data

First, take a look at the contents of the file using the terminal:

1. Make sure you are in the Home directory where you saved sample-table.txt (type cd ~ if you are not sure).
2. Type cat sample-table.txt to display the following output.

mapr@mapr-desktop:~$ cat sample-table.txt


1320352532 1001 http://www.mapr.com/doc http://www.mapr.com 192.168.10.1
1320352533 1002 http://www.mapr.com http://www.example.com 192.168.10.10
1320352546 1001 http://www.mapr.com http://www.mapr.com/doc 192.168.10.1

Notice that the file consists of only three lines, each of which contains a row of data fields separated by the TAB character. The data in the file
represents a web log.

Create a table in Hive and load the source data:

1. Type the following command to start the Hive shell, using tab-completion to expand the <version>:

/opt/mapr/hive/hive-<version>/bin/hive

2. At the hive> prompt, type the following command to create the table:
2.

CREATE TABLE web_log(viewTime INT, userid BIGINT, url STRING, referrer STRING, ip STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY '\t';

3. Type the following command to load the data from sample-table.txt into the table:

LOAD DATA LOCAL INPATH '/home/mapr/sample-table.txt' INTO TABLE web_log;

Run basic queries against the table:

Try the simplest query, one that displays all the data in the table:

SELECT web_log.* FROM web_log;

This query would be inadvisable with a large table, but with the small sample table it returns very quickly.

Try a simple SELECT to extract only data that matches a desired string:

SELECT web_log.* FROM web_log WHERE web_log.url LIKE '%doc';

This query launches a MapReduce job to filter the data.

Setting Up Hive with a MySQL Metastore


All the metadata for Hive tables and partitions are stored in Hive Metastore (for more information, see the Hive project documentation). By default,
the Hive Metastore is in an embedded Apache Derby database in MapR-FS. Derby only allows one connection at a time; if you want multiple
concurrent Hive sessions, you can use MySQL for the Hive Metastore. You can run the MySQL Metastore on any machine that is accessible from
Hive. Use the MySQL bind-address parameter (in /etc/mysql/my.cnf) to specify the address where the Hive Metastore will be exposed. If
the Metastore is on the same machine as the Hive server, you can use localhost; otherwise, set it to the IP address of the node where you are
running the Metastore, and make sure that the Hive parameters javax.jdo.ConnectionURL and hive.metastore.uris (in
hive-site.xml, detailed below) contain the same IP address.

Prerequisites

Make sure MySQL is installed on the machine where you want to run the Metastore, and configure the MySQL bind-address (the
address used to connect to the MySQL Server) to localhost or the IP Address on the node.
The database administrator must create a database for Hive to store metastore data, and the username specified in
javax.jdo.ConnectionUser must have permissions to access it. The database can be specified using the ConnectionURL
parameter.
Download and install the driver for the MySQL JDBC connector. Example:

$ curl -L 'http:
//www.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.15.tar.gz/from/http://mysql.he.net/|http://my
| tar xz
$ sudo cp mysql-connector-java-5.1.15/mysql-connector-java-5.1.15-bin.jar
/opt/mapr/hive/hive-<version>/lib/

Configuration

Create the files hive-site.xml and hive-env.sh in the Hive configuration directory (/opt/mapr/hive/hive-<version>/conf) with the
contents from the examples below.

You can set a specific port for Thrift URIs using the METASTORE_PORT parameter in hive-env.sh
To connect to an existing MySQL metastore, make sure the ConnectionURL parameter and the ThriftUris points to the metastore's
host and port.

Once you have the configurations set up, start the hive metastore service using the following command (use tab auto-complete to fill in the
<version>):

/opt/mapr/hive/hive-<version>/bin/hive --service metastore


You can use nohup hive --service metastore to run metastore in the background.

hive-env.sh

# Set Hive and Hadoop environment variables here. These variables can be used
# to control the execution of Hive. It should be used by admins to configure
# the Hive installation (so that users do not have to set environment variables
# or set command line parameters to get correct behavior).
#
# The hive service being invoked (CLI/HWI etc.) is available via the environment
# variable SERVICE

# Hive Client memory usage can be an issue if a large number of clients


# are running at the same time. The flags below have been useful in
# reducing memory usage:
#
# if [ "$SERVICE" = "cli" ]; then
# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40
-XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"
# fi

# The heap size of the jvm stared by hive shell script can be controlled via:
#
# export HADOOP_HEAPSIZE=1024
#
# Larger heap size may be required when running queries over large number of files or partitions.
# By default hive shell scripts use a heap size of 256 (MB). Larger heap size would also be
# appropriate for hive server (hwi etc).

# Set HADOOP_HOME to point to a specific hadoop install directory


HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2

# Hive Configuration Directory can be controlled by:


# export HIVE_CONF_DIR=

# Folder containing extra ibraries required for hive compilation/execution can be controlled by:
# export HIVE_AUX_JARS_PATH=

export METASTORE_PORT=9090

hive-site.xml
<configuration>

<property>
<name>hive.metastore.local</name>
<value>false</value>
<description>controls whether to connect to remove metastore server or open a new metastore server
in Hive Client JVM</description>
</property>

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value><fill in with password></value>
<description>password to use against metastore database</description>
</property>

<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9090</value>
</property>

</configuration>

Notes

Hive Scratch Directory

MapR architecture prevents moving or renaming across volume boundaries. When running an import job on data from a MapR volume, make
sure to set hive.exec.scratchdir to a directory in the same volume (where the data for the job resides). Set the parameter to the volume's
mount point (as viewed in the Volume Properties). You can set this parameter from the Hive shell. Example:

hive> set hive.exec.scratchdir=/myvolume

Mahout
Mahout is an Apache TLP project to create scalable, machine learning algorithms. For information about installing Mahout, see the Mahout Wiki.

To use Mahout with MapR, set the following environment variables:

HADOOP_HOME - tells Mahout where to find the Hadoop directory (/opt/mapr/hadoop/hadoop-0.20.2)


HADOOP_CONF_DIR - tells Mahout where to find information about the JobTracker (/opt/mapr/hadoop/hadoop-0.20.2/conf)

You can set the environment variables permanently by adding the following lines to /etc/environment:
HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2
HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf

MultiTool
The mt command is the wrapper around Cascading.Multitool, a command line tool for processing large text files and datasets (like sed and grep
on unix). The mt command is located in the /opt/mapr/contrib/multitool/bin directory. To use mt, change to the multitool directory.
Example:

cd /opt/mapr/contrib/multitool
./bin/mt

Pig
Apache Pig is a platform for parallelized analysis of large data sets via a language called PigLatin. For more information about Pig, see the Pig
project page.

Once Pig is installed, the executable is located at: /opt/mapr/pig/pig-<version>/bin/pig

Make sure the environment variable JAVA_HOME is set correctly. Example:

# export JAVA_HOME=/usr/lib/jvm/java-6-sun

Installing Pig
The following procedures use the operating system package managers to download and install Pig from the MapR Repository. To install the
packages manually, refer to the Local Packages document for Red Hat or Ubuntu.

To install Pig on an Ubuntu cluster:

1. Execute the following commands as root or using sudo.


2. This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the Installation Guide.
3. Update the list of available packages:

apt-get update

4. On each planned Pig node, install mapr-pig:

apt-get install mapr-pig

To install Pig on a Red Hat or CentOS cluster:

1. Execute the following commands as root or using sudo.


2. This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the Installation Guide.
3. On each planned Pig node, install mapr-pig:

yum install mapr-pig

Getting Started with Pig


In this tutorial, we'll use Pig to run a MapReduce job that counts the words in the file /myvolume/in/constitution.txt on the cluster, and
store the results in the file /myvolume/wordcount.txt.

First, make sure you have downloaded the file: On the page A Tour of the MapR Virtual Machine, select Tools > Attachments and
right-click constitution.txt to save it.
Make sure the file is loaded onto the cluster, in the directory /myvolume/in. If you are not sure how, look at the NFS tutorial on A Tour
of the MapR Virtual Machine.

Open a Pig shell and get started:

1. In the terminal, type the command pig to start the Pig shell.
2. At the grunt> prompt, type the following lines (press ENTER after each):

A = LOAD '/myvolume/constitution.txt' USING TextLoader() AS (words:chararray);

B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));

C = GROUP B BY $0;

D = FOREACH C GENERATE group, COUNT(B);

STORE D INTO '/myvolume/wordcount.txt';

After you type the last line, Pig starts a MapReduce job to count the words in the file constitution.txt.

3. When the MapReduce job is complete, type quit to exit the Pig shell and take a look at the contents of the file
/myvolume/wordcount.txt to see the results.

Using Whirr to Install on Amazon EC2


Whirr simplifies the task of installing a MapR cluster on Amazon Elastic Compute Cloud (Amazon EC2). To get started, you only need an Amazon
Web Services (AWS) account. Once your account is set up, Whirr creates the instances, installs and configures the cluster according to your
specifications, and launches the MapR software.

Installing Whirr
You can install and run Whirr on any computer running a compatible operating system:

64-bit Red Hat 5.4 or greater, or 64-bit CentOS 5.4 or greater


64-bit Ubuntu 9.04 or greater

To install Whirr on CentOS or Red Hat:

As root (or using sudo), issue the following command:

yum install mapr-whirr

To install Whirr on Ubuntu:

As root (or using sudo), issue the following commands:

apt-get update
apt-get install mapr-whirr

Planning
Before setting up a MapR cluster on Amazon EC2, select a compatible AMI. MapR recommends the following:

us-west-1/ami-e1bfefa4 - an Ubuntu 10.04 AMI (username: ubuntu)

Plan the services to run on each node just as you would for any MapR cluster. For more information, see Planning the Deployment. If you are not
sure, you can use the mapr-simple schema to create a basic cluster.
Specifying Services

Once you have determined how many nodes should run which services, you will need to format this information as a list of instance templates for
use in the Whirr configuration. An instance template specifies a number of nodes and a group of services to run, in the following format:<number>
<service>[+<service>...]

In other words, an instance template defines a node type and how many of that node type to create in the Amazon EC2 cluster. The instance
template is specified in the properties file, in the parameter whirr.instance-templates (see Configuring Whirr below).
For example, consider the following configuration plan for an 8-node cluster:

CLDB, FileServer, JobTracker, WebServer, NFS, and ZooKeeper on 3 nodes


FileServer and TaskTracker on 5 nodes

To set up the above cluster on EC2, you would use the following instance template:

whirr.instance-templates=3
mapr-cldb+mapr-fileserver+mapr-jobtracker+mapr-webserver+mapr-nfs+mapr-zookeeper,5
mapr-fileserver+mapr-tasktracker

The MapR-Simple Schema

The mapr-simple schema installs the core MapR services on a single node, then installs only the FileServer and TaskTracker services on the
remaining nodes. To create a mapr-simple cluster, specify a single instance template consisting of the number of nodes and the text
mapr-simple. Example:

whirr.instance-templates=5 mapr-simple

Because the mapr-simple schema includes only one CLDB, one ZooKeeper, and one JobTracker, it is not suitable for clusters which require
High Availability (HA).

Configuring Whirr
Whirr is configured with a properties file, a text file that specifies information about the cluster to provision on Amazon EC2. The directory
/opt/mapr/whirr/whirr-0.3.0-incubating/mapr-recipes contains a few properties files (MapR recipes) to get you started.

Collect the following information, which is needed for the Whirr properties file

EC2 access key - The access key for your Amazon account
EC2 secret key - The secret key for your Amazon account
Amazon AMI - The AMI to use for the instances
Instance templates - A list of instance templates
Amazon EC2 region - The region on which to start the cluster
Number of nodes - The number of instances to configure with each set of services
Private ssh key - The private ssh key to set up on each instance
Public ssh key - The public ssh key to set up on each instance

You can use one of the MapR recipes as a starting point for your own properties file:

If you do not need to change the instance template or cluster name, you can use the MapR recipes unmodified by setting environment
variables referenced in the file:

$ export AWS_ACCESS_KEY_ID=<EC2 access key>


$ export AWS_SECRET_ACCESS_KEY=<EC2 secret key>

To change the cluster name, the number of nodes, or other configuration settings, edit one of the MapR recipes or create a properties file
from scratch. The properties file should look like the following sample properties file, with the values above substituted for the
placeholders in angle brackets (<>).

Sample Properties File


#EC2 settings
whirr.provider=ec2
whirr.identity=<EC2 access key>
whirr.credential=<EC2 secret key>
whirr.image-id=<Amazon AMI>
whirr.hardware-id=<instance type>
whirr.location-id=<Amazon EC2 region>

#Cluster settings
whirr.instance-templates=<instance templates>
whirr.private-key-file=<private ssh key>
whirr.public-key-file=<public ssh key to setup on instances>

#MapR settings
whirr.run-url-base=http://package.mapr.com/scripts/whirr/
whirr.mapr-install-runurl=mapr/install
whirr.mapr-configure-runurl=mapr/configure

Running Whirr
You can use Whirr to launch the cluster, list the nodes, and destroy the cluster when finished.

Do not use sudo with Whirr, as the environment variables associated with your shell will not be available.

Launching the Cluster

Use the whirr launch-cluster command, specifying the properties file, to launch your cluster. Whirr configures an AMI instance on each
node with the MapR services specified, then starts the MapR software.

Example:

$ /opt/mapr/whirr/whirr-0.3.0-incubating/bin/whirr launch-cluster --config


/opt/mapr/whirr/whirr-0.3.0-incubating/mapr-recipes/mysimple.properties

When Whirr finishes, it displays a list of instances in the cluster. Example:

Started cluster of 2 instances


Cluster{instances=[Instance{roles=[mapr-simple], publicAddress=/10.250.1.126,
privateAddress=/10.250.1.192, id=us-west-1/i-501d1514}, Instance{roles=[mapr-simple],
publicAddress=/10.250.1.136, privateAddress=/10.250.1.217, id=us-west-1/i-521d1516}],
configuration={hadoop.job.ugi=root,root,
mapred.job.tracker=ec2-10-250-1-126.us-west-1.compute.amazonaws.com:9001,
hadoop.socks.server=localhost:6666, fs.s3n.awsAccessKeyId=ASDFASDFSDFASDFASDF,
fs.s3.awsSecretAccessKey=AasKLJjkLa89asdf8970as, fs.s3.awsAccessKeyId=ASDFASDFSDFASDFASDF,
hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.SocksSocketFactory,
fs.default.name=maprfs://ec2-10-250-1-126.us-west-1.compute.amazonaws.com:7222/,
fs.s3n.awsSecretAccessKey=AasKLJjkLa89asdf8970asj}}

Each listed instance includes a list of roles fulfilled by the node. The above example shows a two-node cluster created with the mapr-simple
schema; both nodes have the following roles: roles=[mapr-simple]

You can find the public IP address of each node in the publicAddress parameter.

Listing the Cluster Nodes

Use the whirr list-cluster command, specifying the properties file, to list the nodes in your cluster.

Example:
$ /opt/mapr/whirr/whirr-0.3.0-incubating/bin/whirr list-cluster --config
/opt/mapr/whirr/whirr-0.3.0-incubating/mapr-recipes/mysimple.properties

Destroying the Cluster

Use the whirr destroy-cluster command, specifying the properties file, to destroy the cluster.

Example:

$ /opt/mapr/whirr/whirr-0.3.0-incubating/bin/whirr destroy-cluster --config


/opt/mapr/whirr/whirr-0.3.0-incubating/mapr-recipes/mysimple.properties

After Installation
1. Find the line containing a handler name followed by the phrase "Web UI available at" in the Whirr output to determine the public address
of the node that is running the webserver. Example:

MapRSimpleHandler: Web UI available at https:


//ec2-50-18-17-126.us-west-1.compute.amazonaws.com:8443

2. Log in to the node via SSH using your instance username (for example, ubuntu or ec2-user). No password is needed.
3. Become root with one of the following commands: su - or sudo bash
4. Set the password for root by issuing the passwd root command.
5. Set up keyless SSH from the webserver node to all other nodes in the cluster.

Known Issues

Installation on more than nine nodes fails with an AWSException (message size exceeded). To create a cluster larger than nine nodes, install in
batches of nine and use configure.sh to specify the correct CLDB and ZooKeeper IP addresses on each node.

Using the Cluster


View the MapR Control System by navigating to the following URL (substituting the IP address of the EC2 node running the
mapr-webserver service):
https://<IP address>:8443
To launch MapReduce jobs, either use ssh to log into any node in the EC2 cluster, or set up a MapR Client.

Setting Up the Client


MapR provides several interfaces for working with a cluster from a client computer:

MapR Control System - manage the cluster, including nodes, volumes, users, and alarms
Direct Access NFS™ - mount the cluster in a local directory
MapR client - work with MapR Hadoop directly
Mac OS X
Red Hat/CentOS
Ubuntu
Windows

MapR Control System


The MapR Control System is web-based, and works with the following browsers:

Chrome
Safari
Firefox 3.0 and above
Internet Explorer 7 and 8

To use the MapR Control System, navigate to the host that is running the WebServer in the cluster. MapR Control System access to the cluster is
typically via HTTP on port 8080 or via HTTPS on port 8443; you can specify the protocol and port in the Configure HTTP dialog. You should
disable pop-up blockers in your browser to allow MapR to open help links in new browser tabs.

Direct Access NFS™


You can mount a MapR cluster locally as a directory on a Mac or Linux computer.

Before you begin, make sure you know the hostname and directory of the NFS share you plan to mount.
Example:

usa-node01:/mapr - for mounting from the command line


nfs://usa-node01/mapr - for mounting from the Mac Finder

Make sure the client machine has the appropriate username and password to access the NFS share. For best results, the username and
password for accessing the MapR cluster should be the same username and password used to log into the client machine.

Linux
1. Make sure the NFS client is installed. Examples:
sudo yum install nfs-utils (Red Hat or CentOS)
sudo apt-get install nfs-common (Ubuntu)
2. List the NFS shares exported on the server. Example:
showmount -e usa-node01
3. Set up a mount point for an NFS share. Example:
sudo mkdir /mapr
4. Mount the cluster via NFS. Example:
sudo mount usa-node01:/mapr /mapr

You can also add an NFS mount to /etc/fstab so that it mounts automatically when your system starts up. Example:

# device mountpoint fs-type options dump fsckorder


...
usa-node01:/mapr /mapr nfs rw 0 0
...

Mac

To mount the cluster from the Finder:

1. Open the Disk Utility: go to Applications > Utilities > Disk Utility.
2. Select File > NFS Mounts.
3. Click the + at the bottom of the NFS Mounts window.
4. In the dialog that appears, enter the following information:
Remote NFS URL: The URL for the NFS mount. If you do not know the URL, use the showmount command described below.
Example: nfs://usa-node01/mapr
Mount location: The mount point where the NFS mount should appear in the local filesystem.
5. Click the triangle next to Advanced Mount Parameters.
6. Enter nolocks in the text field.
7. Click Verify.
8. Important: On the dialog that appears, click Don't Verify to skip the verification process.

The MapR cluster should now appear at the location you specified as the mount point.

To mount the cluster from the command line:

1. List the NFS shares exported on the server. Example:


showmount -e usa-node01
2. Set up a mount point for an NFS share. Example:
sudo mkdir /mapr
3. Mount the cluster via NFS. Example:
sudo mount -o nolock usa-node01:/mapr /mapr

Windows

Because of Windows directory caching, there may appear to be no .snapshot directory in each volume's root directory. To work
around the problem, force Windows to re-load the volume's root directory by updating its modification time (for example, by
creating an empty file or directory in the volume's root directory).

To mount the cluster on Windows 7 Ultimate or Windows 7 Enterprise:


1. Open Start > Control Panel > Programs.
2. Select Turn Windows features on or off.
3. Select Services for NFS.
4. Click OK.
5. Mount the cluster and map it to a drive using the Map Network Drive tool or from the command line. Example:
mount -o nolock usa-node01:/mapr z:

To mount the cluster on other Windows versions:

1. Download and install Microsoft Windows Services for Unix (SFU). You only need to install the NFS Client and the User Name Mapping.
2. Configure the user authentication in SFU to match the authentication used by the cluster (LDAP or operating system users). You can
map local Windows users to cluster Linux users, if desired.
3. Once SFU is installed and configured, mount the cluster and map it to a drive using the Map Network Drive tool or from the command
line. Example:
mount -o nolock usa-node01:/mapr z:

To map a network drive with the Map Network Drive tool:

1. Open Start > My Computer.


2.
2. Select Tools > Map Network Drive.
3. In the Map Network Drive window, choose an unused drive letter from the Drive drop-down list.
4. Specify the Folder by browsing for the MapR cluster, or by typing the hostname and directory into the text field.
5. Browse for the MapR cluster or type the name of the folder to map. This name must follow UNC. Alternatively, click the Browse… button
to find the correct folder by browsing available network shares.
6. Select Reconnect at login to reconnect automatically to the MapR cluster whenever you log into the computer.
7. Click Finish.

See Direct Access NFS™ for more information.

MapR Client
The MapR client lets you interact with MapR Hadoop directly. With the MapR client, you can submit Map/Reduce jobs and run hadoop fs and
hadoop mfs commands. The MapR client is compatible with the following operating systems:

CentOS 5.5 or above


Mac OS X (Intel)
Red Hat Enterprise Linux 5.5 or above
Ubuntu 9.04 or above
Windows 7

Do not install the client on a cluster node. It is intended for use on a computer that has no other MapR software installed. Do not
install other MapR software on a MapR client computer. To run MapR CLI commands, establish an ssh session to a node in the
cluster.

To configure the client, you will need the cluster name and the IP addresses and ports of the CLDB nodes on the cluster. The configuration script
configure.sh has the following syntax:

Linux —

configure.sh [-N <cluster name>] -c -C <CLDB node>[:<port>][,<CLDB node>[:<port>]...]

Windows —

server\configure.bat -c -C <CLDB node>[:<port>][,<CLDB node>[:<port>]...]

Linux Example:

/opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222

Windows Example:

server\configure.bat -c -C 10.10.100.1:7222

Installing the MapR Client on CentOS or Red Hat


1. Change to the root user (or use sudo for the following commands).
2. Create a text file called maprtech.repo in the directory /etc/yum.repos.d/ with the following contents:

[maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/v1.2 XXX/redhat/
enabled=1
gpgcheck=0
protect=1

To install a previous release, see the Release Notes for the correct path to use in the baseurl parameter.

3.
3. If your connection to the Internet is through a proxy server, you must set the http_proxy environment variable before installation:

http_proxy=http://<host>:<port>
export http_proxy

4. Remove any previous MapR software. You can use rpm -qa | grep mapr to get a list of installed MapR packages, then type the
packages separated by spaces after the rpm -e command. Example:

rpm -qa | grep mapr


rpm -e mapr-fileserver mapr-core

5. Install the MapR client: yum install mapr-client


6. Run configure.sh to configure the client, using the -C (uppercase) option to specify the CLDB nodes, and the -c (lowercase) option
to specify a client configuration. Example:

/opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222

Installing the MapR Client on Ubuntu


1. Change to the root user (or use sudo for the following commands).
2. Add the following line to /etc/apt/sources.list:

deb http://package.mapr.com/releases/v1.2 XXX/ubuntu/ mapr optional

To install a previous release, see the Release Notes for the correct path to use

3. If your connection to the Internet is through a proxy server, add the following lines to /etc/apt.conf:

Acquire {
Retries "0";
HTTP {
Proxy "http://<user>:<password>@<host>:<port>";
};
};

4. Remove any previous MapR software. You can use dpkg -list | grep mapr to get a list of installed MapR packages, then type the
packages separated by spaces after the dpkg -r command. Example:

dpkg -l | grep mapr


dpkg -r mapr-core mapr-fileserver

5. Update your Ubuntu repositories. Example:

apt-get update

6. Install the MapR client: apt-get install mapr-client


7. Run configure.sh to configure the client, using the -C (uppercase) option to specify the CLDB nodes, and the -c (lowercase) option
to specify a client configuration. Example:

/opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222

Installing the MapR Client on Mac OS X


1. Download the archive http://archive.mapr.com/releases/v1.1.3/mac/mapr-client-1.1.1.11051GA-1.x86_64.tar.gz into /opt:
1.

sudo tar -C /opt -xvf XXX

2. Run configure.sh to configure the client, using the -C (uppercase) option to specify the CLDB nodes, and the -c (lowercase) option
to specify a client configuration. Example:

sudo /opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222

Installing the MapR Client on Windows 7


1. Make sure Java is installed on the computer, and set JAVA_HOME correctly.
2. Create the directory /opt — either use Windows Explorer, or type the following at the command prompt:

mkdir c:\opt

3. Set MAPR_HOME to the c:\opt directory.


4. Navigate to MAPR_HOME:

cd %MAPR_HOME%

5. Download the correct archive into MAPR_HOME:


On a 64-bit Windows machine, download XXX from XXX
On a 32-bit Windows machine, download XXX from XXX
6. Extract XXX:

jar -xvf <zip file>

7. Run configure.bat to configure the client, using the -C (uppercase) option to specify the CLDB nodes, and the -c (lowercase) option
to specify a client configuration. Example:

server\configure.bat -c -C 10.10.100.1:7222

On the Windows client, you can run MapReduce jobs using the command hadoop.bat the way you would normally use the normal hadoop
command. For example, to list the contents of a directory, instead of hadoop fs -ls you would type the following:

hadoop.bat fs -ls

Uninstalling MapR
To re-purpose machines, you may wish to remove nodes and uninstall MapR software.

Removing Nodes from a Cluster


To remove nodes from a cluster: first uninstall the desired nodes, then run configure.sh on the remaining nodes. Finally, if you are using
Ganglia, restart all gmeta and gmon daemons in the cluster.

To uninstall a node:

On each node you want to uninstall, perform the following steps:

1. Change to the root user (or use sudo for the following commands).
2. Stop the Warden:
/etc/init.d/mapr-warden stop
3. If ZooKeeper is installed on the node, stop it:
/etc/init.d/mapr-zookeeper stop
4. Determine which MapR packages are installed on the node:
4.
dpkg -list | grep mapr (Ubuntu)
rpm -qa | grep mapr (Red Hat or CentOS)
5. Remove the packages by issuing the appropriate command for the operating system, followed by the list of services. Examples:
apt-get purge mapr-core mapr-cldb mapr-fileserver (Ubuntu)
yum erase mapr-core mapr-cldb mapr-fileserver (Red Hat or CentOS)
6. If the node you have decommissioned is a CLDB node or a ZooKeeper node, then run configure.sh on all other nodes in the cluster
(see Configuring a Node).

To reconfigure the cluster:

Run the script configure.sh to create /opt/mapr/conf/mapr-clusters.conf and update the corresponding *.conf and *.xml files.
Before performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. Optionally, you can specify the ports
for the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are:

CLDB – 7222
ZooKeeper – 5181

The script configure.sh takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IP
addresses (and optionally ports), using the following syntax:
/opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z
<host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>]

Example:

/opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Z
r1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r1n1.sj.us5:5181 -N MyCluster

If you have not chosen a cluster name, you can run configure.sh again later to rename the cluster.

If you are using Ganglia, restart all gmeta and gmon daemons in the cluster. See Ganglia.

User Guide
This guide provides information about using MapR Distribution for Apache Hadoop, including the following topics:

Volumes - Setting up and managing data with volumes


MapReduce - Provisioning resources and running Hadoop jobs
Working with Data - Managing data protection, capacity, and performance with volumes and NFS
Managing the Cluster - Managing nodes, monitoring the cluster, and upgrading the MapR software
Users and Groups - Working with users and groups, quotas, and permissions
Troubleshooting - Diagnosing and resolving problems
Direct Access NFS™ - Mount a MapR cluster via NFS
Best Practices - Tips and tricks for performance
Programming Guide - Accessing the Storage Services Layer via Java and C++

Volumes
A volume is a logical unit that allows you to apply policies to a set of files, directories, and sub-volumes. Using volumes, you can enforce disk
usage limits, set replication levels, establish ownership and accountability, and measure the cost generated by different projects or departments.
Create a volume for each user, department, or project---you can mount volumes under other volumes, to build a structure that reflects the needs
of your organization.

On a cluster with an M5 license, you can create a special type of volume called a mirror, a local or remote copy of an entire volume. Mirrors are
useful for load balancing or disaster recovery. With an M5 license, you can also create a snapshot, an image of a volume at a specific point in
time. Snapshots are useful for rollback to a known data set. You can create snapshots manually or using a schedule.

See also:

Mirrors
Snapshots
Schedules

MapR lets you control and configure volumes in a number of ways:

Replication - set the number of physical copies of the data, for robustness and performance
Topology - restrict a volume to certain physical racks or nodes (requires M5 license and m permission on the volume)
Quota - set a hard disk usage limit for a volume (requires M5 license)
Advisory Quota - receive a notification when a volume exceeds a soft disk usage quota (requires M5 license)
Ownership - set a user or group as the accounting entity for the volume
Permissions - give users or groups permission to perform specified volume operations
File Permissions - Unix-style read/write permissions on volumes

The following sections describe procedures associated with volumes:

To create a new volume, see Creating a Volume (requires cv permission on the volume)
To view a list of volumes, see Viewing a List of Volumes
To view a single volume's properties, see Viewing Volume Properties
To modify a volume, see Modifying a Volume (requires m permission on the volume)
To mount a volume, see Mounting a Volume (requires mnt permission on the volume)
To unmount a volume, see Unmounting a Volume (requires m permission on the volume)
To remove a volume, see Removing a Volume (requires d permission on the volume)
To set volume topology, see Setting Volume Topology (requires m permission on the volume)

Creating a Volume
When creating a volume, the only required parameters are the volume type (normal or mirror) and the volume name. You can set the ownership,
permissions, quotas, and other parameters at the time of volume creation, or use the Volume Properties dialog to set them later. If you plan to
schedule snapshots or mirrors, it is useful to create a schedule ahead of time; the schedule will appear in a drop-down menu in the Volume
Properties dialog.

By default, the root user and the volume creator have full control permissions on the volume. You can grant specific permissions to other users
and groups:

Code Allowed Action

dump Dump the volume

restore Mirror or restore the volume

m Modify volume properties, create and delete snapshots

d Delete a volume

fc Full control (admin access and permission to change volume ACL)

You can create a volume using the volume create command, or use the following procedure to create a volume using the MapR Control System.

To create a volume using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Click the New Volume button to display the New Volume dialog.
3. Use the Volume Type radio button at the top of the dialog to choose whether to create a standard volume, a local mirror, or a remote
mirror.
4. Type a name for the volume or source volume in the Volume Name or Mirror Name field.
5. If you are creating a mirror volume:
Type the name of the source volume in the Source Volume Name field.
If you are creating a remote mirror volume, type the name of the cluster where the source volume resides, in the Source Cluster
Name field.
6. You can set a mount path for the volume by typing a path in the Mount Path field.
7. You can specify which rack or nodes the volume will occupy by typing a path in the Topology field.
8. You can set permissions using the fields in the Ownership & Permissions section:
a. Click [ + Add Permission ] to display fields for a new permission.
b. In the left field, type either u: and a user name, or g: and a group name.
c. In the right field, select permissions to grant to the user or group.
9. You can associate a standard volume with an accountable entity and set quotas in the Usage Tracking section:
a. In the Group/User field, select User or Group from the dropdown menu and type the user or group name in the text field.
b. To set an advisory quota, select the checkbox beside Volume Advisory Quota and type a quota (in megabytes) in the text field.
c. To set a quota, select the checkbox beside Volume Quota and type a quota (in megabytes) in the text field.
10. You can set the replication factor and choose a snapshot or mirror schedule in the Replication and Snapshot section:
a. Type the desired replication factor in the Replication Factor field.
b. To disable writes when the replication factor falls before a minimum number, type the minimum replication factor in the Disable
Writes... field.
c. To schedule snapshots or mirrors, select a schedule from the Snapshot Schedule dropdown menu or the Mirror Update
Schedule dropdown menu respectively.
11. Click OK to create the volume.
Viewing a List of Volumes
You can view all volumes using the volume list command, or view them in the MapR Control System using the following procedure.

To view all volumes using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.

Viewing Volume Properties


You can view volume properties using the volume info command, or use the following procedure to view them using the MapR Control System.

To view the properties of a volume using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Display the Volume Properties dialog by clicking the volume name, or by selecting the checkbox beside the volume name, then clicking
the Properties button.
3. After examining the volume properties, click Close to exit without saving changes to the volume.

Modifying a Volume
You can modify any attributes of an existing volume, except for the following restriction:

You cannot convert a normal volume to a mirror volume.

You can modify a volume using the volume modify command, or use the following procedure to modify a volume using the MapR Control System.

To modify a volume using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Display the Volume Properties dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clicking
the Properties button.
3. Make changes to the fields. See Creating a Volume for more information about the fields.
4. After examining the volume properties, click Modify Volume to save changes to the volume.

Mounting a Volume
You can mount a volume using the volume mount command, or use the following procedure to mount a volume using the MapR Control System.

To mount a volume using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Select the checkbox beside the name of each volume you wish to mount.
3. Click the Mount button.

You can also mount or unmount a volume using the Mounted checkbox in the Volume Properties dialog. See Modifying a Volume for more
information.

Unmounting a Volume
You can unmount a volume using the volume unmount command, or use the following procedure to unmount a volume using the MapR Control
System.

To unmount a volume using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Select the checkbox beside the name of each volume you wish to unmount.
3. Click the Unmount button.

You can also mount or unmount a volume using the Mounted checkbox in the Volume Properties dialog. See Modifying a Volume for more
information.

Removing a Volume or Mirror


You can remove a volume using the volume remove command, or use the following procedure to remove a volume using the MapR Control
System.

To remove a volume or mirror using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Click the checkbox next to the volume you wish to remove.
3. Click the Remove button to display the Remove Volume dialog.
4. In the Remove Volume dialog, click the Remove Volume button.

Setting Volume Topology


You can place a volume on specific racks, nodes,or groups of nodes by setting its topology to an existing node topology. For more information
about node topology, see Node Topology.

To set volume topology, choose the path that corresponds to the node topology of the rack or nodes where you would like the volume to reside.
You can set volume topology using the MapR Control System or with the volume modify command.

To set volume topology using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Display the Volume Properties dialog by clicking the volume name or by selecting the checkbox beside the volume name, then clicking
the Properties button.
3. Click Move Volume to display the Move Volume dialog.
4. Select a topology path that corresponds to the rack or nodes where you would like the volume to reside.
5. Click Move Volume to return to the Volume Properties dialog.
6. Click Modify Volume to save changes to the volume.

Setting Default Volume Topology


By default, new volumes are created with a topology of / (root directory). To change the default topology, use the config save command to
change the cldb.default.volume.topology configuration parameter. Example:

maprcli config save -values "{\"cldb.default.volume.topology\":\"/default-rack\"}"

After running the above command, new volumes have the volume topology /default-rack by default.

Example: Setting Up CLDB-Only Nodes


In a large cluster (100 nodes or more) create CLDB-only nodes to ensure high performance. This configuration also provides additional control
over the placement of the CLDB data, for load balancing, fault tolerance, or high availability (HA). Setting up CLDB-only nodes involves restricting
the CLDB volume to its own topology and making sure all other volumes are on a separate topology. Unless you specify a default volume
topology, new volumes have no topology when they are created, and reside at the root topology path: "/". Because both the CLDB-only path and
the non-CLDB path are children of the root topology path, new non-CLDB volumes are not guaranteed to keep off the CLDB-only nodes. To avoid
this problem, set a default volume topology. See Setting Default Volume Topology.

To set up a CLDB-only node:

1. SET UP the node as usual:


PREPARE the node, making sure it meets the requirements.
ADD the MapR Repository.
2. INSTALL only the following packages:
mapr-cldb
mapr-webserver
mapr-core
mapr-fileserver
3. RUN configure.sh.
4. FORMAT the disks.
5. START the warden:

/etc/init.d/mapr-warden start

To restrict the CLDB volume to specific nodes:

1. Move all CLDB nodes to a CLDB-only topology (e. g. /cldbonly) using the MapR Control System or the following command:
maprcli node move -serverids <CLDB nodes> -topology /cldbonly
2.
2. Restrict the CLDB volume to the CLDB-only topology. Use the MapR Control System or the following command:
maprcli volume move -name mapr.cldb.internal -topology /cldbonly
3. If the CLDB volume is present on nodes not in /cldbonly, increase the replication factor of mapr.cldb.internal to create enough copies in
/cldbonly using the MapR Control System or the following command:
maprcli volume modify -name mapr.cldb.internal -replication <replication factor>
4. Once the volume has sufficient copies, remove the extra replicas by reducing the replication factor to the desired value using the MapR
Control System or the command used in the previous step.

To move all other volumes to a topology separate from the CLDB-only nodes:

1. Move all non-CLDB nodes to a non-CLDB topology (e. g. /defaultRack) using the MapR Control System or the following command:
maprcli node move -serverids <all non-CLDB nodes> -topology /defaultRack
2. Restrict all existing volumes to the topology /defaultRack using the MapR Control System or the following command:
maprcli volume move -name <volume> -topology /defaultRack
All volumes except (mapr.cluster.root) get re-replicated to the changed topology automatically.

To prevent subsequently created volumes from encroaching on the CLDB-only nodes, set a default topology that
excludes the CLDB-only topology.

Mirrors
A mirror volume is a read-only physical copy of another volume, the source volume. Creating mirrors on the same cluster (local mirroring) is useful
for load balancing and local backup. Creating mirrors on another cluster (remote mirroring) is useful for wide distribution and disaster
preparedness. Creating a mirror is similar to creating a normal (read/write) volume, except that you must specify a source volume from which the
mirror retrieves its contents (the mirroring operation). When you mirror a volume, read requests to the source volume can be served by any of its
mirrors on the same cluster via a volume link of type mirror. A volume link is similar to a normal volume mount point, except that you can specify
whether it points to the source volume or its mirrors.

To write to (and read from) the source volume, mount the source volume normally. As long as the source volume is mounted below a
non-mirrored volume, you can read and write to the volume normally via its direct mount path. You can also use a volume link of type
writeable to write directly to the source volume regardless of its mount point.
To read from the mirrors, use the volume link create command to make a volume link (of type mirror) to the source volume. Any reads
via the volume link will be distributed among the volume's mirrors. It is not necessary to also mount the mirrors, because the volume link
handles access to the mirrors.

Any mount path that consists entirely of mirrored volumes will refer to a mirrored copy of the target volume; otherwise the mount path refers to the
specified volume itself. For example, assume a mirrored volume c mounted at /a/b/c. If the root volume is mirrored, then the mount path /
refers to a mirror of the root volume; if a in turn is mirrored, then the path /a refers to a mirror of a and so on. If all volumes preceding c in the
mount path are mirrored, then the path /a/b/c refers to one of the mirrors of c. However, if any volume in the path is not mirrored then the
source volume is selected for that volume and subsequent volumes in the path. If a is not mirrored, then although / still selects a mirror, /a refers
to the source volume a itself (because there is only one) and /a/b refers to the source volume b (because it was not accessed via a mirror). In
that case, /a/b/c refers to the source volume c.

Any mirror that is accessed via a parent mirror (all parents are mirror volumes) is implicitly mounted. For example, assume a volume a that is
mirrored to a-mirror, and a volume b that is mirrored to b-mirror-1 and b-mirror-2; a is mounted at /a, b is mounted at /a/b, and
a-mirror is mounted at /a-mirror. In this case, reads via /a-mirror/b will access one of the mirrors b-mirror-1 or b-mirror-2 without
the requirement to explicitly mount them.

At the start of a mirroring operation, a temporary snapshot of the source volume is created; the mirroring process reads from the snapshot so that
the source volume remains available for both reads and writes during mirroring. If the mirroring operation is schedule-based, the snapshot expires
according to the Retain For parameter of the schedule; snapshots created during manual mirroring remain until they are deleted manually. To
save bandwidth, the mirroring process transmits only the deltas between the source volume and the mirror; after the initial mirroring operation
(which creates a copy of the entire source volume), subsequent updates can be extremely fast.

Mirroring is extremely resilient. In the case of a network partition (some or all machines where the source volume resides cannot communicate
with machines where the mirror volume resides), the mirroring operation will periodically retry the connection, and will complete mirroring when
the network is restored.

Working with Mirrors


Choose which volumes you mirror and where you locate the mirrors, depending on what you plan to use the mirrors for. Backup mirrors for
disaster recovery can be located on physical media outside the cluster or in a remote cluster. In the event of a disaster affecting the source
cluster, you can check the time of last successful synchronization to determine how up-to-date the backups are (see Mirror Status below). Local
mirrors for load balancing can be located in specific servers or racks with especially high bandwidth, and mounted in a public directory separate
from where the source volume is mounted. You can structure several mirrors in a cascade (or chain) from the source volume, or point all mirrors
to the source volume individually, depending on whether it is more important to conserve bandwidth or to ensure that all mirrors are in sync with
each other as well as the source volume. In most cases, it is convenient to set a schedule to automate mirror synchronization, rather than using
the volume mirror start command to synchronize data manually. Mirroring completion time depends on the available bandwidth and the size of the
data being transmitted. For best performance, set the mirroring schedule according to the anticipated rate of data changes, and the available
bandwidth for mirroring.

The following sections provide information about various mirroring use cases.

Local and Remote Mirroring

Local mirroring (creating mirrors on the same cluster) is useful for load balancing, or for providing a read-only copy of a data set.

Although it is not possible to directly mount a volume from one cluster to another, you can mirror a volume to a remote cluster ( remote mirroring).
By mirroring the cluster's root volume and all other volumes in the cluster, you can create an entire mirrored cluster that keeps in sync with the
source cluster. Mount points are resolved within each cluster; any volumes that are mirrors of a source volume on another cluster are read-only,
because a source volume from another cluster cannot be resolved locally.

To transfer large amounts of data between physically distant clusters, you can use the volume dump create command to create volume
copies for transport on physical media. The volume dump create command creates backup files containing the volumes, which can be
reconstituted into mirrors at the remote cluster using the volume dump restore command. These mirrors can be reassociated with the source
volumes (using the volume modify command to specify the source for each mirror volume) for live mirroring.

Local Mirroring Example

Assume a volume containing a table of data that will be read very frequently by many clients, but updated infrequently. The data is contained in a
volume named table-data, which is to be mounted under a non-mirrored user volume belonging to jsmith. The mount path for the writeable
copy of the data is to be /home/private/users/jsmith/private-table and the public, readable mirrors of the data are to be mounted at
/public/data/table. You would set it up as follows:

1. Create as many mirror volumes as needed for the data, using the MapR Control System or the volume create command (See Creating a
Volume).
2. Mount the source volume at the desired location (in this case, /home/private/users/jsmith/private-table) using the MapR
Control System or the volume mount command.
3. Use the volume link create command to create a volume link at /public/data/table pointing to the source volume. Example:

maprcli volume link create -volume table-data -type mirror -path /public/data/table

4. Write the data to the source volume via the mount path /home/private/users/jsmith/private-table as needed.
5. When the data is ready for public consumption, use the volume mirror push command to push the data out to all the mirrors.
6. Create additional mirrors as needed and push the data to them. No additional steps are required; as soon as a mirror is created and
synchronized, it is available via the volume link.

When a user reads via the path /public/data/table, the data is served by a randomly selected mirror of the source volume. Reads are
evenly spread over all mirrors.

Remote Mirroring Example

Assume two clusters, cluster-1 and cluster-2, and a volume volume-a on cluster-1 to be mirrored to cluster-2. Create a mirror
volume on cluster-2, specifying the remote cluster and volume. You can create remote mirrors using the MapR Control System or the volume
create command:

In the MapR Control System on cluster-2, specify the following values in the New Volume dialog:
1. Select Remote Mirror Volume.
2. Enter volume-a or another name in the Volume Name field.
3. Enter volume-a in the Source Volume field.
4. Enter cluster-1 in the Source Cluster field.

Using the volume create command cluster-2, specify the following parameters:
Specify the source volume and cluster in the format <volume>@<cluster>, provide a name for the mirror volume, and specify
a type of 1. Example:

maprcli volume create -name volume-a -source volume-a@cluster-1 -type 1

After creating the mirror volume, you can synchronize the data using volume mirror start from cluster-2 to pull data to the mirror volume on
cluster-2 from its source volume on cluster-1.

When you mount a mirror volume on a remote cluster, any mirror volumes below it are automatically mounted. For example, assume volumes a
and b on cluster-1 (mounted at /a and /a/b) are mirrored to a-mirror and b-mirror on cluster-2. When you mount the volume
a-mirror at /a-mirror on cluster-2, it contains a mount point for /b which gets mapped to the mirror of b, making it available at
/a-mirror/b. Any mirror volumes below b will be similarly mounted, and so on.
Mirror Status

You can see a list of all mirror volumes and their current status on the Mirror Volumes view (in the MapR Control System, select MapR-FS then
Mirror Volumes) or using the volume list command. You can see additional information about mirror volumes on the CLDB status page (in the
MapR Control System, select CLDB), which shows the status and last successful synchronization of all mirrors, as well as the container locations
for all volumes. You can also find container locations using the Hadoop MFS commands.

Mirroring the Root Volume

The most frequently accessed volumes in a cluster are likely to be the root volume and its immediate children. In order to load-balance reads on
these volumes, it is possible to mirror the root volume (typically mapr.cluster.root, which is mounted at /). There is a special writeable
volume link called .rw inside the root volume, to provide access to the source volume. In other words, if the root volume is mirrored:

The path / refers to one of the mirrors of the root volume


The path /.rw refers to the source (writeable) root volume

Mirror Cascades

A mirror cascade (or chain mirroring) is a series of mirrors that form a chain from a single source volume: the first mirror receives updates from
the source volume, the second mirror receives updates from the first, and so on. Mirror cascades are useful for propagating data over a distance,
then re-propagating the data locally instead of transferring the same data remotely again for each copy of the mirror.

You can create or break a mirror cascade made from existing mirror volumes by changing the source volume of each mirror in the Volume
Properties dialog.

Creating, Modifying, and Removing Mirror Volumes


On an M5-licensed cluster, you can create a mirror manually or automate the process with a schedule. You can set the topology of a mirror
volume to determine the placement of the data, if desired. The following sections describe procedures associated with mirrors:

To create a new mirror volume, see Creating a Volume (requires M5 license and cv permission)
To modify a mirror (including changing its source), see Modifying a Volume
To remove a mirror, see Removing a Volume or Mirror

You can change a mirror's source volume by changing the source volume in the Volume Properties dialog.

Starting a Mirror
To start a mirror means to pull the data from the source volume. Before starting a mirror, you must create a mirror volume and associate it with a
source volume. You should start a mirror operation shortly after creating the mirror volume, and then again each time you want to synchronize the
mirror with the source volume. You can use a schedule to automate the synchronization. If you create a mirror and synchronize it only once, it is
like a snapshot except that it uses the same amount of disk space used by the source volume at the point in time when the mirror was started.
You can start a mirror using the volume mirror start command, or use the following procedure to start mirroring using the MapR Control System.

To start mirroring using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Select the checkbox beside the name of each volume you wish to mirror.
3. Click the Start Mirroring button.

Stopping a Mirror
To stop a mirror means to cancel the replication or synchronization process. Stopping a mirror does not delete or remove the mirror volume, only
stops any synchronization currently in progress. You can stop a mirror using the volume mirror stop command, or use the following procedure to
stop mirroring using the MapR Control System.

To stop mirroring using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Select the checkbox beside the name of each volume you wish to stop mirroring.
3. Click the Stop Mirroring button.

Pushing Changes to Mirrors


To push a mirror means to start pushing data from the source volume to all its local mirrors. You can push source volume changes out to all
mirrors using the volume mirror push command, which returns after the data has been pushed.

Schedules
A schedule is a group of rules that specify recurring points in time at which certain actions are determined to occur. You can use schedules to
automate the creation of snapshots and mirrors; after you create a schedule, it appears as a choice in the scheduling menu when you are editing
the properties of a task that can be scheduled:

To apply a schedule to snapshots, see Scheduling a Snapshot.


To apply a schedule to volume mirroring, see Creating a Volume.

Schedules require the M5 license. The following sections provide information about the actions you can perform on schedules:

To create a schedule, see Creating a Schedule


To view a list of schedules, see Viewing a List of Schedules
To modify a schedule, see Modifying a Schedule
To remove a schedule, see Removing a Schedule

Creating a Schedule
You can create a schedule using the schedule create command, or use the following procedure to create a schedule using the MapR Control
System.

To create a schedule using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Schedules view.
2. Click New Schedule.
3. Type a name for the new schedule in the Schedule Name field.
4. Define one or more schedule rules in the Schedule Rules section:
a. From the first dropdown menu, select a frequency (Once, Yearly, Monthly, etc.))
b. From the next dropdown menu, select a time point within the specified frequency. For example: if you selected Monthly in the
first dropdown menu, select the day of the month in the second dropdown menu.
c. Continue with each dropdown menu, proceeding to the right, to specify the time at which the scheduled action is to occur.
d. Use the Retain For field to specify how long the data is to be preserved. For example: if the schedule is attached to a volume for
creating snapshots, the Retain For field specifies how far after creation the snapshot expiration date is set.
5. Click [ + Add Rule ] to specify additional schedule rules, as desired.
6. Click Save Schedule to create the schedule.

Viewing a List of Schedules


You can view a list of schedules using the schedule list command, or use the following procedure to view a list of schedules using the MapR
Control System.

To view a list of schedules using the MapR Control System:

In the Navigation pane, expand the MapR-FS group and click the Schedules view.

Modifying a Schedule
When you modify a schedule, the new set of rules replaces any existing rules for the schedule.

You can modify a schedule using the schedule modify command, or use the following procedure to modify a schedule using the MapR Control
System.

To modify a schedule using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Schedules view.
2. Click the name of the schedule to modify.
3. Modify the schedule as desired:
a. Change the schedule name in the Schedule Name field.
b. Add, remove, or modify rules in the Schedule Rules section.
4. Click Save Schedule to save changes to the schedule.

For more information, see Creating a Schedule.

Removing a Schedule
You can remove a schedule using the schedule remove command, or use the following procedure to remove a schedule using the MapR Control
System.

To remove a schedule using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Schedules view.

2.
2. Click the name of the schedule to remove.
3. Click Remove Schedule to display the Remove Schedule dialog.
4. Click Yes to remove the schedule.

Snapshots
A snapshot is a read-only image of a volume at a specific point in time. On an M5-licensed cluster, you can create a snapshot manually or
automate the process with a schedule. Snapshots are useful any time you need to be able to roll back to a known good data set at a specific point
in time. For example, before performing a risky operation on a volume, you can create a snapshot to enable "undo" capability for the entire
volume. A snapshot takes no time to create, and initially uses no disk space, because it stores only the incremental changes needed to roll the
volume back to the point in time when the snapshot was created.

The following sections describe procedures associated with snapshots:

To view the contents of a snapshot, see Viewing the Contents of a Snapshot


To create a snapshot, see Creating a Volume Snapshot (requires M5 license)
To view a list of snapshots, see Viewing a List of Snapshots
To remove a snapshot, see Removing a Volume Snapshot

See a video explanation of snapshots

Viewing the Contents of a Snapshot


At the top level of each volume is a directory called .snapshot containing all the snapshots for the volume. You can view the directory with
hadoop fs commands or by mounting the cluster with NFS. To prevent recursion problems, ls and hadoop fs -ls do not show the
.snapshot directory when the top-level volume directory contents are listed. You must navigate explicitly to the .snapshot directory to view
and list the snapshots for the volume.

Example:

root@node41:/opt/mapr/bin# hadoop fs -ls /myvol/.snapshot


Found 1 items
drwxrwxrwx - root root 1 2011-06-01 09:57 /myvol/.snapshot/2011-06-01.09-57-49

Creating a Volume Snapshot


You can create a snapshot manually or use a schedule to automate snapshot creation. Each snapshot has an expiration date that determines
how long the snapshot will be retained:

When you create the snapshot manually, specify an expiration date.


When you schedule snapshots, the expiration date is determined by the Retain parameter of the schedule.

For more information about scheduling snapshots, see Scheduling a Snapshot.

Creating a Snapshot Manually

You can create a snapshot using the volume snapshot create command, or use the following procedure to create a snapshot using the MapR
Control System.

To create a snapshot using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Select the checkbox beside the name of each volume for which you want a snapshot, then click the New Snapshot button to display the
Snapshot Name dialog.
3. Type a name for the new snapshot in the Name... field.
4. Click OK to create the snapshot.

Scheduling a Snapshot

You schedule a snapshot by associating an existing schedule with a normal (non-mirror) volume. You cannot schedule snapshots on mirror
volumes; in fact, since mirrors are read-only, creating a snapshot of a mirror would provide no benefit. You can schedule a snapshot by passing
the ID of a schedule to the volume modify command, or you can use the following procedure to choose a schedule for a volume using the MapR
Control System.

To schedule a snapshot using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Display the Volume Properties dialog by clicking the volume name, or by selecting the checkbox beside the name of the volume then
clicking the Properties button.
3. In the Replication and Snapshot Scheduling section, choose a schedule from the Snapshot Schedule dropdown menu.
4. Click Modify Volume to save changes to the volume.

For information about creating a schedule, see Schedules.

Viewing a List of Snapshots

Viewing all Snapshots

You can view snapshots for a volume with the volume snapshot list command or using the MapR Control System.

To view snapshots using the MapR Control System:

In the Navigation pane, expand the MapR-FS group and click the Snapshots view.

Viewing Snapshots for a Volume

You can view snapshots for a volume by passing the volume to the volume snapshot list command or using the MapR Control System.

To view snapshots using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Click the Snapshots button to display the Snapshots for Volume dialog.

Removing a Volume Snapshot


Each snapshot has an expiration date and time, when it is deleted automatically. You can remove a snapshot manually before its expiration, or
you can preserve a snapshot to prevent it from expiring.

Removing a Volume Snapshot Manually

You can remove a snapshot using the volume snapshot remove command, or use the following procedure to remove a snapshot using the MapR
Control System.

To remove a snapshot using the MapR Control System:

1.
1. In the Navigation pane, expand the MapR-FS group and click the Snapshots view.
2. Select the checkbox beside each snapshot you wish to remove.
3. Click Remove Snapshot to display the Remove Snapshots dialog.
4. Click Yes to remove the snapshot or snapshots.

To remove a snapshot from a specific volume using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Select the checkbox beside the volume name.
3. Click Snapshots to display the Snapshots for Volume dialog.
4. Select the checkbox beside each snapshot you wish to remove.
5. Click Remove to display the Remove Snapshots dialog.
6. Click Yes to remove the snapshot or snapshots.

Preserving a Volume Snapshot

You can preserve a snapshot using the volume snapshot preserve command, or use the following procedure to create a volume using the MapR
Control System.

To remove a snapshot using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Snapshots view.
2. Select the checkbox beside each snapshot you wish to preserve.
3. Click Preserve Snapshot to preserve the snapshot or snapshots.

To remove a snapshot from a specific volume using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Select the checkbox beside the volume name.
3. Click Snapshots to display the Snapshots for Volume dialog.
4. Select the checkbox beside each snapshot you wish to preserve.
5. Click Preserve to preserve the snapshot or snapshots.

MapReduce
If you have used Hadoop in the past to run MapReduce jobs, then running jobs on MapR Distribution for Apache Hadoop will be very familiar to
you. MapR is a full Hadoop distribution, API-compatible with all versions of Hadoop. Mapr provides additional capabilities not present in any other
Hadoop distribution. This section contains information about the following topics:

Secured TaskTracker - Controlling which users are able to submit jobs to the TaskTracker
Standalone Operation - Running MapReduce jobs locally, using the local filesystem
Tuning MapReduce - Strategies for optimizing resources to meet the goals of your application

Compiling Pipes Programs


To facilitate running hadoop pipes jobs on various platforms, MapR provides hadoop pipes, util, and pipes-example sources.

When using pipes, all nodes must run the same distribution of the operating system. If you run different distributions (Red Hat
and CentOS, for example) on nodes in the same cluster, the compiled application might run on some nodes but not others.

To compile the pipes example:

1. Install libssl on all nodes.


2. Change to the /opt/mapr/hadoop/hadoop-0.20.2/src/c++/utils directory, and execute the following commands:

chmod +x configure
./configure # resolve any errors
make install

3. Change to the /opt/mapr/hadoop/hadoop-0.20.2/src/c++/pipes directory, and execute the following commands:


3.

chmod +x configure
./configure # resolve any errors
make install

4. The APIs and libraries will be in the /opt/mapr/hadoop/hadoop-0.20.2/src/c++/install directory.


5. Compile pipes-example:

cd /opt/mapr/hadoop/hadoop-0.20.2/src/c++
g++ pipes-example/impl/wordcount-simple.cc -Iinstall/include/ -Linstall/lib/ -lhadooputils
-lhadooppipes -lssl -lpthread -o wc-simple

To run the pipes example:

1. Copy the pipes program into MapR-FS.


2. Run the hadoop pipes command:

hadoop pipes -Dhadoop.pipes.java.recordreader=true -Dhadoop.pipes.java.recordwriter=true -input


<input-dir> -output <output-dir> -program <MapR-FS path to program>

Secured TaskTracker
You can control which users are able to submit jobs to the TaskTracker. By default, the TaskTracker is secured; All TaskTracker nodes should
have the same user and group databases, and only users who are present on all TaskTracker nodes (same user ID on all nodes) can submit jobs.
You can disallow certain users (including root or other superusers) from submitting jobs, or remove user restrictions from the TaskTracker
completely.
/opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml

To disallow root:

1. Edit mapred-site.xml and set mapred.tasktracker.task-controller.config.overwrite = false on all TaskTracker


nodes.
2. Edit taskcontroller.cfg and set min.user.id=0 on all TaskTracker nodes.
3. Restart all TaskTrackers.

To disallow all superusers:

1. Edit mapred-site.xml and set mapred.tasktracker.task-controller.config.overwrite = false on all TaskTracker


nodes.
2. Edit taskcontroller.cfg and set min.user.id=1000 on all TaskTracker nodes.
3. Restart all TaskTrackers.

To disallow specific users:

1. Edit mapred-site.xml and set mapred.tasktracker.task-controller.config.overwrite = false on all TaskTracker


nodes.
2. Edit taskcontroller.cfg and add the parameter banned.users on all TaskTracker nodes, setting it to a comma-separated list of
usernames. Example:

banned.users=foo,bar

3. Restart all TaskTrackers.

To remove all user restrictions, and run all jobs as root:

1. Edit mapred-site.xml and set mapred.task.tracker.task.controller =


org.apache.hadoop.mapred.DefaultTaskController on all TaskTracker nodes.
2. Restart all TaskTrackers.
When you make the above setting, the tasks generated by all jobs submitted by any user will run with the same privileges as the
TaskTracker (root privileges), and will have the ability to overwrite, delete, or damage data regardless of ownership or
permissions.

Standalone Operation
You can run MapReduce jobs locally, using the local filesystem, by setting mapred.job.tracker=local in mapred-site.xml. With that
parameter set, you can use the local filesystem for both input and output, use MapR-FS for input and output to the local filesystem, or use the
local filesystem for input and output to MapR-FS.

Examples

Input and output on local filesystem

./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local file:


///opt/mapr/hadoop/hadoop-0.20.2/input file:///opt/mapr/hadoop/hadoop-0.20.2/output 'dfs[a-z.]+'

Input from MapR-FS

./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local input file:


///opt/mapr/hadoop/hadoop-0.20.2/output 'dfs[a-z.]+'

Output to MapR-FS

./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local file:


///opt/mapr/hadoop/hadoop-0.20.2/input output 'dfs[a-z.]+'

Tuning MapReduce
MapR automatically tunes the cluster for most purposes. A service called the warden determines machine resources on nodes configured to run
the TaskTracker service, and sets MapReduce parameters accordingly.

On nodes with multiple CPUs, MapR uses taskset to reserve CPUs for MapR services:

On nodes with five to eight CPUs, CPU 0 is reserved for MapR services
On nodes with nine or more CPUs, CPU 0 and CPU 1 are reserved for MapR services

In certain circumstances, you might wish to manually tune MapR to provide higher performance. For example, when running a job consisting of
unusually large tasks, it is helpful to reduce the number of slots on each TaskTracker and adjust the Java heap size. The following sections
provide MapReduce tuning tips. If you change any settings in mapred-site.xml, restart the TaskTracker.

Memory Settings

Memory for MapR Services

The memory allocated to each MapR service is specified in the /opt/mapr/conf/warden.conf file, which MapR automatically configures
based on the physical memory available on the node. For example, you can adjust the minimum and maximum memory used for the
TaskTracker, as well as the percentage of the heap that the TaskTracker tries to use, by setting the appropriate percent, max, and min
parameters in the warden.conf file:

...
service.command.tt.heapsize.percent=2
service.command.tt.heapsize.max=325
service.command.tt.heapsize.min=64
...

The percentages of memory used by the services need not add up to 100; in fact, you can use less than the full heap by setting the
heapsize.percent parameters for all services to add up to less than 100% of the heap size. In general, you should not need to adjust the
memory settings for individual services, unless you see specific memory-related problems occurring.

MapReduce Memory

The memory allocated for MapReduce tasks normally equals the total system memory minus the total memory allocated for MapR services. If
necessary, you can use the parameter mapreduce.tasktracker.reserved.physicalmemory.mb to set the maximum physical memory reserved by
MapReduce tasks, or you can set it to -1 to disable physical memory accounting and task management.

If the node runs out of memory, MapReduce tasks are killed by the OOM-killer to free memory. You can use mapred.child.oom_adj (copy
from mapred-default.xml to adjust the oom_adj parameter for MapReduce tasks. The possible values of oom_adj range from -17 to +15.
The higher the score, more likely the associated process is to be killed by OOM-killer.

Job Configuration

Map Tasks

Map tasks use memory mainly in two ways:

The MapReduce framework uses an intermediate buffer to hold serialized (key, value) pairs.
The application consumes memory to run the map function.

MapReduce framework memory is controlled by io.sort.mb. If io.sort.mb is less than the data emitted from the mapper, the task ends up
spilling data to disk. If io.sort.mb is too large, the task can run out of memory or waste allocated memory. By default io.sort.mb is 100mb. It
should be approximately 1.25 times the number of data bytes emitted from mapper. If you cannot resolve memory problems by adjusting
io.sort.mb, then try to re-write the application to use less memory in its map function.

Compression

To turn off MapR compression for map outputs, set mapreduce.maprfs.use.compression=false


To turn on LZO or any other compression, set mapreduce.maprfs.use.compression=false and
mapred.compress.map.output=true

Reduce Tasks

If tasks fail because of an Out of Heap Space error, increase the heap space (the -Xmx option in mapred.reduce.child.java.opts) to give
more memory to the tasks. If map tasks are failing, you can also try reducing io.sort.mb.
(see mapred.map.child.java.opts in mapred-site.xml)

TaskTracker Configuration

MapR sets up map and reduce slots on each TaskTracker node using formulas based on the number of CPUs present on the node. The default
formulas are stored in the following parameters in mapred-site.xml:

mapred.tasktracker.map.tasks.maximum: (CPUS > 2) ? (CPUS * 0.75) : 1 (At least one Map slot, up to 0.75 times the number of
CPUs)
mapred.tasktracker.reduce.tasks.maximum: (CPUS > 2) ? (CPUS * 0.50) : 1 (At least one Map slot, up to 0.50 times the
number of CPUs)

You can adjust the maximum number of map and reduce slots by editing the formula used in mapred.tasktracker.map.tasks.maximum
and mapred.tasktracker.reduce.tasks.maximum. The following variables are used in the formulas:

CPUS - number of CPUs present on the node


DISKS - number of disks present on the node
MEM - memory reserved for MapReduce tasks

Ideally, the number of map and reduce slots should be decided based on the needs of the application. Map slots should be based on how many
map tasks can fit in memory, and reduce slots should be based on the number of CPUs. If each task in a MapReduce job takes 3 GB, and each
node has 9GB reserved for MapReduce tasks, then the total number of map slots should be 3. The amount of data each map task must process
also affects how many map slots should be configured. If each map task processes 256 MB (the default chunksize in MapR), then each map task
should have 800 MB of memory. If there are 4 GB reserved for map tasks, then the number of map slots should be 4000MB/800MB, or 5 slots.

MapR allows the JobTracker to over-schedule tasks on TaskTracker nodes in advance of the availability of slots, creating a pipeline. This
optimization allows TaskTracker to launch each map task as soon as the previous running map task finishes. The number of tasks to
over-schedule should be about 25-50% of total number of map slots. You can adjust this number with the parameter
mapreduce.tasktracker.prefetch.maptasks.

ExpressLane
MapR provides an express path for small MapReduce jobs to run when all slots are occupied by long tasks. Small jobs are only given this special
treatment when the cluster is busy, and only if they meet the criteria specified by the following parameters in mapred-site.xml:

Parameter Value Description

mapred.fairscheduler.smalljob.schedule.enable true Enable small job fast scheduling inside fair scheduler. TaskTrackers
should reserve a slot called ephemeral slot which is used for smalljob if
cluster is busy.

mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed in small job.

mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed in small job.

mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job.
Default is 10GB.

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed in
small job. Default is 1GB per reducer.

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytes reserved for an ephermal
slot. Default is 200mb. This value must be same on JobTracker and
TaskTracker nodes.

MapReduce jobs that appear to fit the small job definition but are in fact larger than anticipated are killed and re-queued for normal execution.

Working with Data


This section contains information about working with data:

Copying Data from Apache Hadoop - using distcp to copy data to MapR from an Apache cluster
Data Protection - how to protect data from corruption or deletion
Direct Access NFS™ - how to mount the cluster via NFS
Volumes - using volumes to manage data
Mirrors - local or remote copies of volumes
Schedules - scheduling for snapshots and mirrors
Snapshots - point-in-time images of volumes

Copying Data from Apache Hadoop


There are three ways to copy data from an HDFS cluster to MapR:

If the HDFS cluster uses the same version of the RPC protocol that MapR uses (currently version 4), use distcp normally to copy data
with the following procedure
If you are copying very small amounts of data, use hftp
If the HDFS cluster and the MapR cluster do not use the same version of the RPC protocol, or if for some other reason the above steps
do not work, you can push data from the HDFS cluster

To copy data from HDFS to MapR using distcp:

<NameNode IP> - the IP address of the NameNode in the HDFS cluster


<NameNode Port> - the port for connecting to the NameNode in the HDFS cluster
<HDFS path> - the path to the HDFS directory from which you plan to copy data
<MapR-FS path> - the path in the MapR cluster to which you plan to copy HDFS data
<file> - a file in the HDFS path

1. From a node in the MapR cluster, try hadoop fs -ls to determine whether the MapR cluster can successfully communicate with the
HDFS cluster:

hadoop fs -ls <NameNode IP>:<NameNode port>/<path>

2. If the hadoop fs -ls command is successful, try hadoop fs -cat to determine whether the MapR cluster can read file contents
from the specified path on the HDFS cluster:

hadoop fs -cat <NameNode IP>:<NameNode port>/<HDFS path>/<file>

3. If you are able to communicate with the HDFS cluster and read file contents, use distcp to copy data from the HDFS cluster to the
MapR cluster:
3.

hadoop distcp <NameNode IP>:<NameNode port>/<HDFS path> <MapR-FS path>

Using hftp

<NameNode IP> - the IP address of the NameNode in the HDFS cluster


<NameNode HTTP Port> - the HTTP port on the NameNode in the HDFS cluster
<HDFS path> - the path to the HDFS directory from which you plan to copy data
<MapR-FS path> - the path in the MapR cluster to which you plan to copy HDFS data

Use distcp over HFTP to copy files:

hadoop distcp hftp://<NameNode IP>:<NameNode HTTP Port>/<HDFS path> <MapR-FS path>

To push data from an HDFS cluster

Perform the following steps from a MapR client or node (any computer that has either mapr-core or mapr-client installed). For more
information about setting up a MapR client, see Setting Up the Client.

<input path> - the HDFS path to the source data


<output path> - the MapR-FS path to the target directory
<MapR CLDB IP> - the IP address of the master CLDB node on the MapR cluster.

1. Log in as the root user (or use sudo for the following commands).
2. Create the directory /tmp/maprfs-client/ on the Apache Hadoop JobClient node.
3. Copy the following files from a MapR client or any MapR node to the /tmp/maprfs-client/ directory:
/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar,
/opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar
/opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64/libMapRClient.so
4. Install the files in the correct places on the Apache Hadoop JobClient node:

cp /tmp/maprfs-client/maprfs-0.1.jar $HADOOP_HOME/lib/.
cp /tmp/maprfs-client/zookeeper-3.3.2.jar $HADOOP_HOME/lib/.
cp /tmp/maprfs-client/libMapRClient.so $HADOOP_HOME/lib/native/Linux-amd64-64/libMapRClient.so

If you are on a 32-bit client, use Linux-i386-32 in place of Linux-amd64-64 above.


5. If the JobTracker is a different node from the JobClient node, copy and install the files to the JobTracker node as well using the above
steps.
6. On the JobTracker node, set fs.maprfs.impl=com.mapr.fs.MapRFileSystem in $HADOOP_HOME/conf/core-site.xml .
7. Restart the JobTracker.
8. You can now copy data to the MapR cluster by running distcp on the JobClient node of the Apache Hadoop cluster. Example:

./bin/hadoop distcp -Dfs.maprfs.impl=com.mapr.fs.MapRFileSystem -libjars


/tmp/maprfs-client/maprfs-0.1.jar,/tmp/maprfs-client/zookeeper-3.3.2.jar -files
/tmp/maprfs-client/libMapRClient.so <input path> maprfs://<MapR CLDB IP>:7222/<output path>

Data Protection
You can use MapR to protect your data from hardware failures, accidental overwrites, and natural disasters. MapR organizes data into volumes
so that you can apply different data protection strategies to different types of data. The following scenarios describe a few common problems and
how easily and effectively MapR protects your data from loss.

Scenario: Hardware Failure


Even with the most reliable hardware, growing cluster and datacenter sizes will make frequent hardware failures a real threat to business
continuity. In a cluster with 10,000 disks on 1,000 nodes, it is reasonable to expect a disk failure more than once a day and a node failure every
few days.

Solution: Topology and Replication Factor

MapR automatically replicates data and places the copies on different nodes to safeguard against data loss in the event of hardware failure. By
default, MapR assumes that all nodes are in a single rack. You can provide MapR with information about the rack locations of all nodes by setting
topology paths. MapR interprets each topology path as a separate rack, and attempts to replicate data onto different racks to provide continuity in
case of a power failure affecting an entire rack. These replicas are maintained, copied, and made available seamlessly without user intervention.

To set up topology and replication:

1. In the MapR Control System, open the MapR-FS group and click Nodes to display the Nodes view.
2. Set up each rack with its own path. For each rack, perform the following steps:
a. Click the checkboxes next to the nodes in the rack.
b. Click the Change Topology button to display the Change Node Topology dialog.
c. In the Change Node Topology dialog, type a path to represent the rack. For example, if the cluster name is cluster1 and the
nodes are in rack 14, type /cluster1/rack14.
3. When creating volumes, choose a Replication Factor of 3 or more to provide sufficient data redundancy.

Scenario: Accidental Overwrite


Even in a cluster with data replication, important data can be overwritten or deleted accidentally. If a data set is accidentally removed, the removal
itself propagates across the replicas and the data is lost. Users or applications can corrupt data, and once the corruption spreads to the replicas
the damage is permanent.

Solution: Snapshots

With MapR, you can create a point-in-time snapshot of a volume, allowing recovery from a known good data set. You can create a manual
snapshot to enable recovery to a specific point in time, or schedule snapshots to occur regularly to maintain a recent recovery point. If data is lost,
you can restore the data using the most recent snapshot (or any snapshot you choose). Snapshots do not add a performance penalty, because
they do not involve additional data copying operations; a snapshot can be created almost instantly regardless of data size.

Example: Creating a Snapshot Manually

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Select the checkbox beside the name the volume, then click the New Snapshot button to display the Snapshot Name dialog.
3. Type a name for the new snapshot in the Name... field.
4. Click OK to create the snapshot.

Example: Scheduling Snapshots

This example schedules snapshots for a volume hourly and retains them for 24 hours.

To create a schedule:

1. In the Navigation pane, expand the MapR-FS group and click the Schedules view.
2. Click New Schedule.
3. In the Schedule Name field, type "Every Hour".
4. From the first dropdown menu in the Schedule Rules section, select Hourly.
5. In the Retain For field, specify 24 Hours.
6. Click Save Schedule to create the schedule.

To apply the schedule to the volume:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Display the Volume Properties dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clicking
the Properties button.
3. In the Replication and Snapshot Scheduling section, choose "Every Hour."
4. Click Modify Volume to apply the changes and close the dialog.

Scenario: Disaster Recovery


A severe natural disaster can cripple an entire datacenter, leading to permanent data loss unless a disaster plan is in place.

Solution: Mirroring to Another Cluster

MapR makes it easy to protect against loss of an entire datacenter by mirroring entire volumes to a different datacenter. A mirror is a full read-only
copy of a volume that can be synced on a schedule to provide point-in-time recovery for critical data. If the volumes on the original cluster contain
a large amount of data, you can store them on physical media using the volume dump create command and transport them to the mirror cluster.
Otherwise, you can simply create mirror volumes that point to the volumes on the original cluster and copy the data over the network. The
mirroring operation conserves bandwidth by transmitting only the deltas between the source and the mirror, and by compressing the data over the
wire. In addition, MapR uses checksums and a latency-tolerant protocol to ensure success even on high-latency WANs. You can set up a
cascade of mirrors to replicate data over a distance. For instance, you can mirror data from New York to London, then use lower-cost links to
replicate the data from London to Paris and Rome.
To set up mirroring to another cluster:

1. Use the volume dump create command to create a full volume dump for each volume you want to mirror.
2. Transport the volume dump to the mirror cluster.
3. For each volume on the original cluster, set up a corresponding volume on the mirror cluster.
a. Restore the volume using the volume dump restore command.
b. In the MapR Control System, click Volumes under the MapR-FS group to display the Volumes view.
c. Click the name of the volume to display the Volume Properties dialog.
d. Set the Volume Type to Remote Mirror Volume.
e. Set the Source Volume Name to the source volume name.
f. Set the Source Cluster Name to the cluster where the source volume resides.
g. In the Replication and Mirror Scheduling section, choose a schedule to determine how often the mirror will sync.

To recover volumes from mirrors:

1. Use the volume dump create command to create a full volume dump for each mirror volume you want to restore. Example:
maprcli volume create -e statefile1 -dumpfile fulldump1 -name volume@cluster
2. Transport the volume dump to the rebuilt cluster.
3. For each volume on the mirror cluster, set up a corresponding volume on the rebuilt cluster.
a. Restore the volume using the volume dump restore command. Example:
maprcli volume dump restore -name volume@cluster -dumpfile fulldump1
b. Copy the files to a standard (non-mirror) volume.

Provisioning Applications
Provisioning a new application involves meeting the business goals of performance, continuity, and security while providing necessary resources
to a client, department, or project. You'll want to know how much disk space is needed, and what the priorities are in terms of performance,
reliability. Once you have gathered all the requirements, you will create a volume to manage the application data. A volume provides convenient
control over data placement, performance, protection, and policy for an entire data set.

Make sure the cluster has the storage and processing capacity for the application. You'll need to take into account the starting and predicted size
of the data, the performance and protection requirements, and the memory required to run all the processes required on each node. Here is the
information to gather before beginning:

Access How often will the data be read and written?


What is the ratio of reads to writes?

Continuity What is the desired recovery point objective (RPO)?

What is the desired recovery time objective (RTO)?

Performance Is the data static, or will it change frequently?


Is the goal data storage or data processing?

Size How much data capacity is required to start?


What is the predicted growth of the data?

The considerations in the above table will determine the best way to set up a volume for the application.

About Volumes
Volumes provide a number of ways to help you meet the performance, access, and continuity goals of an application, while managing application
data size:

Mirroring - create read-only copies of the data for highly accessed data or multi-datacenter access
Permissions - allow users and groups to perform specific actions on a volume
Quotas - monitor and manage the data size by project, department, or user
Replication - maintain multiple synchronized copies of data for high availability and failure protection
Snapshots - create a real-time point-in-time data image to enable rollback
Topology - place data on a high-performance rack or limit data to a particular set of machines

See Volumes.

Mirroring

Mirroring means creating mirror volumes, full physical read-only copies of normal volumes for fault tolerance and high performance. When you
create a mirror volume, you specify a source volume from which to copy data, and you can also specify a schedule to automate
re-synchronization of the data to keep the mirror up-to-date. After a mirror is initially copied, the synchronization process saves bandwidth and
reads on the source volume by transferring only the deltas needed to bring the mirror volume to the same state as its source volume. A mirror
volume need not be on the same cluster as its source volume; MapR can sync data on another cluster (as long as it is reachable over the
network). When creating multiple mirrors, you can further reduce the mirroring bandwidth overhead by daisy-chaining the mirrors. That is, set the
source volume of the first mirror to the original volume, the source volume of the second mirror to the first mirror, and so on. Each mirror is a full
copy of the volume, so remember to take the number of mirrors into account when planning application data size. See Mirrors.

Permissions

MapR provides fine-grained control over which users and groups can perform specific tasks on volumes and clusters. When you create a volume,
keep in mind which users or groups should have these types of access to the volume. You may want to create a specific group to associate with a
project or department, then add users to the group so that you can apply permissions to them all at the same time. See Managing Permissions.

Quotas

You can use quotas to limit the amount of disk space an application can use. There are two types of quotas:

User/Group quotas limit the amount of disk space available to a user or group
Volume quotas limit the amount of disk space available to a volume

When the data owned by a user, group, or volume exceeds the quota, MapR prevents further writes until either the data size falls below the quota
again, or the quota is raised to accommodate the data.

Volumes, users, and groups can also be assigned advisory quotas. An advisory quota does not limit the disk space available, but raises an alarm
and sends a notification when the space used exceeds a certain point. When you set a quota, you can use a slightly lower advisory quota as a
warning that the data is about to exceed the quota, preventing further writes.

Remember that volume quotas do not take into account disk space used by sub-volumes (because volume paths are logical, not physical).

You can set a User/Group quota to manage and track the disk space used by an accounting entity (a department, project, or application):

Create a group to represent the accounting entity.


Create one or more volumes and use the group as the Accounting Entity for each.
Set a User/Group quota for the group.
Add the appropriate users to the group.

When a user writes to one of the volumes associated with the group, any data written counts against the group's quota. Any writes to volumes not
associated with the group are not counted toward the group's quota. See Managing Quotas.

Replication

When you create a volume, you can choose a replication factor to safeguard important data. MapR manages the replication automatically, raising
an alarm and notification if replication falls below the minimum level you have set. A replicate of a volume is a full copy of the volume; remember
to take that into account when planning application data size.

Snapshots

A snapshot is an instant image of a volume at a particular point in time. Snapshots take no time to create, because they only record changes to
data over time rather than the data itself. You can manually create a snapshot to enable rollback to a particular known data state, or schedule
periodic automatic snapshots to ensure a specific recovery point objective (RPO). You can use snapshots and mirrors to achieve a near-zero
recovery time objective (RTO). Snapshots store only the deltas between a volume's current state and its state when the snapshot is taken.
Initially, snapshots take no space on disk, but they can grow arbitrarily as a volume's data changes. When planning application data size, take into
account how much the data is likely to change, and how often snapshots will be taken. See Snapshots.

Topology

You can restrict a volume to a particular rack by setting its physical topology attribute. This is useful for placing an application's data on a
high-performance rack (for critical applications) or a low-performance rack (to keep it out of the way of critical applications). See Setting Volume
Topology.

Scenarios

Here are a few ways to configure the application volume based on different types of data. If the application requires more than one type of data,
you can set up multiple volumes.

Data Type Strategy

Important Data High replication factor


Frequent snapshots to minimize RPO and RTO
Mirroring in a remote cluster

Highly Acccessed Data High replication factor


Mirroring for high-performance reads
Topology: data placement on high-performance machines
Scratch data No snapshots, mirrors, or replication
Topology: data placement on low-performance machines

Static data Mirroring and replication set by performance and availability requirements

One snapshot (to protect against accidental changes)


Volume set to read-only

The following documents provide examples of different ways to provision an application to meet business goals:

Provisioning for Capacity


Provisioning for Performance

Setting Up the Application


Once you know the course of action to take based on the application's data and performance needs, you can use the MapR Control System to set
up the application.

Creating a Group and a Volume


Setting Up Mirroring
Setting Up Snapshots
Setting Up User or Group Quotas

Creating a Group and a Volume

Create a group and a volume for the application. If you already have a snapshot schedule prepared, you can apply it to the volume at creation
time. Otherwise, use the procedure in Setting Up Snapshots below, after you have created the volume.

Setting Up Mirroring

If you want the mirror to sync automatically, use the procedure in Creating a Schedule to create a schedule.
Use the procedure in Creating a Volume to create a mirror volume. Make sure to set the following fields:
Volume Type - Mirror Volume
Source Volume - the volume you created for the application
Responsible Group/User - in most cases, the same as for the source volume

Setting Up Snapshots

To set up automatic snapshots for the volume, use the procedure in Scheduling a Snapshot.

Provisioning for Capacity


You can easily provision a volume for maximum data storage capacity by setting a low replication factor, setting hard and advisory quotas, and
tracking storage use by users, groups, and volumes. You can also set permissions to limit who can write data to the volume.

The replication factor determines how many complete copies of a volume are stored in the cluster. The actual storage requirement for a volume is
the volume size multiplied by its replication factor. To maximize storage capacity, set the replication factor on the volume to 1 at the time you
create the volume.

Volume quotas and user or group quotas limit the amount of data that can be written by a user or group, or the maximum size of a specific
volume. When the data size exceeds the advisory quota, MapR raises an alarm and notification but does not prevent additional data writes. Once
the data exceeds the hard quota, no further writes are allowed for the volume, user, or group. The advisory quota is generally somewhat lower
than the hard quota, to provide advance warning that the data is in danger of exceeding the hard quota. For a high-capacity volume, the volume
quotas should be as large as possible. You can use the advisory quota to warn you when the volume is approaching its maximum size.

To use the volume capacity wisely, you can limit write access to a particular user or group. Create a new user or group on all nodes in the cluster.

In this scenario, storage capacity takes precedence over high performance and data recovery; to maximize data storage, there will be no
snapshots or mirrors set up in the cluster. A low replication factor means that the data is less effectively protected against loss in the event that
disks or nodes fail. Because of these tradeoffs, this strategy is most suitable for risk-tolerant large data sets, and should not be used for data with
stringent protection, recovery, or performance requirements.

To create a high-capacity volume:

1. Set up a user or group that will be responsible for the volume. For more information, see Users & Groups.
2. In the MapR Control System, open the MapR-FS group and click Volumes to display the Volumes view.
3. Click the New Volume button to display the New Volume dialog.
4. In the Volume Setup pane, set the volume name and mount path.
5. In the Usage Tracking pane:
a. In the Group/User section, select User or Group and enter the user or group responsible for the volume.

b.
5.

b. In the Quotas section, check Volume Quota and enter the maximum capacity of the volume, based on the storage capacity of
your cluster. Example: 1 TB
c. Check Volume Advisory Quota and enter a lower number than the volume quota, to serve as advance warning when the data
approaches the hard quota. Example: 900 GB
6. In the Replication & Snapshot Scheduling pane:
a. Set Replication to 1.
b. Do not select a snapshot schedule.
7. Click OK to create the volume.
8. Set the volume permissions on the volume via NFS or using hadoop fs. You can limit writes to root and the responsible user or group.

See Volumes for more information.

Provisioning for Performance


You can provision a high-performance volume by creating multiple mirrors of the data and defining volume topology to control data placement:
store the data on your fastest servers (for example, servers that use SSDs instead of hard disks).

When you create mirrors of a volume, make sure your application load-balances reads across the mirrors to increase performance. Each mirror is
an actual volume, so you can control data placement and replication on each mirror independently. The most efficient way to create multiple
mirrors is to cascade them rather than creating all the mirrors from the same source volume. Create the first mirror from the original volume, then
create the second mirror using the first mirror as the source volume, and so on. You can mirror the volume within the same cluster or to another
cluster, possibly in a different datacenter.

You can set node topology paths to specify the physical locations of nodes in the cluster, and volume topology paths to limit volumes to specific
nodes or racks.

To set node topology:

Use the following steps to create a rack path representing the high-performance nodes in your cluster.

1. In the MapR Control System, open the MapR-FS group and click Nodes to display the Nodes view.
2. Click the checkboxes next to the high-performance nodes.
3. Click the Change Topology button to display the Change Node Topology dialog.
4. In the Change Node Topology dialog, type a path to represent the high-performance rack. For example, if the cluster name is cluster1
and the high-performance nodes make up rack 14, type /cluster1/rack14.

To set up the source volume:

1. In the MapR Control System, open the MapR-FS group and click Volumes to display the Volumes view.
2. Click the New Volume button to display the New Volume dialog.
3. In the Volume Setup pane, set the volume name and mount path normally.
4. Set the Topology to limit the volume to the high-performance rack. Example: /default/rack14

To Set Up the First Mirror

1. In the MapR Control System, open the MapR-FS group and click Volumes to display the Volumes view.
2. Click the New Volume button to display the New Volume dialog.
3. In the Volume Setup pane, set the volume name and mount path normally.
4. Choose Local Mirror Volume.
5. Set the Source Volume Name to the original volume name. Example: original-volume
6. Set the Topology to a different rack from the source volume.

To Set Up Subsequent Mirrors

1. In the MapR Control System, open the MapR-FS group and click Volumes to display the Volumes view.
2. Click the New Volume button to display the New Volume dialog.
3. In the Volume Setup pane, set the volume name and mount path normally.
4. Choose Local Mirror Volume.
5. Set the Source Volume Name to the previous mirror volume name. Example: mirror1
6. Set the Topology to a different rack from the source volume and the other mirror.

See Volumes for more information.

Managing the Cluster


MapR provides a number of tools for managing the cluster. This section describes the following topics:

Balancers - managing replication and roles


CLDB Failover - metadata continuity on M3 and M5 clusters
Cluster Upgrade - - installing the latest version of MapR software on an existing cluster
Dial Home - information about the MapR Dial Home service
Disks - adding, removing, and replacing disks
Monitoring - Getting timely information about the cluster
Node Topology - specifying the physical layout of the cluster
Nodes - Viewing, installing, configuring, and moving nodes
Services - managing services on each node
Shutting Down and Restarting a Cluster - safely halting a cluster

Balancers
The disk space balancer and the replication role balancer redistribute data in the MapR storage layer to ensure maximum performance and
efficient use of space:

The disk space balancer works to ensure that the percentage of space used on all disks in the node is similar, so that no nodes are
overloaded.
The replication role balancer changes the replication roles of cluster containers so that the replication process uses network bandwidth
evenly.

To view balancer status:

Use the maprcli dump balancerinfo command to view the space used and free on each storage pool, and the active container
moves. Example:

# maprcli dump balancerinfo


usedMB fsid spid percentage outTransitMB
inTransitMB capacityMB
209 5567847133641152120 01f8625ba1d15db7004e52b9570a8ff3 1 0 0
15200
209 1009596296559861611 816709672a690c96004e52b95f09b58d 1 0 0
15200

To view balancer configuration values:

Pipe the maprcli config load command through grep. Example:

# maprcli config load -json | grep balancer


"cldb.balancer.disk.max.switches.in.nodes.percentage":"10",
"cldb.balancer.disk.paused":"0",
"cldb.balancer.disk.sleep.interval.sec":"120",
"cldb.balancer.disk.threshold.percentage":"70",
"cldb.balancer.logging":"0",
"cldb.balancer.role.max.switches.in.nodes.percentage":"10",
"cldb.balancer.role.paused":"0",
"cldb.balancer.role.sleep.interval.sec":"900",
"cldb.balancer.startup.interval.sec":"1800",

To set balancer configuration values:

Use the config save command to set the appropriate values. Example:

# maprcli config save -values {"cldb.balancer.disk.max.switches.in.nodes.percentage":"20"}

By default, the balancers are turned off.

To turn on the disk space balancer, use config save to set cldb.balancer.disk.paused to 0
To turn on the replication role balancer, use config save to set cldb.balancer.role.paused to 0

Disk Space Balancer

The disk space balancer attempts to move containers from storage pools that are over 70% full to other storage pools that have lower utilization
than average for the cluster. You can view disk usage on all nodes in the Disks view, by clicking Cluster > Nodes in the Navigation pane and the
choosing Disks from the dropdown.

Disk Space Balancer Configuration Parameters

Parameter Value Description

cldb.balancer.disk.threshold.percentage 70 Threshold for moving containers out of a given storage pool, expressed as
utilization percentage.

cldb.balancer.disk.paused 0 Specifies whether the disk space balancer runs:

0 - Not paused (normal operation)


1 - Paused (does not perform any container moves)

cldb.balancer.disk.max.switches.in.nodes.percentage 10 This can be used to throttle the disk balancer. If it is set to 10, the balancer will
throttle the number of concurrent container moves to 10% of the total nodes in
the cluster (minimum 2).

Replication Role Balancer

Each container in a MapR volume is either a master container (part of the original copy of the volume) or an intermediate or tail container (one of
the replicas in the replication chain). For a volume's name container (the volume's first container), replication occurs simultaneously from the
master to the intermediate and tail containers. For other containers in the volume, replication proceeds from the master to the intermediate
container(s) until it reaches the tail. Replication occurs over the network between nodes, often in separate racks. For balanced network
bandwidth, every node must have an equal share of masters, intermediates and tails. The replication role balancer switches container roles to
achieve this balance.

Replication Role Balancer Configuration Parameters

Parameter Value Description

cldb.balancer.role.paused 0 Specifies whether the role balancer runs:

0 - Not paused (normal operation)


1 - Paused (does not perform any container replication role switches)

cldb.balancer.role.max.switches.in.nodes.percentage 10 This can be used to throttle the role balancer. If it is set to 10, the balancer will
throttle the number of concurrent role
switches to 10% of the total nodes in the cluster (minimum 2).

CLDB Failover
The CLDB automatically replicates its data to other nodes in the cluster, preserving at least two (and generally three) copies of the CLDB data. If
the CLDB process dies, it is automatically restarted on the node. All jobs and processes wait for the CLDB to return, and resume from where they
left off, with no data or job loss.

If the node itself fails, the CLDB data is still safe, and the cluster can continue normally as soon as the CLDB is started on another node. In an
M5-licensed cluster, a failed CLDB node automatically fails over to another CLDB node without user intervention and without data loss. It is
possible to recover from a failed CLDB node on an M3 cluster, but the procedure is somewhat different.

Recovering from a Failed CLDB Node on an M3 Cluster

To recover from a failed CLDB node, perform the steps listed below:

1. Restore ZooKeeper - if necessary, install ZooKeeper on an additional node.


2. Locate the CLDB data - locate the nodes where replicates of CLDB data are stored, and choose one to serve as the new CLDB node.
3. Stop the selected node - stop the node you have chosen, to prepare for installing the CLDB service.
4. Install the CLDB on the selected node - install the CLDB service on the new CLDB node.
5. Configure the selected node - run configure.sh to inform the CLDB node where the CLDB and ZooKeeper services are running.
6. Start the selected node - start the new CLDB node.
7. Restart all nodes - stop each node in the cluster, run configure.sh on it, and start it.

Restore ZooKeeper

If the CLDB node that failed was also running ZooKeeper, install ZooKeeper on another node to maintain the minimum required number of
ZooKeeper nodes.

Locate the CLDB Data


Perform the following steps on any cluster node:

1. Login as root or use sudo for the following commands.


2. Issue the dump cldbnodes command, passing the ZooKeeper connect string, to determine which nodes contain the CLDB data:

maprcli dump cldbnodes -zkconnect <ZooKeeper connect string> -json

In the output, the nodes containing CLDB data are listed in the valid parameter.

3. Choose one of the nodes from the dump cldbnodes output, and perform the procedures listed below on it to install a new CLDB.

Stop the Selected Node

Perform the following steps on the node you have selected for installation of the CLDB:

1. Change to the root user (or use sudo for the following commands).
2. Stop the Warden:
/etc/init.d/mapr-warden stop
3. If ZooKeeper is installed on the node, stop it:
/etc/init.d/mapr-zookeeper stop

Install the CLDB on the Selected Node

Perform the following steps on the node you have selected for installation of the CLDB:

1. Login as root or use sudo for the following commands.


2. Install the CLDB service on the node:
RHEL/CentOS: yum install mapr-cldb
Ubuntu: apt-get install mapr-cldb
3. Wait until the failover delay expires. If you try to start the CLDB before the failover delay expires, the following message appears:

CLDB HA check failed: not licensed, failover denied: elapsed time since last failure=<time in
minutes> minutes

Configure the Selected Node

Perform the following steps on the node you have selected for installation of the CLDB:

Run the script configure.sh to create /opt/mapr/conf/mapr-clusters.conf and update the corresponding *.conf and *.xml files.
Before performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. Optionally, you can specify the ports
for the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are:

CLDB – 7222
ZooKeeper – 5181

The script configure.sh takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IP
addresses (and optionally ports), using the following syntax:
/opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z
<host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>]

Example:

/opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Z
r1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r1n1.sj.us5:5181 -N MyCluster

If you have not chosen a cluster name, you can run configure.sh again later to rename the cluster.

Start the Node

Perform the following steps on the node you have selected for installation of the CLDB:

1. If ZooKeeper is installed on the node, start it:


/etc/init.d/mapr-zookeeper start
2. Start the Warden:
/etc/init.d/mapr-warden start
Restart All Nodes

On all nodes in the cluster, perform the following procedures:

Stop the node:

1. Change to the root user (or use sudo for the following commands).
2. Stop the Warden:
/etc/init.d/mapr-warden stop
3. If ZooKeeper is installed on the node, stop it:
/etc/init.d/mapr-zookeeper stop

Configure the node with the new CLDB and ZooKeeper addresses:

Run the script configure.sh to create /opt/mapr/conf/mapr-clusters.conf and update the corresponding *.conf and *.xml files.
Before performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. Optionally, you can specify the ports
for the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are:

CLDB – 7222
ZooKeeper – 5181

The script configure.sh takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IP
addresses (and optionally ports), using the following syntax:
/opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z
<host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>]

Example:

/opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Z
r1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r1n1.sj.us5:5181 -N MyCluster

If you have not chosen a cluster name, you can run configure.sh again later to rename the cluster.

Start the node:

1. If ZooKeeper is installed on the node, start it:


/etc/init.d/mapr-zookeeper start
2. Start the Warden:
/etc/init.d/mapr-warden start

Cluster Upgrade
The following sections provide information about upgrading the cluster:

Rolling Upgrade provides information about automatically applying MapR software upgrades to a cluster.
Manual Upgrade provides information about stopping all nodes, installing updated packages manually, and restarting all nodes.

Manual Upgrade
Upgrading the MapR cluster manually entails stopping all nodes, installing updated packages, and restarting the nodes. Here are a few tips:

Make sure to add the correct repository directory for the version of MapR software you wish to install.
Work on all nodes at the same time; that is, stop the warden on all nodes before proceeding to the next step, and so on.
Use the procedure corresponding to the operating system on your cluster.

When performing a manual upgrade to MapR 1.2, it is necessary to run configure.sh on any nodes that are running the
HBase region server or HBase master.

If you are upgrading a node that is running the NFS service:


Use the mount command to determine whether the node has mounted MapR-FS through its own physical IP
(not a VIP). If so, unmount any such mount points before beginning the upgrade process.
If you are upgrading a node that is running the HBase RegionServer and the warden stop command does not return
after five minutes, kill the HBase RegionServer process manually:
1. Determine the process ID of the HBase RegionServer:

cat /opt/mapr/logs/hbase-root-regionserver.pid

2. Kill the HBase RegionServer using the following command, substituting the process ID from the previous step
for the placeholder <PID>:

kill -9 <PID>

CentOS and Red Hat

Perform all three procedures:

Upgrading the cluster - installing the new versions of the packages


Setting the version - manually updating the software configuration to reflect the correct version
Updating the configuration - switching to the new versions of the configuration files, and preserving any custom settings

To upgrade the cluster:

On each node, perform the following steps:

1. Change to the root user or use sudo for the following commands.
2. Make sure the MapR software is correctly installed and configured.
3. Add the MapR yum repository for the latest version of MapR software, removing any old versions. For more information, see Adding the
MapR Repository.
4. Stop the warden:

/etc/init.d/mapr-warden stop

5. If ZooKeeper is installed on the node, stop it:

/etc/init.d/mapr-zookeeper stop

6. Upgrade the MapR packages with the following command:

yum upgrade "mapr*"

7. If ZooKeeper is installed on the node, start it:

/etc/init.d/mapr-zookeeper start

8. Start the warden:

/etc/init.d/mapr-warden start

To update the configuration files:

1. If you have made any changes to mapred-site.xml or core-site.xml in the /opt/mapr/hadoop/hadoop-<version>/conf


directory, then make the same changes to the same files in the /opt/mapr/hadoop/hadoop-<version>/conf.new directory.
2. Rename /opt/mapr/hadoop/hadoop-<version>/conf to /opt/mapr/hadoop/hadoop-<version>/conf.old to deactivate it.
3. Rename /opt/mapr/hadoop/hadoop-<version>/conf.new to /opt/mapr/hadoop/hadoop-<version>/conf to activate it as
the configuration directory.
Ubuntu

Perform all three procedures:

Upgrading the cluster - installing the new versions of the packages


Setting the version - manually updating the software configuration to reflect the correct version
Updating the configuration - switching to the new versions of the configuration files, and preserving any custom settings

To upgrade the cluster:

The easiest way to upgrade all MapR packages on Ubuntu is to temporarily move the normal sources.list and replace it with a special
sources.list that specifies only the MapR software repository, then use apt-get upgrade to upgrade all packages from the special
sources.list file. On each node, perform the following steps:

1. Change to the root user or use sudo for the following commands.
2. Make sure the MapR software is correctly installed and configured.
3. Rename the normal /etc/apt/sources.list to prevent apt-get upgrade from reading packages from other repositories:

mv /etc/apt/sources.list /etc/apt/sources.list.orig

4. Create a new file /etc/apt/sources.list and add the MapR apt-get repository for the latest version of MapR software, removing
any old versions. For more information, see Adding the MapR Repository.
5. Stop the warden:

/etc/init.d/mapr-warden stop

6. If ZooKeeper is installed on the node, stop it:

/etc/init.d/mapr-zookeeper stop

7. Clear the APT cache:

apt-get clean

8. Update the list of available packages:

apt-get update

9. Upgrade all MapR packages:

apt-get upgrade

10. Rename the special /etc/apt/sources.list and restore the original:

mv /etc/apt/sources.list /etc/apt/sources.list.mapr
mv /etc/apt/sources.list.orig /etc/apt/sources.list

11. If ZooKeeper is installed on the node, start it:

/etc/init.d/mapr-zookeeper start

12. Start the warden:

/etc/init.d/mapr-warden start

Setting the Version


After completing the upgrade on all nodes, use the following steps on any node to set the correct version:

1. Check the software version by looking at the contents of the MapRBuildVersion file. Example:

$ cat /opt/mapr/MapRBuildVersion

2. Set the version accordingly using the config save command. Example:
*maprcli config save -values {mapr.targetversion:"1.1.3.11542GA-1"}*

Updating the Configuration

1. If you have made any changes to mapred-site.xml or core-site.xml in the /opt/mapr/hadoop/hadoop-<version>/conf


directory, then make the same changes to the same files in the /opt/mapr/hadoop/hadoop-<version>/conf.new directory.
2. Rename /opt/mapr/hadoop/hadoop-<version>/conf to /opt/mapr/hadoop/hadoop-<version>/conf.old to deactivate it.
3. Rename /opt/mapr/hadoop/hadoop-<version>/conf.new to /opt/mapr/hadoop/hadoop-<version>/conf to activate it as
the configuration directory.

Rolling Upgrade
A rolling upgrade installs the latest version of MapR software on all nodes in the cluster. You perform a rolling upgrade by running the script
rollingupgrade.sh from a node in the cluster. The script upgrades all packages on each node, logging output to the rolling upgrade log (
/opt/mapr/logs/rollingupgrade.log). You must specify either a directory containing packages (using the -p option) or a version to fetch
from the MapR repository (using the -v option). Here are a few tips:

If you specify a local directory with the -p option, you must either ensure that the same directory containing the packages exists on all the
nodes, or use the -x option to copy packages out to each node via SCP automatically (requires the -s option). If you use the -x option,
the upgrade process copies the packages from the directory specified with -p into the same directory path on each node.
In a multi-cluster setting, use -c to specify which cluster to upgrade. If -c is not specified, the default cluster is upgraded.
When specifying the version with the -v parameter, use the format x.y.z to specify the major, minor, and revision numbers of the target
version. Example: 1.1.3
The package rpmrebuild (Red Hat) or dpkg-repack (Ubuntu) enables automatic rollback if the upgrade fails. You can also
accomplish this, without installing the package, by running rollingupgrade.sh with the -n option.

There are two ways to perform a rolling upgrade:

Via SSH - If keyless SSH is set up between all nodes, use the -s option to automatically upgrade all nodes without user intervention.
Node by node - If SSH is not available, the script prepares the cluster for upgrade and guides the user through upgrading each node. In a
node-by-node installation, you must individually run the commands to upgrade each node when instructed by the rollingupgrade.sh
script.

Before upgrading the cluster, make sure no MapReduce jobs are running. If the cluster is NFS-mounted, unmount any such
mount points before beginning the upgrade process.

Requirements

On the computer from which you will be starting the upgrade, perform the following steps:

1. Change to the root user (or use sudo for the following commands).
2. If you are starting the upgrade from a computer that is not a MapR client or a MapR cluster node, you must add the MapR repository (
CentOS or Red Hat, or Ubuntu) and install mapr-core:
CentOS or Red Hat: or yum install mapr-core
Ubuntu: apt-get install mapr-core
Run configure.sh, using -C to specify the cluster CLDB nodes and -Z to specify the cluster ZooKeeper nodes. Example:

/opt/mapr/server/configure.sh -C 10.10.100.1,10.10.100.2,10.10.100.3 -Z
10.10.100.1,10.10.100.2,10.10.100.3

3. To enable a fully automatic rolling upgrade, ensure passwordless SSH is enabled to all nodes for the user root, from the computer on
which the upgrade will be started.

On all nodes and the computer from which you will be starting the upgrade, perform the following steps:

1. Change to the root user (or use sudo for the following commands).
2. Add the MapR software repository (CentOS or Red Hat, or Ubuntu).
3. Install rolling upgrade scripts:
3.
CentOS or Red Hat: or yum install mapr-upgrade
Ubuntu: apt-get install mapr-upgrade
4. Install the following packages to enable automatic rollback:
Ubuntu: apt-get install dpkg-repack
CentOS or Red Hat: yum install rpmrebuild
5. If you are planning to upgrade from downloaded packages instead of the repository, prepare a directory containing the packages to
upgrade using the -p option with rollingupgrade.sh. This directory should reside at the same absolute path on each node. For the
download location, see Local Packages - Ubuntu or Local Packages - Red Hat.

Upgrading the Cluster via SSH

On the node from which you will be starting the upgrade, issue the rollingupgrade.sh command as root (or with sudo) to upgrade the
cluster:

If you have prepared a directory of packages to upgrade, issue the following command, substituting the path to the directory for the
<directory> placeholder:

/opt/upgrade-mapr/rollingupgrade.sh -s -p <directory>

If you are upgrading from the MapR software repository, issue the following command, substituting the version (x.y.z) for the <version>
placeholder:

/opt/upgrade-mapr/rollingupgrade.sh -s -v <version>

Upgrading the Cluster Node by Node

On the node from which you will be starting the upgrade, use the rollingupgrade.sh command as root (or with sudo) to upgrade the cluster:

1. Start the upgrade:


If you have prepared a directory of packages to upgrade, issue the following command, substituting the path to the directory for
the <directory> placeholder:

/opt/upgrade-mapr/rollingupgrade.sh -p <directory>

If you are upgrading from the MapR software repository, issue the following command, substituting the version (x.y.z) for the
<version> placeholder:

/opt/upgrade-mapr/rollingupgrade.sh -v <version>

2. When prompted, run singlenodeupgrade.sh on all nodes other than the master CLDB node, following the on-screen instructions.
3. When prompted, run singlenodeupgrade.sh on the master CLDB node, following the on-screen instructions.

Dial Home
MapR provides a service called Dial Home, which automatically collects certain metrics about the cluster for use by support engineers and to help
us improve and evolve our product. When you first install MapR, you are presented with the option to enable or disable Dial Home. We
recommend enabling it. You can enable or disable Dial Home later, using the dialhome enable and dialhome disable commands respectively.

Disks
MapR-FS groups disks into storage pools, usually made up of two or three disks. When adding disks to MapR-FS, it is a good idea to add at least
two or three at a time so that MapR can create properly-sized storage pools. When you remove a disk from MapR-FS, any other disks in the
storage pool are also removed from MapR-FS automatically; the disk you removed, as well as the others in its storage pool, are then available to
be added to MapR-FS again. You can either replace the disk and re-add it along with the other disks that were in the storage pool, or just re-add
the other disks if you do not plan to replace the disk you removed.

The following sections provide procedures for working with disks:

Adding Disks - adding disks for use by MapR-FS


Removing Disks - removing disks from use by MapR-FS
Handling Disk Failure - replacing a disk in case of failure
Tolerating Slow Disks - increasing the disk timeout to handle slow disks

Adding Disks
You can add one or more available disks to MapR-FS using the disk add command or the MapR Control System. In both cases, MapR
automatically takes care of formatting the disks and creating storage pools.

To add disks using the MapR Control System:

1. Add physical disks to the node or nodes according to the correct hardware procedure.
2. In the Navigation pane, expand the Cluster group and click the Nodes view.
3. Click the name of the node on which you wish to add disks.
4. In the MapR-FS and Available Disks pane, select the checkboxes beside the disks you wish to add.
5. Click Add Disks to MapR-FS to add the disks. Properly-sized storage pools are allocated automatically.

Removing Disks
You can remove one or more disks from MapR-FS using the disk remove command or the MapR Control System. When you remove disks from
MapR-FS, any other disks in the same storage pool are also removed from MapR-FS and become available (not in use, and eligible to be
re-added to MapR-FS).

If you are removing and replacing failed disks, you can install the replacements, then re-add the replacement disks and the other disks
that were in the same storage pool(s) as the failed disks.
If you are removing disks but not replacing them, you can just re-add the other disks that were in the same storage pool(s) as the failed
disks.

To remove disks using the MapR Control System:

1. In the Navigation pane, expand the Cluster group and click the Nodes view.
2. Click the name of the node from which you wish to remove disks.
3. In the MapR-FS and Available Disks pane, select the checkboxes beside the disks you wish to remove.
4. Click Remove Disks from MapR-FS to remove the disks from MapR-FS.
5. Wait several minutes while the removal process completes. After you remove the disks, any other disks in the same storage pools are
taken offline and marked as available (not in use by MapR).
6. Remove the physical disks from the node or nodes according to the correct hardware procedure.

Handling Disk Failure


When disks fail, MapR raises an alarm snd identifies which disks on which nodes have failed.

If a disk failure alarm is raised, check the Failure Reason field in /opt/mapr/logs/faileddisk.log to determine the reason for failure.
There are two failure cases that may not require disk replacement:

Failure Reason: Timeout - try increasing the mfs.io.disk.timeout in mfs.conf.


Failure Reason: Disk GUID mismatch - if a node has restarted, the drive labels (sda, etc.) can be reassigned by the operating system,
and will no longer match the entries in disktab. Edit disktab according to the instructions in the log to repair the problem.

To replace disks using the MapR command-line interface:

1. On the node with the failed disk(s), look at the Disk entries in /opt/mapr/logs/faileddisk.log to determine which disk or disks
have failed.
2. Look at the Failure Reason entries to determine whether the disk(s) should be replaced.
3. Use the disk remove command to remove the disk(s). Use the following syntax, substituting the hostname or IP address for <host>
and a list of disks for <disks>:

maprcli disk remove -host <host> -disks <disks>

4. Wait several minutes while the removal process completes. After you remove the disks, any other disks in the same storage pools are
taken offline and marked as available (not in use by MapR). Keep a list of these disks so that you can re-add them at the same time as
you add the replacement disks.
5. Replace the failed disks on the node or nodes according to the correct hardware procedure.
6. Use the disk add command to add the replacement disk(s) and the others that were in the same storage pool(s). Use the following
syntax, substituting the hostname or IP address for <host> and a list of disks for <disks>:

maprcli disk add -host <host> -disks <disks>

Properly-sized storage pools are allocated automatically.


To replace disks using the MapR Control System:

1. Identify the failed disk or disks:


In the Navigation pane, expand the Cluster group and click the Nodes view.
Click the name of the node on which you wish to replace disks, and look in the MapR-FS and Available Disks pane.
2. Remove the failed disk or disks from MapR-FS:
In the MapR-FS and Available Disks pane, select the checkboxes beside the failed disks.
Click Remove Disks from MapR-FS to remove the disks from MapR-FS.
Wait several minutes while the removal process completes. After you remove the disks, any other disks in the same storage
pools are taken offline and marked as available (not in use by MapR).
3. Replace the failed disks on the node or nodes according to the correct hardware procedure.
4. Add the replacement and available disks to MapR-FS:
In the Navigation pane, expand the Cluster group and click the Nodes view.
Click the name of the node on which you replaced the disks.
In the MapR-FS and Available Disks pane, select the checkboxes beside the disks you wish to add.
Click Add Disks to MapR-FS to add the disks. Properly-sized storage pools are allocated automatically.

Tolerating Slow Disks


The parameter mfs.io.disk.timeout in mfs.conf determines how long MapR waits for a disk to respond before assuming it has failed. If
healthy disks are too slow, and are erroneously marked as failed, you can increase the value of this parameter.

Monitoring
This section provides information about monitoring the cluster:

Alarms and Notifications


Ganglia
Nagios Integration
Service Metrics

Alarms and Notifications


On a cluster with an M5 license, MapR raises alarms and sends notifications to alert you to information about the cluster:

Cluster health, including disk failures


Volumes that are under-replicated or over quota
Services not running

You can see any currently raised alarms in the Alarms view of the MapR Control System, or using the alarm list command. For a list of all alarms,
see Troubleshooting Alarms.

To view cluster alarms using the MapR Control System:

1. In the Navigation pane, expand the Cluster group and click the Dashboard view.
2. All alarms for the cluster and its nodes and volumes are displayed in the Alarms pane.

To view node alarms using the MapR Control System:

In the Navigation pane, expand the Alarms group and click the Node Alarms view.

You can also view node alarms in the Node Properties view, the NFS Alarm Status view, and the Alarms pane of the Dashboard view.

To view volume alarms using the MapR Control System:

In the Navigation pane, expand the Alarms group and click the Volume Alarms view.

You can also view node alarms in the Alarms pane of the Dashboard view.

Notifications

When an alarm is raised, MapR can send an email notification to either or both of the following addresses:

The owner of the cluster, node, volume, or entity for which the alarm was raised (standard notification)
A custom email address for the named alarm.

You can set up alarm notifications using the alarm config save command or from the Alarms view in the MapR Control System.

To set up alarm notifications using the MapR Control System:


1. In the Navigation pane, expand the Alarms group and click the Alarm Notifications view.
2. Display the Configure Alarm Subscriptions dialog by clicking Alarm Notifications.
3. For each Alarm:
To send notifications to the owner of the cluster, node, volume, or entity: select the Standard Notification checkbox.
To send notifications to an additional email address, type an email address in the Additional Email Address field.
4. Click Save to save the configuration changes.

Ganglia
Ganglia is a scalable distributed system monitoring tool that allows remote viewing live or historical statistics for a cluster. The Ganglia system
consists of the following components:

A PHP-based web front end


Ganglia monitoring daemon (gmond): a multi-threaded monitoring daemon
Ganglia meta daemon (gmetad): a multi-threaded aggregation daemon
A few small utility programs

The daemon gmetad aggregates metrics from the gmond instances, storing them in a database. The front end pulls metrics from the database
and graphs them. You can aggregate data from multiple clusters by setting up a separate gmetad for each, and then a master gmetad to
aggregate data from the others. If you configure Ganglia to monitor multiple clusters, remember to use a separate port for each cluster.

MapR with Ganglia

The CLDB reports metrics about its own load, as well as cluster-wide metrics such as CPU and memory utilization, the number of active
FileServer nodes, the number of volumes created, etc. For a complete list of metrics, see below.

MapRGangliaContext collects and sends CLDB metrics, FileServer metrics, and cluster-wide metrics to Gmon or Gmeta, depending on the
configuration. On the Ganglia front end, these metrics are displayed separately for each FileServer by hostname. The ganglia monitor only needs
to be installed on CLDB nodes to collect all the metrics required for monitoring a MapR cluster. To monitor other services such as HBase and
MapReduce, install Gmon on nodes running the services and configure them as you normally would.

The Ganglia properties for the cldb and fileserver contexts are configured in the file
$INSTALL_DIR/conf/hadoop-metrics.properties. Any changes to this file require a CLDB restart.

Installing Ganglia

To install Ganglia on Ubuntu:

1. On each CLDB node, install ganglia-monitor: sudo apt-get install ganglia-monitor


2. On the machine where you plan to run the Gmeta daemon, install gmetad: sudo apt-get install gmetad
3. On the machine where you plan to run the Ganglia front end, install ganglia-webfrontend: sudo apt-get install
ganglia-webfrontend

To install Ganglia on Red Hat:

1. Download the following RPM packages for Ganglia version 3.1 or later:
ganglia-gmond
ganglia-gmetad
ganglia-web
2. On each CLDB node, install ganglia-monitor: rpm -ivh <ganglia-gmond>
3. On the machine where you plan to run the Ganglia meta daemon, install gmetad: rpm -ivh <gmetad>
4. On the machine where you plan to run the Ganglia front end, install ganglia-webfrontend: rpm -ivh <ganglia-web>

For more details about Ganglia configuration and installation, see the Ganglia documentation.

To start sending CLDB metrics to Ganglia:

1. Make sure the CLDB is configured to send metrics to Ganglia (see Service Metrics).
2. As root (or using sudo), run the following commands:

maprcli config save -values '{"cldb.ganglia.cldb.metrics":"1"}'


maprcli config save -values '{"cldb.ganglia.fileserver.metrics":"1"}'

To stop sending CLDB metrics to Ganglia:

As root (or using sudo), run the following commands:


maprcli config save -values '{"cldb.ganglia.cldb.metrics":"0"}'
maprcli config save -values '{"cldb.ganglia.fileserver.metrics":"0"}'

Metrics Collected

CLDB FileServers

Number of FileServers FS Disk Used GB


Number of Volumes FS Disk Available GB
Number of Containers Cpu Busy %
Cluster Disk Space Used GB Memory Total MB
Cluster Disk Space Available GB Memory Used MB
Cluster Disk Capacity GB Memory Free MB
Cluster Memory Capacity MB Network Bytes Received
Cluster Memory Used MB Network Bytes Sent
Cluster Cpu Busy %
Cluster Cpu Total
Number of FS Container Failure Reports
Number of Client Container Failure Reports
Number of FS RW Container Reports
Number of Active Container Reports
Number of FS Volume Reports
Number of FS Register
Number of container lookups
Number of container assign
Number of container corrupt reports
Number of rpc failed
Number of rpc received

Monitoring Tools
MapR works with the following third-party monitoring tools:

Ganglia
Nagios

Nagios Integration
Nagios is an open-source cluster monitoring tool. MapR can generate a Nagios Object Definition File that describes the nodes in the cluster and
the services running on each. You can generate the file using the MapR Control System or the nagios generate command, then save the file in
the proper location in your Nagios environment.

MapR recommends Nagios version 3.3.1 and version 1.4.15 of the plugins.

To generate a Nagios file using the MapR Control System:

1. In the Navigation pane, click Nagios.


2. Copy and paste the output, and save as the appropriate Object Definition File in your Nagios environment.

For more information, see the Nagios documentation.

Service Metrics
MapR services produce metrics that can be written to an output file or consumed by Ganglia. The file metrics output is directed by the
hadoop-metrics.properties files.
By default, the CLDB and FileServer metrics are sent via unicast to the Ganglia gmon server running on localhost. To send the metrics directly to
a Gmeta server, change the cldb.servers property to the hostname of the Gmeta server. To send the metrics to a multicast channel, change
the cldb.servers property to the IP address of the multicast channel.

To configure metrics for a service:

1. Edit the appropriate hadoop-metrics.properties file on all CLDB nodes, depending on the service:
For MapR-specific services, edit /opt/mapr/conf/hadoop-metrics.properties
For standard Hadoop services, edit /opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties
2. In the sections specific to the service:
Un-comment the lines pertaining to the context to which you wish the service to send metrics.
2.

Comment out the lines pertaining to other contexts.


3. Restart the service.

To enable metrics:

1. As root (or using sudo), run the following commands:

maprcli config save -values '{"cldb.ganglia.cldb.metrics":"1"}'


maprcli config save -values '{"cldb.ganglia.fileserver.metrics":"1"}'

To disable metrics:

1. As root (or using sudo), run the following commands:

maprcli config save -values '{"cldb.ganglia.cldb.metrics":"0"}'


maprcli config save -values '{"cldb.ganglia.fileserver.metrics":"0"}'

Example

In the following example, CLDB service metrics will be sent to the Ganglia context:

#CLDB metrics config - Pick one out of null,file or ganglia.


#Uncomment all properties in null, file or ganglia context, to send cldb metrics to that context

# Configuration of the "cldb" context for null


#cldb.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread
#cldb.period=10

# Configuration of the "cldb" context for file


#cldb.class=org.apache.hadoop.metrics.file.FileContext
#cldb.period=60
#cldb.fileName=/tmp/cldbmetrics.log

# Configuration of the "cldb" context for ganglia


cldb.class=com.mapr.fs.cldb.counters.MapRGangliaContext31
cldb.period=10
cldb.servers=localhost:8649
cldb.spoof=1

Nodes
This section provides information about managing nodes in the cluster:

Viewing a List of Nodes - displaying all the nodes recognized by the MapR cluster
Adding a Node - installing a new node on the cluster (requires fc or a permission)
Managing Services - starting or stopping services on a node (requires ss, fc, or a permission)
Reformatting a Node - reformatting a node's disks
Removing a Node - removing a node temporarily for maintenance (requires fc or a permission)
Decommissioning a Node - permanently uninstalling a node (requires fc or a permission)
Reconfiguring a Node - installing, upgrading, or removing hardware or software, or changing roles
Renaming a Node - changing a node's hostname

Viewing a List of Nodes


You can view all nodes using the node list command, or view them in the MapR Control System using the following procedure.

To view all nodes using the MapR Control System:

In the Navigation pane, expand the Cluster group and click the Nodes view.

Adding a Node

To Add Nodes to a Cluster


1. PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements.
2. PLAN which services to run on the new nodes.
3. INSTALL MapR Software:
On all new nodes, ADD the MapR Repository.
On each new node, INSTALL the planned MapR services.
On all new nodes, RUN configure.sh.
On all new nodes, FORMAT disks for use by MapR.
If you have made any changes to configuration files such as warden.conf or mapred-site.xml, copy these configuration
changes from another node in the cluster.
4. Start each node:
On any new nodes that have ZooKeeper installed, start it:

/etc/init.d/mapr-zookeeper start

On all new nodes, start the warden:

/etc/init.d/mapr-warden start

If any of the new nodes are CLDB or ZooKeeper nodes (or both):

RUN configure.sh on all new and existing nodes in the cluster, specifying all CLDB and ZooKeeper nodes.

SET UP node topology for the new nodes.


On any new nodes running NFS, SET UP NFS for HA.

Managing Services
You can manage node services using the node services command, or in the MapR Control System using the following procedure.

To manage node services using the MapR Control System:

1. In the Navigation pane, expand the Cluster group and click the Nodes view.
2. Select the checkbox beside the node or nodes you wish to remove.
3. Click the Manage Services button to display the Manage Node Services dialog.
4. For each service you wish to start or stop, select the appropriate option from the corresponding drop-down menu.
5. Click Change Node to start and stop the services according to your selections.

You can also display the Manage Node Services dialog by clicking Manage Services in the Node Properties view.

Reformatting a Node
1. Change to the root user (or use sudo for the following commands).
2. Stop the Warden:
/etc/init.d/mapr-warden stop
3. Remove the disktab file:
rm /opt/mapr/conf/disktab
4. Create a text file /tmp/disks.txt that lists all the disks and partitions to format for use by MapR. See Setting Up Disks for MapR.
5. Use disksetup to re-format the disks:
disksetup -F /tmp/disks.txt
6. Start the Warden:
/etc/init.d/mapr-warden start

Removing a Node
You can remove a node using the node remove command, or in the MapR Control System using the following procedure. Removing a node
detaches the node from the cluster, but does not remove the MapR software from the cluster.

To remove a node using the MapR Control System:

1. In the Navigation pane, expand the Cluster group and click the NFS Nodes view.
2. Select the checkbox beside the node or nodes you wish to remove.
3. Click Manage Services and stop all services on the node.
4. Wait 5 minutes. The Remove button becomes active.
5. Click the Remove button to display the Remove Node dialog.
6. Click Remove Node to remove the node.
If you are using Ganglia, restart all gmeta and gmon daemons in the cluster. See Ganglia.

You can also remove a node by clicking Remove Node in the Node Properties view.

Decommissioning a Node
Use the following procedures to remove a node and uninstall the MapR software. This procedure detaches the node from the cluster and removes
the MapR packages, log files, and configuration files, but does not format the disks.

Before Decommissioning a Node


Make sure any data on the node is replicated and any needed services are running elsewhere. For example, if
decommissioning the node would result in too few instances of the CLDB, start CLDB on another node beforehand; if you are
decommissioning a ZooKeeper node, make sure you have enough ZooKeeper instances to meet a quorum after the node is
removed. See Planning the Deployment for recommendations.

To decommission a node permanently:

1. Change to the root user (or use sudo for the following commands).
2. Stop the Warden:
/etc/init.d/mapr-warden stop
3. If ZooKeeper is installed on the node, stop it:
/etc/init.d/mapr-zookeeper stop
4. Determine which MapR packages are installed on the node:
dpkg -list | grep mapr (Ubuntu)
rpm -qa | grep mapr (Red Hat or CentOS)
5. Remove the packages by issuing the appropriate command for the operating system, followed by the list of services. Examples:
apt-get purge mapr-core mapr-cldb mapr-fileserver (Ubuntu)
yum erase mapr-core mapr-cldb mapr-fileserver (Red Hat or CentOS)
6. If the node you have decommissioned is a CLDB node or a ZooKeeper node, then run configure.sh on all other nodes in the cluster
(see Configuring a Node).

If you are using Ganglia, restart all gmeta and gmon daemons in the cluster. See Ganglia.

Reconfiguring a Node
You can add, upgrade, or remove services on a node to perform a manual software upgrade or to change the roles a node serves. There are four
steps to this procedure:

Stopping the Node


Formatting the Disks (optional)
Installing or Removing Software or Hardware
Configuring the Node
Starting the Node

This procedure is designed to make changes to existing MapR software on a machine that has already been set up as a MapR cluster node. If
you need to install software for the first time on a machine to create a new node, please see Adding a Node instead.

Stopping a Node

1. Change to the root user (or use sudo for the following commands).
2. Stop the Warden:
/etc/init.d/mapr-warden stop
3. If ZooKeeper is installed on the node, stop it:
/etc/init.d/mapr-zookeeper stop

Installing or Removing Software or Hardware

Before installing or removing software or hardware, stop the node using the procedure described in Stopping the Node.

Once the node is stopped, you can add, upgrade or remove software or hardware. At some point in time after adding or removing services, it is
recommended to restart the warden, to re-optimize memory allocation among all the services on the node. It is not crucial to perform this step
immediately; you can restart the warden at a time when the cluster is not busy.

To add or remove individual MapR packages, use the standard package management commands for your Linux distribution:

apt-get (Ubuntu)
yum (Red Hat or CentOS)

For information about the packages to install, see Planning the Deployment.

To add roles to an existing node:

1. Install the packages corresponding to the new roles using apt-get or yum.
2. Run configure.sh with a list of the CLDB nodes and ZooKeeper nodes in the cluster.
3. If you have added the CLDB or ZooKeeper role, run configure.sh on all nodes in the cluster.
4. The warden picks up the new configuration and automatically starts the new services. If you are adding the CLDB role, restart the
warden:

# /etc/init.d/mapr-warden stop
# /etc/init.d/mapr-warden start

If you are not adding the CLDB role, you can wait until a convenient time to restart the warden.

To remove roles from an existing node:

1. Purge the packages corresponding to the roles using apt-get or yum.


2. Run configure.sh with a list of the CLDB nodes and ZooKeeper nodes in the cluster.
3. If you have removed the CLDB or ZooKeeper role, run configure.sh on all nodes in the cluster.

The warden picks up the new configuration automatically. When it is convenient, restart the warden:

# /etc/init.d/mapr-warden stop
# /etc/init.d/mapr-warden start

Setting Up a Node

Formatting the Disks

The script disksetup removes all data from the specified disks. Make sure you specify the disks correctly, and that any data
you wish to keep has been backed up elsewhere. Before following this procedure, make sure you have backed up any data you
wish to keep.

1. Change to the root user (or use sudo for the following command).
2. Run disksetup, specifying the disk list file.
Example:

/opt/mapr/server/disksetup -F /tmp/disks.txt

Configuring the Node

Run the script configure.sh to create /opt/mapr/conf/mapr-clusters.conf and update the corresponding *.conf and *.xml files.
Before performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. Optionally, you can specify the ports
for the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are:

CLDB – 7222
ZooKeeper – 5181

The script configure.sh takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IP
addresses (and optionally ports), using the following syntax:
/opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z
<host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>]
Example:

/opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Z
r1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r1n1.sj.us5:5181 -N MyCluster

If you have not chosen a cluster name, you can run configure.sh again later to rename the cluster.

Starting the Node

1. If ZooKeeper is installed on the node, start it:


/etc/init.d/mapr-zookeeper start
2. Start the Warden:
/etc/init.d/mapr-warden start

Renaming a Node

To rename a node:

1. Stop the warden on the node. Example:

/etc/init.d/mapr-warden stop

2. If the node is a ZooKeeper node, stop ZooKeeper on the node. Example:

/etc/init.d/mapr-zookeeper stop

3. Rename the host:


On Red Hat or CentOS, edit the HOSTNAME parameter in the file /etc/sysconfig/network and restart the xinetd service or
reboot the node.
On Ubuntu, change the old hostname to the new hostname in the /etc/hostname and /etc/hosts files.
4. If the node is a ZooKeeper node or a CLDB node, run configure.sh with a list of CLDB and ZooKeeper nodes. See configure.sh.
5. If the node is a ZooKeeper node, start ZooKeeper on the node. Example:

/etc/init.d/mapr-zookeeper start

6. Start the warden on the node. Example:

/etc/init.d/mapr-warden start

Adding Nodes to a Cluster

To Add Nodes to a Cluster

1. PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements.
2. PLAN which services to run on the new nodes.
3. INSTALL MapR Software:
On all new nodes, ADD the MapR Repository.
On each new node, INSTALL the planned MapR services.
On all new nodes, RUN configure.sh.
On all new nodes, FORMAT disks for use by MapR.
If you have made any changes to configuration files such as warden.conf or mapred-site.xml, copy these configuration
changes from another node in the cluster.
4. Start each node:
On any new nodes that have ZooKeeper installed, start it:

/etc/init.d/mapr-zookeeper start

On all new nodes, start the warden:

/etc/init.d/mapr-warden start

5.
5. If any of the new nodes are CLDB or ZooKeeper nodes (or both):
RUN configure.sh on all new and existing nodes in the cluster, specifying all CLDB and ZooKeeper nodes.
6. SET UP node topology for the new nodes.
7. On any new nodes running NFS, SET UP NFS for HA.

Adding Roles

To add roles to an existing node:

1. Install the packages corresponding to the new roles using apt-get or yum.
2. Run configure.sh with a list of the CLDB nodes and ZooKeeper nodes in the cluster.
3. If you have added the CLDB or ZooKeeper role, run configure.sh on all nodes in the cluster.
4. The warden picks up the new configuration and automatically starts the new services. If you are adding the CLDB role, restart the
warden:

# /etc/init.d/mapr-warden stop
# /etc/init.d/mapr-warden start

If you are not adding the CLDB role, you can wait until a convenient time to restart the warden.

Node Topology
Topology tells MapR about the locations of nodes and racks in the cluster. Topology is important, because it determines where MapR places
replicated copies of data. If you define the cluster topology properly, MapR scatters replication on separate racks so that your data remains
available in the event an entire rack fails. Cluster topology is defined by specifying a topology path for each node in the cluster. The paths group
nodes by rack or switch, depending on how the physical cluster is arranged and how you want MapR to place replicated data.

Topology paths can be as simple or complex as needed to correspond to your cluster layout. In a simple cluster, each topology path might consist
of the rack only (e. g. /rack-1). In a deployment consisting of multiple large datacenters, each topology path can be much longer (e. g.
/europe/uk/london/datacenter2/room4/row22/rack5/). MapR uses topology paths to spread out replicated copies of data, placing
each copy on a separate path. By setting each path to correspond to a physical rack, you can ensure that replicated data is distributed across
racks to improve fault tolerance.

After you have defined node topology for the nodes in your cluster, you can use volume topology to place volumes on specific racks, nodes, or
groups of nodes. See Setting Volume Topology.

Setting Node Topology

You can specify a topology path for one or more nodes using the node topo command, or in the MapR Control System using the following
procedure.

To set node topology using the MapR Control System:

1. In the Navigation pane, expand the Cluster group and click the Nodes view.
2. Select the checkbox beside each node whose topology you wish to set.
3. Click the Change Topology button to display the Change Node Topology dialog.
4. Set the path in the New Path field:
To define a new path, type a topology path. Topology paths must begin with a forward slash ('/').
To use a path you have already defined, select it from the dropdown.
5. Click Move Node to set the new topology.

Removing Roles

To remove roles from an existing node:

1. Purge the packages corresponding to the roles using apt-get or yum.


2. Run configure.sh with a list of the CLDB nodes and ZooKeeper nodes in the cluster.
3. If you have removed the CLDB or ZooKeeper role, run configure.sh on all nodes in the cluster.

The warden picks up the new configuration automatically. When it is convenient, restart the warden:

# /etc/init.d/mapr-warden stop
# /etc/init.d/mapr-warden start
Services

Viewing Services on the Cluster


You can start services using the dashboard info command, or using the MapR Control System. In the MapR Control System, the running services
on the cluster are displayed in the Services pane of the Dashboard.

To view the running services on the cluster using the MapR Control System:

1. Log on to the MapR Control System.


2. In the Navigation pane, expand the Cluster pane and click Dashboard.

Viewing Services on a Node


You can start services using the service list command, or using the MapR Control System. In the MapR Control System, the running services on a
node are displayed in the Node Properties View.

To view the running services on a node using the MapR Control System:

1. Log on to the MapR Control System.


2. In the Navigation pane, expand the Cluster pane and click Nodes.
3. Click the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane.

Starting Services
You can start services using the node services command, or using the MapR Control System.

To start specific services on a node using the MapR Control System:

1. Log on to the MapR Control System.


2. In the Navigation pane, expand the Cluster pane and click Nodes.
3. Click the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane.
4. Click the checkbox next to each service you would like to start, and click Start Service.

Stopping Services
You can stop services using the node services command, or using the MapR Control System.

To stop specific services on a node using the MapR Control System:

1. Log on to the MapR Control System.


2. In the Navigation pane, expand the Cluster pane and click Nodes.
3. Click the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane.
4. Click the checkbox next to each service you would like to stop, and click Stop Service.

Adding Services
Services determine which roles a node fulfills. You can view a list of the roles configured for a given node by listing the /opt/mapr/roles
directory on the node. To add roles to a node, you must install the corresponding services.

To add roles to an existing node:

1. Install the packages corresponding to the new roles using apt-get or yum.
2. Run configure.sh with a list of the CLDB nodes and ZooKeeper nodes in the cluster.
3. If you have added the CLDB or ZooKeeper role, run configure.sh on all nodes in the cluster.
4. The warden picks up the new configuration and automatically starts the new services. If you are adding the CLDB role, restart the
warden:

# /etc/init.d/mapr-warden stop
# /etc/init.d/mapr-warden start

If you are not adding the CLDB role, you can wait until a convenient time to restart the warden.
Shutting Down and Restarting a Cluster
To safely shut down and restart an entire cluster, preserving all data and full replication, you must follow a specific sequence that stops writes so
that the cluster does not shut down in the middle of an operation:

1. Shut down the NFS service everywhere it is running.


2. Shut down the CLDB nodes.
3. Shut down all remaining nodes.

This procedure ensures that on restart the data is replicated and synchronized, so that there is no single point of failure for any data.

To shut down the cluster:

1. Change to the root user (or use sudo for the following commands).
2. Before shutting down the cluster, you will need a list of NFS nodes, CLDB nodes, and all remaining nodes. Once the CLDB is shut down,
you cannot retrieve a list of nodes; it is important to obtain this information at the beginning of the process. Use the node list
command as follows:
Determine which nodes are running the NFS gateway. Example:

/opt/mapr/bin/maprcli node list -filter "[rp==/*]and[svc==nfs]" -columns id,h,hn,svc, rp


id service hostname
health ip
6475182753920016590 fileserver,tasktracker,nfs,hoststats
node-252.cluster.us 0 10.10.50.252
8077173244974255917 tasktracker,cldb,fileserver,nfs,hoststats
node-253.cluster.us 0 10.10.50.253
5323478955232132984 webserver,cldb,fileserver,nfs,hoststats,jobtracker
node-254.cluster.us 0 10.10.50.254

Determine which nodes are running the CLDB. Example:

/opt/mapr/bin/maprcli node list -filter "[rp==/*]and[svc==cldb]" -columns id,h,hn,svc, rp

List all non-CLDB nodes. Example:

/opt/mapr/bin/maprcli node list -filter "[rp==/*]and[svc!=cldb]" -columns id,h,hn,svc, rp

3. Shut down all NFS instances. Example:

/opt/mapr/bin/maprcli node services -nfs stop -nodes


node-252.cluster.us,node-253.cluster.us,node-254.cluster.us

4. SSH into each CLDB node and stop the warden. Example:

/etc/init.d/mapr-warden stop

5. SSH into each of the remaining nodes and stop the warden. Example:

/etc/init.d/mapr-warden stop

6. If desired, you can shut down the nodes using the Linux halt command.

To start up the cluster:

1. If the cluster nodes are not running, start them.


2. Change to the root user (or use sudo for the following commands).
3. Start the ZooKeeper on nodes where it is installed. Example:
3.

/etc/init.d/mapr-zookeeper start

4. On all nodes, start the warden. Example:

/etc/init.d/mapr-warden start

5. Over a period of time (depending on the cluster size and other factors) the cluster comes up automatically.

Users and Groups


MapR uses each node's native operating system configuration to authenticate users and groups for access to the cluster. If you are deploying a
large cluster, you should consider configuring all nodes to use LDAP or another user management system. You can use the MapR Control
System to give specific permissions to particular users and groups. For more information, see Managing Permissions. Each user can be restricted
to a specific amount of disk usage. For more information, see Managing Quotas.

All nodes in the cluster must have the same set of users and groups, with the same uid and gid numbers on all nodes:

When adding a user to a cluster node, specify the --uid option with the useradd command to guarantee that the user has the same
uid on all machines.
When adding a group to a cluster node, specify the --gid option with the groupadd command to guarantee that the group has the
same gid on all machines.

Choose a specific user to be the administrative user for the cluster. By default, MapR gives the user root full administrative permissions. If the
nodes do not have an explicit root login (as is sometimes the case with Ubuntu, for example), you can give full permissions to the chosen
administrative user after deployment. See Cluster Configuration.

On the node where you plan to run the mapr-webserver (the MapR Control System), install Pluggable Authentication Modules (PAM). See PAM
Configuration.

To create a volume for a user or group:

1. In the Volumes view, click New Volume.


2. In the New Volume dialog, set the volume attributes:
In Volume Setup, type a volume name. Make sure the Volume Type is set to Normal Volume.
In Ownership & Permissions, set the volume owner and specify the users and groups who can perform actions on the volume.
In Usage Tracking, set the accountable group or user, and set a quota or advisory quota if needed.
In Replication & Snapshot Scheduling, set the replication factor and choose a snapshot schedule.
3. Click OK to save the settings.

See Volumes for more information. You can also create a volume using the volume create command.

You can see users and groups that own volumes in the User Disk Usage view or using the entity list command.

Managing Permissions
MapR manages permissions using two mechanisms:

Cluster and volume permissions use access control lists (ACLs), which specify actions particular users are allowed to perform on a
certain cluster or volume
MapR-FS permissions control access to directories and files in a manner similar to Linux file permissions. To manage permissions, you
must have fc permissions.

Cluster and Volume Permissions


Cluster and volume permissions use ACLs, which you can edit using the MapR Control System or the acl commands.

Cluster Permissions

The following table lists the actions a user can perform on a cluster, and the corresponding codes used in the cluster ACL.

Code Allowed Action Includes


login Log in to the MapR Control System, use the API and command-line interface, read access on cluster and cv
volumes

ss Start/stop services

cv Create volumes

a Admin access All permissions except


fc

fc Full control (administrative access and permission to change the cluster ACL) a

Setting Cluster Permissions

You can modify cluster permissions using the acl edit and acl set commands, or using the MapR Control System.

To add cluster permissions using the MapR Control System:

1. Expand the System Settings group and click Permissions to display the Edit Permissions dialog.
2. Click [ + Add Permission ] to add a new row. Each row lets you assign permissions to a single user or group.
3. Type the name of the user or group in the empty text field:
If you are adding permissions for a user, type u:<user>, replacing <user> with the username.
If you are adding permissions for a group, type g:<group>, replacing <group> with the group name.
4. Click the Open Arrow ( ) to expand the Permissions dropdown.
5. Select the permissions you wish to grant to the user or group.
6. Click OK to save the changes.

To remove cluster permissions using the MapR Control System:

1. Expand the System Settings group and click Permissions to display the Edit Permissions dialog.
2. Remove the desired permissions:
3. To remove all permissions for a user or group:
Click the delete button ( ) next to the corresponding row.
4. To change the permissions for a user or group:
Click the Open Arrow ( ) to expand the Permissions dropdown.
Unselect the permissions you wish to revoke from the user or group.
5. Click OK to save the changes.

Volume Permissions

The following table lists the actions a user can perform on a volume, and the corresponding codes used in the volume ACL.

Code Allowed Action

dump Dump the volume

restore Mirror or restore the volume

m Modify volume properties, create and delete snapshots

d Delete a volume

fc Full control (admin access and permission to change volume ACL)

To mount or unmount volumes under a directory, the user must have read/write permissions on the directory (see MapR-FS Permissions).

You can set volume permissions using the acl edit and acl set commands, or using the MapR Control System.

To add volume permissions using the MapR Control System:

1. Expand the MapR-FS group and click Volumes.


To create a new volume and set permissions, click New Volume to display the New Volume dialog.
To edit permissions on a existing volume, click the volume name to display the Volume Properties dialog.
2. In the Permissions section, click [ + Add Permission ] to add a new row. Each row lets you assign permissions to a single user or
group.
3. Type the name of the user or group in the empty text field:
If you are adding permissions for a user, type u:<user>, replacing <user> with the username.
If you are adding permissions for a group, type g:<group>, replacing <group> with the group name.
4. Click the Open Arrow ( ) to expand the Permissions dropdown.
5.
5. Select the permissions you wish to grant to the user or group.
6. Click OK to save the changes.

To remove volume permissions using the MapR Control System:

1. Expand the MapR-FS group and click Volumes.


2. Click the volume name to display the Volume Properties dialog.
3. Remove the desired permissions:
4. To remove all permissions for a user or group:
Click the delete button ( ) next to the corresponding row.
5. To change the permissions for a user or group:
Click the Open Arrow ( ) to expand the Permissions dropdown.
Unselect the permissions you wish to revoke from the user or group.
6. Click OK to save the changes.

MapR-FS Permissions
MapR-FS permissions are similar to the POSIX permissions model. Each file and directory is associated with a user (the owner) and a group. You
can set read, write, and execute permissions separately for:

The owner of the file or directory


Members of the group associated with the file or directory
All other users.

The permissions for a file or directory are called its mode. The mode of a file or directory can be expressed in two ways:

Text - a string that indicates the presence of the read (r), write (w), and execute (x) permission or their absence (-) for the owner, group,
and other users respectively. Example:
rwxr-xr-x
Octal - three octal digits (for the owner, group, and other users), that use individual bits to represent the three permissions. Example:
755

Both rwxr-xr-x and 755 represent the same mode: the owner has all permissions, and the group and other users have read and execute
permissions only.

Text Modes

String modes are constructed from the characters in the following table.

Text Description

u The file's owner.

g The group associated with the file or directory.

o Other users (users that are not the owner, and not in the group).

a All (owner, group and others).

= Assigns the permissions Example: "a=rw" sets read and write permissions and disables execution for all.

- Removes a specific permission. Example: "a-x" revokes execution permission from all users without changing read and write
permissions.

+ Adds a specific permission. Example: "a+x" grants execution permission to all users without changing read and write permissions.

r Read permission

w Write permission

x Execute permission

Octal Modes

To construct each octal digit, add together the values for the permissions you wish to grant:

Read: 4
Write: 2
Execute: 1

Syntax
You can change the modes of directories and files in the MapR storage using either the hadoop fs command with the -chmod option, or using
the chmod command via NFS. The syntax for both commands is similar:

hadoop fs -chmod [-R] <MODE>[,<MODE>]... | <OCTALMODE> <URI> [<URI> ...]

chmod [-R] <MODE>[,<MODE>]... | <OCTALMODE> <URI> [<URI> ...]

Parameters and Options

Parameter/Option Description

-R If specified, this option applies the new mode recursively throughout the directory structure.

MODE A string that specifies a mode.

OCTALMODE A three-digit octal number that specifies the new mode for the file or directory.

URI A relative or absolute path to the file or directory for which to change the mode.

Examples

The following examples are all equivalent:

chmod 755 script.sh

chmod u=rwx,g=rx,o=rx script.sh

chmod u=rwx,go=rx script.sh

Managing Quotas
Quotas limit the disk space used by a volume or an entity (user or group) on an M5-licensed cluster, by specifying the amount of disk space the
volume or entity is allowed to use:

A volume quota limits the space used by a volume.


A user/group quota limits the space used by all volumes owned by a user or group.

Quotas are expressed as an integer value plus a single letter to represent the unit:

B - bytes
K - kilobytes
M - megabytes
G - gigabytes
T - terabytes
P - petabytes

Example: 500G specifies a 500 gigabyte quota.

If a volume or entity exceeds its quota, further disk writes are prevented and a corresponding alarm is raised:

AE_ALARM_AEQUOTA_EXCEEDED - an entity exceeded its quota


VOLUME_ALARM_QUOTA_EXCEEDED - a volume exceeded its quota

A quota that prevents writes above a certain threshold is also called a hard quota. In addition to the hard quota, you can also set an advisory
quota for a user, group, or volume. An advisory quota does not enforce disk usage limits, but raises an alarm when it is exceeded:

AE_ALARM_AEADVISORY_QUOTA_EXCEEDED - an entity exceeded its advisory quota


VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED - a volume exceeded its advisory quota

In most cases, it is useful to set the advisory quota somewhat lower than the hard quota, to give advance warning that disk usage is approaching
the allowed limit.

To manage quotas, you must have a or fc permissions.

Quota Defaults

You can set hard quota and advisory quota defaults for users and groups. When a user or group is created, the default quota and advisory quota
apply unless overridden by specific quotas.

Setting Volume Quotas and Advisory Quotas


You can set a volume quota using the volume modify command, or use the following procedure to set a volume quota using the MapR Control
System.

To set a volume quota using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the Volumes view.
2. Display the Volume Properties dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clicking
the Properties button.
3. In the Usage Tracking section, select the Volume Quota checkbox and type a quota (value and unit) in the field. Example: 500G
4. To set the advisory quota, select the Volume Advisory Quota checkbox and type a quota (value and unit) in the field. Example: 250G
5. After setting the quota, click Modify Volume to exit save changes to the volume.

Setting User/Group Quotas and Advisory Quotas


You can set a user/group quota using the entity modify command, or use the following procedure to set a user/group quota using the MapR
Control System.

To set a user or group quota using the MapR Control System:

1. In the Navigation pane, expand the MapR-FS group and click the User Disk Usage view.
2. Select the checkbox beside the user or group name for which you wish to set a quota, then click the Edit Properties button to display the
User Properties dialog.
3. In the Usage Tracking section, select the User/Group Quota checkbox and type a quota (value and unit) in the field. Example: 500G
4. To set the advisory quota, select the User/Group Advisory Quota checkbox and type a quota (value and unit) in the field. Example:
250G
5. After setting the quota, click OK to exit save changes to the entity.

Setting Quota Defaults


You can set an entity quota using the entity modify command, or use the following procedure to set an entity quota using the MapR Control
System.

To set quota defaults using the MapR Control System:

1. In the Navigation pane, expand the System Settings group.


2. Click the Quota Defaults view to display the Configure Quota Defaults dialog.
3. To set the user quota default, select the Default User Total Quota checkbox in the User Quota Defaults section, then type a quota
(value and unit) in the field.
4. To set the user advisory quota default, select the Default User Advisory Quota checkbox in the User Quota Defaults section, then type
a quota (value and unit) in the field.
5. To set the group quota default, select the Default Group Total Quota checkbox in the Group Quota Defaults section, then type a quota
(value and unit) in the field.
6. To set the group advisory quota default, select the Default Group Advisory Quota checkbox in the Group Quota Defaults section, then
type a quota (value and unit) in the field.
7. After setting the quota, click Save to exit save changes to the entity.

Troubleshooting
This section provides information about troubleshooting cluster problems:

Disaster Recovery
Out of Memory Troubleshooting
Troubleshooting Alarms

Disaster Recovery
It is a good idea to set up an automatic backup of the CLDB volume at regular intervals; in the event that all CLDB nodes fail, you can restore the
CLDB from a backup. If you have more than one MapR cluster, you can back up the CLDB volume for each cluster onto the other clusters;
otherwise, you can save the CLDB locally to external media such as a USB drive.

To back up a CLDB volume from a remote cluster:

1. Set up a cron job on the remote cluster to save the container information to a file by running the following command:
/opt/mapr/bin/maprcli dump cldbnodes -zkconnect <IP:port of ZooKeeper leader> > <path to file>
2. Set up a cron job to copy the container information file to a volume on the local cluster.
3. Create a mirror volume on the local cluster, choosing the volume mapr.cldb.internal from the remote cluster as the source volume.
Set the mirror sync schedule so that it will run at the same time as the cron job.

To back up a CLDB volume locally:


1. Set up a cron job to save the container information to a file on external media by running the following command:
/opt/mapr/bin/maprcli dump cldbnodes -zkconnect <IP:port of ZooKeeper leader> > <path to file>
2. Set up a cron job to create a dump file of the local volume mapr.cldb.internal on external media. Example:
/opt/mapr/bin/maprcli volume dump create -name mapr.cldb.internal -dumpfile <path_to_file>

For information about restoring from a backup of the CLDB, contact MapR Support.

Out of Memory Troubleshooting


When the aggregated memory used by MapReduce tasks exceeds the memory reserve on a TaskTracker node, tasks can fail or be killed. MapR
attempts to prevent out-of-memory exceptions by killing MapReduce tasks when memory becomes scarce. If you allocate too little Java heap for
the expected memory requirements of your tasks, an exception can occur. The following steps can help configure MapR to avoid these problems:

If a particular job encounters out-of-memory conditions, the simplest way to solve the problem might be to reduce the memory footprint of
the map and reduce functions, and to ensure that the partitioner distributes map output to reducers evenly.

If it is not possible to reduce the memory footprint of the application, try increasing the Java heap size (-Xmx) in the client-side
MapReduce configuration.

If many jobs encounter out-of-memory conditions, or if jobs tend to fail on specific nodes, it may be that those nodes are advertising too
many TaskTracker slots. In this case, the cluster administrator should reduce the number of slots on the affected nodes.

To reduce the number of slots on a node:

1. Stop the TaskTracker service on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker stop

2. Edit the file /opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xml:


Reduce the number of map slots by lowering mapred.tasktracker.map.tasks.maximum
Reduce the number of reduce slots by lowering mapred.tasktracker.reduce.tasks.maximum
3. Start the TaskTracker on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker start

Troubleshooting Alarms

User/Group Alarms
User/group alarms indicate problems with user or group quotas. The following tables describe the MapR user/group alarms.

Entity Advisory Quota Alarm

UI Column User Advisory Quota Alarm

Logged AE_ALARM_AEADVISORY_QUOTA_EXCEEDED
As

Meaning A user or group has exceeded its advisory quota. See Managing Quotas for more information about user/group quotas.

Resolution No immediate action is required. To avoid exceeding the hard quota, clear space on volumes created by the user or group, or
stop further data writes to those volumes.

Entity Quota Alarm

UI Column User Quota Alarm

Logged AE_ALARM_AEQUOTA_EXCEEDED
As

Meaning A user or group has exceeded its quota. Further writes by the user or group will fail. See Managing Quotas for more information
about user/group quotas.

Resolution Free some space on the volumes created by the user or group, or increase the user or group quota.
Cluster Alarms
Cluster alarms indicate problems that affect the cluster as a whole. The following tables describe the MapR cluster alarms.

Blacklist Alarm

UI Column Blacklist Alarm

Logged CLUSTER_ALARM_BLACKLIST_TTS
As

Meaning The JobTracker has blacklisted a TaskTracker node because tasks on the node have failed too many times.

Resolution To determine which node or nodes have been blacklisted, see the JobTracker status page (click JobTracker in the Navigation
Pane). The JobTracker status page provides links to the TaskTracker log for each node; look at the log for the blacklisted node or
nodes to determine why tasks are failing on the node.

License Near Expiration

UI Column License Near Expiration Alarm

Logged As CLUSTER_ALARM_LICENSE_NEAR_EXPIRATION

Meaning The M5 license associated with the cluster is within 30 days of expiration.

Resolution Renew the M5 license.

License Expired

UI Column License Expiration Alarm

Logged As CLUSTER_ALARM_LICENSE_EXPIRED

Meaning The M5 license associated with the cluster has expired. M5 features have been disabled.

Resolution Renew the M5 license.

Cluster Almost Full

UI Column Cluster Almost Full

Logged CLUSTER_ALARM_CLUSTER_ALMOST_FULL
As

Meaning The cluster storage is almost full. The percentage of storage used before this alarm is triggered is 90% by default, and is
controlled by the configuration parameter cldb.cluster.almost.full.percentage.

Resolution Reduce the amount of data stored in the cluster. If the cluster storage is less than 90% full, check the
cldb.cluster.almost.full.percentage parameter via the config load command, and adjust it if necessary via the
config save command.

Cluster Full

UI Column Cluster Full

Logged As CLUSTER_ALARM_CLUSTER_FULL

Meaning The cluster storage is full. MapReduce operations have been halted.

Resolution Free up some space on the cluster.

Maximum Licensed Nodes Exceeded alarm

UI Column Licensed Nodes Exceeded Alarm

Logged As CLUSTER_ALARM_LICENSE_MAXNODES_EXCEEDED
Meaning The cluster has exceeded the number of nodes specified in the license.

Resolution Remove some nodes, or upgrade the license to accommodate the added nodes.

Upgrade in Progress

UI Column Software Installation & Upgrades

Logged CLUSTER_ALARM_UPGRADE_IN_PROGRESS
As

Meaning A rolling upgrade of the cluster is in progress.

Resolution No action is required. Performance may be affected during the upgrade, but the cluster should still function normally. After the
upgrade is complete, the alarm is cleared.

VIP Assignment Failure

UI Column VIP Assignment Alarm

Logged CLUSTER_ALARM_UNASSIGNED_VIRTUAL_IPS
As

Meaning MapR was unable to assign a VIP to any NFS servers.

Resolution Check the VIP configuration, and make sure at least one of the NFS servers in the VIP pool are up and running. See Configuring
NFS for HA. Also check the log file /opt/mapr/logs/nfsmon.log for additional information.

Node Alarms
Node alarms indicate problems in individual nodes. The following tables describe the MapR node alarms.

CLDB Service Alarm

UI Column CLDB Alarm

Logged NODE_ALARM_SERVICE_CLDB_DOWN
As

Meaning The CLDB service on the node has stopped running.

Resolution Go to the Manage Services pane of the Node Properties View to check whether the CLDB service is running. The warden will try
three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to
restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf.
If the warden successfully restarts the CLDB service, the alarm is cleared. If the warden is unable to restart the CLDB service, it
may be necessary to contact technical support.

Core Present Alarm

UI Column Core files present

Logged As NODE_ALARM_CORE_PRESENT

Meaning A service on the node has crashed and created a core dump file. When all core files are removed, the alarm is cleared.

Resolution Contact technical support.

Debug Logging Active

UI Column Excess Logs Alarm

Logged NODE_ALARM_DEBUG_LOGGING
As

Meaning Debug logging is enabled on the node.


Resolution Debug logging generates enormous amounts of data, and can fill up disk space. If debug logging is not absolutely necessary, turn
it off: either use the Manage Services pane in the Node Properties view or the setloglevel command. If it is absolutely necessary,
make sure that the logs in /opt/mapr/logs are not in danger of filling the entire disk.

Disk Failure

UI Column Disk Failure Alarm

Logged NODE_ALARM_DISK_FAILURE
As

Meaning A disk has failed on the node.

Resolution Check the disk health log (/opt/mapr/logs/faileddisk.log) to determine which disk failed and view any SMART data provided by the
disk.

FileServer Service Alarm

UI Column FileServer Alarm

Logged NODE_ALARM_SERVICE_FILESERVER_DOWN
As

Meaning The FileServer service on the node has stopped running.

Resolution Go to the Manage Services pane of the Node Properties View to check whether the FileServer service is running. The warden will
try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to
restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf.
If the warden successfully restarts the FileServer service, the alarm is cleared. If the warden is unable to restart the FileServer
service, it may be necessary to contact technical support.

HBMaster Service Alarm

UI Column HBase Master Alarm

Logged NODE_ALARM_SERVICE_HBMASTER_DOWN
As

Meaning The HBMaster service on the node has stopped running.

Resolution Go to the Manage Services pane of the Node Properties View to check whether the HBMaster service is running. The warden will
try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to
restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf.
If the warden successfully restarts the HBMaster service, the alarm is cleared. If the warden is unable to restart the HBMaster
service, it may be necessary to contact technical support.

HBRegion Service Alarm

UI Column HBase RegionServer Alarm

Logged NODE_ALARM_SERVICE_HBREGION_DOWN
As

Meaning The HBRegion service on the node has stopped running.

Resolution Go to the Manage Services pane of the Node Properties View to check whether the HBRegion service is running. The warden will
try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to
restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf.
If the warden successfully restarts the HBRegion service, the alarm is cleared. If the warden is unable to restart the HBRegion
service, it may be necessary to contact technical support.

Hoststats Alarm

UI Column Hoststats process down

Logged NODE_ALARM_HOSTSTATS_DOWN
As
Meaning The Hoststats service on the node has stopped running.

Resolution Go to the Manage Services pane of the Node Properties View to check whether the Hoststats service is running. The warden will
try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to
restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf.
If the warden successfully restarts the service, the alarm is cleared. If the warden is unable to restart the service, it may be
necessary to contact technical support.

Installation Directory Full Alarm

UI Column Installation Directory full

Logged As NODE_ALARM_OPT_MAPR_FULL

Meaning The partition /opt/mapr on the node is running out of space (95% full).

Resolution Free up some space in /opt/mapr on the node.

JobTracker Service Alarm

UI Column JobTracker Alarm

Logged NODE_ALARM_SERVICE_JT_DOWN
As

Meaning The JobTracker service on the node has stopped running.

Resolution Go to the Manage Services pane of the Node Properties View to check whether the JobTracker service is running. The warden
will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three
times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in
warden.conf. If the warden successfully restarts the JobTracker service, the alarm is cleared. If the warden is unable to restart the
JobTracker service, it may be necessary to contact technical support.

MapR-FS High Memory Alarm

UI Column High FileServer Memory Alarm

Logged As NODE_ALARM_HIGH_MFS_MEMORY

Meaning Memory consumed by <strong> fileserver </strong> service on the node is high

Resolution Log on as root to the node for which the alarm is raised, and restart the Warden:
/etc/init.d/mapr-warden restart

NFS Service Alarm

UI Column NFS Alarm

Logged NODE_ALARM_SERVICE_NFS_DOWN
As

Meaning The NFS service on the node has stopped running.

Resolution Go to the Manage Services pane of the Node Properties View to check whether the NFS service is running. The warden will try
three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to
restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf.
If the warden successfully restarts the NFS service, the alarm is cleared. If the warden is unable to restart the NFS service, it may
be necessary to contact technical support.

PAM Misconfigured Alarm

UI Column PAM Alarm

Logged As NODE_ALARM_PAM_MISCONFIGURED

Meaning The PAM authentication on the node is configured incorrectly.

Resolution See PAM Configuration.


Root Partition Full Alarm

UI Column Root partition full

Logged As NODE_ALARM_ROOT_PARTITION_FULL

Meaning The root partition ('/') on the node is running out of space (99% full).

Resolution Free up some space in the root partition of the node.

TaskTracker Service Alarm

UI Column TaskTracker Alarm

Logged NODE_ALARM_SERVICE_TT_DOWN
As

Meaning The TaskTracker service on the node has stopped running.

Resolution Go to the Manage Services pane of the Node Properties View to check whether the TaskTracker service is running. The warden
will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three
times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in
warden.conf. If the warden successfully restarts the TaskTracker service, the alarm is cleared. If the warden is unable to restart
the TaskTracker service, it may be necessary to contact technical support.

TaskTracker Local Directory Full Alarm

UI Column TaskTracker Local Directory Full Alarm

Logged As NODE_ALARM_TT_LOCALDIR_FULL

Meaning The local directory used by the TaskTracker on the specified node(s) is full, and the TaskTracker cannot operate as a result.

Resolution Delete or move data from the local disks, or add storage to the specified node(s), and try the jobs again.

Time Skew Alarm

UI Column Time Skew Alarm

Logged As NODE_ALARM_TIME_SKEW

Meaning The clock on the node is out of sync with the master CLDB by more than 20 seconds.

Resolution Use NTP to synchronize the time on all the nodes in the cluster.

Version Alarm

UI Column Version Alarm

Logged As NODE_ALARM_VERSION_MISMATCH

Meaning One or more services on the node are running an unexpected version.

Resolution Stop the node, Restore the correct version of any services you have modified, and re-start the node. See Managing Nodes.

WebServer Service Alarm

UI Column WebServer Alarm

Logged NODE_ALARM_SERVICE_WEBSERVER_DOWN
As

Meaning The WebServer service on the node has stopped running.


Resolution Go to the Manage Services pane of the Node Properties View to check whether the WebServer service is running. The warden
will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three
times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in
warden.conf. If the warden successfully restarts the WebServer service, the alarm is cleared. If the warden is unable to restart the
WebServer service, it may be necessary to contact technical support.

Volume Alarms
Volume alarms indicate problems in individual volumes. The following tables describe the MapR volume alarms.

Data Unavailable

UI Column Data Alarm

Logged VOLUME_ALARM_DATA_UNAVAILABLE
As

Meaning This is a potentially very serious alarm that may indicate data loss. Some of the data on the volume cannot be located. This alarm
indicates that enough nodes have failed to bring the replication factor of part or all of the volume to zero. For example, if the
volume is stored on a single node and has a replication factor of one, the Data Unavailable alarm will be raised if that volume fails
or is taken out of service unexpectedly. If a volume is replicated properly (and therefore is stored on multiple nodes) then the Data
Unavailable alarm can indicate that a significant number of nodes is down.

Resolution Investigate any nodes that have failed or are out of service.

You can see which nodes have failed by looking at the Cluster Node Heatmap pane of the Dashboard.
Check the cluster(s) for any snapshots or mirrors that can be used to re-create the volume. You can see snapshots and
mirrors in the MapR-FS view.

Data Under-Replicated

UI Column Replication Alarm

Logged VOLUME_ALARM_DATA_UNDER_REPLICATED
As

Meaning The volume replication factor is lower than the minimum replication factor set in Volume Properties. This can be caused by failing
disks or nodes, or the cluster may be running out of storage space.

Resolution Investigate any nodes that are failing. You can see which nodes have failed by looking at the Cluster Node Heatmap pane of the
Dashboard. Determine whether it is necessary to add disks or nodes to the cluster. This alarm is generally raised when the nodes
that store the volumes or replicas have not sent a heartbeat for five minutes. To prevent re-replication during normal maintenance
procedures, MapR waits a specified interval (by default, one hour) before considering the node dead and re-replicating its data.
You can control this interval by setting the cldb.fs.mark.rereplicate.sec parameter using the config save command.

Mirror Failure

UI Column Mirror Alarm

Logged VOLUME_ALARM_MIRROR_FAILURE
As

Meaning A mirror operation failed.

Resolution Make sure the CLDB is running on both the source cluster and the destination cluster. Look at the CLDB log
(/opt/mapr/logs/cldb.log) and the MapR-FS log (/opt/mapr/logs/mfs.log) on both clusters for more information. If the attempted
mirror operation was between two clusters, make sure that both clusters are reachable over the network. Make sure the source
volume is available and reachable from the cluster that is performing the mirror operation.

No Nodes in Topology

UI Column No Nodes in Vol Topo

Logged VOLUME_ALARM_NO_NODES_IN_TOPOLOGY
As
Meaning The path specified in the volume's topology no longer corresponds to a physical topology that contains any nodes, either due to
node failures or changes to node topology settings. While this alarm is raised, MapR places data for the volume on nodes outside
the volume's topology to prevent write failures.

Resolution Add nodes to the specified volume topology, either by moving existing nodes or adding nodes to the cluster. See Node Topology.

Snapshot Failure

UI Column Snapshot Alarm

Logged VOLUME_ALARM_SNAPSHOT_FAILURE
As

Meaning A snapshot operation failed.

Resolution Make sure the CLDB is running. Look at the CLDB log (/opt/mapr/logs/cldb.log) and the MapR-FS log (/opt/mapr/logs/mfs.log) on
both clusters for more information. If the attempted snapshot was a scheduled snapshot that was running in the background, try a
manual snapshot.

Topology Almost Full

UI Column Vol Topo Almost Full

Logged VOLUME_ALARM_TOPOLOGY_ALMOST_FULL
As

Meaning The nodes in the specified topology are running out of storage space.

Resolution Move volumes to another topology, enlarge the specified topology by adding more nodes, or add disks to the nodes in the
specified topology.

Topology Full Alarm

UI Column Vol Topo Full

Logged VOLUME_ALARM_TOPOLOGY_FULL
As

Meaning The nodes in the specified topology have out of storage space.

Resolution Move volumes to another topology, enlarge the specified topology by adding more nodes, or add disks to the nodes in the
specified topology.

Volume Advisory Quota Alarm

UI Column Vol Advisory Quota Alarm

Logged As VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED

Meaning A volume has exceeded its advisory quota.

Resolution No immediate action is required. To avoid exceeding the hard quota, clear space on the volume or stop further data writes.

Volume Quota Alarm

UI Column Vol Quota Alarm

Logged As VOLUME_ALARM_QUOTA_EXCEEDED

Meaning A volume has exceeded its quota. Further writes to the volume will fail.

Resolution Free some space on the volume or increase the volume hard quota.

Direct Access NFS™


Unlike other Hadoop distributions which only allow cluster data import or import as a batch operation, MapR lets you mount the cluster itself via
NFS so that your applications can read and write data directly. MapR allows direct file modification and multiple concurrent reads and writes via
POSIX semantics. With an NFS-mounted cluster, you can read and write data directly with standard tools, applications, and scripts. For example,
you could run a MapReduce job that outputs to a CSV file, then import the CSV file directly into SQL via NFS.

MapR exports each cluster as the directory /mapr/<cluster name> (for example, /mapr/default). If you create a mount point with the local
path /mapr, then Hadoop FS paths and NFS paths to the cluster will be the same. This makes it easy to work on the same files via NFS and
Hadoop. In a multi-cluster setting, the clusters share a single namespace, and you can see them all by mounting the top-level /mapr directory.

Mounting the Cluster


Before you begin, make sure you know the hostname and directory of the NFS share you plan to mount.
Example:

usa-node01:/mapr - for mounting from the command line


nfs://usa-node01/mapr - for mounting from the Mac Finder

Make sure the client machine has the appropriate username and password to access the NFS share. For best results, the username and
password for accessing the MapR cluster should be the same username and password used to log into the client machine.

Linux
1. Make sure the NFS client is installed. Examples:
sudo yum install nfs-utils (Red Hat or CentOS)
sudo apt-get install nfs-common (Ubuntu)
2. List the NFS shares exported on the server. Example:
showmount -e usa-node01
3. Set up a mount point for an NFS share. Example:
sudo mkdir /mapr
4. Mount the cluster via NFS. Example:
sudo mount usa-node01:/mapr /mapr

You can also add an NFS mount to /etc/fstab so that it mounts automatically when your system starts up. Example:

# device mountpoint fs-type options dump fsckorder


...
usa-node01:/mapr /mapr nfs rw 0 0
...

Mac

To mount the cluster from the Finder:

1. Open the Disk Utility: go to Applications > Utilities > Disk Utility.
2. Select File > NFS Mounts.
3. Click the + at the bottom of the NFS Mounts window.
4. In the dialog that appears, enter the following information:
Remote NFS URL: The URL for the NFS mount. If you do not know the URL, use the showmount command described below.
Example: nfs://usa-node01/mapr
Mount location: The mount point where the NFS mount should appear in the local filesystem.
5. Click the triangle next to Advanced Mount Parameters.
6. Enter nolocks in the text field.
7. Click Verify.
8. Important: On the dialog that appears, click Don't Verify to skip the verification process.

The MapR cluster should now appear at the location you specified as the mount point.

To mount the cluster from the command line:

1. List the NFS shares exported on the server. Example:


showmount -e usa-node01
2. Set up a mount point for an NFS share. Example:
sudo mkdir /mapr
3. Mount the cluster via NFS. Example:
sudo mount -o nolock usa-node01:/mapr /mapr

Windows

Because of Windows directory caching, there may appear to be no .snapshot directory in each volume's root directory. To work
around the problem, force Windows to re-load the volume's root directory by updating its modification time (for example, by
creating an empty file or directory in the volume's root directory).

To mount the cluster on Windows 7 Ultimate or Windows 7 Enterprise:


1. Open Start > Control Panel > Programs.
2. Select Turn Windows features on or off.
3. Select Services for NFS.
4. Click OK.
5. Mount the cluster and map it to a drive using the Map Network Drive tool or from the command line. Example:
mount -o nolock usa-node01:/mapr z:

To mount the cluster on other Windows versions:

1. Download and install Microsoft Windows Services for Unix (SFU). You only need to install the NFS Client and the User Name Mapping.
2. Configure the user authentication in SFU to match the authentication used by the cluster (LDAP or operating system users). You can
map local Windows users to cluster Linux users, if desired.
3. Once SFU is installed and configured, mount the cluster and map it to a drive using the Map Network Drive tool or from the command
line. Example:
mount -o nolock usa-node01:/mapr z:

To map a network drive with the Map Network Drive tool:

1. Open Start > My Computer.


2.
2. Select Tools > Map Network Drive.
3. In the Map Network Drive window, choose an unused drive letter from the Drive drop-down list.
4. Specify the Folder by browsing for the MapR cluster, or by typing the hostname and directory into the text field.
5. Browse for the MapR cluster or type the name of the folder to map. This name must follow UNC. Alternatively, click the Browse… button
to find the correct folder by browsing available network shares.
6. Select Reconnect at login to reconnect automatically to the MapR cluster whenever you log into the computer.
7. Click Finish.

Setting Compression and Chunk Size


Each directory in MapR storage contains a hidden file called .dfs_attributes that controls compression and chunk size. To change these
attributes, change the corresponding values in the file.

Valid values:

Compression: true or false


Chunk size (in bytes): a multiple of 65535 (64 K) or zero (no chunks). Example: 131072

You can also set compression and chunksize using the hadoop mfs command.

By default, MapR does not compress files whose filename extension indicate they are already compressed. The default list of filename extensions
is as follows:

bz2
gz
tgz
tbz2
zip
z
Z
mp3
jpg
jpeg
mpg
mpeg
avi
gif
png

The list of fileneme extensions not to compress is stored as comma-separated values in the mapr.fs.nocompression configuration parameter,
and can be modified with the config save command. Example:

maprcli config save -values '{"mapr.fs.nocompression":


"bz2,gz,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mpeg,avi,gif,png"}'

The list can be viewed with the config load command. Example:

maprcli config load -keys mapr.fs.nocompression

Best Practices

File Balancing
MapR distributes volumes to balance files across the cluster. Each volume has a name container that is restricted to one storage pool. The
greater the number of volumes, the more evenly MapR can distribute files. For best results, the number of volumes should be greater than the
total number of storage pools in the cluster. To accommodate a very large number of files, you can use disksetup with the -W option when
installing or re-formatting nodes, to create storage pools larger than the default of three disks each.

Disk Setup
It is not necessary to set up RAID on disks used by MapR-FS. MapR uses a script called disksetup to set up storage pools. In most cases, you
should let MapR calculate storage pools using the default stripe width of two or three disks. If you anticipate a high volume of random-access I/O,
you can use the -W option with disksetup to specify larger storage pools of up to 8 disks each.
Setting Up NFS
The mapr-nfs service lets you access data on a licensed MapR cluster via the NFS protocol:

M3 license: one NFS node


M5 license: multiple NFS nodes with VIPs for failover and load balancing

You can mount the MapR cluster via NFS and use standard shell scripting to read and write live data in the cluster. NFS access to cluster data
can be faster than accessing the same data with the hadoop commands. To mount the cluster via NFS from a client machine, see Setting Up the
Client.

NFS Setup Tips


Before using the MapR NFS Gateway, here are a few helpful tips:

Ensure the stock Linux NFS service is stopped, as Linux NFS and MapR NFS will conflict with each other.
Ensure portmapper is running (Example: ps a | grep portmap).
Make sure you have installed the mapr-nfs package. If you followed the Quick Start - Single Node or Quick Start - Small Cluster
instructions, then it is installed. You can check by listing the /opt/mapr/roles directory and checking for nfs in the list.
Make sure you have applied an M3 license or an M5 (paid or trial) license to the cluster. See Getting a License.
Make sure the NFS service is started (see Services).
For information about mounting the cluster via NFS, see Setting Up the Client.

NFS on an M3 Cluster
At installation time, choose one node on which to run the NFS gateway. NFS is lightweight and can be run on a node running services such as
CLDB or ZooKeeper. To add the NFS service to a running cluster, use the instructions in Adding Roles to install the mapr-nfs package on the
node where you would like to run NFS.

NFS on an M5 Cluster
At cluster installation time, plan which nodes should provide NFS access according to your anticipated traffic. For instance, if you need 5Gbps of
write throughput and 5Gbps of read throughput, here are a few ways to set up NFS:

12 NFS nodes, each of which has a single 1Gbe connection


6 NFS nodes, each of which has a dual 1Gbe connection
4 NFS nodes, each of which has a quad 1Gbe connection

You can also set up NFS on all file server nodes, so each node can NFS-mount itself and native applications can run as tasks, or on one or more
dedicated gateways outside the cluster (using round-robin DNS or behind a hardware load balancer) to allow controlled access.

You can set up virtual IP addresses (VIPs) for NFS nodes in an M5-licensed MapR cluster, for load balancing or failover. VIPs provide multiple
addresses that can be leveraged for round-robin DNS, allowing client connections to be distributed among a pool of NFS nodes. VIPs also make
high availability (HA) NFS possible; in the event an NFS node fails, data requests are satisfied by other NFS nodes in the pool. You should use a
minimum of one VIP per NFS node per NIC that clients will use to connect to the NFS server. If you have four nodes with four NICs each, with
each NIC connected to an individual IP subnet, use a minimum of 16 VIPs and direct clients to the VIPs in round-robin fashion. The VIPs should
be in the same IP subnet as the interfaces to which they will be assigned.

Here are a few tips:

Set up NFS on at least three nodes if possible.


All NFS nodes must be accessible over the network from the machines where you want to mount them.
To serve a large number of clients, set up dedicated NFS nodes and load-balance between them. If the cluster is behind a firewall, you
can provide access through the firewall via a load balancer instead of direct access to each NFS node. You can run NFS on all nodes in
the cluster, if needed.
To provide maximum bandwidth to a specific client, install the NFS service directly on the client machine. The NFS gateway on the client
manages how data is sent in or read back from the cluster, using all its network interfaces (that are on the same subnet as the cluster
nodes) to transfer data via MapR APIs, balancing operations among nodes as needed.
Use VIPs to provide High Availability (HA) and failover. See Setting Up NFS HA for more information.

To add the NFS service to a running cluster, use the instructions in Adding Roles to install the mapr-nfs package on the nodes where you would
like to run NFS.

NFS Memory Settings


The memory allocated to each MapR service is specified in the /opt/mapr/conf/warden.conf file, which MapR automatically configures
based on the physical memory available on the node. You can adjust the minimum and maximum memory used for NFS, as well as the
percentage of the heap that it tries to use, by setting the percent, max, and min parameters in the warden.conf file on each NFS node.
Example:
...
service.command.nfs.heapsize.percent=3
service.command.nfs.heapsize.max=1000
service.command.nfs.heapsize.min=64
...

The percentages need not add up to 100; in fact, you can use less than the full heap by setting the heapsize.percent parameters for all
services to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settings for individual services,
unless you see specific memory-related problems occurring.

NIC Configuration
For high performance clusters, use more than one network interface card (NIC) per node. MapR can detect multiple IP addresses on each node
and load-balance throughput automatically.

Isolating CLDB Nodes


In a large cluster (100 nodes or more) create CLDB-only nodes to ensure high performance. This configuration also provides additional control
over the placement of the CLDB data, for load balancing, fault tolerance, or high availability (HA). Setting up CLDB-only nodes involves restricting
the CLDB volume to its own topology and making sure all other volumes are on a separate topology. Unless you specify a default volume
topology, new volumes have no topology when they are created, and reside at the root topology path: "/". Because both the CLDB-only path and
the non-CLDB path are children of the root topology path, new non-CLDB volumes are not guaranteed to keep off the CLDB-only nodes. To avoid
this problem, set a default volume topology. See Setting Default Volume Topology.

To set up a CLDB-only node:

1. SET UP the node as usual:


PREPARE the node, making sure it meets the requirements.
ADD the MapR Repository.
2. INSTALL only the following packages:
mapr-cldb
mapr-webserver
mapr-core
mapr-fileserver
3. RUN configure.sh.
4. FORMAT the disks.
5. START the warden:

/etc/init.d/mapr-warden start

To restrict the CLDB volume to specific nodes:

1. Move all CLDB nodes to a CLDB-only topology (e. g. /cldbonly) using the MapR Control System or the following command:
maprcli node move -serverids <CLDB nodes> -topology /cldbonly
2. Restrict the CLDB volume to the CLDB-only topology. Use the MapR Control System or the following command:
maprcli volume move -name mapr.cldb.internal -topology /cldbonly
3. If the CLDB volume is present on nodes not in /cldbonly, increase the replication factor of mapr.cldb.internal to create enough copies in
/cldbonly using the MapR Control System or the following command:
maprcli volume modify -name mapr.cldb.internal -replication <replication factor>
4. Once the volume has sufficient copies, remove the extra replicas by reducing the replication factor to the desired value using the MapR
Control System or the command used in the previous step.

To move all other volumes to a topology separate from the CLDB-only nodes:

1. Move all non-CLDB nodes to a non-CLDB topology (e. g. /defaultRack) using the MapR Control System or the following command:
maprcli node move -serverids <all non-CLDB nodes> -topology /defaultRack
2. Restrict all existing volumes to the topology /defaultRack using the MapR Control System or the following command:
maprcli volume move -name <volume> -topology /defaultRack
All volumes except (mapr.cluster.root) get re-replicated to the changed topology automatically.

To prevent subsequently created volumes from encroaching on the CLDB-only nodes, set a default topology that
excludes the CLDB-only topology.
Isolating ZooKeeper Nodes
For large clusters (100 nodes or more), isolate the ZooKeeper on nodes that do not perform any other function, so that the ZooKeeper does not
compete for resources with other processes. Installing a ZooKeeper-only node is similar to any typical node installation, but with a specific subset
of packages. Importantly, do not install the FileServer package, so that MapR does not use the ZooKeeper-only node for data storage.

To set up a ZooKeeper-only node:

1. SET UP the node as usual:


PREPARE the node, making sure it meets the requirements.
ADD the MapR Repository.
2. INSTALL only the following packages:
mapr-zookeeper
mapr-zk-internal
mapr-core
3. RUN configure.sh.
4. FORMAT the disks.
5. START ZooKeeper (as root or using sudo):

/etc/init.d/mapr-zookeeper start

Do not start the warden.

Setting Up RAID on the Operating System Partition


You can set up RAID on the operating system partition(s) or drive(s) at installation time, to provide higher operating system performance (RAID
0), disk mirroring for failover (RAID 1), or both (RAID 10), for example. See the following instructions from the operating system websites:

CentOS
Red Hat
Ubuntu

Tuning MapReduce
MapR automatically tunes the cluster for most purposes. A service called the warden determines machine resources on nodes configured to run
the TaskTracker service, and sets MapReduce parameters accordingly.

On nodes with multiple CPUs, MapR uses taskset to reserve CPUs for MapR services:

On nodes with five to eight CPUs, CPU 0 is reserved for MapR services
On nodes with nine or more CPUs, CPU 0 and CPU 1 are reserved for MapR services

In certain circumstances, you might wish to manually tune MapR to provide higher performance. For example, when running a job consisting of
unusually large tasks, it is helpful to reduce the number of slots on each TaskTracker and adjust the Java heap size. The following sections
provide MapReduce tuning tips. If you change any settings in mapred-site.xml, restart the TaskTracker.

Memory Settings

Memory for MapR Services

The memory allocated to each MapR service is specified in the /opt/mapr/conf/warden.conf file, which MapR automatically configures
based on the physical memory available on the node. For example, you can adjust the minimum and maximum memory used for the
TaskTracker, as well as the percentage of the heap that the TaskTracker tries to use, by setting the appropriate percent, max, and min
parameters in the warden.conf file:

...
service.command.tt.heapsize.percent=2
service.command.tt.heapsize.max=325
service.command.tt.heapsize.min=64
...
The percentages of memory used by the services need not add up to 100; in fact, you can use less than the full heap by setting the
heapsize.percent parameters for all services to add up to less than 100% of the heap size. In general, you should not need to adjust the
memory settings for individual services, unless you see specific memory-related problems occurring.

MapReduce Memory

The memory allocated for MapReduce tasks normally equals the total system memory minus the total memory allocated for MapR services. If
necessary, you can use the parameter mapreduce.tasktracker.reserved.physicalmemory.mb to set the maximum physical memory reserved by
MapReduce tasks, or you can set it to -1 to disable physical memory accounting and task management.

If the node runs out of memory, MapReduce tasks are killed by the OOM-killer to free memory. You can use mapred.child.oom_adj (copy
from mapred-default.xml to adjust the oom_adj parameter for MapReduce tasks. The possible values of oom_adj range from -17 to +15.
The higher the score, more likely the associated process is to be killed by OOM-killer.

Job Configuration

Map Tasks

Map tasks use memory mainly in two ways:

The MapReduce framework uses an intermediate buffer to hold serialized (key, value) pairs.
The application consumes memory to run the map function.

MapReduce framework memory is controlled by io.sort.mb. If io.sort.mb is less than the data emitted from the mapper, the task ends up
spilling data to disk. If io.sort.mb is too large, the task can run out of memory or waste allocated memory. By default io.sort.mb is 100mb. It
should be approximately 1.25 times the number of data bytes emitted from mapper. If you cannot resolve memory problems by adjusting
io.sort.mb, then try to re-write the application to use less memory in its map function.

Compression

To turn off MapR compression for map outputs, set mapreduce.maprfs.use.compression=false


To turn on LZO or any other compression, set mapreduce.maprfs.use.compression=false and
mapred.compress.map.output=true

Reduce Tasks

If tasks fail because of an Out of Heap Space error, increase the heap space (the -Xmx option in mapred.reduce.child.java.opts) to give
more memory to the tasks. If map tasks are failing, you can also try reducing io.sort.mb.
(see mapred.map.child.java.opts in mapred-site.xml)

TaskTracker Configuration
MapR sets up map and reduce slots on each TaskTracker node using formulas based on the number of CPUs present on the node. The default
formulas are stored in the following parameters in mapred-site.xml:

mapred.tasktracker.map.tasks.maximum: (CPUS > 2) ? (CPUS * 0.75) : 1 (At least one Map slot, up to 0.75 times the number of
CPUs)
mapred.tasktracker.reduce.tasks.maximum: (CPUS > 2) ? (CPUS * 0.50) : 1 (At least one Map slot, up to 0.50 times the
number of CPUs)

You can adjust the maximum number of map and reduce slots by editing the formula used in mapred.tasktracker.map.tasks.maximum
and mapred.tasktracker.reduce.tasks.maximum. The following variables are used in the formulas:

CPUS - number of CPUs present on the node


DISKS - number of disks present on the node
MEM - memory reserved for MapReduce tasks

Ideally, the number of map and reduce slots should be decided based on the needs of the application. Map slots should be based on how many
map tasks can fit in memory, and reduce slots should be based on the number of CPUs. If each task in a MapReduce job takes 3 GB, and each
node has 9GB reserved for MapReduce tasks, then the total number of map slots should be 3. The amount of data each map task must process
also affects how many map slots should be configured. If each map task processes 256 MB (the default chunksize in MapR), then each map task
should have 800 MB of memory. If there are 4 GB reserved for map tasks, then the number of map slots should be 4000MB/800MB, or 5 slots.

MapR allows the JobTracker to over-schedule tasks on TaskTracker nodes in advance of the availability of slots, creating a pipeline. This
optimization allows TaskTracker to launch each map task as soon as the previous running map task finishes. The number of tasks to
over-schedule should be about 25-50% of total number of map slots. You can adjust this number with the parameter
mapreduce.tasktracker.prefetch.maptasks.

Troubleshooting Out-of-Memory Errors


When the aggregated memory used by MapReduce tasks exceeds the memory reserve on a TaskTracker node, tasks can fail or be killed. MapR
attempts to prevent out-of-memory exceptions by killing MapReduce tasks when memory becomes scarce. If you allocate too little Java heap for
the expected memory requirements of your tasks, an exception can occur. The following steps can help configure MapR to avoid these problems:

If a particular job encounters out-of-memory conditions, the simplest way to solve the problem might be to reduce the memory footprint of
the map and reduce functions, and to ensure that the partitioner distributes map output to reducers evenly.

If it is not possible to reduce the memory footprint of the application, try increasing the Java heap size (-Xmx) in the client-side
MapReduce configuration.

If many jobs encounter out-of-memory conditions, or if jobs tend to fail on specific nodes, it may be that those nodes are advertising too
many TaskTracker slots. In this case, the cluster administrator should reduce the number of slots on the affected nodes.

To reduce the number of slots on a node:

1. Stop the TaskTracker service on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker stop

2. Edit the file /opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xml:


Reduce the number of map slots by lowering mapred.tasktracker.map.tasks.maximum
Reduce the number of reduce slots by lowering mapred.tasktracker.reduce.tasks.maximum
3. Start the TaskTracker on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker start

ExpressLane
MapR provides an express path for small MapReduce jobs to run when all slots are occupied by long tasks. Small jobs are only given this special
treatment when the cluster is busy, and only if they meet the criteria specified by the following parameters in mapred-site.xml:

Parameter Value Description

mapred.fairscheduler.smalljob.schedule.enable true Enable small job fast scheduling inside fair scheduler. TaskTrackers
should reserve a slot called ephemeral slot which is used for smalljob if
cluster is busy.

mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed in small job.

mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed in small job.

mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job.
Default is 10GB.

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed in
small job. Default is 1GB per reducer.

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytes reserved for an ephermal
slot. Default is 200mb. This value must be same on JobTracker and
TaskTracker nodes.

MapReduce jobs that appear to fit the small job definition but are in fact larger than anticipated are killed and re-queued for normal execution.

HBase
The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase,
turn off MapR compression for directories in the HBase volume (normally mounted at /hbase. Example:

hadoop mfs -setcompression off /hbase

You can check whether compression is turned off in a directory or mounted volume by using Hadoop MFS to list the file contents. Example:

hadoop mfs -ls /hbase

The letter Z in the output indicates compression is turned on; the letter U indicates compression is turned off. See Hadoop MFS for more
information.

On any node where you plan to run both HBase and MapReduce, give more memory to the FileServer than to the RegionServer so that the node
can handle high throughput. For example, on a node with 24 GB of physical memory, it might be desirable to limit the RegionServer to 4 GB, give
10 GB to MapR-FS, and give the remainder to TaskTracker. To change the memory allocated to each service, edit the
/opt/mapr/conf/warden.conf file. See Tuning MapReduce for more information.
For higher performance, make sure fs.mapr.mempool.size is set to 0 in hbase-site.xml to disable shared memory in the fileclient.

Programming Guide

Working with MapR-FS


The file mapr-clusters.conf contains a list of the clusters that can be used by your application. The first cluster in the list is treated as the
default cluster.

In core-site.xml the fs.default.name parameter determines the default filesystem used by your application. Normally, this should be set
to one of the following values:

maprfs:/// - resolves to the default cluster in mapr-clusters.conf


maprfs:///mapr/<cluster name>/ or /mapr/<cluster name>/ - resolves to the specified cluster
maprfs://<IP address>:<port> - resolves to the cluster by the IP address and port of its master CLDB node
maprfs://<IP address> - resolves to the cluster by the IP address of its master CLDB node

In general, the first two options (maprfs:/// and maprfs:///mapr/<cluster name>/) provide the most flexibility, because they are not tied
to an IP address and will continue to function even if the IP address of the master CLDB changes (during failover, for example).

Using Java to Interface with MapR-FS


In your Java application, you will use a Configuration object to interface with MapR-FS. When you run your Java application, add the Hadoop
configuration directory /opt/mapr/hadoop/hadoop-<version>/conf to the Java classpath. When you instantiate a Configuration
object, it is created with default values drawn from configuration files in that directory.

Sample Code

The following sample code shows how to interface with MapR-FS using Java. The example creates a directory, writes a file, then reads the
contents of the file.

Compiling the sample code requires only the Hadoop core JAR:

javac -cp /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar MapRTest.java

Running the sample code uses the following library path:

java -Djava.library.path=/opt/mapr/lib -cp .:\


/opt/mapr/hadoop/hadoop-0.20.2/conf:\
/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar:\
/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar:\
/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:\
/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar:\
/opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar \
MapRTest /test

Sample Code

/* Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved */

//package com.mapr.fs;

import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
/**
* Assumes mapr installed in /opt/mapr
*
* compilation needs only hadoop jars:
* javac -cp /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar MapRTest.java
*
* Run:
* java -Djava.library.path=/opt/mapr/lib -cp
/opt/mapr/hadoop/hadoop-0.20.2/conf:/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hado
MapRTest /test
*/
public class MapRTest
{
public static void main(String args[]) throws Exception {
byte buf[] = new byte[ 65*1024];
int ac = 0;
if (args.length != 1) {
System.out.println("usage: MapRTest pathname");
return;
}

// maprfs:/// -> uses the first entry in /opt/mapr/conf/mapr-clusters.conf


// to specify any other IP address or hostname of the CLDB
// maprfs://10.10.100.12:7222/
// maprfs://10.10.100.12/
// maprfs:///mapr/my.cluster.com/
// /mapr/my.cluster.com/

// String uri = "maprfs:///";


String dirname = args[ac++];

Configuration conf = new Configuration();

//FileSystem fs = FileSystem.get(URI.create(uri), conf); // if wanting to use a different cluster


FileSystem fs = FileSystem.get(conf);

Path dirpath = new Path( dirname + "/dir");


Path wfilepath = new Path( dirname + "/file.w");
//Path rfilepath = new Path( dirname + "/file.r");
Path rfilepath = wfilepath;

// try mkdir
boolean res = fs.mkdirs( dirpath);
if (!res) {
System.out.println("mkdir failed, path: " + dirpath);
return;
}

System.out.println( "mkdir( " + dirpath + ") went ok, now writing file");

// create wfile
FSDataOutputStream ostr = fs.create( wfilepath,
true, // overwrite
512, // buffersize
(short) 1, // replication
(long)(64*1024*1024) // chunksize
);
ostr.write(buf);
ostr.close();

System.out.println( "write( " + wfilepath + ") went ok");

// read rfile
System.out.println( "reading file: " + rfilepath);
FSDataInputStream istr = fs.open( rfilepath);
int bb = istr.readInt();
istr.close();
System.out.println( "Read ok");
}
}
Using C++ to Interface with MapR-FS
Apache Hadoop comes with a library (libhdfs) that provides a C++ API for operations on any filesystem (maprfs, localfs, hdfs and others).
MapR's distribution of libhdfs has an additional custom library (libMapRClient), that provides the same API for operations on MapR-FS.
MapR recommends libMapRClient for operations exclusive to MapR-FS; you should only use libhdfs if you need to perform operations on
other filesystems.

The APIs are defined in the /opt/mapr/hadoop/hadoop-0.20.2/src/c++/libhdfs/hdfs.h header file, which includes documentation for
each API. Three sample programs are included in the same directory: hdfs_test.c, hdfs_write.c, and hdfs_read.c.

There are 2 ways to link your program with libMapRClient:

Since libMapRClient.so is a common library for both MapReduce and the filesystem, it has a dependency on libjvm. You can
operate with libMapRClient by linking your program with libjvm (Look at run1.sh for an example). This is the recommended
approach.
If you do not have JVM on your system, you can provide the options -Wl -allow-shlib-undefined to gcc, to ignore the undefined
JNI symbols, while linking your program (look at run2.sh for an example).

Finally, before running your program, some environment variables need to be set depending on what option is chosen. For examples, look at
run1.sh and run2.sh.

Do not overwrite the libMapRClient.so on a MapR client or node! The MapR version of libMapRClient.so is necessary
for the cluster to perform correctly.

run1.sh

#!/bin/bash

#Ensure JAVA_HOME is defined


if [ ${JAVA_HOME} = "" ] ; then
echo "JAVA_HOME not defined"
exit 1
fi

#Setup environment
export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2/
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/mapr/lib:${JAVA_HOME}/jre/lib/amd64/server/
GCC_OPTS="-I. -I${HADOOP_HOME}/src/c++/libhdfs -I${JAVA_HOME}/include
-I${JAVA_HOME}/include/linux -L${HADOOP_HOME}/c++/lib
-L${JAVA_HOME}/jre/lib/amd64/server/ -L/opt/mapr/lib -lMapRClient
-ljvm"

#Compile
gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_test.c -o hdfs_test
gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_read.c -o hdfs_read
gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_write.c -o hdfs_write

#Run tests
./hdfs_test -m

run2.sh
#!/bin/bash

#Setup environment
export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2/
GCC_OPTS="-Wl,--allow-shlib-undefined -I.
-I${HADOOP_HOME}/src/c++/libhdfs -L/opt/mapr/lib -lMapRClient"
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/mapr/lib

#Compile and Link


gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_test.c -o hdfs_test
gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_read.c -o hdfs_read
gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_write.c -o hdfs_write

#Run tests
./hdfs_test -m

Reference Guide
This guide contains reference information:

Release Notes - Known issues and new features, by release


MapR Control System Reference - User interface reference guide
API Reference - Information about the command-line interface and the REST API
Scripts and Commands - Tool and utility reference
Configuration Files - Information about MapR settings
Glossary - Essential terms and definitions

Release Notes
This section contains Release Notes for all releases of MapR Distribution for Apache Hadoop:

Version 1.2 Release Notes - November 18, 2011


Version 1.1.3 Release Notes - September 29, 2011
Version 1.1.2 Release Notes - September 7, 2011
Version 1.1.1 Release Notes - August 22, 2011
Version 1.1 Release Notes - July 28, 2011
Version 1.0 Release Notes - June 29, 2011
Beta Release Notes - April 1, 2011
Alpha Release Notes - February 15, 2011

Repository Paths
Version 1.1.3
http://package.mapr.com/releases/v1.1.3/redhat/ (Red Hat or CentOS)
http://package.mapr.com/releases/v1.1.3/ubuntu/ (Ubuntu)
Version 1.1.2 - Internal maintenance release
Version 1.1.1
http://package.mapr.com/releases/v1.1.1/mac/ (Mac client)
http://package.mapr.com/releases/v1.1.1/redhat/ (Red Hat or CentOS)
http://package.mapr.com/releases/v1.1.1/ubuntu/ (Ubuntu)
Version 1.1.0
http://package.mapr.com/releases/v1.1.0-sp0/mac/ (Mac client)
http://package.mapr.com/releases/v1.1.0-sp0/redhat/ (Red Hat or CentOS)
http://package.mapr.com/releases/v1.1.0-sp0/ubuntu/ (Ubuntu)
Version 1.0.0
http://package.mapr.com/releases/v1.0.0-sp0/redhat/ (Red Hat or CentOS)
http://package.mapr.com/releases/v1.0.0-sp0/ubuntu/ (Ubuntu)

Version 1.2 Release Notes

General Information
MapR provides the following packages:

Apache Hadoop 0.20.2


flume-0.9.3
hbase-0.90.2
hive-0.7.0
oozie-3.0.0
pig-0.8
sqoop-1.2.0
whirr-0.3.0

New in This Release

Dial Home

Dial Home is a feature that collects information about the cluster for MapR support and engineering. You can opt in or out of Dial Home feature
when you first install or upgrade MapR. To change the Dial Home status of your cluster at any time, see the Dial Home commands.

Rolling Upgrade

The script rollingupgrade.sh performs a software upgrade of an entire MapR cluster. See Cluster Upgrade for details.

Improvements to the Support Tools

The script mapr-support-collect.sh has been enhanced to generate and gather support output from specified cluster nodes into a single
output file via MapR-FS. To support this feature, mapr-support-collect.sh has a new option:

-O, --online Specifies a space-separated list of nodes from which to gather support output.

There is now a "mini-dump" option for both support-dump.sh and support-collect.sh to limit the size of the support output. When the m or
-mini-dump option is specified along with a size, support-dump.sh collects only a head and tail, each limited to the specified size, from any log
file that is larger than twice the specified size. The total size of the output is therefore limited to approximately 2 * size * number of logs. The size
can be specified in bytes, or using the following suffixes:

b - blocks (512 bytes)


k - kilobytes (1024 bytes)
m - megabytes (1024 kilobytes)

-m, --mini-dump For any log file greater than 2 * <size>, collects only a head and tail each of the specified size. The <size> may have a
<size> suffix specifying units:

b - blocks (512 bytes)


k - kilobytes (1024 bytes)
m - megabytes (1024 kilobytes)

MapR Virtual Machine

The MapR Virtual Machine is a fully-functional single-node Hadoop cluster capable of running MapReduce programs and working with
applications like Hive, Pig, and HBase. You can try the MapR Virtual Machine on nearly any 64-bit computer by downloading the free VMware
Player.

Windows 7 Client

A Windows 7 version of the MapR Client is now available. The MapR client lets you interact with MapR Hadoop directly. With the MapR client, you
can submit Map/Reduce jobs and run hadoop fs and hadoop mfs commands.

Resolved Issues
(Issue 4307) Snapshot create fails with error EEXIST

Known Issues

(Issue 5590)

When mapr-core is upgraded from 1.1.3 to 1.2, /opt/mapr/bin/versions.sh is updated to contain hbase-version 0.90.4, even if
HBase has not been upgraded. This can create problems for any process that uses versions.sh to determine the correct version of HBase.
After upgrading to Version 1.2, check that the version of HBase specified in versions.sh is correct.

(Issue 5489) HBase nodes require configure.sh during rolling upgrade

When performing an upgrade to MapR 1.2, the HBase package is upgraded from version 0.90.2 to 0.90.4, and it is necessary to run
configure.sh on any nodes that are running the HBase region server or HBase master.

(Issue 4269) Bulk operations

The MapR Control System provides both a checkbox and a Select All link for selecting all alarms, nodes, snapshots, or volumes matching a filter,
even if there are too many results to display on a single screen. However, the following operations can only be performed on individually selected
results, or results selected using the Select Visible link at the bottom of the MapR Control System screen:

Volumes - Edit Volumes


Volumes - Remove Volumes
Volumes - New Snapshot
Volumes - Unmount
Mirror Volumes - Edit Volumes
Mirror Volumes - Remove Volumes
Mirror Volumes - Unmount
User Disk Usage - Edit
Snapshots - Remove
Snapshots - Preserve
Node Alarms - Change Topology
Nodes - Change Topology
Volume Alarms - Edit
Volume Alarms - Unmount
Volume Alarms - Remove
User/Group Alarms - Edit

In order to perform these operations on a large number of alarms, nodes, snapshots, or volumes, it is necessary to select each screenful of
results using Select Visible and perform the operation before selecting the next screenful of results.

(Issue 3122) Mirroring with fsck-repaired volume

If a source or mirror volume is repaired with fsck then the source and mirror volumes can go out of sync. It is necessary to perform a full mirror
operation with volume mirror start -full true to bring them back in sync. Similarly, when creating a dump file from a volume that has
been repaired with fsck, use -full true on the volume dump create command.

Hadoop Compatibility in Version 1.2


MapR provides the following packages:

Apache Hadoop 0.20.2


flume-0.9.4
hbase-0.90.4
hive-0.7.1
mahout-0.5
oozie-3.0.0
pig-0.9.0
sqoop-1.3.0
whirr-0.3.0

MapR HBase Patches

In the /opt/mapr/hbase/hbase-0.90.4/mapr-hbase-patches directory, MapR provides the following patches for HBase:
0000-hbase-with-mapr.patch
0001-HBASE-4196-0.90.4.patch
0002-HBASE-4144-0.90.4.patch
0003-HBASE-4148-0.90.4.patch
0004-HBASE-4159-0.90.4.patch
0005-HBASE-4168-0.90.4.patch
0006-HBASE-4196-0.90.4.patch
0007-HBASE-4095-0.90.4.patch
0008-HBASE-4222-0.90.4.patch
0009-HBASE-4270-0.90.4.patch
0010-HBASE-4238-0.90.4.patch
0011-HBASE-4387-0.90.4.patch
0012-HBASE-4295-0.90.4.patch
0013-HBASE-4563-0.90.4.patch
0014-HBASE-4570-0.90.4.patch
0015-HBASE-4562-0.90.4.patch

MapR Pig Patches

In the /opt/mapr/pig/pig-0.9.0/mapr-pig-patches directory, MapR provides the following patches for Pig:

0000-pig-mapr-compat.patch
0001-remove-hardcoded-hdfs-refs.patch
0002-pigmix2.patch
0003-pig-hbase-compatibility.patch

MapR Mahout Patches

In the /opt/mapr/mahout/mahout-0.5/mapr-mahout-patches directory, MapR provides the following patches for Mahout:

0000-mahout-mapr-compat.patch

MapR Hive Patches

In the /opt/mapr/hive/hive-0.7.1/mapr-hive-patches directory, MapR provides the following patches for Hive:

0000-symlink-support-in-hive-binary.patch 0001-remove-unnecessary-fsscheme-check.patch
0002-remove-unnecessary-fsscheme-check-1.patch

MapR Flume Patches

In the /opt/mapr/flume/flume-0.9.4/mapr-flume-patches directory, MapR provides the following patches for Flume:

0000-flume-mapr-compat.patch

MapR Sqoop Patches

In the /opt/mapr/sqoop/sqoop-1.3.0/mapr-sqoop-patches directory, MapR provides the following patches for Sqoop:

0000-setting-hadoop-hbase-versions-to-mapr-shipped-versions.patch

MapR Oozie Patches


In the /opt/mapr/oozie/oozie-3.0.0/mapr-oozie-patches directory, MapR provides the following patches for Oozie:

0000-oozie-with-mapr.patch
0001-OOZIE-022-3.0.0.patch
0002-OOZIE-139-3.0.0.patch

HBase Common Patches

MapR 1.2 includes the following Apache Hadoop patches that are not included in the Apache HBase base version 0.90.4:

[HBASE-4169] FSUtils LeaseRecovery for non HDFS FileSystems.


[HBASE-4168] A client continues to try and connect to a powered down regionserver
[HBASE-4196] TableRecordReader may skip first row of region
[HBASE-4144] RS does not abort if the initialization of RS fails
[HBASE-4148] HFileOutputFormat doesn't fill in TIMERANGE_KEY metadata
[HBASE-4159] HBaseServer - IPC Reader threads are not daemons
[HBASE-4095] Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
[HBASE-4270] IOE ignored during flush-on-close causes dataloss
[HBASE-4238] CatalogJanitor can clear a daughter that split before processing its parent
[HBASE-4387] Error while syncing: DFSOutputStream is closed
[HBASE-4295] rowcounter does not return the correct number of rows in certain circumstances
[HBASE-4563] When error occurs in this.parent.close(false) of split, the split region cannot write or read
[HBASE-4570] Fix a race condition that could cause inconsistent results from scans during concurrent writes.
[HBASE-4562] When split doing offlineParentInMeta encounters error, it\'ll cause data loss
[HBASE-4222] Make HLog more resilient to write pipeline failures

Oozie Common Patches

MapR 1.2 includes the following Apache Hadoop patches that are not included in the Apache Oozie base version 3.0.0:
[GH-0022] Add Hive action
[GH-0139] Add Sqoop action

Hadoop Common Patches

MapR 1.2 includes the following Apache Hadoop patches that are not included in the Apache Hadoop base version 0.20.2:

[HADOOP-1722] Make streaming to handle non-utf8 byte array


[HADOOP-1849] IPC server max queue size should be configurable
[HADOOP-2141] speculative execution start up condition based on completion time
[HADOOP-2366] Space in the value for dfs.data.dir can cause great problems
[HADOOP-2721] Use job control for tasks (and therefore for pipes and streaming)
[HADOOP-2838] Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni
[HADOOP-3327] Shuffling fetchers waited too long between map output fetch re-tries
[HADOOP-3659] Patch to allow hadoop native to compile on Mac OS X
[HADOOP-4012] Providing splitting support for bzip2 compressed files
[HADOOP-4041] IsolationRunner does not work as documented
[HADOOP-4490] Map and Reduce tasks should run as the user who submitted the job
[HADOOP-4655] FileSystem.CACHE should be ref-counted
[HADOOP-4656] Add a user to groups mapping service
[HADOOP-4675] Current Ganglia metrics implementation is incompatible with Ganglia 3.1
[HADOOP-4829] Allow FileSystem shutdown hook to be disabled
[HADOOP-4842] Streaming combiner should allow command, not just JavaClass
[HADOOP-4930] Implement setuid executable for Linux to assist in launching tasks as job owners
[HADOOP-4933] ConcurrentModificationException in JobHistory.java
[HADOOP-5170] Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide
[HADOOP-5175] Option to prohibit jars unpacking
[HADOOP-5203] TT's version build is too restrictive
[HADOOP-5396] Queue ACLs should be refreshed without requiring a restart of the job tracker
[HADOOP-5419] Provide a way for users to find out what operations they can do on which M/R queues
[HADOOP-5420] Support killing of process groups in LinuxTaskController binary
[HADOOP-5442] The job history display needs to be paged
[HADOOP-5450] Add support for application-specific typecodes to typed bytes
[HADOOP-5469] Exposing Hadoop metrics via HTTP
[HADOOP-5476] calling new SequenceFile.Reader(...) leaves an InputStream open, if the given sequence file is broken
[HADOOP-5488] HADOOP-2721 doesn't clean up descendant processes of a jvm that exits cleanly after running a task successfully
[HADOOP-5528] Binary partitioner
[HADOOP-5582] Hadoop Vaidya throws number format exception due to changes in the job history counters string format (escaped compact
representation).
[HADOOP-5592] Hadoop Streaming - GzipCodec
[HADOOP-5613] change S3Exception to checked exception
[HADOOP-5643] Ability to blacklist tasktracker
[HADOOP-5656] Counter for S3N Read Bytes does not work
[HADOOP-5675] DistCp should not launch a job if it is not necessary
[HADOOP-5733] Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics
[HADOOP-5737] UGI checks in testcases are broken
[HADOOP-5738] Split waiting tasks field in JobTracker metrics to individual tasks
[HADOOP-5745] Allow setting the default value of maxRunningJobs for all pools
[HADOOP-5784] The length of the heartbeat cycle should be configurable.
[HADOOP-5801] JobTracker should refresh the hosts list upon recovery
[HADOOP-5805] problem using top level s3 buckets as input/output directories
[HADOOP-5861] s3n files are not getting split by default
[HADOOP-5879] GzipCodec should read compression level etc from configuration
[HADOOP-5913] Allow administrators to be able to start and stop queues
[HADOOP-5958] Use JDK 1.6 File APIs in DF.java wherever possible
[HADOOP-5976] create script to provide classpath for external tools
[HADOOP-5980] LD_LIBRARY_PATH not passed to tasks spawned off by LinuxTaskController
[HADOOP-5981] HADOOP-2838 doesnt work as expected
[HADOOP-6132] RPC client opens an extra connection for VersionedProtocol
[HADOOP-6133] ReflectionUtils performance regression
[HADOOP-6148] Implement a pure Java CRC32 calculator
[HADOOP-6161] Add get/setEnum to Configuration
[HADOOP-6166] Improve PureJavaCrc32
[HADOOP-6184] Provide a configuration dump in json format.
[HADOOP-6227] Configuration does not lock parameters marked final if they have no value.
[HADOOP-6234] Permission configuration files should use octal and symbolic
[HADOOP-6254] s3n fails with SocketTimeoutException
[HADOOP-6269] Missing synchronization for defaultResources in Configuration.addResource
[HADOOP-6279] Add JVM memory usage to JvmMetrics
[HADOOP-6284] Any hadoop commands crashing jvm (SIGBUS) when /tmp (tmpfs) is full
[HADOOP-6299] Use JAAS LoginContext for our login
[HADOOP-6312] Configuration sends too much data to log4j
[HADOOP-6337] Update FilterInitializer class to be more visible and take a conf for further development
[HADOOP-6343] Stack trace of any runtime exceptions should be recorded in the server logs.
[HADOOP-6400] Log errors getting Unix UGI
[HADOOP-6408] Add a /conf servlet to dump running configuration
[HADOOP-6419] Change RPC layer to support SASL based mutual authentication
[HADOOP-6433] Add AsyncDiskService that is used in both hdfs and mapreduce
[HADOOP-6441] Prevent remote CSS attacks in Hostname and UTF-7.
[HADOOP-6453] Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH
[HADOOP-6471] StringBuffer -> StringBuilder - conversion of references as necessary
[HADOOP-6496] HttpServer sends wrong content-type for CSS files (and others)
[HADOOP-6510] doAs for proxy user
[HADOOP-6521] FsPermission:SetUMask not updated to use new-style umask setting.
[HADOOP-6534] LocalDirAllocator should use whitespace trimming configuration getters
[HADOOP-6543] Allow authentication-enabled RPC clients to connect to authentication-disabled RPC servers
[HADOOP-6558] archive does not work with distcp -update
[HADOOP-6568] Authorization for default servlets
[HADOOP-6569] FsShell#cat should avoid calling unecessary getFileStatus before opening a file to read
[HADOOP-6572] RPC responses may be out-of-order with respect to SASL
[HADOOP-6577] IPC server response buffer reset threshold should be configurable
[HADOOP-6578] Configuration should trim whitespace around a lot of value types
[HADOOP-6599] Split RPC metrics into summary and detailed metrics
[HADOOP-6609] Deadlock in DFSClient#getBlockLocations even with the security disabled
[HADOOP-6613] RPC server should check for version mismatch first
[HADOOP-6627] "Bad Connection to FS" message in FSShell should print message from the exception
[HADOOP-6631] FileUtil.fullyDelete() should continue to delete other files despite failure at any level.
[HADOOP-6634] AccessControlList uses full-principal names to verify acls causing queue-acls to fail
[HADOOP-6637] Benchmark overhead of RPC session establishment
[HADOOP-6640] FileSystem.get() does RPC retries within a static synchronized block
[HADOOP-6644] util.Shell getGROUPS_FOR_USER_COMMAND method name - should use common naming convention
[HADOOP-6649] login object in UGI should be inside the subject
[HADOOP-6652] ShellBasedUnixGroupsMapping shouldn't have a cache
[HADOOP-6653] NullPointerException in setupSaslConnection when browsing directories
[HADOOP-6663] BlockDecompressorStream get EOF exception when decompressing the file compressed from empty file
[HADOOP-6667] RPC.waitForProxy should retry through NoRouteToHostException
[HADOOP-6669] zlib.compress.level ignored for DefaultCodec initialization
[HADOOP-6670] UserGroupInformation doesn't support use in hash tables
[HADOOP-6674] Performance Improvement in Secure RPC
[HADOOP-6687] user object in the subject in UGI should be reused in case of a relogin.
[HADOOP-6701] Incorrect exit codes for "dfs -chown", "dfs -chgrp"
[HADOOP-6706] Relogin behavior for RPC clients could be improved
[HADOOP-6710] Symbolic umask for file creation is not consistent with posix
[HADOOP-6714] FsShell 'hadoop fs -text' does not support compression codecs
[HADOOP-6718] Client does not close connection when an exception happens during SASL negotiation
[HADOOP-6722] NetUtils.connect should check that it hasn't connected a socket to itself
[HADOOP-6723] unchecked exceptions thrown in IPC Connection orphan clients
[HADOOP-6724] IPC doesn't properly handle IOEs thrown by socket factory
[HADOOP-6745] adding some java doc to Server.RpcMetrics, UGI
[HADOOP-6757] NullPointerException for hadoop clients launched from streaming tasks
[HADOOP-6760] WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race
[HADOOP-6762] exception while doing RPC I/O closes channel
[HADOOP-6776] UserGroupInformation.createProxyUser's javadoc is broken
[HADOOP-6813] Add a new newInstance method in FileSystem that takes a "user" as argument
[HADOOP-6815] refreshSuperUserGroupsConfiguration should use server side configuration for the refresh
[HADOOP-6818] Provide a JNI-based implementation of GroupMappingServiceProvider
[HADOOP-6832] Provide a web server plugin that uses a static user for the web UI
[HADOOP-6833] IPC leaks call parameters when exceptions thrown
[HADOOP-6859] Introduce additional statistics to FileSystem
[HADOOP-6864] Provide a JNI-based implementation of ShellBasedUnixGroupsNetgroupMapping (implementation of
GroupMappingServiceProvider)
[HADOOP-6881] The efficient comparators aren't always used except for BytesWritable and Text
[HADOOP-6899] RawLocalFileSystem#setWorkingDir() does not work for relative names
[HADOOP-6907] Rpc client doesn't use the per-connection conf to figure out server's Kerberos principal
[HADOOP-6925] BZip2Codec incorrectly implements read()
[HADOOP-6928] Fix BooleanWritable comparator in 0.20
[HADOOP-6943] The GroupMappingServiceProvider interface should be public
[HADOOP-6950] Suggest that HADOOP_CLASSPATH should be preserved in hadoop-env.sh.template
[HADOOP-6995] Allow wildcards to be used in ProxyUsers configurations
[HADOOP-7082] Configuration.writeXML should not hold lock while outputting
[HADOOP-7101] UserGroupInformation.getCurrentUser() fails when called from non-Hadoop JAAS context
[HADOOP-7104] Remove unnecessary DNS reverse lookups from RPC layer
[HADOOP-7110] Implement chmod with JNI
[HADOOP-7114] FsShell should dump all exceptions at DEBUG level
[HADOOP-7115] Add a cache for getpwuid_r and getpwgid_r calls
[HADOOP-7118] NPE in Configuration.writeXml
[HADOOP-7122] Timed out shell commands leak Timer threads
[HADOOP-7156] getpwuid_r is not thread-safe on RHEL6
[HADOOP-7172] SecureIO should not check owner on non-secure clusters that have no native support
[HADOOP-7173] Remove unused fstat() call from NativeIO
[HADOOP-7183] WritableComparator.get should not cache comparator objects
[HADOOP-7184] Remove deprecated local.cache.size from core-default.xml

MapReduce Patches

MapR 1.2 includes the following Apache MapReduce patches that are not included in the Apache Hadoop base version 0.20.2:

[MAPREDUCE-112] Reduce Input Records and Reduce Output Records counters are not being set when using the new Mapreduce reducer API
[MAPREDUCE-118] Job.getJobID() will always return null
[MAPREDUCE-144] TaskMemoryManager should log process-tree's status while killing tasks.
[MAPREDUCE-181] Secure job submission
[MAPREDUCE-211] Provide a node health check script and run it periodically to check the node health status
[MAPREDUCE-220] Collecting cpu and memory usage for MapReduce tasks
[MAPREDUCE-270] TaskTracker could send an out-of-band heartbeat when the last running map/reduce completes
[MAPREDUCE-277] Job history counters should be avaible on the UI.
[MAPREDUCE-339] JobTracker should give preference to failed tasks over virgin tasks so as to terminate the job ASAP if it is eventually going to
fail.
[MAPREDUCE-364] Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.
[MAPREDUCE-369] Change org.apache.hadoop.mapred.lib.MultipleInputs to use new api.
[MAPREDUCE-370] Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.
[MAPREDUCE-415] JobControl Job does always has an unassigned name
[MAPREDUCE-416] Move the completed jobs' history files to a DONE subdirectory inside the configured history directory
[MAPREDUCE-461] Enable ServicePlugins for the JobTracker
[MAPREDUCE-463] The job setup and cleanup tasks should be optional
[MAPREDUCE-467] Collect information about number of tasks succeeded / total per time unit for a tasktracker.
[MAPREDUCE-476] extend DistributedCache to work locally (LocalJobRunner)
[MAPREDUCE-478] separate jvm param for mapper and reducer
[MAPREDUCE-516] Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs
[MAPREDUCE-517] The capacity-scheduler should assign multiple tasks per heartbeat
[MAPREDUCE-521] After JobTracker restart Capacity Schduler does not schedules pending tasks from already running tasks.
[MAPREDUCE-532] Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue
[MAPREDUCE-551] Add preemption to the fair scheduler
[MAPREDUCE-572] If #link is missing from uri format of -cacheArchive then streaming does not throw error.
[MAPREDUCE-655] Change KeyValueLineRecordReader and KeyValueTextInputFormat to use new api.
[MAPREDUCE-676] Existing diagnostic rules fail for MAP ONLY jobs
[MAPREDUCE-679] XML-based metrics as JSP servlet for JobTracker
[MAPREDUCE-680] Reuse of Writable objects is improperly handled by MRUnit
[MAPREDUCE-682] Reserved tasktrackers should be removed when a node is globally blacklisted
[MAPREDUCE-693] Conf files not moved to "done" subdirectory after JT restart
[MAPREDUCE-698] Per-pool task limits for the fair scheduler
[MAPREDUCE-706] Support for FIFO pools in the fair scheduler
[MAPREDUCE-707] Provide a jobconf property for explicitly assigning a job to a pool
[MAPREDUCE-709] node health check script does not display the correct message on timeout
[MAPREDUCE-714] JobConf.findContainingJar unescapes unnecessarily on Linux
[MAPREDUCE-716] org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle
[MAPREDUCE-722] More slots are getting reserved for HiRAM job tasks then required
[MAPREDUCE-732] node health check script should not log "UNHEALTHY" status for every heartbeat in INFO mode
[MAPREDUCE-734] java.util.ConcurrentModificationException observed in unreserving slots for HiRam Jobs
[MAPREDUCE-739] Allow relative paths to be created inside archives.
[MAPREDUCE-740] Provide summary information per job once a job is finished.
[MAPREDUCE-744] Support in DistributedCache to share cache files with other users after HADOOP-4493
[MAPREDUCE-754] NPE in expiry thread when a TT is lost
[MAPREDUCE-764] TypedBytesInput's readRaw() does not preserve custom type codes
[MAPREDUCE-768] Configuration information should generate dump in a standard format.
[MAPREDUCE-771] Setup and cleanup tasks remain in UNASSIGNED state for a long time on tasktrackers with long running high RAM tasks
[MAPREDUCE-782] Use PureJavaCrc32 in mapreduce spills
[MAPREDUCE-787] -files, -archives should honor user given symlink path
[MAPREDUCE-809] Job summary logs show status of completed jobs as RUNNING
[MAPREDUCE-814] Move completed Job history files to HDFS
[MAPREDUCE-817] Add a cache for retired jobs with minimal job info and provide a way to access history file url
[MAPREDUCE-825] JobClient completion poll interval of 5s causes slow tests in local mode
[MAPREDUCE-840] DBInputFormat leaves open transaction
[MAPREDUCE-842] Per-job local data on the TaskTracker node should have right access-control
[MAPREDUCE-856] Localized files from DistributedCache should have right access-control
[MAPREDUCE-871] Job/Task local files have incorrect group ownership set by LinuxTaskController binary
[MAPREDUCE-875] Make DBRecordReader execute queries lazily
[MAPREDUCE-885] More efficient SQL queries for DBInputFormat
[MAPREDUCE-890] After HADOOP-4491, the user who started mapred system is not able to run job.
[MAPREDUCE-896] Users can set non-writable permissions on temporary files for TT and can abuse disk usage.
[MAPREDUCE-899] When using LinuxTaskController, localized files may become accessible to unintended users if permissions are
misconfigured.
[MAPREDUCE-927] Cleanup of task-logs should happen in TaskTracker instead of the Child
[MAPREDUCE-947] OutputCommitter should have an abortJob method
[MAPREDUCE-964] Inaccurate values in jobSummary logs
[MAPREDUCE-967] TaskTracker does not need to fully unjar job jars
[MAPREDUCE-968] NPE in distcp encountered when placing _logs directory on S3FileSystem
[MAPREDUCE-971] distcp does not always remove distcp.tmp.dir
[MAPREDUCE-1028] Cleanup tasks are scheduled using high memory configuration, leaving tasks in unassigned state.
[MAPREDUCE-1030] Reduce tasks are getting starved in capacity scheduler
[MAPREDUCE-1048] Show total slot usage in cluster summary on jobtracker webui
[MAPREDUCE-1059] distcp can generate uneven map task assignments
[MAPREDUCE-1083] Use the user-to-groups mapping service in the JobTracker
[MAPREDUCE-1085] For tasks, "ulimit -v -1" is being run when user doesn't specify mapred.child.ulimit
[MAPREDUCE-1086] hadoop commands in streaming tasks are trying to write to tasktracker's log
[MAPREDUCE-1088] JobHistory files should have narrower 0600 perms
[MAPREDUCE-1089] Fair Scheduler preemption triggers NPE when tasks are scheduled but not running
[MAPREDUCE-1090] Modify log statement in Tasktracker log related to memory monitoring to include attempt id.
[MAPREDUCE-1098] Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.
[MAPREDUCE-1100] User's task-logs filling up local disks on the TaskTrackers
[MAPREDUCE-1103] Additional JobTracker metrics
[MAPREDUCE-1105] CapacityScheduler: It should be possible to set queue hard-limit beyond it's actual capacity
[MAPREDUCE-1118] Capacity Scheduler scheduling information is hard to read / should be tabular format
[MAPREDUCE-1131] Using profilers other than hprof can cause JobClient to report job failure
[MAPREDUCE-1140] Per cache-file refcount can become negative when tasks release distributed-cache files
[MAPREDUCE-1143] runningMapTasks counter is not properly decremented in case of failed Tasks.
[MAPREDUCE-1155] Streaming tests swallow exceptions
[MAPREDUCE-1158] running_maps is not decremented when the tasks of a job is killed/failed
[MAPREDUCE-1160] Two log statements at INFO level fill up jobtracker logs
[MAPREDUCE-1171] Lots of fetch failures
[MAPREDUCE-1178] MultipleInputs fails with ClassCastException
[MAPREDUCE-1185] URL to JT webconsole for running job and job history should be the same
[MAPREDUCE-1186] While localizing a DistributedCache file, TT sets permissions recursively on the whole base-dir
[MAPREDUCE-1196] MAPREDUCE-947 incompatibly changed FileOutputCommitter
[MAPREDUCE-1198] Alternatively schedule different types of tasks in fair share scheduler
[MAPREDUCE-1213] TaskTrackers restart is very slow because it deletes distributed cache directory synchronously
[MAPREDUCE-1219] JobTracker Metrics causes undue load on JobTracker
[MAPREDUCE-1221] Kill tasks on a node if the free physical memory on that machine falls below a configured threshold
[MAPREDUCE-1231] Distcp is very slow
[MAPREDUCE-1250] Refactor job token to use a common token interface
[MAPREDUCE-1258] Fair scheduler event log not logging job info
[MAPREDUCE-1285] DistCp cannot handle -delete if destination is local filesystem
[MAPREDUCE-1288] DistributedCache localizes only once per cache URI
[MAPREDUCE-1293] AutoInputFormat doesn't work with non-default FileSystems
[MAPREDUCE-1302] TrackerDistributedCacheManager can delete file asynchronously
[MAPREDUCE-1304] Add counters for task time spent in GC
[MAPREDUCE-1307] Introduce the concept of Job Permissions
[MAPREDUCE-1313] NPE in FieldFormatter if escape character is set and field is null
[MAPREDUCE-1316] JobTracker holds stale references to retired jobs via unreported tasks
[MAPREDUCE-1342] Potential JT deadlock in faulty TT tracking
[MAPREDUCE-1354] Incremental enhancements to the JobTracker for better scalability
[MAPREDUCE-1372] ConcurrentModificationException in JobInProgress
[MAPREDUCE-1378] Args in job details links on jobhistory.jsp are not URL encoded
[MAPREDUCE-1382] MRAsyncDiscService should tolerate missing local.dir
[MAPREDUCE-1397] NullPointerException observed during task failures
[MAPREDUCE-1398] TaskLauncher remains stuck on tasks waiting for free nodes even if task is killed.
[MAPREDUCE-1399] The archive command shows a null error message
[MAPREDUCE-1403] Save file-sizes of each of the artifacts in DistributedCache in the JobConf
[MAPREDUCE-1421] LinuxTaskController tests failing on trunk after the commit of MAPREDUCE-1385
[MAPREDUCE-1422] Changing permissions of files/dirs under job-work-dir may be needed sothat cleaning up of job-dir in all
mapred-local-directories succeeds always
[MAPREDUCE-1423] Improve performance of CombineFileInputFormat when multiple pools are configured
[MAPREDUCE-1425] archive throws OutOfMemoryError
[MAPREDUCE-1435] symlinks in cwd of the task are not handled properly after MAPREDUCE-896
[MAPREDUCE-1436] Deadlock in preemption code in fair scheduler
[MAPREDUCE-1440] MapReduce should use the short form of the user names
[MAPREDUCE-1441] Configuration of directory lists should trim whitespace
[MAPREDUCE-1442] StackOverflowError when JobHistory parses a really long line
[MAPREDUCE-1443] DBInputFormat can leak connections
[MAPREDUCE-1454] The servlets should quote server generated strings sent in the response
[MAPREDUCE-1455] Authorization for servlets
[MAPREDUCE-1457] For secure job execution, couple of more UserGroupInformation.doAs needs to be added
[MAPREDUCE-1464] In JobTokenIdentifier change method getUsername to getUser which returns UGI
[MAPREDUCE-1466] FileInputFormat should save #input-files in JobConf
[MAPREDUCE-1476] committer.needsTaskCommit should not be called for a task cleanup attempt
[MAPREDUCE-1480] CombineFileRecordReader does not properly initialize child RecordReader
[MAPREDUCE-1493] Authorization for job-history pages
[MAPREDUCE-1503] Push HADOOP-6551 into MapReduce
[MAPREDUCE-1505] Cluster class should create the rpc client only when needed
[MAPREDUCE-1521] Protection against incorrectly configured reduces
[MAPREDUCE-1522] FileInputFormat may change the file system of an input path
[MAPREDUCE-1526] Cache the job related information while submitting the job , this would avoid many RPC calls to JobTracker.
[MAPREDUCE-1533] Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and
Counters.makeEscapedString()
[MAPREDUCE-1538] TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit
[MAPREDUCE-1543] Log messages of JobACLsManager should use security logging of HADOOP-6586
[MAPREDUCE-1545] Add 'first-task-launched' to job-summary
[MAPREDUCE-1550] UGI.doAs should not be used for getting the history file of jobs
[MAPREDUCE-1563] Task diagnostic info would get missed sometimes.
[MAPREDUCE-1570] Shuffle stage - Key and Group Comparators
[MAPREDUCE-1607] Task controller may not set permissions for a task cleanup attempt's log directory
[MAPREDUCE-1609] TaskTracker.localizeJob should not set permissions on job log directory recursively
[MAPREDUCE-1611] Refresh nodes and refresh queues doesnt work with service authorization enabled
[MAPREDUCE-1612] job conf file is not accessible from job history web page
[MAPREDUCE-1621] Streaming's TextOutputReader.getLastOutput throws NPE if it has never read any output
[MAPREDUCE-1635] ResourceEstimator does not work after MAPREDUCE-842
[MAPREDUCE-1641] Job submission should fail if same uri is added for mapred.cache.files and mapred.cache.archives
[MAPREDUCE-1656] JobStory should provide queue info.
[MAPREDUCE-1657] After task logs directory is deleted, tasklog servlet displays wrong error message about job ACLs
[MAPREDUCE-1664] Job Acls affect Queue Acls
[MAPREDUCE-1680] Add a metrics to track the number of heartbeats processed
[MAPREDUCE-1682] Tasks should not be scheduled after tip is killed/failed.
[MAPREDUCE-1683] Remove JNI calls from ClusterStatus cstr
[MAPREDUCE-1699] JobHistory shouldn't be disabled for any reason
[MAPREDUCE-1707] TaskRunner can get NPE in getting ugi from TaskTracker
[MAPREDUCE-1716] Truncate logs of finished tasks to prevent node thrash due to excessive logging
[MAPREDUCE-1733] Authentication between pipes processes and java counterparts.
[MAPREDUCE-1734] Un-deprecate the old MapReduce API in the 0.20 branch
[MAPREDUCE-1744] DistributedCache creates its own FileSytem instance when adding a file/archive to the path
[MAPREDUCE-1754] Replace mapred.persmissions.supergroup with an acl : mapreduce.cluster.administrators
[MAPREDUCE-1759] Exception message for unauthorized user doing killJob, killTask, setJobPriority needs to be improved
[MAPREDUCE-1778] CompletedJobStatusStore initialization should fail if {mapred.job.tracker.persist.jobstatus.dir} is unwritable
[MAPREDUCE-1784] IFile should check for null compressor
[MAPREDUCE-1785] Add streaming config option for not emitting the key
[MAPREDUCE-1832] Support for file sizes less than 1MB in DFSIO benchmark.
[MAPREDUCE-1845] FairScheduler.tasksToPeempt() can return negative number
[MAPREDUCE-1850] Include job submit host information (name and ip) in jobconf and jobdetails display
[MAPREDUCE-1853] MultipleOutputs does not cache TaskAttemptContext
[MAPREDUCE-1868] Add read timeout on userlog pull
[MAPREDUCE-1872] Re-think (user|queue) limits on (tasks|jobs) in the CapacityScheduler
[MAPREDUCE-1887] MRAsyncDiskService does not properly absolutize volume root paths
[MAPREDUCE-1900] MapReduce daemons should close FileSystems that are not needed anymore
[MAPREDUCE-1914] TrackerDistributedCacheManager never cleans its input directories
[MAPREDUCE-1938] Ability for having user's classes take precedence over the system classes for tasks' classpath
[MAPREDUCE-1960] Limit the size of jobconf.
[MAPREDUCE-1961] ConcurrentModificationException when shutting down Gridmix
[MAPREDUCE-1985] java.lang.ArrayIndexOutOfBoundsException in analysejobhistory.jsp of jobs with 0 maps
[MAPREDUCE-2023] TestDFSIO read test may not read specified bytes.
[MAPREDUCE-2082] Race condition in writing the jobtoken password file when launching pipes jobs
[MAPREDUCE-2096] Secure local filesystem IO from symlink vulnerabilities
[MAPREDUCE-2103] task-controller shouldn't require o-r permissions
[MAPREDUCE-2157] safely handle InterruptedException and interrupted status in MR code
[MAPREDUCE-2178] Race condition in LinuxTaskController permissions handling
[MAPREDUCE-2219] JT should not try to remove mapred.system.dir during startup
[MAPREDUCE-2234] If Localizer can't create task log directory, it should fail on the spot
[MAPREDUCE-2235] JobTracker "over-synchronization" makes it hang up in certain cases
[MAPREDUCE-2242] LinuxTaskController doesn't properly escape environment variables
[MAPREDUCE-2253] Servlets should specify content type
[MAPREDUCE-2256] FairScheduler fairshare preemption from multiple pools may preempt all tasks from one pool causing that pool to go below
fairshare.
[MAPREDUCE-2289] Permissions race can make getStagingDir fail on local filesystem
[MAPREDUCE-2321] TT should fail to start on secure cluster when SecureIO isn't available
[MAPREDUCE-2323] Add metrics to the fair scheduler
[MAPREDUCE-2328] memory-related configurations missing from mapred-default.xml
[MAPREDUCE-2332] Improve error messages when MR dirs on local FS have bad ownership
[MAPREDUCE-2351] mapred.job.tracker.history.completed.location should support an arbitrary filesystem URI
[MAPREDUCE-2353] Make the MR changes to reflect the API changes in SecureIO library
[MAPREDUCE-2356] A task succeeded even though there were errors on all attempts.
[MAPREDUCE-2364] Shouldn't hold lock on rjob while localizing resources.
[MAPREDUCE-2366] TaskTracker can't retrieve stdout and stderr from web UI
[MAPREDUCE-2371] TaskLogsTruncater does not need to check log ownership when running as Child
[MAPREDUCE-2372] TaskLogAppender mechanism shouldn't be set in log4j.properties
[MAPREDUCE-2373] When tasks exit with a nonzero exit status, task runner should log the stderr as well as stdout
[MAPREDUCE-2374] Should not use PrintWriter to write taskjvm.sh
[MAPREDUCE-2377] task-controller fails to parse configuration if it doesn't end in \n
[MAPREDUCE-2379] Distributed cache sizing configurations are missing from mapred-default.xml

Version 1.1.3 Release Notes

General Information
MapR provides the following packages:

Apache Hadoop 0.20.2


flume-0.9.3
hbase-0.90.2
hive-0.7.0
oozie-3.0.0
pig-0.8
sqoop-1.2.0
whirr-0.3.0

New in This Release


This is a maintenance release with no new features.

Resolved Issues
(Issue 4307) Snapshot create fails with error EEXIST

Known Issues

(Issue 4269) Bulk Operations

The MapR Control System provides both a checkbox and a Select All link for selecting all alarms, nodes, snapshots, or volumes matching a filter,
even if there are too many results to display on a single screen. However, the following operations can only be performed on individually selected
results, or results selected using the Select Visible link at the bottom of the MapR Control System screen:

Volumes - Edit Volumes


Volumes - Remove Volumes
Volumes - New Snapshot
Volumes - Unmount
Mirror Volumes - Edit Volumes
Mirror Volumes - Remove Volumes
Mirror Volumes - Unmount
User Disk Usage - Edit
Snapshots - Remove
Snapshots - Preserve
Node Alarms - Change Topology
Nodes - Change Topology
Volume Alarms - Edit
Volume Alarms - Unmount
Volume Alarms - Remove
User/Group Alarms - Edit

In order to perform these operations on a large number of alarms, nodes, snapshots, or volumes, it is necessary to select each screenful of
results using Select Visible and perform the operation before selecting the next screenful of results.

(Issue 3122) Mirroring with fsck-repaired volume

If a source or mirror volume is repaired with fsck then the source and mirror volumes can go out of sync. It is necessary to perform a full mirror
operation with volume mirror start -full true to bring them back in sync. Similarly, when creating a dump file from a volume that has
been repaired with fsck, use -full true on the volume dump create command.

Version 1.1.2 Release Notes

General Information
MapR provides the following packages:

Apache Hadoop 0.20.2


flume-0.9.3
hbase-0.90.2
hive-0.7.0
oozie-3.0.0
pig-0.8
sqoop-1.2.0
whirr-0.3.0

New in This Release


This is a maintenance release with no new features.

Resolved Issues
(Issue 4037) Starting Newly Added Services (Documentation fix)
(Issue 4024) Hadoop Copy Commands Do Not Handle Broken Symbolic Links
(Issue 4018)(HDFS-1768) fs -put crash that depends on source file name
(Issue 3524) Apache Port 80 Open (Documentation fix)
(Issue 3488) Ubuntu IRQ Balancer Issue on Virtual Machines (Documentation fix)
(Issue 3244) Volume Mirror Issue
(Issue 3028) Changing the Time on a ZooKeeper Node (Documentation fix)

Known Issues

(Issue 4307) Snapshot create fails with error EEXIST

The EEXIST error indicates an attempt to create a new snapshot with the same name as an existing snapshot, but can occur in the following
cases as well:

If the node with the snapshot's name container fails during snapshot creation, the failed snapshot remains until it is removed by the CLDB
after 30 minutes.
If snapshot creation fails after reserving the name, then the name exists but the snapshot does not.
If the response to a successful snapshot is delayed by a network glitch, and the snapshot operation is retried as a result, EEXISTS
correctly indicates that the snapshot exists although it does not appear to.

In any of the above cases, either retry the snapshot with a different name, or delete the existing (or failed) snapshot and create it again.

(Issue 4269) Bulk Operations


The MapR Control System provides both a checkbox and a Select All link for selecting all alarms, nodes, snapshots, or volumes matching a filter,
even if there are too many results to display on a single screen. However, the following operations can only be performed on individually selected
results, or results selected using the Select Visible link at the bottom of the MapR Control System screen:

Volumes - Edit Volumes


Volumes - Remove Volumes
Volumes - New Snapshot
Volumes - Unmount
Mirror Volumes - Edit Volumes
Mirror Volumes - Remove Volumes
Mirror Volumes - Unmount
User Disk Usage - Edit
Snapshots - Remove
Snapshots - Preserve
Node Alarms - Change Topology
Nodes - Change Topology
Volume Alarms - Edit
Volume Alarms - Unmount
Volume Alarms - Remove
User/Group Alarms - Edit

In order to perform these operations on a large number of alarms, nodes, snapshots, or volumes, it is necessary to select each screenful of
results using Select Visible and perform the operation before selecting the next screenful of results.

(Issue 3122) Mirroring with fsck-repaired volume

If a source or mirror volume is repaired with fsck then the source and mirror volumes can go out of sync. It is necessary to perform a full mirror
operation with volume mirror start -full true to bring them back in sync. Similarly, when creating a dump file from a volume that has
been repaired with fsck, use -full true on the volume dump create command.

Version 1.1.1 Release Notes

General Information
MapR provides the following packages:

Apache Hadoop 0.20.2


flume-0.9.3
hbase-0.90.2
hive-0.7.0
oozie-3.0.0
pig-0.8
sqoop-1.2.0
whirr-0.3.0

New in This Release

EMC License support

Packages built for EMC have "EMC" in the MapRBuildVersion (example: 1.1.0.10806EMC-1).

HBase LeaseRecovery

FSUtis LeaseRecovery is supported in HBase trunk and HBase 90.2. To run a different version of HBase with MapR, apply the following patches
and compile HBase:

https://issues.apache.org/jira/secure/attachment/12489782/4169-v5.txt
https://issues.apache.org/jira/secure/attachment/12489818/4169-correction.txt

Resolved Issues
(Issue 4792) Synchronization in reading JobTracker address
(Issue 4910) File attributes sometimes not updated correctly in client cache
(HBase Jira 4169) FSUtils LeaseRecovery for non HDFS FileSystems
(Issue 4905) FSUtils LeaseRecovery for MapR

Known Issues
(Issue 4307) Snapshot create fails with error EEXIST

The EEXIST error indicates an attempt to create a new snapshot with the same name as an existing snapshot, but can occur in the following
cases as well:

If the node with the snapshot's name container fails during snapshot creation, the failed snapshot remains until it is removed by the CLDB
after 30 minutes.
If snapshot creation fails after reserving the name, then the name exists but the snapshot does not.
If the response to a successful snapshot is delayed by a network glitch, and the snapshot operation is retried as a result, EEXISTS
correctly indicates that the snapshot exists although it does not appear to.

In any of the above cases, either retry the snapshot with a different name, or delete the existing (or failed) snapshot and create it again.

(Issue 4269) Bulk Operations

The MapR Control System provides both a checkbox and a Select All link for selecting all alarms, nodes, snapshots, or volumes matching a filter,
even if there are too many results to display on a single screen. However, the following operations can only be performed on individually selected
results, or results selected using the Select Visible link at the bottom of the MapR Control System screen:

Volumes - Edit Volumes


Volumes - Remove Volumes
Volumes - New Snapshot
Volumes - Unmount
Mirror Volumes - Edit Volumes
Mirror Volumes - Remove Volumes
Mirror Volumes - Unmount
User Disk Usage - Edit
Snapshots - Remove
Snapshots - Preserve
Node Alarms - Change Topology
Nodes - Change Topology
Volume Alarms - Edit
Volume Alarms - Unmount
Volume Alarms - Remove
User/Group Alarms - Edit

In order to perform these operations on a large number of alarms, nodes, snapshots, or volumes, it is necessary to select each screenful of
results using Select Visible and perform the operation before selecting the next screenful of results.

(Issue 4037) Starting Newly Added Services

After you install new services on a node, you can start them in two ways:

Use the MapR Control System, the API, or the command-line interface to start the services individually
Restart the warden to stop and start all services on the node

If you start the services individually, the node's memory will not be reconfigured to account for the newly installed services. This can cause
memory paging, slowing or stopping the node. However, stopping and restarting the warden can take the node out of service.

For best results, choose a time when the cluster is not very busy if you need to install additional services on a node. If that is not possible, make
sure to restart the warden as soon as it is practical to do so after installing new services.

(Issue 4024) Hadoop Copy Commands Do Not Handle Broken Symbolic Links

The hadoop fs -copyToLocal and hadoop fs -copyFromLocal commands attempt to resolve symbolic links in the source data set, to
create physical copies of the files referred to by the links. If a broken symbolic link is encountered by either command, the copy operation fails at
that point.

(Issue 4018)(HDFS-1768) fs -put crash that depends on source file name

Copying a file using the hadoop fs command generates a warning or exception if a corresponding checksum file .*.crc exists. If this error
occurs, delete all local checksum files and try again. See http://www.mail-archive.com/hdfs-issues@hadoop.apache.org/msg15824.html

(Issue 3524) Apache Port 80 Open

The MapR UI runs on Apache. By default, installation does not close port 80 (even though the MapR Control System is available over HTTPS on
port 8443). If this would present a security risk to your datacenter, you should close port 80 manually on any nodes running the MapR Control
System.

(Issue 3488) Ubuntu IRQ Balancer Issue on Virtual Machines


In VM environments like EC2, VMWare, and Xen, when running Ubuntu 10.10, problems can occur due to an Ubuntu bug unless the IRQ
balancer is turned off. On all nodes, edit the file /etc/default/irqbalance and set ENABLED=0 to turn off the IRQ balancer (requires reboot
to take effect).

(Issue 3244) Volume Mirror Issue

If a volume dump restore command is interrupted before completion (killed by the user, node fails, etc.) then the volume remains in the "Mirroring
in Progress" state. Before retrying the volume dump restore operation, you must issue the volume mirror stop command explicitly.

(Issue 3122) Mirroring with fsck-repaired volume

If a source or mirror volume is repaired with fsck then the source and mirror volumes can go out of sync. It is necessary to perform a full mirror
operation with volume mirror start -full true to bring them back in sync. Similarly, when creating a dump file from a volume that has
been repaired with fsck, use -full true on the volume dump create command.

(Issue 3028) Changing the Time on a ZooKeeper Node

To avoid cluster downtime, use the following steps to set the time on any node running ZooKeeper:

1. Use the MapR Dashboard to check that all configured ZooKeeper services on the cluster are running. Start any non-running ZooKeeper
instances.
2. Stop ZooKeeper on the node:
/etc/init.d/mapr-zookeeper stop
3. Change the time on the node or sync the time to NTP.
4. Start ZooKeeper on the node:
/etc/init.d/mapr-zookeeper start

Version 1.1 Release Notes

General Information
MapR provides the following packages:

Apache Hadoop 0.20.2


flume-0.9.3
hbase-0.90.2
hive-0.7.0
oozie-3.0.0
pig-0.8
sqoop-1.2.0
whirr-0.3.0

New in This Release

Mac OS Client

A Mac OS client is now available. For more information, see Installing the MapR Client on Mac OS X

Resolved Issues
(Issue 4415) Select and Kill Controls in JobTracker UI
(Issue 2809) Add Release Note for NFS Dependencies

Known Issues

(Issue 4307) Snapshot create fails with error EEXIST

The EEXIST error indicates an attempt to create a new snapshot with the same name as an existing snapshot, but can occur in the following
cases as well:

If the node with the snapshot's name container fails during snapshot creation, the failed snapshot remains until it is removed by the CLDB
after 30 minutes.
If snapshot creation fails after reserving the name, then the name exists but the snapshot does not.
If the response to a successful snapshot is delayed by a network glitch, and the snapshot operation is retried as a result, EEXISTS
correctly indicates that the snapshot exists although it does not appear to.

In any of the above cases, either retry the snapshot with a different name, or delete the existing (or failed) snapshot and create it again.
(Issue 4269) Bulk Operations

The MapR Control System provides both a checkbox and a Select All link for selecting all alarms, nodes, snapshots, or volumes matching a filter,
even if there are too many results to display on a single screen. However, the following operations can only be performed on individually selected
results, or results selected using the Select Visible link at the bottom of the MapR Control System screen:

Volumes - Edit Volumes


Volumes - Remove Volumes
Volumes - New Snapshot
Volumes - Unmount
Mirror Volumes - Edit Volumes
Mirror Volumes - Remove Volumes
Mirror Volumes - Unmount
User Disk Usage - Edit
Snapshots - Remove
Snapshots - Preserve
Node Alarms - Change Topology
Nodes - Change Topology
Volume Alarms - Edit
Volume Alarms - Unmount
Volume Alarms - Remove
User/Group Alarms - Edit

In order to perform these operations on a large number of alarms, nodes, snapshots, or volumes, it is necessary to select each screenful of
results using Select Visible and perform the operation before selecting the next screenful of results.

(Issue 4037) Starting Newly Added Services

After you install new services on a node, you can start them in two ways:

Use the MapR Control System, the API, or the command-line interface to start the services individually
Restart the warden to stop and start all services on the node

If you start the services individually, the node's memory will not be reconfigured to account for the newly installed services. This can cause
memory paging, slowing or stopping the node. However, stopping and restarting the warden can take the node out of service.

For best results, choose a time when the cluster is not very busy if you need to install additional services on a node. If that is not possible, make
sure to restart the warden as soon as it is practical to do so after installing new services.

(Issue 4024) Hadoop Copy Commands Do Not Handle Broken Symbolic Links

The hadoop fs -copyToLocal and hadoop fs -copyFromLocal commands attempt to resolve symbolic links in the source data set, to
create physical copies of the files referred to by the links. If a broken symbolic link is encountered by either command, the copy operation fails at
that point.

(Issue 4018)(HDFS-1768) fs -put crash that depends on source file name

Copying a file using the hadoop fs command generates a warning or exception if a corresponding checksum file .*.crc exists. If this error
occurs, delete all local checksum files and try again. See http://www.mail-archive.com/hdfs-issues@hadoop.apache.org/msg15824.html

(Issue 3524) Apache Port 80 Open

The MapR UI runs on Apache. By default, installation does not close port 80 (even though the MapR Control System is available over HTTPS on
port 8443). If this would present a security risk to your datacenter, you should close port 80 manually on any nodes running the MapR Control
System.

(Issue 3488) Ubuntu IRQ Balancer Issue on Virtual Machines

In VM environments like EC2, VMWare, and Xen, when running Ubuntu 10.10, problems can occur due to an Ubuntu bug unless the IRQ
balancer is turned off. On all nodes, edit the file /etc/default/irqbalance and set ENABLED=0 to turn off the IRQ balancer (requires reboot
to take effect).

(Issue 3244) Volume Mirror Issue

If a volume dump restore command is interrupted before completion (killed by the user, node fails, etc.) then the volume remains in the "Mirroring
in Progress" state. Before retrying the volume dump restore operation, you must issue the volume mirror stop command explicitly.

(Issue 3122) Mirroring with fsck-repaired volume

If a source or mirror volume is repaired with fsck then the source and mirror volumes can go out of sync. It is necessary to perform a full mirror
operation with volume mirror start -full true to bring them back in sync. Similarly, when creating a dump file from a volume that has
been repaired with fsck, use -full true on the volume dump create command.

(Issue 3028) Changing the Time on a ZooKeeper Node

To avoid cluster downtime, use the following steps to set the time on any node running ZooKeeper:

1. Use the MapR Dashboard to check that all configured ZooKeeper services on the cluster are running. Start any non-running ZooKeeper
instances.
2. Stop ZooKeeper on the node:
/etc/init.d/mapr-zookeeper stop
3. Change the time on the node or sync the time to NTP.
4. Start ZooKeeper on the node:
/etc/init.d/mapr-zookeeper start

Hadoop Compatibility in Version 1.1


MapR provides the following packages:

Apache Hadoop 0.20.2


flume-0.9.3
hbase-0.90.2
hive-0.7.0
oozie-3.0.0
pig-0.8
sqoop-1.2.0
whirr-0.3.0

Hadoop Common Patches

MapR 1.0 includes the following Apache Hadoop issues that are not included in the Apache Hadoop base version 0.20.2:

[HADOOP-1722] Make streaming to handle non-utf8 byte array


[HADOOP-1849] IPC server max queue size should be configurable
[HADOOP-2141] speculative execution start up condition based on completion time
[HADOOP-2366] Space in the value for dfs.data.dir can cause great problems
[HADOOP-2721] Use job control for tasks (and therefore for pipes and streaming)
[HADOOP-2838] Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni
[HADOOP-3327] Shuffling fetchers waited too long between map output fetch re-tries
[HADOOP-3659] Patch to allow hadoop native to compile on Mac OS X
[HADOOP-4012] Providing splitting support for bzip2 compressed files
[HADOOP-4041] IsolationRunner does not work as documented
[HADOOP-4490] Map and Reduce tasks should run as the user who submitted the job
[HADOOP-4655] FileSystem.CACHE should be ref-counted
[HADOOP-4656] Add a user to groups mapping service
[HADOOP-4675] Current Ganglia metrics implementation is incompatible with Ganglia 3.1
[HADOOP-4829] Allow FileSystem shutdown hook to be disabled
[HADOOP-4842] Streaming combiner should allow command, not just JavaClass
[HADOOP-4930] Implement setuid executable for Linux to assist in launching tasks as job owners
[HADOOP-4933] ConcurrentModificationException in JobHistory.java
[HADOOP-5170] Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide
[HADOOP-5175] Option to prohibit jars unpacking
[HADOOP-5203] TT's version build is too restrictive
[HADOOP-5396] Queue ACLs should be refreshed without requiring a restart of the job tracker
[HADOOP-5419] Provide a way for users to find out what operations they can do on which M/R queues
[HADOOP-5420] Support killing of process groups in LinuxTaskController binary
[HADOOP-5442] The job history display needs to be paged
[HADOOP-5450] Add support for application-specific typecodes to typed bytes
[HADOOP-5469] Exposing Hadoop metrics via HTTP
[HADOOP-5476] calling new SequenceFile.Reader(...) leaves an InputStream open, if the given sequence file is broken
[HADOOP-5488] HADOOP-2721 doesn't clean up descendant processes of a jvm that exits cleanly after running a task successfully
[HADOOP-5528] Binary partitioner
[HADOOP-5582] Hadoop Vaidya throws number format exception due to changes in the job history counters string format (escaped compact
representation).
[HADOOP-5592] Hadoop Streaming - GzipCodec
[HADOOP-5613] change S3Exception to checked exception
[HADOOP-5643] Ability to blacklist tasktracker
[HADOOP-5656] Counter for S3N Read Bytes does not work
[HADOOP-5675] DistCp should not launch a job if it is not necessary
[HADOOP-5733] Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics
[HADOOP-5737] UGI checks in testcases are broken
[HADOOP-5738] Split waiting tasks field in JobTracker metrics to individual tasks
[HADOOP-5745] Allow setting the default value of maxRunningJobs for all pools
[HADOOP-5784] The length of the heartbeat cycle should be configurable.
[HADOOP-5801] JobTracker should refresh the hosts list upon recovery
[HADOOP-5805] problem using top level s3 buckets as input/output directories
[HADOOP-5861] s3n files are not getting split by default
[HADOOP-5879] GzipCodec should read compression level etc from configuration
[HADOOP-5913] Allow administrators to be able to start and stop queues
[HADOOP-5958] Use JDK 1.6 File APIs in DF.java wherever possible
[HADOOP-5976] create script to provide classpath for external tools
[HADOOP-5980] LD_LIBRARY_PATH not passed to tasks spawned off by LinuxTaskController
[HADOOP-5981] HADOOP-2838 doesnt work as expected
[HADOOP-6132] RPC client opens an extra connection for VersionedProtocol
[HADOOP-6133] ReflectionUtils performance regression
[HADOOP-6148] Implement a pure Java CRC32 calculator
[HADOOP-6161] Add get/setEnum to Configuration
[HADOOP-6166] Improve PureJavaCrc32
[HADOOP-6184] Provide a configuration dump in json format.
[HADOOP-6227] Configuration does not lock parameters marked final if they have no value.
[HADOOP-6234] Permission configuration files should use octal and symbolic
[HADOOP-6254] s3n fails with SocketTimeoutException
[HADOOP-6269] Missing synchronization for defaultResources in Configuration.addResource
[HADOOP-6279] Add JVM memory usage to JvmMetrics
[HADOOP-6284] Any hadoop commands crashing jvm (SIGBUS) when /tmp (tmpfs) is full
[HADOOP-6299] Use JAAS LoginContext for our login
[HADOOP-6312] Configuration sends too much data to log4j
[HADOOP-6337] Update FilterInitializer class to be more visible and take a conf for further development
[HADOOP-6343] Stack trace of any runtime exceptions should be recorded in the server logs.
[HADOOP-6400] Log errors getting Unix UGI
[HADOOP-6408] Add a /conf servlet to dump running configuration
[HADOOP-6419] Change RPC layer to support SASL based mutual authentication
[HADOOP-6433] Add AsyncDiskService that is used in both hdfs and mapreduce
[HADOOP-6441] Prevent remote CSS attacks in Hostname and UTF-7.
[HADOOP-6453] Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH
[HADOOP-6471] StringBuffer -> StringBuilder - conversion of references as necessary
[HADOOP-6496] HttpServer sends wrong content-type for CSS files (and others)
[HADOOP-6510] doAs for proxy user
[HADOOP-6521] FsPermission:SetUMask not updated to use new-style umask setting.
[HADOOP-6534] LocalDirAllocator should use whitespace trimming configuration getters
[HADOOP-6543] Allow authentication-enabled RPC clients to connect to authentication-disabled RPC servers
[HADOOP-6558] archive does not work with distcp -update
[HADOOP-6568] Authorization for default servlets
[HADOOP-6569] FsShell#cat should avoid calling unecessary getFileStatus before opening a file to read
[HADOOP-6572] RPC responses may be out-of-order with respect to SASL
[HADOOP-6577] IPC server response buffer reset threshold should be configurable
[HADOOP-6578] Configuration should trim whitespace around a lot of value types
[HADOOP-6599] Split RPC metrics into summary and detailed metrics
[HADOOP-6609] Deadlock in DFSClient#getBlockLocations even with the security disabled
[HADOOP-6613] RPC server should check for version mismatch first
[HADOOP-6627] "Bad Connection to FS" message in FSShell should print message from the exception
[HADOOP-6631] FileUtil.fullyDelete() should continue to delete other files despite failure at any level.
[HADOOP-6634] AccessControlList uses full-principal names to verify acls causing queue-acls to fail
[HADOOP-6637] Benchmark overhead of RPC session establishment
[HADOOP-6640] FileSystem.get() does RPC retries within a static synchronized block
[HADOOP-6644] util.Shell getGROUPS_FOR_USER_COMMAND method name - should use common naming convention
[HADOOP-6649] login object in UGI should be inside the subject
[HADOOP-6652] ShellBasedUnixGroupsMapping shouldn't have a cache
[HADOOP-6653] NullPointerException in setupSaslConnection when browsing directories
[HADOOP-6663] BlockDecompressorStream get EOF exception when decompressing the file compressed from empty file
[HADOOP-6667] RPC.waitForProxy should retry through NoRouteToHostException
[HADOOP-6669] zlib.compress.level ignored for DefaultCodec initialization
[HADOOP-6670] UserGroupInformation doesn't support use in hash tables
[HADOOP-6674] Performance Improvement in Secure RPC
[HADOOP-6687] user object in the subject in UGI should be reused in case of a relogin.
[HADOOP-6701] Incorrect exit codes for "dfs -chown", "dfs -chgrp"
[HADOOP-6706] Relogin behavior for RPC clients could be improved
[HADOOP-6710] Symbolic umask for file creation is not consistent with posix
[HADOOP-6714] FsShell 'hadoop fs -text' does not support compression codecs
[HADOOP-6718] Client does not close connection when an exception happens during SASL negotiation
[HADOOP-6722] NetUtils.connect should check that it hasn't connected a socket to itself
[HADOOP-6723] unchecked exceptions thrown in IPC Connection orphan clients
[HADOOP-6724] IPC doesn't properly handle IOEs thrown by socket factory
[HADOOP-6745] adding some java doc to Server.RpcMetrics, UGI
[HADOOP-6757] NullPointerException for hadoop clients launched from streaming tasks
[HADOOP-6760] WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race
[HADOOP-6762] exception while doing RPC I/O closes channel
[HADOOP-6776] UserGroupInformation.createProxyUser's javadoc is broken
[HADOOP-6813] Add a new newInstance method in FileSystem that takes a "user" as argument
[HADOOP-6815] refreshSuperUserGroupsConfiguration should use server side configuration for the refresh
[HADOOP-6818] Provide a JNI-based implementation of GroupMappingServiceProvider
[HADOOP-6832] Provide a web server plugin that uses a static user for the web UI
[HADOOP-6833] IPC leaks call parameters when exceptions thrown
[HADOOP-6859] Introduce additional statistics to FileSystem
[HADOOP-6864] Provide a JNI-based implementation of ShellBasedUnixGroupsNetgroupMapping (implementation of
GroupMappingServiceProvider)
[HADOOP-6881] The efficient comparators aren't always used except for BytesWritable and Text
[HADOOP-6899] RawLocalFileSystem#setWorkingDir() does not work for relative names
[HADOOP-6907] Rpc client doesn't use the per-connection conf to figure out server's Kerberos principal
[HADOOP-6925] BZip2Codec incorrectly implements read()
[HADOOP-6928] Fix BooleanWritable comparator in 0.20
[HADOOP-6943] The GroupMappingServiceProvider interface should be public
[HADOOP-6950] Suggest that HADOOP_CLASSPATH should be preserved in hadoop-env.sh.template
[HADOOP-6995] Allow wildcards to be used in ProxyUsers configurations
[HADOOP-7082] Configuration.writeXML should not hold lock while outputting
[HADOOP-7101] UserGroupInformation.getCurrentUser() fails when called from non-Hadoop JAAS context
[HADOOP-7104] Remove unnecessary DNS reverse lookups from RPC layer
[HADOOP-7110] Implement chmod with JNI
[HADOOP-7114] FsShell should dump all exceptions at DEBUG level
[HADOOP-7115] Add a cache for getpwuid_r and getpwgid_r calls
[HADOOP-7118] NPE in Configuration.writeXml
[HADOOP-7122] Timed out shell commands leak Timer threads
[HADOOP-7156] getpwuid_r is not thread-safe on RHEL6
[HADOOP-7172] SecureIO should not check owner on non-secure clusters that have no native support
[HADOOP-7173] Remove unused fstat() call from NativeIO
[HADOOP-7183] WritableComparator.get should not cache comparator objects
[HADOOP-7184] Remove deprecated local.cache.size from core-default.xml

MapReduce Patches

MapR 1.0 includes the following Apache MapReduce issues that are not included in the Apache Hadoop base version 0.20.2:

[MAPREDUCE-112] Reduce Input Records and Reduce Output Records counters are not being set when using the new Mapreduce reducer API
[MAPREDUCE-118] Job.getJobID() will always return null
[MAPREDUCE-144] TaskMemoryManager should log process-tree's status while killing tasks.
[MAPREDUCE-181] Secure job submission
[MAPREDUCE-211] Provide a node health check script and run it periodically to check the node health status
[MAPREDUCE-220] Collecting cpu and memory usage for MapReduce tasks
[MAPREDUCE-270] TaskTracker could send an out-of-band heartbeat when the last running map/reduce completes
[MAPREDUCE-277] Job history counters should be avaible on the UI.
[MAPREDUCE-339] JobTracker should give preference to failed tasks over virgin tasks so as to terminate the job ASAP if it is eventually going to
fail.
[MAPREDUCE-364] Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.
[MAPREDUCE-369] Change org.apache.hadoop.mapred.lib.MultipleInputs to use new api.
[MAPREDUCE-370] Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.
[MAPREDUCE-415] JobControl Job does always has an unassigned name
[MAPREDUCE-416] Move the completed jobs' history files to a DONE subdirectory inside the configured history directory
[MAPREDUCE-461] Enable ServicePlugins for the JobTracker
[MAPREDUCE-463] The job setup and cleanup tasks should be optional
[MAPREDUCE-467] Collect information about number of tasks succeeded / total per time unit for a tasktracker.
[MAPREDUCE-476] extend DistributedCache to work locally (LocalJobRunner)
[MAPREDUCE-478] separate jvm param for mapper and reducer
[MAPREDUCE-516] Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs
[MAPREDUCE-517] The capacity-scheduler should assign multiple tasks per heartbeat
[MAPREDUCE-521] After JobTracker restart Capacity Schduler does not schedules pending tasks from already running tasks.
[MAPREDUCE-532] Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue
[MAPREDUCE-551] Add preemption to the fair scheduler
[MAPREDUCE-572] If #link is missing from uri format of -cacheArchive then streaming does not throw error.
[MAPREDUCE-655] Change KeyValueLineRecordReader and KeyValueTextInputFormat to use new api.
[MAPREDUCE-676] Existing diagnostic rules fail for MAP ONLY jobs
[MAPREDUCE-679] XML-based metrics as JSP servlet for JobTracker
[MAPREDUCE-680] Reuse of Writable objects is improperly handled by MRUnit
[MAPREDUCE-682] Reserved tasktrackers should be removed when a node is globally blacklisted
[MAPREDUCE-693] Conf files not moved to "done" subdirectory after JT restart
[MAPREDUCE-698] Per-pool task limits for the fair scheduler
[MAPREDUCE-706] Support for FIFO pools in the fair scheduler
[MAPREDUCE-707] Provide a jobconf property for explicitly assigning a job to a pool
[MAPREDUCE-709] node health check script does not display the correct message on timeout
[MAPREDUCE-714] JobConf.findContainingJar unescapes unnecessarily on Linux
[MAPREDUCE-716] org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle
[MAPREDUCE-722] More slots are getting reserved for HiRAM job tasks then required
[MAPREDUCE-732] node health check script should not log "UNHEALTHY" status for every heartbeat in INFO mode
[MAPREDUCE-734] java.util.ConcurrentModificationException observed in unreserving slots for HiRam Jobs
[MAPREDUCE-739] Allow relative paths to be created inside archives.
[MAPREDUCE-740] Provide summary information per job once a job is finished.
[MAPREDUCE-744] Support in DistributedCache to share cache files with other users after HADOOP-4493
[MAPREDUCE-754] NPE in expiry thread when a TT is lost
[MAPREDUCE-764] TypedBytesInput's readRaw() does not preserve custom type codes
[MAPREDUCE-768] Configuration information should generate dump in a standard format.
[MAPREDUCE-771] Setup and cleanup tasks remain in UNASSIGNED state for a long time on tasktrackers with long running high RAM tasks
[MAPREDUCE-782] Use PureJavaCrc32 in mapreduce spills
[MAPREDUCE-787] -files, -archives should honor user given symlink path
[MAPREDUCE-809] Job summary logs show status of completed jobs as RUNNING
[MAPREDUCE-814] Move completed Job history files to HDFS
[MAPREDUCE-817] Add a cache for retired jobs with minimal job info and provide a way to access history file url
[MAPREDUCE-825] JobClient completion poll interval of 5s causes slow tests in local mode
[MAPREDUCE-840] DBInputFormat leaves open transaction
[MAPREDUCE-842] Per-job local data on the TaskTracker node should have right access-control
[MAPREDUCE-856] Localized files from DistributedCache should have right access-control
[MAPREDUCE-871] Job/Task local files have incorrect group ownership set by LinuxTaskController binary
[MAPREDUCE-875] Make DBRecordReader execute queries lazily
[MAPREDUCE-885] More efficient SQL queries for DBInputFormat
[MAPREDUCE-890] After HADOOP-4491, the user who started mapred system is not able to run job.
[MAPREDUCE-896] Users can set non-writable permissions on temporary files for TT and can abuse disk usage.
[MAPREDUCE-899] When using LinuxTaskController, localized files may become accessible to unintended users if permissions are
misconfigured.
[MAPREDUCE-927] Cleanup of task-logs should happen in TaskTracker instead of the Child
[MAPREDUCE-947] OutputCommitter should have an abortJob method
[MAPREDUCE-964] Inaccurate values in jobSummary logs
[MAPREDUCE-967] TaskTracker does not need to fully unjar job jars
[MAPREDUCE-968] NPE in distcp encountered when placing _logs directory on S3FileSystem
[MAPREDUCE-971] distcp does not always remove distcp.tmp.dir
[MAPREDUCE-1028] Cleanup tasks are scheduled using high memory configuration, leaving tasks in unassigned state.
[MAPREDUCE-1030] Reduce tasks are getting starved in capacity scheduler
[MAPREDUCE-1048] Show total slot usage in cluster summary on jobtracker webui
[MAPREDUCE-1059] distcp can generate uneven map task assignments
[MAPREDUCE-1083] Use the user-to-groups mapping service in the JobTracker
[MAPREDUCE-1085] For tasks, "ulimit -v -1" is being run when user doesn't specify mapred.child.ulimit
[MAPREDUCE-1086] hadoop commands in streaming tasks are trying to write to tasktracker's log
[MAPREDUCE-1088] JobHistory files should have narrower 0600 perms
[MAPREDUCE-1089] Fair Scheduler preemption triggers NPE when tasks are scheduled but not running
[MAPREDUCE-1090] Modify log statement in Tasktracker log related to memory monitoring to include attempt id.
[MAPREDUCE-1098] Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.
[MAPREDUCE-1100] User's task-logs filling up local disks on the TaskTrackers
[MAPREDUCE-1103] Additional JobTracker metrics
[MAPREDUCE-1105] CapacityScheduler: It should be possible to set queue hard-limit beyond it's actual capacity
[MAPREDUCE-1118] Capacity Scheduler scheduling information is hard to read / should be tabular format
[MAPREDUCE-1131] Using profilers other than hprof can cause JobClient to report job failure
[MAPREDUCE-1140] Per cache-file refcount can become negative when tasks release distributed-cache files
[MAPREDUCE-1143] runningMapTasks counter is not properly decremented in case of failed Tasks.
[MAPREDUCE-1155] Streaming tests swallow exceptions
[MAPREDUCE-1158] running_maps is not decremented when the tasks of a job is killed/failed
[MAPREDUCE-1160] Two log statements at INFO level fill up jobtracker logs
[MAPREDUCE-1171] Lots of fetch failures
[MAPREDUCE-1178] MultipleInputs fails with ClassCastException
[MAPREDUCE-1185] URL to JT webconsole for running job and job history should be the same
[MAPREDUCE-1186] While localizing a DistributedCache file, TT sets permissions recursively on the whole base-dir
[MAPREDUCE-1196] MAPREDUCE-947 incompatibly changed FileOutputCommitter
[MAPREDUCE-1198] Alternatively schedule different types of tasks in fair share scheduler
[MAPREDUCE-1213] TaskTrackers restart is very slow because it deletes distributed cache directory synchronously
[MAPREDUCE-1219] JobTracker Metrics causes undue load on JobTracker
[MAPREDUCE-1221] Kill tasks on a node if the free physical memory on that machine falls below a configured threshold
[MAPREDUCE-1231] Distcp is very slow
[MAPREDUCE-1250] Refactor job token to use a common token interface
[MAPREDUCE-1258] Fair scheduler event log not logging job info
[MAPREDUCE-1285] DistCp cannot handle -delete if destination is local filesystem
[MAPREDUCE-1288] DistributedCache localizes only once per cache URI
[MAPREDUCE-1293] AutoInputFormat doesn't work with non-default FileSystems
[MAPREDUCE-1302] TrackerDistributedCacheManager can delete file asynchronously
[MAPREDUCE-1304] Add counters for task time spent in GC
[MAPREDUCE-1307] Introduce the concept of Job Permissions
[MAPREDUCE-1313] NPE in FieldFormatter if escape character is set and field is null
[MAPREDUCE-1316] JobTracker holds stale references to retired jobs via unreported tasks
[MAPREDUCE-1342] Potential JT deadlock in faulty TT tracking
[MAPREDUCE-1354] Incremental enhancements to the JobTracker for better scalability
[MAPREDUCE-1372] ConcurrentModificationException in JobInProgress
[MAPREDUCE-1378] Args in job details links on jobhistory.jsp are not URL encoded
[MAPREDUCE-1382] MRAsyncDiscService should tolerate missing local.dir
[MAPREDUCE-1397] NullPointerException observed during task failures
[MAPREDUCE-1398] TaskLauncher remains stuck on tasks waiting for free nodes even if task is killed.
[MAPREDUCE-1399] The archive command shows a null error message
[MAPREDUCE-1403] Save file-sizes of each of the artifacts in DistributedCache in the JobConf
[MAPREDUCE-1421] LinuxTaskController tests failing on trunk after the commit of MAPREDUCE-1385
[MAPREDUCE-1422] Changing permissions of files/dirs under job-work-dir may be needed sothat cleaning up of job-dir in all
mapred-local-directories succeeds always
[MAPREDUCE-1423] Improve performance of CombineFileInputFormat when multiple pools are configured
[MAPREDUCE-1425] archive throws OutOfMemoryError
[MAPREDUCE-1435] symlinks in cwd of the task are not handled properly after MAPREDUCE-896
[MAPREDUCE-1436] Deadlock in preemption code in fair scheduler
[MAPREDUCE-1440] MapReduce should use the short form of the user names
[MAPREDUCE-1441] Configuration of directory lists should trim whitespace
[MAPREDUCE-1442] StackOverflowError when JobHistory parses a really long line
[MAPREDUCE-1443] DBInputFormat can leak connections
[MAPREDUCE-1454] The servlets should quote server generated strings sent in the response
[MAPREDUCE-1455] Authorization for servlets
[MAPREDUCE-1457] For secure job execution, couple of more UserGroupInformation.doAs needs to be added
[MAPREDUCE-1464] In JobTokenIdentifier change method getUsername to getUser which returns UGI
[MAPREDUCE-1466] FileInputFormat should save #input-files in JobConf
[MAPREDUCE-1476] committer.needsTaskCommit should not be called for a task cleanup attempt
[MAPREDUCE-1480] CombineFileRecordReader does not properly initialize child RecordReader
[MAPREDUCE-1493] Authorization for job-history pages
[MAPREDUCE-1503] Push HADOOP-6551 into MapReduce
[MAPREDUCE-1505] Cluster class should create the rpc client only when needed
[MAPREDUCE-1521] Protection against incorrectly configured reduces
[MAPREDUCE-1522] FileInputFormat may change the file system of an input path
[MAPREDUCE-1526] Cache the job related information while submitting the job , this would avoid many RPC calls to JobTracker.
[MAPREDUCE-1533] Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and
Counters.makeEscapedString()
[MAPREDUCE-1538] TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit
[MAPREDUCE-1543] Log messages of JobACLsManager should use security logging of HADOOP-6586
[MAPREDUCE-1545] Add 'first-task-launched' to job-summary
[MAPREDUCE-1550] UGI.doAs should not be used for getting the history file of jobs
[MAPREDUCE-1563] Task diagnostic info would get missed sometimes.
[MAPREDUCE-1570] Shuffle stage - Key and Group Comparators
[MAPREDUCE-1607] Task controller may not set permissions for a task cleanup attempt's log directory
[MAPREDUCE-1609] TaskTracker.localizeJob should not set permissions on job log directory recursively
[MAPREDUCE-1611] Refresh nodes and refresh queues doesnt work with service authorization enabled
[MAPREDUCE-1612] job conf file is not accessible from job history web page
[MAPREDUCE-1621] Streaming's TextOutputReader.getLastOutput throws NPE if it has never read any output
[MAPREDUCE-1635] ResourceEstimator does not work after MAPREDUCE-842
[MAPREDUCE-1641] Job submission should fail if same uri is added for mapred.cache.files and mapred.cache.archives
[MAPREDUCE-1656] JobStory should provide queue info.
[MAPREDUCE-1657] After task logs directory is deleted, tasklog servlet displays wrong error message about job ACLs
[MAPREDUCE-1664] Job Acls affect Queue Acls
[MAPREDUCE-1680] Add a metrics to track the number of heartbeats processed
[MAPREDUCE-1682] Tasks should not be scheduled after tip is killed/failed.
[MAPREDUCE-1683] Remove JNI calls from ClusterStatus cstr
[MAPREDUCE-1699] JobHistory shouldn't be disabled for any reason
[MAPREDUCE-1707] TaskRunner can get NPE in getting ugi from TaskTracker
[MAPREDUCE-1716] Truncate logs of finished tasks to prevent node thrash due to excessive logging
[MAPREDUCE-1733] Authentication between pipes processes and java counterparts.
[MAPREDUCE-1734] Un-deprecate the old MapReduce API in the 0.20 branch
[MAPREDUCE-1744] DistributedCache creates its own FileSytem instance when adding a file/archive to the path
[MAPREDUCE-1754] Replace mapred.persmissions.supergroup with an acl : mapreduce.cluster.administrators
[MAPREDUCE-1759] Exception message for unauthorized user doing killJob, killTask, setJobPriority needs to be improved
[MAPREDUCE-1778] CompletedJobStatusStore initialization should fail if {mapred.job.tracker.persist.jobstatus.dir} is unwritable
[MAPREDUCE-1784] IFile should check for null compressor
[MAPREDUCE-1785] Add streaming config option for not emitting the key
[MAPREDUCE-1832] Support for file sizes less than 1MB in DFSIO benchmark.
[MAPREDUCE-1845] FairScheduler.tasksToPeempt() can return negative number
[MAPREDUCE-1850] Include job submit host information (name and ip) in jobconf and jobdetails display
[MAPREDUCE-1853] MultipleOutputs does not cache TaskAttemptContext
[MAPREDUCE-1868] Add read timeout on userlog pull
[MAPREDUCE-1872] Re-think (user|queue) limits on (tasks|jobs) in the CapacityScheduler
[MAPREDUCE-1887] MRAsyncDiskService does not properly absolutize volume root paths
[MAPREDUCE-1900] MapReduce daemons should close FileSystems that are not needed anymore
[MAPREDUCE-1914] TrackerDistributedCacheManager never cleans its input directories
[MAPREDUCE-1938] Ability for having user's classes take precedence over the system classes for tasks' classpath
[MAPREDUCE-1960] Limit the size of jobconf.
[MAPREDUCE-1961] ConcurrentModificationException when shutting down Gridmix
[MAPREDUCE-1985] java.lang.ArrayIndexOutOfBoundsException in analysejobhistory.jsp of jobs with 0 maps
[MAPREDUCE-2023] TestDFSIO read test may not read specified bytes.
[MAPREDUCE-2082] Race condition in writing the jobtoken password file when launching pipes jobs
[MAPREDUCE-2096] Secure local filesystem IO from symlink vulnerabilities
[MAPREDUCE-2103] task-controller shouldn't require o-r permissions
[MAPREDUCE-2157] safely handle InterruptedException and interrupted status in MR code
[MAPREDUCE-2178] Race condition in LinuxTaskController permissions handling
[MAPREDUCE-2219] JT should not try to remove mapred.system.dir during startup
[MAPREDUCE-2234] If Localizer can't create task log directory, it should fail on the spot
[MAPREDUCE-2235] JobTracker "over-synchronization" makes it hang up in certain cases
[MAPREDUCE-2242] LinuxTaskController doesn't properly escape environment variables
[MAPREDUCE-2253] Servlets should specify content type
[MAPREDUCE-2256] FairScheduler fairshare preemption from multiple pools may preempt all tasks from one pool causing that pool to go below
fairshare.
[MAPREDUCE-2289] Permissions race can make getStagingDir fail on local filesystem
[MAPREDUCE-2321] TT should fail to start on secure cluster when SecureIO isn't available
[MAPREDUCE-2323] Add metrics to the fair scheduler
[MAPREDUCE-2328] memory-related configurations missing from mapred-default.xml
[MAPREDUCE-2332] Improve error messages when MR dirs on local FS have bad ownership
[MAPREDUCE-2351] mapred.job.tracker.history.completed.location should support an arbitrary filesystem URI
[MAPREDUCE-2353] Make the MR changes to reflect the API changes in SecureIO library
[MAPREDUCE-2356] A task succeeded even though there were errors on all attempts.
[MAPREDUCE-2364] Shouldn't hold lock on rjob while localizing resources.
[MAPREDUCE-2366] TaskTracker can't retrieve stdout and stderr from web UI
[MAPREDUCE-2371] TaskLogsTruncater does not need to check log ownership when running as Child
[MAPREDUCE-2372] TaskLogAppender mechanism shouldn't be set in log4j.properties
[MAPREDUCE-2373] When tasks exit with a nonzero exit status, task runner should log the stderr as well as stdout
[MAPREDUCE-2374] Should not use PrintWriter to write taskjvm.sh
[MAPREDUCE-2377] task-controller fails to parse configuration if it doesn't end in \n
[MAPREDUCE-2379] Distributed cache sizing configurations are missing from mapred-default.xml

Version 1.0 Release Notes

General Information
MapR provides the following packages:

Apache Hadoop 0.20.2


flume-0.9.3
hbase-0.90.2
hive-0.7.0
oozie-3.0.0
pig-0.8
sqoop-1.2.0
whirr-0.3.0

MapR GA Version 1.0 Documentation

New in This Release

Rolling Upgrade

The script rollingupgrade.sh upgrades a MapR cluster to a specified version of the MapR software, or to a specific set of MapR packages,
either via SSH or node by node. This makes it easy to upgrade a MapR cluster with a minimum of downtime.

32-Bit Client

The MapR Client can now be installed on both 64-bit and 32-bit computers. See MapR Client.

Core File Removal

In the event of a core dump on a node, MapR writes the core file to the /opt/cores directory. If disk space on the node is nearly full, MapR
automatically reclaims space by deleting core files. To prevent a specific core file from being deleted, rename the file to start with a period ( .).
Example:

mv mfs.core.2127.node12 .mfs.core.2127.node12

Resolved Issues
Removing Nodes
(Issue 4068) Upgrading Red Hat
(Issue 3984) HBase Upgrade
(Issue 3965) Volume Dump Restore Failure
(Issue 3890) Sqoop Requires HBase
(Issue 3560) Intermittent Scheduled Mirror Failure
(Issue 2949) NFS Mounting Issue on Ubuntu
(Issue 2815) File Cleanup is Slow

Known Issues

(Issue 4415) Select and Kill Controls in JobTracker UI

The Select and Kill controls in the JobTracker UI appear when the webinterface.private.actions parameter in mapred-site.xml is set
to true. In MapR clusters upgraded from the beta version of the software, the parameter must be added manually for the controls to appear.

To enable the Select and Kill controls in the JobTracker UI, copy the following lines from
/opt/mapr/hadoop/hadoop-0.20.2/conf.new/mapred-site.xml to
/opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml:

<!-- JobTracker's Web Interface Configuration -->


<property>
<name>webinterface.private.actions</name>
<value>true</value>
<description> If set to true, jobs can be killed from JT's web interface.
Enable this option if the interfaces are only reachable by
those who have the right authorization.
</description>
</property>

(Issue 4307) Snapshot create fails with error EEXIST

The EEXIST error indicates an attempt to create a new snapshot with the same name as an existing snapshot, but can occur in the following
cases as well:

If the node with the snapshot's name container fails during snapshot creation, the failed snapshot remains until it is removed by the CLDB
after 30 minutes.
If snapshot creation fails after reserving the name, then the name exists but the snapshot does not.
If the response to a successful snapshot is delayed by a network glitch, and the snapshot operation is retried as a result, EEXISTS
correctly indicates that the snapshot exists although it does not appear to.

In any of the above cases, either retry the snapshot with a different name, or delete the existing (or failed) snapshot and create it again.

(Issue 4269) Bulk Operations

The MapR Control System provides both a checkbox and a Select All link for selecting all alarms, nodes, snapshots, or volumes matching a filter,
even if there are too many results to display on a single screen. However, the following operations can only be performed on individually selected
results, or results selected using the Select Visible link at the bottom of the MapR Control System screen:

Volumes - Edit Volumes


Volumes - Remove Volumes
Volumes - New Snapshot
Volumes - Unmount
Mirror Volumes - Edit Volumes
Mirror Volumes - Remove Volumes
Mirror Volumes - Unmount
User Disk Usage - Edit
Snapshots - Remove
Snapshots - Preserve
Node Alarms - Change Topology
Nodes - Change Topology
Volume Alarms - Edit
Volume Alarms - Unmount
Volume Alarms - Remove
User/Group Alarms - Edit

In order to perform these operations on a large number of alarms, nodes, snapshots, or volumes, it is necessary to select each screenful of
results using Select Visible and perform the operation before selecting the next screenful of results.
(Issue 4037) Starting Newly Added Services

After you install new services on a node, you can start them in two ways:

Use the MapR Control System, the API, or the command-line interface to start the services individually
Restart the warden to stop and start all services on the node

If you start the services individually, the node's memory will not be reconfigured to account for the newly installed services. This can cause
memory paging, slowing or stopping the node. However, stopping and restarting the warden can take the node out of service.

For best results, choose a time when the cluster is not very busy if you need to install additional services on a node. If that is not possible, make
sure to restart the warden as soon as it is practical to do so after installing new services.

(Issue 4024) Hadoop Copy Commands Do Not Handle Broken Symbolic Links

The hadoop fs -copyToLocal and hadoop fs -copyFromLocal commands attempt to resolve symbolic links in the source data set, to
create physical copies of the files referred to by the links. If a broken symbolic link is encountered by either command, the copy operation fails at
that point.

(Issue 4018)(HDFS-1768) fs -put crash that depends on source file name

Copying a file using the hadoop fs command generates a warning or exception if a corresponding checksum file .*.crc exists. If this error
occurs, delete all local checksum files and try again. See http://www.mail-archive.com/hdfs-issues@hadoop.apache.org/msg15824.html

(Issue 3524) Apache Port 80 Open

The MapR UI runs on Apache. By default, installation does not close port 80 (even though the MapR Control System is available over HTTPS on
port 8443). If this would present a security risk to your datacenter, you should close port 80 manually on any nodes running the MapR Control
System.

(Issue 3488) Ubuntu IRQ Balancer Issue on Virtual Machines

In VM environments like EC2, VMWare, and Xen, when running Ubuntu 10.10, problems can occur due to an Ubuntu bug unless the IRQ
balancer is turned off. On all nodes, edit the file /etc/default/irqbalance and set ENABLED=0 to turn off the IRQ balancer (requires reboot
to take effect).

(Issue 3244) Volume Mirror Issue

If a volume dump restore command is interrupted before completion (killed by the user, node fails, etc.) then the volume remains in the "Mirroring
in Progress" state. Before retrying the volume dump restore operation, you must issue the volume mirror stop command explicitly.

(Issue 3122) Mirroring with fsck-repaired volume

If a source or mirror volume is repaired with fsck then the source and mirror volumes can go out of sync. It is necessary to perform a full mirror
operation with volume mirror start -full true to bring them back in sync. Similarly, when creating a dump file from a volume that has
been repaired with fsck, use -full true on the volume dump create command.

(Issue 3028) Changing the Time on a ZooKeeper Node

To avoid cluster downtime, use the following steps to set the time on any node running ZooKeeper:

1. Use the MapR Dashboard to check that all configured ZooKeeper services on the cluster are running. Start any non-running ZooKeeper
instances.
2. Stop ZooKeeper on the node:
/etc/init.d/mapr-zookeeper stop
3. Change the time on the node or sync the time to NTP.
4. Start ZooKeeper on the node:
/etc/init.d/mapr-zookeeper start

(Issue 2809) NFS Dependencies

If you are installing the MapR NFS service on a node that cannot connect to the standard apt-get or yum repositories, you should install the
following packages by hand:

CentOS:
iputils
portmap
glibc-common-2.5-49.el5_5.7
Red Hat:
rpcbind
iputils
Ubuntu:
nfs-common
iputils-arping

Hadoop Compatibility in Version 1.0


MapR provides the following packages:

Apache Hadoop 0.20.2


flume-0.9.3
hbase-0.90.2
hive-0.7.0
oozie-3.0.0
pig-0.8
sqoop-1.2.0
whirr-0.3.0

Hadoop Common Patches

MapR 1.0 includes the following Apache Hadoop issues that are not included in the Apache Hadoop base version 0.20.2:

[HADOOP-1722] Make streaming to handle non-utf8 byte array


[HADOOP-1849] IPC server max queue size should be configurable
[HADOOP-2141] speculative execution start up condition based on completion time
[HADOOP-2366] Space in the value for dfs.data.dir can cause great problems
[HADOOP-2721] Use job control for tasks (and therefore for pipes and streaming)
[HADOOP-2838] Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni
[HADOOP-3327] Shuffling fetchers waited too long between map output fetch re-tries
[HADOOP-3659] Patch to allow hadoop native to compile on Mac OS X
[HADOOP-4012] Providing splitting support for bzip2 compressed files
[HADOOP-4041] IsolationRunner does not work as documented
[HADOOP-4490] Map and Reduce tasks should run as the user who submitted the job
[HADOOP-4655] FileSystem.CACHE should be ref-counted
[HADOOP-4656] Add a user to groups mapping service
[HADOOP-4675] Current Ganglia metrics implementation is incompatible with Ganglia 3.1
[HADOOP-4829] Allow FileSystem shutdown hook to be disabled
[HADOOP-4842] Streaming combiner should allow command, not just JavaClass
[HADOOP-4930] Implement setuid executable for Linux to assist in launching tasks as job owners
[HADOOP-4933] ConcurrentModificationException in JobHistory.java
[HADOOP-5170] Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide
[HADOOP-5175] Option to prohibit jars unpacking
[HADOOP-5203] TT's version build is too restrictive
[HADOOP-5396] Queue ACLs should be refreshed without requiring a restart of the job tracker
[HADOOP-5419] Provide a way for users to find out what operations they can do on which M/R queues
[HADOOP-5420] Support killing of process groups in LinuxTaskController binary
[HADOOP-5442] The job history display needs to be paged
[HADOOP-5450] Add support for application-specific typecodes to typed bytes
[HADOOP-5469] Exposing Hadoop metrics via HTTP
[HADOOP-5476] calling new SequenceFile.Reader(...) leaves an InputStream open, if the given sequence file is broken
[HADOOP-5488] HADOOP-2721 doesn't clean up descendant processes of a jvm that exits cleanly after running a task successfully
[HADOOP-5528] Binary partitioner
[HADOOP-5582] Hadoop Vaidya throws number format exception due to changes in the job history counters string format (escaped compact
representation).
[HADOOP-5592] Hadoop Streaming - GzipCodec
[HADOOP-5613] change S3Exception to checked exception
[HADOOP-5643] Ability to blacklist tasktracker
[HADOOP-5656] Counter for S3N Read Bytes does not work
[HADOOP-5675] DistCp should not launch a job if it is not necessary
[HADOOP-5733] Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics
[HADOOP-5737] UGI checks in testcases are broken
[HADOOP-5738] Split waiting tasks field in JobTracker metrics to individual tasks
[HADOOP-5745] Allow setting the default value of maxRunningJobs for all pools
[HADOOP-5784] The length of the heartbeat cycle should be configurable.
[HADOOP-5801] JobTracker should refresh the hosts list upon recovery
[HADOOP-5805] problem using top level s3 buckets as input/output directories
[HADOOP-5861] s3n files are not getting split by default
[HADOOP-5879] GzipCodec should read compression level etc from configuration
[HADOOP-5913] Allow administrators to be able to start and stop queues
[HADOOP-5958] Use JDK 1.6 File APIs in DF.java wherever possible
[HADOOP-5976] create script to provide classpath for external tools
[HADOOP-5980] LD_LIBRARY_PATH not passed to tasks spawned off by LinuxTaskController
[HADOOP-5981] HADOOP-2838 doesnt work as expected
[HADOOP-6132] RPC client opens an extra connection for VersionedProtocol
[HADOOP-6133] ReflectionUtils performance regression
[HADOOP-6148] Implement a pure Java CRC32 calculator
[HADOOP-6161] Add get/setEnum to Configuration
[HADOOP-6166] Improve PureJavaCrc32
[HADOOP-6184] Provide a configuration dump in json format.
[HADOOP-6227] Configuration does not lock parameters marked final if they have no value.
[HADOOP-6234] Permission configuration files should use octal and symbolic
[HADOOP-6254] s3n fails with SocketTimeoutException
[HADOOP-6269] Missing synchronization for defaultResources in Configuration.addResource
[HADOOP-6279] Add JVM memory usage to JvmMetrics
[HADOOP-6284] Any hadoop commands crashing jvm (SIGBUS) when /tmp (tmpfs) is full
[HADOOP-6299] Use JAAS LoginContext for our login
[HADOOP-6312] Configuration sends too much data to log4j
[HADOOP-6337] Update FilterInitializer class to be more visible and take a conf for further development
[HADOOP-6343] Stack trace of any runtime exceptions should be recorded in the server logs.
[HADOOP-6400] Log errors getting Unix UGI
[HADOOP-6408] Add a /conf servlet to dump running configuration
[HADOOP-6419] Change RPC layer to support SASL based mutual authentication
[HADOOP-6433] Add AsyncDiskService that is used in both hdfs and mapreduce
[HADOOP-6441] Prevent remote CSS attacks in Hostname and UTF-7.
[HADOOP-6453] Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH
[HADOOP-6471] StringBuffer -> StringBuilder - conversion of references as necessary
[HADOOP-6496] HttpServer sends wrong content-type for CSS files (and others)
[HADOOP-6510] doAs for proxy user
[HADOOP-6521] FsPermission:SetUMask not updated to use new-style umask setting.
[HADOOP-6534] LocalDirAllocator should use whitespace trimming configuration getters
[HADOOP-6543] Allow authentication-enabled RPC clients to connect to authentication-disabled RPC servers
[HADOOP-6558] archive does not work with distcp -update
[HADOOP-6568] Authorization for default servlets
[HADOOP-6569] FsShell#cat should avoid calling unecessary getFileStatus before opening a file to read
[HADOOP-6572] RPC responses may be out-of-order with respect to SASL
[HADOOP-6577] IPC server response buffer reset threshold should be configurable
[HADOOP-6578] Configuration should trim whitespace around a lot of value types
[HADOOP-6599] Split RPC metrics into summary and detailed metrics
[HADOOP-6609] Deadlock in DFSClient#getBlockLocations even with the security disabled
[HADOOP-6613] RPC server should check for version mismatch first
[HADOOP-6627] "Bad Connection to FS" message in FSShell should print message from the exception
[HADOOP-6631] FileUtil.fullyDelete() should continue to delete other files despite failure at any level.
[HADOOP-6634] AccessControlList uses full-principal names to verify acls causing queue-acls to fail
[HADOOP-6637] Benchmark overhead of RPC session establishment
[HADOOP-6640] FileSystem.get() does RPC retries within a static synchronized block
[HADOOP-6644] util.Shell getGROUPS_FOR_USER_COMMAND method name - should use common naming convention
[HADOOP-6649] login object in UGI should be inside the subject
[HADOOP-6652] ShellBasedUnixGroupsMapping shouldn't have a cache
[HADOOP-6653] NullPointerException in setupSaslConnection when browsing directories
[HADOOP-6663] BlockDecompressorStream get EOF exception when decompressing the file compressed from empty file
[HADOOP-6667] RPC.waitForProxy should retry through NoRouteToHostException
[HADOOP-6669] zlib.compress.level ignored for DefaultCodec initialization
[HADOOP-6670] UserGroupInformation doesn't support use in hash tables
[HADOOP-6674] Performance Improvement in Secure RPC
[HADOOP-6687] user object in the subject in UGI should be reused in case of a relogin.
[HADOOP-6701] Incorrect exit codes for "dfs -chown", "dfs -chgrp"
[HADOOP-6706] Relogin behavior for RPC clients could be improved
[HADOOP-6710] Symbolic umask for file creation is not consistent with posix
[HADOOP-6714] FsShell 'hadoop fs -text' does not support compression codecs
[HADOOP-6718] Client does not close connection when an exception happens during SASL negotiation
[HADOOP-6722] NetUtils.connect should check that it hasn't connected a socket to itself
[HADOOP-6723] unchecked exceptions thrown in IPC Connection orphan clients
[HADOOP-6724] IPC doesn't properly handle IOEs thrown by socket factory
[HADOOP-6745] adding some java doc to Server.RpcMetrics, UGI
[HADOOP-6757] NullPointerException for hadoop clients launched from streaming tasks
[HADOOP-6760] WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race
[HADOOP-6762] exception while doing RPC I/O closes channel
[HADOOP-6776] UserGroupInformation.createProxyUser's javadoc is broken
[HADOOP-6813] Add a new newInstance method in FileSystem that takes a "user" as argument
[HADOOP-6815] refreshSuperUserGroupsConfiguration should use server side configuration for the refresh
[HADOOP-6818] Provide a JNI-based implementation of GroupMappingServiceProvider
[HADOOP-6832] Provide a web server plugin that uses a static user for the web UI
[HADOOP-6833] IPC leaks call parameters when exceptions thrown
[HADOOP-6859] Introduce additional statistics to FileSystem
[HADOOP-6864] Provide a JNI-based implementation of ShellBasedUnixGroupsNetgroupMapping (implementation of
GroupMappingServiceProvider)
[HADOOP-6881] The efficient comparators aren't always used except for BytesWritable and Text
[HADOOP-6899] RawLocalFileSystem#setWorkingDir() does not work for relative names
[HADOOP-6907] Rpc client doesn't use the per-connection conf to figure out server's Kerberos principal
[HADOOP-6925] BZip2Codec incorrectly implements read()
[HADOOP-6928] Fix BooleanWritable comparator in 0.20
[HADOOP-6943] The GroupMappingServiceProvider interface should be public
[HADOOP-6950] Suggest that HADOOP_CLASSPATH should be preserved in hadoop-env.sh.template
[HADOOP-6995] Allow wildcards to be used in ProxyUsers configurations
[HADOOP-7082] Configuration.writeXML should not hold lock while outputting
[HADOOP-7101] UserGroupInformation.getCurrentUser() fails when called from non-Hadoop JAAS context
[HADOOP-7104] Remove unnecessary DNS reverse lookups from RPC layer
[HADOOP-7110] Implement chmod with JNI
[HADOOP-7114] FsShell should dump all exceptions at DEBUG level
[HADOOP-7115] Add a cache for getpwuid_r and getpwgid_r calls
[HADOOP-7118] NPE in Configuration.writeXml
[HADOOP-7122] Timed out shell commands leak Timer threads
[HADOOP-7156] getpwuid_r is not thread-safe on RHEL6
[HADOOP-7172] SecureIO should not check owner on non-secure clusters that have no native support
[HADOOP-7173] Remove unused fstat() call from NativeIO
[HADOOP-7183] WritableComparator.get should not cache comparator objects
[HADOOP-7184] Remove deprecated local.cache.size from core-default.xml

MapReduce Patches

MapR 1.0 includes the following Apache MapReduce issues that are not included in the Apache Hadoop base version 0.20.2:

[MAPREDUCE-112] Reduce Input Records and Reduce Output Records counters are not being set when using the new Mapreduce reducer API
[MAPREDUCE-118] Job.getJobID() will always return null
[MAPREDUCE-144] TaskMemoryManager should log process-tree's status while killing tasks.
[MAPREDUCE-181] Secure job submission
[MAPREDUCE-211] Provide a node health check script and run it periodically to check the node health status
[MAPREDUCE-220] Collecting cpu and memory usage for MapReduce tasks
[MAPREDUCE-270] TaskTracker could send an out-of-band heartbeat when the last running map/reduce completes
[MAPREDUCE-277] Job history counters should be avaible on the UI.
[MAPREDUCE-339] JobTracker should give preference to failed tasks over virgin tasks so as to terminate the job ASAP if it is eventually going to
fail.
[MAPREDUCE-364] Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.
[MAPREDUCE-369] Change org.apache.hadoop.mapred.lib.MultipleInputs to use new api.
[MAPREDUCE-370] Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.
[MAPREDUCE-415] JobControl Job does always has an unassigned name
[MAPREDUCE-416] Move the completed jobs' history files to a DONE subdirectory inside the configured history directory
[MAPREDUCE-461] Enable ServicePlugins for the JobTracker
[MAPREDUCE-463] The job setup and cleanup tasks should be optional
[MAPREDUCE-467] Collect information about number of tasks succeeded / total per time unit for a tasktracker.
[MAPREDUCE-476] extend DistributedCache to work locally (LocalJobRunner)
[MAPREDUCE-478] separate jvm param for mapper and reducer
[MAPREDUCE-516] Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs
[MAPREDUCE-517] The capacity-scheduler should assign multiple tasks per heartbeat
[MAPREDUCE-521] After JobTracker restart Capacity Schduler does not schedules pending tasks from already running tasks.
[MAPREDUCE-532] Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue
[MAPREDUCE-551] Add preemption to the fair scheduler
[MAPREDUCE-572] If #link is missing from uri format of -cacheArchive then streaming does not throw error.
[MAPREDUCE-655] Change KeyValueLineRecordReader and KeyValueTextInputFormat to use new api.
[MAPREDUCE-676] Existing diagnostic rules fail for MAP ONLY jobs
[MAPREDUCE-679] XML-based metrics as JSP servlet for JobTracker
[MAPREDUCE-680] Reuse of Writable objects is improperly handled by MRUnit
[MAPREDUCE-682] Reserved tasktrackers should be removed when a node is globally blacklisted
[MAPREDUCE-693] Conf files not moved to "done" subdirectory after JT restart
[MAPREDUCE-698] Per-pool task limits for the fair scheduler
[MAPREDUCE-706] Support for FIFO pools in the fair scheduler
[MAPREDUCE-707] Provide a jobconf property for explicitly assigning a job to a pool
[MAPREDUCE-709] node health check script does not display the correct message on timeout
[MAPREDUCE-714] JobConf.findContainingJar unescapes unnecessarily on Linux
[MAPREDUCE-716] org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle
[MAPREDUCE-722] More slots are getting reserved for HiRAM job tasks then required
[MAPREDUCE-732] node health check script should not log "UNHEALTHY" status for every heartbeat in INFO mode
[MAPREDUCE-734] java.util.ConcurrentModificationException observed in unreserving slots for HiRam Jobs
[MAPREDUCE-739] Allow relative paths to be created inside archives.
[MAPREDUCE-740] Provide summary information per job once a job is finished.
[MAPREDUCE-744] Support in DistributedCache to share cache files with other users after HADOOP-4493
[MAPREDUCE-754] NPE in expiry thread when a TT is lost
[MAPREDUCE-764] TypedBytesInput's readRaw() does not preserve custom type codes
[MAPREDUCE-768] Configuration information should generate dump in a standard format.
[MAPREDUCE-771] Setup and cleanup tasks remain in UNASSIGNED state for a long time on tasktrackers with long running high RAM tasks
[MAPREDUCE-782] Use PureJavaCrc32 in mapreduce spills
[MAPREDUCE-787] -files, -archives should honor user given symlink path
[MAPREDUCE-809] Job summary logs show status of completed jobs as RUNNING
[MAPREDUCE-814] Move completed Job history files to HDFS
[MAPREDUCE-817] Add a cache for retired jobs with minimal job info and provide a way to access history file url
[MAPREDUCE-825] JobClient completion poll interval of 5s causes slow tests in local mode
[MAPREDUCE-840] DBInputFormat leaves open transaction
[MAPREDUCE-842] Per-job local data on the TaskTracker node should have right access-control
[MAPREDUCE-856] Localized files from DistributedCache should have right access-control
[MAPREDUCE-871] Job/Task local files have incorrect group ownership set by LinuxTaskController binary
[MAPREDUCE-875] Make DBRecordReader execute queries lazily
[MAPREDUCE-885] More efficient SQL queries for DBInputFormat
[MAPREDUCE-890] After HADOOP-4491, the user who started mapred system is not able to run job.
[MAPREDUCE-896] Users can set non-writable permissions on temporary files for TT and can abuse disk usage.
[MAPREDUCE-899] When using LinuxTaskController, localized files may become accessible to unintended users if permissions are
misconfigured.
[MAPREDUCE-927] Cleanup of task-logs should happen in TaskTracker instead of the Child
[MAPREDUCE-947] OutputCommitter should have an abortJob method
[MAPREDUCE-964] Inaccurate values in jobSummary logs
[MAPREDUCE-967] TaskTracker does not need to fully unjar job jars
[MAPREDUCE-968] NPE in distcp encountered when placing _logs directory on S3FileSystem
[MAPREDUCE-971] distcp does not always remove distcp.tmp.dir
[MAPREDUCE-1028] Cleanup tasks are scheduled using high memory configuration, leaving tasks in unassigned state.
[MAPREDUCE-1030] Reduce tasks are getting starved in capacity scheduler
[MAPREDUCE-1048] Show total slot usage in cluster summary on jobtracker webui
[MAPREDUCE-1059] distcp can generate uneven map task assignments
[MAPREDUCE-1083] Use the user-to-groups mapping service in the JobTracker
[MAPREDUCE-1085] For tasks, "ulimit -v -1" is being run when user doesn't specify mapred.child.ulimit
[MAPREDUCE-1086] hadoop commands in streaming tasks are trying to write to tasktracker's log
[MAPREDUCE-1088] JobHistory files should have narrower 0600 perms
[MAPREDUCE-1089] Fair Scheduler preemption triggers NPE when tasks are scheduled but not running
[MAPREDUCE-1090] Modify log statement in Tasktracker log related to memory monitoring to include attempt id.
[MAPREDUCE-1098] Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.
[MAPREDUCE-1100] User's task-logs filling up local disks on the TaskTrackers
[MAPREDUCE-1103] Additional JobTracker metrics
[MAPREDUCE-1105] CapacityScheduler: It should be possible to set queue hard-limit beyond it's actual capacity
[MAPREDUCE-1118] Capacity Scheduler scheduling information is hard to read / should be tabular format
[MAPREDUCE-1131] Using profilers other than hprof can cause JobClient to report job failure
[MAPREDUCE-1140] Per cache-file refcount can become negative when tasks release distributed-cache files
[MAPREDUCE-1143] runningMapTasks counter is not properly decremented in case of failed Tasks.
[MAPREDUCE-1155] Streaming tests swallow exceptions
[MAPREDUCE-1158] running_maps is not decremented when the tasks of a job is killed/failed
[MAPREDUCE-1160] Two log statements at INFO level fill up jobtracker logs
[MAPREDUCE-1171] Lots of fetch failures
[MAPREDUCE-1178] MultipleInputs fails with ClassCastException
[MAPREDUCE-1185] URL to JT webconsole for running job and job history should be the same
[MAPREDUCE-1186] While localizing a DistributedCache file, TT sets permissions recursively on the whole base-dir
[MAPREDUCE-1196] MAPREDUCE-947 incompatibly changed FileOutputCommitter
[MAPREDUCE-1198] Alternatively schedule different types of tasks in fair share scheduler
[MAPREDUCE-1213] TaskTrackers restart is very slow because it deletes distributed cache directory synchronously
[MAPREDUCE-1219] JobTracker Metrics causes undue load on JobTracker
[MAPREDUCE-1221] Kill tasks on a node if the free physical memory on that machine falls below a configured threshold
[MAPREDUCE-1231] Distcp is very slow
[MAPREDUCE-1250] Refactor job token to use a common token interface
[MAPREDUCE-1258] Fair scheduler event log not logging job info
[MAPREDUCE-1285] DistCp cannot handle -delete if destination is local filesystem
[MAPREDUCE-1288] DistributedCache localizes only once per cache URI
[MAPREDUCE-1293] AutoInputFormat doesn't work with non-default FileSystems
[MAPREDUCE-1302] TrackerDistributedCacheManager can delete file asynchronously
[MAPREDUCE-1304] Add counters for task time spent in GC
[MAPREDUCE-1307] Introduce the concept of Job Permissions
[MAPREDUCE-1313] NPE in FieldFormatter if escape character is set and field is null
[MAPREDUCE-1316] JobTracker holds stale references to retired jobs via unreported tasks
[MAPREDUCE-1342] Potential JT deadlock in faulty TT tracking
[MAPREDUCE-1354] Incremental enhancements to the JobTracker for better scalability
[MAPREDUCE-1372] ConcurrentModificationException in JobInProgress
[MAPREDUCE-1378] Args in job details links on jobhistory.jsp are not URL encoded
[MAPREDUCE-1382] MRAsyncDiscService should tolerate missing local.dir
[MAPREDUCE-1397] NullPointerException observed during task failures
[MAPREDUCE-1398] TaskLauncher remains stuck on tasks waiting for free nodes even if task is killed.
[MAPREDUCE-1399] The archive command shows a null error message
[MAPREDUCE-1403] Save file-sizes of each of the artifacts in DistributedCache in the JobConf
[MAPREDUCE-1421] LinuxTaskController tests failing on trunk after the commit of MAPREDUCE-1385
[MAPREDUCE-1422] Changing permissions of files/dirs under job-work-dir may be needed sothat cleaning up of job-dir in all
mapred-local-directories succeeds always
[MAPREDUCE-1423] Improve performance of CombineFileInputFormat when multiple pools are configured
[MAPREDUCE-1425] archive throws OutOfMemoryError
[MAPREDUCE-1435] symlinks in cwd of the task are not handled properly after MAPREDUCE-896
[MAPREDUCE-1436] Deadlock in preemption code in fair scheduler
[MAPREDUCE-1440] MapReduce should use the short form of the user names
[MAPREDUCE-1441] Configuration of directory lists should trim whitespace
[MAPREDUCE-1442] StackOverflowError when JobHistory parses a really long line
[MAPREDUCE-1443] DBInputFormat can leak connections
[MAPREDUCE-1454] The servlets should quote server generated strings sent in the response
[MAPREDUCE-1455] Authorization for servlets
[MAPREDUCE-1457] For secure job execution, couple of more UserGroupInformation.doAs needs to be added
[MAPREDUCE-1464] In JobTokenIdentifier change method getUsername to getUser which returns UGI
[MAPREDUCE-1466] FileInputFormat should save #input-files in JobConf
[MAPREDUCE-1476] committer.needsTaskCommit should not be called for a task cleanup attempt
[MAPREDUCE-1480] CombineFileRecordReader does not properly initialize child RecordReader
[MAPREDUCE-1493] Authorization for job-history pages
[MAPREDUCE-1503] Push HADOOP-6551 into MapReduce
[MAPREDUCE-1505] Cluster class should create the rpc client only when needed
[MAPREDUCE-1521] Protection against incorrectly configured reduces
[MAPREDUCE-1522] FileInputFormat may change the file system of an input path
[MAPREDUCE-1526] Cache the job related information while submitting the job , this would avoid many RPC calls to JobTracker.
[MAPREDUCE-1533] Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and
Counters.makeEscapedString()
[MAPREDUCE-1538] TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit
[MAPREDUCE-1543] Log messages of JobACLsManager should use security logging of HADOOP-6586
[MAPREDUCE-1545] Add 'first-task-launched' to job-summary
[MAPREDUCE-1550] UGI.doAs should not be used for getting the history file of jobs
[MAPREDUCE-1563] Task diagnostic info would get missed sometimes.
[MAPREDUCE-1570] Shuffle stage - Key and Group Comparators
[MAPREDUCE-1607] Task controller may not set permissions for a task cleanup attempt's log directory
[MAPREDUCE-1609] TaskTracker.localizeJob should not set permissions on job log directory recursively
[MAPREDUCE-1611] Refresh nodes and refresh queues doesnt work with service authorization enabled
[MAPREDUCE-1612] job conf file is not accessible from job history web page
[MAPREDUCE-1621] Streaming's TextOutputReader.getLastOutput throws NPE if it has never read any output
[MAPREDUCE-1635] ResourceEstimator does not work after MAPREDUCE-842
[MAPREDUCE-1641] Job submission should fail if same uri is added for mapred.cache.files and mapred.cache.archives
[MAPREDUCE-1656] JobStory should provide queue info.
[MAPREDUCE-1657] After task logs directory is deleted, tasklog servlet displays wrong error message about job ACLs
[MAPREDUCE-1664] Job Acls affect Queue Acls
[MAPREDUCE-1680] Add a metrics to track the number of heartbeats processed
[MAPREDUCE-1682] Tasks should not be scheduled after tip is killed/failed.
[MAPREDUCE-1683] Remove JNI calls from ClusterStatus cstr
[MAPREDUCE-1699] JobHistory shouldn't be disabled for any reason
[MAPREDUCE-1707] TaskRunner can get NPE in getting ugi from TaskTracker
[MAPREDUCE-1716] Truncate logs of finished tasks to prevent node thrash due to excessive logging
[MAPREDUCE-1733] Authentication between pipes processes and java counterparts.
[MAPREDUCE-1734] Un-deprecate the old MapReduce API in the 0.20 branch
[MAPREDUCE-1744] DistributedCache creates its own FileSytem instance when adding a file/archive to the path
[MAPREDUCE-1754] Replace mapred.persmissions.supergroup with an acl : mapreduce.cluster.administrators
[MAPREDUCE-1759] Exception message for unauthorized user doing killJob, killTask, setJobPriority needs to be improved
[MAPREDUCE-1778] CompletedJobStatusStore initialization should fail if {mapred.job.tracker.persist.jobstatus.dir} is unwritable
[MAPREDUCE-1784] IFile should check for null compressor
[MAPREDUCE-1785] Add streaming config option for not emitting the key
[MAPREDUCE-1832] Support for file sizes less than 1MB in DFSIO benchmark.
[MAPREDUCE-1845] FairScheduler.tasksToPeempt() can return negative number
[MAPREDUCE-1850] Include job submit host information (name and ip) in jobconf and jobdetails display
[MAPREDUCE-1853] MultipleOutputs does not cache TaskAttemptContext
[MAPREDUCE-1868] Add read timeout on userlog pull
[MAPREDUCE-1872] Re-think (user|queue) limits on (tasks|jobs) in the CapacityScheduler
[MAPREDUCE-1887] MRAsyncDiskService does not properly absolutize volume root paths
[MAPREDUCE-1900] MapReduce daemons should close FileSystems that are not needed anymore
[MAPREDUCE-1914] TrackerDistributedCacheManager never cleans its input directories
[MAPREDUCE-1938] Ability for having user's classes take precedence over the system classes for tasks' classpath
[MAPREDUCE-1960] Limit the size of jobconf.
[MAPREDUCE-1961] ConcurrentModificationException when shutting down Gridmix
[MAPREDUCE-1985] java.lang.ArrayIndexOutOfBoundsException in analysejobhistory.jsp of jobs with 0 maps
[MAPREDUCE-2023] TestDFSIO read test may not read specified bytes.
[MAPREDUCE-2082] Race condition in writing the jobtoken password file when launching pipes jobs
[MAPREDUCE-2096] Secure local filesystem IO from symlink vulnerabilities
[MAPREDUCE-2103] task-controller shouldn't require o-r permissions
[MAPREDUCE-2157] safely handle InterruptedException and interrupted status in MR code
[MAPREDUCE-2178] Race condition in LinuxTaskController permissions handling
[MAPREDUCE-2219] JT should not try to remove mapred.system.dir during startup
[MAPREDUCE-2234] If Localizer can't create task log directory, it should fail on the spot
[MAPREDUCE-2235] JobTracker "over-synchronization" makes it hang up in certain cases
[MAPREDUCE-2242] LinuxTaskController doesn't properly escape environment variables
[MAPREDUCE-2253] Servlets should specify content type
[MAPREDUCE-2256] FairScheduler fairshare preemption from multiple pools may preempt all tasks from one pool causing that pool to go below
fairshare.
[MAPREDUCE-2289] Permissions race can make getStagingDir fail on local filesystem
[MAPREDUCE-2321] TT should fail to start on secure cluster when SecureIO isn't available
[MAPREDUCE-2323] Add metrics to the fair scheduler
[MAPREDUCE-2328] memory-related configurations missing from mapred-default.xml
[MAPREDUCE-2332] Improve error messages when MR dirs on local FS have bad ownership
[MAPREDUCE-2351] mapred.job.tracker.history.completed.location should support an arbitrary filesystem URI
[MAPREDUCE-2353] Make the MR changes to reflect the API changes in SecureIO library
[MAPREDUCE-2356] A task succeeded even though there were errors on all attempts.
[MAPREDUCE-2364] Shouldn't hold lock on rjob while localizing resources.
[MAPREDUCE-2366] TaskTracker can't retrieve stdout and stderr from web UI
[MAPREDUCE-2371] TaskLogsTruncater does not need to check log ownership when running as Child
[MAPREDUCE-2372] TaskLogAppender mechanism shouldn't be set in log4j.properties
[MAPREDUCE-2373] When tasks exit with a nonzero exit status, task runner should log the stderr as well as stdout
[MAPREDUCE-2374] Should not use PrintWriter to write taskjvm.sh
[MAPREDUCE-2377] task-controller fails to parse configuration if it doesn't end in \n
[MAPREDUCE-2379] Distributed cache sizing configurations are missing from mapred-default.xml

Beta Release Notes

General Information

New in This Release

Services Down Alarm Removed

The Services Down Alarm (NODE_ALARM_MISC_DOWN) has been removed.

Hoststats Service Down Alarm Added

The Hoststats Service Down Alarm (NODE_ALARM_SERVICE_HOSTSTATS_DOWN) has been added. This alarm indicates that the Hoststats
service on the indicated node is not running.

Installation Directory Full Alarm Added

The Installation Directory Full Alarm (NODE_ALARM_OPT_MAPR_FULL) has been added. This alarm indicates that the /opt/mapr directory on
the indicated node is approaching capacity.

Root Partition Full Alarm Added

The Root Partition Full Alarm (NODE_ALARM_ROOT_PARTITION_FULL) has been added. This alarm indicates that the / directory on the
indicated node is approaching capacity.

Cores Present Alarm Added

The Cores Present Alarm (NODE_ALARM_CORE_PRESENT) has been added. This alarm indicates that the a service on the indicated node has
crashed, leaving a core dump file.

Global fsck scan

Global fsck automatically scans the entire MapR cluster for errors. If an error is found, contact MapR Support for assistance.

Volume Mirrors

A volume mirror is a full read-only copy of a volume that can be synced on a schedule to provide point-in-time recovery for critical data, or for
higher-performance read concurrency. Creating a mirror requires the mir permission. See Managing Volumes.

Resolved Issues
(Issue 3724) Default Settings Must Be Changed
(Issue 3620) Can't Run MapReduce Jobs as Non-Root User
(Issue 2434) Mirroring Disabled in Alpha
(Issue 2282) fsck Not Present in Alpha

Known Issues
Removing Nodes

The MapR Beta release may experience problems when nodes are removed from the cluster. The problems are likely to be seen as
inconsistencies in the GUI and can be corrected by stopping and restarting the CLDB process. This behavior will be corrected in the GA release.

(Issue 4068) Upgrading Red Hat

When upgrading MapR packages on nodes that run Red Hat, you should only upgrade packages if they appear on the following list:

mapr-core
mapr-flume-internal
mapr-hbase-internal
mapr-hive-internal
mapr-oozie-internal
mapr-pig-internal
mapr-sqoop-internal
mapr-zk-internal

Other installed packages should not be upgraded. If you accidentally upgrade other packages, you can restore the node to proper operation by
forcing a reinstall of the latest versions of the packages using the following steps:

1. Log in as root (or use sudo for the following steps).


2. Stop the warden: /etc/init.d/mapr-warden stop
3. If Zookeeper is installed and running, stop it: /etc/init.d/mapr-zookeeper stop
4. Force reinstall of the packages by running yum reinstall with a list of packages to be installed. Example: yum reinstall
mapr-core mapr-zk-internal
5. If ZooKeeper is installed on the node, start it: /etc/init.d/mapr-zookeeper start
6. Start the warden: /etc/init.d/mapr-warden start

(Issue 4037) Starting Newly Added Services

After you install new services on a node, you can start them in two ways:

Use the MapR Control System, the API, or the command-line interface to start the services individually
Restart the warden to stop and start all services on the node
If you start the services individually, the node's memory will not be reconfigured to account for the newly installed services. This can
cause memory paging, slowing or stopping the node. However, stopping and restarting the warden can take the node out of service.

For best results, choose a time when the cluster is not very busy if you need to install additional services on a node. If that is not possible, make
sure to restart the warden as soon as it is practical to do so after installing new services.

(Issue 4024) Hadoop Copy Commands Do Not Handle Broken Symbolic Links

The hadoop fs -copyToLocal and hadoop fs -copyFromLocal commands attempt to resolve symbolic links in the source data set, to
create physical copies of the files referred to by the links. If a broken symbolic link is encountered by either command, the copy operation fails at
that point.

(Issue 4018)(HDFS-1768) fs -put crash that depends on source file name

Copying a file using the hadoop fs command generates a warning or exception if a corresponding checksum file .*.crc exists. If this error
occurs, delete all local checksum files and try again. See http://www.mail-archive.com/hdfs-issues@hadoop.apache.org/msg15824.html

(Issue 3965) Volume Dump Restore Failure

The volume dump restore command can fail with error 22 (EINVAL) if nodes containing the volume dump are restarted during the restore
operation. To fix the problem, run the command again after the nodes have restarted.

(Issue 3984) HBase Upgrade

If you are using HBase and upgrading during the MapR beta, please contact MapR Support for assistance.

(Issue 3890) Sqoop Requires HBase

The Sqoop package requires HBase, but the package dependency is not set. If you install Sqoop, you must also explicitly install HBase.

(Issue 3817) Increasing File Handle Limits Requires Restarting PAM Session Management

If you're upgrading from the Apache distribution of Hadoop on Ubuntu 10.x, it is not sufficient to modify /etc/security/limits.conf to
increase the file handle limits for all the new users. You must also modify your PAM configuration, by adding the following line to
/etc/pam.d/common-session and then restarting the services:

session required pam_limits.so

(Issue 3560) Intermittent Scheduled Mirror Failure

Under certain conditions, a scheduled mirror ends prematurely. To work around the issue, re-start mirroring manually. This issue will be corrected
in a post-beta code release.

(Issue 3524) Apache Port 80 Open

The MapR UI runs on Apache. By default, installation does not close port 80 (even though the MapR Control System is available over HTTPS on
port 8443). If this would present a security risk to your datacenter, you should close port 80 manually on any nodes running the MapR Control
System.

(Issue 3488) Ubuntu IRQ Balancer Issue on Virtual Machines

In VM environments like EC2, VMWare, and Xen, when running Ubuntu 10.10, problems can occur due to an Ubuntu bug unless the IRQ
balancer is turned off. On all nodes, edit the file /etc/default/irqbalance and set ENABLED=0 to turn off the IRQ balancer (requires reboot
to take effect).

(Issue 3244) Volume Mirror Issue

If a volume dump restore command is interrupted before completion (killed by the user, node fails, etc.) then the volume remains in the "Mirroring
in Progress" state. Before retrying the volume dump restore operation, you must issue the volume mirror stop command explicitly.

(Issue 3122) Mirroring with fsck-repaired volume

If a source or mirror volume is repaired with fsck then the source and mirror volumes can go out of sync. It is necessary to perform a full mirror
operation with volume mirror start -full true to bring them back in sync. If a mirror operation is not feasible (due to bandwith
constraints, for example), then you should restore the mirror volume from a full dump file. When creating a dump file from a volume that has been
repaired with fsck, use the volume dump create command without specifying -s to create a full volume dump.

(Issue 3028) Changing the Time on a ZooKeeper Node

To avoid cluster downtime, use the following steps to set the time on any node running ZooKeeper:

1. Use the MapR Dashboard to check that all configured ZooKeeper services on the cluster are running. Start any non-running ZooKeeper
instances.
2. Stop ZooKeeper on the node:
/etc/init.d/mapr-zookeeper stop
3. Change the time on the node or sync the time to NTP.
4. Start ZooKeeper on the node:
/etc/init.d/mapr-zookeeper start

(Issue 2949) NFS Mounting Issue on Ubuntu

When mounting a cluster via NFS, you must include the vers=3 option, which specifies NFS protocol version 3.
If no version is specified, NFS uses the highest version supported by the kernel and mount command, which is most cases is version 4. Version
4 is not yet supported by MapR-FS NFS.

(Issue 2815) File Cleanup is Slow

After a MapReduce job is completed, cleanup of files and directories associated with the tasks can take a long time and tie up the TaskTracker
node. If this happens on multiple nodes, it can cause a temporary cluster outage. If this happens, check the JobTracker View and make sure all
TaskTrackers are back online before submitting additional jobs.

(Issue 2809) NFS Dependencies

If you are installing the MapR NFS service on a node that cannot connect to the standard apt-get or yum repositories, you should install the
following packages by hand:

CentOS:
iputils
portmap
glibc-common-2.5-49.el5_5.7
Red Hat:
rpcbind
iputils
Ubuntu:
nfs-common
iputils-arping

(Issue 2495) NTP Requirement

To keep all cluster nodes time-synchronized, MapR requires NTP to be configured and running on every node. If server clocks in the cluster drift
out of sync, serious problems will occur with HBase and other MapR services. MapR raises a Time Skew alarm on any out-of-sync nodes. See
http://www.ntp.org/ for more information about obtaining and installing NTP. In the event that a large adjustment must be made to the time on a
particular node, you should stop ZooKeeper on the node, then adjust the time, then restart ZooKeeper.

Alpha Release Notes

New in This Release


As this is the first release, there are no added or changed features.

Resolved Issues
As this is the first release, there are no issues resolved or carried over from a previous release.

Known Issues

(Issue 2495) NTP Requirement

To keep all cluster nodes time-synchronized, MapR requires NTP to be configured and running on every node. If server clocks in the cluster drift
out of sync, serious problems will occur with HBase and other MapR services. MapR raises a Time Skew alarm on any out-of-sync nodes. See
http://www.ntp.org/ for more information about obtaining and installing NTP. In the event that a large adjustment must be made to the time on a
particular node, you should stop ZooKeeper on the node, then adjust the time, then restart ZooKeeper.

(Issue 2434) Mirroring Disabled in Alpha

Volume Mirroring is intentionally disabled in the MapR Alpha Release. User interface elements and API commands related to mirroring are
non-functional.

(Issue 2282) fsck Not Present in Alpha

MapR cluster fsck is not present in the Alpha release.

MapR Control System Reference


The MapR Control System main screen consists of a navigation pane to the left and a view to the right. Dialogs appear over the main screen to
perform certain actions.

The Navigation pane to the left lets you choose which view to display on the right.
The main view groups are:

Cluster - information about the nodes in the cluster


MapR-FS - information about volumes, snapshots and schedules
NFS HA - NFS nodes and virtual IP addresses
Alarms - node and volume alarms
System Settings - configuration of alarm notifications, quotas, users, groups, SMTP, and HTTP

Some other views are separate from the main navigation tree:

CLDB View - information about the container location database


HBase View - information about HBase on the cluster
JobTracker View - information about the JobTracker
Nagios - generates a Nagios script
Terminal View - an ssh terminal for logging in to the cluster

Views
Views display information about the system. As you open views, tabs along the top let you switch between them quickly.

Clicking any column name in a view sorts the data in ascending or descending order by that column.

Most views contain the following controls:

a Filter toolbar that lets you sort data in the view, so you can quickly find the information you want
an info symbol ( ) that you can click for help

Some views contain collapsible panes that provide different types of detailed information. Each collapsible a control at the top left that expands
and collapses the pane. The control changes to show the state of the pane:

- pane is collapsed; click to expand


- pane is expanded; click to collapse

Views that contain many results provide the following controls:

(First) - navigates to the first screenful of results

(Previous) - navigates to the previous screenful of results

(Next) - navigates to the next screenful of results

(Last) - navigates to the last screenful of results

(Refresh) - refreshes the list of results

The Filter Toolbar


The Filter toolbar lets you build search expressions to provide sophisticated filtering capabilities for locating specific data on views that display a
large number of nodes. Expressions are implicitly connected by the AND operator; any search results satisfy the criteria specified in all
expressions.

There are three controls in the Filter toolbar:

The close control ( ) removes the expression.


The Add button adds a new expression.
The Filter Help button displays brief help about the Filter toolbar.

Expressions

Each expression specifies a semantic statement that consists of a field, an operator, and a value.

The first dropdown menu specifies the field to match.


The second dropdown menu specifies the type of match to perform:
The text field specifies a value to match or exclude in the field. You can use a wildcard to substitute for any part of the string.
Cluster
The Cluster view group provides the following views:

Dashboard - a summary of information about cluster health, activity, and usage


Nodes - information about nodes in the cluster
Node Heatmap - a summary of the health of nodes in the cluster

Dashboard
The Dashboard displays a summary of information about the cluster in five panes:

Cluster Heat Map - the alarms and health for each node, by rack
Alarms - a summary of alarms for the cluster
Cluster Utilization - CPU, Memory, and Disk Space usage
Services - the number of instances of each service
Volumes - the number of available, under-replicated, and unavailable volumes
MapReduce Jobs - the number of running and queued jobs, running tasks, and blacklisted nodes

Links in each pane provide shortcuts to more detailed information. The following sections provide information about each pane.

Cluster Heat Map

The Cluster Heat Map pane displays the health of the nodes in the cluster, by rack. Each node appears as a colored square to show its health at
a glance.

The Show Legend/Hide Legend link above the heatmap shows or hides a key to the color-coded display.

The drop-down menu at the top right of the pane lets you filter the results to show the following criteria:

Health
(green): healthy; all services up, MapR-FS and all disks OK, and normal heartbeat
(orange): degraded; one or more services down, or no heartbeat for over 1 minute
(red): critical; Mapr-FS Inactive/Dead/Replicate, or no heartbeat for over 5 minutes
(gray): maintenance
(purple): upgrade in process
CPU Utilization
(green): below 50%; (orange): 50% - 80%; (red): over 80%
Memory Utilization
(green): below 50%; (orange): 50% - 80%; (red): over 80%
Disk Space Utilization
(green): below 50%; (orange): 50% - 80%; (red): over 80% or all disks dead
Disk Failure(s) - status of the NODE_ALARM_DISK_FAILURE alarm
(red): raised; (green): cleared
Excessive Logging - status of the NODE_ALARM_DEBUG_LOGGING alarm
(red): raised; (green): cleared
Software Installation & Upgrades - status of the NODE_ALARM_VERSION_MISMATCH alarm
(red): raised; (green): cleared
Time Skew - status of theNODE_ALARM_ TIME_SKEW alarm
(red): raised; (green): cleared
CLDB Service Down - status of the NODE_ALARM_SERVICE_CLDB_DOWN alarm
(red): raised; (green): cleared
FileServer Service Down - status of the NODE_ALARM_SERVICE_FILESERVER_DOWN alarm
(red): raised; (green): cleared
JobTracker Service Down - status of the NODE_ALARM_SERVICE_JT_DOWN alarm
(red): raised; (green): cleared
TaskTracker Service Down - status of the NODE_ALARM_SERVICE_TT_DOWN alarm
(red): raised; (green): cleared
HBase Master Service Down - status of the NODE_ALARM_SERVICE_HBMASTER_DOWN alarm
(red): raised; (green): cleared
HBase Regionserver Service Down - status of the NODE_ALARM_SERVICE_HBREGION_DOWN alarm
(red): raised; (green): cleared
NFS Service Down - status of the NODE_ALARM_SERVICE_NFS_DOWN alarm
(red): raised; (green): cleared
WebServer Service Down - status of the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarm
(red): raised; (green): cleared
Hoststats Service Down - status of the NODE_ALARM_SERVICE_HOSTSTATS_DOWN alarm
(red): raised; (green): cleared
Root Partition Full - status of the NODE_ALARM_ROOT_PARTITION_FULL alarm
(red): raised; (green): cleared
Installation Directory Full - status of the NODE_ALARM_OPT_MAPR_FULL alarm
(red): raised; (green): cleared
Cores Present - status of the NODE_ALARM_CORE_PRESENT alarm
(red): raised; (green): cleared

Clicking a rack name navigates to the Nodes view, which provides more detailed information about the nodes in the rack.

Clicking a colored square navigates to the Node Properties View, which provides detailed information about the node.

Alarms

The Alarms pane displays the following information about alarms on the system:

Alarm - a list of alarms raised on the cluster


Last Raised - the most recent time each alarm state changed
Summary - how many nodes or volumes have raised each alarm

Clicking any column name sorts data in ascending or descending order by that column.

Cluster Utilization

The Cluster Utilization pane displays a summary of the total usage of the following resources:

CPU
Memory
Disk Space

For each resource type, the pane displays the percentage of cluster resources used, the amount used, and the total amount present in the
system.
Services

The Services pane shows information about the services running on the cluster. For each service, the pane displays the following information:

Actv - the number of running instances of the service


Stby - the number of instances of the service that are configured and standing by to provide failover.
Stop - the number of instances of the service that have been intentionally stopped.
Fail - the number of instances of the service that have failed, indicated by a corresponsing Service Down alarm
Total - the total number of instances of the service configured on the cluster

Clicking a service navigates to the Services view.

Volumes

The Volumes pane displays the total number of volumes, and the number of volumes that are mounted and unmounted. For each category, the
Volumes pane displays the number, percent of the total, and total size.

Clicking mounted or unmounted navigates to the Volumes view.

MapReduce Jobs

The MapReduce Jobs pane shows information about MapReduce jobs:

Running Jobs - the number of MapReduce jobs currently running


Queued Jobs - the number of MapReduce jobs queued to run
Running Tasks - the number of MapReduce tasks currently running
Blacklisted Nodes - the number of nodes that have been eliminated from the MapReduce pool

Nodes
The Nodes view displays the nodes in the cluster, by rack. The Nodes view contains two panes: the Topology pane and the Nodes pane. The
Topology pane shows the racks in the cluster. Selecting a rack displays that rack's nodes in the Nodes pane to the right. Selecting Cluster
displays all the nodes in the cluster.

Clicking any column name sorts data in ascending or descending order by that column.

Selecting the checkboxes beside one or more nodes makes the following buttons available:

Manage Services - displays the Manage Node Services dialog, which lets you start and stop services on the node.
Remove - displays the Remove Node dialog, which lets you remove the node.
Change Topology - displays the Change Node Topology dialog, which lets you change the topology path for a node.

Selecting the checkbox beside a single node makes the following button available:

Properties - navigates to the Node Properties View, which displays detailed information about a single node.

The dropdown menu at the top left specifies the type of information to display:

Overview - general information about each node


Services - services running on each node
Machine Performance - information about memory, CPU, I/O and RPC performance on each node
Disks - information about disk usage, failed disks, and the MapR-FS heartbeat from each node
MapReduce - information about the JobTracker heartbeat and TaskTracker slots on each node
NFS Nodes - the IP addresses and Virtual IPs assigned to each NFS node
Alarm Status - the status of alarms on each node

Clicking a node's Hostname navigates to the Node Properties View, which provides detailed information about the node.

Selecting the Filter checkbox displays the Filter toolbar, which provides additional data filtering options.

Overview

The Overview displays the following general information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or critical


Hostname - the hostname of each node
Phys IP(s) - the IP address or addresses associated with each node
FS HB - time since each node's last heartbeat to the CLDB
JT HB - time since each node's last heartbeat to the JobTracker
Physical Topology - the rack path to each node

Services

The Services view displays the following information about nodes in the cluster:

Hlth - eact node's health: healthy, degraded, or critical


Hostname - the hostname of eact node
Services - a list of the services running on each node
Physical Topology - each node's physical topology

Machine Performance

The Machine Performance view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or critical


Hostname - the hostname of each node
Memory - the percentage of memory used and the total memory
# CPUs - the number of CPUs present on each node
% CPU Idle - the percentage of CPU usage on each node
Bytes Received - the network input
Bytes Sent - the network output
# RPCs - the number of RPC calls
RPC In Bytes - the RPC input, in bytes
RPC Out Bytes - the RPC output, in bytes
# Disk Reads - the number of RPC disk reads
# Disk Writes - the number of RPC disk writes
Disk Read Bytes - the number of bytes read from disk
Disk Write Bytes - the number of bytes written to disk
# Disks - the number of disks present

Disks

The Disks view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or critical


Hostname - the hostname of each node
# bad Disks - the number of failed disks on each node
Usage - the amount of disk used and total disk capacity, in gigabytes

MapReduce

The MapReduce view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or critical


Hostname - the hostname of each node
JT HB - the time since each node's most recent JobTracker heartbeat
TT Map Slots - the number of map slots on each node
TT Map Slots Used - the number of map slots in use on each node
TT Reduce Slots - the number of reduce slots on each node
TT Reduce Slots Used - the number of reduce slots in use on each node

NFS Nodes

The NFS Nodes view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or critical


Hostname - the hostname of each node
Phys IP(s) - the IP address or addresses associated with each node
VIP(s) - the virtual IP address or addresses assigned to each node

Alarm Status

The Alarm Status view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or critical


Hostname - the hostname of each node
Version Alarm - whether the NODE_ALARM_VERSION_MISMATCH alarm is raised
Excess Logs Alarm - whether the NODE_ALARM_DEBUG_LOGGING alarm is raised
Disk Failure Alarm - whether the NODE_ALARM_DISK_FAILURE alarm is raised
Time Skew Alarm - whether the NODE_ALARM_TIME_SKEW alarm is raised
Root Partition Alarm - whether the NODE_ALARM_ROOT_PARTITION_FULL alarm is raised
Installation Directory Alarm - whether the NODE_ALARM_OPT_MAPR_FULL alarm is raised
Core Present Alarm - whether the NODE_ALARM_CORE_PRESENT alarm is raised
CLDB Alarm - whether the NODE_ALARM_SERVICE_CLDB_DOWN alarm is raised
FileServer Alarm - whether the NODE_ALARM_SERVICE_FILESERVER_DOWN alarm is raised
JobTracker Alarm - whether the NODE_ALARM_SERVICE_JT_DOWN alarm is raised
TaskTracker Alarm - whether the NODE_ALARM_SERVICE_TT_DOWN alarm is raised
HBase Master Alarm - whether the NODE_ALARM_SERVICE_HBMASTER_DOWN alarm is raised
HBase Region Alarm - whether the NODE_ALARM_SERVICE_HBREGION_DOWN alarm is raised
NFS Gateway Alarm - whether the NODE_ALARM_SERVICE_NFS_DOWN alarm is raised
WebServer Alarm - whether the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarm is raised

Node Properties View


The Node Properties view displays detailed information about a single node in seven collapsible panes:

Alarms
Machine Performance
General Information
MapReduce
Manage Node Services
MapR-FS and Available Disks
System Disks

Buttons:

Remove Node - displays the Remove Node dialog

Alarms

The Alarms pane displays a list of alarms that have been raised on the system, and the following information about each alarm:

Alarm - the alarm name


Last Raised - the most recent time when the alarm was raised
Summary - a description of the alarm

Machine Performance

The Activity Since Last Heartbeat pane displays the following information about the node's performance and resource usage since it last reported
to the CLDB:

Memory Used - the amount of memory in use on the node


Disk Used - the amount of disk space used on the node
CPU - The number of CPUs and the percentage of CPU used on the node
Network I/O - the input and output to the node per second
RPC I/O - the number of RPC calls on the node and the amount of RPC input and output
Disk I/O - the amount of data read to and written from the disk
# Operations - the number of disk reads and writes
General Information

The General Information pane displays the following general information about the node:

FS HB - the amount of time since the node performed a heartbeat to the CLDB
JT HB - the amount of time since the node performed a heartbeat to the JobTracker
Physical Topology - the rack path to the node

MapReduce

The MapReduce pane displays the number of map and reduce slots used, and the total number of map and reduce slots on the node.

MapR-FS and Available Disks

The MapR-FS and Available Disks pane displays the disks on the node, and the following information about each disk:

Mnt - whether the disk is mounted or unmounted


Disk - the disk name
File System - the file system on the disk
Used -the percentage used and total size of the disk

Clicking the checkbox next to a disk lets you select the disk for addition or removal.
Buttons:

Add Disks to MapR-FS - with one or more disks selected, adds the disks to the MapR-FS storage
Remove Disks from MapR-FS with one or more disks selected, removes the disks from the MapR-FS storage

System Disks

The System Disks pane displays information about disks present and mounted on the node:

Mnt - whether the disk is mounted


Device - the device name of the disk
File System - the file system
Used - the percentage used and total capacity

Manage Node Services

The Manage Node Services pane displays the status of each service on the node:

Service - the name of each service


State:
0 - NOT_CONFIGURED: the package for the service is not installed and/or the service is not configured ( configure.sh has not
run)
2 - RUNNING: the service is installed, has been started by the warden, and is currently executing
3 - STOPPED: the service is installed and configure.sh has run, but the service is currently not executing
Log Path - the path where each service stores its logs
Buttons:

Start Service - starts the selected services


Stop Service - stops the selected services
Log Settings - displays the Trace Activity dialog

You can also start and stop services in the the Manage Node Services dialog, by clicking Manage Services in the Nodes view.

Trace Activity

The Trace Activity dialog lets you set the log level of a specific service on a particular node.

The Log Level dropdown specifies the logging threshold for messages.

Buttons:

OK - save changes and exit


Close - exit without saving changes

Remove Node
The Remove Node dialog lets you remove the specified node.
The Remove Node dialog contains a radio button that lets you choose how to remove the node:

Shut down all services and then remove - shut down services before removing the node
Remove immediately (-force) - remove the node without shutting down services

Buttons:

Remove Node - removes the node


Cancel - returns to the Node Properties View without removing the node

Manage Node Services


The Manage Node Services dialog lets you start and stop services on the node.

The Service Changes section contains a dropdown menu for each service:

No change - leave the service running if it is running, or stopped if it is stopped


Start - start the service
Stop - stop the service

Buttons:

Change Node - start and stop the selected services as specified by the dropdown menus
Cancel - returns to the Node Properties View without starting or stopping any services
You can also start and stop services in the the Manage Node Services pane of the Node Properties view.

Change Node Topology


The Change Node Topology dialog lets you change the rack or switch path for one or more nodes.

The Change Node Topology dialog consists of two panes:

Node(s) to move shows the node or nodes specified in the Nodes view.
New Path contains the following fields:
Path to Change - rack path or switch path
New Path - the new node topology path

The Change Node Topology dialog contains the following buttons:

Move Node - changes the node topology


Close - returns to the Nodes view without changing the node topology

Node Heatmap
The Node Heatmap view displays information about each node, by rack

The dropdown menu above the heatmap lets you choose the type of information to display. See Cluster Heat Map.

Selecting the Filter checkbox displays the Filter toolbar, which provides additional data filtering options.
MapR-FS
The MapR-FS group provides the following views:

Volumes - information about volumes in the cluster


Mirror Volumes - information about mirrors
User Disk Usage - cluster disk usage
Snapshots - information about volume snapshots
Schedules - information about schedules

Volumes
The Volumes view displays the following information about volumes in the cluster:

Mnt - whether the volume is mounted ( )


Vol Name - the name of the volume
Mount Path - the path where the volume is mounted
Creator - the user or group that owns the volume
Quota - the volume quota
Vol Size - the size of the volume
Snap Size - the size of the volume snapshot
Total Size - the size of the volume and all its snapshots
Replication Factor - the number of copies of the volume
Physical Topology - the rack path to the volume

Clicking any column name sorts data in ascending or descending order by that column.

The Show Unmounted checkbox specifies whether to show unmounted volumes:

selected - show both mounted and unmounted volumes


unselected - show mounted volumes only

The Show System checkbox specifies whether to show system volumes:

selected - show both system and user volumes


unselected - show user volumes only

Selecting the Filter checkbox displays the Filter toolbar, which provides additional data filtering options.

Clicking New Volume displays the New Volume dialog.

Selecting one or more checkboxes next to volumes enables the following buttons:

Remove - displays the Remove Volume dialog


Properties - displays the Volume Properties dialog (becomes Edit X Volumes if more than one checkbox is selected)
Snapshots - displays the Snapshots for Volume dialog
New Snapshot - displays the Snapshot Name dialog

New Volume
The New Volume dialog lets you create a new volume.

For mirror volumes, the Replication & Snapshot Scheduling section is replaced with a section called Replication & Mirror Scheduling:

The Volume Setup section specifies basic information about the volume using the following fields:

Volume Type - a standard volume, or a local or remote mirror volume


Volume Name (required) - a name for the new volume
Mount Path - a path on which to mount the volume
Mounted - whether the volume is mounted at creation
Topology - the new volume's rack topology
Read-only - if checked, prevents writes to the volume
The Ownership & Permissions section lets you grant specific permissions on the volume to certain users or groups:

User/Group field - the user or group to which permissions are to be granted (one user or group per row)
Permissions field - the permissions to grant to the user or group (see the Permissions table below)
Delete button ( ) - deletes the current row
[ + Add Permission ] - adds a new row

Volume Permissions

Code Allowed Action

dump Dump the volume

restore Mirror or restore the volume

m Modify volume properties, create and delete snapshots

d Delete a volume

fc Full control (admin access and permission to change volume ACL)

The Usage Tracking section sets the accountable entity and quotas for the volume using the following fields:

Group/User - the group/user that is accountable for the volume


Quotas - the volume quotas:
Volume Advisory Quota - if selected, the advisory quota for the volume as an integer plus a single letter to represent the unit
Volume Quota - if selected, the quota for the volume as an integer plus a single letter to represent the unit

The Replication & Snapshot Scheduling section (normal volumes) contains the following fields:

Replication - the desired replication factor for the volume


Minimum Replication - the minimum replication factor for the volume
Snapshot Schedule - determines when snapshots will be automatically created; select an existing schedule from the pop-up menu

The Replication & Mirror Scheduling section (mirror volumes) contains the following fields:

Replication Factor - the desired replication factor for the volume


Actual Replication - what percent of the volume data is replicated once (1x), twice (2x), and so on, respectively
Mirror Update Schedule - determines when mirrors will be automatically updated; select an existing schedule from the pop-up menu
Last Mirror Operation - the status of the most recent mirror operation.

Buttons:

Save - creates the new volume


Close - exits without creating the volume

Remove Volume

The Remove Volume dialog prompts you for confirmation before removing the specified volume or volumes.

Buttons:
Remove Volume - removes the volume or volumes
Cancel - exits without removing the volume or volumes

Volume Properties

The Volume Properties dialog lets you view and edit volume properties.

For mirror volumes, the Replication & Snapshot Scheduling section is replaced with a section called Replication & Mirror Scheduling:
For information about the fields in the Volume Properties dialog, see New Volume.

Snapshots for Volume

The Snapshots for Volume dialog displays the following information about snapshots for the specified volume:

Snapshot Name - the name of the snapshot


Disk Used - the disk space occupied by the snapshot
Created - the date and time the snapshot was created
Expires - the snapshot expiration date and time

Buttons:

New Snapshot - displays the Snapshot Name dialog.


Remove - when the checkboxes beside one or more snapshots are selected, displays the Remove Snapshots dialog
Preserve - when the checkboxes beside one or more snapshots are selected, prevents the snapshots from expiring
Close - closes the dialog

Snapshot Name

The Snapshot Name dialog lets you specify the name for a new snapshot you are creating.
The Snapshot Name dialog creates a new snapshot with the name specified in the following field:

Name For New Snapshot(s) - the new snapshot name

Buttons:

OK - creates a snapshot with the specified name


Cancel - exits without creating a snapshot

Remove Snapshots

The Remove Snapshots dialog prompts you for confirmation before removing the specified snapshot or snapshots.

Buttons

Yes - removes the snapshot or snapshots


No - exits without removing the snapshot or snapshots

Mirror Volumes
The Mirror Volumes pane displays information about mirror volumes in the cluster:

Mnt - whether the volume is mounted


Vol Name - the name of the volume
Src Vol - the source volume
Src Clu - the source cluster
Orig Vol -the originating volume for the data being mirrored
Orig Clu - the originating cluster for the data being mirrored
Last Mirrored - the time at which mirroring was most recently completed
- status of the last mirroring operation
% Done - progress of the mirroring operation
Error(s) - any errors that occurred during the last mirroring operation

User Disk Usage


The User Disk Usage view displays information about disk usage by cluster users:

Name - the username


Disk Usage - the total disk space used by the user
# Vols - the number of volumes
Hard Quota - the user's quota
Advisory Quota - the user's advisory quota
Email - the user's email address

Snapshots
The Snapshots view displays the following information about volume snapshots in the cluster:

Snapshot Name - the name of the snapshot


Volume Name - the name of the source volume volume for the snapshot
Disk Space used - the disk space occupied by the snapshot
Created - the creation date and time of the snapshot
Expires - the expiration date and time of the snapshot

Clicking any column name sorts data in ascending or descending order by that column.

Selecting the Filter checkbox displays the Filter toolbar, which provides additional data filtering options.

Buttons:

Remove Snapshot - when the checkboxes beside one or more snapshots are selected, displays the Remove Snapshots dialog
Preserve Snapshot - when the checkboxes beside one or more snapshots are selected, prevents the snapshots from expiring

Schedules
The Schedules view lets you view and edit schedules, which can then can be attached to events to create occurrences. A schedule is a named
group of rules that describe one or more points of time in the future at which an action can be specified to take place.
The left pane of the Schedules view lists the following information about the existing schedules:

Schedule Name - the name of the schedule; clicking a name displays the schedule details in the right pane for editing
In Use - indicates whether the schedule is in use ( ), or attached to an action

The right pane provides the following tools for creating or editing schedules:

Schedule Name - the name of the schedule


Schedule Rules - specifies schedule rules with the following components:
A dropdown that specifies frequency (Once, Yearly, Monthly, Weekly, Daily, Hourly, Every X minutes)
Dropdowns that specify the time within the selected frequency
Retain For - the time for which the scheduled snapshot or mirror data is to be retained after creation
[ +Add Rule ] - adds another rule to the schedule

Navigating away from a schedule with unsaved changes displays the Save Schedule dialog.

Buttons:

New Schedule - starts editing a new schedule


Remove Schedule - displays the Remove Schedule dialog
Save Schedule - saves changes to the current schedule
Cancel - cancels changes to the current schedule

Remove Schedule

The Remove Schedule dialog prompts you for confirmation before removing the specified schedule.

Buttons

Yes - removes the schedule


No - exits without removing the schedule

NFS HA
The NFS view group provides the following views:

NFS Setup - information about NFS nodes in the cluster


VIP Assignments - information about virtual IP addresses (VIPs) in the cluster
NFS Nodes - information about NFS nodes in the cluster

NFS Setup
The NFS Setup view displays information about NFS nodes in the cluster and any VIPs assigned to them:

Starting VIP - the starting IP of the VIP range


Ending VIP - the ending IP of the VIP range
Node Name(s) - the names of the NFS nodes
IP Address(es) - the IP addresses of the NFS nodes
MAC Address(es) - the MAC addresses associated with the IP addresses

Buttons:
Start NFS - displays the Manage Node Services dialog
Add VIP - displays the Add Virtual IPs dialog
Edit - when one or more checkboxes are selected, edits the specified VIP ranges
Remove- when one or more checkboxes are selected, removes the specified VIP ranges
Unconfigured Nodes - displays nodes not running the NFS service (in the Nodes view)
VIP Assignments - displays the VIP Assignments view

VIP Assignments
The VIP Assignments view displays VIP assignments beside the nodes to which they are assigned:

Virtual IP Address - each VIP in the range


Node Name - the node to which the VIP is assigned
IP Address - the IP address of the node
MAC Address - the MAC address associated with the IP address

Buttons:

Start NFS - displays the Manage Node Services dialog


Add VIP - displays the Add Virtual IPs dialog
Unconfigured Nodes - displays nodes not running the NFS service (in the Nodes view)

NFS Nodes
The NFS Nodes view displays information about nodes running the NFS service:

Hlth - the health of the node


Hostname - the hostname of the node
Phys IP(s) - physical IP addresses associated with the node
VIP(s) - virtual IP addresses associated with the node

Buttons:

Properties - when one or more nodes are selected, navigates to the Node Properties View
Manage Services - navigates to the Manage Node Services dialog, which lets you start and stop services on the node
Remove - navigates to the Remove Node dialog, which lets you remove the node
Change Topology - navigates to the Change Node Topology dialog, which lets you change the rack or switch path for a node

Alarms
The Alarms view group provides the following views:

Node Alarms - information about node alarms in the cluster


Volume Alarms - information about volume alarms in the cluster
User/Group Alarms - information about users or groups that have exceeded quotas
Alarm Notifications - configure where notifications are sent when alarms are raised

Node Alarms
The Node Alarms view displays information about node alarms in the cluster.

Hlth - a color indicating the status of each node (see Cluster Heat Map)
Hostname - the hostname of the node
Version Alarm - last occurrence of the NODE_ALARM_VERSION_MISMATCH alarm
Excess Logs Alarm - last occurrence of the NODE_ALARM_DEBUG_LOGGING alarm
Disk Failure Alarm - of the NODE_ALARM_DISK_FAILURE alarm
Time Skew Alarm - last occurrence of the NODE_ALARM_ TIME_SKEW alarm
Root Partition Alarm - last occurrence of the NODE_ALARM_ROOT_PARTITION_FULL alarm
Installation Directory Alarm - last occurrence of the NODE_ALARM_OPT_MAPR_FULL alarm
Core Present Alarm - last occurrence of the NODE_ALARM_CORE_PRESENT alarm
CLDB Alarm - last occurrence of the NODE_ALARM_SERVICE_CLDB_DOWN alarm
FileServer Alarm - last occurrence of the NODE_ALARM_SERVICE_FILESERVER_DOWN alarm
JobTracker Alarm - last occurrence of the NODE_ALARM_SERVICE_JT_DOWN alarm
TaskTracker Alarm - last occurrence of the NODE_ALARM_SERVICE_TT_DOWN alarm
HBase Master Alarm - last occurrence of the NODE_ALARM_SERVICE_HBMASTER_DOWN alarm
HBase Regionserver Alarm - last occurrence of the NODE_ALARM_SERVICE_HBREGION_DOWN alarm
NFS Gateway Alarm - last occurrence of the NODE_ALARM_SERVICE_NFS_DOWN alarm
WebServer Alarm - last occurrence of the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarm
Hoststats Alarm - last occurrence of the NODE_ALARM_SERVICE_HOSTSTATS_DOWN alarm

See Troubleshooting Alarms.

Clicking any column name sorts data in ascending or descending order by that column.

The left pane of the Node Alarms view displays the following information about the cluster:

Topology - the rack topology of the cluster

Selecting the Filter checkbox displays the Filter toolbar, which provides additional data filtering options.

Clicking a node's Hostname navigates to the Node Properties View, which provides detailed information about the node.

Buttons:

Properties - navigates to the Node Properties View


Remove - navigates to the Remove Node dialog, which lets you remove the node
Manage Services - navigates to the Manage Node Services dialog, which lets you start and stop services on the node
Change Topology - navigates to the Change Node Topology dialog, which lets you change the rack or switch path for a node

Volume Alarms
The Volume Alarms view displays information about volume alarms in the cluster:

Mnt - whether the volume is mounted


Vol Name - the name of the volume
Snapshot Alarm - last Snapshot Failed alarm
Mirror Alarm - last Mirror Failed alarm
Replication Alarm - last Data Under-Replicated alarm
Data Alarm - last Data Unavailable alarm
Vol Advisory Quota Alarm - last Volume Advisory Quota Exceeded alarm
Vol Quota Alarm- last Volume Quota Exceeded alarm

Clicking any column name sorts data in ascending or descending order by that column. Clicking a volume name displays the Volume Properties
dialog

Selecting the Show Unmounted checkbox shows unmounted volumes as well as mounted volumes.

Selecting the Filter checkbox displays the Filter toolbar, which provides additional data filtering options.

Buttons:

New Volume displays the New Volume Dialog.


Properties - if the checkboxes beside one or more volumes is selected,displays the Volume Properties dialog
Mount (Unmount) - if an unmounted volume is selected, mounts it; if a mounted volume is selected, unmounts it
Remove - if the checkboxes beside one or more volumes is selected, displays the Remove Volume dialog
Start Mirroring - if a mirror volume is selected, starts the mirror sync process
Snapshots - if the checkboxes beside one or more volumes is selected,displays the Snapshots for Volume dialog
New Snapshot - if the checkboxes beside one or more volumes is selected,displays the Snapshot Name dialog

User/Group Alarms
The User/Group Alarms view displays information about user and group quota alarms in the cluster:

Name - the name of the user or group


User Advisory Quota Alarm - the last Advisory Quota Exceeded alarm
User Quota Alarm - the last Quota Exceeded alarm

Buttons:

Edit Properties
Alarm Notifications
The Configure Global Alarm Notifications dialog lets you specify where email notifications are sent when alarms are raised.

Fields:

Alarm Name - select the alarm to configure


Standard Notification - send notification to the default for the alarm type (the cluster administrator or volume creator, for example)
Additional Email Address - specify an additional custom email address to receive notifications for the alarm type

Buttons:

Save - save changes and exit


Close - exit without saving changes

System Settings
The System Settings view group provides the following views:

Email Addresses - specify MapR user email addresses


Permissions - give permissions to users
Quota Defaults - settings for default quotas in the cluster
SMTP - settings for sending email from MapR
HTTP - settings for accessing the MapR Control System via a browser
MapR Licenses - MapR license settings

Email Addresses
The Configure Email Addresses dialog lets you specify whether MapR gets user email addresses from an LDAP directory, or uses a company
domain:

Use Company Domain - specify a domain to append after each username to determine each user's email address
Use LDAP - obtain each user's email address from an LDAP server

Buttons:

Save - save changes and exit


Close - exit without saving changes

Permissions
The Edit Permissions dialog lets you grant specific clluster permissions to particular users and groups.

User/Group field - the user or group to which permissions are to be granted (one user or group per row)
Permissions field - the permissions to grant to the user or group (see the Permissions table below)
Delete button ( ) - deletes the current row
[ + Add Permission ] - adds a new row
Cluster Permissions

Code Allowed Action Includes

login Log in to the MapR Control System, use the API and command-line interface, read access on cluster and cv
volumes

ss Start/stop services

cv Create volumes

a Admin access All permissions except


fc

fc Full control (administrative access and permission to change the cluster ACL) a

Buttons:

OK - save changes and exit


Close - exit without saving changes

Quota Defaults
The Configure Quota Defaults dialog lets you set the default quotas that apply to users and groups.
The User Quota Defaults section contains the following fields:

Default User Advisory Quota - if selected, sets the advisory quota that applies to all users without an explicit advisory quota.
Default User Total Quota - if selected, sets the advisory quota that applies to all users without an explicit total quota.

The Group Quota Defaults section contains the following fields:

Default Group Advisory Quota - if selected, sets the advisory quota that applies to all groups without an explicit advisory quota.
Default Group Total Quota - if selected, sets the advisory quota that applies to all groups without an explicit total quota.

Buttons:

Save - saves the settings


Close - exits without saving the settings

SMTP
The Configure Sending Email dialog lets you configure the email account from which the MapR cluster sends alerts and other notifications.

The Configure Sending Email (SMTP) dialog contains the following fields:

Provider - selects Gmail or another email provider; if you select Gmail, the other fields are partially populated to help you with the
configuration
SMTP Server specifies the SMTP server to use when sending email.
The server requires an encrypted connection (SSL) - use SSL when connecting to the SMTP server
SMTP Port - the port to use on the SMTP server
Full Name - the name used in the From field when the cluster sends an alert email
Email Address - the email address used in the From field when the cluster sends an alert email.
Username - the username used to log onto the email account the cluster will use to send email.
SMTP Password - the password to use when sending email.

Buttons:

Save - saves the settings


Close - exits without saving the settings

HTTP
The Configure HTTP dialog lets you configure access to the MapR Control System via HTTP and HTTPS.

The sections in the Configure HTTP dialog let you enable HTTP and HTTPS access, and set the session timeout, respectively:

Enable HTTP Access - if selected, configure HTTP access with the following field:
HTTP Port - the port on which to connect to the MapR Control System via HTTP
Enable HTTPS Access - if selected, configure HTTPS access with the following fields:
HTTPS Port - the port on which to connect to the MapR Control System via HTTPS
HTTPS Keystore Path - a path to the HTTPS keystore
HTTPS Keystore Password - a password to access the HTTPS keystore
HTTPS Key Password - a password to access the HTTPS key
Session Timeout - the number of seconds before an idle session times out.

Buttons:

Save - saves the settings


Close - exits without saving the settings

MapR Licenses
The MapR License Management dialog lets you add and activate licenses for the cluster, and displays the Cluster ID and the following information
about existing licenses:

Name - the name of each license


Issued - the date each license was issued
Expires - the expiration date of each license
Nodes - the nodes to which each license applies
If installing a new cluster, make sure to install the latest version of MapR software. If applying a new license to an existing MapR
cluster, make sure to upgrade to the latest version of MapR first. If you are not sure, check the contents of the file
MapRBuildVersion in the /opt/mapr directory. If the version is 1.0.0 and includes GA then you must upgrade before
applying a license. Example:

# cat /opt/mapr/MapRBuildVersion
1.0.0.10178GA-0v

For information about upgrading the cluster, see Cluster Upgrade.

Fields:

Cluster ID - the unique identifier needed for licensing the cluster

Buttons:

Add Licenses via Web - navigates to the MapR licensing form online
Add License via Upload - alternate licensing mechanism: upload via browser
Add License via Copy/Paste - alternate licensing mechanism: paste license key
Apply Licenses - validates the licenses and applies them to the cluster
Close - closes the dialog.

Other Views
In addition to the MapR Control System views, there are views that display detailed information about the system:

CLDB View - information about the container location database


HBase View - information about HBase on the cluster
JobTracker View - information about the JobTracker
Nagios - generates a Nagios script
Terminal View - an ssh terminal for logging in to the cluster

With the exception of the MapR Launchpad, the above views include the following buttons:

- Refresh Button (refreshes the view)

- Popout Button (opens the view in a new browser window)

CLDB View
The CLDB view provides information about the Container Location Database (CLDB). To display the CLDB view, open the MapR Control System
and click CLDB in the navigation pane.

The following table describes the fields on the CLDB view:

Field Description

CLDB Mode

CLDB BuildVersion

CLDB Status

Cluster Capacity

Cluster Used

Cluster Available

Active FileServers A list of FileServers, and the following information about each:

ServerID (Hex) -
ServerID -
HostPort -
HostName -
Network Location -
Capacity (MB) -
Used (MB) -
Available (MB) -
Last Heartbeat (s) -
State -
Type -

Volumes A list of volumes, and the following information about each:

Volume Name -
Mount Point -
Mounted -
ReadOnly -
Volume ID -
Volume Topology -
Quota -
Advisory Quota -
Used -
LogicalUsed -
Root Container ID -
Replication -
Guaranteed Replication -
Num Containers -

Accounting Entities A list of users and groups, and the following information about each:

AE Name -
AE Type -
AE Quota -
AE Advisory Quota -
AE Used -
Mirrors A list of mirrors, and the following information about each:

Mirror Volume Name -


Mirror ID -
Mirror NextID -
Mirror Status -
Last Successful Mirror Time -
Mirror SrcVolume -
Mirror SrcRootContainerID -
Mirror SrcClusterName -
Mirror SrcSnapshot -
Mirror DataGenerator Volume -

Snapshots A list of snapshots, and the following information about each:

Snapshot ID -
RW Volume ID -
Snapshot Name -
Root Container ID -
Snapshot Size -
Snapshot InProgress -

Containers A list of containers, and the following information about each:

Container ID -
Volume ID -
Latest Epoch -
SizeMB -
Container Master Location -
Container Locations -
Inactive Locations -
Unused Locations -
Replication Type -

Snapshot Containers A list of snapshot containers, and the following information about each:

Snapshot Container ID - unique ID of the container


Snapshot ID - ID of the snapshot corresponding to the container
RW Container ID - corresponding source container ID
Latest Epoch -
SizeMB - container size, in MB
Container Master Location - location of the container's master replica
Container Locations -
Inactive Locations -

HBase View
The HBase View provides information about HBase on the cluster.

Field Description

Local Logs A link to the HBase Local Logs View

Thread Dump A link to the HBase Thread Dump View

Log Level A link to the HBase Log Level View, a form for getting/setting the log level

Master Attributes A list of attributes, and the following information about each:

Attribute Name -
Value -
Description -

Catalog Tables A list of tables, and the following information about each:

Table -
Description -
User Tables

Region Servers A list of region servers in the cluster, and the following information about each:

Address -
Start Code -
Load -
Total -

HBase Local Logs View

The HBase Local Logs view displays a list of the local HBase logs. Clicking a log name displays the contents of the log. Each log name can be
copied and pasted into the HBase Log Level View to get or set the current log level.

HBase Log Level View

The HBase Log Level View is a form for getting and setting log levels that determine which information gets logged. The Log field accepts a log
name (which can be copied from the HBase Local Logs View and pasted). The Level field takes any of the following valid log levels:

ALL
TRACE
DEBUG
INFO
WARN
ERROR
OFF

HBase Thread Dump View

The HBase Thread Dump View displays a dump of the HBase thread.

Example:
Process Thread Dump: 40 active threads Thread 318 (1962516546@qtp-879081272-3): State: RUNNABLE
Blocked count: 8 Waited count: 32 Stack: sun.management.ThreadImpl.getThreadInfo0(Native Method)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
org.apache.hadoop.http.HttpServer$StackServlet.doGet(HttpServer.java:695)
javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:826)
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
org.mortbay.jetty.Server.handle(Server.java:326)
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) Thread 50
(perfnode51.perf.lab:60000-CatalogJanitor): State: TIMED_WAITING Blocked count: 1081 Waited count:
1350 Stack: java.lang.Object.wait(Native Method)
org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:91)
org.apache.hadoop.hbase.Chore.run(Chore.java:74) Thread 49 (perfnode51.perf.lab:60000-BalancerChore):
State: TIMED_WAITING Blocked count: 0 Waited count: 270 Stack:
java.lang.Object.wait(Native Method) org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:91)
org.apache.hadoop.hbase.Chore.run(Chore.java:74) Thread 48
(MASTER_OPEN_REGION-perfnode51.perf.lab:60000-1): State: WAITING Blocked
count: 2 Waited count: 3 Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6d1cf4e5 Stack:
sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)

JobTracker View

Field Description

State

Started

Version

Compiled

Identifier

Cluster Summary The heapsize, and the following information about the cluster:

Running Map Tasks -


Running Reduce Tasks -
Total Submissions -
Nodes -
Occupied Map Slots -
Occupied Reduce Slots -
Reserved Map Slots -
Reserved Reduce Slots -
Map Task Capacity -
Reduce Task Capacity -
Avg. Tasks/Node -
Blacklisted Nodes -
Excluded Nodes -
MapTask Prefetch Capacity -
Scheduling Information A list of queues, and the following information about each:

Queue name -
State -
Scheduling Information -

Filter A field for filtering results by Job ID, Priority, User, or Name

Running Jobs A list of running MapReduce jobs, and the following information about each:

JobId -
Priority -
User -
Name -
Start Time -
Map % Complete -
Current Map Slots -
Failed MapAttempts -
MapAttempt Time Avg/Max -
Cumulative Map CPU -
Current Map PMem -
Reduce % Complete -
Current Reduce Slots -
Failed ReduceAttempts -
ReduceAttempt Time Avg/Max
Cumulative Reduce CPU -
Current Reduce PMem -

Completed Jobs A list of current MapReduce jobs, and the following information about each:

JobId -
Priority -
User -
Name -
Start Time -
Total Time -
Maps Launched -
Map Total -
Failed MapAttempts -
MapAttempt Time Avg/Max -
Cumulative Map CPU -
Reducers Launched -
Reduce Total -
Failed ReduceAttempts -
ReduceAttempt Time Avg/Max -
Cumulative Reduce CPU -
Cumulative Reduce PMem -
Vaidya Reports -

Retired Jobs A list of retired MapReduce job, and the following information about each:

JobId -
Priority -
User -
Name -
State -
Start Time -
Finish Time -
Map % Complete -
Reduce % Complete -
Job Scheduling Information -
Diagnostic Info -

Local Logs A link to the local logs

JobTracker Configuration A link to a page containing Hadoop JobTracker configuration values

JobTracker Configuration View

Field Description
fs.automatic.close TRUE

fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary

fs.checkpoint.edits.dir ${fs.checkpoint.dir}

fs.checkpoint.period 3600

fs.checkpoint.size 67108864

fs.default.name maprfs:///

fs.file.impl org.apache.hadoop.fs.LocalFileSystem

fs.ftp.impl org.apache.hadoop.fs.ftp.FTPFileSystem

fs.har.impl org.apache.hadoop.fs.HarFileSystem

fs.har.impl.disable.cache TRUE

fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem

fs.hftp.impl org.apache.hadoop.hdfs.HftpFileSystem

fs.hsftp.impl org.apache.hadoop.hdfs.HsftpFileSystem

fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem

fs.maprfs.impl com.mapr.fs.MapRFileSystem

fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem

fs.s3.block.size 67108864

fs.s3.buffer.dir ${hadoop.tmp.dir}/s3

fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem

fs.s3.maxRetries 4

fs.s3.sleepTimeSeconds 10

fs.s3n.block.size 67108864

fs.s3n.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem

fs.trash.interval 0

hadoop.job.history.location file:////opt/mapr/hadoop/hadoop-0.20.2/bin/../logs/history

hadoop.logfile.count 10

hadoop.logfile.size 10000000

hadoop.native.lib TRUE

hadoop.proxyuser.root.groups root

hadoop.proxyuser.root.hosts

hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.StandardSocketFactory

hadoop.security.authentication simple

hadoop.security.authorization FALSE

hadoop.security.group.mapping org.apache.hadoop.security.ShellBasedUnixGroupsMapping

hadoop.tmp.dir /tmp/hadoop-${user.name}

hadoop.util.hash.type murmur

io.bytes.per.checksum 512

io.compression.codecs org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.

io.file.buffer.size 8192
io.map.index.skip 0

io.mapfile.bloom.error.rate 0.005

io.mapfile.bloom.size 1048576

io.seqfile.compress.blocksize 1000000

io.seqfile.lazydecompress TRUE

io.seqfile.sorter.recordlimit 1000000

io.serializations org.apache.hadoop.io.serializer.WritableSerialization

io.skip.checksum.errors FALSE

io.sort.factor 256

io.sort.record.percent 0.17

io.sort.spill.percent 0.99

ipc.client.connect.max.retries 10

ipc.client.connection.maxidletime 10000

ipc.client.idlethreshold 4000

ipc.client.kill.max 10

ipc.client.tcpnodelay FALSE

ipc.server.listen.queue.size 128

ipc.server.tcpnodelay FALSE

job.end.retry.attempts 0

job.end.retry.interval 30000

jobclient.completion.poll.interval 5000

jobclient.output.filter FAILED

jobclient.progress.monitor.poll.interval 1000

keep.failed.task.files FALSE

local.cache.size 10737418240

map.sort.class org.apache.hadoop.util.QuickSort

mapr.localoutput.dir output

mapr.localspill.dir spill

mapr.localvolumes.path /var/mapr/local

mapred.acls.enabled FALSE

mapred.child.oom_adj 10

mapred.child.renice 10

mapred.child.taskset TRUE

mapred.child.tmp ./tmp

mapred.cluster.ephemeral.tasks.memory.limit.mb 200

mapred.compress.map.output FALSE

mapred.fairscheduler.allocation.file conf/pools.xml

mapred.fairscheduler.assignmultiple TRUE

mapred.fairscheduler.eventlog.enabled FALSE
mapred.fairscheduler.smalljob.max.inputsize 10737418240

mapred.fairscheduler.smalljob.max.maps 10

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824

mapred.fairscheduler.smalljob.max.reducers 10

mapred.fairscheduler.smalljob.schedule.enable TRUE

mapred.healthChecker.interval 60000

mapred.healthChecker.script.timeout 600000

mapred.inmem.merge.threshold 1000

mapred.job.queue.name default

mapred.job.reduce.input.buffer.percent 0

mapred.job.reuse.jvm.num.tasks -1

mapred.job.shuffle.input.buffer.percent 0.7

mapred.job.shuffle.merge.percent 0.66

mapred.job.tracker ted-desk.perf.lab:9001

mapred.job.tracker.handler.count 10

mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done

mapred.job.tracker.http.address 0.0.0.0:50030

mapred.job.tracker.persist.jobstatus.active FALSE

mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo

mapred.job.tracker.persist.jobstatus.hours 0

mapred.jobtracker.completeuserjobs.maximum 100

mapred.jobtracker.instrumentation org.apache.hadoop.mapred.JobTrackerMetricsInst

mapred.jobtracker.job.history.block.size 3145728

mapred.jobtracker.jobhistory.lru.cache.size 5

mapred.jobtracker.maxtasks.per.job -1

mapred.jobtracker.port 9001

mapred.jobtracker.restart.recover TRUE

mapred.jobtracker.retiredjobs.cache.size 1000

mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.FairScheduler

mapred.line.input.format.linespermap 1

mapred.local.dir ${hadoop.tmp.dir}/mapred/local

mapred.local.dir.minspacekill 0

mapred.local.dir.minspacestart 0

mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log

mapred.map.max.attempts 4

mapred.map.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec

mapred.map.tasks 2

mapred.map.tasks.speculative.execution FALSE

mapred.max.maps.per.node -1
mapred.max.reduces.per.node -1

mapred.max.tracker.blacklists 4

mapred.max.tracker.failures 4

mapred.merge.recordsBeforeProgress 10000

mapred.min.split.size 0

mapred.output.compress FALSE

mapred.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec

mapred.output.compression.type RECORD

mapred.queue.names default

mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log

mapred.reduce.copy.backoff 300

mapred.reduce.max.attempts 4

mapred.reduce.parallel.copies 12

mapred.reduce.slowstart.completed.maps 0.95

mapred.reduce.tasks 1

mapred.reduce.tasks.speculative.execution FALSE

mapred.running.map.limit -1

mapred.running.reduce.limit -1

mapred.skip.attempts.to.start.skipping 2

mapred.skip.map.auto.incr.proc.count TRUE

mapred.skip.map.max.skip.records 0

mapred.skip.reduce.auto.incr.proc.count TRUE

mapred.skip.reduce.max.skip.groups 0

mapred.submit.replication 10

mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system

mapred.task.cache.levels 2

mapred.task.profile FALSE

mapred.task.profile.maps 0-2

mapred.task.profile.reduces 0-2

mapred.task.timeout 600000

mapred.task.tracker.http.address 0.0.0.0:50060

mapred.task.tracker.report.address 127.0.0.1:0

mapred.task.tracker.task-controller org.apache.hadoop.mapred.DefaultTaskController

mapred.tasktracker.dns.interface default

mapred.tasktracker.dns.nameserver default

mapred.tasktracker.ephemeral.tasks.maximum 1

mapred.tasktracker.ephemeral.tasks.timeout 10000

mapred.tasktracker.ephemeral.tasks.ulimit 4294967296>

mapred.tasktracker.expiry.interval 600000
mapred.tasktracker.indexcache.mb 10

mapred.tasktracker.instrumentation org.apache.hadoop.mapred.TaskTrackerMetricsInst

mapred.tasktracker.map.tasks.maximum (CPUS > 2) ? (CPUS * 0.75) : 1

mapred.tasktracker.reduce.tasks.maximum (CPUS > 2) ? (CPUS * 0.50): 1

mapred.tasktracker.taskmemorymanager.monitoring-interval 5000

mapred.tasktracker.tasks.sleeptime-before-sigkill 5000

mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp

mapred.userlog.limit.kb 0

mapred.userlog.retain.hours 24

mapreduce.heartbeat.10 300

mapreduce.heartbeat.100 1000

mapreduce.heartbeat.1000 10000

mapreduce.heartbeat.10000 100000

mapreduce.job.acl-view-job

mapreduce.job.complete.cancel.delegation.tokens TRUE

mapreduce.job.split.metainfo.maxsize 10000000

mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recovery

mapreduce.jobtracker.recovery.maxtime 120

mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging

mapreduce.maprfs.use.compression TRUE

mapreduce.reduce.input.limit -1

mapreduce.tasktracker.outofband.heartbeat FALSE

mapreduce.tasktracker.prefetch.maptasks 1

mapreduce.use.fastreduce FALSE

mapreduce.use.maprfs TRUE

tasktracker.http.threads 2

topology.node.switch.mapping.impl org.apache.hadoop.net.ScriptBasedMapping

topology.script.number.args 100

webinterface.private.actions FALSE

MapR Launchpad
The MapR Launchpad showcases big data solutions from vendors who work with MapR.

Nagios View
The Nagios view displays a dialog containing a Nagios configuration script.
Example:

############# Commands #############

define command {
command_name check_fileserver_proc
command_line $USER1$/check_tcp -p 5660
}

define command {
command_name check_cldb_proc
command_line $USER1$/check_tcp -p 7222
}

define command {
command_name check_jobtracker_proc
command_line $USER1$/check_tcp -p 50030
}

define command {
command_name check_tasktracker_proc
command_line $USER1$/check_tcp -p 50060
}

define command {
command_name check_nfs_proc
command_line $USER1$/check_tcp -p 2049
}

define command {
command_name check_hbmaster_proc
command_line $USER1$/check_tcp -p 60000
}

define command {
command_name check_hbregionserver_proc
command_line $USER1$/check_tcp -p 60020
}

define command {
command_name check_webserver_proc
command_line $USER1$/check_tcp -p 8443
}
################# HOST: perfnode51.perf.lab ###############

define host {
use linux-server
host_name perfnode51.perf.lab
address 10.10.30.51
check_command check-host-alive
}

################# HOST: perfnode52.perf.lab ###############

define host {
use linux-server
host_name perfnode52.perf.lab
address 10.10.30.52
check_command check-host-alive
}

################# HOST: perfnode53.perf.lab ###############

define host {
use linux-server
host_name perfnode53.perf.lab
address 10.10.30.53
check_command check-host-alive
}

################# HOST: perfnode54.perf.lab ###############

define host {
use linux-server
host_name perfnode54.perf.lab
address 10.10.30.54
check_command check-host-alive
}

################# HOST: perfnode55.perf.lab ###############

define host {
use linux-server
host_name perfnode55.perf.lab
address 10.10.30.55
check_command check-host-alive
}

################# HOST: perfnode56.perf.lab ###############

define host {
use linux-server
host_name perfnode56.perf.lab
address 10.10.30.56
check_command check-host-alive
}

Terminal View
The ssh terminal provides access to the command line.

API Reference

Overview
This guide provides information about the MapR command API. Most commands can be run on the command-line interface (CLI), or by making
REST requests programmatically or in a browser. To run CLI commands, use a Client machine or an ssh connection to any node in the cluster.
To use the REST interface, make HTTP requests to a node that is running the WebServer service.

Each command reference page includes the command syntax, a table that describes the parameters, and examples of command usage. In each
parameter table, required parameters are in bold text. For output commands, the reference pages include tables that describe the output fields.
Values that do not apply to particular combinations are marked NA.

REST API Syntax


MapR REST calls use the following format:

https://<host>:<port>/rest/<command>[/<subcommand>...]?<parameters>

Construct the <parameters> list from the required and optional parameters, in the format <parameter>=<value> separated by the
ampersand (&) character. Example:

https://r1n1.qa.sj.ca.us:8443/api/volume/mount?name=test-volume&path=/test
Values in REST API calls must be URL-encoded. For readability, the values in this document are presented using the actual characters, rather
than the URL-encoded versions.

Authentication

To make REST calls using curl or wget, provide the username and password.

Curl Syntax

curl -k -u <username>:<password> https://<host>:<port>/rest/<command>...

Wget Syntax

wget --no-check-certificate --user <username> --password <password> https://<host>:<port>/rest


/<command>...

Command-Line Interface (CLI) Syntax


The MapR CLI commands are documented using the following conventions:

[Square brackets] indicate an optional parameter


<Angle brackets> indicate a value to enter

The following syntax example shows that the volume mount command requires the -name parameter, for which you must enter a list of
volumes, and all other parameters are optional:

maprcli volume mount


[ -cluster <cluster> ]
-name <volume list>
[ -path <path list> ]

For clarity, the syntax examples show each parameter on a separate line; in practical usage, the command and all parameters and options are
typed on a single line. Example:

maprcli volume mount -name test-volume -path /test

Common Parameters
The following parameters are available for many commands in both the REST and command-line contexts.

Parameter Description

cluster The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued.
In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command.

zkconnect A ZooKeeper connect string, which specifies a list of the hosts running ZooKeeper, and the port to use on each, in the format:
'<host>[:<port>][,<host>[:<port>]...]' Default: 'localhost:5181' In most cases the ZooKeeper connect string
can be omitted, but it is useful in certain cases when the CLDB is not running.

Common Options
The following options are available for most commands in the command-line context.

Option Description

-noheader When displaying tabular output from a command, omits the header row.

-long Shows the entire value. This is useful when the command response contains complex information. When -long is omitted, complex
information is displayed as an ellipsis (...).

-json Displays command output in JSON format. When -json is omitted, the command output is displayed in tabular format.

Filters
Some MapR CLI commands use filters, which let you specify large numbers of nodes or volumes by matching specified values in specified fields
rather than by typing each name explicitly.

Filters use the following format:

[<field><operator>"<value>"]<and|or>[<field><operator>"<value>"] ...

field Field on which to filter. The field depends on the command with which the filter is used.

operator An operator for that field:

== - Exact match
!= - Does not match
> - Greater than
< - Less than

>= - Greater than or equal to


<= - Less than or equal to |

value Value on which to filter. Wildcards (using *) are allowed for operators == and !=. There is a special value all that matches all
values.

You can use the wildcard (*) for partial matches. For example, you can display all volumes whose owner is root and whose name begins with
test as follows:

maprcli volume list -filter [n=="test*"]and[on=="root"]

Response
The commands return responses in JSON or in a tabular format. When you run commands from the command line, the response is returned in
tabular format unless you specify JSON using the -json option; when you run commands through the REST interface, the response is returned in
JSON.

Success

On a successful call, each command returns the error code zero (OK) and any data requested. When JSON output is specified, the data is
returned as an array of records along with the status code and the total number of records. In the tabular format, the data is returned as a
sequence of rows, each of which contains the fields in the record separated by tabs.

JSON
{
"status":"OK",
"total":<number of records>,
"data":[
{
<record>
}
...
]
}

Tabular
status
0

Or
<heading> <heading> <heading> ...
<field> <field> <field> ...
...

Error

When an error occurs, the command returns the error code and descriptive message.
JSON
{
"status":"ERROR",
"errors":[
{
"id":<error code>,
"desc":"<command>: <error message>"
}
]
}

Tabular
ERROR (<error code>) - <command>: <error message>

acl
The acl commands let you work with access control lists (ACLs):

acl edit - modifies a specific user's access to a cluster or volume


acl set - modifies the ACL for a cluster or volume
acl show - displays the ACL associated with a cluster or volume

In order to use the acl edit command, you must have full control (fc) permission on the cluster or volume for which you are running the
command.

Specifying Permissions

Specify permissions for a user or group with a string that lists the permissions for that user or group. To specify permissions for multiple users or
groups, use a string for each, separated by spaces. The format is as follows:

Users - <user>:<action>[,<action>...][ <user>:<action>[,<action...]]


Groups - <group>:<action>[,<action>...][ <group>:<action>[,<action...]]

The following tables list the permission codes used by the acl commands.

Cluster Permission Codes

Code Allowed Action Includes

login Log in to the MapR Control System, use the API and command-line interface, read access on cluster and cv
volumes

ss Start/stop services

cv Create volumes

a Admin access All permissions except


fc

fc Full control (administrative access and permission to change the cluster ACL) a

Volume Permission Codes

Code Allowed Action

dump Dump the volume

restore Mirror or restore the volume

m Modify volume properties, create and delete snapshots

d Delete a volume

fc Full control (admin access and permission to change volume ACL)

acl edit
The acl edit command grants one or more specific volume or cluster permissions to a user. To use the acl edit command, you must have
full control (fc) permissions on the volume or cluster for which you are running the command.

The permissions are specified as a comma-separated list of permission codes. See acl. You must specify either a user or a group. When the
type is volume, a volume name must be specified using the name parameter.

Syntax

CLI
maprcli acl edit
[ -cluster <cluster name> ]
[ -group <group> ]
[ -name <name> ]
-type cluster|volume
[ -user <user> ]

REST
http[s]://<host:port>/rest/acl/edit?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

group Groups and allowed actions for each group. See acl. Format: <group>:<action>[,<action>...][
<group>:<action>[,<action...]]

name The object name.

type The object type (cluster or volume).

user Users and allowed actions for each user. See acl. Format: <user>:<action>[,<action>...][
<user>:<action>[,<action...]]

Examples

Give the user jsmith dump, restore, and delete permissions for "test-volume":

CLI
maprcli acl edit -type volume -name test-volume -user jsmith:dump,restore,d

acl set

The acl set command specifies the entire ACL for a cluster or volume. Any previous permissions are overwritten by the new values, and any
permissions omitted are removed. To use the acl set command, you must have full control (fc) permissions on the volume or cluster for which
you are running the command.

The permissions are specified as a comma-separated list of permission codes. See acl. You must specify either a user or a group. When the
type is volume, a volume name must be specified using the name parameter.

The acl set command removes any previous ACL values. If you wish to preserve some of the permissions, you should either
use the acl edit command instead of acl set, or use acl show to list the values before overwriting them.
Syntax

CLI
maprcli acl set
[ -cluster <cluster name> ]
[ -group <group> ]
[ -name <name> ]
-type cluster|volume
[ -user <user> ]

REST
http[s]://<host:port>/rest/acl/edit?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

group Groups and allowed actions for each group. See acl. Format: <group>:<action>[,<action>...][
<group>:<action>[,<action...]]

name The object name.

type The object type (cluster or volume).

user Users and allowed actions for each user. See acl. Format: <user>:<action>[,<action>...][
<user>:<action>[,<action...]]

Examples

Give the user root full control of the cluster, and remove all permissions for all other users:

CLI
maprcli acl set -type cluster -cluster default -user
root:fc

Usage Example

# maprcli acl show -type cluster


Principal Allowed actions
User root [login, ss, cv, a, fc]
User lfedotov [login, ss, cv, a, fc]
User mapr [login, ss, cv, a, fc]

# maprcli acl set -type cluster -cluster default -user root:fc


# maprcli acl show -type cluster
Principal Allowed actions
User root [login, ss, cv, a, fc]

Notice that the specified permissions have overwritten the existing ACL.

Give multiple users specific permissions for a certain volume, and remove all permissions for all other users:

CLI
maprcli acl set -type volume -name test-volume -user jsmith:dump,restore,m rjones:fc
acl show

Displays the ACL associated with an object (cluster or a volume). An ACL contains the list of users who can perform specific actions.

Syntax

CLI
maprcli acl show
[ -cluster <cluster> ]
[ -group <group> ]
[ -name <name> ]
[ -output long|short|terse ]
[ -perm ]
-type cluster|volume
[ -user <user> ]

REST None

Parameters

Parameter Description

cluster The name of the cluster on which to run the command

group The group for which to display permissions

name The cluster or volume name

output The output format:

long
short
terse

perm When this option is specified, acl show displays the permissions available for the object type specified in the type parameter.

type Cluster or volume.

user The user for which to display permissions

Output

The actions that each user or group is allowed to perform on the cluster or the specified volume. For information about each allowed action, see
acl.

Principal Allowed actions


User root [r, ss, cv, a, fc]
Group root [r, ss, cv, a, fc]
All users [r]

Examples

Show the ACL for "test-volume":

CLI
maprcli acl show -type volume -name test-volume
Show the permissions that can be set on a cluster:

CLI
maprcli acl show -type cluster -perm

alarm
The alarm commands perform functions related to system alarms:

alarm clear - clears one or more alarms


alarm clearall - clears all alarms
alarm config load - displays the email addresses to which alarm notifications are to be sent
alarm config save - saves changes to the email addresses to which alarm notifications are to be sent
alarm list - displays alarms on the cluster
alarm names - displays all alarm names
alarm raise - raises a specified alarm

Alarm Notification Fields


The following fields specify the configuration of alarm notifications.

Field Description

alarm The named alarm.

individual Specifies whether individual alarm notifications are sent to the default email address for the alarm type.

0 - do not send notifications to the default email address for the alarm type
1 - send notifications to the default email address for the alarm type

email A custom email address for notifications about this alarm type. If specified, alarm notifications are sent to this email address,
regardless of whether they are sent to the default email address

Alarm Types
See Troubleshooting Alarms.

Alarm History
To see a history of alarms that have been raised, look at the file /opt/mapr/logs/cldb.log on the master CLDB node. Example:

grep ALARM /opt/mapr/logs/cldb.log

alarm clear

Clears one or more alarms. Permissions required: fc or a

Syntax

CLI
maprcli alarm clear
-alarm <alarm>
[ -cluster <cluster> ]
[ -entity <host, volume, user, or group name> ]

REST
http[s]://<host>:<port>/rest/alarm/clear?<parameters>

Parameters
Parameter Description

alarm The named alarm to clear. See Alarm Types.

cluster The cluster on which to run the command.

entity The entity on which to clear the alarm.

Examples

Clear a specific alarm:

CLI
maprcli alarm clear -alarm NODE_ALARM_DEBUG_LOGGING

REST
https://r1n1.sj.us:8443/rest/alarm/clear?alarm=NODE_ALARM_DEBUG_LOGGING

alarm clearall

Clears all alarms. Permissions required: fc or a

Syntax

CLI
maprcli alarm clearall
[ -cluster <cluster> ]

REST
http[s]://<host>:<port>/rest/alarm/clearall?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

Examples

Clear all alarms:

CLI
maprcli alarm clearall

REST
https://r1n1.sj.us:8443/rest/alarm/clearall

alarm config load


Displays the configuration of alarm notifications. Permissions required: fc or a

Syntax

CLI
maprcli alarm config load
[ -cluster <cluster> ]

REST
http[s]://<host>:<port>/rest/alarm/config/load

Parameters

Parameter Description

cluster The cluster on which to run the command.

Output

A list of configuration values for alarm notifications.

Output Fields

See Alarm Notification Fields.

Sample output

alarm individual email


CLUSTER_ALARM_BLACKLIST_TTS 1
CLUSTER_ALARM_UPGRADE_IN_PROGRESS 1
CLUSTER_ALARM_UNASSIGNED_VIRTUAL_IPS 1
VOLUME_ALARM_SNAPSHOT_FAILURE 1
VOLUME_ALARM_MIRROR_FAILURE 1
VOLUME_ALARM_DATA_UNDER_REPLICATED 1
VOLUME_ALARM_DATA_UNAVAILABLE 1
VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED 1
VOLUME_ALARM_QUOTA_EXCEEDED 1
NODE_ALARM_CORE_PRESENT 1
NODE_ALARM_DEBUG_LOGGING 1
NODE_ALARM_DISK_FAILURE 1
NODE_ALARM_OPT_MAPR_FULL 1
NODE_ALARM_VERSION_MISMATCH 1
NODE_ALARM_TIME_SKEW 1
NODE_ALARM_SERVICE_CLDB_DOWN 1
NODE_ALARM_SERVICE_FILESERVER_DOWN 1
NODE_ALARM_SERVICE_JT_DOWN 1
NODE_ALARM_SERVICE_TT_DOWN 1
NODE_ALARM_SERVICE_HBMASTER_DOWN 1
NODE_ALARM_SERVICE_HBREGION_DOWN 1
NODE_ALARM_SERVICE_NFS_DOWN 1
NODE_ALARM_SERVICE_WEBSERVER_DOWN 1
NODE_ALARM_SERVICE_HOSTSTATS_DOWN 1
NODE_ALARM_ROOT_PARTITION_FULL 1
AE_ALARM_AEADVISORY_QUOTA_EXCEEDED 1
AE_ALARM_AEQUOTA_EXCEEDED 1
Examples

Display the alarm notification configuration:

CLI
maprcli alarm config load

REST
https://r1n1.sj.us:8443/rest/alarm/config/load

alarm config save

Sets notification preferences for alarms. Permissions required: fc or a

Alarm notifications can be sent to the default email address and a specific email address for each named alarm. If individual is set to 1 for a
specific alarm, then notifications for that alarm are sent to the default email address for the alarm type. If a custom email address is provided,
notifications are sent there regardless of whether they are also sent to the default email address.

Syntax

CLI
maprcli alarm config save
[ -cluster <cluster> ]
-values <values>

REST
http[s]://<host>:<port>/rest/alarm/config/save?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

values A comma-separated list of configuration values for one or more alarms, in the following format:

<alarm>,<individual>,<email>
See Alarm Notification Fields.

Examples

Send alert emails for the AE_ALARM_AEQUOTA_EXCEEDED alarm to the default email address and a custom email address:

CLI
maprcli alarm config save -values "AE_ALARM_AEQUOTA_EXCEEDED,1,test@example.com"

REST
https://r1n1.sj.us:8443/rest/alarm/config/save?values=AE_ALARM_AEQUOTA_EXCEEDED,1,test@example.com

alarm list
Lists alarms in the system. Permissions required: fc or a

You can list all alarms, alarms by type (Cluster, Node or Volume), or alarms on a particular node or volume. To retrieve a count of all alarm types,
pass 1 in the summary parameter. You can specify the alarms to return by filtering on type and entity. Use start and limit to retrieve only a
specified window of data.

Syntax

CLI
maprcli alarm list
[ -alarm <alarm ID> ]
[ -cluster <cluster> ]
[ -entity <host or volume> ]
[ -limit <limit> ]
[ -output (terse|verbose) ]
[ -start <offset> ]
[ -summary (0|1) ]
[ -type <alarm type> ]

REST
http[s]://<host>:<port>/rest/alarm/list?<parameters>

Parameters

Parameter Description

alarm The alarm type to return. See Alarm Types.

cluster The cluster on which to list alarms.

entity The name of the cluster, node, volume, user, or group to check for alarms.

limit The number of records to retrieve. Default: 2147483647

output Whether the output should be terse or verbose.

start The list offset at which to start.

summary Specifies the type of data to return:

1 = count by alarm type


0 = List of alarms

type The entity type:

cluster
node
volume
ae

Output

Information about one or more named alarms on the cluster, or for a specified node, volume, user, or group.

Output Fields

Field Description
alarm state State of the alarm:

0 = Clear
1 = Raised

description A description of the condition that raised the alarm

entity The name of the volume, node, user, or group.

alarm name The name of the alarm.

alarm statechange time The date and time the alarm was most recently raised.

Sample Output

alarm state description entity


alarm name alarm statechange time
1 Volume desired replication is 1, current replication is 0
mapr.qa-node173.qa.prv.local.logs VOLUME_ALARM_DATA_UNDER_REPLICATED 1296707707872
1 Volume data unavailable
mapr.qa-node173.qa.prv.local.logs VOLUME_ALARM_DATA_UNAVAILABLE 1296707707871
1 Volume desired replication is 1, current replication is 0
mapr.qa-node235.qa.prv.local.mapred VOLUME_ALARM_DATA_UNDER_REPLICATED 1296708283355
1 Volume data unavailable
mapr.qa-node235.qa.prv.local.mapred VOLUME_ALARM_DATA_UNAVAILABLE 1296708283099
1 Volume desired replication is 1, current replication is 0
mapr.qa-node175.qa.prv.local.logs VOLUME_ALARM_DATA_UNDER_REPLICATED 1296706343256

Examples

List a summary of all alarms

CLI
maprcli alarm list -summary 1

REST
https://r1n1.sj.us:8443/rest/alarm/list?summary=1

List cluster alarms

CLI
maprcli alarm list -type 0

REST
https://r1n1.sj.us:8443/rest/alarm/list?type=0

alarm names

Displays a list of alarm names. Permissions required fc or a.

Syntax

CLI
maprcli alarm names
REST
http[s]://<host>:<port>/rest/alarm/names

Examples

Display all alarm names:

CLI
maprcli alarm names

REST
https://r1n1.sj.us:8443/rest/alarm/names

alarm raise

Raises a specified alarm or alarms. Permissions required fc or a.

Syntax

CLI
maprcli alarm raise
-alarm <alarm>
[ -cluster <cluster> ]
[ -description <description> ]
[ -entity <cluster, entity, host, node, or volume> ]

REST
http[s]://<host>:<port>/rest/alarm/raise?<parameters>

Parameters

Parameter Description

alarm The alarm type to raise. See Alarm Types.

cluster The cluster on which to run the command.

description A brief description.

entity The entity on which to raise alarms.

Examples

Raise a specific alarm:

CLI
maprcli alarm raise -alarm NODE_ALARM_DEBUG_LOGGING

REST
https://r1n1.sj.us:8443/rest/alarm/raise?alarm=NODE_ALARM_DEBUG_LOGGING
config
The config commands let you work with configuration values for the MapR cluster:

config load displays the values


config save makes changes to the stored values

Configuration Fields

Field Value Description

cldb.balancer.disk.max.switches.in.nodes.percentage 10

cldb.balancer.disk.paused 0

cldb.balancer.disk.sleep.interval.sec 2 * 60

cldb.balancer.disk.threshold.percentage 70

cldb.balancer.logging 0

cldb.balancer.role.max.switches.in.nodes.percentage 10

cldb.balancer.role.paused 0

cldb.balancer.role.sleep.interval.sec 15 * 60

cldb.balancer.startup.interval.sec 30 * 60

cldb.cluster.almost.full.percentage 90 The percentage at which the


CLUSTER_ALARM_CLUSTER_ALMOS
alarm is triggered.

cldb.container.alloc.selector.algo 0

cldb.container.assign.buffer.sizemb 1 * 1024

cldb.container.create.diskfull.threshold 80

cldb.container.sizemb 16 * 1024

cldb.default.chunk.sizemb 256

cldb.default.volume.topology

cldb.default.volume.topology The default topology for new volumes.

cldb.dialhome.metrics.rotation.period 365

cldb.fileserver.activityreport.interval.hb.multiplier 3

cldb.fileserver.containerreport.interval.hb.multiplier 1800

cldb.fileserver.heartbeat.interval.sec 1

cldb.force.master.for.container.minutes 1

cldb.fs.mark.inactive.sec 5 * 60

cldb.fs.mark.rereplicate.sec 60 * 60 The number of seconds a node can fail to


heartbeat before it is considered dead. O
node is considered dead, the CLDB
re-replicates any data contained on the n

cldb.fs.workallocator.num.volume.workunits 20

cldb.fs.workallocator.num.workunits 80

cldb.ganglia.cldb.metrics 0

cldb.ganglia.fileserver.metrics 0

cldb.heartbeat.monitor.sleep.interval.sec 60

cldb.log.fileserver.timeskew.interval.mins 60
cldb.max.parallel.resyncs.star 2

cldb.min.containerid 1

cldb.min.fileservers 1 The minimum CLDB fileservers.

cldb.min.snap.containerid 1

cldb.min.snapid 1

cldb.replication.process.num.containers 60

cldb.replication.sleep.interval.sec 15

cldb.replication.tablescan.interval.sec 2 * 60

cldb.restart.wait.time.sec 180

cldb.snapshots.inprogress.cleanup.minutes 30

cldb.topology.almost.full.percentage 90

cldb.volume.default.replication The default replication for the CLDB volu

cldb.volume.epoch

cldb.volumes.default.min.replication 2

cldb.volumes.default.replication 3

mapr.domainname The domain name MapR uses to get ope


system users and groups (in domain mod

mapr.entityquerysource Sets MapR to get user information from L


(LDAP mode) or from the operating syste
domain (domain mode):

ldap
domain

mapr.eula.user

mapr.eula.time

mapr.fs.nocompression "bz2,gz,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mpeg,avi,gif,png"

mapr.fs.permissions.supergroup The super group of the MapR-FS layer.

mapr.fs.permissions.superuser The super user of the MapR-FS layer.

mapr.ldap.attribute.group The LDAP server group attribute.

mapr.ldap.attribute.groupmembers The LDAP server groupmembers attribut

mapr.ldap.attribute.mail The LDAP server mail attribute.

mapr.ldap.attribute.uid The LDAP server uid attribute.

mapr.ldap.basedn The LDAP server Base DN.

mapr.ldap.binddn The LDAP server Bind DN.

mapr.ldap.port The port MapR is to use on the LDAP se

mapr.ldap.server The LDAP server MapR uses to get user


groups (in LDAP mode).

mapr.ldap.sslrequired Specifies whether the LDAP server requi


SSL:

0 == no
1 == yes

mapr.license.exipry.notificationdays 30
mapr.quota.group.advisorydefault The default group advisory quota; see M
Quotas.

mapr.quota.group.default The default group quota; see Managing Q

mapr.quota.user.advisorydefault The default user advisory quota; see Man


Quotas.

mapr.quota.user.default The default user quota; see Managing Q

mapr.smtp.port The port MapR uses on the SMTP server


65535).

mapr.smtp.sender.email The reply-to email address MapR uses w


sending notifications.

mapr.smtp.sender.fullname The full name MapR uses in the Sender


when sending notifications.

mapr.smtp.sender.password The password MapR uses to log in to the


server when sending notifications.

mapr.smtp.sender.username The username MapR uses to log in to the


server when sending notifications.

mapr.smtp.server The SMTP server that MapR uses to sen


notifications.

mapr.smtp.sslrequired Specifies whether SSL is required when


email:

0 == no
1 == yes

mapr.targetversion

mapr.webui.http.port The port MapR uses for the MapR Contro


System over HTTP (0-65535); if 0 is spec
disables HTTP access.

mapr.webui.https.certpath The HTTPS certificate path.

mapr.webui.https.keypath The HTTPS key path.

mapr.webui.https.port The port MapR uses for the MapR Contro


System over HTTPS (0-65535); if 0 is sp
disables HTTPS access.

mapr.webui.timeout The number of seconds the MapR Contro


System allows to elapse before timing ou

mapreduce.cluster.permissions.supergroup The super group of the MapReduce laye

mapreduce.cluster.permissions.superuser The super user of the MapReduce layer.

config load

Displays information about the cluster configuration. You can use the keys parameter to specify which information to display.

Syntax

CLI
maprcli config load
[ -cluster <cluster> ]
-keys <keys>

REST
http[s]://<host>:<port>/rest/config/load?<parameters>
Parameters

Parameter Description

cluster The cluster for which to display values.

keys The fields for which to display values; see the Configuration Fields table

Output

Information about the cluster configuration. See the Configuration Fields table.

Sample Output

{
"status":"OK",
"total":1,
"data":[
{
"mapr.webui.http.port":"8080",
"mapr.fs.permissions.superuser":"root",
"mapr.smtp.port":"25",
"mapr.fs.permissions.supergroup":"supergroup"
}
]
}

Examples

Display several keys:

CLI
maprcli config load -keys mapr.webui.http.port,mapr.webui.https.port,mapr.webui.https.keystorepath,mapr.webui.h

REST
https://r1n1.sj.us:8443/rest/config/load?keys=mapr.webui.http.port,mapr.webui.https.port,mapr.webui.https.keyst

config save

Saves configuration information, specified as key/value pairs. Permissions required: fc or a.

See the Configuration Fields table.

Syntax

CLI
maprcli config save
[ -cluster <cluster> ]
-values <values>
REST
http[s]://<host>:<port>/rest/config/save?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

values A JSON object containing configuration fields; see the Configuration Fields table.

Examples

Configure MapR SMTP settings:

CLI
maprcli config save -values '{"mapr.smtp.provider":"gmail","mapr.smtp.server":"smtp.gmail.com","mapr.smtp.sslre
Cd","mapr.smtp.sender.email":"xxx@gmail.com","mapr.smtp.sender.username":"xxx@gmail.com","mapr.smtp.sender.pass

REST
https://r1n1.sj.us:8443/rest/config/save?values={"mapr.smtp.provider":"gmail","mapr.smtp.server":"smtp.gmail.co
Cd","mapr.smtp.sender.email":"xxx@gmail.com","mapr.smtp.sender.username":"xxx@gmail.com","mapr.smtp.sender.pass

dashboard
The dashboard info command displays a summary of information about the cluster.

dashboard info

Displays a summary of information about the cluster. For best results, use the -json option when running dashboard info from the command
line.

Syntax

CLI
maprcli dashboard info
[ -cluster <cluster> ]
[ -zkconnect <ZooKeeper connect string> ]

REST
http[s]://<host>:<port>/rest/dashboard/info?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

zkconnect ZooKeeper Connect String


Output

A summary of information about the services, volumes, mapreduce jobs, health, and utilization of the cluster.

Output Fields

Field Description

services The number of total and active services on the following nodes:

CLDB
File server
Job tracker
Task tracker
HB master
HB region server

volumes The number and size (in GB) of volumes that are:

Available
Under-replicated
Unavailable

mapreduce The following mapreduce information:

Queue time
Running jobs
Queued jobs
Running tasks
Blacklisted jobs

maintenance The following information about system health:

Failed disk nodes


Cluster alarms
Node alarms
Versions

utilization The following utilization information:

CPU:
Memory
Disk space

Sample Output

{
"status":"OK",
"total":1,
"data":[
{
"volumes":{
"available":{
"total":3,
"size":0
},
"underReplicated":{
"total":0,
"size":0
},
"unavailable":{
"total":1,
"size":0
}
},
"utilization":{
"cpu":{
"total":0,
"active":0
},
"memory":{
"total":0,
"active":0
},
"diskSpace":{
"total":1,
"active":0
}
},
"maintenance":{
"failedDiskNodes":0,
"clusterAlarms":0,
"nodeAlarms":0,
"versions":1
},
"services":{
"cldb":{
"total":"1",
"active":0
},
"fileserver":{
"total":0,
"active":0
},
"jobtracker":{
"total":"1",
"active":0
},
"nfs":{
"total":"1",
"active":0
},
"hbmaster":{
"total":"1",
"active":0
},
"hbregionserver":{
"total":0,
"active":0
},
"tasktracker":{
"total":0,
"active":0
}
}
}
]
}

Examples

Display dashboard information:

CLI
maprcli dashboard info -json

REST
https://r1n1.sj.us:8443/rest/dashboard/info

dialhome
The dialhome commands let you change the Dial Home status of your cluster:

dialhome ackdial - acknowledges a successful Dial Home transmission.


dialhome disable - disables Dial Home.
dialhome enable - enables Dial Home.
dialhome lastdialed - displays the last Dial Home transmission.
dialhome metrics - displays the metrics collected by Dial Home.
dialhome status - displays the current Dial Home status.

dialhome ackdial

Acknowledges the most recent Dial Home on the cluster. Permissions required: fc or a

Syntax

CLI
maprcli dialhome ackdial
[ -forDay <date> ]

REST
http[s]://<host>:<port>/rest/dialhome/ackdial[?parameters]

Parameters

Parameter Description

forDay Date for which the recorded metrics were successfully dialed home. Accepted values: UTC timestamp or a UTC date in
MM/DD/YY format. Default: yesterday

Examples

Acknowledge Dial Home:


CLI
maprcli dialhome ackdial

REST
https://r1n1.sj.us:8443/rest/dialhome/ackdial

dialhome disable

Disables Dial Home on the cluster. Permissions required: fc or a

Syntax

CLI
maprcli dialhome disable
[ -cluster <cluster> ]

REST
http[s]://<host>:<port>/rest/dialhome/disable

Parameters

Parameter Description

cluster The cluster on which to run the command.

Output

A success or failure message.

Sample output

$ maprcli dialhome disable


Successfully disabled dialhome

$ maprcli dialhome status


Dial home status is: disabled

Examples

Disable Dial Home:

CLI
maprcli dialhome disable

REST
https://r1n1.sj.us:8443/rest/dialhome/disable

dialhome enable
Enables Dial Home on the cluster. Permissions required: fc or a

Syntax

CLI
maprcli dialhome enable
[ -cluster <cluster> ]

REST
http[s]://<host>:<port>/rest/dialhome/enable

Parameters

Parameter Description

cluster The cluster on which to run the command.

Output

A success or failure message.

Sample output

pconrad@s1-r1-sanjose-ca-us:~$ maprcli dialhome enable


Successfully enabled dialhome

pconrad@s1-r1-sanjose-ca-us:~$ maprcli dialhome status


Dial home status is: enabled

Examples

Enable Dial Home:

CLI
maprcli dialhome enable

REST
https://r1n1.sj.us:8443/rest/dialhome/enable

dialhome lastdialed

Displays the date of the last successful Dial Home call. Permissions required: fc or a

Syntax

CLI
maprcli dialhome lastdialed
[ -cluster <cluster> ]
REST
http[s]://<host>:<port>/rest/dialhome/lastdialed

Parameters

Parameter Description

cluster The cluster on which to run the command.

Output

The date of the last successful Dial Home call.

Sample output

$ maprcli dialhome lastdialed


date
1322438400000

Examples

Show the date of the most recent Dial Home:

CLI
maprcli dialhome lastdialed

REST
https://r1n1.sj.us:8443/rest/dialhome/lastdialed

dialhome metrics

Returns a compressed metrics object. Permissions required: fc or a

Syntax

CLI
maprcli dialhome metrics
[ -cluster <cluster> ]
[ -forDay <date> ]

REST
http[s]://<host>:<port>/rest/dialhome/metrics

Parameters

Parameter Description
cluster The cluster on which to run the command.

forDay Date for which the recorded metrics were successfully dialed home. Accepted values: UTC timestamp or a UTC date in
MM/DD/YY format. Default: yesterday

Output

Sample output

$ maprcli dialhome metrics


metrics
[B@48067064

Examples

Show the Dial Home metrics:

CLI
maprcli dialhome metrics

REST
https://r1n1.sj.us:8443/rest/dialhome/metrics

dialhome status

Displays the Dial Home status. Permissions required: fc or a

Syntax

CLI
maprcli dialhome status
[ -cluster <cluster> ]

REST
http[s]://<host>:<port>/rest/dialhome/status

Parameters

Parameter Description

cluster The cluster on which to run the command.

Output

The current Dial Home status.


Sample output

$ maprcli dialhome status


enabled
1

Examples

Display the Dial Home status:

CLI
maprcli dialhome status

REST
https://r1n1.sj.us:8443/rest/dialhome/status

disk
The disk commands lets you work with disks:

disk add adds a disk to a node


disk list lists disks
disk listall lists all disks
disk remove removes a disk from a node

Disk Fields

The following table shows the fields displayed in the output of the disk list and disk listall commands. You can choose which fields (columns) to
display and sort in ascending or descending order by any single field.

Field Description

hn Hostname of node which owns this disk/partition.

n Name of the disk or partition.

st Disk status:

0 = Good
1 = Bad disk

pst Disk power status:

0 = Active/idle (normal operation)


1 = Standby (low power mode)
2 = Sleeping (lowest power mode, drive is completely shut down)

mt Disk mount status

0 = unmounted
1 = mounted

fs File system type

mn Model number

sn Serial number

fw Firmware version
ven Vendor name

dst Total disk space, in MB

dsu Disk space used, in MB

dsa Disk space available, in MB

err Disk error message, in english. Note that this will not be translated. Only sent if st == 1.

ft Disk failure time, MapR disks only. Only sent if st == 1.

disk add

Adds one or more disks to the specified node. Permissions required: fc or a

Syntax

CLI
maprcli disk add
[ -cluster ]
-disks <disk names>
-host <host>

REST
http[s]://<host>:<port>/rest/disk/add?<parameters>

Parameters

Parameter Description

cluster The cluster on which to add disks.

disks A comma-separated list of disk names.Examples:

["disk"]
["disk","disk","disk"...]

host The hostname or IP address of the machine on which to add the disk.

Output

Output Fields

Field Description

ip The IP address of the machine that owns the disk(s).

disk The name of a disk or partition. Example "sca" or "sca/sca1"

all The string all, meaning all unmounted disks for this node.

Examples
Add a disk:

CLI
maprcli disk add -disks ["/dev/sda1"] -host 10.250.1.79

REST
https://r1n1.sj.us:8443/rest/disk/add?disks=["/dev/sda1"]

disk list

Lists the disks on a node.

Syntax

CLI
maprcli disk list
-host <host>
[ -output terse|verbose ]
[ -system 1|0 ]

REST
http[s]://<host>:<port>/rest/disk/list?<parameters>

Parameters

Parameter Description

host The node on which to list the disks.

output Whether the output should be terse or verbose.

system Show only operating system disks:

0 - shows only MapR-FS disks


1 - shows only operating system disks
Not specified - shows both MapR-FS and operating system disks

Output

Information about the specified disks. See the Disk Fields table.

Examples

List hosts on a host:

CLI
maprcli disk list -host 10.10.100.22

REST
https://r1n1.sj.us:8443/rest/disk/list?host=10.10.100.22
disk listall

Lists all disks

Syntax

CLI
maprcli disk listall
[ -cluster <cluster> ]
[ -columns <columns>]
[ -filter <filter>]
[ -limit <limit>]
[ -output terse|verbose ]
[ -start <offset>]

REST
http[s]://<host>:<port>/rest/disk/listall?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

columns A comma-separated list of fields to return in the query. See the Disk Fields table.

filter A filter specifying snapshots to preserve. See Filters for more information.

limit The number of rows to return, beginning at start. Default: 0

output Always the string terse.

start The offset from the starting row according to sort. Default: 0

Output

Information about all disks. See the Disk Fields table.

Examples

List all disks:

CLI
maprcli disk listall

REST
https://r1n1.sj.us:8443/rest/disk/listall

disk remove

Removes a disk from MapR-FS. Permissions required: fc or a


The disk remove command does not remove a disk containing unreplicated data unless forced. To force disk removal, specify -force with the
value 1.

Only use the -force 1 option if you are sure that you do not need the data on the disk. This option removes the disk without
regard to replication factor or other data protection mechanisms, and may result in permanent data loss.

Syntax

CLI
maprcli disk remove
[ -cluster <cluster> ]
-disks <disk names>
[ -force 0|1 ]
-host <host>

REST
http[s]://<host>:<port>/rest/disk/remove?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

disks A list of disks in the form:


["disk"]or["disk","disk","disk"...]or[]

force Whether to force

0 (default) - do not remove the disk or disks if there is unreplicated data on the disk
1 - remove the disk or disks regardless of data loss or other consequences

host The hostname or ip address of the node from which to remove the disk.

Output

Output Fields

Field Description

disk The name of a disk or partition. Example: sca or sca/sca1

all The string all, meaning all unmounted disks attached to the node.

disks A comma-separated list of disks which have non-replicated volumes.<eg> "sca" or "sca/sca1,scb"</eg>

Examples

Remove a disk:
CLI
maprcli disk remove -disks ["sda1"]

REST
https://r1n1.sj.us:8443/rest/disk/remove?disks=["sda1"]

entity
The entity commands let you work with entities (users and groups):

entity info shows information about a specified user or group


entity list lists users and groups in the cluster
entity modify edits information about a specified user or group

entity exists

Tests whether an entity exists. Returns an error if the entity is invalid, or nothing if the entity exists.

Syntax

CLI
maprcli entity exists
-name <entity name>
[ -type <type> ]

REST
http[s]://<host>:<port>/rest/entity/exists?<parameters>

Parameters

Parameter Description

name The entity name.

type The entity type

0 = User
1 = Group

entity info
Displays information about an entity.

Syntax

CLI
maprcli entity info
[ -cluster <cluster> ]
-name <entity name>
[ -output terse|verbose ]
-type <type>
REST
http[s]://<host>:<port>/rest/entity/info?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

name The entity name.

output Whether to display terse or verbose output.

type The entity type

Output

DiskUsage EntityQuota EntityType EntityName VolumeCount EntityAdvisoryquota EntityId


864415 0 0 root 208 0 0

Output Fields

Field Description

DiskUsage Disk space used by the user or group

EntityQuota The user or group quota

EntityType The entity type

EntityName The entity name

VolumeCount The number of volumes associated with the user or group

EntityAdvisoryquota The user or group advisory quota

EntityId The ID of the user or group

Examples

Display information for the user 'root':

CLI
maprcli entity info -type 0 -name root

REST
https://r1n1.sj.us:8443/rest/entity/info?type=0&name=root

entity list
Syntax

CLI
maprcli entity list
[ -alarmedentities true|false ]
[ -cluster <cluster> ]
[ -columns <columns> ]
[ -filter <filter> ]
[ -limit <rows> ]
[ -output terse|verbose ]
[ -start <start> ]

REST
http[s]://<host>:<port>/rest/entity/list?<parameters>

Parameters

Parameter Description

alarmedentities Specifies whether to list only entities that have exceeded a quota or advisory quota.

cluster The cluster on which to run the command.

columns A comma-separated list of fields to return in the query. See the Fields table below.

filter A filter specifying entities to display. See Filters for more information.

limit The number of rows to return, beginning at start. Default: 0

output Specifies whether output should be terse or verbose.

start The offset from the starting row according to sort. Default: 0

Output

Information about the users and groups.

Fields

Field Description

EntityType Entity type

0 = User
1 = Group

EntityName User or Group name

EntityId User or Group id

EntityQuota Quota, in MB. 0 = no quota.

EntityAdvisoryquota Advisory quota, in MB. 0 = no advisory quota.

VolumeCount The number of volumes this entity owns.

DiskUsage Disk space used for all entity's volumes, in MB.


Sample Output

DiskUsage EntityQuota EntityType EntityName VolumeCount EntityAdvisoryquota EntityId


5859220 0 0 root 209 0 0

Examples

List all entities:

CLI
maprcli entity list

REST
https://r1n1.sj.us:8443/rest/entity/list

entity modify
Modifies a user or group quota or email address. Permissions required: fc or a

Syntax

CLI
maprcli entity modify
[ -advisoryquota <advisory quota>
[ -cluster <cluster> ]
[ -email <email>]
-name <entityname>
[ -quota <quota> ]
-type <type>

REST
http[s]://<host>:<port>/rest/entity/modify?<parameters>

Parameters

Parameter Description

advisoryquota The advisory quota.

cluster The cluster on which to run the command.

email Email address.

name The entity name.

quota The quota for the entity.

type The entity type:

0=user
1-group

Examples
Modify the email address for the user 'root':

CLI
maprcli entity modify -name root -type 0 -email test@example.com

REST
https://r1n1.sj.us:8443/rest/entity/modify?name=root&type=0&email=test@example.com

license
The license commands let you work with MapR licenses:

license add - adds a license


license addcrl - adds a certificate revocation list (CRL)
license apps - displays the features included in the current license
license list - lists licenses on the cluster
license listcrl - lists CRLs
license remove - removes a license
license showid - displays the cluster ID

license add

Adds a license. Permissions required: fc or a

The license can be specified either by passing the license string itself to license add, or by specifying a file containing the license string.

Syntax

CLI
maprcli license add
[ -cluster <cluster> ]
[ -is_file true|false ]
-license <license>

REST
http[s]://<host>:<port>/rest/license/add?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

is_file Specifies whether the license specifies a file. If false, the license parameter contains a long license string.

license The license to add to the cluster. If -is_file is true, license specifies the filename of a license file. Otherwise, license
contains the license string itself.

Examples

Adding a License from a File

Assuming a file /tmp/license.txt containing a license string, the following command adds the license to the cluster.
CLI
maprcli license add -is_file true -license /tmp/license.txt

license addcrl

Adds a certificate revocation list (CRL). Permissions required: fc or a

Syntax

CLI
maprcli license addcrl
[ -cluster <cluster> ]
-crl <crl>
[ -is_file true|false ]

REST
http[s]://<host>:<port>/rest/license/addcrl?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

crl The CRL to add to the cluster. If file is set, crl specifies the filename of a CRL file. Otherwise, crl contains the CRL string itself.

is_file Specifies whether the license is contained in a file.

license apps

Displays the features authorized for the current license. Permissions required: fc or a

Syntax

CLI
maprcli license apps
[ -cluster <cluster> ]

REST
http[s]://<host>:<port>/rest/license/apps?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

license list
Lists licenses on the cluster. Permissions required: fc or a

Syntax

CLI
maprcli license list
[ -cluster <cluster> ]

REST
http[s]://<host>:<port>/rest/license/list?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

license listcrl

Lists certificate revocation lists (CRLs) on the cluster. Permissions required: fc or a

Syntax

CLI
maprcli license listcrl
[ -cluster <cluster> ]

REST
http[s]://<host>:<port>/rest/license/listcrl?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

license remove

Adds a license. Permissions required: fc or a

Syntax

CLI
maprcli license remove
[ -cluster <cluster> ]
-license_id <license>
REST
http[s]://<host>:<port>/rest/license/remove?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

license_id The license to remove.

license showid

Displays the cluster ID for use when creating a new license. Permissions required: fc or a

Syntax

CLI
maprcli license showid
[ -cluster <cluster> ]

REST
http[s]://<host>:<port>/rest/license/showid?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

nagios
The nagios generate command generates a topology script for Nagios

nagios generate

Generates a Nagios Object Definition file that describes the cluster nodes and the services running on each.

Syntax

CLI
maprcli nagios generate
[ -cluster <cluster> ]

REST
http[s]://<host>:<port>/rest/nagios/generate?<parameters>
Parameters

Parameter Description

cluster The cluster on which to run the command.

Output

Sample Output
############# Commands #############

define command {
command_name check_fileserver_proc
command_line $USER1$/check_tcp -p 5660
}

define command {
command_name check_cldb_proc
command_line $USER1$/check_tcp -p 7222
}

define command {
command_name check_jobtracker_proc
command_line $USER1$/check_tcp -p 50030
}

define command {
command_name check_tasktracker_proc
command_line $USER1$/check_tcp -p 50060
}

define command {
command_name check_nfs_proc
command_line $USER1$/check_tcp -p 2049
}

define command {
command_name check_hbmaster_proc
command_line $USER1$/check_tcp -p 60000
}

define command {
command_name check_hbregionserver_proc
command_line $USER1$/check_tcp -p 60020
}

define command {
command_name check_webserver_proc
command_line $USER1$/check_tcp -p 8443
}

################# HOST: host1 ###############

define host {
use linux-server
host_name host1
address 192.168.1.1
check_command check-host-alive
}

################# HOST: host2 ###############

define host {
use linux-server
host_name host2
address 192.168.1.2
check_command check-host-alive
}

Examples

Generate a nagios configuration, specifying cluster name and ZooKeeper nodes:


CLI
maprcli nagios generate -cluster cluster-1

REST
https://host1:8443/rest/nagios/generate?cluster=cluster-1

Generate a nagios configuration and save to the file nagios.conf:

CLI
maprcli nagios generate >
nagios.conf

nfsmgmt
The nfsmgmt refreshexports command refreshes the NFS exports on the specified host and/or port.

nfsmgmt refreshexports

Refreshes the NFS exports. Permissions required: fc or a

Syntax

CLI
maprcli nfsmgmt refreshexports
[ -nfshost <host> ]
[ -nfsport <port> ]

REST
http[s]://<host>:<port>/rest/license/nfsmgmt/refreshexports?<parameters>

Parameters

Parameter Description

nfshost The host on which to refresh NFS exports.

nfsport The port to use.

node
The node commands let you work with nodes in the cluster:

node heatmap
node list
node path
node remove
node services
node topo

node heatmap

Displays a heatmap for the specified nodes.


Syntax

CLI
maprcli node heatmap
[ -cluster <cluster> ]
[ -filter <filter> ]
[ -view <view> ]

REST
http[s]://<host>:<port>/rest/node/heatmap?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

filter A filter specifying snapshots to preserve. See Filters for more information.

view Name of the heatmap view to show:

status = Node status:


0 = Healthy
1 = Needs attention
2 = Degraded
3 = Maintenance
4 = Critical
cpu = CPU utilization, as a percent from 0-100.
memory = Memory utilization, as a percent from 0-100.
diskspace = MapR-FS disk space utilization, as a percent from 0-100.
DISK_FAILURE = Status of the DISK_FAILURE alarm. 0 if clear, 1 if raised.
SERVICE_NOT_RUNNING = Status of the SERVICE_NOT_RUNNING alarm. 0 if clear, 1 if raised.
CONFIG_NOT_SYNCED = Status of the CONFIG_NOT_SYNCED alarm. 0 if clear, 1 if raised.

Output

Description of the output.

{
status:"OK",
data: [{
"{{rackTopology}}" : {
"{{nodeName}}" : {{heatmapValue}},
"{{nodeName}}" : {{heatmapValue}},
"{{nodeName}}" : {{heatmapValue}},
...
},
"{{rackTopology}}" : {
"{{nodeName}}" : {{heatmapValue}},
"{{nodeName}}" : {{heatmapValue}},
"{{nodeName}}" : {{heatmapValue}},
...
},
...
}]
}
Output Fields

Field Description

rackTopology The topology for a particular rack.

nodeName The name of the node in question.

heatmapValue The value of the metric specified in the view parameterfor this node, as an integer.

Examples

Display a heatmap for the default rack:

CLI
maprcli node heatmap

REST
https://r1n1.sj.us:8443/rest/node/heatmap

Display memory usage for the default rack:

CLI
maprcli node heatmap -view memory

REST
https://r1n1.sj.us:8443/rest/node/heatmap?view=memory

node list

Lists nodes in the cluster.

Syntax

CLI
maprcli node list
[ -alarmednodes 1 ]
[ -cluster <cluster ]
[ -columns <columns>]
[ -filter <filter> ]
[ -limit <limit> ]
[ -nfsnodes 1 ]
[ -output terse|verbose ]
[ -start <offset> ]
[ -zkconnect <ZooKeeper Connect String> ]

REST
http[s]://<host>:<port>/rest/node/list?<parameters>

Parameters

Parameter Description

alarmednodes If set to 1, displays only nodes with raised alarms. Cannot be used if nfsnodes is set.
cluster The cluster on which to run the command.

columns A comma-separated list of fields to return in the query. See the Fields table below.

filter A filter specifying nodes on which to start or stop services. See Filters for more information.

limit The number of rows to return, beginning at start. Default: 0

nfsnodes If set to 1, displays only nodes running NFS. Cannot be used if alarmednodes is set.

output Specifies whether the output should be terse or verbose.

start The offset from the starting row according to sort. Default: 0

zkconnect ZooKeeper Connect String

Output

Information about the nodes.See the Fields table above.

Sample Output

bytesSent dreads davail TimeSkewAlarm servicesHoststatsDownAlarm


ServiceHBMasterDownNotRunningAlarm ServiceNFSDownNotRunningAlarm ttmapUsed DiskFailureAlarm mused
id mtotal cpus utilization rpcout ttReduceSlots
ServiceFileserverDownNotRunningAlarm ServiceCLDBDownNotRunningAlarm dtotal jt-heartbeat
ttReduceUsed dwriteK ServiceTTDownNotRunningAlarm ServiceJTDownNotRunningAlarm ttmapSlots dused
uptime hostname health disks faileddisks fs-heartbeat rpcin ip
dreadK dwrites ServiceWebserverDownNotRunningAlarm rpcs LogLevelAlarm
ServiceHBRegionDownNotRunningAlarm bytesReceived service topo(rack) MapRfs disks
ServiceMiscDownNotRunningAlarm VersionMismatchAlarm
8300 0 269 0 0 0
0 75 0 4058 6394230189818826805 7749 4
3 141 50 0 0
286 2 10 32 0 0
100 16 Thu Jan 15 16:58:57 PST 1970 whatsup 0 1 0 0
51 10.250.1.48 0 2 0 0 0 0
8236 /third/rack/whatsup 1 0 0

Fields

Field Description

bytesReceived Bytes received by the node since the last CLDB heartbeat.

bytesSent Bytes sent by the node since the last CLDB heartbeat.

corePresentAlarm Cores Present Alarm (NODE_ALARM_CORE_PRESENT):

0 = Clear
1 = Raised

cpus The total number of CPUs on the node.

davail Disk space available on the node.


DiskFailureAlarm Failed Disks alarm (DISK_FAILURE):

0 = Clear
1 = Raised

disks Total number of disks on the node.

dreadK Disk Kbytes read since the last heartbeat.

dreads Disk read operations since the last heartbeat.

dtotal Total disk space on the node.

dused Disk space used on the node.

dwriteK Disk Kbytes written since the last heartbeat.

dwrites Disk write ops since the last heartbeat.

faileddisks Number of failed MapR-FS disks on the node.

failedDisksAlarm Disk Failure Alarm (NODE_ALARM_DISK_FAILURE)

0 = Clear
1 = Raised

fs-heartbeat Time since the last heartbeat to the CLDB, in seconds.

health Overall node health, calculated from various alarm states:

0 = Healthy
1 = Needs attention
2 = Degraded
3 = Maintenance
4 = Critical

hostname The host name.

id The node ID.

ip A list of IP addresses associated with the node.

jt-heartbeat Time since the last heartbeat to the JobTracker, in seconds.

logLevelAlarm Excessive Logging Alarm (NODE_ALARM_DEBUG_LOGGING):

0 = Clear
1 = Raised

MapRfs disks

mtotal Total memory, in GB.

mused Memory used, in GB.

optMapRFullAlarm Installation Directory Full Alarm (ODE_ALARM_OPT_MAPR_FULL):

0 = Clear
1 = Raised

rootPartitionFullAlarm Root Partition Full Alarm (NODE_ALARM_ROOT_PARTITION_FULL):

0 = Clear
1 = Raised

rpcin RPC bytes received since the last heartbeat.

rpcout RPC bytes sent since the last heartbeat.

rpcs Number of RPCs since the last heartbeat.


service A comma-separated list of services running on the node:

cldb - CLDB
fileserver - MapR-FS
jobtracker - JobTracker
tasktracker - TaskTracker
hbmaster - HBase Master
hbregionserver - HBase RegionServer
nfs - NFS Gateway
Example: "cldb,fileserver,nfs"

ServiceCLDBDownAlarm CLDB Service Down Alarm (NODE_ALARM_SERVICE_CLDB_DOWN)

0 = Clear
1 = Raised

ServiceFileserverDownNotRunningAlarm Fileserver Service Down Alarm (NODE_ALARM_SERVICE_FILESERVER_DOWN)

0 = Clear
1 = Raised

serviceHBMasterDownAlarm HBase Master Service Down Alarm (NODE_ALARM_SERVICE_HBMASTER_DOWN)

0 = Clear
1 = Raised

serviceHBRegionDownAlarm HBase Regionserver Service Down Alarm" (NODE_ALARM_SERVICE_HBREGION_DOWN)

0 = Clear
1 = Raised

servicesHoststatsDownAlarm Hoststats Service Down Alarm (NODE_ALARM_SERVICE_HOSTSTATS_DOWN)

0 = Clear
1 = Raised

serviceJTDownAlarm Jobtracker Service Down Alarm (NODE_ALARM_SERVICE_JT_DOWN)

0 = Clear
1 = Raised

ServiceMiscDownNotRunningAlarm
0 = Clear
1 = Raised

serviceNFSDownAlarm NFS Service Down Alarm (NODE_ALARM_SERVICE_NFS_DOWN):

0 = Clear
1 = Raised

serviceTTDownAlarm Tasktracker Service Down Alarm (NODE_ALARM_SERVICE_TT_DOWN):

0 = Clear
1 = Raised

servicesWebserverDownAlarm Webserver Service Down Alarm (NODE_ALARM_SERVICE_WEBSERVER_DOWN)

0 = Clear
1 = Raised

timeSkewAlarm Time Skew alarm (NODE_ALARM_TIME_SKEW):

0 = Clear
1 = Raised
topo(rack) The rack path.

ttmapSlots TaskTracker map slots.

ttmapUsed TaskTracker map slots used.

ttReduceSlots TaskTracker reduce slots.

ttReduceUsed TaskTracker reduce slots used.

uptime The number of seconds the machine has been up since the last restart.

utilization CPU use percentage since the last heartbeat.

versionMismatchAlarm Software Version Mismatch Alarm (NODE_ALARM_VERSION_MISMATCH):

0 = Clear
1 = Raised

Examples

List all nodes:

CLI
maprcli node list

REST
https://r1n1.sj.us:8443/rest/node/list

node move

Moves one or more nodes to a different topology. Permissions required: fc or a

Syntax

CLI
maprcli node move
[ -cluster <cluster> ]
-serverids <server IDs>
-topology <topology>

REST
http[s]://<host>:<port>/rest/node/move?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

serverids The server IDs of the nodes to move.

topology The new topology.

node path
Changes the path of the specified node or nodes. Permissions required: fc or a

Syntax

CLI
maprcli node path
[ -cluster <cluster> ]
[ -filter <filter> ]
[ -nodes <node names> ]
-path <path>
[ -which switch|rack|both ]
[ -zkconnect <ZooKeeper Connect String> ]

REST
http[s]://<host>:<port>/rest/node/path?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

filter A filter specifying nodes on which to start or stop services. See Filters for more information.

nodes A list of node names, separated by spaces.

path The path to change.

which Which path to change: switch, rack or both. Default: rack

zkconnect ZooKeeper Connect String.

node remove
Remove one or more server nodes from the system. Permissions required: fc or a

Syntax

CLI
maprcli node remove
[ -filter <filter> ]
[ -force true|false ]
[ -nodes <node names> ]
[ -zkconnect <ZooKeeper Connect String> ]

REST
http[s]://<host>:<port>/rest/node/remove?<parameters>

Parameters

Parameter Description

filter A filter specifying nodes on which to start or stop services. See Filters for more information.

force Forces the service stop operations. Default: false

nodes A list of node names, separated by spaces.


zkconnect ZooKeeper Connect String. Example: 'host:port,host:port,host:port,...'. default: localhost:5181

node services

Starts, stops, restarts, suspends, or resumes services on one or more server nodes. Permissions required: ss, fc or a

The same set of services applies to all specified nodes; to manipulate different groups of services differently, send multiple requests.

Note: the suspend and resume actions have not yet been implemented.

Syntax

CLI
maprcli node services
[ -action restart|resume|start|stop|suspend ]
[ -cldb restart|resume|start|stop|suspend ]
[ -cluster <cluster> ]
[ -fileserver restart|resume|start|stop|suspend ]
[ -filter <filter> ]
[ -hbmaster restart|resume|start|stop|suspend ]
[ -hbregionserver restart|resume|start|stop|suspend ]
[ -jobtracker restart|resume|start|stop|suspend ]
[ -name <service> ]
[ -nfs restart|resume|start|stop|suspend ]
[ -nodes <node names> ]
[ -tasktracker restart|resume|start|stop|suspend ]
[ -zkconnect <ZooKeeper Connect String> ]

REST
http[s]://<host>:<port>/rest/node/services?<parameters>

Parameters

When used together, the action and name parameters specify an action to perform on a service. To start the JobTracker, for example, you can
either specify start for the action and jobtracker for the name, or simply specify start on the jobtracker.

Parameter Description

action An action to perform on a service specified in the name parameter: restart, resume, start, stop, or suspend

cldb Starts or stops the cldb service. Values: restart, resume, start, stop, or suspend

cluster The cluster on which to run the command.

fileserver Starts or stops the fileserver service. Values: restart, resume, start, stop, or suspend

filter A filter specifying nodes on which to start or stop services. See Filters for more information.

hbmaster Starts or stops the hbmaster service. Values: restart, resume, start, stop, or suspend

hbregionserver Starts or stops the hbregionserver service. Values: restart, resume, start, stop, or suspend

jobtracker Starts or stops the jobtracker service. Values: restart, resume, start, stop, or suspend

name A service on which to perform an action specified by the action parameter.

nfs Starts or stops the nfs service. Values: restart, resume, start, stop, or suspend

nodes A list of node names, separated by spaces.

tasktracker Starts or stops the tasktracker service. Values: restart, resume, start, stop, or suspend

zkconnect ZooKeeper Connect String


node topo

Lists cluster topology information.

Lists internal nodes only (switches/racks/etc) and not leaf nodes (server nodes).

Syntax

CLI
maprcli node topo
[ -cluster <cluster> ]
[ -path <path> ]

REST
http[s]://<host>:<port>/rest/node/topo?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

path The path on which to list node topology.

Output

Node topology information.

Sample output

{
status:"OK",
total:recordCount,
data: [
{
path:'path',
status:[errorChildCount,OKChildCount,configChildCount],
},
...additional structures above for each topology node...
]
}

Output Fields

Field Description

path The physical topology path to the node.

errorChildCount The number of descendants of the node which have overall status 0.

OKChildCount The number of descendants of the node which have overall status 1.

configChildCount The number of descendants of the node which have overall status 2.
schedule
The schedule commands let you work with schedules:

schedule create creates a schedule


schedule list lists schedules
schedule modify modifies the name or rules of a schedule by ID
schedule remove removes a schedule by ID

A schedule is a JSON object that specifies a single or recurring time for volume snapshot creation or mirror syncing. For a schedule to be useful,
it must be associated with at least one volume. See volume create and volume modify.

Schedule Fields

The schedule object contains the following fields:

Field Value

id The ID of the schedule.

name The name of the schedule.

inuse Indicates whether the schedule is associated with an action.

rules An array of JSON objects specifying how often the scheduled action occurs. See Rule Fields below.

Rule Fields

The following table shows the fields to use when creating a rules object.

Field Values

frequency How often to perform the action:

once - Once
yearly - Yearly
monthly - Monthly
weekly - Weekly
daily - Daily
hourly - Hourly
semihourly - Every 30 minutes
quarterhourly - Every 15 minutes
fiveminutes - Every 5 minutes
minute - Every minute

retain How long to retain the data resulting from the action. For example, if the schedule creates a snapshot, the retain field sets the
snapshot's expiration. The retain field consists of an integer and one of the following units of time:

mi - minutes
h - hours
d - days
w - weeks
m - months
y - years

time The time of day to perform the action, in 24-hour format: HH

date The date on which to perform the action:

For single occurrences, specify month, day and year: MM/DD/YYYY


For yearly occurrences, specify the month and day: MM/DD
For monthly occurrences occurrences, specify the day: DD
Daily and hourly occurrences do not require the date field.

Example
The following example JSON shows a schedule called "snapshot," with three rules.

{
"id":8,
"name":"snapshot",
"inuse":0,
"rules":[
{
"frequency":"monthly",
"date":"8",
"time":14,
"retain":"1m"
},
{
"frequency":"weekly",
"date":"sat",
"time":14,
"retain":"2w"
},
{
"frequency":"hourly",
"retain":"1d"
}
]
}

schedule create

Creates a schedule. Permissions required: fc or a

A schedule can be associated with a volume to automate mirror syncing and snapshot creation. See volume create and volume modify.

Syntax

CLI
maprcli schedule create
[ -cluster <cluster> ]
-schedule <JSON>

REST
http[s]://<host>:<port>/rest/schedule/create?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

schedule A JSON object describing the schedule. See Schedule Objects for more information.

Examples

Scheduling a Single Occurrence

CLI
maprcli schedule create -schedule '{"name":"Schedule-1","rules":[{"frequency":"once","retain":"1w","time":13,"d
REST
https://r1n1.sj.us:8443/rest/schedule/create?schedule={"name":"Schedule-1","rules":[{"frequency":"once","retain

A Schedule with Several Rules

CLI
maprcli schedule create -schedule '{"name":"Schedule-1","rules":[{"frequency":"weekly","date":"sun","time":7,"r

REST
https://r1n1.sj.us:8443/rest/schedule/create?schedule={"name":"Schedule-1","rules":[{"frequency":"weekly","date

schedule list
Lists the schedules on the cluster.

Syntax

CLI
maprcli schedule list
[ -cluster <cluster> ]
[ -output terse|verbose ]

REST
http[s]://<host>:<port>/rest/schedule/list?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

output Specifies whether the output should be terse or verbose.

Output

A list of all schedules on the cluster. See Schedule Objects for more information.

Examples

List schedules:

CLI
maprcli schedule list

REST
https://r1n1.sj.us:8443/rest/schedule/list

schedule modify
Modifies an existing schedule, specified by ID. Permissions required: fc or a

To find a schedule's ID:

1. Use the schedule list command to list the schedules.


2. Select the schedule to modify
3. Pass the selected schedule's ID in the -id parameter to the schedule modify command.

Syntax

CLI
maprcli schedule modify
[ -cluster <cluster> ]
-id <schedule ID>
[ -name <schedule name ]
[ -rules <JSON>]

REST
http[s]://<host>:<port>/rest/schedule/modify?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

id The ID of the schedule to modify.

name The new name of the schedule.

rules A JSON object describing the rules for the schedule. If specified, replaces the entire existing rules object in the schedule. For
information about the fields to use in the JSON object, see Rule Fields.

Examples

Modify a schedule

CLI
maprcli schedule modify -id 0 -name Newname -rules '[{"frequency":"weekly","date":"sun","time":7,"retain":"2w"}

REST
https://r1n1.sj.us:8443/rest/schedule/modify?id=0&name=Newname&rules=[{"frequency":"weekly","date":"sun","time"

schedule remove

Removes a schedule.

A schedule can only be removed if it is not associated with any volumes. See volume modify.

Syntax

CLI
maprcli schedule remove
[ -cluster <cluster> ]
-id <schedule ID>
REST
http[s]://<host>:<port>/rest/schedule/remove?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

id The ID of the schedule to remove.

Examples

Remove schedule with ID 0:

CLI
maprcli schedule remove -id 0

REST
https://r1n1.sj.us:8443/rest/schedule/remove?id=0

service list

Lists all services on the specified node, along with the state and log path for each service.

Syntax

CLI
maprcli service list
-node <node name>

REST
http[s]://<host>:<port>/rest/service/list?<parameters>

Parameters

Parameter Description

node The node on which to list the services

Output

Information about services on the specified node. For each service, the status is reported numerically:

0 - NOT_CONFIGURED: the package for the service is not installed and/or the service is not configured ( configure.sh has not run)
2 - RUNNING: the service is installed, has been started by the warden, and is currently executing
3 - STOPPED: the service is installed and configure.sh has run, but the service is currently not executing
5 - STAND_BY: the service is installed and is in standby mode, waiting to take over in case of failure of another instance (mainly used for
JobTracker warm standby)
setloglevel
The setloglevel commands set the log level on individual services:

setloglevel cldb - Sets the log level for the CLDB.


setloglevel hbmaster - Sets the log level for the HB Master.
setloglevel hbregionserver - Sets the log level for the HBase RegionServer.
setloglevel jobtracker - Sets the log level for the JobTracker.
setloglevel fileserver - Sets the log level for the FileServer.
setloglevel nfs - Sets the log level for the NFS.
setloglevel tasktracker - Sets the log level for the TaskTracker.

setloglevel cldb

Sets the log level on the CLDB service. Permissions required: fc or a

Syntax

CLI
maprcli setloglevel cldb
-classname <class>
-loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN
-node <node>

REST
http[s]://<host>:<port>/rest/setloglevel/cldb?<parameters>

Parameters

Parameter Description

classname The name of the class for which to set the log level.

loglevel The log level to set:

DEBUG
ERROR
FATAL
INFO
TRACE
WARN

node The node on which to set the log level.

setloglevel fileserver

Sets the log level on the FileServer service. Permissions required: fc or a

Syntax
CLI
maprcli setloglevel fileserver
-classname <class>
-loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN
-node <node>

REST
http[s]://<host>:<port>/rest/setloglevel/fileserver?<parameters>

Parameters

Parameter Description

classname The name of the class for which to set the log level.

loglevel The log level to set:

DEBUG
ERROR
FATAL
INFO
TRACE
WARN

node The node on which to set the log level.

setloglevel hbmaster

Sets the log level on the HB Master service. Permissions required: fc or a

Syntax

CLI
maprcli setloglevel hbmaster
-classname <class>
-loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN
-node <node>

REST
http[s]://<host>:<port>/rest/setloglevel/hbmaster?<parameters>

Parameters

Parameter Description

classname The name of the class for which to set the log level.

loglevel The log level to set:

DEBUG
ERROR
FATAL
INFO
TRACE
WARN
node The node on which to set the log level.

setloglevel hbregionserver

Sets the log level on the HB RegionServer service. Permissions required: fc or a

Syntax

CLI
maprcli setloglevel hbregionserver
-classname <class>
-loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN
-node <node>

REST
http[s]://<host>:<port>/rest/setloglevel/hbregionserver?<parameters>

Parameters

Parameter Description

classname The name of the class for which to set the log level.

loglevel The log level to set:

DEBUG
ERROR
FATAL
INFO
TRACE
WARN

node The node on which to set the log level.

setloglevel jobtracker

Sets the log level on the JobTracker service. Permissions required: fc or a

Syntax

CLI
maprcli setloglevel jobtracker
-classname <class>
-loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN
-node <node>

REST
http[s]://<host>:<port>/rest/setloglevel/jobtracker?<parameters>

Parameters
Parameter Description

classname The name of the class for which to set the log level.

loglevel The log level to set:

DEBUG
ERROR
FATAL
INFO
TRACE
WARN

node The node on which to set the log level.

setloglevel nfs

Sets the log level on the NFS service. Permissions required: fc or a

Syntax

CLI
maprcli setloglevel nfs
-classname <class>
-loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN
-node <node>

REST
http[s]://<host>:<port>/rest/setloglevel/nfs?<parameters>

Parameters

Parameter Description

classname The name of the class for which to set the log level.

loglevel The log level to set:

DEBUG
ERROR
FATAL
INFO
TRACE
WARN

node The node on which to set the log level.

setloglevel tasktracker

Sets the log level on the TaskTracker service. Permissions required: fc or a

Syntax
CLI
maprcli setloglevel tasktracker
-classname <class>
-loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN
-node <node>

REST
http[s]://<host>:<port>/rest/setloglevel/tasktracker?<parameters>

Parameters

Parameter Description

classname The name of the class for which to set the log level.

loglevel The log level to set:

DEBUG
ERROR
FATAL
INFO
TRACE
WARN

node The node on which to set the log level.

trace
The trace commands let you view and modify the trace buffer, and the trace levels for the system modules. The valid trace levels are:

DEBUG
INFO
ERROR
WARN
FATAL

The following pages provide information about the trace commands:

trace dump
trace info
trace print
trace reset
trace resize
trace setlevel
trace setmode

trace dump

Dumps the contents of the trace buffer into the MapR-FS log.

Syntax

CLI
maprcli trace dump
[ -host <host> ]
[ -port <port> ]

REST None.

Parameters
Parameter Description

host The IP address of the node from which to dump the trace buffer. Default: localhost

port The port to use when dumping the trace buffer. Default: 5660

Examples

Dump the trace buffer to the MapR-FS log:

CLI
maprcli trace dump

trace info

Displays the trace level of each module.

Syntax

CLI
maprcli trace info
[ -host <host> ]
[ -port <port> ]

REST None.

Parameters

Parameter Description

host The IP address of the node on which to display the trace level of each module. Default: localhost

port The port to use. Default: 5660

Output

A list of all modules and their trace levels.

Sample Output
RPC Client Initialize
**Trace is in DEFAULT mode.
**Allowed Trace Levels are:
FATAL
ERROR
WARN
INFO
DEBUG
**Trace buffer size: 2097152
**Modules and levels:
Global : INFO
RPC : ERROR
MessageQueue : ERROR
CacheMgr : INFO
IOMgr : INFO
Transaction : ERROR
Log : INFO
Cleaner : ERROR
Allocator : ERROR
BTreeMgr : ERROR
BTree : ERROR
BTreeDelete : ERROR
BTreeOwnership : INFO
MapServerFile : ERROR
MapServerDir : INFO
Container : INFO
Snapshot : INFO
Util : ERROR
Replication : INFO
PunchHole : ERROR
KvStore : ERROR
Truncate : ERROR
Orphanage : INFO
FileServer : INFO
Defer : ERROR
ServerCommand : INFO
NFSD : INFO
Cidcache : ERROR
Client : ERROR
Fidcache : ERROR
Fidmap : ERROR
Inode : ERROR
JniCommon : ERROR
Shmem : ERROR
Table : ERROR
Fctest : ERROR
DONE

Examples

Display trace info:

CLI
maprcli trace info

trace print

Manually dumps the trace buffer to stdout.

Syntax
CLI
maprcli trace print
[ -host <host> ]
[ -port <port> ]
-size <size>

REST None.

Parameters

Parameter Description

host The IP address of the node from which to dump the trace buffer to stdout. Default: localhost

port The port to use. Default: 5660

size The number of kilobytes of the trace buffer to print. Maximum: 64

Output

The most recent <size> bytes of the trace buffer.

-----------------------------------------------------
2010-10-04 13:59:31,0000 Program: mfs on Host: fakehost IP: 0.0.0.0, Port: 0, PID: 0

-----------------------------------------------------
DONE

Examples

Display the trace buffer:

CLI
maprcli trace print

trace reset

Resets the in-memory trace buffer.

Syntax

CLI
maprcli trace reset
[ -host <host> ]
[ -port <port> ]

REST None.

Parameters
Parameter Description

host The IP address of the node on which to reset the trace parameters. Default: localhost

port The port to use. Default: 5660

Examples

Reset trace parameters:

CLI
maprcli trace reset

trace resize

Resizes the trace buffer.

Syntax

CLI
maprcli trace resize
[ -host <host> ]
[ -port <port> ]
-size <size>

REST None.

Parameters

Parameter Description

host The IP address of the node on which to resize the trace buffer. Default: localhost

port The port to use. Default: 5660

size The size of the trace buffer, in kilobytes. Default: 2097152 Minimum: 1

Examples

Resize the trace buffer to 1000

CLI
maprcli trace resize -size
1000

trace setlevel

Sets the trace level on one or more modules.


Syntax

CLI
maprcli trace setlevel
[ -host <host> ]
-level <trace level>
-module <module name>
[ -port <port> ]

REST None.

Parameters

Parameter Description

host The node on which to set the trace level. Default: localhost

module The module on which to set the trace level. If set to all, sets the trace level on all modules.

level The new trace level. If set to default, sets the trace level to its default.

port The port to use. Default: 5660

Examples

Set the trace level of the log module to INFO:

CLI
maprcli trace setlevel -module log -level
info

Set the trace levels of all modules to their defaults:

CLI
maprcli trace setlevel -module all -level
default

trace setmode

Sets the trace mode. There are two modes:

Default
Continuous

In default mode, all trace messages are saved in a memory buffer. If there is an error, the buffer is dumped to stdout. In continuous mode, every
allowed trace message is dumped to stdout in real time.

Syntax
CLI
maprcli trace setmode
[ -host <host> ]
-mode default|continuous
[ -port <port> ]

REST None.

Parameters

Parameter Description

host The IP address of the host on which to set the trace mode

mode The trace mode.

port The port to use.

Examples

Set the trace mode to continuous:

CLI
maprcli trace setmode -mode
continuous

urls

The urls command displays the status page URL for the specified service.

Syntax

CLI
maprcli urls
[ -cluster <cluster> ]
-name <service name>
[ -zkconnect <zookeeper connect string>
]

REST
http[s]://<host>:<port>/rest/urls/<name>

Parameters

Parameter Description

cluster The name of the cluster on which to save the configuration.


name The name of the service for which to get the status page:

cldb
jobtracker
tasktracker

zkconnect ZooKeeper Connect String

Examples

Display the URL of the status page for the CLDB service:

CLI
maprcli urls -name cldb

REST
https://r1n1.sj.us:8443/rest/maprcli/urls/cldb

virtualip
The virtualip commands let you work with virtual IP addresses for NFS nodes:

virtualip add adds a ranges of virtual IP addresses


virtualip list lists virtual IP addresses
virtualip remove removes a range of virtual IP addresses

Virtual IP Fields

Field Description

macaddress The MAC address of the virtual IP.

netmask The netmask of the virtual IP.

virtualipend The virtual IP range end.

virtualip add
Adds a virtual IP address. Permissions required: fc or a

Syntax

CLI
maprcli virtualip add
[ -cluster <cluster> ]
[ -macaddress <MAC address>
-netmask <netmask>
-virtualip <virtualip>
[ -virtualipend <virtual IP range end> ]

REST
http[s]://<host>:<port>/rest/virtualip/add?<parameters>

Parameters
Parameter Description

cluster The cluster on which to run the command.

macaddress The MAC address of the virtual IP.

netmask The netmask of the virtual IP.

virtualip The virtual IP.

virtualipend The virtual IP range end.

virtualip edit

Edits a virtual IP (VIP) range. Permissions required: fc or a

Syntax

CLI
maprcli virtualip edit
[ -cluster <cluster> ]
[ -macs <mac address(es)> ]
-netmask <netmask>
-virtualip <virtualip>
[ -virtualipend <range end> ]

REST
http[s]://<host>:<port>/rest/virtualip/edit?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

macs The MAC address or addresses to associate with the VIP or VIP range.

netmask The netmask for the VIP or VIP range.

virtualip The start of the VIP range, or the VIP if only one VIP is used.

virtualipend The end of the VIP range if more than one VIP is used.

virtualip list

Lists the virtual IP addresses in the cluster.

Syntax
CLI
maprcli virtualip list
[ -cluster <cluster> ]
[ -columns <columns> ]
[ -filter <filter> ]
[ -limit <limit> ]
[ -nfsmacs <NFS macs> ]
[ -output <output> ]
[ -range <range> ]
[ -start <start> ]

REST
http[s]://<host>:<port>/rest/virtualip/list?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

columns The columns to display.

filter A filter specifying VIPs to list. See Filters for more information.

limit The number of records to return.

nfsmacs The MAC addresses of servers running NFS.

output Whether the output should be terse or verbose.

range The VIP range.

start The index of the first record to return.

virtualip remove

Removes a virtual IP (VIP) or a VIP range. Permissions required: fc or a

Syntax

CLI
maprcli virtualip remove
[ -cluster <cluster> ]
-virtualip <virtual IP>
[ -virtualipend <Virtual IP Range End> ]

REST
http[s]://<host>:<port>/rest/virtualip/remove?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

virtualip The virtual IP or the start of the VIP range to remove.


virtualipend The end of the VIP range to remove.

volume
The volume commands let you work with volumes, snapshots and mirrors:

volume create creates a volume


volume dump create creates a volume dump
volume dump restore restores a volume from a volume dump
volume info displays information about a volume
volume link create creates a symbolic link
volume link remove removes a symbolic link
volume list lists volumes in the cluster
volume mirror push pushes a volume's changes to its local mirrors
volume mirror start starts mirroring a volume
volume mirror stop stops mirroring a volume
volume modify modifies a volume
volume mount mounts a volume
volume move moves a volume
volume remove removes a volume
volume rename renames a volume
volume snapshot create creates a volume snapshot
volume snapshot list lists volume snapshots
volume snapshot preserve prevents a volume snapshot from expiring
volume snapshot remove removes a volume snapshot
volume unmount unmounts a volume

volume create

Creates a volume. Permissions required: cv, fc, or a

Syntax

CLI
maprcli volume create
[ -advisoryquota <advisory quota> ]
[ -ae <accounting entity> ]
[ -aetype <accounting entity type> ]
[ -canBackup <list of users and groups> ]
[ -canDeleteAcl <list of users and groups> ]
[ -canDeleteVolume <list of users and groups> ]
[ -canEditconfig <list of users and groups> ]
[ -canMirror <list of users and groups> ]
[ -canMount <list of users and groups> ]
[ -canSnapshot <list of users and groups> ]
[ -canViewConfig <list of users and groups> ]
[ -cluster <cluster> ]
[ -groupperm <group:allowMask list> ]
[ -localvolumehost <localvolumehost> ]
[ -localvolumeport <localvolumeport> ]
[ -minreplication <minimum replication factor> ]
[ -mount 0|1 ]
-name <volume name>
[ -path <mount path> ]
[ -quota <quota> ]
[ -readonly <read-only status> ]
[ -replication <replication factor> ]
[ -rereplicationtimeoutsec <seconds> ]
[ -rootdirperms <root directory permissions> ]
[ -schedule <ID> ]
[ -source <source> ]
[ -topology <topology> ]
[ -type 0|1 ]
[ -user <user:allowMask list> ]
REST
http[s]://<host>:<port>/rest/volume/create?<parameters>

Parameters

Parameter Description

advisoryquota The advisory quota for the volume as integer plus unit. Example: quota=500G; Units: B, K, M, G, T, P

ae The accounting entity that owns the volume.

aetype The type of accounting entity:

0=user
1=group

canBackup The list of users and groups who can back up the volume.

canDeleteAcl The list of users and groups who can delete the volume access control list (ACL)

canDeleteVolume The list of users and groups who can delete the volume.

canEditconfig The list of users and groups who can edit the volume properties.

canMirror The list of users and groups who can mirror the volume.

canMount The list of users and groups who can mount the volume.

canSnapshot The list of users and groups who can create a snapshot of the volume.

canViewConfig The list of users and groups who can view the volume properties.

cluster The cluster on which to create the volume.

groupperm List of permissions in the format group:allowMask

localvolumehost The local volume host.

localvolumeport The local volume port. Default: 5660

minreplication The minimum replication level. Default: 0

mount Specifies whether the volume is mounted at creation time.

name The name of the volume to create.

path The path at which to mount the volume.

quota The quota for the volume as integer plus unit. Example: quota=500G; Units: B, K, M, G, T, P

readonly Specifies whether the volume is read-only:

0 = read/write
1 = read-only

replication The desired replication level. Default: 0

rereplicationtimeoutsec The re-replication timeout, in seconds.

rootdirperms Permissions on the volume root directory.

schedule The ID of a schedule. If a schedule ID is provided, then the volume will automatically create snapshots (normal
volume) or sync with its source volume (mirror volume) on the specified schedule.

source For mirror volumes, the source volume to mirror, in the format <source volume>@<cluster> (Required when
creating a mirror volume).

topology The rack path to the volume.


type The type of volume to create:

0 - standard volume
1 - mirror volume

user List of permissions in the format user:allowMask.

Examples

Create the volume "test-volume" mounted at "/test/test-volume":

CLI
maprcli volume create -name test-volume -path /test/test-volume

REST
https://r1n1.sj.us:8443/rest/volume/create?name=test-volume&path=/test/test-volume

Create Volume with a Quota and an Advisory Quota

This example creates a volume with the following parameters:

advisoryquota: 100M
name: volumename
path: /volumepath
quota: 500M
replication: 3
schedule: 2
topology: /East Coast
type: 0

CLI
maprcli volume create -name volumename -path /volumepath -advisoryquota 100M -quota 500M -replication 3 -schedu

REST
https://r1n1.sj.us:8443/rest/volume/create?advisoryquota=100M&name=volumename&path=/volumepath&quota=500M&repli

Create the mirror volume "test-volume.mirror" from source volume "test-volume" and mount at "/test/test-volume-mirror":

CLI
maprcli volume create -name test-volume.mirror -source test-volume@src-cluster-name -path /test/test-volume-mir

REST
https://r1n1.sj.us:8443/rest/volume/create?name=test-volume.mirror&sourcetest-volume@src-cluster-name&path=/tes

volume dump create

The volume dump create command creates a volume dump file containing data from a volume for distribution or restoration.

You can use volume dump create to create two types of files:

full dump files containing all data in a volume


incremental dump files that contain changes to a volume between two points in time

A full dump file is useful for restoring a volume from scratch. An incremental dump file contains the changes necessary to take an existing (or
restored) volume from one point in time to another. Along with the dump file, a full or incremental dump operation can produce a state file
(specified by the -e parameter) that contains a table of the version number of every container in the volume at the time the dump file was created.
This represents the end point of the dump file, which is used as the start point of the next incremental dump. The main difference between
creating a full dump and creating an incremental dump is whether the -s parameter is specified; if -s is not specified, the volume create command
includes all volume data and creates a full dump file. If you create a full dump followed by a series of incremental dumps, the result is a sequence
of dump files and their accompanying state files:

dumpfile1 statefile1

dumpfile2 statefile2

dumpfile3 statefile3

...

You can restore the volume from scratch, using the volume dump restore command with each dump file.

Syntax

CLI
maprcli volume dump create
[ -cluster <cluster> ]
-dumpfile <dump file>
[-e <end state file> ]
-name volumename
[-o ]
[-s <start state file>
]{anchor:cli-syntax-end}

REST None.

Parameters

Parameter Description

cluster The cluster on which to run the command.

dumpfile The name of the dump file (ignored if -o is used).

e The name of the state file to create for the end point of the dump.

name A volume name.

o This option dumps the volume to stdout instead of to a file.

s The start point for an incremental dump.

Examples

Create a full dump:

CLI
maprcli volume create -e statefile1 -dumpfile fulldump1 -name
volume

Create an incremental dump:

CLI
maprcli volume dump -s statefile1 -e statefile2 -name volume -dumpfile incrdump1

volume dump restore


The volume dump restore command restores or updates a volume from a dump file. Permissions required: dump, fc, or a

There are two ways to use volume dump restore:

With a full dump file, volume dump restore recreates a volume from scratch from volume data stored in the dump file.
With an incremental dump file, volume dump restore updates a volume using incremental changes stored in the dump file.

The volume that results from a volume dump restore operation is a mirror volume whose source is the volume from which the dump was
created. After the operation, this volume can perform mirroring from the source volume.

When you are updating a volume from an incremental dump file, you must specify an existing volume and an incremental dump file. To restore
from a sequence of previous dump files would involve first restoring from the volume's full dump file, then applying each subsequent incremental
dump file.

A restored volume may contain mount points that represent volumes that were mounted under the original source volume from which the dump
was created. In the restored volume, these mount points have no meaning and do not provide access to any volumes that were mounted under
the source volume. If the source volume still exists, then the mount points in the restored volume will work if the restored volume is associated
with the source volume as a mirror.

Syntax

CLI
maprcli volume dump restore
[ -cluster <cluster> ]
-dumpfile dumpfilename
[ -i ]
[ -n ]
-name <volume name>

REST None.

Parameters

Parameter Description

cluster The cluster on which to run the command.

dumpfile The name of the dumpfile (ignored if -i is used).

i This option reads the dump file from stdin.

n This option creates a new volume if it doesn't exist.

name A volume name, in the form volumename

Examples

Restore a volume from a full dump file:

CLI
maprcli volume dump restore -name volume -dumpfile
fulldump1

Apply an incremental dump file to a volume:

CLI
maprcli volume dump restore -name volume -dumpfile
incrdump1
volume fixmountpath

Corrects the mount path of a volume. Permissions required: fc or a

The CLDB maintains information about the mount path of every volume. If a directory in a volume's path is renamed (by a hadoop fs command,
for example) the information in the CLDB will be out of date. The volume fixmountpath command does a reverse path walk from the volume
and corrects the mount path information in the CLDB.

Syntax

CLI
maprcli volume fixmountpath
-name <name>

REST
http[s]://<host>:<port>/rest/volume/fixmountpath?<parameters>

Parameters

Parameter Description

name The volume name.

Examples

Fix the mount path of volume v1:

CLI
maprcli volume fixmountpath -name v1

REST
https://r1n1.sj.us:8443/rest/volume/fixmountpath?name=v1

volume info

Displays information about the specified volume.

Syntax

CLI
maprcli volume info
[ -cluster <cluster> ]
[ -name <volume name> ]
[ -output terse|verbose ]
[ -path <path> ]

REST
http[s]://<host>:<port>/rest/volume/info?<parameters>

Parameters

You must specify either name or path.


Parameter Description

cluster The cluster on which to run the command.

name The volume for which to retrieve information.

output Whether the output should be terse or verbose.

path The mount path of the volume for which to retrieve information.

volume link create

Creates a link to a volume. Permissions required: fc or a

Syntax

CLI
maprcli volume link create
-path <path>
-type <type>
-volume <volume>

REST
http[s]://<host>:<port>/rest/volume/link/remove?<parameters>

Parameters

Parameter Description

path The path parameter specifies the link path and other information, using the following syntax:

/link/[maprfs::][volume::]<volume type>::<volume name>

link - the link path


maprfs - a keyword to indicate a special MapR-FS link
volume - a keyword to indicate a link to a volume
volume type - writeable or mirror
volume name - the name of the volume
Example:
/abc/maprfs::mirror::abc

type The volume type: writeable or mirror.

volume The volume name.

Examples

Create a link to v1 at the path v1. mirror:

CLI
maprcli volume link create -volume v1 -type mirror -path /v1.mirror
REST
https://r1n1.sj.us:8443/rest/volume/link/create?path=/v1.mirror&type=mirror&volume=v1

volume link remove

Removes the specified symbolic link. Permissions required: fc or a

Syntax

CLI
maprcli volume link remove
-path <path>

REST
http[s]://<host>:<port>/rest/volume/link/remove?<parameters>

Parameters

Parameter Description

path The symbolic link to remove. The path parameter specifies the link path and other information about the symbolic link, using the
following syntax:
/link/[maprfs::][volume::]<volume type>::<volume name>

link - the symbolic link path


*maprfs - a keyword to indicate a special MapR-FS link
volume - a keyword to indicate a link to a volume
volume type - writeable or mirror
volume name - the name of the volume
Example:
/abc/maprfs::mirror::abc

Examples

Remove the link /abc:

CLI
maprcli volume link remove -path /abc/maprfs::mirror::abc

REST
https://r1n1.sj.us:8443/rest/volume/link/remove?path=/abc/maprfs::mirror::abc

volume list

Lists information about volumes specified by name, path, or filter.

Syntax
CLI
maprcli volume list
[ -alarmedvolumes 1 ]
[ -cluster <cluster> ]
[ -columns <columns> ]
[ -filter <filter> ]
[ -limit <limit> ]
[ -nodes <nodes> ]
[ -output terse | verbose ]
[ -start <offset> ]
{anchor:cli-syntax-end}

REST
http[s]://<host>:<port>/rest/volume/list?<parameters>

Parameters

Parameter Description

alarmedvolumes Specifies whether to list alarmed volumes only.

cluster The cluster on which to run the command.

columns A comma-separated list of fields to return in the query. See the Fields table below.

filter A filter specifying volumes to list. See Filters for more information.

limit The number of rows to return, beginning at start. Default: 0

nodes A list of nodes. If specified, volume list only lists volumes on the specified nodes.

output Specifies whether the output should be terse or verbose.

start The offset from the starting row according to sort. Default: 0

Field Description

volumeid Unique volume ID.

volumetype Volume type:

0 = normal volume
1 = mirror volume

volumename Unique volume name.

mountdir Unique volume path (may be null if the volume is unmounted).

mounted Volume mount status:

0 = unmounted
1 = mounted

rackpath Rack path.

creator Username of the volume creator.

aename Accountable entity name.


aetype Accountable entity type:

0=user
1=group

uacl Users ACL (comma-separated list of user names.

gacl Group ACL (comma-separated list of group names).

quota Quota, in MB; 0 = no quota.

advisoryquota Advisory quota, in MB; 0 = no advisory quota.

used Disk space used, in MB, not including snapshots.

snapshotused Disk space used for all snapshots, in MB.

totalused Total space used for volume and snapshots, in MB.

readonly Read-only status:

0 = read/write
1 = read-only

numreplicas Desired replication factor (number of replications).

minreplicas Minimum replication factor (number of replications)

actualreplication The actual current replication factor by percentage of the volume, as a zero-based array of integers from 0 to
100. For each position in the array, the value is the percentage of the volume that is replicated index number
of times. Example: arf=[5,10,85] means that 5% is not replicated, 10% is replicated once, 85% is
replicated twice.

schedulename The name of the schedule associated with the volume.

scheduleid The ID of the schedule associated with the volume.

mirrorSrcVolumeId Source volume ID (mirror volumes only).

mirrorSrcVolume Source volume name (mirror volumes only).

mirrorSrcCluster The cluster where the source volume resides (mirror volumes only).

lastSuccessfulMirrorTime Last successful Mirror Time, milliseconds since 1970 (mirror volumes only).

mirrorstatus Mirror Status (mirror volumes only:

0 = success
1 = running
2 = error

mirror-percent-complete Percent completion of last/current mirror (mirror volumes only).

snapshotcount Snapshot count .

SnapshotFailureAlarm Status of SNAPSHOT_FAILURE alarm:

0 = Clear
1 = Raised

AdvisoryQuotaExceededAlarm Status of VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED alarm:

0 = Clear
1 = Raised

QuotaExceededAlarm Status of VOLUME_QUOTA_EXCEEDED alarm:

0 = Clear
1 = Raised
MirrorFailureAlarm Status of MIRROR_FAILURE alarm:

0 = Clear
1 = Raised

DataUnderReplicatedAlarm Status of DATA_UNDER_REPLICATED alarm:

0 = Clear
1 = Raised

DataUnavailableAlarm Status of DATA_UNAVAILABLE alarm:

0 = Clear
1 = Raised

Output

Information about the specified volumes.

mirrorstatus QuotaExceededAlarm numreplicas schedulename DataUnavailableAlarm volumeid rackpath


volumename used volumetype SnapshotFailureAlarm mirrorDataSrcVolumeId
advisoryquota aetype creator snapshotcount quota mountdir
scheduleid snapshotused MirrorFailureAlarm AdvisoryQuotaExceededAlarm minreplicas
mirrorDataSrcCluster actualreplication aename mirrorSrcVolumeId mirrorId mirrorSrcCluster
lastSuccessfulMirrorTime nextMirrorId mirrorDataSrcVolume mirrorSrcVolume mounted logicalUsed
readonly totalused DataUnderReplicatedAlarm mirror-percent-complete
0 0 3 every15min 0 362 /
ATS-Run-2011-01-31-160018 864299 0 0 0
0 0 root 3 0 /ATS-Run-2011-01-31-160018 4
1816201 0 0 1 ...
root 0 0 0 0
1 2110620 0 2680500 0 0
0 0 3 0 12 /
mapr.cluster.internal 0 0 0 0
0 0 root 0 0 / var/mapr/cluster 0
0 0 0 1 ...
root 0 0 0 0
1 0 0 0 0 0
0 0 3 0 11 /
mapr.cluster.root 1 0 0 0
0 0 root 0 0 / 0
0 0 0 1 ...
root 0 0 0 0
1 1 0 1 0 0
0 0 10 0 21 /
mapr.jobtracker.volume 1 0 0 0
0 0 root 0 0 / var/mapr/cluster/mapred/jobTracker 0
0 0 0 1 ...
root 0 0 0 0
1 1 0 1 0 0
0 0 3 0 1 /
mapr.kvstore.table 0 0 0 0
0 0 root 0 0 0
0 0 0 1 ...
root 0 0 0 0
0 0 0 0 0 0

Output Fields
See the Fields table above.

volume mirror push

Pushes the changes in a volume to all of its mirror volumes in the same cluster, and waits for each mirroring operation to complete. Use this
command when you need to push recent changes.

Syntax

CLI
maprcli volume mirror push
[ -cluster <cluster> ]
-name <volume name>
[ -verbose true|false ]

REST None.

Parameters

Parameter Description

cluster The cluster on which to run the command.

name The volume to push.

verbose Specifies whether the command output should be verbose. Default: true

Output

Sample Output

Starting mirroring of volume mirror1


Mirroring complete for volume mirror1
Successfully completed mirror push to all local mirrors of volume volume1

Examples

Push changes from the volume "volume1" to its local mirror volumes:

CLI
maprcli volume mirror push -name volume1 -cluster
mycluster

volume mirror start

Starts mirroring on the specified volume from its source volume. License required: M5 Permissions required: fc or a

When a mirror is started, the mirror volume is synchronized from a hidden internal snapshot so that the mirroring process is not affected by any
concurrent changes to the source volume. The volume mirror start command does not wait for mirror completion, but returns immediately.
The changes to the mirror volume occur atomically at the end of the mirroring process; deltas transmitted from the source volume do not appear
until mirroring is complete.

To provide rollback capability for the mirror volume, the mirroring process creates a snapshot of the mirror volume before starting the mirror, with
the following naming format: <volume>.mirrorsnap.<date>.<time>.

Normally, the mirroring operation transfers only deltas from the last successful mirror. Under certain conditions (mirroring a volume repaired by
fsck, for example), the source and mirror volumes can become out of sync. In such cases, it is impossible to transfer deltas, because the state is
not the same for both volumes. Use the -full option to force the mirroring operation to transfer all data to bring the volumes back in sync.

Syntax

CLI
maprcli volume mirror start
[ -cluster <cluster> ]
[ -full true|false ]
-name <volume name>

REST
http[s]://<host>:<port>/rest/volume/mirror/start?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

full Specifies whether to perform a full copy of all data. If false, only the deltas are copied.

name The volume for which to stop the mirror.

Output

Sample Output

messages
Started mirror operation for volumes 'test-mirror'

Examples

Start mirroring the mirror volume "test-mirror":

CLI
maprcli volume mirror start -name
test-mirror

volume mirror stop

Stops mirroring on the specified volume. License required: M5 Permissions required: fc or a

The volume mirror stop command lets you stop mirroring (for example, during a network outage). You can use the volume mirror start
command to resume mirroring.

Syntax

CLI
maprcli volume mirror stop
[ -cluster <cluster> ]
-name <volume name>
REST
http[s]://<host>:<port>/rest/volume/mirror/stop?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

name The volume for which to stop the mirror.

Output

Sample Output

messages
Stopped mirror operation for volumes 'test-mirror'

Examples

Stop mirroring the mirror volume "test-mirror":

CLI
maprcli volume mirror stop -name
test-mirror

volume modify

Modifies an existing volume. Permissions required: m, fc, or a

An error occurs if the name or path refers to a non-existent volume, or cannot be resolved.

Syntax
CLI
maprcli volume modify
[ -advisoryquota <advisory quota> ]
[ -ae <accounting entity> ]
[ -aetype <aetype> ]
[ -canBackup <list of users and groups> ]
[ -canDeleteAcl <list of users and groups> ]
[ -canDeleteVolume <list of users and groups> ]
[ -canEditconfig <list of users and groups> ]
[ -canMirror <list of users and groups> ]
[ -canMount <list of users and groups> ]
[ -canSnapshot <list of users and groups> ]
[ -canViewConfig <list of users and groups> ]
[ -cluster <cluster> ]
[ -groupperm <list of group:allowMask> ]
[ -minreplication <minimum replication> ]
-name <volume name>
[ -quota <quota> ]
[ -readonly <readonly> ]
[ -replication <replication> ]
[ -schedule <schedule ID> ]
[ -source <source> ]
[ -topology <topology> ]
[ -userperm <list of user:allowMask> ]

REST http[s]://<host>:<port>/rest/volume/modify?<parameters>

Parameters

Parameter Description

advisoryquota The advisory quota for the volume as integer plus unit. Example: quota=500G; Units: B, K, M, G, T, P

ae The accounting entity that owns the volume.

aetype The type of accounting entity:

0=user
1=group

canBackup The list of users and groups who can back up the volume.

canDeleteAcl The list of users and groups who can delete the volume access control list (ACL).

canDeleteVolume The list of users and groups who can delete the volume.

canEditconfig The list of users and groups who can edit the volume properties.

canMirror The list of users and groups who can mirror the volume.

canMount The list of users and groups who can mount the volume.

canSnapshot The list of users and groups who can create a snapshot of the volume.

canViewConfig The list of users and groups who can view the volume properties.

cluster The cluster on which to run the command.

groupperm A list of permissions in the format group:allowMask

minreplication The minimum replication level. Default: 0

name The name of the volume to modify.

quota The quota for the volume as integer plus unit. Example: quota=500G; Units: B, K, M, G, T, P
readonly Specifies whether the volume is read-only.

0 = read/write
1 = read-only

replication The desired replication level. Default: 0

schedule A schedule ID. If a schedule ID is provided, then the volume will automatically create snapshots (normal volume) or sync
with its source volume (mirror volume) on the specified schedule.

source (Mirror volumes only) The source volume from which a mirror volume receives updates, specified in the format
<volume>@<cluster>.

topology The rack path to the volume.

userperm List of permissions in the format user:allowMask.

Examples

Change the source volume of the mirror "test-mirror":

CLI
maprcli volume modify -name test-mirror -source volume-2@my-cluster

REST
https://r1n1.sj.us:8443/rest/volume/modify?name=test-mirror&source=volume-2@my-cluster

volume mount

Mounts one or more specified volumes. Permissions required: mnt, fc, or a

Syntax

CLI
maprcli volume mount
[ -cluster <cluster> ]
-name <volume list>
[ -path <path list> ]

REST
http[s]://<host>:<port>/rest/volume/mount?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

name The name of the volume to mount.

path The path at which to mount the volume.


Examples

Mount the volume "test-volume" at the path "/test":

CLI
maprcli volume mount -name test-volume -path /test

REST
https://r1n1.sj.us:8443/rest/volume/mount?name=test-volume&path=/test

volume move

Moves the specified volume or mirror to a different topology. Permissions required: m, fc, or a

Syntax

CLI
maprcli volume move
[ -cluster <cluster> ]
-name <volume name>
-topology <path>

REST
http[s]://<host>:<port>/rest/volume/move?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

name The volume name.

topology The new rack path to the volume.

volume remove

Removes the specified volume or mirror. Permissions required: d, fc, or a

Syntax

CLI
maprcli volume remove
[ -cluster <cluster> ]
[ -force ]
-name <volume name>
REST
http[s]://<host>:<port>/rest/volume/remove?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

force Forces the removal of the volume, even if it would otherwise be prevented.

name The volume name.

volume rename

Renames the specified volume or mirror. Permissions required: m, fc, or a

Syntax

CLI
maprcli volume rename
[ -cluster <cluster> ]
-name <volume name>
-newname <new volume name>

REST
http[s]://<host>:<port>/rest/volume/rename?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

name The volume name.

newname The new volume name.

volume snapshot create

Creates a snapshot of the specified volume, using the specified snapshot name. License required: M5 Permissions required: snap, fc, or a

Syntax
CLI
maprcli volume snapshot create
[ -cluster <cluster> ]
-snapshotname <snapshot>
-volume <volume>

REST
http[s]://<host>:<port>/rest/volume/snapshot/create?<parameters>

Parameters

Parameter Description

cluster The cluster on which to run the command.

snapshotname The name of the snapshot to create.

volume The volume for which to create a snapshot.

Examples

Create a snapshot called "test-snapshot" for volume "test-volume":

CLI
maprcli volume snapshot create -snapshotname test-snapshot -volume test-volume

REST
https://r1n1.sj.us:8443/rest/volume/snapshot/create?volume=test-volume&snapshotname=test-snapshot

volume snapshot list

Displays info about a set of snapshots. You can specify the snapshots by volumes or paths, or by specifying a filter to select volumes with certain
characteristics.

Syntax

CLI
maprcli volume snapshot list
[ -cluster <cluster> ]
[ -columns <fields> ]
( -filter <filter> | -path <volume path list> | -volume <volume list> )
[ -limit <rows> ]
[ -output (terse\|verbose) ]
[ -start <offset> ]

REST
http[s]://<host>:<port>/rest/volume/snapshot/list?<parameters>

Parameters

Specify exactly one of the following parameters: volume, path, or filter.


Parameter Description

cluster The cluster on which to run the command.

columns A comma-separated list of fields to return in the query. See the Fields table below. Default: none

filter A filter specifying snapshots to preserve. See Filters for more information.

limit The number of rows to return, beginning at start. Default: 0

output Specifies whether the output should be terse or verbose. Default: verbose

path A comma-separated list of paths for which to preserve snapshots.

start The offset from the starting row according to sort. Default: 0

volume A comma-separated list of volumes for which to preserve snapshots.

Fields

The following table lists the fields used in the sort and columns parameters, and returned as output.

Field Description

snapshotid Unique snapshot ID.

snapshotname Snapshot name.

volumeid ID of the volume associated with the snapshot.

volumename Name of the volume associated with the snapshot.

volumepath Path to the volume associated with the snapshot.

ownername Owner (user or group) associated with the volume.

ownertype Owner type for the owner of the volume:

0=user
1=group

dsu Disk space used for the snapshot, in MB.

creationtime Snapshot creation time, milliseconds since 1970

expirytime Snapshot expiration time, milliseconds since 1970; 0 = never expires.

Output

The specified columns about the specified snapshots.

Sample Output
creationtime ownername snapshotid snapshotname expirytime
diskspaceused volumeid volumename ownertype volumepath
1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001
1063191 362 ATS-Run-2011-01-31-160018 1 /dummy
1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057
753010 362 ATS-Run-2011-01-31-160018 1 /dummy
1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0
362 ATS-Run-2011-01-31-160018 1 /dummy
dummy 1 14 test-volume-2 /dummy 102
test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001

Output Fields

See the Fields table above.

Examples

List all snapshots:

CLI
maprcli volume snapshot list

REST
https://r1n1.sj.us:8443/rest/volume/snapshot/list

volume snapshot preserve

Preserves one or more snapshots from expiration. Specify the snapshots by volumes, paths, filter, or IDs. License required: M5 Permissions
required: snap, fc, or a

Syntax

CLI
maprcli volume snapshot preserve
[ -cluster <cluster> ]
( -filter <filter> | -path <volume path list> | -snapshots <snapshot list> | -volume <volume
list> )

REST
http[s]://<host>:<port>/rest/volume/snapshot/preserve?<parameters>

Parameters

Specify exactly one of the following parameters: volume, path, filter, or snapshots.

Parameter Description

cluster The cluster on which to run the command.

filter A filter specifying snapshots to preserve. See Filters for more information.
path A comma-separated list of paths for which to preserve snapshots.

snapshots A comma-separated list of snapshot IDs to preserve.

volume A comma-separated list of volumes for which to preserve snapshots.

Examples

Preserve two snapshots by ID:

First, use volume snapshot list to get the IDs of the snapshots you wish to preserve. Example:

# maprcli volume snapshot list


creationtime ownername snapshotid snapshotname expirytime
diskspaceused volumeid volumename ownertype volumepath
1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001
1063191 362 ATS-Run-2011-01-31-160018 1 /dummy
1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057
753010 362 ATS-Run-2011-01-31-160018 1 /dummy
1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0
362 ATS-Run-2011-01-31-160018 1 /dummy
dummy 1 14 test-volume-2 /dummy 102
test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001

Use the IDs in the volume snapshot preserve command. For example, to preserve the first two snapshots in the above list, run the
commands as follows:

CLI
maprcli volume snapshot preserve -snapshots 363,364

REST
https://r1n1.sj.us:8443/rest/volume/snapshot/preserve?snapshots=363,364

volume snapshot remove

Removes one or more snapshots. License required: M5 Permissions required: snap, fc, or a

Syntax

CLI
maprcli volume snapshot remove
[ -cluster <cluster> ]
( -snapshotname <snapshot name> | -snapshots <snapshots> | -volume <volume name>
)

REST
http[s]://<host>:<port>/rest/volume/snapshot/remove?<parameters>

Parameters

Specify exactly one of the following parameters: snapshotname, snapshots, or volume.

Parameter Description

cluster The cluster on which to run the command.


snapshotname The name of the snapshot to remove.

snapshots A comma-separated list of IDs of snapshots to remove.

volume The name of the volume from which to remove the snapshot.

Examples

Remove the snapshot "test-snapshot":

CLI
maprcli volume snapshot remove -snapshotname test-snapshot

REST
https://10.250.1.79:8443/api/volume/snapshot/remove?snapshotname=test-snapshot

Remove two snapshots by ID:

First, use volume snapshot list to get the IDs of the snapshots you wish to remove. Example:

# maprcli volume snapshot list


creationtime ownername snapshotid snapshotname expirytime
diskspaceused volumeid volumename ownertype volumepath
1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001
1063191 362 ATS-Run-2011-01-31-160018 1 /dummy
1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057
753010 362 ATS-Run-2011-01-31-160018 1 /dummy
1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0
362 ATS-Run-2011-01-31-160018 1 /dummy
dummy 1 14 test-volume-2 /dummy 102
test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001

Use the IDs in the volume snapshot remove command. For example, to remove the first two snapshots in the above list, run the commands
as follows:

CLI
maprcli volume snapshot remove -snapshots 363,364

REST
https://r1n1.sj.us:8443/rest/volume/snapshot/remove?snapshots=363,364

volume unmount

Unmounts one or more mounted volumes. Permissions required: mnt, fc, or a

Syntax

CLI
maprcli volume unmount
[ -cluster <cluster> ]
[ -force 1 ]
-name <volume name>

REST
http[s]://<host>:<port>/rest/volume/unmount?<parameters>
Parameters

Parameter Description

cluster The cluster on which to run the command.

force Specifies whether to force the volume to unmount.

name The name of the volume to unmount.

Examples

Unmount the volume "test-volume":

CLI
maprcli volume unmount -name test-volume

REST
https://r1n1.sj.us:8443/rest/volume/unmount?name=test-volume

Scripts and Commands


This section contains information about the following scripts and commands:

configure.sh - configures a node or client to work with the cluster


disksetup - sets up disks for use by MapR storage
Hadoop MFS - enhanced hadoop fs command
mapr-support-collect.sh - collects information for use by MapR Support
rollingupgrade.sh - upgrades software on a MapR cluster

configure.sh

Sets up a MapR cluster or client, creates /opt/mapr/conf/mapr-clusters.conf, and updates the corresponding *.conf and *.xml files.

The normal use of configure.sh is to set up a MapR cluster, or to set up a MapR client for communication with one or more clusters.

To set up a cluster, run configure.sh on all nodes specifying the cluster's CLDB and ZooKeeper nodes, and a cluster name if desired.
If setting up a cluster on virtual machines, use the -isvm parameter.
To set up a client, run configure.sh on the client machine, specifying the CLDB and ZooKeeper nodes of the cluster or clusters. On a
client, use both the -c and -C parameters.
To change services (other than the CLDB and ZooKeeper) running on a node, run configure.sh with the -R option. If you change the
location or number of CLDB or ZooKeeper services in a cluster, run configure.sh and specify the new lineup of CLDB and ZooKeeper
nodes.
To rename a cluster, run configure.sh on all nodes with the -N option (and the -R option, if you have previously specified the CLDB
and ZooKeeper nodes).

On a Windows client, the script configure.sh is named configure.bat but otherwise works in a similar way.

Syntax
/opt/mapr/server/configure.sh
-C <host>[:<port>][,<host>[:<port>]...]
-Z <host>[:<port>][,<host>[:<port>]...]
[ -c ]
[ --isvm ]
[ -J <CLDB JMX port> ]
[ -L <log file> ]
[ -N <cluster name> ]
[ -R ]

Parameters

Parameter Description

-C A list of the CLDB nodes in the cluster.

-Z A list of the ZooKeeper nodes in the cluster. The -Z option is required unless -c (lowercase) is specified.

--isvm Specifies virtual machine setup. Required when configure.sh is run on a virtual machine.

-c Specifies client setup. See Setting Up the Client.

-J Specifies the JMX port for the CLDB. Default: 7220

-L Specifies a log file. If not specified, configure.sh logs errors to /opt/mapr/logs/configure.log .

-N Specifies the cluster name, to prevent ambiguity in multiple-cluster environments.

-R After initial node configuration, specifies that configure.sh should use the previously configured ZooKeeper and CLDB nodes.
When -R is specified, the CLDB credentials are read from
mapr-clusters.conf and the ZooKeeper credentials are read from warden.conf. The -R option is useful for making
changes to the services configured on a node without changing the CLDB and ZooKeeper nodes, or for renaming a cluster. The
-C and -Z parameters are not required when -R is specified.

If those can not be extracted configure.sh will complain. |

Examples

Add a node (not CLDB or ZooKeeper) to a cluster that is running the CLDB and ZooKeeper on three nodes:

On the new node, run the following command:


/opt/mapr/server/configure.sh -C 10.10.100.1,10.10.100.2,10.10.100.3 -Z 10.10.100.1,10.10.100.2,10.10.100.3

Configure a client to work with MyCluster, which has one CLDB at 10.10.100.1:

On a Linux client, run the following command:


/opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222

On a Windows 7 client, run the following command:


/opt/mapr/server/configure.bat -N MyCluster -c -C 10.10.100.1:7222

Rename the cluster to Cluster1 without changing the specified CLDB and ZooKeeper nodes:

On all nodes, run the following command:


/opt/mapr/server/configure.sh -N Cluster1 -R

disksetup
Formats specified disks for use by MapR storage.

For information about when and how to use disksetup, see Setting Up Disks for MapR.

To specify disks:

Create a text file /tmp/disks.txt listing disks and partitions for use by MapR. Each line lists a single disk, or partitions on a single
disk. Example:

/dev/sdb
/dev/sdc1 /dev/sdc2 /dev/sdc4
/dev/sdd

Later, when you run disksetup to format the disks, specify the disks and partitions file. Example:

disksetup -F /tmp/disks.txt

To test without formatting physical disks:

If you do not have physical partitions or disks available for reformatting, you can test MapR by creating a flat file and including a path to the file in
the disk list file. You should create at least a 4GB file or larger.

The following example creates a 20 GB flat file (bs=1G specifies 1 gigabyte, multiply by count=20):

$ dd if=/dev/zero of=/root/storagefile bs=1G count=20

Using the above example, you would add the following to /tmp/disks.txt:

/root/storagefile

Syntax

/opt/mapr/server/disksetup
<disk list file>
[-F]
[-G]
[-M]
[-W <stripe_width>]

Parameters

Parameter Description

-F Forces formatting of all specified disks. If not specified, disksetup does not re-format disks that have already been formatted for
MapR.

-G Generates disktab contents from input disk list, but does not format disks. This option is useful if disk names change after a
reboot, or if the disktab file is damaged.

-M Uses the maximum available number of disks per storage pool.

-W Specifies the number of disks per storage pool.

Examples
Set up disks specified in the file /tmp/disks.txt:

/opt/mapr/server/disksetup -F /tmp/disks.txt

fsck
Filesystem check (fsck) is used to find and fix inconsistencies in the filesystem.

Fsck of MapR-FS can be run in two ways:

Local Fsck (fsck): To check/fix inconsistencies in a storage pool


Global Fsck (gfsck): To check/fix inconsistencies in the distributed filesystem

Every storage pool has its own log to journal updates to the storage pool. All operations to storage pool are done transactionally by journaling all
operations to the log before they are applied to storage pool metadata. If a storage pool is not shut down cleanly, metadata can become
inconsistent. To make the metadata consistent on the next load of the storage pool, the log is replayed to recover any data.

Local fsck

Fsck will check/fix inconsistencies in local filesystem on a storage pool. The local fsck utility will recover all data from the LOG by replaying it
before it does any check or repair. Fsck will walk the storage-pool in question and verify all mfs meta-data (and data correctness if specified on
the command line) and report all potentially lost or corrupt containers, directories, tables, files, filelets, and blocks in the storage-pool.

Local fsck will attempt to visit every allocated block in the storage-pool by walking all the BTrees. It will consolidate all blocks on the storage-pool
and recover any blocks that are part of a corrupted or un-connected meta-data chain.

The local fsck will typically be used on an offline storage-pool either after a node failure, or disk failure, or mfs process crash, or to simply verify
the consistency of data for suspected software bugs.

The local fsck will run in two modes:

Verification mode - fsck will only report but not actually attempt to fix or modify any data on disk. fsck can be run in verification mode
on an offline storage-pool at any time and if it does not report any errors the
storage-pool can be brought online without any risk of data loss.
Repair mode - fsck will attempt to restore a bad storage pool. If the local fsck is run in repair mode on a storage pool, some volumes
might need a global fsck after bringing the storage pool online. There is potential for loss of data in this case.

The local fsck can be explicitly run by the administrator.

Syntax

/opt/mapr/server/fsck
[ -d ]
[ <device-paths> | -n <sp name> ]
[ -l <log filename> ]
[ -p <MFS port> ]
[ -h ]
[ -j ]
[ -m <memory in MB> ]
[ -d ]
[ -r ]

Parameters

Parameter Description

-d Perform a CRC on data blocks. By default, {{fsck }}will not validate the CRC of user data-pages. Enabling this check can take
quite a long time to finish.

<device-paths> Paths to the disks that make up the storage pool.

-n Storage pool name. This option works only if all the disks are in disktab. Otherwise the user will have to individually specify
all the disks that make up the storage-pool, using the {{<device-paths>> parameter.
-l The log filename; default /tmp/fsck.<pid>.out

-p The MFS port; default 5660

-h Help

-j Skip log replay. Should be set only when log recovery fails. Log recovery can fail if the damaged blocks of a disk belong to the
log, or if log recovery finds some CRC errors in the metadata blocks. *Using this parameter will typically lead to larger data
loss. *

-m Cache size for blocks (MB)

-d Check data blocks CRC

-r Run in repair mode. USE WITH CAUTION AS IT CAN LEAD TO LOSS OF DATA.

Examples

Set up disks specified in the file /tmp/disks.txt:

Hadoop MFS
The hadoop mfs command performs operations on directories in the cluster. The main purposes of hadoop mfs are to display directory
information and contents, to create symbolic links, and to set compression and chunk size on a directory.

hadoop mfs
[ -ln <target> <symlink> ]
[ -ls <path> ]
[ -lsd <path> ]
[ -lsr <path> ]
[ -lss <path> ]
[ -setcompression on|off <dir> ]
[ -setchunksize <size> ]
[ -help <command> ]

Options

The normal command syntax is to specify a single option from the following table, along with its corresponding arguments. If compression and
chunk size are not set explicitly for a given directory, the values are inherited from the parent directory.

Option Description

-ln Creates a symbolic link <symlink> that points to the target path <target>, similar to the standard Linux ln -s command.

-ls Lists files in the directory specified by <path>. The hadoop mfs -ls command corresponds to the standard hadoop fs
-ls command, but provides the following additional information:

Blocks used for each file


Server where each block resides

-lsd Lists files in the directory specified by <path>, and also provides information about the specified directory itself:

Whether compression is enabled for the directory (indicated by z


The configured chunk size (in bytes) for the directory.

-lsr Lists files in the directory and subdirectories specified by <path>, recursively. The hadoop mfs -lsr command
corresponds to the standard hadoop fs -lsr command, but provides the following additional information:

Blocks used for each file


Server where each block resides

-lss <path> Lists files in the directory specified by <path>, with an additional column that displays the number of disk blocks per file.
Disk blocks are 8192 bytes.

-setcompression Turns compression on or off on the specified directory.


-setchunksize Sets the chunk size in bytes for the specified directory. The <size> parameter must be a multiple of 65536.

-help Displays help for the hadoop mfs command.

Output

When used with -ls, -lsd, -lsr, or -lss, hadoop mfs displays information about files and directories. For each file or directory hadoop mfs
displays a line of basic information followed by lines listing the chunks that make up the file, in the following format:

{mode} {compression} {replication} {owner} {group} {size} {date} {chunk size} {name}
{chunk} {fid} {host} [{host}...]
{chunk} {fid} {host} [{host}...]
...

Volume links are displayed as follows:

{mode} {compression} {replication} {owner} {group} {size} {date} {chunk size} {name}
{chunk} {target volume name} {writability} {fid} -> {fid} [{host}...]

For volume links, the first fid is the chunk that stores the volume link itself; the fid after the arrow (->) is the first chunk in the target volume.

The following table describes the values:

mode A text string indicating the read, write, and execute permissions for the owner, group, and other permissions. See also
Managing Permissions.

compression
U - directory is not compressed
Z - directory is compressed

replication The replication factor of the file (directories display a dash instead)

owner The owner of the file or directory

group The group of the file of directory

size The size of the file or directory

date The date the file or directory was last modified

chunk size The chunk size of the file or directory

name The name of the file or directory

chunk The chunk number. The first chunk is a primary chunk labeled "p", a 64K chunk containing the root of the file. Subsequent
chunks are numbered in order.

fid The chunk's file ID, which consists of three parts:

The ID of the container where the file is stored


The inode of the file within the container
An internal version number

host The host on which the chunk resides. When several hosts are listed, the first host is the first copy of the chunk and
subsequent hosts are replicas.

target volume The name of the volume pointed to by a volume link.


name

writability Displays whether the volume is writable.

mapr-support-collect.sh

Collects information about a cluster's recent activity, to help MapR Support diagnose problems.

The "mini-dump" option limits the size of the support output. When the -m or --mini-dump option is specified along with a size,
support-dump.sh collects only a head and tail, each limited to the specified size, from any log file that is larger than twice the specified size.
The total size of the output is therefore limited to approximately 2 * size * number of logs. The size can be specified in bytes, or using the following
suffixes:
b - bytes
k - kilobytes (1024 bytes)
m - megabytes (1024 kilobytes)

Syntax

/opt/mapr/support/tools/mapr-support-collect.sh
[ -h|--hosts <host file> ]
[ -H|--host <host entry> ]
[ -Q|--no-cldb ]
[ -n|--name <name> ]
[ -d|--output-dir <path> ]
[ -l|--no-logs ]
[ -s|--no-statistics ]
[ -c|--no-conf ]
[ -i|--no-sysinfo ]
[ -x|--exclude-cluster ]
[ -u|--user <user> ]
[ -m|--mini-dump <size> ]
[ -O|--online ]
[ -p|--par <par> ]
[ -t|--dump-timeout <dump timeout> ]
[ -T|--scp-timeout <SCP timeout> ]
[ -C|--cluster-timeout <cluster timeout> ]
[ -y|--yes ]
[ -S|--scp-port <SCP port> ]
[ --collect-cores ]
[ --move-cores ]
[ --port <port> ]
[ -?|--help ]

Parameters

Parameter Description

-h or --hosts A file containing a list of hosts. Each line contains one host entry, in the format [user@]host[:port]

-H or --host One or more hosts in the format [user@]host[:port]

-Q or --no-cldb If specified, the command does not query the CLDB for list of nodes

-n or --name Specifies the name of the output file. If not specified, the default is a date-named file in the format
YYYY-MM-DD-hh-mm-ss.tar

-d or --output-dir The absolute path to the output directory. If not specified, the default is /opt/mapr/support/collect/

-l or --no-logs If specified, the command output does not include log files

-s or --no-statistics If specified, the command output does not include statistics

-c or --no-conf If specified, the command output does not include configurations

-i or --no-sysinfo If specified, the command output does not include system information

-x or If specified, the command output does not collect cluster diagnostics


--exclude-cluster

-u or --user The username for ssh connections

-m, --mini-dump For any log file greater than 2 * <size>, collects only a head and tail each of the specified size. The <size> may have a
<size> suffix specifying units:

b - blocks (512 bytes)


k - kilobytes (1024 bytes)
m - megabytes (1024 kilobytes)

-O or --online Specifies a space-separated list of nodes from which to gather support output.
-p or --par The maximum number of nodes from which support dumps will be gathered concurrently (default: 10)

-t or The timeout for execution of the mapr-support-dump command on a node (default: 120 seconds or 0 = no limit)
--dump-timeout

-T or --scp-timeout The timeout for copy of support dump output from a remote node to the local file system (default: 120 seconds or 0 = no
limit)

-C or The timeout for collection of cluster diagnostics (default: 300 seconds or 0 = no limit)
--cluster-timeout

-y or --yes If specified, the command does not require acknowledgement of the number of nodes that will be affected

-S or --scp-port The local port to which remote nodes will establish an SCP session

--collect-cores If specified, the command collects cores of running mfs processes from all nodes (off by default)

--move-cores If specified, the command moves mfs and nfs cores from /opt/cores from all nodes (off by default)

--port The port number used by FileServer (default: 5660)

-? or --help Displays usage help text

Examples

Collect support information and dump it to the file /opt/mapr/support/collect/mysupport-output.tar:

/opt/mapr/support/tools/mapr-support-collect.sh -n mysupport-output

rollingupgrade.sh

Upgrades a MapR cluster to a specified version of the MapR software, or to a specific set of MapR packages.

By default, any node on which upgrade fails is rolled back to the previous version. To disable rollback, use the -n option. To force installation
regardless of the existing version on each node, use the -r option.
For more information about using rollingupgrade.sh, see Cluster Upgrade.

Syntax

/opt/upgrade-mapr/rollingupgrade.sh
[-c <cluster name>]
[-d]
[-h]
[-i <identity file>]
[-n]
[-p <directory>]
[-r]
[-s]
[-u <username>]
[-v <version>]
[-x]

Parameters

Parameter Description

-c Cluster name.

-d If specified, performs a dry run without upgrading the cluster.


-h Displays help text.

-i Specifies an identity file for SSH. See the SSH man page.

-n Specifies that the node should not be rolled back to the previous version if upgrade fails.

-p Specifies a directory containing the upgrade packages.

-r Specifies reinstallation of packages even on nodes that are already at the target version.

-s Specifies SSH to upgrade nodes.

-u A username for SSH.

-v The target upgrade version, using the format x.y.z to specify the major, minor, and revision numbers. Example: 1.2.0

-x Specifies that packages should be copied to nodes via SCP.

Configuration Files
This guide contains reference information about the following configuration files:

core-site.xml - Specifies the default filesystem


disktab - Lists the disks in use by MapR-FS
hadoop-metrics.properties - Specifies where to output service metric reports
mapr-clusters.conf - Configuration file that specifies clusters
mapred-default.xml - Additional MapReduce settings
mapred-site.xml - Core MapReduce settings
mfs.conf - MapR-FS settings
taskcontroller.cfg - Configuration settings for the TaskController
warden.conf - Configuration settings for the warden

core-site.xml
The file /opt/mapr/hadoop/hadoop-<version>/conf/core-site.xml specifies the default filesystem.

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<!--
Replace 'maprfs' by 'hdfs' to use HDFS.
Replace localhost by an ip address for namenode/cldb.
-->

<configuration>

<property>
<name>fs.default.name</name>
<value>maprfs:///</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

</configuration>
disktab
On each node, the file /opt/mapr/conf/disktab lists all of the physical drives and partitions that have been added to MapR-FS. The
disktab file is created by disksetup and automatically updated when disks are added or removed (either using the MapR Control System, or
with the disk add and disk remove commands).

Sample disktab file

# MapR Disks Mon Nov 28 11:46:16 2011

/dev/sdb 47E4CCDA-3536-E767-CD18-0CB7E4D34E00
/dev/sdc 7B6A3E66-6AF0-AF60-AE39-01B8E4D34E00
/dev/sdd 27A59ED3-DFD4-C692-68F8-04B8E4D34E00
/dev/sde F0BB5FB1-F2AC-CC01-275B-08B8E4D34E00
/dev/sdf 678FCF40-926F-0D04-49AC-0BB8E4D34E00
/dev/sdg 46823852-E45B-A7ED-8417-02B9E4D34E00
/dev/sdh 60A99B96-4CEE-7C46-A749-05B9E4D34E00
/dev/sdi 66533D4D-49F9-3CC4-0DF9-08B9E4D34E00
/dev/sdj 44CA818A-9320-6BBB-3751-0CB9E4D34E00
/dev/sdk 587E658F-EC8B-A3DF-4D74-00BAE4D34E00
/dev/sdl 11384F8D-1DA2-E0F3-E6E5-03BAE4D34E00

hadoop-metrics.properties
The hadoop-metrics.properties files direct MapR where to output service metric reports: to an output file (FileContext) or to Ganglia 3.1
(MapRGangliaContext31). A third context, NullContext, disables metrics. To direct metrics to an output file, comment out the lines
pertaining to Ganglia and the NullContext; for the chosen service; to direct metrics to Ganglia, comment out the lines pertaining to the metrics
file and the NullContext. See Service Metrics.

There are two hadoop-metrics.properties files:

/opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties specifies output for standard Hadoop services


/opt/mapr/conf/hadoop-metrics.properties specifies output for MapR-specific services

The following table describes the parameters for each service in the hadoop-metrics.properties files.

Parameter Example Values Description

<service>.class The class that implements the interface


org.apache.hadoop.metrics.spi.NullContextWithUpdateThread responsible for sending the service metrics to
apache.hadoop.metrics.file.FileContext the appropriate handler. When implementing a
com.mapr.fs.cldb.counters.MapRGangliaContext31 class that sends metrics to Ganglia, set this
property to the class name.

<service>.period The interval between 2 service metrics data


10 exports to the appropriate interface. This is
60 independent of how often are the metrics
updated in the framework.

<service>.fileName /tmp/cldbmetrics.log The path to the file where service metrics are
exported when the cldb.class property is set to
FileContext.

<service.servers localhost:8649 The location of the gmon or gmeta that is


aggregating metrics for this instance of the
service, when the cldb.class property is set to
GangliaContext.

<service>.spoof 1 Specifies whether the metrics being sent out


from the server should be spoofed as coming
from another server. All our fileserver metrics
are also on cldb, but to make it appear to end
users as if these properties were emitted by
fileserver host, we spoof the metrics to Ganglia
using this property. Currently only used for the
FileServer service.

Examples
The hadoop-metrics.properties files are organized into sections for each service that provides metrics. Each section is divided into
subsections for the three contexts.

/opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties
# Configuration of the "dfs" context for null
dfs.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "dfs" context for file


#dfs.class=org.apache.hadoop.metrics.file.FileContext
#dfs.period=10
#dfs.fileName=/tmp/dfsmetrics.log

# Configuration of the "dfs" context for ganglia


# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
# dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
# dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
# dfs.period=10
# dfs.servers=localhost:8649

# Configuration of the "mapred" context for null


mapred.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "mapred" context for file


#mapred.class=org.apache.hadoop.metrics.file.FileContext
#mapred.period=10
#mapred.fileName=/tmp/mrmetrics.log

# Configuration of the "mapred" context for ganglia


# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
# mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
# mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
# mapred.period=10
# mapred.servers=localhost:8649

# Configuration of the "jvm" context for null


#jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "jvm" context for file


#jvm.class=org.apache.hadoop.metrics.file.FileContext
#jvm.period=10
#jvm.fileName=/tmp/jvmmetrics.log

# Configuration of the "jvm" context for ganglia


# jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
# jvm.period=10
# jvm.servers=localhost:8649

# Configuration of the "ugi" context for null


ugi.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "fairscheduler" context for null


#fairscheduler.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "fairscheduler" context for file


#fairscheduler.class=org.apache.hadoop.metrics.file.FileContext
#fairscheduler.period=10
#fairscheduler.fileName=/tmp/fairschedulermetrics.log

# Configuration of the "fairscheduler" context for ganglia


# fairscheduler.class=org.apache.hadoop.metrics.ganglia.GangliaContext
# fairscheduler.period=10
# fairscheduler.servers=localhost:8649
#

/opt/mapr/conf/hadoop-metrics.properties
#######################################################################################################################
hadoop-metrics.properties
#######################################################################################################################
metrics config - Pick one out of null,file or ganglia.
#Uncomment all properties in null, file or ganglia context, to send cldb metrics to that context

# Configuration of the "cldb" context for null


#cldb.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread
#cldb.period=10

# Configuration of the "cldb" context for file


#cldb.class=org.apache.hadoop.metrics.file.FileContext
#cldb.period=60
#cldb.fileName=/tmp/cldbmetrics.log

# Configuration of the "cldb" context for ganglia


cldb.class=com.mapr.fs.cldb.counters.MapRGangliaContext31
cldb.period=10
cldb.servers=localhost:8649
cldb.spoof=1

#FileServer metrics config - Pick one out of null,file or ganglia.


#Uncomment all properties in null, file or ganglia context, to send fileserver metrics to that context

# Configuration of the "fileserver" context for null


#fileserver.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread
#fileserver.period=10

# Configuration of the "fileserver" context for file


#fileserver.class=org.apache.hadoop.metrics.file.FileContext
#fileserver.period=60
#fileserver.fileName=/tmp/fsmetrics.log

# Configuration of the "fileserver" context for ganglia


fileserver.class=com.mapr.fs.cldb.counters.MapRGangliaContext31
fileserver.period=37
fileserver.servers=localhost:8649
fileserver.spoof=1

################################################################################################################

mapr-clusters.conf
The configuration file /opt/mapr/conf/mapr-clusters.conf specifies the CLDB nodes for one or more clusters that can be reached from
the node or client on which it is installed.

Format:

clustername1 <CLDB> <CLDB> <CLDB>


[ clustername2 <CLDB> <CLDB> <CLDB> ]
[ ... ]

The <CLDB> string format is one of the following:

host,ip:port - Host, IP, and port (uses DNS to resolve hostnames, or provided IP if DNS is down)
host:port - Hostname and IP (uses DNS to resolve host, specifies port)
ip:port - IP and port (avoids using DNS to resolve hosts, specifies port)
host - Hostname only (default, uses DNS to resolve host, uses default port)
ip - IP only (avoids using DNS to resolve hosts, uses default port)

mapred-default.xml
The configuration file mapred-default.xml provides defaults that can be overridden using mapred-site.xml, and is located in the Hadoop
core JAR file (/opt/mapr/hadoop/hadoop-<version>/lib/hadoop-<version>-dev-core.jar).

Do not modify mapred-default.xml directly. Instead, copy parameters into mapred-site.xml and modify them there. If
mapred-site.xml does not already exist, create it.

The format for a parameter in both mapred-default.xml and mapred-site.xml is:

<property>
<name>io.sort.spill.percent</name>
<value>0.99</value>
<description>The soft limit in either the buffer or record collection
buffers. Once reached, a thread will begin to spill the contents to disk
in the background. Note that this does not imply any chunking of data to
the spill. A value less than 0.5 is not recommended.</description>
</property>

The <name> element contains the parameter name, the <value> element contains the parameter value, and the optional <description>
element contains the parameter description. You can create XML for any parameter from the table below, using the example above as a guide.

Parameter Value Description

hadoop.job.history.location If job tracker is sta


known place on lo
default, it is in the
History files are m
mapred.jobtracke
MapRFs JobTrack

hadoop.job.history.user.location User can specify a


job. If nothing is s
The files are store
stop logging by gi

hadoop.rpc.socket.factory.class.JobSubmissionProtocol SocketFactory to
(JobTracker). If nu
hadoop.rpc.socke

io.map.index.skip 0 Number of index e


default. Setting th
opening large ma

io.sort.factor 256 The number of str


determines the nu

io.sort.mb 100 Buffer used to hol


map outputs. Sett
average input to m
io.sort.mb should

io.sort.record.percent 0.17 The percentage o


boundaries. Let th
number of records
block is equal to (

io.sort.spill.percent 0.99 The soft limit in ei


reached, a thread
background. Note
the spill. A value l

job.end.notification.url http://localhost:8080/jobstatus.php?jobId=$jobId&amp;jobStatus=$jobStatus Indicates url which


end status of job.
$jobId and $jobSt
replaced by their r

job.end.retry.attempts 0 Indicates how ma


notification URL

job.end.retry.interval 30000 Indicates time in m


jobclient.completion.poll.interval 5000 The interval (in m
the JobTracker fo
this to a lower val
system. Adjusting
client-server traffic

jobclient.output.filter FAILED The filter for contr


the console of the
KILLED, FAILED,

jobclient.progress.monitor.poll.interval 1000 The interval (in m


status to the cons
want to set this to
single node syste
to unwanted clien

map.sort.class org.apache.hadoop.util.QuickSort The default sort c

mapr.localoutput.dir output The path for local

mapr.localspill.dir spill The path for local

mapr.localvolumes.path /var/mapr/local The path for local

mapred.acls.enabled false Specifies whether


users for doing va
disabled by defau
by JobTracker an
users for queue o
job in the queue a
(See mapreduce.j
mapreduce.job.ac
via the console an

mapred.child.env User added enviro


processes. Examp
foo 2) B=$B:c Thi

mapred.child.java.opts Java opts for the t


symbol, if present
current TaskID. A
For example, to e
taskid in /tmp and
pass a 'value' of: -
-Xloggc:/tmp/@ta
mapred.child.ulim
memory of the ch

mapred.child.oom_adj 10 Increase the OOM


allow increasing th

mapred.child.renice 10 Nice value to run


favorable) to 19 (l
priority. (valid valu

mapred.child.taskset true Run the job in a ta


No taskset 5-8 CP
infrastructrue proc
reserved for infras

mapred.child.tmp ./tmp To set the value o


value is an absolu
prepended with ta
executed with opt
tmp dir'. Pipes and
TMPDIR='the abs

mapred.child.ulimit The maximum virt


the Map-Reduce f
Mapper/Reducer
Hadoop Streamin
cluster admins co
mechanisms. Not
equal to the -Xmx
mapred.cluster.map.memory.mb -1 The size, in terms
Map-Reduce fram
multiple slots for a
mapred.job.map.m
mapred.cluster.m
the feature. The v
off.

mapred.cluster.max.map.memory.mb -1 The maximum siz


task launched by
scheduler. A job c
via mapred.job.ma
mapred.cluster.m
the feature. The v
off.

mapred.cluster.max.reduce.memory.mb -1 The maximum siz


task launched by
scheduler. A job c
via mapred.job.re
mapred.cluster.m
the feature. The v
off.

mapred.cluster.reduce.memory.mb -1 The size, in terms


Map-Reduce fram
multiple slots for a
mapred.job.reduc
mapred.cluster.m
the feature. The v
off.

mapred.compress.map.output false Should the output


across the networ

mapred.healthChecker.interval 60000 Frequency of the

mapred.healthChecker.script.args List of arguments


when it is being la

mapred.healthChecker.script.path Absolute path to t


health monitoring
not. If the value of
the location config
is not started.

mapred.healthChecker.script.timeout 600000 Time after node h


considered that th

mapred.hosts.exclude Names a file that


excluded by the jo
excluded.

mapred.hosts Names a file that


the jobtracker. If t

mapred.inmem.merge.threshold 1000 The threshold, in


merge process. W
we initiate the in-m
less than 0 indica
instead depend on
trigger the merge.

mapred.job.map.memory.mb -1 The size, in terms


job. A job can ask
up to the next mu
upto the limit spec
the scheduler sup
this feature is turn
also turned off (-1

mapred.job.map.memory.physical.mb Maximum physica


exceeded task att
mapred.job.queue.name default Queue to which a
queues defined in
ACL setup for the
job to the queue.
system is configur
submitting jobs to

mapred.job.reduce.input.buffer.percent 0.0 The percentage o


to retain map outp
concluded, any re
less than this thre

mapred.job.reduce.memory.mb -1 The size, in terms


the job. A job can
rounded up to the
mapred.cluster.re
mapred.cluster.m
the feature. The v
iff mapred.cluster.

mapred.job.reduce.memory.physical.mb Maximum physica


is exceeded task

mapred.job.reuse.jvm.num.tasks -1 How many tasks t

mapred.job.shuffle.input.buffer.percent 0.70 The percentage o


heap size to storin

mapred.job.shuffle.merge.percent 0.66 The usage thresh


initiated, expresse
allocated to storin
mapred.job.shuffle

mapred.job.tracker.handler.count 10 The number of se


roughly 4% of the

mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done The completed job


well-known locatio
$<hadoop.job.hist

mapred.job.tracker.http.address 0.0.0.0:50030 The job tracker ht


on. If the port is 0

mapred.job.tracker.persist.jobstatus.active false Indicates if persis

mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo The directory whe


system to be avai
between jobtracke

mapred.job.tracker.persist.jobstatus.hours 0 The number of ho


The job status info
memory queue an
value the job statu

mapred.job.tracker localhost:9001 jobTracker addres


or maprfs:///mapr/
'san_jose_cluster1

mapred.jobtracker.completeuserjobs.maximum 100 The maximum nu


before delegating

mapred.jobtracker.instrumentation org.apache.hadoop.mapred.JobTrackerMetricsInst Expert: The instru


JobTracker.

mapred.jobtracker.job.history.block.size 3145728 The block size of


job history, its imp
possible. Note tha
value is set to 3 M

mapred.jobtracker.jobhistory.lru.cache.size 5 The number of job


loaded when they
on LRU.

mapred.jobtracker.maxtasks.per.job -1 The maximum nu


indicates that ther
mapred.jobtracker.plugins Comma-separate

mapred.jobtracker.port 9001 Port on which Job

mapred.jobtracker.restart.recover true "true" to enable (jo

mapred.jobtracker.retiredjobs.cache.size 1000 The number of ret

mapred.jobtracker.taskScheduler.maxRunningTasksPerJob The maximum nu


preempted. No lim

mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.JobQueueTaskScheduler The class respons

mapred.line.input.format.linespermap 1 Number of lines p

mapred.local.dir.minspacekill 0 If the space in ma


tasks until all the c
Also, to save the r
them, to clean up
go with the ones t

mapred.local.dir.minspacestart 0 If the space in ma


more tasks. Value

mapred.local.dir $<hadoop.tmp.dir>/mapred/local The local directory


files. May be a co
devices in order to
are ignored.

mapred.map.child.env User added enviro


processes. Examp
foo 2) B=$B:c Thi

mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/hadoop/java_error%p.log Java opts for the m


be interpolated: @
other occurrences
enable verbose gc
and to set the hea
-Xmx1024m -verb
configuration varia
used to control the
processes. MapR
memory reserved
given more memo
task = (Total Mem
(#mapslots + 1.3*

mapred.map.child.ulimit The maximum virt


the Map-Reduce f
Mapper/Reducer
Hadoop Streamin
cluster admins co
mechanisms. Not
greater than or eq
might not start.

mapred.map.max.attempts 4 Expert: The maxim


words, framework
number of times b

mapred.map.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec If the map outputs


compressed?

mapred.map.tasks.speculative.execution true If true, then multip


executed in parall

mapred.map.tasks 2 The default numb


mapred.job.tracke

mapred.max.tracker.blacklists 4 The number of bla


which the task tra
tracker will be give
become a healthy

mapred.max.tracker.failures 4 The number of tas


which new tasks o
mapred.merge.recordsBeforeProgress 10000 The number of rec
progress notificati

mapred.min.split.size 0 The minimum size


that some file form
priority over this s

mapred.output.compress false Should the job ou

mapred.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec If the job outputs a


compressed?

mapred.output.compression.type RECORD If the job outputs a


should they be co
or BLOCK.

mapred.queue.default.state RUNNING This values define


be either "STOPP
at runtime.

mapred.queue.names default Comma separated


Jobs are added to
scheduling proper
property for a que
name specified in
to all schedulers a
mapred.queue.$Q
mapred.queue.de
configured in this
scheduler being u
mapred.jobtracke
JobQueueTaskSc
the default configu
that the scheduler

mapred.reduce.child.env

mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/hadoop/java_error%p.log Java opts for the r


determined by me
Reduce task is giv
memory for a redu
mapreduce) * (1.3

mapred.reduce.child.ulimit

mapred.reduce.copy.backoff 300 The maximum am


fetching one map

mapred.reduce.max.attempts 4 Expert: The maxim


other words, fram
many number of t

mapred.reduce.parallel.copies 12 The default numb


copy(shuffle) phas

mapred.reduce.slowstart.completed.maps 0.95 Fraction of the nu


complete before r

mapred.reduce.tasks.speculative.execution true If true, then multip


executed in parall

mapred.reduce.tasks 1 The default numb


of the cluster's red
can still be execut
mapred.job.tracke

mapred.skip.attempts.to.start.skipping 2 The number of Ta


kicked off. When s
range of records w
So that on failures
the bad records. O
mapred.skip.map.auto.incr.proc.count true The flag which if s
SkipBadRecords.
incremented by M
value must be set
records asynchron
streaming. In such
counter on their o

mapred.skip.map.max.skip.records 0 The number of ac


record PER bad r
record as well. To
records off, set th
down the skipped
all attempts get ex
Long.MAX_VALU
narrow down. Wh
skipped are accep

mapred.skip.out.dir If no value is spec


output directory a
records by giving

mapred.skip.reduce.auto.incr.proc.count true The flag which if s


SkipBadRecords.
is incremented by
This value must b
records asynchron
streaming. In such
counter on their o

mapred.skip.reduce.max.skip.groups 0 The number of ac


PER bad group in
as well. To turn th
off, set the value t
skipped range by
attempts get exha
Long.MAX_VALU
narrow down. Wh
skipped are accep

mapred.submit.replication 10 The replication lev


the square root of

mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system The shared direct

mapred.task.cache.levels 2 This is the max le


2, the tasks cache

mapred.task.profile.maps 0-2 To set the ranges


to be set to true fo

mapred.task.profile.reduces 0-2 To set the ranges


has to be set to tr

mapred.task.profile false To set whether th


some of the tasks
log directory. The

mapred.task.timeout 600000 The number of mi


neither reads an i
string.

mapred.task.tracker.http.address 0.0.0.0:50060 The task tracker h


the server will sta

mapred.task.tracker.report.address 127.0.0.1:0 The interface and


is only connected
EXPERT ONLY. S
have the loopback

mapred.task.tracker.task-controller org.apache.hadoop.mapred.DefaultTaskController TaskController wh


execution

mapred.tasktracker.dns.interface default The name of the N


should report its IP
mapred.tasktracker.dns.nameserver default The host name or
TaskTracker shou
JobTracker for co

mapred.tasktracker.expiry.interval 600000 Expert: The time-i


is declared 'lost' if

mapred.tasktracker.indexcache.mb 10 The maximum me


cache that is used

mapred.tasktracker.instrumentation org.apache.hadoop.mapred.TaskTrackerMetricsInst Expert: The instru


TaskTracker.

mapred.tasktracker.map.tasks.maximum (CPUS > 2) ? (CPUS * 0.75) : 1 The maximum nu


simultaneously by

mapred.tasktracker.memory_calculator_plugin Name of the class


information on the
org.apache.hadoo
null, the tasktrack
platform. Currentl

mapred.tasktracker.reduce.tasks.maximum (CPUS > 2) ? (CPUS * 0.50): 1 The maximum nu


simultaneously by

mapred.tasktracker.taskmemorymanager.monitoring-interval 5000 The interval, in mi


between two cycle
only if tasks' mem
mapred.tasktracke

mapred.tasktracker.tasks.sleeptime-before-sigkill 5000 The time, in millis


SIGKILL to a proc

mapred.temp.dir $<hadoop.tmp.dir>/mapred/temp A shared directory

mapred.user.jobconf.limit 5242880 The maximum allo


to 5 MB

mapred.userlog.limit.kb 0 The maximum siz


cap.

mapred.userlog.retain.hours 24 The maximum tim


retained after the

mapreduce.heartbeat.10 300 heartbeat in millis


nodes)

mapreduce.heartbeat.100 1000 heartbeat in millis


Scales linearly be

mapreduce.heartbeat.1000 10000 heartbeat in millis


Scales linearly be

mapreduce.heartbeat.10000 100000 heartbeat in millis


nodes). Scales lin
mapreduce.job.acl-modify-job job specific acces
used if authorizati
configuration prop
the list of users an
on the job. For sp
use is "user1,user
users/groups to m
none. This configu
with respect to thi
operations: o killin
task of this job o s
operations are als
"acl-administer-job
caller should have
queue-level ACL o
configuration, job-
administrators con
and queue admin
submitted to confi
mapred.queue.qu
mapred-queue-ac
a job. By default,
started the cluster
administrators can

mapreduce.job.acl-view-job job specific acces


if authorization is
configuration prop
the list of users an
the job. For specif
is "user1,user2 gr
users/groups to m
none. This configu
and at present on
sensitive informat
task-level counter
displayed on the T
the JobTracker's w
is still accessible b
JobProfile, list of j
configuration, job-
administrators con
and queue admin
submitted to confi
mapred.queue.qu
mapred-queue-ac
By default, nobod
the cluster, cluste
perform view oper

mapreduce.job.complete.cancel.delegation.tokens true if false - do not un


because same tok

mapreduce.job.split.metainfo.maxsize 10000000 The maximum pe


JobTracker won't
the configured val

mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recovery Recovery Director

mapreduce.jobtracker.recovery.job.initialization.maxtime Maximum time in


before starting rec
mapreduce.jobtra

mapreduce.jobtracker.recovery.maxtime 480 Maximum time in


mode. JobTracke
tasktrackers. On l
mapreduce.jobtra

mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging The root of the sta


should be the dire
(usually /user)

mapreduce.maprfs.use.checksum true Deprecated; chec

mapreduce.maprfs.use.compression true If true, then mapre


mapreduce.reduce.input.limit -1 The limit on the in
size of the reduce
of -1 means that t

mapreduce.task.classpath.user.precedence false Set to true if user

mapreduce.tasktracker.group Expert: Group to w


LinuxTaskControl
mapreduce.tasktr
task-controller bin

mapreduce.tasktracker.heapbased.memory.management false Expert only: If adm


too many tasks us
max java heap siz
tasktracker based
tasks. See
mapred.map.child

mapreduce.tasktracker.jvm.idle.time 10000 If jvm is idle for m


(milliseconds) tas

mapreduce.tasktracker.outofband.heartbeat false Expert: Set this to


heartbeat on task

mapreduce.tasktracker.prefetch.maptasks 1.0 How many map ta


tasktracker. To be
means number of
tasktracker.

mapreduce.tasktracker.reserved.physicalmemory.mb Maximum phyisca


mapreduce tasks.
maximum memor
tasktracker should
mapreduce tasks.
based on services
mapreduce.tasktr
disable physical m

mapreduce.tasktracker.volume.healthcheck.interval 60000 How often tasktra


$<mapr.localvolum

mapreduce.use.fastreduce false Expert only . Redu

mapreduce.use.maprfs true If true, then mapre


may be executed

keep.failed.task.files false Should the files fo


on jobs that are fa
also prevents the
directory as they a

keep.task.files.pattern .*_m_123456_0 Keep all files from


regular expression

tasktracker.http.threads 2 The number of wo


used for map outp

mapred-site.xml
The file /opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xml specifies MapReduce formulas and parameters.

Each parameter in the local configuration file overrides the corresponding parameter in the cluster-wide configuration unless the cluster-wide copy
of the parameter includes <final>true</final>. In general, only job-specific parameters should be set in the local copy of
mapred-site.xml.

There are three parts to mapred-site.xml:

JobTracker configuration
TaskTracker configuration
Job configuration

Jobtracker Configuration
Should be changed by the administrator. When changing any parameters in this section, a JobTracker restart is required.

Parameter Value Description

mapred.job.tracker maprfs:/// JobTracker address ip:port or use uri maprfs:/// for default cluster or
maprfs:///mapr/san_jose_cluster1 to connect 'san_jose_cluster1' cluster.
Replace localhost by one or more ip addresses for jobtracker.

mapred.jobtracker.port 9001 Port on which JobTracker listens. Read by JobTracker to start RPC Server.

mapreduce.tasktracker.outofband.heartbeat false Expert: Set this to true to let the tasktracker send an out-of-band heartbeat on
task-completion for better latency.

webinterface.private.actions If set to true, jobs can be killed from JT's web interface.
Enable this option if the interfaces are only reachable by
those who have the right authorization.

Jobtracker Directories

When changing any parameters in this section, a JobTracker restart is required.

Volume path = mapred.system.dir/../

Parameter Value Description

mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system The shared directory where MapReduce


stores control files.

mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo The directory where the job status


information is persisted in a file system to be
available after it drops of the memory queue
and between jobtracker restarts.

mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging The root of the staging area for users' job
files In practice, this should be the directory
where users' home directories are located
(usually /user)

mapreduce.job.split.metainfo.maxsize 10000000 The maximum permissible size of the split


metainfo file. The JobTracker won't attempt
to read split metainfo files bigger than the
configured value. No limits if set to -1.

mapred.jobtracker.retiredjobs.cache.size 1000 The number of retired job status to keep in


the cache.

mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done The completed job history files are stored at


this single well known location. If nothing is
specified, the files are stored at
${hadoop.job.history.location}/done in local
filesystem.

hadoop.job.history.location If job tracker is static the history files are


stored in this single well known place on
local filesystem. If No value is set here, by
default, it is in the local file system at
${hadoop.log.dir}/history. History files are
moved to
mapred.jobtracker.history.completed.location
which is on MapRFs JobTracker volume.

mapred.jobtracker.jobhistory.lru.cache.size 5 The number of job history files loaded in


memory. The jobs are loaded when they are
first accessed. The cache is cleared based
on LRU.

JobTracker Recovery

When changing any parameters in this section, a JobTracker restart is required.

Parameter Value Description


mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recovery Recovery Directory. Stores list of known
TaskTrackers.

mapreduce.jobtracker.recovery.maxtime 120 Maximum time in seconds JobTracker should stay in


recovery mode.

mapred.jobtracker.restart.recover true "true" to enable (job) recovery upon restart, "false" to


start afresh

Enable Fair Scheduler

When changing any parameters in this section, a JobTracker restart is required.

Parameter Value Description

mapred.fairscheduler.allocation.file conf/pools.xml

mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.FairScheduler

mapred.fairscheduler.assignmultiple true

mapred.fairscheduler.eventlog.enabled false Enable scheduler logging in


${HADOOP_LOG_DIR}/fairscheduler/

mapred.fairscheduler.smalljob.schedule.enable true Enable small job fast scheduling inside fair


scheduler. TaskTrackers should reserve a
slot called ephemeral slot which is used for
smalljob if cluster is busy.

mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps


allowed in small job.

mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of


reducers allowed in small job.

mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes


allowed for a small job. Default is 10GB.

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated input


size for a reducer allowed in small job.
Default is 1GB per reducer.

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytes


reserved for an ephermal slot. Default is
200mb. This value must be same on
JobTracker and TaskTracker nodes.

TaskTracker Configuration
When changing any parameters in this section, a TaskTracker restart is required.

Should be changed by admin

Parameter Value Description

mapred.tasktracker.map.tasks.maximum (CPUS > 2) ? The maximum number of map tasks that will be run simultaneously
(CPUS * by a task tracker.
0.75) : 1

mapreduce.tasktracker.prefetch.maptasks 1.0 How many map tasks should be scheduled in-advance on a


tasktracker. To be given in % of map slots. Default is 1.0 which
means number of tasks overscheduled = total map slots on TT.

mapred.tasktracker.reduce.tasks.maximum (CPUS > 2) ? The maximum number of reduce tasks that will be run
(CPUS * simultaneously by a task tracker.
0.50): 1

mapred.tasktracker.ephemeral.tasks.maximum 1 Reserved slot for small job scheduling

mapred.tasktracker.ephemeral.tasks.timeout 10000 Maximum time in ms a task is allowed to occupy ephemeral slot

mapred.tasktracker.ephemeral.tasks.ulimit 4294967296> Ulimit (bytes) on all tasks scheduled on an ephemeral slot


mapreduce.tasktracker.reserved.physicalmemory.mb Maximum phyiscal memory tasktracker should reserve for
mapreduce tasks.
If tasks use more than the limit, task using maximum memory will
be killed.
Expert only: Set this value iff tasktracker should use a certain
amount of memory
for mapreduce tasks. In MapR Distro warden figures this number
based
on services configured on a node.
Setting mapreduce.tasktracker.reserved.physicalmemory.mb to -1
will disable
physical memory accounting and task management.

mapreduce.tasktracker.heapbased.memory.management false Expert only: If admin wants to prevent swapping by not launching
too many tasks
use this option. Task's memory usage is based on max java heap
size (-Xmx).
By default -Xmx will be computed by tasktracker based on slots
and memory reserved for mapreduce tasks.
See mapred.map.child.java.opts/mapred.reduce.child.java.opts.

mapreduce.tasktracker.jvm.idle.time 10000 If jvm is idle for more than mapreduce.tasktracker.jvm.idle.time


(milliseconds)
tasktracker will kill it.

Job Configuration
Users should set these values on the node from which you plan to submit jobs, before submitting the jobs. If you are using Hadoop examples, you
can set these parameters from the command line. Example:

hadoop jar hadoop-examples.jar terasort -Dmapred.map.child.java.opts="-Xmx1000m"

When you submit a job, the JobClient creates job.xml by reading parameters from the following files in the following order:

1. mapred-default.xml
2. The local mapred-site.xml - overrides identical parameters in mapred-default.xml
3. Any settings in the job code itself - overrides identical parameters in mapred-site.xml

Parameter Value Description

keep.failed.task.files false Should the files for failed tasks be kept.


This should only be used on jobs that
are failing, because the storage is never
reclaimed. It also prevents the map
outputs from being erased from the
reduce directory as they are consumed.

mapred.job.reuse.jvm.num.tasks -1 How many tasks to run per jvm. If set to


-1, there is no limit.

mapred.map.tasks.speculative.execution true If true, then multiple instances of some


map tasks may be executed in parallel.

mapred.reduce.tasks.speculative.execution true If true, then multiple instances of some


reduce tasks may be executed in
parallel.

mapred.job.map.memory.physical.mb Maximum physical memory limit for map


task of this job. If limit is exceeded task
attempt will be FAILED.

mapred.job.reduce.memory.physical.mb Maximum physical memory limit for


reduce task of this job. If limit is
exceeded task attempt will be FAILED.

mapreduce.task.classpath.user.precedence false Set to true if user wants to set different


classpath.

mapred.max.maps.per.node -1 Per-node limit on running map tasks for


the job. A value of -1 signifies no limit.
mapred.max.reduces.per.node -1 Per-node limit on running reduce tasks
for the job. A value of -1 signifies no
limit.

mapred.running.map.limit -1 Cluster-wide limit on running map tasks


for the job. A value of -1 signifies no
limit.

mapred.running.reduce.limit -1 Cluster-wide limit on running reduce


tasks for the job. A value of -1 signifies
no limit.

mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log Java opts for the reduce tasks. MapR


Default heapsize(-Xmx) is determined
by memory reserved for mapreduce at
tasktracker. Reduce task is given more
memory than map task. Default memory
for a reduce task = (Total Memory
reserved for mapreduce) *
(2*#reduceslots / (#mapslots +
2*#reduceslots))

mapred.reduce.child.ulimit

io.sort.mb Buffer used to hold map outputs in


memory before writing final map
outputs. Setting this value very low may
cause spills. By default if left empty
value is set to 50% of heapsize for map.
If a average input to map is "MapIn"
bytes then typically value of io.sort.mb
should be '1.25 times MapIn' bytes.

io.sort.factor 256 The number of streams to merge at


once while sorting files. This determines
the number of open file handles.

io.sort.record.percent 0.17 The percentage of io.sort.mb dedicated


to tracking record boundaries. Let this
value be r, io.sort.mb be x. The
maximum number of records collected
before the collection thread must block
is equal to (r * x) / 4

mapred.reduce.slowstart.completed.maps 0.95 Fraction of the number of maps in the


job which should be complete before
reduces are scheduled for the job.

mapreduce.reduce.input.limit -1 The limit on the input size of the reduce.


If the estimated
input size of the reduce is greater than
this value, job is failed. A
value of -1 means that there is no limit
set.

mapred.reduce.parallel.copies 12 The default number of parallel transfers


run by reduce during the copy(shuffle)
phase.

Oozie

Parameter Value Description

hadoop.proxyuser.root.hosts * comma separated ips/hostnames running Oozie server

hadoop.proxyuser.mapr.groups mapr,staff

hadoop.proxyuser.root.groups root

mfs.conf
The configuration file /opt/mapr/conf/mfs.conf specifies the following parameters about the MapR-FS server on each node:
Parameter Value Description

mfs.server.ip 192.168.10.10 IP address of the FileServer

mfs.server.port 5660 Port used for communication with the server

mfs.cache.lru.sizes inode:6:log:6:meta:10:dir:40:small:15 LRU cache configuration

mfs.on.virtual.machine 0 Specifies whether MapR-FS is running on a virtual machine

mfs.io.disk.timeout 60 Timeout, in seconds, after which a disk is considered failed and taken offline.

mfs.conf

mfs.server.ip=192.168.10.10
mfs.server.port=5660
mfs.cache.lru.sizes=inode:6:log:6:meta:10:dir:40:small:15
mfs.on.virtual.machine=0
mfs.io.disk.timeout=60

taskcontroller.cfg
The file /opt/mapr/hadoop/hadoop-<version>/conf/taskcontroller.cfg specifies TaskTracker configuration parameters. The
parameters should be set the same on all TaskTracker nodes. See also Secured TaskTracker.

Parameter Value Description

mapred.local.dir /tmp/mapr-hadoop/mapred/local The local MapReduce directory.

hadoop.log.dir /opt/mapr/hadoop/hadoop-0.20.2/bin/../logs The Hadoop log directory.

mapreduce.tasktracker.group root The group that is allowed to submit jobs.

min.user.id -1 The minimum user ID for submitting jobs:

Set to 0 to disallow root from submitting jobs


Set to 1000 to disallow all superusers from submitting
jobs

banned.users (not present by default) Add this parameter with a comma-separated list of usernames to
ban certain users from submitting jobs

warden.conf
The file /opt/mapr/conf/warden.conf controls parameters related to MapR services and the warden. Most of the parameters are not
intended to be edited directly by users. The following table shows the parameters of interest:

Parameter Value Description

service.command.jt.heapsize.percent 10 The percentage of heap space reserved for the JobTracker.

service.command.jt.heapsize.max 5000 The maximum heap space that can be used by the JobTracker.

service.command.jt.heapsize.min 256 The minimum heap space for use by the JobTracker.

service.command.tt.heapsize.percent 2 The percentage of heap space reserved for the TaskTracker.

service.command.tt.heapsize.max 325 The maximum heap space that can be used by the TaskTracker.

service.command.tt.heapsize.min 64 The minimum heap space for use by the TaskTracker.

service.command.hbmaster.heapsize.percent 4 The percentage of heap space reserved for the HBase Master.

service.command.hbmaster.heapsize.max 512 The maximum heap space that can be used by the HBase Master.

service.command.hbmaster.heapsize.min 64 The minimum heap space for use by the HBase Master.
service.command.hbregion.heapsize.percent 25 The percentage of heap space reserved for the HBase Region Server.

service.command.hbregion.heapsize.max 4000 The maximum heap space that can be used by the HBase Region Server.

service.command.hbregion.heapsize.min 1000 The minimum heap space for use by the HBase Region Server.

service.command.cldb.heapsize.percent 8 The percentage of heap space reserved for the CLDB.

service.command.cldb.heapsize.max 4000 The maximum heap space that can be used by the CLDB.

service.command.cldb.heapsize.min 256 The minimum heap space for use by the CLDB.

service.command.mfs.heapsize.percent 20 The percentage of heap space reserved for the MapR-FS FileServer.

service.command.mfs.heapsize.min 512 The maximum heap space that can be used by the MapR-FS FileServer.

service.command.webserver.heapsize.percent 3 The percentage of heap space reserved for the MapR Control System.

service.command.webserver.heapsize.max 750 The maximum heap space that can be used by the MapR Control System.

service.command.webserver.heapsize.min 512 The minimum heap space for use by the MapR Control System.

service.command.os.heapsize.percent 3 The percentage of heap space reserved for the operating system.

service.command.os.heapsize.max 750 The maximum heap space that can be used by the operating system.

service.command.os.heapsize.min 256 The minimum heap space for use by the operating system.

service.nice.value -10 The nice priority under which all services will run.

zookeeper.servers 10.250.1.61:5181 The list of ZooKeeper servers.

services.retries 3 The number of times the Warden tries to restart a service that fails.

services.retryinterval.time.sec 1800 The number of seconds after which the warden will again attempt several
times to start a failed service. The number of attempts after each interval is
specified by the parameter services.retries.

cldb.port 7222 The port for communicating with the CLDB.

mfs.port 5660 The port for communicating with the FileServer.

hbmaster.port 60000 The port for communicating with the HBase Master.

hoststats.port 5660 The port for communicating with the HostStats service.

jt.port 9001 The port for communicating with the JobTracker.

kvstore.port 5660 The port for communicating with the Key/Value Store.

mapr.home.dir /opt/mapr The directory where MapR is installed.

warden.conf

services=webserver:all:cldb;jobtracker:1:cldb;tasktracker:all:jobtracker;nfs:all:cldb;kvstore:all;cldb:all:kvstore;host
start jobtracker
service.command.tt.start=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh start tasktracker
service.command.hbmaster.start=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh start master
service.command.hbregion.start=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh start regionserver
service.command.cldb.start=/etc/init.d/mapr-cldb start
service.command.kvstore.start=/etc/init.d/mapr-mfs start
service.command.mfs.start=/etc/init.d/mapr-mfs start
service.command.nfs.start=/etc/init.d/mapr-nfsserver start
service.command.hoststats.start=/etc/init.d/mapr-hoststats start
service.command.webserver.start=/opt/mapr/adminuiapp/webserver start
service.command.jt.stop=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh stop jobtracker
service.command.tt.stop=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh stop tasktracker
service.command.hbmaster.stop=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh stop master
service.command.hbregion.stop=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh stop regionserver
service.command.cldb.stop=/etc/init.d/mapr-cldb stop
service.command.kvstore.stop=/etc/init.d/mapr-mfs stop
service.command.mfs.stop=/etc/init.d/mapr-mfs stop
service.command.nfs.stop=/etc/init.d/mapr-nfsserver stop
service.command.hoststats.stop=/etc/init.d/mapr-hoststats stop
service.command.webserver.stop=/opt/mapr/adminuiapp/webserver stop
service.command.jt.type=BACKGROUND
service.command.tt.type=BACKGROUND
service.command.hbmaster.type=BACKGROUND
service.command.hbregion.type=BACKGROUND
service.command.cldb.type=BACKGROUND
service.command.kvstore.type=BACKGROUND
service.command.mfs.type=BACKGROUND
service.command.nfs.type=BACKGROUND
service.command.hoststats.type=BACKGROUND
service.command.webserver.type=BACKGROUND
service.command.jt.monitor=org.apache.hadoop.mapred.JobTracker
service.command.tt.monitor=org.apache.hadoop.mapred.TaskTracker
service.command.hbmaster.monitor=org.apache.hadoop.hbase.master.HMaster start
service.command.hbregion.monitor=org.apache.hadoop.hbase.regionserver.HRegionServer start
service.command.cldb.monitor=com.mapr.fs.cldb.CLDB
service.command.kvstore.monitor=server/mfs
service.command.mfs.monitor=server/mfs
service.command.nfs.monitor=server/nfsserver
service.command.jt.monitorcommand=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh status
jobtracker
service.command.tt.monitorcommand=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh status
tasktracker
service.command.hbmaster.monitorcommand=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh status master
service.command.hbregion.monitorcommand=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh status
regionserver
service.command.cldb.monitorcommand=/etc/init.d/mapr-cldb status
service.command.kvstore.monitorcommand=/etc/init.d/mapr-mfs status
service.command.mfs.monitorcommand=/etc/init.d/mapr-mfs status
service.command.nfs.monitorcommand=/etc/init.d/mapr-nfsserver status
service.command.hoststats.monitorcommand=/etc/init.d/mapr-hoststats status
service.command.webserver.monitorcommand=/opt/mapr/adminuiapp/webserver status
service.command.jt.heapsize.percent=10
service.command.jt.heapsize.max=5000
service.command.jt.heapsize.min=256
service.command.tt.heapsize.percent=2
service.command.tt.heapsize.max=325
service.command.tt.heapsize.min=64
service.command.hbmaster.heapsize.percent=4
service.command.hbmaster.heapsize.max=512
service.command.hbmaster.heapsize.min=64
service.command.hbregion.heapsize.percent=25
service.command.hbregion.heapsize.max=4000
service.command.hbregion.heapsize.min=1000
service.command.cldb.heapsize.percent=8
service.command.cldb.heapsize.max=4000
service.command.cldb.heapsize.min=256
service.command.mfs.heapsize.percent=20
service.command.mfs.heapsize.min=512
service.command.webserver.heapsize.percent=3
service.command.webserver.heapsize.max=750
service.command.webserver.heapsize.min=512
service.command.os.heapsize.percent=3
service.command.os.heapsize.max=750
service.command.os.heapsize.min=256
service.nice.value=-10
zookeeper.servers=10.250.1.61:5181
nodes.mincount=1
services.retries=3
cldb.port=7222
mfs.port=5660
hbmaster.port=60000
hoststats.port=5660
jt.port=9001
kvstore.port=5660
mapr.home.dir=/opt/mapr

Glossary
Term Definition

.dfs_attributes A special file in every directory, for controlling the compression and chunk size used for the directory and its subdirectories.

.rw A special mount point in the root-level volume (or read-only mirror) that points to the writable original copy of the volume.

.snapshot A special directory in the top level of each volume, containing all the snapshots for that volume.

access A list of permissions attached to an object. An access control list (ACL) specifies users or system processes that can perform
control list specific actions on an object.

accounting A clearly defined economics unit that is accounted for separately.


entity

ACL See access control list.

advisory An advisory disk capacity limit that can be set for a volume, user, or group. When disk usage exceeds the advisory quota, an
quota alert is sent.

AE See accounting entity.

bitmask A binary number in which each bit controls a single toggle.

CLDB See container location database.

container The unit of sharded storage in a MapR cluster.

container A service, running on one or more MapR nodes, that maintains the locations of services, containers, and other cluster
location information.
database

desired The number of copies of a volume, not including the original, that should be maintained by the MapR cluster for normal
replication operation.
factor

dump file A file containing data from a volume for distribution or restoration. There are two types of dump files: full dump files containing
all data in a volume, and incremental dump files that contain changes to a volume between two points in time.

entity A user or group. Users and groups can represent accounting entities.

full dump file See dump file.

Hbase A distributed storage system, designed to scale to a very large size, for managing massive amounts of structured data.

heartbeat A signal sent by each FileServer and NFS node every second to provide information to the CLDB about the node's health and
resource usage.

incremental See dump file.


dump file

JobTracker The process responsible for submitting and tracking MapReduce jobs. The JobTracker sends individual tasks to TaskTrackers
on nodes in the cluster.

Mapr-FS The NFS-mountable, distributed, high-performance MapR data storage system.

minimum The minimum number of copies of a volume, not including the original, that should be maintained by the MapR cluster for
replication normal operation. When the replication factor falls below this minimum, writes to the volume are disabled.
factor

mirror A read-only physical copy of a volume.

name A container that holds a volume's namespace information and file chunk locations, and the first 64 KB of each file in the
container volume.

Network File A protocol that allows a user on a client computer to access files over a network as though they were stored locally.
System

NFS See Network File System.


node An individual server (physical or virtual machine) in a cluster.

quota A disk capacity limit that can be set for a volume, user, or group. When disk usage exceeds the quota, no more data can be
written.

recovery point The maximum allowable data loss as a point in time. If the recovery point objective is 2 hours, then the maximum allowable
objective amount of data loss that is acceptable is 2 hours of work.

recovery time The maximum alllowable time to recovery after data loss. If the recovery time objective is 5 hours, then it must be possible to
objective restore data up to the recovery point objective within 5 hours. See also recovery point objective

replication The number of copies of a volume, not including the original.


factor

RPO See recovery point objective.

RTO See recovery time objective.

schedule A group of rules that specify recurring points in time at which certain actions are determined to occur.

snapshot A read-only logical image of a volume at a specific point in time.

storage pool A unit of storage made up of one or more disks. By default, MapR storage pools contain two or three disks. For high-volume
reads and writes, you can create larger storage pools when initially formatting storage during cluster creation.

stripe width The number of disks in a storage pool.

super group The group that has administrative access to the MapR cluster.

super user The group that has administrative access to the MapR cluster.

TaskTracker The process that starts and tracks MapReduce tasks on a node. The TaskTracker receives task assignments from the
JobTracker and reports the results of each task back to the JobTracker on completion.

volume A tree of files, directories, and other volumes, grouped for the purpose of applying a policy or set of policies to all of them at
once.

warden A MapR process that coordinates the starting and stopping of configured services on a node.

ZooKeeper A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing
group services.

You might also like