Download as pdf
Download as pdf
You are on page 1of 40
7 Essential Hadoop Tools In This Chapter: + The Hive SQL-like query tool is explained using two examples * The Sqoop RDBMS tool is used to import and export data from MySQL to/from HDFS. . « The Flume streaming data transport utility is configured to capture weblog data into HDFS. * The Oozie workflow manager is used to run basic and complex Hadoop workflows. « The distributed HBase database is used to store and access data on a Hadoop cluster. The Hadoop ecosystem offers many tools to help with data input, high-level processing, workflow management, and creation of huge databases. Each tool is managed as a separate Apache Software foundation project, but is designed to operate with the core Hadoop services including HDFS, YARN, and MapReduce. Background on each tool 1s provided in this chapter, along with a start to finish example. Using Apache Pig Apache Pig is a high-level language that enables programmers to write complex MapReduce transformations using a simple scripting language. Pig Latin (the actual language) defines a set of transformations on a data set such as aggregate, join, and Sort. Pig is often used to extract, transform, and load (ETL) data pipelines, quick Tesearch on raw data, and iterative data processing. ae Apache Pig has several usage modes. The first is a local mode in which all Processing is done on the local machine. The non-local (cluster) modes are MapReduce and Tez, These modes execute the job on the cluster using either 132 Chapter 7 Essential Hadoop Tools Essent MapReduce Mode Tez M 6 Table 7.1. Apache Pig Usage mode er Local Mode : , Local Mode Te = a Yes Experimental e e a Interactive Mode ¥ Experimental Yes es es Batch Mode © yhich is Hindi for ~ or the optimized ‘Tez engine. (Te?, vl On oe the MapReduce engin OF | doop jobs such as those found in many 2 isco eae a stes available; they enable Pig applications to be bs ndes a 5 There ae ale ites ea ves wing small amounts of data, and then run at ractive modes, a a neu developed locally in interactive aeleetae ree a vosle on the chister in a production mode, The modes a mple Walk-Through ren med. Other environmen: For this example, the following software environment is ass should work in a similar fashion. * OS: Linux « Platform: RHEL 6.6 Hortonworks HDP 2.2 with Hadoop version: 2.6 « Pig version: 0.14.0 If you are using the pseudo-distributed installation from Chapter 2, “Installation Recipes,” instructions for installing Pig are provided in that chapter. More informa- Hon on installing Pig by hand can be found on the Pig website: http://pig.apache. org/#Getting+Started. Apache Pig is also installed as part of the Hortonworks HDP Sandbox. In this simple example, Pig is used to extract user names from the /etc/passwe file A full description of the Pig Latin language is beyond the scope of this introduction, but more information about Pig can be found at http://pig.apache.org/docs/r0.14.0 start html. The following example assumes the user is hdfs, but any valid user with access to HDFS can run the example, To begin the exampl i i fc en imple, copy the passwd file to a working directory for local Pig S cp /etc/passwd . N ae , a xt, copy the data file into HDFS for Hadoop MapReduce operation: S hdfs dfs -put passwd passwd You can i confirm the file is in HDS by entering the following command: hdfs dfs -1s passwa wBWerm-r-- 2 hag fs hdfs 2526 2015-03-17 11:08 passwd Using Apache Pig 133 In the following example of local Dig operation, all proces nthe local ing is done hine (Hadoop is not need) Fire mac the interactive command line 1s started Ie Dig starts correctly, you will see a grunt> prompt. You may also see a bunch of INFO messages, which you can ignore. Next, enter the following commands to load rhe passwd file and then grab the user name and dump it to the terminal, Note that pig commands must end with a cemicolon (;) = load ‘passwd’ using PigStorage/ = foreach A generate §0 as id The processing will start and a list of user names will be printed to the screen. To exit the interactive session, enter the command quit. grunt> quit To use Hadoop MapReduce, start Pig as follows (or just enter pig) § pig -x mapreduce The same sequence of commands can be entered at the grunt> prompt. You may wish to change the $0 argument to pull out other items in the passwd file. In the case of this smple script, you will notice that the MapReduce version takes much longer. Also, because we are running this application under Hadoop, make sure the file is placed in HDES. Ifyou are using the Hortonworks HDP distribution with tez installed, the tez engine can be used as follows: $ pig -x tez Pig can also be run from a script. An example script (id. pig) is available from the ©xample code download (see Appendix A, “Book Webpage and Code Download”). This Script, which is repeated here, is designed to do the same things as the interactive version: A= load ‘passwd’ using PigStorage(":'); -- load the passwd file Bs foreach & generate $0 as id; -- extract the user IDs dump B; Store B into ‘id.out'; -- write the results to a directory name id.out Comments are delineated by /* */ and ~~ at the end of a line, The script will “reate a directory called id.out for the results. First, ensure that the id. out directory is S Not in your local directory, and then start Pig with the script on the command line: $ /bin/rm or id.out/ § pig -* local id.pig 8 Chapter 7 Essential Hadoop Too! at least one data file with the recut, o . vou should eee : is If the script worked correctly, you sae an a 7 .. i enh flere ere that now all reading and writing takes pla. nda me acedure, the only ¢ the ame proved in HDPS f hate ote £ pig 36 pie using the -% ter optic. Ty Apache tee 1 installed, you can run the example script using f Apache ter is installed, y ' Pig scripts at heep: apache.org/docs/r) 141) av can Jearn more about writing Pig scripts at http://pig. apache.org/di 1 ‘ou can learn me start hem Using Apache Hive Apache Hive sa data warehouse infrastructure built on top of Hadoop for providing data summarization, ad hoc queries, and the analysis of large data sets using # SQL-like language called HiveQL. Hive is considered the de facto stan- Garé for anteractive SQL queries over petabytes of data using Hadoop and offers following features: * Tools to enable easy data extraction, transformation, and loading (ETL) * A mechanism to impose structure on a variety of data formats * Access to files stored either directly in HDFS or in other data storage systems such as HBase * Query execution via MapReduce and Tez (optimized MapReduce) Hive provides users who are already familiar with SQL the capability to query che Gats on Hadoop clusters. At the same time, Hive makes it possible for programmers who are familiar with the MapReduce framework to add their custom mappers and reducers to Hive queries. Hive queries can also be dramatically accelerated using the Apache Ter framework under YARN in Hadoop version 2. Hive Example Walk-Through For this eXamy 1 le, the following software AVIronMent is assume 1. Od t should work in a similar fashion, SOESE sneeonincue * OS: Linux * Platform RHEL 6.6 * Hortonworks HDP 2.2 with Hadoop version: 2.6 * Hive Version: 0.14.0 Ifyou are u sing the pseudo-. ; . fo 8 the pseudo-distributed Installation from Chapter 2, instructions for install i 'ng Hive are provided j Provided in that chapter. More information on installation can Using Apache Hive 135 be found on the Hive website the Hortonworks HDP Sandho. nafs, any valid user With acces To start Hive, simply enter http://hive apac X. Althoug sto HDPS he.org, Hive is also installed as part of h the following example assumes the user is an run the example the Nive coy due get a hive> prompt mmand. If Hive starts correctly, you should ee § hive (some messages may show up here) hive> As a simple test, create and dro] P a table. Note that Hive commands a semicolon (;). must end with hives CREATE TABLE pokes (f00 INT, bar STRING) ; OK fe taken: 1.705 seconds e> SHOW TABLES; OR pokes qime taken: 0.174 seconds, Fetched: 1 row(s) e> DROP TABLE pokes; OK Time taken: 4.038 seconds A more detailed example can be developed using a web server log file to summa- nize message types. First, create a table using the following command: hive> CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, cS string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS ‘> TERMINATED BY ' '; OK Time taken: 0.129 seconds Next, load the data—in this case, from the sample. log file. This file is available from the example code download (see Appendix A). Note that the file is found in the local directory and not in HDFS. hive> LOAD DATA LOCAL INPATH ‘sample.log' OVERWRITE INTO TABLE logs; Loading data to table default. logs Table default.logs stats: [numFiles=1, numRows: OK totalsize=99271, rawDataSize=0) Time taken: 0.953 seconds Finally, apply the select step to the file. Note that this os : Padoop : apReduce operation, The results appear at the end of the output (¢.g., totals for the message types DEBUG, ERROR, and so on). > 136 Chapter 7 Essential Hadoop Tools B KE '(8! hive> SELECT t4 AS sev, COUNT(*) AS cnt FROM logs eae GROUP py Query TD = hd€s_20150327130000_dela265~a5d7-4ed8- Total jobs = 1 aunel 1 out of 1 — ere not specified. Estimated from input ae Size: 1 In order to change the average load for a reducer (in bytes) : Set hive. exec. reducers. bytes per. reducer= In order to Limit the maximum nunber of reducers: Set hive. exec, reducers max= In order to set a constant number of reducers: Set mapreduce. job. reduces= Starting Job = job_1427397392757_0001, Trecking URL = http: //norbert : 8088 /proxy, application_1427397392757_0001/ X11 Conmand = /opt/hadoop-2.6,0/bin/hadoep job -kill job_1427397392757_0001 BaGoop job information for Stage-1: number of mappers: 1; munber of reducers, 2015-03-27 13:00:17,399 Stage-1 map = 0%, reduce = 0% 2015-03-27 13:00:26,100 Stage-1 map = 1008, reduce = 08, Cumulative cpu 24 2015-03-27 13:00:34,979 Stage-1 map = 1008, reduce = 1008, Cumilative cry 4 NepReduce Total cumulative CPU time: 4 seconds 70 msec Ended Job = job_1427397392757_0001 MapReduce Jobs Launched: suage’Stage-1: Map: 1 Reduce: 1 Cumulative cPU: 4.07 sec HDFS Read: 106384 HDFS Write: 63 success te, sec Total MapReduce CPU Time Spent: 4 seconds 70 nsec oK (DEBUG) 434 [ERROR] 3 [FATAL] 1 (miFo] 96 [TRACE] 16 (warn) 4 ‘ime taken: 32.624 seconds, Fetched: § row(s) To exit Hive, simply type exit;: hives exit; A More Advanced Hive Example ‘A more advanced usage case from the Hive do the movie rating data files obtained from the CG lens.org/datasets/movielens) webpage, The da Lens website (htp://movielens.org). The files Teviews, starting at 100,000 and queries used in the foll Appendix A). In this example, 100, rating, contain various numbers of movie Boing up (© 20 million entries. The data file and Owing example are available from the book website (see ),000 records will be transformed from userid, movieid, vnixtime to userid, movieid, rating g, and weekday using Apache Hive and a Using Apache Hive 137 python program (ie., the UNIX time ” NOtati e tra wreck). The ist sep isto dowtean Notation wil be transformed tothe day of the id extract the data: g wget http://files.groupicha.org/a, g unzip ml-100k. zip gcd ml-100k jatasets/movielens/m1-100k.2ip Before we use Hive, we will create a short Python program called weekday_mapper .py with following contents: import sys import datetime for line in sys.stdin; line = line.strip() userid, movieid, rating, unixtime = Line.split('\t") weekday = datetime. datetime. fromtimestanp (float (unixtine)) isoweekday () «Tavdata” OVBRIMITE the TRONS egg str (weekday) |) LOAD DATA LOCAL INPATH Next, start Hive and create the data table (u_data) by entering the following at the hive> prompt: CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE; Load the movie data into the table with the following command: hive> LOAD DATA LOCAL INPATH './u.data' OVERWRITE INTO TABLE u_data; The number of rows in the table can be reported by entering the following command: hive > SELECT COUNT(*) FROM u_data; This command will start a single MapReduce job and should finish with the following lines: MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.26 sec HDFS Read: 1979380 NDFS Write: 7 success Total MapReduce CPU Time Spent: 2 seconds 260 msec 0K 100000, Time taken: 28.366 seconds, Fetched: 1 row(s) 138 Chapter 7 Essential Hadoop Tools mand to make the new ble data are loaded, use the following com Now that the table data a table (udata_new): hive> CREATE TABLE udatanew ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIBLDS TERMINATED BY ‘\t'; live resources: The next command adds the weekday_mapper .py to Hive hive> add FILE weekday_mapper.py: Once weekday_mapper.py is successfully loaded, we can enter the transformation query: ‘ hive> INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING ‘python weekday mapper .py' AS (userid, movieid, rating, weekday) FROM u_dat: 1 the transformation was successful, the following final portion of the output should be displayed: Table default .u_data new stats: (numPiles=1, numRows=100000, totelsize=1179173, rawDataSize=1079173) MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 3.44 sec HDFS Read; 1179256 success Total MapReduce CPU Time Spent: 3 seconds 440 msec ox ‘Time taken: 24.06 seconds The fi 1979380 HDFS write: nal query will sort and group the reviews hive> SELECT weekday, by weekday: COUNT(*) FROM u_data_new GROUP By weekday; Final output for the review counts by weekday should look like the following: 2: Map: 1 Reduce: 1 q , 56 Sinceds wumulative CPU: 2.39 sec HoES Read: 1179386 Total MapRedy x Se CPU Time Spent: 2 seconds 399 msec 1 13278 14816 Using Apache Sqoop to Acquire Relational Data. 139 15426 13774 17964 12318 12424 ime taken: 22.645 seconds, 3 4 5 6 1 Fetched: 7 row(s) wn previo As shows ree poe gn Temove the tables used in this example with the DROP - In this case, we are also using the -e command-line option. Note that queries can be loaded from files using the -£ option as well give -e ‘drop table u_data_new! g hive -e ‘drop table u_data’ Using Apache Sqoop to Acquire Relational Data Sqoop is a tool designed to transfer data bet You can use Sqoop to import data from a (RDBMS) into the Hadoop Distributed Fi Hadoop, and then export the data back int tween Hadoop and relational databases. relational database management system ile System (HDFS), transform the data in to an RDBMS. Sgoop can be used with any Java Database Connectivity (JDBC)-compliant database and has been tested on Microsoft SQL Server, PostgreSQL, MySQL, and Oracle. In version 1 of Sqoop, data were accessed using connectors written for specific databases. Version 2 ( ‘in beta) does not support connectors or version 1 data transfer from a RDBMS directly to Hive or HBase, or data transfer from Hive or HBase to your RDBMS. Instead, version 2 offers more generalized ways to accomplish these tasks. The remainder of this section provides a brief overview of how Sqoop works with Hadoop. In addition, a basic Sqoop example walk-through is demonstrated. To fully explore Sqoop, more information can found by consulting the Sqoop project website: http://sqoop.apache.org Apache Sqoop Import and Export Methods Figure 7.1 describes the Sqoop data import (to HDFS) process. The data import is done in two steps. In the first step, shown in the figure, Sqoop examines the database to gather the necessary metadata for the data to be imported. The second, step is a map-only (no reduce step) Hadoop job that Sqoop submits to the cluster. This job does the actual data transfer using the metadata captured in the Previous step. Note that each node doing the import must have access to the database, The imported data are saved in an HDFS directory. Sqoop will use the database name for the directory, or the user can specify any alternative oe where the files should be populated. By default, these files contain ee imited fields, with new lines separating different records, You can easily override G leona which ee are Copied over by explicitly specifying the field separator and record terminator charac- tets. Once placed in HDFS, the data are ready for processing. tial Hadoop Tools chapter 7 E582" it Map-only Job (2) Subm 4————~._ ( _ job HDFS Storage Kee Figure 7.1 Two-step Apache Sqoop data import method (Adapted from Apache Sqoop Documentation) AePs, 28 shown in Figure 7.2. As in the import process, the first step is to examine the P again uses a ™ap-only Hadoop job to write the data to the database. Sqoop divides the j input data set into splits, then uses individual map tasks to push the splits to the database, A, Bain, this process assumes the map tasks have access to the database, P ° Apache Sqoop Version Changes Saovp Version 1 USES specialized Connectors to access &xternal systems. These con- Nectors are often °Ptimized for various RDBMS5 or for Systems that do not support Using A S, Pache Sqoop to Acquire Relational Data (2) (2) Submit Map-Only Job Sqoop Export Sqoop job HDFS Storage (1) Gather Meta data Map |}«——| / ' ' { t ' ‘ ‘ \ Map |1<—_| Map |l¢— Map |<—1- Figure 7.2 Two-step Sqoop data export method (Adapted from Apache Sqoop Documentation) JDBC. Connectors are plug-in components based on Sqoop’s extension framework and can be added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external store sup- ported by the connector. By default, Sqoop version 1 incluces connectors for popular databases such as MySQL, PostgreSQL, Oracle, SQL Server, and DB2, [also supports direct transfer to and from the RDBMS to HBase or Hive. In contrast, to streamline the Sqoop input methods, Sqoop version ® no longer supports specialized connectors or direct import into HBase or Hive, All imports and exports are done through the JDBC interface, Table 7.2 summarizes the changes from version 1 to version 2. Due to these changes, any new development should be done with Sqoop version 2. 141, 142 Chapter 7 Essential Hadoop Tools Table 7.2 Apache Sqoop Version Comparison i Fe Sqoop Version 4 Sqoop Version 2 Feature — i id. Use the ae 1 Supported. Not supportes Cones omy os generic JDBC connector. major ROBMSs a Kerberos security Supported, Not supported ee Supported. Not supported. First import eae ae data from RDBMS into HOFS, eee then load data into Hive or ‘neon HBase manually. Data transfer Not supported. First export Not supported. First export from Hive or data from Hive or HBase data from Hive or HBase into HBase to ROBMS —_into HDFS, and then use HDFS, then use Sqoop for Sqoop for export. export. Sqoop Example Walk-Through The following simple example illustrates use of Sqoop. It can be used as a foundation from which to explore the other capabilities offered by Apache Sqoop. The following steps will be performed: 1, Download Sgoop. 2. Download and load sample MySQL data. 3. Add Sqoop user permissions for the local machine and cluster, 4. Import data from MySQL to HDFS. 5. Export data from HDFS to MySQL. For this example, the following software environment is assumed, Other environ- ments should work in a similar fashion, * OS: Linux * Platform: RHEL 6.6 * Hortonworks HDP 2.2 with Hadoop version: 2.6 * Sqoop version: 1.4.5 A working installation of MySQL on the host 2 or wane Sqoop website: U SINE Apache Sqo0p to Acquire Relational Data 143 entry point for all connecting Sqoop i apReduce client, it requires ber) an CHents. Because the ‘fo install Sqoop using the 1419p Adoop install distribution RK Sqoop node is a Hadoop ation and access to HIDES PM files, simply enter yon install S400 sqoop-metastorg ' For this example, we will use the World example (hap://dev.mysql.com/doe/Avorld. database from the MySQL site setup/e n/index.html). ‘This database has three tables + Country: information about countries of the world +» City information about some Of the cities in those countries ry To get the database, use wget to download and then extract the file et http: //downloads mysql .com/docs/w gunzip worlé_innodb.sql.gz + CountryLanguage: languages spoken in each coups orld_innodb.sql.gz Next, log into MySQL (assumes you have privileges to create a database) and import the desired database by following these steps: § mysql -u root -p mysql> CREATE DATABASE world; nysql> USE world; mysql> SOURCE world_innodb.sql; mysql> SHOW TABLES; | Tables_in_world | | city I | country | | CountryLanguage | --+ 3 rows in set (0.01 sec) TI et ee i d i he table details (output omitte Si nd will let you see ¢ he following MySQL comma: for clarity): mysql> SHOW CREATE TABLE Country: mysql> SHOW CREATE TABLE eee ee ee am he Local Machine and Cluster anal cop to MySQL. Note that you Be ae vaubnet for Sqoop to work properly. Step 2: Add Sqoop User Permis | In MySQL, add the following prvi Must use both the local host name OP le, the sqo . on Also, for the purposes of this examp! qoop'@* Limulus’ IDENTIFIED BY "sqoop +8 ys oN aqcopi “Te %° IDENTIFIED BY ‘soot orld. ¥ar10.0.0- ‘Sql> GRANT ALL PRIVILEGES N fat to. 'eqoop'@ 3 0 'YSql> GRANT ALL PRIVILEGE! "ysql> quit ols Chapter 7 Essential Hadoop To: est the permissions: oop to test t Next, log in as 84! $ mysql -u sqoop -P use world: SHOW TABLES? mysql> mysql> | tables inworld | | city country | countrybanguage | rows in set (0.01 sec) mysql> quit Step 3: Import Data Using Sqoop ‘Aca test, we can use Sqoop to list databases in MySQL. The results appear after the warnings at the end of the output. Note the use of local host name (1imulus) in the JDBC statement. § sqoop list-databases --connect jdbe:mysql://limulus/world --username sqoop ‘= --password sqoop Warning: /usr/lib/sqoop/../accumulo does not exist! Accumule imports will fa: Please set $ACCUMULO_HOME to the root of your Accumulo installation. 14/08/18 14:38:55 INFO sqoop.Sqoop: Running Sqoop version: 1.4.4.2.1.2.1-471 14/08/18 14:38:55 WARN tool.BaseSgoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 14/08/18 14:38:55 INFO manager.MySQLManager: Preparing to use a MySQL stream: resultset. information_schema test world Ina similar fashion, , the world databases" M8 S900P to connect to MySQL and list the tables in Sqoop list-tables --connect f : Sa Jabe:mysql: //Limulus/world --username sqoop » Running Sqoop version: 1.4.4,2,1.2.1-471 14/08/18 14:39:43 WARN tool BaseSqooptool: setting your Password on the ci - sider using -P instead omand-line is insecure. Consider ye ig a 14/08/18 14:39, resultset, 19°49 INFO manager City Country CountryLanguage ™ ; ySQLManager ; Preparing to use a MySQL streaming Usin % Apache Sqoop to Acquire Relational Data 145 To import data, we need to make a directory in HDs. g nafs @f5 -RBIE SE0P-My QI impor following ¢ The fol wing command imports the Cou -table signities the table t0 inyport HUY Uble into HIDES. The option arget-dir is the and -m 1 tells Sqoop to use one directory created previously, map MAP task to import the data Y username sqoop . M1 ant, user hdfs/ sg00p-mysgl import /count arget-dir ry 18 16:47:15 INFO mapre fe) aes Preduce. ImportJobbase: Transferred 30.752 kB in KB/sec) 1 6:47:15 INFO mapreduce. InportJobBase; Retrieved 239 records The import can be confirmed by examining HDFS s dfs -1s sqoop-mysql-import /country Found 2 items 2 hdfs hdfs 0 2014-08-18 16:47 sqoop-mysql-import/ 2 hdfs hdfs 31490 2014-08-18 16:47 sqoop-mysql-import/worl4/ The file can be viewed using the hdfs dfs -cat command: hdfs dfs -cat sqoop-mysql-import/country/part-m-00000 2BW,Aruba,North America, Caribbean, 193.0,nul1,103000,78.4,828.0,793.0,Aruba, Nonnetropolitan . Territory of The Netherlands, Beatrix, 129,AW 2H, Zimbabwe, Africa, Eastern Africa, 390757.0,1980, 11669000, 37.8,5951.0, 8670.0 Zimbabwe, Republic, Robert G. Mugabe, 4068, 2W To make the Sqoop command more convenient, you can create an options file and use it on the command line. Such a file enables you to avoid having to rewrite world-options. txt with the following. the s i mple, a file called ame options. For examp en ; - ~-usernane, contents will include the import command, --connect, Options: import ~-connect idbe:mysql : //1imulus/world ~-username Sqoop ~-password Sgoop 146 Chapter 7 Essential Hadoop Tools be performed with the following shorter line rt command can xt --table City impor ‘The same imp' as --options-file world-options-U import /city QL Query in the imy $ sqoop we juser /hafs/sqooP-MYSq) port step. For example, ossible to include an S' it is also p nada: pose we want just cities in Ca rom City WHERE CountryC xy option in the Sqoop import request, The ns, which will be explained k is designated with the de='CAN' SELECT ID,Name f ode e, we can include the --que! ‘a variable called $CONDITIO! yy example, a single mapper tas In such a cas¢ ~-query option also needs next. In the following quer option: mY sqoop --options-file world-options.txt -m 1 --target-dir we /user/hafs/sqoop-mysql-import /canada-city --query "SELECT ID, Name w= from City WHERE CountryCode='CAN' AND \SCONDITIONS* Inspecting the results confirms that only cities from Canada have been imported $ hdfs dfs -cat sqoop-mysql-import /canada-city/part-m-00000 1810,Montréal 1811, Calgary 1812, Toronto 1856, Sudbury 1857, Kelowna 1858,Barrie Since there was onl ly one mapper process, only o to be run on the database. The results are also reporect eo ae eo ported in a single file ay Primary key, it may be neces. not uniformly distributed The following e; i of ~-split. rs| ae emo) irs '8 example illustrates the u: es Se of the --sp1it-by option. F ° " ~by option. First, r $ he . dfs dfs -rm -r -skiptrash sq00p-; mYSal-import /canada-city Us Sing Apache Sqoop to Acquire Relational Data 147 Next, run the query using four m . appers _-split-by ID): Phers (-m 4), where we split by the [D number {--sP plit by eqoop ~“oPtions-file world-options.txt -m 4 w yuser/hatS/sa0op-nysql import /eanada-etty target -dir rom City WHERE Coui pee i query "SELECT [D, tame we from City WHERE CountryCode='CaN' aND \conniTtoNs* rien 7 —-split-by 1D Tewe look at the number of results files, we find four fil , eaten jes corresponding to the four mappers We requested in the command: responding to th g néfs dfs -1s__ sqoop-nysql-import/canada-cit ound § items ¥ mercr-- 2 hd = fs hdfs 0 2014-08-18 21:31 sqoop-mysql-import/ 2 hdfs ha: fs 175 2014-08-18 21:31 sqoop-nysql-import /canada / 2 hdts hafs 153 2014-08-18 21:31 sqoop-nysql-import /canada-city/ 2hdfs héfs 186 2014-08-18 21:31 sqoop-mysql-import /canada-city part-m-00002 srer--r--_ 2hdfs héfs 182 2014-08-18 21: -nysql-i - part-m-00003 :31 sqoop-mysql-import/canada-ci' Step 4: Export Data from HDFS to MySQL Sqoop can also be used to export data from HDFS. The first step is to create tables for exported data. There are actually two tables needed for each exported table. The rst table holds the exported data (CityExport), and the second is used for staging the exported data (CityBxportStaging). Enter the following MySQL commands to create these tables: mysql> CREATE TABLE 'CityExport' ( ‘ID! int (11) NOT NULL AUTO_INCREMENT, "Name! char (35) NOT NULL DEFAULT '', *countryCode' char (3) NOT NULL DEFAULT '', ‘District’ char (20) NOT NULL DEFAULT '', ‘population’ int(11) NOT NULL DEFAULT '0', PRIMARY KEY (‘ID')); mysql> CREATE TABLE 'CityExportstaging’ ( ‘ID! int (11) NOT NULL AUTO_INCREMENT, ‘Mame’ char (35) NOT NULL DEFAULT '', ‘countryCode' char(3) NOT NULL DEFAULT *', char (20) NOT NULL DEFAULT '', ‘District’ AULT '0 *Population’ int (11) NOT NULL DEF PRIMARY KEY (‘ID'))i ort-options, txt file similar to the world-options. txt Next, create a cities-exP J instead of the import command. created previously, but use the export command 148 Chapter 7 ntial Hadoop Tools apter 7 Essential Tl x 5 data reviously imported b J will expor the cities data we PI ommand a he following com” ck into MySQI apie cityBxport ng. Ext opt ions. tx eats export : clear-staging-t ile cities opt ions-f Pere aging-table citykxportstagind user /hdfs/Sqoor sqoop est we --export-dil >-mysql- import /city sheck the table in MySQL to ad correctly, check ¢ «sure everything worked Finally, to make sure every {the cities are in the table: see i « qyagi> select * from CityBxport Limit: 101 . - istri Population | iD | Name | countrycode | District | | 00 2 | Kabul | ARG | Kabo2 | x720000 | 2] gandahar | aFG | Qandahar | a | 3 | Herat | arc | Herat | sa | 4 | Mazar-e-Sharif | AFG | Balkh | ot | 5 | Amsterdam | uD | Noord-Holland | id- 593321 | 6 | Rotterdam | NLD | guid-Holland | eo | 7 | Haag | ND | guid-Holland | oso | 8 | verecht | cD | Uerecht | 23 | 8 | Eindhoven | wtD | Noord-Brabant | 201843 | [10 | Tilburg | sup | Noord-Brabant | 193238 | + 10 rows in set (0.00 sec) Some Handy Cleanup Commands If you are not especially familiar with MySQL, the following commands may be help- fal to clean up the examples. To remove the table in MySQL, enter the following command: mysql> drop table 'CityExportStaging' To remove the data in a table, enter this command: mysql> delete from CityBxportStaging; To clean up imported files, enter this command: Shdfs dfs -rm -r - * “im -skipTrash sqoop-mysql-import/ (country, city, canada-city} end pi i lent agent designed to collect, transport, and store data 4 number of Flume agents that may traverse S often used for log files, social media- Using ay # Apache Flume to Acquire Data Streams 149 As shown in Figure 7.3, 4 Flume " ABENE is « « Source, The source con Omposed of three components send the dat PONENL yecQ; 4 tO More than, one cha source (©-8.. Weblog) oF another ty r ves data and sends mnel wn ito achannel. ft can : "he input data can be from a real-time + Channel. A channel is a day ‘ee cination. Tecan be though A queue th, at for - wards the source data to the sink (sink) flow rates, of as a buff a bufler that manages input (source) and ou « Sink. The sink deliv ‘ets data to destination suc Flame agent. ich as HDFS, a local file, or another des- put A Flume agent must have all three of can have several sources, in memory but may be optionally stored on disk to veework failure. prevent data loss in the event of a As shown in poe 7-4, Sqoop agents may be placed in a pipeline, possibly to tra~ verse several machines or domains. This configuration is normally used when data are ‘Web Sout C onwmnat C) Figure 7.3. Flume agent with source, channel, and sink (Adapted from Apache Flume Documentation) foo i ting Flume agents (Adapted from ipeline created by connec! e ere 74 Pie ‘Apache Flume Sqoop Documentation) 150 sential Hadoop Tools Chapter 7 ES Consolidation Agent3 Figure 7.5 A Flume consolidation network (Adapted from Apache Flume Documentation) collected on one machine (e.g,, a web server) and sent to another machine that has access to HDFS. addition, oth : ® transport net . 3 i Flume features not described in deysh here inclade pg a 2 le plug-ins ani i SINg Apache Flume to Acquire Data Streams 151 flume Example Walk-Through gollow these steps t0 walk through a Fung exampl ample step 1: Download and Install Apache Flume eae zs Wsimilor pene fw chvironment is assumed, Other environments » OS: Linux « Platform: RHEL 6.6 « Hortonworks HDP 2.2 with Hadoop version: 2.6 + Flume version: 1.5.2 If Flume is not installed and you are using the Hortonworks HDP repository, you can add Flume with the following command: # yum install flume flume-agent In addition, for the simple example, telnet will be needed: # yum install telnet The following examples will also require some configuration files. See Appendix A for download instructions. Step 2: Simple Test A simple test of Flume can be done ona single machine. To start the Flume agent, enter the flume-ng command shown here. This command uses the simple-example.conf file to configure the agent. $ flume-ng agent --conf conf --conf-file simple-example.conf --name simple_agent ‘* -Dflume. root . logger=INFO, console In another terminal window, use telnet to contact the agent: $ telnet localhost 44444 Trying ::1 vee ; a telnet: connect to address ::1: Connection refuse Trying 127.0.0.1... Connected to localhost. Escape character is '*]'. testing 123 OK 152 Chapter 7 Essential Hadoop Tools < Flume agent was started will If Flume is working correctly, the window where the Flume ag : ¢ telnet low show the testing message entered in the telnet wind pony: 14 65 73°74 65 14/08 14 16:20:58 INFO sink.Loggersink: Fvent: ( headera:() body: 74 aes 6B 67 20.20:91.9049 20-37-00 tenting 1.23.) js Example aie ps recor oe the weblogs from the local machine (Ambar output) will be placed into HDFS using Flume. This example is easily modified to use other weblogs trom different machines. Two files are needed to configure Flume. (See the sidebar and Appendix A for file downloading instructions.) * web-server-target -agent .conf—the target Flume agent that writes the data to HDFS = web-server-source-agent .conf—the source Flume agent that captures the weblog data The weblog is also mirrored on the local file system by the agent that writes to HDFS. To run the example, create the directory as root: # mkdir /var/log/flume-hdfs # chown hdfs:hadoop /var/log/flume-hdfs/ Next, as user hdfs, make a Flume data directory in HDFS: § hdfs dfs -mkdir /user/hdfs/flume-channel/ Now that you have created the data directories, you can start the Flume target agent (execute as user hdfs): § flume-ng agent -c conf -£ web-server-target-agent conf -n collector (The source reads the weblogs.) This confi environment), Note With the HDP distribution, Flume can be Started as a service when the system boots (8, service start flume), In this example, the source ag weblog data to the target agent. machine if desired, Us SINE Apache Flume to Acquire Data Streams 153 4 flume-ng agent -c conf -£ wor g Berver-source-agent To see if Flume is workin, mand. Also confirm that the name will vary) B correctly, check the fune-ne local log by using the tail com- AReNLS are not reporting any errors (the file g tail -f /var/Log/flume-naes/14301 644n2581-1 The contents of the loc: ren ito HDFS. You can oops ne nat should be identical to that writ- name will vary). Note that while cunning Fame the mot acne Ae nates oe have the extension . tmp appended to it. The .tmp indicares that the file is still bemne written by Flume. The target agent can be configured to write the file (and start ° another . tmp file) by setting some or all of the rollcount, rollsize, rollinterval idleTimeout, and batchsize options in the configuration file. . g ndfs dfs -tail flume-channel /apache_acces: i -access_comb’ c erence combined/150427/PFlumeData Both files should contain the same data. For instance, the preceding example had the following data in both files: 10.0.0.1 - - [27/Apr/2015:16:04:21 -0400) "GET /ambarinagios/nagios/ nagios_alerts.php?ql-alertséalert_type=all HTTP/1.1" 200 30801 *-* *Java/1.7.0_65° 10.0.0.1 - - [27/Apr/2015:16:04:25 -0400] "POST /cgi-bin/rrd.py HTTP/1.1" 200 784 +" *Java/1.7.0_65" 10.0.0.1 - - [27/Apr/2015:16:04:25 -0400] "POST /cgi-bin/rrd.py HTTP/1.1" 200 508 "Java/1.7.0_65" You can modify both the target and source files to suit your system. (em acne CNR ATTESTED Flume Configuration Files A compete explanation of Flume configuration is beyond the scope of this chapter. The Flume website has additional information on Flume configuration: http://flume.apache. org/FlumeUserGuide.html#configuration. The configurations used previously also have links to help explain the settings. Some of the important settings used in the precec- ing example follow. In web-server-source-agent .conf, the following lines set the source. Note that the weblog is acquired by using the tail command to record the log file. source_agent .sources = apache_server = exec source_agent .sources.apache_server. tYP® eee a ot seto/netpa/ source_agent .sources.apache_server.conmand = (2! logs/access_log Further down in the file, the sink is defined. The eee fou t. sinks avro_sink. hostname is used to assig ree_agent.. avro_ will write to HDFS. The port number is also set in the target configuration file. 154 Is Chapter 7 Essential Hadoop Too! = avro_sink gink.type = avr - nel = memoryChannel 192.168.93.24 source_agent. sinks s.avt source_agent .sinks-@v eee t sinks avro sink. i avro_sink. hostname avro_sink.port = 4545 the web-server-target-agent .conf file. Note the us example and the data specification. source_agen agent .sinks nks gs are placed in the previo soure source_agent The HOFS settini path that was used in goltectorsinks.Radoopout.type = hdfs = me2 silector.sinks Hadoopout .channel = mci nestor ain copOut .hdfs.path = /user/hdfs/flume-channel/%{1og_type}/ collector. sinks Hadi aytmtd ; collector.sinks.HadoopOut .hdfs.fileType = DataStream The target file also defines the port and two channels (mc and mc2). One of these channels writes the data to the local file system, and the other writes to HDFS. The relevant lines are shown here: collector.sources.AvroIn.port = 4545 collector.sources.AvroIn.channels = mcl mc2 collector.sinks.LocalOut.sink.directory = /var/log/flume-hdfs collector. sinks.Localout.channel = mel The HDFS file rollover counts create a new file when a threshold is exceeded. In this example, that threshold is defined to allow any file size and write a new file after 10,000 events or 600 seconds. collector. sinks HadoopOut .hdfs.rollsize = 0 collector.sinksHadoopOut .hdfs.rollcount = 10000 collector. sinks .HadoopOut hdfs.rollinterval = 600 A full discussion of Flume can be found on the Flume website. Manage Hadoop Workflows with Apache Oozie Cosi is a workflow director system designed to run and manage multiple related ache Hadoop jobs. For instance, complete data input and analysis may require se ‘ral discrete Hadoop jobs to be run as a workflow in which the output o f one job and control Hadoop jobs on the cluster, Oozie workflow jobs are represented as directed acyclic graphs (DAGs) of actions (DAGs are basicalh ly graphs that ae eee pis Sat cannot have directed loops.) Three types of Oozie jobs M: ‘@nage Hadoap Workflows with Apache Oozie 155 + Coordinator—a scheduled w, when data become available » Bundle—a higher-leve} ‘i cl Oozie abstrac jobs. © abstraction that will batch a set of coordinator orkflow. Job that can run at various time intervals or Corie is integrated wi adoor jobs ne eas ws Test of the Hadoop stack, supporting several types of snd Sqoop) as well as aystemegpo ae MePReduce, Streaming MapReduce, Pig, Hive, ako provides a CLI and web Ut Gr ment ees aan , Figure 76 a Tepes Oszie workflow. In this case, Oorie runs a basic ap Peration. If the application was successful, the job ends: if occurred, the job is killed, vies pan error Oozie workflow definitions are written in hPDL guage). Such workflows contain several types of a XML Process Definition Lan- les: = Control flow nodes define the be; include stare, end, and epee eee sinning and the end ofa workflow. They nodes. Action nodes are where the actual processing tasks are defined. When an action node finishes, the remote systems notify Oozie and the next node in the workflow is executed. Action nodes can also include HDFS commands. Fork/join nodes enable parallel execution of tasks in the workflow. The fork node enables two or more tasks to run at the same time. A join node represents a rendezvous point that must wait until all forked tasks complete. Control flow nodes enable decisions to be made about the previous task. Control decisions are based on the results of the previous action (e.g., file size or file existence). Decision nodes are essentially switch-case statements that use JSP EL (Java Server Pages—Expression Language) that evaluate to either true or false. Figure 7.7 depicts a more complex workflow that uses all of these node types. More information on Oozie can be found at http://oozie.apache.org/docs/4.1.0/index.html. start OK map-reduce a wordeount MapReduce Workflow DAG Figure 7.6 A simple Oozie Workflow.xml DAG workflow (Adapted from Apache Oozie Documentation) 156 Chapter 7 Essential Hadoop Tools java file- @02— Figure 7.7 Amore complex Oozie DAG workflow (Adapted from Apache Oozie Documentation) Oozie Example Walk-Through For this example, the following software environment is assumed. Other environments should work in a similar fashion. * OS: Linux « Platform: CentOS 6.6 * Hortonworks HDP 2.2 with Hadoop version: 2.6 « Oozie version: 4.1.0 If you are using the pseudo-distributed installation from Chapter 2 or want to install Oozie by hand, see the installation instructions on the Ozzie website: hetp:// oozie.apache.org. Oozie is also installed as part of the Hortonworks HDP Sandbox. Step 1: Download Oozie Examples The Oozie examples used in this section can be found on the book website (see Appendix A). They are also available as part of the oozie-client .noarch RPM in the Hortonworks HDP 2.x packages. For HDP 2.1, the following command can be used to extract the files into the working directory used for the demo: § tar xvef /usr/share/doc/oozie-4.0.0.2.1.2.1/oozie-examples.tar.gz Mai age Hadoop Workflows with Apache Oozie 157 For HDP 2.2, the following command wil eegr extract the files g tar xv2f /Usr/hdp/2.2.4.9 9)... 22/0024 6 /doe /my *16-oxamplon.tar gz Once extracted, rename the @ domes xamples dire not confiase it with the other exam liteetor Y to b0zie-examples so out hiples dive zamples so that you will s g mv examples oozie-examples The examples must also be Yl placed SB the example files into HDES: NHDFS. Enter the following command to move § hdfs dfs -put oozie-examples/ oz ie-example Ss the Ambari installation of HDP 2.x. ¢ user/00zie/share/1ib, Note In HOP 2.2+, some additional version-tagged directories may appear below this path. if you installed and built Oozie by hand, then make sure /user/oozie exists in HDFS and put the oozie-share1ib files in this directory as user oozie and group hadoop. Ihe example applications are found under the oozie-examples/app directory, one directory per example. Each directory contains at least workflow.2ml and job.properties files. Other files needed for each example are also in its directory. The inputs for all examples are in the oozie-examples/input-data directory. The examples will create output under the examples/output-data directory in HDES. Step 2: Run the Simple MapReduce Example Move to the simple MapReduce example directory: § ed oozie-examples/apps/map-reduce/ This directory contains two files and a 1ib directory. The files are: * The job.properties file defines parameters (e.g., path names, ports) for a job. This fil hange per job. eens ae workflow for the job. In this case, it . _xm1 file provides the actual ae (pass/Cl). This file usually stays the same between jobs. i e Jes requires a few edits to work The job. ties file included in the examples requires fe oon 9. Using ctent editor change the following lines by ang the how name a the NameNode and ResourceManager (indicated by jobTracker in the file). 158 Chapter 7 Essential Hadoop Tools nameNode=hdaf jobtracke: //localhost:: 8020 ocalhost :8032 r): to the following (note the port change for jobTracker): nameNode=hdfs : / /_HOSTNAME_: 8020 jobTracker=_HOSTNAME_:8050 For example, for the cluster created with Ambari in Chapter 2, the lines were changed to nameNode=hdfs:: //limulus :8020 JobTracker=1imulus: 8050 The examplesRoot variable mu: st also be changed to oozie-examples, reflecting the change made previously: examplesRoot=oozie-examples These changes must be done for tl he all the job.properties files in the Oozie examples that you choose to run. The DAG for the simple MapReduce example is shown in Figure 7.6. The workflow. xml file describes these simple steps and has the following workflow nodes: action name="mr-node"> * operty> hadoop .proxyuser .cozie.groupsc/name> evalve>* Ir you are using Ambari, make this change (or add the lines) in the Servicas/HOFS/ Config window and restart Hadoop. Otherwise, make the change by hand and rastart all the Hadoop daemons. This setting is required because Oozie needs to impersonate other users to run jobs. The group property can be set to a specific user group or to a wild card. This setting allows the account that runs the Oozie server to run as part of the user's grou. a To avoid having to provide the -oozie option with the Oozie URL every ume you run the oozie command, set the 00ZIE_URL environment variable as follows (using vour Oozie server host name in place of “limulus”): export OOZIE_URL=*http: //Limlus:11000/oozie* at Oozie commands without specifying the -oo2ie You can now run all subseque! learn about a particular job's URL option. For instance, using the job ID, you can progress by issuing the following command: € job -info 0000001-150626174853048-cozie-oozi essed) is shown in the following sting ay be complete by the time you issue the I be indicated in che listing The resulting output (line length compr Because this job is just a simple test, 1 ; “info command, If it is not complete, Hts progress 8! 2 1D + ooov001-150424174853088-conie-90e a" Workflow Name ; map reduce wt Nappa/ap ceare o/user/hdts/exane) *pP Path : hdfe://1imulus: 6021 Status : SUCCEEDED Ran 20 Ger : hdfs 160 Chapter 7 Essential Hadoop Too!s Ckeated + 2015-04-29 20:52 GH Started + 2015-04-29 20:52 GHT Last Modified : 2015-04-29 20:53 Gum Ended 2015-04-29 20:53 GMT CoordAction ID: Actions Status Ext ID Ext Status Err Code 1D 0000001-150424174853048-oozie ; ~oori-We:start: OK ox (0000001~-150424174853048-oozie eos Near noge OK job_1429912013449_0006 succEEDzp 0000001-150424174853048-oozie ~o0zi-Weend OK The various steps shown in the output can be related directly to the work£lo: mentioned previously. Note that the MapReduce job number is provided. This will also be listed in the ResourceManager web user interface. The application is located in HDFS under the oozie-examples/output-data/map-reduce dire: WwW. xm1 job output CtOry, Step 3: Run the Oozie Demo Application A more sophisticated example can be found in the demo directory ( epps/deno). This workflow includes MapReduce, Pig, as fork, join, decision, actior 0ozie-examples and file systera tasks as well n, start, stop, kill, and end nodes. Move to the demo directory and edit the job.properties file as des ously. Entering the following command runs the workflow, ( environment variable has been set): cribed previ- assuming the OOZIE_URL $ cozie job -run -config job.properties Manage H '@d00p Workflows with Apache Oozie 161 foam MOR? Occ We Conn ch11000002 ‘ \Gooea pocumentation 7 - a ncoee MII Ome Gram ore ron 410224221 Spore ey NEY = = ROTORS nny Ge hs Wed 29h 2851 08-C4 Yt an 7 oT ees Sees eS or iatand SESS Fi gure 7.8 Oozie main console window (a Te vies 29 or 7015250532 MT Surtees 25 gr 201520552 GM | ste ves. 2 97035250651 GMT | me: 28 Ar 2016 210623 GT | = ee — le ea 1 vos sta ore ah We Sa wa TaR OK cena Wok 2AW US ZLOSEGMT ink SAW 9152108. GUT | we win” Whawanseout waar mssemeat | ecm 5024745048 ae cman ocean i 2948 2L08GME 294 5 2105.64 |} eos ssoicerrassosscnre-coxrwotak-eade — tukvade FORK jovi Wo 2A M8 10.S16M Wad 34 2015210805 GMT | {mom socwesoeccanwopoci peat roe mL AWS 1098GMT Wed 29A@ NS LOGGL GMT | |) 5 cams onesie mee-caswesreureane wesinys. maieans OF ve ose Wo 2946 UNS 210805 GMT We 29 gr 215 210508 GMT Hl om xoaciarinutsnrcuswopnene —prvmse 20M ON ee nism nee aaa roe |) 7 om ocarosucntcawowmce meme meen OF evo Wok SAW 019 7106206NT Wa 294 21521082008 | heise ceesmomama so mo OF ee et 284 35210531 2420821062 6 : o~" maweraaner saseumenee ron sort rast oe os Woe em in window Figure 7.9 Oozie workflow informatio Is 1 HadooP TO! \/ chapter 7 Essential x Figure 7.10 Oozie-generated workflow Di as it appears on the 162 HDFS node ofei* AG for the demo example, screen The job progressio n results, similar to shown in the Actic ‘hose printed by the Oozie command line, are ‘ons window at the bottom, tation of the workflow DAG. If the the steps th, St have been completed thus far. The own in Figure 7.19. Th lit to Pee A withthe pred a © actual image was sp 'mple, com ‘aring this information t0 © can provide further 'nsights into how Genre OW Oozie operates Using Apache HBase 163 A Short Summary Of Oozle Job Com, The following summary tit, tome of then commands. Sec the latest commonly encountered Oozie information. (Note ation that the ex "at http://oozie. apache org for more assume 0O2TE_URL is di fined : lefined.) + Ran a workflow job (retuene ooze g $ oozie job —x imines aae + Start a submitted job: § cozie job -start _oozze_ gop 1p, * Check a job’s status: $ oozie job -info _oozre sop rp, + Suspend a workflow: $ cozie job -suspend _oozre_sop_zp_ « Resume a workflow: $ oozie job -resume _cozre_sop_tp_ « Rerun a workflow: $ cozie job -rerun _ooztE_goB_rp_ « Kill a job: $ oozie job -kill _oozrE_JoB_1p_ ~config JOB_PROPERTIES # View server logs: $ oozie job -logs 00ZIE_JOB_ID_ Full logs are available at /var/1og/oozie on the Oozie server. Using Apache HBase i ioned, nonrelational database Apache HBase is an open source, distributed, versioned, 0 modeled after Google’ Bigtable (http://research.google.com/archive/bigtable html) Like Bigtable, HBase leverages the distributed data storage provided by the | : ing distributed file systems spread across commodity servers. Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Some of the more important features include the following capabilities: * Linear and modular scalability Strictly consistent reads and writes a Automatic and configurable sharding of tables etween RegionServers g Hadoop MapReduce jobs with Apache Automatic failover support bi Convenient base classes for backin HBase tables * Easy-to-use Java API for client access 164 Chapter 7 Essential Hadoop Tools He Ont Mt eet oases, I rows and columns. Columns in i latabases, having rows 7 ‘A table in HBase is similar to other data having rows a ple, oe famili ” HBase are grouped into colunis be a column family called “price” that ‘iden a table of daily stock prices. ‘There may be ac ae ir bi o ric rice rice: a srice:close, price:low, and pri has four membei ‘iceropen, P not need to be a family. For instance, the stock table may have a column named “vol aot e indicating, how many shares were traded. All column family members are stored gether e physical file system. aaa ¥ column (column family and fi e cell vi e identified by a row key, Specific HBase cell values are identified by ( y a column), and version (timestamp). It is possible to have many versions of data within dn HBase cell. A version is specified as a timestamp and is created each time data are \critten to a cell. Almost anything can serve as a row key, from strings to binary repre- sentations of longs to serialized data structures. Rows are lexicographically sorted with the lowest order appearing first in a table. The empty byte array denotes both the start and the end ofa table's namespace. All table accesses are via the table row key, which is considered its primary key. More information on HBase can be found on the HBase website: http://hbase.apache.org. HBase Example Walk-Through For this example, the following software environment is assumed. Other environments should work in a similar fashion. « OS: Linux = Platform: CentOS 6.6 « Hortonworks HDP 2.2 with Hadoop version: 2.6 « HBase version: 0.98.4 If you are using the pseudo-distributed installation from Chapter 2 or want to install HBase by hand, see the installation instructions on the HBase website: http:// hbase.apache.org. HBase is also installed as part of the Hortonworks HDP Sandbox. The following example illustrates a small subset of HBase commands. Consult the HBase website for more background. HBase provides a shell for interactive use. To enter the shell, type the following as a user: : $ hbase shell hbase (main) :001:0> To exit the shell, type exit. Vari ino eee can be conveniently entered from the shell prompt. For nce, the status command provides the system status: hbase(main) :001:0> status 4 servers, 0 dead, 1.0000 average load Additional arguments can be added to the ‘ status tat ‘ summary’, or ‘detailed’ tus command, including 'simple', ~ The single quotes are needed for proper operation.

You might also like