Big Data - CH04

Chapter 4: Getting Data into Hadoop
In This Chapter
• The data lake concept is presented as a new data processing paradigm.

• Basic methods for importing CSV data into HDFS and Hive tables are presented.
• Additional methods for using Spark to import data into Hive tables or directly for a Spark job are presented.
• Apache Sqoop is introduced as a tool for exporting and importing relational data into and out of HDFS.
• Apache Flume is introduced as a tool for transporting and capturing streaming data (e.g., web logs) into
HDFS.
• Apache Oozie is introduced as workflow manager for Hadoop ingestion jobs.
• The Apache Falcon project is described as a framework for data governance (organization) on Hadoop
clusters.
No matter what kind of data needs processing, there is often a tool for importing such data from or exporting
such data into the Hadoop Distributed File System (HDFS). Once stored in HDFS the data may be processed
by any number of tools available in the Hadoop ecosystem.
This chapter begins with the concept of the Hadoop data lake and then follows with a general overview of each
of the main tools for data ingestion into Hadoop—Spark, Sqoop, and Flume—along with some specific usage
examples. Workflow tools such as Oozie and Falcon are presented as tools that aid in managing the ingestion
process.
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Chapter Outline
• Hadoop as a Data Lake

• The Hadoop Distributed File System (HDFS)
• Direct File Transfer to Hadoop HDFS
• Importing Data from Files into Hive Tables
• Import CSV Files into Hive Tables
• Importing Data into Hive Tables Using Spark
• Import CSV Files into HIVE Using Spark
• Import a JSON File into HIVE Using Spark
• Using Apache Sqoop to Acquire Relational Data
• Data Import and Export with Sqoop
• Apache Sqoop Version Changes
• Using Sqoop V2: A Basic Example
• Using Apache Flume to Acquire Data Streams
• Using Flume: A Web Log Example Overview
• Manage Hadoop Work and Data Flows with Apache Oozie
• Apache Falcon
• What’s Next in Data Ingestion?
• Summary
Figure 4.1 The data warehouse versus the Hadoop data lake.
Figure 4.2 Two-step Apache Sqoop data import method.
Figure 4.3 Two-step Sqoop data export method.
Figure 4.4 Flume Agent with Source, Channel, and Sink.
Figure 4.5 Pipeline created by connecting Flume agents.
Figure 4.6 A Flume consolidation network.
Figure 4.7 A simple Oozie DAG workflow.
Figure 4.8 A more complex Oozie DAG workflow.
Figure 4.9 A simple Apache Falcon workflow.
Chapter Discussion
In this chapter we provide various ways to get data in to HDFS for Hadoop/Spark processing. This is not an
exhaustive list. It represents many of the common ways to get various types of data into HDFS. In general, the
data take three forms in this chapter, CSV files, database, and log (text) files.
Hadoop as a Data Lake

Explain the concept of the data lake that supports Big Data (Volume, Velocity, Variability) and how Hadoop
represents a new paradigm in processing data. Emphasize the distinction between schema-on-read vs.
schema-on-write. Hadoop determines how data are used at run-time because of volume, velocity and variability.
The Hadoop Distributed File System (HDFS)

HDFS is the backbone of Hadoop (although it is not strictly required, other parallel file systems can be used).
Highlight the differences with HDFS, designed for streaming (large block size), no random reads or writes, slices
data (transparently) across servers when writing files to HDFS.
Direct File Transfer to Hadoop HDFS

The easiest way to get data into HDFS is to "just copy it," however native copy commands do not work with
HDFS because it is a separate file system. The HDFS "put" and "get" commands need to be used. These and
all other HDFS operation must be done with the “hdfs dfs ..." commands. See appendix.
Another subtle point to mention is when processing data, Hadoop processing (Pig, Hive) or Spark, don't really
care if data are in one big file or many smaller files. Often times, the input argument is a directory and all files in
that directory are used as input. A similar thing may occur that confuses students is that for each mapper or
reducer used there will be one output file.
Importing Data from Files into Hive Tables

Explain "Hive is SQL for Hadoop" and provides SQL like query language at scale (huge amounts of data spread
over many systems) Also point out Hive has two types of tables, internal and external. The example, Import CSV
Files into Hive Tables, is explained here, the same steps are provided in the Worked Examples below.
(continues)
(Chapter Discussion continued)
Importing Data into Hive Tables Using Spark

Spark can also be used for importing data. We provide two examples, Import CSV Files into HIVE Using Spark
and Import a JSON File into HIVE Using Spark, as a way for students to get a feel for the types of operations
needed for data import. It is assumed, that students can modify some of these methods to suit their own needs.
We also use PySpark for the example and have found students are usually most confortable with using Python.
These examples are in the Worked examples section below.
Using Apache Sqoop to Acquire Relational Data

Often times it is desired to import data directly from an existing database -- as opposed to dumping the
database table to a CSV text file and then importing it. Point out that Sqoop is actually a MapReduce process
without the Reduce. It can use the power of the cluster to rapidly get JDBC database into HDFS. Mention that
the database structure is lost and the data are essentially text files. (Sqoop V2 does not support direct import
into Hive tables).
The Sqoop example in this chapter is also outlined below. Students may find this a good basis for developing
their own projects.
Using Apache Flume to Acquire Data Streams

Flume is good for moving "real time" (constantly generated) data into HDFS. Mention that this is valuable for any
type of log file data collection. Emphasize that it is not a streaming interface where data are constantly being
analyzed. Flume will collect and write to a data file, stop, close the file, and start a new file depending on user
provided time or size limits.
Manage Hadoop Work and Data Flows with Apache Oozie

We don't provide an Ooozie example. The concept behind Ooozie is that many analytics jobs proceed in stages
and are composed of many sub-steps using different tools. This aspect of analytics is often new to many
students and the assemblage of workflows is an advanced concept. Important point: some students ask what is
the difference between Hadoop YARN (resource scheduler) and Ooozie (and other similar tools). YARN
manages the workflow while Oozie describes how multiple applications work together for a specific job.
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline. (continues)
(Chapter Discussion continued)
Apache Falcon, NiFi, Atlas

Mention that data lifecycle management is a big topic in analytics. Retention and data movement tasks can
become complicated and also need to deal with compliance issues. Tools like Atlas, Nifi, and Atals are designed
to help with these issues.
Worked Examples
Example 1: Commands for loading CSV data into a HIVE table.

Location: ~/practical-data-science-with-hadoop-and-spark-master/ch04/Hive-Import-CSV
See the Note: Assumes path is for user "hdfs," change all references from /user/hdfs to your path
in HDFS
A. Create the names directly in HDFS
hdfs dfs -mkdir names
B. Move names to HDFS
hdfs dfs -put names.csv names
C. HIVE commands to create external HIVE table
CREATE EXTERNAL TABLE IF NOT EXISTS Names_text(
EmployeeID INT, FirstName STRING, Title STRING, State STRING, Laptop STRING)
COMMENT 'Employee Names'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/user/hdfs/names';
D. Check to see if names are there:
Select * from Names_text limit 5;
(continues)
(Example 1 continued)
E. Now create HIVE table
CREATE TABLE IF NOT EXISTS Names(
STORED AS ORC;
F. Copy data from external table to internal table
INSERT OVERWRITE TABLE Names SELECT * FROM Names_text;
G. Check to see if names are there:
Select * from Names limit 5;
H. Create partitioned table
CREATE TABLE IF NOT EXISTS Names_part(
EmployeeID INT, FirstName STRING, Title STRING, Laptop STRING)
COMMENT 'Employee names partitioned by state'
PARTITIONED BY (State STRING)
STORED AS ORC;
I. Select names from "PA" and place in table.
INSERT INTO TABLE Names_part PARTITION(state='PA')
SELECT EmployeeID, FirstName, Title, Laptop FROM Names_text WHERE state='PA';
(continues)
J. Create table using Parquet format
CREATE TABLE IF NOT EXISTS Names_Parquet(
STORED AS PARQUET;
K. Add data
INSERT OVERWRITE TABLE Names_Parquet SELECT * FROM Names_text;
L. Save external table in Parquet format
CREATE EXTERNAL TABLE IF NOT EXISTS Names_Parquet_Ext(
STORED AS PARQUET
LOCATION '/user/hdfs/names-parquet';
M. Add data
INSERT OVERWRITE TABLE Names_Parquet_Ext SELECT * FROM Names_text;
Example 2: Commands to enter data CSV/JSON data into Spark using PySpark
Location: ~/practical-data-science-with-hadoop-and-spark-master/ch04/ Spark-Import-CSV
Assumes path is for user "hdfs," change all references from /home/hdfs to your path in HDFS.
A. Start PySpark and import needed modules.
$ Pyspark
>>> from pyspark.sql import SQLContext
>>> from pyspark.sql.types import *
>>> from pyspark.sql import Row
B. Create the SQL context
>>> sqlContext = SQLContext(sc)
C. read in CSV data
>>> csv_data = sc.textFile("file:///home/hdfs/Spark-Import-CSV/names.csv")
D. confirm RDD
>>> type(csv_data)
<class 'pyspark.rdd.RDD'>
E. peak at the RDD data
>>> csv_data.take(5)
[u'EmployeeID,FirstName,Title,State,Laptop', u'10,Andrew,Manager,DE,PC', u'11,Arun,Manager,NJ,PC',
u'12,Harish,Sales,NJ,MAC', u'13,Robert,Manager,PA,MAC']
F. Split on comma
>>> csv_data = csv_data.map(lambda p: p.split(","))
(continues)
G. Remove the header
>>> header=csv_data.first()
>>> csv_data = csv_data.filter(lambda p:p != header)
H. Place RRD into Spark DataFrame
>>> df_csv = csv_data.map(lambda p: Row(EmployeeID = int(p[0]), FirstName = p[1], Title=p[2], State=p[3],
Laptop=p[4])).toDF()
I. Show the DataFrame format
>>> df_csv
DataFrame[EmployeeID: bigint, FirstName: string, Laptop: string, State: string, Title: string]
J. Show the first 5 rows of the DataFrame
>>> df_csv.show(5)

+----------+---------+------+-----+--------+
|EmployeeID|FirstName|Laptop|State| Title|
+----------+---------+------+-----+--------+
| 10| Andrew| PC| DE| Manager|
| 11| Arun| PC| NJ| Manager|
| 12| Harish| MAC| NJ| Sales|
| 13| Robert| MAC| PA| Manager|
| 14| Laura| MAC| PA|Engineer|
+----------+---------+------+-----+--------+
only showing top 5 rows
(continues)
K. Print the DataFrame schema
>>> df_csv.printSchema()
root
|-- EmployeeID: long (nullable = true)
|-- FirstName: string (nullable = true)
|-- Laptop: string (nullable = true)
|-- State: string (nullable = true)
|-- Title: string (nullable = true)
L. Create and Write a Hive Table in pySpark. Similar to using Hive, just call Hive commands:
>>> from pyspark.sql import HiveContext
>>> sqlContext = HiveContext(sc)

>>> sqlContext.sql("CREATE TABLE IF NOT EXISTS MoreNames_text(EmployeeID INT, FirstName STRING, Title
STRING, State STRING, Preference STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' ")

>>> sqlContext.sql("CREATE TABLE IF NOT EXISTS MoreNames(EmployeeID INT, FirstName STRING, Title
STRING, State STRING, Preference STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' STORED AS ORC")

>>> sqlContext.sql("LOAD DATA LOCAL INPATH 'MoreNames.txt' INTO TABLE MoreNames_text")

>>> sqlContext.sql("INSERT OVERWRITE TABLE MoreNames SELECT * FROM MoreNames_text")

>>> result = sqlContext.sql("FROM MoreNames SELECT EmployeeID, FirstName, State”)
(continues)
M. Read a Hive table from Spark. Only read EmployeeID and Laptop
>>> sqlContext = HiveContext(sc)
>>> df_hive = sqlContext.sql("SELECT EmployeeID, Laptop FROM names")

>>> df_hive.show(5)

+----------+------+
|EmployeeID|Laptop|
+----------+------+
| 10| PC|
| 11| PC|
| 12| MAC|
| 13| MAC|
| 14| MAC|
+----------+------+
only showing top 5 rows

>>> df_hive.printSchema()
root
|-- EmployeeID: integer (nullable = true)
N. Read from JSON. Please note Spark expects each line to be a separate JSON object, so it will
fail if you’ll try to load a pretty formatted JSON file.
>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> df_json = sqlContext.read.json("file:///home/hdfs/Spark-Import-CSV/names.json")
>>> df_json.show()
(continues)
+----------+---------+------+-----+--------+
|EmployeeID|FirstName|Laptop|State| Title|
+----------+---------+------+-----+--------+
| 10| Andrew| PC| DE| Manager|
| 11| Arun| PC| NJ| Manager|
| 12| Harish| MAC| NJ| Sales|
| 13| Robert| MAC| PA| Manager|
| 14| Laura| MAC| PA|Engineer|
| 15| Anju| PC| PA| CEO|
| 16| Aarathi| PC| NJ| Manager|
| 17| Parvathy| MAC| DE|Engineer|
| 18| Gopika| MAC| DE| Admin|
| 19| Steven| MAC| PA|Engineer|
| 20| Michael| PC| PA| CFO|
| 21| Gokul| PC| PA| Admin|
| 23| Janet| PC| DE| Sales|
| 22| Anne| PC| PA| Admin|
| 24| Hari| PC| NJ| Admin|
| 25| Sanker| MAC| NJ| Admin|
| 26| Margaret| PC| PA| Tech|
| 27| Nirmal| MAC| PA| Tech|
| 28| jinju| MAC| PA|Engineer|
| 29| Nancy| PC| NJ| Admin|
+----------+---------+------+-----+--------+

(continues)
>>> df_json
DataFrame[EmployeeID: bigint, FirstName: string, Laptop: string, State: string, Title: string]

>>> df_json.printSchema()

root
|-- EmployeeID: long (nullable = true)
|-- FirstName: string (nullable = true)
|-- State: string (nullable = true)
|-- Title: string (nullable = true)
Example 3: Using Apache Sqoop
Location: ~/practical-data-science-with-hadoop-and-spark-master/ch04/Sqoop
Reference:
http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
In this example we will extract a database table from MySQL to HDFS and then back to MySQL.
A: Download and Load Sample MySQL Data
Assumes MySQL is installed and working on the head node of the cluster (this is the easiest
method). If you are using a VM you can install MySQL in the VM , if it is not installed already.
We will use the world database from MySQL. Ref:
http://dev.mysql.com/doc/world-setup/en/index.html. Get the database:
wget http://downloads.mysql.com/docs/world_innodb.sql.gz
Load the world database into MySQL, login as MySQL root user, create the databases and import
the data. NOTE: This step can take a long time (10-20 minutes) in some cases.
mysql -u root -p
mysql> CREATE DATABASE world;
mysql> USE world;
mysql> SOURCE world_innodb.sql;
mysql> SHOW TABLES;
+-----------------+
| Tables_in_world |
+-----------------+
| City |
| Country |
| CountryLanguage |
+-----------------+
3 rows in set (0.01 sec)
(continues)
Use these commands to see table details:
mysql> SHOW CREATE TABLE Country;
mysql> SHOW CREATE TABLE City;
mysql> SHOW CREATE TABLE CountryLanguage;
B: Add Sqoop User Permissions for Local Machine and Cluster (still in MySQL).
This step ensures that all the Sqoop processes on the nodes in the cluster can connect to the
MySQL server. The "__HOSTNAME__" item must be filled in with the host name of the machine
running MySQL. The "10.0.0.%" is the cluster network, this needs to be changed for your local
system, the "%" is a wild card so any node that has "10.0.0." as the first three values for the IP
address will be allowed to log into MySQL. If you are using a VM this setting is not needed. We are
setting the user name as "sqoop" with the password "sqoop."
mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'__HOSTNAME__' IDENTIFIED BY 'sqoop';
mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'10.0.0.%' IDENTIFIED BY 'sqoop';
mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'localhost' IDENTIFIED BY 'sqoop';
mysql> quit
Login as sqoop to test access (password should be "sqoop"):
mysql -u sqoop -p
mysql> USE world;
mysql> SHOW TABLES;
(continues)
+-----------------+
| Tables_in_world |
+-----------------+
| City |
| Country |
| CountryLanguage |
+-----------------+

mysql> quit
C: Import Data Using Sqoop

As a test, use Sqoop to List Databases (You may see warnings, anything that is non-fatal is
usually okay and can be ignored.)
sqoop list-databases --connect jdbc:mysql://limulus/world --username sqoop --password sqoop

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
14/08/18 14:38:55 INFO sqoop.Sqoop: Running Sqoop version: 1.4.4.2.1.2.1-471
14/08/18 14:38:55 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using
-P instead.
14/08/18 14:38:55 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
test
world
(continues)
Now use sqoop to list the tables
sqoop list-tables --connect jdbc:mysql://limulus/world --username sqoop --password sqoop

...
14/08/18 14:39:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.4.2.1.2.1-471
14/08/18 14:39:43 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using
-P instead.
14/08/18 14:39:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
City
Country
CountryLanguage
Make directory for data in HDFS
hdfs dfs -mkdir sqoop-mysql-import
Do the import, -m is number of map tasks
sqoop import --connect jdbc:mysql://limulus/world --driver com.mysql.jdbc.Driver --username sqoop --password sqoop
--table Country -m 1 --target-dir /user/deadline/sqoop-mysql-import/country
...
14/08/18 16:47:15 INFO mapreduce.ImportJobBase: Transferred 30.752 KB in 12.7348 seconds
(2.4148 KB/sec)
14/08/18 16:47:15 INFO mapreduce.ImportJobBase: Retrieved 239 records.
(continues)
Check to see if data showed up in HDFS
hdfs dfs -ls sqoop-mysql-import/country
Found 2 items
-rw-r--r-- 2 deadline hdfs 0 2017-12-12 12:14 sqoop-mysql-import/country/_SUCCESS
-rw-r--r-- 2 deadline hdfs 31490 2017-12-12 12:14 sqoop-mysql-import/country/part-m-00000
Check the actual data file
hdfs dfs -cat sqoop-mysql-import/country/part-m-00000

ABW,Aruba,North America,Caribbean,193.0,null,103000,78.4,828.0,793.0,Aruba,Nonmetropolitan
Territory of The Netherlands,Beatrix,129,AW
...
ZWE,Zimbabwe,Africa,Eastern Africa,390757.0,1980,11669000,37.8,5951.0,8670.0,Zimbabwe,
Republic,Robert G. Mugabe,4068,ZW
To clean things up on the command line, we can use an Options File. This file will avoid rewriting
same options over and over again.
(continues)
For example, edit the world-input-options.txt file to include the following (vi world-input-
options.txt):
import
--connect
jdbc:mysql://limulus/world
--driver
com.mysql.jdbc.Driver
--username
sqoop
--password
sqoop

If we import the city table using the options file, then it is a shorter line:
sqoop --options-file world-import-options.txt --table City -m 1 --target-dir /user/deadline/sqoop-mysql-import/city
Sqoop allows you to include a SQL Query in the Import Step. We will do this again using a single
mapper "-m 1" The $CONDITIONS is required by the WHERE command, but not when the number
of mappers is one. In this example, we are pulling only the cities from Canada.
sqoop --options-file world-import-options.txt -m 1 --target-dir /user/deadline/sqoop-mysql-import/canada-city --query
"SELECT ID,Name from City WHERE CountryCode='CAN' AND \$CONDITIONS"
(continues)
Check the resulting file.
hdfs dfs -cat sqoop-mysql-import/canada-city/part-m-00000
1810,Montr√©al
1811,Calgary
1812,Toronto
...
1856,Sudbury
1857,Kelowna
1858,Barrie
Using Multiple Mappers to Import from Sqoop
The $CONDITIONS is a placeholder variable needed each mapper. If you want to import the
results of a query in parallel, then each map task will need to execute a copy of the query, with
results partitioned by bounding conditions inferred by Sqoop. Your query must include the
placeholder token $CONDITIONS which each Sqoop process will replace with a unique condition
expression based on the "--split-by" option. You will need to select the splitting column with --
split-by option if your primary key is not uniformly distributed. In this case we are splitting by the
ID column.
Remove the previous import. Also use the "skipTrash" option so the file is not moved to
local .TRASH directory.
hdfs dfs -rm -r -skipTrash sqoop-mysql-import/canada-city
Now start the import with 4 mappers.
sqoop --options-file world-import-options.txt -m 4 --target-dir /user/deadline/sqoop-mysql-import/canada-city --query
"SELECT ID,Name from City WHERE CountryCode='CAN' AND \$CONDITIONS" --split-by ID
(continues)
Check the result, there are 4 results files, one for each mapper:
hdfs dfs -ls sqoop-mysql-import/canada-city
Found 5 items
-rw-r--r-- 2 deadline hdfs 0 2017-12-12 12:23 sqoop-mysql-import/canada-city/_SUCCESS
-rw-r--r-- 2 deadline hdfs 175 2017-12-12 12:23 sqoop-mysql-import/canada-city/part-m-00000

D: Export Data from HDFS to MySQL
We need to create a new table for the exported data in MySQL:
mysql> USE WORLD;

mysql> CREATE TABLE `CityExport` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Name` char(35) NOT NULL DEFAULT '',
`CountryCode` char(3) NOT NULL DEFAULT '',
`District` char(20) NOT NULL DEFAULT '',
`Population` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`ID`));
(continues)
Do the export, not the use of a different options file:
sqoop --options-file cities-export-options.txt --table CityExport -m 4 --export-dir
/user/deadline/sqoop-mysql-import/city
Finally, check the table in MySQL
mysql> select * from CityExport limit 10;
+----+----------------+-------------+---------------+------------+
| ID | Name | CountryCode | District | Population |
+----+----------------+-------------+---------------+------------+
| 1 | Kabul | AFG | Kabol | 1780000 |
| 2 | Qandahar | AFG | Qandahar | 237500 |
| 3 | Herat | AFG | Herat | 186800 |
| 4 | Mazar-e-Sharif | AFG | Balkh | 127800 |
| 5 | Amsterdam | NLD | Noord-Holland | 731200 |
| 6 | Rotterdam | NLD | Zuid-Holland | 593321 |
| 7 | Haag | NLD | Zuid-Holland | 440900 |
| 8 | Utrecht | NLD | Utrecht | 234323 |
| 9 | Eindhoven | NLD | Noord-Brabant | 201843 |
| 10 | Tilburg | NLD | Noord-Brabant | 193238 |
+----+----------------+-------------+---------------+------------+
Some Handy Clean-up Commands

1. Remove the export table from MySQL
mysql> Drop table `CityExport`;
2. Clean-up imported files in HDFS, do for each name in {}
hdfs dfs -rm -r -skipTrash sqoop-mysql-import/{country,city, canada-city}
Example 4: Using Apache Flume
Location: ~/practical-data-science-with-hadoop-and-spark-master/ch04/Flume
Reference:
https://flume.apache.org/FlumeUserGuide.html
NOTE: These examples are best done outside of a multi-user environment

A: Simple Flume Test
In this example, start a Flume agent defined by the file in conf/simple-example.conf
flume-ng agent --conf conf --conf-file simple-example.conf --name simple_agent -Dflume.root.logger=INFO,console
In another window connect to the agent. (assumes telnet is installed)
telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Now type "testing 1 2 3" in the current window.
testing 1 2 3
OK
Switch back to the Flume agent window. Near the bottom (there may be lots of information on the
screen, just ignore) the input was picked up by Flume.
14/08/14 16:20:58 INFO sink.LoggerSink: Event: { headers:{} body: 74 65 73 74 69 6E 67 20 20 31 20 32 20 33 0D
testing 1 2 3. }
(continues)
B: Web Log Example
This example is designed to record the weblogs from the local machine and place them into HDFS
using flume. For simplicity this example is done on one machine, the first Flume agent (source
agent) reads the input from the weblog and sends it to the target agent that writes to HDFS. The
example also assumes you have an Apache webserver running on the host (If not, see below).
There are two configuration files needed.
web-server-target-agent.conf - the target flume agent that writes the data to HDFS
web-server-source-agent.conf - the source flume agent that captures the web log data
For your installation, the source agent needs to know the IP address of the target agent. This is
set in the web-server-source-agent-file near the bottom. Change this line to reflect the IP address
of the local machine if you are using the one machine example described here. In the example
file, the line looks like:
source_agent.sinks.avro_sink.hostname = 192.168.93.23
If the source agent is run on another a machine (web server) that transmits to the Hadoop
cluster. The source agent needs to be installed and started on the web server and the IP address
of the Hadoop cluster node that accepts Flume data needs to written to the web-server-source-
agent.conf file on the line above.
As root make local log directory to echo the web log
mkdir /var/log/flume-hdfs
chown hdfs:hadoop /var/log/flume-hdfs/
Next, as user hdfs make a flume data directory in HDFS
hdfs dfs -mkdir /user/hdfs/flume-channel/
Start flume target agent as user hdfs (writes data to HDFS) This target agent should be started
before the source agent. Note: with Hortonworks HDP flume agents can be started as a service
when the system boots.
(continues)
As user hdfs, start the Flume target agent (the one writing to HDFS):
flume-ng agent -c conf -f web-server-target-agent.conf -n collector
As root, start the source agent to feed the target agent
flume-ng agent -c conf -f web-server-source-agent.conf -n source_agent
Check to see if Flume is working (assuming no errors from flume-ng agents), open a web browser
and view a file from your local machine to create some traffic. (or just copy some data to the local
log).
tail -f /var/log/flume-hdfs/1408126765449-1
Inspect data in HDFS (actual filename will vary)
hdfs dfs -tail flume-channel/apache_access_combined/140815/
FlumeData.1408126868576
If you don't have a web server running, change the line in the source agent
source_agent.sources.apache_server.command = tail -f /etc/httpd/logs/access_log
To something like:
source_agent.sources.apache_server.command = tail -f /tmp/access_log
Create the file /tmp/access_log and add some data (anything will work) and start the Flume agents.
Then do the following:
echo "test test 123" >>/tmp/access_log
The text should appear in HDFS.
Chapter Summary
In this chapter
• The Hadoop data lake concept was presented as a new model for data processing.
• Various methods for making data available to several Hadoop tools were outlined. The examples
included copying files directly to HDFS, importing CSV files to Apache Hive and Spark, and
importing JSON files into HIVE with Spark.
• Apache Sqoop was presented as a tool for moving relational data into and out of HDFS.
• Apache Flume was presented as tool for capturing and transporting continuous data, such as
web logs, into HDFS.
• The Apache Oozie workf low manager was described as a tool for creating and scheduling
Hadoop workflows.
• The Apache Falcon tool enables a high-level framework for data governance (end-to-end
management) by keeping Hadoop data and tasks organized and defined as pipelines.
• New tools like Apache Nifi and Atlas were mentioned as options for governance and data flow on
a Hadoop cluster.
Exercises
1. Give an example of data flows that would be difficult to fit into a traditional data warehouse?
Twitter feeds, human genome, IoT, etc.
2. Name two features of HDFS that are different than most file systems?
No random reads or writes, slices and replicates blocks, not POSIX
3. What is the difference between an external Hive table and an internal Hive table?
See page 59
4. Suggested exercise: Create a new partitioned hive table to use the type of laptop instead of the
state.
5. What is the basic difference between a Spark RDD and a DataFrame?

See page 62.
6. Suggested exercise: have students use a different CSV file and reproduce (and modify) the
steps in the PySpark examples.
7. Suggested exercise: Use Sqoop query for another country code instead of Canada.
8. Have students take the Sqoop import data in HDFS and put it into Hive tables using Hive or
Spark.
9. What are a Flume source, a channel, and a sink?
10.Why would someone use Ooozie instead of running the programs by hand?

Big Data - CH04

Uploaded by

Copyright:

Available Formats

You might also like

Big Data - CH04

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data - CH04

Uploaded by

Copyright:

Available Formats

Chapter 4: Getting Data into Hadoop

• The data lake concept is presented as a new data processing paradigm.

• Hadoop as a Data Lake

Hadoop as a Data Lake

The Hadoop Distributed File System (HDFS)

Direct File Transfer to Hadoop HDFS

Importing Data from Files into Hive Tables

Importing Data into Hive Tables Using Spark

Using Apache Sqoop to Acquire Relational Data

Using Apache Flume to Acquire Data Streams

Manage Hadoop Work and Data Flows with Apache Oozie

Apache Falcon, NiFi, Atlas

Example 1: Commands for loading CSV data into a HIVE table.

C: Import Data Using Sqoop

Some Handy Clean-up Commands

5. What is the basic difference between a Spark RDD and a DataFrame?

9. What are a Flume source, a channel, and a sink?

You might also like