Professional Documents
Culture Documents
Data Ingest
Data Ingest
Learning objectives
Skill Description Technology To Use
Load data into and out of HDFS using the Hadoop File System
HDFS Command Line
commands
Data Ingest Import data from a MySQL database into HDFS using Sqoop Sqoop
H. EL GHAZI Export data to a MySQL database from HDFS using Sqoop Sqoop
Change the delimiter and file format of data during import using
Sqoop
Sqoop
Flume or Kafka or Spark
Ingest real-time and near-real-time streaming data into HDFS
Streaming
Flume or Kafka or Spark
Process streaming data as it is loaded onto the cluster
Streaming
1 2
HDFS
• Scalable and economical data
storage, processing, and analysis
3 4
10/13/2021
• In Spark
hdfs://nnhost:port/file…
• Other programs
• Java API
• Used by Hadoop tools such as
MapReduce, Impala, Hue, Sqoop,
Flume
• RESTful interface
5 6
HDFS Command Line Examples (1) HDFS Command Line Examples (2)
• Copy file fname.txt from local disk to the user’s directory in HDFS • Display the contents of the HDFS file /user/me/grades.txt
hdfs dfs -put localfname.txt targetfname.txt hdfs dfs -cat /user/me/grades.txt
• This will copy the file to /user/username/targetfname.txt
• Copy that file to the local disk, named as notes.txt
hdfs dfs -get /user/me/grades.txt notes.txt
• Get a directory listing of the user’s home directory in HDFS • Create a directory called input under the user’s home directory
hdfs dfs -ls
hdfs dfs -mkdir input
• Get a directory listing of the HDFS root directory
hdfs dfs –ls /
7 8
10/13/2021
HDFS Command Line Examples (3) The Hue HDFS File Browser
• Delete a file • The File Browser in Hue lets you view and manage your HDFS
hdfs dfs -rm input_old/file1 directories and files
• Delete a set of files using a wildcard
hdfs dfs -rm input_old/*
9 10
11 12
10/13/2021
13 14
Importing an Entire Database with Sqoop Importing a Single Table with Sqoop
• The import-all-tables tool imports an entire database • The import tool imports a single table
• Stored as comma-delimited files • This example imports the product table
• Default base location is your HDFS home directory • It stores the data in HDFS as comma-delimited fields
• Data will be in subdirectories corresponding to name of each table
sqoop import-all-tables \ sqoop import --table product \
--connect jdbc:mysql://dbhost/retails \ --connect jdbc:mysql://dbhost/retails \
--username dbuser --password pw --username dbuser --password pw
17 18
19 20
10/13/2021
Flume • Now widely used for collection of any streaming event data
• Supports aggregating data from many sources into HDFS
• Benefits of Flume
• Horizontally-scalable
• Extensible
• Reliable
21 22
23 24
10/13/2021
25 26
27 28
10/13/2021
29 30
agent1.channels.ch1.type = memory
flume-ng agent \
--conf /etc/flume-ng/conf \
agent1.sources.src1.type = spooldir
Connects source --conf-file /path/to/flume.conf \
agent1.sources.src1.spoolDir = /var/flume/incoming
and channel --name agent1 \
agent1.sources.src1.channels = ch1 -Dflume.root.logger=INFO,console
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /loudacre/logdata Connects sink
agent1.sinks.sink1.channel = ch1 and channel
33 34
35 36
10/13/2021
37 38
Messages Topics
• Messages in Kafka are variable-size byte arrays • There is no explicit limit on the number of topics
• Represent arbitrary user-defined content • However, Kafka works better with a few large topics than many small ones
• Use any format your application requires • A topic can be created explicitly or simply by publishing to the topic
• Common formats include free-form text, JSON, and Avro • This behavior is configurable
• There is no explicit limit on message size • We recommends that administrators disable auto-creation of topics to avoid
• Optimal performance at a few KB per message accidental creation of large numbers of topics
• Practical limit of 1MB per message
• Kafka retains all messages for a defined time period and/or total size
39 40
10/13/2021
Producers Consumers
• Producers publish messages to Kafka topics • A consumer reads messages that were published to Kafka topics
• They communicate with Kafka, not a consumer • They communicate with Kafka, not any producer
• Kafka persists messages to disk on receipt • Consumer actions do not affect other consumers
• For example, having one consumer display the messages in a topic as they are
published does not change what is consumed by other consumers
• They can come and go without impact on the cluster or other
consumers
41 42
43 44
10/13/2021
45 46
47 48
10/13/2021
49 50
Contoller
51 52
10/13/2021
Displaying Topics from the Command Line Running a Producer from the Command Line
• Use the --list option to list all topics • You can run a producer using the kafka-console-producer tool
kafka-topics --list \
• Specify one or more brokers in the --broker-list option
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181 • Each broker consists of a hostname, a colon, and a port number
• If specifying multiple brokers, separate them with commas
• You must also provide the name of the topic
• Use the --help option to list all kafka-topics options
kafka-console-producer \
kafka-topics --help --broker-list brokerhost1:9092,brokerhost2:9092 \
--topic device_status
53 54
55 56
10/13/2021
Spark Streaming
Homework
57