Professional Documents
Culture Documents
Data Lake 1
Data Lake 1
Data Lake 1
What is
HIVE?
• A system for managing and querying structured data built on top of
Hadoop
• Uses Map-Reduce for execution
• HDFS for storage
• Extensible to other Data Repositories
• SQL:
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Hive QL – Join in Map
Reduce
page_view
pageid userid time key value key value
111 <1,2> 1 25
111 <2,25> 2 25
Reduce
key value
222 <1,1> pageid age
1 32
222 <2,32>
Hive
• Developed at Facebook
• Used for majority of Facebook jobs
• “Relational database” built on Hadoop
• Maintains list of table schemas
• SQL-like query language (HiveQL)
• Can call Hadoop Streaming scripts from HiveQL
• Supports table partitioning, clustering, complex data
types, some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
• Sample output:
Why go for Hive when Pig is there?
Features Hive Pig
Language SQL-like PigLatin
Schemas/Type Yes(Explicit) Yes(Implicit)
Partitions Yes No
Server Optional(Thrift) No
User Defined Functions(UDF) Yes(Java) Yes(Java)
Custom Serializer/ Deserializer Yes Yes
DFS Direct Access Yes(Implicit) Yes(Explicit)
Join/Order/Sort Yes Yes
Shell Yes Yes
Streaming Yes Yes
Web Interface Yes No
JDBC/ODBC Yes(limited) Yes
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_views.userid,
page_views.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views;
Exchanging data with databases using
Sqoop
Sqoop is a tool for transferring data between Hadoop and
external structured data stores like RDBMS and data
warehouses. It supports Oracle, IBM, Microsoft, MySQL,
Teradata, etc., as well as NonSQL DBMS
It uses a connector-based architecture for connectivity to
external systems via JDBC or native plugins
Sqoop uses MapReduce to run tasks on Hadoop
The Sqoop tasks are executed using sqoop command
MapReduce Map tasks execute the command
The data source provides the schema, and Sqoop generates
and executes corresponding SQL statement
Typical workflow using Sqoop
The Sqoop import tool
With Sqoop, you can import data from a relational
database system into HDFS:
The input to the import process is a database table
Sqoop will read the table row by row into HDFS. The output
of this import process is a set of files containing a copy of the
imported table
The import process is performed in parallel for all tables
imported. For this reason, the output will be in multiple files
Scoop can also import data from files – text files (with
commas or tabs separating each record) or binary files
containing serialized record data
Example: Importing data from a table in
a database into HDFS
sqoop import
--connect jdbc:mysql://localhost/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile
o the connect string is for the local MySQL database nyse
o the database table is StockPrices
o the data will be imported in folder /data/stockprice/
o the default number of MapReduce map tasks for executing each Sqoop
command is four, so the result of this import will occupy four files in
HDFS
o the as-textfile argument instructs Sqoop to import the data as plain
text
The Sqoop export tool
Sqoop’s export process will read a set of delimited text files
from HDFS in parallel, parse them into records, and insert
them as new rows in a target database table.
The Sqoop export tool runs in three modes:
Insert Mode The records being exported are inserted into the
table using an SQL INSERT statement
Update Mode An UPDATE SQL statement is executed for
existing rows, and an INSERT is used for new rows
Call Mode A stored procedure is invoked for each record
Example: Exporting HDFS data to a database table
sqoop export
--connect jdbc:mysql://localhost/mylogs
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t“
o The table LogData needs to already exist in the MySQL database mylogs
o The column values are determined by the delimiter used in the files,
which is a tab in this example
o All data in the files stored in the /data/logfiles/ directory of HDFS
will be exported
o Sqoop will perform this job using four mappers by default, but you can
change this number with the -m argument
Streaming live data into Hadoop using
Flume
Flume is an Apache open-source system for efficiently
collecting, aggregating, and transporting large amounts of
log data from many different sources into HDFS.
Examples of possible data source include
System and network log files
Emails
Website clickstream and Web traffic logs
Twitter feeds and other social media sources, etc.
Flume uses a producer-consumer model for handling
events, transmitted over a Channel, where Source is the
producer and Sink is the consumer of the events.
Flume Workflow
Producer-consumer model
The Sink can be one of the following destinations
HDFS: stores data in files
Hbase: stores data in key/value pairs
Event Serializer: converts the event data into a custom format and
writes to an output stream
A Flume process can consist of one Agent with a single Source and
Sink or multiple Agents that aggregate data from multiple Sources
and/or output events to multiple Sinks for further processing
A Channel drains asynchronously into a Sink (Source does not have
to wait for storing the event in its final destination) because the
Source and the Sink are decoupled, which improves the
performance
Using Flume
To use Flume, you start an Agent.
Each Agent has a configuration file that defines its Sources,
Channels and Sinks.
Example: Starting a Flume Agent
flume-ng agent -n my_agent -c conf -f myagent.conf
Example: Agent configuration file for
streaming Web server log files into HDFS
my_agent.sources = webserver
my_agent.channels = memoryChannel
my_agent.sinks = mycluster
my_agent.sources.webserver.type = exec
my_agent.sources.webserver.command = tail -F
/var/log/hadoop/hdfs/hdfs-audit.log
my_agent.sources.webserver.batchSize = 1
my_agent.sources.webserver.channels = memoryChannel
my_agent.channels.memoryChannel.type = memory
my_agent.channels.memoryChannel.capacity = 10000
my_agent.sinks.mycluster.type = hdfs
my_agent.sinks.mycluster.channel = memoryChannel
my_agent.sinks.mycluster.hdfs.path =
hdfs://127.0.0.1:8020/hdfsaudit/
Example (cont.)
The name of the Flume Agent is my_agent and the names of
the Sink, Source, and Channel are arbitrary
This Agent has one Source named webserver of type exec,
which means it executes a command (in this case, tail on the
httpd access log file)
The Agent has one Sink named mycluster, which sends the
events to a sequence file in a specified folder in HDFS at
hdfs://127.0.0.1:8020/hdfsaudit/
The Agent has one Channel named memoryChannel,
configured with a memory type to store the events in
memory with a capacity of 10,000
Data Lake
Data Lake
Data Lake
Repository
• A repository for analyzing large
Web quantities of disparate sources of
data in its native format
Sensor
• One architectural platform to
Log
house all types of data:
Social o Machine-generated data (ex: IoT, logs)
o Human-generated data (ex: tweets, e-
Images mail)
o Traditional operational data (ex: sales,
inventory)
Objectives of a Data Lake
Reduce up-front effort by ingesting Data Lake
data in any format without requiring a Repository
schema initially Web
Social
Store large volume of multi-structured
data in its native format Images
Objectives of a Data Lake
Defer work to ‘schematize’ after value & Data Lake
requirements are known Repository
Web
Achieve agility faster than a traditional
data warehouse can Sensor
Log
Speed up decision-making ability
Social
Storage for additional types of data Images
which were historically difficult to obtain
Strategy:
Data Lake as a Staging Area for DW • Reduce storage
needs in data
warehouse
• Practical use for
Data Lake Store data stored in the
CRM
data lake
Raw Data: Data
Corporate Staging Area Warehouse Analytics & 1 Utilize the data
reporting
Data 1 tools lake as a landing
area for DW
Social Media staging area,
Data instead of the
relational
database
Devices &
Sensors
Strategy:
Data Lake for Active Archiving Data archival, with
query ability
available when
needed
CRM Data Lake Store
1 Archival process
Raw Data: based on data
Corporate Staging Area Data Analytics & retention policy
Warehouse reporting
Data
tools
2 Federated query
Social Media to access current
Data 1 & historical
Active Archive
2 data
Devices &
Sensors
Iterative Data Lake Pattern
1 2 3
For data of value:
Ingest and Analyze in place to
integrate with
store data determine value
the data
indefinitely in of the data
warehouse
is native (“schema on read”)
(“schema on
format
write”), or use data
Acquire data virtualization
Analyze data
with cost- on a case-by-case basis Deliver data
effective storage with scalable parallel once requirements
processing ability are fully
Data Lake Implementation Data Lake
Transient/ Analytics
Temp Zone
Sandbox
Confidential Classification
Business Impact / Criticality Public information
High (HBI) Internal use only
Medium (MBI) Supplier/partner confidential
Low (LBI) Personally identifiable information (PII)
etc… Sensitive – financial
Sensitive – intellectual property
etc…
Owner / Steward / SME
Ways to Get Started with a Data Lake