Data Lake 1

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 48

Apache HIVE

What is
HIVE?
• A system for managing and querying structured data built on top of
Hadoop
• Uses Map-Reduce for execution
• HDFS for storage
• Extensible to other Data Repositories

• Key Building Principles:


• SQL on structured data as a familiar data warehousing tool
• Extensibility (Pluggable map/reduce scripts in the language of your choice, Rich
and User Defined data types, User Defined Functions)
• Interoperability (Extensible framework to support different file and data
formats)
What HIVE Is
Not
• Not designed for OLTP
• Does not offer real-time queries
HIVE
Architecture
Hive/Hadoop Usage @ Facebook
• Types of Applications:
• Summarization
• Eg: Daily/Weekly aggregations of impression/click counts
• Complex measures of user engagement
• Ad hoc Analysis
• Eg: how many group admins broken down by state/country
• Data Mining (Assembling training data)
• Eg: User Engagement as a function of user attributes
• Spam Detection
• Anomalous patterns for Site Integrity
• Application API usage patterns
• Ad Optimization
• Too many to count.
Hive Query Language
• Basic SQL
• CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING);
• SHOW TABLES '.*s';
• DESCRIBE sample;
• ALTER TABLE sample ADD COLUMNS (new_col INT);
• DROP TABLE sample;
• Extensibility
• Pluggable Map-reduce scripts
• Pluggable User Defined Functions
• Pluggable User Defined Types
• Pluggable SerDes to read different kinds of Data Formats
Hive QL –
Join pv_users
page_view
user
pageid userid time pageid age
userid age gender
1 111 9:08:01 1 25
X 111 25 female =
2 111 9:08:13 2 25
222 32 male
1 222 9:08:14 1 32

• SQL:
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Hive QL – Join in Map
Reduce
page_view
pageid userid time key value key value

1 111 9:08:01 111 <1,1> 111 <1,1>


2 111 9:08:13 111 <1,2> 111 <1,2>
1 222 9:08:14 222 <1,1> 111 <2,25>
Shuffle
user Map Sort
userid age gender key value key value
111 25 female 111 <2,25> 222 <1,1>

222 32 male 222 <2,32> 222 <2,32>


Hive QL – Join in Map
Reduce
pv_users
key value

111 <1,1> Pageid age

111 <1,2> 1 25

111 <2,25> 2 25

Reduce

key value
222 <1,1> pageid age
1 32
222 <2,32>
Hive
• Developed at Facebook
• Used for majority of Facebook jobs
• “Relational database” built on Hadoop
• Maintains list of table schemas
• SQL-like query language (HiveQL)
• Can call Hadoop Streaming scripts from HiveQL
• Supports table partitioning, clustering, complex data
types, some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;

• Partitioning breaks table into separate files for each (dt,


country) pair
Ex: /hive/page_view/dt=2008-06-08,country=USA
/hive/page_view/dt=2008-06-08,country=CA
A Simple Query
• Find all page views coming from xyz.com on
March 31st:
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2018-03-01'
AND page_views.date <= '2018-03-31'
AND page_views.referrer_url like '%xyz.com';

• Hive only reads partition 2018-03-01,* instead


of scanning entire table
Aggregation and Joins
• Count users who visited each page by gender:
SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)
FROM page_views pv JOIN user u ON (pv.userid = u.id)
GROUP BY pv.page_url, u.gender
WHERE pv.date = '2018-03-03';

• Sample output:
Why go for Hive when Pig is there?
Features Hive Pig
Language SQL-like PigLatin
Schemas/Type Yes(Explicit) Yes(Implicit)
Partitions Yes No
Server Optional(Thrift) No
User Defined Functions(UDF) Yes(Java) Yes(Java)
Custom Serializer/ Deserializer Yes Yes
DFS Direct Access Yes(Implicit) Yes(Explicit)
Join/Order/Sort Yes Yes
Shell Yes Yes
Streaming Yes Yes
Web Interface Yes No
JDBC/ODBC Yes(limited) Yes
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_views.userid,

page_views.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views;
Exchanging data with databases using
Sqoop
 Sqoop is a tool for transferring data between Hadoop and
external structured data stores like RDBMS and data
warehouses. It supports Oracle, IBM, Microsoft, MySQL,
Teradata, etc., as well as NonSQL DBMS
 It uses a connector-based architecture for connectivity to
external systems via JDBC or native plugins
 Sqoop uses MapReduce to run tasks on Hadoop
 The Sqoop tasks are executed using sqoop command
MapReduce Map tasks execute the command
The data source provides the schema, and Sqoop generates
and executes corresponding SQL statement
Typical workflow using Sqoop
The Sqoop import tool
 With Sqoop, you can import data from a relational
database system into HDFS:
The input to the import process is a database table
Sqoop will read the table row by row into HDFS. The output
of this import process is a set of files containing a copy of the
imported table
The import process is performed in parallel for all tables
imported. For this reason, the output will be in multiple files
 Scoop can also import data from files – text files (with
commas or tabs separating each record) or binary files
containing serialized record data
Example: Importing data from a table in
a database into HDFS
sqoop import
--connect jdbc:mysql://localhost/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile
o the connect string is for the local MySQL database nyse
o the database table is StockPrices
o the data will be imported in folder /data/stockprice/
o the default number of MapReduce map tasks for executing each Sqoop
command is four, so the result of this import will occupy four files in
HDFS
o the as-textfile argument instructs Sqoop to import the data as plain
text
The Sqoop export tool
 Sqoop’s export process will read a set of delimited text files
from HDFS in parallel, parse them into records, and insert
them as new rows in a target database table.
 The Sqoop export tool runs in three modes:
Insert Mode The records being exported are inserted into the
table using an SQL INSERT statement
Update Mode An UPDATE SQL statement is executed for
existing rows, and an INSERT is used for new rows
Call Mode A stored procedure is invoked for each record
Example: Exporting HDFS data to a database table
sqoop export
--connect jdbc:mysql://localhost/mylogs
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t“
o The table LogData needs to already exist in the MySQL database mylogs
o The column values are determined by the delimiter used in the files,
which is a tab in this example
o All data in the files stored in the /data/logfiles/ directory of HDFS
will be exported
o Sqoop will perform this job using four mappers by default, but you can
change this number with the -m argument
Streaming live data into Hadoop using
Flume
 Flume is an Apache open-source system for efficiently
collecting, aggregating, and transporting large amounts of
log data from many different sources into HDFS.
 Examples of possible data source include
System and network log files
Emails
Website clickstream and Web traffic logs
Twitter feeds and other social media sources, etc.
Flume uses a producer-consumer model for handling
events, transmitted over a Channel, where Source is the
producer and Sink is the consumer of the events.
Flume Workflow
Producer-consumer model
 The Sink can be one of the following destinations
HDFS: stores data in files
Hbase: stores data in key/value pairs
Event Serializer: converts the event data into a custom format and
writes to an output stream
 A Flume process can consist of one Agent with a single Source and
Sink or multiple Agents that aggregate data from multiple Sources
and/or output events to multiple Sinks for further processing
 A Channel drains asynchronously into a Sink (Source does not have
to wait for storing the event in its final destination) because the
Source and the Sink are decoupled, which improves the
performance
Using Flume
 To use Flume, you start an Agent.
 Each Agent has a configuration file that defines its Sources,
Channels and Sinks.
Example: Starting a Flume Agent
flume-ng agent -n my_agent -c conf -f myagent.conf
Example: Agent configuration file for
streaming Web server log files into HDFS
my_agent.sources = webserver
my_agent.channels = memoryChannel
my_agent.sinks = mycluster
my_agent.sources.webserver.type = exec
my_agent.sources.webserver.command = tail -F
/var/log/hadoop/hdfs/hdfs-audit.log
my_agent.sources.webserver.batchSize = 1
my_agent.sources.webserver.channels = memoryChannel
my_agent.channels.memoryChannel.type = memory
my_agent.channels.memoryChannel.capacity = 10000
my_agent.sinks.mycluster.type = hdfs
my_agent.sinks.mycluster.channel = memoryChannel
my_agent.sinks.mycluster.hdfs.path =
hdfs://127.0.0.1:8020/hdfsaudit/
Example (cont.)
 The name of the Flume Agent is my_agent and the names of
the Sink, Source, and Channel are arbitrary
 This Agent has one Source named webserver of type exec,
which means it executes a command (in this case, tail on the
httpd access log file)
 The Agent has one Sink named mycluster, which sends the
events to a sequence file in a specified folder in HDFS at
hdfs://127.0.0.1:8020/hdfsaudit/
 The Agent has one Channel named memoryChannel,
configured with a memory type to store the events in
memory with a capacity of 10,000
Data Lake
Data Lake
Data Lake
Repository
• A repository for analyzing large
Web quantities of disparate sources of
data in its native format
Sensor
• One architectural platform to
Log
house all types of data:
Social o Machine-generated data (ex: IoT, logs)
o Human-generated data (ex: tweets, e-
Images mail)
o Traditional operational data (ex: sales,
inventory)
Objectives of a Data Lake
 Reduce up-front effort by ingesting Data Lake
data in any format without requiring a Repository
schema initially Web

 Make acquiring new data easy, so it can Sensor


be available for data science & analysis
quickly Log

Social
 Store large volume of multi-structured
data in its native format Images
Objectives of a Data Lake
 Defer work to ‘schematize’ after value & Data Lake
requirements are known Repository
Web
 Achieve agility faster than a traditional
data warehouse can Sensor

Log
 Speed up decision-making ability
Social
 Storage for additional types of data Images
which were historically difficult to obtain
Strategy:
Data Lake as a Staging Area for DW • Reduce storage
needs in data
warehouse
• Practical use for
Data Lake Store data stored in the
CRM
data lake
Raw Data: Data
Corporate Staging Area Warehouse Analytics & 1 Utilize the data
reporting
Data 1 tools lake as a landing
area for DW
Social Media staging area,
Data instead of the
relational
database
Devices &
Sensors
Strategy:
Data Lake for Active Archiving Data archival, with
query ability
available when
needed
CRM Data Lake Store
1 Archival process
Raw Data: based on data
Corporate Staging Area Data Analytics & retention policy
Warehouse reporting
Data
tools
2 Federated query
Social Media to access current
Data 1 & historical
Active Archive
2 data

Devices &
Sensors
Iterative Data Lake Pattern

1 2 3
For data of value:
Ingest and Analyze in place to
integrate with
store data determine value
the data
indefinitely in of the data
warehouse
is native (“schema on read”)
(“schema on
format
write”), or use data
Acquire data virtualization
Analyze data
with cost- on a case-by-case basis Deliver data
effective storage with scalable parallel once requirements
processing ability are fully
Data Lake Implementation Data Lake

A data lake is a conceptual idea. It can be Object


implemented with one or more technologies. Store
HDFS (Hadoop Distributed File Storage) is a NoSQL
very common option for data lake storage.
However, Hadoop is not a requirement for a
data lake. A data lake may also span > 1
Hadoop cluster.
HDFS
NoSQL databases are also very common.

Object stores (like Amazon S3 or Azure Blob


Storage) can also be used.
Coexistence of Data Lake & Data Warehouse

Data Lake Enterprise


Data Warehouse

Data Lake Values: DW Values:


 Agility  Governance
 Flexibility  Reliability
 Rapid Delivery  Standardization
 Exploration  Security

Less effort Data acquisition More effort


More effort Data retrieval Less effort
Zones in a Data Lake
Data Lake

Transient/ Analytics
Temp Zone
Sandbox

Raw Data/ Curated Data


Staging Zone Zone

Metadata | Security | Governance | Information Management


Raw Data Zone
 Raw data zone is immutable
to change
 History is retained to
accommodate
future unknown
needs
 Staging may be a distinct
area on its own
 Supports any type of
data
o Streaming
o Batch
Transient Zone
 Useful when data quality or
validity checks are
necessary before data can
be landed in the Raw Zone
 All landing zones
considered “kitchen area”
with highly limited access
o Transient Zone
o Raw Data Zone
o Staging Area
Curated Data Zone
 Cleansed, organized data
for data delivery:
o Data consumption
o Federated queries
o Provides data to other systems

 Most self-service data


access occurs from the
Curated Data Zone
 Standard governance &
security in the Curated Data
Zone
Analytics Sandbox
 Data science and
exploratory activities
 Minimal governance of the
Analytics Sandbox
 Valuable efforts are
“promoted” from Analytics
Sandbox to the Curated
Data Zone or to the data
warehouse
Objective:
Sandbox Solutions: Develop 1 Utilize sandbox
area in the
data lake for
CRM Data data
Warehouse
preparation
Corporate
2 Execution of R
Data scripts from local
1 workstation for
Social Media Data Lake
exploratory data
Data Raw Curate science &
d advanced analytics
Data Data scenarios
Devices & 1 Sandbox 2
Sensors
Data Scientist / Analyst
1
Flat Files
Objective:
Sandbox Solutions: Operationalize 1 Trained model is
promoted to run in
CRM
production server
Data
Warehouse environment
1 2 Sandbox use is
Corporate
R Server 3 discontinued
Data
once solution is
promoted
Data Lake Data Scientist / Analyst
Social Media
Data Raw C
Data u 3 Execution of R
r
scripts from server
a 2
Devices & Sandbox
t for
Sensors e operationalized
2 d data science &
D
Flat Files a advanced analytics
t scenarios
Organizing the Data Lake
Plan the structure based on optimal data retrieval.
The organization pattern should be self-documenting.

Organization is frequently based upon:

Subject Time Metadata capabilities


area partitioning of your technology will
have a *big* impact
Security Downstream on how you choose to
boundaries handle organization.
app/purpose

-- The objective is to avoid a chaotic data swamp --


Organizing the Data Lake
Example 1
Raw Data Zone Pros: Subject area at top level, organization-wide,
Subject Area Partitioned by time
Data Source Cons: No obvious security or organizational
Object boundaries
Date Loaded
File(s) Curated Data Zone
Purpose
Sales Type
Salesforce Snapshot Date
CustomerContacts File(s)
2016
12 Sales Trending Analysis
20 Summarized
1 2016_12_01
6 SalesTrend.
Organizing the Data Lake
Example 2
Raw Data Zone Pros: Security at the organizational level,
Organization
Partitioned by time
Unit Subject
Cons: Potentially siloed data, duplicated
Area
data
Data Source Curated Data Zone
Object Organizational
Date Unit Purpose
Loaded Type
File(s)
East Division Snapshot Date
Sales File(s)
Salesforce -----------------------
Custom ------------
erCon East Division
tacts Sales Trending Analysis
2016 Summarized
12 2016_12_01
Organizing the Data Lake
Other options which affect organization and/or metadata:

Data Retention Policy Probability of Data Access


Temporary data Recent/current data
Permanent data Historical data
Applicable period (ex: project lifetime) etc…
etc…

Confidential Classification
Business Impact / Criticality Public information
High (HBI) Internal use only
Medium (MBI) Supplier/partner confidential
Low (LBI) Personally identifiable information (PII)
etc… Sensitive – financial
Sensitive – intellectual property
etc…
Owner / Steward / SME
Ways to Get Started with a Data Lake

1. Data lake as staging area for DW


2. Offload archived data from DW back to data lake
3. Ingest a new type of data to allow time for longer-term planning

You might also like