Speed Up Your Queries With Hive LLAP Engine On Hadoop or in The Cloud

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Tanel Poder
Who we are
• Enterprise database & performance background (Oracle focused)
• ”All enterprise data can just be a query away”
• Gluent Data Platform

• Supports all major Hadoop distributions, on-premises or in the cloud
• Consolidates data into a centralized location in open data formats
• Transparent Data Virtualization provides simple data sharing across the enterprise
2
Enterprise Applications run on Enterprise Databases
? ?
T
Big Data IoT
P P
… but traditional databases don’t cut it anymore!

3
Gluent Data Virtualization
4
What is Apache Hive?
• An open source big data warehouse system on Hadoop
• Metadata + table structure + data access
• SQL layer over HDFS, cloud storage (HiveQL)

• Cost based optimizer, indexing, partitions, etc
• Used for:
• Access to huge datasets
• Parse large text files, log files, JSON (schema-on-read)
• Binary, columnar storage (schema-on-write)
• Very large queries (running for hours, days)
• Enterprise Data Warehouse offload
• Integration with Business Intelligence tools (fast, interactive queries)
• Insert, Update, Delete, Merge data in Hadoop
More on these later!
5
Apache Hive - a brief history
2007: Facebook created the
first SQL abstraction layer 2013: Hive on Tez released
for writing MapReduce Java via Hortonworks Data
Platform 2.0
code to access data in
Hadoop called Hive 2010: Apache
Hive first 2016: Hive LLAP
release (v0.3) included in Azure
HDInsight
2008: Apache 2013: Hortonworks announces 2016: Hive LLAP

Hive incubating the Stinger initiative - included in Apache
promising 100x faster Hive Hive 2.0
project created
https://hortonworks.com/blog
/100x-faster-hive/
6
Hive data processing engines
• MapReduce
• Original data processing framework
for Hadoop
• Map: filtering, sorting, etc
• Reduce: aggregate (sum, count, etc)
• Each Map + Reduce intermediate
result is written to disk (I/O intensive)
• Apache Tez
• Built on top of YARN
• Dataflow graph - processing steps
defined before the job begins
Source: https://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
• Low latency, high throughput
• Intermediate results transferred via
memory
7
Apache Tez
• YARN based framework for data processing applications in Hadoop
• Used by Apache Hive, Apache Pig, and others
• Can execute a complex Directed Acyclic Graph (DAG) when processing data
• Any given SQL query can be expressed as a single job
• Data is not physically stored in between tasks as in MapReduce
• Data processing defined as a ”graph”

• Vertices - the processing of data (where the
query logic resides)
• Edges - movement of data in-between
processing (task routing/scheduling)
8
Hive performance optimizations
• Query vectorization
• Process rows in "blocks" of 1024 containing vectors of column values
• Improves operations of scans, aggregations, filters, and joins
• Partitioning
• Reduce the amount of data read to improve I/O
• Each partition becomes a directory
• Bucketing
• Similar to hash subpartitioning
• Each bucket becomes a file
• Best for high cardinality columns
• ORC file format

• Columnar data compression
• Built-in "storage indexes"
9
Hive on Tez is great, but what is missing?
Fast, sub-second query response time!
10
Now called Hive
Interactive Query
Introducing Hive LLAP
11
Introducing Hive LLAP
• “Live Long and Process” or “Low Latency Analytical Processing”
• Not an execution engine (like Tez), LLAP simply enhances the Hive execution model
• Built for fast query response time against smaller data volumes
• Allows concurrent execution of analytical workloads
• Intelligent memory caching for quick startup and data sharing

• Caches most active data in RAM
• Shared cache across clients
• Persistent server used to instantly

execute queries
• LLAP daemons are “always on”
• Data passed to execution as it becomes ready
12
LLAP Daemons: HiveServer2 Interactive
13
LLAP architecture
Hive 2 with LLAP: Architecture Overview Persistent

daemon
YARN Cluster
Query
Coordinators LLAP Daemon LLAP Daemon LLAP Daemon LLAP Daemon
ODBC /
SQL Coord-
JDBC Queries In-Memory Cache
inator
HiveServer2 (Shared Across All Users)
Coord-
(Query inator
Endpoint)
Query Query Query Query
Coord-
Storage
inator
Executors Executors Executors Executors
Deep
HDFS and
S3 WASB Isilon
Compatible
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Source: https://www.slideshare.net/Hadoop_Summit/an-apache-hive-based-data-warehouse-80225129 14
Hive data processing
Data cached in-

memory & shared
across clients
Write resultset
to disk after
each operation
MapReduce Tez Tez with LLAP
Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ 15
Query performance - Tez vs Tez + LLAP
Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ 16
LLAP features
• Persistent columnar caching
• Data, metadata, and indexes are cached in-memory
• Clients share cached data for faster processing (less I/O, less CPU)
• Query fragments
• Higher priority queries can pre-empt other queries
• Fragmenting allows lower priority queries to continue, even if pre-empted
• Smarter map joins
• Build the hash table once and cache it in-memory for sharing with other processes
• Hybrid execution
• Hive queries can run in LLAP, Tez, or a hybrid of both
• Multi-threaded processing
• Data reads, decoding, and processing executed on separate threads
• Dynamic runtime filtering
• Bloom filter automatically built to eliminate rows that cannot match
17
Caching efficiently - LLAP’s tricks
LLAP’s cache is decentralized, columnar, automatic, additive, packed, and layered
There is no centralized Caches data with

store of “what’s cached Admins don’t
intact dictionary
and where” - the cache need to run and RLE encodings,
side-steps the block “cache table” or
to reduce footprint.
metadata size concerns. new partitions as
they are created.
The cache does not contain Data updates are Caches ORC indexes which
any dead columns. If you run detected as well. trigger skips too - a scan for
TPC-H with LLAP, you’ll notice city = ‘San Francisco’, allows
it never caches billions of When a new column or city = ‘Los Angeles’ to use
values in L_COMMENT. partition is used, the cache cached index data to skip.
adds to itself incrementally
- unlike immutable caches.
Source: https://www.slideshare.net/t3rmin4t0r/llap-locality-is-dead/13 18
LLAP demo
19
Hive ACID - transactional operations in Hadoop
• Transactions on data stored in HDFS (no longer just INSERT!)
• Uses base files and delta files where insert, update, and delete operations are
recorded
CREATE TABLE customers (
• Useful for name string,
• Slowly changing dimensions address string,
city string,
• Data corrections
state string
• Bulk updates ) clustered by (name) into 10 buckets
• Streaming ingest of data STORED AS ORC
TBLPROPERTIES('transactional'='true');
• MERGE support now available

Enable transactions
• Note: Hive transactions is not OLTP!
20
Hive roadmap
Source: https://hortonworks.com/apache/hive/#section_3 21
Azure HDInsight
Hive LLAP in the cloud
22
Microsoft Azure Hadoop Stack
Source: https://f.ch9.ms/public/MLDS2016/OptimizingApacheHivePerformanceHDInsight.pptx 23
Hadoop in the cloud
• Easy deployment
• Elasticity - expand or shrink resources as needed

• Launch transient services for “large” or temporary data processing
• Managed storage
• Never run out of space!
• Hardware maintenance is handled

by the cloud provider
24
Hive LLAP performance on HDInsight
LLAP cached (Text) LLAP uncached (Text) LLAP cached (ORC) LLAP uncached (ORC) Presto (ORC) Spark (Parquet)
1478
LLAP cached (ORC)
1878 LLAP uncached (ORC)
Total Time (s)
1061
2216
2416
1503
Spark (Parquet)
0 500 1000 1500 2000 2500 3000
Source: https://azure.microsoft.com/en-us/blog/hdinsight-interactive-query-performance-benchmarks-and-integration-with-power-bi-direct-query/
25
Gluent’s transparent data virtualization New analytic
tools
No existing
app code
changes!
Application Application
Database Database On Demand Compute
On Demand Data Access
“No-ETL” Data Sync
Before After Much smaller

Additional
footprint & cost data sources 26
Gluent and Hive with LLAP
• Query performance is key for Gluent’s transparent
data virtualization
27
Gluent + LLAP Demo
28
Thank you!
info@gluent.com
gluent.com
@gluent

Speed Up Your Queries With Hive LLAP Engine On Hadoop or in The Cloud

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speed Up Your Queries With Hive LLAP Engine On Hadoop or in The Cloud

Uploaded by

Copyright:

Available Formats

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

• Gluent Data Platform

Big Data IoT

… but traditional databases don’t cut it anymore!

• SQL layer over HDFS, cloud storage (HiveQL)

2008: Apache 2013: Hortonworks announces 2016: Hive LLAP

• Data processing defined as a ”graph”

• ORC file format

Fast, sub-second query response time!

Introducing Hive LLAP

• Intelligent memory caching for quick startup and data sharing

• Persistent server used to instantly

Hive 2 with LLAP: Architecture Overview Persistent

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data cached in-

MapReduce Tez Tez with LLAP

There is no centralized Caches data with

• MERGE support now available

• Note: Hive transactions is not OLTP!

• Elasticity - expand or shrink resources as needed

• Hardware maintenance is handled

Database Database On Demand Compute

On Demand Data Access

“No-ETL” Data Sync

Before After Much smaller

You might also like