Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

Tanel Poder
Who we are
• Enterprise database & performance background (Oracle focused)
• ”All enterprise data can just be a query away”

• Gluent Data Platform


• Supports all major Hadoop distributions, on-premises or in the cloud
• Consolidates data into a centralized location in open data formats
• Transparent Data Virtualization provides simple data sharing across the enterprise

2
Enterprise Applications run on Enterprise Databases

? ?
T

Big Data IoT

P P

… but traditional databases don’t cut it anymore!


3
Gluent Data Virtualization

4
What is Apache Hive?
• An open source big data warehouse system on Hadoop
• Metadata + table structure + data access

• SQL layer over HDFS, cloud storage (HiveQL)


• Cost based optimizer, indexing, partitions, etc

• Used for:
• Access to huge datasets
• Parse large text files, log files, JSON (schema-on-read)
• Binary, columnar storage (schema-on-write)
• Very large queries (running for hours, days)
• Enterprise Data Warehouse offload
• Integration with Business Intelligence tools (fast, interactive queries)
• Insert, Update, Delete, Merge data in Hadoop
More on these later!
5
Apache Hive - a brief history
2007: Facebook created the
first SQL abstraction layer 2013: Hive on Tez released
for writing MapReduce Java via Hortonworks Data
Platform 2.0
code to access data in
Hadoop called Hive 2010: Apache
Hive first 2016: Hive LLAP
release (v0.3) included in Azure
HDInsight

2008: Apache 2013: Hortonworks announces 2016: Hive LLAP


Hive incubating the Stinger initiative - included in Apache
promising 100x faster Hive Hive 2.0
project created
https://hortonworks.com/blog
/100x-faster-hive/

6
Hive data processing engines
• MapReduce
• Original data processing framework
for Hadoop
• Map: filtering, sorting, etc
• Reduce: aggregate (sum, count, etc)
• Each Map + Reduce intermediate
result is written to disk (I/O intensive)

• Apache Tez
• Built on top of YARN
• Dataflow graph - processing steps
defined before the job begins
Source: https://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
• Low latency, high throughput
• Intermediate results transferred via
memory

7
Apache Tez
• YARN based framework for data processing applications in Hadoop
• Used by Apache Hive, Apache Pig, and others

• Can execute a complex Directed Acyclic Graph (DAG) when processing data
• Any given SQL query can be expressed as a single job
• Data is not physically stored in between tasks as in MapReduce

• Data processing defined as a ”graph”


• Vertices - the processing of data (where the
query logic resides)
• Edges - movement of data in-between
processing (task routing/scheduling)

8
Hive performance optimizations
• Query vectorization
• Process rows in "blocks" of 1024 containing vectors of column values
• Improves operations of scans, aggregations, filters, and joins

• Partitioning
• Reduce the amount of data read to improve I/O
• Each partition becomes a directory

• Bucketing
• Similar to hash subpartitioning
• Each bucket becomes a file
• Best for high cardinality columns

• ORC file format


• Columnar data compression
• Built-in "storage indexes"

9
Hive on Tez is great, but what is missing?

Fast, sub-second query response time!

10
Now called Hive
Interactive Query

Introducing Hive LLAP

11
Introducing Hive LLAP
• “Live Long and Process” or “Low Latency Analytical Processing”
• Not an execution engine (like Tez), LLAP simply enhances the Hive execution model
• Built for fast query response time against smaller data volumes
• Allows concurrent execution of analytical workloads

• Intelligent memory caching for quick startup and data sharing


• Caches most active data in RAM
• Shared cache across clients

• Persistent server used to instantly


execute queries
• LLAP daemons are “always on”
• Data passed to execution as it becomes ready

12
LLAP Daemons: HiveServer2 Interactive

13
LLAP architecture

Hive 2 with LLAP: Architecture Overview Persistent


daemon
YARN Cluster
Query
Coordinators LLAP Daemon LLAP Daemon LLAP Daemon LLAP Daemon
ODBC /
SQL Coord-
JDBC Queries In-Memory Cache
inator
HiveServer2 (Shared Across All Users)
Coord-
(Query inator
Endpoint)
Query Query Query Query
Coord-
Storage
inator
Executors Executors Executors Executors
Deep

HDFS and
S3 WASB Isilon
Compatible

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Source: https://www.slideshare.net/Hadoop_Summit/an-apache-hive-based-data-warehouse-80225129 14
Hive data processing

Data cached in-


memory & shared
across clients
Write resultset
to disk after
each operation

MapReduce Tez Tez with LLAP

Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ 15
Query performance - Tez vs Tez + LLAP

Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ 16
LLAP features
• Persistent columnar caching
• Data, metadata, and indexes are cached in-memory
• Clients share cached data for faster processing (less I/O, less CPU)
• Query fragments
• Higher priority queries can pre-empt other queries
• Fragmenting allows lower priority queries to continue, even if pre-empted
• Smarter map joins
• Build the hash table once and cache it in-memory for sharing with other processes
• Hybrid execution
• Hive queries can run in LLAP, Tez, or a hybrid of both
• Multi-threaded processing
• Data reads, decoding, and processing executed on separate threads
• Dynamic runtime filtering
• Bloom filter automatically built to eliminate rows that cannot match

17
Caching efficiently - LLAP’s tricks
LLAP’s cache is decentralized, columnar, automatic, additive, packed, and layered

There is no centralized Caches data with


store of “what’s cached Admins don’t
intact dictionary
and where” - the cache need to run and RLE encodings,
side-steps the block “cache table” or
to reduce footprint.
metadata size concerns. new partitions as
they are created.
The cache does not contain Data updates are Caches ORC indexes which
any dead columns. If you run detected as well. trigger skips too - a scan for
TPC-H with LLAP, you’ll notice city = ‘San Francisco’, allows
it never caches billions of When a new column or city = ‘Los Angeles’ to use
values in L_COMMENT. partition is used, the cache cached index data to skip.
adds to itself incrementally
- unlike immutable caches.

Source: https://www.slideshare.net/t3rmin4t0r/llap-locality-is-dead/13 18
LLAP demo

19
Hive ACID - transactional operations in Hadoop
• Transactions on data stored in HDFS (no longer just INSERT!)
• Uses base files and delta files where insert, update, and delete operations are
recorded
CREATE TABLE customers (
• Useful for name string,
• Slowly changing dimensions address string,
city string,
• Data corrections
state string
• Bulk updates ) clustered by (name) into 10 buckets
• Streaming ingest of data STORED AS ORC
TBLPROPERTIES('transactional'='true');

• MERGE support now available


Enable transactions

• Note: Hive transactions is not OLTP!

20
Hive roadmap

Source: https://hortonworks.com/apache/hive/#section_3 21
Azure HDInsight
Hive LLAP in the cloud

22
Microsoft Azure Hadoop Stack

Source: https://f.ch9.ms/public/MLDS2016/OptimizingApacheHivePerformanceHDInsight.pptx 23
Hadoop in the cloud
• Easy deployment

• Elasticity - expand or shrink resources as needed


• Launch transient services for “large” or temporary data processing

• Managed storage
• Never run out of space!

• Hardware maintenance is handled


by the cloud provider

24
Hive LLAP performance on HDInsight
LLAP cached (Text) LLAP uncached (Text) LLAP cached (ORC) LLAP uncached (ORC) Presto (ORC) Spark (Parquet)

1478
LLAP cached (ORC)
1878 LLAP uncached (ORC)
Total Time (s)

1061

2216

2416

1503
Spark (Parquet)
0 500 1000 1500 2000 2500 3000

Source: https://azure.microsoft.com/en-us/blog/hdinsight-interactive-query-performance-benchmarks-and-integration-with-power-bi-direct-query/

25
Gluent’s transparent data virtualization New analytic
tools
No existing
app code
changes!

Application Application

Database Database On Demand Compute

On Demand Data Access

“No-ETL” Data Sync

Before After Much smaller


Additional
footprint & cost data sources 26
Gluent and Hive with LLAP
• Query performance is key for Gluent’s transparent
data virtualization

27
Gluent + LLAP Demo

28
Thank you!
info@gluent.com
gluent.com
@gluent

You might also like