Professional Documents
Culture Documents
Speed Up Your Queries With Hive LLAP Engine On Hadoop or in The Cloud
Speed Up Your Queries With Hive LLAP Engine On Hadoop or in The Cloud
Tanel Poder
Who we are
• Enterprise database & performance background (Oracle focused)
• ”All enterprise data can just be a query away”
2
Enterprise Applications run on Enterprise Databases
? ?
T
P P
4
What is Apache Hive?
• An open source big data warehouse system on Hadoop
• Metadata + table structure + data access
• Used for:
• Access to huge datasets
• Parse large text files, log files, JSON (schema-on-read)
• Binary, columnar storage (schema-on-write)
• Very large queries (running for hours, days)
• Enterprise Data Warehouse offload
• Integration with Business Intelligence tools (fast, interactive queries)
• Insert, Update, Delete, Merge data in Hadoop
More on these later!
5
Apache Hive - a brief history
2007: Facebook created the
first SQL abstraction layer 2013: Hive on Tez released
for writing MapReduce Java via Hortonworks Data
Platform 2.0
code to access data in
Hadoop called Hive 2010: Apache
Hive first 2016: Hive LLAP
release (v0.3) included in Azure
HDInsight
6
Hive data processing engines
• MapReduce
• Original data processing framework
for Hadoop
• Map: filtering, sorting, etc
• Reduce: aggregate (sum, count, etc)
• Each Map + Reduce intermediate
result is written to disk (I/O intensive)
• Apache Tez
• Built on top of YARN
• Dataflow graph - processing steps
defined before the job begins
Source: https://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
• Low latency, high throughput
• Intermediate results transferred via
memory
7
Apache Tez
• YARN based framework for data processing applications in Hadoop
• Used by Apache Hive, Apache Pig, and others
• Can execute a complex Directed Acyclic Graph (DAG) when processing data
• Any given SQL query can be expressed as a single job
• Data is not physically stored in between tasks as in MapReduce
8
Hive performance optimizations
• Query vectorization
• Process rows in "blocks" of 1024 containing vectors of column values
• Improves operations of scans, aggregations, filters, and joins
• Partitioning
• Reduce the amount of data read to improve I/O
• Each partition becomes a directory
• Bucketing
• Similar to hash subpartitioning
• Each bucket becomes a file
• Best for high cardinality columns
9
Hive on Tez is great, but what is missing?
10
Now called Hive
Interactive Query
11
Introducing Hive LLAP
• “Live Long and Process” or “Low Latency Analytical Processing”
• Not an execution engine (like Tez), LLAP simply enhances the Hive execution model
• Built for fast query response time against smaller data volumes
• Allows concurrent execution of analytical workloads
12
LLAP Daemons: HiveServer2 Interactive
13
LLAP architecture
HDFS and
S3 WASB Isilon
Compatible
Source: https://www.slideshare.net/Hadoop_Summit/an-apache-hive-based-data-warehouse-80225129 14
Hive data processing
Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ 15
Query performance - Tez vs Tez + LLAP
Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ 16
LLAP features
• Persistent columnar caching
• Data, metadata, and indexes are cached in-memory
• Clients share cached data for faster processing (less I/O, less CPU)
• Query fragments
• Higher priority queries can pre-empt other queries
• Fragmenting allows lower priority queries to continue, even if pre-empted
• Smarter map joins
• Build the hash table once and cache it in-memory for sharing with other processes
• Hybrid execution
• Hive queries can run in LLAP, Tez, or a hybrid of both
• Multi-threaded processing
• Data reads, decoding, and processing executed on separate threads
• Dynamic runtime filtering
• Bloom filter automatically built to eliminate rows that cannot match
17
Caching efficiently - LLAP’s tricks
LLAP’s cache is decentralized, columnar, automatic, additive, packed, and layered
Source: https://www.slideshare.net/t3rmin4t0r/llap-locality-is-dead/13 18
LLAP demo
19
Hive ACID - transactional operations in Hadoop
• Transactions on data stored in HDFS (no longer just INSERT!)
• Uses base files and delta files where insert, update, and delete operations are
recorded
CREATE TABLE customers (
• Useful for name string,
• Slowly changing dimensions address string,
city string,
• Data corrections
state string
• Bulk updates ) clustered by (name) into 10 buckets
• Streaming ingest of data STORED AS ORC
TBLPROPERTIES('transactional'='true');
20
Hive roadmap
Source: https://hortonworks.com/apache/hive/#section_3 21
Azure HDInsight
Hive LLAP in the cloud
22
Microsoft Azure Hadoop Stack
Source: https://f.ch9.ms/public/MLDS2016/OptimizingApacheHivePerformanceHDInsight.pptx 23
Hadoop in the cloud
• Easy deployment
• Managed storage
• Never run out of space!
24
Hive LLAP performance on HDInsight
LLAP cached (Text) LLAP uncached (Text) LLAP cached (ORC) LLAP uncached (ORC) Presto (ORC) Spark (Parquet)
1478
LLAP cached (ORC)
1878 LLAP uncached (ORC)
Total Time (s)
1061
2216
2416
1503
Spark (Parquet)
0 500 1000 1500 2000 2500 3000
Source: https://azure.microsoft.com/en-us/blog/hdinsight-interactive-query-performance-benchmarks-and-integration-with-power-bi-direct-query/
25
Gluent’s transparent data virtualization New analytic
tools
No existing
app code
changes!
Application Application
27
Gluent + LLAP Demo
28
Thank you!
info@gluent.com
gluent.com
@gluent