Professional Documents
Culture Documents
Hive Performance With Different Fileformats
Hive Performance With Different Fileformats
Hive Performance With Different Fileformats
0
Performance metrics of Hive
Queries
Table of Contents
1 Introduction...................................................................................................................... 2
1.1 Overview................................................................................................................................. 2
1.2 Introduction to Apache Avro file format...................................................................................2
1.2 Introduction to Apache Parquet file format..............................................................................2
2 Case Study....................................................................................................................... 3
2.1
2.2
2.3
2.4
Objective................................................................................................................................. 3
Extracting the data from Teradata........................................................................................... 3
Loading the data extracted from sqoop into Hive tables.........................................................9
Conclusion............................................................................................................................. 10
Version 1.0
Performance metrics of Hive
Queries
1 INTRODUCTION
1.1
Overview
The purpose of this document is discuss on the performace trade off of Hive queries
using different file formats and suggest the best file formats to be used for Hive Storage
Version 1.0
Performance metrics of Hive
Queries
Parquet is built to support very efficient compression and encoding schemes Parquet
allows compression schemes to be specified on a per-column level, and is future-proofed
to allow adding more encodings as they are invented and implemented.Parquet is built to
be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and
we are not interested in playing favorites. We believe that an efficient, well-implemented
columnar storage substrate should be useful to all frameworks without the cost of
extensive and difficult to set up dependencies.
2 CASE STUDY
2.1Objective
Load the data from a Terdata table which is approximatelt about 40GB, Into hive
tables to make a performance comparison of the query execution time using different
file formats.
Version 1.0
Performance metrics of Hive
Queries
Since sqoop doesnt support importing the data in Parquet format we need to convert
the data extracted from sqoop into parquet format.
Two approaches are listed below to conert the data into parquet format:
Modified: 24/09/2015 07:33
2014 Accenture. All Rights Reserved.
Version 1.0
Performance metrics of Hive
Queries
Version 1.0
Performance metrics of Hive
Queries
8) Table Size
ii.
1) Use the below source code to create a jar which helps to converting a avro file to
Parquet data file
Main java Code
File Name : Avro2Parquet.java
package com.cloudera.science.avro2parquet;
import java.io.InputStream;
Modified: 24/09/2015 07:33
2014 Accenture. All Rights Reserved.
Version 1.0
Performance metrics of Hive
Queries
import
import
import
import
import
import
import
import
import
import
import
org.apache.avro.Schema;
org.apache.avro.mapreduce.AvroKeyInputFormat;
org.apache.hadoop.conf.Configuration;
org.apache.hadoop.conf.Configured;
org.apache.hadoop.fs.FileStatus;
org.apache.hadoop.fs.FileSystem;
org.apache.hadoop.fs.Path;
org.apache.hadoop.mapreduce.Job;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.util.Tool;
org.apache.hadoop.util.ToolRunner;
import parquet.avro.AvroParquetOutputFormat;
import parquet.avro.AvroSchemaConverter;
import parquet.hadoop.metadata.CompressionCodecName;
public class Avro2Parquet extends Configured implements Tool {
public
Path
Path
Path
Version 1.0
Performance metrics of Hive
Queries
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Avro2Parquet(), args);
System.exit(exitCode);
}
}
org.apache.avro.generic.GenericRecord;
org.apache.avro.mapred.AvroKey;
org.apache.hadoop.io.NullWritable;
org.apache.hadoop.mapreduce.Mapper;
@Override
protected void map(AvroKey<GenericRecord> key, NullWritable value,
Context context) throws IOException, InterruptedException {
context.write(null, key.datum());
}
3) Using the jar to convert from avro data format into Paquet data format
hadoop jar <avro2parquet jar file> \
com.cloudera.science.avro2parquet.Avro2Parquet \
<and generic options to the JVM> \
hdfs:///path/to/avro/schema.avsc \
hdfs:///path/to/avro/data \
hdfs:///output/path
The stats below indicate the File size comparison with different file storage types
Version 1.0
Performance metrics of Hive
Queries
As can be seen, File size decreases drasctically as we from Text file -> Snappy Conversion,
Avro File -> Avro Compression and in case of Parquet without compression any
compression it brings down the file size to 85% of original size.
Version 1.0
Performance metrics of Hive
Queries
2.3Loading the data extracted from sqoop into the Hive tables
1) Loading the text file into Hive Table
This is a straight forward load of the data into Hive table
2) Loading the text file into Hive Table
CREATE EXTERNAL TABLE avro_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOACATION '<hdfs_path>'
TBLPROPERTIES (
'avro.schema.literal'='<hdfs_path>/avro_schema.avsc');
As we can see above to create a Avro Hive Table it is necessary to specify the Avro
Schema
Defining the avro schema can be difficult as it needs a thorough knowledge of Json,
hence we can follow
the below steps to extract the schema from the avro data file itself instead of defining the
schema manually for
each of the tables.
10
Version 1.0
Performance metrics of Hive
Queries
Load the converted parquet data into the newly created Parquet Hive Table
The stats below indicate the Query Respond time with different file storage types.
11
Version 1.0
Performance metrics of Hive
Queries
2.4
Conclusion
Comparing all 3 formats Parquet Storage compresses data file to a great extent and
query respond time is also much faster than the other two formats , hence Parquet format
looks to be an undisputed winner in this scenario.
Since parquet format is columnar , it might not work as efficient as the above use
case incase entire row accesses are needed.
Revision History
Date
Version
Description
Author
27-Nov14
1.0
Created
Mohammed
Danesh Guard
12