Unit 3

What is HIVE
Hive is a data warehouse system which is used to analyze structured

data. It is built on the top of Hadoop.
Initially Hive was developed by Facebook, later the Apache Software

Foundation took it up and developed it further as an open source under
the name Apache Hive. It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.
Hive provides the functionality of reading, writing, and managing large

datasets residing in distributed storage. It runs SQL like queries called
HQL (Hive query language) which gets internally converted to
MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of

writing complex MapReduce programs. Hive supports Data Definition
Language (DDL), Data Manipulation Language (DML), and User Defined
Functions (UDF).
Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
Advantages of Hive
 Keeps queries running fast

 Takes very little time to write Hive query in comparison to MapReduce
code
 HiveQL is a declarative language like SQL
 Provides the structure on an array of data formats
 Multiple users can query the data with the help of HiveQL
 Very easy to write query including joins in Hive
 Simple to learn and use
Disadvantages of Hive
 Usefulwhen the data is structured
 Debugging code is very difficult
 You can’t do complicated operations
Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or
HQL.
 It is familiar, fast, scalable, and extensible.
Architecture of Hive
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that can

create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the

schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping.
HiveQL Process HiveQL is similar to SQL for querying on schema info on the
Engine Metastore. It is one of the replacements of traditional
approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for
MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and

MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.
Working of Hive
The following table defines how Hive interacts with Hadoop framework:
Step No. Operation

1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver
(any database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the
syntax and query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to
here, the parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution
engine sends the job to JobTracker, which is in Name node and it assigns this
job to TaskTracker, which is in Data node. Here, the query executes
MapReduce job.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
HIVE Data Types
Hive data types are categorized in numeric types, string types, misc types, and
complex types. A list of Hive data types is given below.
Integer Types
Type Size Range
TINYINT 1-byte signed integer -128 to 127
SMALLINT 2-byte signed integer 32,768 to 32,767
INT 4-byte signed integer 2,147,483,648 to 2,147,483,647
BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
Decimal Type
Type Size Range
FLOAT 4-byte Single precision floating point number
DOUBLE 8-byte Double precision floating point number
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
Data Type Length
VARCHAR 1 to 65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Null Value
Missing values are represented by the special value NULL.
Complex Type
Type Size Range
Struct It is similar to C struct or an object where fields are accessed struct('James','Roy')

using the "dot" notation.
Map It contains the key-value tuples where the fields are map('first','James','last','Roy')
accessed using array notation.
Array It is a collection of similar type of values that indexable array('James','Roy')

using zero-based integers.
Operation on Shivani
Differences Between Pig and Hive

Pig Hive
Operates on the client-side of a Operates on the server-side of a
cluster. cluster.
Procedural Data Flow Language. Declarative SQLish Language.
Pig is used for programming. Hive is used for creating reports.
Majorly used by Researchers and

Used by Data Analysts.
Programmers.
Used for handling structured and It is used in handling structured

semi-structured data. data.
Scripts end with .pig extension. Hive supports all extensions.
Supports Avro file format. Does not support Avro file format.
Uses an exact variation of dedicated

Does not have a dedicated
SQL-DDL language by defining tables
metadata database.
beforehand.

Unit 3

Uploaded by

Copyright:

Available Formats

You might also like

Unit 3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3

Uploaded by

Copyright:

Available Formats

What is HIVE

Hive is a data warehouse system which is used to analyze structured

Initially Hive was developed by Facebook, later the Apache Software

Hive provides the functionality of reading, writing, and managing large

Using Hive, we can skip the requirement of the traditional approach of

 Keeps queries running fast

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can

Meta Store Hive chooses respective database servers to store the

Execution Engine The conjunction part of HiveQL process Engine and

Step No. Operation

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point number

Data Type Length

Struct It is similar to C struct or an object where fields are accessed struct('James','Roy')

Array It is a collection of similar type of values that indexable array('James','Roy')

Differences Between Pig and Hive

Procedural Data Flow Language. Declarative SQLish Language.

Pig is used for programming. Hive is used for creating reports.

Majorly used by Researchers and

Used for handling structured and It is used in handling structured

Scripts end with .pig extension. Hive supports all extensions.

Uses an exact variation of dedicated

You might also like