Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 47

Deepshikha Agrawal Pushp

B.Sc.(IT), MBA (IT)


Certification-Hadoop,Spark,Scala,
Python,Tableau,ML
(Assistant Professor JLBS)
Hive - A Warehousing Solution
Over a Map-Reduce Framework
Agenda

Why Hive???
What is Hive?
Hive Architecture
HiveQL
Data Analysts with Hadoop
Hive - A Warehousing Solution
Over a Map-Reduce Framework
Hive Key Principles
Challenges that Data Analysts faced

Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop Map-Reduce framework to parallelize processing of
Data.

What is the catch?


- Hadoop Map Reduce is Java intensive
- Thinking in Map Reduce paradigm can get tricky
What is Hive?
Apache Hive is a Data warehouse system which is
built to work on Hadoop. It is used to querying and
managing large datasets residing in distributed storage. 
Before becoming a open source project of Apache Hadoop,
Hive was originated in Facebook. It provides a mechanism
to project structure onto the data in Hadoop and to query that data using a
SQL-like language called HiveQL (HQL).
What is HQL?
Hive defines a simple SQL-like query language to querying and managing large
datasets called Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL
Language. Hive allows programmers who are familiar with the language to write
the custom MapReduce framework to perform more sophisticated analysis.
Set Property in hive-site.xml to install hive
Set path of hive in bashrc file
What is Hive?

Apache Hive is an open source data warehouse system built on top of Hadoop used for
querying and analyzing large datasets stored in Hadoop files.
Initially, you have to write complex Map-Reduce jobs, but now with the help of the Hive,
you just need to submit merely SQL queries. Hive is mainly targeted towards users who
are comfortable with SQL. Hive use language called HiveQL (HQL), which is similar to
SQL. HiveQL automatically translates SQL-like queries into MapReduce jobs.
Hive abstracts the complexity of Hadoop. The main thing to notice is that there is no need
to learn java for Hive.
The Hive generally runs on your workstation and converts your SQL query into a series of
jobs for execution on a Hadoop cluster. Apache Hive organizes data into tables. This
provides a means for attaching the structure to data stored in HDFS.
Apache History Hive

Data Infrastructure Team at Facebook developed Hive. Apache Hive is also one of
the technologies that are being used to address the requirements at Facebook. It is
very popular with all the users internally at Facebook. It is being used to run
thousands of jobs on the cluster with hundreds of users, for a wide variety of
applications.
Apache Hive-Hadoop cluster at Facebook stores more than 2PB of raw data. It
regularly loads 15 TB of data on a daily basis.
Now it is being used and developed by a number of companies like Amazon, IBM,
Yahoo, Netflix, Financial Industry Regulatory Authority (FINRA) and many others.
Why Apache Hive?

Let’s us now discuss the need of Hive-


Facebook had faced a lot of challenges before the implementation of Apache Hive.
Challenges like the size of the data being generated increased or exploded, making it
very difficult to handle them. The traditional RDBMS could not handle the pressure.
As a result, Facebook was looking out for better options. To overcome this problem,
Facebook initially tried using MapReduce. But it has difficulty in programming and
mandatory knowledge in SQL, making it an impractical solution. Hence, Apache Hive
allowed them to overcome the challenges they were facing.
With Apache Hive, they are now able to perform the following:
Schema flexibility and evolution
Tables can be portioned and bucketed
Apache Hive tables are defined directly in the HDFS
JDBC/ODBC drivers are available
Apache Hive saves developers from writing complex Hadoop MapReduce jobs for
ad-hoc requirements. Hence, hive provides summarization, analysis, and query of
data. Hive is very fast and scalable. It is highly extensible. Since Apache Hive is
similar to SQL, hence it becomes very easy for the SQL developers to learn and
implement Hive Queries.
Hive reduces the complexity of MapReduce by providing an interface where the user
can submit SQL queries. So, now business analysts can play with Big Data using
Apache Hive and generate insights. It also provides file access on various data
stores like HDFS and HBase. The most important feature of Apache Hive is that to
learn Hive we don’t have to learn Java.
Hive Architecture
Hive Architecture

After the introduction to Apache Hive, Now we are going to discuss the major
component of Hive Architecture. The Apache Hive components are-
Metastore – It stores metadata for each of the tables like their schema and location.
Hive also includes the partition metadata. This helps the driver to track the progress
of various data sets distributed over the cluster. It stores the data in a
traditional RDBMS format. Hive metadata helps the driver to keep a track of the data
and it is highly crucial. Backup server regularly replicates the data which it can
retrieve in case of data loss.
Driver – It acts like a controller which receives the HiveQL statements. The driver
starts the execution of the statement by creating sessions. It monitors the life cycle
and progress of the execution. Driver stores the necessary metadata generated
during the execution of a HiveQL statement. It also acts as a collection point of data
or query result obtained after the Reduce operation.
Compiler – It performs the compilation of the HiveQL query. This converts the query to an
execution plan. The plan contains the tasks. It also contains steps needed to be performed by
the MapReduce to get the output as translated by the query
Executor – Once compilation and optimization complete, the executor executes the tasks.
Executor takes care of pipelining the tasks.
CLI, UI, and Thrift Server – CLI (command-line interface) provides a user interface for an
external user to interact with Hive. Thrift server in Hive allows external clients to interact with
Hive over a network, similar to the JDBC or ODBC protocols.
Features of Apache Hive

There are so many features of Apache Hive. Let’s discuss them one by one-
Hive provides data summarization, query, and analysis in much easier manner.
Hive supports external tables which make it possible to process data without actually storing in
HDFS.
Apache Hive fits the low-level interface requirement of Hadoop perfectly.
It also supports partitioning of data at the level of tables to improve performance.
Hive has a rule based optimizer for optimizing logical plans.
It is scalable, familiar, and extensible.
Using HiveQL doesn’t require any knowledge of programming language, Knowledge of basic
SQL query is enough.
We can easily process structured data in Hadoop using Hive.
Querying in Hive is very simple as it is similar to SQL.
We can also run Ad-hoc queries for the data analysis using Hive
Limitation of Apache Hive

Limitation of Apache Hive


Hive has the following limitations-
Apache does not offer real-time queries and row level updates.
Hive also provides acceptable latency for interactive data browsing.
It is not good for online transaction processing.
Latency for Apache Hive queries is generally very high.
Hive is not
Sometimes, few misconceptions occur about Hive. So, let’s clarify that:
We can say it is not a relational database
Also, not a design for OnLine Transaction Processing (OLTP)
Even not a language for real-time queries and row-level updates
Apache Hive Data Types

Hive Data types are used for specifying the column/field type in Hive tables.
Mainly Hive Data Types are classified into 5 major categories, let’s discuss them one by one:
a. Primitive Data Types in Hive
Primitive Data Types also divide into 4 types which are as follows:
Numeric Data Type
Date/Time Data Type
String Data Type
Miscellaneous Data Type
i. Numeric Data Type
The Hive Numeric Data types also classified into two types-
Integral Data Types
Floating Data Types
* Integral Data Types
Integral Hive data types are as follows-
TINYINT (1-byte (8 bit) signed integer, from -128 to 127)
SMALLINT (2-byte (16 bit) signed integer, from -32, 768 to 32, 767)
INT (4-byte (32-bit) signed integer, from –2,147,483,648to 2,147,483,647)
BIGINT (8-byte (64-bit) signed integer, from –9,223,372,036,854,775,808 to
9,223,372,036,854,775,807)
* Floating Data Types
Floating Hive data types are as follows-
FLOAT (4-byte (32-bit) single-precision floating-point number)
DOUBLE (8-byte (64-bit) double-precision floating-point number)
DECIMAL (Arbitrary-precision signed decimal number)
ii. Date/Time Data Type
The second category of Apache Hive primitive data type is Date/Time data
types. The following Hive data types comes into this category-
TIMESTAMP (Timestamp with nanosecond precision)
DATE (date)
INTERVAL
iii. String Data Type
String data types are the third category under Hive data types. Below are the data types that
come into this-
STRING (Unbounded variable-length character string)
VARCHAR (Variable-length character string)
CHAR (Fixed-length character string)
iv. Miscellaneous Data Type
Miscellaneous data types has two types of Hive data types-
BOOLEAN (True/false value)
BINARY (Byte array)

b. Complex Data Types in Hive


In this category of Hive data types following data types are come-
Array
MAP
STRUCT
UNION
i. ARRAY
An ordered collection of fields. The fields must all be of the same type.
Syntax: ARRAY<data_type>
E.g. array (1, 2)
ii. MAP
An unordered collection of key-value pairs. Keys must be primitives; values may be any type.
For a particular map, the keys must be the same type, and the values must be the same type.
Syntax: MAP<primitive_type, data_type>
E.g. map(‘a’, 1, ‘b’, 2).
iii. STRUCT
A collection of named fields. The fields may be of different types.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment],…..>
E.g. struct(‘a’, 1 1.0),[b] named_struct(‘col1’, ‘a’, ‘col2’, 1,  ‘col3’, 1.0)
iv. UNION
A value that may be one of a number of defined data. The value is tagged with an integer
(zero-indexed) representing its data type in the union.
Syntax: UNIONTYPE<data_type, data_type, …>
E.g. create_union(1, ‘a’, 63)
What is Hive Operators?

Apache Hive provides various Built-in operators for data operations to be


implemented on the tables present inside Apache Hive warehouse.
Hive operators are used for mathematical operations on operands. It returns
specific value as per the logic applied.
Types of Hive Built-in Operators
Relational Operators
Arithmetic Operators
Logical Operators
String Operators
Operators on Complex Types
. Hive Relational Operators
1

These operators compare two operands and generate a TRUE or FALSE value. The below table describes
the relational operators available in Hive:
OperatorOperand TypesDescription
A=B
All primitive typesTRUE if expression A is equal to expression B. Otherwise FALSE.
A!=B
All primitive typesTRUE if expression A is not equal to expression B. Otherwise FALSE.
A<B
All primitive typesTRUE if expression A is less than expression B. Otherwise FALSE.
A <= B
All primitive typesTRUE if expression A is less than or equal to expression B. Otherwise FALSE
.A > B
All primitive typesTRUE if expression A is greater than expression B. Otherwise FALSE.
A >= B
All primitive typesTRUE if expression A is greater than or equal to expression B. Otherwise
FALSE.
A IS NULL
All typesTRUE if expression A evaluates to NULL. Otherwise FALSE.
A IS NOT NULL
All typesFALSE if expression A evaluates to NULL. Otherwise FALSE.
A LIKE B
StringTRUE if string pattern A matches to B. Otherwise FALSE.
Hive Arithmetic Operators
Arithmetic Operators in Hive supports various arithmetic operations on the operands. All return
number types. If any of the operands are NULL, then the result is also NULL.
 A+B
 All number types Gives the result of adding A and B
A-B
 All number types Gives the result of subtracting B from A
A*B
All number typesGives the result of multiplying A and
A/B
All number typesGives the result of dividing A by B
A%B
All number typesGives the remainder resulting from dividing A by B
A&B
All number typesGives the result of bitwise AND of A and B
A|B
All number typesGives the result of bitwise OR of A and B
A^B
All number typesGives the result of bitwise XOR of A and B
~A
All number typesGives the result of bitwise NOT of A.
Hive Logical Operators
Logical operators in Hive provide support for creating logical expressions. All return boolean
TRUE, FALSE, or NULL depending upon the boolean values of the operands. IN this NULL
behaves as an “unknown” flag, so if the result depends on the state of an unknown, the result
itself is unknown.
A AND B
BooleanTRUE if both A and B are TRUE. Otherwise FALSE. NULL if A or B is NULL.
A OR B
BooleanRUE if either A or B or both are TRUE, FALSE OR NULL is NULL. Otherwise FALSE.
NOT A
BooleanTRUE if A is FALSE or NULL if A is NULL. Otherwise FALSE.
!A
BooleanSame as NOT A.
Hive String Operators
Operator
Operand typesDescription
A || BstringsConcatenates the operands – shorthand for concat(A,B)
Internal and External Tables in Hive

The concept of a table in Hive is very similar to the table in the relational database. Each table
associates with a directory configured in ${HIVE_HOME}/conf/hivesite.xml in HDFS. By default,
it is /user/hive/warehouse in HDFS. For example, /user/hive/warehouse/employee is created by
Hive in HDFS for the employee table. All the data in the table will be kept in the directory. The
Hive table is also referred to as internal or managed tables.
When there is data already in HDFS, an external Hive table can be created to describe the
data. It is called EXTERNAL because the data in the external table is specified in the
LOCATION properties instead of the default warehouse directory. When keeping data in the
internal tables, Hive fully manages the life cycle of the table and data. This means the data is
removed once the internal table is dropped. If the external table is dropped, the table metadata
is deleted but the data is kept. Most of the time, an external table is preferred to avoid deleting
data along with tables by mistake.
Hive Commands :
Data Definition Language (DDL )
DDL statements are used to build and modify the tables and other objects in the
database.
Example :
CREATE, DROP, TRUNCATE, ALTER, SHOW, DESCRIBE Statements.
•To create the new database in the Hive

To list out the databases in Hive warehouse, enter the command


‘show databases’.

The command to use the database is USE <data base name>


The command to use the database is USE
<data base name>
Copy the input data to HDFS from local by using the copy From Local
command.
When we create a table in hive, it creates in the default location of the hive warehouse. –
“/user/hive/warehouse”, after creation of the table we can move the data from HDFS to
hive table.

The following command creates a table with in location of “/user/hive/warehouse/retail.db”

Note : retail.db is the database created in the Hive warehouse .


Describe provides information about the schema of the table.
Data Manipulation Language (DML )
DML statements are used to retrieve, store, modify, delete, insert and update data in
the database.
Example :
LOAD, INSERT Statements.
Syntax :
LOAD data <LOCAL> inpath <file path> into table [tablename]

The Load operation is used to move the data into corresponding Hive table. If the
keyword local is specified, then in the load command will give the local file
system path. If the keyword local is not specified we have to use the HDFS path
of the file.
examples for the LOAD data LOCAL command

After loading the data into the Hive table we can apply the Data Manipulation
Statements or aggregate functions retrieve the data.
Word Count Example

Key: offset
Value: line

Key: word Key: word


Value: count Value: sum of count

0:The cat sat on the mat


22:The aardvark sat on the sofa
WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {


public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");


job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}
WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'


OVERWRITE INTO TABLE wordcount;

SELECT count(*) FROM wordcount GROUP BY token;

You might also like