Professional Documents
Culture Documents
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
Why Hive???
What is Hive?
Hive Architecture
HiveQL
Data Analysts with Hadoop
Hive - A Warehousing Solution
Over a Map-Reduce Framework
Hive Key Principles
Challenges that Data Analysts faced
Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop Map-Reduce framework to parallelize processing of
Data.
Apache Hive is an open source data warehouse system built on top of Hadoop used for
querying and analyzing large datasets stored in Hadoop files.
Initially, you have to write complex Map-Reduce jobs, but now with the help of the Hive,
you just need to submit merely SQL queries. Hive is mainly targeted towards users who
are comfortable with SQL. Hive use language called HiveQL (HQL), which is similar to
SQL. HiveQL automatically translates SQL-like queries into MapReduce jobs.
Hive abstracts the complexity of Hadoop. The main thing to notice is that there is no need
to learn java for Hive.
The Hive generally runs on your workstation and converts your SQL query into a series of
jobs for execution on a Hadoop cluster. Apache Hive organizes data into tables. This
provides a means for attaching the structure to data stored in HDFS.
Apache History Hive
Data Infrastructure Team at Facebook developed Hive. Apache Hive is also one of
the technologies that are being used to address the requirements at Facebook. It is
very popular with all the users internally at Facebook. It is being used to run
thousands of jobs on the cluster with hundreds of users, for a wide variety of
applications.
Apache Hive-Hadoop cluster at Facebook stores more than 2PB of raw data. It
regularly loads 15 TB of data on a daily basis.
Now it is being used and developed by a number of companies like Amazon, IBM,
Yahoo, Netflix, Financial Industry Regulatory Authority (FINRA) and many others.
Why Apache Hive?
After the introduction to Apache Hive, Now we are going to discuss the major
component of Hive Architecture. The Apache Hive components are-
Metastore – It stores metadata for each of the tables like their schema and location.
Hive also includes the partition metadata. This helps the driver to track the progress
of various data sets distributed over the cluster. It stores the data in a
traditional RDBMS format. Hive metadata helps the driver to keep a track of the data
and it is highly crucial. Backup server regularly replicates the data which it can
retrieve in case of data loss.
Driver – It acts like a controller which receives the HiveQL statements. The driver
starts the execution of the statement by creating sessions. It monitors the life cycle
and progress of the execution. Driver stores the necessary metadata generated
during the execution of a HiveQL statement. It also acts as a collection point of data
or query result obtained after the Reduce operation.
Compiler – It performs the compilation of the HiveQL query. This converts the query to an
execution plan. The plan contains the tasks. It also contains steps needed to be performed by
the MapReduce to get the output as translated by the query
Executor – Once compilation and optimization complete, the executor executes the tasks.
Executor takes care of pipelining the tasks.
CLI, UI, and Thrift Server – CLI (command-line interface) provides a user interface for an
external user to interact with Hive. Thrift server in Hive allows external clients to interact with
Hive over a network, similar to the JDBC or ODBC protocols.
Features of Apache Hive
There are so many features of Apache Hive. Let’s discuss them one by one-
Hive provides data summarization, query, and analysis in much easier manner.
Hive supports external tables which make it possible to process data without actually storing in
HDFS.
Apache Hive fits the low-level interface requirement of Hadoop perfectly.
It also supports partitioning of data at the level of tables to improve performance.
Hive has a rule based optimizer for optimizing logical plans.
It is scalable, familiar, and extensible.
Using HiveQL doesn’t require any knowledge of programming language, Knowledge of basic
SQL query is enough.
We can easily process structured data in Hadoop using Hive.
Querying in Hive is very simple as it is similar to SQL.
We can also run Ad-hoc queries for the data analysis using Hive
Limitation of Apache Hive
Hive Data types are used for specifying the column/field type in Hive tables.
Mainly Hive Data Types are classified into 5 major categories, let’s discuss them one by one:
a. Primitive Data Types in Hive
Primitive Data Types also divide into 4 types which are as follows:
Numeric Data Type
Date/Time Data Type
String Data Type
Miscellaneous Data Type
i. Numeric Data Type
The Hive Numeric Data types also classified into two types-
Integral Data Types
Floating Data Types
* Integral Data Types
Integral Hive data types are as follows-
TINYINT (1-byte (8 bit) signed integer, from -128 to 127)
SMALLINT (2-byte (16 bit) signed integer, from -32, 768 to 32, 767)
INT (4-byte (32-bit) signed integer, from –2,147,483,648to 2,147,483,647)
BIGINT (8-byte (64-bit) signed integer, from –9,223,372,036,854,775,808 to
9,223,372,036,854,775,807)
* Floating Data Types
Floating Hive data types are as follows-
FLOAT (4-byte (32-bit) single-precision floating-point number)
DOUBLE (8-byte (64-bit) double-precision floating-point number)
DECIMAL (Arbitrary-precision signed decimal number)
ii. Date/Time Data Type
The second category of Apache Hive primitive data type is Date/Time data
types. The following Hive data types comes into this category-
TIMESTAMP (Timestamp with nanosecond precision)
DATE (date)
INTERVAL
iii. String Data Type
String data types are the third category under Hive data types. Below are the data types that
come into this-
STRING (Unbounded variable-length character string)
VARCHAR (Variable-length character string)
CHAR (Fixed-length character string)
iv. Miscellaneous Data Type
Miscellaneous data types has two types of Hive data types-
BOOLEAN (True/false value)
BINARY (Byte array)
These operators compare two operands and generate a TRUE or FALSE value. The below table describes
the relational operators available in Hive:
OperatorOperand TypesDescription
A=B
All primitive typesTRUE if expression A is equal to expression B. Otherwise FALSE.
A!=B
All primitive typesTRUE if expression A is not equal to expression B. Otherwise FALSE.
A<B
All primitive typesTRUE if expression A is less than expression B. Otherwise FALSE.
A <= B
All primitive typesTRUE if expression A is less than or equal to expression B. Otherwise FALSE
.A > B
All primitive typesTRUE if expression A is greater than expression B. Otherwise FALSE.
A >= B
All primitive typesTRUE if expression A is greater than or equal to expression B. Otherwise
FALSE.
A IS NULL
All typesTRUE if expression A evaluates to NULL. Otherwise FALSE.
A IS NOT NULL
All typesFALSE if expression A evaluates to NULL. Otherwise FALSE.
A LIKE B
StringTRUE if string pattern A matches to B. Otherwise FALSE.
Hive Arithmetic Operators
Arithmetic Operators in Hive supports various arithmetic operations on the operands. All return
number types. If any of the operands are NULL, then the result is also NULL.
A+B
All number types Gives the result of adding A and B
A-B
All number types Gives the result of subtracting B from A
A*B
All number typesGives the result of multiplying A and
A/B
All number typesGives the result of dividing A by B
A%B
All number typesGives the remainder resulting from dividing A by B
A&B
All number typesGives the result of bitwise AND of A and B
A|B
All number typesGives the result of bitwise OR of A and B
A^B
All number typesGives the result of bitwise XOR of A and B
~A
All number typesGives the result of bitwise NOT of A.
Hive Logical Operators
Logical operators in Hive provide support for creating logical expressions. All return boolean
TRUE, FALSE, or NULL depending upon the boolean values of the operands. IN this NULL
behaves as an “unknown” flag, so if the result depends on the state of an unknown, the result
itself is unknown.
A AND B
BooleanTRUE if both A and B are TRUE. Otherwise FALSE. NULL if A or B is NULL.
A OR B
BooleanRUE if either A or B or both are TRUE, FALSE OR NULL is NULL. Otherwise FALSE.
NOT A
BooleanTRUE if A is FALSE or NULL if A is NULL. Otherwise FALSE.
!A
BooleanSame as NOT A.
Hive String Operators
Operator
Operand typesDescription
A || BstringsConcatenates the operands – shorthand for concat(A,B)
Internal and External Tables in Hive
The concept of a table in Hive is very similar to the table in the relational database. Each table
associates with a directory configured in ${HIVE_HOME}/conf/hivesite.xml in HDFS. By default,
it is /user/hive/warehouse in HDFS. For example, /user/hive/warehouse/employee is created by
Hive in HDFS for the employee table. All the data in the table will be kept in the directory. The
Hive table is also referred to as internal or managed tables.
When there is data already in HDFS, an external Hive table can be created to describe the
data. It is called EXTERNAL because the data in the external table is specified in the
LOCATION properties instead of the default warehouse directory. When keeping data in the
internal tables, Hive fully manages the life cycle of the table and data. This means the data is
removed once the internal table is dropped. If the external table is dropped, the table metadata
is deleted but the data is kept. Most of the time, an external table is preferred to avoid deleting
data along with tables by mistake.
Hive Commands :
Data Definition Language (DDL )
DDL statements are used to build and modify the tables and other objects in the
database.
Example :
CREATE, DROP, TRUNCATE, ALTER, SHOW, DESCRIBE Statements.
•To create the new database in the Hive
The Load operation is used to move the data into corresponding Hive table. If the
keyword local is specified, then in the load command will give the local file
system path. If the keyword local is not specified we have to use the HDFS path
of the file.
examples for the LOAD data LOCAL command
After loading the data into the Hive table we can apply the Data Manipulation
Statements or aggregate functions retrieve the data.
Word Count Example
Key: offset
Value: line
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
WordCount Example By Hive