Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

T.Y.B.

TECH COMPUTER ENGINEERING ACADEMIC YEAR – 2022-23


BIG DATA INFRASTRUCTURE - EXPERIMENT 4
NAME – BHARGAVEE KALPAK CHAVARKAR SAP ID – 60004200124 DIVISION/BATCH – B/B1

AIM:
Study and Implement Hive Query Language (HQL) in the Hadoop Ecosystem

THEORY:
Hive is an open-source data warehouse software that provides a SQL-like interface for
querying and analyzing large datasets stored in distributed storage systems, such as Hadoop
Distributed File System (HDFS). It was initially developed by Facebook and later donated to
the Apache Software Foundation, where it is now part of the Apache Hive project.
Hive is designed to make it easy for data analysts, data scientists, and other users who are
familiar with SQL to query and analyze large datasets stored in HDFS without having to write
complex MapReduce or other distributed computing jobs. Hive provides a higher-level
abstraction over Hadoop, allowing users to express data queries using SQL-like syntax, which
is then translated into MapReduce or other Hadoop jobs for execution on the Hadoop
cluster.
Some key features of Hive include:
1. SQL-like interface: Hive provides a familiar SQL-like interface for querying and
analyzing large datasets, making it easy for users who are familiar with SQL to work
with big data stored in HDFS.
2. Schema and metadata management: Hive allows users to define schemas and
metadata for their data, including tables, partitions, and views, which are stored in a
metastore. This enables users to define and manage data structures in a more
structured and organized way.
3. Extensibility: Hive provides a pluggable architecture that allows users to extend its
functionality with custom user-defined functions (UDFs), user-defined aggregates
(UDAs), and user-defined data types (UDTs). This enables users to add custom logic
and operations to their Hive queries.
4. Optimization: Hive includes query optimization features, such as query optimization,
predicate pushdown, and schema evolution, which help to optimize query
performance and improve query execution times.
5. Integration with Hadoop ecosystem: Hive integrates with other Hadoop ecosystem
tools, such as HDFS, Hadoop MapReduce, Apache Spark, and others, allowing users
to leverage the capabilities of these tools in their data processing pipelines.
6. Support for batch and interactive queries: Hive supports both batch processing and
interactive queries, making it suitable for a wide range of use cases, from large-scale
data processing to ad-hoc data exploration and analysis.
HQL stands for Hive Query Language, and it is a SQL-like query language used for querying
and analyzing data stored in Hive, an open-source data warehouse software that provides a
SQL-like interface for querying large datasets stored in distributed storage systems like
Hadoop Distributed File System (HDFS). HQL is the primary language used to interact with
Hive and perform various data querying and manipulation operations.
HQL provides a familiar SQL-like syntax that allows users to express queries in a declarative
manner. It supports many standard SQL features, such as SELECT, FROM, WHERE, GROUP BY,
JOIN, ORDER BY, and others, which make it easy for users who are familiar with SQL to
transition to Hive and work with big data stored in HDFS. However, it's important to note
that while HQL shares similarities with SQL, it also has some differences and limitations due
to its distributed nature and integration with Hadoop.
Some key features of HQL include:
1. Data definition and manipulation: HQL allows users to define and manage tables,
partitions, and views in Hive, and perform various data manipulation operations such
as inserting, updating, and deleting data.
2. Hive-specific features: HQL provides Hive-specific features, such as support for
custom user-defined functions (UDFs), user-defined aggregates (UDAs), and user-
defined data types (UDTs), which allow users to add custom logic and operations to
their Hive queries.
3. Support for complex data types: HQL supports complex data types such as arrays,
maps, and structs, which allows users to work with structured, semi-structured, and
unstructured data stored in Hive.
4. Dynamic partitioning: HQL supports dynamic partitioning, which allows users to
partition data on-the-fly based on specified criteria during data insertion or loading,
without pre-defining partitions.
5. Query optimization: HQL includes query optimization features, such as predicate
pushdown and schema evolution, which help to optimize query performance and
improve query execution times.
6. Integration with Hadoop ecosystem: HQL integrates with other Hadoop ecosystem
tools, such as HDFS, Hadoop MapReduce, Apache Spark, and others, allowing users
to leverage the capabilities of these tools in their Hive queries.

COMMAND & OUTPUT:


1. Show databases already existing in Hive

2. Creating a new database


3. Using the database for executing queries

4. Creating a table in Hive

5. Inserting values in the table


6. To display contents of a table

7. Order By Clause (Ordering the result in descending order; by default, ascending


order)

8. Group By Clause (Grouping the result w.r.t branch here)


9. Update the value in table

10. Delete the entry from the table

11. Rename a table using the ALTER statement


12. Check whether table is renamed

13. Add new column in the table using ALTER statement

14. Changing Column Name with ALTER statement and changing the datatype from
varchar to String.

15. Check whether the CGPA values are updated.


16. Operators in HIVE (Arithmetic, Relational, Logical)

17. Joins in HIVE


1. Inner join in Hive
2. Left Outer Join in Hive

3. Right Outer Join in Hive

4. Full Outer Join in Hive


7. Views in HIVE
1. Creating a View on customers table for address ABC.

2. Displaying the results of above view

3. Creating a view on complete customers table instead of one single row


(condition based)

4. Displaying the results of above view

5. Dropping the view created

CONCLUSION:
In this experiment, we have successfully studied and implemented Hive Query Language
(HQL) in the Hadoop Ecosystem. We started by learning the basics of Hive, a data
warehousing tool that allows you to query large datasets stored in Hadoop Distributed File
System (HDFS) using SQL-like syntax. We explored the architecture of Hive and its
components, including the Hive Metastore, Hive Query Language Processor, and Hive Server.

We then created a sample dataset in HDFS and used Hive to perform a variety of data
analysis tasks, including filtering, sorting, and aggregating data. We also learned how to join
data from multiple tables and how to create and modify tables in Hive. Additionally, we
explored advanced topics such as partitioning data, optimizing query performance, and
integrating Hive with other Hadoop Ecosystem tools.

Through this experiment, we have gained a deeper understanding of Hive Query Language
(HQL) and its capabilities within the Hadoop Ecosystem. We have also gained hands-on
experience with using Hive to analyze large datasets and perform complex queries on HDFS
data.

Overall, this experiment has provided valuable insights into the power and flexibility of Hive
and its role in the Hadoop Ecosystem. It has equipped us with the knowledge and skills to
continue exploring Hive and other Hadoop Ecosystem tools for managing and analyzing large
datasets, and to leverage these technologies for solving real-world data challenges.

You might also like