Download as pdf or txt
Download as pdf or txt
You are on page 1of 120

BIG DATA

CHAPTER-5
Introduction to HBase
Introduction

In the early days ,data used to be less and


structured.

Data could be easily stored in a relational


database.
After the evaluation of internet
huge amount of structured and
semi structured data got
generated.

Storing and processing the data


using RDBMS becomes a
problem
H Base
• Open source software.

• Non relational.

• Distributed column oriented


database.

• It run on top of HDFS.


Different from RDBMS
• Not a SQL database

• Not relational

• No joins

• No query language

• Not a drop in
replacement of
RDBMS.
Features
• Linear Scalability

• Automatic and configurable


shading of table

• Automatic failure support

• Strictly consistent read and


writes.

• Provide real time random read


write access to data stored in
hdfs.
• Attach nicely with Hadoop map
reduce.

• Easy java API for client access.

• Import of large amount of data.

• Backup option

• Bloom filter for real time


quarries.
Limitation of HBase
• It takes a very long time to recover if the H Master goes down. It takes
a long time to activate another node if the first nodes go down.
• In HBase, cross data operations and join operations are very difficult
to perform, even if we join operations by using Map Reduce, it
requires a lot of time to design and develop.
• HBase needs a new format when we want to migrate from RDBMS
external sources to HBase servers.
• It is very challenging in HBase to support querying process. We need
integration support from SQL layers like Apache Phoenix to write
queries to trigger the information from the database.
• It takes enormous time to develop security factor to grant access to
the users.
Pig
• Pig Latin: A not-so-foreign
language for data processing.
Introduction

• Apache pig work over map reduce.

• It is a tool/platform which is used to


analyze large data.

• We can perform all data manipulation


operation in Hadoop using Apache pig.

• Pig provide high level language known


as Pig Latin.

• Pig Latin help the user to develop their


own functions for reading, writing and
processing data.
Pig Architecture

PIG
Map Reduce
Parse statements

Pig Latin Compile


script
Map Reduce
Optimize

Plan

HDFS
Features of PIG

• Rich in operators: It provides a large number of operator operation.

• Ease of programming: As the pig Latin is similar to SQL.

• UDFs : Pig provide the facility to create user defined function.

• Handles all kinds of data : Apache pig can analyzes all kind of data
structured and unstructured.
Work flow of PIG.

• Programmers need to write script using Pig Latin Language.

• All scripts are internally converted to map reduce task.

• Apache pig contains a component as PIG ENGINE that accept Latin


script as input and convert those script to map reduce jobs.
Install PIG.
Install PIG.
Install PIG.
Install PIG.
Install PIG.
Install PIG.
Can Elephant fly?

Can Hadoop be
used more
efficiently?

Let See…
Ideas

Let have a red bull for wings…

Not a great
idea.
Shrink

After
Before
Genetic change

Before After
Behind the scenes…?

Facebook
initially
developed hive
What is hive?

• Hive is a data ware house infrastructure built on top of Hadoop.

• Support analysis of large dataset stored in Hadoop.

• Provide sql query language called HIVEQL.

• To provide quick query response ,it provide indexing.


Architecture of hive
Working of hive
Install HIVE

Java SE - Downloads | Oracle Technology Network | Oracle


Install HIVE
Install HIVE
Install HIVE
Install HIVE

Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r


1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
javax.jdo.PersistenceManagerFactoryClass =
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Install Hive
Install Hive
Hive - Data Types

1. Column Types

2. Literals

3. Null Values

4. Complex Types
Column Types

Column type are used as column data types of Hive. They are as follows:

1. Integral Types

2. String Types

3. Timestamp

4. Dates

5. Decimals

6. Union Types
Literals

1.Floating Point Types

Floating point types are nothing but numbers with decimal points. Generally, this
type of data is composed of DOUBLE data type.

2.Decimal Type

Decimal type data is nothing but floating point value with higher range than
DOUBLE data type. The range of decimal type is approximately -10-308 to 10308.
Complex Types
Create Database
Drop Database
Create Table
Example
Syntax of example
Load Data Statement
Example
Alter Table Statement
Change Statement
Change Statement
Drop Table Statement
Example
Partition
Renaming Partition
Dropping a Partition
Operator
There are four types of operators in Hive:

1. Relational Operators

2. Arithmetic Operators

3. Logical Operators

4. Complex Operators
Built in Function
the built-in functions available in Hive. The functions look quite
similar to SQL functions, except for their usage.
Creating a View
Dropping a View
Creating a Index
Dropping a Index
Select Query
Order By
Group By
Join Table
Join
Left Outer Join
Right Outer Join
Full Outer Join
Physical Layout of hive

• Warehouse directory in hdfs.


• User/hive/warehouse

• Tables : subdirectories of warehouse.

• Partitions : subdirectories of
corresponding Table Directory.
Encapsulation

• HIVEQL queries is converted to map reduce code using hive engine.

• Hive engine translate all queries into a directed acyclic graph of map-
reduce jobs.

• These map reduce jobs are sent to Hadoop for execution.


Dependencies

• /user/hive directory is created as soon as the hive session is started first


time.

• /user/hive/warehouse directory shall be accessible by everyone.


• Hadoop dfs –chmod –R 1777/user/hive/warehouse.

• Recommended to activate sticky note if supported.


Hive Command line Interface

• HIVE CLI can be invoked by hike command.


• %hive
Hive SQL script
Hive QL

• HIVE QL is similar to SQL query language.


• DML(Data manipulation language)
• Select

• DDL(Data Definition Language)


• SHOW TABLE
• CREATE TABLE
• ALTER TABLE
• DROP TABLE
Play with hive
Loading delimited data
Normal Table VS External Table

Normal Table External Table

Normal table are created under External table read directly from hdfs file.
warehouse directory.

Normal table are directly visible through External table are not visible in
hdfs directory browsing. warehouse directory.

On dropping a normal table, the source Only dropping the external table only the
data and table metadata both are metadata is deleted.
deleted.
Joins

• HIVE QL supports join on only equality expressions. Complex Boolean


Expression inequality conditions are not supported.

• More than 2 table can be joined.

• Number of map-reduce jobs are generated for a join depend on the


columns being used.

• If same col is used for all the tables, then n=1


• Otherwise n>1.
Data type and format
Primitive data types

Numeric Date/Time String Miscellaneous

• TIME STAMP • STRING • BOOLEAN


• DATE • VARCHAR • BINARY
• INTERVAL • CHAR
Integral Floating
• TINYINT
• SMALL INT • FLOAT
• INTEGER • DOUBLE
• BIGINT
Data format in hive

TEXTFILE ORC

PARQUET
Apache hive data file format.
SEQUENCE
FILE

AVRO
RCFILE
PIG VS HIVE
PIG HIVE

Script language Sql like language

Comparatively less no of line than map Comparatively less no of lines than


reduce. map reduce and pig.
No partition Yes partition

Pig is mainly used for programming Hive mainly used for data analysts.

Pig support Avro Hive does not support Avro


www.paruluniversity.ac.in

You might also like