Download as pdf or txt
Download as pdf or txt
You are on page 1of 93

Unit- 4

Topics
• Introduction to NoSQL Databases

• Introduction to Hive
Relational Databases
• A relational database refers to a database that stores data in a structured format, using rows and
columns. This makes it easy to locate and access specific values within the database
• Relation is sometimes used to refer to a table in a relational database but is more commonly refers to
the relation between the different elements of a row, e.g.

• The relation is defined in a “schema”. It is the logical definition of a table


• The data is said to be structured.
• The data usually consists of simple types like integers, strings, floats etc.
Relational Database Management System
(RDBMS)
• Software that manages a collection
or database of tables.

• Is designed to support multi-user


access.

• Transactional Processing - Designed


for random access of table elements
for updation purposes as opposed
to batch processing.
Features of Relation Databases
• Records are organized into tables

• Rows of tables are identified by unique keys

• Data Spans multiple tables, which are linked by join operation

• Transactions are ACID-compliant


Structured Query Language -SQL
• Structured Query Language or SQL is a standard Database language which is
used to create, maintain and retrieve the data from relational databases

• Much more compact and expressive than programs written in standard


programming languages such as C++, Java etc. But only for tabular data stores.

Note: SQL has been found a very effective language for relational databases,
hence they are so closely associated with RDBMS systems. But there is no rule
which states that SQL has to be used for RDBMS.
Not Only SQL (NoSQL) DatabasesHistory
• RDBMS found unsuitable to handle unstructured data
generated by the proliferation of the internet.
• Unstructured data includes: web pages, images, audio
clips, videos, documents (pdf, csv, text).
• There is a need to mine the data, hence a need to store
and manipulate the data in an efficient and organized
manner.
• Difficult to scale RDBMS on clusters.
• All the above gives rise to NoSQL databases around the
year 2000.
• Origins in Google’s BigTable and Amazon’s SimpleDB.

• Note: The “SQL” in the name “NoSQL” does not imply that these
category of databases do not or can not use SQL as the query
language.
Non-relational data storage systems

What isNoSQL?
No fixed table schema

No Joins
NoSQL

No multi-document transactions

Relaxes one or more ACID properties


NoSQL Database Types
• Key-Value Store.

• Document Store.

• Column Store.

• Graph databases
Key-Value Pair Store
• Key is unique.

• Value can be anything including a


document, an image etc.

• DBMS typically does not know anything


about the contents of the “value”.

• But database might allow storage of


metadata about the values.

• Application: online shopping information -


(user, user preferences)
Document Store
• Pair each key with a complex data
structure known as document
• Documents can contain many
different key-value pairs or key-array
pairs or even nested documents
• Support for embedded document
• Consumes more space as compared
to counterparts
• MongoDB is an example of this type
• Collection contains lots of document
• Each document can contain diverse
and heterogeneous field.
https://beginnersbook.com/2017/09/mapping-relational-databases-to-mongodb/
Graph stores
• Used to store information about
networks of data, such as social
networking connections
• Graph stores include Neo4J.
• Not very well suited for all sets
of problems
• Best suited for connected data
Wide Column Stores
• Store columns of data together
instead of rows
• Cassandra and Hbase are
optimized for queries over large
datasets
• Excellent for lookups on a single
field
• Lookup on other fields are not
supported
• Columns are not fixed
Types ofNoSQL

Key value data Column-oriented Document data Graph data


store data store store store

• Riak • Cassandra • MongoDB • InfiniteGraph


• Redis • HBase • CouchDB • Neo4
• Membase • HyperTable • RavenDB • Allegro Graph
NoSQL Vendors

Company Product Most widely used by

Amazon DynamoDB LinkedIn, Mozilla

Facebook Cassandra Netflix, Twitter, eBay

Google BigTable Adobe Photoshop


NoSQL Characteristics
Advantages ofNoSQL
Cheap, Easy to implement

Easy to distribute

Can easily scale up & down


Advantages of NoSQL
Relaxes the data consistency
requirement

Doesn’t require a pre-defined


schema

Data can be replicated to


multiple nodes and can be
partitioned
BASE Properties
Basically Available
•Description: The system guarantees availability, meaning that the database will always
respond to any request (either by returning the requested data or an error). This is in
contrast to ACID systems that might sacrifice availability to ensure strict consistency.
•Example: If a NoSQL database server is overwhelmed with requests, it might still respond
with a message indicating that it is overloaded, rather than being completely unresponsive.

Soft State
Description: The state of the system can change over time, even without input
due to the eventual consistency model. This means that the system doesn't
have to be in a consistent state at all times.
Example: In a distributed system, data might be replicated across multiple
nodes. The data in these nodes may be inconsistent temporarily due to the
delay in replication.
Eventual Consistency
Description: The system will become consistent at some point in the future,
assuming there are no new updates. While the system allows for temporary
inconsistencies, it guarantees that, given enough time, all nodes will
converge to the same state.
Example: In a distributed database, when a user updates their profile
information, the change may not be immediately visible on all servers.
However, eventually, all servers will reflect the updated information.
BASE Properties Has to do with
the “AP” of CAP.

• Basic Availability: The database appears to work most of the time (even if some nodes fail, or
packets are dropped).
• Soft-state: State changes even without input (to provide eventual consistency). Both have to do
with the “C” in
• Eventual consistency: Stores exhibit consistency at some later point. CAP.

BASE is a relaxed form of the CAP properties. NoSQL databases strive to satisfy the
BASE properties.

The BASE model is a flexible alternative (as is found acceptable with customer
shopping data) to the ACID model for databases that don't require strict adherence
to a relational model (as is required for banking data).
NoSQL Pros andCons
Cons
Pros
• Not mature.
• Handles the diverse kind of data
generated by proliferation of the
internet. Flexible. • Do not provide same level of
guarantees (ACID properties) as
RDBMS systems.
• Designed to scale.
• Not transactional.
• Easier to maintain.
• Less secure.

• Not designed for typical business


intelligence applications.
SQL Vs.NoSQL
SQL NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructureddata
Table based databases Document-based or graph-based or wide column store or key-value
pairs databases
Vertically scalable (by increasing systemresources) Horizontallyscalable (by creating a cluster of commodity machines)
Uses SQL Uses UnQL (Unstructured QueryLanguage)
Not preferred for large datasets Largely preferred for large datasets
Not a best fit for hierarchical data Best fit for hierarchical storage as it follows the key-value pair of
storing data similar to JSON (Java Script ObjectNotation)
Emphasis on ACID properties Follows Brewer’s CAP theorem
Excellent support from vendors Relies heavily on community support
Supports complex querying and data keeping needs Does not have good support for complex querying
Can be configured for strong consistency Few support strong consistency (e.g., MongoDB), few others can be
configured for eventual consistency (e.g., Cassandra)
Examples: Oracle, DB2, MySQL, MS SQL,PostgreSQL, MongoDB, HBase, Cassandra, Redis, Neo4j, CouchDB, Couchbase, Riak,
etc. etc.
NewSQL
Goal is to provide the scalabilityand flexibility of NoSQLdatabases and the consistency of SQLdatabases

SQL interface for application interaction

ACID support for transactions

Characteristics of NewSQL An architecture that provides higher per node


performance vis-a-vs traditional RDBMS solution

Scale out, shared nothing architecture

Non-locking concurrency control mechanism so


that real time reads will not conflict with writes
SQL Vs. NoSQL Vs.NewSQL
SQL NoSQL NewSQL
Adherence to ACID Yes No Yes
properties
OLTP/OLAP Yes No Yes
Schema rigidity Yes No Maybe
Adherence to data model Adherence to
relational model
Data Format Flexibility No Yes Maybe
Scalability Scale up Scale out Scale out
Vertical Scaling Horizontal
Scaling
Distributed Computing Yes Yes Yes
Community Support Huge Growing Slowly
growing
Introduction to Hive
History of Hive

Hive 0.14
Hive 0.10 Hive 0.13
• Transaction with ACID
• Batch • Interactive
semantics
• Read –only Data • Read –only Data
• Cost Based Optimizer
• Hive QL • Substantial SQL
• SQL temporary tables
• MR • MR,TEZ
• MR, TEZ, Spark

Enterprise SQL at Hadoop Scale


Hive a Data Warehousing Tool
When to use hive
Meta Store in Hive (metastore)
• The Metastore stores the information about the tables, partitions, the
columns within the tables.
• There are 3 ways of storing in Metastore:
• Embedded Mode
• Local Mode
• Remote Mode
Embedded Metastore

• In this mode, the Metastore service run in the same JVM as Hive service and contains an embedded Derby
database instance backed by local disk. This mode required least configuration but support only 1 session at a
time. Therefore not suited for production.
Local meta store

In this mode, Metastore service run in the same JVM as Hive service, but Metastore
database run on separate process.
In this mode, Metastore service run on its own JVM. This brings better manageability and security because the
database tier can be completely fire walled off, and the clients no longer need the database credentials. In this,
Metastore service communicate with database over JDBC. Hadoop ecosystem software can communicate with
Hive using Thrift service.
Namespaces that separate tables
Database
and other data units
SQL HiveQL

Insert values row by row Insertion of bulk data(not single row at a time)

Update command is used Update command cannot be used

Delete command used to delete row or column Can not be used


Hive Query Language (HiveQL)
• It is HiveQL and not HQL.
• Based on SQL.
• Does not strictly follow the full SQL-92 standard.
• HiveQL offers extensions not in SQL including multitable inserts.
• Limited support for various SQL operations such as subqueries.
• Internally, a compiler translates HiveQL statements into a directed
acyclic graph of MapReduce, Tez, or Spark jobs, which are executed
on a distributed cluster.
Hive Query Language(HQL)
1. Create and manage tables and partitions.
2. Support various Relational, Arithmetic, and Logical Operators.
3. Evaluate functions.

4. Download the contents of a table to a local directory or result of queries to HDFS directory.
5. Large number of functions defined in Hive. Categorized as mathematical, Statistical, String, Date, Conditional,
Aggregate and so on.
We can retrive the list on hive shell by
hive> show function
Data Definition Language
• Build and modify the tables & other objects in the database
• Create/Drop/Alter Database
• Create/Drop/Truncate Table
• Alter Table/Partition/Column
• Create/Drop/Alter View
• Create/Drop/Alter Index
• Show
• Describe
Data Manipulation Language
• To receive
• Store
• Modify
• Delete
• Update data in database
Hive Data Types
Numeric Data Type
TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer
INT 4 - byte signed integer
BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number

String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)

Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive
• Collection Data Types
• STRUCT
• Similar to ‘C’ struct. Fields are accessed using dot notation.
• E.g.: struct('John', 'Doe')
• MAP
• A collection of key - value pairs. Fields are accessed using [] notation.
• E.g.: map('first', 'John', 'last', 'Doe')
• ARRAY
• Ordered sequence of same types. Fields are accessed using array index.
• E.g.: array('John', 'Doe')
Hive File format
• Text File
• The default file format is text file.
• Csv, tsv, json or xml
• Sequential File
• Sequential files are flat files that store binary key-value pairs.
• Includes compression
• RCFile (Record Columnar File)
• RCFile stores the data in Column Oriented Manner which ensures that
• Aggregation operation is not an expensive operation.
Hive File Formats
1. Text FIle
• Data is stored in lines, with each line being a record. Each lines are terminated by a newline character (\n).
• Also a default format, equivalent to creating a table with the clause STORED AS TEXTFILE.

Create table textfile_table (column_specs) stored as textfile;


• Sequence files are Hadoop flat files which stores values in binary key-value pairs.
• Can be specified using “STORED AS SEQUENCEFILE” clause during table creation.
• Files are in flat files structure consisting of binary key-value pairs.

Create table sequencefile_table (column_specs) stored as sequencefile;


• RCFile is row columnar file format. This is another form of Hive file format which offers high row level
compression rates.
• It first partitions rows horizontally into row splits and then it vertically partitions each row split in a
columnar way.
• It first stores the metadata of a row split, as the key part of a record, and all the data of a row split as value
part.
Database
• To create a database named “STUDENTS” with comments and database properties.

• CREATE DATABASE IF NOT EXISTS STUDENTS


COMMENT 'STUDENT Details’
WITH DBPROPERTIES ('creator' = 'JOHN');

To describe a database. To drop database.

DESCRIBE DATABASE STUDENTS; DROP DATABASE STUDENTS;


Internal versus External Tables
Internal Table(Managed Table) External Table(Self Managed Table)
• Table data is stored in Hive managed HDFS • Table data is not managed by Hive and is
store. stored outside the warehouse.
• Dropping the table deletes the table • Dropping the table deletes the metadata but
metadata and data. not the data.
• Default create table • “External” is used, location need to be
• One file is referred specified
• One file is referred by any number of tables,
by one table only
external references by location
To create managed table named ‘STUDENT’

CREATE TABLE IF NOT EXISTS student (rollno INT,name STRING,gpaFLOAT)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY'\t';

To create external table named ‘EXT_STUDENT’

CREATE EXTERNAL TABLE IF NOT EXISTS ext_student(rollno INT,name STRING,gpa FLOAT)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t’
LOCATION ‘/STUDENT_INFO;
To load data into the table from file named student.tsv.

LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv’


OVERWRITE INTO TABLE ext_student;

To retrieve the student details from “EXT_STUDENT” table.

SELECT * FROM ext_student;


Partitioning - Prelimaries
Employee Name Employee ID Country
Big Table
Alok Nath 36554 India

Arun Thomas 36553 India


Break into smaller parts based on key
Geeta Rao 36555 India (“Country” in this case) and store in
separate units (can be files).
Susan Phillips 71222 UK

John Chambers 71225 UK


If data is required only from one part
(say India), access will be faster.

Breaking into too many small parts causes degradation of performance e.g. by employeeID.
Hashing - Prelimaries
Employee Name Employee ID Country Big Table Solution:
Alok Nath 36554 India
- map each key to a number e.g.
(empid modulo 2).
Arun Thomas 36553 India - 36554 mod 2 = 0
- 36553 mod 2 = 1
Geeta Rao 36555 India
- even EmpID mod 2 = 0
Susan Phillips 71222 UK - odd EmpID mod 2 = 1
- partition by above number.
John Chambers 71225 UK
- two partitions generated for above
Liam Neeson 80162 Ireland numbers, one with odd empid, other
with even empid.
Milo O’Shea 80233 Ireland
- Generating a partition number using
a function on a key is called Hashing.
Require to “partition” by empid. But do not want one
partition per key since it leads to too many partitions.
partitioning based on hashing is
Why “partition” by empid (but want small number of partitions) ? called hashPartitioning in Part 3.
Could be for joining two tables by empid (see example in Part 3).
Partitions
• Partitions split the larger dataset into more meaningful chunks.
• Partition improves i/o performance
• Hive provides two kinds of partitions:
• Static Partition
• Dynamic Partition.
Static Partitions

• Static Partition can be done on columns whose values are known at compile time
• create static partition based on “gpa” column.
CREATE TABLE IF NOT EXISTS static_part_student (rollno INT, name STRING)
PARTITIONED BY (gpa FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

• Load data into partition table from table


INSERT OVERWRITE TABLE static_part_student PARTITION (gpa =4.0);
SELECT rollno, name from EXT_STUDENT where gpa=4.0;
Dynamic Partition

To create dynamic partition- The Column whose values are know only at execution time

CREATE TABLE IF NOT EXISTS dynamic_part_student(rollno INT, name STRING)


PARTITIONED BY (gpa FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

• To load data into a dynamic partition table from table.

SET hive.exec.dynamic.partition = true;


SET hive.exec.dynamic.partition.mode = nonstrict;

Note: The dynamic partition strict mode requires at least one static partition column. To turn this off,
set hive.exec.dynamic.partition.mode=nonstrict

INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT


PARTITION (gpa);
SELECT rollno,name,gpa FROM ext_student;
Hive Partitions and Buckets
• Hive partitions are partitions generated by using keys which could be
column values or may not be.

• Hive buckets are “partitions” generated by hashing column values.


Hive Partitions
CREATE TABLE logs (ts BIGINT, line STRING)
Example: Partitions by date data was created and
PARTITIONED BY (dt STRING, country STRING);
further by country. Note: Date is not part of table.
LOAD DATA LOCAL INPATH ‘input/hive/partitions/file1’
INTO TABLE logs
PARTITION (dt=‘2001-01-01’, country=‘GB”)

• Separate folders / directories are


created per partition.

• Partition values may or may not be part


of the table data.
Hive Buckets
• Buckets are specified CREATE TABLE bucketed_users (id INT, name STRING)
using column names and CLUSTERED BY (id) INTO 4 BUCKETS;

number of buckets.
column on which to hash.
number of buckets into
which column entries
should be hashed.

can use id modulo


4 to hash.
• To create a bucketed table having 3 buckets.

CREATE TABLE IF NOT EXISTS student_bucket (rollnoINT,name STRING,grade FLOAT)


CLUSTERED BY (grade) into 3 buckets;

• Load data to bucketed table.

FROM STUDENT INSERT OVERWRITE TABLE student_bucket


SELECT rollno,name,grade;

• To display the content of first bucket.

SELECT DISTINCT grade FROM student_bucket TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);


Hive supports aggregation functions like avg, count, etc.

To write the average and count aggregation function.

SELECT avg(gpa) FROM STUDENT;

SELECT count(*) FROM STUDENT;

To write group by and having function.

SELECT rollno, name,gpa


FROM STUDENT
GROUP BY rollno,name,gpa
HAVING gpa > 4.0;
SerDe
• SerDe stands for Serialization and Deserialization.
• A serialization function converts complex in-memory data (for example a Java class object or Hive
Table) into a string to be stored on disk.
• The string is usually in compressed binary format for space savings.
• A deserialization function converts the string back into the complex data structure in-memory.
• Since a string is a “uniform sequential” or serial data form as opposed to a complex data
structure, this conversion is known as serialization.
• Hive can use SerDe functions to read / write its data from HDFS efficiently.
• Custom SerDe can be used.
• Note: RCFile storage uses SerDe to compress column data.
User Defined Functions
User Defined Functions allow customization of Hive queries.

1. Create a Java class for the User Defined Function, public final class MyUpperCase extends UDF {
Class must extend UDF abstract class public string evaluate(final String word) {
return word.toUpperCase
2. Class must have one or more evaluate() }
methods. Put in your desired logic. }

3. Compile the java file.


hive> ADD JAR UpperCase.jar;
4. Package your Java class into a JAR file.
hive> CREATE TEMPORARY FUNCTION toUpperCase AS
5. Go to Hive CLI and add your JAR. MyUpperCase;

6. CREATE TEMPORARY FUNCTION in Hive which


points to your Java class. hive> SELECT toUpperCase(name) FROM STUDENT;

• Use it in Hive SQL ! Note: The syntax of the Hive commands above are not meant to be complete
and are for illustration purposes only.
End of Unit 5

You might also like