Professional Documents
Culture Documents
BDT UNIT 4
BDT UNIT 4
Topics
• Introduction to NoSQL Databases
• Introduction to Hive
Relational Databases
• A relational database refers to a database that stores data in a structured format, using rows and
columns. This makes it easy to locate and access specific values within the database
• Relation is sometimes used to refer to a table in a relational database but is more commonly refers to
the relation between the different elements of a row, e.g.
Note: SQL has been found a very effective language for relational databases,
hence they are so closely associated with RDBMS systems. But there is no rule
which states that SQL has to be used for RDBMS.
Not Only SQL (NoSQL) DatabasesHistory
• RDBMS found unsuitable to handle unstructured data
generated by the proliferation of the internet.
• Unstructured data includes: web pages, images, audio
clips, videos, documents (pdf, csv, text).
• There is a need to mine the data, hence a need to store
and manipulate the data in an efficient and organized
manner.
• Difficult to scale RDBMS on clusters.
• All the above gives rise to NoSQL databases around the
year 2000.
• Origins in Google’s BigTable and Amazon’s SimpleDB.
• Note: The “SQL” in the name “NoSQL” does not imply that these
category of databases do not or can not use SQL as the query
language.
Non-relational data storage systems
What isNoSQL?
No fixed table schema
No Joins
NoSQL
No multi-document transactions
• Document Store.
• Column Store.
• Graph databases
Key-Value Pair Store
• Key is unique.
Easy to distribute
Soft State
Description: The state of the system can change over time, even without input
due to the eventual consistency model. This means that the system doesn't
have to be in a consistent state at all times.
Example: In a distributed system, data might be replicated across multiple
nodes. The data in these nodes may be inconsistent temporarily due to the
delay in replication.
Eventual Consistency
Description: The system will become consistent at some point in the future,
assuming there are no new updates. While the system allows for temporary
inconsistencies, it guarantees that, given enough time, all nodes will
converge to the same state.
Example: In a distributed database, when a user updates their profile
information, the change may not be immediately visible on all servers.
However, eventually, all servers will reflect the updated information.
BASE Properties Has to do with
the “AP” of CAP.
• Basic Availability: The database appears to work most of the time (even if some nodes fail, or
packets are dropped).
• Soft-state: State changes even without input (to provide eventual consistency). Both have to do
with the “C” in
• Eventual consistency: Stores exhibit consistency at some later point. CAP.
BASE is a relaxed form of the CAP properties. NoSQL databases strive to satisfy the
BASE properties.
The BASE model is a flexible alternative (as is found acceptable with customer
shopping data) to the ACID model for databases that don't require strict adherence
to a relational model (as is required for banking data).
NoSQL Pros andCons
Cons
Pros
• Not mature.
• Handles the diverse kind of data
generated by proliferation of the
internet. Flexible. • Do not provide same level of
guarantees (ACID properties) as
RDBMS systems.
• Designed to scale.
• Not transactional.
• Easier to maintain.
• Less secure.
Hive 0.14
Hive 0.10 Hive 0.13
• Transaction with ACID
• Batch • Interactive
semantics
• Read –only Data • Read –only Data
• Cost Based Optimizer
• Hive QL • Substantial SQL
• SQL temporary tables
• MR • MR,TEZ
• MR, TEZ, Spark
• In this mode, the Metastore service run in the same JVM as Hive service and contains an embedded Derby
database instance backed by local disk. This mode required least configuration but support only 1 session at a
time. Therefore not suited for production.
Local meta store
In this mode, Metastore service run in the same JVM as Hive service, but Metastore
database run on separate process.
In this mode, Metastore service run on its own JVM. This brings better manageability and security because the
database tier can be completely fire walled off, and the clients no longer need the database credentials. In this,
Metastore service communicate with database over JDBC. Hadoop ecosystem software can communicate with
Hive using Thrift service.
Namespaces that separate tables
Database
and other data units
SQL HiveQL
Insert values row by row Insertion of bulk data(not single row at a time)
4. Download the contents of a table to a local directory or result of queries to HDFS directory.
5. Large number of functions defined in Hive. Categorized as mathematical, Statistical, String, Date, Conditional,
Aggregate and so on.
We can retrive the list on hive shell by
hive> show function
Data Definition Language
• Build and modify the tables & other objects in the database
• Create/Drop/Alter Database
• Create/Drop/Truncate Table
• Alter Table/Partition/Column
• Create/Drop/Alter View
• Create/Drop/Alter Index
• Show
• Describe
Data Manipulation Language
• To receive
• Store
• Modify
• Delete
• Update data in database
Hive Data Types
Numeric Data Type
TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer
INT 4 - byte signed integer
BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number
String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)
Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive
• Collection Data Types
• STRUCT
• Similar to ‘C’ struct. Fields are accessed using dot notation.
• E.g.: struct('John', 'Doe')
• MAP
• A collection of key - value pairs. Fields are accessed using [] notation.
• E.g.: map('first', 'John', 'last', 'Doe')
• ARRAY
• Ordered sequence of same types. Fields are accessed using array index.
• E.g.: array('John', 'Doe')
Hive File format
• Text File
• The default file format is text file.
• Csv, tsv, json or xml
• Sequential File
• Sequential files are flat files that store binary key-value pairs.
• Includes compression
• RCFile (Record Columnar File)
• RCFile stores the data in Column Oriented Manner which ensures that
• Aggregation operation is not an expensive operation.
Hive File Formats
1. Text FIle
• Data is stored in lines, with each line being a record. Each lines are terminated by a newline character (\n).
• Also a default format, equivalent to creating a table with the clause STORED AS TEXTFILE.
Breaking into too many small parts causes degradation of performance e.g. by employeeID.
Hashing - Prelimaries
Employee Name Employee ID Country Big Table Solution:
Alok Nath 36554 India
- map each key to a number e.g.
(empid modulo 2).
Arun Thomas 36553 India - 36554 mod 2 = 0
- 36553 mod 2 = 1
Geeta Rao 36555 India
- even EmpID mod 2 = 0
Susan Phillips 71222 UK - odd EmpID mod 2 = 1
- partition by above number.
John Chambers 71225 UK
- two partitions generated for above
Liam Neeson 80162 Ireland numbers, one with odd empid, other
with even empid.
Milo O’Shea 80233 Ireland
- Generating a partition number using
a function on a key is called Hashing.
Require to “partition” by empid. But do not want one
partition per key since it leads to too many partitions.
partitioning based on hashing is
Why “partition” by empid (but want small number of partitions) ? called hashPartitioning in Part 3.
Could be for joining two tables by empid (see example in Part 3).
Partitions
• Partitions split the larger dataset into more meaningful chunks.
• Partition improves i/o performance
• Hive provides two kinds of partitions:
• Static Partition
• Dynamic Partition.
Static Partitions
• Static Partition can be done on columns whose values are known at compile time
• create static partition based on “gpa” column.
CREATE TABLE IF NOT EXISTS static_part_student (rollno INT, name STRING)
PARTITIONED BY (gpa FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
To create dynamic partition- The Column whose values are know only at execution time
Note: The dynamic partition strict mode requires at least one static partition column. To turn this off,
set hive.exec.dynamic.partition.mode=nonstrict
number of buckets.
column on which to hash.
number of buckets into
which column entries
should be hashed.
1. Create a Java class for the User Defined Function, public final class MyUpperCase extends UDF {
Class must extend UDF abstract class public string evaluate(final String word) {
return word.toUpperCase
2. Class must have one or more evaluate() }
methods. Put in your desired logic. }
• Use it in Hive SQL ! Note: The syntax of the Hive commands above are not meant to be complete
and are for illustration purposes only.
End of Unit 5