UNIT-III - Easy to Learn 34 Pages

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

lOMoARcPSD|13657851 lOMoARcPSD|13657851

UNIT III NOSQL DATABASES NoSQL

NoSQL – CAP Theorem – Sharding - Document based – MongoDB Operation: Insert, NoSQL Database is used to refer a non-SQL or non relational database.
Update, Delete, Query, Indexing, Application, Replication, Sharding–Cassandra: Data Model,
Key Space, Table Operations, CRUD Operations, CQL Types – HIVE: Data types, Database It provides a mechanism for storage and retrieval of data other than tabular relations model
Operations, Partitioning – HiveQL – OrientDB Graph database – OrientDB Features. used in relational databases. NoSQL database doesn't use tables for storing data. It is
generally used to store big data and real-time web applications.
CAP theorem
Databases can be divided in 3 types:
It is very important to understand the limitations of NoSQL database. NoSQL can not provide
consistency and high availability together. This was first expressed by Eric Brewer in CAP 1. RDBMS (Relational Database Management System)
Theorem.
2. OLAP (Online Analytical Processing)
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of
three guarantees for a database: Consistency, Availability and Partition Tolerance. 3. NoSQL (recently developed database)
Here Consistency means that all nodes in the network see the same data at the same time.
Advantages of NoSQL
Availability is a guarantee that every request receives a response about whether it was
successful or failed. However it does not guarantee that a read request returns the most recent
o It supports query language.
write.The more number of users a system can cater to better is the availability.
Partition Tolerance is a guarantee that the system continues to operate despite arbitrary o It provides fast performance.
message loss or failure of part of the system. In other words, even if there is a network outage o It provides horizontal scalability.
in the data center and some of the computers are unreachable, still the system continues to
perform. What is MongoDB?
What is CAP theorem in NoSQL databases?
MongoDB is an open-source document database that provides high performance, high
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of availability, and automatic scaling.
three guarantees for a database: Consistency, Availability and Partition Tolerance. Here
Consistency means that all nodes in the network see the same data at the same time. Mongo DB is a document-oriented database. It is an open source product, developed and
supported by a company named 10gen.
What Is Database Sharding? Sharding is a method for distributing a single dataset across
multiple databases, which can then be stored on multiple machines. This allows for larger MongoDB is a scalable, open source, high performance, document-oriented database." -
datasets to be split in smaller chunks and stored in multiple data nodes, increasing the total
10gen
storage capacity of the system.
What is difference between sharding and partitioning? MongoDB was designed to work with commodity servers. Now it is used by the company of
Sharding and partitioning are both about breaking up a large data set into smaller subsets. The all sizes, across all industry.
difference is that sharding implies the data is spread across multiple computers while
partitioning does not. Partitioning is about grouping subsets of data within a single database MongoDB Advantages
instance.
o MongoDB is schema less. It is a document database in which one collection holds
What are the types of sharding?
different documents.
Sharding Architectures
o There may be difference between number of fields, content and size of the
 Key Based Sharding. This technique is also known as hash-based sharding. ... document from one to other.
 Horizontal or Range Based Sharding. In this method, we split the data based on the ranges of
a given value inherent in each entity. ... o Structure of a single object is clear in MongoDB.
 Vertical Sharding. ... o There are no complex joins in MongoDB.
 Directory-Based Sharding.

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

o MongoDB provides the facility of deep query because it supports a powerful To check the database list, use the command show dbs:
dynamic query on documents. >show dbs

o It is very easy to scale. insert at least one document into it to display database:
o It uses internal memory for storing working sets and this is the reason of its fast
MongoDB insert documents
access.
In MongoDB, the db.collection.insert() method is used to add or insert new documents into a
Distinctive features of MongoDB collection in your database.

o Easy to use >db.movie.insert({"name":"javatpoint"})


o Light Weight
MongoDB Drop Database
o Extremely faster than RDBMS
The dropDatabase command is used to drop a database. It also deletes the associated data
Where MongoDB should be used files. It operates on the current database.

Syntax:
o Big and complex data
o Mobile and social infrastructure db.dropDatabase()
o Content management and delivery
This syntax will delete the selected database. In the case you have not selected any database,
o User data management it will delete default "test" database.
o Data hub
If you want to delete the database "javatpointdb", use the dropDatabase() command as
MongoDB Create Database follows:

There is no create database command in MongoDB. Actually, MongoDB do not provide any >db.dropDatabase()
command to create database. MongoDB Create Collection

How and when to create database In MongoDB, db.createCollection(name, options) is used to create collection. But usually you
don?t need to create collection. MongoDB creates collection automatically when you insert
If there is no existing database, the following command is used to create a new database. some documents. It will be explained later. First see how to create collection:

Syntax: Syntax:

use DATABASE_NAME db.createCollection(name, options)


we are going to create a database "javatpointdb" Name: is a string type, specifies the name of the collection to be created.

>use javatpointdb Options: is a document type, specifies the memory size and indexing of the collection. It is
an optional parameter.
To check the currently selected database, use the command db
To check the created collection, use the command "show collections".
>db
>show collections

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

How does MongoDB create collection automatically Create an array of documents

MongoDB creates collections automatically when you insert some documents. For example: Define a variable named Allcourses that hold an array of documents to insert.
Insert a document named seomount into a collection named SSSIT. The operation will create
the collection if the collection does not currently exist. var Allcourses =
[
>db.SSSIT.insert({"name" : "seomount"})
{
>show collections
Course: "Java",
SSSIT
details: { Duration: "6 months", Trainer: "Sonoo Jaiswal" },
MongoDB update documents
Batch: [ { size: "Medium", qty: 25 } ],
In MongoDB, update() method is used to update or modify the existing documents of a category: "Programming Language"
collection. },
{
Syntax:
Course: ".Net",
db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA, UPDATED_DATA) details: { Duration: "6 months", Trainer: "Prashant Verma" },
Batch: [ { size: "Small", qty: 5 }, { size: "Medium", qty: 10 }, ],
Example
category: "Programming Language"
Consider an example which has a collection name javatpoint. Insert the following documents },
in collection: {
Course: "Web Designing",
db.javatpoint.insert(
details: { Duration: "3 months", Trainer: "Rashmi Desai" },
{
Batch: [ { size: "Small", qty: 5 }, { size: "Large", qty: 10 } ],
course: "java",
category: "Programming Language"
details: {
}
duration: "6 months",
];
Trainer: "Sonoo jaiswal"
}, Inserts the documents
Batch: [ { size: "Small", qty: 15 }, { size: "Medium", qty: 25 } ],
Pass this Allcourses array to the db.collection.insert() method to perform a bulk insert.
category: "Programming language"
} > db.javatpoint.insert( Allcourses );
)
MongoDB Delete documents
Update the existing course "java" into "android":
In MongoDB, the db.colloction.remove() method is used to delete documents from a
collection. The remove() method works on two parameters.
>db.javatpoint.update({'course':'java'},{$set:{'course':'android'}})
1. Deletion criteria: With the use of its syntax you can remove the documents from the
MongoDB insert multiple documents collection.

If you want to insert multiple documents in a collection, you have to pass an array of 2. JustOne: It removes only one document when set to true or 1.
documents to the db.collection.insert() method.

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Syntax: Syntax –

b.collection_name.remove (DELETION_CRITERIA) db.NAME_OF_COLLECTION.dropIndexes({KEY1:1, KEY2, 1})


Applications of MongoDB
Remove all documents
These are some important features of MongoDB:
If you want to remove all documents from a collection, pass an empty query document {} to
the remove() method. The remove() method does not remove the indexes. 1. Support ad hoc queries:

db.javatpoint.remove({}) In MongoDB, you can search by field, range query and it also supports regular expression
searches.
Indexing in MongoDB :
2. Indexing:
MongoDB uses indexing in order to make the query processing more efficient. If there is no
You can index any field in a document.
indexing, then the MongoDB must scan every document in the collection and retrieve only
those documents that match the query. Indexes are special data structures that stores some 3. Replication:
information related to the documents such that it becomes easy for MongoDB to find the
MongoDB supports Master Slave replication.
right data file. The indexes are order by the value of the field specified in the index.
A master can perform Reads and Writes and a Slave copies data from the master and
can only be used for reads or back up (not writes)
Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an index. 4. Duplication of data:
Syntax db.COLLECTION_NAME.createIndex({KEY:1})
MongoDB can run over multiple servers. The data is duplicated to keep the system up and
also keep its running condition in case of hardware failure.
Example
5. Load balancing:
db.mycol.createIndex({“age”:1})
It has an automatic load balancing configuration because of data placed in shards.
{
“createdCollectionAutomatically” : false, 6. Supports map reduce and aggregation tools.
“numIndexesBefore” : 1, 7. Uses JavaScript instead of Procedures.
“numIndexesAfter” : 2,
8. It is a schema-less database written in C++.
“ok” : 1
} 9. Provides high performance.

In order to drop an index, MongoDB provides the dropIndex() method. 10. Stores files of any size easily without complicating your stack.
Syntax
11. Easy to administer in the case of failures.
db.NAME_OF_COLLECTION.dropIndex({KEY:1})
12. It also supports:
The dropIndex() methods can only delete one index at a time. In order to delete (or drop)
multiple indexes from the collection, MongoDB provides the dropIndexes() method that
takes multiple indexes as its parameters. o JSON data model with dynamic schemas
o Auto-sharding for horizontal scalability

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

o Built in replication for high availability MongoDBsh.addShard(<url>) command

Now a day many companies using MongoDB to create new types of applications, improve A shard replica set added to a sharded cluster using this command. If we add it among the
performance and availability. shard of cluster, it affects the balance of chunks. It starts transferring chunks to balance the
cluster.
MongoDB Replication Methods
<replica_set>/<hostname><:port>,<hostname><:port>, ...
The MongoDB Replication methods are used to replicate the member to the replica sets.
Syntax:
rs.add(host, arbiterOnly)
sh.addShard("<replica_set>/<hostname><:port>")
The add method adds a member to the specified replica set. We are required to connect to the
primary set of the replica set to this method. The connection to the shell will be terminated if Example:
the method will trigger an election for primary. For example - if we try to add a new member
with a higher priority than the primary. An error will be reflected by the mongo shell even if sh.addShard("repl0/mongodb3.example.net:27327")
the operation succeeds.
Output:
Example:

In the following example we will add a new secondary member with default vote.

rs.add( { host: "mongodbd4.example.net:27017" } )


MongoDBSharding Commands

Sharding is a method to distribute the data across different machines. Sharding can be used by
MongoDB to support deployment on very huge scale data sets and high throughput
operations.

It will add a shard to specify the name of the replica set and the hostname of at least one
member of the replica set.

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

o Replica placement Strategy: It is a strategy which species how to place replicas in


Cassandra the ring. There are three types of strategies such as:

What is Cassandra? 1) Simple strategy (rack-aware strategy)

Apache Cassandra is highly scalable, high performance, distributed NoSQL database. 2) old network topology strategy (rack-aware strategy)
Cassandra is designed to handle huge amount of data across many commodity servers,
providing high availability without a single point of failure. 3) network topology strategy (datacenter-shared strategy)

Cassandra is a NoSQL database Cassandra Create Keyspace

NoSQL database is Non-relational database. It is also called Not Only SQL. It is a database Cassandra Query Language (CQL) facilitates developers to communicate with Cassandra.
that provides a mechanism to store and retrieve data other than the tabular relations used in The syntax of Cassandra query language is very similar to SQL.
relational databases. These databases are schema-free, support easy replication, have simple
API, eventually consistent, and can handle huge amounts of data. What is Keyspace?

Important Points of Cassandra A keyspace is an object that is used to hold column families, user defined types. A keyspace
is like RDBMS database which contains column families, indexes, user defined types, data
o Cassandra is a column-oriented database. center awareness, strategy used in keyspace, replication factor, etc.

o Cassandra is scalable, consistent, and fault-tolerant. In Cassandra, "Create Keyspace" command is used to create keyspace.
o Cassandra is created at Facebook. It is totally different from relational database
Syntax:
management systems.
o Cassandra is being used by some of the biggest companies like Facebook, Twitter, CREATE KEYSPACE <identifier> WITH <properties>
Cisco, Rackspace, ebay, Twitter, Netflix, and more.
Different components of Cassandra Keyspace
Cassandra Data Model Strategy: There are two types of strategy declaration in Cassandra syntax:
Data model in Cassandra is totally different from normally we see in RDBMS. Let's see how
Cassandra stores its data. o Simple Strategy: Simple strategy is used in the case of one data center. In this
strategy, the first replica is placed on the selected node and the remaining nodes are
Cluster placed in clockwise direction in the ring without considering rack or node location.

Cassandra database is distributed over several machines that are operated together. The o Network Topology Strategy: This strategy is used in the case of more than one data
outermost container is known as the Cluster which contains different nodes. Every node centers. In this strategy, you have to provide replication factor for each data center
contains a replica, and in case of a failure, the replica takes charge. Cassandra arranges the
separately.
nodes in a cluster, in a ring format, and assigns data to them.
Replication Factor: Replication factor is the number of replicas of data placed on different
Keyspace nodes. More than two replication factor are good to attain no single point of failure. So, 3 is
good replication factor.
Keyspace is the outermost container for data in Cassandra. Following are the basic attributes
of Keyspace in Cassandra: Example:

o Replication factor: It specifies the number of machine in the cluster that will receive Let's take an example to create a keyspace named "javatpoint".
copies of the same data.
CREATE KEYSPACE javatpoint

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3}; Single primary key: Use the following syntax for single primary key.

Keyspace is created now. Primary key (ColumnName)


Compound primary key: Use the following syntax for single primary key.
Using a Keyspace
Primary key(ColumnName1,ColumnName2 . . .)
To use the created keyspace, you have to use the USE command.
Example:
Syntax:
Let's take an example to demonstrate the CREATE TABLE command.
USE <identifier>
Here, we are using already created Keyspace "javatpoint".
Cassandra Alter Keyspace
CREATE TABLE student(
The "ALTER keyspace" command is used to alter the replication factor, strategy name and
durable writes properties in created keyspace in Cassandra. student_id int PRIMARY KEY,
student_name text,
Syntax: student_city text,
student_fees varint,
ALTER KEYSPACE <identifier> WITH <properties>
student_phone varint
Cassandra Drop Keyspace
);
In Cassandra, "DROP Keyspace" command is used to drop keyspaces with all the data,
SELECT * FROM student;
column families, user defined types and indexes from Cassandra.
Cassandra Alter Table
Syntax:
ALTER TABLE command is used to alter the table after creating it. You can use the ALTER
DROP keyspace KeyspaceName ; command to perform two types of operations:

Cassandra Create Table


o Add a column
In Cassandra, CREATE TABLE command is used to create a table. Here, column family is o Drop a column
used to store data just like table in RDBMS.
Syntax:
So, you can say that CREATE TABLE command is used to create a column family in
Cassandra.
ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>
Syntax:
Adding a Column

CREATE TABLE tablename( You can add a column in the table by using the ALTER command. While adding column, you
column1 name datatype PRIMARYKEY, have to aware that the column name is not conflicting with the existing column names and
that the table is not defined with compact storage option.
column2 name data type,
column3 name data type. Syntax:
)
ALTER TABLE table name
There are two types of primary keys:
ADD new column datatype;

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

After using the following command: TRUNCATE <tablename>

ALTER TABLE student Example:


ADD student_email text;
Cassandra Batch
A new column is added. You can check it by using the SELECT command.
In Cassandra BATCH is used to execute multiple modification statements (insert, update,
delete) simultaneously. It is very useful when you have to update some column as well as
Dropping a Column delete some of the existing.
You can also drop an existing column from a table by using ALTER command. You should Syntax:
check that the table is not defined with compact storage option before dropping a column
from a table.
BEGIN BATCH
Syntax: <insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
ALTER table name
DROP column name;
Example: Use of WHERE Clause

WHERE clause is used with SELECT command to specify the exact location from where we
After using the following command:
have to fetch data.
ALTER TABLE student
Syntax:
DROP student_email;

Now you can see that a column named "student_email" is dropped now. SELECT FROM <table name> WHERE <condition>;
SELECT * FROM student WHERE student_id=2;
If you want to drop the multiple columns, separate the columns name by ",".
Cassandra Update Data
Cassandra DROP table UPDATE command is used to update data in a Cassandra table. If you see no result after
updating the data, it means data is successfully updated otherwise an error will be returned.
DROP TABLE command is used to drop a table. While updating data in Cassandra table, the following keywords are commonly used:
Syntax:
o Where: The WHERE clause is used to select the row that you want to update.
DROP TABLE <tablename>
Example: o Set: The SET clause is used to set the value.
After using the following command:
o Must: It is used to include all the columns composing the primary key.
DROP TABLE student;
Syntax:
The table named "student" is dropped now. You can use DESCRIBE command to verify if
the table is deleted or not. Here the student table has been deleted; you will not find it in the
UPDATE <tablename>
column families list.
SET <column name> = <new value>
Cassandra Truncate Table <column name> = <value>....
WHERE <condition>
TRUNCATE command is used to truncate a table. If you truncate a table, all the rows of the
table are deleted permanently.

Syntax:

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

SELECT MIN(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;


Cassandra DELETE Data MAX Function with HAVING Clause:
SELECT MAX(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;
DELETE command is used to delete data from Cassandra table. You can delete the complete
table or a selected row by using this command. AVERAGE CLAUSE:

Syntax: SELECT AVG(Emp_Salary), Emp_Dept FROM Employee_Dept GROUP BY Emp_Dept;

DELETE FROM <identifier> WHERE <condition>; SQL ORDER BY Clause

Delete an entire row o Whenever we want to sort the records based on the columns stored in the tables of the

To delete the entire row of the student_id "3", use the following command: SQL database, then we consider using the ORDER BY clause in SQL.

DELETE FROM student WHERE student_id=3; o The ORDER BY clause in SQL will help us to sort the records based on the specific
column of a table. This means that all the values stored in the column on which we are
Delete a specific column name
applying ORDER BY clause will be sorted, and the corresponding column values will
Example:
be displayed in the sequence in which we have obtained the values in the earlier step.
Delete the student_fees where student_id is 4.
Syntax to sort the records in ascending order:
DELETE student_fees FROM student WHERE student_id=4;
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
HAVING Clause in SQL
ColumnName ASC;
The HAVING clause places the condition in the groups defined by the GROUP BY clause in Syntax to sort the records in descending order:
the SELECT statement.
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
This SQL clause is implemented after the 'GROUP BY' clause in the 'SELECT' statement. ColumnNameDESC;
Syntax to sort the records in ascending order without using ASC keyword:
This clause is used in SQL because we cannot use the WHERE clause with the SQL
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
aggregate functions. Both WHERE and HAVING clauses are used for filtering the records in
SQL queries. ColumnName;
Index Cassandra Mongodb
Syntax of HAVING clause in SQL
SELECT column_Name1, column_Name2, ....., column_NameN aggregate_function_name( 1) Cassandra is high performance MongoDB is cross-platform document-oriented
column_Name) distributed database system. database system.
GROUP BY 2) Cassandra is written in Java. MongoDB is written in C++.
SELECT SUM(Emp_Salary), Emp_City FROM Employee GROUP BY Emp_City;
3) Cassandra stores data in tabular form MongoDB stores data in JSON format.
the following query with the HAVING clause in SQL: like SQL format.

4) Cassandra is got license by Apache. MongoDB is got license by AGPL and drivers
SELECT SUM(Emp_Salary), Emp_City FROM Employee GROUP BY Emp_City by Apache.
HAVING SUM(Emp_Salary)>12000;
5) Cassandra is mainly designed to MongoDB is designed to deal with JSON-like
MIN Function with HAVING Clause: handle large amounts of data across documents and access applications easier and
many commodity servers. faster.
If you want to show each department and the minimum salary in each department, you have
to write the following query: 6) Cassandra provides high availability MongoDB is easy to administer in the case of
with no single point of failure. failure.

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Hive Decimal Type


What is HIVE?
Type Size Range
Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.
FLOAT 4-byte Single precision floating point number
Hive provides the functionality of reading, writing, and managing large datasets residing in DOUBLE 8-byte Double precision floating point number
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Date/Time Types
Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF). TIMESTAMP

Features of Hive o It supports traditional UNIX timestamp with optional nanosecond precision.
o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o Hive is fast and scalable.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce
decimal precision.
or Spark jobs.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD
o It is capable of analyzing large datasets stored in HDFS.
HH:MM:SS.fffffffff" (9 decimal place precision)
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem. DATES

o It supports user-defined functions (UDFs) where user can provide its functionality. The Date value is used to specify a particular year, month and day, in the form YYYY--MM--
DD. However, it didn't provide the time of the day. The range of Date type lies between
HIVE Data Types 0000--01--01 to 9999--12--31.

Hive data types are categorized in numeric types, string types, misc types, and complex types. String Types
A list of Hive data types is given below.
STRING
Integer Types
The string is a sequence of characters. It values can be enclosed within single quotes (') or
double quotes (").
Type Size Range
Varchar
TINYINT 1-byte signed integer -128 to 127
The varchar is a variable length type whose range lies between 1 and 65535, which specifies
that the maximum number of characters allowed in the character string.
SMALLINT 2-byte signed integer 32,768 to 32,767
CHAR
INT 4-byte signed integer 2,147,483,648 to 2,147,483,647 The char is a fixed-length type whose maximum length is fixed at 255.

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to


9,223,372,036,854,775,807

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Complex Type Internal Table

The internal tables are also called managed tables as the lifecycle of their data is controlled by
Type Size Range
the Hive. By default, these tables are stored in a subdirectory under the directory defined by
Struct It is similar to C struct or an object where fields struct('James','Roy') hive. metastore. warehouse.dir (i.e. /user/hive/warehouse). The internal tables are not flexible
are accessed using the "dot" notation.
enough to share with other tools like Pig. If we try to drop the internal table, Hive deletes
Map It contains the key-value tuples where the fields map('first','James','last','Roy') both table schema and data.
are accessed using array notation.

Array It is a collection of similar type of values that array('James','Roy') o Let's create an internal table by using the following command:-
indexable using zero-based integers.

Hive - Create Database hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain fields terminated by ',' ;
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default. Let's see the metadata of the created table by using the following command:-

o Initially, we check the default database provided by Hive. So, to check the list of hive> describe demo.employee
existing databases, follow the below command: -
External Table
o hive> create database demo
The external table allows us to create and access a table and a data externally.
hive> show databases; The external keyword is used to specify the external table, whereas the location keyword is
used to determine the location of loaded data.
Let's create a new database by using the following command: -
As the table is external, the data is not present in the Hive directory. Therefore, if we try to
Hive - Drop Database drop the table, the metadata of the table will be deleted, but the data still exists.

In this section, we will see various ways to drop the existing database.
Let's create an external table using the following command: -

drop the database by using the following command.


hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
hive> drop database demo;
fields terminated by ','
Hive - Create Table
location '/HiveDirectory';
In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide we can use the following command to retrieve the data: -
range of flexibility where the data files for tables are stored. It provides two types of table: -
select * from emplist;
o Internal table
Hive - Load Data
o External table
Once the internal table has been created, the next step is to load the data into it. So, in Hive,
we can easily load data from any file to the database.

o Let's load the data of the file into the database by using the following command: -

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Change Column

In Hive, we can rename a column, change its type and position. Here, we are changing the
load data local inpath '/home/codegyani/hive/emp_details' into table demo.employee; name of the column by using the following signature: -

Hive - Drop Table Alter table table_name change old_column_name new_column_name datatype;
o Now, change the name of the column by using the following command: -
Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the
below steps to drop the table from the database.
Alter table employee_data change name first_name string;
o Let's check the list of existing databases by using the following command: - Delete or Replace Column

Hive allows us to delete one or more columns by replacing them with the new columns. Thus,
hive> show databases;
we cannot drop the column directly.

hive> use demo; o Let's see the existing schema of the table.

hive> show tables; o Now, drop a column from the table.


hive> drop table new_employee;
alter table employee_data replace columns( id string, first_name string, age int);
Hive - Alter Table
Partitioning in Hive
In Hive, we can perform modifications in the existing table like changing the table name,
column name, comments, and table properties. It provides SQL like commands to alter the The partitioning in Hive means dividing the table into some parts based on the values of a
table. particular column like date, course, city or country. The advantage of partitioning is that since
the data is stored in slices, the query response time becomes faster.
Rename a Table
The partitioning in Hive can be executed in two ways –
If we want to change the name of an existing table, we can rename that table by using the
following signature: - o Static partitioning
o Dynamic partitioning
Alter table old_table_name rename to new_table_name;
Static Partitioning
o Now, change the name of the table by using the following command: -
In static or manual partitioning, it is required to pass the values of partitioned columns
Alter table emp rename to employee_data; manually while loading the data into the table. Hence, the data file doesn't contain the
Adding column partitioned columns.

In Hive, we can add one or more columns in an existing table by using the following Example of Static Partitioning
signature:
o First, select the database in which we want to create a table.
Alter table table_name add columns(column_name datatype);
hive> use test;
o Now, add a new column to the table by using the following command: -
o Create the table and provide the partitioned columns by using the following
Alter table employee_data add columns (age int); command: -

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

hive> create table student (id int, name string, age int, institute string) hive> create table stud_demo(id int, name string, age int, institute string, course string)
partitioned by (course string) row format delimited
row format delimited fields terminated by ',';
fields terminated by ','; o Now, load the data into the table.
hive> describe student;
hive> load data local inpath '/home/codegyani/hive/student_details' into table stud_demo;
o Load the data into the table and pass the values of partition columns with it by using
the following command: - o Create a partition table by using the following command: -

hive> load data local inpath '/home/codegyani/hive/student_details1' into table student hive> create table student_part (id int, name string, age int, institute string)
partition(course= "java"); partitioned by (course string)
row format delimited
Here, we are partitioning the students of an institute based on courses.
fields terminated by ',';

o Load the data of another file into the same table and pass the values of partition o Now, insert the data of dummy table into the partition table.
columns with it by using the following command: -
hive> insert into student_part
hive> load data local inpath '/home/codegyani/hive/student_details2' into table student partition(course)
partition(course= "hadoop"); select id, name, age, institute, course
from stud_demo;

hive> select * from student; OrientDB Graph database


o Now, try to retrieve the data based on partitioned columns by using the following
What is Graph?
command: -
A graph is a pictorial representation of objects which are connected by some pair of links. A
hive> select * from student where course="java"; graph contains two elements: Nodes (vertices) and relationships (edges).

Dynamic Partitioning What is Graph database


In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not A graph database is a database which is used to model the data in the form of graph. It store
required to pass the values of partitioned columns manually. any kind of data using:

o First, select the database in which we want to create a table. o Nodes

hive> use show; o Relationships

o Enable the dynamic partition by using the following commands: - o Properties

Nodes: Nodes are the records/data in graph databases. Data is stored as properties and
hive> set hive.exec.dynamic.partition=true; properties are simple name/value pairs.
hive> set hive.exec.dynamic.partition.mode=nonstrict;
Relationships: It is used to connect nodes. It specifies how the nodes are related.
o Create a dummy table to store the data.
o Relationships always have direction.

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

o Relationships always have a type. MongoDB vs OrientDB

o Relationships form patterns of data. MongoDB and OrientDB contains many common features but the engines are fundamentally
different. MongoDB is pure Document database and OrientDB is a hybrid Document with
Properties: Properties are named data values. graph engine.

Popular Graph Databases Features MongoDB OrientDB

Neo4j is the most popular Graph Database. Other Graph Databases are Uses the RDBMS JOINS to create Embeds and connects documents
relationship between entities. It has like relational database. It uses
Relationships high runtime cost and does not direct, super-fast links taken from
o Oracle NoSQL Database
scale when database scale graph database world.
o OrientDB increases.
o HypherGraphDB Costly JOIN operations. Easily returns complete graph
Fetch Plan
o GraphBase with interconnected documents.

o InfiniteGraph Doesn’t support ACID Supports ACID transactions as


Transactions transactions, but it supports atomic well as atomic operations.
o AllegroGraph etc. operations.

Graph Database vs. RDBMS Has its own language based on Query language is built on SQL.
Query language
JSON.
Differences between Graph database and RDBMS:
Uses the B-Tree algorithm for all Supports three different indexing
Indexes indexes. algorithms so that the user can
In Graph Database RDBMS achieve best performance.
de
x Uses memory mapping technique. Uses the storage engine name
Storage engine
LOCAL and PLOCAL.
1. In graph database, data is stored in graphs. In RDBMS, data is stored in tables. The following table illustrates the comparison between relational model, document model,
and OrientDB document model −
2. In graph database there are nodes. In RDBMS, there are rows.
Relational Model Document Model OrientDB Document Model

3. In graph database there are properties and In RDBMS, there are columns and Table Collection Class or Cluster
their values. data.
Row Document Document
4. In graph database the connected nodes are In RDBMS, constraints are used Column Key/value pair Document field
defined by relationships. instead of that.
Relationship Not available Link
5. In graph database traversal is used instead In RDBMS, join is used instead of The SQL Reference of the OrientDB database provides several commands to create, alter, and
of join. traversal. drop databases.
Create database
The following statement is a basic syntax of Create Database command.
CREATE DATABASE <database-url> [<user> <password> <storage-type> [<db-type>]]

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Following are the details about the options in the above syntax. Example
<database-url> − Defines the URL of the database. URL contains two parts, one is <mode>
We have already created a database named ‘demo’ in the previous chapters. In this example,
and the second one is <path>.
we will connect to that using the user admin.
<mode> − Defines the mode, i.e. local mode or remote mode.
You can use the following command to connect to demo database.
<path> − Defines the path to the database.
orientdb> CONNECT PLOCAL:/opt/orientdb/databases/demo admin admin
<user> − Defines the user you want to connect to the database.
If it is successfully connected, you will get the following output −
<password> − Defines the password for connecting to the database.
Connecting to database [plocal:/opt/orientdb/databases/demo] with user 'admin'…OK
<storage-type> − Defines the storage types. You can choose between PLOCAL and Orientdb {db = demo}>
MEMORY.

Example the following statement is the basic syntax of the info command.
LIST DATABASES
You can use the following command to create a local database named demo.
Orientdb> CREATE DATABASE PLOCAL:/opt/orientdb/databses/demo The following statement is the basic syntax of the Drop database command.
If the database is successfully created, you will get the following output. DROP DATABASE [<database-name> <server-username> <server-user-password>]
Database created successfully. Following are the details about the options in the above syntax.
<database-name> − Database name you want to drop.
Current database is: plocal: /opt/orientdb/databases/demo
<server-username> − Username of the database who has the privilege to drop a database.
orientdb {db = demo}> <server-user-password> − Password of the particular user.
The following statement is the basic syntax of the Alter Database command.
ALTER DATABASE <attribute-name> <attribute-value> In this example, we will use the same database named ‘demo’ that we created in an earlier
Where <attribute-name> defines the attribute that you want to modify and <attribute- chapter. You can use the following command to drop a database demo.
value> defines the value you want to set for that attribute.
orientdb {db = demo}> DROP DATABASE
orientdb> ALTER DATABASE custom strictSQL = false
If this command is successfully executed, you will get the following output.
If the command is executed successfully, you will get the following output.
Database 'demo' deleted successfully
Database updated successfully
INSERT DATABASE
The following statement is the basic syntax of the Connect command.
The following statement is the basic syntax of the Insert Record command.
CONNECT <database-url> <user> <password>
INSERT INTO [class:]<class>|cluster:<cluster>|index:<index>
Following are the details about the options in the above syntax.
[(<field>[,]*) VALUES (<expression>[,]*)[,]*]|
<database-url> − Defines the URL of the database. URL contains two parts one is <mode> [SET <field> = <expression>|<sub-command>[,]*]|
and the second one is <path>. [CONTENT {<JSON>}]
[RETURN <expression>]
<mode> − Defines the mode, i.e. local mode or remote mode.
[FROM <query>]
<path> − Defines the path to the database. Following are the details about the options in the above syntax.
<user> − Defines the user you want to connect to the database. SET − Defines each field along with the value.
<password> − Defines the password for connecting to the database. CONTENT − Defines JSON data to set field values. This is optional.

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

RETURN − Defines the expression to return instead of number of records inserted. The most FETCHPLAN − Specifies the strategy defining how you want to fetch results.
common use cases are −
TIMEOUT − Defines the maximum time in milliseconds for the query.
 @rid − Returns the Record ID of the new record.
LOCK − Defines the locking strategy. DEFAULT and RECORD are the available lock
 @this − Returns the entire new record.
strategies.

FROM − Where you want to insert the record or a result set. PARALLEL − Executes the query against ‘x’ concurrent threads.

The following command is to insert the first record into the Customer table. NOCACHE − Defines whether you want to use cache or not.

INSERT INTO Customer (id, name, age) VALUES (01,'satish', 25) Example
The following command is to insert the second record into the Customer table. Method 1 − You can use the following query to select all records from the Customer table.
INSERT INTO Customer SET id = 02, name = 'krishna', age = 26 orientdb {db = demo}> SELECT FROM Customer
The following command is to insert the next two records into the Customer table. orientdb {db = demo}> SELECT FROM Customer WHERE name LIKE 'k%'
orientdb {db = demo}> SELECT FROM Customer WHERE name.left(1) = 'k'
INSERT INTO Customer (id, name, age) VALUES (04,'javeed', 21), (05,'raja', 29) orientdb {db = demo}> SELECT id, name.toUpperCase() FROM Customer
orientdb {db = demo}> SELECT FROM Customer WHERE age in [25,29]
orientdb {db = demo}> SELECT FROM Customer WHERE ANY() LIKE '%sh%'
SELECT COMMAND orientdb {db = demo}> SELECT FROM Customer ORDER BY age DESC
The following statement is the basic syntax of the SELECT command.
SELECT [ <Projections> ] [ FROM <Target> [ LET <Assignment>* ] ] UPDATE QUERY
[ WHERE <Condition>* ]
[ GROUP BY <Field>* ]
[ ORDER BY <Fields>* [ ASC|DESC ] * ] Update Record command is used to modify the value of a particular record. SET is the basic
[ UNWIND <Field>* ]
command to update a particular field value.
[ SKIP <SkipRecords> ]
[ LIMIT <MaxRecords> ] The following statement is the basic syntax of the Update command.
[ FETCHPLAN <FetchPlan> ]
UPDATE <class>|cluster:<cluster>|<recordID>
[ TIMEOUT <Timeout> [ <STRATEGY> ] ]
[SET|INCREMENT|ADD|REMOVE|PUT <field-name> = <field-value>[,]*] |[CONTENT|
[ LOCK default|record ]
MERGE <JSON>]
[ PARALLEL ]
[UPSERT]
[ NOCACHE ]
[RETURN <returning> [<returning-expression>]]
Following are the details about the options in the above syntax. [WHERE <conditions>]
[LOCK default|record]
<Projections> − Indicates the data you want to extract from the query as a result records set.
[LIMIT <max-records>] [TIMEOUT <timeout>]
FROM − Indicates the object to query. This can be a class, cluster, single Record ID, set of
Following are the details about the options in the above syntax.
Record IDs. You can specify all these objects as target.
SET − Defines the field to update.
WHERE − Specifies the condition to filter the result-set.
INCREMENT − Increments the specified field value by the given value.
LET − Indicates the context variable which are used in projections, conditions or sub queries.
ADD − Adds the new item in the collection fields.
GROUP BY − Indicates the field to group the records.
REMOVE − Removes an item from the collection field.
ORDER BY − Indicates the filed to arrange a record in order.
PUT − Puts an entry into map field.
UNWIND − Designates the field on which to unwind the collection of records.
CONTENT − Replaces the record content with JSON document content.
SKIP − Defines the number of records you want to skip from the start of the result-set.
MERGE − Merges the record content with a JSON document.
LIMIT − Indicates the maximum number of records in the result-set.

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.
OrientDB Features
UPSERT − Updates a record if it exists or inserts a new record if it doesn’t. It helps in
executing a single query in the place of executing two queries. providing more functionality and flexibility, while being powerful enough to replace your
operational DBMS.
RETURN − Specifies an expression to return instead of the number of records.
SPEED
LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out. OrientDB was engineered from the ground up with performance as a key specification. It’s
fast on both read and write operations. Stores up to 120,000 records per second
Try the following query to update the age of a customer ‘Raja’.

Orientdb {db = demo}> UPDATE Customer SET age = 28 WHERE name = 'Raja'
Truncate  No more Joins: relationships are physical links to the records.
 Better RAM use.
Truncate Record command is used to delete the values of a particular record.  Traverses parts of or entire trees and graphs of records in milliseconds.
The following statement is the basic syntax of the Truncate command.  Traversing speed is not affected by the database size.
TRUNCATE RECORD <rid>* ENTERPRISE
Where <rid>* indicates the Record ID to truncate. You can use multiple Rids separated by
comma to truncate multiple records. It returns the number of records truncated.  Incremental backups
 Unmatched security
Try the following query to truncate the record having Record ID #11:4.
 24x7 Support
Orientdb {db = demo}> TRUNCATE RECORD #11:4  Query Profiler
 Distributed Clustering configuration
DELETE  Metrics Recording
 Live Monitor with configurable alerts
Delete Record command is used to delete one or more records completely from the database.
The following statement is the basic syntax of the Delete command. With a master-slave architecture, the master often becomes the bottleneck. With OrientDB,
throughput is not limited by a single server. Global throughput is the sum of the throughput
DELETE FROM <Class>|cluster:<cluster>|index:<index>
of all the servers.
[LOCK <default|record>]
[RETURN <returning>]
 Multi-Master + Sharded architecture
[WHERE <Condition>*]
 Elastic Linear Scalability
[LIMIT <MaxRecords>]
 estore the database content using WAL
[TIMEOUT <timeout>]
Following are the details about the options in the above syntax.  OrientDB Community is free for commercial use.
 Comes with an Apache 2 Open Source License.
LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.  Eliminates the need for multiple products and multiple licenses.

RETURN − Specifies an expression to return instead of the number of records.


LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out.
Note − Don’t use DELETE to remove Vertices or Edges because it effects the integrity of the
graph.
Try the following query to delete the record having id = 4.
orientdb {db = demo}> DELETE FROM Customer WHERE id = 4

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)

You might also like