Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Module - 3

NoSQL Big data Management


Features of Distributed Computing

1. Fault Tolerance and reliability

2. Flexibility

3. Sharding

4. Speed

5. Scalability

6. Resources Sharing

7. Performance
Demerits of Distributed Computing

1. Issues in troubleshooting in a large Network infrastructure

2. Additional software requirements

3. Security risks for data and resources


Hadoop or NoSQL? What is the difference?

• Analytical vs Operational

• Volume (Petabytes vs Terabytes) vs Velocity

• Batch vs Interactive
SQL- RDBMS
NoSql
NoSql- Flexible Data Models & multiple schemas, Consider as semi-structured
What is NoSQL?
• NoSQL Database is a non-relational Data Management System, that does not
require a fixed schema.
 NoSQL is a new set of database. Big Data Solutions:

 Require a scalable distributed


 A flexible Data base Model used for Big Data computing model with shared-
& Real time web apps nothing architecture.

 NoSQL database system includes a wide range  A solution is Big Data store in HDFS
of database technologies that can store files, The accesses are sequential in
HDFS data.
structured, semi-structured, unstructured and
polymorphic data.
NoSQL databases have the following properties:

• Support for Multiple Data Models (Schema-free)

• Simple Application Programme Interface(API)

• higher scalability.

• distributed computing & cost effective.


Types of NoSQL Databases
NoSQL data stores and their characteristic features
Apache's HBase open-source and non-relational data store written in Java;
A column-family based NoSQL data store,
data store providing BigTable-like capabilities scalability, strong consistency

Apache's MongoDB Master-slave distribution model;


Document-oriented data store with JSON-like documents and dynamic schemas;
open-source, NoSQL, scalable and non-relational database;
Used by Websites Craigslist, eBay, Foursquare at the backend

Apache's Cassandra Decentralized distribution peer-to-peer model;


Open source; NoSQL; scalable, non-relational, column-family based;
Fault-tolerant and tuneable consistency used by Facebook and Instagram

Apache's CouchDB A project of Apache which is also widely used database for the web.
CouchDBconsists of Document Store.
It uses the JSON data exchange format to store its documents, JavaScript for indexing,
combining and transforming documents, and HTTP Apis

Oracle NoSQL Step towards NoSQL data store; distributed key-value data store; provides transactional
semantics for data manipulation, horizontal scalability, simple administration and monitoring
CAP Theorem
It states that is impossible for a distributed data store to offer more than
two out of three guarantees
• Consistency
• Availability
• Partition Tolerance

 Database must answer, and that answer would be


old or wrong data (AP).
 Database should not answer, unless it receives the
latest copy of the data (CP).
Schema-Less Models
Advantages of schema-less:
• Speed for whole document requests

• Ability to store any format or data - including documents with missing fields

• Most technologies (e.g. Cassandra, Hadoop, Mondo) allow for rapid and easy scaling of servers
(sharding/ clustering).

• Some technologies allow for indexing - but at that point you are not really schemaless so you can
have a nearly schema-less design with one primary key (say a doumentid) and required fields (like a
timestamp) … and still allow nearly anything else to be loaded in.

• A developer can build their own objects (schema) easily and change them on the fly (think Agile)
without engaging a DBA.
Increasing Flexibility for Data Manipulation

BASE- Principles (Properties)

Basically Available: Rather than enforcing immediate consistency, BASE-modelled


NoSQL databases will ensure availability of data by spreading and replicating it across
the nodes of the database cluster.

Soft State: Due to the lack of immediate consistency, data values may change over time.

Eventual Consistency: The system will be eventually consistent after the application
input. The data will be replicated to different nodes and will eventually reach a consistent
state. But the consistency is not guaranteed at a transaction level.
Key-Value Store:
The data store characteristics are high performance, scalability & flexibility.
A simple string called, key maps to a large data string or
BLOB (Basic Large Object).

Key-value store accesses use a primary key for accessing the


values. Therefore, the store can be easily scaled up for very
large data.

The key is flexible and can be represented in many formats:


(i) Artificially generated strings created from a hash of a value,
(ii) Logical path names to images or files,
(iii)REST web-service calls (request response cycles)
(iv) SQL queries.
The key-value store provides client to read and write values using a
key as follows:

• Get (key) : returns the value associated with the key.

• Put (key, value): associates the value with the key and updates a value
if this key is already present.

• Multi-get ( keyl, key2,… keyN) : returns the list of values

• Delete (key) : removes a key and its value from the data store
Limitations of key-value store architectural pattern are:
• No indexes are maintained on values, thus a subset of values is not
searchable.
• Key-value store does not provide traditional database capabilities
• Maintaining unique values as keys may become more difficult when the
volume of data increases.
• Queries cannot be performed on individual values. No clause like 'where' in
a relational database usable that filters a result set.

Typical uses of key-value store are:


(i) Image store, (iii) Lookup table
(ii) Document or file store (iv) Query-cache.
Document Store:

• Document stores unstructured data.


• Storage has similarity with object store.

• Querying is easy. For example, using section


number, sub-section number and figure caption
and table headings to retrieve document partitions.

• Data stores in nested hierarchies. Typical uses of a document store


• JSON formats data model are: (i) office documents,
• XML document object model (DOM), (ii) inventory store,
• Machine-readable data as one BLOB. (iii) forms data, (iv) document
exchange and (v) document search.
Document Collection:
A collection can be used in many ways for
managing a large document store.
Three uses of a document collection are:
1. Group the documents together, similar to a
directory structure in a file.
2. Enables navigating through document
hierarchies, logically grouping similar
documents and storing business rules such
as permissions, indexes and triggers
3. A collection can contain other collections
as well.
MONGODB DATABASE

• MongoDB is an open-source document database and leading NoSQL


database. MongoDB is written in C++.
• MongoDB is a cross-platform, document oriented database that provides,
high performance, high availability, and easy scalability.
• MongoDB works on concept of collection and document.
A document is a set of key-value pairs. Documents have dynamic schema.

Dynamic schema means that documents in the same collection do not need to have the same set of
fields or structure, and common fields in a collection's documents may hold different types of data.

{ _id: ObjectId(7df78ad8902c)
title: 'MongoDB Overview',
description: 'MongoDB is no sql database',
by: 'tutorials point',
url: 'http://www.tutorialspoint.com',
tags: ['mongodb', 'database', 'NoSQL'],
likes: 100,
comments: [
{
user:'user1',
message: 'My first comment',
dateCreated: new Date(2011,1,20,2,15),
like: 0
},
{ user:'user2',
message: 'My second comments',
dateCreated: new Date(2011,1,25,7,45),
like: 5
}
]
}

_id is a 12 bytes hexadecimal -These 12 bytes first 4 bytes for the current timestamp, next 3 bytes for machine id, next
2 bytes for process id of MongoDB server and remaining 3 bytes are simple incremental VALUE.
Replication :
• Replication is the process of synchronizing data across multiple servers.
• Replication provides redundancy and increases data availability with multiple copies
of data on different database servers.
Commands Description :
rs.initiate() To initiate a new replica set
rs.conf () To check the replica set configuration
rs.status() To check the status of a replica set
rs.add() To add members to a replica se

Replica Set Features


•A cluster of N nodes
•Any one node can be primary
•All write operations go to primary
•Automatic failover
•Automatic recovery
Auto-Sharding:
• Sharding is the process of storing data records across multiple machines and it is
MongoDB's approach to meeting the demands of data growth.

MongoDB uses sharding to support deployments


with very large data sets and high throughput
operations.

query router, providing an interface between client


applications and the sharded cluster

MongoDB uses the shard key to distribute


the collection’s documents across shards.
CASSANDRA DATABASES

• Cassandra is a column-oriented database.


• Cassandra is scalable, consistent, and fault-tolerant.
• Cassandra's distribution design is based on Amazon's Dynamo and its data
model on Google's Bigtable.
• Cassandra is created at Facebook. It is totally different from relational
database management systems.
• Cassandra follows a Dynamo-style replication model with no single point of
failure, but adds a more powerful "column family" data model.
• Cassandra is being used by some of the biggest companies like Facebook,
Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more.
Components of cassandra Commit log : Used for crash recovery; each
write operation written to commit log

Node : Place where data stores for processing Mem-table: Memory resident data structure,
after data written in commit log, data write
Data Center: Collection of many related nodes in mem-table temporarily
Cluster : Collection of many data centers
SSTable: When mem-table reaches a certain
threshold, data flush into an SSTable disk
file
Cassandra Query Language(CQL)
Data Replication in Cassandra

In Cassandra, one or more of the nodes


in a cluster act as replicas for a given
piece of data. If it is detected that some
of the nodes responded with an out-of-
date value, Cassandra will return the
most recent value to the client.

Cassandra uses the Gossip Protocol in the background


to allow the nodes to communicate with each other and
detect any faulty nodes in the cluster.

You might also like