Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 142

Presentation on

NoSQL
“Towards the end of RDBMS ?”
Database Categories
OLTP Database
OLTP Database
No SQL Databases
What is RDBMS
2

 RDBMS: the relational


database management system.

 Relation: a relation is a 2D table


which has the following features:
 Name

 Attributes

 Tuples
Name
Issues with RDBMS- Scalability
3

 Issues with scaling up when the dataset


is just too big e.g. Big Data.
 Not designed to be distributed.
 Looking at multi-node database
solutions. Known as ‘horizontal scaling’.
 Different approaches include:
 Master-slave
 Sharding
Scaling RDBMS
4

Master-Slave Sharding
 All writes are written to the master.  Scales well for both reads
All reads are performed against and writes.
the replicated slave databases.  Not transparent, application needs
 Critical reads may be incorrect as to be partition-aware.
writes may not have been  Can no longer have relationships or
propagated down. joins across partitions.
 Large data sets can pose  Loss of referential integrity across
problems as master needs to shards.
duplicate data to slaves.
What is NoSQL?
5

 Stands for Not Only SQL. Term was redefined by Eric Evans after Carlo Strozzi.
 Class of non-relational data storage systems.
 Do not require a fixed table schema nor do they use the concept of joins.
 Relaxation for one or more of the ACID properties (Atomicity, Consistency,
Isolation, Durability) using CAP theorem.
What is NoSQL?
What is NoSQL?
Why NoSQL?
Why NoSQL?
Need of NoSQL
6

 Explosion of social media sites (Facebook, Twitter, Google etc.) with large data
needs. (Sharding is a problem)

 Rise of cloud-based solutions such as Amazon S3 (simple storage solution).

 Just as moving to dynamically-typed languages (Ruby/Groovy), a shift to


dynamically-typed data with frequent schema changes.

 Expansion of Open-source community.

 NoSQL solution is more acceptable to a client now than a year ago.


Benefits of NoSQL
 Large volumes of structured, semi-structures, and unstructured data.
Benefits of NoSQL
 Agile development, quick changes, and frequent code pushes.
Benefits of NoSQL
 Object-oriented programming that is easy to use and flexible

 Horizontal scaling instead of expensive hardware’s


NoSQL Types
7

NoSQL database are classified into four types:

• Key Value pair based


• Column based
• Document based
• Graph based
Key Value Pair Based
8

• Designed for processing dictionary. Dictionaries contain a


collection of records having fields containing data.

• Records are stored and retrieved using a key that


uniquely identifies the record, and is used to quickly
find the data within the database.

Example: CouchDB, Oracle NoSQL Database, Riak etc.

We use it for storing session information, user profiles


preferences, shopping cart data.

We would avoid it when we need to query data having


relationships between entities.
Column based
9

It store data as Column families containing rows that


have many columns associated with a row key.
Each row can have different columns.

Column families are groups of related data that is


accessed together.

Example: Cassandra, HBase, Hypertable, and


Amazon DynamoDB.

We use it for content management systems, blogging


platforms, log aggregation.

We would avoid it for systems that are in early


Document Based
10

The database stores and retrieves documents. It stores documents


in the value part of the key-value store.

Self- describing, hierarchical tree data structures consisting of maps,


collections, and scalar values.

Example: Lotus Notes, MongoDB, Couch DB, Orient DB, Raven


DB.

We use it for content management systems, blogging platforms, web


analytics, real-time analytics,
e-commerce applications.

We would avoid it for systems that need complex transactions spanning multiple operations or
Graph Based
11

Store entities and relationships between these entities as


nodes and edges of a graph respectively. Entities have
properties.

Traversing the relationships is very fast as relationship


between nodes is not calculated at query time but is actually
persisted as a relationship.

Example: Neo4J, Infinite Graph, OrientDB, FlockDB.

It is well suited for connected data, such as social networks,


spatial data, routing information for goods and supply.
SQL Databases (ACID Property)
CAP Theorem
12

 According to Eric Brewer,


a distributed system has 3 properties :
 Consistency
 Availability
 Partitions

 We can have at most two of these three properties for any shared-data system

 To scale out, we have to partition. It leaves a choice between consistency and


availability. (In almost all cases, we would choose availability over consistency)

 Everyone who wants to develop big applications, builds them on CAP :


Google, Yahoo, Facebook, Amazon, eBay, etc.
SQL Databases (ACID Property)
Advantages of NoSQL
13

 Cheap and easy to implement (open source)


 Data are replicated to multiple nodes (therefore identical and fault-
tolerant) and can be partitioned
 When data is written, the latest version is on at least one node and
then replicated to other nodes
 No single point of failure
 Easy to distribute
 Don't require a schema
What is not provided by NoSQL
14

 Joins
 Group by
 ACID transactions
 SQL
 Integration with applications that are based on SQL
Where to use NoSQL
15

 NoSQL Data storage systems makes sense for applications that


process very large semi-structured data –like Log Analysis, Social
Networking Feeds, Time-based data.
 To improve programmer productivity by using a database that better
matches an application's needs.
 To improve data access performance via some combination of handling
larger data volumes, reducing latency, and improving throughput.
NoSQL Vs. SQL Comparison
Before Selection and Implementation of NoSQL
NoSQL Landscape
Conclusion
16

 All the choices provided by the rise of NoSQL databases does not mean
the demise of RDBMS databases as Relational databases are a powerful
tool.

 We are entering an era of Polyglot persistence, a technique that uses


different data storage technologies to handle varying data storage needs. It
can apply across an enterprise or within an individual application.
References
17

1. “NoSQL Databases: An Overview”. Pramod Sadalage,


thoughtworks.com(2014)

2. “Data management in cloud environments: NoSQL and NewSQL data


stores”. Grolinger, K.; Higashino, W. A.; Tiwari, A.; Capretz, M. A. M.
(2013). JoCCASA, Springer.

3. “Making the Shift from Relational to NoSQL”. Couchbase.com(2014).

4. “NoSQL - Death to Relational Databases”. Scofield, Ben (2010).


Presentation on
MongoDB
“Large-Scale Document Data Management”
MongoDB Overview
New World
What is MongoDB?
What is MongoDB?
Key MongoDB Features
MongoDB Features
MongoDB High Level Architecture
MongoDB Licensing
MongoDB Enterprise
 MongoDB Enterprise is the commercial edition of MongoDB that
provides enterprise-grade capabilities.
 MongoDB Enterprise includes advanced security features,
management tools, software integrations and certifications.
 These value-added capabilities are not included in the open-source
edition of MongoDB
MongoDB Enterprise Includes
Use Cases
Few MongoDB Clients
Few MongoDB Clients

 Metlife uses MongoDB for “The Wall” an innovative customer service


application provides a 360-degree, consolidated view of MetLife
customers, including policy details and transactions across lines of
business.

 Ebay has a number of projects running on MongoDB for search


suggestions, metadata storage, cloud management and merchandizing
categorization.
 MongoDB is the repository that powers MTV Networks’ next-
generation CMS, which is used to manage and distribute content for all
of MTV Networks’ major websites.

 MongoDB is used for back-end storage on the SourceForge front


pages, project pages, and download pages for all projects.

 Craigslist uses MongoDB to archive billions of records.


Few MongoDB Clients
 ADP uses MongoDB for its high performance, scalability, reliability and its ability to
preserve the data manipulation capabilities of traditional relational databases.

 CNN Turk uses MongoDB for its infrastructure and content management system,
including the tv.cnnturk.com.

 Foursquare uses MongoDB to store venues and user ‘check-ins’ into venues, sharing
the data over more than 25 machines on Amazon EC2.

 Justin.tv is the easy, fun, and fast way to share live video online. MongoDB powers
Justin.tv’s internal analytics tools for virality, user retention, and general usage stats
that out-of-the-box solutions can’t provide.

 Ibibo (“I build, I Bond”) is a social network using MongoDB for its dashboard feeds.
Each feed is represented as a single document containing an average of 1000 entries;
the site currently stores over two million of these documents in MongoDB
Financial Services

 Risk Analytics and Reporting

 Reference Data Management

 Market Data Management

 Portfolio Management

 Order Capture

 Time Series Data


 Surveillance Data Aggregation

 Crime Data Management and Analytics

 Citizen Engagement Platform

 Program Data Management

 Healthcare Record Management


Health Care

 360-Degree Patient View

 Population Management for At-Risk Demographics

 Lab Data Management and Analytics

 Mobile Apps for Doctors and Nurses

 Electronic Healthcare Records (EHR)


Media and Entertainment

 Content Management and Delivery

 User Data Management

 Digital Asset Management

 Mobile and Social Apps

 Content Archiving
Retail

 Rich Product Catalogs

 Customer Data Management

 New Services

 Digital Coupons

 Real-Time Price Optimization


Telecommunication

 Consumer Cloud

 Product Catalog

 Customer Shape Improvement

 Machine-to-Machine (M2M) Platform

 Real-Time Network Analysis and Optimization


MongoDB Tools
MongoDB Package Components

 mongod
 mongos
 mongo
 m0ng0d.exe
 m0ng0s.exe
 mongodump
 mongorestore
 bsondump
 mongooplog
 mongoimport
 mongoexport
 mongostat
 mongotop
 mongosniff
 mongoperf
 mongofiles
MongoDB Package Components (Tools)
MongoDB Database

 Mongod is the primary daemon process for the MongoDB system.

 Database is a physical container for collections.

 Each database gets its own set of files on the file system.

 A single MongoDB server typically has multiple databases.

 It handles data requests, manages data format, and performs

background management operations.


MongoDB Collection

 Collection is a group of MongoDB documents.

 It is the equivalent of an RDBMS table.

 A collection exists within a single database.

 Collection do not enforce a schema.

 Documents within a collection can have different fields.

 Typically, all documents in a collection are of similar or related

purpose.
MongoDB Document

 A document is a set of key-value pairs.

 Documents have dynamic schema.


MongoDB Sample Document
RDBMS Terminology with MongoDB
Introduction to JSON and BSON
JSON
JSON
JSON
JSON
Presentation on
HBase
“Large-Scale Data
Management”
What is HBase
2

 The Hadoop database, distributed, scalable, big data


store.

 An open-source, versioned, non-relational database

 Random, realtime read/write access to your Big Data

 Hosting of very large tables -- billions of rows X millions of


columns -- atop clusters of commodity hardware.
HBase: Part of Hadoop’s Ecosystem
2

HBase is built on top of HDFS

HBase files are


internally
stored in HDFS
HBase vs. HDFS
2

 Both are distributed systems that scale to hundreds or


thousands of nodes

 HDFS is good for batch processing (scans over big files)


 Not good for record lookup
 Not good for incremental addition of small batches
 Not good for updates
HBase vs. HDFS(cont’d)
2

 HBase is designed to efficiently address the above points


 Fast record lookup
 Support for record-level insertion
 Support for updates (not in place)

 HBase updates are done by creating new versions of


values
HBase vs. HDFS(cont’d)
2

If application has neither random reads or writes  Stick to HDFS


HBase Data Model
2

 HBase is based on Google’s Bigtable model


 Key-Value pairs
HBase Logical View
2
HBase: Keys and Column Families
2

Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns


HBase: Keys and Column Families
Column family named “Contents” Column family named “anchor”
Key
Byte array
Serves as the primary key for the
table Column named “apache.com
Indexed far fast lookup
Column Family
Has a name (string)
Contains one or more related
columns
Column
Belongs to one column family
Included inside the row
familyName:columnName
HBase: Keys and Column Families
Version number for each row
Version Number
Unique within each key
By default System’s value

timestamp
Data type is Long
Value (Cell)
Byte array
Notes on Data Model
2

 HBase schema consists of several Tables


 Each table consists of a set of Column Families
 Columns are not part of the schema
 HBase has Dynamic Columns
 Because column names are encoded inside the cells
 Different cells can have different columns

“Roles” column family


has different columns in
different cells
Notes on Data Model
2

 The version number can be user-supplied


 Even does not have to be inserted in increasing order
 Version number are unique within each key
 Table can be very sparse Has two columns
 Many cells are empty [cnnsi.com & my.look.ca]
 Keys are indexed as the primary key
HBase Physical Model
2

 Each column family is stored in a separate file (called HTables)


 Key & Version numbers are replicated with each column family
 Empty cells are not stored

HBase maintains a multi-level


index on values:
<key, column family, column
name, timestamp>
Column Families
2

 Different sets of columns may have different properties and access


patterns.
 Configurable by Column family:
 Compression (none,gzip,LZO)

 Version retension policies

 Cache priority

 CFs stored seperately on disk: access one without wasting IO on the other
HBase Regions
2

 Each HTable (column family) is partitioned horizontally into regions


 Regions are counterpart to HDFS blocks

Each will be one region


HBase Architecture
2

 Three Major Components


 The HBaseMaster

 One master

 The HRegionServer
 Many region servers

 The HBase client


HBase Components
2

 Region
 A subset of a table’s rows, like horizontal range partitioning

 Automatically done

 RegionServer (many slaves)


 Manages data regions

 Serves data for reads and writes (using a log)

 Master
 Responsible for coordinating the slaves

 Assigns regions, detects failures

 Admin functions
Big Picture
2
ZooKeeper
2

 HBase depends on ZooKeeper


 By default HBase manages the
ZooKeeper instance
 E.g., starts and stops

ZooKeeper
 HMaster and HRegionServers
register themselves with
ZooKeeper
Creating a Table
2

HBaseAdmin admin= new HBaseAdmin(config);


HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new
HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);
Operations On Regions: Get()
2

 Given a key return corresponding record


 For each value return the highest version

 Can control the number of versions you want


Operations On Regions: Scan()
2
Select value from table where
Get() key=‘com.apache.www’ AND
label=‘anchor:apache.com’
2

Time
Row key Column “anchor:”
Stamp
t12

“com.apache.www” t11

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6
t5
t3
Select value from table where
Scan() anchor=‘cnnsi.com’

Time
Row key Column “anchor:”
Stamp
t12

“com.apache.www” t11

t10 “anchor:apache.com” “APACHE”


t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6
t5
t3
Operations On Regions: Put()
2

 Insert a new record (with a new key), Or


 Insert a record for an existing key
Implicit version number
(timestamp)

Explicit version number


Operations On Regions: Delete()
2

 Marking table cells as deleted


 Multiple levels
 Can mark an entire column family as deleted

 Can make all column families of a given row as deleted

• All operations are logged by the RegionServers


• The log is flushed periodically
HBase: Joins
2

 HBase does not support joins

 Can be done in the application layer


 Using scan() and get() operations
Altering a Table

Disable the table before changing the schema


Logging Operations
HBase Deployment

Master
node

Slave
nodes
HBase vs. HDFS
HBase vs. RDBMS
When to use HBase
Thank you
Presentation on
Cassandra
“Handle massive transactional data and never go down”
History of Cassandra
2

Cluster
Schema Architecture
Memtables Partitioning
Compaction Replication
SStables Gossip
Commit Log Anti-Entropy
Hints
What is Cassandra?
2

Distributed Database

✓ Individual DBs (nodes)

✓ Working in a cluster
C*

✓ Nothing is shared
Why Cassandra?
2

 It’s Fast (Low Latency)


Why Cassandra?
2

 It’s Always On
Why Cassandra?
2

 It’s Hugely Scalable (High Throughput)


Why Cassandra?
2

 It is natively multi-data center with distributed data


Why Cassandra?
2

 Operational Simplicity

OR
San Stockholm
Francisco

New York
Why Cassandra?
2

 It has a flexible data model

 Tables, wide rows, partitioned and distributed


 Data
 Blobs (documents, files, images)
 Collections (Sets, Lists, Maps)
 UDTs

 Access it with CQL ← familiar syntax to SQL


Why Cassandra?
2

Scaling Writes and


Data Originating from Reads
Scaling Out Strategy
Multiple Location

For High Velocity Data Strong all types of data

Writing Data Anywhere, Keeping Business Online and


Everywhere Serving Customers

Retaining Data for Long Delivering First Response Time


Why Use Cassandra..??

Voluminous Data No Downtime


Scaling for both
READS and WRITES
How Does The Database Work in Cassandra? - The Basics
2

 It’s a Cluster of Nodes

C*
How Does The Database Work in Cassandra? - The Basics
2

read
 Each Cassandra node is…
write
 Fully functional database data tables
memory
 Very fast (low latency)
disk or SSD
 In-memory and/or persistent
storage data tables
 One machine or Virtual
Machine
How Does The Database Work in Cassandra? - The Basics
2

read
 Why is Cassandra so fast?
 Low Latency Nodes write
 Distributed Workload
data tables
memory
 Writes
 Durable write to log file (fast) disk or SSD
 No db file read or lock needed data tables

 Reads
 Get data from memory first
 Optimize storage layer IO
 No locking or two phase commit
Multi-data center deployment
2

read or
 A Cassandra Cluster write
 Nodes in a peer-to-peer cluster
 No single point of failure

 Built in data replication


 Data is always available
100% Uptime C*
 Across data centers


 Failure avoidance
Multi-data center deployment
2

C*

Amazon

C* C*

San Francisco UK

C*

New York
Multi-data center deployment
2

 If a data center goes offline…


C*

Amazon

Your data is always


available from other
data centers. C*
C*

UK
San Francisco

New York
Multi-data center deployment
2

 …and recovers automatically


C*

Amazon

Your data is always


available from other
data centers. C*
C*

UK
San Francisco C*

New York
Cassandra Cluster Architecture
2

80
 Each Node ~ box or VM
70 10 (technically it’s a JVM)
 Each node has same Cassandra
database functionality
60 20  System/hardware failures happen
 Snitch – topology (DC and rack)
 Gossip – state of each node
50 30

40
Transaction Load Balancing
2

Application
driver
 Driver manages connection pool
80

70 10  Driver has load balancing policies


that can be applied

60 Data Center 1 20
 Each transaction has a different
connection point in the cluster

50 30
 Asynch operations are faster
40
Data Partitioning
Application insert
driver
key=‘x’;
hash(key)
=>  Driver selects the Coordinator
80
token(43) node per operation via load
70 10 balancing

 Partitioned token ranges


60 20  Random distribution
 Murmur3
Data Center 1  No hot spots

50 30
 Each entire row lives on a node
40
Replication
Application insert
driver
key=‘x’;
hash(key)
=>  Driver selects the Coordinator
80
token(43) node per operation via load
70 10 balancing

 Partitioned token ranges


60 20  Random distribution
 Murmur3
Data Center 1  No hot spots

replication 50 30
factor = 3  Each entire row lives on a node
40
Replication
Application insert
driver
key=‘x’;
hash(key)
=>  Replication Factor = # of copies
80
token(43)

70 10  All replication operations in parallel

 “Eventual” = micro- or milliseconds


60 20
 No master or primary node
Data Center 1

replication 50 30  Each node acknowledges the


factor = 3 operation
40
Multi-Data Center Replication
Application insert
key=‘x’; hash(key) hash(key)
driver
=> =>
81
80
token(43) token(43)

71 11
70 10

61 21
60 20

Data Center 1 Data Center 1

51 31
replication 50 30

factor = 3 replication 41
40
factor = 3
Consistency
Application insert
driver
key=‘x’; hash(key)
80
=>  Replication Factor = ?
token(43)
 Consistency Level = # of nodes to
70 10
acknowledge an operation
 read and write “consistency”
per operation
60 20  CL(write) + CL(read) > RF
➔ consistency
Data Center 1
 If needed, tune consistency for
replication 50 30 performance
factor = 3
40
Netflix Replication Experiment
Slow Node Anti-Entropy: Hints
Application insert
driver
key=‘x’; hash(key) IF Node 60 is:
80
=> Slow enough to timeout ops
token(43) Down for a short time
70 10 Then…

Node 80 holds the op missed by 60


held ops are called Hints
60 20

When node 60 becomes responsive


Data Center 1
Node 80 sends the hints as ops
replication 50 30 …until node 60 is synchronized
factor = 3
40 Read Repair
What if I need more nodes?
Application
driver
 Data set size is growing
80
 Need more TPS
70 10  Client application demand
 Hardware limits being reached
 Looking for lower latency
60 Data Center 1 20

 Moving some tables to in-memory

50 30

40
replication factor = 3
Cluster Expansion: Add Nodes
Application
driver
 Introduce new nodes to cluster
80
 Nodes do not yet own any token
70 10 ranges (data)

 Cluster operates as normal


60 Data Center 1

50 30

replication factor = 3
Cluster Expansion: Rebalance
Application
driver
 Rebalance = redistribution of token
80
ranges around the cluster
72 8
 Data is streamed in small chunks to
synchronize the cluster
 see “Vnodes”
64 16  minimizes streaming
Data Center 1  distributes rebalance workload
 New nodes begin taking ops
56 24  Cleanup = reclaims space on
nodes dedicated to tokens no
48 32 longer owned
40
replication factor = 3
Cluster Expansion: Rebalance
Application
driver
 Rebalance = redistribution of token
80
ranges around the cluster
72 8
 Data is streamed in small chunks to
synchronize the cluster
 see “Vnodes”
64 16  minimizes streaming
Data Center 1  distributes rebalance workload
 New nodes begin taking ops
56 24  Cleanup = reclaims space on
nodes dedicated to tokens no
48 32 longer owned
40
replication factor = 3
Write Path ✔write

INSERT INTO email_users (domain, user,


username)
VALUES ('@datastax.com',’rreffner',
’rreffner@datastax.com'); memtable A memtable B

Compaction memory
• Merge sstables disk or SSD
• Evict tombstones
commit sstable A sstable
sstableB Bv2
• Rebuild indexes log sstable B
• Is a background process append
fsynch
SSTables are
immutable
Memtable
Flush During
Write Load
Compaction
During Write
Load
Compaction And Tombstones
Read Path ✔ ✔ ✔ read
row cache
SELECT * FROM memtable A * can be in-memory table
email_users bloom filter
WHERE domain= memtable B
key cache
'@datastax.com’
AND user = memory OS cache
’rreffner@datastax.co disk or SSD
m’;
commit sstable A sstable B
log
Data Modeling Best Practices
 Start with your data access patterns
 What does your app do?
 What are your query patterns?

 Optimize for
 Fast writes and reads at the correct consistency
 Consider the results set, order, group, filtering
 Use TTL to manage data aging
 1 Query = 1 Table
 Each SELECT should target a single row (1 seek)
 Range queries should traverse a single row (efficient)
Denormalization Is Expected
 Remember: You’re not optimizing for storage efficiency

 You are optimizing for performance

 Do this:
 Forget what you’ve learned about 3rd normal form
 Repeat after me… “slow is down”, “storage is cheap”
 Denormalize and do parallel writes!

 Don’t to this:
 Client side joins
 Reads before writes
Where to use Cassandra?
2

 Use if your application has:


 Big Data (Billion of Records Rows & Columns)
 Very High Velocity Random Reads & Writes
 Flexible Sparse / Wide Column Requirements
 Low Latency
 Use Cases:
 E-Commerce Inventory Cache Use Cases
 Times Series / /Events Use Cases
 Feed bases Activates / Use Cases
Where NOT to Use Cassandra?
2

 Don’t use if you application has:


 Secondary Indexes.
 Relational Data.
 Transactional (Rollback, Commit).
 Primary & Financial Records.
 Stringent Security & Authorization Needs On Data.
 Dynamic Queries on Columns.
 Searching Column Data.
 Low Latency.
Thank you..!!

You might also like