19sw-Ds&a - 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Data Science and Analytics

SQL Vs NoSQL Databases

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Issues with scaling up
• Best way to provide ACID and rich query model is to have the dataset on a
single machine
• Limits to scaling up (or vertical scaling: make a “single” machine more
powerful) → dataset is just too big!
• Scaling out (or horizontal scaling: adding more smaller/cheaper servers) is a
better choice
• Different approaches for horizontal scaling (multi-node database):
• Master/Slave
• Sharding (partitioning)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Scaling out RDBMS: Master/Slave
• Master/Slave
• All writes are written to the master

• All reads performed against the replicated slave databases

• Critical reads may be incorrect as writes may not have been propagated down

• Large datasets can pose problems as master needs to duplicate data to slaves

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Scaling out RDBMS: Sharding
Sharding is a method of splitting and storing a single logical dataset in multiple databases

• Sharding (Partitioning) – FOR RDBMS


• Scales well for both reads and writes
• Not transparent, application needs to be partition-aware
• Can no longer have relationships/joins across partitions
• Loss of referential integrity across shards

• Sharding (Partitioning) – FOR NoSQL


• Distributes a single logical database system across a cluster of machines
• Uses range-based partitioning to distribute documents based on a specific
shard key
• Automatically balances the data associated with each shard
• Can be turned on and off per collection (table)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Other ways to scale out RDBMS
• Multi-Master replication
• INSERT only, not UPDATES/DELETES
• No JOINs, thereby reducing query time
• This involves de-normalizing data

• In-memory databases

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Replica Sets
• Redundancy and Failover
• Zero downtime for upgrades and maintenance

• Master-slave replication
• Strong Consistency
• Delayed Consistency
• Geospatial features

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
How does NoSQL vary from RDBMS?
• Looser schema definition
• Applications written to deal with specific documents/ data
• Applications aware of the schema definition as opposed to the data
• Designed to handle distributed, large databases
• Trade offs:
• No strong support for ad hoc queries but designed for speed and growth of
database
• Query language through the API
• Relaxation of the ACID properties

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Benefits of NoSQL Over SQL
• Elastic Scaling
• RDBMS scale up –bigger load , bigger server
• NO SQL scale out –distribute data across multiple hosts seamlessly
• DBA Specialists
• RDMS require highly trained expert to monitor DB
• NoSQL require less management, automatic repair and simpler data models
• Big Data
• Huge increase in data RDMS: capacity and constraints of data volumes at its
limits
• NoSQL designed for big data

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Benefits of NoSQL
• Flexible data models
• Change management to schema for RDMS have to be carefully managed
• NoSQL databases more relaxed in structure of data
• Database schema changes do not have to be managed as one complicated change unit
• Application already written to address an amorphous schema
• Economics
• RDMS rely on expensive proprietary servers to manage data
• No SQL: clusters of cheap commodity servers to manage the data and
transaction volumes
• Cost per gigabyte or transaction/second for NoSQL can be lower than the cost
for a RDBMS

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Drawbacks of NoSQL
• Support
• RDBMS vendors provide a high level of support to clients
• Stellar reputation
• NoSQL –are open source projects with startups supporting them
• Reputation not yet established
• Maturity
• RDMS mature product: means stable and dependable
• Also means old no longer cutting edge nor interesting
• NoSQL are still implementing their basic feature set

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Drawbacks of NoSQL
• Administration
• RDMS administrator well defined role
• No SQL’s goal: no administrator necessary however NO SQL still requires effort to
maintain
• Lack of Expertise
• Whole workforce of trained and seasoned RDMS developers
• Still recruiting developers to the NoSQL camp
• Analytics and Business Intelligence
• RDMS designed to address this niche
• NoSQL designed to meet the needs of a Web 2.0 application -not designed for ad hoc
query of the data
• Tools are being developed to address this need

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
SQL’s ACID to NoSQL BASE

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
• Three main properties of a distributed system (sharing data)
• Consistency:
• all copies have same value A
• Availability: C
• reads and writes always succeed P
• Partition-tolerance:
• system properties (consistency and/or availability) hold even when
network failures prevent some machines from communicating with
others

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
• Brewer’s CAP Theorem:
“For any system sharing data, it is “impossible” to guarantee
simultaneously all of these three properties”
• You can have at most two of these three properties for any shared-data system
• Very large systems will “partition” at some point:
• That leaves either C or A to choose from
(traditional DBMS prefers C over A and P )
• In almost all cases, you would choose A over C
(except in specific applications such as order processing)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem

Availability

All client always have the Consistency


same view of the data
Partition
tolerance

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
• Consistency
2 types of consistency:
1. Strong consistency – ACID
(Atomicity, Consistency, Isolation, Durability)
2. Weak consistency – BASE
(Basically Available Soft-state Eventual consistency)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
• A consistency model determines rules for visibility and apparent order of updates
• Example:
• Row X is replicated on nodes M and N
• Client A writes row X to node N
• Some period of time t elapses
• Client B reads row X from node M
• Does client B see the write from client A?
• Consistency is a continuum with tradeoffs
• For NOSQL, the answer would be: “maybe”
• CAP theorem states:
“strong consistency can't be achieved at the same time as
availability and partition-tolerance”
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
• Eventual consistency
• When no updates occur for a long period of time, eventually all updates will propagate
through the system and all the nodes will be consistent
• Cloud computing
• ACID is hard to achieve, moreover, it is not always required, e.g. for blogs,
status updates, product listings, etc.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
Each client always can
read and write.
Availability

Consistency

Partition
tolerance

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem

Availability

Consistency

Partition A system can continue to


tolerance operate in the presence of
a network partitions

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Available, Partition-
Tolerant (AP) Systems
Consistent, Available (CA) Systems achieve "eventual
have trouble with partitions consistency" through
and typically deal with it with replication replication and verification

Consistent, Partition-Tolerant (CP) Systems


have trouble with availability while keeping data
consistent across partitioned nodes

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Q&A

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk22
isma.farah@teacher.muet.edu.pk

You might also like