Professional Documents
Culture Documents
19sw-Ds&a - 2
19sw-Ds&a - 2
19sw-Ds&a - 2
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Issues with scaling up
• Best way to provide ACID and rich query model is to have the dataset on a
single machine
• Limits to scaling up (or vertical scaling: make a “single” machine more
powerful) → dataset is just too big!
• Scaling out (or horizontal scaling: adding more smaller/cheaper servers) is a
better choice
• Different approaches for horizontal scaling (multi-node database):
• Master/Slave
• Sharding (partitioning)
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Scaling out RDBMS: Master/Slave
• Master/Slave
• All writes are written to the master
• Critical reads may be incorrect as writes may not have been propagated down
• Large datasets can pose problems as master needs to duplicate data to slaves
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Scaling out RDBMS: Sharding
Sharding is a method of splitting and storing a single logical dataset in multiple databases
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Other ways to scale out RDBMS
• Multi-Master replication
• INSERT only, not UPDATES/DELETES
• No JOINs, thereby reducing query time
• This involves de-normalizing data
• In-memory databases
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Replica Sets
• Redundancy and Failover
• Zero downtime for upgrades and maintenance
• Master-slave replication
• Strong Consistency
• Delayed Consistency
• Geospatial features
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
How does NoSQL vary from RDBMS?
• Looser schema definition
• Applications written to deal with specific documents/ data
• Applications aware of the schema definition as opposed to the data
• Designed to handle distributed, large databases
• Trade offs:
• No strong support for ad hoc queries but designed for speed and growth of
database
• Query language through the API
• Relaxation of the ACID properties
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Benefits of NoSQL Over SQL
• Elastic Scaling
• RDBMS scale up –bigger load , bigger server
• NO SQL scale out –distribute data across multiple hosts seamlessly
• DBA Specialists
• RDMS require highly trained expert to monitor DB
• NoSQL require less management, automatic repair and simpler data models
• Big Data
• Huge increase in data RDMS: capacity and constraints of data volumes at its
limits
• NoSQL designed for big data
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Benefits of NoSQL
• Flexible data models
• Change management to schema for RDMS have to be carefully managed
• NoSQL databases more relaxed in structure of data
• Database schema changes do not have to be managed as one complicated change unit
• Application already written to address an amorphous schema
• Economics
• RDMS rely on expensive proprietary servers to manage data
• No SQL: clusters of cheap commodity servers to manage the data and
transaction volumes
• Cost per gigabyte or transaction/second for NoSQL can be lower than the cost
for a RDBMS
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Drawbacks of NoSQL
• Support
• RDBMS vendors provide a high level of support to clients
• Stellar reputation
• NoSQL –are open source projects with startups supporting them
• Reputation not yet established
• Maturity
• RDMS mature product: means stable and dependable
• Also means old no longer cutting edge nor interesting
• NoSQL are still implementing their basic feature set
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Drawbacks of NoSQL
• Administration
• RDMS administrator well defined role
• No SQL’s goal: no administrator necessary however NO SQL still requires effort to
maintain
• Lack of Expertise
• Whole workforce of trained and seasoned RDMS developers
• Still recruiting developers to the NoSQL camp
• Analytics and Business Intelligence
• RDMS designed to address this niche
• NoSQL designed to meet the needs of a Web 2.0 application -not designed for ad hoc
query of the data
• Tools are being developed to address this need
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
SQL’s ACID to NoSQL BASE
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
• Three main properties of a distributed system (sharing data)
• Consistency:
• all copies have same value A
• Availability: C
• reads and writes always succeed P
• Partition-tolerance:
• system properties (consistency and/or availability) hold even when
network failures prevent some machines from communicating with
others
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
• Brewer’s CAP Theorem:
“For any system sharing data, it is “impossible” to guarantee
simultaneously all of these three properties”
• You can have at most two of these three properties for any shared-data system
• Very large systems will “partition” at some point:
• That leaves either C or A to choose from
(traditional DBMS prefers C over A and P )
• In almost all cases, you would choose A over C
(except in specific applications such as order processing)
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
Availability
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
• Consistency
2 types of consistency:
1. Strong consistency – ACID
(Atomicity, Consistency, Isolation, Durability)
2. Weak consistency – BASE
(Basically Available Soft-state Eventual consistency)
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
• A consistency model determines rules for visibility and apparent order of updates
• Example:
• Row X is replicated on nodes M and N
• Client A writes row X to node N
• Some period of time t elapses
• Client B reads row X from node M
• Does client B see the write from client A?
• Consistency is a continuum with tradeoffs
• For NOSQL, the answer would be: “maybe”
• CAP theorem states:
“strong consistency can't be achieved at the same time as
availability and partition-tolerance”
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
• Eventual consistency
• When no updates occur for a long period of time, eventually all updates will propagate
through the system and all the nodes will be consistent
• Cloud computing
• ACID is hard to achieve, moreover, it is not always required, e.g. for blogs,
status updates, product listings, etc.
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
Each client always can
read and write.
Availability
Consistency
Partition
tolerance
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
CAP Theorem
Availability
Consistency
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Available, Partition-
Tolerant (AP) Systems
Consistent, Available (CA) Systems achieve "eventual
have trouble with partitions consistency" through
and typically deal with it with replication replication and verification
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Q&A
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk22
isma.farah@teacher.muet.edu.pk