Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 66

NO SQL

PARUL PANDEY
Scaling Up Databases
A question I’m often asked about Heroku is: “How do you scale
the SQL database?” There’s a lot of things I can say about using
caching, sharding, and other techniques to take load off the
database. But the actual answer is: we don’t. SQL databases are
fundamentally non-scalable, and there is no magical pixie dust
that we, or anyone, can sprinkle on them to suddenly make
them scale.
Adam Wiggins Heroku

Adam Wiggins, Heroku Patterson, David; Fox, Armando (2012-07-11). Engineering Long-Lasting
Software: An Agile Approach Using SaaS and Cloud Computing, Alpha Edition (Kindle Locations
1285-1288). Strawberry Canyon LLC. Kindle Edition.

2
Data Management Systems: History
• In the last decades RDBMS have been successful in
solving problems related to storing, serving and
processing data.
• RDBMS are adopted for:
– Online transaction processing (OLTP),
– Online analytical processing (OLAP).
• Vendors such as Oracle, Vertica, Teradata,
Microsoft and IBM proposed their solution based
on Relational Math and SQL.
But….
3
Something Changed!
• Traditionally there were transaction recording
(OLTP) and analytics (OLAP) of the recorded
data.
• Not much was done to understand:
– the reasons behind transactions,
– what factor contributed to business, and
– what factor could drive the customer’s behavior.
• Pursuing such initiatives requires working with a
large amount of varied data.
4
Something Changed!
• This approach was pioneered by Google, Amazon, Yahoo,
Facebook and LinkedIn.
• They work with different type of data, often semi or un-
structured.
• And they have to store, serve and process huge amount of data.

5
Evolutions in Data Management
• As part of innovation in data management system, several
new technologies where built:
– 2003 - Google File System,
– 2004 - MapReduce,
– 2006 - BigTable,
– 2007 - Amazon DynamoDB
– 2012 Google Cloud Engine
• Each solved different use cases and had a different set of
assumptions.
• All these mark the beginning of a different way of thinking
about data management.
6
Go to hell RDBMS!

HELLO, BIG DATA!

7
Big Data: Try { Definition }
Big Data means the data is large enough that you have
to think about it in order to gain insights from it
Or
Big Data when it stops fitting on a single machine

“Big Data, is a fundamentally different way of thinking


about data and how it’s used to drive business value.”

8
History of NoSQL
NO SQL Was Coined by Carlo
Strozzi in 1998

NO SQL Became So Popular


After the Conference
Held in Atlanta in the Same
Year

9
What is NoSQL?
 Stands for No-SQL or Not Only SQL??
 Class of non-relational data storage systems
 E.g. BigTable, Dynamo, PNUTS/Sherpa, ..
 Usually do not require a fixed table schema nor do
they use the concept of joins
 Distributed data storage systems
 All NoSQL offerings relax one or more of the ACID
properties (will talk about the CAP theorem)
How did we get here?
• Explosion of social media sites (Facebook, Twitter) with
large data needs
• Rise of cloud-based solutions such as Amazon S3 (simple
storage solution)
• Just as moving to dynamically-typed languages
(Ruby/Groovy), a shift to dynamically-typed data with
frequent schema changes
• Open-source community
NoSQL Suitable Scenarios
• NoSQL Focuses on the Ability to Store more Data rather than to Keep tracking on
the Relationships existing between them .

• Therefore NO SQL is used when Data’s nature does not require Relationships .

12
The Benefits of NoSQL
• NoSQL databases are more scalable and provide superior
performance, and their data model addresses several
issues that the relational model is not designed to address:
– Large volumes of structured, semi-structured, and
unstructured data
– Agile sprints, quick iteration, and frequent code pushes
– Object-oriented programming that is easy to use and flexible
– Efficient, scale-out architecture instead of expensive,
monolithic architecture

13
Advantages of NoSQL
• Scalability In RDBMS

Increase in Decrease in
Increase in
Availability Economic
Transaction Rate
Requirements Advantage
• Scalability In NO SQL

NO SQL Solves the Problem


through Horizontal Scalability. i.e., a
cluster of commodity systems
where the cluster scales as load
increases. Highly Useful for Big Data
Environment
14
Advantages of NoSQL (Contd.)
Big Data
– Transaction rates have grown out of recognition over the last decade, the
volumes of data that are being stored also have increased massively.
Less Administrative Efforts
– Automatic repair, data distribution, and simpler data models lead to lower
administration and tuning requirements.

Economic Advantage
– RDBMS tends to rely on expensive proprietary servers and storage systems.
– Use clusters of cost-effective commodity servers to manage the exploding data
and transaction volumes, store and process more data at a much lower price
point.
– The cost per gigabyte or transaction/second for NoSQL can be many times less
than the cost for RDBMS .

15
Advantages of NoSQL (Contd.)

Flexible data models


– NoSQL Key Value stores and document databases allow the application
to store virtually any structure it wants in a data element.
Elastic scaling
– Scalability is the ability of a system to increase throughput.
– Done with help of addition of resources to address load increases.
– Scalability can be vertical and horizontal.
– NoSQL provides horizontal scalability.
– Horizontal scalability typically involves adding additional nodes to
serve additional load.

16
Advantages of NoSQL (Contd.)

Goodbye DBAs
– NoSQL databases are generally designed from the ground up to
require less management.
– Automatic repair, data distribution, and simpler data models lead to
lower administration and tuning requirements.
– Indirectly and theoretically, Goodbye to DBAs.

17
Big Data

• Big data is a popular term used to describe the


exponential growth and availability of data, both
structured and unstructured.
• Big Data is based on:
– Volume
• Many factors contribute to the increase in data volume.
– Velocity
• Data is streaming in at unprecedented speed and must be
dealt with in a timely manner. 
– Variety
• Data today comes in all types of formats. 

18
NoSQL Database Types
• Wide Column Store / Column Families:
– These were created to store and process very large amounts of
data distributed over many machines. There are still keys but
they point to multiple columns. The columns are arranged by
column family.
– such as Cassandra and HBase are optimized for queries
over large datasets, and store columns of data together,
instead of rows.
– Hadoop/Hbase, Cassandra, cloudata (Google)

19
• Key Value / Tuple Store:
– The main idea here is using a hash table where there is a unique
key and a pointer to a particular item of data. The key-value
model is the simplest and easiest to implement. But it is
inefficient when you are only interested in querying or updating
part of a value, among other disadvantages.

– They are the simplest NoSQL databases. Every single item in


the database is stored as an attribute name (or "key"),
together with its value.
– DynamoDB, LevelDB (Google)
• Document Store:
– The model is basically versioned documents that are collections of other key-
value collections. The semi-structured documents are stored in formats like
JSON. Document databases are essentially the next level of key-value, allowing
nested values associated with each key. Document databases support
querying more efficiently.
– pair each key with a complex data structure known as a
document. Documents can contain many different key-
value pairs, or key-array pairs, or even nested
documents.
– MongoDB, CouchDB
NoSQL Database Types
• Graph Databases:
– They are used to store information about networks, such as
social connections. Graph stores include Neo4J and
HyperGraphDB.
– Instead of tables of rows and columns and the rigid structure
of SQL, a flexible graph model is used which, again, can scale
across multiple machines. NoSQL databases do not provide a
high-level declarative query language like SQL to avoid
overtime in processing. Rather, querying these databases is
data-model specific.

– GraphBase, Trinity, BigData

22
NoSQL Database Types (Contd.)
• Multimodel Databases:
– FatDB
• Object Databases:
– Starcounter
• Grid & Cloud Database Solutions:
– GigaSpaces

• Multidimensional Databases:
– Global

• Multivalue Databases:
– U2 and OpenInsite
• XML Databases:
– BaseX
23
Benefits to the IT
• Next Generation
Databases that are
• Non-relational,
• Distributed,
• Open-source and 
• Horizontally scalable.

24
Challenges in No SQL

Maturity
– RDBMS systems have been around for a long time.
– For most CIOs, the maturity of the RDBMS is reassuring.

Support
– Most NoSQL systems are open source projects.
– NoSQL companies are often new and not offering supports.
– Timely and competent support is lacking

25
Challenges in No SQL

Analytics and business intelligence


– NoSQL databases offer few facilities for ad-hoc query and analysis.
– Even a simple query requires significant programming expertise.
– Commonly used BI tools do not provide connectivity to NoSQL.

Complexity
– NO SQL does not support Structured Query Language and therefore
Query Programmi RDBMS tends to rely on expensive proprietary
servers and storage systems. ng need to be performed manually

26
Challenges in NoSQL (Contd.)
Reliability
– Relational Database Support Atomicity, Consistency, Isolation, Durability
– Whereas, Non Relational Database will not support the same .

Administration
– NoSQL today requires a lot of skill to install and a lot of effort to
maintain.

Expertise
– It's far easier to find experienced RDBMS programmers or
administrators than a NoSQL expert

27
Advantages Of NoSQL
• 1: Elastic scaling
• For years, database administrators have relied on scale
up — buying bigger servers as database load increases —
rather than scale out — distributing the database across
multiple hosts as load increases. However, as transaction
rates and availability requirements increase, and as
databases move into the cloud or onto virtualized
environments, the economic advantages of scaling out
on commodity hardware become irresistible.
• RDBMS might not scale out easily on commodity clusters,
but the new breed of NoSQL databases are designed to
expand transparently to take advantage of new nodes,
and they're usually designed with low-cost commodity
hardware in mind.
• 2: Big data
• Just as transaction rates have grown out of recognition
over the last decade, the volumes of data that are
being stored also have increased massively. O'Reilly
has cleverly called this the "industrial revolution of
data." RDBMS capacity has been growing to match
these increases, but as with transaction rates, the
constraints of data volumes that can be practically
managed by a single RDBMS are becoming intolerable
for some enterprises. Today, the volumes of "big data"
that can be handled by NoSQL systems, such as
Hadoop, outstrip what can be handled by the biggest
RDBMS.
• 3: Goodbye DBAs (see you later?)
• Despite the many manageability improvements claimed
by RDBMS vendors over the years, high-end RDBMS
systems can be maintained only with the assistance of
expensive, highly trained DBAs. DBAs are intimately
involved in the design, installation, and ongoing tuning of
high-end RDBMS systems.
• NoSQL databases are generally designed from the ground
up to require less management:  automatic repair, data
distribution, and simpler data models lead to lower
administration and tuning requirements — in theory. In
practice, it's likely that rumors of the DBA's death have
been slightly exaggerated. Someone will always be
accountable for the performance and availability of any
mission-critical data store.
• 4: Economics
• NoSQL databases typically use clusters of
cheap commodity servers to manage the
exploding data and transaction volumes, while
RDBMS tends to rely on expensive proprietary
servers and storage systems. The result is that
the cost per gigabyte or transaction/second
for NoSQL can be many times less than the
cost for RDBMS, allowing you to store and
process more data at a much lower price
point.
• 5: Flexible data models
• Change management is a big headache for large production
RDBMS. Even minor changes to the data model of an RDBMS
have to be carefully managed and may necessitate downtime
or reduced service levels.
• NoSQL databases have far more relaxed — or even
nonexistent — data model restrictions. NoSQL Key Value
stores and document databases allow the application to store
virtually any structure it wants in a data element. Even the
more rigidly defined BigTable-based NoSQL databases
(Cassandra, HBase) typically allow new columns to be created
without too much fuss.
• The result is that application changes and database schema
changes do not have to be managed as one complicated
change unit. In theory, this will allow applications to iterate
faster, though,clearly, there can be undesirable side effects if
the application fails to manage data integrity.
Five challenges of NoSQL
• 1: Maturity
• RDBMS systems have been around for a long time.
NoSQL advocates will argue that their advancing age is
a sign of their obsolescence, but for most CIOs, the
maturity of the RDBMS is reassuring. For the most
part, RDBMS systems are stable and richly functional.
In comparison, most NoSQL alternatives are in pre-
production versions with many key features yet to be
implemented.
• Living on the technological leading edge is an exciting
prospect for many developers, but enterprises should
approach it with extreme caution.
• 2: Support
• Enterprises want the reassurance that if a key system fails,
they will be able to get timely and competent support. All
RDBMS vendors go to great lengths to provide a high level
of enterprise support.
• In contrast, most NoSQL systems are open source projects,
and although there are usually one or more firms offering
support for each NoSQL database, these companies often
are small start-ups without the global reach, support
resources, or credibility of an Oracle, Microsoft, or IBM.
• 3: Analytics and business intelligence
• NoSQL databases have evolved to meet the scaling demands of modern
Web 2.0 applications. Consequently, most of their feature set is oriented
toward the demands of these applications. However, data in an
application has value to the business that goes beyond the insert-read-
update-delete cycle of a typical Web application. Businesses mine
information in corporate databases to improve their efficiency and
competitiveness, and business intelligence (BI) is a key IT issue for all
medium to large companies.
• NoSQL databases offer few facilities for ad-hoc query and analysis. Even a
simple query requires significant programming expertise, and commonly
used BI tools do not provide connectivity to NoSQL.
• Some relief is provided by the emergence of solutions such as HIVE or PIG,
which can provide easier access to data held in Hadoop clusters and
perhaps eventually, other NoSQL databases. Quest Software has
developed a product — Toad for Cloud Databases — that can provide ad-
hoc query capabilities to a variety of NoSQL databases.
4: Administration
The design goals for NoSQL may be to provide a zero-
admin solution, but the current reality falls well short
of that goal. NoSQL today requires a lot of skill to
install and a lot of effort to maintain.
5: Expertise
There are literally millions of developers throughout
the world, and in every business segment, who are
familiar with RDBMS concepts and programming. In
contrast, almost every NoSQL developer is in a
learning mode. This situation will address naturally
over time, but for now, it's far easier to find
experienced RDBMS programmers or administrators
than a NoSQL expert.
onsistenc
y
CAP artition

Vailability
CAP Theorem
• The CAP theorem states that a distributed computer system
cannot guarantee all of the following three properties at the
same time:
• Consistency – once data is written, all future read requests will
contain that data
• Availability – the database is always available and responsive
• Partition tolerance – if one part of the database is unavailable,
other parts are unaffected
 Brewer’s CAP “Theorem”: You can have at most two of these
three properties for any system
 Very large systems will partition at some point
 Choose one of consistency or availablity
 Traditional database choose consistency
 Most Web applications choose availability
 Except for specific parts such as order processing
CAP Theorem
Brewer’s Conjecture
“Of three properties of shared-data
systems – data Consistency, system
Availability and tolerance to network
Partitions – only two can be achieved
at any given moment in time.”

• 2000 Prof Eric Brewer, PoDC Conference Keynote


• 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33 (2)
CAP THEOREM
• Brewer originally described this impossibility result as forcing a
choice of “two out of the three” CAP properties, leaving three
viable design options: CP, AP and CA. All the three
combinations can be defined as:
• CA – data should be consistent between all nodes. As long as all
nodes are online, users can read/write from any node and be
sure that the data is the same on all nodes.
• CP – data is consistent between all nodes and maintains
partition tolerance by becoming unavailable when a node goes
down.
• AP - nodes remain online even if they can’t communicate with
each other and will re-sync data once the partition is resolved,
but you aren’t guaranteed that all nodes will have the same
data (either during or after the partition)
CAP Theorem
The business decision

CONSISTENT

Partition OR

Available
Pick two!
BASE
• A BASE system gives up on consistency so as to have
greater Availability and Partition tolerance. A BASE can be
defined as following:
• Basically Available indicates that the system does
guarantee availability.
• Soft state indicates that the state of the system may change
over time, even without input. This is because of the
eventual consistency model.
• Eventual consistency indicates that the system will become
consistent over time, given that the system doesn’t receive
input during that time.
CAP Summary
Available

Traditio
nal rela o rt , Riak, ,
m b
MySQL ional:
t Volde , CouchD
ndr a ems
Postgre
SQL, e
, Cassa o like syst
m
tc. CA AP Dyna
AP: Requests will complete at any
node possibly violating consistency

Consistent
CP Partition Tolerance
HBase, MongoDB,
Redis, BigTable like
systems
CP: Requests will complete at nodes that
have quorum
Eventual Consistency
 When no updates occur for a long period of time, eventually all
updates will propagate through the system and all the nodes will
be consistent
 For a given accepted update and a given node, eventually either
the update reaches the node or the node is removed from
service
 Known as BASE (Basically Available, Soft state, Eventual
consistency), as opposed to ACID
 Soft state: copies of a data item may be inconsistent

 Eventually Consistent – copies becomes consistent at some later

time if there are no more updates to that data item


SQL vs NoSQL
SQL NOSQL

types one ‘logical’ database, with many different types


somewhat distinct ‘physical’ impl. [columnar, key/value,
document, ..]
history 1970 2000
storage table/row/column a.k.a. it depends: records, documents
file/record/field storage ++ unstructured ++

schema ‘static’ schema’s ‘dynamic’ or no schema


structure is pre-determined ++ schema-free ++
scaling Vertical horizontal
++ easier, cheaper ++
dvlpmnt initially: proprietary; open source
model later: open source ++ agile ++
transactio consistency: ACID consistency: BASE
ns ++ yes ++ -- not always --
DML ++SQL++ OO; also SQL-like
-- infancy --
45
Managing Different Data Types
• NoSQL databases come in four core types — one for each type of data the database is
expected to manage:
• ✓✓Columnar: Extension to traditional table structures. Supports variable
• sets of columns (column families) and is optimized for column‐wide operations (such as count,
sum, and mean average).
• ✓✓Key‐value: A very simple structure. Sets of named keys and their
• value(s), typically an uninterpreted chunk of data. Sometimes that simple value may in fact be
a JSON or binary document.
• ✓✓Triple: A single fact represented by three elements:
• • The subject you’re describing The name of its property or relationship to another subject
• • The value — either an intrinsic value (such as an integer) or the unique ID of another subject
(if it’s a relationship)
• For example, Adam likes Cheese. Adam is the subject, likes is the predicate,and Cheese is the
object.
• ✓✓Document: XML, JSON, text, or binary blob. Any treelike structure can
• be represented as an XML or JSON document, including things such as an order that includes
a delivery address, billing details, and a list of products and quantities.
• Some document NoSQL databases support storing a separate list (or
• document) of properties about the document, too.
Columnar
• Column stores are similar at first appearance
to traditional relational DBMS.
• The concepts of rows and columns are still
there.
• You also define column families before
loading data into the database, meaning that
the structure of data must be known in
advance.
• column stores organize data differently
• data is organized for fast column operations
• Ideal for aggregate functions.
• Column stores are also sometimes referred to as
Big Tables or Big Table
• key difference between column stores and a
traditional RDBMS is that, in a column store, each
record (think row in an RDBMS) doesn’t require a
single value per column.
• Each one of these column families consists of several fields.
• One of these column families may have multiple “rows” in its own
right.
• Order item information,for example, has multiple rows — one for
each line item. These rowswill contain data such as item ID, quantity,
and unit price.
• A key benefit of a column store over an RDBMS is that column stores
don’t require fields to always be present and don’t require a blank
padding null value like an RDBMS does.
• can retrieve all related information using a single record ID, rather
than using the complex Structured Query Language (SQL) join as in
an RDBMS
• If you know the data fields involved up front and need to quickly
retrieve related data together as a single record, then consider a
column store.
Key‐value stores
• Key‐value stores also have a record with an ID field — the
key in key‐value stores — and a set of data. This data can
be one of the following:
✓An arbitrary piece of data that the application developer
interprets (as opposed to the database)
✓Any set of name‐value pairs (called bins)
• key‐value stores are similar to column stores in that it’s
possible to store varying data structures in the same
logical record set.
• Key‐value stores are the simplest type of storage in the
NoSQL world — you’re just storing keys for the data you
provide.
• key‐value stores support integers, strings,
Booleans and more complex structures for values
(such as maps and lists).
• Key‐value stores are optimized for speed of
ingestion and retrieval. If you need very high
ingest speed on a limited numbers of nodes and
can afford to sacrifice complex ad hoc query
support, then a key‐value store may be for you.
Triple and graph stores
• the concept of triples has been around since 1998, thanks to the World
Wide Web Consortium (W3C) and Sir Tim Berners‐Lee social graph.
• every fact (or more correctly, assertion) is described as a triple of subject,
predicate, and object:
– ✓✓A subject is the thing you’re describing. It has a unique ID called an
IRI. It may also have a type, which could be a physical object (like a
person) or a concept (like a meeting).
– ✓✓A predicate is the property or relationship belonging to the
subject. This again is a unique IRI that is used for all subjects with this
property.
– ✓✓An object is the intrinsic value of a property (such as integer or
Boolean, text) or another subject IRI for the target of a relationship.
• If you need to store facts, dynamically changing relationships, or
provenance information, then consider a triple store.
Examples of triple

Adam likes Cheese . is a triple

AdamFowler is_a Person


AdamFowler likes Cheese
Cheese is_a Foodstuff
such triple information is conveyed with full IRI information in a format such as Turtle, like this:
Web of interrelated facts across different ontologies.
Documents
• Document databases are sometimes called aggregate
databases because they tend to hold documents that
combine information in a single logical unit — an
aggregate.
• You might have a document that includes a TV episode,
series, channel, brand, and scheduling and availability
information.
• Retrieving all information from a single document is
easier with a database (no complex joins as in an RDBMS)
and is more logical for applications (less complex code).
• a document is any unstructured or tree‐structured
piece of information. It could be a recipe (for
cheesecake, obviously), inancial services trade,
PowerPoint file, PDF, plain text, or JSON or XML
document.
• because of its treelike nature, an effective
document store is also capable of storing simpler
data structures.
Question
• What kind of database solution will you use
for an online store’s orders and the related
delivery and payment addresses and order
items ??
Search engines
• search engines use an architecture very similar
to NoSQL databases.
• Their indexes and query processing are highly
distributed. Many search engines are even
capable of acting as a key‐value or document
store in their own right.
• NoSQL databases are often used to store
unstructured data, document, The structures of
this indexed data vary greatly.
Search

• document databases are appropriate in cases where


system administrators or developers frequently don’t
have control of the structures.
• Eg. Publishing and defense and intelligence realms
• Storing many structures in a single database necessitates
a way to provide a standard query mechanism over all
content. Search engines are great for that purpose.
• Search technology is different from traditional query
database interface technology. SQL is not a search
technology; it’s a query language. Search deals with
imperfect matches and relevancy scoring, whereas
query deals with Boolean exact matching logic
Hybrid NoSQL databases
• NoSQL database has its core audience, several can be used
to manage two or more of the previously mentioned data
structures.
• Hybrid databases can easily handle document and key‐value
storage needs, while also allowing fast aggregate operations
similar to how column stores work.
• Typically, this goal is achieved by using search engine term
indexes rather than tabular field indexes within a table
column in the database schema design itself.
• Eg. MarkLogic Server , OrientDB
Available NoSQL products
• Columnar: DataStax, Apache Cassandra, HBase, Apache
Accumulo, Hypertable
• Key‐value: Basho Riak, Redis, Voldemort, Aerospike,
Oracle NoSQL
• Triple/graph: Neo4j, Ontotext’s GraphDB (formerly
OWLIM), MarkLogic, OrientDB, AllegroGraph, YarcData
• Document: MongoDB, MarkLogic, CouchDB,
FoundationDB, IBM Cloudant, Couchbase
• Search engine: Apache Solr, Elasticsearch, MarkLogic
• Hybrid: OrientDB, MarkLogic, ArangoDB
Why NoSQL now?

• Trends

“Internet size”, Cluster friendly

Rapid development / Solution oriented

Polyglot Persistence

Schemaless
What kinds of NoSQL
• NoSQL solutions fall into two major areas:
– Key/Value or ‘the big hash table’.
• Amazon S3 (Dynamo)
• Voldemort
• Scalaris
– Schema-less which comes in multiple flavors, column-
based, document-based or graph-based.
• Cassandra (column-based)
• CouchDB (document-based)
• Neo4J (graph-based)
• HBase (column-based)
Key/Value
Pros:
– very fast
– very scalable
– simple model
– able to distribute horizontally

Cons:
- many data structures (objects) can't be easily modeled as key
value pairs
Schema-Less
Pros:
- Schema-less data model is richer than key/value pairs
- eventual consistency
- many are distributed
- still provide excellent performance and scalability

Cons:
- typically no ACID transactions or joins

You might also like