Chap 3. NoSQL

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 97

Chap.3.

NoSQL

Kanchan Doke
Asst. Professor, Dept. of Computer Engineering, B.V.C.O.E
Contents
2

 Introduction
 Business drivers
 NoSQL Data Architecture Pattern
 Key-Value Store
 Graph Store
 Column Family store
 Document Store

 NoSQL solution for Big Data


Kanchan Doke, Computer Dept, B.V.C.O.E.
3
What is RDBMS
2

 RDBMS:therelational database
management system.

 Relation: a relation is a 2D table


which has the following features:
 Name

 Attributes

 Tuples Name

Kanchan Doke, Computer Dept, B.V.C.O.E.


4
Issues with RDBMS- Scalability
3

 Fixed table schemas


 Small but frequent reads/writes
 Can not work on commodity
hardware
 Issues with scaling up when the
dataset is just too big e.g. Big
Data.
 Not designed to be distributed.

Kanchan Doke, Computer Dept, B.V.C.O.E.


8
What is NoSQL
5

 Stands for Not Only SQL.


 “NoSQL is a set of concepts that allows the rapid and

efficient processing of datasets with a focus on


scalability, performance, reliability, and agility. “
 Provide mechanism for storage and retrieval of
unstructured data in distributed environment.
 Work for unpredictable dynamic data
 Developed to handle large amount of data that need to be
frequently accessed and processed.
Kanchan Doke, Computer Dept, B.V.C.O.E.
2/5 marks

9
Need of NoSQL
6

 Explosion of social media sites (Facebook, Twitter, Google etc.) with large data
needs.
 The system response time becomes slow when you use RDBMS for massive
volumes of data.
 Solution:
 "scale up" our systems by upgrading our existing hardware. This process is
expensive.
 "scaling out" is to
distribute database
load on multiple hosts
whenever the load
increases.
Kanchan Doke, Computer Dept, B.V.C.O.E.
4 Marks

CAP Theorem
10

 Consistency –
 All the servers in the system will have the same data so
anyone using the system will get the same copy regardless
of which server answers their request.
 Availability –
 The system will always respond to a request (even if it's not
the latest data or consistent across the system or just a
message saying the system isn't working)
 Partition Tolerance –
 The system continues to operate as a whole even if
individual servers fail or can't be reached..
Kanchan Doke, Computer Dept, B.V.C.O.E.
10 Marks

What are the characteristics/ features?


11

 It’s more than rows in tables


 NoSQL systems store and retrieve data from many formats: key-value stores,
graph databases, column-family (Bigtable) stores, document stores, and even
rows in tables.
 It’s free of joins
 NoSQL systems allow you to extract your data using simple interfaces without
joins.
 It’s schema-free
 NoSQL systems allow you to drag-and-drop your data into a folder and then
query it without creating an entity-relational model.
 It works on many processors
 NoSQL systems allow you to store your database on multiple processors and
maintain high-speed performance.
Kanchan Doke, Computer Dept, B.V.C.O.E.
What are the characteristics/ features?
12

 It uses shared-nothing commodity computers


 Most (but not all) NoSQL systems leverage low-cost commodity processors
that have separate RAM and disk.
 It supports linear scalability
 When you add more processors, you get a consistent increase in performance.
 It’s innovative
 NoSQL offers options to a single way of storing, retrieving, and manipulating
data.

Kanchan Doke, Computer Dept, B.V.C.O.E.


What NoSQL is not?
13

 It’s not about the SQL language


 The definition of NoSQL isn’t an application that uses a language other than
SQL.
 SQL as well as other query languages are used with NoSQL databases.
 It’s not only open source
 Although many NoSQL systems have an open source model, commercial
products use NOSQL concepts as well as open source initiatives. You can still
have an innovative approach to problem solving with a commercial product.
 It’s not about cloud computing
 Many NoSQL systems reside in the cloud to take advantage of its ability to
rapidly scale when the situation dictates. NoSQL systems can run in the cloud
as well as in your corporate data center.

Kanchan Doke, Computer Dept, B.V.C.O.E.


10 marks

Difference between RDBMS and NoSQL


14

Sr. No RDBMS NoSQL


1 Have fixed or static predefined schema Have dynamic Schema
2 Vertically scalable Horizontally scalable
3 Table based databases Document based, key-value pairs, graph
databases or wide-column stores.
4 SQL ( structured query language ) for defining and Uses unstructured Query Language
manipulating the data
5 QL databases maintains on ACID properties ( NoSQL database follows the Brewers CAP
Atomicity, Consistency, Isolation and Durability) theorem/BASE properties

6 Synchronous Inserts & Updates Asynchronous Inserts & Updates


7 Standard interface for executing complex query Support only simple transactions
8 Have single point of failure Have no single point of failure
9 Transactions written in one location Transactions written in many locations.
10 Eg.: Oracle, MS-SQL,MySQL Eg: MongoDB, BigTable, Cassandra, Hbase,Neo4j,
CouchDB etc
Kanchan Doke, Computer Dept, B.V.C.O.E.
5 marks

NoSQL Business Drivers


15

 Volume and Velocity


 The ability to handle large
datasets that arrive quickly.
 Variability
 How diverse data types don’t
fit into structured tables
 Agility
 How quickly an organization
responds to business change.

Kanchan Doke, Computer Dept, B.V.C.O.E.


NoSQL Business Drivers - Volume
16

 Need to query big data using clusters of commodity


processors.
 The ability to increase processing speed was no longer
an option.
 The need to scale out (also known as horizontal scaling),
rather than scale up
 Moved organizations from serial to parallel processing.
 The data problems are split into separate paths and sent to
separate processors to divide and conquer the work.

Kanchan Doke, Computer Dept, B.V.C.O.E.


NoSQL Business Drivers - Velocity
17

 Single-processor RDBMSs are unable to keep up with the


demands of real-time inserts and online queries to the
database made by public-facing websites.

 Problems faced with RDBMS:


 RDBMSs frequently index many columns of every new row,
a process which decreases system performance.
 The random bursts in web traffic slow down response for
everyone,
 Tuning these systems can be costly when both high read
and write throughput is desired.
Kanchan Doke, Computer Dept, B.V.C.O.E.
NoSQL Business Drivers - Variability
18

 Companies that want to capture and report on


exception data, struggle when attempting to use rigid
database schema structures imposed by RDBMSs.
 For example,
 If a business unit wants to capture a few custom fields for a
particular customer, all customer rows within the database need
to store this information even though it doesn’t apply.
 Adding new columns to an RDBMS requires the system be shut
down and ALTER TABLE commands to be run.
 When a database is large, this process can impact
system availability, costing time and money.
Kanchan Doke, Computer Dept, B.V.C.O.E.
NoSQL Business Drivers - Agility
19

 The most complex part of building applications using RDBMSs is


the process of putting data into and getting data out of the
database.
 If your query is nested, data also have nested and repeated
subgroups of data structures, you need to include an object-
relational mapping layer.
 The responsibility of this layer is to generate the correct combination
of INSERT, UPDATE, DELETE, and SELECT SQL statements to move
object data to and from the RDBMS persistence layer.
 This process isn’t simple and is associated with the largest barrier
to rapid change when developing new or modifying existing
applications.
Kanchan Doke, Computer Dept, B.V.C.O.E.
10 marks

29
NoSQL Data Architecture Pattern
7

NoSQL database are classified into four types:


• Key Value pair based
• Document based
• Column based
• Graph based

Kanchan Doke, Computer Dept, B.V.C.O.E.


30 Key-Value Store
 What a key-value store is
 Benefits of using a key-value
store
 How to use a key-value store in
an application
 Key-value store use cases
31
Key-value stores
• A key-value store is a simple database that when
presented with a simple string (the key) returns an
arbitrary large BLOB of data (the value).
• A key-value store is like a dictionary.
• Word entries represent keys and definitions
represent values.
• Entries are sorted alphabetically by word, retrieval is
quick.
• A key-value store is also indexed by the key.
• The key points directly to the value, resulting in
rapid retrieval, regardless of the number of
items in your store.
Kanchan Doke, Computer Dept, B.V.C.O.E.
33
Key-value stores (Cont.)
 No need to specify a data type for the value of a key-value
store
o So you can store any data type that you want in the
value.
o Each value can have different number of attributes
 The system will store the information as a BLOB and return
the same BLOB when a GET (retrieval) request is made.
o The value can :
 images,
 web pages,
 Documents
 videos.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Key-value stores (Cont.)
34

 Example: Value

Key

Kanchan Doke, Computer Dept, B.V.C.O.E.


35
Key-value stores (Cont.)
The key in a key-value store is flexible and can be represented by many
formats:
• Logical path names to images or files
• Artificially generated strings created from a hash of the value
• REST web service calls
• SQL queries

Kanchan Doke, Computer Dept, B.V.C.O.E.


41
Using a key-value store
• The best way to think about using a key-value store is to visualize a
single table with two columns.
• There are three operations performed on a key-value store:
• put
• get
• delete

Kanchan Doke, Computer Dept, B.V.C.O.E.


42
Using a key-value store
• put($key as xs:string, $value as item()) adds a new key-value pair
to the table and will update a value if this key is already present.
• get($key as xs:string) as item() returns the value for any given key, or it
may return an error message if there’s no key in the key-value store.
• delete($key as xs:string) removes a key and its value from the table, or
it many return an error message if there’s no key in the key-value store.

Kanchan Doke, Computer Dept, B.V.C.O.E.


43
Key-value store rules
A key-value store has two rules:
• Distinct keys: if you can’t uniquely identify a key-value pair, you can’t return a single
result.
• No queries on values: In a relational database, you can constrain a result set using the
where clause. key-value store prohibits this type of operation, as you can’t select a key-
value pair using the value.
Restrictions of Keys and Values
 A key:
 as long as it’s a reasonably short string of characters.
 The value of a key-value store.:
 As long as your storage systemcan hold it
 Making structure ideal for multimedia: images, sounds, and even full-length
movies.
Kanchan Doke, Computer Dept, B.V.C.O.E.
46
Use cases
• Use case: Storing web pages in a key-value store

• Use case: Amazon simple storage service (S3)

Kanchan Doke, Computer Dept, B.V.C.O.E.


Use cases
47

 Storing web pages in a key-value store


 A web crawler to automatically visit a website to extract and store the
content of each web page
 The words in each web page are then indexed for fast keyword search.
 The URL is the key, and the value is the web page or resource located at
that key.
 Dynamic portions of a site where pages are generated by scripts are not
stored in the key-value store

Kanchan Doke, Computer Dept, B.V.C.O.E.


Use cases
48

 Amazon simple storage service (S3)


 S3 is a simple key-value store with some enhanced
features:
 It allows an owner to attach metadata tags to an object, to
provides additional information about the object;
 For example, content type, content length, cache control, and
object expiration.
 It has an access control module to allow a object owner
 to grant rights to individuals, groups, or everyone to perform
 put, get, and delete operations on an object, group of objects,
or bucket.

Kanchan Doke, Computer Dept, B.V.C.O.E.


Use cases
49

 Amazon simple storage service (S3)


 At the heart of S3 is the bucket
 All objects you store in bucket

 Buckets store key/object pairs,


 The key: is a string (unique within a bucket)
 The object: images, XML files, digital music.

Kanchan Doke, Computer Dept, B.V.C.O.E.


Use cases
50

 Amazon simple storage service (S3)


 To manipulate objects:
 HTTP PUT message : New objects are added to a bucket.
 HTTP GET message :Objects are retrieved from a bucket.
 HTTP DELETE message: Objects are removed from a bucket
 To access an object
 Generate a URL from the bucket/key combination
 Example:
 http://testbucket.s3.amazonws.com/gray-bucket.png.

Kanchan Doke, Computer Dept, B.V.C.O.E. Bucket name Object Key


51 Document Store
 Introduction
 Document collections
 Document store
implementations
 Case study
Document Oriented Database
52

 It is similar to key-value database


 But Document database contains structure or semi
structure data.
 Structure or semi structure data value is referred as
document.
 The key-value store lack a formal structure and aren’t
indexed or searchable.
 Return the value (a BLOB of data) associated with that key

Kanchan Doke, Computer Dept, B.V.C.O.E.


Document Oriented Database
53

Kanchan Doke, Computer Dept, B.V.C.O.E.


Document Oriented Database
54

 Documents are gathered together in collections within the database


 Eg:- Book collection, Video collection, web page collection, etc.

Kanchan Doke, Computer Dept, B.V.C.O.E.


55
Document Store
 Properties:
 Key may be a simple ID which is never used or seen
 Can query any value or content within the document
 Everything inside a document is automatically indexed when a new document is
added.

SID Name Phone


{
16s143 Sagar 9723486 {
_id: 16s143,
_id: 16s144,
16s144 Nikita 9723456 Name: Sagar,
Name: Nikita,
Phone: 9723486
Phone: 9723456
}
}
Kanchan Doke, Computer Dept, B.V.C.O.E.
What is a Document DB?
56

{ {
"name": "Phil", "age": 26,
"name": "Phil",
"status": "A",
"age": 26,
"citiesVisited" : ["Chicago", "LA", "San Francisco"]
"status": "A"
} }
 Documents can have differences in their attributes
 But belongs to the same collection
 A document can be
 PDF
 Microsoft word doc
 XML
 JSON file.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Document Store….eg.
57

Kanchan Doke, Computer Dept, B.V.C.O.E.


58
Document Store
• Document stores can tell not only that your search item is in the
document, but also the search item’s exact location by using the
document path, a type of key, to access the leaf values of a tree
structure.

Kanchan Doke, Computer Dept, B.V.C.O.E.


Document collections
61

Kanchan Doke, Computer Dept, B.V.C.O.E.


64
Document store implementations
• A document store can come in many varieties.
• Simpler document structures are often associated with serialized
objects and may use the JavaScript Object Notation (JSON) format.

Kanchan Doke, Computer Dept, B.V.C.O.E.


Document Oriented Databases
65

 Examples:
 MongoDB

 CouchDB

 DocumentDB

Kanchan Doke, Computer Dept, B.V.C.O.E.


Case study: ad server with MongoDB
66

 MongoDB, a popular NoSQL product, was to create a service


that would quickly send a banner ad to an area on a web page
for millions of users at the same time.
 The primary purpose behind ad service:
 quickly select the most appropriate ad for a user and place it on the page
in the time it takes a web page to load

Kanchan Doke, Computer Dept, B.V.C.O.E.


Case study: ad server with MongoDB
67

 Complex business rules followed:


 Ad servers should be highly available and run 24/7 with no downtime
 To find the most appropriate ad to send to a web page.
 Ads are selected from a database of ad promotions of paid advertisers
that best match the person’s interest.
 Ad servers can’t send the same ad repeatedly
 Able to send ads of a specific type (page area, animation, and so
on) in a specific order.
 Finally, ad systems need accurate reporting that shows what ads
were sent to which user and which ads the user found interesting
enough to click on.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Case study: MongoDB (Cont.)
68

 MongoDB can be used in some of the following use cases:


• Content management :- Store web content and photos and use tools such as
geolocation indexes to find items.
• Real-time operational intelligence :-Ad targeting, real-time sentiment analysis,
customized customer-facing dashboards, and social media monitoring.
• Product data management :-Store and query complex and highly variable
product data.
• User data management :-Store and query user-specific data on highly scalable
web applications. Used by video games and social network applications.
• High-volume data feeds :-Store large amounts of real-time data into a central
database for analysis characterized by asynchronous writes to RAM.

Kanchan Doke, Computer Dept, B.V.C.O.E.


69
Column family (Bigtable)
stores Column family basics
Overview

 Understanding column family


keys
 Benefits of column family
systems
 Case study
Data stores
70

Key / value stores (opaque / typed) Document stores (non-shaped / shaped)


collection
key value
key document
value
key value
key document
... Relational databases
table ...

row
key value

column
column
row
key
...
Kanchan Doke, Computer Dept, B.V.C.O.E.
71
Relational databases
 Tables (relations) consist of rows and columns
 Columns have a type. Type information is stored once per column.
A rows contains just values for a record (no type information)
 All rows in a table have the same columns and are homogenous
table
column type column type column type column typ column type
row e
key value value value value value

row
Example rows: key value value value value value
„foo“, „bar“, 25, 35.63
„bar“, „baz“, 42, -673.342

Kanchan Doke, Computer Dept, B.V.C.O.E.


Row vs. columnar relational databases
72

All relational databases deal with tables, rows, and


columns
But there are sub-types:
Row-oriented: they are internally organised around the
handling of rows
Columnar / column-oriented: these mainly work with columns
Both types usually offer SQL interfaces and produce
tables (with rows and columns) as their result sets
Both types can generally solve the same queries
Kanchan Doke, Computer Dept, B.V.C.O.E.
Row-oriented storage
73

In row-oriented databases, row value data is usually


stored contiguously:

row0 header column0 value column1 value column2 value column3 value

row1 header column0 value column1 value column2 value column3 value

row2 header column0 value column1 value column2 value column3 value

(the row headers contain record lengths, NULL bits etc.)

Kanchan Doke, Computer Dept, B.V.C.O.E.


Row-oriented storage
74

Rows stored sequentially


Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F

 Best performance when most queries are for multiple


columns of a single row
Kanchan Doke, Computer Dept, B.V.C.O.E.
Key Lookup in a Row-Oriented Database
75

Indexes
Key RowID Indexes on high-cardinality columns
1 0001B008D23A671A make accessing a single row very fast
2 0001B008D23A671B
3 0001B008D23A671C Key Fname Lname State Zip Phone Age Sex

4 0001B008D23A671D 1 Bugs Bunny NY 11217 (718) 938-3235 34 M


ABC calls
5 0001B008D23A671E 2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
customer service
WHERE key=4 4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F

but don’t help on analytical queries which


Phone RowID scan many rows
(207) 882-7323 0001B008D23A671D
(209) 375-6572 0001B008D23A671B
(212) 227-1810
(718) 938-3235
0001B008D23A671C
0001B008D23A671A
e.g.
(978) 744-0991 0001B008D23A671E
What’s the average age of males?
WHERE phone=‘(207) 882-7323’

Kanchan Doke, Computer Dept, B.V.C.O.E.


76
Column-oriented storage
Column stores store data in column-specific files
Simplest case: one datafile per column
Row values for each column are stored contiguously

column0 values
column0
r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16 r17

filesize

column1 values
column1
r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16 r17

filesize

Kanchan Doke, Computer Dept, B.V.C.O.E.


Column-Oriented Storage
77

Each column is stored in a separate file

Key Fname Lname State Zip Phone Age Sex


1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F

Each column for a given row is at the same offset (auto-indexing)

Kanchan Doke, Computer Dept, B.V.C.O.E.


78
Column-oriented storage
Column stores can greatly improve the performance of queries that only touch a
small amount of columns
This is because they will only access these columns' particular data
Simple math: table t has a total of 10 GB data, with
column a: 4 GB
column b: 2 GB
column c: 3 GB
column d: 1 GB
If a query only uses column d, at most 1 GB of data will be processed by a column
store
In a row store, the full 10 GB will be processed
Kanchan Doke, Computer Dept, B.V.C.O.E.
Column family
80

 Column family Vs Column Oriented


 A column-family database stores a row with
all its column families together
 A column-oriented database simply stores
data tables by column rather than by row.
 Use concept of keyspace (like a schema in the
relational model)
 The keyspace contains all the column families (kind
of like tables in the relational model), which
contain rows and columns.

Kanchan Doke, Computer Dept, B.V.C.O.E.


Example of Column family
81

Kanchan Doke, Computer Dept, B.V.C.O.E.


Example of Column family
82

 Each rows contains


different number
of columns
Kanchan Doke, Computer Dept, B.V.C.O.E.
83
Benefits of Column Family Systems
 Higher Scalability
 Higher Availability
 Easy to Update

Kanchan Doke, Computer Dept, B.V.C.O.E.


Benefits of Column Family Systems ….Higher Scalability
84

 Bigtable-inspired column family systems are designed to scale


beyond a single processor.
 As you add more data to your system, your investment will be in
the new nodes added to the computing cluster.
 By keeping the interface simple, the back-end system can
distribute queries over a large number of processing nodes
without performing any join operations.
 With careful design of row IDs and columns, the system get
enough hints to tell where to get related data and avoid
unnecessary network traffic crucial to system performance.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Benefits of Column Family Systems …. Higher Availability
85

 By building a system that scales on distributed networks,


gain the ability to replicate data on multiple nodes in a
network.
 Due to efficient communication, the cost of replication is
lower.
 Due to lack of join operations allows you to store any
portion of a column family matrix on remote computers.

Kanchan Doke, Computer Dept, B.V.C.O.E.


Benefits of Column Family Systems …. Easy to Update
88

Row-oriented: value replaced


Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 852-2352 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F

Column-oriented: value replaced


Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 852-2352 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F

Yeah, this one just works.


Kanchan Doke, Computer Dept, B.V.C.O.E.
Column family Limitations
89

 Work on distributed clusters of computers.


 May not be appropriate for small datasets.
 Need at least five processors to justify a column
family cluster.
 To store data on three different nodes for replication.
 Don’t support standard SQL queries for real-time
data access.

Kanchan Doke, Computer Dept, B.V.C.O.E.


Case study: Storinganalytical information in Bigtable
90

 The Bigtable is used to store website


usage information in Google Analytics.
 The Google Analytics service allows you
to track who’s visiting your website.

 Viewing a detailed log of all the individual hits on your site


would be a long process.

 Google Analytics makes it simple by summarizing the data


at regular intervals (such as once a day) and creating
reports that allow you to see the total number of visits and
most popular pages that were requested on any given day.

Kanchan Doke, Computer Dept, B.V.C.O.E.


91 Graph Store
 Overview
 Linking external
data
 Use cases
92
Overview
 A graph store is a system that contains a sequence of nodes and
relationships that, when combined, create a graph.
 A graph store has three data fields:
 Nodes,
 Relationships,
 Properties.

 Graph stores are ideal when you have many items that are related to each other in
complex ways and these relationships have properties.

Kanchan Doke, Computer Dept, B.V.C.O.E.


93
Overview (Cont.)
 Graph nodes are usually representations of real-world objects like nouns.
• People, organizations, telephone numbers, web pages, computers on a
network, or even biological cells in a living organism.
 The relationships is connections between these objects
 Represented as arcs (lines that connect) between circles in diagrams.

Kanchan Doke, Computer Dept, B.V.C.O.E.


94
Overview (Cont.)
 Graph queries are similar to traversing nodes in a graph:
o What’s the shortest path between two nodes in a graph?
o What nodes have neighboring nodes that have specific properties?
o Given any two nodes in a graph, how similar are their neighboring
nodes?
o What’s the association of various points on a graph with each
other?

Kanchan Doke, Computer Dept, B.V.C.O.E.


95
Graph Stores
Graph stores are difficult to scale out on multiple servers
due to the close connectedness of each node in a graph.
 Data can be replicated on multiple servers to enhance read
and query performance
 But writes to multiple servers and graph queries that span
multiple nodes can be complex to implement.
Interaction methods : load, query, update, and delete
 A graph query will return a set of nodes that are used to
create a graph image on the screen to show you the
relationship between your data.
Kanchan Doke, Computer Dept, B.V.C.O.E.
96
A Graph Example
 You’ll often see links on a page that take you to another page.
 These links can be represented by a graph or triple.
o The current web page is the first or source node Property: URL
o The link is the arc that “points to” the second page Source
o The second or destination page is the second node web
page
Source web page Destination web page

Destination
web page Destination
web page
Property: URL
Property: URL
Kanchan Doke, Computer Dept, B.V.C.O.E.
97
Linking external data

Statement is :(Book, has-author, Person123)

Statement is: (Person123, has-name, “Dan”).

When stored in a graph store, the two statements are independent and
may even be stored on different systems around the world.
Link metadata
 Group ID the graph belongs to
 The date and time the node
was created or last updated

Kanchan Doke, Computer Dept, B.V.C.O.E.


98
Use cases for graph stores
 Link analysis is used when you want to perform searches and look for
patterns and relationships in situations such as social networking,
telephone, or email records.

 Rules and inference are used when you want to run queries on
complex structures such as class libraries, taxonomies and rule-based
systems.

 Integrating linked data is used with large amounts of open linked data
to do realtime integration and build mashups without storing data.
Kanchan Doke, Computer Dept, B.V.C.O.E.
99
Link analysis
 Sometimes the best way to solve a business problem is to traverse
graph data.
 As you add new contacts to your friends list, you might want to know if
you have any mutual friends.
 need to get a list of your friends, and for each one of them get a list of
their friends (friends-of-friends).
 Relational database :After the initial pass of listing out your
friends, the system performance drops dramatically!!!

Kanchan Doke, Computer Dept, B.V.C.O.E.


100
Link analysis (Cont.)
• Graph stores can perform these operations much faster by using techniques
that consolidate and remove unwanted nodes from memory.
• Though graph stores would clearly be much faster for link analysis tasks, they
usually require enough RAM to store all the links during analysis.

A social network graph generated by


the LinkedIn InMap system. Each
person is represented by a circle, and
a line is drawn between two people
that have a relationship

Kanchan Doke, Computer Dept, B.V.C.O.E.


102
Rules and inference
Suppose you have a website that allows anyone to
post restaurant reviews.
 Would there be value in allowing you to indicate which
reviewers you trust?
 You’re going out to dinner and you’re considering two
restaurants. Each restaurant has positive and negative reviews.
 Can you use simple inference to help you decide which
restaurant to visit?
 You could see if your friends reviewed the restaurants. But a more powerful test
would be to see if any of your friends-of-friends also reviewed the restaurants.
 If you trust John and John trustsSue,what can you infer about your ability to trust
Sue’s restaurant recommendations?

Kanchan Doke, Computer Dept, B.V.C.O.E.


103

Kanchan Doke, Computer Dept, B.V.C.O.E.


NoSQL Database Types
104

Kanchan Doke, Computer Dept, B.V.C.O.E.


105 NoSQL solution for big data
 What is big data problem
 Big data use cases

 Types of big data problems

 Ways that NoSQL systems handle big data problems


What is big data problem?
107

 Any business problem that’s so large that it can’t


be easily managed using a single processor.
 Whether you need all of your data or a subset of
your data to solve your problem
 Ensure the sample you choose is a fair representation
of the full dataset.
 How quickly you need your data processed
Kanchan Doke, Computer Dept, B.V.C.O.E.
Big data use cases
108

 Bulk image processing


 NASA regularly receive terabytes of incoming data from
satellites
 Medical imaging systems like CT scans and MRIs need to convert raw
image data into formats that are useful to doctors and patients.
 Public web page data
 They contain news stories, RSS feeds, new product
information, product reviews, and blog postings
 Finding out which product reviews are valid is a topic for
careful analysis.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Big data use cases
109

 Remote sensor data


 Devices installed on vehicles track location, speed, acceleration, and
fuel consumption
 Road sensors can warn about traffic jams in real time and suggest
alternate routes.
 Track the moisture in your garden, lawn, and indoor plants to
suggest a watering plan for your home
 Event log data
 Creating logs of read-only events from web page hits (also called
clickstreams), email messages sent, or login attempts
 Helps organizations understand who’s using what resources and
when systems may not be performing according to specification
Kanchan Doke, Computer Dept, B.V.C.O.E.
Big data use cases
110

 Mobile phone data—


 Every time users move to new locations, applications can track these
events.
 You can see when your friends are around you or when customers walk
through your retail store.
 Social media data—
 Social networks such as Twitter, Facebook, and LinkedIn provide a
continuous real-time data feed that can be used to see relationships and
trends.
 Each site creates data feeds that you can use to look at trends in customer
mood or get feedback on your own as well as competitor products.

Kanchan Doke, Computer Dept, B.V.C.O.E.


Big data use cases
111

 Game Data-
 Required backend dataset that need to scale quickly
 Share and store high score of all users and data of
game for each player.
 Open Linked Data-
 Organization can publish dataset that can be
integrated by system

Kanchan Doke, Computer Dept, B.V.C.O.E.


Big data use cases
112

 Image and signal processing:


 Focus on efficient and reliable data transformation at scale
 Don’t need query or transaction support

 Solution: key-value store or DFS like S3/ HDFS

 Event log or game data:


 Need to store data in a structure that can be queried and
analysed

Kanchan Doke, Computer Dept, B.V.C.O.E.


NoSQL solutions
113

 Scale linearly with growing data size.


 Be operationally efficient. Organizations can’t afford to hire many
people to run the servers.
 Require that reports and analyses be performed by
nonprogrammers using simple tools—not every business can afford
a full-time Java programmer to write on-demand queries.
 Meet the challenges of distributed computing, including
consideration of latency between systems and eventual node
failures.

Kanchan Doke, Computer Dept, B.V.C.O.E.


Analyzing big data with a shared-nothing
116
architecture

a. The left panel shows a shared RAM architecture, where many CPUs access a single
shared RAM over a high-speed bus. This system is ideal for large graph traversal.
b. The middle panel shows a shared disk system, where processors have independent
RAM but share disk using a storage area network (SAN).
c. The right panel shows an architecture used in big data solutions: cache-friendly, using
low-cost commodity hardware, and a shared-nothing architecture.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Analyzing big data with a shared-nothing
117
architecture

 Graph Store  Key Value store


 Row store  Document store
 Column store
Kanchan Doke, Computer Dept, B.V.C.O.E.
Choosing distribution models:
118
Master-Slave versus Peer-to-Peer

 A master-slave configuration where all incoming  The peer-to-peer model stores


database requests (reads or writes) are sent to a all the information about the
single master node and redistributed from there. cluster on each node in the
 The master node is called the NameNode in cluster. If any node crashes,
Hadoop. This node keeps a database of all the other the other nodes can take over
nodes in the cluster and the rules for distributing and processing can continue.
requests to each node.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
120
problems
 Moving queries to the data, not data to the queries
 Using hash rings to evenly distribute data on a
cluster
 Using replication to scale reads
 Letting the database distribute queries evenly to
data nodes

Kanchan Doke, Computer Dept, B.V.C.O.E.


Ways that NoSQL systems handle big data
121
problems
 Moving queries to the data, not data to the queries
 Most NoSQL systems use commodity processors that
each hold a subset of the data on their local shared-
nothing drives
 It’s more efficient to send the query to each node than
it is to transfer large datasets to a central processor.

Kanchan Doke, Computer Dept, B.V.C.O.E.


Ways that NoSQL systems handle big data
122
problems
 Using hash rings to evenly distribute data on a cluster
 Determine how to assign pieces of data to a specific
processor.
 Key / hash based distribution
 Hash rings
 Challenges:
 When new server is added

 When server becomes unreachable

Kanchan Doke, Computer Dept, B.V.C.O.E.


Ways that NoSQL systems handle big data
124
problems
 Using hash rings to evenly distribute data on a cluster
 Key or hash based distribution

Server 4

Keys would need to be remapped and migrated to new servers.


Also, the hash function will need to be changed from modulo 4 to
Kanchan Doke, Computer Dept, B.V.C.O.E.
modulo 5.
Ways that NoSQL systems handle big data
125
problems
 Using hash rings to evenly distribute data on a cluster
 Hash rings take the leading bits of a document’s hash value
and use this to determine which node the document
should be assigned.
 Server and Keys are hashed by same hash function
 Eg:
 You hash 3 servers You hash 3 Keys
 hash(“10.0.1.1”) = 100  hash(“redis”) = 200
 hash(“10.0.1.2”) = 400  hash(“charsyam”) = 450
 hash(“10.0.1.3”) = 700  hash(“udemy”) = 50
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
126
problems
 Using hash rings to evenly distribute data on a
Store a key in hash(key) is
cluster A
higher and the nearest one
100
 You hash 3 servers hash(“web”) = 100
Key
50
Key
server
 hash(“10.0.1.1”) = 100 Key 200
100
 hash(“10.0.1.2”) = 400
 hash(“10.0.1.3”) = 700 C
700

 You hash 3 Keys B

 hash(“redis”) = 200
400

 hash(“charsyam”) = 450
Key
 hash(“udemy”) = 50 450

Kanchan Doke, Computer Dept, B.V.C.O.E.


Ways that NoSQL systems handle big data
127
problems
 Using hash rings to evenly distribute data on a
Store a key in hash(key) is
cluster A
higher and the nearest one
100
 You hash 3 servers hash(“web”) = 1000
Key
50
Key
server
 hash(“10.0.1.1”) = 100 Key 200
100
 hash(“10.0.1.2”) = 400
 hash(“10.0.1.3”) = 700 C
700

 You hash 3 Keys


 hash(“redis”) = 200
 hash(“charsyam”) = 450
Key
 hash(“udemy”) = 50 450

Kanchan Doke, Computer Dept, B.V.C.O.E.


Ways that NoSQL systems handle big data
128
problems
 Using replication to scale reads
 All incoming client requests enter from the
left.
 All reads can be directed to any node,
either a primary read/write node or a
replica node.
 All write transactions can be sent to a
central read/write node that will update
the data and then automatically send the
updates to replica nodes.
 The time between the write to the
primary and the time the update arrives
on the replica nodes determines how long
it takes for reads to return consistent
results.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
129
problems
 Letting the database distribute queries evenly to data nodes
 All incoming queries arrive at query analyzer nodes.
 These nodes then forward the queries to each data node.
 If they have matches, the documents
are returned to the query node.
 The query won’t return until all data
nodes (or a response from a replica)
have responded to the original query
request.
 If the data node is down, a query can
be redirected to a replica of the data
node.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Questions?

130

You might also like