Professional Documents
Culture Documents
DBMS (UNIT-6) (Advances in Databases and Big Data)
DBMS (UNIT-6) (Advances in Databases and Big Data)
Database System Concepts - 5th Edition, July 28, 2005. 7.1 ©Silberschatz, Korth and Sudarshan
Unit VI : Advances in Databases and Big Data
2
Structured/Unstructured
Data
3
Structured Data
➢ Data that resides in a fixed field within a record or file is
called structured data. This includes data contained in
relational databases.
➢ Emails have the sender, recipient, date, time and other fixed
fields added to the unstructured data of the email message
content and any attachments.
9
Structured, Semi-structured, and Unstructured data
• Structured data
– Information stored DB
– Strict format
– Example-Databases, DataWarehouse,Enterprise systems(ERP)
• Semi-structured data
– Data may have certain structure but not all information collected has
identical structure
– Some attributes may exist in some of the entities of a particular type
but not in others
– Example: XML,E-Mail
• Unstructured data
– Very limited indication of data type
• Example a simple text document, Analog data,GPS Tracking
information,Audio/video data.
10
Limitations for SQL database
• Scalability: Users have to scale relational
database on powerful servers that are expensive
and difficult to handle. To scale relational
database it has to be distributed on to multiple
servers.
13
▪ A NoSQL or Not Only SQL database provides a mechanism for
storage and retrieval of data that is modeled other than the tabular
relations used in relational databases.
• The data structure (e.g. key-value, graph, or document) differs from the
RDBMS, and therefore some operations are faster in NoSQL and some
in RDBMS.
14
What is NoSQL database
➢ Relational databases, on the other hand, were not
designed to cope with the scale and the challenges that
face modern applications.
16
2)Column-oriented databases:
➢ The data are stored in sections of columns which offers more
flexibility and easy aggregation.
➢ Rather than store sets of information in a heavily structured table
of columns and rows with uniform sized fields for each record, as
is the case with relational databases, column-oriented databases
contain one extendable column of closely related data.
➢ i.e. columns are logically grouped into column families.
➢ RDBMS stores a single column as a continous disk entry.
➢ Different rows are stored in different places on disk while
columnar databases store all the cells corresponding to a column as
a continuous disk entry thus makes search/access faster.
➢ Facebook's Cassandra, BigTable from Google, and Amazon's
SimpleDB ,HBase are the examples which belongs to this group.
17
3)Document databases:
• The data model consists of document collections where
individual documents can have multiple fields, without
necessarily defining a schema.
• A document store is similar to a key value store in that
stored objects are character string keys.
• The difference is that the values being stored are referred
to as documents provide some encodings like XML,
JSON,BSON
• Document databases pair each key with a complex data
structure known as a document.
• Documents can contain many different key-value pairs,
or key-array pairs, or even nested documents.
• The best known and used are MongoDB and CouchDB.
18
4)Graph databases:
• The domain model consists of vertices interconnected by
edges which creates a rich graph structure.
• Graph stores are used to store information about
networks, such as social connections. Graph stores
include Neo4J, OrientDB and HyperGraphDB.
• Scalability concerns are perfectly address by Graph store.
19
• Categories of NoSQL database
21
✓There is a large number of companies using
NoSQL. To name a few :
➢ Google
➢ Facebook
➢ Mozilla
➢ Adobe
➢ Foursquare
➢ LinkedIn
➢ McGraw-Hill Education
➢ Vermont Public Radio
22
❖The Benefits of NoSQL
➢ When compared to relational databases, NoSQL databases are
more scalable and provide superior performance, and their data
model addresses several issues that the relational model is not
designed to address.
➢ Large volumes of structured, semi-structured, and unstructured
data.
23
➢ NoSQL database also trades off “ACID” (atomicity,
consistency, isolation and durability).
24
❖Advantages of NoSQL database
1) NoSQL databases generally process data
faster than relational databases.
28
➢NoSQL database are highly preferred for large
data set (i.e for big data). HBase is an example.
30
Comparative Study of
SQL and NOSQL
31
Many different types including
One type (SQL
key-value stores, document
Types database) with
databases, wide-column stores,
minor variations
and graph databases
Developed in 2000s to deal with
Developed in 1970s
limitations of SQL databases,
Development to deal with first
particularly concerning scale,
History wave of data storage
replication and unstructured data
applications
storage
MySQL, Postgres, MongoDB, Cassandra, HBase,
Examples
Oracle Database Neo4j
32
Individual records (e.g.,
"employees") are stored as rows in
Varies based on database type.
tables, with each column storing a
For example, key-value stores
specific piece of data about that
function similarly to SQL
record (e.g., "manager," "date hired,"
databases, but have only two
etc.), much like a spreadsheet.
columns ("key" and "value"),
Separate data types are stored in
with more complex information
Data separate tables, and then joined
sometimes stored within the
Storage together when more complex queries
"value" columns. Document
Model are executed. For example, "offices"
databases do away with the
might be stored in one table, and
table-and-row model altogether,
"employees" in another. When a user
storing all relevant data together
wants to find the work address of an
in single "document" in JSON,
employee, the database engine joins
XML, or another format, which
the "employee" and "office" tables
can nest values hierarchically.
together to get all the information
necessary.
33
Structure and data types are
Typically dynamic. Records can
fixed in advance. To store
add new information on the fly,
information about a new data
Schemas and unlike SQL table rows,
item, the entire database must be
dissimilar data can be stored
altered, during which time the
together as necessary.
database must be taken offline.
34
Mix of open-source (e.g.,
Development Postgres, MySQL) and
Open-source
Model closed source (e.g.,
Oracle Database)
Depends on product. Some
Can be configured for
Consistency provide strong consistency
strong consistency
(e.g., MongoDB)
35
Advantages of NoSQL
1: Elastic scaling
2: Big data
Today, the volumes of "big data" that can be handled by NoSQL
systems, such as Hadoop.
3: Economics
NoSQL databases typically use clusters of cheap
commodity servers to manage the exploding data and
transaction volumes, while RDBMS tends to rely on
expensive proprietary servers and storage systems.
The result is that the cost per gigabyte or
transaction/second for NoSQL can be many times less
than the cost for RDBMS, allowing you to store and
process more data at a much lower price point.
36
4 Flexible data models
➢ Minor changes to the data model of an RDBMS have to be
carefully managed.
➢ NoSQL databases have far more relaxed -- or even nonexistent -
- data model restrictions.
37
CAP Theorem
➢While designing applications for a distributed
architecture, some basic requirements should be
present in relations.
➢C-Consistency
➢A-Availability
➢P-Partition tolerance
38
CAP Theorem (Brewer’s Theorem)
CAP Theorem (Brewer’s Theorem)
➢ it is impossible for a distributed computer system to simultaneously provide all
three of the following guarantees:
➢ Consistency: all nodes see the same data at the same time
➢ A distributed system can satisfy any two of these guarantees at the same time but
not all three.
➢C-Consistency:-After update operation every
client should see the same data- data consistency.
➢A-Availability:-system should always on so
service guarantee availability.
➢P-Partition tolerance:-The system continues to
function even the communication among the
servers is unreliable.
➢CA-RDBMS
➢CP-MongoDB,HBase,Redis
➢AP-Cassandra,CouchDB,DynamoDB,Riak
41
BASE
(Basically Available, Soft-State, Eventually Consistent)
43
BIG DATA
• Big data is a term for data sets that are so large or complex that
traditional data processing applications are inadequate.
Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization, querying,
updating and information privacy.
44
3 V’s of Big Data
1-Scale (Volume)
➢Volume refers to amount of data ,volume of
data stored in enterprise repositories have
grown from megabytes and gigabytes to
petabytes.
45
2-Complexity (Varity)
➢ Various formats, types, and
structures
➢ Text, numerical, images,
audio, video, sequences, time
series, social media data,
multi-dim arrays, etc…
➢ Static data vs. streaming data
➢ A single application can be
generating/collecting many
types of data
46
3-Speed (Velocity)
➢ Data is being generated fast and need to be processed fast.
➢ The term 'velocity' refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real
potential in the data.
➢ Online Data Analytics.
➢ Late decisions ➔ missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase
history, what you like ➔ send promotions
47
5 V’s of Big data:
➢ Velocity, Volume, Value, Variety, and Veracity.
1)Velocity-velocity refers to the speed at which vast amounts of data
are being generated, collected and analyzed. Every day the number of
emails, twitter messages, photos, video clips, etc. increases at lighting
speeds around the world. Every second of every day data is
increasing. Not only must it be analyzed, but the speed of
transmission, and access to the data must also remain instantaneous
to allow for real-time access to website, credit card verification and
instant messaging. Big data technology allows us now to analyze the
data while it is being generated, without ever putting it into databases.
48
• 3)Value-When we talk about value, we’re referring to the worth
of the data being extracted. Having endless amounts of data is
one thing, but unless it can be turned into value it is
useless. While there is a clear link between data and insights,
this does not always mean there is value in Big Data .
• 4)Variety-Variety is defined as the different types of data we can
now use. Today’s data is unstructured. In fact, 80% of all the
world’s data fits into this category, including photos, video
sequences, social media updates, etc. New and innovative big
data technology is now allowing structured and unstructured
data to be harvested, stored, and used simultaneously.
• 5)Veracity-Veracity is the quality or trustworthiness of the
data. Just how accurate is all this data? For example, think
about all the Twitter posts with hash tags, abbreviations, typos,
etc., and the reliability and accuracy of all that content.
49
Pillars of Big Data
• Big table: relational,tabular format
• Big Text: all kinds of unstructured data,
natural language, semantic data.
• Big metadata: data about data, taxonomies,
glossary, concepts
• Big Graphs: Object connections, semantic
discovery, degree of separation etc.
50
Infrastructure requirements in Big data
1)Data Acquisition in Big data
-data will be in distributed environment, infrastructure must
support to carry out high volume of data.
-NOSQL are often used in Big data.
52
Benefits of Big Data Processing
➢ Ability to process 'Big Data' brings in multiple benefits, such
as-
• Businesses can utilize outside intelligence while taking
decisions
Access to social data from search engines and sites
like facebook, twitter are enabling organizations to fine tune
their business strategies.
53
• Early identification of risk to the product/services, if
any .
54
References
• https://www.xsnet.com/blog/bid/205405/the
-v-s-of-big-data-velocity-volume-value-variety-
and-veracity
55
56
➢ Hadoop is an open-source framework that allows
to store and process big data in a distributed
environment across clusters of computers using simple
programming models.
59
Hadoop
60
➢ Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming model.
61
Hadoop Architecture
Hadoop framework includes following four modules:
62
Hadoop Components
Hadoop consists of two core components
➢ The Hadoop Distributed File System (HDFS)
➢ MapReduce Software Framework
64
Who uses Hadoop?
➢Amazon/A9
➢Facebook
➢Google
➢New York Times
➢yahoo!
➢…. many more
65
66
67
Core Hadoop Concepts
Applications are written in high-level code
➢ Developers do not worry about network programming,
temporal(time-based) dependencies etc.
➢ If a node fails, the master will detect that failure and re-assign
the work to a different node on the system.
69
Advantages of Hadoop
➢ Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes the
data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
70
HDFS
71
• Hadoop File System was developed using distributed
file system design. It runs on commodity hardware. Unlike
other distributed systems, HDFS is highly fault-tolerant and
designed using low-cost hardware.
72
HDFS
73
Basic Features: HDFS
• Highly fault-tolerant
– Can handle disk crashes, machine crashes, etc...
• High throughput
• Suitable for applications with large data sets
• Can be built out of commodity hardware
• Based on Google's Filesystem GFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to
easily check the status of cluster.
• HDFS provides file permissions and authentication.
74
Fault tolerance
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..
Client
Block ops
Read Datanodes Datanodes
replication
B
Blocks
Client
76
HDFS Daemons
➢ Filesystem cluster is managed by three types of processes
➢ Namenode
➢ manages the File System's namespace/meta-data/file blocks
➢ Runs on 1 machine to several machines
➢ Datanode
➢ Stores and retrieves data blocks
➢ Reports to Namenode
➢ Runs on many machines
➢ Secondary Namenode
➢ Performs house keeping work so Namenode doesn’t have to
require similar hardware as Namenode machine
➢ Not used for high-availability – not a backup for Namenode
77
Namenode
Master/slave architecture
➢ HDFS cluster consists of a single Namenode, a master server
that manages the file system namespace and regulates access to
files by clients.
Functions of a NameNode:-
➢ Regulates client’s access to files.
➢ It also executes file system operations such as renaming, closing,
and opening files and directories.
➢ Manages File System Namespace
➢ Maps a file name to a set of blocks
➢ Maps a block to the DataNodes where it resides
➢ Cluster Configuration Management
➢ Replication Engine for Blocks
78
Datanode
➢ There are a number of DataNodes usually one per node
in a cluster.
➢ These nodes manage the data storage of their system.
➢ Datanodes perform read-write operations on the file
systems, as per client request.
➢ They also perform operations such as block creation,
deletion, and replication according to the instructions of
the namenode.
79
➢ The DataNodes manage storage attached to the nodes
that they run on.
➢ HDFS exposes a file system namespace and allows user
data to be stored in files.
➢ A file is split into one or more blocks and set of blocks
are stored in DataNodes.
➢ DataNodes: serves read, write requests, performs block
creation, deletion, and replication upon instruction from
Namenode.
80
Block
➢ Generally the user data is stored in the files of
HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual
data nodes.
➢ Block Report
➢Periodically sends a report of all existing blocks to the
NameNode
➢ A Block Server
➢Stores data in the local file system
➢Stores metadata of a block
➢Serves data and metadata to Clients
84
Filesystem Metadata
➢ The HDFS namespace is stored by Namenode.
➢ A Transaction Log
➢Records file creations, file deletions etc
86
Block Placement
➢Current Strategy
➢One replica on local node
➢Second replica on a remote rack
➢Third replica on same remote rack
➢Additional replicas are randomly placed
➢Clients read from nearest replicas
Heartbeats
➢ DataNodes send hearbeat to the NameNode
➢Once every 3 seconds
88
Secondary NameNode
➢ It is responsible for performing periodic checkpoints.So
if the Namenode fails it can be replaced with a snapshot
image stored by the Secondary Namenode checkpoints.
➢ Copies FsImage and Transaction Log from Namenode
to a temporary directory.
➢ Merges FSImage and Transaction Log into a new
FSImage in temporary directory.
➢ Uploads new FSImage to the NameNode
➢Transaction Log on NameNode is eliminate.
89
Goals of HDFS
➢ Fault detection and recovery : Since HDFS includes a
large number of commodity hardware, failure of
components is frequent. Therefore HDFS should have
mechanisms for quick and automatic fault detection and
recovery.
90
HDFS is Good for...
Storing large files
➢ Terabytes, Petabytes, etc...
➢ 100MB or more per file
Streaming data
➢ Write once and read-many times patterns
➢ Optimized for streaming reads rather than random reads
“Cheap” Commodity Hardware
➢ No need for super-computers, use less reliable commodity
hardware
91
User Interface
➢ Commads for HDFS User:
➢hadoop dfs -mkdir /foodir
➢hadoop dfs -cat /foodir/myfile.txt
➢hadoop dfs -rm /foodir/myfile.txt
➢ Web Interface
➢http://host:port/dfshealth.jsp
92
MapReduce
93
➢ MapReduce is a framework using which we can write applications
to process huge amounts of data, in parallel, on large clusters of
commodity hardware in a reliable manner.
➢ Map takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value
pairs).
➢ Map stage : The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed
to the mapper function line by line. The mapper processes the data
and creates several small chunks of data.
95
➢The framework manages all the details of data-
passing such as issuing tasks, verifying task
completion, and copying data around the cluster
between the nodes.
97
MapReduce - What?
➢ MapReduce is a programming model for efficient
distributed computing
➢ It works like a Unix pipeline
➢ Input | Map | Shuffle & Sort | Reduce | Output
➢ Efficiency from
➢ Streaming through data
➢ Pipelining
➢ A good fit for a lot of applications
➢ Log processing
➢ Web index building
98
Word Count Example
➢Mapper
➢Input: value: lines of text of input
➢Output: key: word, value: 1
➢Reducer
➢Input: key: word, value: set of counts
➢Output: key: word, value: sum
➢Launching program
➢Defines this job
➢Submits job to cluster
99
Word Count Dataflow
100
MapReduce
combine part0
map reduce
Cat split
reduce part1
split map combine
Bat
map part2
split combine reduce
Dog
split map
Other
Words
(size:
TByte)
101
References
• http://www.hadoopadmin.co.in/hadoop-
administrator/mapreduce/
• https://www.tutorialspoint.com/hadoop/inde
x.htm
102
END
103