Download as pdf or txt
Download as pdf or txt
You are on page 1of 103

Unit - VI:

Advances in Databases and Big Data

Department of Computer Engineering

BRACT’S, Vishwakarma Institute of Information Technology, Pune-48


(An Autonomous Institute affiliated to Savitribai Phule Pune University)
(NBA and NAAC accredited, ISO 9001:2015 certified)

Database System Concepts - 5th Edition, July 28, 2005. 7.1 ©Silberschatz, Korth and Sudarshan
Unit VI : Advances in Databases and Big Data

➢ Introduction to NoSQL, Structured verses Unstructured data


➢ Different NoSQL Data Models
➢ NoSQL using MongoDB, CAP theorem and BASE Properties
➢ Comparative study of SQL and NoSQL
➢ Introduction to Big Data, HADOOP- Building blocks of Hadoop
➢ Components of Hadoop- HDFS, MapReduce, HBASE, HIVE

2
Structured/Unstructured
Data

3
Structured Data
➢ Data that resides in a fixed field within a record or file is
called structured data. This includes data contained in
relational databases.

➢ Structured data first depends on creating a data model – a


model of the types of business data that will be recorded and
how they will be stored, processed and accessed.

➢ This includes defining what fields of data will be stored and


how that data will be stored: data type (numeric, alphabetic,
name, date, address) and any restrictions on the data input
(number of characters; restricted to certain terms such as Mr.,
Ms. or Dr.; M or F).

➢ Structured data has the advantage of being easily entered,


stored, queried and analyzed.
4
➢ Structured data is often managed using Structured
Query Language (SQL) – a programming language
created for managing and querying data in relational
database management systems. Originally developed
by IBM in the early 1970s and later developed
commercially by Relational Software, Inc. (now
Oracle Corporation).

➢ Unstructured data is all those things that can't be so


readily classified : photos and graphic images,
videos, streaming instrument data, webpages, pdf
files, PowerPoint presentations, emails, blog entries,
wikis and word processing documents, books,
journals, documents, metadata, health records, audio,
analog data, images, files, body of an e-mail
message.
5
➢ Semi-Structured data is a cross between the two. It is a type
of structured data, but lacks the strict data model structure.
With semi-structured data, tags or other types of markers are
used to identify certain elements within the data, but the data
doesn’t have a strict structure.

➢ For example, word processing software now can include


metadata showing the author's name and the date created, with
the bulk of the document just being unstructured text.

➢ Emails have the sender, recipient, date, time and other fixed
fields added to the unstructured data of the email message
content and any attachments.

➢ Photos or other graphics can be tagged with keywords such as


the creator, date, location and keywords, making it possible to
organize and locate graphics. XML and other markup
languages are often used to manage semi-structured data.
6
Structured Data
➢data is organized in semantic chunks (entities)
➢similar entities are grouped together (relations
or classes)
➢entities in the same group have the same
descriptions (attributes)
➢descriptions for all entities in a group (schema)
➢have the same defined format
➢have a predefined length
➢are all present
➢and follow the same order
7
Semi-Structured Data
➢data is available electronically in
➢database systems
➢file systems, e.g., bibliographic data, Web data
➢semi-structured data
➢similar entities are grouped together
➢entities in same group may not have same
attributes
➢order of attributes not necessarily important
➢not all attributes may be required
➢size of same attributes in a group may differ
➢type of same attributes in a group may differ 8
Unstructured Data
-data can be of any type
-not necessarily following any format or
sequence
-does not follow any rules
-is not predictable
-examples include
text
video
sound
images

9
Structured, Semi-structured, and Unstructured data
• Structured data
– Information stored DB
– Strict format
– Example-Databases, DataWarehouse,Enterprise systems(ERP)

• Semi-structured data
– Data may have certain structure but not all information collected has
identical structure
– Some attributes may exist in some of the entities of a particular type
but not in others
– Example: XML,E-Mail

• Unstructured data
– Very limited indication of data type
• Example a simple text document, Analog data,GPS Tracking
information,Audio/video data.

10
Limitations for SQL database
• Scalability: Users have to scale relational
database on powerful servers that are expensive
and difficult to handle. To scale relational
database it has to be distributed on to multiple
servers.

• Complexity: In SQL server’s data has to fit into


tables anyhow. If your data doesn’t fit into tables,
then you need to design your database structure
that will be complex and again difficult to
handle. 11
Overview

➢ Relational Database Management Systems that use SQL are Schema –


Oriented i.e. the structure of the data should be known in advance ensuring
that the data adheres to the schema.
➢ Examples of such predefined schema based applications that use SQL
include Payroll Management System, Order Processing, and Flight
Reservations.
➢ It is not possible for SQL to process unpredictable and unstructured
information. However, Big Data applications, demand for an occurrence-
oriented database which is highly flexible and operates on a schema less
data model.
➢ SQL Databases are vertically scalable – this means that they can only be
scaled by enhancing the horse power of the implementation hardware,
thereby making it a costly deal for processing large batches of data.
➢ IT enterprises need to increase the RAM, SSD, CPU, etc., on a single server
in order to manage the increasing load on the RDBMS.
➢ With increasing size of the database or increasing number of users,
Relational Database Management Systems using SQL suffer from serious
performance bottlenecks -making real time unstructured data processing a
hard row to hoe.
➢ With Relational Database Management Systems, built-in clustering is
difficult due to the ACID properties of transactions.
12
NOSQL

13
▪ A NoSQL or Not Only SQL database provides a mechanism for
storage and retrieval of data that is modeled other than the tabular
relations used in relational databases.

▪ NoSQL = “Not Only SQL”


Not every data management/analysis problem
is best solved exclusively using a traditional DBMS

• The data structure (e.g. key-value, graph, or document) differs from the
RDBMS, and therefore some operations are faster in NoSQL and some
in RDBMS.

• Motivations for this approach include simplicity of design, horizontal


scaling and finer control over availability.

14
What is NoSQL database
➢ Relational databases, on the other hand, were not
designed to cope with the scale and the challenges that
face modern applications.

➢ The basic quality of NoSQL is that, it may not require


fixed table schemas, usually avoid join operations,
and typically scale horizontally.

➢ NoSQL includes a wide variety of different database


technologies and were developed in response to a rise
in the volume of data stored about users, objects and
products, the frequency in which this data is accessed,
and performance and processing needs.
15
❑ Four main types of NoSQL databases/Data Models
1)Key / Value databases:
• the model is reduced to a simple hash table which consists of key /
value pairs.
• It is often easily distributed across multiple servers. As the name
implies, a key-value store is a system that stores values indexed
for retrieval by keys.
• These systems can hold structured or unstructured data.
• The most famous products of this group include Redis, Dynamo,
and Riak,BerkeleyDB.
• Key-value stores are the simplest NoSQL databases. Every single
item in the database is stored as an attribute name (or "key"),
together with its value.

16
2)Column-oriented databases:
➢ The data are stored in sections of columns which offers more
flexibility and easy aggregation.
➢ Rather than store sets of information in a heavily structured table
of columns and rows with uniform sized fields for each record, as
is the case with relational databases, column-oriented databases
contain one extendable column of closely related data.
➢ i.e. columns are logically grouped into column families.
➢ RDBMS stores a single column as a continous disk entry.
➢ Different rows are stored in different places on disk while
columnar databases store all the cells corresponding to a column as
a continuous disk entry thus makes search/access faster.
➢ Facebook's Cassandra, BigTable from Google, and Amazon's
SimpleDB ,HBase are the examples which belongs to this group.

17
3)Document databases:
• The data model consists of document collections where
individual documents can have multiple fields, without
necessarily defining a schema.
• A document store is similar to a key value store in that
stored objects are character string keys.
• The difference is that the values being stored are referred
to as documents provide some encodings like XML,
JSON,BSON
• Document databases pair each key with a complex data
structure known as a document.
• Documents can contain many different key-value pairs,
or key-array pairs, or even nested documents.
• The best known and used are MongoDB and CouchDB.
18
4)Graph databases:
• The domain model consists of vertices interconnected by
edges which creates a rich graph structure.
• Graph stores are used to store information about
networks, such as social connections. Graph stores
include Neo4J, OrientDB and HyperGraphDB.
• Scalability concerns are perfectly address by Graph store.

19
• Categories of NoSQL database

Category Description Name of the database


Document An example format may be MongoDB, CouchDB,
Oriented like - FirstName=“XYZ”, RethinkDB, RavenDB etc.
Data is stored Address="St. Xavier's Road",
as documents. Spouse=[{Name:"Kiran"}],
Children=[{Name:"Rihit",
Age:8}]
XML database Data is stored in XML format BaseX, eXist, MarkLogic
Server etc.

Graph Data is stored as a collection Allegro, Neo4J, OrientDB,


databases of nodes, where nodes are Virtuoso.
analogous to objects in a
programming language. Nodes
are connected using edges.
20
Key-value In Key-value-store category Dynamo, FoundationDB,
store of NoSQL database, an user MemcacheDB, Redis,
can store data in schema-less Riak. etc. -
way. A key may be strings,
hashes, lists, sets, sorted sets
and values are stored against
these keys

Column A column is a key value pair, Accumulo, Cassandra,


store where the key is an identifier HBase etc.
and the value stores values
related to the key (identifier).

21
✓There is a large number of companies using
NoSQL. To name a few :
➢ Google
➢ Facebook
➢ Mozilla
➢ Adobe
➢ Foursquare
➢ LinkedIn
➢ McGraw-Hill Education
➢ Vermont Public Radio

22
❖The Benefits of NoSQL
➢ When compared to relational databases, NoSQL databases are
more scalable and provide superior performance, and their data
model addresses several issues that the relational model is not
designed to address.
➢ Large volumes of structured, semi-structured, and unstructured
data.

➢ Object-oriented programming that is easy to use and flexible .

➢ Efficient, scale-out architecture.

23
➢ NoSQL database also trades off “ACID” (atomicity,
consistency, isolation and durability).

➢ No schema required: Data can be inserted in a NoSQL


database without first defining a rigid database schema.
This provides immense application flexibility.

➢ Auto elasticity: NoSQL automatically spreads your


data onto multiple servers without requiring
application assistance. Servers can be added or
removed from the data layer automatically.

24
❖Advantages of NoSQL database
1) NoSQL databases generally process data
faster than relational databases.

2) NoSQL databases are also often faster because


their data models are simpler.

3) Major NoSQL systems are flexible enough to


better enable developers to use the
applications in ways that meet their needs.
25
❖ SQL vs NoSQL: High-Level Differences
➢ SQL databases are primarily called as Relational
Databases (RDBMS); whereas NoSQL database are
primarily called as non-relational or distributed database.

➢ SQL databases are table based databases whereas NoSQL


databases are document based, key-value pairs, graph
databases or wide-column stores.

➢ This means that SQL databases represent data in form of


tables which consists of n number of rows of data whereas
NoSQL databases are the collection of key-value pair,
documents, graph databases or wide-column stores
which do not have standard schema definitions.
26
➢ SQL databases have predefined schema whereas NoSQL
databases have dynamic schema for unstructured data.

➢ SQL databases are vertically scalable whereas the NoSQL


databases are horizontally scalable.

➢ SQL databases uses SQL ( structured query language ) for


defining and manipulating the data, which is very
powerful.

➢ In NoSQL database, queries are focused on collection of


documents. Sometimes it is also called as UnQL
(Unstructured Query Language). The syntax of using
UnQL varies from database to database.
27
➢ SQL database examples: MySql, Oracle, Postgres, Sqlite
and MS-SQL.

➢ NoSQL database examples: MongoDB, BigTable, Redis,


RavenDb, Cassandra, Hbase, Neo4j and CouchDb

➢ For the type of data to be stored: SQL databases are not


best fit for hierarchical data storage. But, NoSQL
database fits better for the hierarchical data storage as it
follows the key-value pair way of storing data similar to
JSON data.

28
➢NoSQL database are highly preferred for large
data set (i.e for big data). HBase is an example.

➢For scalability: In most typical situations, SQL


databases are vertically scalable. You can
manage increasing load by increasing the CPU,
RAM, etc, on a single server.

➢On the other hand, NoSQL databases are


horizontally scalable. You can just add few more
servers easily in your NoSQL database
infrastructure to handle the large traffic.
29
➢ For properties: SQL databases emphasizes on ACID
properties ( Atomicity, Consistency, Isolation and Durability)
whereas the NoSQL database follows the Brewers CAP
theorem ( Consistency, Availability and Partition tolerance )

➢ For DB types: On a high-level, we can classify SQL databases


as either open-source or close-sourced from commercial
vendors. NoSQL databases can be classified on the basis of
way of storing data as graph databases, key-value store
databases, document store databases, column store databases
and XML databases.

30
Comparative Study of
SQL and NOSQL

31
Many different types including
One type (SQL
key-value stores, document
Types database) with
databases, wide-column stores,
minor variations
and graph databases
Developed in 2000s to deal with
Developed in 1970s
limitations of SQL databases,
Development to deal with first
particularly concerning scale,
History wave of data storage
replication and unstructured data
applications
storage
MySQL, Postgres, MongoDB, Cassandra, HBase,
Examples
Oracle Database Neo4j

32
Individual records (e.g.,
"employees") are stored as rows in
Varies based on database type.
tables, with each column storing a
For example, key-value stores
specific piece of data about that
function similarly to SQL
record (e.g., "manager," "date hired,"
databases, but have only two
etc.), much like a spreadsheet.
columns ("key" and "value"),
Separate data types are stored in
with more complex information
Data separate tables, and then joined
sometimes stored within the
Storage together when more complex queries
"value" columns. Document
Model are executed. For example, "offices"
databases do away with the
might be stored in one table, and
table-and-row model altogether,
"employees" in another. When a user
storing all relevant data together
wants to find the work address of an
in single "document" in JSON,
employee, the database engine joins
XML, or another format, which
the "employee" and "office" tables
can nest values hierarchically.
together to get all the information
necessary.

33
Structure and data types are
Typically dynamic. Records can
fixed in advance. To store
add new information on the fly,
information about a new data
Schemas and unlike SQL table rows,
item, the entire database must be
dissimilar data can be stored
altered, during which time the
together as necessary.
database must be taken offline.

Vertically, meaning a single server


Horizontally, meaning that to add
must be made increasingly
capacity, a database administrator
powerful in order to deal with
can simply add more commodity
Scaling increased demand. It is possible to
servers or cloud instances. The
spread SQL databases over many
database automatically spreads
servers, but significant additional
data across servers as necessary
engineering is generally required.

34
Mix of open-source (e.g.,
Development Postgres, MySQL) and
Open-source
Model closed source (e.g.,
Oracle Database)
Depends on product. Some
Can be configured for
Consistency provide strong consistency
strong consistency
(e.g., MongoDB)

35
Advantages of NoSQL
1: Elastic scaling
2: Big data
Today, the volumes of "big data" that can be handled by NoSQL
systems, such as Hadoop.

3: Economics
NoSQL databases typically use clusters of cheap
commodity servers to manage the exploding data and
transaction volumes, while RDBMS tends to rely on
expensive proprietary servers and storage systems.
The result is that the cost per gigabyte or
transaction/second for NoSQL can be many times less
than the cost for RDBMS, allowing you to store and
process more data at a much lower price point.

36
4 Flexible data models
➢ Minor changes to the data model of an RDBMS have to be
carefully managed.
➢ NoSQL databases have far more relaxed -- or even nonexistent -
- data model restrictions.

37
CAP Theorem
➢While designing applications for a distributed
architecture, some basic requirements should be
present in relations.

➢C-Consistency
➢A-Availability
➢P-Partition tolerance

38
CAP Theorem (Brewer’s Theorem)
CAP Theorem (Brewer’s Theorem)
➢ it is impossible for a distributed computer system to simultaneously provide all
three of the following guarantees:
➢ Consistency: all nodes see the same data at the same time

➢ Availability: Node failures do not prevent other survivors from continuing to


operate (a guarantee that every request receives a response about whether it
succeeded or failed)

➢ Partition tolerance: the system continues to operate despite arbitrary


partitioning due to network failures (e.g., message loss)

➢ A distributed system can satisfy any two of these guarantees at the same time but
not all three.
➢C-Consistency:-After update operation every
client should see the same data- data consistency.
➢A-Availability:-system should always on so
service guarantee availability.
➢P-Partition tolerance:-The system continues to
function even the communication among the
servers is unreliable.

➢CA-RDBMS
➢CP-MongoDB,HBase,Redis
➢AP-Cassandra,CouchDB,DynamoDB,Riak
41
BASE
(Basically Available, Soft-State, Eventually Consistent)

• Basic Availability: fulfill request, even in partial consistency.


Database should appears to work.
• Soft State: data storage may not contain the write consistent state.
Different replicas on different shards have to be
mutually consistent at all the time.
• Eventual Consistency: at some point in the future, data will
converge to a consistent state; delayed consistency, as opposed to
immediate consistency of the ACID properties.
• A BASE model normally focuses on availability as it is important
for scaling.
• But it does not guranteed for data consistency.
References
• http://bigdata.black/infrastructure/storage/sq
l-nosql-differences/

43
BIG DATA

“Big Data” is data whose scale, diversity, and


complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract
value and hidden knowledge from it…

• Big data is a term for data sets that are so large or complex that
traditional data processing applications are inadequate.
Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization, querying,
updating and information privacy.

44
3 V’s of Big Data
1-Scale (Volume)
➢Volume refers to amount of data ,volume of
data stored in enterprise repositories have
grown from megabytes and gigabytes to
petabytes.

➢Data volume is increasing exponentially

45
2-Complexity (Varity)
➢ Various formats, types, and
structures
➢ Text, numerical, images,
audio, video, sequences, time
series, social media data,
multi-dim arrays, etc…
➢ Static data vs. streaming data
➢ A single application can be
generating/collecting many
types of data

To extract knowledge➔ all these types of


data need to linked together

46
3-Speed (Velocity)
➢ Data is being generated fast and need to be processed fast.
➢ The term 'velocity' refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real
potential in the data.
➢ Online Data Analytics.
➢ Late decisions ➔ missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase
history, what you like ➔ send promotions

– Healthcare monitoring: sensors monitoring your activities and body


➔ any abnormal measurements require immediate reaction

47
5 V’s of Big data:
➢ Velocity, Volume, Value, Variety, and Veracity.
1)Velocity-velocity refers to the speed at which vast amounts of data
are being generated, collected and analyzed. Every day the number of
emails, twitter messages, photos, video clips, etc. increases at lighting
speeds around the world. Every second of every day data is
increasing. Not only must it be analyzed, but the speed of
transmission, and access to the data must also remain instantaneous
to allow for real-time access to website, credit card verification and
instant messaging. Big data technology allows us now to analyze the
data while it is being generated, without ever putting it into databases.

2)Volume-Volume refers to the incredible amounts of data generated


each second from social media, cell phones, cars, credit cards, M2M
sensors, photographs, video, etc. The vast amounts of data have
become so large in fact that we can no longer store and analyze data
using traditional database technology.

48
• 3)Value-When we talk about value, we’re referring to the worth
of the data being extracted. Having endless amounts of data is
one thing, but unless it can be turned into value it is
useless. While there is a clear link between data and insights,
this does not always mean there is value in Big Data .
• 4)Variety-Variety is defined as the different types of data we can
now use. Today’s data is unstructured. In fact, 80% of all the
world’s data fits into this category, including photos, video
sequences, social media updates, etc. New and innovative big
data technology is now allowing structured and unstructured
data to be harvested, stored, and used simultaneously.
• 5)Veracity-Veracity is the quality or trustworthiness of the
data. Just how accurate is all this data? For example, think
about all the Twitter posts with hash tags, abbreviations, typos,
etc., and the reliability and accuracy of all that content.

49
Pillars of Big Data
• Big table: relational,tabular format
• Big Text: all kinds of unstructured data,
natural language, semantic data.
• Big metadata: data about data, taxonomies,
glossary, concepts
• Big Graphs: Object connections, semantic
discovery, degree of separation etc.

50
Infrastructure requirements in Big data
1)Data Acquisition in Big data
-data will be in distributed environment, infrastructure must
support to carry out high volume of data.
-NOSQL are often used in Big data.

2)Data organization in Big data:


-organizing means data integration
-requires good infrastructure so that processing and manipulating
data in the original storage location can be done easily.
-Hadoop-handles large volume of data and keeps data on the
original data storage cluster.
-HDFS used to store web logs.
-MapReduce on cluster 51
3)Data analysis in Big data:
➢ the infrastructure must be able to integrate analysis on
the combination of Big data and traditional enterprise
data.
➢ the infrastructure required for analyzing big data must
be able to support deeper analytics such as statistical
analysis and data mining on variety of data stored in
systems.

52
Benefits of Big Data Processing
➢ Ability to process 'Big Data' brings in multiple benefits, such
as-
• Businesses can utilize outside intelligence while taking
decisions
Access to social data from search engines and sites
like facebook, twitter are enabling organizations to fine tune
their business strategies.

• Improved customer service


Traditional customer feedback systems are getting
replaced by new systems designed with 'Big Data' technologies.
In these new systems, Big Data and natural language processing
technologies are being used to read and evaluate consumer
responses.

53
• Early identification of risk to the product/services, if
any .

• Better operational efficiency


'Big Data' technologies can be used for creating staging
area or landing zone for new data before identifying what data
should be moved to the data warehouse.

54
References
• https://www.xsnet.com/blog/bid/205405/the
-v-s-of-big-data-velocity-volume-value-variety-
and-veracity

55
56
➢ Hadoop is an open-source framework that allows
to store and process big data in a distributed
environment across clusters of computers using simple
programming models.

➢ It is designed to scale up from single servers to


thousands of machines, each offering local
computation and storage.

➢ Hadoop was created by Doug Cutting and Mike


Cafarella in 2005.
➢ Cutting, who was working at Yahoo! at the time,
named it after his son's toy elephant.
57
Hadoop history

➢ Hadoop is based on work done by Google in the early 2000s


➢ Specifically, on papers describing the Google File System
(GFS) published in 2003, and MapReduce published in 2004.

➢ This work takes a radical new approach to the problem of


➢ distributed computing
➢ Meets all the requirements-reliability, scalability etc

➢ Core concept: distribute the data as it is initially stored in the


system
➢ Individual nodes can work on data local to those nodes
➢ No data transfer over the network is required for initial
Processing.
58
Hadoop

➢ Hadoop is the practical implementation in Java of Google’s


MapReduce model.

➢ Hadoop is open-source, developed by Yahoo!, now distributed by


the Apache Foundation.

➢ A software “industry” has grown up around Hadoop

➢ Hadoop is now supplemented by a range of Cloud software


projects, such as Pig, Hive and Zookeeper, etc. Most of these are
also distributed by Apache.

59
Hadoop

➢ Hadoop is a MapReduce software platform.


➢ Provides a framework for running MapReduce applications.
➢ This framework understands and assigns work to the nodes
in a cluster.
➢ Handles the mapping and reduc(e)ing logistics.

➢ Currently takes custom functionality written in Java or Python.

➢ Can use an open-source Eclipse plug‐in to interface with


Hadoop.

60
➢ Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming model.

➢ It is designed to scale up from single servers to thousands of


machines, each providing computation and storage.

➢ Rather than rely on hardware to deliver high-availability, the


framework itself is designed to detect and handle failures at
the application layer, thus delivering a highly-available
service on top of a cluster of computers.

61
Hadoop Architecture
Hadoop framework includes following four modules:

1. Hadoop Common: These are Java libraries and utilities required by


other Hadoop modules. These libraries provides filesystem and OS
level abstractions and contains the necessary Java files and scripts
required to start Hadoop.

2. Hadoop YARN(Yet Another Resource Negotiator): This is a


framework for job scheduling and cluster resource management.

3. Hadoop Distributed File System (HDFS): A distributed file system


that provides high-throughput access to application data.

4. Hadoop MapReduce: This is YARN-based system for parallel


processing of large data sets.

62
Hadoop Components
Hadoop consists of two core components
➢ The Hadoop Distributed File System (HDFS)
➢ MapReduce Software Framework

There are many other projects based around core Hadoop


➢ Often referred to as the ‘Hadoop Ecosystem’
➢ Since 2012, the term "Hadoop" often refers not just to the base
modules mentioned above but also to the collection of additional
software packages that can be installed on top of or alongside Hadoop,
such as Apache Pig, Apache Hive, Apache HBase, Apache Spark ,
Flume, Oozie, Sqoop, etc

➢ A set of machines running HDFS and MapReduce is


known as a Hadoop Cluster
➢ Individual machines are known as nodes
➢ More nodes = better performance! 63
Hadoop: Big Picture

64
Who uses Hadoop?
➢Amazon/A9
➢Facebook
➢Google
➢New York Times
➢yahoo!
➢…. many more

65
66
67
Core Hadoop Concepts
Applications are written in high-level code
➢ Developers do not worry about network programming,
temporal(time-based) dependencies etc.

Nodes talk to each other as little as possible


➢ Developers should not write code which communicates between
nodes
➢ ‘Shared nothing’ architecture

Data is spread among machines in advance


➢ Computation happens where the data is stored, wherever
possible
➢ Data is replicated multiple times on the system for increased
availability and reliability.
68
Fault Tolerance

➢ If a node fails, the master will detect that failure and re-assign
the work to a different node on the system.

➢ Restarting a task does not require communication with nodes


working on other portions of the data.

➢ If a failed node restarts, it is automatically added back to the


system and assigned new tasks.

69
Advantages of Hadoop
➢ Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes the
data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.

➢ Hadoop does not rely on hardware to provide fault-tolerance and


high availability (FTHA), rather Hadoop library itself has been
designed to detect and handle failures at the application layer.

➢ Servers can be added or removed from the cluster dynamically


and Hadoop continues to operate without interruption.

➢ Another big advantage of Hadoop is that apart from being open


source, it is compatible on all the platforms since it is Java based.

70
HDFS

71
• Hadoop File System was developed using distributed
file system design. It runs on commodity hardware. Unlike
other distributed systems, HDFS is highly fault-tolerant and
designed using low-cost hardware.

• HDFS holds very large amount of data and provides easier


access. To store such huge data, the files are stored across
multiple machines.

• These files are stored in redundant fashion to rescue the


system from possible data losses in case of failure. HDFS
also makes applications available to parallel processing.

72
HDFS

▪ Hadoop makes use of HDFS for data storage - the file


system that spans all the nodes in a Hadoop cluster.

▪ It links together the file systems on many local nodes to


make them into one big file system.

▪ HDFS assumes nodes will fail, so it achieves reliability


by replicating data across multiple nodes.

73
Basic Features: HDFS
• Highly fault-tolerant
– Can handle disk crashes, machine crashes, etc...
• High throughput
• Suitable for applications with large data sets
• Can be built out of commodity hardware
• Based on Google's Filesystem GFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to
easily check the status of cluster.
• HDFS provides file permissions and authentication.

74
Fault tolerance

• A HDFS instance may consist of thousands of


server machines, each storing part of the file
system’s data.

• Since we have huge number of components and that


each component has non-trivial probability of
failure means that there is always some component
that is non-functional.

• Detection of faults and quick, automatic recovery


from them is a core architectural goal of HDFS.
75
HDFS Architecture

Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..

Client
Block ops
Read Datanodes Datanodes

replication
B
Blocks

Rack1 Write Rack2

Client

76
HDFS Daemons
➢ Filesystem cluster is managed by three types of processes
➢ Namenode
➢ manages the File System's namespace/meta-data/file blocks
➢ Runs on 1 machine to several machines
➢ Datanode
➢ Stores and retrieves data blocks
➢ Reports to Namenode
➢ Runs on many machines
➢ Secondary Namenode
➢ Performs house keeping work so Namenode doesn’t have to
require similar hardware as Namenode machine
➢ Not used for high-availability – not a backup for Namenode
77
Namenode
Master/slave architecture
➢ HDFS cluster consists of a single Namenode, a master server
that manages the file system namespace and regulates access to
files by clients.
Functions of a NameNode:-
➢ Regulates client’s access to files.
➢ It also executes file system operations such as renaming, closing,
and opening files and directories.
➢ Manages File System Namespace
➢ Maps a file name to a set of blocks
➢ Maps a block to the DataNodes where it resides
➢ Cluster Configuration Management
➢ Replication Engine for Blocks
78
Datanode
➢ There are a number of DataNodes usually one per node
in a cluster.
➢ These nodes manage the data storage of their system.
➢ Datanodes perform read-write operations on the file
systems, as per client request.
➢ They also perform operations such as block creation,
deletion, and replication according to the instructions of
the namenode.

79
➢ The DataNodes manage storage attached to the nodes
that they run on.
➢ HDFS exposes a file system namespace and allows user
data to be stored in files.
➢ A file is split into one or more blocks and set of blocks
are stored in DataNodes.
➢ DataNodes: serves read, write requests, performs block
creation, deletion, and replication upon instruction from
Namenode.

80
Block
➢ Generally the user data is stored in the files of
HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual
data nodes.

➢ These file segments are called as blocks. In other


words, the minimum amount of data that HDFS can
read or write is called a Block.

➢ The default block size is 64MB, but it can be


increased as per the need to change in HDFS
configuration.
81
➢ When the filesystem starts up it generates a list of all HDFS
blocks and send this report to Namenode: Blockreport.

➢ Block Report
➢Periodically sends a report of all existing blocks to the
NameNode

➢ A Block Server
➢Stores data in the local file system
➢Stores metadata of a block
➢Serves data and metadata to Clients

➢ Facilitates Pipelining of Data


➢Forwards data to other specified DataNodes
82
File system Namespace

➢Hierarchical file system with directories and


files
➢Create, remove, move, rename etc.
➢Namenode maintains the file system
➢Any meta information changes to the file system
recorded by the Namenode.
➢An application can specify the number of
replicas of the file needed: replication factor of
the file. This information is stored in the
Namenode.
83
Data Replication
➢ HDFS is designed to store very large files across
machines in a large cluster.
➢ Each file is a sequence of blocks.
➢ Blocks are replicated for fault tolerance.
➢ Block size and replicas are configurable per file.
➢ The Namenode receives a Heartbeat and a BlockReport
from each DataNode in the cluster.
➢ BlockReport contains all the blocks on a Datanode.

84
Filesystem Metadata
➢ The HDFS namespace is stored by Namenode.

➢ Namenode uses a transaction log called the EditLog to


record every change that occurs to the filesystem meta
data.
➢ For example, creating a new file.
➢ Change replication factor of a file
➢ EditLog is stored in the Namenode’s local filesystem

➢ Entire filesystem namespace including mapping of blocks


to files and file system properties is stored in a file
FsImage. Stored in Namenode’s local filesystem.
85
➢ Types of metadata
➢List of files
➢List of Blocks for each file
➢List of DataNodes for each block
➢File attributes, e.g. creation time, replication factor

➢ A Transaction Log
➢Records file creations, file deletions etc

86
Block Placement
➢Current Strategy
➢One replica on local node
➢Second replica on a remote rack
➢Third replica on same remote rack
➢Additional replicas are randomly placed
➢Clients read from nearest replicas
Heartbeats
➢ DataNodes send hearbeat to the NameNode
➢Once every 3 seconds

➢ NameNode uses heartbeats to detect DataNode


failure 87
Replication Engine
➢NameNode detects DataNode failures
➢Chooses new DataNodes for new replicas
➢Balances disk usage
➢Balances communication traffic to DataNodes

88
Secondary NameNode
➢ It is responsible for performing periodic checkpoints.So
if the Namenode fails it can be replaced with a snapshot
image stored by the Secondary Namenode checkpoints.
➢ Copies FsImage and Transaction Log from Namenode
to a temporary directory.
➢ Merges FSImage and Transaction Log into a new
FSImage in temporary directory.
➢ Uploads new FSImage to the NameNode
➢Transaction Log on NameNode is eliminate.

89
Goals of HDFS
➢ Fault detection and recovery : Since HDFS includes a
large number of commodity hardware, failure of
components is frequent. Therefore HDFS should have
mechanisms for quick and automatic fault detection and
recovery.

➢ Huge datasets : HDFS should have hundreds of nodes per


cluster to manage the applications having huge datasets.

➢ Hardware at data : A requested task can be done


efficiently, when the computation takes place near the data.
Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.

90
HDFS is Good for...
Storing large files
➢ Terabytes, Petabytes, etc...
➢ 100MB or more per file
Streaming data
➢ Write once and read-many times patterns
➢ Optimized for streaming reads rather than random reads
“Cheap” Commodity Hardware
➢ No need for super-computers, use less reliable commodity
hardware

91
User Interface
➢ Commads for HDFS User:
➢hadoop dfs -mkdir /foodir
➢hadoop dfs -cat /foodir/myfile.txt
➢hadoop dfs -rm /foodir/myfile.txt

➢ Commands for HDFS Administrator


➢hadoop dfsadmin -report
➢hadoop dfsadmin -decommision datanodename

➢ Web Interface
➢http://host:port/dfshealth.jsp
92
MapReduce

93
➢ MapReduce is a framework using which we can write applications
to process huge amounts of data, in parallel, on large clusters of
commodity hardware in a reliable manner.

➢ MapReduce is a processing technique and a program model for


distributed computing based on java. The MapReduce algorithm
contains two important tasks, namely Map and Reduce.

➢ Map takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value
pairs).

➢ Secondly, reduce task, which takes the output from a map as an


input and combines those data tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce task
is always performed after the map job. 94
➢ MapReduce program executes in three stages, namely map stage,
shuffle stage, and reduce stage.

➢ Map stage : The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed
to the mapper function line by line. The mapper processes the data
and creates several small chunks of data.

➢ Reduce stage : This stage is the combination of the Shuffle stage


and the Reduce stage. The Reducer’s job is to process the data
that comes from the mapper. After processing, it produces a new
set of output, which will be stored in the HDFS.

➢ During a MapReduce job, Hadoop sends the Map and Reduce


tasks to the appropriate servers in the cluster.

95
➢The framework manages all the details of data-
passing such as issuing tasks, verifying task
completion, and copying data around the cluster
between the nodes.

➢Most of the computing takes place on nodes with


data on local disks that reduces the network
traffic.

➢After completion of the given tasks, the cluster


collects and reduces the data to form an
appropriate result, and sends it back to the
Hadoop server.
96
MapReduce - Dataflow

97
MapReduce - What?
➢ MapReduce is a programming model for efficient
distributed computing
➢ It works like a Unix pipeline
➢ Input | Map | Shuffle & Sort | Reduce | Output
➢ Efficiency from
➢ Streaming through data
➢ Pipelining
➢ A good fit for a lot of applications
➢ Log processing
➢ Web index building

98
Word Count Example
➢Mapper
➢Input: value: lines of text of input
➢Output: key: word, value: 1

➢Reducer
➢Input: key: word, value: set of counts
➢Output: key: word, value: sum

➢Launching program
➢Defines this job
➢Submits job to cluster
99
Word Count Dataflow

100
MapReduce

combine part0
map reduce
Cat split

reduce part1
split map combine

Bat

map part2
split combine reduce
Dog

split map
Other
Words
(size:
TByte)
101
References
• http://www.hadoopadmin.co.in/hadoop-
administrator/mapreduce/
• https://www.tutorialspoint.com/hadoop/inde
x.htm

102
END

103

You might also like