5-Overiview of Big Data Technologies - Hadoop

Big Data Analytics &
Technologies
CT047-3-M
Overview of Big Data Technologies

- Hadoop
Topic & Structure of The Lesson
• The lesson covers:

• Overview of Big data Technologies
– Hadoop-HDFS
– Hadoop-MapReduce
– NoSQL
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <2> of 9
Learning Outcomes
• At the end of this topic, You should be

able to
• Demonstrate the theories involved in big
data technologies
• Critically evaluate and present technology
choices to solve real world big data and
Data science problems
Key Terms You Must Be Able To
Use
• If you have mastered this topic, you should be able to use the
following terms correctly in your assignments and exams:
• Hadoop MapReduce
• Key-value
• Document-based
• NOSQL
• RDBMS
What Technology Do We Have
For Big Data ??
Hadoop for Big Data
• Apache Hadoop is a framework that allows for the distributed processing of

large data sets across clusters of commodity computers using a simple
programming model.
• It is an Open-source Data Management with scale-out storage & distributed
processing.
Source: https://hadoop.apache.org/
Hadoop Creation History
Hadoop: Assumptions
• It is written with large clusters of computers in mind and is
built around the following assumptions:
• Hardware will fail.
• Processing will be run in batches. Thus there is an emphasis
on high throughput as opposed to low latency.
• Applications that run on HDFS have large data sets. A typical
file in HDFS is gigabytes to terabytes in size.
• It should provide high aggregate data bandwidth and scale to
hundreds of nodes in a single cluster. It should support tens of
millions of files in a single instance.
• Applications need a write-once-read-many access model.
• Moving Computation is Cheaper than Moving Data.
• Portability is important.
Hadoop Features
• Scalable: Can reliably store and process petabytes.
• Cost effective: Distributes the data and processing across

clusters of commonly available computers (in thousands).
• Efficient: By distributing the data, it can process in parallel on

the nodes where the data is located.
• Flexible: Can easily access new data source and tap into
different types of data (structured and unstructured)
• Reliable: Automatically maintains multiple copies of data and

automatically redeploys computing tasks based on failures.
Apache Hadoop – key
components
• Hadoop Common: Common utilities
• (Storage Component) Hadoop Distributed File System (HDFS): A
distributed file system that provides high-throughput access
– Many other data storage approaches also in use
– E.G., Apache Cassandra, Apache Hbase, Apache Accumulo (NSA-contributed)
• (Scheduling) Hadoop YARN: A framework for job scheduling and
cluster resource management.
• (Processing) Hadoop MapReduce (MR2): A YARN-based system for
parallel processing of large data sets
– Other execution engines increasingly in use, e.g., Spark
• Note:
– All of these key components are OSS under Apache 2.0 license
David A. Wheeler
RDBMS vs. Hadoop
Source : Hadoop :The Definition Guide

Apache Spark
• A new general framework, which solves many of the short comings

of MapReduce
• It is capable of leveraging the Hadoop ecosystem, e.g. HDFS,
YARN, HBase, …
• Has many other workflows, i.e. join, filter, flatMapdistinct,
groupByKey, reduceByKey, sortByKey, collect, count, first…
– (around 30 efficient distributed operations)
• In-memory caching of data (for iterative, graph, and machine
learning algorithms, etc.)
• Native Scala, Java, Python, and R support
• Supports interactive shells for exploratory data analysis
• Spark API is extremely simple to use
• Developed at AMPLab UC Berkeley, now by Databricks.com
13
NOSQL
• The Name:
– Stands for Not Only SQL
– The term NOSQL was introduced by Carl Strozzi
in 1998 to name his file-based database
– It was again re-introduced by Eric Evans when an
event was organized to discuss open source
distributed databases
– Eric states that “… but the whole point of seeking
alternatives is that you need to solve a problem
that relational databases are a bad fit for. …”
Key features (advantages)
– non-relational
– don’t require schema
– data are replicated to multiple
nodes (so, identical & fault-tolerant)
and can be partitioned:
• down nodes easily replaced
• no single point of failure
– horizontal scalable
– cheap, easy to implement
(open-source)
– massive write performance
– fast key-value access
Disadvantages
– Don’t fully support relational features

• no join, group by, order by operations (except
within partitions)
• no referential integrity constraints across partitions
– No declarative query language (e.g., SQL) 
more programming
– No easy integration with other applications
that support SQL
Who is using them?
NOSQL categories
1.Key-value
• Example: DynamoDB, Voldermort, Scalaris
2.Document-based
• Example: MongoDB, CouchDB
3.Column-based
• Example: BigTable, Cassandra, Hbase
4.Graph-based
• Example: Neo4J, InfoGrid
• “No-schema” is a common characteristics
of most NOSQL storage systems
• Provide “flexible” data types
Key-value
• Focus on scaling to huge amounts of data

• Designed to handle massive load
• Based on Amazon’s dynamo paper
• Data model: (global) collection of Key-value pairs
• Dynamo ring partitioning and replication
• Example: (DynamoDB)
– items having one or more attributes (name, value)
– An attribute can be single-valued or multi-valued
like set.
– items are combined into a table
Key-value
• Basic API access:

– get(key): extract the value given a key
– put(key, value): create or update the value
given its key
– delete(key): remove the key and its
associated value
– execute(key, operation, parameters): invoke
an operation to the value (given its key) which
is a special data structure (e.g. List, Set, Map
.... etc)
Key-value
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Key-value
Pros:
– very fast
– very scalable (horizontally distributed to nodes based on
key)
– simple data model
– eventual consistency
– fault-tolerance
Cons:
- Can’t model more complex data structure such
as objects
Document-based
• Can model more complex objects

• Inspired by Lotus Notes
• Data model: collection of documents
• Document: JSON (JavaScript Object Notation is a
data model, key-value pairs, which supports objects,
records, structs, lists, array, maps, dates, Boolean
with nesting), XML, other semi-structured formats.
Document-based
• Example: (MongoDB) document

– {Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten:
"1", "Otis: "3", Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}
Document-based
Column-based
• Based on Google’s BigTable paper
• Like column oriented relational databases (store data in column order)
• Tables similarly to RDBMS, but handle semi-structured
• Data model:
– Collection of Column Families
– Column family = (key, value) where value = set of related columns (standard, super)
– indexed by row key, column key and timestamp
allow key-value pairs to be stored (and retrieved on key) in a massively parallel

system
storing principle: big hashed distributed tables
properties: partitioning (horizontally and/or vertically), high availability etc.
completely transparent to application
* Better: extendible records
Column-based
• One column family can have variable
numbers of columns
• Cells within a column family are sorted “physically”
• Very sparse, most cells have null values
• Comparison: RDBMS vs column-based NOSQL
– Query on multiple tables
• RDBMS: must fetch data from several places on disk and
glue together
• Column-based NOSQL: only fetch column families of those
columns that are required by a query (all columns in a
column family are stored together on the disk, so multiple
rows can be retrieved in one read operation  data locality)
Column-based
Graph-based
• Focus on modeling the structure of data (interconnectivity)

• Scales to the complexity of data
• Inspired by mathematical Graph Theory (G=(E,V))
• Data model:
– (Property Graph) nodes and edges
• Nodes may have properties (including ID)
• Edges may have labels or roles
– Key-value pairs on both
• Interfaces and query languages vary
• Single-step vs path expressions vs full recursion
• Example:
– Neo4j, FlockDB, Pregel, InfoGrid …
Apache Hive
• “Hive is a data warehouse infrastructure

tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize
Big Data, and makes querying and
analyzing easy”
Hive is not
• A relational database
• A design for OnLine Transaction
Processing (OLTP)
• A language for real-time queries and row-
level updates
Features of Hive
• It stores schema in a database and

processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for
querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
SQL-on-Hadoop
• Enable the use of SQL commands in

Hadoop for assessing and processing big
data.
• Hive Data warehouse is one of the earliest
applications which was made to integrate
SQL with Hadoop.
• Some other examples of such applications
are the Apache Drill, Apache Spark, H-
SQL, BigSQL, Tez.
Quick Review Question
• What Technology Do We Have For Big

Data ?
• Explain the difference between NoSQL
v/s Relational database?
• Explain the categories NOSQL?
Summary of Main Teaching Points
• Hadoop for Big Data

• Key features of NoSQL
• NOSQL categories
• SQL-on-Hadoop
Question and Answer Session
What we will cover next
• Hadoop – HDFS and MapReduce

– Hadoop Framework
– HDFS file system
– Hadoop Map reduce
– Hadoop Streaming

5-Overiview of Big Data Technologies - Hadoop

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5-Overiview of Big Data Technologies - Hadoop

Uploaded by

Copyright:

Available Formats

Big Data Analytics &

Overview of Big Data Technologies

• The lesson covers:

• At the end of this topic, You should be

• Apache Hadoop is a framework that allows for the distributed processing of

• Cost effective: Distributes the data and processing across

• Efficient: By distributing the data, it can process in parallel on

• Reliable: Automatically maintains multiple copies of data and

Source : Hadoop :The Definition Guide

• A new general framework, which solves many of the short comings

– Don’t fully support relational features

• Focus on scaling to huge amounts of data

• Basic API access:

• Can model more complex objects

• Example: (MongoDB) document

allow key-value pairs to be stored (and retrieved on key) in a massively parallel

* Better: extendible records

• Focus on modeling the structure of data (interconnectivity)

• “Hive is a data warehouse infrastructure

• It stores schema in a database and

• Enable the use of SQL commands in

• What Technology Do We Have For Big

• Hadoop for Big Data

• Hadoop – HDFS and MapReduce

You might also like