Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Big Data Analytics &

Technologies
CT047-3-M

Overview of Big Data Technologies


- Hadoop
Topic & Structure of The Lesson

• The lesson covers:


• Overview of Big data Technologies
– Hadoop-HDFS
– Hadoop-MapReduce
– NoSQL

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <2> of 9
Learning Outcomes

• At the end of this topic, You should be


able to
• Demonstrate the theories involved in big
data technologies
• Critically evaluate and present technology
choices to solve real world big data and
Data science problems

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <3> of 9
Key Terms You Must Be Able To
Use
• If you have mastered this topic, you should be able to use the
following terms correctly in your assignments and exams:

• Hadoop MapReduce
• Key-value
• Document-based
• NOSQL
• RDBMS

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <4> of 9
What Technology Do We Have
For Big Data ??

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <5> of 9
Hadoop for Big Data

• Apache Hadoop is a framework that allows for the distributed processing of


large data sets across clusters of commodity computers using a simple
programming model.
• It is an Open-source Data Management with scale-out storage & distributed
processing.

Source: https://hadoop.apache.org/
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <6> of 9
Hadoop Creation History

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <7> of 9
Hadoop: Assumptions
• It is written with large clusters of computers in mind and is
built around the following assumptions:
• Hardware will fail.
• Processing will be run in batches. Thus there is an emphasis
on high throughput as opposed to low latency.
• Applications that run on HDFS have large data sets. A typical
file in HDFS is gigabytes to terabytes in size.
• It should provide high aggregate data bandwidth and scale to
hundreds of nodes in a single cluster. It should support tens of
millions of files in a single instance.
• Applications need a write-once-read-many access model.
• Moving Computation is Cheaper than Moving Data.
• Portability is important.
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <8> of 9
Hadoop Features
• Scalable: Can reliably store and process petabytes.

• Cost effective: Distributes the data and processing across


clusters of commonly available computers (in thousands).

• Efficient: By distributing the data, it can process in parallel on


the nodes where the data is located.

• Flexible: Can easily access new data source and tap into
different types of data (structured and unstructured)

• Reliable: Automatically maintains multiple copies of data and


automatically redeploys computing tasks based on failures.

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <9> of 9
Apache Hadoop – key
components
• Hadoop Common: Common utilities
• (Storage Component) Hadoop Distributed File System (HDFS): A
distributed file system that provides high-throughput access
– Many other data storage approaches also in use
– E.G., Apache Cassandra, Apache Hbase, Apache Accumulo (NSA-contributed)
• (Scheduling) Hadoop YARN: A framework for job scheduling and
cluster resource management.
• (Processing) Hadoop MapReduce (MR2): A YARN-based system for
parallel processing of large data sets
– Other execution engines increasingly in use, e.g., Spark
• Note:
– All of these key components are OSS under Apache 2.0 license

David A. Wheeler
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <10> of 9
RDBMS vs. Hadoop

Source : Hadoop :The Definition Guide


CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <11> of 9
Apache Spark

• A new general framework, which solves many of the short comings


of MapReduce
• It is capable of leveraging the Hadoop ecosystem, e.g. HDFS,
YARN, HBase, …
• Has many other workflows, i.e. join, filter, flatMapdistinct,
groupByKey, reduceByKey, sortByKey, collect, count, first…
– (around 30 efficient distributed operations)
• In-memory caching of data (for iterative, graph, and machine
learning algorithms, etc.)
• Native Scala, Java, Python, and R support
• Supports interactive shells for exploratory data analysis
• Spark API is extremely simple to use
• Developed at AMPLab UC Berkeley, now by Databricks.com

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <12> of 9
13

NOSQL

• The Name:
– Stands for Not Only SQL
– The term NOSQL was introduced by Carl Strozzi
in 1998 to name his file-based database
– It was again re-introduced by Eric Evans when an
event was organized to discuss open source
distributed databases
– Eric states that “… but the whole point of seeking
alternatives is that you need to solve a problem
that relational databases are a bad fit for. …”

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <13> of 9
Key features (advantages)
– non-relational
– don’t require schema
– data are replicated to multiple
nodes (so, identical & fault-tolerant)
and can be partitioned:
• down nodes easily replaced
• no single point of failure
– horizontal scalable
– cheap, easy to implement
(open-source)
– massive write performance
– fast key-value access

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <14> of 9
Disadvantages

– Don’t fully support relational features


• no join, group by, order by operations (except
within partitions)
• no referential integrity constraints across partitions
– No declarative query language (e.g., SQL) 
more programming
– No easy integration with other applications
that support SQL

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <15> of 9
Who is using them?

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <16> of 9
NOSQL categories

1.Key-value
• Example: DynamoDB, Voldermort, Scalaris
2.Document-based
• Example: MongoDB, CouchDB
3.Column-based
• Example: BigTable, Cassandra, Hbase
4.Graph-based
• Example: Neo4J, InfoGrid
• “No-schema” is a common characteristics
of most NOSQL storage systems
• Provide “flexible” data types
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <17> of 9
Key-value

• Focus on scaling to huge amounts of data


• Designed to handle massive load
• Based on Amazon’s dynamo paper
• Data model: (global) collection of Key-value pairs
• Dynamo ring partitioning and replication
• Example: (DynamoDB)
– items having one or more attributes (name, value)
– An attribute can be single-valued or multi-valued
like set.
– items are combined into a table
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <18> of 9
Key-value

• Basic API access:


– get(key): extract the value given a key
– put(key, value): create or update the value
given its key
– delete(key): remove the key and its
associated value
– execute(key, operation, parameters): invoke
an operation to the value (given its key) which
is a special data structure (e.g. List, Set, Map
.... etc)

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <19> of 9
Key-value

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Key-value

Pros:
– very fast
– very scalable (horizontally distributed to nodes based on
key)
– simple data model
– eventual consistency
– fault-tolerance

Cons:
- Can’t model more complex data structure such
as objects
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <21> of 9
Document-based

• Can model more complex objects


• Inspired by Lotus Notes
• Data model: collection of documents
• Document: JSON (JavaScript Object Notation is a
data model, key-value pairs, which supports objects,
records, structs, lists, array, maps, dates, Boolean
with nesting), XML, other semi-structured formats.

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <22> of 9
Document-based

• Example: (MongoDB) document


– {Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten:
"1", "Otis: "3", Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <23> of 9
Document-based

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based
• Based on Google’s BigTable paper
• Like column oriented relational databases (store data in column order)
• Tables similarly to RDBMS, but handle semi-structured
• Data model:
– Collection of Column Families
– Column family = (key, value) where value = set of related columns (standard, super)
– indexed by row key, column key and timestamp

allow key-value pairs to be stored (and retrieved on key) in a massively parallel


system
storing principle: big hashed distributed tables
properties: partitioning (horizontally and/or vertically), high availability etc.
completely transparent to application

* Better: extendible records

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based
• One column family can have variable
numbers of columns
• Cells within a column family are sorted “physically”
• Very sparse, most cells have null values
• Comparison: RDBMS vs column-based NOSQL
– Query on multiple tables
• RDBMS: must fetch data from several places on disk and
glue together
• Column-based NOSQL: only fetch column families of those
columns that are required by a query (all columns in a
column family are stored together on the disk, so multiple
rows can be retrieved in one read operation  data locality)

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Graph-based

• Focus on modeling the structure of data (interconnectivity)


• Scales to the complexity of data
• Inspired by mathematical Graph Theory (G=(E,V))
• Data model:
– (Property Graph) nodes and edges
• Nodes may have properties (including ID)
• Edges may have labels or roles
– Key-value pairs on both
• Interfaces and query languages vary
• Single-step vs path expressions vs full recursion
• Example:
– Neo4j, FlockDB, Pregel, InfoGrid …

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Apache Hive

• “Hive is a data warehouse infrastructure


tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize
Big Data, and makes querying and
analyzing easy”

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Hive is not

• A relational database
• A design for OnLine Transaction
Processing (OLTP)
• A language for real-time queries and row-
level updates

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Features of Hive

• It stores schema in a database and


processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for
querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
SQL-on-Hadoop

• Enable the use of SQL commands in


Hadoop for assessing and processing big
data.
• Hive Data warehouse is one of the earliest
applications which was made to integrate
SQL with Hadoop.
• Some other examples of such applications
are the Apache Drill, Apache Spark, H-
SQL, BigSQL, Tez.
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Quick Review Question

• What Technology Do We Have For Big


Data ?
• Explain the difference between NoSQL
v/s Relational database?
• Explain the categories NOSQL?

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <33> of 9
Summary of Main Teaching Points

• Hadoop for Big Data


• Key features of NoSQL
• NOSQL categories
• SQL-on-Hadoop

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <34> of 9
Question and Answer Session

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <35> of 9
What we will cover next

• Hadoop – HDFS and MapReduce


– Hadoop Framework
– HDFS file system
– Hadoop Map reduce
– Hadoop Streaming

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <36> of 9

You might also like