Professional Documents
Culture Documents
5-Overiview of Big Data Technologies - Hadoop
5-Overiview of Big Data Technologies - Hadoop
Technologies
CT047-3-M
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <2> of 9
Learning Outcomes
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <3> of 9
Key Terms You Must Be Able To
Use
• If you have mastered this topic, you should be able to use the
following terms correctly in your assignments and exams:
• Hadoop MapReduce
• Key-value
• Document-based
• NOSQL
• RDBMS
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <4> of 9
What Technology Do We Have
For Big Data ??
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <5> of 9
Hadoop for Big Data
Source: https://hadoop.apache.org/
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <6> of 9
Hadoop Creation History
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <7> of 9
Hadoop: Assumptions
• It is written with large clusters of computers in mind and is
built around the following assumptions:
• Hardware will fail.
• Processing will be run in batches. Thus there is an emphasis
on high throughput as opposed to low latency.
• Applications that run on HDFS have large data sets. A typical
file in HDFS is gigabytes to terabytes in size.
• It should provide high aggregate data bandwidth and scale to
hundreds of nodes in a single cluster. It should support tens of
millions of files in a single instance.
• Applications need a write-once-read-many access model.
• Moving Computation is Cheaper than Moving Data.
• Portability is important.
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <8> of 9
Hadoop Features
• Scalable: Can reliably store and process petabytes.
• Flexible: Can easily access new data source and tap into
different types of data (structured and unstructured)
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <9> of 9
Apache Hadoop – key
components
• Hadoop Common: Common utilities
• (Storage Component) Hadoop Distributed File System (HDFS): A
distributed file system that provides high-throughput access
– Many other data storage approaches also in use
– E.G., Apache Cassandra, Apache Hbase, Apache Accumulo (NSA-contributed)
• (Scheduling) Hadoop YARN: A framework for job scheduling and
cluster resource management.
• (Processing) Hadoop MapReduce (MR2): A YARN-based system for
parallel processing of large data sets
– Other execution engines increasingly in use, e.g., Spark
• Note:
– All of these key components are OSS under Apache 2.0 license
David A. Wheeler
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <10> of 9
RDBMS vs. Hadoop
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <12> of 9
13
NOSQL
• The Name:
– Stands for Not Only SQL
– The term NOSQL was introduced by Carl Strozzi
in 1998 to name his file-based database
– It was again re-introduced by Eric Evans when an
event was organized to discuss open source
distributed databases
– Eric states that “… but the whole point of seeking
alternatives is that you need to solve a problem
that relational databases are a bad fit for. …”
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <13> of 9
Key features (advantages)
– non-relational
– don’t require schema
– data are replicated to multiple
nodes (so, identical & fault-tolerant)
and can be partitioned:
• down nodes easily replaced
• no single point of failure
– horizontal scalable
– cheap, easy to implement
(open-source)
– massive write performance
– fast key-value access
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <14> of 9
Disadvantages
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <15> of 9
Who is using them?
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <16> of 9
NOSQL categories
1.Key-value
• Example: DynamoDB, Voldermort, Scalaris
2.Document-based
• Example: MongoDB, CouchDB
3.Column-based
• Example: BigTable, Cassandra, Hbase
4.Graph-based
• Example: Neo4J, InfoGrid
• “No-schema” is a common characteristics
of most NOSQL storage systems
• Provide “flexible” data types
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <17> of 9
Key-value
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <19> of 9
Key-value
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Key-value
Pros:
– very fast
– very scalable (horizontally distributed to nodes based on
key)
– simple data model
– eventual consistency
– fault-tolerance
Cons:
- Can’t model more complex data structure such
as objects
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <21> of 9
Document-based
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <22> of 9
Document-based
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <23> of 9
Document-based
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based
• Based on Google’s BigTable paper
• Like column oriented relational databases (store data in column order)
• Tables similarly to RDBMS, but handle semi-structured
• Data model:
– Collection of Column Families
– Column family = (key, value) where value = set of related columns (standard, super)
– indexed by row key, column key and timestamp
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based
• One column family can have variable
numbers of columns
• Cells within a column family are sorted “physically”
• Very sparse, most cells have null values
• Comparison: RDBMS vs column-based NOSQL
– Query on multiple tables
• RDBMS: must fetch data from several places on disk and
glue together
• Column-based NOSQL: only fetch column families of those
columns that are required by a query (all columns in a
column family are stored together on the disk, so multiple
rows can be retrieved in one read operation data locality)
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Graph-based
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Apache Hive
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Hive is not
• A relational database
• A design for OnLine Transaction
Processing (OLTP)
• A language for real-time queries and row-
level updates
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Features of Hive
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
SQL-on-Hadoop
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <33> of 9
Summary of Main Teaching Points
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <34> of 9
Question and Answer Session
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <35> of 9
What we will cover next
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <36> of 9