Professional Documents
Culture Documents
MapReduce Debates and Schema-Free
MapReduce Debates and Schema-Free
MapReduce Debates and Schema-Free
com
Woohyun Kim
The creator of open source “Coord”
(http://www.coordguru.com)
2010-03-03
http://www.coordguru.com
•SAN •SQL
•HDFS •MapReduce
•Hbase, Voldemort, MongoDB, •Pig
Cassandra •Hive, CloudBase
•HadoopDB
Store Process
Analyze Retrieve
•OLAP •SQL
•Visualization •RESTFul
•Reporting
http://www.coordguru.com
0.5 ∑
0.5
amount
quality
∑ 0.3 ∑
0.1 0.7
0.3 0.6
Open100_write Answer_ Question_cn confidence popularity
_cnt cnt t
∑ ∑
∑ ∑ ∑
ETL
http://www.coordguru.com
Buddy
pt_buddy.csv cal_buddy_
cnt.cpp
Buddy * Count
pt_count.csv att_visit_
count.cpp Buddy/Count * PowerBlogger
pt_power_blog1.csv att_is_power
blogger.cpp Buddy/Count/PowerBlogger * Comment
pt_comment1.csv att_commenting.cpp
Blogger
http://www.coordguru.com
• Speed
• The seek times of physical storage is not keeping pace with improvements in network
speeds
‚New Relations‛
• Integration
• Today’s data processing tasks increasingly have to access and combine data from
many different non-relational sources, often over a network
http://www.coordguru.com
Hadoop Revolution
http://www.coordguru.com
Row
Structured
Data
Time
Column Column stamp
Family Family
http://www.coordguru.com
Alternatives
• Map-side Join
• Mapper-only job to avoid sort and to reduce data movement across the
network
• Semi-Join
• Shrink data size through semi-join(by preprocessing)
http://www.coordguru.com
Semi-Join
• Extract – unique IDs referenced in a larger input source(A)
• Mapper: extract Movie IDs from Ratings records
• Reducer: accumulate all unique Movie IDs
• Filter – the other larger input source(B) with the referenced unique IDs
• Mapper: filter the referenced Movie IDs from full Movie dataset
• Join - a larger input source(A) with the filtered datasets
• Mapper: do Mapper-side Join
• Ratings records & the filtered movie IDs dataset
http://www.coordguru.com
MapReduce Debates
http://www.coordguru.com
• Missing most of the features that are routinely included in current DBMS
• MapReduce provides only a sliver of the functionality found in modern DBMSs
• Bulk loader – transform input data in files into a desired format and load it into a DBMS
• Indexing – hash or B-Tree indexes
• Updates – change the data in the data base
• Transactions – support parallel update and recovery from failures during update
• integrity constraints – help keep garbage out of the data base
• referential integrity – again, help keep garbage out of the data base
• Views – so the schema can change without having to rewrite the application program
• Incompatible with all of the tools DBMS users have come to depend on
• MapReduce cannot use the tools available in a modern SQL DBMS, and has none of
its own
• Report writers(Crystal reports)
• Prepare reports for human visualization
• business intelligence tools(Business Objects or Cognos)
• Enable ad-hoc querying of large data warehouses
• data mining tools(Oracle Data Mining or IBM DB2 Intelligent Miner)
• Allow a user to discover structure in large data sets
• replication tools(Golden Gate)
• Allow a user to replicate data from on DBMS to another
• database design tools(Embarcadero)
• Assist the user in constructing a data base
http://www.coordguru.com
Vertica+Hadoop
Oracle+Hadoop
HadoopDB Details
HadoopDB Architecture
Connection parameters
- database location
- driver class
- credentials
Metadata
- dataset
- replica locations
- data partitioning
http://www.coordguru.com
RDBMS + MapReduce
Greenplum,
Pig, Hive,
Aster Data,
CloudBase
Scalability, Fault HadoopDB
tolerance, Flexibility
SQL or Script MapReduce
Performance,
Efficiency
MapReduce RDBMS
http://www.coordguru.com
Why Non-Relational?
http://www.coordguru.com
‚New Relations‛
http://www.coordguru.com
• Data warehousing RDBMSs provide horizontal scaling of complex joins and queries
• Most of them are read-only or read-mostly
• Integration
• Today’s data processing tasks increasingly have to access and combine data from
many different non-relational sources, often over a network
http://www.coordguru.com
• Design Issues
• ACID
• BASE
Atomicity
Consistency
Isolation
Durability Basically
Available
Soft-state
Eventual Consistency
v0
http://www.coordguru.com
• Immutable
• Do not need update and delete data, only insert it with versions
• tracking history
• lock-free
• atomicity is based on just a key
http://www.coordguru.com
Non-Relational Databases
http://www.coordguru.com
Trend
Google(Jan.)
2500
2000
1500
1000
500
0
Voldemort
Sclaris
Cassandra
CouchDB
MongoDB
ScaleDB
Drizzle
VoltDB
Tokyo
Riak
Hbase
HyperTable
Bigtable
SimpleDB
Redis
MySQL Cluster On-going classification by Woohyun Kim
http://www.coordguru.com
• Document Stores
• Store indexed documents(with multiple indexes)
• Not support locking, synchronous replication, and ACID transactions
• Instead of ACID, support BASE for much higher performance and scalability
• Provide some simple query mechanisms
• Relational Databases
• Store, index, and query tuples
• Some new RDBMSs provide horizontal scaling
http://www.coordguru.com
Thank you.
http://www.coordguru.com
App
take write
read
2m-1 0