Tìm Hiểu Nghiên Cứu Về Mapreduce

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

BỘ GIÁO DỤC VÀ ĐÀO TẠO​

TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN TP.HCM​


KHOA CÔNG NGHỆ THÔNG TIN​

Tìm hiểu nghiên cứu


về MapReduce
Báo cáo môn Các hệ cơ sở dữ liệu nâng cao 

GVHD: Ts. Nguyễn Trần Minh Thư


Nhóm 07: 
1. 19C11015 - Đỗ Huy Gia Cát
2. 21C12003 - Đào Thanh Danh
3. 21C11026 - Nguyễn Thành Thái
1
CONTENTS

• Overview about MapReduce


o Motivation
o History
o Application
• MapReduce define
o How MapReduce works?
o Example
• MapReduce extends
• Conclusion

2
What is MapReduce

• Motivation – the real-world problem


• History MapReduce

3
4
What is MapReduce

• MapReduce brings resolve


• Automatically parallelized and executed on a large cluster of machines
• Relate MapReduce and database management system competing or
completing paradigms?

5
What is MapReduce

• Use case of Google apply MapReduce


• Distributed grep
• Distributed sort
• Web link-graph reversal
• Term-vector per host
• Web access log stats
• Inverted index construction
• Document clustering
• Machine learning
• Statistical machine translation

6
How MapReduce Works

• Define MapReduce
• key-value pairs
• map
• Input: input key/value
• Output: intermediate key/value
• reduce
• Input: intermediate key/{value}
• Output: output key/value

7
How
MapReduce
Works
•Input Splits -> divided into
fixed-size pieces (jobs) => key-
value pairs
•Mapping -> each chunk split
passed into mapping function
•Shuffling -> task is to
consolidate the relevant records
•Reducing -> value aggregate
combined and returns a single
output value 

8
Example: Word Count Problem

9
MapReduce Extends

• MapReduce trades off flexibility in structuring computation for a model


for parallelizing the computation over a cluster => Computation
constraints exist
• Within a map task, you can only work on one aggregate
• Within a reduce task, you can only work on one single key
• It is required to have different approaches with these constraints

10
Multiple stages approach

• As the computation becomes more complex, it is more suitable to divide


the map-reduce into smaller steps

• Advantages:
• Easier to write and maintain
• Reusability

11
Incremental MapReduce approach

• Suitable for data with constant update


• Can be used to implement part of data instead of restarting from starch
• Need to persist the current data and combine with new data
• Map stages are easier to approach, while reduce stages are more
complex

12
Conclusion MapReduce
• Allow computations to be parallelized over a cluster, but has large latency.
• The map task reads data from an aggregate and boils it down to relevant key-value
pairs. Only read a single record at a time and can thus be parallelized.
• Reduce tasks take many values for a single key, output from map tasks and summarize
them into a single output. Parallelized by key
• Reducers can be combined into pipelines, improves parallelism and reduces data to
be transferred.
• Map-reduce operations can be composed into pipelines with multi map-reduce
others (map -> reduce -> map -> reduces...)
• Result of a map-reduce computation can be stored as a materialized view -> it can be
updated through incremental map-reduce operations (only recomputing changing)

13
14

You might also like