Welcome to Scribd!

Unit 1 Lecture 3

Uploaded by

0% found this document useful (0 votes)

0 views12 pages

MapReduce is a programming model used to process large datasets in a distributed computing environment. It works by breaking a job into map and reduce tasks that are executed in parallel across clusters. The map tasks output key-value pairs that are shuffled and sorted before being input to the reduce tasks. Examples where MapReduce is well-suited include counting word frequencies, calculating total page sizes by host, and performing joins between large datasets. Refinements like combiners and backup tasks improve performance.

Original Description:

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pptx, pdf, or txt

0% found this document useful (0 votes)

0 views12 pages

Unit 1 Lecture 3

Uploaded by

Anirudh Prakash

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pptx, pdf, or txt

Jump to Page

You are on page 1of 12

Search inside document

Map-Reduce and

the New Software Stack

Map-Reduce: A diagram

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

Map-Reduce: In Parallel

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Data Flow
◼ Input and final output are stored on a distributed file system
(FS):
▪ Scheduler tries to schedule map tasks “close” to physical
storage location of input data

◼ Intermediate results are stored on local FS of Map and

Reduce workers

◼ Output is often input to another

MapReduce task

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

Refinements: Backup Tasks
◼ Problem
▪ Slow workers significantly lengthen the job completion
time:
▪ Other jobs on the machine
▪ Bad disks
▪ Weird things
◼ Solution
▪ Near end of phase, spawn backup copies of tasks
▪ Whichever one finishes first “wins”
◼ Effect
▪ Dramatically shortens job completion time

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

Refinement: Combiners
Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k
▪ E.g., popular words in the word count example
▪ The goal of the combiner is to decrease the size of the data sent and can save network
time by pre-aggregating values in the mapper:
• combine(k, list(v1)) 🡪 v2
▪ Combiner functions like a reducer
▪ Works only if reduce function is commutative
and associative

▪ Since it functions as “semi -reducer” ,it has the same interface as reducer and often are
the same class. The combiner() executes on each machine that performs a map task

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

Refinement: Combiners
◼ Back to our word counting example:

▪ Combiner combines the values of all keys of a single mapper (single machine) :

▪ Much less data needs to be copied and shuffled!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

Shuffle &Sort

Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer
in MapReduce.

Sort phase in MapReduce covers the merging and sorting of map outputs.

Data from the mapper are grouped by the key, split among reducers and
sorted by the key.

Every reducer obtains all values associated with the same key.

Shuffle and sort phase in Hadoop occur simultaneously and are done by the
MapReduce framework
Example suited for map reduce:
Host size
◼ Suppose we have a large web corpus
◼ Look at the metadata file
▪ Lines of the form: (URL, size, date, …)
◼ For each host, find the total number of bytes
▪ That is, the sum of the page sizes for all URLs from that
particular host

◼ Other examples:
▪ Link analysis and graph processing
▪ Machine Learning algorithms

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

Example: Language Model
◼ Statistical machine translation:
▪ Need to count number of times every 5-word
sequence occurs in a large corpus of documents

◼ Very easy with MapReduce:

▪ Map:
▪ Extract (5-word sequence, count) from document
▪ Reduce:
▪ Combine the counts

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

Example: Join By Map-Reduce
◼ Compute the natural join R(A,B) ⋈ S(B,C)
◼ R and S are each stored in files
◼ Tuples are pairs (a,b) or (b,c)

A B B C A C
a1 b1 b2 c1 a3 c1

a2 b1 b2 c2 a3 c2

a3 b2 b3 c3 a4 c3

a4 b3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

Map-Reduce Join
◼ Use a hash function h from B-values to 1...k
◼ A Map process turns:
▪ Each input tuple R(a,b) into key-value pair (b,(a,R))
▪ Each input tuple S(b,c) into (b,(c,S))

◼ Map processes send each key-value pair with

key b to Reduce process h(b)
▪ Hadoop does this automatically; just tell it what k is.
◼ Each Reduce process matches all the pairs (b,
(a,R)) with all (b,(c,S)) and outputs (a,b,c).
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

Job
Document4 pages
Job
Joseph Chamorro
No ratings yet
MapReduce and The New Software Stack
Document33 pages
MapReduce and The New Software Stack
varsha1504
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
Document49 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
Bunny Honey
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
Document48 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
varsha1504
No ratings yet
Map Reduce Architecture: Adapted From Lectures by
Document37 pages
Map Reduce Architecture: Adapted From Lectures by
Anandh Kumar
No ratings yet
Map Reduce
Document14 pages
Map Reduce
Amanda Drew
No ratings yet
Map Reduce PDF
Document29 pages
Map Reduce PDF
Mahalakshmi G
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
Document44 pages
Chapter Five Hadoop Mapreduce & HDFS
Binyam Bekele Moges
No ratings yet
Map Reduce Design and Execution Framework Part 1
Document19 pages
Map Reduce Design and Execution Framework Part 1
l200908
No ratings yet
Lecture 1 - Map Reduce
Document31 pages
Lecture 1 - Map Reduce
poornank05
No ratings yet
Map Reduce and Hadoop
Document39 pages
Map Reduce and Hadoop
Arifa Khadri
No ratings yet
Directed Acyclic Graph Scheduler
Document29 pages
Directed Acyclic Graph Scheduler
ArXlan Xahir
No ratings yet
Join Algorithms Using MapReduce
Document22 pages
Join Algorithms Using MapReduce
Marwa Hussien
No ratings yet
Mapreduce
Document51 pages
Mapreduce
roshanak attar
No ratings yet
MR Databases
Document52 pages
MR Databases
raj9523493430
No ratings yet
Cloudera Certification Dump 410 Anil PDF
Document49 pages
Cloudera Certification Dump 410 Anil PDF
arunshan
No ratings yet
MapReduce-Final
Document92 pages
MapReduce-Final
rsumaira80
No ratings yet
Mapreduce Types and Formats
Document65 pages
Mapreduce Types and Formats
Tejaswini Karmakonda
No ratings yet
MapReduce Introduction
Document34 pages
MapReduce Introduction
WarunikaRanaweera
No ratings yet
Map Reduce Algorithm
Document4 pages
Map Reduce Algorithm
Leela Rallapudi
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
Document22 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
23522020 Danendra Athallariq Harya P
No ratings yet
Module 1 Algorithm For Massive Datasets
Document59 pages
Module 1 Algorithm For Massive Datasets
Aleesha K B
No ratings yet
Map Reduce Design and EXECUTION FRAMEWORK
Document21 pages
Map Reduce Design and EXECUTION FRAMEWORK
l200908
No ratings yet
Introduction To MapReduce
Document26 pages
Introduction To MapReduce
fab vif
No ratings yet
Unit 3 - Big Data Technologies
Document42 pages
Unit 3 - Big Data Technologies
prakash N
No ratings yet
Introduction To Map Reduce
Document50 pages
Introduction To Map Reduce
KhAn Zainab
No ratings yet
Map Reduce Programming
Document21 pages
Map Reduce Programming
Sachin Pukale
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
Document53 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
Sven De Bruyn
No ratings yet
GMT SUCOGG Lecture5 Dietmar
Document32 pages
GMT SUCOGG Lecture5 Dietmar
Yehezkiel Halauwet
No ratings yet
Unit 2 Topic 4 Map Reduce
Document43 pages
Unit 2 Topic 4 Map Reduce
teotia.harsh22
No ratings yet
DB-20 - Bascom D+L
Document40 pages
DB-20 - Bascom D+L
Auperu Dauru
No ratings yet
(Download PDF) Internet and World Wide Web How To Program 5th Edition Paul Deitel Test Bank Full Chapter
Document28 pages
(Download PDF) Internet and World Wide Web How To Program 5th Edition Paul Deitel Test Bank Full Chapter
emerabornet
100% (8)
" 10 Binary: There Are Types of People in The World
Document30 pages
" 10 Binary: There Are Types of People in The World
Tamer Nassar
No ratings yet
(How To Plot 1-D, 2-D and 3-D in Matlab) : Department: Chemical Engineering
Document18 pages
(How To Plot 1-D, 2-D and 3-D in Matlab) : Department: Chemical Engineering
Aram Nasih Muhammad
No ratings yet
Hadoop Interview Questions Faq
Document14 pages
Hadoop Interview Questions Faq
mihirhota
No ratings yet
Map Reduce-LO2
Document62 pages
Map Reduce-LO2
Kai Enezhu
No ratings yet
MapReduce Its Applications For Course
Document36 pages
MapReduce Its Applications For Course
Informatique Info
No ratings yet
BDA Module 3
Document66 pages
BDA Module 3
yashchheda2002
No ratings yet
Open-Source Solution For Huge Data Sets: Hadoop Committer - Apache Software Foundation 11/23/2008
Document56 pages
Open-Source Solution For Huge Data Sets: Hadoop Committer - Apache Software Foundation 11/23/2008
h2a Chandu
No ratings yet
Unit III EBDP 2022 PDF
Document77 pages
Unit III EBDP 2022 PDF
Raju Jacob Raj
No ratings yet
Introduction To MapReduce
Document43 pages
Introduction To MapReduce
Ashwin Ajmera
No ratings yet
2011 Hyunjung Park VLDB Map-Reduce, RAMP, Big Data
Document4 pages
2011 Hyunjung Park VLDB Map-Reduce, RAMP, Big Data
Sunil Thalia
No ratings yet
Unit 2 - From Hadoop Streaming PDF
Document20 pages
Unit 2 - From Hadoop Streaming PDF
Gopal Agarwal
No ratings yet
Graficos R
Document136 pages
Graficos R
Gilson Sanchez Chia
No ratings yet
Map Reduce Merge
Document32 pages
Map Reduce Merge
warwithin
No ratings yet
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
Document55 pages
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
Lindsey Hoffman
No ratings yet
Map Reduce Programming
Document21 pages
Map Reduce Programming
Nyerere Oyaro
No ratings yet
Mapper Reducer
Document10 pages
Mapper Reducer
BigData Professional
No ratings yet
Paper Dvi
Document7 pages
Paper Dvi
jefferyleclerc
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
Document47 pages
Parallel Programming, Mapreduce Model: Unit Ii
Darpan Paloda
No ratings yet
Using R For Data Analysis and Graphing in An Introductory Physics Laboratory
Document10 pages
Using R For Data Analysis and Graphing in An Introductory Physics Laboratory
Brandon Mcguire
No ratings yet
Parallel Collections: A Free Lunch?
Document5 pages
Parallel Collections: A Free Lunch?
Journal of Computer Science and Engineering
No ratings yet
Data Warehousing & Analytics On Hadoop
Document28 pages
Data Warehousing & Analytics On Hadoop
Dheepika
No ratings yet
T05 MapReduce
Document20 pages
T05 MapReduce
abdulazizbinyabtemp
No ratings yet
18mcs35e U4
Document7 pages
18mcs35e U4
jefferyleclerc
No ratings yet
Bda Unit I Lecture8 2
Document4 pages
Bda Unit I Lecture8 2
Anju2
No ratings yet
Unit III EBDP 2022
Document77 pages
Unit III EBDP 2022
Raju Jacob Raj
No ratings yet
4a Resilient Distributed Datasets Etc PDF
Document46 pages
4a Resilient Distributed Datasets Etc PDF
23522020 Danendra Athallariq Harya P
No ratings yet
C++ Programming For Beginners
From Everand
C++ Programming For Beginners
carlos vereau
No ratings yet
Guidebook to R Graphics Using Microsoft Windows
From Everand
Guidebook to R Graphics Using Microsoft Windows
Kunio Takezawa
Rating: 3 out of 5 stars
3/5 (1)
Powerb BI Lan Manual
Document29 pages
Powerb BI Lan Manual
Brijesh C.G
No ratings yet
1Z0-061 Exam Dumps With PDF and VCE Download (1-15) PDF
Document13 pages
1Z0-061 Exam Dumps With PDF and VCE Download (1-15) PDF
Mangesh Abnave
100% (1)
Chapter 6 - Exercises
Document5 pages
Chapter 6 - Exercises
Hsu Let Yee Hnin
No ratings yet
Excel Import From Workbook Files
Document13 pages
Excel Import From Workbook Files
geko1
No ratings yet
Fndload For XML
Document3 pages
Fndload For XML
Chandana Rachapudi
No ratings yet
Oracle Private Cloud Appliance Backup Guide: Oracle White Paper - July 2019
Document14 pages
Oracle Private Cloud Appliance Backup Guide: Oracle White Paper - July 2019
Mohamed Maher Alghareeb
No ratings yet
Assignment 1 SQL: Cairo University Faculty of Computers and Information Information Systems Department
Document3 pages
Assignment 1 SQL: Cairo University Faculty of Computers and Information Information Systems Department
madhav
No ratings yet
ORA-01555 - Snapshot Too Old
Document10 pages
ORA-01555 - Snapshot Too Old
nareshm121
No ratings yet
D22057GC10 - sgBACKUP Y RECOVERY PDF
Document704 pages
D22057GC10 - sgBACKUP Y RECOVERY PDF
Vladi Mir
No ratings yet
Module 1: Data Preparation Using Tableau Prep: Demo Document II
Document9 pages
Module 1: Data Preparation Using Tableau Prep: Demo Document II
செல்வம் கருணாநிதி
No ratings yet
Programme Project Report (PPR) For Master of Library Science (MLI B)
Document16 pages
Programme Project Report (PPR) For Master of Library Science (MLI B)
shishirkant
No ratings yet
CS - Practical File Prince
Document30 pages
CS - Practical File Prince
princejerry8888
No ratings yet
Dbms Lab # 3: SQL Aggregate & Scalar Functions
Document20 pages
Dbms Lab # 3: SQL Aggregate & Scalar Functions
saqib idrees
No ratings yet
r18 - Big Data Analytics - Cse (DS)
Document1 page
r18 - Big Data Analytics - Cse (DS)
aarthi dev
No ratings yet
AG2425 Spatial Databases
Document36 pages
AG2425 Spatial Databases
srini
No ratings yet
COMP 430 Syllabus
Document5 pages
COMP 430 Syllabus
james
No ratings yet
Python Magazine Survey
Document7 pages
Python Magazine Survey
Sai Manasa Ghatti
No ratings yet
Script To Calculate Your Sites Google SERP Position
Document3 pages
Script To Calculate Your Sites Google SERP Position
Dragos Florin Lucian
No ratings yet
Qwetu Analytical Test
Document16 pages
Qwetu Analytical Test
Indresh Singh Saluja
No ratings yet
Hive - Self Learning Notes
Document69 pages
Hive - Self Learning Notes
Karvendhan Balasubramanian
No ratings yet
CSE311L Week - 4
Document5 pages
CSE311L Week - 4
RK Rashik
No ratings yet
How To Control How Many Cursors Are Open in An Session: Q UE ST IO N #
Document26 pages
How To Control How Many Cursors Are Open in An Session: Q UE ST IO N #
Pavithra S
No ratings yet
Information Integrity, Privacy and Security
Document14 pages
Information Integrity, Privacy and Security
Shafeeu Muhammad
No ratings yet
Databases December 2015 Sample Examination Paper: Answer ALL Questions. Clearly Cross Out Surplus Answers
Document6 pages
Databases December 2015 Sample Examination Paper: Answer ALL Questions. Clearly Cross Out Surplus Answers
Bright Harrison Mwale
No ratings yet
SP Sys
Document13 pages
SP Sys
Srihari Meka
No ratings yet
Lab Manual 03 PDF
Document10 pages
Lab Manual 03 PDF
Shahid Zikria
No ratings yet
Guidelines For Search - Podcast Hints
Document11 pages
Guidelines For Search - Podcast Hints
khucphuongnam2005
No ratings yet
Assignment - Big Data Management
Document2 pages
Assignment - Big Data Management
Shwetank Pandey
No ratings yet
Output
Document6 pages
Output
elchongo
No ratings yet
Hibernate Tutorial
Document149 pages
Hibernate Tutorial
Nishitha Nishi
No ratings yet