Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 118

21ITE06 / Big Data Analytics

III Year / VI Semester


MODULE I INTRODUCTION TO BIG
DATA AND HADOOP FRAMEWORK
Introduction to Big Data: Types of Digital Data-Characteristics of Data – Evolution of
Big Data - Definition of Big Data – Challenges with Big Data – 3Vs of Big Data – Non-
Definitional traits of Big Data – Business Intelligence vs. Big Data - Data warehouse
and Hadoop environment – Coexistence. Big Data Analytics: Classification of analytics
– Data Science - Terminologies in Big Data – CAP Theorem – BASE Concept. NoSQL:
Types of Databases – Advantages – NewSQL - SQL vs. NOSQL vsNewSQL. Introduction
to Hadoop: Features – Advantages - Versions – Overview of Hadoop Eco systems –
Hadoop distributions – Hadoop vs. SQL – RDBMS vs. Hadoop-Hadoop Components –
Architecture - HDFS – Map Reduce: Mapper – Reducer - Combiner -Partitioner –
Searching – Sorting – Compression. Hadoop 2 (YARN): Architecture – Interacting with
Hadoop Eco systems.
Objectives

•To understand the need of Big Data, challenges, and


different analytical architectures
•Installation and understanding of Hadoop Architecture
and its ecosystems.
Course Outcome
1. Understand evolution of big data with its
characteristics and challenges with traditional
business intelligence
2. Distinguish big data analysis and analytics in
optimizing the business decisions.
3. Make use of appropriate components for
processing, scheduling and knowledge extraction
from large volumes in distributed Hadoop
Ecosystem.
Unit I - INTRODUCTION TO BIG
DATA
Evolution of Big data - Best Practices for Big data Analytics -
Big data characteristics - Validating - The Promotion of the
Value of Big Data - Big Data Use Cases- Characteristics of
Big Data Applications - Perception and Quantification of
Value -Understanding Big Data Storage - A General
Overview of High-Performance Architecture - HDFS -
MapReduce and YARN - Map Reduce Programming Model.
DATA

The quantities, characters, or symbols on which


operations are performed by a computer, which may be stored
and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.
BIG DATA

Big Data is also data but with a huge size. Big Data is a
term used to describe a collection of data that is huge in size and
yet growing exponentially with time.

“Big Data” is data whose scale, diversity, and complexity


require new architecture, techniques, algorithms, and analytics
to manage it and extract value and hidden knowledge from it…
BIG DATA
Units of Memory-
Byte
Kilo Byte
Mega Byte
Giga Byte
Tera Byte
Peta Byte
Exa Byte
Zetta Byte
Yotta Byte
BIG DATA - Sources
BIG DATA - Sources

 Primary sources of Big Data


 Social data:
 Likes,
Tweets & Retweets,
Comments,
Video Uploads, and general media
BIG DATA - Sources

 Primary sources of Big Data


 Machine data:
 Industrial equipment,
 sensors that are installed in machinery,
 web logs which track user behavior
 Sensors such as medical devices, smart meters, road
cameras, satellites, games
BIG DATA - Sources

 Primary sources of Big Data


 Transactional data:
 Invoices,
 Payment orders,
 Storage records,
 Delivery receipts
BIG DATA – Data Structures
BIG DATA – Data Structures

 Structured data:
 Data containing a defined data type, format, and
structure.
BIG DATA – Data Structures

 Semi-structured data:
 Semi-structured data is information that does not
reside in a relational database but that have some
organizational properties that make it easier to
analyze.
 Example: XML Data
BIG DATA – Data Structures

 Quasi-structured data:
 It consists of textual data with erratic data formats,
and can be formatted with effort, software tools,
and time. An example of quasi-structured data is
the data about which webpages a user visited and
in what order.
BIG DATA – Data Structures

 Quasi-structured data:
BIG DATA – Data Structures

 Unstructured data:
Data that has no inherent structure, which may
include text documents, PDFs, images, and video.
BIG DATA – Data Structures

 A clickstream that can be parsed and mined by


data scientists to discover usage patterns and
uncover relationships among clicks and areas
of interest on a website or group of sites.
Types of Data Repositories, from
an Analyst Perspective

Data Repository Characteristics

Spreadsheets and Spreadsheets and low-volume


data marts databases for recordkeeping
Analyst depends on data extracts
Types of Data Repositories, from
an Analyst Perspective
Data Repository Characteristics
Data Warehouses Centralized data containers in a purpose-built
space
Supports BI and reporting, but restricts robust
analyses
Analyst dependent on IT and DBAs for data access
and schema changes
Analysts must spend significant time to get
aggregated and disaggregated data extracts from
multiple sources.
Types of Data Repositories, from
an Analyst Perspective
Data Repository Characteristics

Analytic Sandbox Data assets gathered from multiple sources and


(workspaces) technologies for analysis
Enables flexible, high-performance analysis in a
nonproduction environment; can leverage in-
database processing
Reduces costs and risks associated with data
replication into “shadow” file systems
“Analyst owned” rather than “DBA owned”
State of the Practice in Analytics
Business Driver Examples

Optimize business operations Sales, pricing, profitability,


efficiency

Identify business risk Customer churn, fraud, default

Predict new business Upsell, cross-sell, best new


opportunities customer prospects

Comply with laws or regulatory Anti-Money Laundering, Fair


requirements Lending, Basel II-III,
Sarbanes-Oxley (SOX)
BI Versus Data Science
BI Versus Data Science

 BI systems make it easy to answer questions


related to:
Quarter-to-date revenue,

Progress toward quarterly targets, and

Understand how much of a given product was sold


in a prior quarter or year
BI Versus Data Science

 Data Science tends to use disaggregated data


in a
more forward-looking,

exploratory way,

focusing on analyzing the present and enabling


informed decisions about the future.
BI Versus Data Science

 BI problems tend to require highly structured


data organized in rows and columns for
accurate reporting,
 Data Science projects tend to use many types
of data sources, including large or
unconventional datasets
Current Analytical Architecture

 Most organizations still have data warehouses


that provide excellent support for traditional
reporting and
 simple data analysis activities but
unfortunately have a more difficult time
supporting more robust analyses.
Current Analytical Architecture
Current Analytical Architecture

 For data sources to be loaded into the data


warehouse, data needs to be well understood,
structured, and normalized with the
appropriate data type definitions.
Current Analytical Architecture

 Although this kind of centralization enables


security, backup, and failover of highly critical
data,
 it also means that data typically must go through
significant preprocessing and checkpoints before
it can enter this sort of controlled environment
Current Analytical Architecture

 As a result of this level of control on the


EDW, additional local systems may emerge in
the form of departmental warehouses and local
data marts that business users create to
accommodate their need for flexible analysis.
Current Analytical Architecture

 Once in the data warehouse, data is read by


additional applications across the enterprise
for BI and reporting purposes.
 These are high-priority operational processes
getting critical data feeds from the data
warehouses and repositories
Current Analytical Architecture

 Analysts create data extracts from the EDW to


analyze data offline in R or other local
analytical tools.
Current Analytical Architecture

 Because new data sources slowly accumulate


in the EDW due to the rigorous validation and
data structuring process, data is slow to move
into the EDW, and the data schema is slow to
change.
Current Analytical Architecture

 Departmental data warehouses may have been


originally designed for a specific purpose and
set of business needs, some of which may be
forced into existing schemas to enable BI and
the creation of OLAP cubes for analysis and
reporting.
Drivers of Big Data
Drivers of Big Data

 The data now comes from multiple sources,


such as these:
 Medical information, such as genomic sequencing
and diagnostic imaging
Photos and video footage uploaded to the World
Wide Web.
Drivers of Big Data
 The data now comes from multiple sources,
such as these:
 Video surveillance, such as the thousands of video
cameras spread across a city
Mobile devices, which provide geospatial location
data of the users, as well as metadata about text
messages, phone calls, and application usage on
smart phones.
Drivers of Big Data
 The data now comes from multiple sources,
such as these:
 Smart devices, which provide sensor-based
collection of information from smart electric grids,
smart buildings, and many other public and industry
infrastructures
 Nontraditional IT devices, including the use of
radio-frequency identification (RFID) readers, GPS
navigation systems, and seismic processing.
Emerging Big Data Ecosystem and
a New Approach to Analytics
Emerging Big Data Ecosystem and
a New Approach to Analytics
 Data devices
 “Sensornet” gather data from multiple locations
and continuously generate new data about this
data.
 The video game provider captures data about the
skill and levels attained by the player.
Emerging Big Data Ecosystem and
a New Approach to Analytics
 Data devices
 As a consequence, the game provider can fine-
tune the difficulty of the game, suggest other
related games that would most likely interest the
user, and offer additional equipment and
enhancements for the character based on the user’s
age, gender, and interests.
Emerging Big Data Ecosystem and
a New Approach to Analytics
 Data collectors
 Retail stores tracking the path a customer takes

through their store while pushing a shopping cart


with an RFID chip so they can gauge which
products get the most foot traffic using geospatial
data collected from the RFID chips
Emerging Big Data Ecosystem and
a New Approach to Analytics
 Data aggregators
 Organizations compile data from the devices and usage
patterns collected by government agencies, retail stores,
and websites.
 In turn, they can choose to transform and package the data
as products to sell to list brokers, who may want to
generate marketing lists of people who may be good targets
for specific ad campaigns.
Emerging Big Data Ecosystem and
a New Approach to Analytics
 Data users and buyers
 These groups directly benefit from the data collected and
aggregated by others within the data value chain.
Characteristics of Big Data
 Volume: This refers to the data that is tremendously
large.
 Variety: The data is coming from different sources in
various formats.
 Velocity: The speed of data accumulation also plays a
role in determining whether the data is categorized into
big data or normal data.
Characteristics of Big Data

 Value: It deals with a mechanism to bring out


the correct meaning out of data.
 Veracity: Trustworthiness and quality of data.
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS
 There are a number of factors that need to be
considered before making a decision regarding
adopting that technology.
 Big data is feasible within the organization, it
does not necessarily mean that it is reasonable.
 A sample framework for determining a score
for each of these factors ranging from 0
(lowest level) to 4 (highest level)
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS

 Feasibility:
 Is the enterprise aligned in a way that allows for new and
emerging technologies to be brought into the organization,
tested out, and assessed without overbearing organization?
 If not, what steps can be taken to create an environment
that is suited to the introduction and assessment of
innovative technologies?
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS

 Reasonability:
 When evaluating the feasibility of adopting big
data technologies, have you considered whether
your organization faces business challenges whose
resource requirements exceed the capability of the
existing or planned environment?
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS

 Reasonability:
 If not currently, do you anticipate that the
environment will change in the near-, medium or
long-term to be more data-centric and require
augmentation of the resources necessary for
analysis and reporting?
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS

 Value:
 Is there an expectation that the resulting
quantifiable value that can be enabled as a result of
big data warrants the resource and effort
investment in development and
productionalization of the technology?
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS

 Integrability:
 What steps need to be taken to evaluate the means
by which big data can be integrated as part of the
enterprise?
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS

 Sustainability :
 the costs associated with maintenance, configuration,
skills maintenance, and adjustments to the level of
agility in development may not be sustainable within
the organization
 How would you plan to fund continued management
and maintenance of a big data environment?
Quantifying Organizational Readiness

0 Evaluation of new technology is not officially sanctioned


1 Organization tests new technologies in reaction to market
Pressure
2 Organization evaluates and tests new technologies after
market evidence of successful use
Feasibility 3 Organization is open to evaluation of new technology
Adoption of technology on an ad hoc basis based on
convincing business Justifications
4 Organization encourages evaluation and testing of new
technology
Clear decision process for adoption or rejection
Organization supports allocation of time to Innovation
Quantifying Organizational Readiness
0 Organization’s resource requirements for near-, mid-, and
long-terms are satisfactorily Met
1 Organization’s resource requirements for near- and mid-
terms are satisfactorily met, unclear as to whether long-
term needs are met
2 Organization’s resource requirements for near-term is
satisfactorily met, unclear as to whether mid- and long
term needs are Met
Reasonability
3 Business challenges are expected to have resource
requirements in the mid- and long-terms that will exceed
the capability of the existing and planned environment
4 Business challenges have resource requirements that
clearly exceed the capability of the existing and planned
environment
Organization’s go-forward business model is highly
Information centric
Quantifying Organizational Readiness

0 Investment in hardware resources, software tools, skills


training, and ongoing management and maintenance
exceeds the expected quantifiable Value
1 The expected quantifiable value widely is evenly
balanced by an investment in hardware resources,
software tools, skills training, and ongoing management
and maintenance
Value
2 Selected instances of perceived value may suggest a
positive return on investment
3 Expectations for some quantifiable value for investing in
limited aspects of the technology
4 The expected quantifiable value widely exceeds the
investment in hardware resources, software tools, skills
training, and ongoing management and Maintenance
Quantifying Organizational Readiness

0 Significant impediments to incorporating any


nontraditional technology into environment
1 Willingness to invest effort in determining ways to
integrate technology, with some successes
2 New technologies can be integrated into the environment
Integrability within limitations and with some level of effort
3 Clear processes exist for migrating or integrating new
technologies, but require dedicated resources and level of
effort
4 No constraints or impediments to fully integrate
technology into operational environment
Quantifying Organizational Readiness

0 No plan in place for acquiring funding for ongoing


management and maintenance costs
No plan for managing skills inventory
1 Continued funding for maintenance and engagement is
given on an ad hoc basis
Sustainability is at risk on a continuous basis
Sustainability 2 Need for year by- year business justifications for
continued Funding
3 Business justifications ensure continued funding and
investments in skills
4 Program management office effective in absorbing and
remunerating management and maintenance costs
Program for continuous skills enhancement and training
The promotion of the value of the
Big Data
 A thoughtful approach must differentiate
between hype and reality, and one way to do
this is to review the difference between what is
being said about big data and what is being
done with big data.
The promotion of the value of the
Big Data
 A scan of existing content on the “value of big
data” sheds interesting light on what is being
promoted as the expected result of big data
analytics and, more interestingly, how familiar
those expectations sound.
The promotion of the value of the
Big Data
 Center for Economics and Business Research
(CEBR) that speaks to the cumulative value of:
 optimized consumer spending as a result of improved
targeted customer marketing
 improvements to research and analytics within the
manufacturing sectors to lead to new product
development
The promotion of the value of the
Big Data
 Center for Economics and Business Research
(CEBR) that speaks to the cumulative value
of:
 improvements in strategizing and business
planning leading to innovation and new start-up
companies
The promotion of the value of the
Big Data
 Center for Economics and Business Research
(CEBR) that speaks to the cumulative value of:
 predictive analytics for improving supply chain
management to optimize stock management,
replenishment, and forecasting
 improving the scope and accuracy of fraud detection
The promotion of the value of the
Big Data
 Benefits promoted by business intelligence and
data warehouse tools vendors and system
integrators for the past 15-20 years, namely:
 Better targeted customer marketing, Improved
product analytics, Improved business planning,
Improved supply chain management, Improved
analysis for fraud, waste, and abuse
Big Data Use Cases

 A scan of the list allows us to group most of


those applications into these categories:
 Business intelligence, querying, reporting,
searching
 Improved performance for common data
management operations
Big Data Use Cases

 A scan of the list allows us to group most of


those applications into these categories:
Non-database applications

Data mining and analytical applications


Big Data Use Cases

 The big data application can be further


abstracted into more fundamental categories:
 Counting – Filtering and aggregation

 Scanning – Sorting, Transformation and searching

 Modeling – Analysis and prediction

 Storing – large datasets, rapid access.


Characteristics of Big Data
Applications
 The big data approach is mostly suited to
addressing or solving business problems that are
subject to one or more of the following criteria:
 Data throttling, Computation-restricted throttling,
Large data volumes, Significant data variety and
Benefits from data parallelization
Characteristics of Big Data
Applications
Application Characteristic Sample Data Sources
Energy network Data throttling Sensor data from
monitoring and Computation smart meters and
Optimization throttling network
Components
Large data volumes
Credit fraud Data throttling Point-of-sale data
detection Computation Customer profiles
throttling
Large data volumes Transaction histories
Parallelization Predictive models
Data variety
Characteristics of Big Data
Applications
Application Characteristic Sample Data Sources
Data profiling Large data volumes Sources selected for
Parallelization downstream
repurposing
Clustering and Data throttling Customer profiles
customer Computation Transaction histories
segmentation throttling
Large data volumes Enhancement
datasets
Parallelization
Data variety
Characteristics of Big Data
Applications
Application Characteristic Sample Data Sources
Recommendation Data throttling Customer profiles
engines Computation Transaction histories
throttling
Large data volumes Enhancement
datasets
Parallelization Social network data
Data variety
Characteristics of Big Data
Applications
Application Characteristic Sample Data Sources

Price modeling Data throttling Point-of-sale data

Computation Customer profiles


throttling

Large data volumes Transaction histories

Parallelization Predictive models


Perception and Quantification of
Value
 Big data significantly contributes to adding
value to the organization by:
 Increasing revenues

 Lowering costs - operating costs

 Increasing productivity

 Reducing risk
Understanding Big Data Storage

 If not all big data applications achieve their


performance and scalability through
deployment on a collection of storage and
computing resources bound together within a
runtime environment
Understanding Big Data Storage

 the ability to design, develop, and implement


a big data application is directly dependent on
an awareness of the architecture of the
underlying computing platform (both
hardware and software)
Understanding Big Data Storage -
Resource
 Processing capability
 CPU, processor, or node

 modern processing nodes often incorporate multiple


cores that are individual CPUs that share the node’s
memory and are managed and scheduled together,
allowing multiple tasks to be run simultaneously -
multithreading
Understanding Big Data Storage -
Resource
 Memory
 which holds the data that the processing node is
currently working on. Most single node machines
have a limit to the amount of memory.
Understanding Big Data Storage -
Resource
 Storage
 the place where datasets are loaded, and from
which the data is loaded into memory to be
processed
Understanding Big Data Storage -
Resource
 Network
 provides the “pipes” through which datasets are
exchanged between different processing and
storage nodes
Understanding Big Data Storage -
Resource
 Single-node computers are limited in their
capacity, they cannot easily accommodate
massive amounts of data.
Understanding Big Data Storage -
Resource
 High-performance platforms are composed of
collections of computers in which the massive
amounts of data and requirements for
processing can be distributed among a pool of
resources
A General View of High
Performance Architecture
 Connecting multiple nodes together via a
variety of network topologies.
 The general architecture distinguishes the
management of computing resources and the
management of the data across the network of
storage nodes.
A General View of High
Performance Architecture
A General View of High
Performance Architecture
 A master job manager - oversees the pool of
processing nodes, assigns tasks, and monitors
the activity
 A storage manager - oversees the data storage
pool and distributes datasets across the
collection of storage resources
A General View of High
Performance Architecture
 To get a better understanding of the layering
and interactions within a big data platform,
will examine the Apache Hadoop software
stack.
HDFS
 HDFS attempts to enable the storage of large files,
and does this by distributing the data among a pool of
data nodes.
 A single name node runs in a cluster, associated with
one or more data nodes, and provide the management
of a typical hierarchical file organization and
namespace.
HDFS
 The name node effectively coordinates the
interaction with the distributed data nodes.
 The name node maintains metadata about each file.
That metadata includes an enumeration of the
managed files, properties of the files, and the file
system, as well as the mapping of blocks to files at
the data nodes.
HDFS
HDFS
 The data node itself does not manage any
information about the logical HDFS file;
 Rather, it treats each data block as a separate
file and shares the critical information with the
name node.
HDFS
 Once a file is created, as data is written to the
file, it is actually cached in a temporary file.
HDFS
 When the amount of the data in that temporary
file is enough to fill a block in an HDFS file,
the name node is alerted to transition that
temporary file into a block that is committed to
a permanent data node, which is also then
incorporated into the file management scheme.
HDFS
 HDFS provides a level of fault tolerance
through data replication.
 An application can specify the degree of
replication (i.e., the number of copies made)
when a file is created.
HDFS
 HDFS provides performance through
distribution of data and fault tolerance through
replication.
 The result is a level of robustness for reliable
massive file storage.
HDFS - Key tasks for failure
management
 Monitoring:
 There is a continuous “heartbeat” communication between the
data nodes to the name node.
 If a data node’s heartbeat is not heard by the name node, the
data node is considered to have failed and is no longer
available.
 A replica is employed to replace the failed node, and a
change is made to the replication scheme
HDFS - Key tasks for failure
management
 Rebalancing:
A process of automatically migrating blocks of
data from one data node to another when there is
free space
 when there is an increased demand for the data
and moving it may improve performance
HDFS - Key tasks for failure
management
 Managing integrity:
HDFS uses checksums, which are effectively
“digital signatures” associated with the actual data
stored in a file that can be used to verify that the
data stored corresponds to the data shared or
received.
HDFS - Key tasks for failure
management
 Metadata replication:
 The metadata files are also subject to failure, and
HDFS can be configured to maintain replicas of
the corresponding metadata files to protect against
corruption
HDFS - Key tasks for failure
management
 decreasing the cost of specialty large-scale
storage systems;
 providing the ability to rely on commodity
components;
 enabling the ability to deploy using cloud-
based services;
 reducing system management costs
MapReduce

 MapReduce originally combined both job


management and the programming model for
execution
 The MapReduce execution environment
employs a master/slave execution model
MapReduce

 one master node (called the JobTracker)


manages a pool of slave computing resources
(called TaskTrackers) that are called upon to
do the actual work.
MapReduce – JobTracker
 Responsibilities:
 managing the TaskTrackers

 monitoring their accessibility and availability

 job management - scheduling tasks, tracking the


progress of assigned tasks, reacting to identified
failures, and ensuring fault tolerance of the execution
MapReduce – TaskTracker

 Responsibilities:
 wait for a task assignment

 initiate and execute the requested task

 provide status back to the JobTracker on a


periodic basis
MapReduce

 Limitations
 Applications that demand data movement will
rapidly become bogged down by network latency
issues
 Not all applications are easily mapped to the
MapReduce model.
MapReduce

 Limitations
 The allocation of processing nodes within the
cluster is fixed through allocation of certain nodes
as “map slots” versus “reduce slots.”
 the nodes assigned to the other phase are largely
unused, resulting in processor underutilization
YARN

 YARN – Yet Another Resource Negotiation

 Overall resource management has been


centralized while management of resources at
each node is now performed by a local
NodeManager.
YARN

 The concept of an ApplicationMaster that is


associated with each application that directly
negotiates with the central ResourceManager
for resources while taking over the
responsibility for monitoring progress and
tracking status
YARN
 Applications to be better aware of the data allocation
across the topology of the resources within a cluster.
 Allows for improved colocation of compute and data
resources, reducing data motion, and consequently,
reducing delays associated with data access latencies.
MapReduce Programming Model

 to develop applications to read, analyze,


transform, and share massive amounts of data
 Application development in MapReduce is a
combination of the familiar
procedural/imperative approaches used by
Java or C++ programmers
MapReduce Programming Model

 Operations:
 Map, which describes the computation or analysis
applied to a set of input key/value pairs to produce a
set of intermediate key/value pairs.
 Reduce, in which the set of values associated with
the intermediate key/value pairs output by the Map
operation are combined to provide the results.
MapReduce Programming Model
 To process huge amount of data in parallel, reliable and
efficient way in cluster environments.
Uses Divide and Conquer technique to process large
amount of data.
It divides input task into smaller and manageable sub-tasks
to execute them in-parallel.
MapReduce Programming Model

Steps:
Map function

Shuffle function

Reduce function
MapReduce Programming Model

Map function
It takes input tasks and divides them into smaller sub-tasks.

 Sub steps:

 Splitting - takes input DataSet from Source and divide


into smaller Sub-DataSets.
 Mapping - takes those smaller Sub-DataSets and perform
required action or computation on each Sub-DataSet
MapReduce Programming Model

Map function
 The output of this Map Function is a set of key and value pairs
as <Key, Value>
MapReduce Programming Model
Shuffle function
 Sub steps:

 Merging - combines all key-value pairs which have same


keys.
 Sorting - takes input from Merging step and sort all key-
value pairs by using Keys
 Shuffle Function returns a list of <Key, List<Value>> sorted pairs
to next step
MapReduce Programming Model

 Reduce Function:
Takes list of <Key, List<Value>> sorted pairs from Shuffle
Function and perform reduce operation
MapReduce Programming Model

You might also like