Professional Documents
Culture Documents
CS8091 BDA Unit 1
CS8091 BDA Unit 1
Big Data is also data but with a huge size. Big Data is a
term used to describe a collection of data that is huge in size and
yet growing exponentially with time.
Structured data:
Data containing a defined data type, format, and
structure.
BIG DATA – Data Structures
Semi-structured data:
Semi-structured data is information that does not
reside in a relational database but that have some
organizational properties that make it easier to
analyze.
Example: XML Data
BIG DATA – Data Structures
Quasi-structured data:
It consists of textual data with erratic data formats,
and can be formatted with effort, software tools,
and time. An example of quasi-structured data is
the data about which webpages a user visited and
in what order.
BIG DATA – Data Structures
Quasi-structured data:
BIG DATA – Data Structures
Unstructured data:
Data that has no inherent structure, which may
include text documents, PDFs, images, and video.
BIG DATA – Data Structures
exploratory way,
Feasibility:
Is the enterprise aligned in a way that allows for new and
emerging technologies to be brought into the organization,
tested out, and assessed without overbearing organization?
If not, what steps can be taken to create an environment
that is suited to the introduction and assessment of
innovative technologies?
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS
Reasonability:
When evaluating the feasibility of adopting big
data technologies, have you considered whether
your organization faces business challenges whose
resource requirements exceed the capability of the
existing or planned environment?
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS
Reasonability:
If not currently, do you anticipate that the
environment will change in the near-, medium or
long-term to be more data-centric and require
augmentation of the resources necessary for
analysis and reporting?
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS
Value:
Is there an expectation that the resulting
quantifiable value that can be enabled as a result of
big data warrants the resource and effort
investment in development and
productionalization of the technology?
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS
Integrability:
What steps need to be taken to evaluate the means
by which big data can be integrated as part of the
enterprise?
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS
Sustainability :
the costs associated with maintenance, configuration,
skills maintenance, and adjustments to the level of
agility in development may not be sustainable within
the organization
How would you plan to fund continued management
and maintenance of a big data environment?
Quantifying Organizational Readiness
Increasing productivity
Reducing risk
Understanding Big Data Storage
Responsibilities:
wait for a task assignment
Limitations
Applications that demand data movement will
rapidly become bogged down by network latency
issues
Not all applications are easily mapped to the
MapReduce model.
MapReduce
Limitations
The allocation of processing nodes within the
cluster is fixed through allocation of certain nodes
as “map slots” versus “reduce slots.”
the nodes assigned to the other phase are largely
unused, resulting in processor underutilization
YARN
Operations:
Map, which describes the computation or analysis
applied to a set of input key/value pairs to produce a
set of intermediate key/value pairs.
Reduce, in which the set of values associated with
the intermediate key/value pairs output by the Map
operation are combined to provide the results.
MapReduce Programming Model
To process huge amount of data in parallel, reliable and
efficient way in cluster environments.
Uses Divide and Conquer technique to process large
amount of data.
It divides input task into smaller and manageable sub-tasks
to execute them in-parallel.
MapReduce Programming Model
Steps:
Map function
Shuffle function
Reduce function
MapReduce Programming Model
Map function
It takes input tasks and divides them into smaller sub-tasks.
Sub steps:
Map function
The output of this Map Function is a set of key and value pairs
as <Key, Value>
MapReduce Programming Model
Shuffle function
Sub steps:
Reduce Function:
Takes list of <Key, List<Value>> sorted pairs from Shuffle
Function and perform reduce operation
MapReduce Programming Model