Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 84

DIGITAL IMAGE PROCESSING

BIG DATA ANALYTICS

Hadoop Architecture
HADOOP
 Hadoop is an open-source framework and ecosystem for processing,
storing, and analyzing large and complex datasets across clusters
of commodity hardware.

 It was conceived and implemented on the basis of Google File System


and Map Reduce programming paradigm.

 Hadoop enables organizations to harness the power of distributed


computing to handle massive amounts of data efficiently.

 The core components of the Hadoop ecosystem include Hadoop


Distributed File System (HDFS) and the MapReduce programming
model.
HADOOP'S BASIC DATA FLOW
A basic data flow of the Hadoop system can be divided into four phases:
1. Capture Big Data :
The sources can be extensive lists that are structured, semi-structured, and
unstructured, some streaming, real-time data sources, sensors, devices,
machine-captured data, and many other sources.
For data capturing and storage, we have different data integrators such
as, Flume, Sqoop, Storm, and so on in the Hadoop ecosystem, depending on
the type of data.

2. Process and Structure:


We will be cleansing, filtering, and transforming the data by using a
MapReduce-based framework or some other frameworks which can
perform distributed programming in the Hadoop ecosystem.
The frameworks available currently are MapReduce, Hive, Pig, Spark and
so on.
3. Distribute Results:
 The processed data can be used by the BI and analytics system or the
big data analytics system for performing analysis or visualization.
4. Feedback and Retain:
 The data analyzed can be fed back to Hadoop and used for
improvements and audits.
THE HADOOP ECOSYSTEM
HDFS
 The Hadoop Distributed File System (HDFS) is a distributed file storage
system designed to store and manage very large datasets across
clusters of machines.
Key features:
 Distributed Storage: HDFS distributes data across multiple
machines in a cluster, allowing for efficient storage and retrieval of
large datasets. This distribution also provides fault tolerance, as data is
replicated across nodes to ensure that data remains available even if
a node fails.
DISTRIBUTED STORAGE
BLOCKS:

 Data in HDFS is divided into fixed-size blocks


(typically 128 MB or 256 MB).
 These blocks are distributed across the cluster and
can be stored on different machines.
 This block structure enables parallel processing and
efficient data management.
BLOCKS IN HDFS ARCHITECTURE
MASTER-SLAVE ARCHITECTURE:

 HDFS follows a master-slave architecture.


 The master node is called the NameNode, which
stores metadata about the file system hierarchy
and the location of data blocks.
 DataNodes are the slave nodes that store the actual
data blocks.
HDFS DATA BLOCK
DATA NODE FAILURE
 Replication:
 HDFS replicates data blocks across multiple nodes in the cluster.
 The default replication factor is usually set to three, which means that
each data block is stored on three different nodes.
 This replication provides fault tolerance: if a node fails, the data can
still be accessed from other replicas.
REPLICATION FACTOR
 Data Integrity:
 HDFS ensures data integrity through checksums.
 Each data block has a checksum associated with it, and the client
verifies the checksum when reading data.
 If a checksum mismatch occurs, the system knows that data
corruption has occurred.

 High Throughput:
 HDFS is optimized for high throughput data access rather than low-
latency access.
 It is well-suited for batch processing workloads, such as those commonly
found in big data analytics.

 Data Locality:
 HDFS tries to place computation close to the data.
 This means that when processing data, tasks are scheduled on nodes
where the data resides, reducing the need for network transfers.
LAYERS OF HDFS
 The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware.
 HDFS has two main layers:
NAMESPACE
 Consists of directories, files
and blocks.

 It maintains the filesystem tree


and the metadata for all the files
and directories in the tree.

 Metadata contains things like the


owners of files, permission bits,
block location, size etc.

 The NameNode manages the


filesystem namespace.
BLOCK STORAGE SERVICE
Which has two parts:

 Block Management (performed in the Namenode)

 Block address table which maps the HDFS blocks to the node where
they are located.

 Provides Datanode cluster membership by handling registrations, and


periodic heart beats.

 Processes block reports and maintains location of blocks.

 Supports block related operations such as create, delete, modify and get
block location.

 Manages replica placement, block replication for under replicated blocks,


and deletes blocks that are over replicated
 Storage –
 is provided by Datanodes by storing blocks on the local file system
and allowing read/write access.

HDFS is an in-memory file system


HEARTBEAT
 In Hadoop Name node and data node do communicate using Heartbeat.

 Therefore Heartbeat is the signal that is sent by the datanode to the


namenode after the regular interval to time to indicate its presence,
i.e. to indicate that it is alive

 The default heartbeat interval is 3 seconds.

 If the DataNode in HDFS does not send heartbeat to NameNode in ten


minutes, then NameNode considers the DataNode to be out of service
and the Blocks replicas hosted by that DataNode to be unavailable.

 The NameNode then schedules the creation of new replicas of those


blocks on other DataNodes.

 NameNode that receives the Heartbeats from a DataNode also carries


information like total storage capacity, the fraction of storage in
use, and the number of data transfers currently in progress.

 For the NameNode’s block allocation and load balancing decisions,


we use these statistics.
BLOCK REPORT
 A block report is a report that contains information about the data
blocks stored on a DataNode.
 Each DataNode in a Hadoop cluster periodically sends a block
report to the NameNode, which contains information such as the
block ID, the length of the block, and the location of the block.
IN-MEMORY FILE SYSTEM
 Name node stores all these information in its RAM
 It is benificial that its faster to read and write in RAM rather than
hard disk.
 Hence, HDFS can handle thousands of transactions per second that to
with additional overhead of managing data across hundreds of
nodes.
 However, its volatile in nature hence during power failure data stored
in RAM get destroyed.
BACK-UP FUNCTIONALITIES
 Hence, the name node will maintain a
snapshot of its in-memory namespace
information in its local hard disk and
is stored in a file called as FsImage.

 FsImage contains the entire file


system namespace, including the
mapping of blocks to files and file
system properties.
EDITLOGS
 The NameNode uses a transaction log called the EditLog to
persistently record every change that occurs to file system
metadata.
 For example, creating a new file in HDFS causes the NameNode to
insert a record into the EditLog indicating this.
 Similarly, changing the replication factor of a file causes a new record
to be inserted into the EditLog.
 The NameNode uses a file in its local host OS file system to store the
EditLog.
 The NameNode manages the entire HDFS file system metadata (i.e owners of files, file
permission, no of blocks, block locations, size etc.) and maintained it in main memory.

 Clients first contact point is the NameNode for file metadata and then perform actual file
I/O directly with the DataNodes.

 If something goes wrong with the NameNode, then whatever metadata was there in
main memory would get lost permanently.
 So during startup the namenode uses the FsImage.
 But this fsimage file being used and get loaded into memory when
NameNode gets started.

 It’s interesting to note that the NameNode never really uses these files on
disk during runtime, except for when it starts.

 Lets say you have done so many changes like creating directories, files,
putting the data to HDFS during runtime this information is directly loaded
into Editlogs.

 The edits/editlog file (also called transactions log stored in disk) stores all
changes that occur into the file system metadatas in memory during
runtime.
 The process that takes the last fsimage file and apply all changes which is
found in editlog file and produce a new up to date fsimage file,
called checkpointing process.
CHECK-POINT
 When the NameNode starts up, it reads the FsImage and EditLog
from disk, applies all the transactions from the EditLog to the in-
memory representation of the FsImage, and flushes out this new
version into a new FsImage on disk.
 This process is called a checkpoint.

 A process which involves merging the fsimage along with the latest
edit log and creating a new fsimage for the namenode to possess the
latest configured metadata of HDFS namespace .

 Checkpointing process only done at the time of NameNode server start


up.
WRITING TO HDFS
 When a client or application wants to write a file to HDFS, it reaches
out to the name node with details of the file.

 The name node responds with details based on the actual size of the
file, block, and replication configuration.

 These details from the name node contain the number of blocks of the
file, the replication factor, and data nodes where each block will be
stored.

 Based on information received from the name node, the client or


application splits the files into multiple blocks and starts sending
them to the first data node.

 Normally, the first replica is written to the data node creating the file, to
improve the write performance because of the write affinity.
The client talks to the name node for metadata to specify where to place
the data blocks.
 Block A is transferred to data node 1 along with details of the two
other data nodes where this block needs to be stored. (see in metadata
A(1,2,3))

 When it receives Block A from the client (assuming a replication


factor of 3), data node 1 copies the same block to the second data node
(in this case, data node 2 of the same rack).

 This involves a block transfer via the rack switch because both of
these data nodes are in the same rack.

 When it receives Block A from data node 1, data node 2 copies the
same block to the third data node (in this case, data node 3 of the
another rack).

 This involves a block transfer via an out-of-rack switch along with a


rack switch because both of these data nodes are in separate racks.
The client sends data blocks to identified data nodes.
 When all the instructed data nodes receive a block, each one sends
a write confirmation to the name node.
 Finally, the first data node in the flow sends the confirmation of the
Block A write to the client (after all the data nodes send confirmation
to the name node).

 The whole process is actually repeated for each block of the file, and
data transfer happens in parallel for faster write of blocks
COMMUNICATION PROTOCOLS
 All communication from clients to the name node, clients to data
nodes, data nodes to the name node, and name node to the data
nodes happens over Transmission Control Protocol/Internet Protocol
(TCP/IP).

 The data nodes communicate with the name node using the data node
protocol with its own TCP port number (configurable).

 The client communicates with the name node using the client protocol
with its own TCP port number (configurable).

 By design, the name node does not initiate a remote procedure call
(RPC); it only responds to the RPC requests coming from either data
nodes or clients.
READING FROM HDFS
 To read a file from the HDFS, the client or application reaches out to
the name node with the name of the file and its location. The name
node responds with the number of blocks of the file, data nodes where
each block has been stored.

The client talks to the name node to get metadata about the file it wants to read.
 Now the client or application reaches out to the data nodes directly
(without involving the name node for actual data transfer—data
blocks don’t pass through the name node) to read the blocks of the files
in parallel, based on information received from the name node.
 When the client or application receives all the blocks of the file, it
combines these blocks into the form of the original file.
CHECKSUM FOR DATA BLOCKS
 When writing blocks of a file, the HDFS client computes the
checksum of each block of the file and stores these checksums in a
separate, hidden file in the same HDFS file system namespace.

 Later, while reading the blocks, the client references these


checksums to verify that these blocks have not been corrupted.

 (Corruption can happen because of faults in a storage device, network


transmission faults, or bugs in the program.)

 When the client realizes that a block is corrupt, it reaches out to


another data node that has the replica of the corrupt block, to get
another copy of the block.
HANDLING FAILURES

 On cluster startup, the name node enters into a special state called
safe mode.
 During this time, the name node receives a heartbeat signal
(implying that the data node is active and functioning properly) and a
block-report from each data node (containing a list of all blocks on that
specific data node) in the cluster.
THE NAME NODE UPDATES ITS METADATA BASED ON
INFORMATION IT RECEIVES FROM THE DATA NODES.
HANDLING A DATA NODE FAILURE TRANSPARENTLY.
IN TERMS OF STORAGE, WHAT DOES A NAME NODE
CONTAIN AND WHAT DO DATA NODES CONTAIN?
 HDFS stores and maintains file system metadata and application data
separately.

 The name node ((master of HDFS) ) contains an entire metadata called


namespace (a hierarchy of files and directories) in physical memory,
for quicker response to client requests. This is called the fsimage. Any
changes into a transactional file is called an edit log.

 The name node simultaneously responds to the multiple client


requests (in a multithreaded system) and provides information to the client
to connect to data nodes to write or read the data.

 Data nodes (slaves of HDFS) contain application data in a partitioned


manner for parallel writes and reads.
WHAT IS THE DEFAULT DATA BLOCK
PLACEMENT POLICY?
 By default, three copies, or replicas, of each block are placed, per the
default block placement policy mentioned next.

 The objective is a properly load-balanced, fast-access, fault-tolerant file


system:

 The first replica is written to the data node creating the file.

 The second replica is written to another data node within the same
rack.

 The third replica is written to a data node in a different rack.


WHAT IS THE REPLICATION PIPELINE? WHAT IS ITS
SIGNIFICANCE?
 Data nodes maintain a pipeline for data transfer. Having said that, data
node 1 does not need to wait for a complete block to arrive before it can
start transferring it to data node 2 in the flow. In fact, the data transfer
from the client to data node 1 for a given block happens in smaller
chunks of 4KB. When data node 1 receives the first 4KB chunk from the
client, it stores this chunk in its local repository and immediately starts
transferring it to data node 2 in the flow. Likewise, when data node 2
receives the first 4KB chunk from data node 1, it stores this chunk in its
local repository and immediately starts transferring it to data node 3,
and so on. This way, all the data nodes in the flow (except the last one)
receive data from the previous data node and, at the same time, transfer
it to the next data node in the flow, to improve the write performance by
avoiding a wait at each stage.
WHAT IS THE DATA BLOCK REPLICATION FACTOR?

 An application or a job can specify the number of replicas of a file that


HDFS should maintain. The number of copies or replicas of each
block of a file is called the replication factor of that file. The
replication factor is configurable and can be changed at the cluster level
or for each file when it is created, or even later for a stored file.
WHAT IS BLOCK SIZE, AND HOW IS IT
CONTROLLED?

 When a client writes a file to a data node, it splits the file into
multiple chunks, called blocks.

 This data partitioning helps in parallel data writes and reads.

 Block size is controlled by the dfs.blocksize configuration property in


the hdfs-site.xml file and applies for files that are created without a
block size specification.

 When creating a file, the client can also specify a block size
specification to override the cluster-wide configuration.
WHAT IS A CHECKPOINT, AND WHO PERFORMS
THIS OPERATION?

 The process of generating a new fsimage by merging transactional


records from the edit log to the current fsimage is called checkpoint.

 The secondary name node periodically performs a checkpoint by


downloading fsimage and the edit log file from the name node and
then uploading the new fsimage back to the name node.

 The name node performs a checkpoint upon restart (not


periodically, though—only on name node start-up).
HOW DOES A NAME NODE ENSURE THAT ALL THE
DATA NODES ARE FUNCTIONING PROPERLY?

• Each data node in the cluster periodically sends heartbeat


signals and a block-report to the name node.

• Receipt of a heartbeat signal implies that the data node is active


and functioning properly.

• A block-report from a data node contains a list of all blocks on


that specific data node.
HOW DOES A CLIENT ENSURES THAT THE DATA IT RECEIVES
WHILE READING IS NOT CORRUPTED?
 When writing blocks of a file, an HDFS client computes the
checksum of each block of the file and stores these checksums in a
separate hidden file in the same HDFS file system namespace.

 Later, while reading the blocks, the client references these


checksums to verify that these blocks were not corrupted
(corruption might happen because of faults in a storage device, network
transmission faults, or bugs in the program).

 When the client realizes that a block is corrupted, it reaches out to


another data node that has the replica of the corrupted block, to get
another copy of the block.
IS THERE A WAY TO RECOVER AN ACCIDENTLY
DELETED FILE FROM HDFS?

 By default, no—but you can change this default behavior.

 You can enable the Trash feature of HDFS using two configuration
properties: fs.trash.interval and fs.trash.checkpoint.interval in
the core-site.xml configuration file.

 After enabling it, if you delete a file, it gets moved to the Trash
folder and stays there, per the settings.

 If you happen to recover the file from there before it gets deleted, you
are good; otherwise, you will lose the file.
MAPREDUCE: SIMPLIFIED DATA PROCESSING
OF LARGE DATA ACROSS THE CLUSTERS
MAPREDUCE
 MapReduce is a programming model and processing framework for
processing and generating large datasets in parallel.

 It divides tasks into smaller subtasks and distributes them across


nodes in the cluster.

 MapReduce is used for batch processing and has been fundamental in


early big data analytics.

 There are two primary tasks in MapReduce: map and reduce.


PHASES
The two distinct phases of MapReduce are:
1) Map Phase: ((Splits & Mapping))
 In Map phase, we split the input dataset into small chunks.
 Map task processes these chunks in parallell.
 The tasks are assigned to Mapper, which processes each unit block of data to produce a
sorted list of (key, value) pairs.
 This list, which is the output of mapper, is passed to the next phase.

2) Reduce: (Shuffling, Reducing)


 Reducers process the intermediate data from the maps into smaller tuples, that
reduces the tasks, leading to the final output of the framework.

MapReduce can be implemented using various programming languages such as Java,


Hive, Pig, Scala, and Python.
MAPREDUCE DAEMON

 The complete execution process (execution of Map and Reduce


tasks, both) is controlled by two types of entities called

 Job Tracker (Resides in the Name Node)

 Task Tracker(Resides in the Data node)

 For every job submitted for execution in the system, there is


one Jobtracker that resides on Namenode and there are multiple
tasktrackers which reside on Datanode.
JOBTRACKER: MASTER PROCESS
 Each cluster can have a single JobTracker.

 JobTracker creates and run the Job on the NameNode and whenever
the client submits the job to the JobTracker, it divides the job and
splits it into the tasks, assigns tasks to the worker nodes (task
scheduling) and tracks its progress and fault tolerance.

 The JobTracker carries out the communication between the client and
the TaskTracker by making use of Remote Procedure Calls(RPC).

 RPC can be considered as a language that is used by the processes to


communicate with each other.

 Job- Tracker keeps track of all the jobs and the associated tasks
within the main memory.
TASK TRACKER
 Within a cluster there can be multiple TaskTracker.

 Each DataNode can have a single TaskTracker running under them

 It is the responsibility of the TaskTracker to execute all the tasks assigned by the JobTracker.

 Within each TaskTracker there are number of Map and reduce slots.

 These slots are referred to as a Task slots.

 The number of Map and Reduce slots determine how many Map and Reduce task can be
executed simultaneously.

 The Task-Tracker is pre-configured with the number of slots indicating the number of tasks it can
accept.

 When a JobTracker tries to schedule a task, it looks for an empty slot in the TaskTracker
running on the same server which hosts the DataNode, where the data for that task resides.
 If not found, it looks for the machine in the same rack.

 TaskTracker sends the HeartBeat signal to the JobTracker after every 3 seconds and if the
Job-Tracker doesn't receive this signal, it will consider that Task-tracker as dead.
MAPPING PHASE

 This is the first phase of the program. There are two steps in this phase:
splitting and mapping.

 A dataset is split into equal units called chunks (input splits) in the
splitting step.

 Hadoop consists of a RecordReader that uses TextInputFormat to


transform input splits into key-value pairs.

 The key-value pairs are then used as inputs in the mapping step.

 This is the only data format that a mapper can read or understand.

 In the mapping step, the mapper contains a coding logic that processes the
key-value pairs and produces an intermediate key-value pairs.
SHUFFLING PHASE
 This is the second phase that takes place after the completion of the Mapping
phase.

 It consists of two main steps: sorting and merging.

 In the sorting step, the key-value pairs are sorted using the keys. ((since
different mappers may have output the same key))

 Merging ensures that key-value pairs are combined.

 The shuffling phase facilitates the removal of duplicate values and the
grouping of values.

 Different values with similar keys are grouped.

 The output of this phase will be keys and values, just like in the Mapping phase.
REDUCER PHASE
 In the reducer phase, the output of the shuffling phase is used as the
input.

 Reducer reduces a set of intermediate values which share a key to a


smaller set of values.

 The reducer processes this input further to reduce the


intermediate values into smaller values.

 It provides a summary of the entire dataset.

 The output from this phase is stored in the HDFS.


HOW DOES THE HADOOP MAPREDUCE ALGORITHM WORK?
 The input data to process using the MapReduce task is stored in input
files that reside on HDFS.
 The input format defines the input specification and how the input
files are split and read.
 The input split logically represents the data to be processed by an
individual Mapper.
 The record reader communicates with the input split and converts
the data into key-value pairs suitable for reading by the mapper (k,
v).
 The mapper class processes input records from RecordReader and
generates intermediate key-value pairs (k’, v’). Conditional logic is
applied to ‘n’ number of data blocks present across various data
nodes.
 The combiner is a mini reducer.
 For every combiner, there is one mapper. It is used to optimize the
performance of MapReduce jobs.
 The partitioner decides how outputs from the combiner are sent to
the reducers.
 The output of the partitioner is shuffled and sorted.

 All the duplicate values are removed, and different values are
grouped based on similar keys.
 This output is fed as input to the reducer.
 All the intermediate values for the intermediate keys are combined
into a list by the reducer called tuples.
 The record writer writes these output key-value pairs from the
reducer to the output files. The output data is stored on the HDFS.
WORKFLOW
EXAMPLE

Mapreduce for word count process


THE HADOOP ECOSYSTEM
 The Hadoop ecosystem architecture is made up of four main
components: data storage, data processing, data access, and
data management.
YARN: YET ANOTHER RESOURCE
NEGOTIATOR
 Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem
component that provides the resource management.

 YARN is called as the operating system of Hadoop as it is responsible for


managing and monitoring workloads.

 It allows multiple data processing engines such as real-time streaming and


batch processing to handle data stored on a single platform.

 It specifies how jobs should be run and managed on Hadoop


clusters.

 It allows users to submit applications on different machines within


the cluster, with each job running on a single machine called a
container.
YARN ARCHITECTURE
RESOURCE MANAGER (RM)
 It is the master daemon of Yarn. RM manages the global assignments
of resources (CPU and memory) among all the applications.

 Resource Manager has two Main components:


 Scheduler
 Application manager

 Scheduler
 The scheduler is responsible for allocating the resources to the
running application.
 The scheduler is pure scheduler it means that it performs no
monitoring no tracking for the application and even doesn’t
guarantees about restarting failed tasks
Application Manager
 It manages running Application Masters in the cluster, i.e., it is
responsible for starting application masters and for monitoring and
restarting them on different nodes in case of failures.
NODE MANAGER (NM)

 It is the slave daemon of Yarn.

 NM is responsible for containers monitoring their resource usage and


reporting the same to the ResourceManager.
 Manage the user process on that machine.
 Yarn NodeManager also tracks the health of the node on which it is
running. The design also allows plugging long-running auxiliary
services to the NM; these are application-specific services, specified as
part of the configurations and loaded by the NM during startup.
 A shuffle is a typical auxiliary service by the NMs for MapReduce
applications on YARN
3. APPLICATION MASTER (AM)

 One application master runs per application.

 It negotiates resources from the resource manager and works with the
node manager.

 It Manages the application life cycle.

 The AM acquires containers from the RM’s Scheduler before contacting


the corresponding NMs to start the application’s individual tasks.
DATA ACCESS TOOLS
 Hive
 Apache Hive, is an open source data warehouse system for querying
and analyzing large datasets stored in Hadoop files.

 Hive do three main functions: data summarization, query, and


analysis.

 Hive use language called HiveQL (HQL), which is similar to SQL.

 HiveQL automatically translates SQL-like queries into MapReduce


jobs which will execute on Hadoop.
HIVE ARCHITECTURE
PIG

 Apache Pig is a high-level language platform for analyzing and


querying huge dataset that are stored in HDFS.

 Pig as a component of Hadoop Ecosystem uses PigLatin language.

 It is very similar to SQL. It loads the data, applies the required filters
and dumps the data in the required format.

 For Programs execution, pig requires Java runtime environment.

You might also like