Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

The Big Data Technology: Hadoop

3.1 Introduction to Hadoop:


Hadoop is an open-source framework for distributed storage and processing of large
datasets across clusters of computers using simple programming models. It's designed to
scale up from single servers to thousands of machines, each offering local computation and
storage. Hadoop consists of four main modules:

1. Hadoop Distributed File System (HDFS): A distributed file system that


provides high-throughput access to application data. It stores data across
multiple machines in a fault-tolerant manner.
2. YARN (Yet Another Resource Negotiator): YARN is the resource
management layer of Hadoop. It manages resources in the cluster and
schedules tasks to be executed on the cluster's nodes.
3. MapReduce: A programming model and processing engine for distributed
computing on large data sets. It is designed to efficiently process large
volumes of data in parallel.
4. Common Utilities: These are a set of utilities, libraries, and frameworks for
Hadoop modules. They provide support for other Hadoop components and
assist in managing and operating Hadoop clusters.

Hadoop is widely used in industries like finance, healthcare, advertising, social media,
and more for tasks like log processing, recommendation systems, data warehousing,
and large-scale data analytics. Its scalability, fault tolerance, and cost-effectiveness
make it a popular choice for handling big data.

3.1.1 Feature of Hadoop –


Hadoop, an open-source framework developed by the Apache Software Foundation, is
designed to process and store large volumes of data across distributed computing clusters.
One of its key features is its ability to handle big data efficiently through parallel processing,
fault tolerance, and scalability. Here are some key features of Hadoop:

1. Distributed Storage (HDFS - Hadoop Distributed File System): Hadoop


splits large files into smaller blocks and distributes them across a cluster of
machines. This allows for parallel processing and fault tolerance since each
block is replicated across multiple nodes.
2. Distributed Processing (MapReduce): Hadoop utilizes the MapReduce
programming model to process large datasets in parallel across a distributed
cluster. MapReduce divides the processing into two phases: the map phase,
where data is filtered and sorted, and the reduce phase, where the processed
data is aggregated and summarized.
3. Scalability: Hadoop is highly scalable, allowing organizations to easily add
more nodes to their cluster to handle increasing amounts of data. This
horizontal scalability enables Hadoop to accommodate growing datasets
without significant changes to the underlying architecture.
The Big Data Technology: Hadoop

4. Fault Tolerance: Hadoop's distributed nature ensures fault tolerance by


replicating data across multiple nodes. If a node fails, the data can be
retrieved from other nodes that contain replicas, ensuring uninterrupted
processing and data availability.
5. Cost-Effectiveness: Hadoop runs on commodity hardware, making it a cost-
effective solution for storing and processing large datasets compared to
traditional proprietary systems.
6. Flexibility: Hadoop is not limited to a specific type of data or application. It
can handle structured, semi-structured, and unstructured data, making it
suitable for a wide range of use cases, including batch processing, real-time
processing, data warehousing, and analytics.
7. Ecosystem: Hadoop has a rich ecosystem of tools and technologies that
complement its core components, including Apache Hive for data
warehousing, Apache Pig for data analysis, Apache Spark for in-memory
processing, Apache HBase for real-time NoSQL database capabilities, and
many others.

3.1.2 Advantages of Hadoop OR Why Hadoop–


Hadoop, an open-source framework for distributed storage and processing of large
datasets across clusters of computers, offers several advantages:

1. Scalability: Hadoop is highly scalable, allowing organizations to easily scale


their data storage and processing capabilities by adding more nodes to the
cluster. This scalability makes it suitable for handling massive amounts of data,
ranging from terabytes to petabytes and beyond.
2. Cost-Effectiveness: Hadoop runs on commodity hardware, which is generally
less expensive than specialized hardware. This makes it cost-effective for
organizations to store and process large volumes of data without investing
heavily in proprietary hardware solutions.
3. Fault Tolerance: Hadoop is designed to handle hardware failures gracefully. It
replicates data across multiple nodes in the cluster, ensuring that even if a
node fails, data remains accessible and processing can continue
uninterrupted.
4. Flexibility: Hadoop can process various types of data, including structured,
semi-structured, and unstructured data. This flexibility allows organizations to
derive insights from diverse data sources such as text documents, images,
videos, log files, social media data, and more.
5. Parallel Processing: Hadoop distributes data processing tasks across multiple
nodes in the cluster, enabling parallel processing. This parallelism significantly
reduces the time required to process large datasets compared to traditional
sequential processing.
The Big Data Technology: Hadoop

Overall, Hadoop is a powerful framework that enables organizations to store,


process, and analyze large datasets efficiently and cost-effectively, making it an
indispensable tool for big data analytics.

3.1.3 RDBMS vs Hadoop –

3.2 Hadoop Overview –


Hadoop is a powerful open-source framework for distributed storage and processing
of large datasets across clusters of computers using simple programming models. It's
designed to scale from single servers to thousands of machines, each offering local
computation and storage. Here's a brief overview of its key components and
concepts:

1. Hadoop Distributed File System (HDFS): This is the primary storage system
used by Hadoop. It stores data across multiple machines in a distributed
manner, providing high throughput access to application data.
2. MapReduce: MapReduce is a programming model for processing and
generating large datasets in parallel across a distributed cluster. It comprises
two main functions: map and reduce. Map function processes a key-value pair
to generate intermediate key-value pairs, while reduce function aggregates
the intermediate values associated with the same intermediate key.
3. YARN (Yet Another Resource Negotiator): YARN is the resource
management layer of Hadoop. It manages resources in the cluster and
schedules user applications. YARN separates resource management and job
scheduling/monitoring functions, enabling multiple processing engines to run
on top of it.
The Big Data Technology: Hadoop

4. Hadoop Common: This module contains libraries and utilities used by other
Hadoop modules.
5. Hadoop Ecosystem: Hadoop has a vast ecosystem of tools and projects built
around it to extend its capabilities. Some popular ones include:
 Hive: Data warehousing infrastructure built on top of Hadoop for
querying and managing large datasets.
 Pig: High-level platform for creating MapReduce programs used for
processing and analyzing large datasets.
 HBase: Distributed, scalable, big data store that provides real-time
read/write access to large datasets.
 Spark: An open-source, distributed computing system that provides an
interface for programming entire clusters with implicit data parallelism
and fault tolerance.
6. Security: Hadoop provides security features such as Kerberos authentication,
Access Control Lists (ACLs), and encryption to ensure data privacy and
integrity.

Hadoop's distributed nature, fault tolerance, and scalability make it ideal for
processing and analyzing large volumes of data across clusters of commodity
hardware. It's widely used in various industries for tasks like log processing, data
warehousing, scientific computing, and more.

3.3 Use Case of Hadoop –


Hadoop is a powerful framework for distributed storage and processing of large
datasets across clusters of computers using simple programming models. Here's a
common use case for Hadoop:

Big Data Analytics: One of the primary use cases for Hadoop is in big data analytics.
Many organizations generate and collect massive amounts of data from various
sources such as social media, sensors, logs, and transaction records. Analyzing this
data using traditional databases or single machines can be slow and expensive.
Hadoop enables organizations to store, process, and analyze large volumes of data
efficiently and cost-effectively.

For example, a retail company might use Hadoop to analyze customer purchase
patterns to optimize inventory management and marketing strategies. They can
process large datasets containing information about customer transactions,
demographics, and online behavior to identify trends and patterns that can inform
business decisions.

By leveraging Hadoop's distributed processing capabilities, organizations can


perform complex analytics tasks such as data mining, machine learning, predictive
The Big Data Technology: Hadoop

modeling, and sentiment analysis on massive datasets in parallel, leading to insights


that drive innovation, improve operational efficiency, and enhance decision-making.

3.4 HDFS (Hadoop Distributed File System) –


HDFS stands for Hadoop Distributed File System. It's a distributed file system designed to
run on commodity hardware. HDFS is the primary storage system used by Hadoop
applications. It provides high-throughput access to application data and is suitable for
applications with large data sets. HDFS has a master/slave architecture where the master is
called the Name Node and the slaves are called Data Nodes. The Name Node manages the
file system namespace and regulates access to files by clients. The Data Nodes manage
storage attached to the nodes that they run on. HDFS is fault-tolerant and designed to be
highly scalable, making it well-suited for big data processing tasks.

Fig. HDFS

3.5 Processing Data with Hadoop –

Processing data with Hadoop involves utilizing the Hadoop ecosystem, which is a set
of open-source tools and frameworks designed for distributed storage and
processing of large datasets across clusters of computers. Here's a general overview
of the steps involved:

1. Setting up Hadoop Cluster: First, you need to set up a Hadoop cluster


consisting of multiple nodes. A Hadoop cluster typically includes master nodes
The Big Data Technology: Hadoop

(Name Node, Resource Manager) and worker nodes (Data Nodes, Node
Managers).
2. Storing Data: Data is typically stored in the Hadoop Distributed File System
(HDFS), which is designed to store large datasets across distributed clusters.
You can upload data to HDFS using various methods like Hadoop command
line tools or APIs.
3. Writing MapReduce Jobs: MapReduce is a programming model used for
processing and generating large datasets. You write MapReduce jobs to
process data in parallel across the nodes in the Hadoop cluster. A MapReduce
job consists of two main functions: Map function for processing input data
and emitting intermediate key-value pairs, and Reduce function for
aggregating and processing the intermediate key-value pairs.
4. Running MapReduce Jobs: Once you have written MapReduce jobs, you
submit them to the Hadoop cluster for execution. Hadoop takes care of
distributing the tasks across the nodes in the cluster, processing the data in
parallel, and handling failures.
5. Using Higher-Level Abstractions: While MapReduce is a powerful paradigm,
writing MapReduce jobs directly can be complex. Therefore, many higher-level
abstractions and frameworks have been built on top of Hadoop to simplify
data processing tasks. Examples include Apache Pig, Apache Hive, Apache
Spark, etc. These frameworks provide easier-to-use interfaces and higher-level
abstractions for data processing tasks.
6. Data Analysis and Visualization: Once the data processing is done, you can
perform various data analysis tasks on the processed data. This may involve
using tools like Apache Spark for in-memory data processing, Apache Hive for
SQL-like queries, or Apache HBase for real-time querying of Hadoop data. You
can also visualize the results using tools like Apache Zeppelin, Tableau, or
Power BI.

Overall, Hadoop provides a scalable and cost-effective solution for processing large
volumes of data by distributing the processing workload across multiple nodes in a
cluster.

You might also like