Professional Documents
Culture Documents
BDA Unit 3
BDA Unit 3
Hadoop is widely used in industries like finance, healthcare, advertising, social media,
and more for tasks like log processing, recommendation systems, data warehousing,
and large-scale data analytics. Its scalability, fault tolerance, and cost-effectiveness
make it a popular choice for handling big data.
1. Hadoop Distributed File System (HDFS): This is the primary storage system
used by Hadoop. It stores data across multiple machines in a distributed
manner, providing high throughput access to application data.
2. MapReduce: MapReduce is a programming model for processing and
generating large datasets in parallel across a distributed cluster. It comprises
two main functions: map and reduce. Map function processes a key-value pair
to generate intermediate key-value pairs, while reduce function aggregates
the intermediate values associated with the same intermediate key.
3. YARN (Yet Another Resource Negotiator): YARN is the resource
management layer of Hadoop. It manages resources in the cluster and
schedules user applications. YARN separates resource management and job
scheduling/monitoring functions, enabling multiple processing engines to run
on top of it.
The Big Data Technology: Hadoop
4. Hadoop Common: This module contains libraries and utilities used by other
Hadoop modules.
5. Hadoop Ecosystem: Hadoop has a vast ecosystem of tools and projects built
around it to extend its capabilities. Some popular ones include:
Hive: Data warehousing infrastructure built on top of Hadoop for
querying and managing large datasets.
Pig: High-level platform for creating MapReduce programs used for
processing and analyzing large datasets.
HBase: Distributed, scalable, big data store that provides real-time
read/write access to large datasets.
Spark: An open-source, distributed computing system that provides an
interface for programming entire clusters with implicit data parallelism
and fault tolerance.
6. Security: Hadoop provides security features such as Kerberos authentication,
Access Control Lists (ACLs), and encryption to ensure data privacy and
integrity.
Hadoop's distributed nature, fault tolerance, and scalability make it ideal for
processing and analyzing large volumes of data across clusters of commodity
hardware. It's widely used in various industries for tasks like log processing, data
warehousing, scientific computing, and more.
Big Data Analytics: One of the primary use cases for Hadoop is in big data analytics.
Many organizations generate and collect massive amounts of data from various
sources such as social media, sensors, logs, and transaction records. Analyzing this
data using traditional databases or single machines can be slow and expensive.
Hadoop enables organizations to store, process, and analyze large volumes of data
efficiently and cost-effectively.
For example, a retail company might use Hadoop to analyze customer purchase
patterns to optimize inventory management and marketing strategies. They can
process large datasets containing information about customer transactions,
demographics, and online behavior to identify trends and patterns that can inform
business decisions.
Fig. HDFS
Processing data with Hadoop involves utilizing the Hadoop ecosystem, which is a set
of open-source tools and frameworks designed for distributed storage and
processing of large datasets across clusters of computers. Here's a general overview
of the steps involved:
(Name Node, Resource Manager) and worker nodes (Data Nodes, Node
Managers).
2. Storing Data: Data is typically stored in the Hadoop Distributed File System
(HDFS), which is designed to store large datasets across distributed clusters.
You can upload data to HDFS using various methods like Hadoop command
line tools or APIs.
3. Writing MapReduce Jobs: MapReduce is a programming model used for
processing and generating large datasets. You write MapReduce jobs to
process data in parallel across the nodes in the Hadoop cluster. A MapReduce
job consists of two main functions: Map function for processing input data
and emitting intermediate key-value pairs, and Reduce function for
aggregating and processing the intermediate key-value pairs.
4. Running MapReduce Jobs: Once you have written MapReduce jobs, you
submit them to the Hadoop cluster for execution. Hadoop takes care of
distributing the tasks across the nodes in the cluster, processing the data in
parallel, and handling failures.
5. Using Higher-Level Abstractions: While MapReduce is a powerful paradigm,
writing MapReduce jobs directly can be complex. Therefore, many higher-level
abstractions and frameworks have been built on top of Hadoop to simplify
data processing tasks. Examples include Apache Pig, Apache Hive, Apache
Spark, etc. These frameworks provide easier-to-use interfaces and higher-level
abstractions for data processing tasks.
6. Data Analysis and Visualization: Once the data processing is done, you can
perform various data analysis tasks on the processed data. This may involve
using tools like Apache Spark for in-memory data processing, Apache Hive for
SQL-like queries, or Apache HBase for real-time querying of Hadoop data. You
can also visualize the results using tools like Apache Zeppelin, Tableau, or
Power BI.
Overall, Hadoop provides a scalable and cost-effective solution for processing large
volumes of data by distributing the processing workload across multiple nodes in a
cluster.