dSbDa MiniProject Case Study

SAVITRIBAI PHULE PUNE UNIVERSITY
A
MINI-PROJECT REPORT
ON
"Case study to process data driven for Digital Marketing OR Health care
systems with Hadoop Ecosystem components”
Submitted to the Department of Computer Engineering, SITS, Narhe, Pune, in fulfilment

of the requirements for the
THIRD YEAR OF COMPUTER ENGINEERING
By
Name: Patil Rohit Yuvraj Roll No.:3202031
DEPARTMENT OF COMPUTER
ENGINEERING
SINHGAD INSTITUTE OF TECHNOLOGY & SCIENCE, NARHE,

PUNE (2023-2024)
SINHGAD TECHNICAL EDUCATION SOCIETY’S

SINHGAD INSTITUTE OF TECHNOLOGY & SCIENCE
NARHE, Pune – 411 041
DEPARTMENT OF COMPUTER ENGINEERING
CERTIFICATE
This is to certify that MINI-PROJECT II work entitled "Case study to process data driven for Digital
Marketing OR Health care systems with Hadoop Ecosystem components” was successfully created
by
Name of student: Patil Rohit Yuvraj
In the fulfilment of the undergraduate degree course in Third Year Computer Engineering, in the Academic
Year 2023-2024 prescribed by the Savitribai Phule Pune University.
Ms. Sucheta S. Navale Dr. Geeta S. Navale

Guide Head of Department
Dr. S. D. Markande
Principal
SINHGAD INSTITUE OF TECHNOLOGY and SCIENCE, NARHE – 411041

CONTENTS
Sr. Title
Page No.
No.
1 Problem Statement 1
2 Introduction 2
3 Architecture 4
4 Conclusion 7
1. Problem Statement
“Case study to process data driven for Digital Marketing OR Health care systems with
Hadoop Ecosystem components as shown.”
● HDFS: Hadoop Distributed File System.

● YARN: Yet Another Resource Negotiator.
● MapReduce: Programming based Data Processing.
● Spark: In-Memory data processing.
● PIG, HIVE: Query based processing of data services.
● HBase: NoSQL Database (Provides real-time reads and writes).
● Mahout, Spark MLLib: (Provides analytical tools) Machine Learning algorithm
Libraries.
● Solar, Lucene: Searching and Indexing".
1
2. Introduction
In today's information age, organizations are hungry for ways to use big data to gain insights and
make smart decisions. This is especially true in digital marketing and healthcare, where data-driven
approaches are transforming how things are done. This case study dives into how the Hadoop ecosystem,
a powerful suite of tools, can be used to process and analyse data in these fields, leading to better marketing
strategies and improved healthcare management.
Digital marketing has become essential for business growth. With the explosion of online channels
and the ever-growing amount of customer data, companies have a unique opportunity to understand their
target audience better, personalize marketing efforts, and boost engagement. However, analysing and
processing this massive data in real-time requires powerful and adaptable technologies.
Healthcare systems generate a tremendous amount of data – from electronic health records and
medical imaging to data from wearable devices and clinical trials. This data has the potential to
revolutionize patient care, optimize healthcare operations, and accelerate medical research. However,
traditional databases often struggle with the complexity and sheer volume of healthcare data, making it
necessary to adopt more advanced data processing tools.
The Hadoop ecosystem, a collection of tools like HDFS, YARN, MapReduce, Spark, Pig, Hive,
HBase, Mahout, Spark MLLib, Solr, and Lucene, offers a comprehensive framework for processing and
analysing massive datasets. These tools provide functionalities for distributed storage, resource
management, data processing, real-time analytics, query-based services, NoSQL database management,
machine learning algorithms, and efficient searching and indexing.
2
3. Architecture
Hadoop Ecosystem:
3
Hadoop Ecosystem Components
The Hadoop ecosystem is a collection of open-source software projects that facilitate storing and
processing large datasets in a distributed computing environment. Here's a breakdown of the key
components:
1. HDFS (Hadoop Distributed File System):

HDFS stands as a pivotal component within the Hadoop ecosystem, tasked with the
storage of vast data sets, be they structured or unstructured, across multiple nodes, while
simultaneously maintaining metadata in log files.
Comprising two core elements—Name Node and Data Node—HDFS operates in a
distributed environment utilizing commodity hardware for data storage. The Name Node, housing
metadata, demands fewer resources compared to Data Nodes, which store actual data, thus
contributing to the cost-effectiveness of Hadoop.
2. YARN (Yet Another Resource Negotiator):

Yet Another Resource Negotiator, or YARN, plays a crucial role in managing resources across
clusters within the Hadoop system, handling tasks such as scheduling and resource allocation.
Comprised of three primary components—Resource Manager, Node Manager, and
Application Manager—YARN ensures efficient allocation of resources for applications while
coordinating with Node Managers for resource allocation per machine. The Application Manager
serves as an intermediary, facilitating negotiation between the Resource Manager and Node
Manager as per system requirements.
3. MapReduce:
MapReduce enables the implementation of distributed and parallel algorithms, facilitating
the processing logic necessary for transforming extensive data sets into manageable ones.
This framework employs two core functions—Map() and Reduce()—to execute its tasks.
Map() is responsible for sorting, filtering, and organizing data into groups, generating key-value
pair results for subsequent processing by the Reduce() method. Reduce() summarizes the mapped
data, aggregating them into smaller sets of tuples..
4. Spark:
As a platform handling resource-intensive tasks like batch processing, interactive or
iterative real-time processing, and graph visualization, Apache Spark boasts enhanced speed
through in-memory resource consumption.
Suited for real-time data processing, Apache Spark complements Hadoop, which excels in
structured data or batch processing scenarios. Many companies utilize both interchangeably to
leverage their respective strengths.
4
5. Pig & Hive:
Developed by Yahoo, PIG operates on the Pig Latin language, a query-based language
akin to SQL, and serves as a platform for structuring data flow, processing, and analyzing large
data sets.
PIG executes commands while managing MapReduce activities in the background, storing
results in HDFS upon processing. The Pig Latin language, designed specifically for this
framework, operates on Pig Runtime, akin to Java running on JVM, facilitating ease of
programming and optimization within the Hadoop ecosystem.
Utilizing SQL methodology and interface, HIVE facilitates the reading and writing of
extensive data sets, employing HQL (Hive Query Language) for queries.
Highly scalable, HIVE supports both real-time and batch processing, accommodating all
SQL data types for streamlined query processing. Its components include JDBC Drivers and the
HIVE Command Line, facilitating data storage permissions, connection establishment, and query
processing.
6. HBase:
Functioning as a NoSQL database, Apache HBase accommodates various data types and
effectively handles Hadoop Database requirements, offering capabilities akin to Google's
Bigtable.
Well-suited for scenarios demanding swift processing of requests within large databases,
HBase provides a resilient method of storing limited data, facilitating efficient search and retrieval
operations..
7. Mahout & Spark MLLib:

Mahout introduces machine learnability to systems or applications, enabling self-
improvement based on patterns, user/environmental interactions, or algorithms.
Offering various libraries and functionalities such as collaborative filtering, clustering, and
classification, Mahout allows the invocation of algorithms as needed, empowering applications
with machine learning capabilities.
8. Solr & Lucene: Powerful tools for efficient indexing and searching of marketing data. Solr builds
upon Lucene, a popular open-source search engine library. By implementing Solr and Lucene,
Acme Inc. can rapidly search through massive datasets of marketing data to retrieve specific
customer information or campaign details, enabling faster decision-making and campaign
optimization.
5
Use Cases in Healthcare
Some specific use cases of how these Hadoop ecosystem components can be applied in healthcare
systems:
a. Predictive Analytics for Disease Diagnosis

By leveraging machine learning algorithms provided by Mahout and Spark MLLib, healthcare
organizations can analyse patient data to predict and diagnose diseases more accurately. For example, by
processing electronic health records (EHRs) and medical imaging data, predictive models can be built to
identify patterns and risk factors associated with various diseases.
b. Real-time Patient Monitoring

HBase can be utilized to store and process real-time patient data, such as vital signs, medication
adherence, and sensor data from wearable devices. This data can be continuously monitored and analysed
to detect anomalies or changes in patient health status, enabling timely interventions and personalized
care.
c. Population Health Management

Hadoop ecosystem components enable healthcare providers to aggregate and analyse population-
level data to identify trends, assess health outcomes, and develop targeted interventions for improving
public health. MapReduce and Spark can be used to process large datasets, while tools like Hive facilitate
querying and analysis of structured healthcare data.
d. Drug Discovery and Development

In pharmaceutical research, Hadoop ecosystem components can accelerate the drug discovery
process by analysing large datasets of chemical compounds, genetic information, and clinical trial data.
Spark's in-memory processing capabilities are particularly useful for running complex algorithms and
simulations to identify potential drug candidates.
6
4. Conclusion
This study shows how the Hadoop ecosystem transforms digital marketing and healthcare with big
data. Components like HDFS, Spark, PIG, and HIVE enable real-time data processing and advanced
analytics. This empowers businesses to personalize marketing, optimize strategies, and improve healthcare
systems. Integration with Solar and Lucene enhances search and indexing. By leveraging Hadoop,
organizations can unlock insights, make data-driven decisions, and drive innovation.

dSbDa MiniProject Case Study

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

dSbDa MiniProject Case Study

Uploaded by

Copyright:

Available Formats

SAVITRIBAI PHULE PUNE UNIVERSITY

Submitted to the Department of Computer Engineering, SITS, Narhe, Pune, in fulfilment

THIRD YEAR OF COMPUTER ENGINEERING

Name: Patil Rohit Yuvraj Roll No.:3202031

SINHGAD INSTITUTE OF TECHNOLOGY & SCIENCE, NARHE,

SINHGAD TECHNICAL EDUCATION SOCIETY’S

DEPARTMENT OF COMPUTER ENGINEERING

Name of student: Patil Rohit Yuvraj

Ms. Sucheta S. Navale Dr. Geeta S. Navale

SINHGAD INSTITUE OF TECHNOLOGY and SCIENCE, NARHE – 411041

● HDFS: Hadoop Distributed File System.

1. HDFS (Hadoop Distributed File System):

2. YARN (Yet Another Resource Negotiator):

7. Mahout & Spark MLLib:

a. Predictive Analytics for Disease Diagnosis

b. Real-time Patient Monitoring

c. Population Health Management

d. Drug Discovery and Development

You might also like