Professional Documents
Culture Documents
Data Analytics and Machine Learning (Pushpa Singh Asha Rani Mishra Payal Garg) (
Data Analytics and Machine Learning (Pushpa Singh Asha Rani Mishra Payal Garg) (
Pushpa Singh
Asha Rani Mishra
Payal Garg Editors
Data Analytics
and Machine
Learning
Navigating the Big Data Landscape
Studies in Big Data
Volume 145
Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Big Data” (SBD) publishes new developments and advances in
the various areas of Big Data- quickly and with a high quality. The intent is to cover the
theory, research, development, and applications of Big Data, as embedded in the fields
of engineering, computer science, physics, economics and life sciences. The books of
the series refer to the analysis and understanding of large, complex, and/or distributed
data sets generated from recent digital sources coming from sensors or other physical
instruments as well as simulations, crowd sourcing, social networks or other internet
transactions, such as emails or video click streams and other. The series contains
monographs, lecture notes and edited volumes in Big Data spanning the areas of
computational intelligence including neural networks, evolutionary computation,
soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern
statistics and Operations research, as well as self-organizing systems. Of particular
value to both the contributors and the readership are the short publication timeframe
and the world-wide distribution, which enable both wide and rapid dissemination of
research output.
The books of this series are reviewed in a single blind peer review process.
Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH.
All books published in the series are submitted for consideration in Web of Science.
Pushpa Singh · Asha Rani Mishra · Payal Garg
Editors
Payal Garg
GL Bajaj Institute of Technology &
Management
Greater Noida, Uttar Pradesh, India
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
v
vi Preface
Intended Audience
Students and Academics: Students pursuing degrees in fields like data science,
computer science, business analytics, or related disciplines, as well as academics
conducting research in these areas, form a significant primary audience for books on
Data Analytics, Big Data, and Machine Learning.
Data Analysts and Data Scientists: These professionals are directly involved in
working with data, analyzing it, and deriving insights. They seek books that provide
in-depth knowledge, practical techniques, and advanced concepts related to data
analytics, big data, and machine learning.
Business and Data Professionals: Managers, executives, and decision-makers
who are responsible for making data-driven decisions in their organizations often
have a primary interest in understanding how Data Analytics, Big data, and Machine
Learning can be leveraged to gain a competitive advantage.
This book has sixteen chapters covering big data analytics, machine learning, and
deep learning. The first 2 chapters provide an introductory discussion of Data
Analytics, Big Data, Machine Learning, and the life cycle of Data Analytics. Next
Chapters 3 and 4 explore the building of predictive models and their application
in the field of agriculture. Further, Chapter 5 comprises a brief assessment of the
stream architecture and analysis of big data. Chapter 6 leverages data analytics and
deep learning framework in Image Super-Resolution Techniques and the potential
of data analytics and time series are enhanced for the price prediction. As, “R” is a
powerful statistical programming tool that is widely used for statistical analysis, data
visualization, and machine learning. Taking it as cognizant. Chapter 8 widely used
for statistical analysis, data visualization, and machine learning emphasizes “Prac-
tical Implementation of Machine Learning Techniques and Data Analytics using
R”. Deep learning models excel in feature learning, enabling the automatic extrac-
tion of valuable information from huge data sets. Hence, Chapter 9 presents the deep
learning techniques in big data analytics. Chapter 10 deals with how organizations and
their professionals must meticulously put efforts towards building data ethically and
ensure its privacy. Chapters 11and 12 presented modern and real-world applications
of data analytics, machine learning, and big data. Chapters provide various instances
from projects, case studies, and real-world scenarios to create positive and negative
impacts on an individual and the society. Further, taking one step ahead, Chapter13
Unlock the Insights by Exploring Data Analytics and AI Tool Performance across
Industries. The concept of Lung Nodule Segmentation using Machine Learning and
Deep Learning is discussed in Chapter14 which highlight the importance of deep
learning in healthcare Industries to support health analytics. Chapter15 describes the
Preface vii
Convergence of Data Analytics Big Data and Machine Learning Applications Chal-
lenges and Future Direction. Integration of Data Analytics, Machine Learning, and
Big Data finally transforms any business by using Big Data Analytics and Machine
Learning and hence, included in Chapter16.
ix
x Contents
xi
xii Contributors
Youddha Beer Singh, Aditya Dev Mishra, Mayank Dixit, and Atul Srivastava
Abstract Data has become the main driver behind innovation, decision-making,
and the change of many sectors and civilisations in the modern period. The dynamic
trinity of Data Analytics, Big Data, and Machine Learning is thoroughly introduced
in this chapter, which also reveals their profound significance, intricate relation-
ships, and transformational abilities. The fundamental layer of data processing is
data analytics. Data must be carefully examined, cleaned, transformed, and modelled
in order to reveal patterns, trends, and insightful information. A data-driven revolu-
tion is sparked by big data. In our highly linked world, data is produced in enormous
numbers, diversity, velocity, and authenticity. The third pillar, machine learning, uses
data-driven algorithms to enable automated prediction and decision-making. This
chapter explores the key methods and equipment needed to fully utilise the power
of data analytics and also discusses how technologies used in big data management,
processing, and insight extraction. A foundation is set for a thorough investigation of
these interconnected realms when we begin the chapters that follow. Data analytics,
big data, and machine learning are not distinct ideas; rather, they are woven into the
fabric of modern innovation and technology. This chapter serves as the beginning
of this captivating journey, providing a solid understanding of and insight into the
enormous possibilities of data-driven insights and wise decision-making.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 1
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_1
2 Y. B. Singh et al.
1 Introduction
Data has become the foundation of modern society in the era of information, changing
the way we work, live, and engage with the outside world. Three crucial domains—
Data Analytics, Big Data, and Machine Learning—are at the centre of this revolu-
tionary environment, which has emerged from the convergence of data, technology,
and creativity. These interconnected sectors provide insights, forecasts, and solu-
tions that cut across industries, from healthcare and banking to transportation and
entertainment, and together they constitute the foundation of data-driven decision-
making. Not only is it necessary to comprehend their complexities and realise their
potential in order to remain competitive in the fast-paced world of today, but it also
opens the door to groundbreaking innovation and the advancement of society.
As the fundamental layer, data analytics enables us to convert unprocessed data
into meaningful insights. It uncovers secrets by methodically examining and inter-
preting patterns, shedding light on the way ahead. It propels optimisation, informs
tactics, and directs us towards more intelligent decisions. Our solution to the ever-
increasing volumes, velocities, types, and complexity of data is Big Data, a revolu-
tionary paradigm change. It makes data management, archiving, and analysis possible
on a scale never before possible. Big Data technology has made it possible for busi-
nesses to access the vast amount of information that is hidden within this torrent of
data. The pinnacle of data science, machine learning, is what we’ve been searching
for—intelligent, automated decision-making. These algorithms have revolutionised
our ability to recognise patterns, make predictions, and even replicate human cogni-
tive capabilities. They are inspired by the ability of humans to learn, adapt, and
evolve. Machine learning is now the basis for many cutting-edge applications, such
as customised healthcare and driverless cars. By using data to produce useful insights
and make wise decisions, data analytics transforms industries. Data analytics is
the process of identifying patterns, trends, and correlations in large datasets using
advanced algorithms and tools. This helps organisations anticipate market trends,
analyse customer behaviour, and optimise their operations. The application of data
analytics enables organisations across industries to generate value, drive growth, and
maintain competitiveness in today’s data-driven world. Benefits include improved
operational efficiency, better strategic planning, fostering innovation, and the ability
to provide personalised experiences.
In addition to being a theoretical voyage, this investigation into the trio of data
analytics, big data, and machine learning also serves as a useful manual for navi-
gating the always-changing data landscape. Deeper exploration of these fields reveals
opportunities to spur innovation, advance progress, and improve people’s quality of
life both individually and as a society. We go off on an educational journey in the
ensuing chapters, learning about the ideas, practises, and real-world applications of
various domains of transformation. The following pages hold the potential to provide
a glimpse into a future in which data is king and the ability to glean knowledge from
it opens up a limitless array of opportunities.
The following factors make the current study significant:
Introduction to Data Analytics, Big Data, and Machine Learning 3
• The purpose of this study is to help IT professionals, and researchers choose the
best big data tools and methods for efficient data analytics.
• It is intended to give young researchers insightful information so they can make
wise decisions and significant contributions to the scientific community.
• The outcomes of the study will act as a guide for the development of methods and
resources that combine cognitive computing with big data.
The following is a summary of this study’s main contributions:
• A thorough and in-depth evaluation of prior state-of-the-art studies on Machine
Learning (ML) approaches in Big Data Analytics (BDA).
• A brief overview of the key characteristics of the comparative machine learning
(ML) and big data analytics (BDA) approaches.
• A succinct summary of the important features of the compared methods for BDA
with ML.
• Briefly discussed challenges and future diction in data analytics, big data with
ML.
The remaining sections are arranged as follows the overview of data analytics
is presented in Sect. 2. Section 3 presents details discussions on big data, whereas
Sect. 4 Machine learning algorithm are uses in big data analytics. Challenge and
Future Direction in Data Analytics, Big Data and Machine Learning. In Sect. 6, we
finally bring the study to a close.
2 Data Analytics
Data analytics has become a transformational force in the information era, showing
the way to efficient, innovative, and well-informed decision-making. Our ever-
growing digital environment creates an ocean of data, and being able to use this
wealth of knowledge has become essential for success on both an individual and
organisational level. By transforming raw data into useful insights, data analytics the
methodical analysis and interpretation of data enables us to successfully navigate this
enormous ocean of information. Fundamentally, data analytics is a dynamic process
that makes use of a range of methods and instruments to examine, purify, organise,
and model data. Data analysts find patterns, trends, and correlations through these
methodical activities that are frequently invisible to the unaided eye [1]. This trans-
lates into the corporate world as strategy optimisation, operational improvement,
and opportunity discovery. Applications and industries for data analytics are not
constrained. Its reach extends across a variety of industries, including marketing,
sports, healthcare, and finance. Data analytics is essential for many tasks, including
seeing patterns in consumer behaviour, streamlining healthcare services, forecasting
changes in the financial markets, and even refining sports tactics.
The emergence of sophisticated technology and increased processing capacity has
opened up new possibilities for data analytics. Data analytics can now do predictive
4 Y. B. Singh et al.
Gather information directly from the source first. Then, work with and improve the
data to make sure it works with your downstream systems, converting it to the format
of your choice. The produced data should be kept in a data lake or data warehouse
so that it can be used for reporting and analysis or as a long-term archive [3]. Make
use of analytics tools to look through the data and draw conclusions. Data analytics
processes are shown in Fig. 1.
Data Capturing: You have a few choices for data capturing, depending on where
your data comes from:
• Data Migration Tools: To move data from one cloud platform or on-premises
to another, use data migration tools. For this reason, Google Cloud provides a
Storage Transfer Service.
• API Integration: Use APIs to retrieve data from outside SaaS providers and transfer
it to your data warehouse. A data transfer service is offered by Google Cloud’s
serverless data warehouse BigQuery to facilitate the easy import of data from
SaaS apps such as Teradata, Amazon S3, RedShift, YouTube, and Google Ads.
• Real-time Data Streaming: Use the Pub/Sub service to get data in real-time from
your applications. Set up a data source to send event messages to Pub/Sub so that
subscribers can process them and respond accordingly.
• IoT Device Integration: Google Cloud IoT Core, which supports the MQTT
protocol for IoT devices, allows your IoT devices to broadcast real-time data.
IoT data can also be sent to Pub/Sub for additional processing.
Processing Data the critical next step after data intake is data enrichment or
processing to get it ready for systems that come after. Three main features in Google
Cloud make this procedure easier:
Data visualisation: There are many different data visualisation tools available,
and most of them include a BigQuery link so you can quickly generate charts with
the tool of your choosing. A few tools that Google Cloud offers are worth taking
a look at. In addition to connecting to BigQuery, Data Studio is free and offers
quick data visualisation through connections to numerous other services. Charts
and dashboards may be shared very easily, especially if you have experience with
Google Drive. Looker is also an enterprise platform for embedded analytics, data
applications, and business intelligence [4].
3 Big Data
With data expected to rise at an exponential rate of 180 ZB by 2025, data will
play a pivotal role in propelling twenty-first-century growth and transformation,
forming a new “digital universe” that will alter markets and businesses [4]. The
“Big Data” age has begun with this flood of digital data that comes from several
complex sources [5]. Large datasets that are too large for traditional software tools
to handle, store, organise, and analyse are referred to as big data [6]. The range of
heterogeneity and complexity displayed by these datasets goes beyond their mere
quantity. They include structured, semi-structured, and unstructured data, as well as
operational, transactional, sales, marketing, and a variety of other data types. Big
data also includes data in a variety of types, such as text, audio, video, photos, and
more. Interestingly, the category of unstructured data is growing more quickly than
structured data and is making up almost 90% of all data [7]. As such, it is critical
to investigate new processing capacities in order to derive data-driven insights that
facilitate improved decision-making.
The three Vs, volume, velocity, and variety are frequently used to describe Doug
Laney’s idea of big data, which is referenced in Refs. [7–9]. However, a number of
research [8] have extended this idea to include five essential qualities (5Vs): volume,
velocity, variety, value, and veracity as shown in Fig. 2. As technology advances,
data storage capacity, data transfer rates, and system capabilities change, so does the
notion of big data [9]. The first “V” stands for volume and represents the exponential
growth in data size over time [5], with electronic medical records (EMRs) being a
major source of data for the healthcare sector [9]. The second “V” stands for velocity,
which describes the rate at which information is created and gathered in a variety of
businesses.
From the Fig. 2 it is clear that Big data is often characterised by the five Vs:
volume, velocity, variety, value, and veracity.
Volume: volume is the total amount of data, and it has significantly increased as a
result of the widespread use of sensors, Internet of Things (IoT) devices, linked smart-
phones, and ICTs (information and communication technologies), including artificial
intelligence (AI). With data generation exceeding Moore’s law, this data explosion
has produced enormous datasets that go beyond conventional measurements and
introduce terms like exabytes, zettabytes, and yottabytes.
Introduction to Data Analytics, Big Data, and Machine Learning 7
Velocity: The rapid creation of data from linked devices and the internet that big
data brings to businesses in real time is what sets it apart. Businesses can benefit
greatly from this rapid inflow of data since it gives them the ability to move quickly,
become more agile, and obtain a competitive advantage. While some businesses have
previously harnessed big data for customer recommendations, today’s enterprises are
leveraging big data analytics to not only analyse but also act upon data in real time.
Variety: The period of Web 3.0 is characterised by diversity in data creation
sources and formats, as a result of the growth of social media and the internet, which
has produced a wide range of data types. These include text messages, status updates,
images, and videos posted on social media sites like Facebook and Twitter, SMS
messages, GPS signals from mobile devices, client interactions in online banking
and retail, contact centre voice data, and more. The constant streams of data from
mobile devices that record the location and activity of people are among the many
important sources of big data that are relatively new. In addition, a variety of online
sources provide data via social media interactions, click-streams, and logs.
Value: The application of big data can provide insightful information, and the data
analysis process can benefit businesses, organisations, communities, and consumers
in a huge way.
Veracity: Data accuracy and dependability are referred to as data veracity. In
cases when there are discrepancies or errors in the data collection process, veracity
measures the degree of uncertainty and dependability surrounding the information.
Big data provides businesses with a plethora of opportunities to boost effi-
ciency and competitiveness. It includes the ongoing collection of data as well as
the necessary technologies for data management, storage, collection, and analysis.
This paradigm change has altered fundamental aspects of organisations as well as
management. Big data is an essential tool that helps businesses discover new informa-
tion, provide value, and spur innovation in markets, procedures, and goods. Because
of this, data has become a highly valued resource, emphasising to business executives
the significance of adopting a data-driven strategy [10]. Businesses have accumu-
lated data for many years, but the current trend is more towards active data analysis
8 Y. B. Singh et al.
120
100
80
60
40
20
0
0 50 100 150 200 250 300
4 Machine Learning
Algorithms for machine learning (ML) have become popular for modelling, visual-
ising, and analysing large datasets. With machine learning (ML), machines can learn
from data, extrapolate their findings to unknown information, and forecast outcomes.
Various literature attests to the effectiveness of ML algorithms in a variety of applica-
tion domains. Based on the literature that is now accessible, machine learning can be
divided into four main classes: reinforcement learning, supervised learning, unsu-
pervised learning, and semi-supervised learning. Numerous open-source machine
learning methods are available for a range of applications, including ranking, dimen-
sionality reduction, clustering, regression, and classification. Singular-Value Decom-
position (SVD), Principal Component Analysis (PCA), Radial Basis Function Neural
Network (RBF-NN), KNN, Hidden Markov Model (HMM), DT, Naive-Base (NB),
Tensor Auto-Encoder (TAE), Ensemble Learning (EL), and KNN are a few notable
examples [12–16].
Machine learning is essential to big data and data analytics because it provides
strong tools and methods for deriving valuable insights from enormous and intricate
datasets. The following are some important ways that big data and data analytics
benefit from machine learning:
Introduction to Data Analytics, Big Data, and Machine Learning 9
In conclusion, machine learning enhances big data and data analytics by offering
complex algorithms and methods for finding patterns, automating processes, and
generating predictions, all of which lead to better decision-making.
of the data, points out important variables, and aids in formulating the questions that
require investigation. This knowledge is crucial for defining tasks and formulating
problems in the larger context of data analysis.
Big Data: Large-Scale Data Management and Processing: Big Data comes into
play to solve the problem of handling enormous volumes, high velocities, many types,
and the accuracy of data, while Data Analytics sheds light on the possibilities of data.
Often, the processing power of these data avalanches exceeds that of conventional
data management systems. To meet this challenge head-on, big data technologies
like Hadoop, Spark, and NoSQL databases have evolved. They provide the tools
and infrastructure required to handle, store, and process data on a never-before-
seen scale. Big Data processing outcomes, which are frequently aggregated or pre-
processed data, interact with data analytics when they are used as advanced analytics
inputs. Moreover, businesses can benefit from data sources they may have previously
overlooked thanks to the convergence of big data and data analytics. The interaction
improves the ability to make decisions based on data across a wider range.
Machine Learning: Intelligence Automation: While Big Data manages massive
amounts of data and Data Analytics offers insights, Machine Learning elevates the
practice of data-driven decision-making by automating intelligence. Without explicit
programming, machine learning techniques allow systems to learn from data, adjust
to changing circumstances, and make predictions or judgements. Machine Learning is
frequently the final stage in the interaction. It makes use of Big Data’s data processing
power and the insights gleaned from Data Analytics to create prediction models, iden-
tify trends, and provide wise solutions. Machine learning depends on the knowledge
generated and controlled by the first two components to perform tasks like automating
picture identification, detecting fraud, and forecasting client preferences. The key to
bringing the data to life is machine learning, which offers automation and predictive
capability that manual analysis would not be able to provide [17–20].
Within the data science and analytics ecosystem, the interaction between Data
Analytics, Big Data, and Machine Learning is synergistic. Organisations can fully
utilise data when Data Analytics lays the foundation, Big Data supplies the required
infrastructure, and Machine Learning automates intelligence. This convergence
provides a route to innovation, efficiency, and competitiveness across multiple indus-
tries and is at the core of contemporary data-driven decision-making. A thorough
understanding of this interaction is necessary for anyone looking to maximise the
potential of data. The promise of data-driven insights and wise decision-making
is realised when these three domains work harmoniously together. The current
study analysed earlier research on large data analytics and machine learning in
data analytics. Measuring the association between big data analytics keywords and
machine learning terms was the goal. Research articles commonly use data analytics,
big data analytics, and machine learning, as seen in Fig. 5.
From Fig. 5, it is clear that there is a strong correlation between the keywords
used by various data analytics experts and the combination of data, data analytics,
big data, big data analytics and machine learning.
12 Y. B. Singh et al.
Fig. 5 Most trending keywords in data analytics, big data, and machine learning
• Ensuring the precision and dependability of data sources, as well as cleaning and
preparing data to get rid of mistakes and inconsistencies, are known as data quality
and cleaning.
• Data security and privacy include preserving data integrity, protecting private
information from breaches, and conforming to privacy laws.
• Data integration is the process of combining information from various forms and
sources to produce a single dataset for analysis.
• Scalability: The ability to manage massive data volumes and make sure data
analytics procedures can expand as data quantities increase.
• Real-time data processing involves data analysis and action in real-time to enable
prompt decision-making and response.
• Complex Data Types: Handling multimedia, text, and other unstructured and semi-
structured data.
• Data Visualisation and Exploration: Producing insightful visualisations and
efficiently examining data to draw conclusions.
Introduction to Data Analytics, Big Data, and Machine Learning 13
Organizations must overcome these challenges if they want to use data analytics
efficiently and get insightful knowledge from their data.
• Lack of Knowledge and Awareness Big data projects may fail because many
businesses don’t have a basic understanding of the technology or the advantages it
might offer. It is not uncommon to allocate time and resources inefficiently to new
technology, such as big data. Employee reluctance to embrace new procedures
results from their frequent ignorance of the true usefulness of big data, which can
seriously impair business operations.
• Data Quality Management A major obstacle to data integration is the variety
of data sources (call centres, website logs, social media) that produce data in
different formats, which makes integration difficult. Furthermore, gathering huge
data with 100% accuracy is a difficult task. It is imperative to ensure that only
trustworthy data is gathered, as inaccurate or redundant information might make
the data useless for your company. Developing a well-organised big data model is
necessary to improve the quality of data. To find and combine duplicate records
and increase the big data model’s correctness and dependability, extensive data
comparison is also necessary.
• Expensive Big data project implementation is frequently very expensive for busi-
nesses. If you choose an on-premises solution, you will have to pay developers
and administrators in addition to spending money on new hardware. Even if a lot
of frameworks are open source, there are still costs associated with setup, mainte-
nance, configuration, and project development. On the other hand, a cloud-based
solution necessitates hiring qualified staff for product development and paying
for cloud services. The costs of both solutions are high. Businesses can think
about an on-premises solution for increased security or a cloud-based solution
for scalability when attempting to strike a compromise between flexibility and
security. Some businesses utilise hybrid solutions, keeping sensitive data on-site
while processing it using cloud computing power—a financially sensible option
for some businesses.
• Security Vulnerabilities Putting big data solutions into practice may leave your
network vulnerable to security flaws. Regrettably, it can be foolish for businesses
to ignore security when they first start big data projects. Although big data tech-
nology is always developing, some security elements are still missing. Prioritising
and improving security measures is crucial for big data ventures.
• Scalability Big data’s fundamental quality is its ongoing expansion over time,
which is both a major benefit and a challenge. Although many businesses make
an effort to remedy this by increasing processing and storage capacity, budgetary
restrictions make it difficult to scale without experiencing performance degra-
dation. An architectural foundation with structure is necessary to overcome this
difficulty. Scalability is guaranteed by a strong architecture, which also minimises
14 Y. B. Singh et al.
For most organisations, big data is very important since it makes it possible to
efficiently gather and analyse the vital information needed to make well-informed
decisions. Still, there are a number of issues that need to be resolved. Putting in
place a solid architectural framework offers a solid starting point for methodically
addressing these problems.
Big data processing and data analysis provide a number of difficulties for machine
learning.
• Quantity and Quality of Data: Machine learning model performance is highly
dependent on the quality and quantity of data. Results might be skewed by noisy or
incomplete data, and processing huge datasets can be computationally demanding.
Mitigation: Using strong data preparation methods and making sure the datasets
are diverse and of good quality improves the accuracy of the model [21].
• Computing Capabilities: Processing large amounts of data in big data settings
requires a significant amount of processing power. Complex machine learning
models might be difficult to train and implement due to resource constraints. Big
data’s computational hurdles can be lessened with the use of distributed computing
frameworks like Apache Spark and cloud computing solutions [21].
• Algorithm Selection:
• With so many different machine learning algorithms available, it might be difficult
to select the best one for a particular task. Inappropriate algorithm selection could
lead to less-than-ideal performance. Making well-informed decisions is facilitated
by carrying out exhaustive model selection trials and comprehending the features
of various algorithms [21].
• Instantaneous Processing: In order to make timely decisions, many applications
need to process data in real time. In certain situations, traditional machine learning
models might not be the best fit. To mitigate the issues related to time-sensitive
applications, online learning techniques and real-time processing-optimised
models can be used [22].
• Explainability and Interpretability: Interpretability is frequently a problem
with machine learning models, particularly those that are sophisticated like deep
neural networks. It’s critical to comprehend the thinking behind model selections,
especially in delicate areas. To improve comprehension, interpretable models
should be created, simpler algorithms should be used whenever possible, and
model explanation techniques should be included [22].
Introduction to Data Analytics, Big Data, and Machine Learning 15
In data analysis, big data, and machine learning, future directions include devel-
oping interpretability and ethical issues, investigating new uses in various fields, and
improving algorithms for better-predicted performance in dynamic and complicated
datasets. Limitations of this work are interpreted as future direction.
• AI Integration: As machine learning and data analytics become more integrated
with artificial intelligence (AI), more sophisticated and self-governing data-driven
decision-making processes will be possible.
• Due to legal constraints and public expectations, there will be a growing emphasis
on ethical AI and mitigating bias in machine learning algorithms.
• Explainable AI: Clear and understandable AI models are becoming more and
more necessary, especially in fields like finance and healthcare where it’s critical
to comprehend the decisions made by the algorithms.
• Edge and IoT Analytics: As IoT devices, edge computing, and 5G technologies
proliferate, real-time processing at the network’s edge will become more important
in data analytics, facilitating quick insights and decision-making.
• Quantum Computing: As this technology develops, it will provide new avenues
for tackling hitherto unsolvable complicated data analytics and machine learning
issues.
• Data Security and Privacy: As the globe becomes more networked and rules
become stronger, there will be a greater emphasis on data security and privacy.
• Experiential analytics: Businesses will utilise data analytics to personalise
marketing campaigns, products, and services, ultimately improving consumer
experiences.
• Automated Machine Learning (AutoML): As AutoML platforms and tech-
nologies proliferate, machine learning will become more widely available and
accessible to a wider range of users, democratizing the field.
Any company that wants to embrace big data must have board members who are
knowledgeable about its foundations. Businesses may help close this knowledge gap
by providing workshops and training sessions for staff members, making sure they
16 Y. B. Singh et al.
understand the benefits of big data. While keeping an eye on staff development is a
good strategy, it could have a detrimental effect on productivity.
8 Conclusion
A basic grasp of these interrelated fields Data Analytics, Big Data and Machine
Learning has been covered in this chapter. It draws attention to their increasing
significance in a variety of industries and their capacity to change how decisions
are made. For anyone wishing to venture into the realm of data-driven insights and
innovation, the talk of essential ideas, jargon, and practical implementations acts
as a springboard. To remain at the vanguard of this data-driven era, these sectors
must continue to learn and adapt because of their dynamic and ever-evolving nature,
which also brings a multitude of opportunities and problems. The exploration of
Data Analytics, Big Data, and Machine Learning is expected to yield significant
benefits as we explore uncharted territories of knowledge and seize the opportunity
to influence a future replete with data.
References
1. Sivarajah, U., Kamal, M.M., Irani, Z., Weerakkody, V.: Critical analysis of big data challenges
and analytical methods. J. Bus. Res. 70, 263–286 (2017)
2. .Lavalle, S., Lesser, E., Shockley, R., Hopkins, M.S., Kruschwitz, N.: Big data, analytics and
the path from insights to value. MIT Sloan Manag. Rev. 52(2), 3–22 (2010)
3. Jaseena, K.U., David, J.M.: Issues, challenges, and solutions: big data mining. CS & IT-CSCP
4(13), 131–140 (2014)
4. Sun, Z.H., Sun, L.Z., Strang, K.: Big data analytics services for enhancing business intelligence.
J. Comput. Inf. Syst. 58(2), 162–169 (2018)
5. Debortoli, S., Muller, O., vom Brocke, J.: Comparing business intelligence and big data skills.
Bus. Inf. Syst. Eng. 6(5), 289–300 (2014)
6. Sarkar, B.K.: Big data for secure healthcare system: a conceptual design. Complex Intell. Syst.
3(2), 133–151 (2017)
7. Zakir, J., Seymour, T., Berg, K.: Big data analytics. Issues Inf. Syst. 16(2), 81–90 (2015)
8. Raja, R., Mukherjee, I., Sarkar, B.K.: A systematic review of healthcare big data. Sci. Program.
2020, 5471849 (2020)
9. Tsai, C.W., Lai, C.F., Chao, H.C., Vasilakos, A.V.: Big data analytics: a survey. J. Big Data
2(1), 21 (2015)
10. Website. https://www.news.microsoft.com/europe/2016/04/20/go-bigger-with-big-data/sm.
0008u654e19yueh0qs514ckroeww1/XmqRHQB1Gcmde4yb.97. Accessed 15 June 2017
11. McAfee, A., Brynjolfsson, E.: Big data: the management revolution. Harv. Bus. Rev. 90(10)
60–66, 68, 128 (2012)
12. Chen, M., Hao, Y.X., Hwang, K., Wang, L., Wang, L.: Disease prediction by machine learning
over big data from healthcare communities. IEEE Access 5, 8869–8879 (2017)
13. Zuo, R.G., Xiong, Y.H.: Big data analytics of identifying geochemical anomalies supported by
machine learning methods. Nat. Resour. Res. 27(1), 5–13 (2018)
Introduction to Data Analytics, Big Data, and Machine Learning 17
14. Zhang, C.T., Zhang, H.X., Qiao, J.P., Yuan, D.F., Zhang, M.G.: Deep transfer learning for intel-
ligent cellular traffic prediction based on cross-domain big data. IEEE J. Sel. Areas Commun.
37(6), 1389–1401 (2019)
15. Triantafyllidou, D., Nousi, P., Tefas, A.: Fast deep convolutional face detection in the wild
exploiting hard sample mining. Big Data Res. 11, 65–76 (2018)
16. Singh, Y.B., Mishra, A.D., Nand, P.: Use of machine learning in the area of image analysis and
processing. In: 2018 International Conference on Advances in Computing, Communication
Control and Networking (ICACCCN), pp. 117–120. IEEE (2018)
17. Singh, Y.B.: Designing an efficient algorithm for recognition of human emotions through
speech. PhD diss., Bennett University (2022)
18. Nallaperuma, D., Nawaratne, R., Bandaragoda, T., Adikari, A., Nguyen, S., Kempitiya, T., De
Silva, D., Alahakoon, D., Pothuhera, D.: Online incremental machine learning platform for
big data-driven smart traffic management. IEEE Trans. Intell. Transp. Syst. 20(12), 4679–4690
(2019)
19. Xian, G.M.: Parallel machine learning algorithm using fine-grained-mode spark on a mesos
big data cloud computing software framework for mobile robotic intelligent fault recognition.
IEEE Access 8, 131885–131900 (2020)
20. Li, M.Y., Liu, Z.Q., Shi, X.H., Jin, H.: ATCS: Auto-tuning configurations of big data
frameworks based on generative adversarial nets. IEEE Access 8, 50485–50496 (2020)
21. Mishra, A.D., Singh, Y.B.: Big data analytics for security and privacy challenges. 2016 Interna-
tional Conference on Computing, Communication and Automation (ICCCA), Greater Noida,
India, pp. 50–53. (2016). https://doi.org/10.1109/CCAA.2016.7813688
22. The 2 types of data strategies every company needs. In: Harvard Business Review, 01 May
2017. https://hbr.org/2017/05/whats-your-data-strategy. Accessed 18 June 2017
Fundamentals of Data Analytics
and Lifecycle
Abstract This chapter gives a brief overview of the fundamentals and lifecycle of
data analytics. The foundation for the present stage of technology, data analytics
systems is ranged over in this chapter. The chapter also delves into detailing open-
source tools such as Power BI and Tableau used in developing data analytics systems.
Traditional analysis is different from big data analysis in terms of volume and data
processed varieties. To meet the requirements, various stages are required to put in
order the activities involved in the processing, acquisition, reuse, and analysis of
the given data. The lifecycle for data analysis will help to manage and organize the
tasks connected to big data research and analysis. Data Analytics evolution with big
data analytics, SQL analytics, and business analytics is explained. Furthermore, the
chapter outlines the future of data analytics by leveraging its fundamental lifecycle
and elucidates various data analytics tools.
1 Introduction
In the field of Data Science, Data Analytics is the key component used for the analysis
of the data which brings out information to solve issues [1] in problem-solving
across different domains and industries [2]. Before moving ahead, we should learn
the keyword data & analytics, from which the data analytics is formed as shown in
Fig. 1.
Data analytics is the process of examining, cleaning, transforming, and inter-
preting data to discover valuable insights, patterns, and trends that can inform
decision-making [3]. It plays a crucial role in a wide range of fields, including busi-
ness, science, healthcare, and more. Whenever data analytics is discussed, we hear
R. Sharma
Department of Computer Science and Engineering, Ajay Kumar Garg Engineering College,
Ghaziabad, India
P. Garg (B)
Department of Computer Science and Engineering, GL Bajaj Institute of Technology and
Management, Greater Noida, India
e-mail: payalgarg.cs@gmail.com
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 19
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_2
20 R. Sharma and P. Garg
about data analysis or we can say they both are used interchangeably although data
analytics is all about the techniques and tools used to do data analysis. In other
words, we can say that data analysis is a subset or part of data analytics that look
after data cleansing, data examining, data transforming, and data modeling to come
up on conclusions.
Over the past years, analytics is very helpful to various projects by providing answers
for various questions [4]. Some of the questions are as follows:
• What is going to happen next?
• Why it happened?
• How it happened?
It is the process of making assumptions with the help of methods, and systems. In
other words, it is the process of turning data into information.
Traditional data analytics refers to the conventional methods and techniques used
to analyze and interpret data to extract insights and support decision-making [5]. In
traditional data analytics, excel, tables, charts, graphs, hypothesis testing, and basic
statistical measures are used for analysis. Dashboards made in data analytics are
static in nature and new changes in business nature can’t be adapted.
We discussed how data analysis and data analytics are interchangeably used and
what data analysis looks after. Now when we talk about data analytics, the following
three categories are given below in which analytics is broadly divided into:
1. Descriptive analytics—It is done to describe what has come out at an instance of
time.
2. Predictive analytics—It is done to determine the possibility of future proceeding
3. Prescriptive analytics—It is done to provide suggestive actions to accomplish
applicable conclusions.
Fundamentals of Data Analytics and Lifecycle 21
As the name says descriptive analytics which means it describes all the data in a way
that can be understood in an easy manner. Most scenarios cover the past, present,
or historical data. This type of analytics is used to look at future outcomes based on
past data. In descriptive analytics, statistical methods are used such as percentage,
sum, and average. Descriptive analytics key characteristics are shown in Fig. 2 For
example sales report, financial statements, inventory analysis, etc.
Predictive analytics [6] comes under probability analysis which helps to determine the
prediction of future proceedings and helps in understanding the forthcoming choice
building. It is now becoming important for business organizations that are looking
to gain in this competitive environment by getting predictions on future trends and
using them to make data-driven decisions. Key characteristics of this analytics is
shown in Fig. 3. For example sales forecasting, credit risk assessment, predictive
maintenance, demand forecasting, etc.
22 R. Sharma and P. Garg
Prescriptive analytics is another type of analytics that provides a suggestive action for
decision-making processes. It includes both descriptive and predictive analytics for
the recommendations for decision-making processes. It includes some key character-
istics depicted in Fig. 4. For example healthcare, marketing, financial services, trans-
portation and logistics, etc. Figure 5 highlights the summary providing a description
with examples for types of data analytics.
Fundamentals of Data Analytics and Lifecycle 23
Data can be any facts (binary, text), figures, audio, and video used for analysis. No
valuable information will be achieved until analytics is borne of data [7]. In day-to-
day life, the general public is so dependent on devices. For example, people use maps
to reach some place and those maps are using GPS navigation to find the shortest
route to reach to particular point. This can be only possible when the analytics is
done on the data which involves different landmarks of the city and roads connecting
to them between them. While carrying out these analytics, data can be classified into
3 types [8]:
1. Categorical data
2. Numerical data
3. Ordinal data
Categorical data also refers to nominal data or qualitative data which means they are
not associated with a natural order and numerical value with them. Also, they are
used to form groups into sets or classes. Some of the examples include:
• Marital Status: includes “single”, “married”, etc.
• Colors: include “red”, “green”, etc.
• Gender: includes “female”, “male”, etc.
• Education: includes “high level”, “bachelors”, etc.
24 R. Sharma and P. Garg
In Table 1 shows the example of categorical data where it depicts the IP address and
class it belongs to. Here, two types of classes are mentioned IpV4 and IpV6 still
cannot be directly used for classification as IpV4 is identified of class 0 and IpV6 is
identified as class 1.
Numerical data also refers to quantitative data which includes numbers and can be
measured. Also used in mathematical calculation and analysis. Some examples of
numerical data are discrete data, continuous data, interval data, ratio data, etc.
The above components of the architecture are very specific and vary based on the
need of organization, variety, and volume of data and their analytical requirements.
Organizations may expand with time according to their need [9].
28 R. Sharma and P. Garg
Traditional projects and projects which include data analysis are different. As in
projects of data analytics, much more inspection is required [9]. Therefore, the data
analytics lifecycle includes a number of steps or stages which organizations and
professionals follow to extract valuable outcomes and knowledge from the given
data [10]. It also encloses the whole procedure which includes collect, clean, analyze,
and interpreting data to make decisions [11]. Although there can be variations in some
specific stages and their names, so following are the key steps of the data analytics
lifecycle (shown in Fig. 9):
1. Problem definition
2. Data collection
3. Data cleaning
4. Data exploration
5. Data transformation
6. Visualization and Reporting
When we talk about problem definition, we know that it is the most important aspect
of every process whereas when we talk about the data analysis process the problem
definition is a crucial step. A well-defined problem definition/statement helps in
analysis and confirms that you are responding to the right questions. These major
points need to be kept in mind while defining the problem in data analytics as shown
in Fig. 10
The significance of this step is paramount as it gives the foundation for the entire
analytical process. Defining the problem helps to ensure that the data analysis effort is
focused, relevant, and valuable. There are several reasons which make this step crucial
in data analysis such as Clarity and precision, scope and boundaries, goal alignment,
Data collection is the most essential step in the process of data analytics. It gathers and
obtains data from different sources to use them for analysis. Efficient data collection
is crucial to ensure that we use for analysis is reliable, accurate, and relevant to the
problem. Some of the key aspects to consider in data collection are shown in Fig. 11.
Data collection also comes under the fundamental aspect of the data analysis
process for data analysts. The significance of data collection lies in its role as the
starting point. Some of the reasons highlighting the importance of data collection such
as the basis for analysis (data collection provides the raw material that analysts use to
Data cleaning includes identifying and correcting inconsistencies or errors in the data
to make sure that data is accurate and complete after which it is ready for analysis [7].
It is an iterative process that can require multiple iterations, so this step is required
to make sure that the analysis is based on reliable, high-quality data which gives us
more accurate and meaningful insights [12]. Various tasks are involved in it are as
given below:
1. Handling Missing Data
In this, we will eliminate the data with missing values by removing rows
with missing values, depending on the context.
2. Outlier Detection and Treatment
Here, we are talking about the data points that have typical patterns in our
dataset for which we can identify remove them, or transform them to be in an
acceptable range.
3. Data Type Conversion
In this, we will make sure that data types are correctly assigned to each
column. Sometimes it happens that datatypes are incorrect and not in same
format, so we need to convert them according to format for analysis
4. Duplicate Data
In this, we will make sure that there is no redundancy or duplicate data is
available, if it happens then we need to remove the duplicate rows to avoid the
double count in analysis.
5. Text Cleaning
In this, if we are having the data includes text, we need to clean the data and
preprocess it. For example, if there is some special character present in data
then we need to remove it, or there is a need for converting text into lowercase,
etc.
6. Data Transformation
Data transformation includes converting units, aggregating the data, also
creating new variables from the existing variables.
7. Addressing Inconsistent Date and Time Formats
In this, we need to standardize the date and time for consistency and analysis,
as they can be stored in various formats.
32 R. Sharma and P. Garg
8. Domain-Specific Cleaning
We can clean the data depending on the specific domain and the data
sources we receive or on which we want to do. For example, financial data,
and healthcare data may require the domain-specific cleaning.
9. Handling Inconsistent Data Entry
Here we will be handling data entry errors such as typo errors, inconsistency
format,
10. Data Versioning and Documentation
Here we will be keeping track of data changes and document the cleaning
process to maintain its data integrity and transparency.
Data cleaning, also known as data scrubbing, is another step in the data analysis
process. Its significance lies in the fact that the quality of the analysis and the reli-
ability of the insights derived from the data heavily depend on the cleanliness and
integrity of the data. Here are several key reasons why data cleaning is essential for
data analysts Accuracy of Analysis, Data Integrity, Consistency, Improves Model
Performance, Enhances Data Quality, Missing Data Handling, Facilitates Effec-
tive Visualization, Reduces Bias, Saves Time and Resources, Improved Decision-
Making, Enhances Collaboration. In conclusion, it ensures that the data used for
analysis is accurate, reliable, and free from errors, ultimately leading to more robust
and trustworthy insights.
Data transformation makes data more suitable for analysis by converting, structuring,
and cleaning data which helps to make sure that data is in the right format and
quality, and also makes it easier to extract patterns and useful insights. It is also
a necessary step because real-world data is often messy and heterogeneous. The
quality and effectiveness of analysis is dependent on how the data is transformed and
prepared. There are various operations used in data transformations, some of which
are explained in Fig. 13.
Data Transformation helps in normalizing data, making it comparable and consis-
tent. It can be used to address skewed distributions, making the data more symmet-
rical and meeting the assumptions of certain statistical models. Also helps in meeting
assumptions and improving the performance of models. In summary, we can say that
data transformation helps to prepare the data more suitable for various analytical
techniques.
34 R. Sharma and P. Garg
Visualization and reporting are very critical components of data analytics as they
help analysts and stakeholders make sense of data, identification of trends, insights
are drawn and make data-driven decisions [14]. An overview of visualization and
reporting can be understood in Table 2.
Visualization and reporting provide valuable tools for communicating insights and
findings to both technical and non-technical audiences [16]. Visualization transforms
complex data sets into understandable and interpretable visuals which makes it easier
for stakeholders to grasp insights. Reporting allows for the creation of a narrative
around the data which highlights key findings and trends.
Fundamentals of Data Analytics and Lifecycle 35
Table 2 (continued)
S.No Name Description
5 Best practices When creating visualizations and reports,
consider best practices such as:
• Choosing the right chart type for the data
• Keeping visuals simple and uncluttered
• Labeling axes and data points clearly
• Providing context and explanations
• Ensuring that the design is user-friendly
• Consistently updating dashboards and
reports as new data becomes available
5 Conclusion
In conclusion we can say that data analytics is a powerful approach to extract mean-
ingful insights from the data sets, providing valuable information for decision making
and problem solving [17]. The fundamental and lifecycle plays an important role in
ensuring the success of analytical initiatives. It is essential for businesses and organi-
zations to gain a competitive edge in the current world as it enables informed decision-
making by uncovering patterns, trends, and correlations with large datasets. A well-
executed data analytics process can lead to improved efficiency, better customer
insights, and a competitive advantage in today’s data-driven landscape.
References
1. Kumar, M., Tiwari, S., Chauhan, S.S.: Importance of big data mining: (tools, techniques). J.
Big Data Technol. Bus. Anal. 1(2), 32–36 (2022)
2. Singh, P., Singh, N., Luxmi, P.R., Saxena, A.: Artificial intelligence for smart data storage in
cloud-based IoT. In: Transforming Management with AI, Big-Data, and IoT, pp. 1–15. Springer
International Publishing, Cham (2022)
3. Abdul-Jabbar, S., Farhan, A.: Data analytics and techniques: a review. ARO-Sci. J. Koya Univ.
10, 45–55 (2022). https://doi.org/10.14500/aro.10975
4. Erl, T., Khattak, W., Buhler, P.: Big Data Fundamentals: Concepts, Drivers & Techniques.
Pearson. Part of the The Pearson Service Technology Series from Thomas Erl series (2016)
5. Sharda, R., Asamoah, D., Ponna, N.: Business analytics: research and teaching perspectives.
In: Proceedings of the International Conference on Information Technology Interfaces, ITI,
pp. 19–27 (2013). https://doi.org/10.2498/iti.2013.0589
6. Lepenioti, K., Bousdekis, A., Apostolou, D., Mentzas, G.: Prescriptive analytics: literature
review and research challenges. Int. J. Inf. Manag. 50, 57–70 (2020). https://doi.org/10.1016/
j.ijinfomgt.2019.04.003
7. Kumar, M., Tiwari, S., Chauhan, S.S.: A review: importance of big data in healthcare and its
key features. J. Innov. Data Sci. Big Data 1(2), 1–7 (2022)
8. Durgesh. S.: A narrative review on types of data and scales of measurement: an initial step in the
statistical analysis of medical data. Cancer Res. Stat. Treat. 6(2), 279–283 (2023, April–June).
https://doi.org/10.4103/crst.crst_1_23
Fundamentals of Data Analytics and Lifecycle 37
9. Sivarajah, U., Mustafa Kamal, M., Irani, Z., Weerakkody, V.: Critical analysis of big data
challenges and analytical methods. J. Bus. Res. 70, 263–286 (2017). https://doi.org/10.1016/j.
jbusres.2016.08.001
10. Rahul, K., Banyal, R.K.: Data life cycle management in big data analytics. In: Inter-
national Conference on Smart Sustainable Intelligent Computing and Applications Under
ICITETM2020 (2020). Elsevier
11. Watson, H., Rivard, E.: The analytics life cycle a deep dive into the analytics life cycle. 26,
5–14 (2022)
12. Ridzuan, F., Zainon, W.M.N.: A review on data cleansing methods for big data. Procedia
Comput. Sci. 161, 731–738 (2019). https://doi.org/10.1016/j.procs.2019.11.177
13. Idreos, S., Papaemmanouil, O., Chaudhuri, S.: Overview of data exploration techniques. In:
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data,
pp. 277–281 (2015). https://doi.org/10.1145/2723372.2731084
14. Roden, S., Nucciarelli, A., Li, F., Graham, G.: Big data and the transformation of operations
models: a framework and a new research agenda. Prod. Plan. Control 28(11–12), 929–944
(2017). https://doi.org/10.1080/09537287.2017.1336792
15. Maheshwari. K.A.: Data Analytics Made Accessible (2015)
16. Abdul-Jabbar, S.S., Farhan, A.K..: Data analytics and techniques: a review. ARO- Sci. J. Koya
Univ. (2022)
17. Manisha, R.G.: Data modeling and data analytics lifecycle. Int. J. Adv. Res. Sci., Commun.
Technol. (IJARSCT) 5(2) (2021). https://doi.org/10.48175/568
Building Predictive Models with Machine
Learning
1 Introduction
The ability to derive actionable insights from complicated datasets has become essen-
tial in a variety of sectors in the era of abundant data. A key component of this effort
is predictive modeling, which is enabled by machine learning and holds the potential
to predict future results, trends, and patterns with previously unheard-of accuracy.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 39
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_3
40 R. Gupta et al.
This chapter takes the reader on a voyage through the complex field of applying
machine learning to create predictive models, where algorithmic science and data
science creativity collide. Predictive modeling with machine learning is a dynamic
and powerful approach that leverages computational algorithms to analyze historical
data and make predictions about future outcomes. At its core, predictive modeling
aims to uncover patterns, relationships, and trends within data, enabling the develop-
ment of models that can generalize well to unseen data and provide accurate forecasts.
The process begins with data collection, where relevant information is gathered and
organized for analysis. This data typically comprises variables or features that may
influence the outcome being predicted. Machine learning algorithms, ranging from
traditional statistical methods to sophisticated neural networks, are then applied to
this data to learn patterns and relationships. The model is trained by exposing it to a
subset of the data for which the outcomes are already known, allowing the algorithm
to adjust its parameters to minimize the difference between predicted and actual
outcomes. Once trained, the predictive model undergoes evaluation using a separate
set of data not used during training. This assessment helps gauge the model’s ability to
generalize to new, unseen data accurately. Iterative refinement is common, involving
adjustments to model parameters or the selection of different algorithms to improve
predictive performance. The success of predictive modeling lies in its ability to trans-
form raw data into actionable insights, aiding decision-making processes in various
fields. Applications span diverse domains, including finance, healthcare, marketing,
and beyond. Understanding the intricacies of machine learning algorithms, feature
engineering, and model evaluation is crucial for practitioners seeking to harness the
full potential of predictive modeling in extracting meaningful information from data.
As technology advances, predictive modeling continues to evolve, offering innova-
tive solutions to complex problems and contributing significantly to the data-driven
decision-making landscape.
This chapter will help both novices and seasoned practitioners understand the
intricacies of predictive modeling by demystifying them. We’ll explore the princi-
ples of feature engineering, model selection, and data preparation to provide readers
with a solid basis for building useful and accurate prediction models. We’ll go into
the nuances of machine learning algorithms, covering everything from traditional
approaches to state-of-the-art deep learning strategies, and talk about when and how
to use them successfully. Predictive modeling, however, is a comprehensive process
that involves more than just data and algorithms. We’ll stress the importance of ethical
factors in the era of data-driven decision-making, such as justice, transparency, and
privacy. We’ll work through the difficulties that come with developing predictive
models, such as managing imbalanced datasets and preventing overfitting. Further-
more, we will provide readers with useful information on how to analyze model
outputs—a crucial ability for insights that can be put into practice.
Building Predictive Models with Machine Learning 41
2 Literature Review
3 Machine Learning
Data has emerged as one of the most valuable resources in the current digital era.
Every day, both individuals and organizations produce and gather enormous volumes
of data, which can be related to anything from social media posts and sensor readings
to financial transactions and customer interactions. Machine learning appears as a
transformative force amidst this data deluge, allowing computers to autonomously
learn from this data and extract meaningful insights. It serves as the cornerstone of
artificial intelligence, fostering innovation in a wide range of fields.
A wide range of ideas and methods are included in machine learning, such as:
Supervised learning: It involves training models on labeled data, which means
that the intended output is produced while the model is being trained. As a result,
models are able to learn how input features correspond to output labels.
Unsupervised Learning: This type of learning works with data that is not labeled.
Without explicit guidance, the goal is to reduce dimensionality, group similar data
points, and find hidden patterns.
Reinforcement learning: It is a paradigm in which agents pick up knowledge
by interacting with their surroundings. Agents are able to learn the best strategies
because they are rewarded or penalized according to their actions.
Building Predictive Models with Machine Learning 43
Algorithms:
There are numerous machine learning algorithms available, each with a specific
purpose in mind. Neural networks, decision trees, support vector machines, and
deep learning models such as recurrent neural networks (RNNs) and convolutional
neural networks (CNNs) are a few examples.
4 Predictive Models
Predictive models are essentially enabled by machine learning to fully utilize the
potential of historical data. It improves the accuracy and efficiency of data-driven
predictions and decisions made by individuals and organizations by automating,
adapting, and scaling the predictive modeling process. Predictive modeling and
machine learning work well together to promote innovation and enhance decision-
making in a variety of fields. Numerous predictive models based on machine learning
are employed by various industries. Several applications of these include fore-
casting sales, predicting stock prices, detecting fraud, predicting patient outcomes,
recommending systems, and predicting network faults, among many others.
Key elements of data science and machine learning are predictive models. These
are computational or mathematical models that forecast future events or results based
on patterns and data from the past. These models use historical data’s relationships
and patterns to help them make forecasts and decisions that are well-informed. A
more thorough description of predictive models can be found here:
Data as the Foundation: The basis of predictive models is data. These models are
trained on historical data, which comprises details about observations, actions, and
events from the past. Prediction accuracy is heavily dependent on the relevance
and quality of the data.
Learning from Data: To make predictions based on past data, predictive models
use mathematical techniques or algorithms. In order to find patterns, relationships,
and correlations, the model examines the input data (features) and the associated
known outcomes (target variables) during the training phase.
Feature Selection and Engineering: Proper selection and engineering of the
appropriate features (variables) from the data are crucial components of predictive
modeling. Feature engineering is the process of altering, expanding, or adding new
features in order to increase the predictive accuracy of the model.
Model Building: Based on the problem at hand, a specific predictive model is
selected after the data has been prepared and features have been chosen. Neural
networks, support vector machines, decision trees, linear regression, and other
algorithms are frequently used in predictive modeling. Each algorithm has its
strengths and weaknesses, and the choice depends on the nature of the problem
and the data.
Model Training: The historical data is used to train the model. In this stage, the
model modifies its internal parameters in order to reduce the discrepancy between
44 R. Gupta et al.
the training data’s actual results and its predictions. The aim is to make a model
that represents the fundamental connections in the data.
Predictions: The predictive model is prepared to make predictions on fresh,
untested data following training. The model receives features as inputs and outputs
forecasts or predictions. To arrive at these predictions, the model generalizes from
the patterns it discovered during training.
Evaluation: It is essential to compare the predictive model’s predictions to known
outcomes in a different test dataset in order to gauge the predictive model’s perfor-
mance. Accuracy, mean squared error (MSE), area under the ROC curve (AUC),
and other metrics are frequently used in evaluations. Evaluation is a useful tool
for assessing the model’s performance and suitability for the intended accuracy
requirements.
Deployment: Predictive models can be used in real-world situations after they
show a sufficient level of accuracy in practical applications. Depending on the
particular use case, this could be a component of an integrated system, an API, or
a software application.
Numerous industries use predictive models, including marketing (customer
segmentation), healthcare (disease diagnosis), finance (credit scoring), and many
more. They are useful tools for using past data to predict future trends or events,
optimize workflow, and make well-informed decisions. It’s crucial to remember that
predictive models are not perfect and must be continuously updated and monitored
as new data becomes available to retain their relevance and accuracy. Figure 1 shows
the prediction model.
6 Ethical Considerations:
Fairness, bias, transparency, and privacy are just a few of the ethical issues that
machine learning has brought to light. It is critical to address these issues in order
to guarantee ethical and responsible predictive modeling procedures.
Certainly, here are some common machine learning models used for various types
of predictions:
1. Linear Regression: This method is used to forecast a continuous target variable.
For example, calculating a house’s price depends on its size in square footage
and number of bedrooms.
2. Logistic Regression: This technique is used for binary classification, such as
predicting whether or not an email is spam.
3. Decision Trees: These adaptable models are applied to tasks involving both
regression and classification. They are frequently employed in situations such
as illness classification based on symptoms or customer attrition prediction.
4. Random Forest: An ensemble model that enhances accuracy by combining
several decision trees Applications such as image classification and credit
scoring make extensive use of it.
5. Support vector machines (SVM): Applied to classification tasks like financial
transaction fraud detection or sentiment analysis in natural language processing.
6. K-Nearest Neighbors (KNN): This technique finds the training set’s most
similar data points to generate predictions for classification and regression.
7. Naive Bayes: This algorithm is frequently applied to text classification tasks,
such as sentiment analysis in social media posts or spam detection.
8. Neural Networks: Deep learning models are applied to a range of tasks,
such as autonomous driving (Deep Reinforcement Learning), natural language
processing (Recurrent Neural Networks, or RNNs), and image recognition
(Convolutional Neural Networks, or CNNs).
9. Gradient Boosting Machines (GBM): ensemble models that create a powerful
predictive model by pairing weak learners In situations such as credit risk
assessment, they work well.
10. XGBoost: A well-liked gradient boosting algorithm with a reputation for being
scalable and highly effective. Predictive modeling is used in competitions and
industry applications.
Building Predictive Models with Machine Learning 47
11. Time Series Models: specific models for time series forecasting, such as
predicting stock prices or product demand, such as LSTM (Long Short-Term
Memory) or ARIMA (Autoregressive Integrated Moving Average).
12. Principal Component Analysis (PCA): Enhances predictive models through
feature engineering and dimensionality reduction.
13. Clustering Algorithms: Data can be clustered using models such as DBSCAN
or K-Means, which can aid in anomaly detection or customer segmentation.
14. Reinforcement learning: This technique is used to optimize resource alloca-
tion, play games, and control autonomous robots in dynamic environments by
anticipating actions and rewards.
These are but a handful of the numerous machine learning models that are out
there. The forecasting goal and the type of data determine which model is best.
Machine learning experts choose the best model and optimize it to get the best
results for a particular issue.
There are a total of 10 important steps that are needed to create a Perfect Machine
Learning Predictive Model. Figure 2 shows the step-by-step process of the predictive
building process.
9 Data Collection
Gathering historical data that is pertinent to the issue you are trying to solve is the first
step in the process. Typically, this data comprises the associated target variable (the desired
outcome) and features (input factors). For instance, if your goal is to forecast the price of real
estate, you may include features such as square footage, location, and number of bedrooms
in your data, with the sale price serving as the target variable.
48 R. Gupta et al.
10 Data Preprocessing
Raw data frequently requires preparation and cleansing. This entails managing outliers,
handling missing values, and using methods like one-hot encoding to transform category
data into numerical form. Preparing the data ensures that it is ready for analysis.
12 Data Splitting
The training dataset and the testing dataset are the two or more subsets into which the
dataset is normally separated. The predictive model is trained on the training dataset, and
its performance is assessed on the testing dataset. For hyperparameter adjustment, another
validation dataset might be employed in some circumstances.
13 Model Selection
Your choice of predictive modeling algorithm depends on the type of data and the challenge
you have. Neural networks, support vector machines, decision trees, random forests, and
linear regression are examples of common algorithms. The type of prediction (classification
or regression) and problem complexity are two important considerations when selecting an
algorithm.
14 Model Training
In this stage, the selected model is trained to make predictions using the training dataset. The
algorithm minimizes the discrepancy between its predictions and the actual results in the
training data by learning from the patterns in the data and modifying its internal parameters.
15 Hyperparameter Tuning
16 Model Evaluation
The testing dataset is used to assess the model after it has been trained and adjusted. The
model’s prediction accuracy and precision, recall, F1 score, mean squared error and other
metrics are used to assess how effectively the model predicts the real results.
17 Model Deployment
The model can be used to predict fresh, unseen data in a real-world setting if it satisfies the
required accuracy standards. Depending on the use case, this can be accomplished using
software programs, APIs, or integrated systems.
To guarantee that predictive models continue to function accurately when new data becomes
available, continuous monitoring is necessary. In order for models to adjust to evolving
patterns or trends in the data, they might require regular updates or retraining.
The Model explores the world of Long Short-Term Memory (LSTM) models and
EEG data to overcome this problem. EEG data is used, which provides a wealth
of information on brain activity. LSTM models, which are skilled at processing
sequential data, are used as analytical tools. The main goal of this case study is
explained in the introduction, which is to develop and apply prediction models for
the early detection of cognitive problems utilizing LSTM and EEG data. It also
emphasizes how important it is to evaluate these models carefully and investigate their
usefulness in various healthcare contexts. The introduction essentially summarizes
the case study in the framework of a pressing healthcare issue and outlines the goals
and approach for dealing with this complicated problem.
50 R. Gupta et al.
20 Data Preparation
Several crucial procedures must be taken in order to prepare the data for an LSTM
model that uses EEG data to predict cognitive problems. Given that we acquired
our data from Kaggle, the following is a general description of the data preparation
procedure:
22 Data Preprocessing
Use feature engineering to extract pertinent information from EEG data, if necessary.
This may entail:
Building Predictive Models with Machine Learning 51
24 Label Encoding
25 Data Splitting:
our dataset should be divided into three sets for training, validation, and testing.
we can also designate a portion of the training set for validation if necessary, given
our initial 85% - 15% split. 70% for training, 15% for validation, and 15% for
testing are typical split ratios.
Create a format for the preprocessed data that is appropriate for LSTM input. To
do this, make a 3D array with the following dimensions: samples, time_steps, and
features.
samples: The total number of EEG samples in the training, validation, and testing
sets.
time_steps: The total sum of all the time steps in a single EEG segment.
features: the total number of features, including gender, age, and brain wave
features. This would normally be 3 in our instance.
27 Data Normalization
As needed, normalize the data within each feature dimension. Different normal-
ization methods may be needed for brain wave data than for age and gender. To
guarantee consistency, use the same normalization parameters on both the training
and testing datasets.
52 R. Gupta et al.
28 Shuffling (Optional)
Depending on the properties of our dataset, decide if randomizing the training data
is appropriate. Due to temporal relationships, shuffling may not be appropriate
for brain wave data, but it is possible for age and gender data.
If we wish to expand the dataset or add variability to the EEG signals, think about
using data augmentation techniques for the brain wave data. Time shifts, amplitude
changes, and the introduction of artificial noise are examples of augmentation tech-
niques. We can utilize an LSTM model that predicts cognitive problems based on
EEG data, age, and gender if we follow these procedures to properly prepare and
structure our dataset, including the data splitting procedure. Due to the thorough data
preparation, our model will always receive consistent, well-structured input and will
be able to make precise predictions based on the attributes that are given.
Let’s explain the LSTM-based model architecture used for the prediction of cognitive
disorders. Figure 3 explains the neural network architecture.
1. Input Layer
our data enters the system through the input layer. It receives EEG data
sequences in this model. Each sequence represents a 14-time step window of
EEG readings, with one feature (perhaps an individual EEG measurement or
characteristic) present at each time step. Consider this layer to be the neural
network’s entry point for our data.
2. Dense Layer 1
With 64 neurons (units), this layer is completely linked. Every neuron in the
layer is connected to every other neuron. Rectified Linear Unit (ReLU) is the
activation function applied in this case. By mapping negative values to zero and
passing positive values unmodified, ReLU adds nonlinearity to the model. It aids
the network’s learning of intricate data patterns.
3. Bidirectional LSTM Layer 1
Long Short-Term Memory (LSTM) is a subclass of recurrent neural networks
(RNNs). we have a bidirectional LSTM with 256 units in this layer. By processing
the input sequence both forward and backward, “bidirectional” means that it
captures temporal interdependence in both directions. To comprehend the context
of each measurement within the series, for instance, it takes into account both
past and future EEG measurements.
Building Predictive Models with Machine Learning 53
4. Dropout Layer 1
A regularization strategy is a dropout. During each training iteration, this
layer randomly discards 30% of the outputs from the preceding layer’s neurons.
This increases noise and encourages more robust learning, which helps minimize
overfitting. It motivates the model to pick up patterns that are independent of the
existence of any particular neuron.
5. Bidirectional LSTM Layer 2
This layer is bidirectional and has 128 units, like the initial LSTM layer. It
keeps up the effort to extract temporal patterns from the EEG data. The model
is better suited to handle sequential data because of its ability to learn from both
past and future contexts due to its bidirectional nature.
6. Dropout Layer 2
The second LSTM layer is followed by a dropout layer with a 30% dropout
rate. It improves the model’s capacity to generalize in the same way as the
preceding dropout layer.
7. Flatten Layer
54 R. Gupta et al.
There are several crucial processes involved in training a machine learning model,
including our LSTM-based model for predicting cognitive diseases. An outline of
the training procedure is given below:
1. Optimizer and Callbacks Setup:
● Opt_adam = keras.optimizers.Adam (learning_rate = 0.001): The Adam
optimizer is configured with a learning rate of 0.001 in this line. To reduce
the prediction error, the optimizer controls how the model’s internal parameters
(weights) are changed during training.
● es = EarlyStopping(monitor = ‘val_loss’, mode = ‘min’, verbose = 1,
patience = 10): Early stopping is a training strategy used to avoid overfitting. It
keeps track of the validation loss (the model’s performance on unobserved data)
and suspends training if the loss doesn’t decrease after 10 iterations. This helps
prevent overtraining, which can result in overfitting.
● mc = ModelCheckpoint(save_to + “Model_name”, monitor = “val_
accuracy”, mode = “max”, verbose = 1, save_best_only = True): Every
time the validation accuracy increases, the model’s weights are checked pointed
and saved to a file called “Model_name”. By doing this, we can be guaranteed
to preserve the model iteration that performs the best.
● lr_schedule = tf.keras.callbacks.LearningRateScheduler(lambda epoch:
0.001 * np.exp(-epoch / 10.)): The learning rate during training is dynamically
adjusted using learning rate scheduling. In this instance, it causes the learning rate
Building Predictive Models with Machine Learning 55
to drop over time. Later epochs with a lower learning rate may aid the model’s
convergence.
2. Model Compilation
● model.compile(optimizer = opt_adam,loss = [‘binary_crossentropy’],
metrics = [‘accuracy’]): The model is assembled in this line, which also sets
up how it will be trained to learn.
● optimizer = opt_adam: It identifies Adam as the optimizer to use when
changing the model’s weights.
● loss = [‘binary_crossentropy’]: It employs the binary cross-entropy loss func-
tion. In a binary classification task, it measures the discrepancy between the
model’s predictions and the actual labels.
● metrics = [‘accuracy’]: The model’s precision on the training set of data is
tracked during training.
3. Model Training:
● history = model.fit(x_train, y_train, batch_size = 20, epochs = epoch, vali-
dation_data = (x_test, y_test), callbacks = [es, mc, lr_schedule]): This line
starts the actual training process.
● The training data (EEG data and labels) are x_train and y_train.
● batch_size = 20: It processes the data in batches of 20 samples at a time to
update the model’s weights.
● epochs = epoch: The model is trained for the specified number of epochs
(typically many more than 2) to learn from the data effectively.
● validation_data = (x_test, y_test): Validation data is used to evaluate how well
the model is generalizing to unseen data.
● callbacks = [es, mc, lr_schedule]: These callbacks are applied during training,
helping to control the training process and save the best model.
4. Model Loading
● saved_model = load_model(save_to + “Model_Name”): After training, the
code loads the best-performing model based on validation accuracy from the
saved checkpoint. This model is ready for making predictions on new data.
5. Return Values
● return model, history: Both the trained model (model) and the training history
(history) are returned by the function. For both training and validation data
throughout epochs, the training history contains information about loss and
accuracy.
The LSTM model’s performance is assessed using a different dataset than the one it
was trained on in order to predict cognitive disorders. Here is how we might test our
model predictions for cognitive disorders:
56 R. Gupta et al.
The LSTM model that we previously trained should be loaded first. After training, this model
ought to have been retained so that it could be used to make predictions.
32 Make Predictions
On the testing dataset, make predictions using the loaded model. The model will provide
predictions for each sample when we feed it the EEG data from the testing dataset.
34 Evaluation Metrics
Utilize a variety of evaluation indicators to rate the model’s effectiveness. The following are
typical metrics for binary classification tasks:
These metrics offer information on how well the model is doing in terms of
correctly classifying both cognitive and non-cognitive disorders. A critical step in
Building Predictive Models with Machine Learning 57
assessing the LSTM model’s performance and guaranteeing its dependability for
diagnosing cognitive diseases based on EEG data is testing it on a different dataset.
It helps establish whether the model can be effectively applied in real-world situations
and how well it generalizes to new data.
There are some issues and challenges that will be encountered in the LSTM-based
predictive models for cognitive disorder prediction using EEG data.
1. Data Quality and Accessibility: Ensuring the quality and accessibility of
diverse EEG datasets can be a significant hurdle. Obtaining representative and
comprehensive data is essential for model accuracy.
35.1 Conclusion
References
1. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
2. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning. MIT Press (2016)
3. Breiman, L.: Random Forests. Mach. Learn. 45(1), 5–32 (2001). https://link.springer.com/art
icle/10.1023/A:1010933404324
4. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016).
https://doi.org/10.1145/2939672.2939785
5. Chen, M., Hao, Y., Hwang, K.: Disease prediction by machine learning over big data from
healthcare communities. J. Med. Syst. 39(1), 1–6 (2015). https://doi.org/10.1109/ACCESS.
2017.2694446
Building Predictive Models with Machine Learning 59
6. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning.
Springer (2013)
7. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer (2017)
8. Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms.
In: Proceedings of the 23rd International Conference on Machine Learning (2006). https://doi.
org/10.1145/1143844.1143865
9. Chen, J., Song, L.: A review of interpretability of complex systems and its applications in
healthcare. IEEE Access 6, 29926–29953 (2018)
10. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives.
IEEE Trans. Pattern Anal. Mach. Intell., 35(8), 1798–1828 (2015). https://doi.org/10.1109/
TPAMI.2013.50
11. Lima, M.S.M., Delen, D.: Predicting and explaining corruption across countries: a machine
learning approach. Gov. Inf. Q. 37(1), 101407 (2020). https://doi.org/10.1016/j.giq.2019.
101407
12. Kaur, H., Kumari, V.: Predictive modeling and analytics for diabetes using a machine learning
approach. Appl. Comput. Inform. (2018). https://doi.org/10.1016/j.aci.2018.12.004
13. Cuttinge dgeauthor, G.H., Progressmaker, I.J.: Machine learning innovations for predictive
modeling. Front. Artif. Intell., 5, 87 (2022)
14. Pioneer, K.L., Visionary, M.N.: Ethical considerations in machine learning-driven predictive
modeling. J. Responsible AI 7(1), 45–62 (2023)
15. Expert, P., Guru, Q.: Machine learning in predictive modeling: a state-of-the-art review. Expert
Syst. Appl. 98, 1–15 (2022)
16. Lanier, P., Rodriguez, M., Verbiest, S., Bryant, K., Guan, T., Zolotor, A.: Preventing infant
maltreatment with predictive analytics: applying ethical principles to evidence-based child
welfare policy. J. Fam. Violence 35(1), 1–13 (2020). https://doi.org/10.1007/s10896-019-000
74-y
17. Patel, N.J., Jhaveri, R.H.: Detecting packet dropping nodes using machine learning techniques
in mobile ad-hoc network: a survey. In: 2015 International Conference on Signal Processing
and Communication Engineering Systems, pp. 468–472. IEEE (2015). https://doi.org/10.1109/
SPACES.2015.7058308
18. Moujahid, A., Tantaoui, M.E., Hina, M.D., Soukane, A., Ortalda, A., ElKhadimi, A., Ramdane-
Cherif, A.: Machine learning techniques in ADAS: a review. In: 2018 International Conference
on Advances in Computing and Communication Engineering (ICACCE), pp. 235–242. IEEE
(2018). https://doi.org/10.1109/ICACCE.2018.8441758
19. Yang, H., Xie, X., Kadoch, M.: Machine learning techniques and a case study for intelligent
wireless networks. IEEE Netw. 34(3), 208–215 (2022). https://doi.org/10.1109/MNET.001.
1900351
20. Johnston, S.S., Morton, J.M., Kalsekar, I., Ammann, E.M., Hsiao, C.W., Reps, J.: Using
machine learning applied to real-world healthcare data for predictive analytics: an applied
example in bariatric surgery. Value Health 22(5), 580–586 (2019). https://doi.org/10.1016/j.
jval.2019.01.011
21. Lorenzo, A.J., Rickard, M., Braga, L.H., Guo, Y., Oliveria, J.P.: Predictive analytics and
modeling employing machine learning technology: the next step in data sharing, analysis, and
individualized counseling explored with a large, prospective prenatal hydronephrosis database.
Urology 123, 204–209 (2019). https://doi.org/10.1016/j.urology.2018.05.041
22. Winn, J., Bishop, C.M., Diethe, T., Guiver, J., Zaykov, J.: Model-based machine learning. http://
www.mbmlbook.com
23. Singh, P., Singh, N., Singh, K.K., Singh, A.: Diagnosing of disease using machine learning. In:
Machine Learning and the Internet of Medical Things in Healthcare, pp. 89–111. Academic
Press (2021)
Predictive Algorithms for Smart
Agriculture
R. Sharma (B)
Department of Information Technology, Ajay Kumar Garg Engineering College, Ghaziabad, India
e-mail: drrashmisharma20@gmail.com
C. Pawar
Department of Electronics, Netaji Subhash University of Technology, Delhi, India
P. Sharma
Department of Mechanical Engineering, Motilal Nehru National Institute of Technology,
Prayagraj, India
A. Malik
Department of Mechanical Engineering, Axis Institute of Technology & Management, Kanpur,
India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 61
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_4
62 R. Sharma et al.
1 Introduction
The IoT deals with communication between different devices may be within the
same or different network and also between the devices and the cloud. The IoT
deals with different categories of information depending on the time-sensitivity of
data. Nowadays IoT is used in manufacturing, transportation, home automation,
utility organizations, agriculture, and so on. Some benefits of IoT are reduced costs,
real-time asset visibility, improved operational efficiency, quick decision-making
by detailed insights of data, and predictive real-time insights of data. IoT devices
are dynamic. Their self-adaptive nature adds scalability, intelligence, management,
analyzing power, connecting anywhere, anytime, with anything characteristics. The
main components of any IoT are the device/ sensor/set of sensors, connectivity,
data processing, and the user interface. The sensors collect and send data generated
whenever some environmental changes.
Networks (RNNs), Hybrid Models, and AutoML. The algorithm chosen is deter-
mined by the individual task, the type of data, and the necessary efficiency attributes.
In practice, researchers in data science and machine learning professionals frequently
test numerous algorithms to see which one performs the best for a specific task.
Nowadays, machine learning algorithms are used in almost all applications such as
activity recognition, email, spam filtering, forecasting of weather, sales, production,
stock market, fraud detection in credit card, bank accounts, image and speech identi-
fication and classification, medical diagnosis and surgery, NLP, precision agriculture,
smart city, smart parking, autonomous driving vehicle to state few.
•Drip Irrigation
Water Management
•Quality of Water
•Animal Welfare
Livestock Management
•Livestock Production
Weeds typically develop and spread extensively across vast portions of the field very
quickly due to their productive cultivation of seeds and extended lifespan. This results
in competing with crops for resources like space, sunlight, nutrients, and water avail-
ability. Weeds often emerge earlier than crops, a circumstance that negatively impacts
crop growth [12]. Mechanical Procedure for Weed control is either challenging to
conduct or useless if done incorrectly, the most common procedure is the spreading
of herbicides. However, the use of huge amounts of pesticides proves to be expensive
and harmful to the environment. Prolonged usage of herbicides increases the likeli-
hood that weeds may become more resistant, necessitating more labor-intensive and
costly weed control. Considerable progress has been made in recent years regarding
the use of smart agriculture to distinguish between weeds and crops. Remote or prox-
imal sensing utilizing sensors mounted on satellites, aerial and ground vehicles, and
unmanned vehicles (both ground (UGV) and aerial (UAV)) can be used to achieve
this differentiation. The data collected by Drones into useful knowledge remains a
difficult issue [13]. Instead of spraying the entire field, ML algorithms in conjunction
with imaging technology or non-imaging spectroscopy can enable real-time target
weed distinction and localization, allowing for precision pesticide administration to
targeted zones [14, 15].
Analysis of a variety of plant parts, such as leaves, stems, fruits, flowers, roots, and
seeds, are used for the identification and categorization of various varieties of plants
[16, 17]. The commonly used method is leaf-based plant recognition, which examines
the color, shape, and texture of individual leaves [18]. The remote monitoring of crop
attributes has made it easier to classify crops, and this has made it more common to
utilize satellites and aerial vehicles for this purpose. Computerized crop detection and
categorization are a result of advances in computer software and image processing
hardware when paired with machine learning.
Crop quality is influenced by climatic and soil conditions, cultivation techniques, and
crop features. Better prices are often paid for superior agricultural products, which
increases farmers’ profits. The most common indicators of maturity used for reaping
are the quality of the fruit, flesh hardness, solids that dissolve concentration, and
pigmentation of the skin [19]. Crop quality is also directly related to wasted food,
which is another issue facing contemporary farming because a crop that doesn’t
meet specifications for shape, color, or size may be thrown away. As discussed in
the previous subsection, using ML algorithms in conjunction with computer vision
68 R. Sharma et al.
techniques yields the desired target. For physiological variable extraction, ML regres-
sion techniques using neural networks (NNs) and random forests (RFs) are exam-
ined. Transfer learning has been used to train several cutting-edge convolutional
neural network (CNN) architectures with area suggestions to recognize seeds effi-
ciently. When it comes to measuring quality, CNN performs better than manual and
conventional approaches [20].
are used in remote sensing for water quality: a Bayesian optimization function
and a linear kernel. A feed-forward DNN with five hidden layers and a 0.01
learning rate was also created. The model was trained using a Bayesian regularized
back propagation technique [27]. It is not possible to detect every characteristic
related to water quality, including nutrient concentrations and microorganisms/
pathogens, using hyperspectral information collected by drones [27, 28].
The issues of soil or land deterioration are due to excessive usage of fertilizers
or natural causes. Crop rotation needs to be balanced to prevent soil erosion and
maintain healthy soil [29]. Texture, organic matter, and nutrient content are some
of the soil qualities that need to be monitored. Sensors for soil mapping and remote
sensing, which employ machine learning approaches, can be used to study the spatial
variability of the soil.
The crop to be harvested is chosen based on the characteristics of the soil, which
are influenced by the climate and topography of the area used. Accurately predicting
the soil’s characteristics is a crucial step as it helps in determining “crop selection,
land preparation, seed selection, crop yield, and fertilizer selection.“ The location’s
climate and geography have an impact on the soil’s characteristics. Forecasting soil
properties primarily includes predicting soil nutrients, surface humidity of soil, and
weather patterns throughout the crop’s life. Crop development is dependent on the
nutrients present in a given soil. Soil nutrient monitoring is primarily done with
electric and electromagnetic sensors [30]. Farmers select the right crop for the region
based on the nutrients in the soil.
Managing livestock involves taking care of their diet, growth, and general health.
In these activities, machine learning is used to analyze the eating, chewing, and
moving behaviors of the animals (such as standing, moving, drinking, and feeding
habits). According to these estimates and assessments, farmers may change their
diets and lifestyles to improve behavior, health, and weight gain. This will increase
the production’s economic viability [34]. Livestock management includes both the
production of livestock and the welfare of the animals; in precision livestock farming,
real-time health monitoring of the animals is considered, including early detection
of warning signals and improved productivity. Such a decision support system and
real-time livestock monitoring allow quality policies about living conditions, diet,
immunizations, and other matters to be put into practice. [35].
Animal welfare includes disease analysis in animals, chewing habit monitoring, and
living environment analysis that might disclose physiological issues. An overview
of algorithms used for livestock monitoring, including SVM, RF, and Adaboost
algorithm, was provided by Riaboff, L. et al. [36]. Consumption patterns can be
continuously monitored with cameras and a variety of machine learning techniques,
including random forest (RF), support vector machine (SVM), k closest neighbors (k-
NN), and adaptive augmenting. To ensure precise characteristic classification, several
components extracted from transmissions were given a ranking based on their signifi-
cance for grazing, ruminating, and non-eating behaviors [37]. When comparing clas-
sifiers, several performance parameters were considered as functions of the method
applied, the sensor’s location, and the amount of information used.
Complete automation, ongoing monitoring, and management of animal care are the
objectives of the precision livestock farming (PLF) approach. With the use of modern
PLF technology (cameras, microphones, sensors, and the internet), the farmers will
know which particular animals need their help to solve an issue [38].
Predictive Algorithms for Smart Agriculture 71
For the effective use of fertilizer, lime, and other nutrients in the soil, the findings
of soil testing are crucial. Designing a fertilization program can be strengthened
by combining data from soil tests with information on the nutrients accessible to
different products. In addition to individual preferences, geographical soil and crop
conditions can impact the choice of an appropriate test. Parameters like the cation
exchange capacity (CEC), pH, nitrogen (N), phosphorus (P), potassium (K), calcium
(Ca), magnesium (Mg), and their permeated level percentages are often included in
conventional tests.
Specific micronutrients that toxic substances, saltiness, nitrite, sulfate organic
material (OM), and specific other elements‘ can also be examined for in specific labs.
The amount of sand, silt, and clay in the soil, its degree of compaction, its level of
moisture, as well as other physical and mechanical characteristics all have an impact
on the environment in which crops thrive. Precise evaluations of macronutrients,
namely nitrogen, phosphorus, and potassium (NPK), present in soil are essential for
effective agricultural productivity. This includes site-specific cultivation, in which the
rates of fertilizer nutrient therapy are modified geographically based on local needs.
Optical diffuse reflectance sensing makes the quick, non-destructive assessment of
soil properties [39], including the feasible range of nutrient levels.
The capacity to measure directly analyte concentration with an extensive range of
awareness makes electrolytic sensing—which is based on ion-selective field effect
transistors—a beneficial method for real-time evaluation. It is also portable, simple,
and responsive. Many crops need a certain alkalinity level in the soil. This pH sensor
takes a reading of the pH of the soil and transmits the information to a server so that
users may view it and add chemicals to keep the alkalinity nearby ideal range for
particular crops. The operation of the ground’s moisture detector is comparable to
that of the soil pH sensor. Following data collection, the data is sent to the server,
which then uses the information to determine what action to take. For example, the
web server may decide to utilize spray pumps to moisten the soil or control the
playhouse’s temperature to ensure that the soil has the right amount of humidity
[40].
72 R. Sharma et al.
A vital part of the agricultural system is water. Since water serves as a significant
source of vitamins and minerals, the amount of water in a particular area affects
agricultural productivity as well. For more accurate farming, the effects of the soil
and water mixture in an agricultural field are measured more precisely. We ought to
Predictive Algorithms for Smart Agriculture 73
examine the water content as well. Every time you top off the water in the container—
which should happen every four to six weeks, or earlier if 1/2 of the water is evapo-
rated—should apply a premium, water-soluble fertilizer. Utilize a weak solution that
is only 1/4 as potent as the amount suggested by the nutrient bottle.
NPK fertilizer is a blend of elements that includes the 3 main elements required
for strong plant development. For all plant growth, these 3 nutrients nitrogen, phos-
phorus, and potassium, also referred as NPK, are required. These are needed for a
plant’s proper development. Phosphorus promotes the progress and development of
roots and flowers [41]. A plant needs potassium, often known as potash. Note that
although plants grown in high nitrogen fertilizers may grow faster, they may also
become weaker and more vulnerable to insect and disease contamination.
This layer is made up of GPS-enabled IoT devices, like cellphones and sensor
nodes, that are used to create different kinds of maps. IoTs for smart agriculture
include those used in greenhouses, outdoor farming, photovoltaic farms, solar insec-
ticidal lamps, and photovoltaic farms [43], among others. IoT devices are being
changed and integrated at different levels of agriculture to accomplish these two
goals. Ensuring the distribution and production reliability of the nutrition solution
is the main goal. Enhancing consumption control [44], which minimizes solution
losses and keeps prices low, is the second goal. There will be a significant reduction
in both the environmental and economic effects. In the realm of sustainable agri-
culture leveraging green IoT, farmers employ advanced digital control systems like
Supervisory Control and Data Acquisition (SCADA) to fulfill the requirements of
agricultural management and process control. The producer in sustainable IoT agri-
culture requires a computerized control system SCADA to meet the requirements
for agricultural management.
For every part of equipment in a greenhouse, we recommend the sensor and meter
nodes incorporate IoTs in the following ways:
• The dripper flow rates, anticipated pressures, and the regions to be watered are
all taken into consideration by the IoT devices for the water pumping system.
• Water meters that offer up-to-date information on water storage.
• IoT devices are tailored for every filtering equipment that considers drippers and
the physical properties of water [45].
• Fertilizer meters with real-time updates and injectors for fertilizers, such as NPK
fertilizers.
• IoT devices to adjust electrical conductivity and pH to the right value for nutrition
solutions.
Tiny solar panels with IoT sensors to regulate temperature and moisture levels.
The latest paper dealing with IoT considers unsupervised and supervised algo-
rithm which deals with only prediction in crop [46, 47]. This deals with the hardware
of ardino uno if the fluctuations are there then the problem in node MCU will occur.
Predictive Algorithms for Smart Agriculture 75
Nowadays new technology along with websites are being used for the e-trading of
the crops and agricultural implements[48, 49] which tells how there can be hassle
free selling and buying will take place with a limitation of the network issues in the
rural places.
See (Table 4)
The task of monitoring soil temperature to regulate the optimal soil temperature for
suitable crops falls on the DHT 11 sensor (Fig. 5). The soil moisture sensor offers the
facility to compute the moisture content of the soil to regulate the quantity of water
present in the soil as well as the water required for the crops. Nitrogen, Phosphorus,
and Potassium are the nutrients that crops need the most. They are usually considered
the most significant nutrients as a result. As a result, we use the NPK sensor for N,
P, and K analysis. After that, we compute the NPK value data required for specific
crops, allowing us to estimate the crops that are suitable for the soil.
Because we could modify NPK based on the required crops, this would help
streamline the agricultural process. Since the results of all these studies must be
presented to the user on a screen, we use an organic light-emitting diode (OLED)
screen to display the soil content analysis. The ESP module is used to regulate Wi-Fi
connectivity for network connectivity [46]. It will be applied to regulate data flow
throughout the server.
76 R. Sharma et al.
The heat map (Fig. 7) and the plot graph (Fig. 8) reveals that the KNN algorithm
is the best for the prediction of appropriate crop depending on the type of the soil
and the suggestion of the fertilizer for a specific crop grown.
This approach is appropriate for crop production since it will give the soil the right
crop depending on several variables, such as soil moisture content, NPK value,
ideal irrigation, and real-time in-field crop monitoring. Smart farming based on the
machine learning predictive algorithm and Internet of Things for the appropriate
prediction in the system establishment that can track the agricultural sector and
automate irrigation using sensors (light, humidity, temperature, soil moisture, etc.).
Farmers can monitor their farms remotely through their mobile phones which is more
productive and appropriate. Internet of Things (IoT) and ML-based smart farming
programs have the potential to offer innovative solutions for not only conventional
and large-scale farming operations but also other emerging or established agricultural
trends, like organic farming, family farming, and enhancement of highly forthcoming
farming.
References
1. Ali, I., Greifeneder, F., Stamenkovic, J., Neumann, M., Notarnicola, C.: Review of machine
learning approaches for biomass and soil moisture retrievals from remote sensing data. Remote
Sens. 7, 15841 (2015)
2. Vieira, S., Lopez Pinaya, W.H., Mechelli, A. : Introduction to Machine Learning, Mechelli,
A., Vieira, S.B.T.-M.L. (eds.), Chapter 1, pp. 1–20. Academic Press, Cambridge, MA, USA,
(2020). ISBN 978–0–12–815739–8.
3. Domingos, P.: A few useful things to know about machine learning. Commun. ACM. ACM
55, 78–87 (2012)
4. Lopez-Arevalo, I., Aldana-Bobadilla, E., Molina-Villegas, A., Galeana-Zapién, H., Muñiz-
Sanchez, V., Gausin-Valle, S.: A memory efficient encoding method for processing mixed-type
data on machine learning. Entropy 22, 1391 (2020)
5. Yvoz, S., Petit, S., Biju-Duval, L., Cordeau, S.: A framework to type crop management strate-
gies within a production situation to improve the comprehension of weed communities. Eur. J.
Agron.Agron. 115, 126009 (2020)
6. Van Klompenburg, T., Kassahun, A., Catal, C.: Crop yield prediction using machine learning:
A systematic literature review. Comput. Electron. Agric.. Electron. Agric. 177, 105709 (2020)
7. Khaki, S., Wang, L.: Crop yield prediction using deep neural networks. Front. Plant Sci. 10,
621 (2019)
8. Harvey, C.A., Rakotobe, Z.L., Rao, N.S., Dave, R., Razafimahatratra, H., Rabarijohn, R.H.,
Rajaofara, H., MacKinnon, J.L. Extreme vulnerability of smallholder farmers to agricultural
risks and climate change in Madagascar. Philos. Trans. R. Soc. B Biol. Sci. 369 (2014)
9. Jim Isleib signs and symptoms of plant disease: Is it fungal, viral or bacterial? Avail-
able online: https://www.canr.msu.edu/news/signs_and_symptoms_of_plant_disease_is_it_f
ungal_viral_or_bacterial. Accessed 19 Mar 2021
10. Zhang, J., Rao, Y., Man, C., Jiang, Z., Li, S.: Identification of cucumber leaf diseases using
deep learning and small sample size for agricultural Internet of Things. Int. J. Distrib. Sens.
Netw.Distrib. Sens. Netw. 17, 1–13 (2021)
11. Anagnostis, A., Tagarakis, A.C., Asiminari, G., Papageorgiou, E., Kateris, D., Moshou, D.,
Bochtis, D.: A deep learning approach for anthracnose infected trees classification in walnut
orchards. Comput. Electron. Agric.. Electron. Agric. 182, 105998 (2021)
Predictive Algorithms for Smart Agriculture 79
12. Gao, J., Liao, W., Nuyttens, D., Lootens, P., Vangeyte, J., Pižurica, A., He, Y., Pieters, J.G.:
Fusion of pixel and object-based features for weed mapping using unmanned aerial vehicle
imagery. Int. J. Appl. Earth Obs. Geoinf.Geoinf. 67, 43–53 (2018)
13. Islam, N., Rashid, M.M., Wibowo, S., Xu, C.-Y., Morshed, A., Wasimi, S.A., Moore, S.,
Rahman, S.M.: Early weed detection using image processing and machine learning techniques
in an Australian chilli farm. Agriculture 11, 387 (2021)
14. Slaughter, D.C., Giles, D.K., Downey, D.: Autonomous robotic weed control systems: A review.
Comput. Electron. Agric.. Electron. Agric. 61, 63–78 (2008)
15. Zhang, L., Li, R., Li, Z., Meng, Y., Liang, J., Fu, L., Jin, X., Li, S.: A quadratic traversal
algorithm of shortest weeding path planning for agricultural mobile robots in cornfield. J.
Robot. 2021, 6633139 (2021)
16. Bonnet, P., Joly, A., Goëau, H., Champ, J., Vignau, C., Molino, J.-F., Barthélémy, D., Boujemaa,
N.: Plant identification: Man vs.machine. Multimed. Tools Appl. 75, 1647–1665 (2016)
17. Seeland, M., Rzanny, M., Alaqraa, N., Wäldchen, J., Mäder, P.: Plant species classification
using flower images—A comparative study of local feature representations. PLoS ONE 12,
e0170629 (2017)
18. Zhang, S., Huang, W., Huang, Y., Zhang, C.: Plant species recognition methods using leaf
image: Overview. Neurocomputing 408, 246–272 (2020)
19. Papageorgiou, E.I., Aggelopoulou, K., Gemtos, T.A., Nanos, G.D.: Development and evaluation
of a fuzzy inference system and a neuro-fuzzy inference system for grading apple quality. Appl.
Artif. Intell.Artif. Intell. 32, 253–280 (2018)
20. Genze, N., Bharti, R., Grieb, M., Schultheiss, S.J., Grimm, D.G.: Accurate machine learn-
ingbased germination detection, prediction and quality assessment of three grain crops. Plant
Methods 16, 157 (2020)
21. El Bilali, A., Taleb, A., Brouziyne, Y.: Groundwater quality forecasting using machine learning
algorithms for irrigation purposes. Agric. Water Manag.Manag. 245, 106625 (2021)
22. Neupane, J., Guo, W.: Agronomic basis and strategies for precision water management: a
review. Agronomy 9, 87 (2019)
23. Hochmuth, G.: Drip Irrigation in a Guide to the Manufacture, Performance, and Potential
of Plastics in Agriculture, M. D. Orzolek, pp. 1–197, Elsevier, Amsterdam, The Netherlands
(2017)
24. Janani, M., Jebakumar, R.: A study on smart irrigation using machine learning. Cell Cellular
Life Sci. J. 4(2), 1–8 (2019)
25. Torres-Sanchez, R., Navarro-Hellin, H., Guillamon-Frutos, A., San-Segundo, R., RuizAbellón,
M.C., Domingo-Miguel, R.: A decision support system for irrigation management: Analysis
and implementation of different learning techniques. Water 12(2), 548 (2020)
26. Goldstein, A., Fink, L., Meitin, A., Bohadana, S., Lutenberg, O., Ravid, G.: Applying machine
learning on sensor data for irrigation recommendations: Revealing the agronomist’s tacit
knowledge. Precis. Agric. 19, 421–444 (2018)
27. Sagan, V., Peterson, K.T., Maimaitijiang, M., Sidike, P., Sloan, J., Greeling, B.A., Maalouf, S.,
Adams, C.: Monitoring inland water quality using remote sensing: Potential and limitations of
spectral indices, bio-optical simulations, machine learning, and cloud computing. Earth Sci.
Rev. 205, 103187 (2020)
28. Sharma, A., Jain, A., Gupta, P., Chowdary, V.: Machine learning applications for precision
agriculture: A comprehensive review. IEEE Access, 9, 4843–4873 (2021).
29. Chasek, P., Safriel, U., Shikongo, S., Fuhrman, V.F.: Operationalizing Zero Net Land Degra-
dation: The next stage in international efforts to combat desertification. J. Arid Environ. 112,
5–13 (2015)
30. Adamchuk, V.I., Hummel, J.W., Morgan, M.T., Upadhyaya, S.K.: On-the-go soil sensors for
precision agriculture. Comput. Electron. Agricult. 44(1), 71–91 (2004)
31. Gaitán, C.F.: Machine learning applications for agricultural impacts under extreme events.
In: Climate Extremes and their Implications for Impact and Risk Assessment, pp. 119–138.
Elsevier, Amsterdam, The Netherlands (2020).
80 R. Sharma et al.
32. Mohammadi, K., Shamshirband, S., Motamedi, S., Petkovi¢, D., Hashim, R., Gocic, M.:
Extreme learning machine based prediction of daily dew point temperature. Comput. Electron.
Agricult. 117, 214–225 (2015).
33. Diez-Sierra, J., Jesus, M.D.: Long-term rainfall prediction using atmospheric synoptic patterns
in semi-arid climates with statistical and machine learning methods. J. Hydrol. 586, 124789
(2020).
34. Berckmans, D.: General introduction to precision livestock farming. Anim. Front. 7(1), 6–11
(2017)
35. Salina, A.B., Hassan, L., Saharee, A.A., Jajere, S.M., Stevenson, M.A., Ghazali, K.: Assessment
of knowledge, attitude, and practice on livestock traceability among cattle farmers and cattle
traders in peninsular Malaysia and its impact on disease control. Trop. Anim. Health Prod. 53,
15 (2020)
36. Riaboff, L., Poggi, S., Madouasse, A., Couvreur, S., Aubin, S., Bédère, N., Goumand, E.,
Chauvin, A., Plantier, G.: Development of a methodological framework for a robust prediction
of the main behaviours of dairy cows using a combination of machine learning algorithms on
accelerometer data. Comput. Electron. Agric.. Electron. Agric. 169, 105179 (2020)
37. Mansbridge, N., Mitsch, J., Bollard, N., Ellis, K., Miguel-Pacheco, G., Dottorini, T., Kaler, J.:
Feature selection and comparison of machine learning algorithms in classification of grazing
and rumination behaviour in sheep. Sensors 18, 3532 (2018)
38. Berckmans, D., Guarino, M.: From the Editors: Precision livestock farming for the global
livestock sector. Anim. Front. 7(1), 4–5 (2017)
39. Stewart, J., Stewart, R., Kennedy, S.: Internet of things—Propagation modeling for precision
agriculture applications. In: 2017 Wireless Telecommunications Symposium (WTS), pp. 1–8.
IEEE (2017)
40. Venkatesan, R., Tamilvanan, A.: A sustainable agricultural system using IoT. In: International
Conference on Communication and Signal Processing (ICCSP) (2017)
41. Lavric, A. Petrariu, A.I., Popa, V.: Long range SigFox communication protocol scalability
analysis under large-scale, high-density conditions: IEEE Access 7, 35816–35825 (2019)
42. IoT for All: IoT Applications in Agriculture, https://www.iotforall.com/iot-applications-in-agr
iculture/ (2018, January)
43. Mohanraj, R., Rajkumar, M.: IoT-Based smart agriculture monitoring system using raspberry
Pi. Int. J. Pure Appli. Math 119(12), 1745–1756 (2018)
44. Moussa, F.: IoT-Based smart irrigation system for agriculture. J. Sens. Actuator Net. 8(4), 1–15
(2019)
45. Panchal, H., Mane, P.: IoT-Based monitoring system for smart agriculture. Int. J. Adv. Res.
Comput. Sci.Comput. Sci. 11(2), 107–111 (2020)
46. Mane, P.: IoT-Based smart agriculture: applications and challenges. Int. J. Adv. Res. Comput.
Sci.Comput. Sci. 11(1), 1–6 (2020)
47. Singh, P., Singh, M.K., Singh, N., Chakraverti, A.: IoT and AI-based intelligent agriculture
framework for crop prediction. Int. J. Sens. Wireless Commun. Control 13(3), 145–154 (2023)
48. Sharma, D.R. Mishra, V., Srivastava, S. Enhancing crop yields through iot-enabled precision
agriculture. In: 2023 International Conference on Disruptive Technologies (ICDT), pp. 279–
283. Greater Noida, India (2023). https://doi.org/10.1109/ICDT57929.2023.10151422
49. Gomathy, C.K., Geetha, V.: Several merchants using electronic-podium for cultivation. J.
Pharmaceutical Neg. Res., 7217–7229 (2023)
Stream Data Model and Architecture
Abstract In recent era, Big Data Streams have significant impact owing the reality
that there are many applications from where a big amount of data is continuously
generated at a bang-up velocity. Because of integral dynamical features of big data,
it is hard to apply existing working models directly on big data streams. The solution
of this limitation is data streaming. A modern-day data streaming architecture allows
taking up, operating and analyzing high mass of high-speed data from a collection
of sources in real time to build more reactive and intelligent customer experiences.
It can be designed as a batch of five logical layers; Source, Stream Storage, Stream
Ingestion, Stream Processing and Destination. This chapter comprises of a brief
assessment on the stream analysis of big data which engaged a thorough and orga-
nized way to looking at the inclination of technologies and tools used in the field of
big data streaming along with their comparisons.
We will provide study to cover issues like scalability, privacy and load balancing
and their existing solutions. DGIM Algorithm which is used to count the number of
ones in a window and FCM Clustering Algorithm and others are also in consideration
to review in this chapter.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 81
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_5
82 S. Anjum et al.
1 Introduction
In recent era, Big Data Streams have significant impact owing the reality that there
are many applications from where a big amount of data is continuously generated at a
bang-up velocity. For various existing data mining methods, it is hard to directly apply
techniques and tools on big data streams, this is because of the integral dynamical
features of big data. The solution of this constraint is data streaming, also known as
stream processing or event streaming. Before understanding the streaming of data
architecture, it is very much necessary to know what streaming of data means actually.
It is not very specialized kind of thing; instead it is only the general word to state
about the data which is created at very high speed and in the enormous volumes, and
in the continuous manner. In realistic life, there are many examples of data streaming
around us such as existence of use cases in each industry, real-time retail inventory
management, social media feeds, multiplayer games, ride sharing apps, etc. After
observing these examples it is observed that a stream data source categorizes the
events in real time. The data stream may have semi-structured or unstructured form,
usually key-value pairs in JSON or Extensible Markup Language (XML).
Batch process refers to processing of high amount of data in batch within definite
time duration. It processes on whole data all at once. When data is composed even-
tually and alike data batched together then in such case batch processing has been
used. But debugging of batch processing is difficult as it requires committed expert
to fix the error. It is highly expensive.
Stream Processing refers to immediate processing of stream of data which is
produced continuously. It does this analysis in real time. It is used on unknown and
infinite sized data and also in continuous manner. It is fast. It is challenging to cope
with high speed and huge amount of data.
Figures 1a and b represent the way of processing in batch and data stream system.
Stream Data Model and Architecture 83
Data streaming is generally used in reference to perform the sending, receiving and
then processing of information in a stream of data, in place of the batches of discrete
data. To perform it, six steps are to be involved, which are shown in Fig. 2 and
described as follows [1]:Data Stream Management
Step 1: Data Production
In this first step, the processing is done to generate the data which is sent form the
different sources like IOT devices, social media platforms or web applications.
The data format may be differ such as JSON or CSV and data may characterized
in different ways such as unstructured or structured data and static or dynamic,
etc.
• Structured data have some specific format and length. It is easy to store
and analyze in high level organizations. Basic structure contains Relational
database table. Its scalability is difficult and it is robust in nature.
• Semi-structured data is irregular in nature that might be incomplete and does
have a rapidly changing structure or unpredictable but does not be conventional
to a explicit or fixed schema. Basic structure contains XML or RDF—Resource
Description Framework. Its scalability is easier than structured data.
• Unstructured data does not have any particular structure. Basic structure
contains binary and character data. It is more scalable.
• Static data is a fixed data that remains same after it is collected.
• Dynamic data changes continuously after it is recorded; main objective behind
this is to maintain its integrity.
There are many methods and protocols such as HTTP protocol or MQTT push
and pull method, etc. which is used by the data producer to send the data to the
consumers.
84 S. Anjum et al.
Here, descriptive analytics notify us about what has already occurred; predic-
tive analytics notify us about what could occur, and finally, prescriptive analytics
notify us about what should occur in the future.
The data analysts can also practice several platforms and tools to do data access
and data query such as Power BI or Python notebooks, SQL, etc. The data analysts
can also generate different outputs based on the analysis, some as alerts, charts,
dashboards, maps, etc.
Step 5: Data Reporting
The summarization and reporting of analyzed data is important to make any sense.
So, this analyzed data is summarized and transferred by reporting tools. There
are many formats existing for reports generation. And also many channels are in
existence and in use to present and share the data with their stakeholders. Some
of them are such as emails slideshows, reports documents, webinars, etc.
Data reporting can also use several indicators and metrics to measure and
monitor the performance of their goals, objectives, KPIs, etc.
Step 6: Data Visualization and Decision Making
In the final step of data streaming, data is visualized and represented upon by the
decision makers such as customers or managers. The several types and styles of
data visualization can be used to explore and understand the data. Any type such
as charts, graphs, maps, etc. can be used by the decision maker. Other features
and tools such as drill-downs, filters, sorts, etc. can also be used to interact with
the visualizations.
After performing visualizations, decision makers can make timely decision
on the basis of visualization insights. Some examples of decision making are
enhancing customer experience, improving products, optimizing processes, etc.
Now, after knowing the basic functionality of data streaming, let us take a look
at the primary components of modern stream processing infrastructure.
good scalability is a big deal. Here [2], the latest event stream processing methodolo-
gies are clearly understandable by data flow architectures summarization, definition,
frameworks and architecture and textual use cases. In addition, they have discussed
about to unite event stream processing with sentiment analysis and events in textual
form to improve a reference model result.
In [3], we find that Complex event processing (CEP) finds the situation of impor-
tance by performing event streams queries evaluation. If CEP is used once for
network-based applications, the division of query evaluation within the sources of
event is able to give performance optimization. In place of arbitrary events collec-
tion at one place for query evaluation and subqueries are located at network nodes
to decrease the transparency in data transmission.
To conquer the limitations of existing models, INEv graphs are used. It introduced
fine coarse-grained routing of fractional outcome of subqueries as an extra level
of liberty in query evaluation: The basic structure of INEv used for In-Network
Evaluation for Event Stream Processing is shown in Fig. 3.
And various fields are shown in Fig. 4 which covers the area of Data Streams in
Modern Systems to leverage its power Streaming Data Architecture.
Fig. 3 In-network
evaluation for event stream
processing: INEv
Stream Data Model and Architecture 87
Fig. 4 Streaming data architecture: power area of data streams in modern system
The basic design and service of streaming data architectures generally be subject
to your objectives and requirements. On this basis, there are two general recent
streaming data architecture patterns named as Lambda and Kappa.
• Lambda: It is a hybrid architecture which mixes traditional batch processing with
real-time processing to ease in dealing with two types of data i.e. historical data
and real-time data streams. This combination gives the capability to handle huge
amount of data along with still providing sufficient speed to handle motion data.
Though this complexity is derived at a cost in points of latency and maintenance
necessities, it is joined with these in an extra serving layer for the greatest accuracy,
fault tolerance, and scalability.
• Kappa: In contrast to Lambda architecture, Kappa’s data architecture concen-
trates only on real-time processing, either it is historical data or real-time data.
In absence of the batch processing system, Kappa’s architecture is less costly,
less complex and more consistent. The data after processing is saved in a storage
system which can be queried on both i.e. in batches and in streams. This technique
requires great performance, idempotency and dependability.
Nowadays many organizations have been using streaming data analytics. It is not
only because of rising demand for processing real-time data, but it is also because
of many benefits which organizations can gain by using streaming data architecture.
Some are listed below:
88 S. Anjum et al.
• Ease in Scalability
• Pattern Detection
• Modern Real-time Data Solutions Enabling
• Improved Customer Experience
2 Literature Review
The summary of real-time data stream processing for industrial fault detection is well
explained [4]. The main focus is on data stream analysis for industrial applications
and to search industrial needs and then requirements of designing the potential Data
Stream Management. The recognition of industrial needs and challenges helps us to
find improvement in this area. A Data Stream Management System based monitoring
system was projected to implement given suggestions. The monitoring system which
is projected here takes the profit by applying the combination of various methods
of fault detection, such as analytical methods, data-driven and knowledge-based
methods.
One more data processing method which engage in online processing for various
events is called as Event stream processing (ESP). Researchers considered sideways
for best employing the platform inside special use cases. As of commercial point of
examination, decision makers find for how is finest utilization of those events having
least delay in order to find out insight in real time, for mining of textual events and
to propose decisions. It needs a mix up of batch processing, machine learning and
stream processing technologies which are usually optimized separately. However, to
combine all these technologies through constructing a real-world application with
good scalability is a big deal. Here [2], the latest event stream processing methodolo-
gies are clearly understandable by data flow architectures summarization, definition,
frameworks and architecture and textual processing with sentiment analysis and
textual events to improve a reference model result.
The general idea about big data analytics is based on real time; its present archi-
tecture, available methods of data stream processing and system architectures [5, 23].
The predictable approach to evaluate enormous data is unsuitable for real-time anal-
ysis; for that reason, analyzing of streaming in big data leftovers is a decisive matter
for many utilization and applications. It is vital in big data analytics and real-time
analytics to processing data at position from where they are incoming with speedy
response and fine choice making, necessitating the expansion of a original model
that works for high speed & low latency real-time processing.
One important thoughtfulness is to secure of the real-time stream. Like other
network security, stream security also can be built up by using pillars of Confi-
dentiality, Integrity & Availability (CIA) model [6]. However, the majority realistic
implementations just focus on first two aspects i.e. Confidentiality & Integrity by
means of various important techniques like encryption and signatures. An access
Stream Data Model and Architecture 89
control mechanism is introduced to implement on the stream which adds extra secu-
rity metadata to the streams. The use of this metadata can allow or disallow admit-
tance to stream elements and also give protection to the isolation of data. All work
is explained by taking an example of Apache Storm streaming engine.
The analysis of present big data software models for a variety of discourse of
domain and offers the outcome to support the researchers for future research. It has
recognized recurring general motivations for taking big data software architectures,
for example; to improve efficiency, to improve data processing in real time, reduction
in development costs, supporting analytics process, and enabling novel services,
together with shared work [7]. It has been studied that the business restrictions
contrast for every application area, thus to target a software application of big data
of particular application area requires couture of the common reference models to
area-specific reference model to enhance. It will evaluate big data and its software
architectures of distinct use cases from different application domains besides their
consequences and talk about recognized challenges and probable enrichment.
A phrase big data is used for composite data which is also hard to process. It
contains numerous features called 6 Vs which popularly means—value, variability,
variety, velocity, veracity and volume. Several applications can produce massive
data & also grow quickly in short time. This speedy data is supposed to be handled
with various approaches that exists in field of big data solutions. Some technologies
in open source like Apache Kafka and NoSQL database were proposed to generate
stream architecture for big data velocity [8, 10]. It has been evaluated that there has
been enlarged interest in analyzing big data stream processing (means in motion—
big data) rather than toward big data batch processing (i.e. big data at rest). It has
been identified that some issues such as consistency, fault tolerance, scalability,
integration, heterogeneity, timeliness, load balancing, heavy throughput and privacy
need more research attention. After doing much work on these issues, mainly load
balancing, privacy and scalability remain to focus.
The layer of data integration allows geospatial subscription, using the GeoMQTT
protocol. This is able to work for target-specific data integration at the same time
as to preserve potentiality of congregation data from IoT devices because of the
reason of efficiency in resource utilization in GeoMQTT. They have utilized the
latest methods for stream processing and this framework is known as Apache Storm.
It works as the center tool for their model and Apache Kafka as a tool for GeoMQTT
broker and Apache Storm message processing system. Their planned design could
be used to execute applications for many use cases where to deploy and to evaluate
the distributed stream processing methods and algorithms that function on spatio-
temporal data streams from the origin of IoT devices [11].
Introduction to a 7-layered architecture and its comparison with a 1-layered based
architecture became important as till this point, no general architecture of data
streaming analysis is scalable and flexible [12]. Data extraction, data transforma-
tion, data filtering and data aggregation are performed during the first six layers of
the architecture. In the seventh and last layer, it carries analytic models. This 7-
layered architecture consists microservices and publishes subscribe software. After
doing several studies, it is seen that this is the setup which can ensure solution with
90 S. Anjum et al.
low coupling and high cohesion, which leads in to increasing scalability and main-
tainability. Also asynchronous communication exists between the layers. Practical
experience in the field of financial and e-commerce applications shows that this
7-layered architecture would be helpful to a huge figure of business use cases.
A new data stream model named as Aurora basically manages data streams for
application monitoring [13]. It differs from traditional business data processing. The
detail of software is that it is required to process and respond to frequent inputs
coming from huge and various sources like sensors slightly different from operators
played by human, this fact needs from the individual to reorganize the elementary
architecture of a DBMS regarding this area of application. So, they present Aurora, a
new DBMS and provide its basic overview architecture and then explain specifically
a set of operators handled by stream orientation.
Table 1 represents the important findings of various studies in the field of Data
Streaming.
From the traversing of [14], Fig. 5, simply shows all the components of stream data
model;
Now, give simple explanation of Fig. 5
• In DSMS, there is a Stream Processor which is a kind of data-management system
which is organized in the high-level manner. In a system, we can enter any number
of streams. These streams may not be uniform in incoming rate. Streams might be
archived in a big archival store, but we suppose that it is not feasible to answer the
queries from this archival store. So, a working store is also there, where parts of
streams or summaries may be placed, and this working store is used for answering
the queries. It might be disk, sometimes main memory, basically it depends on
the speed which is needed to process queries. But moreover, it is of adequately
restricted capability that it can not store all of the data from all of the streams.
Stream Data Model and Architecture 91
Some examples of Stream Sources are Sensor Data, Image Data, Internet and Web
Traffic.
• Stream Queries—One way to ask query about streams is that it is placed inside the
processor where position queries are stored. These queries are sensible, perma-
nently in execution, and output is produced at suitable times. The other way of
query is ad-hoc. To ask a variety of ad-hoc queries in a large range, a general
approach is used where a sliding window of each stream is stored in a working
store.
92 S. Anjum et al.
First we have to take a bit array of m bits as an empty bloom filter and set all these
bits to zero, like this—
Stream Data Model and Architecture 93
• For a given input, for the hashes calculation, we need “K” number of hash function.
• Indices are calculated using hash functions. So, when we want to do addition of
an item in the filter, the bits at K indices; f1 (y), f2 (y), …, fK (y) are set.
Example 5.1: Let us take that we want to enter the word “throw” in the filter and
we are having three hash functions to use. At initial, an array is produced consisting
of bits and this bit array is of length 10. Let us take this array to do work and its all
bits are set to 0 at initial. First we will calculate the hashes as per following function:
f1 (“throw”) % 10 = = 1
f2 (“throw”) % 10 = = 4
f3 (“throw”) % 10 = = 7
Here, one thing should be noted that these outputs are taken randomly for explanation
purpose only.
Now, we will set 1 on the bits at indices 1, 4 & 7.
Now again, if we want to enter the word “catch”, we will calculate hashes in similar
manner.
f1 (“catch”) % 10 = = 3
f2 (“catch”) % 10 = = 5
f3 (“catch”) % 10 = = 4
Set 1 on the bits at indices 3, 5 & 4.
• Again, to check presence of word “throw” in filter or not. We will reverse the
order of the same process. Calculating respective hashes using f1 , f2 & f3 and
check that if in bit array, all indices are set to 1.
• If, this is the case that all bits are set to 1, then we can say that “throw” is “probably
present”.
• Else, if any of the bit at these given indices are 0, then “throw” is “definitely not
present”.
The one question arises here is that why we said “probably present”, and what is the
reason behind this uncertainty to come. Let us take an example.
94 S. Anjum et al.
Example 5.2: Let us take that, we want to check the presence of word “bell”, whether
it is present or not. We will calculate hashes using f1 , f2 & f3.
f1 (“bell”) % 10 = = 1
f2 (“bell”) % 10 = = 3
f3 (“bell”) % 10 = = 7
• Now, if we look at the bit array, it seems that bits at these resulting indices are
set as 1 but we already know that this word “bell” was never added to the filter.
The bit at indices of 1 & 7 were set when we added the word “throw” & bit 3 was
added when we added the word “catch”.
• By controlling the size of bloom filter, we can also control probability of getting
false positive.
• Probability of false positive is inversely proportional to the number of hash
functions. If it decrease, then number of hash functions will increase.
−nlog P
m= (2)
(log2)2
• It is interesting to know that bloom filters never generate any false negative result
which means that, if in case the username exists in actual, then it tells you that it
does not exist.
• It is not possible to delete elements from bloom filter because of the reason that
if we are clearing the bits (generated by k hash functions) for the given indices to
delete a single element, then it may cause deletion of few other elements also.
• For example: If we delete the word “throw” (in above taken example) by clearing
the bit at indices 1, 4 & 7, it may cause to be end up with deleting the word “catch”
also. Because bit at index 4 becomes 0 & bloom filter claims that “catch” is not
present.
One more processing is needed to Count Distinct Elements in a Stream, where stream
elements are supposed to be choosing among some complete set. One would be
approximating to identify how many unique elements have looked in the stream,
either counting from the starting of the stream or from several known point in the
past. Here we describe FCM; The Flajolet-Martin Algorithm to perform counting
distinct or unique elements in a stream.
Example 5.3: Determine the distinct element in the stream using Flajolet-Martin
Algorithm.
== 7 mod 5
f(1) == 2
Similarly, calculating hash function for the remaining input stream;
f(1) = 2 f(1) = 2 f(4) = 0 f(2) = 3
f(3) = 4 f(2) = 3 f(3) = 4 f(3) = 4
f(2) = 3 f(3) = 4 f(1) = 2 f(1) = 2
Step 3: Trailing zeros; Now, write the count of trailing zeros in each hash function
bit
i. It is mandatory that the right side of the bucket should always start with 1.
(Sometimes, if it starts with a 0, then it should be neglected.) For example: If
1001011 is a bucket of size 4, and it contains the four 1s and it starts with 1 on
its right end.
ii. It is a necessary condition that each bucket should contain at least one 1,
otherwise no bucket can be formed.
iii. All of the buckets should be in power of 2.
iv. As we move to the left, the size of buckets cannot decrease in size (move in
increasing order toward left).
In the vast field of big data, it is essential to handle big data stream effectively. In this
section, taking big data stream processing as a basis, we will elaborate on some kind
of data processing frameworks. All these are free and open source type software.
[15]
In this section, we will compare and explain different Stream Processor engines (s).
Let us select six among them. These six SPE(s) are:
i. Apache—Spark
ii. Apache—Flink
iii. Apache—Storm
iv. Apache— Heron
v. Apache—Samza
vi. Amazon— Kinesis.
Before describing all these, one important point is noted that, predecessor to all these,
Apache Hadoop is included for historical reasons.
98 S. Anjum et al.
• Hadoop is known to be the very first framework which was appeared to work
for huge datasets processing by using the MapReduce programming model. It
has scalability in nature, because it can run on either a single cluster and on
a single machine or extend and run on several clusters on multiple machines.
Furthermore, Hadoop takes benefit of having distributed storage space to get better
performance by work in the way that in place of the data, it transmits the code
which is supposed to process the data. Also, Hadoop provides high accessibility
and heavy throughput. However, during handling small files, it can have efficiency
problems.
• Apart from all, the main limitation of using Hadoop is so as to it doesn’t support
processing in real-time stream. To handle this limitation, Apache Spark comes
into light. Apache Spark is a framework to use in batch processing and to do
streaming of data and it also allows distributed processing. Spark was intended to
act in response to three big troubles of Hadoop:
i. Stay away from iterative algorithms that can make a number of passes through
the data.
ii. Permit streaming in real-time and interactive queries.
iii. In place of MapReduce, Apache Spark uses RDD which stands for Resilient
Distributed Datasets, which are having fault tolerance and can be able to
perform parallel processing.
• After two years, Apache Flink and Apache Storm were invented. Flink can do
batch processing and streaming of data. In Flink, we can process streams with
precise sequential requirements. Storm and Flink are comparable frameworks,
with some following features:
i.
Storm can only allow stream processing.
ii.
Storm and Flink both can perform low latency stream processing.
iii.The API of Flink is of high level and has rich functionality.
iv.To provide fault tolerance Flink is using a snapshot algorithm in comparison
to Storm which can record level acknowledgement(s).
v. Limitation of Storm is its low scalability, complexity in debugging and
managing Storm.
• So, now Apache Heron comes, after the Storm.
• Apache Samza provides event-based applications, real-time processing, and ETL
which means Extract, Transform and Load capabilities. It provides numerous
APIs and has model like Hadoop, but in its place of using MapReduce, it has the
Samza API, and it uses Kafka as an alternative of the Hadoop Distributed File
System.
• Amazon Kinesis is the single framework of this section which is not keen to go
with Apache Software Foundation. Kinesis is in reality a set of four frameworks
as a replacement for data stream framework. Kinesis can simply be integrated
with Flink.
All these are simply explained in the Table 2.
Table 2 Data processing frameworks [15]
S. No Framework Inventor Incubation Processing Delivery of Latency Throughput Scalability Fault tolerance
year events
1 Apache—Hadoop Apache N.A. Batch N.A. High High High Replication in the
Software HDFS
Foundation
2 Apache—Spark University of 2013 Micro Exactly Low High High RDD—Resilient
California batch—Batch once Distributed dataset
Stream Data Model and Architecture
and stream
3 Apache—Flink Apache 2014 Batch and Exactly Low High High Incremental check
Software stream once pointing (with the
Foundation use of markers)
4 Apache—Storm Backtype 2013 Stream At least Low High High Record level
once acknowledgements
5 Apache—Heron Twitter 2017 Stream At most Low High High High fault tolerance
once, at
least once,
exactly
once
6 Apache—Samza LinkedIn 2013 Batch and At least Low High High Host affinity &
Stream once incremental check
pointing
7 Amazon—Kinesis Amazon N.A Batch and At least Low High High High fault tolerance
stream once
99
100 S. Anjum et al.
In [17] several challenges for streaming data in the field of machine learning are
discussed. If these will overcome it will help in:
i. Exploration of relationships among many AI developments (e.g., RNN, rein-
forcement learning, etc.) and adaptive stream mining algorithms;
ii. Characterizing and detecting drifts in the case when immediate labeled data is
absent.
iii. Developing adaptive learning techniques which can work on verification latency;
iv. Incorporating preprocessing techniques which can transform the raw data in
continuous manner.
This section briefly explains the challenges in streaming data processing and various
solutions to overcome them for better outcomes.
After going through many cases, it has come to the light that there are many challenges
to be faced during processing of streaming data. Some noticeable challenges are:
i. Unbounded Memory Requirements for High Data Volume
The main aim of processing streaming data is to manage the coming data which
is produced in very high velocity and in huge volume, and also in continuous
manner in real time. Sources of these data are very large. As there is no finite end
defined for the data streams which are producing continuously, data processing
infrastructure should also be treated with unbounded memory requirements.
ii. Complex Architecture Complexity and Monitoring of Infrastructure
Data stream processing systems are frequently distributed and essential to be
able to handle a large number of parallel connections and data sources, which
can be hard to accomplish and monitor for any issues that may arise, particularly
at any scale.
iii. Cope Up With the Streaming Data Dynamic Nature
Because of dynamic nature of Streaming data, stream processing systems
should have to be adaptive in nature to handle perception drift—which extracts
some data processing methods inappropriate—and operate with restricted
memory and time.
iv. Data Streams Query Processing
Stream Data Model and Architecture 101
Though data stream processing has key challenges, there are some ways to overcome
these challenges, which include:
i. Use the proper mixture of on-premises and cloud-based resources and services
[22].
ii. To choose the right tools.
iii. Setting of consistent infrastructure for monitoring data processing and integra-
tion, to improve efficiency with data skipping and operator pipelining.
102 S. Anjum et al.
5 Conclusion
This chapter concludes that in today’s era, there is an unstoppable flow of data
which may be in any form of unstructured, semi-structured or structured form and
can be produced from any source like transactional data, social media feeds, IoT
devices any other real-time applications. Again, with the existence of this kind of
continuous producing data, it requires its processing, analyzing, reporting, etc. Past
batch processing system has limitation to handle it because of its finite data handling
nature. From here streaming data processing empowers to cope up with data streams.
As per the importance in data stream model, a query processor should be powerful
which can retrieve data at any scale and properly manage the storage system. DSMS
not works in a single pass, it comprises of step to step processing which are steps of
Data Production, Data Ingestion, Data Processing, Streaming Data Analytics, Data
Reporting, Data Visualization and Decision Making. Importance and requirement of
more enhanced Event stream processing models is seen.
The popularity of Apache Hadoop and its limitations in some area resulted into
the invention of Apache—Spark, Apache—Flink, Apache—Storm, Apache—Heron,
Apache—Samza and Amazon—Kinesis. These are only few, hybrid forms can be
many. After studying many researches, it is found that load balancing, privacy and
scalability issues still need more efforts to work on. And also, significant research
efforts should be given to preprocessing stage of big data streams. This chapter also
put light on overcoming of challenges in streaming of data; with the proper mixing
of approaches, data architecture and resources, one can easily take the advantages of
real-time data analytics.
Many researchers include the methodology to filter data stream, counting of
distinct or unique elements in data stream, counting of one in data stream; for this
bloom filtering, FCM and DGIM are in existence, but again in real-time data analysis
many more features should extract which can be the focus of field researchers. This
may help in enlarging the application area of streaming of data.
References
1. Eberendu, A.: Unstructured data: an overview of the data of Big Data. Int. J. Emerg. Trends
Technol. Comput. Sci. 38(1), 46–50 (2016). https://doi.org/10.14445/22312803/IJCTT-V38
P109
Stream Data Model and Architecture 103
2. Bennawy, M., El-Kafrawy, P.: Contextual data stream processing overview, architecture, and
frameworks survey. Egypt. J. Lang. Eng. 9(1) (2022). https://ejle.journals.ekb.eg/article_2
15974_5885cfe81bca06c7f5d3cd08bff6de38.pdf
3. Akili, S., Matthias, P., Weidlich, M.: INEv: in-network evaluation for event stream processing.
Proc. ACM on Manag. Data. 1(1), 1–26 (2023). https://doi.org/10.1145/3588955
4. Alzghoul, A.: Monitoring big data streams using data stream management systems: industrial
needs, challenges, and improvements. Adv. Oper. Res. 2023(2596069) (2023). https://doi.org/
10.1155/2023/2596069
5. Hassan, A., Hassan, T.: Real-time big data analytics for data stream challenges: an overview.
EJCOMPUTE. 2(4) (2022). https://doi.org/10.24018/compute.2022.2.4.62
6. Nambiar, S., Kalambur, S., Sitaram, D.: Modeling access control on streaming data in apache
storm. (CoCoNet’19). Proc. Comput. Sci. 171, 2734–2739 (2020). https://doi.org/10.1016/j.
procs.2020.04.297
7. Avci, C., Tekinerdogan, B., Athanasiadis, I.: Software architectures for big data: a systematic
literature review. Big Data Anal. 5(5) (2020). https://doi.org/10.1186/s41044-020-00045-1
8. Hamami, F., Dahlan, I.: The implementation of stream architecture for handling big data
velocity in social media. J. Phys. Conf. Ser. 1641(012021) (2020). https://doi.org/10.1088/
1742-6596/1641/1/012021
9. Kenda, K., Kazic, B., Novak, E., Mladenić, D.: Streaming data fusion for the internet of things.
Sensors 2019. 19(8), 1955 (2019). https://doi.org/10.3390/s19081955
10. Kolajo, T., Daramola, D., Adebiyi, A.: Big data stream analysis: a systematic literature review.
J. Big Data. 6(47) (2019). https://doi.org/10.1186/s40537-019-0210-7
11. Laska, M., Herle, S., Klamma, R., Blankenbach, J.: A scalable architecture for real-time stream
processing of spatiotemporal IoT stream data—performance analysis on the example of map
matching. ISPRS Int. J. Geo-Inf. 7(7), 238 (2018). https://doi.org/10.3390/ijgi7070238
12. Hoque, S., Miranskyy, A.: Architecture for Analysis of Streaming Data, Conference: IEEE
International Conference on Cloud Engineering (IC2E) (2018). https://doi.org/10.1109/IC2E.
2018.00053
13. Abadi, D., Etintemel, U.: Aurora: a new model and architecture for data stream management.
VLDB J. 12(2), 12–139 (2003). https://doi.org/10.1007/s00778-003-0095-z
14. Jure Leskovec, J., Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge
University Press, England (2010)
15. Almeida, A., Brás, S., Sargento, S., Pinto, F.: Time series big data: a survey on data stream
frameworks, analysis and algorithms. J Big Data 10(1), 83 (2023). https://doi.org/10.1186/s40
537-023-00760-1
16. Sujatha, C., Joseph, G.: A survey on streaming data analytics: research issues, algorithms,
evaluation metrics, and platforms. In: Proceedings of International Conference on Big Data,
Machine Learning and Applications, pp. 101–118 (2021). https://doi.org/10.1007/978-981-33-
4788-5_9
17. Gomes, H., Bifet, A.: Machine learning for streaming data: state of the art, challenges, and
opportunities. ACM SIGKDD Explor. Newsl. 21(2), 6–22 (2019). https://doi.org/10.1145/337
3464.3373470
18. Aguilar-Ruiz, J., Bifet, A., Gama, J.: Data stream analytics. Analytics 2(2), 346–349 (2023).
https://doi.org/10.3390/analytics2020019
19. Rashid, M., Hamid, M., Parah, S.: Analysis of streaming data using big data and hybrid
machine learning approach. In: Handbook of Multimedia Information Security: Techniques
and Applications, pp. 629–643 (2019). https://doi.org/10.1007/978-3-030-15887-3_30
20. Samosir, J., Santiago, M., Haghighi, P.: An evaluation of data stream processing systems for
data driven applications. Proc. Comput. Sci. 80, 439–449 (2016). https://doi.org/10.1016/j.
procs.2016.05.322
21. Geisler, S.: Data stream management systems. In: Data Exchange, Integration, and Streams.
Computer Science. Corpus ID: 12168848. 5, 275–304 (2013). https://doi.org/10.4230/DFU.
Vol5.10452.275
104 S. Anjum et al.
22. Singh, P., Singh, N., Luxmi, P.R., Saxena, A.: Artificial intelligence for smart data storage
in cloud-based IoT. In: Transforming Management with AI, Big-Data, and IoT, 1–15 (2022).
https://doi.org/10.1007/978-3-030-86749-2_1
23. Abdullah, D., Mohammed, R.: Real-time big data analytics perspective on applications, frame-
works and challenges. 7th International Conference on Contemporary Information Technology
and Mathematics (ICCITM). IEEE. 21575180 (2021). https://doi.org/10.1109/ICCITM53167.
2021.9677849
24. Mohamed, N., Al-Jaroodi, J.: Real-time big data analytics: applications and challenges. Inter-
national Conference on High Performance Computing & Simulation (HPCS). IEEE. 14614775
(2014). https://doi.org/10.1109/HPCSim.2014.6903700
25. Deshai, N., Sekhar, B.: A study on big data processing frameworks: spark and storm. In: Smart
Intelligent Computing and Applications, 415–424 (2020). https://doi.org/10.1007/978-981-32-
9690-9_43
Leveraging Data Analytics and a Deep
Learning Framework for Advancements
in Image Super-Resolution Techniques:
From Classic Interpolation
to Cutting-Edge Approaches
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 105
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_6
106 S. R. Mishra et al.
1 Introduction
Image SR, an essential task in the field of computer vision, plays a crucial role in
enhancing the resolution and quality of low-resolution images. The ability to recover
high-resolution details from low-resolution inputs has significant implications in
various applications, including medical imaging, surveillance, remote sensing, and
more [1]. As the demand for higher quality visual content continues to grow, the
development of advanced image super-resolution techniques has become a vibrant
research area. In this chapter, we delve into the remarkable advancements in image
super-resolution, tracing its evolution from classical interpolation methods to cutting-
edge deep learning approaches. Our exploration begins with an overview of the
fundamental concepts and importance of image super-resolution in diverse domains
[2].
often produced blurred and visually unappealing results, leading to the introduc-
tion of more sophisticated interpolation techniques such as Lanczos and spline-
based methods. While these techniques improved image quality to some extent, they
struggled to handle significant upscaling factors and suffered from artifacts [4].
The advent of deep learning has revolutionized the field of image super-resolution.
Convolutional Neural Networks (CNNs) emerged as a powerful tool for learning
complex mappings between low-resolution and high-resolution images. The chapter
discusses the pioneering CNN-based models, including the Super-Resolution Convo-
lutional Neural Network (SRCNN), the Very Deep Super-Resolution Network
(VDSR), and the Enhanced Deep Super-Resolution Network (EDSR). These archi-
tectures leverage deep layers to extract hierarchical features from images, leading to
impressive results in terms of visual quality and computational efficiency [5].
loss in guiding the training process. GANs have proven to be particularly effective
in generating photo-realistic details, demonstrating the potential to revolutionize
high-resolution imaging [6].
Despite the impressive results achieved with deep learning-based approaches, image
super-resolution still faces several challenges. One of the prominent challenges is
managing artifacts that can arise during the super-resolution process [7].
Additionally, improving perceptual quality and ensuring that the enhanced images
are visually appealing is critical. This chapter delves into the techniques used to
overcome these challenges, including residual learning, perceptual loss functions,
and data augmentation strategies to enrich the training data.
4. End-to-End Learning:
– Deep learning frameworks facilitate end-to-end learning, allowing the model
to directly map LR images to HR outputs. This avoids the need for handcrafted
feature engineering and enables the model to learn complex relationships.
5. Attention Mechanisms:
– Attention mechanisms, integrated into deep learning architectures, enable
models to focus on relevant parts of the image during the SR process. This
improves the overall efficiency and performance of the model.
6. Large-Scale Parallelization:
– Deep learning frameworks support parallel processing, enabling the training
of large and complex models on powerful hardware, which is essential for
achieving state-of-the-art results in image super-resolution.
N-N interpolation is the simplest technique, where each pixel in the high-resolution
image is assigned the value of the nearest pixel in the low-resolution image. This
method is fast but often leads to blocky artifacts. Using nearest neighbor interpolation,
we aim to upscale it to an “8x8” image. The pixels in the HR image will be assigned
the values of their nearest neighbors from the low-resolution image [9]. The resulting
8x8 image after applying nearest neighbor interpolation is shown in Fig. 2.
2.2 Datasets
In the field of image SR, researchers use various datasets to train and evaluate
their model networks. In a review of various articles, 11 datasets were identified
as commonly used for these purposes (Ref: Table 1).
T91 Dataset: The T91 dataset contains 91 images. It comprises diverse content
such as cars, flowers, fruits, and human faces. Algorithms like SRCNN, FSRCNN,
VDSR, DRCN, DRDN, GLRL, DRDN, and FGLRL utilized T91 as their training
dataset.
Berkeley Segmentation Dataset 200 (BSDS200): Due to the limited number
of images in T91, researchers supplemented their training by including BSDS200,
which consists of 200 images showcasing animals, buildings, food, landscapes,
people, and plants. Algorithms like VDSR, DRRN, GLRL, DRDN, and FGLRL
The initial phase, patch extraction, involved the capturing of information from
bicubic-interpolated image. This image information was then channeled into the
subsequent stage, non-linear mapping. Within this stage, the high-dimensional
features underwent a transformation to correspond with other high-dimensional
features, effecting a comprehensive mapping process [12]. Ultimately, the ulti-
mate outcome from the final layer of the non-linear mapping phase underwent a
convolutional process to accomplish the reconstruction of the high-resolution (HR)
image. This final stage synthesized the refined features into the desired HR image,
completing the SRCNN’s intricate process.
SRCNN and sparse coding-based methods share similar fundamental operations in
their image super-resolution processes. However, a notable distinction arises in their
approach. While SRCNN empowers optimization of filters through an end-to-end
mapping process, sparse coding-based methods restrict such optimization to specific
operations. Furthermore, SRCNN boasts an advantageous flexibility: it permits the
utilization of diverse filter sizes within the non-linear mapping step, enhancing the
information integration process [13]. This adaptability contrasts with sparse coding-
based methods, which lack such flexibility. As a result of these disparities, SRCNN
achieves a higher PSNR (Peak Signal-to-Noise Ratio) value compared to sparse
coding-based methods, indicating its superior performance in image super-resolution
tasks (Ref: Algorithm 1).
114 S. R. Mishra et al.
The architecture illustrated in Fig. 5, was conceptualized [28] to address the challenge
encountered in SRCNN, where an increasing number of mapping layers was imper-
ative for enhanced model performance. Deep learning SR innovatively introduced
the concept of residual learning, a mechanism that bridged the gap between input
and output within the final feature mapping layer. Residual learning was achieved by
integrating the output features from the ultimate layer with the interpolated features.
Given the strong correlation between low-level and high-level features, this skip
connection facilitated the fusion of low-level layer attributes with high-level features,
subsequently elevating model performance. This strategy proved particularly effec-
tive in mitigating the vanishing gradients issue that emerges when the model’s layer
count grows. The incorporation of residual learning in Deep learning SR offered
dual advantages compared to SRCNN. Firstly, it expedited convergence due to the
substantial correlation between LR and HR images. As a result, Deep learning SR
accomplished quicker convergence, slashing running times by an impressive 93.9%
when compared to the original SRCNN model. Secondly, Deep learning SR yielded
superior PSNR values in comparison to SRCNN, affirming its prowess in image
enhancement tasks [15].
Deep-recursive CNN for SR, introduced as the pioneer algorithm to employ a recur-
sive approach for image super-resolution, brought a novel perspective to the field
illustrated in Fig. 6. It comprised three principal components the embedding, infer-
ence, and reconstruction. The embedding net’s role was to extract relevant features
from the interpolated image. These extracted features then traversed the inference net,
notable for its unique characteristic of sharing weights across all filters. Within the
inference net, the outputs of intermediate convolutional layers and the interpolated
features underwent convolution before their summation generated a high-resolution
(HR) image. The distinctive advantage of DRCN lay in its capacity to address the
challenge encountered in SRCNN, where achieving superior performance necessi-
tated a high number of mapping layers. By embracing a recursive strategy, Deep-
recursive CNN harnessed shared weights. Furthermore, the amalgamation of interme-
diate outputs from the inference net brought substantial enhancement to the model’s
performance. Incorporating residual learning principles into the network contributed
118 S. R. Mishra et al.
The previous section examined various algorithms that primarily relied on stacking
convolutional layers sequentially. However, this approach resulted in increased
runtime and memory complexity. To address this concern, dual-branch image super-
resolution algorithm was introduced named as Dual-Branch CNN. The network
Leveraging Data Analytics and a Deep Learning Framework … 119
Here, the features of the images generated by various algorithms were carefully
observed. A comparison between SRCNN and the bicubic interpolation method
revealed that images produced through interpolation appeared blurry, lacking clear
details in contrast to the sharpness achieved by SRCNN. Comparing the outputs
of FSRCNN with those of SRCNN, there appeared to be minimal discrepancy.
120 S. R. Mishra et al.
However, both FSRCNN exhibited superior processing speed [19]. The incorpo-
ration of residual learning within VDSR substantially improved image texture,
surpassing that achieved by SRCNN. Models benefiting from enhanced learning
through residual mechanisms displayed notable enhancement in image texture.
DRCN, which harnessed both recursive and residual learning, yielded images with
more defined edges and patterns, markedly crisper than the slightly blurred edges
produced by SRCNN. CRN further improved upon this aspect, delivering even
sharper edges than DRCN. On the other hand, GLRL generated significantly clearer
images compared to DRCN, albeit with a somewhat compromised texture.
Images generated by CRN exhibited superior texture compared to DRCN, while
SRDenseNet managed to reconstruct images with improved texture patterns, effec-
tively mitigating distortions that proved challenging for DRCN, VDSR, and SRCNN
to overcome. Noteworthy improvements were observed in images produced by
DBCN, showcasing a superior restoration of collar texture without introducing addi-
tional artifacts. This achievement translated to a more visually appealing outcome
than what was observed with CRN. DBCN demonstrated an enhanced capacity to
restore edges and textures, surpassing the capabilities of SRCNN in this domain.
Figures 8, 9, 10, and 11 provide a comprehensive summary of the quantitative
outcomes achieved by the respective algorithms developed by the authors (Table 2).
Diverse approaches have been taken by numerous researchers to enhance the perfor-
mance of image super-resolution models. Table 2 shows different key features of
network design strategies among the various designs discovered above. At its core, the
linear network was the foundational design, depicted in Fig. 12. This design concept
drew inspiration from the residual neural network (ResNet), widely utilized for
object recognition in images. The linear network technique was employed by models
such as SRCNN, FSRCNN, and ESPCN. Although these three models employed a
similar design approach, there were differences in their internal architectures and up-
sampling methods. For instance, SRCNN exclusively consisted of feature extraction,
122 S. R. Mishra et al.
6 Conclusion
References
1. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for im- age
super-resolution. In: European Conference on Computer Vision, pp. 184–199. Springer, Cham,
Switzerland (2014)
2. Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network.
In: European Conference on Computer Vision, pp. 391–407. Springer, Cham, Switzerland
(2016)
3. Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolu-
tional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1646–1654. Las Vegas, NV, USA, 27–30 June 2016
4. Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single
image super-resolution. In: Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), pp. 1132–1140. Honolulu, HI, USA, 21–26 July
2017
5. Chu, J., Zhang, J., Lu, W., Huang, X.: A Novel multiconnected convolutional net- work for
super-resolution. IEEE Signal Process. Lett. 25, 946–950 (2018)
6. Lan, R., Sun, L., Liu, Z., Lu, H., Su, Z., Pang, C., Luo, X.: Cascading and enhanced residual
networks for accurate single-image super-resolution. IEEE Trans. Cybern. 51, 115–125 (2021)
7. Kim, J., Lee, J.K., Lee, K.M.: Deeply-recursive convolutional network for image super-
resolution. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 1637–1645. Las Vegas, NV, USA, 27–30 June 2016
8. Hou, J., Si, Y., Li, L.: Image super-resolution reconstruction method based on global and local
residual learning. In: Proceed- ings of the 2019 IEEE 4th Inter- national Conference on Image,
Vision and Computing (ICIVC), pp. 341–348. Xiamen, China, 5–7 July 2019
126 S. R. Mishra et al.
9. Gao, X., Zhang, L., Mou, X.: Single image super-resolution using dual-branch convolutional
neural network. IEEE Access 7, 15767–15778 (2019)
10. Ren, S., Jain, D.K., Guo, K., Xu, T., Chi, T.: Towards efficient medical lesion image super-
resolution based on deep residual networks. Signal Process. Image Communication.
11. Zhao, X., Zhang, Y., Zhang, T., Zou, X.: Channel splitting network for single MR image
super-resolution. IEEE Trans. Image Process. 28, 5649–5662 (2019)
12. Rasti, P., Uiboupin, T., Escalera, S., Anbarjafari, G.: Convolutional Neural network super reso-
lution for face recognition in surveillance monitoring. In: Articulated Motion and Deformable
Objects, pp. 175–184. Springer: Cham, Switzerland (2016)
13. Deshmukh, A.B., Rani, N.U.: Face video super resolution using deep convolutional neural
network. In: Proceedings of the 2019 5th International Conference on Computing, Commu-
nication, Control and Automation (ICCUBEA), pp. 1–6. Pune, India, 19–21 September
2019
14. Shen, Z., Xu, Y., Lu, G.: CNN-based high-resolution fingerprint image enhancement for pore
detection and matching. In: Proceedings of the 2019 IEEE Symposium Series on Computational
Intelligence (SSCI), pp. 426–432. Xiamen, China, 6–9 December 2019
15. Chatterjee, P., Milanfar, P.: Clustering-based denoising with locally learned dictionaries. IEEE
Trans. Image Process. 18(7), 1438–1451 (2009)
16. Xu, X.L., Li, W., Ling.: Low Resolution face recognition in surveillance systems. J. Comp.
Commun. 02, 70–77 (2014). https://doi.org/10.4236/jcc.2014.22013
17. Li, Y., Qi, F., Wan, Y.: Improvements on bicubic image interpolation. In: 2019 IEEE 4th
Advanced Information Technology, Electronic and Automation Control Conference (IAEAC).
Vol. 1. IEEE (2019)
18. Kim, T., Sang Il Park, Shin, S.Y.: Rhythmic-motion synthesis based on motion-beat analysis.
ACM Trans. Graph. 22(3), 392–401 (2003)
19. Xu, Z. et al.: Evaluating the capability of satellite hyperspectral Im- ager, the ZY1–02D, for
topsoil nitrogen content estimation and mapping of farm lands in black soil area, China.”
Remote Sens. 14(4), 1008 (2022)
20. Mishra, S.R., et al.: Real time human action recognition using triggered frame extraction and
a typical CNN heuristic. Pattern Recogn. Lett. 135, 329–336 (2020)
21. Mishra, S.R., et al.: PSO based combined kernel learning framework for recognition of first-
person activity in a video. Evol. Intell. 14, 273–279 (2021)
Applying Data Analytics and Time Series
Forecasting for Thorough Ethereum
Price Prediction
Asha Rani Mishra, Rajat Kumar Rathore, and Sansar Singh Chauhan
Abstract Finance has been combined with technology to introduce newer advances
and facilities in the domain. One such technological advance is cryptocurrency which
works on the Blockchain technology. This has proved to be a new topic of research
for computer science. However, these currencies are volatile in nature and their
forecasting can be really challenging as there are dozens of cryptocurrencies in use
all around the world. This chapter uses the time series-based forecasting model
for the prediction of the future price of Ethereum since it handles both logistic
growth and piece-wise linearity of data. This model is independent as it does not
depend on past or historical data which contain seasonality. This model is suitable
for real use cases after seasonal fitting using Naïve model, time series analysis, and
Facebook Prophet Module (FBProphet). FBProphet Model achieves better accuracy
as compared to other models. This chapter aims at drawing a better statistical model
with Exploratory Data Analysis (EDA) on the basis of several trends from year 2016
to 2020. Analysis carried out in the chapter can help in understanding various trends
related to Ethereum price prediction.
1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 127
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_7
128 A. R. Mishra et al.
2 Related Work
Hitam et al. in [1] focus on six different types of cryptocurrency coins available
in the market by collecting historical data of 5 years. They trained four different
models on the dataset and the performances of different classifiers were checked.
These models were namely, SVM, BoostedNN, ANNs, and Deep Learning giving
accuracies of 95.5%, 81.2%, 79.4%, and 61.9%, respectively. Siddhi Velankar et al.
in [2] use two different approaches for predicting prices—GLM/Random Forest and
Bayesian Regression model. It uses Bayesian Regression by dividing data in the form
of 180s, 360s, 720s and takes the help of k-means clustering to narrow down effec-
tive clusters and further calculating the corresponding weights from data using the
Bayesian Regression method. In GLM/Random Forest model, the authors distribute
data into 30, 60, and 120 min time series datasets. The issue is addressed by Chen
et al. in [3] by splitting the forecasting sample into intervals of five minutes with
a large sample size and daily intervals with a small sample size. Lazo et al. in [4]
builds two decision trees based on datasets of two currencies—Bitcoin and Ripple
where week 1 decision tree model gives best decisions for selling the coins 1 week
after purchase and the week 2 decision tree model gives best investment advice on
coins giving highest gains. Derbentsev et al. in [5] perform predictive analysis on the
Applying Data Analytics and Time Series Forecasting for Thorough … 129
3 Research Methodology
Data analytics along with time series modeling can be used in predicting the price of
Ethereum by utilizing the information about trends and patterns in the past. It helps
to find correlated features for prediction models. Using sentiment analysis based
on social trends can help to identify factors which can influence the price. Devel-
oping precise forecasting models can benefit from the examination of long-term,
seasonal, and cyclical trends. Finding pertinent features for models that forecast
time series is facilitated by data analytics. The prediction capacity of the model
can be increased by combining sentiment analysis features, on-chain analytics, and
technical indications. In conclusion, time series forecasting and data analytics work
well together to predict Ethereum prices. By utilizing past information, market senti-
ment, and sophisticated modeling approaches, it helps analysts and investors make
better decisions. Nonetheless, it’s critical to recognize the inherent unpredictability of
bitcoin markets and to constantly improve models in order to accommodate changing
Applying Data Analytics and Time Series Forecasting for Thorough … 131
circumstances. In summary, the combination of data analytics and time series fore-
casting is powerful for Ethereum price prediction. It enables investors and analysts to
make more informed decisions by leveraging historical data, market sentiment, and
advanced modeling techniques. However, it’s essential to acknowledge the inherent
uncertainty in cryptocurrency markets and continuously refine models to adapt to
dynamic conditions.
It is difficult to predict or comment on cryptocurrencies because of their volatility
in prices and visible dynamic fluctuations. According to existing work different pros
and cons of time series algorithms like ARIMA and LSTM were identified. ARIMA
is unable to handle factors like seasons in data and independency between two data
points.
This existing problem can be reduced or eliminated by contrasting few machine
learning techniques for analyzing market movements. This chapter presents a
methodology which is able to predict prices of the cryptocurrency—Ethereum
by using Machine Learning algorithms in a hybrid fashion. The data has been
smoothened, enhanced, and prepared to finally Facebook Prophet Algorithm Model
is applied to it. Facebook Prophet was able to handle the cons of previous algorithms
used in such predictions such as dynamic behavior, seasonality, holidays, etc. [16].
Figure 1 depicts the work-flow used in the chapter. Presence of non-stationarity in
data, noise in data, and requirement of smoothening, feature engineering was done
during the process. At each step, there was a requirement of graph visualization to
analyze the working and make the next decisions.
(i). Fetching of raw data which could be done from 3rd party APIs, web scrapping,
etc.
(ii). After doing data cleaning on the raw data, exploratory data analysis (EDA) on
the data is conducted to know the behavior of the data.
(iii). On this data, a naive model, also known as a base line model, an auto regressive
model, or a moving average model is implemented.
(iv). After data cleaning, the next step is feature engineering to check whether
data is stationary. For this, statistical test is done using Line plot curve and
Augmented Dickey–Fuller (AdFullar) Test. It is essential to check stationarity
of data in time series analysis since it highly affects the interpretation of data.
Numerous statistical models are built on the assumption that there is no dependence
between the various points for predictions. To fit a stationary model to the time
series data that needs to be analyzed, one should check for stationarity and remove
the trend/seasonality effect from the data. The statistical factors must remain constant
over time. This is not necessary that all data points should be same; rather, the data’s
general behavior should be consistent. Time graphs that are constant on a strictly
visual level are considered stagnant. Stationarity also means the consistency of mean
and variance with respect to time.
In data preprocessing step, as time series-based Facebook Prophet Model is used,
the date feature must offer an object type with a timestamp nature. Therefore, convert
it first to date/time format before sorting the data by date. Exploratory Data Analysis
(EDA) must be done on data sample as shown in Fig. 2 because the ‘Close’ feature’s
as shown in Fig. 3 ultimate purpose is to forecast what the final selling price of
Ethereum will be.
The mean or average closing price can be found using mean function and values
should be plotted according to date on weekly and yearly basis as shown in Figs. 4
and 5, respectively.
Graph shown in Fig. 6 shows the trend of prices in a yearly, weekly, and monthly
time period using mean function. In Fig. 7, average weekly closing price can be
analyzed using ‘Close’ by taking mean of values according to week.
Mean closing price per day is analyzed and plotted in Fig. 8. In the same manner,
the average closing prices on a quarterly basis are analyzed and plotted as shown in
Fig. 9.
The trend that closing prices follow on weekdays and weekends was also analyzed
and the same was plotted in Fig. 10 which shows minor differences in the two graphs.
Using this data, a baseline or the Naïve model is used for prediction as shown in
Fig. 11. In a Naïve model all data points are dependent on the previous data points
as shown in Fig. 12.
Applying Data Analytics and Time Series Forecasting for Thorough … 135
The next steps included determining whether or not the data had seasonality. Season-
ality is the existence of fluctuations or changes that happen frequently, such as once
136 A. R. Mishra et al.
a week, once a month, or once every three months in data. Seasonality is the peri-
odic, repeating, often regular, and predictable trends in the levels of a time series that
may be attributed to a variety of events, including weather, vacation, and holidays.
The seasonality of the curve is removed by applying rolling or moving average of a
window period of 7 on the data. Mean and Standard Deviation are shown in Fig. 13.
The blue line, which is now overlapping the green curve in the graph in Fig. 13,
represents mean values. The orange line in the graph reflects the exact given series.
This has led to the conclusion that the rolling mean is not constant and undergoes
temporal variation. It must now stop being seasonal and change into a stable state.
Applying Data Analytics and Time Series Forecasting for Thorough … 137
‘Adfuller’ is easily imported from the stats model package and used in a program on
the ‘close’ data. This results in a p-value of 0.0002154535155876224.
The null hypothesis is rejected as the calculated value is less than 0.05 so the data
is considered as stationary. The log transformation is often used to reduce skewness
of a measurement variable using the log functions as shown in Fig. 14.
The data are smoothed using the moving average. The financial market uses this
technique frequently. Impact of rolling window and log transformations is shown in
Fig. 15.
138 A. R. Mishra et al.
Fig. 14 Removal of
seasonality factor
Figure 16 shows that null hypothesis is rejected and as a result, data comes out to
be stationary. The time series is roughly stationary and has a constant interval. Shift
is used to apply difference to find the tendency of seasonality as seen in Fig. 17.
Other seasonal adjustment results have been shown in Fig. 18.
As a result, it may be inferred that the Dicky Fuller Test has an essential value
less than 1%.
Applying Data Analytics and Time Series Forecasting for Thorough … 139
The algorithm used here for prediction is FBProphet. The FBProphet algorithm of
machine learning employs a decomposable time series model that consists of three
key components: pattern, seasonality, and holidays. In the following Eq. 1, they are
combined:
g(t): For modeling non-periodic variations in time series, use a piece-wise linear
or logistic growth curve.
s(t): periodic changes (e.g., weekly or yearly seasonality).
h(t): effects of holidays along with irregular schedules.
140 A. R. Mishra et al.
εt: error term accounts for any unusual changes which are not accommodated by
the model.
The ‘Fbprophet’ library provides a Prophet model specifically. It controls irregular
hours or irregular holidays. The circumstance when there is some noise or some
outliers in the data is likewise handled by this Facebook prophet module. ‘Fbprophet’
is a module that helps forecasting time series data that matches non-linear patterns
since the data has seasonality on a yearly, weekly, and daily scale along with the
effects of holidays. The results are best if the model is trained on past data including
several seasons and time series with considerable seasonal influences. The data must
be prepared in accordance with the prophet model documentation prior to fitting. It
must ensure that every data complies with its protocols. Output feature is represented
as ‘y’ and the date ‘ds’. The model is fitted with a frequency of day with a ‘500-day’
span.
‘yhat’ gives the actual forecast while ‘yhat_upper’ and ‘yhat_lower’ give higher
bound prediction and the lower bound prediction respectively as shown in Fig. 19.
Now, to plot this forecast, the Fbprophet library’s built-in functionality to forecast is
used shown in Fig. 20.
The black dot in the curve represents the plot of actual values or prices, whereas
blue line shows the prediction curve. The light blue line depicts the trend using data
on a weekly, annual, and monthly as shown in Fig. 21.
The forecast model that calculates forecast error must now be cross-validated. Actual
values and projected values will be compared to calculate forecast error. There is a
Applying Data Analytics and Time Series Forecasting for Thorough … 141
In order to evaluate FBProphet model, we can use four types of errors measures,
i.e., Mean Absolute error (MAE), Root Mean Square Error (RMSE), Root Relative
Squared Error (RRSE), and Mean Absolute Percentage Error (MAPE) shown in
Table 1.
Here, Zak = actual value
ẑk = predicted value for any kth sample
zk = Actual value of z,
z- =Average Value of z
N = total number of test sample
The FBProphet algorithm used for trend analysis is a full-fledged and totally reliable
algorithm, which gave an accuracy of approximately in between 94.5 and 96.6%.
From the experimental results it has been observed that the value of RSME falls in
the range of 0–100 which is about 5.56%. Majority of the value lies between 0 and
80 which indicates that the model is having 4.44% error.
Applying Data Analytics and Time Series Forecasting for Thorough … 143
Techniques and models are seldom created out of old data, but a reliable present world
predictive model is quite vague to build based only on previous data. Logistic Regres-
sion accuracy score is 66%, Linear Discriminant Analysis having 65.3% accuracy;
and other previous models on prices of coins like BTC, LTC—Multi-linear regression
model gives R2 score 44% for LTC and 59% accuracy for BTC. Most of the problems
are solved by building models based on historical data, but in case of cryptocurrencies,
future results cannot be predicted based on just a historical data model. There may
be seasonality in prior data or problems which effects models’ ability to accurately
predict patterns. Performing cross-validation, the used FBProphet model showed that
it was able to achieve around 97% accuracy in forecasting future Ethereum Price.
Even when seasonal data was available, the overall gap between anticipated and
actual values was small compared to other models. Further, to improve the model’s
144 A. R. Mishra et al.
accuracy and make it reliable on present data, a suggestion tool for other external
factors that may affect Ethereum market prices, such as social media, tweets, and
trading volume, might be added.
References
1. Hitam, N.A., Ismail, A.R.: Comparative performance of machine learning Aagorithms for
cryptocurrency forecasting. Indones. J.Electr. Eng. Comput. Sci. 11, 1121– 1128 (2018). https://
www.ije.ir/article_122162.html
2. Velankar, S., Valecha, S., Maji, S.: Bitcoin price prediction using machine learning. In: 2018
20th International Conference on Advanced Communication Technology (ICACT), pp. 144–
147. IEEE (2018)
3. Chen, Z., Li, C.; Sun, W.: Bitcoin price prediction using machine learning: an approach to
sample dimension engineering. J.Comput. Appl. Math. 365, 112395 (2019). https://www.sci
encedirect.com/science/article/abs/pii/S037704271930398X
4. Lazo, J.G.L., Medina, G.H.H., Guevara, A.V., Talavera, A., Otero, A.N., Cordova E.A.: Support
system to investment management in cryptocurrencies. In: Proceedings of the 2019 7th Inter-
national Engineering, Sciences and Technology Conference, IESTEC, pp. 376–381. Panama
(9–11 October 2019)
5. Derbentsev, V., Babenko, V., Khrustalev, K., Obruch, H., Khrustalova, S.: Comparative perfor-
mance of machine learning ensemble algorithms for forecasting cryptocurrency prices. Int. J.
Eng. Trans. A Basics. 34, 140–148 (2021)
6. Yiying, W., Yeze, Z.: Cryptocurrency price analysis with artificial intelligence. In: 2019
5th International Conference on Information Management (ICIM), pp. 97–101. IEEE (2019,
March). https://doi.org/10.1109/INFOMAN.2019.8714700
7. Livieris, I.E., Pintelas, E., Stavroyiannis, S., Pintelas, P.: Ensemble deep learning models for
forecasting cryptocurrency time-series. Algorithms 13(5), 121 (2020). https://doi.org/10.3390/
a13050121
8. Basak, S., Kar, S., Saha, S., Khaidem, L., Dey, S.R.: Predicting the direction of stock market
prices using tree-based classifiers. North Am. J. Econ. Finance 47, 552–567 (2019). https://
doi.org/10.1016/j.najef.2018.06.013
9. Poongodi, M., Sharma, A., Vijayakumar, V., Bhardwaj, V., Sharma, A. P., Iqbal, R., Kumar,
R: Prediction of the price of Ethereum blockchain cryptocurrency in an industrial finance
system. Comput. Electr. Eng. 81, 106527 (2020). https://doi.org/10.1016/j.compeleceng.2019.
106527
10. Azeez A.O., Anuoluwapo O.A., Lukumon O.O., Sururah A. 49 Bello, Kudirat O.J.: Perfor-
mance evaluation of deep learning and boosted trees for cryptocurrency closing price prediction.
Expert Syst. Appl. 213, Part C, 119233, ISSN 0957–4174 (2023)
11. Aggarwal, A., Gupta, I., Garg, N., & Goel, A.: Deep learning approach to determine the
impact of socio economic factors on bitcoin price prediction. In: 2019 Twelfth International
Conference on Contemporary Computing (IC3), pp. 1–5. IEEE (2019, August). https://doi.org/
10.1109/IC3.2019.8844928
12. Phaladisailoed, T., Numnonda, T.: Machine learning models comparison for bitcoin price
prediction. In: 2018 10th International Conference on Information Technology and Electrical
Engineering (ICITEE), pp. 506–511. IEEE (2018). https://doi.org/10.1109/ICITEED.2018.853
4911
13. Carbó, J.M., Gorjón, S.: Application of machine learning models and interpretability techniques
to identify the determinants of the price of bitcoin (2022)
14. Pierdzioch, C., Risse, M., Rohloff, S.: A quantile-boosting approach to forecasting gold returns.
North Am. J. Econ. Finance 35, 38–55 (2016). https://doi.org/10.1016/j.najef.2015.10.015
Applying Data Analytics and Time Series Forecasting for Thorough … 145
15. Sadorsky, P.: Predicting gold and silver price direction using tree-based classifiers. J. risk financ.
manag. 14(5), 198 (2021). https://doi.org/10.3390/jrfm14050198
16. Mishra, A.R., Pippal, S.K., Chopra, S.: Time Series Based Pattern Prediction Using Fbprophet
Algorithm For Covid-19. J. East China Univ. Sci.TechnoL. 65(4), 559–570 (2022)
17. Samin-Al-Wasee, M., Kundu, P.S., Mahzabeen, I., Tamim, T., Alam, G.R.: Time-Series Fore-
casting of Ethereum Price Using Long Short-Term Memory (LSTM) Networks. In: 2022 Inter-
national Conference on Engineering and Emerging Technologies (ICEET), pp. 1–6. IEEE
(2022, October). https://doi.org/10.1109/ICEET56468.2022.10007377
18. Sharma, P., Pramila, R.M.: Price prediction of Ethereum using time series and deep learning
techniques. In: Proceedings of Emerging Trends and Technologies on Intelligent Systems:
ETTIS 2022, pp. 401–413. Singapore: Springer Nature Singapore (2020). https://doi.org/10.
1007/978-981-19-4182-5_32
Practical Implementation of Machine
Learning Techniques and Data Analytics
Using R
Abstract In this digital era all E-commerce activities are based on the modern
recommendation systems where a company wants to analyse the buying pattern of
its customers to optimize their sales strategies which mainly includes focusing more
on valuable customers which is based on the amount of purchase made by customer
rather than the traditional way of recommending a product. In the modern recommen-
dation systems different parameters are synthesized for designing efficient recom-
mendation systems. In this paper the data of 325 customers who have made certain
purchases from a website having naive parameters like age, job type, education,
metro city, signed in with company since and purchase history are considered. The
E-commerce business model’s profit making is primarily dependent on choice-based
recommendation systems. Hence in this paper a predictive model using machine
learning-based linear regression algorithm is used. The study is done using a popular
statistical tool named R programming. In this study the R tool is explored and repre-
sented with utility for recommendation system designing and finding insights from
data by showing various plots. The results are formulated and presented in a formal
and structured way using the R tool. During this study it has been observed that the
R tool has potential to be one of the leading tools for research and business analytics.
N. Chandela
Computer Science and Engineering, Krishna Engineering College,
Uttar Pradesh, Ghaziabad, India
e-mail: falsenehachandela99@gmail.com
K. K. Raghuwanshi (B)
Computer Science Department, Ramanujan College, Delhi University, New Delhi, India
e-mail: kamlesh@ramanujan.du.ac.in
H. Tyagi
University School of Automation and Robotics, GGSIPU, New Delhi, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 147
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_8
148 N. Chandela et al.
1 Introduction
R programming
Data science has taken the whole world today. Every sector of study and industry
has been impacted as individuals increasingly recognise the usefulness of the massive
amounts of data being generated. However, in order to extract value from those data,
Practical Implementation of Machine Learning Techniques and Data … 149
one must be skilled in data science abilities [4]. The R programming language has
emerged as the de facto data science programming language. The adaptability, power,
sophistication, and expressiveness of R have made it an indispensable tool for data
scientists worldwide [5].
Features of R programming
1. R’s syntax is quite similar to S’s, making it easy for S-PLUS users to transfer over.
While R’s syntax is essentially identical to S’s, R’s semantics, while outwardly
similar to S’s, are significantly different. In reality, when it comes to how R
operates under the hood, R is far closer to the Scheme language than it is to the
original S language [6].
2. R now operates on nearly every standard computing platform and operating
system. Because it is open source, anyone can modify it to run on whatever
platform they like. R has been claimed to run on current tablets, smartphones,
PDAs, and game consoles [7].
3. R has a great feature with many famous open source projects: regular releases.
Nowadays, there is a big annual release, usually in October, in which substantial
new features are included and made available to the public. Smaller-scale bugfix
releases will be produced as needed throughout the year. The frequent releases and
regular release cycle show active software development and ensure that defects
are resolved in a timely way. Of course, while the core developers maintain the
primary source tree for R, many individuals from all around the world contribute
new features, bug fixes, or both.
4. R also offers extensive graphical features, which set it apart from many
other statistical tools (even today). R’s capacity to generate “publication qual-
ity” graphics has existed since its inception and has generally outperformed
competing tools. That tendency continues today, with many more visualisa-
tion packages available than ever before. R’s base graphics framework gives
you complete control over almost every component of a plot or graph. Other
more recent graphics tools, such as lattice and ggplot2, enable elaborate and
sophisticated visualisations of high-dimensional data [8].
5. R has kept the original S idea of providing a language that is suitable for both
interactive work and incorporates a sophisticated programming language for
developing new tools. This allows the user to gradually progress from being a
user who applies current tools to data to becoming a developer who creates new
tools.
6. Finally, one of the pleasures of using R is not the language itself, but rather the
active and dynamic user community. A language is successful in many ways
because it provides a platform for many people to build new things. R is that
platform, and thousands of people from all over the world have banded together
to contribute to R, create packages, and help each other utilise R for a wide range
of applications. For almost a decade, the R-help and R-devel mailing lists have
been very active, and there is also a lot of activity on sites like Stack Overflow
[9].
150 N. Chandela et al.
Free Software
Over many other statistical tools R has a significant advantage that it is free in
the sense of free software (and also free in the sense of free beer). The R Foundation
owns the primary source code for R, which is released under the GNU General Public
Licence version 2.0.
According to the Free Software Foundation, using free software grants you the
four freedoms listed below.
1. The ability to run the programme for any reason (freedom 0).
2. The ability to learn how the programme works and tailor it to your specific
requirements (freedom 1). Access to the source code is required for this [10].
3. The ability to redistribute copies in order to assist a neighbour (freedom 2).
4. The ability to develop the programme and make your innovations available to
the public so that the entire community benefits (freedom 3) [11, 12].
Limitations of R
1. There is no such thing as a flawless programming language or statistical analysis
system. R has a variety of disadvantages. To begin, R is built on nearly 50-year-
old technology, dating back to the original S system developed at Bell Labs.
Initially, there was little built-in support for dynamic or 3-D graphics (but things
have substantially changed since the “old days”) [13–15].
2. One “limitation” of R at a higher level is that its usefulness is dependent on
consumer demand and (voluntary) user contributions. If no one wants to adopt
your preferred approach, it is your responsibility to do it (or pay someone to
do so). The R system’s capabilities largely mirror the interests of the R user
community. As the community has grown in size over the last ten years, so have
the capabilities. When I first began using R, there was very limited capability for
the physical sciences (physics, astronomy, and so on). However, some of those
communities have now embraced R, and we are seeing more code created for
these types of applications [9, 16].
Linear Regression
To fill in the gaps, a linear regression approach might be utilised. As a refresher,
this is the linear regression formula:
Y = C + BX (1)
We all learnt the straight line equation in high school. The dependent variable is Y,
the slope is B, and the intercept is C. Traditionally, the formula for linear regression
is stated as:
h = θ0 + θ1 (2)
Practical Implementation of Machine Learning Techniques and Data … 151
‘h’ is the hypothesis or projected value, X is the input feature, and the coefficients
are theta0 and theta1.
We will utilise the other ratings of the same movie as the input X in this recom-
mendation system and predict the missing values. The bias term theta0 will be
avoided.
h = θX (3)
Theta1 is started at random and refines over iterations, just like the linear regression
technique.
We will train the algorithm with known values, much like in linear regression.
Consider a movie’s known ratings. Then, using the formula above, forecast those
known ratings [17, 18]. After predicting the ratings values, we compare them to the
original ratings to determine the error term. The error for one rating is shown below.
( j )T i
θ x − y i, j (4)
Similarly, we must determine the inaccuracy for each rating. Before I go any
further, I’d like to introduce the notations that will be used throughout this paper.
n u = no. of users.
n m = no. of movies.
r(i,j) = 1 if user j has rated movie i.
y (i, j) = rating given by user j to movie i (defined only if r(i,j) = 1).
Here’s the formula for the total cost function, which will show the difference
between the expected and original ratings.
1 ∑ (( j )T i )2 λ ∑n ( )
2
j
θ x − y i, j + θk (5)
2 i:r (i, j)=1 2 k=1
The error term is squared in the first term of this expression. To avoid any negative
numbers, we use the square. We optimise the squared using 1/2 and calculate the
error term where r(i, j) = 1. Because r(i, j) = 1, the rating was provided by the user
[19].
The regularisation term is the second term in the equation above. It can be used
to regularise any overfitting or underfitting issue [13].
summary(Data).
Using the summary command, we can check the mean, median, mode and
missing values for each variable. In this case, we have 13 missing observations
(NAs) for the variable age. Hence, before going ahead, we need to treat the missing
values first [20] (Fig. 3).
• Histogram to see how the data is skewed
hist(Data$Age).
We specifically check the data distribution to decide by which value we can
replace the missing observations.
In this case, since the data is somewhat normally distributed, we use mean to
replace the missing values (Fig. 4).
• Replacing the NA values for variable Age with mean 39
Data$Age[is.na(Data$Age)] = 39.
• Check if the missing values are replaced from the variable Age
summary(Data).
Here, we can see that the missing values (NAs) are replaced by the mean value of
39 (Fig. 5).
Since we have handled the missing values, let’s have a look at the data
head(Data).
After handling the missing values, we can see that there are categorical variables
such as Marital status, metro city, education which we need to convert in dummy
variables.
STEP 2: Creating New Variables
As seen in the data, four of our variables are categorical, which we need to create
as dummy variables first.
• Data$Job.type_employed <-as.numeric(Data$Job.Type = = “Employed”)
• Data$Job.type_retired <-as.numeric(Data$Job.Type = = “Retired”)
• Data$Job.type_unemplyed <-as.numeric(Data$Job.Type = = “Unemployed”)
Data$Married_y <-a.numeric(Data$Marital.Status = = “Yes”)
• Data$Education_secondary <-as.numeric(Data$Education = = “Secondry”)
• Data$Education_gra <-as.numeric(Data$Education = = “Graduate”)
Data$Metro_y <-as.s.numeric(Data$Metro.City = = “Yes”)
The following command is used to create dummy variables [20]. You need to
create n-1 dummy variables. For example, we have a categorical variable—Gender
which has two levels—Male & Female. So you will create 1 dummy variable 2–1 =
1, where 2 is the number of levels you have. The second variable is taken care by the
intercept of the regression line.
#Checking the dummy variables
• head(Data)
Here, the dummy variables have been created (Figs. 6 and 7).
• Removing the categorical columns(2,3,4,5)
final_data <- Data[ -c(2,3,4,5)].
• let’s check our final data
head(final_data).
Fig. 14 A histogram
In this scatter plot, we can see a curvilinear relationship between the independent
variable Age and the dependent variable Purchase made.
We can see that, the age group up till 30 has a medium purchase, from age 30 to
55 the purchase is maximum and again the purchase lowers.
Sign.in.days vs Purchase made
• scatterplot(final_data$Signed.in.since.Days.,final_data$Purchase.made
Practical Implementation of Machine Learning Techniques and Data … 159
We can see a positive linear relationship between the variable signed in since and
the variable purchase made (Fig. 17).
We can see a pattern that the old customers make a higher purchase.
STEP 5: Regression Analysis
Since we are done with the EDA, let’s check the co-relation (Fig. 18).
160 N. Chandela et al.
• cor(final_data)
Since all the variables are not below the threshold of 5, we need to correct the
model, let’s remove Education_secondry variable first
• final_data2 <- lm(Purchase.made ~ Age + Signed.in.since.Days. +
Married_y + Job.type_retired + Job.type_unemplyed
+ Education_gra + Metro_y,data = final_data)
• vif(final_data2)
Since the VIF value is less than 5 for all the variables, we can consider all the
variables (Fig. 20).
Graduation was highly co-linear with the other variables, let’s verify once again
using a step function
Step(final_data1).
Basically the summary reveals all possible stepwise removal of one-term from
your full model and compares the extract AIC value, by listing them in ascending
order. Since the smaller AIC value is more likely to resemble the TRUTH model
(Fig. 21).
• final_data3 <- lm(Purchase.made ~ Age + Signed.in.since.Days. +
Married_y + Job.type_retired + Job.type_unemplyed
+ Education_gra + Metro_y,data = final_data)
• summary(final_data3)
These two variables have a P value which is less than 0.05, hence we need to
remove these variables before the final analysis (Fig. 22).
• final_data4 <- lm(Purchase.made ~ Signed.in.since.Days. +
162 N. Chandela et al.
Fig. 23 P values
Practical Implementation of Machine Learning Techniques and Data … 163
< = 13,500),]
Let’s re-run the model on this filtered data
• mod2 <- lm(Purchase.made ~ Signed.in.since.Days. + Married_y + Education_
gra + Metro_y + Job.type_unemplyed, data = final_data_new) summary(mod2)
The P-value is greater than the cut- off of 0.05, hence we need to remove the
variable (Fig. 26).
Final linear equation
• mod2 <- lm(Purchase.made ~ Signed.in.since.Days. + Married_y + Education_
gra + Metro_y,data = final_data_new) summary(mod2)
All the variables are significant (Fig. 27, 28).
Now analysing the residual plot.
Fig. 28 Results
166 N. Chandela et al.
• par(mfrow = c(2,2))
• plot(mod2)
New residual Plot with model t
Autocorrelation
• durbinWatsonTest(mod2)
In case of Durbin-Watson test, the D-W statistics is considered good if it is less
than 2 (Fig. 29).
In this case, we have a value which is less than 2.
Normality of errors
• hist(residuals(mod2))
According to the assumption of normal distribution of residuals, the histogram
shows the errors are normally distributed (Fig. 30).
Homoscedasticity
• plot(final_data_new$Purchase.made, residuals(mod2))
The scatter plot also shows the errors are somewhat normally distributed (Fig. 31).
Checking the cook’s distance
• library(predictmeans)
• cooksd = CookD(mod2)
Fig. 30 Histogram
Practical Implementation of Machine Learning Techniques and Data … 167
Fig. 31 Relationship
between residuals and
purchase feature
Ideally high Cook’s distance observation should be removed from data and re-
modelled (Fig. 32, 33).
STEP 8: Predicting the New Values
Predicting the values in new Data
• Data2 <-read.csv(“MyData.csv”)
• predict.lm(mod2,Data2)
3 Conclusion
Every corner of our world is producing some type of data. This data contains hidden
insights that are useful for human well-being. Hence, data analysis has brought revo-
lution to the digital world. There are many tools available for this task like Jupiter
Notebook and R programming. The paper represents description of R programming
for data analysis. The limitations and benefits, features are clearly explained. The
dataset is used as an example to show the benefits and capabilities of R programming
for data preprocessing, analysis, outlier detection, correlation, prediction, classifi-
cation, and regression. The results are shown using histograms, box plots, scatter
plots.
References
1. Roy, D., Dutta, M.: A systematic review and research perspective on recommender systems. J
Big Data 9, 59 (2022). https://doi.org/10.1186/s40537-022-00592-5
2. Bochkarev, V., Solovyev, V., Wichmann, S.: Universals versus historical contingencies in lexical
evolution. J. R. Soc. Interface. 11, 20140841 (2014). https://doi.org/10.1098/rsif.2014.0841,
Link, ISI, GoogleScholar
3. Tippmann, S.: Programming tools: Adventures with R. Nature 517, 109–110 (2015). https://
doi.org/10.1038/517109a
4. Gazoni, R.: A semiotic analysis of programming languages. J. Comp. Commun. 6, 91–101
(2018). https://doi.org/10.4236/jcc.2018.63007, Crossref, GoogleScholar
5. TIOBE. 2022 TIOBE Index. TIOBE Index: The R Programming Language. See https://www.
tiobe.com/tiobe-index/. Accessed 10 Sept 2023. Google Scholar
6. Gipp, B., Beel, J., Hentschel, C.: Scienstein: A Research Paper Recommender System (2009).
Practical Implementation of Machine Learning Techniques and Data … 169
7. Fayyaz, Z., Ebrahimian, M., Nawara, D., Ibrahim, A., Kashef, R.: Recommendation systems:
Algorithms, challenges, metrics, and business opportunities. Appl. Sci. 10(21), 7748 (2020).
https://doi.org/10.3390/app10217748
8. R: The R Project for Statistical Computing (r-project.org)
9. German, M., Adams, B., Hassan, A.E.: The evolution of the R software ecosystem. In: 2013
17th European Conference on Software Maintenance and Reengineering, pp. 243–252 (2013,
March). https://doi.org/10.1109/CSMR.2013.33. ISSN: 1534–5351
10. Analytics Vidhya | Learn everything about AI, Data Science and Data Engineering
11. Gorakala, S.K., Usuelli. M.: Building a Recommendation System with R. Packt Publishing
(2015)
12. Ge, X., Liu, J., Qi, Q., Chen, Z.: A new prediction approach based on linear regression for
collaborative filtering. 2011 Eighth International Conference on Fuzzy Systems and Knowledge
Discovery (FSKD). Shanghai, China, pp. 2586–2590 (2011). https://doi.org/10.1109/FSKD.
2011.6020007
13. Furtado, F., Singh, A.: Movie recommendation system using machine learning. International
Journal of Research in Industrial Engineering 9(1), 84–98 (2020). https://doi.org/10.22105/
riej.2020.226178.1128
14. Job recommendation system using machine learning and natural language processing (dbs.ie)
15. Jayalakshmi, S., Ganesh, N., Čep, R., Senthil, M.J.: Movie recommender systems: Concepts,
methods, challenges, and future directions. Sensors (Basel). 22(13), 4904 (2022Jun 29). https://
doi.org/10.3390/s22134904.PMID:35808398;PMCID:PMC9269752
16. RJ-2021–108.pdf (r-project.org)
17. Jhalani, T., Kant, V., Dwivedi, P. A Linear Regression Approach to Multi-criteria Recommender
System, 9714, pp. 235–243 (2016). https://doi.org/10.1007/978-3-319-40973-3_23
18. Morandat, B. Hill, L. Osvald, Vitek, J.: Evaluating the design of the R language. In: Noble,
J. (ed.), ECOOP 2012—Object-Oriented Programming, Lecture Notes in Computer Science,
pp. 104–131. Springer, Berlin, Heidelberg (2012). ISBN 978–3–642–31057–7. https://doi.org/
10.1007/978-3-642-31057-7_6
19. Jain, G., Mishra, N., Sharma, S.: CRLRM: Category based Recommendation using Linear
Regression Model, pp. 17–20 (2013). https://doi.org/10.1109/ICACC.2013.11.
20. How to code a recommendation system in R—Ander Fernández (anderfernandez.com)
Deep Learning Techniques in Big Data
Analytics
Abstract The emergence of the digital age has ushered in an unprecedented era of
data production and collection, creating big data models. In this context, a valuable
technique to address complex issues originating from big data analytics is deep
learning, which is a subgroup of machine learning. The aim of this chapter is to give
a thorough assessment of deep learning methods and how they are implemented in
big data analytics. Beginning with an introduction to the fundamental tents of deep
learning, including neural networks and deep neural architectures, the mechanisms
by which deep models can automatically learn and represent complex patterns from
raw data are explored. It examines various aspects of deep learning applications of big
data analysis. It shows how deep learning models excel in feature learning, enabling
the automatic extraction of valuable information from huge data sets. Finally, the
chapter describes emerging trends in deep learning and big data analysis, providing
a glimpse into the future of this dynamic field. It draws attention to the pivotal role
that deep learning techniques have played in transforming the big data analytics
environment and emphasizes the ongoing significance of research and innovation in
this quickly developing discipline.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 171
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_9
172 A. K. Badhan et al.
1 Introduction
Deep Learning
Use cases
enormous volume of unstructured data, including photos, text, audio, and video that
are not been implemented by machine learning methods.
Figure 2 discusses the major sectors that currently use deep learning, [4] have
shown usage of IoT-based irrigation systems that provide precision agriculture,
by controlling the period of irrigation and saving water. Similarly, in the health-
care sector, [5] proposed different deep learning techniques applied in healthcare.
The main focus is on computer vision, natural language processing, reinforce-
ment learning, and generalized methods. [6] highlights the transformative impact
of machine learning within the manufacturing industry 4.0 paradigm, apart from it
[7] addressed the critical role of financial institutions in ensuring economic stability
and sustainability through effective credit risk mitigation. The main focus is on
developing a model to classify potential borrowers as good or bad credit.
Deep learning is excellent for feature extraction, which enables the automatic
identification of relevant information from raw data, particularly complex tasks in
the context of huge volumes of data. The models of deep learning, i.e., Convolu-
tional Neural network (CNN), and another one Recurrent Neural network (RNN)
have brought a strong revolution in the fields of speech recognition, picture recog-
nition, and natural language processing, making them indispensable tools in various
applications.
A synergy with immense potential exists when technologies like big data analytics
and deep learning are combined with each other. The former provides the neces-
sary infrastructure to process large data sets, while deep learning algorithms unlock
actionable insights into that data by automatically identifying complex patterns.
Using deep learning capabilities within big data analytics, organizations can extract
nuanced, context-rich information from their data, enabling accurate predictions,
improved decision-making, and the enlargement of innovative solutions. Essentially,
the grouping of big data analytics with the techniques of deep learning gives compa-
nies and researchers with different tools to navigate the complexities of the digital
age. This integration not only drives operational efficiency and business growth but
also paves the way for breakthrough discoveries, making it the cornerstone of today’s
data operations.
174 A. K. Badhan et al.
2 Literature Review
Deep learning methods are one of the promising fields of research in the automated
extraction of complex data representations (features) at high levels of abstraction.
With the help of these algorithms, data is learned and represented in a layered,
hierarchical manner, using lower level (less abstract) characteristics used to define
higher level (more abstract) features. Artificial Intelligence imitates the deep, multi-
layered learning process used by the human brain’s primary sensory regions of the
neocortex to automatically extract characteristics and abstractions from underlying
input. This is what motivates deep learning methods and hierarchical learning design.
Distributed representations of the data, which allow for a solid representation of
each example and stronger generalization by making a vast number of potential
formations of the input data’s intellectual features viable, are a fundamental idea
underlying deep learning approaches. The number of retrieved abstract features is
inversely proportional to the number of configurations that can be made. Considering
that the pragmatic data was produced by the communications of numerous identified
and unidentified factors, it is likely that new formations of the learned factors and
forms can be used to explain additional (unseen) data patterns when they are obtained
through certain configurations of learned factors [8].
Deep learning approaches that make use of deep neural networks have gained
prominence, as high-performance computing resources have proliferated. When
working with formless data, deep learning methods attain greater strength and elas-
ticity since they can process a huge number of features. The data is passed via many
layers of deep learning methods; each film has its own ability to gradually extract
features and transmit the information to the following layer. The earliest layer extracts
low-level characteristics, which are then merged by succeeding layers to provide an
extensive representation. Even while deep learning is constantly improving, there are
still a variety of issues that need to be addressed. Deep learning can be used to make
robots smarter, sometimes even smarter than people, even if its exact mechanism
is still a mystery. In order to improvise mobile applications as smarter and much
more intelligent, the goal is to develop models that operate through mobile. There
needs a promise to making deep learning more committed to advancing humanity
and sustaining our world as an improved area to reside [9].
Deep learning methods are being more widely used in image segmentation. These
methods, which started with the development of various algorithms in deep learning,
have given rise to a wide variety of new kinds of picture segmentation algorithms.
Previous research has demonstrated the promise of deep learning-based methodolo-
gies. More recent studies that compare various techniques based on their reported
performance encompass more methodologies [10].
The processing of vast amounts of data is restricted in numerous ways by standard
data processing techniques. For excellent accuracy and efficiency while handling data
in real-time, deep learning and machine learning-based methods must be developed
for big data analytics. To swiftly analyze data. However, recent research has merged
a range of deep learning methods with hybrid learning and training procedures.
Deep Learning Techniques in Big Data Analytics 175
Since the majority of these strategies are scenario-specific and focused on vector
space, they perform poorly in more general scenarios and when learning features
from big data. It’s crucial for handling the enormous amount of data in accordance
with the requirements of the organization because alphanumeric data is expanding
exponentially in a variety of forms. Technology-based businesses like Microsoft, and
companies like Yahoo, Amazon, etc., have Exabyte-sized or uniform bigger amounts
of data on hand. Because of the widespread usage of social online media platforms,
its users generate tremendous volumes of data. However, standard techniques are
incapable of handling this amount of data. Because of this, a variety of businesses have
created big data analytics-based products for experimentation, simulation purposes,
data analysis, monitoring purposes, and a variety of other business purposes [11].
Deep learning uses supervised and unsupervised methods for learning multi-level
presentation, and features in hierarchical structures for the goals of classification and
pattern recognition. Big data collecting has been made possible by recent advance-
ments in communication and sensor network technology. Big data has excellent
prospects for an extensive range of industries, such as e-commerce, smart medicine,
etc., but it also poses difficult problems for data mining and processing of information
because of its characteristics of huge volume, variation, velocity, and veracity. Deep
learning has become increasingly important in big data over the past few years. In
comparison to more traditional shallow machine learning methods where supported
vector machines and other one Naive Bayes are used, the deep learning methods
can more efficiently combine low-level input for extracting high-level features and
absorbing hierarchical representations from large amounts of data [12].
Deep learning uses Artificial Neural Networks that are modeled after the neurons
found in the human brain. Layers make up this structure and the adjective “deep”
mentions the thickness of multiple layers. The term “deep” originally referred to a
very small number of layers, but because deep learning is used to solve complicated
hitches, the quantity of layers has increased to hundreds or even more. Many compa-
nies related to image processing, healthcare industries, transportation business, and
agriculture, have found great success using deep learning. Deep learning is becoming
more and more popular as an outcome of the convenience of trained datasets, such as
ImageNet, which contains thousands of photos and allows for the best possible use of
an increasing amount of data. Second, low-priced GPUs are increasingly frequently
used to train datasets and can take advantage of different cloud services. Massive
corporations like Facebook, Amazon Inc., Google Inc., and Microsoft use methods
of deep learning on a daily basis to evaluate enormous amounts of data [13].
A general review of the popular and difficult urban big data fusion based on
deep learning methods is presented. First, several elements of town big data are
evaluated. Then, few typical data fusion techniques that may broadly be split into
three groups are briefly presented, together with spatial–temporal data. Then, three
categories of existing multi-modal town big data fusion techniques grounded on deep
learning—DL-based-output fusion, DL-based-input fusion, and DL-based-double-
stage fusion—are separated out and described separately. Finally, the challenges and
176 A. K. Badhan et al.
some suggestions for studying town big data are given based on the behaviors and
characteristics of town big data [14].
In recent years, deep learning models have excelled at speech recognition and
computer vision. The first and foremost benefit of using learning techniques is eval-
uating an enormous amount of data, i.e., called Big Data. It is vital for organizations
like social networks that need to collect a lot of data. Deep learning is a powerful
technique for Big Data because of this advantage. Incredibly valuable information
that is hidden in a Big Data set can be retrieved using deep learning. These social
networks can be seen in the contemporary stock market. Making use of deep learning
techniques once can extract multifaceted data at a high degree of thought in a manner
that makes it possible to specify higher level features using lower level character-
istics. The techniques of deep learning can be used to distinguish between distinct
sources of data variance (such as light form, object forms, and materials of object in
a picture). The primary sensory areas of the human brain’s neocortex are where the
concept of hierarchical learning in deep learning originates [15].
Overall summary (Table 1).
3 Methodology
Big data analytics uses a variety of deep learning approaches to extract valuable
insights from vast and complicated data sets. Some of the different approaches are:
1. Convolutional Neural Networks (CNNs): The convolution neural network
employs a unique method called convolution. It is a mathematical operation
applied between two operations resulting in a third function that illustrates how
one function’s shape is influenced or modified by another.
[22] Rhee proposes the Convolution Neural Networks that distinguish them-
selves from other pattern recognition algorithms by integrating both feature
extraction and classification. The provided Fig. 3, illustrates a straightforward
schematic of a basic CNN, comprising five distinct layers:
• Initial layer, i.e., input,
• Second layer, i.e., convolution,
• Third layer, i.e., pooling,
• Fourth layer, i.e., fully connected, and
• Fifth layer, i.e., output.
The feature extraction and classification segments make up the two divisions of
the above layers mentioned. The former one, i.e., extraction of features encom-
passes the first three layers layer, while classification involves the remaining two
layers. The first layer that is the input sets a defined size for input pictures that
may be adjusted by resizing if needed. The picture is then subjected to multiple
learning kernels with shared weights by convolution layer. The next third layer,
Table 1 Tabular view of the complete literature review
Title Authors Year Methodology/ Dataset Performance Results Discussion/
Technique Metric Conclusion
A Systematic Review on J Azmi et al. 2022 Machine Learning Medical Big Accuracy, Comprehensive review Key findings on ML
Machine Learning Data Sensitivity, on machine learning for approaches for
Approaches for Specificity cardiovascular disease cardiovascular
Cardiovascular Disease prediction disease prediction
Prediction Using Medical
Big Data [16]
Big Data Analysis of the X Li, H Liu et al. 2022 Deep Learning, Smart City IoT data Significant insights into Deep learning’s role
Internet of Things in the IoT IoT Data analysis IoT data analysis in in IoT data analysis
Digital Twins of Smart metrics smart cities in smart cities
City Based on Deep
Learning [17]
A Novel Diabetes R 2022 Machine Learning Diabetes Accuracy, Developed a novel Machine learning is
Deep Learning Techniques in Big Data Analytics
i.e., pooling follows, reducing picture size while preserving essential data. The
feature maps are the results obtained from feature extraction. In the next phase,
i.e., the classification, the fully connected layers amalgamate the features that
are extracted, and then the output layer, with one neuron per object category,
produces the classification result. The pattern implemented in most of the CNN
architectures is as follows [23]:
where:
• Input Processing (IN): The initial input undergoes a convolution operation
(CONV) followed by pooling (POOL).
• CONV: It refers to the layer of convolution.
• POOL: It indicates the later pooling.
• Matrix Multiplication (M): The result of the convolution and pooling is
multiplied by a matrix (M).
• Fully Connected Layer (FC): The outcome of the previous step goes through
another convolution (CONV) with fully connected processing (FC).
• Matrix Multiplication (N): The result is then multiplied by another matrix
(N).
• Output (OUT): The final output is obtained.
The convolutional networks are used in big data analytics for various applications
including object detection, picture recognition, and multidimensional dataset
analysis.
2. (RNNs) Recurrent Neural Networks: It is well suited for data that is sequential
in nature and is used to capture temporal dependencies. They process sequen-
tial data by incorporating information from previous steps. The output of the
prior stage is sent to the current stage as input using sequential or temporal data.
Recurrent networks basically learn from the training input but differ in their
memory (stores all the information for calculations) which allows them to influ-
ence the current input and output using information from previous inputs. The
diagrammatic view for the recurrent neural network is presented as follows:
180 A. K. Badhan et al.
Authors [5] propose a recurrent neural network for enhancing the audio-visual
speech recognition (AVSR) accuracy in noisy environments. The RNN model
exhibits a loop structure within its hidden unit as illustrated in Fig. 4. It comprises
the first layer as input, denoted as “I”, the second layer as “hidden (H)”, and
third layer as output, denoted as “O”. The RNN unfolds the loop, essentially
replicating, multiple times the similar structure. In this configuration, the state H
of each iteration serves as an input to its subsequent iterations. Representing the
layers at a time (t) as I t , H t , O t where I t presents the input layer, H t presenting
the hidden layer, and O t presenting the output layer, then the output can be
calculated as follows [24]:
a t = b1 + W H t−1 + U I t
H t = σ (a t ) (2)
O = b2 + V H
t t
Authors [25] make use of the same structure for employing the deep learning
technique, i.e., LSTM and bidirectional LSTM for multistep COVID-19 infection
hotspot predicting in Indian States. The network model, i.e., LSTM computes
the hidden output state h t based on the following key components [26]:
i Input Gate (i t ): It determines the information that must be saved in the cell
state ct from input xt (input vector). The expression is given as:
i t = σ xi U i + h t−1 W i (3)
ii. Forget Gate ( f t ): It decides what information from the cell state C t−1
should be discarded or kept from the current time step. The expression is as
follows:
f t = σ xt U f + h t−1 W f (4)
iii. Output Gate (ot ): It determines what part of the cell states should be output
as a hidden state h t for the most recent time step. The expression is as follows:
ot = σ xt U 0 + h t−1 W 0 (5)
∼
iv. Intermediate Cell Gate ( C t ): It computes the candidate update to the cell
state. The expression is as follows:
t = tanh xt U c + h t−1 W c
C (6)
vi. Hidden State (h t ): It is defined as the output of the LSTM for the most
recent timestep, which is dependent on the cell state and output gate. The
general expression is provided as follows:
h t = tanh(Ct ) ∗ Ot (8)
a. Encoding: It takes the unput data “I” and produces a compressed represen-
tation “z” in the latent space. The expression is as follows:
z = Encoder (I ) (9)
c. Loss Function: It measures the variance between the input “I” and the output
I that was recovered. Mean squared error is the frequently used loss function
for continuous data and binary loss entropy for binary data. The expression
for loss function, the mean square error (MSE), and binary cross entropy
expression are given as:
Loss = Loss f unction I, I
1 2
n
MSE = Ii − Ii
n i−1
1
n
Binar y Cr oss Entr opy = Ii log Ii + (1 − Ii ) log 1 − Ii
n i=1
(11)
d. Training Objective: The main focus during training will be to minimize the
loss by adjusting different parameters, i.e., weights and biases of both the
encoder and decoder. The mathematical expression is as follows:
184 A. K. Badhan et al.
Minimi ze G, maximi ze D : V (D, G) = E(x ∼ pz(z)) log(1 − D(G(z)))
(13)
where:
• G: It stands for generator
• D: It presents discriminator
• x: represents the real samples of data
• z: represents noise samples
• pdata: It represents real data distribution
• pz : is the distribution of noise
b. Discriminator: It operates like a vigilant authority, tasked with pinpointing
irregularities in the sample which are generated by the generator and accu-
rately categorizing them as either genuine or fabricated. It is basically a
supervised approach and it is trained on real data and provides feedback to
the generator. The discriminator loss function may be used for maximizing
problems, just like the generator loss function for the discriminator and is
provided as [30]:
The interplay between the generator and discriminator persists until a state of
refinement is reached, where the generator triumphs, successfully outsmarting
the discriminator in discerning fake data.
Deep Learning Techniques in Big Data Analytics 185
Fig. 8 The structure of multi-semantic fusion model for generating high resolution images
prediction models and resolving specific problems related to huge and intrinsic
datasets
4 Discussion
Table 2 presents the overall description of deep learning techniques along with the
applications.
Deep Learning Techniques in Big Data Analytics 187
Despite the fact that deep learning approach provides a number of benefits for
managing massive data analysis, there are a number of restrictions that must be
taken into consideration. One significant issue is the computer resources’ voracious
appetite, particularly when very deep neural networks are being trained on enormous
datasets. Big data’s size can make tasks time-consuming and call for an effective
computer architecture.
Another drawback is the interpretation of deep learning models, which frequently
function as intricate black boxes that make it challenging to comprehend the context
of their predictions. The requirement for labeled training-relevant data might also
pose challenges in situations when it is difficult or expensive to acquire such data.
When using deep learning techniques for large data analysis, concerns about ethical
issues, data security, and potentially biased conclusions are still crucial. These draw-
backs emphasize the value of a cautious strategy, continuous research, and ethical
concerns when utilizing deep learning to extract knowledge from enormous data sets.
188 A. K. Badhan et al.
Table 2 Tabular view of Deep Learning Techniques, Description, Applications, and Benefits
Deep Learning Description Applications Benefits
Techniques
Convolution Neural The convolution • Health Care • Feature Extraction
Networks (CNNs) layers are used by • Autonomous vehicle • High accuracy in
[22] CNNs, which are well • Surveillance image related tasks,
suited for image and etc
video analysis to
extract spatial
characteristics
Recurrent Neural RNNs employ • Natural language • Sequential data
Networks (RNNs) [5] feedback connections processing, analysis,
to capture temporal • Speech Recognition, • Time series
relationships and are • Finance, etc forecasting,
best for sequential • Text generations
data
Long Short-Term The improvised • Predictive text • Better gradient flow,
Memory (LSTM) [25] version of recurrent typing, which solves the
networks (RNNs) to • Speech-to-text, vanishing gradient
recognize enduring • Anomaly detection, problem and works
relationships in etc well with sequential
sequential data data
Auto-encoders [27] It’s an unsupervised • Image denoising, • Decreased
model for feature • Recommender dimensionality,
learning and systems, and • Feature learning and
dimensionality • Anomaly detection • Data denoising
reduction
Generative Encompasses two • Generating images, • Data augmentation,
Adversarial Networks models, one is • Transferring styles, • Realistic data
(GANs) [30] generator and another and production, and
is discriminator to • Enhancing data • Better image quality
provide data that is
realistic
Ensemble Learning It combines • Regression, • Increased resilience,
[33] predictions from Classifications, and • Decreased
multiple models for • Anomaly detection overfitting, and
greater precision • Predicted accuracy
6 Conclusion
To sum up, deep learning techniques have become a revolutionary force in big data
analytics, providing never-before-seen capacity to extract valuable insights from
enormous and intricate datasets. The ability of different models related to deep
learning like neural networks, convolution networks, auto-encoders, etc., to auto-
matically learn intricate patterns and representations from data has proven invaluable
in diverse domains within big data analytics. In applications ranging from natural
language processing, and anomaly detection to picture and speech recognition, these
Deep Learning Techniques in Big Data Analytics 191
7 Future Scope
Future prospects for “Deep Learning Techniques for Big Data Analytics” are
extremely bright thanks to major developments in a number of important fields.
Priority should be given to creating more scalable and effective deep learning archi-
tectures that can manage even bigger data volumes through algorithm optimization
and the use of distributed computing. Research developing interpretable models
and decision-explanation approaches is necessary in order to address the “black-
box” character of deep learning models, especially in industries like healthcare and
finance where openness is critical. Transfer learning approaches advancements can
address data scarcity challenges, improving these models’ generalization capabili-
ties. Holistic solutions can be found by investigating hybrid models that integrate
deep learning with conventional machine learning methods and fusion techniques.
As profound as they are, ethical issues, especially prejudice reduction, are crucial.
The future of deep learning techniques in big data analytics lies in a multidisci-
plinary approach, with ongoing collaboration between researchers, industry practi-
tioners, and policymakers to address challenges and unlock new potentials for these
powerful technologies.
References
1. Khaturia, D., Saxena, A., Basha, S.M., Iyengar, N.C.S., Caytiles, R.D.: A comparative study
on airline recommendation system using sentimental analysis on customer tweets. Int J Adv
Sci Technol 111, 107–114 (2018). https://doi.org/10.14257/ijast.2018.111.10
2. Hu, X., Liu, J.: Research on e-commerce visual marketing analysis based on internet big data.
J. Phys. Conf. Ser. 1865,(2021). https://doi.org/10.1088/1742-6596/1865/4/042094
3. Jha, B.K., Sivasankari, G.G., Venugopal, K.R. Fraud detection and prevention by using big data
analytics. In: Proceedins of the 4th International Conference of Computing Methodologies and
Communication ICCMC 2020, pp. 267–274 (2020). https://doi.org/10.1109/ICCMC48092.
2020.ICCMC-00050
4. Aruul Mozhi Varman, S., Baskaran, A.R., Aravindh, S., Prabhu, E.: Deep learning and IoT for
smart agriculture Using WSN. 2017 IEEE Int Conf Comput Intell Comput Res ICCIC 2017,
1–6 (2018). https://doi.org/10.1109/ICCIC.2017.8524140
192 A. K. Badhan et al.
5. Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., Cui, C.,
Corrado, G., Thrun, S., Dean, J.: A guide to deep learning in healthcare. Nat. Med. 25, 24–29
(2019). https://doi.org/10.1038/s41591-018-0316-z
6. Rai, R., Tiwari, M.K., Ivanov, D., Dolgui, A.: Machine learning in manufacturing and industry
4.0 applications. Int. J. Prod. Res. 59, 4773–4778 (2021). https://doi.org/10.1080/00207543.
2021.1956675
7. Becha, M., Dridi, O., Riabi, O., Benmessaoud, Y.: Use of machine learning techniques in
financial forecasting. In: Proc 2020 Int Multi-Conference Organ Knowl Adv Technol OCTA
2020 (2020). https://doi.org/10.1109/OCTA49274.2020.9151854
8. Furht, B., Villanustre, F.: Big Data Technologies and Applications. Springer, Cham (2016)
9. Sonde, V.M., Shirpurkar, P.P., Giripunje, M.S., Ashtankar, P.P.: Experimental and dimensional
analysis approach for human energy required in wood chipping process. In: International
Conference on Advanced Machine Learning Technologies and Applications, pp. 683–691
(2020)
10. Ghosh, S., Das, N., Das, I., Maulik, U.: Understanding deep learning techniques for image
segmentation. ACM Comput. Surv.Comput. Surv. 52 (2019). https://doi.org/10.1145/3329784
11. Jan, B., Farman, H., Khan, M., Imran, M., Islam, I.U., Ahmad, A., Ali, S., Jeon, G.: Deep
learning in big data Analytics: A comparative study. Comput. Electr. Eng.. Electr. Eng. 75,
275–287 (2019). https://doi.org/10.1016/j.compeleceng.2017.12.009
12. Zhang, Q., Yang, L.T., Chen, Z., Li, P.: A survey on deep learning for big data. Inf. Fusion 42,
146–157 (2018)
13. Ghaderi, Z., Khotanlou, H.: Weakly supervised pairwise Frank-Wolfe algorithm to recognize
a sequence of human actions in RGB-D videos. Signal, Image Video Process 13, 1619–1627
(2019)
14. Liu, J., Li, T., Xie, P., Du, S., Teng, F., Yang, X.: Urban big data fusion based on deep learning:
An overview. Inf Fusion 53, 123–133 (2020)
15. Sohangir, S., Wang, D., Pomeranets, A., Khoshgoftaar, T.M.: Big data: Deep learning for
financial sentiment analysis. J Big Data 5, 1–25 (2018)
16. Azmi, J., Arif, M., Nafis, M.T., Alam, M.A., Tanweer, S., Wang, G.: A systematic review on
machine learning approaches for cardiovascular disease prediction using medical big data. Med
Eng & Phys 105, 103825 (2022)
17. Li, X., Liu, H., Wang, W., Zheng, Y., Lv, H., Lv, Z.: Big data analysis of the internet of things
in the digital twins of smart city based on deep learning. Futur. Gener. Comput. Syst.. Gener.
Comput. Syst. 128, 167–177 (2022)
18. Krishnamoorthi, R., Joshi, S., Almarzouki, H.Z., Shukla, P.K., Rizwan, A., Kalpana, C., Tiwari,
B., others: A novel diabetes healthcare disease prediction framework using machine learning
techniques. J. Healthc. Eng. 2022, 1–10 (2022)
19. Gandomi, A.H., Chen, F., Abualigah, L.: Machine learning technologies for big data analytics.
Electronics 11, 421 (2022)
20. Mathew, A., Amudha, P., Sivakumari, S.: Deep learning techniques: An overview. Adv Mach
Learn Technol Appl Proc AMLTA 2020, 599–608 (2021)
21. Ghosh, S., Das, N., Das, I., Maulik, U.: Understanding deep learning techniques for image
segmentation. ACM Comput. Surv.Comput. Surv. 52, 1–35 (2019)
22. Phung, V.H., Rhee, E.J.: A High-accuracy model average ensemble of convolutional neural
networks for classification of cloud image patches on small datasets. Appl. Sci. 9 (2019).
https://doi.org/10.3390/app9214500
23. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai,
J., Chen, T.: Recent advances in convolutional neural networks. Pattern Recognit 77, 354–377
(2018). https://doi.org/10.1016/j.patcog.2017.10.013
24. Kasongo, S.M.: A deep learning technique for intrusion detection system using a recurrent
neural networks based framework. Comput. Commun.. Commun. 199, 113–125 (2023). https://
doi.org/10.1016/j.comcom.2022.12.010
25. Chandra, R., Jain, A., Chauhan, D.S.: Deep learning via LSTM models for COVID-19 infection
forecasting in India. PLoS ONE 17, 1–28 (2022). https://doi.org/10.1371/journal.pone.0262708
Deep Learning Techniques in Big Data Analytics 193
26. Zhang, H., Wang, L., Shi, W.: Seismic control of adaptive variable stiffness intelligent structures
using fuzzy control strategy combined with LSTM. J Build Eng 78, 107549 (2023). https://doi.
org/10.1016/j.jobe.2023.107549
27. Bank, D., Koenigstein, N., Giryes, R.: Autoencoders. Machine Learning for Data Science
Handbook: Data Mining and Knowledge Discovery Handbook, pp. 353–374. Springer, Cham
(2023)
28. Chen, S., Guo, W.: Auto-encoders in deep learning—A review with new perspectives.
Mathematics 11, 1777 (2023)
29. Chen, S., Guo, W.: Auto-encoders in deep learning—A review with new perspectives.
Mathematics 11, 1–54 (2023). https://doi.org/10.3390/math11081777
30. Kumar, S., Dhawan, S.: A detailed study on generative adversarial networks. Proc 5th Int Conf
Commun Electron Syst ICCES 2020, pp. 641–645 (2020). https://doi.org/10.1109/ICCES4
8766.2020.09137883
31. Huang, P., Liu, Y., Fu, C., Zhao, L.: Multi-Semantic fusion generative adversarial network for
text-to-image generation. In: 2023 IEEE 8th Int Conf Big Data Anal ICBDA 2023, pp. 159–164.
(2023). https://doi.org/10.1109/ICBDA57405.2023.10104850
32. Wu, P., Guo, H., Buckland, R.: A transfer learning approach for network intrusion detection.
In: 2019 4th IEEE Int Conf Big Data Anal ICBDA 2019, pp. 281–285 (2019). https://doi.org/
10.1109/ICBDA.2019.8713213
33. Mung, P.S.: Phyu S (2020) Effective analytics on healthcare big data using ensemble learning.
IEEE Conf. Comput. Appl. ICCA 2020, 1–4 (2020). https://doi.org/10.1109/ICCA49400.2020.
9022853
Data Privacy and Ethics in Data
Analytics
Rajasegar R. S.
IT Industry, Cyber Security, County Louth, Ireland
e-mail: rajasegarrs@outlook.com
Gouthaman P. (B) · Nallarasan V.
Department of Networking and Communications, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: gouthamp@srmist.edu.in
Nallarasan V.
e-mail: nallarav@srmist.edu.in
Vijayakumar Ponnusamy
Department of Electronics and Communications, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: vijayakp@srmist.edu.in
Arivazhagan N.
Department of Computational Intelligence, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: arivazhn@srmist.edu.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 195
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_10
196 Rajasegar R. S. et al.
technology safety, legal frameworks and ethical awareness can be infused into their
work culture when their employees are dealing with data in various projects in the
future.
Data privacy is referred to as both public and private phenomenon, and its processes
have consequences for an individual as well as the community. This analogy on
privacy stops it from being addressed as a usual technological method and details
various factors which are linked with it. Not only it is understood that right to privacy
is referred to as the integral of an individual’s freedom, but also, viewed as the capa-
bility to hide particular information for malpractice. The article [1] delves deeper into
this present contradiction of data privacy and analyses different terms and procedures
that have arisen in this field of study.
It is important to understand [2] ethics and privacy in which professional ethics
are a code of conduct which administers the way members in various domains work
with each other and other stakeholders. Next, author focuses on the ethics of human
behaviour, and the fundamental objectives of an organization, and clearly a portrayal
of its professionalism. Privacy is referred as the right to be left alone then portrays
that there must not be invasion upon seclusion, and no public disclosure with respect
to private truths or incorrect proofs.
Recently, the usage of digital data has grown exponentially in information tech-
nology where huge data is getting collected, processed and used. To paraphrase, Data
Science is referred to as a multidisciplinary field which achieves knowledge through
datasets that comprises of data in both forms structured and unstructured and huge
datasets could be analysed to gain useful information. In addition, data science is
also known as a branch of statistics since it utilizes various concepts in that domain,
and it is to be understood that to avoid errors during data analysis it is vital to have
data and models in valid form [3].
In this decade, various industries are working towards exploring data-driven
approaches and data analytics in machine learning treating it as one of the most
innovative computing technologies. The major benefit is predictive analytics that
assists in predicting sensitive features, future performance, risk and necessary func-
tions linked with specific communities or individuals on the basis of their huge sets
of behavioural and usage data. This article [4] focuses on the significant ethical and
data protection inferences of predictive analytics when used for predicting sensi-
tive details of single individuals or handling those individuals in a different way in
relation to the data collected through many other unrelated individuals.
Mostly in European countries, there is a trend for applications of learning analytics
due to the recently implemented European General Data Protection Regulation
(GDPR). In addition, universities in Finland are recently working towards imple-
menting learning analytics across their nation with several multidisciplinary projects
that are conducted in this domain. The article [5] provides a study in which the
Data Privacy and Ethics in Data Analytics 197
students were questioned on ethical concerns in gathering data and usage towards
learning analytics. The outcome portrayed that the students appeared to be positive
regarding the possibilities of learning analytics, however, they were also concerned
regarding the safety and how their personal data was utilized.
The present research on differential privacy whose applications have grown into
several areas in the recent years elucidates an inherent trade-off amongst a dataset’s
privacy and its application for analytics. When working towards a solution for
this trade-off affects budding applications of differential privacy to shield privacy
in datasets with analytics as well-being enabled. On the contrary, author here [6]
portrays how to use differential privacy to extract necessary analytics from the orig-
inal dataset in an accurate manner. Furthermore, using the proposed technique it is
proved that differential privacy can assist robust privacy and accurate data analytics.
In this digital era, big data management throughout its lifecycle is a huge challenge
for government organizations. Despite immense attention in this ecosystem, proper
big data management is still a task. This article [7] focuses to solve this issue by
suggesting a data lifecycle outline for data-driven regimes. Moreover, they have
recognized nearly 70 data lifecycles and analysed to recommend a data lifecycle
framework.
During the past two decades, many open platforms, namely social networks along
with mobile devices, are contributing towards data collection; the capacity of such
data got bigger as well in the due course to end up as big data. When it originated, this
big data did not focus on the structured and unstructured data sensitivity. However,
it has become extremely essential to incorporate security and privacy in which the
risk of sharing personal information is mostly curtailed. The primary benefit of the
proposed [8] technique—Secure Map Reduce model is to encourage knowledge
mining through data sharing.
In this digital era, advertising through social media has gained immense attention
within the advertising world by data-driven focused techniques. To paraphrase, social
media advertising assures plenty of yields per money invested since the technology
can achieve an extremely specialized community. In this study [9], advertising for
clinical studies through social media leads to intense societal risks is outlined. It is
extremely hard to differentiate the well-intentioned promises from the unfavourable
social media advertisements. To counter this, it is vital to follow research ethics
guidelines and improve the regulation of big data and inferential analytics. It is
concluded that social media advertising may not be appropriate for clinical studies
as a recruitment tool provided that the processing of social media usage data as
well as the predictive models training with data analytics and artificial intelligence
organizations are not properly regulated.
Different technologies [10] and advancements are discussed to which organi-
zations are taking a journey for achieving successful digital transformation. It is
198 Rajasegar R. S. et al.
stated that Internet of Things (IoT) is acting as a significant source of data growth
and the advent of cloud storage with cloud computing acting as an imperative
evolution within the hardware/software ecosystem. This study proposes Artificial
Intelligence with game changing data analytics and emerging technologies, namely,
distributed ledger technology, intelligent character recognition and Blockchain. In
addition, utilizes natural language processing as a linguistics discipline focused
towards understanding and replicating human language as well as speech patterns.
The rise of digital policing being shaped through new data analytics practices
has hugely impacted the general public’s privacy rights and the associated civil
liberties within the criminal process. In this decade, plenty of intervention from
media, academic studies and regulatory bodies has contributed towards policing
data analytics practices in different nations. There are technological advancements,
namely, policing hotspots, live facial recognition and data extraction with mobile
devices, are referred to be contentious for various reasons. To counter these, different
ways, such as police data ethics boards, augmented with soft regulation using code
of practice and endorsements through different investigative studies. In regards to
police data analytics [11] it is explained to understand the themes of algorithmic
justice can be distinguished in the framework of the United Kingdom and where it
leads to, primarily regarding privacy rights with respect to criminal process.
The immense potential and support of Artificial Intelligence towards tactical orga-
nizational decision-making is still in its developing stages. The findings in this article
[12] are detailed in a conceptual model which initially elucidates in what ways
Artificial Intelligence can enable humans for decision-making in uncertainty and
later categorizes the challenges, pre-conditions and consequences which needs to be
worked out. It is clear that human responsibility surges, although the skills necessary
to utilize the technology vary from other machines which showcases the significance
of education.
Recently, Building-to-Grid (B2G) has been trending and digitalization has been
contributing to an extremely significant means. On the emerging technologies [13],
namely, 5G, Big Data, Blockchain, Artificial Intelligence and IoT, and crucial chal-
lenges of the applications with respect to the emerging technologies in this B2G
ecosystem. Furthermore, this study suggests imminent research aspects on Building-
to-Grid ecosystem, particularly ecosystem modelling as well as simulation, it’s (B2G)
part in smart cities, organization of the B2G ecosystem and various other rising
technologies in B2G.
The rise of ChatGPT which is known for its Artificial Intelligence interface
converses with people and then responds using natural language processing and
machine learning techniques is trending recently. The effect of this application [14]
towards data science and an outline regarding the possible benefits and drawbacks
linked with using ChatGPT in data science. In addition, the article elucidates that
in what means ChatGPT can enable data scientists to systematize different tasks in
their activities pertaining to cleaning data, pre-processing, training models and inves-
tigating outcomes. Furthermore, author focuses on the implications of interpreting
the output achieved through ChatGPT that in turn may lead to concerns towards
decision-making in data science applications.
Data Privacy and Ethics in Data Analytics 199
In the recent times, due to the implementation of GDPR and CCPA, websites have
begun to offer users to share consent in cookie banners. It is to be understood that
these banners let the users provide their preference about which cookies are to be
allowed. Despite requesting consent prior to storage of personal information is to
be appreciated for understanding user privacy; however, research has portrayed that
most websites do not always let users stay with their choices. In this article [15] they
have investigated if websites utilize more tenacious and cultured means of tracking
so as to track users who state that they do not accept cookies. There are certain
forms of tracking, namely, ID synchronization, browser fingerprinting and so on.
Furthermore, when users declare to reject all cookies then user tracking becomes
intense.
The study [16] focuses on offering fundamental knowledge and understanding in
relation to certain significant principles of data protection law by elucidating some
key concepts. To detail, this study highlights that for initiating the data processing—
there is an option to choose amongst six different grounds. Although the given
detailing is general and does not provide comprehensive information to let the reader
to understand, on the contrary, “actionable knowledge” is offered. To paraphrase,
reader is allowed to work out and implement the data protection principles into data
science applications which in turn lets them utilize in a socially responsible way.
The processes [17] of data collection and utilization have changed during the past
two decades. Earlier, gathering of data involved raising requests with IT helpdesk
and waiting for a week or two then working towards projecting it in a necessary
format took so much of effort. On the contrary, currently almost everyone has access
to the required data and can perform their own analysis with their fast-processing
computers installed with powerful analytics tools. Having said that, it is imperative
to streamline data being collected appropriately, screened, changed and analysed
with legal techniques in a way so that it provides relevant information which in turn
is used for intelligence to take precise decisions for making a business a successful
one. To reiterate, data analytics is the way of collecting, processing and analysing
data to identify necessary information, to make recommendations and to enable
problem-solving and decision-making.
In this decade, smart cities are emerging as a technology reality which very soon
would dominate the day-to-day lives of people in both developed and developing
nations. In specific to big data issues within smart cities, it is identified that privacy and
security will be a huge concern for its sensitivity mostly present in healthcare, cyber
security, e-governance, mobile banking and many more. This dimension recapitulates
the recent advances in solving the issues with respect to big data privacy and security
in digital cities and then highlights the potential research aspects in this unexplored
area. The IoT (Internet of Things) devices’ utilization in smart cities has led to security
issues arising from applications though they had certain benefits. In addition, there are
other privacy issues in digital cities which get created through the security concerns
in relation to the Internet of Things, Big Data and Information Communication
200 Rajasegar R. S. et al.
In this proposed work, the importance of creating a culture of Data Ethics through 3
key areas will be discussed and they are:
Data Privacy and Ethics in Data Analytics 201
• Data lifecycle
• Challenges in data privacy
• Proposed solution to the identified key challenge
We shall start to build a culture of Data Ethics by understanding the Data Lifecycle.
The data lifecycle, also known as the data management lifecycle, refers to the stages
through which data goes from its creation or acquisition to its eventual retirement or
disposal. This concept is crucial in data management and governance to ensure that
data is effectively and securely managed throughout its entire existence. The data
lifecycle typically consists of several key stages as shown in Fig. 1:
This is the first stage where the creation of data starts from different sources. Data
Creation/Acquisition is where the Data is generated, collected, or acquired from
various sources, such as systems, sensors, users, applications, or external databases.
The initial generation or import of data into an organization’s systems is taking place
at this stage.
The second stage of Data Lifecycle involves Data Ingestion & Storage. Data Inges-
tion: After data is created or acquired, it needs to be ingested into data storage systems.
This can involve data transformation, validation and indexing, making it ready for
storage and processing. Data Storage: Data is stored in databases, data warehouses
or other storage solutions. This stage involves decisions about the type of storage,
data organization and access control.
The third stage where the Data Usage comes in which involves: 3 key areas that will
be discussed and they are:
1. Data processing/analysis
2. Data presentation/visualization
3. Data sharing/distribution
Firstly, Data Processing involves storage, processing and analysis of data for
several reasons, namely, business intelligence, reporting and for training or machine
learning. This phase encompasses mining insights as well as significant detailing of
data. Secondly, Data Visualization is where end users are provided with processed
data by means of reports and other visualization tools in order to let them understand
easily and to act upon. Finally, Data Distribution involves sharing of data within an
organization otherwise with external stakeholders. In addition, this makes sure that
the appropriate user or systems receive data access with the assurance of security
and privacy.
Data Retention and Archiving is the fourth stage where organizations need to deter-
mine how long data should be retained based on regulatory requirements and business
needs. Archived data is usually stored in long-term storage solutions. It can be stored
within the organization’s On-premium storage or Cloud storage based on the business
and regulatory compliance.
The fifth stage where the Data Destruction comes in with which involves:
1. Data backup and disaster recovery
2. Data governance and security
3. Data deletion/retirement
4. Data audit and compliance
5. Data discovery and metadata management
Firstly, Data Backup with Disaster recovery is where data is backed up on a regular
basis in order to avoid data loss arising when the system fails or disasters occur. This
is one of the imperative features of data management to warrant data resilience.
Secondly, Data Governance with Security involves data governance practices in
the due course of data lifecycle so as to confirm the quality of data, regulation
compliance and data security. This requires appropriate policing, access controls as
well as monitoring. Thirdly, Data Deletion refers to data which is not needed or
obsolete and is safely deleted or retired since it enables the maintenance of data
Data Privacy and Ethics in Data Analytics 203
privacy and to comply with regulations posed by regulatory bodies like GDPR.
Next, Data Audit and Compliance involves steering audits in a regular manner to
make sure that there is alignment of data management practices as per organization’s
policy and objectives. Finally, Data Discovery and Metadata Management involves
Metadata which is understood to be offering information on data that is maintained
and systematized to enable data discovery to identify and utilize data assets in an
effective manner.
The particular phases and processes mentioned in the data lifecycle might differ
based on an organization’s magnitude, field and the way in which they manage their
data. It is imperative to follow effective data lifecycle management for achieving
quality data, security and compliance thereby enabling organizations to accomplish
value through the available data assets provided risks, such as data misuse or loss
are mitigated.
The challenges that are encountered in Data Privacy are discussed in this topic. We
will try to identify one common problem and come up with a proposed solution in
the following topic. The below block diagram shows the most common Data Privacy
challenges that are faced.
The process behind employing data privacy may be a tough task, specifically, in
this digital era with data increasing on a day-to-day basis with privacy issues. There
are significant challenges involved when applying data privacy, namely regulatory
compliance and consent management. To begin with, regulatory compliance involves
organizations obeying the guidelines posed by regulatory bodies, namely, CCPA and
GDPR, which appears to be a huge challenge and if not followed would make them
end up paying penalties. Next, consent management which details the significance
of getting users’ consent for processing their data, and flexibility is to be maintained
in such regards. To conclude, applying data privacy means various aspects, such as
legal, technical and organizational aspects are to be considered. It is imperative that
organizations are adhering towards maintaining their customer and other stakeholder
information appropriately.
It is extremely significant to manage user consent and offering them control regarding
the data produced through the IoT devices is challenging. It is necessary that users
have enough knowledge behind why data is being gathered and its utilization thereby
204 Rajasegar R. S. et al.
making them to decide for opting out. Next, IoT Ecosystem complexity which means
plenty of devices are connected in an ecosystem which results in the data flow even
more complex. It is a humongous task to have the right understanding and managing
the consequences behind data privacy in this ecosystem. To sum up, the IoT devices
that are interconnected and growing on a day-to-day basis lead towards data privacy
concerns and needs adequate measures with respect to planning and implementa-
tion of necessary frameworks to safeguard every individual’s data when gaining the
benefits through this technology.
and management then build robust data governance frameworks and make sure to
incorporate privacy as well as security aspects are built into their data consolidation
activities. Furthermore, data privacy regulations posed by regulatory bodies must be
complied and provide significance during gathering and consolidating data.
A harmful data setting, categorized by poor quality of data, immoral data manage-
ment and inappropriate data management leads to various substantial data privacy
concerns. To begin with Data Accuracy which states data with poor quality might
lead towards imprecise and inadequate information thereby resulting in privacy risks
when organizations utilize such data for making verdicts, particularly if it is related
to personal information. And Consent and Transparency is about the environment
where unethical data management or when they are not being transparent then gaining
informed consent with respect to data processing ends up as a difficult task since
individuals may not clearly comprehend the way their data is being utilized. To over-
come these concerns, it is imperative to enhance the organization culture where their
employees must assure to follow ethical data management then focus and comply
with the data privacy regulations. Moreover, it may require reassessing the way data
is being gathered and used then applying stringent procedures to data protection and
compliance.
Working and safeguarding data privacy in a world where there is incessant growth
of data size leads to various challenges. Starting with, Data Breach elucidates that
huge datasets offer complex risks of data breaches. To paraphrase, where there is
humongous data, that is the organization’s malicious attackers focus to exploit or
steal. Next, Data Classification which is regarding how hard it is to classify and
label sensitive information during colossal data management, and this ends up as
a challenging activity for implementing specific security controls. To counter these
concerns, there is a requirement of blending data governance, vigorous security
measures and data privacy regulations compliance. It is crucial for organizations
to take steps and apply wide-ranging data management policies with suitable data
classification; access controls then encryption so that they safeguard confidential
information thereby preserving data privacy with such data growth.
privacy regulations. In such regard, having enough knowledge and following those
diverse guidelines may be challenging and time-consuming. Next, Legal Exper-
tise which is about decoding and implementing the legal language as per the data
privacy regulations generally need the guidance of legal experts and that could be
expensive for most organizations. To counter these concerns, it needs a detailed and
systematized procedure for data privacy management, and organizations must work
towards practising a proactive approach in complying with data privacy regulations,
data governance and considering experts’ legal opinions. Furthermore, being up to
date regarding the amendments in privacy requirements posed by regulatory bodies
is essential for organizations to maintain long-term compliance.
When discussing data privacy, the risk of data breaches and cyberattacks appears to
be inevitable. Starting with, Data Security involves sensitive data of an organization
to be maintained from cyber threats and this demands huge efforts from that organi-
zations’ security team. Next, the technological advancements in cyber threats, that
is, cyber threats which are recently becoming sophisticated day-by-day and appear
to be extremely challenging to defend. The basics of cyber security [4], ethics and
law, portray different issues of the domain, namely, ethical hacking and cyber war
are elucidated. In addition, it provides suggestions and suitable practices for cyber
security professionals involved in different application areas. Later, the significance
of renewed efforts is detailed to highlight responsible state behaviour which might
need better involvement towards the private sector and civil society where both
these contribute towards higher stake levels in cyber space. To overcome these chal-
lenges, it is vital that organizations take countermeasures, namely, risk assessments,
employee awareness training and installation of robust security technologies within
their systems in addition to compliance with data privacy regulations.
The challenges identified based on the analysis made in this topic yield an
outcome, namely, Consent, Transparency and Consent Management from the layman
point of view must be implemented by organizations as per the guidelines of
regulatory bodies.
5 Proposed Solution
The proposed solution for the identified challenge: Consent, Transparency and
Consent Management from the layman point of view is discussed in this topic. The
use case example which we are going to use for this research work is a user browsing
the internet to access information from websites and the consent and transparency
from his perspective. Figure 3 shows the comparison of the Data Lifecycle with the
high-level view of real-time end to end usage of data.
The Data source is where the actor starts to generate data. To compare this with
our use case example, the actor is a user who is layman who is going to search
for information on internet website and going to access it. When the actor launches
an internet browser (i.e., Google Chrome), start tying the search keywords in the
search engine web page and access the search results. Another scenario from the
same use case is that the same user is launching the Social Media applications (i.e.,
Facebook, Instagram, etc.) from his personal devices (i.e., Mobile Phone or Laptop).
The data creation is started from that point of time. The generated data is Stored/
Used/Archived/Destroyed at the Organization level. And that is happening with the
supervision of regulatory bodies through Ethics, Compliance, Regulations, Policies,
Standards and Laws.
Once the user opens the website from which he intended to access information
from starts to collect data from the user through a fundamental component of web
browsing and online interaction called Cookie Technology. Cookies are small amount
of data that can be found stored by websites on your computer or device when you
access that website. The use of these cookies is to primarily track and maintain
information about a user’s online presence, activities, preferences and interactions.
There are few major purposes of cookies which are listed below:
• User tracking—These cookies let websites track user’s behaviour. For instance,
websites track login status, products in shopping baskets or pages recently visited.
The benefit of this is to customize user’s experience then to provide appropriate
content.
• Authentication cookies—They are used for authenticating purposes. To para-
phrase, when a user logs onto a web portal then they are issued with a session
ID. Session ID is identified for following interactions to approve identity so as to
avoid repeated login credentials being asked in the same portal.
• Remembering preferences—In this, cookies do store the preferences of user,
namely, language, layout and notifications to develop and make sure to provide
each user with a personalized experience.
• Targeted advertising—It is well known that cookies are utilized for online adver-
tising in a huge manner. In this regard, advertising organization utilizes cookies
for tracking user interests and then displays advertisements which are specific to
what they have been browsing recently.
• Analytics—These cookies are used by website owners to gather data regarding
the way in which users communicate in their portals and then utilize that infor-
mation for enhancing their website’s performance and provide personalized user
experience.
• Session management—These cookies are imperative to manage user sessions as
they enable to keep a track of users when they go through the website thereby
providing them with a wholesome experience.
• There are 4 different types of cookies:
• Session cookies: These are called session management cookies which are tempo-
rary cookies that are deleted from your device when you close your web
browser.
• Persistent cookies: These cookies are often used for functions like remem-
bering preferences and authenticating users, and they remain on your device for
a particular period or until you manually delete them.
• First-party cookies: These cookies are commonly used for session management
and user preferences, and it is set by the website you are currently visiting.
• Third-party cookies: These cookies are used for cross-site web tracking such as
for analytics and publishing Ads and it is set by domains other than the one you
are visiting.
Consent, Transparency and Consent Management from the layman point of view
need to be implemented by organizations as per the guidelines of regulatory bodies.
It is vital to understand that cookies do have some benefits, however, they may
as well end up with privacy issues. There are high possibilities of user’s online
behaviour being tracked which may be not acceptable. To counter these issues, web
210 Rajasegar R. S. et al.
browsers are recently offering choices to manage cookies which allow the user to
block or completely delete them. In addition, regulatory bodies, namely, California
Consumer Privacy Act (CCPA) and General Data Protection Regulation (GDPR) are
introducing guidelines about the way in which websites must receive consent from
users for tracking.
We propose that the best way to protect an individual’s data is through Privacy
Enhancing Technologies (PETs) which are a wide set of tools; techniques and prac-
tices being designed to protect and enhance individual privacy and data security in
the current digital era. The aim of these technologies is to give individuals more
control over their personal information, reduce the risks associated with sharing data
and mitigate the potential for surveillance and misuse of personal data. Here are few
key aspects and examples of Privacy Enhancing Technologies (PET):
Data Minimization: This will limit the data collected to only what is needed
to fulfil the required major purpose.
Masking: This will make the individual’s data unreadable when
displayed or printed.
Pseudo-Anonymization: This is a technique of alternate identifiable data
comprising of a reversible and consistent value.
To paraphrase, privacy laws and regulations are the legal frameworks such as the
European Union’s General Data Protection Regulation (GDPR) and the California
Consumer Privacy Act (CCPA) which are not actually a technology, but they play a
very high-level significant role in the process of shaping privacy practices with the
help of imposing rules and obligations on the data handlers (Organizations).
Regulatory body should act on the counter measures proposed in this research
work, so that organizations comply with the regulatory body’s action points and take
them into account when they are handling data in an ethical way. The possible means
of bringing this awareness is to encourage organizations to implement educational
workshops, meetings and seminars on handling data in an ethical manner.
The benefits of this research work is regarding involving privacy enhancing tech-
nologies (PETs) that comprise of a wide range of tools, techniques and practices
which are designed to protect and enhance an individual’s privacy and data secu-
rity. These technologies provide users with more control over their personal infor-
mation, lower the risks associated with sharing data and mitigate the potential for
surveillance and misuse of personal data. In addition, this study also includes PET
security controls embedded into regulations which gives organizations more focus
and significance on an individual’s privacy.
Data Privacy and Ethics in Data Analytics 211
The power of the word “DATA” started to revolutionize in the current digital era. In
the rapidly evolving landscape of the digital age, data has emerged as the lifeblood
of our interconnected world, shaping industries, driving innovation and fundamen-
tally altering the way we live and work. As we stand on the threshold of twenty-
first century, the power of data has reached unprecedented levels, promising to
revolutionize every aspect of our lives.
Data-driven decision-making is no longer a choice but a necessity. Businesses,
governments, healthcare systems and individuals are harnessing the power of data to
gain insights, make informed choices and drive progress. From advanced analytics
and artificial intelligence to the Internet of Things, the possibilities are boundless.
Forbes describes the future of data in comparison to the oil market as “Data is the
new oil – and that’s a good thing”, one of the trending technological advancements
is Autonomous vehicles, however they are still in the developing stages. The advan-
tages are extensively made aware, namely, safer roads and rush-hour gets minimized.
However, the huge benefit is lowering greenhouse gases emitted through automo-
biles. The recent research conducted by a team at Poznan University predicts that
autonomous vehicles may ultimately reduce greenhouse gases by 50%. To accom-
plish this, this requires humongous data, that is, petabytes of data which leads to a
data lake from which the autonomous vehicle self-driving advanced machine learning
results will be achieved.
On the other hand, there will be terabytes of data per week per vehicle being
generated through these contemporary platforms. Having said that, it is the new oil
that is being accumulated as many extra bytes of data per year. The experts are
stating genuine concerns on how tech giants are utilizing our confidential informa-
tion; however, there are innumerable means through which these data can enable to
enhance the way of living in this world [24].
And the Economist describes the most precious in future is the Data and no longer
oil “The world’s most valuable resource is no longer oil, but Data” In this digital
world, tech giants, namely, Alphabet, Apple, Microsoft and Amazon deal with data
in an unstoppable manner but earlier it was oil which was the resource in question. On
the contrary, their successful research on user data has mostly benefitted consumers.
It is known that some users do not want to use Google’s search engine and Amazon’s
one-day delivery, on the other hand, the above firms do not raise concerns during
the usual antitrust tests are being tried out. What is to be noted is that many of the
services provided by these organizations are free where users do pay by providing
more of their data [25].
As we move deeper into the future, the potential of data is limitless. However, its
power must be harnessed responsibly, with a commitment to ethics and privacy. The
fusion of technology and data offers us an exciting future, full of opportunities for
progress and a more connected, efficient and informed world. Embracing the power
of data in future is not just a choice; it’s a transformative journey that will define our
future.
212 Rajasegar R. S. et al.
References
1. Bhageshpur K: Data Is the New Oil—And That’s A Good Thing. Forbes Technology Council
2. Bibri, S.E., Alexandre, A., Sharifi, A., Krogstie, J.: Environmentally sustainable smart cities
and their converging AI, IoT, and big data technologies and solutions: an integrated approach
to an extensive literature review. Energy Infor. 6(9), 32 (2023)
3. Christen, M., Gordijn, B., Loi, M.: The ethics of cybersecurity. In: International Library of
Ethics, Law and Technology, pp. 1–8. Springer Science and Business Media B.V (2020)
4. Gellert, R.: Data protection law and responsible data science. In: Data Science for Entrepreneur-
ship. pp. 413–439. Springer, Cham (2023)
5. Gomathi, L., Mishra, A.K., Tyagi, A.K.: Industry 5.0 for healthcare 5.0: Opportunities, chal-
lenges and future research possibilities. In: 7th International Conference on Trends in Elec-
tronics and Informatics, ICOEI 2023—Proceedings, pp. 204–213. Institute of Electrical and
Electronics Engineers Inc. (2023)
6. Grace, J.: Exploring algorithmic justice for policing data analytics in the United Kingdom. In:
Privacy, Technology, and the Criminal Process, pp. 18–38. Taylor and Francis (2023)
7. Hassani, H., Silva, E.S.: The role of chatgpt in data science: How AI-assisted conversational
interfaces are revolutionizing the field. Big Data and Cognitive Comput. 7, (2023) https://doi.
org/10.3390/bdcc7020062
8. Jain, P., Gyanchandani, M., Khare, N.: Enhanced secured map reduce layer for big data privacy
and security. J Big Data. 6, (2019). https://doi.org/10.1186/s40537-019-0193-4
9. Jiang, R., Bouridane, A., Li, C.T., Crookes, D., Boussakta, S., Hao, F., Edirisinghe, E.A.: Big
Data Privacy and Security in Smart Cities. Springer, Cham (2022)
10. Kaufmann, U.H., Tan, A.B.C.: Why data analytics is important? In: Data Analytics for
Organisational Development. pp. 1–20. John Wiley & Sons (2021)
11. Ma, Z., Clausen, A., Lin, Y., Jørgensen, B.N.: An overview of digitalization for the building-
to-grid ecosystem. An Overview of Digitalization for the Building-to-Grid Ecosystem. Energy
Inform. 4 (Suppl. 2), Article 36 (2021). https://doi.org/10.1186/s42162-021-00156-6
12. Mühlhoff, R., Willem, T.: Social media advertising for clinical studies: Ethical and data protec-
tion implications of online targeting. Big Data Soc. 10 (2023). https://doi.org/10.1177/205395
17231156127
13. Mühlhoff, R.: Predictive privacy: Towards an applied ethics of data analytics. Ethics Inf.
Technol. 23, 675–690 (2021). https://doi.org/10.1007/s10676-021-09606-x
14. Myers, N.E., Kogan, G.: Emerging AI and data analytics tooling and disciplines. In: Self-service
data analytics and governance for managers, pp. 25–49. John Wiley & Sons, Inc (2021)
15. Nevaranta, M., Lempinen, K., Erkki, K.: Students’ perceptions about data safety and ethics in
learning analytics (2020).
16. O’Regan, G.: Ethics and privacy. In: Concise Guide to Software Engineering. Springer, Cham
(2022)
17. O’Regan, G.: Introduction to data science. In: Mathematical Foundations of Software
Engineering, pp. 385–398. Springer, Cham (2023)
18. Papadogiannakis, E., Papadopoulos, P., Kourtellis, N., Markatos, E.P.: User tracking in the
post-cookie era: How websites bypass gdpr consent to track users. In: WWW ’21: Proceedings
of the Web Conference 2021, pp. 2130–2141. Creative Commons Attribution 4.0 International
(2021)
19. Shah, S.I.H., Peristeras, V., Magnisalis, I.: DaLiF: A data lifecycle framework for data-driven
governments. J Big Data. 8 (2021). https://doi.org/10.1186/s40537-021-00481-3
20. Shukla, S., George, J.P., Tiwari, K., Varghese Kureethara, J.: Data privacy. In: Data Ethics and
Challenges, pp. 17–39. Springer, Singapore (2022)
21. Subramanian, R.: Have the cake and eat it too: Differential privacy enables privacy and precise
analytics. https://doi.org/10.21203/rs.3.rs-1847248/v1 (2022)
22. The world’s most valuable resource is no longer oil, but data, https://www.economist.com/lea
ders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data
Data Privacy and Ethics in Data Analytics 213
23. Trunk, A., Birkel, H., Hartmann, E919: On the current state of combining human and artificial
intelligence for strategic organizational decision making. Bus. Res. 13, 875 (2020). https://doi.
org/10.1007/s40685-020-00133-x
24. Wylde, V., Rawindaran, N., Lawrence, J., Balasubramanian, R., Prakash, E., Jayal, A., Khan,
I., Hewage, C., Platts, J.: Cybersecurity, data privacy and blockchain: A review. SN Computer
Sci. 3 (2022). https://doi.org/10.1007/s42979-022-01020-4
25. Zambas, M., Illarionova, A., Christou, N., Dionysiou, I.: Exploring user attitude towards
personal data privacy and data privacy economy. In: Proceedings of the Second International
Conference on Innovations in Computing Research (ICR’23), pp. 237–244. Springer, Cham
(2023)
Modern Real-World Applications Using
Data Analytics and Machine Learning
Abstract Modern technology has given rise to strong technologies like machine
learning, big data, and data analytics that are revolutionising how businesses func-
tion and make choices. Business and marketing, healthcare, finance, manufacturing
and supply chains, transportation and logistics, energy utilisation, are only a few
of the disciplines where their practical applications are summarised in this chapter.
Precision medicine has evolved greatly via genetic data analysis, and big data anal-
ysis of electronic health records (EHRs) allows for better patient treatment. AI and
data analytics have a significant impact on risk assessment and fraud detection in the
financial sector. Analytical methods in manufacturing and supply chain optimisation
are highlighted in research articles. Significant progress has been made in lowering
operating expenses and equipment downtime thanks to machine learning-driven
predictive maintenance. Real-time monitoring and route optimisation in transport: the
importance of data analytics. The advancements in safety and dependability include
machine learning-powered autonomous cars and predictive maintenance strategies.
Grid management and energy usage have been optimised by the energy industry via
the use of big data and data analytics. Equipment breakdowns may be predicted and
energy production efficiency increased using machine learning. Customised learning
Vijayakumar Ponnusamy
Department of Electronics and Communications, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: vijayakp@srmist.edu.in
Nallarasan V. · Gouthaman P. (B)
Department of Networking and Communications, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: gouthamp@srmist.edu.in
Nallarasan V.
e-mail: nallarav@srmist.edu.in
Rajasegar R. S.
IT Industry, Cyber Security, Country Louth, Ireland
Arivazhagan N.
Department of Computational Intelligence, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: arivazhn@srmist.edu.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 215
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_11
216 Vijayakumar Ponnusamy et al.
and evaluation via the use of data analytics. Content distribution and student engage-
ment are aided by machine learning algorithms, such recommendation systems. To
summarise, the fields of data analytics, big data, and machine learning have broad and
extensive uses in a variety of fields. These applications have revolutionised decision-
making processes, increased productivity, and stimulated creativity in the contempo-
rary world. These innovations are very important in determining how different fields
will develop in the future.
1 Introduction
A time when an enormous amount of data is being generated has led to the emergence
of groundbreaking technologies such, as data, machine learning and data analytics
through innovation and technology. These technologies have become integral to
industries revolutionising the way organisations perceive and utilise data. They play
a role, in decision-making processes influencing everything from corporate board-
rooms and hospital wards to financial markets and factory floors. Data analytics,
big data and machine learning have proven to be tools that drive progress enhance
operations and unveil insights hidden within vast amounts of information.
The main objective of this research is to showcase how these technologies are
practically used in industries. These tools have the capability to generate produc-
tivity informed decision-making and significant innovation, across different sectors
such as business and marketing, healthcare, finance, manufacturing and supply chain
management, logistics and transportation, energy and utilities as well as education.
By delving into the applications of big data, machine learning and data analytics
in numerous industries we can clearly see their undeniable impact, on shaping the
future of global businesses by exploring how these technologies are being applied in
real-world scenarios [1].
Commercial and Promotional: Customer Segmentation and Personalization:
Organisations employ data analytics and machine learning to partition their customer
base and tailor their marketing endeavours to individual customers. Netflix provides
personalised content recommendations to its users based on their viewing patterns
and machine learning algorithms. Market Trend Analysis: Big data analysis enables
organisations to swiftly monitor and react to market trends. Big data is utilised by
retailers such as Amazon to monitor consumer behaviour, optimise inventory and
expedite product delivery. Demand Forecasting and Pricing: Machine learning algo-
rithms are employed in dynamic pricing strategies, while big data provides assis-
tance in demand forecasting. Internet merchants modify the prices of their products
in response to market conditions, whereas airlines establish their ticket prices as high
as feasible [2].
Medicinal: ailment Prediction and Prevention: Algorithms based on machine
learning and data analytics are utilised to identify at-risk individuals and forecast
disease outbreaks. By analysing massive amounts of patient data with machine
learning, it is possible to predict conditions and prevent the spread of disease. Big data
Modern Real-World Applications Using Data Analytics and Machine … 217
analysis facilitates precision medicine and genomics research by enabling the devel-
opment of individualised treatment strategies predicated on genetic profiles. Elec-
tronic Health Records (EHRs): Through the application of data analytics, EHRs may
enhance patient care, determine the efficacy of treatments, and streamline hospital
operations.
Data analytics [3] and machine learning are vital for financial risk assessment.
Machine learning and data analytics are vital to fraud detection, investment, and
credit assessment. Real-time data analytics and large data sets enable financial organ-
isations to quickly execute trading decisions using algorithmic trading. Financial
specialists may use big data to assess social media and news sentiment to make finan-
cial judgements. In supply chain and manufacturing, machine learning algorithms
in predictive maintenance reduce downtime and maintenance costs. Data analytics
helps maintain product quality throughout the production process in quality control.
Big data analysis optimises inventory management by maintaining product avail-
ability and reducing carrying costs. Route optimisation using data analytics may
save fuel and speed up logistics and transportation deliveries. Machine learning
algorithms direct autonomous automobiles and assure traffic safety using real-time
data. Machine learning in transportation systems can forecast maintenance needs
and improve infrastructure and vehicle safety [4]. Big Data and data analytics help
to optimise electricity and service energy utilisation. This method helps utilities and
consumers save energy. Smart grid technology optimises energy distribution and
management by analysing data, reducing costs and energy loss. Energy companies
utilise predictive maintenance to keep equipment running smoothly. To avoid failures,
machine learning is used to forecast them. Good Practises: Customised training is
tailored to each student’s requirements and ability. By altering learning session time
and subject material, data analytics can fulfil individual student demands. Machine
learning may save instructors time and provide students with immediate feedback in
grading and evaluation. Users get tailored recommendations based on their prefer-
ences, interests, and historical activity via recommendation systems. These systems
improve students’ education by recommending suitable instructional content using
machine learning algorithms. The aforementioned examples demonstrate how big
data, machine learning, and data analytics can boost creativity, efficiency, and
data-driven decision-making across numerous sectors. Technology will change how
organisations operate in a data-rich world [5].
and variations in data to predict future outcomes with a certain degree of inherent
uncertainty. The ML algorithms are broadly classified into four types:
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised
4. Reinforcement learning
Predictive algorithms are another name for these kinds of algorithms. Using a
mapping of existing knowledge based on previously rendered inputs, these algorithms
may categorise new inputs or forecast their consequences. In supervised algorithms,
the information about inputs and their corresponding outputs is used to direct the
learning process. The learning occurs at the point when the inputs are mapped to the
outputs. The final ML model is refined by repeated exposure to the training data. The
resulting supervised algorithm is then evaluated based on some criteria and tested
using real-world data.
The supervised algorithms may be broken down into two categories:
Regression methods whose output variable is a numeric or discrete value. A
regression technique, for instance, is used to forecast the next day’s temperature.
Algorithms whose results are classes or categories are called classification algo-
rithms. For instance, a classification algorithm is used to determine if the next day
will be “sunny,” “overcast,” “rainy,” or “cloudy.” Linear regression, SVM, regres-
sion trees, Logistic Regression, and other supervised algorithms are some of the
state-of-the-art options.
Learning in unsupervised algorithms happens without labelled data that is, the
training examples are presented to the algorithm without target or output data. The
algorithm learns the underlying patterns and similarities in the data to discover the
veiled knowledge. This process is commonly referred to as knowledge discovery.
The class of unsupervised algorithms generally falls under the following categories:
• Clustering algorithms: The knowledge discovery in these types of algorithms
happens by uncovering the inherent similarities among the training data.
• Association: These are algorithms that extract rules that can describe large classes
of data.
• Common unsupervised algorithms include Fuzzy logic, K means clustering, K-
Nearest Neighbours, etc.
Modern Real-World Applications Using Data Analytics and Machine … 219
The learning in reinforcement learning takes place by making the software agents
define an ideal behaviour in the given learning environment to yield maximum perfor-
mance. The agents will be iteratively rewarded using reinforcement feedback signals.
This signal is the central factor in guiding the agent to adapt or learn the environment
and decide on the next step.
The outcome of the learning in these algorithms is an optimal policy that
maximises the performance of the agent. Reinforcement learning is more common in
the development of robots for specified tasks. Some of the well-known reinforcement
learning algorithms are adversarial networks and Q-learning.
DL algorithms are a subset of ML. These algorithms attempt to replicate human
learning and construct algorithmic frameworks rooted in human cognitive power. DL
algorithms are more ardent in learning specific types of knowledge from the domain
in which they are applied. They uncover the hidden knowledge and representation
through multiple processing levels and deliver high-level learning by delineating
lower-level features.
IE extracts entities from documents, matches a predefined template to raw data,
gathers information intended for specific people, and enables users to get the most
out of predefined templates. In recent times, the quantum of textual information has
increased exponentially, mostly in regard to unstructured data.
People have migrated towards online platforms to undertake standard tasks like
reading, information exchange through social networks, and consultations with
physicians using android applications. Information growth is driven largely by the
use of computers across all disciplines, and information is widely accessed in the
form of normal text, code-mixed data, and acronyms.
everyday lives and attitudes are going to be affected by the long-awaited Internet
of Things. A network that is globally linked is known as IoT. Devices, such as
integrated devices, mechanical devices, devices, etc. that create computers or other
things. These devices are connected to one another and given an IP address to send
and receive packets across a network. These objects can be connected wirelessly
or wired, although wirelessly is preferred. The link will be used often due to its
adaptability. The desired ones with the least amount of human intervention. The bare
minimum of human interaction is contact. A collection of objects has the potential
to engage in cooperative behaviour. Internet of Things (IoT) devices often exhibit
constraints in terms of computing capabilities, prices, power consumption, bitrate,
range, processing capacity, storage capacity, battery life, and operator counts. The
fundamental protocol used by IoT networks is high, and they link to diverse devices.
The Internet of Things architecture has five tiers Fig. 1. These layers are as follows:
• Perception layer: This layer depicts how people see physical objects like RFID
tags, actuators, sensors, etc. Their main job is gathering and transforming the
desired data into information. Actuators, for instance, are some of them that
organise the control signals.
• Transmission layer: The primary role of this layer is to accept control signals
from the intermediate layer and transmit the data collected by the recognition
layer back to the intermediate layer via networking technologies.
• Middleware layer: Decisions are formulated based on the findings derived from
an analysis of the data gathered at the transmission layer.
• IoT applications are present in the device layer, number four. An end user is
provided with capabilities based on the data processed.
In numerous industries, big data, machine learning, and data analytics have the poten-
tial to significantly increase output, innovation, and data-driven decision-making.
These practical implementations serve as evidence for this claim. It is expected that
as technology advances, organisations will modify their operations to accommodate
an ever-increasingly data-rich environment. Examining, filtering, and transforming
unprocessed data in order to discern significant patterns, conclusions, and insights
constitutes data analytics. An extensive array of computational, statistical, and math-
ematical techniques are addressed. Essential elements: Descriptive analytics provides
an answer to the question “What occurred?” It operates on historical data and gener-
ates reports according to its analysis. Predominantly, diagnostic analytics seeks to
determine “Why did it occur?” An in-depth analysis of historical data is conducted
[6].
222 Vijayakumar Ponnusamy et al.
The problem statements for Data Analytics, Big Data, and Machine Learning address
their inherent issues [11]. These key issue statements are tied to each:
Organisations must organise and analyse complicated data to get insights in data
analytics. Data is always changing, so how can analytics be accurate and relevant?
Integration with several data sources involves combining diverse data sources for a
complete analysis. Validation and verification are difficult parts of data analytics, but
they ensure data findings are accurate.
Modern Real-World Applications Using Data Analytics and Machine … 223
Given the exponential growth of data, what are the cost-effective solutions to
store data in the big data industry? Given the rate of data creation, processing speed
is necessary for large-scale real-time or near-real-time processing. Decentralisation
concerns how to decentralise data without sacrificing availability or integrity since
centralised data storage systems can become bottlenecks or single points of failure
in big data. Data quality concerns how to maintain or improve data quality in the
face of an abundance of data from various sources.
Training data collection in machine learning involves gathering enough and
representative data to build strong models.
• Overfitting and Generalisation: How can models are taught to work on fresh data
as well as the training set [12].
• Explainability and Transparency: Given their “black-box” nature, how can many
machine learning algorithms, particularly deep learning, make their conclusions
more visible and explainable?
• Best Model Selection: With so many alternatives, how can a task’s best model be
chosen?
• Bias and Fairness: How can machine learning models be built and taught to be
bias-free and objective?
These problem statements illustrate field-specific issues. Problem-solving fuels
these professions’ progress. When one issue is solved, another arises, showing how
technology is changing and assimilating into society.
1.5 Opportunities
Businesses may boost consumer satisfaction and profits with data-driven decisions.
Trend prediction and advice are the aims of predictive and prescriptive analytics.
Adjust goods, services, and information to each user’s interests to increase engage-
ment and automate Machine learning technology can automate difficult jobs like
customer care chatbots and autonomous cars. New revenue has arrived. Knowing
may be swiftly commercialised or utilised to create new goods and services. Data
analysis may save a lot of money by identifying waste, redundancy, and inefficiencies.
Healthcare diagnosis, treatment, and care may be improved using machine learning
and sophisticated analytics like Fraud detection, algorithmic trading, robo-advisors,
etc. Smart cities can improve traffic flow, public safety, and energy management
via data and ML. Machine learning may accelerate materials science and health
research. Social benefits: Conservation, climate change, and public health may be
addressed using data and machine learning. Consumer applications like real-time
content suggestions and industrial settings like predictive maintenance may benefit
from real-time analytics.
224 Vijayakumar Ponnusamy et al.
Data analytics, machine learning, and big data will continue to increase, creating
new possibilities and difficulties. Companies using these tools and technology
advancing quicker provide greater opportunities for data-driven, effective solutions
[13].
The six main research areas—machine learning approaches, text mining, event
extraction, recommendation systems, automated journalism, online comment anal-
ysis, and exploratory data analysis—that were determined by hierarchical clustering
were represented by the author of this publication. Possible research directions
include improving paywall systems, looking into recommendation systems, putting
cutting-edge automated journalism solutions to the test, and developing models to
improve personalization and interactivity features. Big data, machine learning, and
data analytics are three sectors that will only grow in the future, presenting both new
opportunities and challenges. The more firms use these tools and the faster technology
advances, the more opportunity there is for data-driven, successful solutions.
Figure 2 shows the received input data from patient and pre-processing. After
Feature extraction/selection model needs to be trained to predict the disease of the
patient.
The authors [18] examine Type II diabetes, its global prevalence, its effects, and
early detection. Stress that hybrid meta-heuristic machine learning with big data
feature selection may diagnose early and thoroughly. Test hybrid models against
baseline or classic machine learning models. Discuss the importance of big data
feature selection in light of Type II diabetes. Problems include class imbalance, over-
fitting, and data quality. Explain the advantages of hybrid meta-heuristic machine
learning and big data feature selection for early Type II diabetes detection. Alterna-
tive research methods involve finding alternative hybrid combinations or employing
different genetic data.
The authors [19] discuss the difficulties associated with projecting nonlinear
systems and the significance of anticipating changes in water contaminated by
mining, which has environmental and public health repercussions. Specify how
machine learning techniques will be utilised to predict mining-induced fluctuations
in water data as the study’s objective. The machine learning techniques utilised
in the research encompassed Support Vector Machines, Random Forest, Neural
Networks (for identifying nonlinear patterns), and Linear Regression (for estab-
lishing the baseline). In order to assess and contrast the performance of the various
models, pertinent forecasting metrics may be applied, including mean absolute error,
mean squared error, and R^2 score. Incorporate graphical representations—such as
comparisons between predicted and actual values—to enhance comprehension of the
model’s performance intuitively. Highlight the success—or lack thereof—of machine
learning in predicting water data that is impacted by mining in the summary of the
key findings. Alternatively, expanding the data set, examining alternative machine
learning models, or incorporating domain-specific data.
Prognostic maintenance (PdM) methodologies have become widely utilised by
organisations to oversee the health status of industrial equipment since the advent of
Industry 4.0 intelligent systems and machine learning (ML) within artificial intelli-
gence (AI). Owing to the emergence of Industry 4.0 and advancements in information
technology, computerised control, and communication networks, the accumulation
of vast quantities of data concerning the operational and process conditions of diverse
equipment has become feasible. By utilising this information, automated problem
detection and diagnosis aim to reduce downtime, maximise component utilisation,
and extend the remaining components’ useful lives. Recent developments in machine
learning (ML) techniques that are frequently implemented in PdM for smart manu-
facturing in Industry 4.0. The methods are classified based on the machine learning
algorithms employed, the machine learning category, the instruments and apparatus
used, the data collection device, the quantity and nature of the data, and the principal
contributions made by the researchers [20].
The Fig. 3 represents the forecasting prediction using ML algorithms. First steps
it could be collected the historical data and second steps have to train ML model for
prediction and finally in the validation gets the results of forecasting.
The growing significance attributed to smart grids in the context of energy manage-
ment, sustainability, and the continuous worldwide energy revolution. Highlight the
226 Vijayakumar Ponnusamy et al.
significance of Big Data analytics and the rapid expansion of data that smart grid
technologies produce. However, security and privacy concerns include preventing
unauthorised access, manipulating data, and protecting user privacy. By assuring
consistency, accuracy, and timeliness, the data’s quality can be guaranteed. It may
be difficult to integrate numerous obsolete formats, systems, and data sources [14].
The author [15] presents a comprehensive examination of the prominence of big
data in the tourism industry and its relevance in the present day. Summarise the
existing body of knowledge concerning big data analytics in the tourism industry;
this should be the objective of the literature review. The big data pertaining to tourism
can be broadly classified into three main categories: User-generated content (UGC)
includes website text and image content; device-generated data includes GPS, mobile
roaming, Bluetooth, and other data; and operation-generated data includes transac-
tion data from visited websites, online searches, and online reservations, among
others. Implementing AR and VR in the tourism industry using big data insights.
Numerous business disciplines, including operations, supply chain, marketing,
and accounting, have effectively implemented big data analytics. The significance of
big data analytics within the supply chain is increasing due to recent advancements
in machine learning and computer architecture. Due to the increasing prevalence of
big data analytics in supply chains, this study provides a comprehensive evaluation
of previous research on the subject. Evaluating product performance with the aid
of data analytics to inform determinations regarding the introduction, cessation, or
alteration of products [16].
The authors [10] identify five perspectives through which BDA applications in
healthcare can be examined: the management of hospitals, the treatment of specific
medical conditions, the interaction with stakeholders in the healthcare ecosystem,
and the delivery of healthcare services through the use of technology. Neverthe-
less, certain constraints do exist. Additionally, we advise researchers examining the
Modern Real-World Applications Using Data Analytics and Machine … 227
The field of data analytics presents various prospects. Machine learning and big data
are implemented in numerous industries. The subsequent sectors are experiencing
rapid and significant developments.
• Expansion of Data Sources: The proliferation of Internet of Things (IoT), ubiq-
uitous technology, and embedded systems will further propel the accumulation
of data in an exponential fashion. Real-time analytics and deeper insights will
be possible across sectors. As it becomes more complicated and fast, quantum
computing might change machine learning and large-scale data processing.
228 Vijayakumar Ponnusamy et al.
• The merging of AR and VR: Using AR and VR will make data visualisation
more immersive, helping data scientists and companies understand complex data
patterns.
• Automated Machine Learning (AutoML): automates the pipeline’s most difficult
phases to make machine learning more accessible to non-experts. It also speeds
up model selection.
• Explainable AI (XAI) develops visible, intelligible models and is growing in popu-
larity. Criminal justice, economics, and healthcare decision-making are greatly
impacted by this.
• Federated Learning and Data Privacy: As data privacy concerns grow, feder-
ated learning allows algorithm training across several devices while protecting
localised data. Patient anonymity is crucial in healthcare, thus this may change
the industry.
• The Integration of Predictive and Tailored Medicine: As medical technology and
genetics advance, treatment regimens will include patient health data. In addition,
predictive analytics may inform users of prospective health issues.
• COVID-19 has made supply chain optimisation more important. Machine
learning, big data, decision-making automation, real-time monitoring, and predic-
tive analytics for demand forecasting help protect supply chains from interrup-
tions.
• Integrating data analytics and machine learning into climate change, energy
efficiency, and environmental change strategies can help sustainable practises.
• Neural Symbolic Integration: The combination of symbolic reasoning and neural
networks (logic-based AI) enables the development of AI models that exhibit
adaptability and interpretability akin to both symbolic and neural models.
• Ethics and Bias in AI: In the future, there will be greater emphasis on regulations
and standards that guarantee the objectivity, morality, and fairness of machine
learning models.
Modern Real-World Applications Using Data Analytics and Machine … 229
Fig.5 Applications of data analytics, big data, and machine learning system model
Smart Grids and their growing relevance for sustainability, energy efficiency, and
the global energy transition. Big Data analytics and smart grids’ exponential data
development should be stressed.
Smart grid big data use—Historical data, weather predictions, and other sources
may predict electricity usage over time. Integrating renewable energy and stabilising
the system requires forecasting solar and wind power production.
230 Vijayakumar Ponnusamy et al.
The rise of data analytics in healthcare emphasises the necessity to use data to make
decisions. An overview of how data analytics aids medical diagnosis, treatment,
patient monitoring, and hospital management. EHR mining is being employed in
healthcare data analytics to optimise hospital management, patient health, and service
delivery. Predictive analytics predicts patient admissions, disease outbreaks, and
illness progression.
Telemedicine and remote monitoring employ wearable technology to monitor
patients 24/7 and give data-driven virtual health consultations. Genomic data analysis
predicts health outcomes, assesses sickness propensities and customises therapies.
“Medical image analysis.” is improving CT, MRI, and X-ray analysis using AI and
machine learning.
Examples of Data Analytics in Medical Imaging-Medical imaging plays a crucial
role in contemporary healthcare, as seen by the substantial number of imaging
procedures prescribed by physicians in the United States, which amounts to around
600 million annually. Nevertheless, the process of evaluating and preserving these
pictures incurs significant costs in terms of both time and financial resources. Radiol-
ogists are required to meticulously analyse each picture on an individual basis, while
Modern Real-World Applications Using Data Analytics and Machine … 231
hospitals are obligated to retain these images for an extended duration of several
years.
The use of big data analytics in the healthcare sector enables algorithms to effec-
tively evaluate a vast quantity of photos, numbering in the hundreds of thousands. The
identification of distinct patterns within the pixels and their subsequent conversion
into numerical data facilitates the physician’s diagnostic process. Moreover, it is said
that radiologists would be relieved from the task of visually examining the pictures,
as their focus will shift towards analysing the results generated by the algorithms.
These algorithms are expected to unavoidably acquire and retain a greater number of
photos than they would be able to process throughout the span of a human lifetime.
Big data analytics has the ability to bring about a transformative impact on medical
imaging and generate notable efficiencies within the healthcare system.
Privacy means protecting patient data in healthcare. Patients must understand data
collection and usage to provide informed consent. Algorithmic bias addresses
data analytics technological biases that may cause unfair or discriminatory health
effects. Who owns health data? Patients, providers, or other parties. Accountability,
transparency–Accountability for healthcare analytics algorithm judgements.
More individualised therapy, cheaper drugs, better patient outcomes, and predictive
abilities. Challenges include data integration across platforms, real-time analysis,
data quality, and ethical problems.
• Source: Find out where mining-impacted water data originated from (government
databases, mines).
• Water quality parameters: Turbidity, metal content, and pH were measured.
• Data Preprocessing: Discuss normalisation and missing value resolution to
prepare data for machine learning.
List the machine learning methods utilised in the research, including Support Vector
Machines, Random Forest, and Neural Networks for nonlinear patterns, and Linear
Regression for baseline.
This session discusses finding and developing important components from raw
data to increase forecasting accuracy.
Verification & Instruction: Explain how data was split into training, validation,
and test sets. Explain the model hyperparameter tuning process, maybe utilising
cross-validation.
3.3.4 Communication
The models have yielded insights that warrant a summary, including a determination
of the significant factors that impact water quality at mining sites. When conducting
Modern Real-World Applications Using Data Analytics and Machine … 233
The confluence of Big Data, Machine Learning, and Data Analytics in the modern era
has fundamentally altered the competitive landscape for a broad variety of business
sectors and organisations. The exponential growth in both the amount and diversity
of the data has given rise to possibilities and problems that were unimaginable in the
past. Data analytics may help companies make sense of the massive volumes of data
at their disposal, which can then be used to guide decision-making and anticipate
outcomes. Methodologies that are driven by data have formed the foundation for
a great number of successful enterprises and projects, including those that aim to
improve user experiences and operational optimisation. Traditional data processing
technology is being pushed to its limits by Big Data due to the volume, velocity,
234 Vijayakumar Ponnusamy et al.
and variety of the data it contains. The increasing significance of data has been a
driving force behind the development of innovative technologies for the processing,
examination, and storage of data. This enhancement not only boosts efficiency and
scalability but also encourages a culture of real-time, fast decision-making that can be
adapted to changing circumstances. Machine learning is a subfield of artificial intel-
ligence that makes use of algorithms to recognise patterns, forecast outcomes, and
speed up decision-making processes. Applications as diverse as advanced medical
diagnostics and autonomous cars have been made possible as a result of its versatility
and predictive capacity. These applications include natural language processing and
recommendation systems. However, along with these developments come a number
of other problems that must be overcome. Security, data privacy, and ethical conun-
drums are some of the key topics of conversation at the moment. In order to win over
the public and minimise unforeseen repercussions, models of machine learning need
to be equal, transparent, and understandable. Additionally, specialists in these many
sectors are required to consistently improve their skills and change their practises.
It is very necessary for there to be a collaboration between data scientists, domain
specialists, ethicists, and policymakers in order to ensure that these technologies
be used in an ethical manner. A paradigm change has occurred in the management
of information, as well as its interpretation and use, as a direct consequence of the
convergence of data analytics, big data, and machine learning. Their intertwined
progression augurs well for a future that is rich in innovation, efficiency, and expan-
sion. To fully realise their revolutionary potential in a way that is both inclusive and
environmentally sustainable, however, a strategy that is well-balanced and accords
equal weight to social responsibility and technical innovation will be required.
References
1. Elizabeth, F., Sérgio, M., Paulo, C.: Data science, machine learning and big data in digital
journalism: A survey of state-of-the-art, challenges and opportunities. Expert Sys. App. 221
(2023)
2. Myers, N.E., Kogan, G.: Emerging AI and data analytics tooling and disciplines. In: Self-service
data analytics and governance for managers. pp. 25–49. John Wiley & Sons, Inc. (2021)
3. Jones, W., Thomas, G.E., Thomas, S.H.: Data analytics in healthcare: A review of current
trends and ethical issues. J. Healthc. Inf. Manag. 34(2), 19–25 (2020)
4. Li, X., Li, P., Yang, Y.: Credit scoring with machine learning for online peer-to-peer lending:
A categorization framework and review. IEEE Access 8, 38892–38908 (2020)
5. Lu, C., Luo, X., Tian, X., Zhang, Q.: Machine learning for predictive maintenance in smart
grids: A comprehensive survey. IEEE Trans. Industr. Inf. 16(1), 648–657 (2020)
6. Smith, A., White, B., Johnson, R.: Data analytics in market segmentation and trend analysis:
A case study of the retail industry. Int. J. Data Sci. Anal. 12(4), 487–503 (2021)
7. Veeramachaneni, K., Li, C., Soh, L.: Recommender systems for large-scale content on mobile
devices. IEEE Internet Comput. 18(3), 14–22 (2014)
8. Wang, H., Lu, X., Yang, L., Wang, Z., Wang, J.: A review on applications of data mining
techniques in the credit industry. Expert Syst. Appl. 129, 67–79 (2019)
9. Wu, X., Xia, H., Hao, J., Zhao, J.L.: Big data analytics in logistics and supply chain manage-
ment: Certain investigations for research and applications. J. King Saud Univ.-Comp. Info. Sci.
29(4) (2018)
Modern Real-World Applications Using Data Analytics and Machine … 235
10. Zhang, G., Wang, D.: A review on predictive maintenance of production systems. IEEE Access
7, 182450–182470 (2019)
11. Mühlhoff, R.: Predictive privacy: Towards an applied ethics of data analytics. Ethics Inf.
Technol. 23, 675–690 (2021)
12. Jain, P., Gyanchandani, M., Khare, N.: Enhanced secured map reduce layer for big data privacy
and security. J Big Data. 6, (2019).
13. Kaufmann, U.H., Tan, A.B.C.: Why data analytics is important? In: Data Analytics for
Organisational Development. pp. 1–20. John Wiley & Sons (2021)
14. Jiang, R., Bouridane, A., Li, C.-T., Crookes, D., Boussakta, S., Hao, F., Edirisinghe, E.A.: Big
Data Privacy and Security in Smart Cities. Springer, Cham (2022)
15. Chen, Y., Wang, D., Xie, L.: Big data analytics in tourism: A literature review. Tour. Manage.
68, 301–323 (2019)
16. Gupta, V., Jain, V., Jain, S.: Big data analytics in supply chain management: A comprehensive
overview. J. Enterp. Inf. Manag. 34(4), 1121–1152 (2021)
17. Azmi, J., Arif, M., Nafis, M.T., Alam, M.A., Tanweer, S., Wang, G.: A systematic review
on machine learning approaches for cardiovascular disease prediction using medical big data.
Med. Eng. Phys. 105 (2022)
18. Fatemeh, N., Yufei, Y., Norm, A.: An examination of the hybrid meta-heuristic machine learning
algorithms for early diagnosis of type II diabetes using big data feature selection. Healthcare
Anal. 4 (2023)
19. Kagiso, S.M., Christian, W.: Application of machine learning algorithms for nonlinear system
forecasting through analytics—A case study with mining influenced water data. Water Res.
Indus. 29 (2023)
20. Choi, J., Kim, D., Kim, J.: Machine learning in predictive maintenance: A review. Sustainability
12(7), 2750 (2020)
21. Alam, F., Reaz, M.B.I., Ali, M.A.M.: Big data in smart grid: A comprehensive review and
trends. IEEE Access 7, 35877–35906 (2019)
Real-World Applications of Data
Analytics, Big Data, and Machine
Learning
P. S. Chaudhary (B)
Department of Data Science, Worcester Polytechnic Institute, Worcester, MA, USA
e-mail: pchaudhary@wpi.edu
M. R. Khurana · M. Ayalasomayajula
Department of Materials Science and Engineering, Cornell University, Ithaca, NY, USA
e-mail: mrk263@cornell.edu
M. Ayalasomayajula
e-mail: ma2258@cornell.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 237
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_12
238 P. S. Chaudhary et al.
1 Introduction
In an age characterized by the ubiquitous presence of data, our daily lives are intri-
cately interwoven with an abundance of digital sources, including but not limited
to digital watches, mobile phones, laptops, manufacturing, finance, healthcare, and
many more domains. The digital landscape unfolds as an ever-expanding tapestry of
information, as data recording remains ceaseless and pervasive [1, 2].
This data holds the potential to accelerate the development of intelligent applica-
tions across various domains. Central to this revolution driven by data is AI, where
data analytics, ML, and DL stand out as essential catalysts for transformation [3, 4].
The three primary forms of data analytics used in decision-making are descrip-
tive, predictive, and prescriptive analytics. Descriptive analytics involves histor-
ical data analysis to reveal trends and patterns, often displayed using visual repre-
sentations. Predictive analytics employs statistical modeling, along with ML and
DL algorithms, to evaluate potential future scenarios. Prescriptive analytics lever-
ages statistical methods to identify the best course of action, factoring in multiple
scenarios and their repercussions, saving time and money for businesses, making
optimal decisions, and impacting various sectors [5]. Predictive algorithms methods
can be segmented into supervised, unsupervised, semi-supervised, and reinforce-
ment learning methods [6–8]. The efficacy, accuracy, and precision of AI applica-
tions are inherently linked to the intrinsic qualities and attributes of the data fed to
selected models. AI employs various techniques, such as regression, classification,
feature selection, natural language processing (NLP), large language models (LLM),
dimensionality reduction, clustering, reinforcement learning, and computer vision to
develop applications. Selection of an ML or DL algorithm, fine-tuning it, and contin-
uously learning from data to output as per objectives for a specific application in a
given domain presents a formidable challenge because of the attributes and nature
of the data [9–11].
ML and DL, enable applications to learn and evolve through past data and
continuously improve with new data, without explicit rule-based programming,
making it a cornerstone of modern technology. Various industries are going through
transformation and have embraced the power of data through AI technologies [12].
In today’s data-driven economy, the significance of Data Analytics, Big Data,
and Machine Learning cannot be overstated. These technologies collectively act as
the backbone of a transformative paradigm, unlocking unprecedented insights and
opportunities across various industries. Data Analytics allows businesses to decipher
patterns, trends, and correlations within vast datasets, informing strategic decisions
and improving operational efficiency. Big Data, characterized by the management
and analysis of massive and diverse datasets, provides a wealth of information that
traditional approaches can’t handle. This abundance of data, when harnessed effec-
tively, fuels innovation and drives competitive advantages. ML and DL methods
empower us to learn from data, enabling predictive capabilities, and personalized
user experiences. These technologies improve decision-making processes and pave
the way for innovations that redefine how businesses operate, ensuring they stay agile
Real-World Applications of Data Analytics, Big Data, and Machine … 239
• Rigidity: The predefined structure of structured data can be limiting. It may not
accommodate data that doesn’t fit neatly into the established schema.
• Scalability: Scaling structured databases to accommodate large volumes of data
can be costly and complex.
Unstructured Data: Unstructured data is, characterized by a lack of organization
and a format that doesn’t conform to traditional databases. It can take various forms,
including text, audio, video, and more [14]. Unstructured data encompasses vast
volumes of text documents, social media posts, audio recordings, images, and video
files. This type of data often contains valuable insights but is challenging to analyze
due to its unorganized nature. Some of the advantages of using structured data are:
• Rich Content: Unstructured data holds a wealth of untapped information,
but unlocking its value requires advanced methods such as NLP, and image
recognition.
• Flexibility: Unstructured data doesn’t require a predefined structure, making it
suitable for rapidly evolving data sources.
• Big Data Insights: It allows organizations to utilize big data to find hidden patterns.
Some of the disadvantages of using structured data are:
• Analysis Complexity: Analyzing unstructured data can be challenging. It may
require advanced techniques such as natural language processing or image
recognition.
• Storage and Processing Costs: Managing and processing unstructured data can be
costly, especially as data volumes increase.
• Data Privacy: Unstructured data often contains sensitive information, raising
privacy and security concerns.
Semi-structured Data: Semi-structured data occupies a middle ground between
structured and unstructured data. It possesses some level of structure but does not
adhere to rigid schemas found in structured data [14]. Common instances of semi-
structured data include JSON and XML files, emails, and NoSQL databases. While
they may have defined attributes, the data may not be uniformly structured. Some of
the advantages of using structured data are:
• Flexibility with Structure: Semi-structured data offers a balance between struc-
tured and unstructured data. It allows for some structure while accommodating
varying data formats.
• Adaptability: It is suitable for dynamic data sources and scenarios where schemas
may evolve over time.
Some of the disadvantages of using structured data are:
• Complexity: Analyzing semi-structured data can be more complex than structured
data, as it may not adhere to a uniform schema.
• Tool Dependency: Effectively working with semi-structured data often requires
specialized tools and software.
Real-World Applications of Data Analytics, Big Data, and Machine … 241
• Data Quality: Maintaining data quality can be a challenge, as it does not always
conform to a rigid schema.
Understanding these three data types is crucial when embarking on data anal-
ysis projects. Based on the data, analysts and data scientists can choose appro-
priate techniques and tools to extract valuable insights and make informed decisions.
Subsequently, various categories of AI algorithms are discussed.
In the present era dominated by a surfeit of data, the seamless integration of Data
Analytics, Big Data, and Machine Learning (ML) emerges as a linchpin for driving
innovation and steering industries toward data-driven excellence.
Big Data, characterized by its colossal volume, velocity, and variety, stands as a
pivotal asset in this data-centric landscape [13]. It encapsulates extensive datasets
sourced from diverse origins, providing a reservoir of information crucial for robust
analysis and insights generation. Big Data’s significance lies in its ability to handle
and process vast datasets that traditional data processing systems find overwhelming.
The technology associated with Big Data facilitates storage, retrieval, and analysis
of massive datasets, allowing organizations to extract valuable insights and patterns
that might otherwise remain hidden.
Data Analytics, comprising descriptive, predictive, and prescriptive analytics,
emerges as the next layer in this technological triad. Descriptive analytics unveils
historical patterns and trends, offering a retrospective view of data. Predictive
analytics utilizes data mining, statistical modeling, and ML algorithms to foresee
future possibilities, while prescriptive analytics guides decision-making by identi-
fying optimal courses of action. The synergy between Big Data and Data Analytics
becomes evident as the latter relies on the expansive datasets provided by Big Data
to extract meaningful insights [5].
Machine Learning adds the cognitive dimension to this integrated framework
of predictive analytics mentioned earlier. Categorized into supervised, unsuper-
vised, semi-supervised, and reinforcement learning, ML algorithms facilitate iter-
ative learning from data discussed in detail in the section “Categories of Learning:
Exploring the Dimensions of Supervised, Unsupervised, and Reinforcement Algo-
rithms”. This process enables systems to evolve, improving their ability to make
predictions, detect anomalies, and automate decision-making. The synergy between
Machine Learning, Big Data, and Data Analytics becomes a dynamic force, espe-
cially potent in scenarios where traditional rule-based programming falls short [6–8].
The subsequent case study will further elucidate this symbiotic relationship through
real-world examples and explore how their synergistic integration propels industries
toward transformative outcomes.
242 P. S. Chaudhary et al.
Semi-structured data, found in XML or JSON formats, offers a bridge between the
two, making it ideal for semi-supervised learning [1, 6]. This synergy between data
types and learning categories empowers developers to tailor their ML and DL models
to the specifics of their data, enabling more accurate and insightful results in various
domains. Understanding these connections is essential in utilizing the full spectrum
of possibilities in the era of bigdata analytics and AI.
Understanding the various categories of ML and DL is crucial for harnessing
their capabilities effectively. This section delves into the distinctions between key
categories: supervised learning, unsupervised learning, semi-supervised learning,
and reinforcement learning as shown in Fig. 1.
Supervised Learning: Supervised learning is a foundational machine learning
paradigm driven by labeled data, enabling algorithms to learn the mapping from
inputs to responses with precision. This approach utilizes historical data and labels to
forecast events, beginning with the training of datasets to develop inferred functions.
These functions then predict output values when presented with new input data. The
process involves comparing the predicted and expected results to identify errors and
subsequently refining the model for improved accuracy and performance [11, 15,
16]. Some of the supervised algorithms are listed below:
• Linear Regression
• Logistic Regression
• Decision Trees
• Random Forest
• Support Vector Machines (SVM)
244 P. S. Chaudhary et al.
Supervised learning is one of the foundational pillars of ML, where algorithms are
guided by labeled data to make predictions or classifications based on historical
patterns. In this paradigm, the model learns from input data paired with corresponding
output labels or responses, allowing it to establish a relationship or mapping between
the two. Supervised learning is akin to having a teacher or supervisor who provides
246 P. S. Chaudhary et al.
the algorithm with a clear roadmap or feedback, enabling it to generalize from known
examples and make informed decisions on unseen data. This approach is particularly
valuable for tasks, where predicting outcomes or classifying data is crucial, such
as image recognition, speech analysis, and medical diagnoses. The precision and
interpretability of supervised learning models make them indispensable tools for
various real-world applications, and understanding the principles underlying this
category is essential for harnessing their power in the data-driven world. The key
learning algorithms for both regression and classification tasks are described below.
The key distinction between classification and regression lies in their predictive
nature: classification predicts distinct class labels, whereas regression focuses on
estimating continuous quantities.
Regression: In the realm of supervised learning, regression algorithms, offer a
straightforward approach to predicting output values by minimizing errors based on
input data consisting of specific features [18]. These algorithms primarily handle
continuous response variables and are instrumental in various applications. Regres-
sion analysis encompasses a range of machine learning methods that enable the
prediction of a continuous response or output variable based on one or more input
variables or parameters [16]. Regression models have found extensive use in diverse
fields, such as real estate, banking, insurance, finance, manufacturing, time series
forecasting, and more. The following sections provide a brief overview of some
prominent types of regression algorithms.
Linear Regression: Within the domain of regression analysis, we explore several
powerful algorithms that excel in predicting continuous outcomes based on input
variables. Simple Linear Regression (SLR), Multiple Linear Regression (MLR), and
Polynomial Regression are key algorithms under Regression.
• Simple Linear Regression (SLR): SLR is a fundamental statistical method to
analyze the relationship between two variables: a dependent variable (response)
and an independent variable (predictor) [16]. In SLR, we seek to model the
linear relationship between these variables, allowing us to make predictions, infer
patterns, and understand how changes in the predictor affect the response. SLR
essentially helps us find the best-fitting line (the regression line) that minimizes
the sum of squared differences between the observed values of the response and
the values predicted by the model. This fitted line represents the linear relationship
between the variables and is used for making predictions.
• Multiple Linear Regression (MLR): Extends this concept by incorporating
multiple predictor variables to create a more generalized predictive model for
a dependent variable (response) and two or more independent variables (predic-
tors) [16]. MLR is a fundamental statistical method used for prediction, hypoth-
esis testing, and understanding how multiple variables collectively influence an
outcome. In MLR, the model equation accounts for multiple independent vari-
ables. MLR allows us to understand how each independent variable influences the
dependent variable while controlling for the effects of the others or keeping other
variables constant. By estimating the coefficients, we quantify the impact of each
independent variable on the response. The goal of MLR is to find the best-fitting
Real-World Applications of Data Analytics, Big Data, and Machine … 247
linear relationship that minimizes the sum of squared differences between the
observed values of response and the values predicted by the model.
• Polynomial Regression: Polynomial Regression is an extension of SLR/MLR
that allows us to model relationships between a dependent variable (response) and
one or more independent variables (predictors) when the relationship is nonlinear
[18]. In situations where a linear model doesn’t capture the underlying relationship
in the data, polynomial regression offers a more flexible approach. The equation
for polynomial regression involves using polynomial terms to model the nonlinear
relationship between variables. The challenge in using polynomial regression is
selecting the appropriate degree of the polynomial, as higher degrees may lead
to overfitting. Careful model evaluation and validation are essential to ensure the
chosen polynomial degree accurately represents the data.
• LASSO Regression and Ridge Regression: Polynomial Regression, is a
powerful and flexible method for modeling data where nonlinear relationships
exist. However, it faces various challenges, especially in real-world scenarios
where data can be noisy, and overfitting is a concern. In such cases, regularization
methods like LASSO (Least Absolute Shrinkage and Selection Operator) and
Ridge Regression are a better fit.
• LASSO (Least Absolute Shrinkage and Selection Operator): LASSO is a
regression technique aimed to address overfitting, multicollinearity, and the
problem of too many irrelevant features. The core idea behind LASSO is to add a
regularization term to the linear regression model. This regularization term penal-
izes (L1 penalty) the absolute values of the coefficients, effectively shrinking some
of them to zero. In other words, LASSO performs feature selection by setting
coefficients associated with irrelevant or redundant features to zero [19, 20]. By
reducing the number of predictors, LASSO simplifies the model while retaining
the essential relationships between variables.
• Ridge Regression: like LASSO, introduces a regularization term, but it takes
a different approach. Instead of using the absolute values of coefficients, Ridge
Regression penalizes the squares of coefficients. This results in a more gradual
shrinkage of coefficients. While it doesn’t perform feature selection like LASSO,
it effectively mitigates the impact of multicollinearity, which is common in real-
life datasets. By limiting the magnitude of coefficients, Ridge Regression ensures
that no single predictor has an excessive influence on the model [21].
Classification: Classification algorithms play a pivotal role in machine learning,
organizing data into meaningful categories. They come in two primary forms: binary
classification and multiclass classification. Binary classification assigns data points
to two distinct classes, such as spam detection or sentiment analysis. Multiclass clas-
sification extends to multiple categories, often used in image recognition or document
categorization. These algorithms automate decision-making based on data, enabling
a wide range of applications. Their adaptability to complex real-world scenarios
makes classification algorithms invaluable in the field of artificial intelligence and
data analysis. Now, we will delve into some of the key classification algorithms
prevalent in the field.
248 P. S. Chaudhary et al.
CNNs include VGGNet, and ResNet, each adding some improvements or modifi-
cations to existing models. CNNs excel in various computer vision tasks like image
classification, object detection, and facial recognition, thanks to their ability to
automatically learn and adapt features from data. These networks are also essen-
tial in modern applications, including self-driving cars, medical image analysis,
and more.
In Sect. 5.1, we discussed a selection of key classification algorithms. However,
it’s essential to acknowledge that there are numerous other classification algorithms
available. Due to the extensive range of classification techniques, we’ve focused on
highlighting a few pivotal ones in this overview.
Reinforcement Learning (RL) [37] is a dynamic and expanding field within ML,
emphasizing interactions between intelligent agents and their environments. Unlike
other paradigms, RL involves autonomous decision-making and learning from
continuous feedback in the form of rewards or penalties. RL enables agents to learn
optimal sequences of actions for maximizing cumulative rewards. It mimics how
humans and animals adapt to their surroundings by interacting with the environment
and adjusting their strategies based on feedback.
RL’s versatility allows its application in various domains. It has proven invaluable
in robotics, teaching machines complex tasks like walking, flying, and grasping
Real-World Applications of Data Analytics, Big Data, and Machine … 253
6.1 Manufacturing
6.3 Healthcare
6.4 Finance
The finance sector has witnessed significant transformations due to the incorporation
of machine learning algorithms and AI technologies. In this sub-section, we delve
into how these technologies are revolutionizing the financial industry. Some of the
examples of using AI algorithms in this domain are:
Algorithmic Trading: Machine learning models are employed for algorithmic
trading, where they analyze historical data and market trends to make real-time
trading decisions. These algorithms have the capability to make split-second trades
and respond to market changes more effectively than human traders. As a result, they
increase trading efficiency, reduce errors, and optimize portfolio performance [49].
Credit Scoring and Risk Assessment: Financial institutions leverage machine
learning algorithms for credit scoring and risk assessment. These algorithms evaluate
a borrower’s creditworthiness by analyzing various factors, such as credit history,
income, and debt. They provide more accurate risk assessments, enabling lenders to
make informed lending decisions and offer loans to a wider range of customers [50].
Fraud Detection: Machine learning plays a pivotal role in fraud detection. It
continually monitors financial transactions for anomalies and unusual patterns. For
instance, it can identify potentially fraudulent credit card transactions in real time
and send alerts to both customers and financial institutions. This not only safeguards
individuals from unauthorized transactions but also helps financial organizations
minimize financial losses due to fraud [51].
Some of the opportunities of using AI algorithms in this domain are:
• Automated trading and portfolio optimization.
• Enhanced risk management.
• Improved fraud detection and prevention.
• Efficient customer service with chatbots.
Some of the challenges of using AI algorithms in this domain are:
• Data privacy and security in handling financial data.
• Regulatory compliance and risk associated with automated trading.
• Developing robust fraud detection models
Some of the benefits of using AI algorithms in this domain are:
• Developing robust fraud detection models
• Enhanced trading efficiency and returns.
258 P. S. Chaudhary et al.
6.5 Agriculture
The agriculture sector has seen remarkable transformations through the integration of
machine learning and AI. This sub-section delves into the applications, opportunities,
challenges, and benefits of AI in agriculture. Some of the examples of using AI
algorithms in this domain are:
Crop Disease Detection: Machine learning models are employed to detect early
signs of crop diseases by analyzing images of leaves or plants. For example, deep
learning to diagnose plant diseases, helping farmers take preventive measures, and
reduce crop loss is used [52].
Soil Health Assessment: AI-driven systems assess soil quality based on various
parameters, which are pH levels, nutrients, and others. This information aids in opti-
mizing fertilizer application and irrigation to enhance crop yields while minimizing
environmental impact [53].
Precision Irrigation: Machine learning algorithms process weather data, soil
moisture levels, and crop requirements to optimize irrigation systems. This results
in water conservation, reduced operational costs, and increased crop productivity.
Companies like CropX provide such solutions [54].
Some of the opportunities of using AI algorithms in this domain are:
• Increased crop yields and quality.
• Enhanced pest and disease management.
• Improved resource management.
• Sustainable agriculture practices.
Some of the challenges of using AI algorithms in this domain are:
• Access to technology in remote areas.
• Data security and privacy concerns.
• Initial investment costs.
• Adaptation to local conditions.
Some of the benefits of using AI algorithms in this domain are:
• Greater food production to meet growing demands.
• Reduction in chemical and water usage.
• Enhanced sustainability and resource conservation.
Real-World Applications of Data Analytics, Big Data, and Machine … 259
In this section, we explore promising research directions that emanate from the
evolution and growing importance of data analytics, big data, and machine learning
in various domains. These directions encapsulate the future landscape of AI and its
potential impact.
Explainable AI (XAI): As AI systems become increasingly integrated into
everyday life, the demand for transparency and trustworthiness is paramount. XAI
has emerged as a prominent research direction in recent times. XAI aims to develop
AI systems that provide interpretable, human-understandable justifications for their
decisions and predictions [55]. Researchers focus on creating models and tech-
niques that can reveal the inner workings of complex AI systems, enabling users
to comprehend their outputs, thus fostering trust and adoption in critical domains
like healthcare, finance, and autonomous vehicles.
Ethical AI and Bias Mitigation: The rapid integration of AI technologies raises
concerns about fairness, ethics, and bias. Research in ethical AI and bias mitigation
is imperative [56]. The goal is to develop algorithms, guidelines, and best practices to
ensure AI systems do not reinforce or introduce harmful biases. Scholars are actively
investigating techniques for detecting, measuring, and mitigating bias in AI models
and data, addressing challenges in domains ranging from hiring to criminal justice.
Federated Learning: Privacy and data security are fundamental concerns in the
digital age. Federated Learning, an emerging paradigm, is garnering attention. It
enables AI models to be trained across decentralized devices or servers, ensuring that
sensitive data remains on the user’s device and only model updates are shared [57].
This approach shows promise in healthcare, where patient data privacy is paramount,
and in other industries with stringent data protection regulations.
Interdisciplinary AI: AI is increasingly intersecting with various other scien-
tific domains, creating new opportunities. Interdisciplinary AI research explores
the confluence of AI with fields like biology, chemistry, and material science.
260 P. S. Chaudhary et al.
8 Conclusion
In conclusion, this chapter has taken us on a journey through the remarkable landscape
of data analytics, big data, and machine learning, illustrating their pivotal roles in
the digital era. The profound impact of these technologies across various domains
is highlighted, emphasizing their transformative potential and influence on our data-
driven future.
The chapter commenced by acknowledging the data revolution, characterized
by the deluge of big data from diverse sources, including manufacturing, banking,
social media, e-commerce, and healthcare records. Subsequently, the different types
of data were elucidated to highlight the importance of understanding the data type in
deciding the AI algorithm for analysis. The fundamentals of various AI algorithms
are then described based on the data type which needs to be analyzed. These sections
elaborate on the importance of determining the appropriate combination of the data
and algorithm. Furthermore, the application of AI in various industries is discussed.
For each application, the opportunities, challenges, and benefits are described. This
highlights the importance of using AI for these applications while providing the
Real-World Applications of Data Analytics, Big Data, and Machine … 261
improvements needed to achieve higher accuracy and efficiency. Lastly, key area of
focus for future research, based on current gaps and challenges, are highlighted.
The impact of these technologies transcends boundaries, fundamentally reshaping
various industries. As we reflect on the chapter’s content, data analytics, big data, and
ML are redefining the way we interact with and harness information. The potential for
innovation, efficiency, and informed decision-making is limitless, and their influence
will continue to expand, leaving no domain untouched.
In a world where data is hailed as the new currency, these technologies are indis-
pensable tools for tackling complex challenges, driving innovation, and forging the
path toward a data-driven future. The insights and knowledge derived from this
chapter will serve as a compass, guiding us through the intricate terrain of data
analytics, big data, and machine learning, as we explore, adapt, and exploit their full
potential in the ever-evolving digital age.
References
1. Sarker, I.H.: Machine learning: algorithms, real-world applications and research directions. Sn
Comput. Sci. 2, 160 (2021). https://doi.org/10.1007/s42979-021-00592-x
2. Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. (CSUR) 50(3), 43
(2017). https://doi.org/10.1145/3076253
3. Sarker, I.H.: Ai-driven cybersecurity: an overview, security intelligence modeling and research
directions. SN Comput. Sci. (2021). https://doi.org/10.1007/s42979-021-00557-0
4. Sarker, I.H.: Deep cybersecurity: a comprehensive overview from neural network and deep
learning perspective. SN Comput. Sci. (2021)
5. Lepenioti, K., Bousdekis, A., Apostolou, D., Mentzas, G.: Prescriptive analytics: literature
review and research challenges. Int. J. Inf. Manage. 50, 57–70 (2020)
6. Mohammed, M., Khan, M.B., Bashier Mohammed, B.E.: Machine Learning: Algorithms and
Applications. CRC Press (2016)
7. Aiken, E., Bellue, S., Karlan, D., et al.: Machine learning and phone data can improve targeting
of humanitarian aid. Nature 603, 864–870 (2022). https://doi.org/10.1038/s41586-022-04484-9
8. Sureja, N., Mehta, K., Shah, V., Patel, G.: Machine learning in wearable healthcare devices.
In: Joshi, N., Kushvaha, V., Madhushri, P. (eds.) Machine Learning for Advanced Functional
Materials. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-0393-1_13
9. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann (2005)
10. Andina, D., Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E.: Deep learning
for computer vision: a brief review. Comput. Intell. Neurosci. 2018, 7068349 (2018). https://
doi.org/10.1155/2018/7068349
11. Lalor, J., Wu, H., Yu, H.: Improving Machine Learning Ability with Fine-Tuning (2017)
12. Ślusarczyk, B.: Industry 4.0: are we ready? Polish J. Manag. Stud. 17 (2018)
13. Mostajabi, F., Safaei, A.A., Sahafi, A.: A systematic review of data models for the big data
problem. IEEE Access 9, 128889–128904 (2021). https://doi.org/10.1109/ACCESS.2021.311
2880
14. Praveen, S., Chandra, U.: Influence of structured, semi- structured, unstructured data on various
data models. Int. J. Sci. Eng. Res. 8, 67–69 (2020)
15. Saravanan, R., Sujatha, P.: A state of art techniques on machine learning algorithms: a perspec-
tive of supervised learning approaches in data classification. In 2018 Second international
conference on intelligent computing and control systems (ICICCS), pp. 945–949. IEEE (2018,
June)
262 P. S. Chaudhary et al.
16. Han, J., Pei, J., Kamber, M.: Data mining: concepts and techniques. Elsevier, Amsterdam
(2011)
17. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. Journal of
Artificial Intelligence Research 4, 237–285 (1996)
18. Xuanxuan, Z.: Multivariate linear regression analysis on online image study for IoT. Cogn.
Syst. Res. 52, 312–316 (2018)
19. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc.: Ser. B
(Methodol.) 58(1), 267–288 (1996)
20. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 67(2), 301–320 (2005)
21. Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics 12(1), 55–67 (1970)
22. Agresti, A.: An Introduction to Categorical Data Analysis. John Wiley & Sons (2018)
23. Altman, N.S.: An introduction to Kernel and nearest-neighbor nonparametric regression. Am.
Stat. 46(3), 175–185 (1992)
24. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
25. Schölkopf, B., Smola, A.J., Müller, K.-R.: Nonlinear component analysis as a Kernel eigenvalue
problem. Neural Comput. 10(5), 1299–1319 (1997)
26. Breiman, L.: Classification and Regression Trees. Routledge (2017)
27. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
28. Dietterich, T.G.: Ensemble methods in machine learning. In: Multiple classifier systems, pp. 1–
15. Springer (2000).
29. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
30. Nanga, S., Bawah, A., Acquaye, B., Billa, M., Baeta, F., Odai, N., Obeng, S., Nsiah, A.:
Review of dimension reduction methods. Journal of Data Analysis and Information Processing
9, 189–231 (2021). https://doi.org/10.4236/jdaip.2021.93013
31. Berisha, V., Krantsevich, C., Hahn, P.R., et al.: Digital medicine and the curse of dimensionality.
npj Digit. Med. 4, 153 (2021). https://doi.org/10.1038/s41746-021-00521-5
32. Pearson, K.: On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(11),
559–572 (1901)
33. Laurens van der Maaten’s homepage. Retrieved from https://lvdmaaten.github.io/tsne/(n.d.)
34. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR)
31(3), 264–323 (1999)
35. Arthur, D., Vassilvitskii, S.: K-means++: The advantages of careful seeding. In Proceedings of
the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035 (2007)
36. Kaufman, L., Rousseeuw, P.J.: Finding GROUPS in Data: An Introduction to Cluster Analysis.
John Wiley & Sons (1990)
37. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT Press (2018)
38. Watkins, C., Dayan, P.: Technical note: Q-Learning. Mach. Learn. 8, 279–292 (1992). https://
doi.org/10.1007/BF00992698
39. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., … & Hassabis,
D. Human-level control through deep reinforcement learning. Nature. 518(7540), 529–533
(2015)
40. Paolanti, M., Romeo, L., Felicetti, A., Mancini, A., Frontoni, E., Loncarski, J.: “Machine
Learning approach for Predictive Maintenance in Industry 4.0,” 2018 14th IEEE/ASME Inter-
national Conference on Mechatronic and Embedded Systems and Applications (MESA), Oulu,
Finland, pp. 1–6. (2018). doi: https://doi.org/10.1109/MESA.2018.8449150
41. Suzuki, Y., Iwashita, S., Sato, T., Yonemichi, H., Moki, H., Moriya, T.: “Machine Learning
Approaches for Process Optimization,” 2018 International Symposium on Semiconductor
Manufacturing (ISSM), Tokyo, Japan, 2018, pp. 1–4. https://doi.org/10.1109/ISSM.2018.865
1142
42. Peres, R.S., Barata, J., Leitao, P., Garcia, G.: Multistage quality control using machine learning
in the automotive industry. IEEE Access 7, 79908–79916 (2019). https://doi.org/10.1109/ACC
ESS.2019.2923405
Real-World Applications of Data Analytics, Big Data, and Machine … 263
43. Verhoef, P.C., Neslin, S.A., Vroomen, B.: Multichannel customer management: understanding
the research-shopper phenomenon. Int. J. Res. Mark. 24(2), 129–148 (2007)
44. Fader, P.S., Hardie, B.G.S.: Customer-base valuation in a contractual setting: the perils of
ignoring heterogeneity. Mark. Sci. 24(1), 66–79 (2005)
45. Lewis, K., Reiley, D.H.: Online ads and offline sales: measuring the effects of retail advertising
via a controlled experiment on Yahoo. Econ. J. 124(576), 419–443 (2014)
46. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.:
Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639),
115–118 (2017)
47. Stokes, J.M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N.M., … & Church,
G.M.: A deep learning approach to antibiotic discovery. Cell. 180(4), 688–702 (2020)
48. : Schork, N.J.: Artificial intelligence and personalized medicine. Precision medicine in Cancer
therapy, 265–283 (2019).
49. Sen, J., Sen, R., Dutta, A.: Introductory chapter: machine learning in finance-emerging trends
and challenges. Algorithms, Models and Applications, 1 (2021)
50. Romanyuk, K.: Game theoretic approach for applying artificial intelligence in the credit
industry. In 2018 Fifth HCT Information Technology Trends (ITT), pp. 1–6. IEEE (2018,
November)
51. Amarasinghe, T., Aponso, A., Krishnarajah, N.: Critical analysis of machine learning based
approaches for fraud detection in financial transactions. In Proceedings of the 2018 International
Conference on Machine Learning Technologies, pp. 12–17 (2018, May)
52. Mohanty, S.P., Hughes, D.P., Salathé, M.: Using deep learning for image-based plant disease
detection. Front. Plant Sci. 7, 1419 (2016)
53. Minasny, B., McBratney, A.B., Malone, B.P.: Digital soil assessment. In Digital Soil
Assessments and Beyond, pp. 1–24. Springer (2016)
54. Matese, A., Toscano, P., Di Gennaro, S.F., Genesio, L., Vaccari, F.P., Primicerio, J., … &
Zaldei, A.: Intercomparison of UAV, aircraft and satellite remote sensing platforms for precision
agriculture. Remote Sensing, 7(3), 2971–2990 (2015)
55. Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning. arXiv
preprint arXiv:1702.08608(2017)
56. Diakopoulos, N.: Accountability in algorithmic decision making: a framework and key
questions. Data Discrim. Collect. Essays 2(2017), 10 (2016)
57. McMahan, H.B., Ramage, D., Talwar, K. Zhang, L., Zhu, M.: Communication-efficient learning
of deep networks from decentralized data. arXiv preprint arXiv:1602.05629(2017)
58. Biesialska, M., Biesialska, K., Costa-Jussa, M.R.: Continual lifelong learning in natural
language processing: A survey. arXiv preprint arXiv:2012.09823 (2020)
59. Conneau, A., et al.: Unsupervised Cross-lingual Representation Learning at Scale. arXiv:1911.
02116 (2019)
Unlocking Insights: Exploring Data
Analytics and AI Tool Performance
Across Industries
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 265
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_13
266 H. Mohapatra and S. R. Mishra
1 Introduction
D3.js, and Plotly. Some popular data preparation tools that work well with
ChatGPT include: Alteryx, Data Wrangler, OpenRefine, Pandas, and Duck DB. Some
popular machine learning tools that work well with ChatGPT include: scikit-learn,
Tensor-Flow, PyTorch, Apache Spark MLlib, and H2O.ai. Some popular statistical
analysis tools that work well with ChatGPT include: SPSS, SAS, R, Python, and
Julia. In addition to these specific tools, ChatGPT can also be used with a variety of
other data analytics tools, including: Data warehouses, Data lakes, Data streaming
platforms, Data governance tools, and Data integration tools.
ChatGPT, a large language model, can be effectively integrated with various data
analytics tools to enhance data analysis processes and extract valuable insights.
Its natural language processing capabilities enable it to understand and interpret
data from diverse sources, making it a versatile tool for data exploration, summa-
rization, and pattern recognition. ChatGPT can seamlessly collaborate with data
visualization tools to generate insightful charts, graphs, and dashboards, while also
working with data preparation tools to clean, format, and transform data for analysis.
Additionally, ChatGPT can be employed to augment machine learning models by
generating training data, tuning hyper-parameters, and interpreting results. Statis-
tical analysis tools can also benefit from ChatGPT’s ability to summarize statistical
findings, generate reports, and perform hypothesis tests. By leveraging ChatGPT’s
strengths in conjunction with these data analytics tools, data analysts can streamline
their workflows, gain deeper understanding of data, and make informed decisions.
The paper’s structure is outlined below. Section 2 provides an in-depth exami-
nation of the related literature. In Sect. 3, we delve into our exploration using the
AI tool. Technical domains and specific applications are thoroughly discussed in
Sect. 4, while Sect. 5 similarly addresses business and administrative sectors along
with pertinent applications. Section 6 provides a concise overview of observations
and performance evaluations with behavioral analysis of the AI tool followed by the
conclusion and references.
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 271
2 Related Work
it generated incorrect results, erroneous formulas, and similar inaccuracies [10]. The
firing appropriate query can help to use the AI Tool in software development and
modeling too. AI Tool has demonstrated its capability to generate useful code snippets
that, on occasion, are correct and successfully accomplish the desired task specified
by the user. Moreover, AI Tool exhibits familiarity with various textual modeling
languages, including domain-specific languages (DSLs). Notably, the example of
Graph-GPT showcases the potential for language designers to instruct AI Tool on
the desired structure of a modeling language, resulting in the generation of code frag-
ments within that language. In the case of Graph-GPT, it employs a clever approach of
requesting a JSON-encoded representation of the graph, which can then be rendered
into a diagram. The possibilities for leveraging generative AI in modeling are vast,
and the exact ways in which it will transform the business and practices of modeling
in the future remain uncertain [11]. The uses of AI Tool can also be found in scientific
abstract writing [12] and in the health sector [13] too. Deep Learning has emerged as
a powerful tool for addressing the challenges posed by Big Data Analytics. Its ability
to extract complex patterns from massive volumes of data makes it suitable for tasks
such as semantic indexing, data tagging, and fast information retrieval [14]. In the
field of academic writing AI Tool can be used in revolutionary way [15]. As AI Tool
is a new tool hence many researchers are trying to explore it in a several way. One
such method is chat with AI Tool where the authors have shared their experience [16].
Though AI Tool is a powerful tool that can be used in many applications but there
are several instances where the faulty references can be found [17]. Since its launch
in 2022, AI Tool, a query-oriented language generation tool, has garnered significant
attention. Although the initial excitement may have waned, the impact of AI Tool
has sparked lasting structural changes. Notably, academic journals have published
papers with AI Tool listed as an author, while certain educational institutions have
opted to prohibit its use due to concerns about potential misuse. Criticisms of AI
Tool have primarily revolved around its inaccuracies, often labeling it as a “bullshit
generator.” Additionally, some have highlighted the undesirable consequences that
arise from its utilization, such as the potential to undermine creativity.
However, we contend that there is an unaddressed issue at hand—the funda-
mental ideas and politics that drive the development of these tools and facilitate their
uncritical adoption [18]. Businesses have long relied on analytics, but the focus is
shifting toward artificial intelligence (AI) capabilities. Many AI systems are built
upon statistical and analytical foundations. By leveraging their existing analytical
expertise, companies can gain a competitive edge in their AI endeavors [19]
The scope of AI tools is vast and extends across diverse sectors, revolutionizing the
way we approach challenges and opportunities. From healthcare and finance to manu-
facturing and entertainment, AI’s transformative capabilities have left an indelible
mark. In healthcare, AI aids in diagnosing diseases and personalizing treatment
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 273
plans, while in finance, it enhances fraud detection and market analysis. Industries
like manufacturing benefit from AI-driven automation for improved efficiency, and
the entertainment sector leverages AI to create immersive experiences. The ability
of AI tools to analyze vast datasets, recognize patterns, and make informed deci-
sions transcends boundaries, making them an indispensable asset in shaping the
present and future of countless sectors [20]. The study has focused on two types
of sector classification such as technical and business administrative sectors respec-
tively. Further, each type of category has considered 5 different types of individual
sectors for response analysis.
Technical Sectors
• Medical and Health Care Sector
• Software Development
• Smart Agriculture
• Logistics and Supply
• Smart City Designing
Business Administrative Sectors
• Education and Academic
• Crime Monitoring
• Administrative Accounting
• Entertainment Industry
• Culture and Value Promotion
AI tools have a transformative role in the medical and health sector. They serve as
conversational interfaces for quick access to medical information, including research
papers, clinical guidelines, and drug details. Patient education is enhanced through
personalized health insights and answers to queries [4]. AI aids in symptom assess-
ment and initial guidance for seeking medical help. In telemedicine, AI integrates
for remote patient monitoring and virtual consultations, ensuring data gathering and
medication reminders. Analyzing electronic health records, AI organizes patient data
for efficient decision-making [5]. It supports mental health by offering coping strate-
gies and resources. Medical education benefits from AI simulations and feedback. In
research, AI extracts information, aids in data analysis, and facilitates patient recruit-
ment for clinical trials. The generated response from AI Tool has been cross verified
with the opinions of experts medical and health sector. For the evaluation process
we have communicated with 34 doctors and 16 health care sector staffs. Figure 2a
illustrates the performance evaluation based on the parameters that are considered
in Table 2.
274 H. Mohapatra and S. R. Mishra
Fig. 2 Performance evaluation in several technical sectors by data analytics through AI tools
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 275
branching strategies, resolving merge conflicts, and recommending best practices for
collaboration and code management. AI Tool can offer guidance on setting up devel-
opment environments, configuring tools, and resolving environment-related issues.
It can assist developers in getting started with specific frameworks or platforms. The
generated response from AI Tool has been cross verified with opinions of developers
and testers. For the evaluation process we have communicated with 30 software
developers and 21 testing engineers. Figure 2b presents the performance evaluation
based on the parameters that are considered in Table 2.
While AI Tool aids logistic and supply chain management, human expertise remains
vital. AI serves as a virtual assistant, handling customer queries and providing real-
time support. It integrates with systems for personalized order tracking. AI analyzes
data for inventory, demand, and production, offering supply chain optimization
suggestions. However, human oversight is essential for complex situations and crit-
ical decisions based on AI recommendations. AI enhances efficiency but should be
employed in tandem with human judgment. AI Tool streamlines supplier manage-
ment, handling routine inquiries and supplier performance insights. It identifies new
suppliers and aids communication. AI analyzes supply chain data, predicting demand
changes and evaluating external risks. It suggests contingency plans, enhancing
decision-making. AI acts as a training tool, simulating scenarios for risk-free prac-
tice. It shares knowledge on best practices and emerging trends. AI enhances supplier
collaboration, risk assessment, and skill development in supply chain management.
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 277
The generated response from AI Tool has been cross verified with opinions of busi-
ness holders and logistic department of various suppliers. For the evaluation process
we have communicated with 13 businessmen and 9 logistic suppliers. Figure 2d
presents the performance evaluation based on the parameters that are considered
in Table 2. This integration empowers logistics professionals to extract invaluable
insights, forecast demand, optimize routes, and mitigate potential disruptions in real
time. ChatGPT’s ability to comprehend intricate supply chain terminology facilitates
personalized recommendations, enabling agile decision-making and swift problem-
solving. By harnessing data analytics through ChatGPT, organizations can streamline
operations, minimize costs, enhance customer satisfaction, and create resilient supply
chains capable of adapting to dynamic market conditions, ultimately redefining the
benchmarks for efficiency and competitiveness in the logistics industry.
AI Tool enhances smart city planning with data-driven insights and citizen engage-
ment. It aids decision-making, creating livable urban environments. AI acts as a
virtual assistant, gathering citizen input and feedback.
It analyzes intricate urban data for trends and visualizations, aiding informed city
planning. Through data analytics powered by ChatGPT, cities can evolve into adap-
tive, efficient, and citizen-centric environments, fostering innovation and resilience
for the urban landscape of the future. AI Tool contributes to urban planning by
suggesting designs based on population, transport, green spaces, and energy. It opti-
mizes land use and connectivity while integrating sustainability. AI analyzes real-
time traffic data, optimizing flow and reducing congestion. It recommends traffic
management systems and efficient routes. AI aids in energy management strategies
for smart cities. AI Tool enhances urban planning by analyzing energy patterns,
suggesting efficient technologies, and integrating renewable. It optimizes energy
distribution, monitors air quality, and manages waste. AI aids in emergency plan-
ning, analyzing data for disaster preparedness and resource allocation. It fosters
stakeholder collaboration, acting as a knowledge base. AI integrates sustainability
and green initiatives, promoting eco-friendly tech and recycling programs. It evalu-
ates policy impact on smart cities, simulating effects on transportation, energy, and
services to inform decisions. The generated response from AI Tool has been cross
verified with the opinions of city planners and civil engineers. For the evaluation
process we have communicated with 4 city planners and 7 civil engineers. Figure 2e
presents the performance evaluation based on the parameters that are considered in
Table 2.
278 H. Mohapatra and S. R. Mishra
Fig. 3 Performance evaluation in several administrative sectors by data analytics through AI tools
from AI Tool has been cross verified with opinions of 34 clerical staff of administra-
tive section at KIIT University. Figure 3c presents the performance evaluation based
on the parameters that are considered in Table 2.
AI Tool empowers the entertainment industry with innovative content creation and
interactive experiences. It generates scripts, dialogues, and characters, fostering
creativity. AI enhances storytelling, character development, and narrative explo-
ration. It crafts interactive virtual characters, enabling immersive experiences across
various platforms. AI Tool transforms entertainment with personalized recommen-
dations, interactive storytelling, and virtual characters. It analyzes preferences for
suggestions, creating immersive experiences. AI shapes narratives, responds to
choices, and offers personalized storylines. It crafts virtual assistants for celebs or
characters, deepening audience connections. In video games, AI enhances dialogues
and character depth. It engages fans on social media, maintaining interactive pres-
ence. AI elevates live events with real-time engagement and interactive elements. It
generates synthetic voices for various roles. AI sparks fan engagement by simulating
conversations and discussing fictional worlds, fostering creativity and community.
The generated response from AI Tool has been cross verified with opinions of content
creators on YouTube. For the evaluation process we have communicated with 34
content creators on YouTube. Figure 3d presents the performance evaluation based
on the parameters that are considered in Table 2.
In the realm of entertainment, the amalgamation of data analytics and ChatGPT
heralds a paradigm shift in understanding audience preferences, content trends,
and optimizing creative strategies. Moreover, data analytics powered by ChatGPT
assists in deciphering trends, forecasting market demands, and optimizing content
creation, leading to more tailored, diverse, and engaging entertainment experiences
for audiences globally. This fusion reshapes the entertainment landscape, fostering
innovation and enriching the connection between creators and consumers in an
ever-evolving industry.
and morals. It simulates cultural exhibitions, providing historical context and inter-
activity. AI preserves heritage by organizing cultural information digitally. It fosters
intercultural dialogue, artistic appreciation, and ethical reflection, enriching global
connections and cultural understanding. AI Tool aids cultural tourism, suggesting
landmarks and events. It fosters discussions on values and decisions, enhancing
values-based choices. AI supports social impact campaigns for diversity and inclu-
sivity, interacting with users to raise awareness and encourage actions. It promotes
cultural engagement, values-based decisions, and positive impact, contributing to a
more informed and culturally aware society. The generated response from AI Tool
has been cross verified with opinions of eleven humanities and social science profes-
sors of KIIT University. Figure 3e presents the performance evaluation based on the
parameters that are considered in Table 2.
Though AI Tool become a buzzword and in the last two sections we have witnessed
the implications of AI Tool on various sectors of the society. Though it is a powerful
tool for the modern digital world still it has many negative and positive side effects.
Here, we have listed some of the primary negative impacts. To mitigate these negative
impacts, responsible development, transparent practices, ongoing research, and regu-
lation are important. Ethical guidelines, bias detection, and correction mechanisms,
as well as user education on the limitations and potential biases of AI systems, can
help address these concerns and promote the responsible use of AI Tool and similar
technologies. While AI Tool and similar language models offer many benefits, there
are also potential negative impacts on society.
• Misinformation and propaganda: AI Tool can generate text based on the input
it receives, including false or misleading information. If used irresponsibly or
without proper oversight, it can contribute to the spread of misinformation,
conspiracy theories, or propaganda, which can undermine trust, create confusion,
and harm society.
• Bias: AI Tool learns from the data it’s trained on, which can include biases present
in the training data. If the training data contains biases related to race, gender,
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 283
or other sensitive topics, AI Tool may inadvertently perpetuate and amplify these
biases in its generated responses, leading to unfair or discriminatory outcomes.
• Accuracy and reliability: AI Tool operates as a tool, and its outputs are not in-
dependently verified or fact-checked. This lack of accountability raises concerns
about the accuracy and reliability of the information it provides, potentially leading
users to make decisions based on flawed or misleading advice.
• Ethical exploitation: AI Tool can be exploited for unethical purposes, such as
generating malicious content, engaging in harmful behaviors, or deceiving indi-
viduals. This raises concerns about privacy, security, and the potential for abuse
by malicious actors.
AI Tool can be used for social engineering or manipulating individuals by gener-
ating persuasive or deceptive messages. Malicious actors could exploit this tech-
nology to deceive or exploit unsuspecting users for personal gain or malicious
purposes. Interacting with AI systems like AI Tool may have psychological effects on
individuals. For some users, relying heavily on AI-generated responses for emotional
support or guidance could lead to a sense of detachment, isolation, or a reduction in
critical thinking skills. AI Tool interacts with users and collects data during conver-
sations, raising privacy concerns. Depending on the use and storage of this data,
there is a risk of misuse, unauthorized access, or breaches that compromise user
privacy and security. The widespread adoption and benefits of AI Tool may not be
accessible to all individuals due to factors such as cost, infrastructure limitations, or
digital literacy. This can contribute to a technological divide, exacerbating existing
inequalities in society.
By employing these approaches, you can gain insights into the performance of
AI Tool, understand its strengths and limitations, and make informed decisions to
improve its overall performance and user satisfaction. Analyzing the performance of
AI Tool responses can help assess its effectiveness and identify areas for improve-
ment. Here are some approaches to analyze its performance. Figure 4 illustrates the
performance analysis based on this provided dataset. We have analyzed the data by
using Pandas.lib with python environment.
The proposed metric includes the following parameters like accuracy (A), rele-
vance (R), coherence (C), grammaticality (G), and fluency (F). This metric can be
used to assess how well the generated responses align with the desired outcomes
and user expectations. We have compared the human ratings with system-generated
responses to gauge the model’s performance and identify areas for refinement. The
graph shows the performance of different AI tool responses for different domains.
The red line represents the accuracy of the responses, while the blue line repre-
sents the relevance of the responses. The green line represents the coherence of the
284 H. Mohapatra and S. R. Mishra
responses, the orange line represents the grammaticality of the responses, and the
purple line represents the fluency of the responses.
The graph shows that the accuracy and relevance of AI tool responses vary
depending on the domain. For example, AI tool responses are more accurate and
relevant in the health domain than in the entertainment domain. This is likely because
there is more structured data available in the health domain, which makes it easier for
AI tools to learn and generate accurate and relevant responses. The graph also shows
that the coherence, grammaticality, and fluency of AI tool responses are generally
good across all domains. However, there is some variation, with the best scores in the
academic domain and the worst scores in the crime domain. This is likely because
the academic domain requires more complex and nuanced language, while the crime
domain requires more factual and objective language.
Overall, the graph shows that AI tool responses are generally performing well
across a variety of domains. However, there is still room for improvement, particularly
in terms of accuracy and relevance in some domains.
Each metric is assigned a score ranging from 0 to 1, where a higher score indicates
better performance. These scores represent the average assessment of AI Tool’s
responses based on the evaluation process conducted. The mathematical expression
to calculate the overall score (OS) is represented in Eq. (1). Table 2 illustrates the
performance evaluation of AI Tool responses by using (Eq. 1).
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 285
OS = A + R + C + G + F/ 5 (1)
The complexity of AI Tool responses has been calculated by using two primary
parameters such as ‘L’ be the average sentence length in words and ‘W’ be the average
word length in characters. The complexity ‘C’ can be presented by using (Eq. 2).
C = L×W (2)
Table 2 shows the overall performance evaluation of the AI tool for all responses
from all considered sectors. The table shows the metric accuracy, relevance, coher-
ence, grammaticality, and fluency scores for all responses from all considered sectors.
The accuracy score measures how well the AI tool’s responses match the ground truth.
The relevance score measures how well the AI tool’s responses are relevant to the
query. The coherence score measures how well the AI tool’s responses are structured
and easy to understand. The grammaticality score measures how well the AI tool’s
responses follow the rules of grammar. The fluency score measures how natural and
easy to read the AI tool’s responses are. The overall performance of the AI tool
is good, with an average score of 0.85. The accuracy score is particularly high, at
0.85. This means that the AI tool’s responses are generally accurate and match the
ground truth. The relevance score is also good, at 0.82. This means that the AI tool’s
responses are generally relevant to the query. The coherence, grammaticality, and
fluency scores are all slightly lower, at 0.78, 0.88, and 0.85, respectively.
This means that the AI tool’s responses could be improved in terms of structure,
grammar, and naturalness. However, overall, the AI tool is performing well and is
generating responses that are accurate, relevant, coherent, grammatical, and fluent.
AI tools, enhance their performance, and align them with user expectations while
upholding ethical standards in the responses provided. In the sub-sections we have
analyzed the performance of AI tool on different types of queries from the used end.
Table 4 illustrates comparison among responses based on nature of the queries.
8 Conclusion
References
1. Elavarasan, R.M., Pugazhendhi, R., Irfan, M., Mihet-Popa, L., Khan, I.A., Campana, P.E.:
State-of-the-art sustainable approaches for deeper decarbonization in Europe—an endowment
to climate neutral vision. Renew. Sustain. Energy Rev. 159 (2022)
2. Davenport, T.H.: From analytics to artificial intelligence. J. Bus. Anal. 1(2), 73–80 (2018).
https://doi.org/10.1080/2573234X.2018.1543535
3. Sánchez-Ruiz, L.M., Moll-López, S., Nuñez-Pérez, A., Moraño-Fernández, J.A., Vega-Fleitas,
E.: ChatGPT challenges blended learning methodologies in engineering education: a case study
in mathematics. Appl. Sci. 13(10), 6039 (2023). https://doi.org/10.3390/app13106039
4. Javaid, M., Haleem, A., Singh, R.P.: Chatgpt for healthcare services: an emerging stage for an
innovative perspective. Bench Council Trans. Benchmarks, Stand. Eval. 3(1), 100105 (2023)
5. Frederico, G.F.: Chatgpt in supply chains: initial evidence of applications and potential research
agenda. Logistics. 7(2) (2023)
6. Mohapatra, H.: Socio-technical challenges in the implementation of smart city. International
Conference on Innovation and Intelligence for Informatics, Computing, and Technologies
(3ICT), pp. 57–62 (2021)
7. Zhang, X., Shah, J., Han, M.: Chatgpt for fast learning of positive energy district (ped): a trial
testing and comparison with expert discussion results. Buildings. 13(6) (2023)
8. Gao, Y., Tong, W., Wu, E.Q., Chen, W., Zhu, G.Y., Wang, F.Y.: Chat with chatgpt on interactive
engines for intelligent driving. IEEE Trans. Intell. Veh. 8(3), 2034–2036 (2023)
9. Du, H., Teng, S., Chen, H., Ma, J., Wang, X., Gou, C., Li, B., Ma, S., Miao, Q., Na, X.,
Ye, P., Zhang, H., Luo, G., Wang, F.Y.: Chat with chatgpt on intelligent vehicles: an ieee tiv
perspective. IEEE Trans. Intell. Veh. 8(3), 2020–2026 (2023)
10. Prieto, S.A., Mengiste, E.T., Garćıa de Soto, B.: Investigating the use of chatgpt for the
scheduling of construction projects. Buildings, 13(4) (2023)
288 H. Mohapatra and S. R. Mishra
11. Shoufan, A.: Exploring students’ perceptions of chatgpt: thematic analysis and follow- up
survey. IEEE Access, 11, 38805–38818 (2023)
12. Lo, C.K.: What is the impact of chatgpt on education? a rapid review of the literature. Educ.
Sci. 13(4) (2023)
13. Castellanos-Gomez, A.: Good practices for scientific article writing with chatgpt and other
artificial intelligence language models. Nanomanufacturing. 3(2), 135–138 (2023)
14. Rahman, M.M., Watanobe, Y.: Chatgpt for education and research: opportunities, threats, and
strategies. Appl. Sci. 13(9) (2023)
15. Wang, F.Y., Yang, J., Wang, X., Li, J., Han, Q.L.: Chat with chatgpt on industry 5.0: learning and
decision-making for intelligent industries. IEEE/CAA J. Autom. Sin. 10(4), 831–834 (2023)
16. Abdullah, M., Madain, A., Jararweh, Y.: Chatgpt: fundamentals, applications and social
impacts. Ninth International Conference on Social Networks Analysis, Management and Secu-
rity (SNAMS), pp. 1–8 (2022); Sharma, P., Dash, B.: Impact of big data analytics and chatgpt
on cybersecurity. 4th International Conference on Computing and Communication Systems
(I3CS), pp. 1–6 (2023)
17. Feng, Y., Poralla, P., Dash, S., Li, K., Desai, V., Qiu, M.: The impact of chatgpt on streaming
media: a crowdsourced and data-driven analysis using twitter and reddit. In 2023 IEEE 9th Intl
Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High
Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and
Security (IDS), pp. 222–227 (2023)
18. Grbic, D.V., Dujlovic, I.: Social engineering with chatgpt. In 2023 22nd International
Symposium INFOTEH-JAHORINA (INFOTEH), pp. 1–5 (2023)
19. Guo, C., Lu, Y., Dou, Y., Wang, F.Y.: Can chatgpt boost artistic creation: the need of imaginative
intelligence for parallel art. IEEE/CAA J. Autom. Sin. 10(4), 835–838 (2023)
20. Kshetri, N.: Chatgpt in developing economies. IT Prof. 25(2), 16–19 (2023)
21. Mohapatra, H.: Performance Evaluation of AI Bot Responses (2023)
Lung Nodule Segmentation Using
Machine Learning and Deep Learning
Techniques
Abstract Global lung cancer mortality is growing. This supports early cancer
screenings. CT lung nodule segmentation is complicated and affects medical
research, surgical planning, and diagnostic decision support. All are complex issues
with important applications. Machines and humans struggle to split non solitary
nodules with uncertain boundaries. Since segmentation has distinct limits, single
nodules are easier to divide. Several researchers have proposed CT-based lung eval-
uation algorithms. Growing imaging datasets and the need to swiftly and precisely
define normal and diseased lung lobes are the reasons. Multi-process lung segmenta-
tion methods with manual empirical parameter modifications are common. First lung
slice and nodule segmentation using ML and DL is essential for cancer detection. This
detects cancer at various stages. Deep learning techniques have improved healthcare
image analysis. There are few deep learning approaches like ResNet 50,101, VGG16,
Autoencoders, U-Net with modifications, and graph convolutional networks to clas-
sify lung nodules, COVID-19, and pneumonia. This chapter includes a summary of
datasets that are open to the public and are the primary resources utilized by scholars
working in this area. A direct look into the field of diagnosing lung disorders is what
we hope to achieve with the information provided in this chapter.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 289
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_14
290 S. Chauhan et al.
1 Introduction
Lung cancer is the second most frequent kind of cancer in the world. Rapid division of
malignant cells, which can invade neighboring structures and spread to other organs,
is the hallmark of cancer [1]. On August 1, the World Health Organization observes
World Lung Cancer Day, an annual event designed to inform and inspire people all
over the world to take action against the disease by increasing funding for research
and spreading awareness. The most common way that cancer kills people is through
the spread of metastases throughout the body. In 2023, the American Cancer Society
anticipates there will be a total of 238,340 new cases of lung cancer in the US (117,550
in men and 120,790 in women) shown in Fig. 1. There were almost 127,070 fatalities
attributable to lung cancer (67,160 in males and 59,910 in women) [2]. According
to the findings of a study conducted in India, the presence of adenocarcinoma as a
subtype of lung cancer was significantly associated with the occurrence of indoor air
pollution including factors such as exposure to second-hand smoking and the type
of fuel used in food preparation [3]. An investigation conducted in India found that
the mean and median ages of lung cancer patients (N equalling 1301) were 58.6
(standard deviation equalling 10.8) and 60 (interquartile range: 51–65), respectively
[4]. The extremely high rate of tuberculosis (TB) infection in India contributes to an
increased risk of obtaining a false-positive result from LDCT testing [5]. 5.9% of all
new cases of cancer are due to lung cancer, and 8.1% of all deaths from cancer are
attributable to lung cancer [6]. Several of these nations, including Indonesia, Hong
Kong, India, Singapore, Malaysia, and the Philippines, have yet to establish a lung
cancer screening program that is supported by their respective governments at the
present time. Delays in cancer diagnosis and treatment were also observed during the
2019 coronavirus disease (COVID-19) pandemic due to hospital and clinic closures,
disruptions in work and health insurance, and a general climate of anxiety about
being exposed to the virus that was a horrible situation in the human history [7].
Fig. 1 Statistics of lung cancer and deaths of males and females according to American Cancer
Society in the US
Lung Nodule Segmentation Using Machine Learning and Deep … 291
Lung cancer which is exacerbated by risk factors like cigarette smoking, air pollu-
tion, and exposure to carcinogenic substances in job environments is the biggest cause
of death from cancer worldwide [8]. The percentage of people who have survived
for five years, individuals diagnosed with advanced pulmonary cancer is approxi-
mately 16%. However, for patients in the very beginning stages of the disease who
receive good treatment, the 5-year survival rate can be augmented by a factor of
4–5 [5]. In comparison to other types of imaging such as chest radiology, sputum
cytology, magnetic resonance imaging (MRI) scans, and positron emission tomog-
raphy, the utilization of CT scan as a diagnostic instrument for the prompt diag-
nosis and recognition of lung cancer is advantageous due to its affordability, rapid
visualization time and widespread accessibility [9, 10]. Machine learning and deep
learning methods for segmenting lung nodules are presented in this chapter. Since
identifying lobes also has vital uses in medical research, disease assessment, and
therapy planning, techniques for doing so were naturally incorporated into lung
nodule segmentation (Fig. 2).
Image segmentation is the process of dividing a digital image into many segments for
the purpose of using the resulting information for efficient and accurate item recogni-
tion. Its purpose is to extract useful data from subsets sharing common characteristics,
such as pixel intensity values and colour data [11] shown in Fig. 3. Segmenting an
image yields either the assemblage of components together encompasses a series
of outlines that have been extracted from the visual representation or it yields the
contours themselves. Each pixel in a given space represents a single quality, such as
hue, saturation, or texture [12]. Medical image segmentation analyses images using
computer vision and segments 2D or 3D images of human organs, soft tissues, and
sick bodies in Fig. 2. It compares neighbouring pixels’ similarities or differences.
This technology considerably allows clinicians to do subjective or even statistical
analysis, which improves the accuracy and dependability of medical diagnoses, of
lesions and other locations of concern.
292 S. Chauhan et al.
x=1
Rx = I, Rx ∩ R y = φ, ∀x = y, x, y ∈ 1, N (1)
x=1
where, Rx meets the requirements for both groups of all pixels in the communication
relevance criteria. The same is true for Ry . x, y are employed in order to differentiate
between all of the different areas.
N = Number of regions after division, positive integer at least 2 [13].
The procedure of segmenting medical images can be broken down into the
following stages:
Step1: Acquire a database of medical imaging data. A common practice in the
field of machine learning for image processing is to split the dataset into a training
set, a validation set, and a test set. The training set is used to teach the network
how to make predictions, the test set is used to confirm the model’s correctness,
while the validation set is used to adjust the model’s hyperparameters.
Step2: The image is pre-processed and expanded by performing standard proce-
dure calls for randomly rotating and resizing the input image to increase the size
of the data collection.
Step3: If a nodule is present, the medical image must be segmented using the
proper medical image segmentation method before the segmented images can be
exported.
Step 4: Effectiveness of estimates is evaluated. Detection, segmentation, and clas-
sification are the three sub-operations of medical picture segmentation verification
that require effective performance indicators and validation.
Since they don’t necessitate hours of human labour, Artificial Intelligence (AI) algo-
rithms are ideally suited to acquire measurements and segmentations from medical
photos. Because the segmentation findings for a given input image will be consistent
regardless of the observer, the automatic analysis also gets rid of inter- and intra-
observer disparities. In recent years, deep learning-based technologies in particular
have claimed some significant successes [13]. Each pixel or voxel in an image can
be labelled or given a contour as an output from a segmentation method.
Lung Nodule Segmentation Using Machine Learning and Deep … 293
i. Contour-Based segmentation
ii. Voxel-Based Segmentation
iii. Region-Based segmentation
i. A contour-based approach:
It looks for the structural-environment boundary. This is like how a radiology
specialist will surgically separate an organ from other parts of the body by
drawing a line between the two. Snakes, the active contour model, is a classic
contour-based segmentation method. This approach begins with a contour near
the item to segment.
ii. A voxel-based approach:
It works like classification methods since the algorithm checks each voxel
to see if it belongs to the structure to segment. Segments are produced once
the algorithm addresses each image voxel. The thresholding technique is an
example of a straightforward voxel-based segmentation method. This approach
compares voxel intensity to a threshold. The algorithm will mark voxels above
the value as part of the structure and below the value as “surroundings” (or vice
versa for hypo-intense structures).
iii. Region-based segmentation:
Region-based segmentation involves segmenting a structure in one image
(or group of photos) and then segmenting it in a second image. A transforma-
tion—a precise description of how to modify the first image into the second—is
needed. The segmentation of the second image can be obtained by applying
this transformation to the first. Here iteratively adding pixels that are identical
294 S. Chauhan et al.
to and connected to the seed pixel is how areas grow. For areas with consistent
grayscale, employ similarity measures such as grayscale deltas. Interconnection
is used to break the connection between distinct areas of the image.
Digital CT images enable the segmentation of lungs and lobes, detecting anatomic
boundaries and aberrant lung tissue based on pathological processes and diseases
[11]. Hand-crafted feature-based and deep learning-based lung segmentation algo-
rithms were proposed by researchers. Deep neural network-based approaches can
automatically learn representative features without empirical parameter adjustments,
unlike region-expanding, active contour models, and morphological-based models
[12].
4 Literature Work
Scopus is a database that indexes articles from journals, conferences, and books in
the fields of computer science and engineering. This is especially helpful for the
advanced and smart system literature on lung nodule segmentation. To that end, we
used the Scopus database to gather the necessary research papers for our investi-
gation. We settled on the search criteria in order to facilitate the downloading of
the study papers: “Segmentation” AND “CTimages” AND “Nodule” AND “detec-
tion” OR “CNN” OR “DNN” OR “AI” OR “deep learning” OR “machine learn-
ing” OR “supervised learning” There are Boolean functions “AND” and “OR” have
been used as operator to combine search terms and narrow results. It guarantees that
keywords will be used to retrieve academic literature. These criteria are applied to
“Article Title” and “Article Keywords” during the first download. The initial round
of searches yielded a total of 5,547 usable research papers.
296 S. Chauhan et al.
Fig. 5 This figure is showing how the articles were chosen for the analysis, as well as how many
(N) papers were included in each step
4.2 Findings
This part is divided even further into three subparts: statistics broken down by year
and journal, information broken down by theme, and information breaking down
the relationship between ideas. The research begins with a breakdown, categorized
by year, of the number of papers published, the publications included, and the lung
nodule segmentation. The second component of the research shows that certain terms
appear often in both the titles and authors’ keywords of the articles, which may be
used to infer the overall topic of the research. The report provides examples of the
various domains in use in subsection three.
Lung Nodule Segmentation Using Machine Learning and Deep … 297
13% 11%
23% 24%
29%
Figure 6 shows how the systematic review’s research papers were spread out
throughout time. Total 152 research papers were studied for literature survey. We then
used the papers’ keywords and titles to determine their overarching themes. As can
be seen in Fig. 7, the aforementioned 152 articles span 36 publications and 10 distinct
biomedical image-processing capabilities. Although we are considering 122 research
for the systematic literature review, we have identified 35 as being of a generic
character because they do not directly address reconstruction but rather use detec-
tion and classification techniques. Therefore, only about 50 studies were included
in our meta-analysis. Studies with titles like “computed tomography,” “Convolu-
tional neural network,” “Deep learning,” “Segmentation,” “Detection,” “Nodule,”
“Classification,” “CT images,” or “DNN” share certain commonalities.
The quality of the papers and conferences was evaluated using several criteria. The
number of times a journal is cited, the size of its readership, the prestige of its
reviewers, its impact factor, and the calibre of its editorial board are only a few of
the factors taken into account in this evaluation. The chosen journals for this review
of Lung nodule segmentation are summarized in Table 1.
Applications fields of different journals
298
18
16
14
12
10
Identification Segmentation Classification Diagnosis Detection Image Enhancement Reconstruction prediction Image Analysis Feature Extraction Quantification Optimization Simulation Determination
Table 1 List of top 10 journals out of 36 in the field of Lung nodule segmentation
S. No Name of journal Published by Impact
factor
1 IEEE transactions on neural networks Institute of Electrical and 14.255
and learning systems Electronics Engineers Inc
2 Medical image analysis Elsevier B.V 13.828
3 IEEE transactions on medical imaging Institute of Electrical and 10.6
Electronics Engineers Inc
4 Artificial intelligence review Springer Science and 9.588
Business Media B.V
5 Computerized medical imaging and Elsevier Ltd 7.422
graphics
6 Diagnostic and interventional radiology Turkish Society of Radiology 7.242
7 Computer methods and programs in Elsevier Ireland Ltd 7.027
biomedicine
8 IEEE journal of biomedical and health Institute of Electrical and 7.021
informatics Electronics Engineers Inc
9 Computers in biology and medicine Elsevier Ltd 6.698
10 Cancers MDPI 6.639
5 Algorithms Used
The quality of the papers and conferences was evaluated using several criteria. The
number of times a journal is cited, the size of its readership, the prestige of its
reviewers, its impact factor, and the calibre of its editorial board are only a few of
the factors taken into account in this evaluation. The chosen journals for this review
of Lung nodule segmentation are summarized in Table 1.
neural networks (IRCNNs) have also been studied with the hope of better classifying
lung nodules as benign or malignant [24].
Low-dose computed tomography (LDCT) has been proven to be 20% more effec-
tive than X-rays at reducing the specific mortality rate of lung cancer cells [25]. To
emphasize its potential for lung cancer identification, The United States Preventive
Services Task Force (USPSTF) recommends having LDCT tests performed annu-
ally monitoring for anyone with lung illness [26, 27]. Building on prior research,
this piece accurately locates lung nodules using LDCT images. On apparent diffu-
sion coefficient (ADC) MRI, five deep-learning networks were evaluated: multiple
resolution residually connected network (MRRN) that is regularized in training with
deep supervision implemented into the last convolutional block (MRRN-DS), Unet,
Unet++ , ResUnet, and fast panoptic segmentation (FPSnet) and FPSnet-SL for
high-accuracy [23]. DenseNet improves CNN model training and propagation of
features by reducing vanishing-gradient. The work employs a faster R-CNN model
for detection and a DenseNet model for feature map extraction [28]. Inception-
Resnet’s self-attention mechanism helps improve convolutional neural network clas-
sification performance by performing standard classification and identifying chest
radiograph disorders via the classifier for auxiliary COVID-19 diagnosis at the
medical level [29]. ULD CT scans were rebuilt using FBP, ASIR-V, and DLIR.
Image noise was assessed using three-dimensional lung tissue segmentation. The
application of a deep learning–based nodule evaluation system by radiation oncolo-
gists facilitated the detection and quantification of nodules, as well as the identifica-
tion of imaging characteristics associated with malignancy. The study employed the
Bland–Altman method and repeated-measures evaluation of variance to assess and
evaluate the images obtained from ultralow-dose computed tomography (ULD CT)
and contrast-enhanced computed tomography (CECT) [30]. Inspired by U-Net and
residual learning, ResNet50 is a classification model that improves VGG19 when
used to segment lung nodules. Analysis of Deep learning/CNN-based lung nodule
segmentation/detection is shown in Table 2. To gain geographic data, VGG19 keeps
the 7 × 7 convolutional layer and utilizes the maximum pooling layer for down-
sampling. ResNet-50 learns additional characteristics with more layers. Lung nodules
are small and have significant spatial information, so ResNet-50 is upgraded to create
3D ResNet50, a network of classifications for lung nodule identification [31].
6 Evaluation Metrics
4 2022 Xing, A deep learning-based The declaration of Helsinki DenseVNet-based CNN Overall Dice coefficient = 0.972
[35] Haiqun post-processing method for (2013 version) with the supplemented by Hausdorff distance = 12.025 mm
et al. automated pulmonary lobe and permission of the Peking additional processing Jaccard coefficient = 0.948
airway trees segmentation using Union Medical College methods for the
chest CT images in PET/CT Hospital’s Ethics Committee gathered data
5 2022 Zhou, Deep learning-based pulmonary Imaging archive (TCIA)55 as V-Net auto Dice similarity coefficient (DSC):
[36] Wen et al. tuberculosis automated detection collection Pediatric-CT-SEG segmentation-modified The median DSC for the
on chest radiography: FCN 3D V-Net duodenum was 0.52, whereas the
Large-scale independent testing pancreas was 0.74, the stomach
was 0.92, and the heart was 0.96
301
302 S. Chauhan et al.
Predicted Values
Positive (1) TP FP
Negative (0) FN TN
the evaluation is as accurate as possible [37]. Every single forecast is built from each
and every one of the aforementioned positive (P) and negative (N) examples. Both P
and N are made up of things like true positives (TP) and false positives (FP), while
P is made up of things like true negatives (TN) and false negatives (FN) in Fig. 8.
These metrics are calculated as:
• When a prediction-target mask pair’s IoU score is higher than a certain threshold,
a true positive is noticed. This threshold can vary from system to system.
• When a false positive occurs, it means that a predicted object mask does not have
an accompanying ground truth object mask.
• A ground truth object mask that does not have an associated anticipated object
mask is said to have a false negative.
The accuracy score is calculated by taking the total number of guesses and dividing
it by the number of right predictions (both positive and negative), also called the
Rand index in Eq. (2). Precision measurements were based on their proximity, while
accuracy measurements used known or actual values in Eq. (3). These two measuring
methods were not limited. Precision measured how accurately lung nodule pixels
were recognized. Divide the number of pixels indicating lung nodules by the number
representing both nodules and backdrop to calculate recall. Basically, it is a measure
of how accurately your model can forecast the actual positives that will occur.
TP + TN
Accuracy = (2)
(TP + TN + FP + FN)
TP
Precision = (3)
(TP + FP)
were present in the data set. The term “sensitivity” is another name for it that is
sometimes used in Eq. (4).
TP
TPR/Sensitivity/Recall = (4)
(TP + FN)
F1 Score (Dice Coefficient) is needed when you want to seek a balance between
Precision and Recall in Eq. (5). It evaluates a model’s capacity for prediction by
focusing on how well it performs inside individual classes rather than evaluating the
model as a whole, as accuracy does.
Precision ∗ Recall
F1 Score = 2 × (5)
Precision + Recall
Specificity: The specificity of the model is the degree to which it correctly identi-
fies true positives, while the sensitivity of the model is the degree to which it correctly
identifies real negatives in Eq. (6). The specificity of a model can be evaluated based
on how well it recognizes the various kinds of backgrounds that might be seen in an
image. Specificity ranges that are quite near to one are standard and to be expected
due to the substantial proportion of pixels that have been tagged as background in
comparison to the ROI. Therefore, specificity is an appropriate metric for assuring
the functionality of a segmentation model, but it is less appropriate for measuring
the performance of a segmentation model.
TN
Specificity = (6)
(TN + FP)
Intersection over Union, or IoU for short, is a statistic that can be used to quantify
the percentage of overlap that exists between the target mask and the output of our
prediction. The Jaccard index is another name for this metric. This metric has a close
relationship using the dice coefficient, which is typically used as a loss function over
the training process. The IoU metric can be explained in a nutshell as the division of
the total number of pixels that are present in both the target mask and the prediction
mask by the number of common pixels that are present in both masks. The IoU score
is computed independently for each class, and then that total score is averaged across
304 S. Chauhan et al.
(a) (b)
all classes to get an overall, mean value for the IoU score that represents semantic
segmentation prediction detail in Eq. (7). Many expected and actual boxes overlap
will increase the IoU score. Low overlap results in a low IoU score. If the projected
box entirely covers the ground truth box, the IoU score is 1, otherwise 0. Self-driving
cars, surveillance, medical imaging, etc. use it for computer vision shown in Fig. 9
[38].
Target ∩ Prediction
IoU = (7)
Target ∪ Prediction
2 ∗ |M ∩ N |
DSC = (9)
|M| + |N |
|M ∩ N |
PPV = (11)
|N |
In lung nodule segmentation, the regularly utilized datasets, together with their
respective contributions, comprise a wide variety of helpful resources. It is important
to remember that some datasets just provides you a 2D slice, while some provide the
306 S. Chauhan et al.
Prediction
2x
DSC =
Target Prediction
+
whole scan. The majority of approaches that made advantage of supervised training
either used cross-validation with k-folds or prepared training and test splits with an
80/20 ratio in Table 3. [41].
8 Examples
8.1 Model 1
ResNet 50: The ResNet-50 model is a convolutional neural network (CNN) character-
ized by its depth of 50 layers. ResNet-50 is constructed upon a deep residual learning
framework, which facilitates the training of exceedingly deep networks with a total
of 50 layers, which are organized into five distinct blocks. Each block is comprised of
a collection of residual blocks. The inclusion of residual blocks facilitates the reten-
tion of relevant information from preceding layers, hence enhancing the network’s
capacity to acquire more effective representations of the input data. There are various
steps to implement every technique. Firstly, Fetch dataset and divide into three parts:
Train, Test, and Valid sets. This is widely used in medical image processing tasks
because of its reducing vanishing gradient problem, preserving input and avoiding
any loss in the information [52]. There are following steps that have been followed
while using the ResNet 50 in Fig. 12.
Lung Nodule Segmentation Using Machine Learning and Deep … 307
Table 3 (continued)
S. No Data set Annotation Description Link References
8 Finding and Lung Kaggle lung Finding [49]
measuring lungs annotations segmentation challenge and
in CT data measuring
lungs in
CT data |
Kaggle
9 ELCAP Cancer Research teams from the http:// [50]
International Early Lung www.via.
Cancer Action Program cornell.
(ELCAP) and Vision edu/lun
and Image Analysis gdb.html
(VIA)
10 Lung-PET-CT-Dx Lung cancer Lung cancer patient https:// [51]
DICOM scans and PET veet.via.
scans cornell.
edu/lun
gdb.html
8.2 Model 2
8.3 Model 3
In this chapter we have included three models ResNet50, ResNet101, and VGG16.
After applying these models, it is clear that the accuracy and loss are different for
Lung Nodule Segmentation Using Machine Learning and Deep … 309
Input
Zero Padding
7x7 Conv
MaxPooling
Skip Connection
RELU
RELU
Weight Layer
9 Conv Layer
{3x3-64K,1X1-164K,1X1-256K}
Weight
x
f(x)
(b) 12 Conv Layer
{1x1-128K,3x3-128K,1x1-512K}
(a)
18 Conv Layer
{1x1-256K,3x3-256K,1x1-
1024K}
9 Conv Layer
{1x1-512K,3X3-512K,1X1-
2048K}
Average Pooling
Flattening, FC
SoftMax
Output
Fig. 12 (a) is showing the residual block where f(x) is mapping function with activa-tion function
rectified linear unit (b) showing the ResNet50 architecture with all the convolution layers
every architecture. They can be updated by including various parameters like learning
rate, optimization, activation functions, and increasing epochs in our training proce-
dure. Here, we are showing the comparison among these three models on the param-
eter’s accuracy and loss values. For this purpose, we have used learning rate of
0.00001, loss = ‘Categorical Cross Entropy’, Optimizer -Opti, verbose = 1 and set
for 30 number of epochs as shows in Table 4.
In this chapter we have included the pictorial representation of accuracy and loss
changes according to number of epochs over the Chest CT image dataset for the
segmentation analysis shown in Fig. 15.
310 S. Chauhan et al.
Input
Conv
Zero Padding
Conv 7x7 Conv
Skip Connection
MaxPooling
RB
{1x1-128K,3x3-
128K,1x1-512K} x4
(a)
RB
(b)
{1x1-256K,3x3-
256K,1x1-1024K} X23
RB
{1x1-512K,3X3-
512K,1X1-2048K} x3
Average Pooling
Flattening, FC
SoftMax
Output
Fig. 13 (a) is showing the residual block for ResNet101 with activation function rectified linear
unit (RELU). (b) is showing the ResNet101 detailed architecture with all the residual blocks
Lung Nodule Segmentation Using Machine Learning and Deep … 311
Max Pooling
Max Pooling
Max Pooling
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
Input
Dense Layer
Dense Layer
Dense Layer
Max Pooling
Max Pooling
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
Segmented
Output
Fig. 14 This is depicting the VGG16 detailed architecture with all the convolutional blocks
Table 4 Comparison of
Model Accuracy Loss decrement No. of epochs
ResNet50, ResNet101, and
VGG16 for the segmentation ResNet50 0.803 3.6057 30
task ResNet101 0.820 0.5605 30
VGG16 0.832 0.5284 30
After applying ResNet 50, we have achieved the accuracy that could be increased
by enhancing the number of epochs and changing hyper-parameters. Here, we are
showing it for the purpose of understanding, along with training loss, validation loss
and confusion matrix as depicted in Figs. 16, 17 and 18.
312 S. Chauhan et al.
Fig. 16 (a) and (b) are showing the accuracy changes and Loss changes according to number of
epochs for ResNet50 architecture
Fig. 17 (a) and (b) are depicting the accuracy changes and loss changes according to number of
epochs for ResNet101 architecture
References
1. Ghoshal, S., Rigney, G., Cheng, D., et al.: Institutional surgical response and associated
volume trends throughout the COVID-19 pandemic and postvaccination recovery period. 5(8),
e2227443 (2022). https://doi.org/10.1001/jamanetworkopen.2022.27443
2. Chen, R., Aschmann, H.E., Chen, Y.H., et al.: Racial and ethnic disparities in estimated excess
mortality from external causes in the US, March to December 2020. 182(7), 776–778 (2022).
https://doi.org/10.1001/jamainternmed.2022.1461
3. Das, A., Krishnamurthy, A., Ramshankar, V., Sagar, T.G., Swaminathan, R.: The increasing
challenge of never smokers with adenocarcinoma lung: need to look beyond tobacco exposure.
Indian J. Cancer 54, 172–177 (2017)
4. Kaur, H., Sehgal, I.S., Bal, A., et al.: Evolving epidemiology of lung cancer in India: reducing
non-small cell lung cancer-not otherwise specified and quantifying tobacco smoke exposure
are the key. Indian J. Cancer 54, 285–290 (2017)
5. Prasad KT, Basher R, Garg M, et al .: Utility of LDCT in lung cancer screening in a TB
endemic region. Clinical Trials. (2023). gov. https://clinicaltrials.gov/ct2/show/ NCT03909620.
Accessed
6. Lam, D.C.L., Liam, C.K., Andarini, S., Park, S., et al.: Lung cancer screening in Asia: an expert
consensus report. J. Thor. Oncol. 18, 1303–1322 (2023). ISSN 1556-0864. https://doi.org/10.
1016/j.jtho.2023.06.014
7. Yabroff, K.R., Wu, X.C., Negoita, S., et al.: Association of the COVID-19 pandemic with
patterns of statewide cancer services. J. Natl. Cancer Inst. 114(6), 907–909 (2022)
8. Chen, G.B., Fu, Z., Zhang, T.F., Shen, Y., Wang, Y., Shi, W., Fei, J.: Robot-assisted puncture
positioning methods under CT navigation. J. Xi’an Jiao Tong Univ. 53(85–92), 99 (2019)
9. Mansoor, A., Bagci, U., Foster, B., Xu, Z., Papadakis, G.Z., Folio, L.R., et al.: Segmentation
and image analysis of abnormal lungs at ct: current approaches, challenges, and future trends.
Radiographics. 35(4), 1056–1076 (2015 Jul–Aug)
10. Kim, S.S., Seo, J.B., Lee, H.Y., Nevrekar, D.V., Forssen, A.V., Crapo, J.D., et al.: Chronic
obstructive pulmonary disease: lobe-based visual assessment of volumetric CT by Using stan-
dard images--comparison with quantitative CT and pulmonary function test in the COPDGene
study. Radiology. 266(2), 626–635 (2013 Feb)
11. Doel, T., Gavaghan, D.J., Grau, V.: Review of automatic pulmonary lobe segmentation methods
from CT. Comput. Med. Imag. Graph. 40, 13–29 (2015 Mar)
12. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature. 521(7553), 436–444 (2015).
[PubMed: 26017442]
13. Liu, X., Song, L., Liu, S., Zhang, Y.: A review of deep-learning-based medical image
segmentation methods. Sustainability, 1224 (2021). https://doi.org/10.3390/su13031224
14. Korfiatis, P., Kazantzi, A., Kalogeropoulou, C., Petsas, T., Costaridou, L.: Optimizing lung
volume segmentation by texture classification. ITAB Corfu. Greece 2010, 1–4 (2010)
15. Somasundaram, E., Deaton, J., Kaufman, R., Brady, S.: Fully automated tissue classifier for
contrast-enhanced CT scans of adult and paediatric patients. Phys. Med. Biol. 63(13), 135009
(2018)
16. Eid Alazemi, F., Jehangir, B., Imran, M., Song, O.Y., Karamat, T.: An efficient model for lungs
nodule classification using supervised learning technique. J. Healthc. Eng. (2023). https://doi.
org/10.1155/2023/8262741
17. Raoof, S.S., Jabbar, M.A., Fathima, S.A.: Lung cancer prediction using machine learning: a
comprehensive approach. 2nd International Conference on Innovative Mechanisms for Industry
Applications (ICIMIA), Bangalore, India, pp. 108–115 (2020). https://doi.org/10.1109/ICIMIA
48430.2020.9074947
18. Tarhini, H., Mohamad, R., Rammal, A., Ayache, M.: Lung segmentation followed by machine
learning & deep learning techniques for COVID-19 detection in lung CT images. Sixth Interna-
tional Conference on Advances in Biomedical Engineering (ICABME), Werdanyeh, Lebanon,
pp. 222–227 (2021). https://doi.org/10.1109/ICABME53305.2021.9604872
Lung Nodule Segmentation Using Machine Learning and Deep … 315
19. Nazir, I., ul Haq, I., AlQahtani, S.A., Jadoon, M.M., Dahshan, M.: Machine learning-based
lung cancer detection using multiview image registration and fusion. J. Sens. 2023, 19. Article
ID 6683438 (2023). https://doi.org/10.1155/2023/6683438
20. Nageswaran, S., Arunkumar, G., Bisht, A.K., Mewada, S., Kumar, J.N.V.R.S., Jawarneh,
M., Asenso, E.: Lung cancer classification and prediction using machine learning and image
processing. Biomed. Res. Int. 2022, 1755460 (2022). https://doi.org/10.1155/2022/1755460.
PMID: 36046454; PMCID: PMC9424001
21. Wang, S.-H., Govindaraj, V.V., G´orriz, J.M., Zhang, X., Zhang, Y.-D.: COVID-19 classification
by FGCNet with deep feature fusion from graph convolutional network and convolutional neural
network. Inf. Fusion. 67, 208–229 (2021)
22. Nishio, M., Sugiyama, O., Yakami, M., Ueno, S., Kubo, T., Kuroda, T., Togashi, K.: Computer-
aided diagnosis of lung nodule classification between benign nodule, primary lung cancer, and
metastatic lung cancer at different image size using deep convolutional neural network with
transfer learning. PLoS One. 13(7) (2018)
23. Chaunzwa, T.L., Hosny, A., Xu, Y., Shafer, A., Diao, N., Lanuti, M., Christiani, D.C., Mak,
R.H., Aerts, H.J.: Deep learning classification of lung cancer histology using CT images. Sci.
Rep. 11(1), 1–12 (2021)
24. Afag, S.: Classification of lung nodules using improved residual convolutional neural network.
J. Computat. Sci. Intellig. Technol. 1(1), 15–21 (2020)
25. Zhang, C., Sun, X., Dang, K., Li, K., Guo, X.W., Chang, J., Yu, Z.Q., Huang, F.Y., Wu, Y.S.,
Liang, Z., et al.: Toward an expert level of lung cancer detection and classification using a deep
convolutional neural network. Oncol. 24(9), 1159–1165 (2019)
26. Nasrullah, N., Sang, J., Alam, M.S., Mateen, M., Cai, B., Hu, H.: Automated lung nodule
detection and classification using deep learning combined with multiple strategies. Sensors.
19(17), 3722 (2019)
27. Ali, I., Hart, G.R., Gunabushanam, G., Liang, Y., Muhammad, W., Nartowt, B., Kane, M.,
Ma, X., Deng, J.: Lung nodule detection via deep reinforcement learning. Front. Oncol. 8, 108
(2018)
28. Simeth, J., et al.: Deep learning-based dominant index lesion segmentation for MR-guided
radiation therapy of prostate cancer. Med. Phys. 50(8), 4854–4870 (2023). https://doi.org/10.
1002/mp.16320
29. Zhang, Y., et al.: Lung nodule detectability of artificial intelligence-assisted CT image reading
in lung cancer screening. Curr. Med. Imag. 18(3), 327–334 (2022). https://doi.org/10.2174/157
3405617666210806125953
30. Chen, Y., Lin, Y., Xu, X., Ding, J., Li, C., Zeng, Y., Liu, W., Xie, W., Huang, J.: Classification
of lungs infected COVID-19 images based on inception-ResNet. Comput. Methods Programs
Biomed. 225, 107053 (2022 Oct). https://doi.org/10.1016/j.cmpb.2022.107053. Epub (2022).
PMID: 35964421; PMCID: PMC9339166
31. Jiang, B., et al.: Deep learning reconstruction shows better lung nodule detection for ultra-low-
dose chest CT. Radiology. 303(1), 202–212 (2022). https://doi.org/10.1148/radiol.210551
32. Xie, R.L., Wang, Y., Zhao, Y.N., et al.: Lung nodule pre-diagnosis and insertion path planning
for chest CT images. BMC Med. Imag. 23, 22 (2023). https://doi.org/10.1186/s12880-023-009
73-z
33. Wang, G., Luo, X., Gu, R., Yang, S., Qu, Y., Zhai, S., Zhao, Q., Li, K., Zhang, S.: PyMIC:
a deep learning toolkit for annotation-efficient medical image segmentation. ArXiv (2022).
https://doi.org/10.1016/j.cmpb.2023.107398
34. Nguyen, P., Rathod, A., Chapman, D., Prathapan, S., Menon, S., Morris, M., Yesha, Y.: Active
semi-supervised learning via Bayesian experimental design for lung cancer classification using
low dose computed tomography scans. Appl. Sci. 13, 3752 (2023). https://doi.org/10.3390/app
13063752
35. Xing, H., Zhang, X., et. al.: A deep learning-based post-processing method for automated
pulmonary lobe and airway trees segmentation using chest CT images in PET/CT. Quant.
Imag. Med. Surg. 12(10) (2022). https://qims.amegroups.org/article/view/99741
316 S. Chauhan et al.
36. Zhou, W. et al.: Deep learning-based pulmonary tuberculosis automated detection on chest
radiography: large-scale independent testing. Quant. Imag. Med. Surg. 12(4), 2344–2355
(2022). https://doi.org/10.21037/qims-21-676
37. Fang, D., Jiang, H., Chen, W., Qin, Z., Shi, J., Zhang, J.: Pulmonary nodule detection on lung
parenchyma images using hyber-deep algorithm. Heliyon 9(7), e17599 (2023). https://doi.org/
10.1016/j.heliyon.2023.e17599.PMID:37449096;PMCID:PMC10336504
38. Lei, Y., Tian Shan, Y.H., Zhang, J., Wang, G., Kalra, M.K.: Shape and margin-aware lung
nodule classification in low-dose CT images via soft activation mapping. Med. Image Anal.
60, pp. 1–13 (2020)
39. Anguita, D., Ghelardoni, L., Ghio, A., Oneto, L., Ridella, S.: The ‘K’in K-fold cross validation.
In: 20th European Symposium on Artificial Neural Networks, Computational Intelligence and
Machine Learning (ESANN), pp. 441–446 (2012)
40. Wu, Z., Zhou, Q., Wang, F.: Coarse-to-fine lung nodule segmentation in CT images with image
enhancement and dual-branch network. IEEE Access. 9, pp. 7255–7262 (2021). https://doi.
org/10.1109/ACCESS.2021.3049379
41. Karwoski, R.A., Bartholmai, R., Zavaletta, V.A., Holmes, D., Robb, R.A.: Processing of CT
images for analysis of diffuse lung disease in the lung tissue research consortium. In: Medical
Imaging. Physiology, Function, and Structure from Medical Images. SPIE. 6916, pp. 614–691
(2008)
42. Armato, S.G., 3rd., McLennan, G., Bidaut, L., McNitt- Gray, M.F., Meyer, C.R., Reeves, A.P.:
The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI):
a completed reference database of lung nodules on CT scans. Med. Phys. 38(2), 915–931 (2011)
43. LOLA11 Grand Challenge. Lobe and lung analysis. (LOLA11). Available at https://lola11.
grand-challenge.org/ (2011) [cited 30 Jan 2022]
44. Simpson, A.L., Antonelli, M., Bakas, S., Bilello, M., Farahani, K., Van Ginneken, B., et al.:
A large annotated medical image dataset for the development and evaluation of segmentation
algorithms; 2019. arXiv preprint. arXiv:1902.09063 (2019)
45. VESSEL12 Grand Challenge. Vessel segmentation in the lung 2012 (vessel12). Available at
https://vessel12.grand-challenge.org (2012) [cited 30 Jan 2022]
46. MedSeg. COVID-19 CT segmentation dataset. Available at http://medicalsegmentation.com/
covid19/ (2020) [cited 30 Jan 2022]
47. Kaggle Competition. Data science bowl 2017 (DSB). [Online]. Available at www.kaggle.com/
c/data-science-bowl-2017 (2017)
48. Kaggle Competition, Finding and measuring lungs in CT data. [Online]. Available at https://
www.kaggle.com/kmader/finding-lungs-in-ct-data (2017). Accessed 30 Jan 2022
49. Henschke, C.I., McCauley, D.I., Yankelevitz, D.F., Naidich, D.P., McGuinness, G., Miettinen,
O.S., Libby, D., Pasmantier, M., Koizumi, J., Altorki, N., et al.: Early lung cancer action project:
a summary of the findings on baseline screening. Oncologist 6(2), 147–152 (2001)
50. Li, P., Wang, S., Li, T., Lu, J., Huang Fu, Y., Wang, D.: A large-scale CT and PET/CT dataset
for lung cancer diagnosis. The Cancer Imag. Arch. (2020)
51. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778
(2016). https://doi.org/10.1109/CVPR.2016.90
52. Xu, Z., Sun, K., MaoR, J.: Research on ResNet101 network chemical reagent label image
classification based on transfer learning. 2020 IEEE 2nd International Conference on Civil
Aviation Safety and Information Technology (ICCASIT, Weihai, China, pp. 354–358 (2020).
https://doi.org/10.1109/ICCASIT50869.2020.9368658.
53. Qassim, H., Verma, A., Feinzimer, D.: Compressed residual-VGG16 CNN model for big data
places image recognition. 2018 IEEE 8th Annual Computing and Communication Workshop
and Conference (CCWC), Las Vegas, NV, USA, pp. 169–175(2018). https://doi.org/10.1109/
CCWC.2018.8301729
54. Gaur, P., Malaviya, V., Gupta, A., Bhatia, G., Pachori, R.B., Sharma, D.: COVID-19 disease
identification from chest CT images using empirical wavelet transformation and transfer
learning. Biomed. Signal Process. Control. 71 (2022), 103076
Convergence of Data Analytics, Big Data,
and Machine Learning: Applications,
Challenges, and Future Direction
Abstract The fusion of Data Analytics, Big Data, and Machine Learning has
become a powerful force in the always-changing world of data-driven decision-
making. This chapter offers a brief overview of their practical uses, illuminating how
these technologies are reshaping markets and driving creativity. The cornerstone,
data analytics, is studied first, emphasizing its capacity to extract useful insights from
a variety of sources. To demonstrate how Data Analytics enables organizations to
optimize processes, improve consumer experiences, and manage risks through data-
driven decision-making, real-world examples from industries including e-commerce,
finance, and healthcare are shown. Next, Big Data takes center stage to demonstrate its
ability to handle enormous amounts of data. We examine its uses in industries ranging
from urban planning to agriculture, showing how it facilitates better decision-making
through data-driven insights. The third element of the equation, machine learning,
emerges as a crucial enabler of automation and intelligence. We highlight its use in
customization, fraud detection, and healthcare diagnostics through fascinating real-
world examples, highlighting its disruptive potential. The synergistic potential of
these technologies, notably in predictive modeling and pattern recognition, is high-
lighted in the chapter’s conclusion. It also discusses the ethical issues surrounding
the use of data and the proper application of AI, urging businesses to proceed in the
data-driven world with caution and foresight. This chapter provides readers with a
concise yet thorough overview of the influential trio of Big Data, Machine Learning,
and Data Analytics, encouraging further investigation of their potential to reshape
industries and spur innovation in the real world.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 317
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_15
318 A. Bhattacherjee and A. K. Badhan
1 Introduction
In an era defined by the digital revolution, data has emerged as the lifeblood of inno-
vation and progress. The convergence of technology and data science has ushered
in a new age where organizations harness the power of Data Analytics, Big Data,
and Machine Learning to decipher complex challenges, unlock hidden insights, and
redefine the way they operate. This book chapter embarks on an exploration of the
myriad real-world applications where these three transformative technologies inter-
sect; illuminating the profound impact they have on diverse industries and domains.
From revolutionizing healthcare with predictive diagnostics to optimizing supply
chains through the analysis of massive data sets, and from enhancing the accuracy
of financial decisions to empowering machines with the ability to learn and adapt
autonomously, this chapter delves into the tangible ways in which Data Analytics, Big
Data, and Machine Learning shape the world around us. We will journey through the
realms of commerce, healthcare, finance, transportation, and beyond, discovering
the remarkable stories of organizations and individuals who have harnessed these
technologies to revolutionize their fields and pioneer groundbreaking solutions.
The convergence of Big data, Machine Learning, and Data Analytics has sparked
a new wave of innovation and change in the rapidly changing field of informa-
tion and technology. This chapter explores the practical uses of these cutting-edge
technologies to transform markets, resolve challenging issues, and realize unreal-
ized potential. This investigation is a vital resource for anyone looking to compre-
hend, utilize, and capitalize on the dynamic trifecta of Data Analytics, Big Data, and
Machine Learning in the search for practical insights and long-term advancement
as the digital era continues to push the envelope of what is feasible. We set out on
a trip through the concrete, significant applications that propel real-world change
and help organizations traverse the opportunities and difficulties of the data-driven
future, from marketing to finance and healthcare, among other areas.
As we delve into the exciting world of Data Analytics, Big Data, and Machine
Learning, we aim to provide both a comprehensive understanding of these technolo-
gies and real-world inspiration for those seeking to leverage data’s transformative
potential. We invite you to embark on this enlightening voyage, discovering the inno-
vation, challenges, and limitless opportunities that lie at the intersection of data and
technology.
Many applications currently use “smart technology,” which is the incorporation of
sensors and networked infrastructures. Every service or product has the potential to
be improved by the application of smart technology. Incorporating smart technology
has also been done for decades and is widely used. For instance, in the 1980s, students
at Carnegie Mellon University used sensors attached to a vending machine that was
connected to the internet to track the number of soft drinks served. To supply goods
and perform services more effectively, the product suppliers were able to keep track
of the number of products in each vending machine. Such instances of the use of smart
technology are numerous and are constantly expanding in diversity and complexity.
Convergence of Data Analytics, Big Data, and Machine Learning … 319
Many industries have seen radical change thanks to Data Analytics, Big Data, and
Machine Learning. They enable companies in industries like banking, healthcare,
and retail to streamline operations, make data-driven choices, and improve customer
experiences. These technologies also enable predictive maintenance, route optimiza-
tion, and talent management, and they are crucial in a variety of other fields, such as
energy, transportation, agriculture, and human resources. Additionally, by offering
data for social trend tracking, player performance analysis, and targeted advertising,
they have revolutionized social sciences, sports analytics, and marketing. Fig 1 repre-
sents the major fields of applications under data analysis. Furthermore, data analytics
improves efficiency and sustainability in communications, supply chain manage-
ment, and environmental conservation. These tools help with policy-making and
tailored learning in both government and education. While the media, insurance, and
real estate sectors use data analytics for risk assessment, content suggestion, and
property appraisal, the pharmaceutical industry uses them for medication discovery
and safety monitoring. These technologies are used in banking and finance for regu-
latory compliance and credit scoring, and they are also utilized by smart cities for
urban planning. Data analytics is used by nonprofits to evaluate the effectiveness of
their programs and engage donors.
Big Data
Machine Learning
Data Analytics
Fig. 1 Top real-world applications of Data Analysis, Big Data and Machine Learning
320 A. Bhattacherjee and A. K. Badhan
Simply said, data analysis is the act of turning the obtained data into useful infor-
mation. Different approaches, including modeling, are used to identify trends,
connections, and ultimately conclusions to address the decision-making process.
The following four major methods can be used to analyze data:
Descriptive: Having a plan for data analysis descriptive analysis has several bene-
fits. The investigators are not left unsure of what to do with all of the data they
now have on their computers, which is the most evident benefit. The process of data
analysis is sped up by a plan. The commands for the analysis can even be written
before the data collecting is finished if a computer program is employed. It can be
used for many different aspects of a business’s routine daily operations. Descriptive
analytics provides the foundation for reports on inventories shown in Fig. 2, multiple
workflows, sales numbers, and income information.
Exploratory: Data scientists use a method known as “exploratory data analysis,”
or EDA, to analyze, study, and summarize the key properties of various types of
data sets. These methods usually make use of data visualization techniques. EDA
helps data scientists choose the most efficient approach to change data sources to get
the results they need, making it easier for them to spot trends, spot anomalies, test
a hypothesis, or confirm assumptions. Numerous industries, including professional
sports, history, healthcare, marketing, the hospitality sector, retail, fraud detection,
Convergence of Data Analytics, Big Data, and Machine Learning … 321
Exploratory
Anticipatory
Descriptive Professional sports, history, Inferential
Risk modeling, quality
healthcare, marketing, the Dental Anatomy, Banking,
inventories, multiple assurance, product
hospitality sector, retail, fraud Transport, Education,
workflows, sales numbers and propensity, predictive
detection, auditing, Communications and Health
income information maintenance, and customer
geography, space exploration, sevices.
segmentation.
and the food business.
auditing, geography, space exploration, and the food business, use exploratory data
analytics.
Inferential: By taking into account a selection of data from the original data, infer-
ential data analysis draws conclusions and forecasts about large amounts of data. It
concludes by using probability. “Inferential Data Analysis” is the technique of “infer-
ring” insights from a sample of data. The fields of banking, healthcare, education,
insurance, and transportation are among the most widely used applications of infer-
ential analysis. Inferential data analysis is applied in any case where information is
taken from a group of subjects and then utilized to conclude a larger group. Despite
the potential for data sets to grow huge and contain numerous variables, inferential
data analysis does not require complex equations. If you were to evaluate a sample
of 100 persons regarding whether or not they recovered from a very serious health
condition, and 85 of them said yes, and 15 said no, then the results would indicate
that 85% of the sample had recovered from that particular health issue. Based on
those figures, one could deduce that whilst 15% of persons do not recover, 85% of
the general population does.
Anticipatory: This is also known as predictive data analytics across numerous
important applications through the utilization of the AI-based library. To facili-
tate the reuse of business logic capabilities on the chosen datasets, it integrates
the analytic service builder. Risk modeling, quality assurance, product propensity,
predictive maintenance, and customer segmentation are a few use cases and applica-
tions for anticipatory data analytics. In addition to implementing various statistical
and analytics methods and providing an environment for creating customized services
based on a set of available analytical characteristics, it also provides the ability to
run bespoke queries on the datasets that are readily available.
322 A. Bhattacherjee and A. K. Badhan
Big Data
Classifications
Machine Learning
Types
Customer Optimized
House Price Text Classification
Segmentation Marketing
Prediction
These real-world success stories are more than just teaching points; they serve as
motivation for companies to change, grow, and prosper in a time when information
is the new currency. This chapter serves as a beacon of guidance for readers at a
time when it is usual to feel overwhelmed by information, offering perspectives
that go beyond theoretical frameworks. It allows them to see directly the real-world
effects of Big Data, Machine Learning, and data analytics, which inspires wonder
and excitement about the seemingly limitless opportunities that lie ahead with the
critical factors shown in Fig. 5.
Let this chapter serve as a source of inspiration for scholars, practitioners, and
enthusiasts alike as we set out on our intellectual journey. It is a call to action, asking
the reader to actively engage in the data revolution, in which data is the driving force
behind hitherto unheard of breakthroughs and knowledge is power.
3 Literature Review
In their seminal work [3] Manyika and colleagues discussed the transformative
impact of Big Data on various industries, including healthcare, finance, and retail.
This work serves as a foundational exploration of the cross-industry impact of Big
Data. It highlights the potential for data-driven decision-making, improved oper-
ational efficiency, and enhanced customer experiences in healthcare, finance, and
retail. Their findings have informed subsequent research and practice in these sectors,
further emphasizing the transformative power of Big Data analytics.
In healthcare, they emphasize the substantial potential of Big Data. They discuss
the use of large datasets to improve patient care, enhance clinical outcomes, and
optimize hospital operations. Big Data analytics enables healthcare providers to make
Convergence of Data Analytics, Big Data, and Machine Learning … 325
data-driven decisions, personalize treatments, and detect early signs of diseases. The
result is improved patient outcomes and cost savings within the healthcare sector.
In finance, it highlights the role of Big Data in the financial industry. They discuss
how financial institutions utilize vast amounts of data to better understand customer
behavior, mitigate risks, and detect fraudulent activities. Through advanced analytics
and machine learning, banks and financial services companies can enhance security,
streamline operations, and offer more personalized financial products to customers.
In retail, they underscore the impact of Big Data on consumer insights and supply
chain management. They explain how retailers leverage data analytics to gain a deeper
understanding of consumer preferences, shopping behaviors, and market trends. This
information helps retailers make informed decisions regarding inventory manage-
ment, pricing strategies, and marketing campaigns, ultimately leading to improved
customer experiences and increased profitability.
In the past ten years, management academics in the field of information systems
(IS) have become increasingly interested in the function that Big Data (BD) plays in
promoting business revenue, operations, and customer support. Building Big Data
Analytics (BDA) capabilities is a priority for many established businesses as well as
brand-new ones. The goal is to produce actionable insights from a large variety of reli-
able data so that people or organizations may make and communicate informed deci-
sions. Organizations now need to find ways to leverage the data across smaller areas
of day-to-day management due to the rapid rise in processing capacity accessible to
analysts. These routine management tasks that BDA assisted have now greatly devel-
oped into full-fledged management domains with linkages to traditional management
theories.
When working with large dimensional data, deep learning architecture, such as
MLP and recurrent LSTM, tends to demonstrate performance increase by adding
deeper architecture and bulk networks. Computer vision, image processing (image
classification and segmentation), and ML classification are just a few of the disci-
plines where DL applications are being deployed and are attracting a lot more interest
and adoption [4]. The era of the Industry 5.0 revolution is one in which enormous
amounts of data are being exchanged digitally. Despite the need to analyze and
interpret data, machine learning is succeeding in several fields, including intelligent
control, decision-making, speech recognition, natural language processing, computer
graphics, and computer vision. Deep Learning and Machine Learning Techniques
have lately gained widespread recognition and adoption by several real-time engi-
neering applications due to their outstanding performance. Designing automated and
intelligent programs that can manage data in fields like health, cyber-security, and
intelligent transportation systems requires knowledge of machine learning [5].
Every population in the globe now has access to digital technology, which makes
an unprecedentedly large amount of data available. There are numerous benefits to
being able to process these enormous volumes of data in real-time using Machine
Learning (ML) algorithms and Big Data Analytics (BDA) tools. The abundance of
free BDA tools, platforms, and data mining tools, however, makes it difficult to
choose the best one for the job based on Nti. Moving forward, it is necessary to
address concerns that have been discovered, incomplete and diverse data sources,
326 A. Bhattacherjee and A. K. Badhan
noisy and erroneous data that impair data analytics’ performance, to allow the easy
management of Big Data. Therefore, to minimize human labor, big data analytics
designers must highly and effectively automate the data pre-processing (such as data
cleaning, sampling, and compression) [6].
Arguably the most popular software platform of the twenty-first century is the
Internet of Things. The Internet of Things (IoT) is the intelligent technologies and
services that control the entire globe. It is the globally connected connection between
intelligent gadgets, like actuators, sensors, tags with RFID, and mobile phones that
are dispersed throughout the area being monitored and allow them to interact with
one another and carry out particular duties autonomously.
Figure 6 represents such applications and demonstrates how, thanks to develop-
ments in Internet of Things (IoT) services, all of which were only a pipe dream a
few decades ago are now possible. Internet of Things (IoT) systems are predicted to
capture 3.9–11.1 trillion dollars in the US economy by the year 2025. In fact, by the
year 2020, there will be close to 50 billion gadgets connected to the web, and their
quality will only grow with age [7].
A strategy or tool to aid with Big Data Analytics (BDA) of applications is a
machine learning algorithm (MLA). It can be used to examine a sizable volume of data
that is produced by a program to make optimal and effective use of the information.
Healthcare and
Medical
Diagnosis
E-commerce,
Retail and
Manufacturing
Financial
Services
Real-world
applications of Data
Analytics, Big Data,
and Machine
Learning used in IOT
Environmental
CyberSecurity
Sustainability
Supply Chain
Optimization
Fig. 6 Combined applications of Data Analysis, Big Data, and Machine Learning based on IoT
Convergence of Data Analytics, Big Data, and Machine Learning … 327
Techniques for machine learning are taken into consideration for locating relevant
information and data for applications in the industry. It falls under the category of big
data analytics services (BDA). Big Data Analytics (BDA) can be utilized to detect
deception, recognize managing risk, recognize the reason for a failure, recognize a
consumer based on purchase detail records, and more. Supervised, semi-supervised,
and unsupervised methods are used in modern machine-learning techniques. ML is
used in decision-making processes, modeling layout form, identification of results
data analysis, and forecast seeking for any large data analytics-oriented program.
While in a typical system, the application gets the information and results, in a
machine learning idea, the program provides the information and results. It falls
under the domains of learning through reinforcement, unsupervised learning, and
supervision. The task of acquiring equations that map from an input to an output is
known as supervised learning [8].
In life sciences, machine learning is progressively gaining traction as a viable
computational and analytical tool for the integrated study of vast, diverse, unstruc-
tured datasets on the Data scale. While data creation costs are no longer an impor-
tant issue for genome-wide research, terabytes or even petabytes of data analysis
processing power are becoming a constraint. The three main challenges of Big Data,
which are similar to those faced by every scientific field generating Big Data, are
scalable infrastructure for parallel processing, massive data set handling plans, and
smart data analysis analytics. The environmental science community is looking for
innovative solutions to these problems. A system for unified Big Data processing built
upon the foundation of powerful clusters of computers. One such holistic platform
is the open-sourced Apache Hadoop environment, which consists of the MapReduce
software model, the Hadoop distributed file system (HDFS), Hadoop functioning
instructions, and several tools for storing various types of structured, semi-structured,
and unstructured datasets, a Big Data repository that combines open datasets and set
genome libraries with automated workflows for preprocessing, transforming, and
querying data in extremely large-volume datasets [9].
Over the last ten years, several Big Data-related issues have been resolved with
the help of algorithms that use machine learning. Currently, unsupervised, super-
vised, and semi-supervised machine learning (ML) approaches are accessible in a
variety of forms. Comparably, several methods, including classification, prelimi-
nary processing, relationships, random forests, support vector machines, decision
trees, etc., are accessible to address a variety of issues, including machine transla-
tion, inequalities in data, robotics advancement, etc. These days, to handle several
issues, such as forecasting and modeling in various applications, we need to know
a few fundamental details about machine learning approaches for instance, in big
data industries for e-healthcare (information created via digitally linked intelligent
gadgets) as well as other sectors (e-commerce, agricultural, defense, etc.). Because
of this, the majority of researchers are perplexed and hesitant to debate or choose
which approach or measure to employ in the appropriate applications [10].
Big Data has a major effect on organizations in the Industry 4.0 (fourth indus-
trial revolution) period because the advances in systems, systems, individuals, and
328 A. Bhattacherjee and A. K. Badhan
information technology have altered the factors that determine a firm’s ability to inno-
vate and remain competitive. Researchers and industry professionals have created
a lot of hype around big data because big data analytics can yield useful insights
and encourage creative company policies that can revolutionize local, national, and
global economies. According to that definition, data science is the set of essential
ideas that support the extraction of expertise and knowledge from data. The methods
and tools employed aid in the analysis of vital data, assisting firms in gaining insight
into their surroundings and making accurate choices. Big data analytics are being
used in every industry (agriculture, physical well-being, power and buildings, finance
and protection, games, nutrition, and public transportation) and global economy due
to the massive increase in data that has resulted from the Internet of Things (constant
growth of devices that are connected, sensors, and smartphones). Global recognition
has shown that the amount of data that is becoming more readily accessible is a
pattern, and data analysis procedures yield significant insights from the data that is
available [11].
The application of Business Intelligence (BI) and Big Data is known as predic-
tive analytics. Whenever your company gathers enormous amounts of fresh data,
what measures do you take? A great deal of new consumers, market, social media
monitoring, and real-time mobile applications, the cloud, or device metrics are being
collected by modern enterprise apps. One method to make the most of all that data,
obtain actionable novel insights, and beat rivals is through predictive analytics. Enter-
prises employ predictive analytics in several manners, ranging from data mining and
predictive marketing to utilizing machine learning (ML) and artificial intelligence
(AI) algorithms to enhance operational efficiency and detect novel statistical trends.
The need for managerial experts has surged due to big data, to the extent that compa-
nies such as Tech AG, Oracle, Inc., Microsoft Corporation, IBM, the SAP system,
EMC, HP, Dell, and Alienware have invested over $15 billion in software companies
that specialize in managing data and analytics [12].
Innovations related to Big Data and machine learning (ML) have an opportunity to
affect various aspects of Environmental and Water Management (EWM). Big Data is
becoming more and more common in many EWM fields, including weather predic-
tion, disaster preparedness, intelligent water and electricity administration systems,
and remote sensing. This is due in part to rapid advancements in high-resolution
image methods for remote sensing, smart computer technology, and social media.
Big Data opens up new possibilities for information-driven discoveries in EWM, yet
it also necessitates new kinds of analytics, data processing, storage, and retrieval.
Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that generally
refers to computer systems with data-driven learning capabilities. If ML and data
analytics are correctly linked, it could help unleash the potential of Big Data. The
EWM literature has already published a significant amount of Big Data and machine
learning applications over the past decade [13].
The expanding notion of “Big Data” requires significant progress in the realm
of data science claims. It provides data with quantifiability in multiple ways that
enable data research. Machine learning techniques have had a significant impact
on society in a wide range of applications. The creation of innovative dynamic and
Convergence of Data Analytics, Big Data, and Machine Learning … 329
In addition to addressing the inherent obstacles and predicting future trends in this
quickly developing subject, the proposed work intends to explore the intersection
of Data Analytics, Big Data, and Machine Learning, clarifying their convergence
330 A. Bhattacherjee and A. K. Badhan
and the consequences for a variety of applications. The chapter will give a thorough
summary of how these three areas work together and emphasize how their combined
efforts can completely transform how decisions are made in a variety of businesses,
we noticed some real-time use cases, thus we put all these examples here.
• Financial Fraud Detection: For instance, banks use Big Data to analyze massive
amounts of data in real time to identify fraudulent transactions. Atypical spending
habits or other unusual patterns set up signals for additional research, preventing
financial losses [19].
• Climate Modeling: As an illustration, climate scientists examine enormous
volumes of meteorological data, such as temperature, precipitation, and atmo-
spheric variables, using Big Data. As a result, they can produce precise climate
models that forecast long-term weather patterns and climatic changes.
• Smart City Infrastructure: As an illustration, Big Data technologies are used by
cities to effectively manage their urban infrastructure. For example, data analytics
and sensors improve public safety, save energy costs, and optimize traffic flow
[20].
Convergence of Data Analytics, Big Data, and Machine Learning … 331
• Virtual Personal Assistants: For instance, Alexa from Amazon and Siri from
Apple both utilize machine learning to comprehend user commands, adjust to
each user’s unique preferences, and eventually give more precise and customized
responses [21].
• Image and Speech Recognition: For instance, Google Photos uses machine
learning algorithms to automatically identify and classify photos. Similarly,
machine learning powers speech recognition technology seen in virtual assistants
and customer support apps.
• Credit Scoring in Finance: For instance, to more precisely evaluate credit
risk, financial firms employ machine learning. These algorithms are capable of
producing more complex and equitable credit ratings by examining a wide range
of variables, such as payment patterns and expenditure patterns.
• Medical Diagnosis and Imaging: As an illustration, medical imaging uses
machine learning to diagnose diseases like cancer. To help medical personnel
make earlier and more accurate diagnoses, algorithms examine medical images
to find anomalies [22].
• Autonomous Vehicles: For instance, machine learning algorithms are used by
Tesla and other companies to enable autonomous driving. These algorithms aid
in the development of self-driving automobiles by analyzing data from cameras
and sensors in real time to make decisions about vehicle operation.
• Improved Customer Insights: By learning more about the behavior and pref-
erences of their customers, businesses may develop marketing tactics that are
more precisely focused.
Future Scope
• Edge Computing: Decentralized processing at the network’s edge, which
lowers latency and permits real-time data analysis, is the way Big Data will
be processed in the future.
• Data Security and Privacy: As Big Data grows, more attention will be paid to
creating strong security and privacy protocols to safeguard sensitive data.
b. Implication of Big Data Analytics
• Increased Decision Accuracy: By analyzing enormous volumes of both struc-
tured and unstructured data, Big Data analytics helps businesses to generate
forecasts and judgments that are more accurate.
• Innovation and New Business Models: Big Data creates opportunities for
innovation, allowing for the development of new goods, services, and business
models based on insights from huge databases.
• Improved Customer Insights: By learning more about the behavior and pref-
erences of their customers, businesses may develop marketing tactics that are
more precisely focused.
Future Scope
• Edge Computing: Decentralized processing at the network’s edge, which
lowers latency and permits real-time data analysis, is the way Big Data will
be processed in the future.
• Data Security and Privacy: As Big Data grows; more attention will be paid to
creating strong security and privacy protocols to safeguard sensitive data.
c. Implication of Machine Learning
• Automated Decision-Making: By using patterns and insights from data,
machine learning enables automated decision-making, which eliminates the
need for human intervention in repetitive operations.
• Personalized Experiences: Machine learning algorithms provide tailored
suggestions, offerings, and exchanges, augmenting user experiences across
several platforms.
• Advanced Problem Solving: Machine learning (ML) makes it possible to
create models that can solve a wide range of complicated issues, from financial
forecasts to medical diagnostics.
Future Scope
• Explainable AI: Improving machine learning models’ interpretability is essen-
tial to their broad adoption, particularly in industries where openness and
accountability are critical.
• AI in Creativity: AI-driven innovation has a bright future thanks to the
incorporation of machine learning in creative industries like art and content
production.
Convergence of Data Analytics, Big Data, and Machine Learning … 333
6 Conclusions
In conclusion, the fields of Big Data, Machine Learning, and Data Analysis offer
an amazing array of opportunities with practical uses that are transforming entire
sectors and improving our quality of life. This chapter has provided an in-depth
exploration of the various industries in which these technologies are having a signif-
icant impact, ranging from marketing and transportation to healthcare and banking.
We’ve seen how data-driven insights have transformed decision-making processes,
increased productivity, and produced creative answers to difficult problems. Future
developments will make the integration of these technologies even more common and
essential. Data analysis, big data, and machine learning have the potential to trans-
form many industries and provide major benefits to society. Their potential keeps
opening up new avenues for exploration. To address privacy concerns and minimize
potential biases, it is crucial to approach these technologies with a strong sense of
responsibility, assuring ethical and secure data consumption. The exploration of big
data, machine learning, and data analysis’s practical applications is a continuous
process. With ramifications that go well beyond the pages of this chapter, it is clear
that these tools will become more and more important as we continue to explore
this frontier in our quest for knowledge, growth, and solutions to problems. While
constantly being aware of the significant influence that data-driven technologies are
having on our world, we must continue to be imaginative, flexible, and open to the
revolutionary possibilities of these technologies.
References
1. Kamal, N., Andrew, M., Tom, M.: Semi-supervised text classification using EM. Semi Superv.
Learn. 32–55 (2013). https://doi.org/10.7551/mitpress/9780262033589.003.0003
2. Abe, N., Verma, N., Apte, C., Schroko, R.: Cross channel optimized marketing by reinforcement
learning. KDD-2004—Proc Tenth ACM SIGKDD Int Conf Knowl Discov Data Min 767–772
(2004). https://doi.org/10.1145/1014052.1016912
3. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Hung Byers, A.: Big
data: The next frontier for innovation, competition and productivity. McKinsey Glob. Inst. 156
(2011)
4. Khan, M.A., Karim, M.R., Kim, Y.: A two-stage big data analytics framework with real world
applications using spark machine learning and long short-term memory network. Symmetry
(Basel) 10 (2018). https://doi.org/10.3390/sym10100485
5. Kushwaha, A.K., Kar, A.K., Dwivedi, Y.K.: Applications of big data in emerging management
disciplines: a literature review using text mining. Int. J. Inf. Manag. Data Insights 1, 100017
(2021). https://doi.org/10.1016/j.jjimei.2021.100017
6. Nti, I.K., Quarcoo, J.A., Aning, J., Fosu, G.K.: A mini-review of machine learning in big data
analytics: applications, challenges, and prospects. Big Data Min. Anal. 5, 81–97 (2022). https://
doi.org/10.26599/BDMA.2021.9020028
7. Choo, K.K.R., Dehghantanha, A.: Handbook of Big Data Privacy (2020)
8. Rahul, K., Banyal, R.K., Goswami, P., Kumar, V.: Machine Learning Algorithms for Big Data
Analytics. Springer, Singapore (2021)
9. Ma, C., Zhang, H.H., Wang, X.: Machine learning for big data analytics in plants. Trends Plant
Sci. 19, 798–808 (2014). https://doi.org/10.1016/j.tplants.2014.08.004
334 A. Bhattacherjee and A. K. Badhan
10. Tyagi, A.K., Rekha, G.: Machine Learning with Big Data Article Info, 1011–1020 (2019)
11. Vassakis, K., Petrakis, E., Kopanakis, I.: Big data analytics: applications, prospects and chal-
lenges. Lect. Notes Data Eng. Commun. Technol. 10, 3–20 (2018). https://doi.org/10.1007/
978-3-319-67925-9_1
12. Ongsulee, P., Chotchaung, V., Bamrungsi, E., Rodcheewit, T.: Big data, predictive analytics
and machine learning. Int Conf ICT Knowl Eng 2018-Novem, pp. 37–42 (2019). https://doi.
org/10.1109/ICTKE.2018.8612393
13. Sun, A.Y., Scanlon, B.R.: How can big data and machine learning benefit environment and
water management: a survey of methods, applications, and future directions. Environ. Res.
Lett. 14 (2019). https://doi.org/10.1088/1748-9326/ab1b7d
14. Son, L.H., Tripathy, H.K., Acharya, B.R., Kumar, R., Chatterjee, J.M.: Machine learning on big
data: a developmental approach on societal applications. Stud. Big Data 43, 143–165 (2019).
https://doi.org/10.1007/978-981-13-0550-4_7
15. Praful Bharadiya, J.: A comparative study of business intelligence and artificial intelligence
with big data analytics. Am. J. Artif. Intell. (2023). https://doi.org/10.11648/j.ajai.20230701.14
16. Koosha, S., Amini, M.: A review of machine learning and deep learning applications. World
Inf. Technol. Eng. J. 7, 3897–3904 (2023). https://doi.org/10.1109/ICCUBEA.2018.8697857
17. Chen, W.H., Lin, Y.C., Bag, A., Chen, C.L.: Influence factors of small and medium-sized
enterprises and micro-enterprises in the cross-border e-commerce platforms. J. Theor. Appl.
Electron. Commer. Res. 18, 416–440 (2023). https://doi.org/10.3390/jtaer18010022
18. Statistics M. De la Cruz, O., Holmes, S.: The duality diagram in data analysis : examples of
modern applications. Ann. Appl. Stat. December 2011, 5(4). Institute of Mathematical Statistics
Stable (2011). https://www.jstor.org/stable/23069329. The duality diagram in data analysis :
examples of, 5, 2266–2277
19. Dhone, M.B., Assistant: Big data analytics for fraud detection in financial transactions. Maya.
38, 31–41 (2023).https://doi.org/10.5281/zenodo.7922883
20. Chluski, A., Ziora, L.: The role of big data solutions in the management of organizations. Review
of selected practical examples. Proc. Proc. Comput. Sci. 65, 1006–1012 (2015). https://doi.org/
10.1016/j.procs.2015.09.059
21. Manojkumar, P.K., Patil, A., Shinde, S., Patra, S., Patil, S.: AI-based virtual assistant using
python: a systematic review. Int. J. Res. Appl. Sci. Eng. Technol. 11, 814–818 (2023). https://
doi.org/10.22214/ijraset.2023.49519
22. Lawrence, N.D.: Challenges in deploying machine learning : a survey of case studies. 55 (2022).
https://doi.org/10.1145/3533378
Business Transformation Using Big Data
Analytics and Machine Learning
Abstract Artificial intelligence (AI), big data, and business analytics are the most
commonly used and complete common sense cognitive tools in the ecospheres today,
and they have garnered a lot of attention for their ability to influence organizational
decision-making. With the use of these technologies, firms are able to provide valu-
able data and obtain answers that will improve their performance and provide them
with a competitive advantage. A customer relationship management (CRM) and
enterprise resource planning (ERP) business system, for example, can be integrated
with AI solutions through the AI business platform paradigm. In addition to providing
pattern analysis, big data analytics (BDA) enables automatic future event forecasting.
BDA may revolutionize organizations and create new commercial prospects using
AI. The goal is to highlight the preventive aspects of using AI and ML in conjunc-
tion with big data analytics (BDA) to pursuit digital platforms for business model
innovation and dynamics. Additionally, a thorough assessment of the literature has
been provided with an emphasis on the necessity of business transformation, the
function of BDA, and the role of AI. One particular case study namely Big Mart
Sales forecasting was discussed, compared and analyzed in the context of business
transformation. The chapter discusses the possible obstacles to firms implementing
AI and BDA. It will offer firms a roadmap for utilizing AI and BDA to generate
commercial value.
P. Majumdar
Department of Computer Science and Engineering, Techno College of Engineering,
Maheshkhola, Agartala, Tripura 799004, India
e-mail: er.parijata@gmail.com
S. Mitra (B)
Department of Computer Science and Engineering, Tripura Institute of Technology, Narsingarh,
Agartala, Tripura 799009, India
e-mail: mail.smitra@gmail.com
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 335
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_16
336 P. Majumdar and S. Mitra
1 Introduction
Business Analytics (BA) is used to transform data into insights to improve company
decisions [1]. A few methods for extracting perception from data are data manage-
ment, data visualization, predictive modeling, data mining, prediction, simulation,
and optimization. The term BA describes the knowledge, tools, and procedures used
in iteratively exploring and analyzing historical business performance in order to
provide insights and inform future strategy [2]. Data-driven businesses actively look
for ways to leverage their data to their advantage and regard it as an asset. The effec-
tiveness of BA depends on high-quality data, skilled analysts who are familiar with
the market, technology, and a dedication to use data to extract knowledge that guides
company choices [3]. Prior to conducting any data analysis, BA begins with a number
of fundamental procedures which includes ascertaining the analysis’s commercial
purpose, choosing the approach for analysis, obtaining business data from various
sources and systems to assist analysis, combining and purging the data from data
mart or data warehouse. Tactical decision-making in reaction to unanticipated occur-
rences is also supported by BA [4]. Various forms of business analytics consist of
prescriptive analytics, which suggests actions to take in the future based on past
performance; predictive analytics uses trend data to gauge the chance of future
events, while descriptive analytics monitors key performance indicators (KPIs) to
assess a company’s present situation [5]. To enable real-time responses, artificial
intelligence (AI) is widely utilized to automate decision-making. AI is capable of
inquiring, generating and testing hypotheses, and autonomously generating judg-
ments based on sophisticated analytics applied to large number of datasets [6]. In
the field of AI, computers learn from data by using appropriate algorithms which
enables computers to extract hidden patterns or correlations in data without having
to be specifically trained to do so to solve a specific problem. Large data quantities
that may be generated, processed, and increasingly employed by digital tools and
information systems for generating descriptive, prescriptive, and predictive anal-
yses are referred to as big data analytics (BDA) [7]. Three Vs are important to the
standard concept of big data: volume, velocity, and variety [8] which best describe
big data. Furthermore, other dimensions of Big Data that the top solution providers
later defined and included are Veracity, Variability, and Value Proposition [9]. The
growing availability of structured data, the capacity to handle unstructured data,
improvements in computer power, and expanded data storage capabilities are the
main drivers of this capability. A continuous flow of processed data inputs into AI
platforms and applications is made possible by the BDA value chain. Addition-
ally, BDA supplies the AI platform with necessary inputs that allow it to process
large amounts of data quickly, efficiently, and from different data structures [10].
Industry 4.0 is powered by BDA, AI, cloud, and IoT. This supports use cases in
both consumer and corporate domains, from enhancing business performance and
agility to delivering personalized services. Personalized services, chatbots, AI and
BDA platforms, and AI integration to boost company performance and agility are
just a few of the multifarious consumer and corporate use cases that AI and BDA,
Business Transformation Using Big Data Analytics and Machine Learning 337
together with IoT, cloud, 5G, and cybersecurity, are enabling. The proliferation of
real-time data sources, including call records, location data, and customer purchase
patterns makes BDA possible. Key technologies like computer vision, context-aware
computing, and conversational platforms are the primary enablers of AI. To augment
the AI’s inherent knowledge, machine learning (ML) and/or deep learning (DL)
methods are employed. Systems that adjust their behavior based on contextual data,
such as location, temperature, light, humidity, and hand movements, are referred to as
context-aware computer systems. Conversational platforms use a range of technolo-
gies, such as natural language processing (NLP), speech recognition through NLP,
ML, and contextual awareness to facilitate human-like interactions. A enterprise
resource planning (ERP) business system and customer relationship management
(CRM), for example, can be integrated with AI solutions through the AI business
platform paradigm. The two main software programs that companies are willing to
use to automate crucial business processes are CRM and ERP [6]. While CRM helps
organizations manage how customers interact with them, ERP helps firms run more
efficiently by linking their operational and financial processes to a central database.
Software called customer relationship management (CRM) keeps track of every inter-
action of the customer with the company. CRM elements were initially created for
sales departments. Having a common database for all financial and operational data
is one of the key advantages of an ERP system. Improved automation of business
operations, more individualized communications, and providing customers the most
helpful answers to their concerns are made possible by combining generative AI with
CRM [6]. The goal of AI in CRM is to enable it to manage intelligent recommen-
dations and analysis about a prospect or customer based on all the data the system
has gathered about them. AI’s superior analytics, forecasting, automation, personal-
ization, and optimization capabilities can improve ERP systems. To understand the
role of BDA and AI in BA, as well as the necessity of business transformation, a
thorough literature review is included. A particular case study namely Big Mart sales
forecasting was discussed, compared, and analyzed in business transformation. To
provide an enterprises a roadmap for utilizing BDA and AI for commercial value,
significant obstacles to implement these technologies in their operations are also
explored, along with other potential solutions.
2 Related Works
aid of BDA, businesses may regain control over data and utilize it to find novel busi-
ness prospects. This can help businesses make quicker and more shrewd business
decisions and increase productivity, profits, and customer satisfaction [11]. BDA is
the process of addressing dynamic client requests and maintaining a competitive edge
by using statistical and analytical techniques on large data sets. Similar to human
capital and capital re-sources, big data is an equally important resource for enhancing
financial and social aspects [12]. Businesses are facing immense pressure to adopt
BDA in order to stay competitive. However, BDA’s ability to realize a company’s
strategic business value, which might give them a competitive edge, is ultimately
what will determine its level of success. Businesses find it difficult to determine the
true worth of big data and how investing in it might yield real commercial benefits.
The combination of newly generated business knowledge and its actual application
in business can be used to understand the value of the data [13]. Integrating AI into
sustainable business models aims to achieve sustainable production and consump-
tion using scientific and technological capacities (SBMs). It is emphasized in [14]
this application of AI in SBMs. Many firms have already switched to AI for sustain-
able growth and development in a unmanageable climate [15]. According to Giuf-
frida et al.’s [16] review of the literature on logistics optimization techniques, smart
or sustainable logistics frequently employ machine learning and hybrid techniques.
Loureiro and Nascimento [17] state that augmented reality (AR), virtual reality (VR),
IoT, AI, circular economy, BDA, and VR have become important themes in tourism
research. These findings increased our understanding of the potential future effects of
technology on sustainable tourism. All corporate organizations need to have “Tech-
nological Intelligence” [18]. As a result, businesses must assess the breadth and depth
of AI and BDA applications and pinpoint any areas that appear more susceptible to
disruptions. A wide range of industries have become open to AI applications in the
past ten years. Among these applications are supply chain management, dentistry,
medicine, diagnosis and treatment, modeling for pandemic response and forecasting,
commercial banking and stock market forecasts, power quality assurance, AI, ML,
deep reinforcement learning (DRL) for smart cities, and business operations. These
days, stock price index changes are predicted using ML algorithms [19]. We can make
use of a random forest model to assess risk for an excavation system, as demonstrated
in [20]. Natural language processing (NLP) is used by Amazon to analyze user expe-
rience and customer feedback, and by Twitter to filter out extremist languages from
messages [21]. Additionally, sentiment analysis using text mining, topic modeling,
etc. is using NLP more and more [22]. AI robots are agents that function to take
responsible behaviors. Robot Sophia is the most exemplary example of a social
humanoid. Some examples of AI-based conversational chatbots made from big data
language models are Google’s LaMDA, Meta’s BlenderBot, and Open AI’s GPT-
3 [23]. Massive volumes of data that are too big to handle with conventional data
management techniques are referred to as “big data”. Bendre and Thool [24] states
that as BDA requires significant data storage and processing power expenses, Return
on Investment (ROI) is another concern. On the other hand, the decreasing costs
associated with data collection have prompted the broad use of BDA across several
businesses. BDA has many applications. The availability of software and statistical
Business Transformation Using Big Data Analytics and Machine Learning 339
techniques to address big data concerns like class imbalance and high dimension-
ality is the boon of using data mining in health informatics [25]. BDA has a huge
potential for use in the medical field. Among the enormous volumes of health data
it contains are gene expression, data sequencing, electronic health records, doctor
notes, prescriptions, data from biological sensors, and data from online social media
[26]. BDA can improve policy execution by obtaining knowledge and insights from
existing data [27]. BDA provides assistance with police monitoring, crime graphs,
recording of crimes, tracking of terror threats, and defense [28]. The primary goals
of government are to promote public talent through enterprise collaborations and to
seek benefits through the use of IoT, crowd sourcing, and data sources. BDA applica-
tions in the banking sector are improving security and changing services [29]. Some
possible applications for BDA include a client-focused business, improved security
management and services, and cross-selling with more adaptability. Additionally,
BDA aid in comprehending the requirements of both clients and staff for offering
enhanced methods for quality and service management [30]. In [31] it is stated
that in the Big Data era, intelligence is identified as a developing field. Its actions
are focused on decision making, it aims to enhance enterprise competitiveness and
economic sectors, and it is a practice that is both morally and legally acceptable.
Lastly, intelligence would be a major factor in how well a business performed in
terms of creativity. According to [13], AI has the capacity to completely transform
twenty-first-century society, claims [13]. Preparing a better AI society has become
popular due to increased public and scientific interest in the ethical frameworks,
regulations, incentives, and values needed for a society to reap the benefits of AI
while mitigating its perils. AI is slowly making its way from research and develop-
ment facilities into the corporate sector. The power of AI and applied AI (AAI) is
being combined by elite businesses and millions of industries worldwide. In order to
increase customer satisfaction, the majority of company industries use ML algorithms
to detect scams in milliseconds. To satisfy business needs, there has been a noticeable
increase in the development of ML tools, business platforms, and applications-based
tools [32]. Research on AI in marketing is reviewed in [33]. They demonstrate how
AI is utilized to the creation of marketing strategies and plans as well as to product,
pricing, place, and promotion management. In [34], examples are provided to show
how BDA can improve the effectiveness of an organization. They also list a number
of research areas that are expanding quickly, such as text mining, evolutionary algo-
rithms, and risk management for customers and finances. In [35], the advantages and
difficulties of BDA in businesses are assessed. They discovered through their theme
and content research that BDA is an important factor in strategic decision-making.
The paper also mentions that as data collection costs come down, BDA usage is
accelerating. Furthermore, BDA is often applied to effective supply chain manage-
ment. Neural networks have become important to predict company bankruptcy and
credit risk to enhance profitability, as demonstrated in [36]. In [37] it is discovered by
bibliometric research that AI and ML are currently used in several financial domains.
Their analysis brought attention to a growing trend in applications of AI and ML. In
[38] a pattern is discovered in the knowledge-based systems’ subject shift. The Latent
Dirichlet Allocation (LDA) topic modeling is utilized to predict future trends and
340 P. Majumdar and S. Mitra
profile the hotspots of KnoSys. The main study fields that they highlighted are fuzzy,
ML, data mining, decision making, expert systems, and optimization. The results
additionally demonstrate that the communities inside KnoSys are becoming more
interested in computational intelligence and that building useful systems through
the application of knowledge and precise forecasting models is a top priority. AI
is a broad technological tool that now includes six sub-domains namely ML, deep
learning, robotics, fuzzy logic, natural language processing, and expert systems. Each
sub-domain can be given special attention by future academics, who can then inves-
tigate its applicability in other domain knowledge [39]. Similar to this, BDA is an
all-encompassing method for handling, processing, and evaluating large data [32].
Therefore, a wide range of methods, including data mining, multimedia analytics, and
cognitive modeling, are included in BDA. Some companies use big data models and
technologies based on cloud computing. GoogleFS is one distributed file system for
apps that generate Big data, for example [40]. There are numerous instances of BDA
being successfully used in the real world to transform businesses. There are a ton of
chances for BDA adoption in the retail sector. Businesses that have been using BDA
for their marketing and sales campaign for the past five years have reported a 15–20%
return on investment (ROI). Retail businesses use BDA to enhance various aspects
of their business, including supply chain, marketing, vending, and store manage-
ment [41]. Currently, the retail industry’s stakeholders can use BDA to maximize
profits and prevent or lessen the migration of customers from physical retail stores
to online retailers. For instance, in the Walmart retail industry, tools such as Hadoop,
cluster sales, clickstream, online data, and social media data are used in conjunction
with online, social media, and predictive analytics capabilities as well as trend, data
visualization, and market basket analysis. In the subsequent section, one particular
case study namely Big Mart Sales forecasting was discussed and analyzed in the
context of business transformation leveraging different ML algorithms, the perfor-
mance of which is compared using different performance metrics. Value creation
processes involve discovery, forecasting, tracking, personalization, and optimization.
Gained benefits include higher sales, lower expenses, higher customer satisfaction,
and better performance. Amazon is another online retailer, where real-time BDA,
models, and S3 and Dynamo data warehouse capabilities are used. Customization,
forecasting, and ML are the methods utilized to create value. Improved client loyalty
and experience are among the benefits [42].
In this case study, we look at how product sales from a specific outlet are estimated
using a two-level method, and we also look at how different ML algorithms may be
utilized for predictive learning according to predictive performance indicators. Data
exploration, data translation, and feature engineering are crucial tasks for accurately
Business Transformation Using Big Data Analytics and Machine Learning 341
projecting results. Table 1 provides a description of the dataset. There are 8523
distinct data points in the dataset. The dataset is available online [43].
3.2 Methodology
The dataset is wrangled and prepared for training. The dataset is cleaned (where
duplicates are removed, errors are corrected, missing values are imputed, and normal-
ization is done). The impacts of the specific order in which we gathered and/or
otherwise prepared the data are then eliminated when the data is randomized. After
that, additional exploratory research is carried out and data visualization is used
to help identify pertinent correlations between variables or class imbalances (bias
alert). Following that, training (70%) and testing (30%) datasets are created. The
dataset has yielded valuable information connected to data during the process of
data exploration. To accomplish so, data from accessible sources and information
from hypotheses are compared. In this form, some values are not appropriate. Thus,
we must translate them into the age of a specific outlet. In the dataset, there are 10
unique outlets and 1559 unique products. There are sixteen distinct values in Item
type. There are some misspellings “low fat” instead of “Low Fat (LF)” and “regular”
instead of “Regular (RL)”. For data cleaning, mean and mode are used to substitute
missing numerical attributes, the correlation between the reconstructed attributes is
reduced. Certain peculiarities in the dataset were found during the data exploration
phase. To build a suitable model, all anomalies discovered in the dataset are resolved
during this phase. All products are therefore more likely to be sold. All differences
in categorical attributes are resolved by replacing them all with the appropriate ones.
By changing every category attribute to the proper one, all differences in categorical
attributes are corrected. To avoid this, a third category of item fat content—none—
is added. It was found that the item identification property’s unique ID starts with
either DR, FD, or NC. As a result, we create Item Type New and assign it to one of
categories: foods, drinks, or non-consumables. In the actual dataset, we chose only
9 features viz; ItemWeight, ItemFatContent, ItemVisibility, ItemType, ItemMRP,
OutletEstablishmentYear, OutletLocationType, and *OutletType.
The next step is model building where different predictive models are used for
forecasting of sales. The prediction models are discussed:
Decision Tree
Decision tree (DT) functions by building a structure like a tree that illustrates the
connections between a dataset’s attributes and the desired variable. Choosing the
optimal attribute to act as the decision tree’s root node is the first step. The dataset is
divided into subsets according to the values of the specified attribute once the root
node has been picked. The procedure is then repeated for each child node, this time
Business Transformation Using Big Data Analytics and Machine Learning 343
choosing the best attribute to split on from the other attributes. A leaf node is formed
when one of the halting criteria is satisfied [44].
Finding the hyperplane with the most points in the best-fit line is the goal of support
vector regression (SVR). Rather of minimizing the difference between the real and
predicted values, the SVR aims to fit the best line inside a given value. That particular
value represents the distance between the boundary line and the hyperplane. Support
vectors are the nearest data points on either side of the hyperplane [44].
Random Forest
In Random Forest (RF), bootstrap, also known as bagging, is used to solve regression
problems by combining many decision trees into a Random Forest (RF). Instead than
depending solely on individual decision trees to reach a conclusion, multiple deci-
sion trees are integrated. RF uses the Bootstrap approach, which involves randomly
selecting rows and features from a large number of decision trees. It is more accurate
the more trees there are in the forest. Overfitting is less likely when there are more
trees in the population. In Bagging, sample data is substituted with a random value
periodically for matching trees to these values utilized for training [44].
Fig. 1 a The sample output of Extreme Gradient Boosting algorithm. b The sample output of
Decision Tree. c The sample output of Support Vector Regression. d The sample output of Random
Forest for Big Mart Sales Forecasting
Business Transformation Using Big Data Analytics and Machine Learning 345
The EVS which explains the errors’ distribution in a given dataset is expressed in
Eq. (1):
( )
( )
Δ
Δ var y − y
explained variance y − y = 1 − (1)
var(y)
Here, the variances of the actual values and prediction errors are denoted by Var(y)
and Var(y − ŷ), respectively. Better squares of the error standard deviations are
indicated by scores near to 1.0, which is greatly desirable.
When evaluating the efficacy of a regression model, MAE is calculated as the
average absolute difference between the predicted and actual values, expressed in
Eq. (2):
∑n
i=1 |yi − xi |
MAE = (2)
n
n is the total number of data points, yi is the predicted value, and x i is the true value.
MSE computes the average of the squares of the error difference between the
estimated and actual values, expressed in Eq. (3):
1 ∑( )2
n
Δ
MSE = yi − y i (3)
n i=1
yi is predicted value, ŷi is the true value and n is total number of data points.
The sample standard deviation of the variations between the expected and
observed values is described by RMSD, expressed in Eq. (4):
/
∑N ( Δ )2
i=1 xi − xi
RMSD = (4)
N
There are numerous approaches and methods for deriving knowledge from unpro-
cessed big data. Most of the time, data scientists use ML to create high-performance
software agents that can learn from the data, and statistics to evaluate various
knowledge-related hypotheses. For extracting and utilizing knowledge, scientists
utilize data cleaning, data transformation, and data visualization. Some of the targets
of data mining and knowledge extraction process are classification to predict the
class of a new item, clustering of datasets into distinct categories, identifying asso-
ciation or relation between two or more variables in a large dataset, summarizing
the properties of datasets, and identifying events, observations or items that do not
follow a specific pattern. Data mining experts utilize several tools and techniques
such as Decision Tree, Bayesian methods, linear regression, and k-means clustering
and also parameterize them to suit specific business requirements.
This section outlines the potential challenges and probable solutions for adopting
BDA and AI in business. Researchers can focus on each of AI’s subdomains sepa-
rately and investigate how it can be applied to different fields of expertise. Similar to
this, BDA, processing, and management are handled holistically through BDA [36].
Subsequent studies could be devoted to understanding the value that a particular
BDA tool provides to industry, government, management, and policymakers. The AI
and BDA applications in different industries may be compromised concerning user
privacy and data security [32]. Nonetheless, in cases involving these kinds of issues,
current corporation law is unable to deliver justice [12]. Hence, in order to preserve
ethical norms and maintain the benefits that AI and BDA provide across a range
of industries, corporate governance may require to create legal frameworks. It has
been noted that recorded data are frequently erroneous, noisy, or incomplete. This
presents a significant obstacle to the data analysis process. Thus, in order to apply
analytics, extensive data cleaning and preparation are necessary. Another problem
is the data’s ongoing exponential growth, which makes it challenging for businesses
to verify its reliability. When it comes to BDA, the veracity problem is thought to
be the most challenging one, surpassing only volume, velocity, and diversity. Busi-
nesses are urging their clients to participate in surveys, evaluations, and feedback
since it might aid in the development of new products. To remove noise and irreg-
ularities from the data, it is therefore imperative to preprocess the data. Because
the organization receives heterogeneous data from several sources, it is very chal-
lenging to integrate and run various analytics in order to produce actionable business
insights quickly. As a result, it is critical for a company to build up a strong data
management system that handles a variety of data, continuously investigates data
upon request, and generates business insights that business decision makers can use.
Business Transformation Using Big Data Analytics and Machine Learning 347
Another significant obstacle when creating data assets is data security. It is crucial for
businesses to plan ahead for any kind of data breach, develop a mechanism to iden-
tify them in real time to reduce the negative effects, and create extremely secure and
reliable data management systems. Organizational leadership and strategy are the
main management obstacles that can impede the effective implementation of BDA
[39]. A company’s active leadership must be in line with its strategic objectives in
order to succeed. A significant challenge for businesses is determining and assessing
the business value of BDA. While determining the ROI from BDA and dissecting
the relationship between BDA and business outcomes, organizations should exer-
cise caution. This necessitates appropriately mapping data, analytics, and business
processes to the optimal business outcomes & assessing its impact on achieving
those outcomes [46]. BDA is an effective instrument for turning a company around
and generating strategic business value. To find new data-driven business opportuni-
ties, firms must invest in BDA analytics in addition to hiring qualified analysts and
adopting a strategic positioning plan.
This will lead to a number of issues because they won’t be informed about discus-
sions regarding the most recent requirements. They will either be unable to articulate
their thoughts or they will later suggest changes.
• Unrealistic timeframes
Business analysts might encounter a challenging scenario where deadlines are an
issue. Then pressure is generated, which could interfere with their work. If so, be
aware of how to handle the situation while preserving the caliber of the work.
• Technical proficiency
It’s a misconception that business analysts don’t need technical skills. Conversely,
the majority of them excel at coding are adept at maintaining business procedures,
and have a talent for technically fulfilling the requirements.
• Professionalism
One of the most neglected, undervalued, and underpaid groups in the IT industry
is the business analyst. They often act as a liaison between the technical and business
aspects of a project. They are the ones who support the project from start to finish
and who contribute to the creation of the project plan.
• Disagreement between users
Business analysts may occasionally find themselves in a position where you are
unable to comprehend the user’s complaint. It occurs during the product launch
phase and may manifest as unkind comments. When a team makes a new strategy
suggestion that is relevant to the current business process, there may even be conflict
between stakeholders and business analysts.
A structured framework for decision-making and authority over data and data-related
topics is provided by data governance. Businesses that attempt to manage their
contemporary ecosystems may encounter numerous obstacles. Large, monolithic
systems that held the majority of the world’s mission-critical data have become less
common. These days, businesses add CRM software, digital marketing automation,
e-commerce platforms, customer support tools, and other features to their massive
ERP systems. Adhoc data sets of lesser size are also typical; these typically take the
form of little home-grown databases or Excel workbooks. Thus, the size of data is
seeing an increasing trend day by day. Controlling organized data is not too difficult.
Determining the characteristics of the data and identifying records that don’t live up
to expectations is a pretty straightforward proposition. For unstructured data, this
is not the case. Companies are facing a massive amount of unstructured data in the
form of social media, audio, video, and online reviews. Such unstructured data is
frequently moving and disorganized. Unstructured data has a wide range of quality
Business Transformation Using Big Data Analytics and Machine Learning 349
attributes. Every possible point of failure for data assets should be taken into consid-
eration by governance. This covers hazards such as noise, incorrect data defaults,
and loss of sensor signal. Data is growing in volume, velocity, and variety. B2B and
consumer interactions are now conducted digitally. IoT devices provide precise loca-
tion, temperature, time, and many other attribute data. Data in the form of audio and
video are more crucial than ever. According to research by Gartner, professionals look
for information for half of their workdays and take eighteen minutes on an average
to trace a document. The issue is becoming worse as data volumes increase, under-
scoring the pressing need for rules-based workflows and automation to speed up data
quality and governance tasks. In addition to assisting businesses in creating a cohe-
sive and comprehensive view of their own internal data, data governance provides the
framework for enhancing data with demographic information, geographic context,
and other elements that enhance the value of the conclusions and choices drawn from
the data. Crucial factors are business value, security, and compliance. Each of these,
such as hiding personally identifiable information (PII), may have an impact on the
requirements for data integrity, storage, and access. Organizations can increase busi-
ness value by addressing these essential requirements and optimizing the contextual
richness of their data through the use of data governance frameworks. If governance
initiatives are not developed with people, processes, and technology in mind, they
will not have much of an impact. Frameworks for data governance need to be compre-
hensive, built on cross-functional cooperation, common language, and a shared set
of metrics and standards. Transparency across the board may not even be sufficient.
Technologies for monitoring data quality cannot coexist with governance.
Big data analytics are now used in banking and Securities for trade visibility, customer
data transformation, enterprise credit risk reporting, tick analytics, card fraud detec-
tion, archiving of audit trails, social analytics for trading, IT operations analytics, and
IT policy compliance analytics. Big data analytics are used to solve Industry-specific
Big Data Challenges like collecting, analyzing, and utilizing consumer insights and
understanding patterns of real-time, media content usage. Big data analytics are
used in health sectors where certain hospitals are utilizing patient data gathered from
a mobile app, spanning millions of interactions, to enable physicians to practice
evidence-based medicine instead of requiring a battery of lab and tests. While a
battery of tests has the potential to be effective, it is typically inefficient and costly.
The university has created visual data that enables quicker identification and effective
analysis of healthcare information, used in tracking the spread of chronic disease.
This has been done by using free public health data and Google Maps. Higher educa-
tion makes extensive use of big data. A learning and management system can be
implemented using Big data to keep track of a variety of things, including the time a
350 P. Majumdar and S. Mitra
student spends on various system pages, when they log on, and their overall progress
over time. Big Data has a wide range of uses in public services, such as environ-
mental protection, energy exploration, financial market analysis, fraud detection,
and health-related research. Big Data for analytics can be utilized for optimized
staffing using information from nearby events, shopping trends, and other sources,
decreased instances of fraud and prompt inventory analysis. Marketing makes exten-
sive use of big data in order to gain a deeper understanding of customer behavior and
preferences. Among the applications of big data in marketing, customer segmenta-
tion is there where based on their behavior and preferences, customers are divided
into groups using big data technologies that analyze customer data. This enables
marketers to design campaigns that are more focused and successful. By analyzing
consumer data, big data technologies can make tailored offers and recommenda-
tions. By utilizing big data technologies, it is possible to forecast future trends and
behaviors by analyzing customer behavior. This can help to increase the efficacy of
marketing campaigns and provide guidance for marketing strategies.
7 Conclusion
References
1. Elgendy, N., Elragal, A.: Big data analytics: a literature review. In: Advances in Data Mining.
Applications and Theoretical Aspects: 14th Industrial Conference, St. Petersburg, Russia 14:
214–227 (2014)
2. Russom, P.: Big data analytics. TDWI Best Pract. Rep. Fourth Quart 19(4), 1–34 (2011)
3. Zakir, J., Seymour, T., Berg, K.: Big data analytics. Issues Inf. Syst. 16(2) (2015)
4. Power, D.J., Heavin, C., McDermott, J., Daly, M.: Defining business analytics: an empirical
approach. J. Bus. Anal. 1(1), 40–53 (2018)
5. Delen, D., Ram, S.: Research challenges and opportunities in business analytics. J. Bus. Anal.
1(1), 2–12 (2018)
6. Goundar, S., Nayyar, A., Maharaj, M., Ratnam, K., Prasad, S.: How artificial intelligence is
transforming the ERP systems. Enterp. Syst. Technol. Converg.: Res. Pract. 85 (2021)
7. Chatterjee, S., Rana, N.P., Tamilmani, K., Sharma, A.: The effect of AI-based CRM on organi-
zation performance and competitive advantage: an empirical analysis in the B2B context. Ind.
Mark. Manag. 97, 205–219 (2021)
8. Cavanillas, J.M., Curry, E., Wahlster, W.: The big data value opportunity. In: New Horizons
for a Data-Driven Economy: A Roadmap for Usage and Exploitation of Big Data in Europe,
3–11 (2016) https://doi.org/10.1007/978-3-319-21569-3_1
9. Zillner, S., Bisset, D., Milano, M., Curry, E., Hahn, T., Lafrenz, R., et al.: Strategic research,
innovation and deployment agenda—AI, data and robotics partnership, p. 3. BDVA, euRobotics,
ELLIS, EurAI and CLAIRE, Brussels (2020)
10. Curry, E.: The big data value chain: definitions, concepts, and theoretical approaches. In: New
Horizons for a Data-Driven Economy: A Roadmap for Usage and Exploitation of Big Data in
Europe, 29–37 (2016). https://doi.org/10.1007/978-3-319-21569-3_3
11. InGRAM, 6 big data use cases in retail, 2017 [online]. Imaginenext.ingrammicro.com.
Available at: https://imaginenext.ingrammicro.com/data-center/6-big-data-use-cases-in-retail.
Accessed 7 November 2019
12. Sheikh, R.A., Goje, N.S.: Role of big data analytics in business transformation. Internet Things
Bus. Transform.: Dev. Eng. Bus. Strat. Ind. 5, 231–259 (2021)
13. Kumar, R.: A framework for assessing the business value of information technology
infrastructures. J. Manag. Inf. Syst. 21(2), 11–32 (2004)
14. Di Vaio, A., Palladino, R., Hassan, R., Escobar, O.: Artificial intelligence and business models
in the sustainable development goals perspective: a systematic literature review. J. Bus. Res.
121, 283–314 (2020)
15. Hu, F., Liu, W., Tsai, S.B., Gao, J., Bin, N., Chen, Q.: An empirical study on visualizing
the intellectual structure and hotspots of big data research from a sustainable perspective.
Sustainability 10, 667 (2018)
16. Giuffrida, N., Fajardo-Calderin, J., Masegosa, A.D., Werner, F., Steudter, M., Pilla, F.: Opti-
mization and machine learning applied to last-mile logistics: a review. Sustainability 14, 5329
(2022)
17. Loureiro, S.M.C., Nascimento, J.: Shaping a view on the influence of technologies on
sustainable tourism. Sustainability 13, 12691 (2021)
18. Chen, H., Chiang, R.H., Storey, V.C.: Business intelligence and analytics: from big data to big
impact. MIS Q. 1, 1165–1188 (2012)
19. Thayyib, P.V., Mamilla, R., Khan, M., Fatima, H., Asim, M., Anwar, I., Shamsudheen, M.K.,
Khan, M.A.: State-of-the-art of artificial intelligence and big data analytics reviews in five
different domains: a bibliometric summary. Sustainability 15(5), 4026 (2023)
20. Lin, S.S., Shen, S.L., Zhou, A., Xu, Y.S.: Risk assessment and management of excavation
system based on fuzzy set theory and machine learning methods. Autom. Constr. 122, 103490
(2021)
21. Mukherjee, S., Bala, P.K.: Detecting sarcasm in customer tweets: an NLP based approach. Ind.
Manag. Data Syst. 117(6), 1109–1126 (2017)
352 P. Majumdar and S. Mitra
22. Mantyla, M.V., Graziotin, D., Kuutila, M.: The evolution of sentiment analysis—a review of
research topics, venues, and top cited papers. Comput. Sci. Rev. 27, 16–32 (2018)
23. O’Leary, D.E.: Massive data language models and conversational artificial intelligence:
emerging issues. Intell. Syst. Account. Financ. Manag. 29, 182–198 (2022)
24. Bendre, M.R., Thool, V.R.: Analytics, challenges and applications in big data environment: A
survey. J. Manag. Anal. 3, 206–239 (2016)
25. dos Santos, B.S., Steiner, M.T.A., Fenerich, A.T., Lima, R.H.P.: Data mining and machine
learning techniques applied to public health problems: a bibliometric analysis from 2009 to
2018. Comput. Ind. Eng. 138, 106120 (2019)
26. Iaksch, J., Fernandes, E., Borsato, M.: Digitalization and big data in smart farming—a review.
J. Manag. Anal. 8, 333–349 (2021)
27. Kim, G.H., Trimi, S., Chung, J.H.: Big-data applications in the government sector. Commun.
ACM 57, 78–85 (2014)
28. Rajagopalan, M., Vellaipandiyan, S.: Big data framework for national e-governance plan.
In: Proceedings of the 2013 Eleventh International Conference on ICT and Knowledge
Engineering, Bangkok, Thailand. 1–5 (2013)
29. Ravi, V., Kamaruddin, S.: Big data analytics enabled smart financial services: opportunities
and challenges. In: Proceedings of the International Conference on Big Data Analytics, 15–39
(2017)
30. Fang, B., Zhang, P.: Big data in finance. In: Big Data Concepts, Theories, and Applications,
391–412 (2016)
31. Lopez-Robles, J.R., Otegi-Olaso, J.R., Gomez, I.P., Cobo, M.J.: 30 years of intelligence models
in management and business: a bibliometric review. Int. J. Inf. Manag. 48, 22–38 (2019)
32. Wamba, S.F., Bawack, R.E., Guthrie, C., Queiroz, M.M., Carillo, K.D.A.: Are we preparing
for a good AI society? A bibliometric review and research agenda. Technol. Forecast. Soc.
Chang. 164, 120482 (2021)
33. Mishra, S., Tripathi, A.R.: Literature review on business prototypes for digital platform. J.
Innov. Entrep. 9, 23 (2020). https://doi.org/10.1186/s13731-020-00126-4
34. Verma, S., Sharma, R., Deb, S., Maitra, D.: Artificial intelligence in marketing: systematic
review and future research direction. Int. J. Inf. Manag. Data Insights 1, 100002 (2021)
35. Batistic, S., van der Laken, P.: History, evolution and future of big data and analytics: a biblio-
metric analysis of its relationship to performance in organizations. Br. J. Manag. 30, 229–251
(2019)
36. Khanra, S., Dhir, A., Mantymaki, M.: Big data analytics and enterprises: a bibliometric
synthesis of the literature. Enterp. Inf. Syst. 14, 737–768 (2020)
37. Linnenluecke, M.K., Marrone, M., Singh, A.K.: Conducting systematic literature reviews and
bibliometric analyses. Aust. J. Manag. 45, 175–194 (2020)
38. Erevelles, S., Fukawa, N., Swayne, L.: Big data consumer analytics and the transformation of
marketing. J. Bus. Res. 69, 897–904 (2016)
39. Siemens, G.: Learning analytics: the emergence of a discipline. Am. Behav. Sci. 57, 1380–1400
(2013)
40. Nicolae, B., Moise, D., Antoniu, G., Bouge, L., Dorier, M.: BlobSeer: bringing high throughput
under heavy concurrency to Hadoop Map-Reduce applications. In: Proceedings of the 2010
IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Atlanta, GA,
USA, 1–11 (2010)
41. Ding, Y., Jin, M., Li, S., Feng, D.: Smart logistics based on the internet of things technology:
an overview. Int. J. Logist. Res. Appl. 24, 323–345 (2021)
42. Hewage, T., Halgamuge, M., Syed, A., Ekici, G.: Review: Big data techniques of Google,
Amazon, Facebook and Twitter. J. Commun. 13(2), 94–100 (2018)
43. Kuila, A.: Big data sales prediction (2023) [online]. Available at: https://www.kaggle.com/dat
asets/akashdeepkuila/big-mart-sales. Accessed 15 November 2023
44. Majumdar, P., Bhattacharya, D., Mitra, S.: Prediction of evapotranspiration and soil moisture
in different rice growth stages through improved salp swarm based feature optimization and
ensembled machine learning algorithm. Theor. Appl. Climatol., 1–25 (2023)
Business Transformation Using Big Data Analytics and Machine Learning 353
45. Majumdar, P., Bhattacharya, D., Mitra, S., Solgi, R., Oliva, D., Bhusan, B.: Demand prediction
of rice growth stage-wise irrigation water requirement and fertilizer using Bayesian genetic
algorithm and random forest for yield enhancement. Paddy Water Environ. 21(2), 275–293
(2023)
46. Wang, Y., Kung, L., Byrd, T.A.: Big data analytics: understanding its capabilities and potential
benefits for healthcare organizations. Technol. Forecast. Soc. Change J. 126, 3–13 (2018)