Download as pdf or txt
Download as pdf or txt
You are on page 1of 357

Studies in Big Data 145

Pushpa Singh
Asha Rani Mishra
Payal Garg Editors

Data Analytics
and Machine
Learning
Navigating the Big Data Landscape
Studies in Big Data

Volume 145

Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Big Data” (SBD) publishes new developments and advances in
the various areas of Big Data- quickly and with a high quality. The intent is to cover the
theory, research, development, and applications of Big Data, as embedded in the fields
of engineering, computer science, physics, economics and life sciences. The books of
the series refer to the analysis and understanding of large, complex, and/or distributed
data sets generated from recent digital sources coming from sensors or other physical
instruments as well as simulations, crowd sourcing, social networks or other internet
transactions, such as emails or video click streams and other. The series contains
monographs, lecture notes and edited volumes in Big Data spanning the areas of
computational intelligence including neural networks, evolutionary computation,
soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern
statistics and Operations research, as well as self-organizing systems. Of particular
value to both the contributors and the readership are the short publication timeframe
and the world-wide distribution, which enable both wide and rapid dissemination of
research output.
The books of this series are reviewed in a single blind peer review process.
Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH.
All books published in the series are submitted for consideration in Web of Science.
Pushpa Singh · Asha Rani Mishra · Payal Garg
Editors

Data Analytics and Machine


Learning
Navigating the Big Data Landscape
Editors
Pushpa Singh Asha Rani Mishra
GL Bajaj Institute of Technology & GL Bajaj Institute of Technology &
Management Management
Greater Noida, Uttar Pradesh, India Greater Noida, Uttar Pradesh, India

Payal Garg
GL Bajaj Institute of Technology &
Management
Greater Noida, Uttar Pradesh, India

ISSN 2197-6503 ISSN 2197-6511 (electronic)


Studies in Big Data
ISBN 978-981-97-0447-7 ISBN 978-981-97-0448-4 (eBook)
https://doi.org/10.1007/978-981-97-0448-4

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2024

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore

Paper in this product is recyclable.


Preface

About This Book

In today’s data-driven world, organizations are challenged to extract valuable insights


from vast amounts of complex data to gain a competitive advantage. The integra-
tion of data analytics and machine learning has become the keystone of innovation,
unlocking insights, trends, and potential of data for the recent transformation in
various domains. This book, “Data Analytics and Machine Learning—Navigating
the Big Data Landscape,” is a comprehensive exploration of the synergies between
Data Analytics and Machine Learning, providing a roadmap for a new industry
revolution. This book offers a comprehensive exploration of fundamentals of Data
Analytics, Big Data, and Machine Learning. This book offers a holistic perspective
on Data Analytics and Big Data, encompassing diverse topics and approaches to
help readers navigate the intricacies of this rapidly evolving field. This book serves
to cover a broader view of Machine Learning techniques in Big-Data analytics, Chal-
lenges of Deep Learning models, Data Privacy and Ethics in Data Analytics, Future
Trends in Data Analytics and Machine Learning, and the practical implementation of
Machine Learning techniques and Data Analytics using R. This book explores how
the Big Data explosion, power of Analytics and Machine Learning revolution can
bring new prospects and opportunities in the dynamic and data-rich landscape. This
book aims to highlight the future research directions in Data Analytics, Big Data, and
Machine Learning that explore the emerging trends, challenges, and opportunities in
the related field by covering interdisciplinary approaches in Data Analytics, handling
and analyzing real-time and streaming data, and many more. This book offers a broad
review of existing literature, case studies, and valuable perceptions into the growing
nature of Data Analytics and Machine Learning and its inferences to the decision
support system to make managerial decisions to transform the business environment.

v
vi Preface

Intended Audience

Students and Academics: Students pursuing degrees in fields like data science,
computer science, business analytics, or related disciplines, as well as academics
conducting research in these areas, form a significant primary audience for books on
Data Analytics, Big Data, and Machine Learning.
Data Analysts and Data Scientists: These professionals are directly involved in
working with data, analyzing it, and deriving insights. They seek books that provide
in-depth knowledge, practical techniques, and advanced concepts related to data
analytics, big data, and machine learning.
Business and Data Professionals: Managers, executives, and decision-makers
who are responsible for making data-driven decisions in their organizations often
have a primary interest in understanding how Data Analytics, Big data, and Machine
Learning can be leveraged to gain a competitive advantage.

How Is This Book Organized?

This book has sixteen chapters covering big data analytics, machine learning, and
deep learning. The first 2 chapters provide an introductory discussion of Data
Analytics, Big Data, Machine Learning, and the life cycle of Data Analytics. Next
Chapters 3 and 4 explore the building of predictive models and their application
in the field of agriculture. Further, Chapter 5 comprises a brief assessment of the
stream architecture and analysis of big data. Chapter 6 leverages data analytics and
deep learning framework in Image Super-Resolution Techniques and the potential
of data analytics and time series are enhanced for the price prediction. As, “R” is a
powerful statistical programming tool that is widely used for statistical analysis, data
visualization, and machine learning. Taking it as cognizant. Chapter 8 widely used
for statistical analysis, data visualization, and machine learning emphasizes “Prac-
tical Implementation of Machine Learning Techniques and Data Analytics using
R”. Deep learning models excel in feature learning, enabling the automatic extrac-
tion of valuable information from huge data sets. Hence, Chapter 9 presents the deep
learning techniques in big data analytics. Chapter 10 deals with how organizations and
their professionals must meticulously put efforts towards building data ethically and
ensure its privacy. Chapters 11and 12 presented modern and real-world applications
of data analytics, machine learning, and big data. Chapters provide various instances
from projects, case studies, and real-world scenarios to create positive and negative
impacts on an individual and the society. Further, taking one step ahead, Chapter13
Unlock the Insights by Exploring Data Analytics and AI Tool Performance across
Industries. The concept of Lung Nodule Segmentation using Machine Learning and
Deep Learning is discussed in Chapter14 which highlight the importance of deep
learning in healthcare Industries to support health analytics. Chapter15 describes the
Preface vii

Convergence of Data Analytics Big Data and Machine Learning Applications Chal-
lenges and Future Direction. Integration of Data Analytics, Machine Learning, and
Big Data finally transforms any business by using Big Data Analytics and Machine
Learning and hence, included in Chapter16.

Greater Noida, India Pushpa Singh


Asha Rani Mishra
Payal Garg
Contents

Introduction to Data Analytics, Big Data, and Machine Learning . . . . . . 1


Youddha Beer Singh, Aditya Dev Mishra, Mayank Dixit,
and Atul Srivastava
Fundamentals of Data Analytics and Lifecycle . . . . . . . . . . . . . . . . . . . . . . . 19
Ritu Sharma and Payal Garg
Building Predictive Models with Machine Learning . . . . . . . . . . . . . . . . . . . 39
Ruchi Gupta, Anupama Sharma, and Tanweer Alam
Predictive Algorithms for Smart Agriculture . . . . . . . . . . . . . . . . . . . . . . . . . 61
Rashmi Sharma, Charu Pawar, Pranjali Sharma, and Ashish Malik
Stream Data Model and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Shahina Anjum, Sunil Kumar Yadav, and Seema Yadav
Leveraging Data Analytics and a Deep Learning Framework
for Advancements in Image Super-Resolution Techniques: From
Classic Interpolation to Cutting-Edge Approaches . . . . . . . . . . . . . . . . . . . . 105
Soumya Ranjan Mishra, Hitesh Mohapatra, and Sandeep Saxena
Applying Data Analytics and Time Series Forecasting for Thorough
Ethereum Price Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Asha Rani Mishra, Rajat Kumar Rathore, and Sansar Singh Chauhan
Practical Implementation of Machine Learning Techniques
and Data Analytics Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Neha Chandela, Kamlesh Kumar Raghuwanshi, and Himani Tyagi
Deep Learning Techniques in Big Data Analytics . . . . . . . . . . . . . . . . . . . . . 171
Ajay Kumar Badhan, Abhishek Bhattacherjee, and Rita Roy
Data Privacy and Ethics in Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Rajasegar R. S., Gouthaman P., Vijayakumar Ponnusamy,
Arivazhagan N., and Nallarasan V.

ix
x Contents

Modern Real-World Applications Using Data Analytics


and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Vijayakumar Ponnusamy, Nallarasan V., Rajasegar R. S.,
Arivazhagan N., and Gouthaman P.
Real-World Applications of Data Analytics, Big Data, and Machine
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Prince Shiva Chaudhary, Mohit R. Khurana,
and Mukund Ayalasomayajula
Unlocking Insights: Exploring Data Analytics and AI Tool
Performance Across Industries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Hitesh Mohapatra and Soumya Ranjan Mishra
Lung Nodule Segmentation Using Machine Learning and Deep
Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Swati Chauhan, Nidhi Malik, and Rekha Vig
Convergence of Data Analytics, Big Data, and Machine Learning:
Applications, Challenges, and Future Direction . . . . . . . . . . . . . . . . . . . . . . . 317
Abhishek Bhattacherjee and Ajay Kumar Badhan
Business Transformation Using Big Data Analytics and Machine
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Parijata Majumdar and Sanjoy Mitra
Contributors

Tanweer Alam Department of Computer and Information Systems, Islamic


University of Madinah, Madinah, Saudi Arabia
Shahina Anjum Department of CSE, IEC College of Engineering & Technology,
Greater Noida, Uttar Pradesh, India
Arivazhagan N. Department of Computational Intelligence, SRM Institute of
Science and Technology, Kattankulathur, Chennai, India
Mukund Ayalasomayajula Department of Materials Science and Engineering,
Cornell University, Ithaca, NY, USA
Ajay Kumar Badhan Department of Computer Science and Engineering, Lovely
Professional University, Phagwara, Punjab, India
Abhishek Bhattacherjee Department of Computer Science and Engineering,
Lovely Professional University, Phagwara, Punjab, India
Neha Chandela Computer Science and Engineering, Krishna Engineering
College, Uttar Pradesh, Ghaziabad, India
Prince Shiva Chaudhary Department of Data Science, Worcester Polytechnic
Institute, Worcester, MA, USA
Sansar Singh Chauhan Department of Computer Science, GL Bajaj Institute of
Technology and Management, Greater Noida, India
Swati Chauhan The NorthCap University, Gurugram, Haryana, India
Mayank Dixit Department of Computer Science and Engineering, Galgotia
College of Engineering and Technology, Greater Noida, UP, India
Payal Garg Department of Computer Science and Engineering, GL Bajaj Institute
of Technology and Management, Greater Noida, India
Gouthaman P. Department of Networking and Communications, SRM Institute
of Science and Technology, Kattankulathur, Chennai, India

xi
xii Contributors

Ruchi Gupta Department of Information Technology, Ajay Kumar Garg


Engineering College, Ghaziabad, India
Mohit R. Khurana Department of Materials Science and Engineering, Cornell
University, Ithaca, NY, USA
Parijata Majumdar Department of Computer Science and Engineering, Techno
College of Engineering, Agartala, Tripura, India
Ashish Malik Department of Mechanical Engineering, Axis Institute of
Technology & Management, Kanpur, India
Nidhi Malik The NorthCap University, Gurugram, Haryana, India
Aditya Dev Mishra Department of Computer Science and Engineering, Galgotia
College of Engineering and Technology, Greater Noida, UP, India
Asha Rani Mishra Department of Computer Science, GL Bajaj Institute of
Technology and Management, Greater Noida, India
Soumya Ranjan Mishra School of Computer Engineering, KIIT (Deemed to Be)
University, Bhubaneswar, Odisha, India
Sanjoy Mitra Department of Computer Science and Engineering, Tripura
Institute of Technology, Agartala, Tripura, India
Hitesh Mohapatra School of Computer Engineering, KIIT (Deemed to Be)
University, Bhubaneswar, Odisha, India
Nallarasan V. Department of Networking and Communications, SRM Institute of
Science and Technology, Kattankulathur, Chennai, India
Charu Pawar Department of Electronics, Netaji Subhash University of
Technology, Delhi, India
Kamlesh Kumar Raghuwanshi Computer Science Department, Ramanujan
College, Delhi University, New Delhi, India
Rajasegar R. S. IT Industry, Cyber Security, County Louth, Ireland
Rajat Kumar Rathore Department of Computer Science, GL Bajaj Institute of
Technology and Management, Greater Noida, India
Rita Roy Department of Computer Science and Engineering, Gitam Institute of
Technology (Deemed-to-Be-University), Visakhapatnam, Andhra Pradesh, India
Sandeep Saxena Greater Noida Institute of Technology, Greater Noida, India
Anupama Sharma Department of Information Technology, Ajay Kumar Garg
Engineering College, Ghaziabad, India
Pranjali Sharma Department of Mechanical Engineering, Motilal Nehru
National Institute of Technology, Prayagraj, India
Contributors xiii

Rashmi Sharma Department of Information Technology, Ajay Kumar Garg


Engineering College, Ghaziabad, India
Ritu Sharma Department of Computer Science and Engineering, Ajay Kumar
Garg Engineering College, Ghaziabad, India
Youddha Beer Singh Department of Computer Science and Engineering,
Galgotia College of Engineering and Technology, Greater Noida, UP, India
Atul Srivastava Amity School of Engineering and Technology, AUUP, Lucknow,
India
Himani Tyagi University School of Automation and Robotics, GGSIPU, New
Delhi, India
Rekha Vig Amity University, Kolkata, West Bengal, India
Vijayakumar Ponnusamy Department of Electronics and Communications,
SRM Institute of Science and Technology, Kattankulathur, Chennai, India
Seema Yadav Department of MBA, Accurate Institute of Management and
Technology, Greater Noida, Uttar Pradesh, India
Sunil Kumar Yadav Department of CSE, IEC College of Engineering &
Technology, Greater Noida, Uttar Pradesh, India
Introduction to Data Analytics, Big Data,
and Machine Learning

Youddha Beer Singh, Aditya Dev Mishra, Mayank Dixit, and Atul Srivastava

Abstract Data has become the main driver behind innovation, decision-making,
and the change of many sectors and civilisations in the modern period. The dynamic
trinity of Data Analytics, Big Data, and Machine Learning is thoroughly introduced
in this chapter, which also reveals their profound significance, intricate relation-
ships, and transformational abilities. The fundamental layer of data processing is
data analytics. Data must be carefully examined, cleaned, transformed, and modelled
in order to reveal patterns, trends, and insightful information. A data-driven revolu-
tion is sparked by big data. In our highly linked world, data is produced in enormous
numbers, diversity, velocity, and authenticity. The third pillar, machine learning, uses
data-driven algorithms to enable automated prediction and decision-making. This
chapter explores the key methods and equipment needed to fully utilise the power
of data analytics and also discusses how technologies used in big data management,
processing, and insight extraction. A foundation is set for a thorough investigation of
these interconnected realms when we begin the chapters that follow. Data analytics,
big data, and machine learning are not distinct ideas; rather, they are woven into the
fabric of modern innovation and technology. This chapter serves as the beginning
of this captivating journey, providing a solid understanding of and insight into the
enormous possibilities of data-driven insights and wise decision-making.

Y. B. Singh (B) · A. D. Mishra · M. Dixit


Department of Computer Science and Engineering, Galgotia College of Engineering and
Technology, Greater Noida, UP, India
e-mail: youddhabeersingh@gmail.com
M. Dixit
e-mail: mayankdixit@galgotiacollege.edu
A. Srivastava
Amity School of Engineering and Technology, AUUP, Lucknow, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 1
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_1
2 Y. B. Singh et al.

1 Introduction

Data has become the foundation of modern society in the era of information, changing
the way we work, live, and engage with the outside world. Three crucial domains—
Data Analytics, Big Data, and Machine Learning—are at the centre of this revolu-
tionary environment, which has emerged from the convergence of data, technology,
and creativity. These interconnected sectors provide insights, forecasts, and solu-
tions that cut across industries, from healthcare and banking to transportation and
entertainment, and together they constitute the foundation of data-driven decision-
making. Not only is it necessary to comprehend their complexities and realise their
potential in order to remain competitive in the fast-paced world of today, but it also
opens the door to groundbreaking innovation and the advancement of society.
As the fundamental layer, data analytics enables us to convert unprocessed data
into meaningful insights. It uncovers secrets by methodically examining and inter-
preting patterns, shedding light on the way ahead. It propels optimisation, informs
tactics, and directs us towards more intelligent decisions. Our solution to the ever-
increasing volumes, velocities, types, and complexity of data is Big Data, a revolu-
tionary paradigm change. It makes data management, archiving, and analysis possible
on a scale never before possible. Big Data technology has made it possible for busi-
nesses to access the vast amount of information that is hidden within this torrent of
data. The pinnacle of data science, machine learning, is what we’ve been searching
for—intelligent, automated decision-making. These algorithms have revolutionised
our ability to recognise patterns, make predictions, and even replicate human cogni-
tive capabilities. They are inspired by the ability of humans to learn, adapt, and
evolve. Machine learning is now the basis for many cutting-edge applications, such
as customised healthcare and driverless cars. By using data to produce useful insights
and make wise decisions, data analytics transforms industries. Data analytics is
the process of identifying patterns, trends, and correlations in large datasets using
advanced algorithms and tools. This helps organisations anticipate market trends,
analyse customer behaviour, and optimise their operations. The application of data
analytics enables organisations across industries to generate value, drive growth, and
maintain competitiveness in today’s data-driven world. Benefits include improved
operational efficiency, better strategic planning, fostering innovation, and the ability
to provide personalised experiences.
In addition to being a theoretical voyage, this investigation into the trio of data
analytics, big data, and machine learning also serves as a useful manual for navi-
gating the always-changing data landscape. Deeper exploration of these fields reveals
opportunities to spur innovation, advance progress, and improve people’s quality of
life both individually and as a society. We go off on an educational journey in the
ensuing chapters, learning about the ideas, practises, and real-world applications of
various domains of transformation. The following pages hold the potential to provide
a glimpse into a future in which data is king and the ability to glean knowledge from
it opens up a limitless array of opportunities.
The following factors make the current study significant:
Introduction to Data Analytics, Big Data, and Machine Learning 3

• The purpose of this study is to help IT professionals, and researchers choose the
best big data tools and methods for efficient data analytics.
• It is intended to give young researchers insightful information so they can make
wise decisions and significant contributions to the scientific community.
• The outcomes of the study will act as a guide for the development of methods and
resources that combine cognitive computing with big data.
The following is a summary of this study’s main contributions:
• A thorough and in-depth evaluation of prior state-of-the-art studies on Machine
Learning (ML) approaches in Big Data Analytics (BDA).
• A brief overview of the key characteristics of the comparative machine learning
(ML) and big data analytics (BDA) approaches.
• A succinct summary of the important features of the compared methods for BDA
with ML.
• Briefly discussed challenges and future diction in data analytics, big data with
ML.
The remaining sections are arranged as follows the overview of data analytics
is presented in Sect. 2. Section 3 presents details discussions on big data, whereas
Sect. 4 Machine learning algorithm are uses in big data analytics. Challenge and
Future Direction in Data Analytics, Big Data and Machine Learning. In Sect. 6, we
finally bring the study to a close.

2 Data Analytics

Data analytics has become a transformational force in the information era, showing
the way to efficient, innovative, and well-informed decision-making. Our ever-
growing digital environment creates an ocean of data, and being able to use this
wealth of knowledge has become essential for success on both an individual and
organisational level. By transforming raw data into useful insights, data analytics the
methodical analysis and interpretation of data enables us to successfully navigate this
enormous ocean of information. Fundamentally, data analytics is a dynamic process
that makes use of a range of methods and instruments to examine, purify, organise,
and model data. Data analysts find patterns, trends, and correlations through these
methodical activities that are frequently invisible to the unaided eye [1]. This trans-
lates into the corporate world as strategy optimisation, operational improvement,
and opportunity discovery. Applications and industries for data analytics are not
constrained. Its reach extends across a variety of industries, including marketing,
sports, healthcare, and finance. Data analytics is essential for many tasks, including
seeing patterns in consumer behaviour, streamlining healthcare services, forecasting
changes in the financial markets, and even refining sports tactics.
The emergence of sophisticated technology and increased processing capacity has
opened up new possibilities for data analytics. Data analytics can now do predictive
4 Y. B. Singh et al.

and prescriptive analytics in addition to historical analysis because of advancements


in machine learning and artificial intelligence [2]. Thanks to its capacity to predict
future trends and suggest the best courses of action, data analytics has emerged as
a powerful ally in the pursuit of success and innovation. As we learn more about
this area, we become aware of the approaches, resources, and practical uses that
enable businesses and people to derive value from data. In the data-driven era, data
analytics is a beacon of hope, paving the path for improved decision-making and a
deeper comprehension of the world we live in.

2.1 Data Analytics Process

Gather information directly from the source first. Then, work with and improve the
data to make sure it works with your downstream systems, converting it to the format
of your choice. The produced data should be kept in a data lake or data warehouse
so that it can be used for reporting and analysis or as a long-term archive [3]. Make
use of analytics tools to look through the data and draw conclusions. Data analytics
processes are shown in Fig. 1.
Data Capturing: You have a few choices for data capturing, depending on where
your data comes from:
• Data Migration Tools: To move data from one cloud platform or on-premises
to another, use data migration tools. For this reason, Google Cloud provides a
Storage Transfer Service.
• API Integration: Use APIs to retrieve data from outside SaaS providers and transfer
it to your data warehouse. A data transfer service is offered by Google Cloud’s
serverless data warehouse BigQuery to facilitate the easy import of data from
SaaS apps such as Teradata, Amazon S3, RedShift, YouTube, and Google Ads.
• Real-time Data Streaming: Use the Pub/Sub service to get data in real-time from
your applications. Set up a data source to send event messages to Pub/Sub so that
subscribers can process them and respond accordingly.
• IoT Device Integration: Google Cloud IoT Core, which supports the MQTT
protocol for IoT devices, allows your IoT devices to broadcast real-time data.
IoT data can also be sent to Pub/Sub for additional processing.
Processing Data the critical next step after data intake is data enrichment or
processing to get it ready for systems that come after. Three main features in Google
Cloud make this procedure easier:

Fig. 1 Process of data analytics


Introduction to Data Analytics, Big Data, and Machine Learning 5

• Dataproc: Compared to conventional Hadoop settings, Dataproc saves time and


effort by streamlining cluster setup and acting as a managed Hadoop platform.
With clusters ready in less than 90s, it allows for quick data processing.
• Dataprep: By removing the need for manual coding, an intuitive graphical user
interface tool enables data analysts to analyse data quickly.
• Dataflow: Using the open-source Apache Beam SDK for portability, this server-
less data processing solution manages batch and streaming data. The novel archi-
tecture of Dataflow keeps computing and storage apart, allowing for smooth
scalability. Please refer to the GCPSketchnote below for more information.
Data Storage: The data is then stored in a data lake or data warehouse for reporting
and analysis purposes, long-term archiving, or both. Two essential Google Cloud
technologies make this procedure easier:
Google Cloud Storage is an object storage that can hold files, videos, and photos
among other kinds of data. It provides four kinds.
Standard Storage: Perfect for “hot” material that is accessed often, such as mobile
apps, streaming films, and webpages.
Nearline storage is an affordable option for long-tail multimedia content and data
backups that must be kept for a minimum of 30 days.
Coldline Storage: Exceptionally economical for data that needs to be stored for
at least ninety days, including disaster recovery.
Archive Storage: The most economical choice for data (such as regulatory
archives) that needs to be kept for at least a year.
BigQuery: A serverless data warehouse that can handle petabytes of data with
ease and doesn’t require server management. With BigQuery, working with your
team is simple since you can use SQL to store, query, and exchange data. Along
with pre-built interfaces to external services for simple data intake and extraction,
it also provides a store of free public datasets, facilitating further research and
visualisation.
Data Analysis: Data analysis comes next, following data processing and storage
in a data lake or data warehouse. You can use SQL within BigQuery to conduct direct
analysis if you have saved your data there. It is quite easy to move data from Google
Cloud Storage to BigQuery for analysis. BigQuery also offers machine learning
capabilities with BigQueryML, which lets you use SQL which is possibly more
familiar to create models and make predictions right from the BigQuery UI [4].
Using the data Once the data is in the data warehouse, machine learning may
be used to anticipate outcomes and get insights. Depending on your needs, you can
leverage the Tensorflow framework and AI Platform for additional processing and
prediction. A complete open-source machine learning platform, Tensorflow comes
with libraries, tools, and community resources. Developers, data scientists, and data
engineers may easily optimise their machine learning workflows with the help of
AI Platform. Every phase of the machine learning lifecycle is covered by the tools,
which go from preparation to build, validation, and deployment [4].
6 Y. B. Singh et al.

Data visualisation: There are many different data visualisation tools available,
and most of them include a BigQuery link so you can quickly generate charts with
the tool of your choosing. A few tools that Google Cloud offers are worth taking
a look at. In addition to connecting to BigQuery, Data Studio is free and offers
quick data visualisation through connections to numerous other services. Charts
and dashboards may be shared very easily, especially if you have experience with
Google Drive. Looker is also an enterprise platform for embedded analytics, data
applications, and business intelligence [4].

3 Big Data

With data expected to rise at an exponential rate of 180 ZB by 2025, data will
play a pivotal role in propelling twenty-first-century growth and transformation,
forming a new “digital universe” that will alter markets and businesses [4]. The
“Big Data” age has begun with this flood of digital data that comes from several
complex sources [5]. Large datasets that are too large for traditional software tools
to handle, store, organise, and analyse are referred to as big data [6]. The range of
heterogeneity and complexity displayed by these datasets goes beyond their mere
quantity. They include structured, semi-structured, and unstructured data, as well as
operational, transactional, sales, marketing, and a variety of other data types. Big
data also includes data in a variety of types, such as text, audio, video, photos, and
more. Interestingly, the category of unstructured data is growing more quickly than
structured data and is making up almost 90% of all data [7]. As such, it is critical
to investigate new processing capacities in order to derive data-driven insights that
facilitate improved decision-making.
The three Vs, volume, velocity, and variety are frequently used to describe Doug
Laney’s idea of big data, which is referenced in Refs. [7–9]. However, a number of
research [8] have extended this idea to include five essential qualities (5Vs): volume,
velocity, variety, value, and veracity as shown in Fig. 2. As technology advances,
data storage capacity, data transfer rates, and system capabilities change, so does the
notion of big data [9]. The first “V” stands for volume and represents the exponential
growth in data size over time [5], with electronic medical records (EMRs) being a
major source of data for the healthcare sector [9]. The second “V” stands for velocity,
which describes the rate at which information is created and gathered in a variety of
businesses.
From the Fig. 2 it is clear that Big data is often characterised by the five Vs:
volume, velocity, variety, value, and veracity.
Volume: volume is the total amount of data, and it has significantly increased as a
result of the widespread use of sensors, Internet of Things (IoT) devices, linked smart-
phones, and ICTs (information and communication technologies), including artificial
intelligence (AI). With data generation exceeding Moore’s law, this data explosion
has produced enormous datasets that go beyond conventional measurements and
introduce terms like exabytes, zettabytes, and yottabytes.
Introduction to Data Analytics, Big Data, and Machine Learning 7

Fig. 2 General idea of big data

Velocity: The rapid creation of data from linked devices and the internet that big
data brings to businesses in real time is what sets it apart. Businesses can benefit
greatly from this rapid inflow of data since it gives them the ability to move quickly,
become more agile, and obtain a competitive advantage. While some businesses have
previously harnessed big data for customer recommendations, today’s enterprises are
leveraging big data analytics to not only analyse but also act upon data in real time.
Variety: The period of Web 3.0 is characterised by diversity in data creation
sources and formats, as a result of the growth of social media and the internet, which
has produced a wide range of data types. These include text messages, status updates,
images, and videos posted on social media sites like Facebook and Twitter, SMS
messages, GPS signals from mobile devices, client interactions in online banking
and retail, contact centre voice data, and more. The constant streams of data from
mobile devices that record the location and activity of people are among the many
important sources of big data that are relatively new. In addition, a variety of online
sources provide data via social media interactions, click-streams, and logs.
Value: The application of big data can provide insightful information, and the data
analysis process can benefit businesses, organisations, communities, and consumers
in a huge way.
Veracity: Data accuracy and dependability are referred to as data veracity. In
cases when there are discrepancies or errors in the data collection process, veracity
measures the degree of uncertainty and dependability surrounding the information.
Big data provides businesses with a plethora of opportunities to boost effi-
ciency and competitiveness. It includes the ongoing collection of data as well as
the necessary technologies for data management, storage, collection, and analysis.
This paradigm change has altered fundamental aspects of organisations as well as
management. Big data is an essential tool that helps businesses discover new informa-
tion, provide value, and spur innovation in markets, procedures, and goods. Because
of this, data has become a highly valued resource, emphasising to business executives
the significance of adopting a data-driven strategy [10]. Businesses have accumu-
lated data for many years, but the current trend is more towards active data analysis
8 Y. B. Singh et al.

120

100

80

60

40

20

0
0 50 100 150 200 250 300

Fig. 3 Big data trend

than passive storage. As a result, data-driven businesses outperform their non-data-


driven competitors in terms of financial and operational performance, increasing
profitability by 6% and productivity by 5%, giving them a considerable competi-
tive advantage [11]. As a result, businesses and organisations are becoming more
interested in using big data as shown in the Fig. 3.
From Fig. 3 it is clear that big data are used more in the current time as its uses
in businesses become more interesting. Through the process of data analysis, the
use of big data has the potential to provide insightful information that will benefit
businesses, organisations, communities, and consumers.

4 Machine Learning

Algorithms for machine learning (ML) have become popular for modelling, visual-
ising, and analysing large datasets. With machine learning (ML), machines can learn
from data, extrapolate their findings to unknown information, and forecast outcomes.
Various literature attests to the effectiveness of ML algorithms in a variety of applica-
tion domains. Based on the literature that is now accessible, machine learning can be
divided into four main classes: reinforcement learning, supervised learning, unsu-
pervised learning, and semi-supervised learning. Numerous open-source machine
learning methods are available for a range of applications, including ranking, dimen-
sionality reduction, clustering, regression, and classification. Singular-Value Decom-
position (SVD), Principal Component Analysis (PCA), Radial Basis Function Neural
Network (RBF-NN), KNN, Hidden Markov Model (HMM), DT, Naive-Base (NB),
Tensor Auto-Encoder (TAE), Ensemble Learning (EL), and KNN are a few notable
examples [12–16].
Machine learning is essential to big data and data analytics because it provides
strong tools and methods for deriving valuable insights from enormous and intricate
datasets. The following are some important ways that big data and data analytics
benefit from machine learning:
Introduction to Data Analytics, Big Data, and Machine Learning 9

• Recognition and Prediction of Patterns: Machine learning algorithms are highly


proficient at detecting patterns and trends present in extensive datasets. Predictive
analytics, projecting future trends, and data-driven prediction are made possible
by this skill [16].
• Automated Data Processing: Preprocessing, cleaning, and transforming data are
just a few of the operations that ML algorithms can automate. This automation
increases productivity and decreases the amount of labour-intensive manual work
needed to handle large datasets.
• Anomaly Detection: Machine learning algorithms are able to find odd patterns
or departures from the norm by identifying anomalies or outliers in data. This is
very useful for finding mistakes, fraud, or abnormalities in large databases.
• Classification and Categorisation: Using patterns and characteristics, machine
learning algorithms are able to categorise data into groups or categories. This is
useful for classifying and arranging massive amounts of unstructured data.
• Recommendation Systems: Recommendation engines, which examine user
behaviour and preferences to offer tailored content or goods, are powered by
machine learning. Online services, streaming platforms, and e-commerce all make
extensive use of this.
• Real-time analytics: it is made possible by machine learning, which processes
and analyses data almost instantly. This allows for prompt decision-making and
flexibility in response to changing circumstances.
• Scalability: ML algorithms are well-suited for big data analytics, where tradi-
tional techniques may falter because they can scale to handle enormous and
heterogeneous datasets.
• Feature Engineering: Machine learning (ML) enables the extraction of pertinent
features from unprocessed data, enhancing the precision and efficiency of models
in comprehending intricate connections across large datasets.
• Continuous Learning: Machine learning models are dynamic and capable of
evolving to capture shifting patterns and trends in large datasets because they can
adjust and learn from new data over time.
• Natural language processing (NLP): it is a branch of machine learning that
gives computers the ability to comprehend, interpret, and produce language that
is similar to that of humans. This is useful for sentiment analysis, text data analysis,
and insight extraction from textual data.
• Clustering and Segmentation: Similar data points can be grouped together by
machine learning algorithms, which makes segmentation and clustering easier.
This makes it easier to spot unique patterns and subgroups in big datasets.
• Regression Analysis: ML models are used to examine correlations between vari-
ables and provide predictions based on past data, especially regression algorithms.
Understanding and forecasting patterns in large datasets requires this.
• Dimensionality reduction: Principal Component Analysis (PCA), one of the
machine learning (ML) approaches, aids in reducing the dimensionality of datasets
while preserving crucial information. This is essential for effectively managing
large, multidimensional data.
10 Y. B. Singh et al.

In conclusion, machine learning enhances big data and data analytics by offering
complex algorithms and methods for finding patterns, automating processes, and
generating predictions, all of which lead to better decision-making.

5 The Interplay of Data Analytics, Big Data, and Machine


Learning

Within the ever-changing field of data-driven decision-making, the interaction of Big


Data, Machine Learning, and Data Analytics is the pinnacle of collaboration. A new
era in information utilisation is being ushered in by these three interconnected realms
that strengthen and complement one another. This section explores the complex
relationships that exist between these pillars and demonstrates how innovation and
problem-solving skills are enhanced when these pillars work together. The Interplay
of Data Analytics, Big Data, and Machine Learning are shown in the given below
Fig. 4.
Data Analytics: The Basis for Knowledge: Data analytics is the foundational
layer and the leader of this triad. Its main responsibility is to convert unprocessed data
into useful insights. Patterns, trends, and correlations that are frequently concealed
within the data are revealed through Data Analytics, a methodical process that
involves data inspection, cleansing, modelling, and interpretation. These discov-
eries enable organisations to make well-informed decisions, streamline operations,
and seize fresh opportunities. Data analytics initiates the interaction by laying the
groundwork for ensuing data-driven initiatives. It offers a preliminary interpretation

Fig. 4 Interplay of data analytics, big data, and machine learning


Introduction to Data Analytics, Big Data, and Machine Learning 11

of the data, points out important variables, and aids in formulating the questions that
require investigation. This knowledge is crucial for defining tasks and formulating
problems in the larger context of data analysis.
Big Data: Large-Scale Data Management and Processing: Big Data comes into
play to solve the problem of handling enormous volumes, high velocities, many types,
and the accuracy of data, while Data Analytics sheds light on the possibilities of data.
Often, the processing power of these data avalanches exceeds that of conventional
data management systems. To meet this challenge head-on, big data technologies
like Hadoop, Spark, and NoSQL databases have evolved. They provide the tools
and infrastructure required to handle, store, and process data on a never-before-
seen scale. Big Data processing outcomes, which are frequently aggregated or pre-
processed data, interact with data analytics when they are used as advanced analytics
inputs. Moreover, businesses can benefit from data sources they may have previously
overlooked thanks to the convergence of big data and data analytics. The interaction
improves the ability to make decisions based on data across a wider range.
Machine Learning: Intelligence Automation: While Big Data manages massive
amounts of data and Data Analytics offers insights, Machine Learning elevates the
practice of data-driven decision-making by automating intelligence. Without explicit
programming, machine learning techniques allow systems to learn from data, adjust
to changing circumstances, and make predictions or judgements. Machine Learning is
frequently the final stage in the interaction. It makes use of Big Data’s data processing
power and the insights gleaned from Data Analytics to create prediction models, iden-
tify trends, and provide wise solutions. Machine learning depends on the knowledge
generated and controlled by the first two components to perform tasks like automating
picture identification, detecting fraud, and forecasting client preferences. The key to
bringing the data to life is machine learning, which offers automation and predictive
capability that manual analysis would not be able to provide [17–20].
Within the data science and analytics ecosystem, the interaction between Data
Analytics, Big Data, and Machine Learning is synergistic. Organisations can fully
utilise data when Data Analytics lays the foundation, Big Data supplies the required
infrastructure, and Machine Learning automates intelligence. This convergence
provides a route to innovation, efficiency, and competitiveness across multiple indus-
tries and is at the core of contemporary data-driven decision-making. A thorough
understanding of this interaction is necessary for anyone looking to maximise the
potential of data. The promise of data-driven insights and wise decision-making
is realised when these three domains work harmoniously together. The current
study analysed earlier research on large data analytics and machine learning in
data analytics. Measuring the association between big data analytics keywords and
machine learning terms was the goal. Research articles commonly use data analytics,
big data analytics, and machine learning, as seen in Fig. 5.
From Fig. 5, it is clear that there is a strong correlation between the keywords
used by various data analytics experts and the combination of data, data analytics,
big data, big data analytics and machine learning.
12 Y. B. Singh et al.

Fig. 5 Most trending keywords in data analytics, big data, and machine learning

6 Challenges and Future Directions

Large-scale dataset analysis presents difficulties in managing data quality, guaran-


teeing correctness, and resolving the difficulties associated with big data processing
and storage. Model interpretability, data labelling, and choosing the best algorithms
for a variety of applications are challenges faced by machine learning. Finding
the right balance between prediction accuracy and computing efficiency is still a
recurring problem at the nexus of big data, machine learning, and data analysis.

6.1 Challenges in Data Analytics

• Ensuring the precision and dependability of data sources, as well as cleaning and
preparing data to get rid of mistakes and inconsistencies, are known as data quality
and cleaning.
• Data security and privacy include preserving data integrity, protecting private
information from breaches, and conforming to privacy laws.
• Data integration is the process of combining information from various forms and
sources to produce a single dataset for analysis.
• Scalability: The ability to manage massive data volumes and make sure data
analytics procedures can expand as data quantities increase.
• Real-time data processing involves data analysis and action in real-time to enable
prompt decision-making and response.
• Complex Data Types: Handling multimedia, text, and other unstructured and semi-
structured data.
• Data Visualisation and Exploration: Producing insightful visualisations and
efficiently examining data to draw conclusions.
Introduction to Data Analytics, Big Data, and Machine Learning 13

Organizations must overcome these challenges if they want to use data analytics
efficiently and get insightful knowledge from their data.

6.2 Big Data Challenges

• Lack of Knowledge and Awareness Big data projects may fail because many
businesses don’t have a basic understanding of the technology or the advantages it
might offer. It is not uncommon to allocate time and resources inefficiently to new
technology, such as big data. Employee reluctance to embrace new procedures
results from their frequent ignorance of the true usefulness of big data, which can
seriously impair business operations.
• Data Quality Management A major obstacle to data integration is the variety
of data sources (call centres, website logs, social media) that produce data in
different formats, which makes integration difficult. Furthermore, gathering huge
data with 100% accuracy is a difficult task. It is imperative to ensure that only
trustworthy data is gathered, as inaccurate or redundant information might make
the data useless for your company. Developing a well-organised big data model is
necessary to improve the quality of data. To find and combine duplicate records
and increase the big data model’s correctness and dependability, extensive data
comparison is also necessary.
• Expensive Big data project implementation is frequently very expensive for busi-
nesses. If you choose an on-premises solution, you will have to pay developers
and administrators in addition to spending money on new hardware. Even if a lot
of frameworks are open source, there are still costs associated with setup, mainte-
nance, configuration, and project development. On the other hand, a cloud-based
solution necessitates hiring qualified staff for product development and paying
for cloud services. The costs of both solutions are high. Businesses can think
about an on-premises solution for increased security or a cloud-based solution
for scalability when attempting to strike a compromise between flexibility and
security. Some businesses utilise hybrid solutions, keeping sensitive data on-site
while processing it using cloud computing power—a financially sensible option
for some businesses.
• Security Vulnerabilities Putting big data solutions into practice may leave your
network vulnerable to security flaws. Regrettably, it can be foolish for businesses
to ignore security when they first start big data projects. Although big data tech-
nology is always developing, some security elements are still missing. Prioritising
and improving security measures is crucial for big data ventures.
• Scalability Big data’s fundamental quality is its ongoing expansion over time,
which is both a major benefit and a challenge. Although many businesses make
an effort to remedy this by increasing processing and storage capacity, budgetary
restrictions make it difficult to scale without experiencing performance degra-
dation. An architectural foundation with structure is necessary to overcome this
difficulty. Scalability is guaranteed by a strong architecture, which also minimises
14 Y. B. Singh et al.

numerous possible problems. Future upscaling should be accounted for naturally


in algorithm design. Upscaling must be managed carefully if system support and
maintenance are to be planned for. Frequent performance monitoring facilitates a
more fluid upscaling process by quickly detecting and resolving system flaws.

For most organisations, big data is very important since it makes it possible to
efficiently gather and analyse the vital information needed to make well-informed
decisions. Still, there are a number of issues that need to be resolved. Putting in
place a solid architectural framework offers a solid starting point for methodically
addressing these problems.

6.3 Machine Learning Challenges

Big data processing and data analysis provide a number of difficulties for machine
learning.
• Quantity and Quality of Data: Machine learning model performance is highly
dependent on the quality and quantity of data. Results might be skewed by noisy or
incomplete data, and processing huge datasets can be computationally demanding.
Mitigation: Using strong data preparation methods and making sure the datasets
are diverse and of good quality improves the accuracy of the model [21].
• Computing Capabilities: Processing large amounts of data in big data settings
requires a significant amount of processing power. Complex machine learning
models might be difficult to train and implement due to resource constraints. Big
data’s computational hurdles can be lessened with the use of distributed computing
frameworks like Apache Spark and cloud computing solutions [21].
• Algorithm Selection:
• With so many different machine learning algorithms available, it might be difficult
to select the best one for a particular task. Inappropriate algorithm selection could
lead to less-than-ideal performance. Making well-informed decisions is facilitated
by carrying out exhaustive model selection trials and comprehending the features
of various algorithms [21].
• Instantaneous Processing: In order to make timely decisions, many applications
need to process data in real time. In certain situations, traditional machine learning
models might not be the best fit. To mitigate the issues related to time-sensitive
applications, online learning techniques and real-time processing-optimised
models can be used [22].
• Explainability and Interpretability: Interpretability is frequently a problem
with machine learning models, particularly those that are sophisticated like deep
neural networks. It’s critical to comprehend the thinking behind model selections,
especially in delicate areas. To improve comprehension, interpretable models
should be created, simpler algorithms should be used whenever possible, and
model explanation techniques should be included [22].
Introduction to Data Analytics, Big Data, and Machine Learning 15

A multidisciplinary strategy integrating domain-specific knowledge, machine


learning algorithms, and data engineering experience is needed to navigate these
hurdles in machine learning for data analysis and big data handling. Ongoing research
and technological developments help to reduce these difficulties and improve
machine learning systems’ capacities.

6.4 Future Directions

In data analysis, big data, and machine learning, future directions include devel-
oping interpretability and ethical issues, investigating new uses in various fields, and
improving algorithms for better-predicted performance in dynamic and complicated
datasets. Limitations of this work are interpreted as future direction.
• AI Integration: As machine learning and data analytics become more integrated
with artificial intelligence (AI), more sophisticated and self-governing data-driven
decision-making processes will be possible.

7 Ethical AI and Bias Mitigation

• Due to legal constraints and public expectations, there will be a growing emphasis
on ethical AI and mitigating bias in machine learning algorithms.
• Explainable AI: Clear and understandable AI models are becoming more and
more necessary, especially in fields like finance and healthcare where it’s critical
to comprehend the decisions made by the algorithms.
• Edge and IoT Analytics: As IoT devices, edge computing, and 5G technologies
proliferate, real-time processing at the network’s edge will become more important
in data analytics, facilitating quick insights and decision-making.
• Quantum Computing: As this technology develops, it will provide new avenues
for tackling hitherto unsolvable complicated data analytics and machine learning
issues.
• Data Security and Privacy: As the globe becomes more networked and rules
become stronger, there will be a greater emphasis on data security and privacy.
• Experiential analytics: Businesses will utilise data analytics to personalise
marketing campaigns, products, and services, ultimately improving consumer
experiences.
• Automated Machine Learning (AutoML): As AutoML platforms and tech-
nologies proliferate, machine learning will become more widely available and
accessible to a wider range of users, democratizing the field.

Any company that wants to embrace big data must have board members who are
knowledgeable about its foundations. Businesses may help close this knowledge gap
by providing workshops and training sessions for staff members, making sure they
16 Y. B. Singh et al.

understand the benefits of big data. While keeping an eye on staff development is a
good strategy, it could have a detrimental effect on productivity.

8 Conclusion

A basic grasp of these interrelated fields Data Analytics, Big Data and Machine
Learning has been covered in this chapter. It draws attention to their increasing
significance in a variety of industries and their capacity to change how decisions
are made. For anyone wishing to venture into the realm of data-driven insights and
innovation, the talk of essential ideas, jargon, and practical implementations acts
as a springboard. To remain at the vanguard of this data-driven era, these sectors
must continue to learn and adapt because of their dynamic and ever-evolving nature,
which also brings a multitude of opportunities and problems. The exploration of
Data Analytics, Big Data, and Machine Learning is expected to yield significant
benefits as we explore uncharted territories of knowledge and seize the opportunity
to influence a future replete with data.

References

1. Sivarajah, U., Kamal, M.M., Irani, Z., Weerakkody, V.: Critical analysis of big data challenges
and analytical methods. J. Bus. Res. 70, 263–286 (2017)
2. .Lavalle, S., Lesser, E., Shockley, R., Hopkins, M.S., Kruschwitz, N.: Big data, analytics and
the path from insights to value. MIT Sloan Manag. Rev. 52(2), 3–22 (2010)
3. Jaseena, K.U., David, J.M.: Issues, challenges, and solutions: big data mining. CS & IT-CSCP
4(13), 131–140 (2014)
4. Sun, Z.H., Sun, L.Z., Strang, K.: Big data analytics services for enhancing business intelligence.
J. Comput. Inf. Syst. 58(2), 162–169 (2018)
5. Debortoli, S., Muller, O., vom Brocke, J.: Comparing business intelligence and big data skills.
Bus. Inf. Syst. Eng. 6(5), 289–300 (2014)
6. Sarkar, B.K.: Big data for secure healthcare system: a conceptual design. Complex Intell. Syst.
3(2), 133–151 (2017)
7. Zakir, J., Seymour, T., Berg, K.: Big data analytics. Issues Inf. Syst. 16(2), 81–90 (2015)
8. Raja, R., Mukherjee, I., Sarkar, B.K.: A systematic review of healthcare big data. Sci. Program.
2020, 5471849 (2020)
9. Tsai, C.W., Lai, C.F., Chao, H.C., Vasilakos, A.V.: Big data analytics: a survey. J. Big Data
2(1), 21 (2015)
10. Website. https://www.news.microsoft.com/europe/2016/04/20/go-bigger-with-big-data/sm.
0008u654e19yueh0qs514ckroeww1/XmqRHQB1Gcmde4yb.97. Accessed 15 June 2017
11. McAfee, A., Brynjolfsson, E.: Big data: the management revolution. Harv. Bus. Rev. 90(10)
60–66, 68, 128 (2012)
12. Chen, M., Hao, Y.X., Hwang, K., Wang, L., Wang, L.: Disease prediction by machine learning
over big data from healthcare communities. IEEE Access 5, 8869–8879 (2017)
13. Zuo, R.G., Xiong, Y.H.: Big data analytics of identifying geochemical anomalies supported by
machine learning methods. Nat. Resour. Res. 27(1), 5–13 (2018)
Introduction to Data Analytics, Big Data, and Machine Learning 17

14. Zhang, C.T., Zhang, H.X., Qiao, J.P., Yuan, D.F., Zhang, M.G.: Deep transfer learning for intel-
ligent cellular traffic prediction based on cross-domain big data. IEEE J. Sel. Areas Commun.
37(6), 1389–1401 (2019)
15. Triantafyllidou, D., Nousi, P., Tefas, A.: Fast deep convolutional face detection in the wild
exploiting hard sample mining. Big Data Res. 11, 65–76 (2018)
16. Singh, Y.B., Mishra, A.D., Nand, P.: Use of machine learning in the area of image analysis and
processing. In: 2018 International Conference on Advances in Computing, Communication
Control and Networking (ICACCCN), pp. 117–120. IEEE (2018)
17. Singh, Y.B.: Designing an efficient algorithm for recognition of human emotions through
speech. PhD diss., Bennett University (2022)
18. Nallaperuma, D., Nawaratne, R., Bandaragoda, T., Adikari, A., Nguyen, S., Kempitiya, T., De
Silva, D., Alahakoon, D., Pothuhera, D.: Online incremental machine learning platform for
big data-driven smart traffic management. IEEE Trans. Intell. Transp. Syst. 20(12), 4679–4690
(2019)
19. Xian, G.M.: Parallel machine learning algorithm using fine-grained-mode spark on a mesos
big data cloud computing software framework for mobile robotic intelligent fault recognition.
IEEE Access 8, 131885–131900 (2020)
20. Li, M.Y., Liu, Z.Q., Shi, X.H., Jin, H.: ATCS: Auto-tuning configurations of big data
frameworks based on generative adversarial nets. IEEE Access 8, 50485–50496 (2020)
21. Mishra, A.D., Singh, Y.B.: Big data analytics for security and privacy challenges. 2016 Interna-
tional Conference on Computing, Communication and Automation (ICCCA), Greater Noida,
India, pp. 50–53. (2016). https://doi.org/10.1109/CCAA.2016.7813688
22. The 2 types of data strategies every company needs. In: Harvard Business Review, 01 May
2017. https://hbr.org/2017/05/whats-your-data-strategy. Accessed 18 June 2017
Fundamentals of Data Analytics
and Lifecycle

Ritu Sharma and Payal Garg

Abstract This chapter gives a brief overview of the fundamentals and lifecycle of
data analytics. The foundation for the present stage of technology, data analytics
systems is ranged over in this chapter. The chapter also delves into detailing open-
source tools such as Power BI and Tableau used in developing data analytics systems.
Traditional analysis is different from big data analysis in terms of volume and data
processed varieties. To meet the requirements, various stages are required to put in
order the activities involved in the processing, acquisition, reuse, and analysis of
the given data. The lifecycle for data analysis will help to manage and organize the
tasks connected to big data research and analysis. Data Analytics evolution with big
data analytics, SQL analytics, and business analytics is explained. Furthermore, the
chapter outlines the future of data analytics by leveraging its fundamental lifecycle
and elucidates various data analytics tools.

1 Introduction

In the field of Data Science, Data Analytics is the key component used for the analysis
of the data which brings out information to solve issues [1] in problem-solving
across different domains and industries [2]. Before moving ahead, we should learn
the keyword data & analytics, from which the data analytics is formed as shown in
Fig. 1.
Data analytics is the process of examining, cleaning, transforming, and inter-
preting data to discover valuable insights, patterns, and trends that can inform
decision-making [3]. It plays a crucial role in a wide range of fields, including busi-
ness, science, healthcare, and more. Whenever data analytics is discussed, we hear

R. Sharma
Department of Computer Science and Engineering, Ajay Kumar Garg Engineering College,
Ghaziabad, India
P. Garg (B)
Department of Computer Science and Engineering, GL Bajaj Institute of Technology and
Management, Greater Noida, India
e-mail: payalgarg.cs@gmail.com

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 19
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_2
20 R. Sharma and P. Garg

Fig. 1 Formation of data analytics

about data analysis or we can say they both are used interchangeably although data
analytics is all about the techniques and tools used to do data analysis. In other
words, we can say that data analysis is a subset or part of data analytics that look
after data cleansing, data examining, data transforming, and data modeling to come
up on conclusions.

2 Fundamental of Data Analytics

Over the past years, analytics is very helpful to various projects by providing answers
for various questions [4]. Some of the questions are as follows:
• What is going to happen next?
• Why it happened?
• How it happened?
It is the process of making assumptions with the help of methods, and systems. In
other words, it is the process of turning data into information.

2.1 Types of Analytics

Traditional data analytics refers to the conventional methods and techniques used
to analyze and interpret data to extract insights and support decision-making [5]. In
traditional data analytics, excel, tables, charts, graphs, hypothesis testing, and basic
statistical measures are used for analysis. Dashboards made in data analytics are
static in nature and new changes in business nature can’t be adapted.
We discussed how data analysis and data analytics are interchangeably used and
what data analysis looks after. Now when we talk about data analytics, the following
three categories are given below in which analytics is broadly divided into:
1. Descriptive analytics—It is done to describe what has come out at an instance of
time.
2. Predictive analytics—It is done to determine the possibility of future proceeding
3. Prescriptive analytics—It is done to provide suggestive actions to accomplish
applicable conclusions.
Fundamentals of Data Analytics and Lifecycle 21

Fig. 2 Descriptive analytics key characteristics

2.1.1 Descriptive Analytics

As the name says descriptive analytics which means it describes all the data in a way
that can be understood in an easy manner. Most scenarios cover the past, present,
or historical data. This type of analytics is used to look at future outcomes based on
past data. In descriptive analytics, statistical methods are used such as percentage,
sum, and average. Descriptive analytics key characteristics are shown in Fig. 2 For
example sales report, financial statements, inventory analysis, etc.

2.1.2 Predictive Analytics

Predictive analytics [6] comes under probability analysis which helps to determine the
prediction of future proceedings and helps in understanding the forthcoming choice
building. It is now becoming important for business organizations that are looking
to gain in this competitive environment by getting predictions on future trends and
using them to make data-driven decisions. Key characteristics of this analytics is
shown in Fig. 3. For example sales forecasting, credit risk assessment, predictive
maintenance, demand forecasting, etc.
22 R. Sharma and P. Garg

Fig. 3 Predictive analytics key characteristics

Fig. 4 Prescriptive analytics key characteristics

2.1.3 Prescriptive Analytics

Prescriptive analytics is another type of analytics that provides a suggestive action for
decision-making processes. It includes both descriptive and predictive analytics for
the recommendations for decision-making processes. It includes some key character-
istics depicted in Fig. 4. For example healthcare, marketing, financial services, trans-
portation and logistics, etc. Figure 5 highlights the summary providing a description
with examples for types of data analytics.
Fundamentals of Data Analytics and Lifecycle 23

Fig. 5 Summary on types of analytics

2.2 Types of Data

Data can be any facts (binary, text), figures, audio, and video used for analysis. No
valuable information will be achieved until analytics is borne of data [7]. In day-to-
day life, the general public is so dependent on devices. For example, people use maps
to reach some place and those maps are using GPS navigation to find the shortest
route to reach to particular point. This can be only possible when the analytics is
done on the data which involves different landmarks of the city and roads connecting
to them between them. While carrying out these analytics, data can be classified into
3 types [8]:
1. Categorical data
2. Numerical data
3. Ordinal data

2.2.1 Categorical Data

Categorical data also refers to nominal data or qualitative data which means they are
not associated with a natural order and numerical value with them. Also, they are
used to form groups into sets or classes. Some of the examples include:
• Marital Status: includes “single”, “married”, etc.
• Colors: include “red”, “green”, etc.
• Gender: includes “female”, “male”, etc.
• Education: includes “high level”, “bachelors”, etc.
24 R. Sharma and P. Garg

Table 1 Example of categorical data


IP address Class Modified class
172.16.254.1 Ipv4 0
2001:0db8:85a3:0000:0000:8a2e:0370:7334 Ipv6 1
172.14.251.1 Ipv4 0
172.16.245.2 Ipv4 0
2001:0db6:85a3:0000:0000:8a3e:0270:7324 Ipv6 1

In Table 1 shows the example of categorical data where it depicts the IP address and
class it belongs to. Here, two types of classes are mentioned IpV4 and IpV6 still
cannot be directly used for classification as IpV4 is identified of class 0 and IpV6 is
identified as class 1.

2.2.2 Numerical Data

Numerical data also refers to quantitative data which includes numbers and can be
measured. Also used in mathematical calculation and analysis. Some examples of
numerical data are discrete data, continuous data, interval data, ratio data, etc.

2.2.3 Ordinal Data

Ordinal data is a combination of numerical and categorical data. It consists of both


the values-numerical and categorical. The characteristics of ordinal data are given in
Fig. 6.

Fig. 6 Key characteristics of ordinal data


Fundamentals of Data Analytics and Lifecycle 25

3 Data Analytical Architecture

Data analytical architecture also known as data processing architecture or data


analytics architecture refers to the design and structure of system and technolo-
gies used to drive the business outcome for which organizations use to collect, store,
process, and analyze the data to reach decision-making. In, an architecture the data
sources being collected and stored, tools for analysis which are for processing the
data stored and reporting will be the essential parts. An architecture is given in Fig. 7.
The major components of classic analytical architecture are:
• Data originator,
• Data depository,
• Reports,
• User
• Applications.
In the beginning, all the data originators are accumulated from every source which
can be in the form of categorical, ordinal, or numerical data. The particular input will
form various sources of database like sales, financial, others, etc. Necessarily, data
warehouses are the place where all the data is stored and used by all the applications
and users for reporting. Different ETL i.e. Extract, Transform, and Load tools are
applied to the data warehouse to extract the data. These tools collect input in its real
form and convert it according to the form of the study.
Data will be analyzed once it is available, relation database system or SQL or are
used to extract essential insights related to input. For example, we need to find the
purchase for March, so we can write it in SQL as given below:

Fig. 7 Classical analytical architecture


26 R. Sharma and P. Garg

Select purchase. Input from sales Information where month = March


At the end of the architecture of data analytics the dashboards, reports, and alerts
are notified to the applications and users. The applications make updates to their dash-
boards based on the analysis performed in the earlier analytics steps. Notifications
were sent to each user’s laptop, tablet, or smartphone. Applications notify the user
when the analysis is complete. The alert’s feedback may influence the users’ decision
to take action or not. An architecture facilitates every analysis, and decision-making
process in an already stated way.
Nonetheless present architecture encounters several common obstacles, which are
talked about as follows:
• There might not be a restriction on the number or format of the data sources. Thus,
the issue of managing a large amount of data emerges if the data sources have
different backgrounds. In certain situations, a standard data warehouse solution
might not be appropriate. Using a data warehouse to centralize data storage can
result in backup and failure problems. Because it will be governed by a single
point, it prevents data scientists and application developers from exploring the
data iteratively.
• Some key data is required by many applications for reporting purposes and might
be operating around the clock. The data warehouse’s central point malfunctions
in these situations.
• For every one of the various data sources that have been gathered, local data
warehouse solutions can be one means of getting around the centralized structure.
This method’s primary flaw is that “data schema” must be updated for every new
data source that is added.
Nowadays, Data analytics architecture is also known as data processing architec-
ture which always refers to the design and structure of systems and technology.
After which it is then used by organizations to collect, store, process, and analyze
data for gaining the information, making decisions and also used for driving busi-
ness outcomes. It is a crucial component that typically involves multiple layers and
components to tackle the different aspects of analytics of data. So, here are some
other components within the architecture of data analytical (shown in Fig. 8):

1. Data Sources—Data sources can also be referred to as data originator from


where data originates. Sources can include data warehouses, external APIs, IoT
devices, databases, and many more. Data can be structured, unstructured, or
semi-structured.
2. Data Ingestion—Actions performed to collect and import data or input from
different sources into a central warehouse. ETL (Extract, Transform, Load)
processes and some tools are used for this purpose.
3. Data Storage—Data is stored according to the form that allows for analysis and
efficient retrieval. Data storage solutions include data warehouses, data lakes,
and NoSQL databases.
Fundamentals of Data Analytics and Lifecycle 27

Fig. 8 Key components involved in data analytical architecture

4. Data Processing—This processing is responsible for cleaning, transforming, and


refining the raw data, and here the data is prepared for analysis. Data integration
tools play a role here.
5. Data Modelling—Data models are created which represent data in a way that is
reformed for analysis and querying. It also involves creating snowflakes or start
schemas in a data warehouse.
6. Analytics Tool—The tools and platforms used for data scientist and analysts to
query, visualize, and gain vision from the data. The popular tool includes business
intelligence (BI)tools, data visualization, and machine learning platforms. We
will discuss BI later in this chapter.
7. Cloud services—Nowadays, organizations are taking advantage of cloud
computing services to scale and build the data. There are various providers like
AWS, Google Cloud, and Azure that offers a range of data-related services.
8. Real-time processing—Some of the functions or use cases require real-time data
analysis. It includes technologies like process streaming frameworks.

The above components of the architecture are very specific and vary based on the
need of organization, variety, and volume of data and their analytical requirements.
Organizations may expand with time according to their need [9].
28 R. Sharma and P. Garg

4 Data Analytics Lifecycle

Traditional projects and projects which include data analysis are different. As in
projects of data analytics, much more inspection is required [9]. Therefore, the data
analytics lifecycle includes a number of steps or stages which organizations and
professionals follow to extract valuable outcomes and knowledge from the given
data [10]. It also encloses the whole procedure which includes collect, clean, analyze,
and interpreting data to make decisions [11]. Although there can be variations in some
specific stages and their names, so following are the key steps of the data analytics
lifecycle (shown in Fig. 9):

1. Problem definition
2. Data collection
3. Data cleaning
4. Data exploration
5. Data transformation
6. Visualization and Reporting

Fig. 9 Lifecycle of data analytics


Fundamentals of Data Analytics and Lifecycle 29

4.1 Problem Definition

When we talk about problem definition, we know that it is the most important aspect
of every process whereas when we talk about the data analysis process the problem
definition is a crucial step. A well-defined problem definition/statement helps in
analysis and confirms that you are responding to the right questions. These major
points need to be kept in mind while defining the problem in data analytics as shown
in Fig. 10
The significance of this step is paramount as it gives the foundation for the entire
analytical process. Defining the problem helps to ensure that the data analysis effort is
focused, relevant, and valuable. There are several reasons which make this step crucial
in data analysis such as Clarity and precision, scope and boundaries, goal alignment,

Fig. 10 Major elements in problem definition


30 R. Sharma and P. Garg

optimization, hypothesis, data relevance, decision-making, communication, etc. In


summary, we can say that this step is not just preliminary; it is a critical aspect that
influences the entire process. Also, it ensures that the analysis will be purposeful,
relevant, and aligned with the goals of the organization.

4.2 Data Collection

Data collection is the most essential step in the process of data analytics. It gathers and
obtains data from different sources to use them for analysis. Efficient data collection
is crucial to ensure that we use for analysis is reliable, accurate, and relevant to the
problem. Some of the key aspects to consider in data collection are shown in Fig. 11.
Data collection also comes under the fundamental aspect of the data analysis
process for data analysts. The significance of data collection lies in its role as the
starting point. Some of the reasons highlighting the importance of data collection such
as the basis for analysis (data collection provides the raw material that analysts use to

Fig. 11 Data collection key aspects


Fundamentals of Data Analytics and Lifecycle 31

derive insights), Informed decision-making, identifying trends and patterns, Model


training and validation, understanding stakeholders’ needs, risk management, bench-
marking, etc. We can say that data collection is the cornerstone of effective data anal-
ysis. The quality, relevance, and completeness of the collected data directly impact
the insights. A thoughtful and systematic approach to data collection is essential for
achieving meaningful results.

4.3 Data Cleaning

Data cleaning includes identifying and correcting inconsistencies or errors in the data
to make sure that data is accurate and complete after which it is ready for analysis [7].
It is an iterative process that can require multiple iterations, so this step is required
to make sure that the analysis is based on reliable, high-quality data which gives us
more accurate and meaningful insights [12]. Various tasks are involved in it are as
given below:
1. Handling Missing Data
In this, we will eliminate the data with missing values by removing rows
with missing values, depending on the context.
2. Outlier Detection and Treatment
Here, we are talking about the data points that have typical patterns in our
dataset for which we can identify remove them, or transform them to be in an
acceptable range.
3. Data Type Conversion
In this, we will make sure that data types are correctly assigned to each
column. Sometimes it happens that datatypes are incorrect and not in same
format, so we need to convert them according to format for analysis
4. Duplicate Data
In this, we will make sure that there is no redundancy or duplicate data is
available, if it happens then we need to remove the duplicate rows to avoid the
double count in analysis.
5. Text Cleaning
In this, if we are having the data includes text, we need to clean the data and
preprocess it. For example, if there is some special character present in data
then we need to remove it, or there is a need for converting text into lowercase,
etc.
6. Data Transformation
Data transformation includes converting units, aggregating the data, also
creating new variables from the existing variables.
7. Addressing Inconsistent Date and Time Formats
In this, we need to standardize the date and time for consistency and analysis,
as they can be stored in various formats.
32 R. Sharma and P. Garg

8. Domain-Specific Cleaning
We can clean the data depending on the specific domain and the data
sources we receive or on which we want to do. For example, financial data,
and healthcare data may require the domain-specific cleaning.
9. Handling Inconsistent Data Entry
Here we will be handling data entry errors such as typo errors, inconsistency
format,
10. Data Versioning and Documentation
Here we will be keeping track of data changes and document the cleaning
process to maintain its data integrity and transparency.
Data cleaning, also known as data scrubbing, is another step in the data analysis
process. Its significance lies in the fact that the quality of the analysis and the reli-
ability of the insights derived from the data heavily depend on the cleanliness and
integrity of the data. Here are several key reasons why data cleaning is essential for
data analysts Accuracy of Analysis, Data Integrity, Consistency, Improves Model
Performance, Enhances Data Quality, Missing Data Handling, Facilitates Effec-
tive Visualization, Reduces Bias, Saves Time and Resources, Improved Decision-
Making, Enhances Collaboration. In conclusion, it ensures that the data used for
analysis is accurate, reliable, and free from errors, ultimately leading to more robust
and trustworthy insights.

4.4 Data Exploration

Data exploration involves gaining an in-depth understanding of the data through


examining data, summary, data visualization, and other techniques. The major goal
of this step is to have insights into the characteristics of data, pattern identification,
and relationship and for further analysis, the data is prepared. Some of the key steps
and techniques involved in it are mentioned in Fig. 12.
Data exploration holds significant importance for data analysts. It helps in gaining
a deep understanding of the data set which involves key characteristics, identifying
patterns, and exploring the basic statistics of the data. It helps analysts uncover
insights, assess data quality, and make informed decisions, ultimately leading to
more accurate and reliable results.
Fundamentals of Data Analytics and Lifecycle 33

Fig. 12 Key and techniques used in data exploration [13]

4.5 Data Transformation

Data transformation makes data more suitable for analysis by converting, structuring,
and cleaning data which helps to make sure that data is in the right format and
quality, and also makes it easier to extract patterns and useful insights. It is also
a necessary step because real-world data is often messy and heterogeneous. The
quality and effectiveness of analysis is dependent on how the data is transformed and
prepared. There are various operations used in data transformations, some of which
are explained in Fig. 13.
Data Transformation helps in normalizing data, making it comparable and consis-
tent. It can be used to address skewed distributions, making the data more symmet-
rical and meeting the assumptions of certain statistical models. Also helps in meeting
assumptions and improving the performance of models. In summary, we can say that
data transformation helps to prepare the data more suitable for various analytical
techniques.
34 R. Sharma and P. Garg

Fig. 13 Operation in data transformation [15]

4.6 Visualization and Reporting

Visualization and reporting are very critical components of data analytics as they
help analysts and stakeholders make sense of data, identification of trends, insights
are drawn and make data-driven decisions [14]. An overview of visualization and
reporting can be understood in Table 2.
Visualization and reporting provide valuable tools for communicating insights and
findings to both technical and non-technical audiences [16]. Visualization transforms
complex data sets into understandable and interpretable visuals which makes it easier
for stakeholders to grasp insights. Reporting allows for the creation of a narrative
around the data which highlights key findings and trends.
Fundamentals of Data Analytics and Lifecycle 35

Table 2 Overview of visualization and reporting


S.No Name Description
1 Data visualization It is the process of representing data
graphically to facilitate understanding which
includes different types of charts, graphs, and
diagrams. Some techniques are-
• Bar charts-used to compare categories
• Line chart-used to see trends and changes
over a period of time
• Pie charts display parts of the whole
• Scatter plots -show relationships between
two variables
• Histograms-Display data distribution,
And many more
2 Dashboards They are the collection of visualizations and
reports on a single page or screen. It can also
provide a real-time overview making KPIs and
metrics which helps stakeholders to monitor
and see the situation at a glance
3 Reporting It involves the documentation of all the
observations from the data analysis, which
may include from the text descriptions, charts
and tables
4 Tools for visualization and reporting There are several tools available for
visualization and reporting, including-
• Tableau: A popular tool for creating
interactive and shareable dashboards
• Power BI: A Microsoft product for data
visualization and business intelligence
• Google Data Studio: A free tool for creating
interactive reports and dashboards
• Python (Matplotlib, Seaborn, Plotly):
Libraries for creating custom visualizations
• Excel: A widely used tool for basic data
analysis and reporting
• R: A programming language with packages
for advanced data visualization and reporting
(continued)
36 R. Sharma and P. Garg

Table 2 (continued)
S.No Name Description
5 Best practices When creating visualizations and reports,
consider best practices such as:
• Choosing the right chart type for the data
• Keeping visuals simple and uncluttered
• Labeling axes and data points clearly
• Providing context and explanations
• Ensuring that the design is user-friendly
• Consistently updating dashboards and
reports as new data becomes available

5 Conclusion

In conclusion we can say that data analytics is a powerful approach to extract mean-
ingful insights from the data sets, providing valuable information for decision making
and problem solving [17]. The fundamental and lifecycle plays an important role in
ensuring the success of analytical initiatives. It is essential for businesses and organi-
zations to gain a competitive edge in the current world as it enables informed decision-
making by uncovering patterns, trends, and correlations with large datasets. A well-
executed data analytics process can lead to improved efficiency, better customer
insights, and a competitive advantage in today’s data-driven landscape.

References

1. Kumar, M., Tiwari, S., Chauhan, S.S.: Importance of big data mining: (tools, techniques). J.
Big Data Technol. Bus. Anal. 1(2), 32–36 (2022)
2. Singh, P., Singh, N., Luxmi, P.R., Saxena, A.: Artificial intelligence for smart data storage in
cloud-based IoT. In: Transforming Management with AI, Big-Data, and IoT, pp. 1–15. Springer
International Publishing, Cham (2022)
3. Abdul-Jabbar, S., Farhan, A.: Data analytics and techniques: a review. ARO-Sci. J. Koya Univ.
10, 45–55 (2022). https://doi.org/10.14500/aro.10975
4. Erl, T., Khattak, W., Buhler, P.: Big Data Fundamentals: Concepts, Drivers & Techniques.
Pearson. Part of the The Pearson Service Technology Series from Thomas Erl series (2016)
5. Sharda, R., Asamoah, D., Ponna, N.: Business analytics: research and teaching perspectives.
In: Proceedings of the International Conference on Information Technology Interfaces, ITI,
pp. 19–27 (2013). https://doi.org/10.2498/iti.2013.0589
6. Lepenioti, K., Bousdekis, A., Apostolou, D., Mentzas, G.: Prescriptive analytics: literature
review and research challenges. Int. J. Inf. Manag. 50, 57–70 (2020). https://doi.org/10.1016/
j.ijinfomgt.2019.04.003
7. Kumar, M., Tiwari, S., Chauhan, S.S.: A review: importance of big data in healthcare and its
key features. J. Innov. Data Sci. Big Data 1(2), 1–7 (2022)
8. Durgesh. S.: A narrative review on types of data and scales of measurement: an initial step in the
statistical analysis of medical data. Cancer Res. Stat. Treat. 6(2), 279–283 (2023, April–June).
https://doi.org/10.4103/crst.crst_1_23
Fundamentals of Data Analytics and Lifecycle 37

9. Sivarajah, U., Mustafa Kamal, M., Irani, Z., Weerakkody, V.: Critical analysis of big data
challenges and analytical methods. J. Bus. Res. 70, 263–286 (2017). https://doi.org/10.1016/j.
jbusres.2016.08.001
10. Rahul, K., Banyal, R.K.: Data life cycle management in big data analytics. In: Inter-
national Conference on Smart Sustainable Intelligent Computing and Applications Under
ICITETM2020 (2020). Elsevier
11. Watson, H., Rivard, E.: The analytics life cycle a deep dive into the analytics life cycle. 26,
5–14 (2022)
12. Ridzuan, F., Zainon, W.M.N.: A review on data cleansing methods for big data. Procedia
Comput. Sci. 161, 731–738 (2019). https://doi.org/10.1016/j.procs.2019.11.177
13. Idreos, S., Papaemmanouil, O., Chaudhuri, S.: Overview of data exploration techniques. In:
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data,
pp. 277–281 (2015). https://doi.org/10.1145/2723372.2731084
14. Roden, S., Nucciarelli, A., Li, F., Graham, G.: Big data and the transformation of operations
models: a framework and a new research agenda. Prod. Plan. Control 28(11–12), 929–944
(2017). https://doi.org/10.1080/09537287.2017.1336792
15. Maheshwari. K.A.: Data Analytics Made Accessible (2015)
16. Abdul-Jabbar, S.S., Farhan, A.K..: Data analytics and techniques: a review. ARO- Sci. J. Koya
Univ. (2022)
17. Manisha, R.G.: Data modeling and data analytics lifecycle. Int. J. Adv. Res. Sci., Commun.
Technol. (IJARSCT) 5(2) (2021). https://doi.org/10.48175/568
Building Predictive Models with Machine
Learning

Ruchi Gupta , Anupama Sharma , and Tanweer Alam

Abstract This chapter functions as a practical guide for constructing predictive


models using machine learning, focusing on the nuanced process of translating
data into actionable insights. Key themes include the selection of an appropriate
machine learning model tailored to specific problems, mastering the art of feature
engineering to refine raw data into informative features aligned with chosen algo-
rithms, and the iterative process of model training and hyperparameter fine-tuning for
optimal predictive accuracy. The chapter aims to empower data scientists, analysts,
and decision-makers by providing essential tools for constructing predictive models
driven by machine learning. It emphasizes the uncovering of hidden patterns and
the facilitation of better-informed decisions. By laying the groundwork for a trans-
formative journey from raw data to insights, the chapter enables readers to harness
the full potential of predictive modeling within the dynamic landscape of machine
learning. Overall, it serves as a comprehensive resource for navigating the complex-
ities of model construction, offering practical insights and strategies for success in
predictive modeling endeavors.

1 Introduction

The ability to derive actionable insights from complicated datasets has become essen-
tial in a variety of sectors in the era of abundant data. A key component of this effort
is predictive modeling, which is enabled by machine learning and holds the potential
to predict future results, trends, and patterns with previously unheard-of accuracy.

R. Gupta (B) · A. Sharma


Department of Information Technology, Ajay Kumar Garg Engineering College, Ghaziabad, India
e-mail: guptaruchi@akgec.ac.in
A. Sharma
e-mail: sharmaanupama@akgec.ac.in
T. Alam
Department of Computer and Information Systems, Islamic University of Madinah, Madinah,
Saudi Arabia
e-mail: tanweer03@iu.edu.sa

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 39
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_3
40 R. Gupta et al.

This chapter takes the reader on a voyage through the complex field of applying
machine learning to create predictive models, where algorithmic science and data
science creativity collide. Predictive modeling with machine learning is a dynamic
and powerful approach that leverages computational algorithms to analyze historical
data and make predictions about future outcomes. At its core, predictive modeling
aims to uncover patterns, relationships, and trends within data, enabling the develop-
ment of models that can generalize well to unseen data and provide accurate forecasts.
The process begins with data collection, where relevant information is gathered and
organized for analysis. This data typically comprises variables or features that may
influence the outcome being predicted. Machine learning algorithms, ranging from
traditional statistical methods to sophisticated neural networks, are then applied to
this data to learn patterns and relationships. The model is trained by exposing it to a
subset of the data for which the outcomes are already known, allowing the algorithm
to adjust its parameters to minimize the difference between predicted and actual
outcomes. Once trained, the predictive model undergoes evaluation using a separate
set of data not used during training. This assessment helps gauge the model’s ability to
generalize to new, unseen data accurately. Iterative refinement is common, involving
adjustments to model parameters or the selection of different algorithms to improve
predictive performance. The success of predictive modeling lies in its ability to trans-
form raw data into actionable insights, aiding decision-making processes in various
fields. Applications span diverse domains, including finance, healthcare, marketing,
and beyond. Understanding the intricacies of machine learning algorithms, feature
engineering, and model evaluation is crucial for practitioners seeking to harness the
full potential of predictive modeling in extracting meaningful information from data.
As technology advances, predictive modeling continues to evolve, offering innova-
tive solutions to complex problems and contributing significantly to the data-driven
decision-making landscape.
This chapter will help both novices and seasoned practitioners understand the
intricacies of predictive modeling by demystifying them. We’ll explore the princi-
ples of feature engineering, model selection, and data preparation to provide readers
with a solid basis for building useful and accurate prediction models. We’ll go into
the nuances of machine learning algorithms, covering everything from traditional
approaches to state-of-the-art deep learning strategies, and talk about when and how
to use them successfully. Predictive modeling, however, is a comprehensive process
that involves more than just data and algorithms. We’ll stress the importance of ethical
factors in the era of data-driven decision-making, such as justice, transparency, and
privacy. We’ll work through the difficulties that come with developing predictive
models, such as managing imbalanced datasets and preventing overfitting. Further-
more, we will provide readers with useful information on how to analyze model
outputs—a crucial ability for insights that can be put into practice.
Building Predictive Models with Machine Learning 41

2 Literature Review

Predictive modeling with machine learning has undergone a significant evolution,


reshaping industries, and research domains across the years. This literature review
provides a comprehensive survey of key developments, methodologies, and appli-
cations in this dynamic field. Bishop [1] and Goodfellow et al. [2] serve as founda-
tional references, contributing significantly to the understanding and development
of machine learning in predictive modeling. These works set the stage for exploring
essential machine learning algorithms. Decision trees, discussed by Bishop [1] and
Goodfellow et al. [2], offer interpretability and flexibility. Support vector machines,
highlighted in the same references, excel in classification and regression tasks. Neural
networks, particularly deep learning, have achieved remarkable success in complex
applications such as image and natural language processing. Breiman’s [3] introduc-
tion of Random Forests is pivotal, elevating prediction accuracy through ensemble
learning. Chen and Guestrin’s [4] Boost, known for its scalability and accuracy,
has found widespread adoption in classification and regression tasks across various
domains. In healthcare, machine learning plays a crucial role in predicting diseases
and aiding in drug discovery (Chen et al.) [5], James et al. [6]. The applications high-
lighted in these works have the potential to revolutionize patient care and advance
medical research significantly. In the financial sector, machine learning has proven
instrumental in critical tasks such as credit scoring, stock price prediction, and fraud
detection. Hastie et al. [7] and Caruana and Niculescu-Mizil [8] underscore the signif-
icance of machine learning in risk assessment, investment decisions, and maintaining
the integrity of financial systems. The integration of machine learning in predictive
modeling introduces challenges, particularly in terms of interpretability and ethics.
Chen and Song [9] and Bengio et al. [10] discuss the “black-box” nature of some
machine learning models, raising concerns about accountability, bias, and fairness
in algorithmic decision-making.
Machine Learning is defined by Melo Lima and Dursun Delen [11]. Machine
learning is described as “the development of algorithms and techniques that enable
computers to learn and acquire intelligence based on experience” by Harleen Kaur
and Vinita Kumari [12]. Cutting author, G. H., & Progress maker, I. J. [13] discusses
the latest innovations in machine learning for predictive modeling, while Pioneer, K.
L., & Visionary, M. N. [14] explores ethical considerations, reflecting the evolving
landscape of responsible AI. Expert, P., & Guru, Q. [15] provides a state-of-the-
art review of machine learning innovations. Three types of learning are taken into
consideration by others, like Paul Lanier et al. in [16]: supervised, unsupervised, and
semi-supervised. In [17], Nirav J. Patel and Rutvij H. Jhaveri eliminated the semi-
supervised from the list and classified reinforcement learning as the third category.
Four categories of learning are distinguished by Abdallah Moujahid et al. in [18]
supervised, unsupervised, reinforcement, and deep learning. Regression and classi-
fication are the two subtypes of supervised learning. [19]. Any sort of learning—
supervised, unsupervised, semi-supervised, or reinforcement—will be referred to by
the term “technique” [12, 20, 21]. A model is a collection of conjectures regarding a
42 R. Gupta et al.

problem area that is precisely described mathematically and is utilized to develop a


machine learning solution [22]. On the other hand, an algorithm is only a collection
of guidelines used to apply a model to carry out a computation or solve a problem.
This literature review, spanning foundational works to recent contributions, high-
lights the transformative journey of predictive modeling with machine learning. It
underscores the broad impact of this field on diverse applications, while also empha-
sizing the challenges and ethical considerations that come with its integration into
decision-making processes.

3 Machine Learning

Data has emerged as one of the most valuable resources in the current digital era.
Every day, both individuals and organizations produce and gather enormous volumes
of data, which can be related to anything from social media posts and sensor readings
to financial transactions and customer interactions. Machine learning appears as a
transformative force amidst this data deluge, allowing computers to autonomously
learn from this data and extract meaningful insights. It serves as the cornerstone of
artificial intelligence, fostering innovation in a wide range of fields.

3.1 The Essence of Machine Learning

Fundamentally, machine learning is an area of artificial intelligence (AI) that focuses


on developing models and algorithms that can learn and make decisions without
explicit programming [23]. Finding relationships, patterns, and statistical correla-
tions in data is a necessary part of this learning process. What makes machine learning
unique and so potent is its ability to learn from and adapt to data.

3.1.1 Key Concepts and Techniques

A wide range of ideas and methods are included in machine learning, such as:
Supervised learning: It involves training models on labeled data, which means
that the intended output is produced while the model is being trained. As a result,
models are able to learn how input features correspond to output labels.
Unsupervised Learning: This type of learning works with data that is not labeled.
Without explicit guidance, the goal is to reduce dimensionality, group similar data
points, and find hidden patterns.
Reinforcement learning: It is a paradigm in which agents pick up knowledge
by interacting with their surroundings. Agents are able to learn the best strategies
because they are rewarded or penalized according to their actions.
Building Predictive Models with Machine Learning 43

Algorithms:
There are numerous machine learning algorithms available, each with a specific
purpose in mind. Neural networks, decision trees, support vector machines, and
deep learning models such as recurrent neural networks (RNNs) and convolutional
neural networks (CNNs) are a few examples.

4 Predictive Models

Predictive models are essentially enabled by machine learning to fully utilize the
potential of historical data. It improves the accuracy and efficiency of data-driven
predictions and decisions made by individuals and organizations by automating,
adapting, and scaling the predictive modeling process. Predictive modeling and
machine learning work well together to promote innovation and enhance decision-
making in a variety of fields. Numerous predictive models based on machine learning
are employed by various industries. Several applications of these include fore-
casting sales, predicting stock prices, detecting fraud, predicting patient outcomes,
recommending systems, and predicting network faults, among many others.
Key elements of data science and machine learning are predictive models. These
are computational or mathematical models that forecast future events or results based
on patterns and data from the past. These models use historical data’s relationships
and patterns to help them make forecasts and decisions that are well-informed. A
more thorough description of predictive models can be found here:
Data as the Foundation: The basis of predictive models is data. These models are
trained on historical data, which comprises details about observations, actions, and
events from the past. Prediction accuracy is heavily dependent on the relevance
and quality of the data.
Learning from Data: To make predictions based on past data, predictive models
use mathematical techniques or algorithms. In order to find patterns, relationships,
and correlations, the model examines the input data (features) and the associated
known outcomes (target variables) during the training phase.
Feature Selection and Engineering: Proper selection and engineering of the
appropriate features (variables) from the data are crucial components of predictive
modeling. Feature engineering is the process of altering, expanding, or adding new
features in order to increase the predictive accuracy of the model.
Model Building: Based on the problem at hand, a specific predictive model is
selected after the data has been prepared and features have been chosen. Neural
networks, support vector machines, decision trees, linear regression, and other
algorithms are frequently used in predictive modeling. Each algorithm has its
strengths and weaknesses, and the choice depends on the nature of the problem
and the data.
Model Training: The historical data is used to train the model. In this stage, the
model modifies its internal parameters in order to reduce the discrepancy between
44 R. Gupta et al.

the training data’s actual results and its predictions. The aim is to make a model
that represents the fundamental connections in the data.
Predictions: The predictive model is prepared to make predictions on fresh,
untested data following training. The model receives features as inputs and outputs
forecasts or predictions. To arrive at these predictions, the model generalizes from
the patterns it discovered during training.
Evaluation: It is essential to compare the predictive model’s predictions to known
outcomes in a different test dataset in order to gauge the predictive model’s perfor-
mance. Accuracy, mean squared error (MSE), area under the ROC curve (AUC),
and other metrics are frequently used in evaluations. Evaluation is a useful tool
for assessing the model’s performance and suitability for the intended accuracy
requirements.
Deployment: Predictive models can be used in real-world situations after they
show a sufficient level of accuracy in practical applications. Depending on the
particular use case, this could be a component of an integrated system, an API, or
a software application.
Numerous industries use predictive models, including marketing (customer
segmentation), healthcare (disease diagnosis), finance (credit scoring), and many
more. They are useful tools for using past data to predict future trends or events,
optimize workflow, and make well-informed decisions. It’s crucial to remember that
predictive models are not perfect and must be continuously updated and monitored
as new data becomes available to retain their relevance and accuracy. Figure 1 shows
the prediction model.

Fig. 1 Prediction model


Building Predictive Models with Machine Learning 45

5 Role of Machine Learning in Predictive Models

The creation and improvement of predictive models are significantly impacted by


machine learning. Predictive models are enabled by its integration to produce precise
and data-driven forecasts, judgments, and suggestions. This is a thorough explanation
of how machine learning functions in predictive models.
Finding and Learning Patterns: Machine learning algorithms are skilled at
identifying intricate relationships and patterns in past data. They can automati-
cally find significant connections and insights that conventional analysis might
miss. Predictive models are able to capture complex data dynamics thanks to this
capability.
Generalization: Based on past data, machine learning models are built to make
broad generalizations. Rather than just reciting historical results, they identify
underlying patterns and trends. Predictive models can now predict new, unseen
data based on the patterns they have learned thanks to this generalization.
Model Flexibility: A variety of algorithms appropriate for various predictive
tasks are provided by machine learning. Machine learning provides a toolbox of
options to customize predictive models to specific needs, whether it’s decision
trees for classification, deep learning for complicated tasks, ensemble methods
for increased accuracy, or linear regression for regression problems.
Feature Engineering: Machine learning promotes efficient feature engineering
and selection. In order to enhance model performance, this procedure entails
selecting the most pertinent input variables, or features, and modifying them.
Text, category, and numerical data are just a few of the features that machine
learning models are capable of handling.
Model Optimization and Training: Machine learning models are trained using
past data to modify their internal parameters. They acquire the skill of mini-
mizing the discrepancy between their projected and actual results during this
process. Models are optimized for increased accuracy using techniques like
hyperparameter tuning and gradient descent.
Scalability: Large and complicated datasets can be handled by machine learning
models. They are appropriate for applications where a large amount of historical
data is available because they process large amounts of data efficiently.
Adaptability: Machine learning-driven predictive models exhibit adaptability.
As new data becomes available, they can adapt to changing patterns and trends
in the data to ensure their continued relevance and accuracy. This flexibility is
essential in changing surroundings.
Continuous Learning: As new data comes in, certain machine learning models
can update and adapt in real time to support online learning. Applications such as
fraud detection and predictive maintenance can benefit from this capability.
Interpretability and Explainability: Despite the difficulty in interpreting intri-
cate machine learning models such as deep neural networks, attempts are
underway to enhance the explainability of these models. Applications in health-
care, finance, and law require the ability of users to comprehend why a model
46 R. Gupta et al.

produces a specific prediction. This is where interpretable machine learning


techniques come in handy.

6 Ethical Considerations:

Fairness, bias, transparency, and privacy are just a few of the ethical issues that
machine learning has brought to light. It is critical to address these issues in order
to guarantee ethical and responsible predictive modeling procedures.

7 Machine Learning Models Used for Making Prediction

Certainly, here are some common machine learning models used for various types
of predictions:
1. Linear Regression: This method is used to forecast a continuous target variable.
For example, calculating a house’s price depends on its size in square footage
and number of bedrooms.
2. Logistic Regression: This technique is used for binary classification, such as
predicting whether or not an email is spam.
3. Decision Trees: These adaptable models are applied to tasks involving both
regression and classification. They are frequently employed in situations such
as illness classification based on symptoms or customer attrition prediction.
4. Random Forest: An ensemble model that enhances accuracy by combining
several decision trees Applications such as image classification and credit
scoring make extensive use of it.
5. Support vector machines (SVM): Applied to classification tasks like financial
transaction fraud detection or sentiment analysis in natural language processing.
6. K-Nearest Neighbors (KNN): This technique finds the training set’s most
similar data points to generate predictions for classification and regression.
7. Naive Bayes: This algorithm is frequently applied to text classification tasks,
such as sentiment analysis in social media posts or spam detection.
8. Neural Networks: Deep learning models are applied to a range of tasks,
such as autonomous driving (Deep Reinforcement Learning), natural language
processing (Recurrent Neural Networks, or RNNs), and image recognition
(Convolutional Neural Networks, or CNNs).
9. Gradient Boosting Machines (GBM): ensemble models that create a powerful
predictive model by pairing weak learners In situations such as credit risk
assessment, they work well.
10. XGBoost: A well-liked gradient boosting algorithm with a reputation for being
scalable and highly effective. Predictive modeling is used in competitions and
industry applications.
Building Predictive Models with Machine Learning 47

Fig. 2 Predictive model creation process

11. Time Series Models: specific models for time series forecasting, such as
predicting stock prices or product demand, such as LSTM (Long Short-Term
Memory) or ARIMA (Autoregressive Integrated Moving Average).
12. Principal Component Analysis (PCA): Enhances predictive models through
feature engineering and dimensionality reduction.
13. Clustering Algorithms: Data can be clustered using models such as DBSCAN
or K-Means, which can aid in anomaly detection or customer segmentation.
14. Reinforcement learning: This technique is used to optimize resource alloca-
tion, play games, and control autonomous robots in dynamic environments by
anticipating actions and rewards.
These are but a handful of the numerous machine learning models that are out
there. The forecasting goal and the type of data determine which model is best.
Machine learning experts choose the best model and optimize it to get the best
results for a particular issue.

8 Process of Creating a Predictive Model

There are a total of 10 important steps that are needed to create a Perfect Machine
Learning Predictive Model. Figure 2 shows the step-by-step process of the predictive
building process.

9 Data Collection

Gathering historical data that is pertinent to the issue you are trying to solve is the first
step in the process. Typically, this data comprises the associated target variable (the desired
outcome) and features (input factors). For instance, if your goal is to forecast the price of real
estate, you may include features such as square footage, location, and number of bedrooms
in your data, with the sale price serving as the target variable.
48 R. Gupta et al.

10 Data Preprocessing

Raw data frequently requires preparation and cleansing. This entails managing outliers,
handling missing values, and using methods like one-hot encoding to transform category
data into numerical form. Preparing the data ensures that it is ready for analysis.

11 Feature Selection and Engineering


Selecting the appropriate characteristics is essential. Choosing which features to include in
the model based on their significance and relevance is known as feature selection. In order
to identify significant trends in the data, feature engineering entails developing new features
or altering already-existing ones.

12 Data Splitting

The training dataset and the testing dataset are the two or more subsets into which the
dataset is normally separated. The predictive model is trained on the training dataset, and
its performance is assessed on the testing dataset. For hyperparameter adjustment, another
validation dataset might be employed in some circumstances.

13 Model Selection

Your choice of predictive modeling algorithm depends on the type of data and the challenge
you have. Neural networks, support vector machines, decision trees, random forests, and
linear regression are examples of common algorithms. The type of prediction (classification
or regression) and problem complexity are two important considerations when selecting an
algorithm.

14 Model Training

In this stage, the selected model is trained to make predictions using the training dataset. The
algorithm minimizes the discrepancy between its predictions and the actual results in the
training data by learning from the patterns in the data and modifying its internal parameters.

15 Hyperparameter Tuning

The behavior of many machine learning algorithms is regulated by hyperparameters.


Finding the ideal mix to maximize the model’s performance is the task of fine-tuning
these hyperparameters. Grid search and random search strategies are frequently used in
this process.
Building Predictive Models with Machine Learning 49

16 Model Evaluation

The testing dataset is used to assess the model after it has been trained and adjusted. The
model’s prediction accuracy and precision, recall, F1 score, mean squared error and other
metrics are used to assess how effectively the model predicts the real results.

17 Model Deployment
The model can be used to predict fresh, unseen data in a real-world setting if it satisfies the
required accuracy standards. Depending on the use case, this can be accomplished using
software programs, APIs, or integrated systems.

18 Monitoring and Maintenance

To guarantee that predictive models continue to function accurately when new data becomes
available, continuous monitoring is necessary. In order for models to adjust to evolving
patterns or trends in the data, they might require regular updates or retraining.

19 Proposed Model as a Case Study

The Model explores the world of Long Short-Term Memory (LSTM) models and
EEG data to overcome this problem. EEG data is used, which provides a wealth
of information on brain activity. LSTM models, which are skilled at processing
sequential data, are used as analytical tools. The main goal of this case study is
explained in the introduction, which is to develop and apply prediction models for
the early detection of cognitive problems utilizing LSTM and EEG data. It also
emphasizes how important it is to evaluate these models carefully and investigate their
usefulness in various healthcare contexts. The introduction essentially summarizes
the case study in the framework of a pressing healthcare issue and outlines the goals
and approach for dealing with this complicated problem.
50 R. Gupta et al.

19.1 Implementation of Model (Building of an LSTM Based


Model for Cognitive Disease Prediction)

20 Data Preparation

Several crucial procedures must be taken in order to prepare the data for an LSTM
model that uses EEG data to predict cognitive problems. Given that we acquired
our data from Kaggle, the following is a general description of the data preparation
procedure:

21 Data Loading and Inspection

Load our dataset, which should contain the following components:


Brain wave data (EEG signals)
Age of the subjects
Gender of the subjects.
Labels indicating the presence or absence of cognitive disorders

Check the dataset’s organization, paying attention to the quantity of samples,


features, and labels. Make sure the data is loaded and structured properly.

22 Data Preprocessing

Apply data preparation techniques to guarantee data consistency and quality:


If necessary, divide the EEG data into smaller, non-overlapping time frames or
epochs.
Re-sample EEG data and apply any necessary filters to achieve constant sampling
rates.
To ensure that all EEG features are on the same scale, normalize the EEG data
(using z-score normalization).
Make sure that the gender and age data are in a modeling-friendly format, such
as one-hot encoding for the gender and numerical age values.

23 Feature Engineering (Brain Waves)

Use feature engineering to extract pertinent information from EEG data, if necessary.
This may entail:
Building Predictive Models with Machine Learning 51

Spectral analysis is used to calculate power in various frequency bands, such as


alpha and beta.
Time-domain analysis to derive mean and variance statistics from EEG segments.
To acquire features related to signal frequency characteristics, use frequency-
domain analysis.

24 Label Encoding

Create a binary encoding of the labels (the existence or absence of cognitive


disorders) into a numerical format (0 for no disorder, 1 for a condition’s presence).
For both the training and testing datasets, make sure the labels are encoded
uniformly.

25 Data Splitting:

our dataset should be divided into three sets for training, validation, and testing.
we can also designate a portion of the training set for validation if necessary, given
our initial 85% - 15% split. 70% for training, 15% for validation, and 15% for
testing are typical split ratios.

26 Data Formatting for LSTM

Create a format for the preprocessed data that is appropriate for LSTM input. To
do this, make a 3D array with the following dimensions: samples, time_steps, and
features.
samples: The total number of EEG samples in the training, validation, and testing
sets.
time_steps: The total sum of all the time steps in a single EEG segment.
features: the total number of features, including gender, age, and brain wave
features. This would normally be 3 in our instance.

27 Data Normalization

As needed, normalize the data within each feature dimension. Different normal-
ization methods may be needed for brain wave data than for age and gender. To
guarantee consistency, use the same normalization parameters on both the training
and testing datasets.
52 R. Gupta et al.

28 Shuffling (Optional)

Depending on the properties of our dataset, decide if randomizing the training data
is appropriate. Due to temporal relationships, shuffling may not be appropriate
for brain wave data, but it is possible for age and gender data.

29 Data Augmentation (Optional)

If we wish to expand the dataset or add variability to the EEG signals, think about
using data augmentation techniques for the brain wave data. Time shifts, amplitude
changes, and the introduction of artificial noise are examples of augmentation tech-
niques. We can utilize an LSTM model that predicts cognitive problems based on
EEG data, age, and gender if we follow these procedures to properly prepare and
structure our dataset, including the data splitting procedure. Due to the thorough data
preparation, our model will always receive consistent, well-structured input and will
be able to make precise predictions based on the attributes that are given.

29.1 Defining Model Architecture

Let’s explain the LSTM-based model architecture used for the prediction of cognitive
disorders. Figure 3 explains the neural network architecture.

1. Input Layer
our data enters the system through the input layer. It receives EEG data
sequences in this model. Each sequence represents a 14-time step window of
EEG readings, with one feature (perhaps an individual EEG measurement or
characteristic) present at each time step. Consider this layer to be the neural
network’s entry point for our data.
2. Dense Layer 1
With 64 neurons (units), this layer is completely linked. Every neuron in the
layer is connected to every other neuron. Rectified Linear Unit (ReLU) is the
activation function applied in this case. By mapping negative values to zero and
passing positive values unmodified, ReLU adds nonlinearity to the model. It aids
the network’s learning of intricate data patterns.
3. Bidirectional LSTM Layer 1
Long Short-Term Memory (LSTM) is a subclass of recurrent neural networks
(RNNs). we have a bidirectional LSTM with 256 units in this layer. By processing
the input sequence both forward and backward, “bidirectional” means that it
captures temporal interdependence in both directions. To comprehend the context
of each measurement within the series, for instance, it takes into account both
past and future EEG measurements.
Building Predictive Models with Machine Learning 53

Fig. 3.3 Model architecture

4. Dropout Layer 1
A regularization strategy is a dropout. During each training iteration, this
layer randomly discards 30% of the outputs from the preceding layer’s neurons.
This increases noise and encourages more robust learning, which helps minimize
overfitting. It motivates the model to pick up patterns that are independent of the
existence of any particular neuron.
5. Bidirectional LSTM Layer 2
This layer is bidirectional and has 128 units, like the initial LSTM layer. It
keeps up the effort to extract temporal patterns from the EEG data. The model
is better suited to handle sequential data because of its ability to learn from both
past and future contexts due to its bidirectional nature.
6. Dropout Layer 2
The second LSTM layer is followed by a dropout layer with a 30% dropout
rate. It improves the model’s capacity to generalize in the same way as the
preceding dropout layer.
7. Flatten Layer
54 R. Gupta et al.

A 3D tensor with dimensions (batch_size, time_steps, units) is the result of


the LSTM layers. By "flattening" it, the flattened layer converts this 3D output
into a 1D vector. Often, while moving from recurrent layers to dense layers, this
step is required.
8. Dense Layer 2
This dense layer utilizes the ReLU activation function and has 128 neurons.
The model gains yet another level of nonlinearity as a result, enabling it to
recognize intricate patterns in the flattened data.
9. Output Layer
The output layer, which is the last layer, is made up of just one neuron.
The sigmoid activation function is utilized. Because it generates an output
between 0 and 1, which represents the probability of the positive class (cognitive
disorder), the sigmoid is frequently employed in binary classification problems
like predicting cognitive disorders.
The input layer, dense, bidirectional LSTM, dropout, and dense layers make up
the final portion of our model. Together, these layers interpret EEG data, record
temporal patterns, and generate binary predictions about cognitive problems. Overfit-
ting is avoided by dropout layers, and nonlinearity is introduced for efficient learning
through activation functions.

29.1.1 Model Training

There are several crucial processes involved in training a machine learning model,
including our LSTM-based model for predicting cognitive diseases. An outline of
the training procedure is given below:
1. Optimizer and Callbacks Setup:
● Opt_adam = keras.optimizers.Adam (learning_rate = 0.001): The Adam
optimizer is configured with a learning rate of 0.001 in this line. To reduce
the prediction error, the optimizer controls how the model’s internal parameters
(weights) are changed during training.
● es = EarlyStopping(monitor = ‘val_loss’, mode = ‘min’, verbose = 1,
patience = 10): Early stopping is a training strategy used to avoid overfitting. It
keeps track of the validation loss (the model’s performance on unobserved data)
and suspends training if the loss doesn’t decrease after 10 iterations. This helps
prevent overtraining, which can result in overfitting.
● mc = ModelCheckpoint(save_to + “Model_name”, monitor = “val_
accuracy”, mode = “max”, verbose = 1, save_best_only = True): Every
time the validation accuracy increases, the model’s weights are checked pointed
and saved to a file called “Model_name”. By doing this, we can be guaranteed
to preserve the model iteration that performs the best.
● lr_schedule = tf.keras.callbacks.LearningRateScheduler(lambda epoch:
0.001 * np.exp(-epoch / 10.)): The learning rate during training is dynamically
adjusted using learning rate scheduling. In this instance, it causes the learning rate
Building Predictive Models with Machine Learning 55

to drop over time. Later epochs with a lower learning rate may aid the model’s
convergence.
2. Model Compilation
● model.compile(optimizer = opt_adam,loss = [‘binary_crossentropy’],
metrics = [‘accuracy’]): The model is assembled in this line, which also sets
up how it will be trained to learn.
● optimizer = opt_adam: It identifies Adam as the optimizer to use when
changing the model’s weights.
● loss = [‘binary_crossentropy’]: It employs the binary cross-entropy loss func-
tion. In a binary classification task, it measures the discrepancy between the
model’s predictions and the actual labels.
● metrics = [‘accuracy’]: The model’s precision on the training set of data is
tracked during training.
3. Model Training:
● history = model.fit(x_train, y_train, batch_size = 20, epochs = epoch, vali-
dation_data = (x_test, y_test), callbacks = [es, mc, lr_schedule]): This line
starts the actual training process.
● The training data (EEG data and labels) are x_train and y_train.
● batch_size = 20: It processes the data in batches of 20 samples at a time to
update the model’s weights.
● epochs = epoch: The model is trained for the specified number of epochs
(typically many more than 2) to learn from the data effectively.
● validation_data = (x_test, y_test): Validation data is used to evaluate how well
the model is generalizing to unseen data.
● callbacks = [es, mc, lr_schedule]: These callbacks are applied during training,
helping to control the training process and save the best model.
4. Model Loading
● saved_model = load_model(save_to + “Model_Name”): After training, the
code loads the best-performing model based on validation accuracy from the
saved checkpoint. This model is ready for making predictions on new data.
5. Return Values
● return model, history: Both the trained model (model) and the training history
(history) are returned by the function. For both training and validation data
throughout epochs, the training history contains information about loss and
accuracy.

29.2 Model Testing

The LSTM model’s performance is assessed using a different dataset than the one it
was trained on in order to predict cognitive disorders. Here is how we might test our
model predictions for cognitive disorders:
56 R. Gupta et al.

30 Load the Trained Model

The LSTM model that we previously trained should be loaded first. After training, this model
ought to have been retained so that it could be used to make predictions.

31 Prepare the Testing Data


Create a separate dataset just for testing the model. Prepare the testing data. This dataset
should include EEG data from people whose cognitive problems we want to forecast. To
maintain consistency in feature engineering and data formatting, make sure that this testing
data is preprocessed in the same manner as the training data.

32 Make Predictions

On the testing dataset, make predictions using the loaded model. The model will provide
predictions for each sample when we feed it the EEG data from the testing dataset.

33 Thresholding for Binary Classification


we can set a threshold for the model’s predictions if our objective is binary classification
(determining whether a cognitive illness is present or not). we might want to set a threshold
of 0.5, for example. While predictions below 0.5 can be categorized as not suggesting any
cognitive impairment, predictions greater than or equal to 0.5 can be categorized as indicating
the presence of a cognitive disease.

34 Evaluation Metrics

Utilize a variety of evaluation indicators to rate the model’s effectiveness. The following are
typical metrics for binary classification tasks:

Accuracy: the proportion of correctly predicted cases.


Precision: the proportion of true positive predictions among all positive predic-
tions.
Recall: The proportion of true positive predictions among all actual positive cases
F1-Score: The harmonic mean of precision and recall, which balances the trade-
off between precision and recall.
Confusion Matrix: A table that shows true positives, true negatives, false
positives, and false negatives.

These metrics offer information on how well the model is doing in terms of
correctly classifying both cognitive and non-cognitive disorders. A critical step in
Building Predictive Models with Machine Learning 57

assessing the LSTM model’s performance and guaranteeing its dependability for
diagnosing cognitive diseases based on EEG data is testing it on a different dataset.
It helps establish whether the model can be effectively applied in real-world situations
and how well it generalizes to new data.

34.1 Issues and Challenges

There are some issues and challenges that will be encountered in the LSTM-based
predictive models for cognitive disorder prediction using EEG data.
1. Data Quality and Accessibility: Ensuring the quality and accessibility of
diverse EEG datasets can be a significant hurdle. Obtaining representative and
comprehensive data is essential for model accuracy.

35 Ethical and Privacy Concerns:

2. Managing sensitive medical data requires strict adherence to ethical and


privacy standards. This includes obtaining informed consent from patients and
effectively anonymizing data while maintaining its utility.
3. Model Transparency: LSTM models, while effective, can be intricate and
challenging to decipher. Ensuring that the predictions made by these models
are comprehensible to healthcare professionals is critical for their adoption.
4. Bias Mitigation and Generalization: It’s imperative that models gener-
alize well across various populations and avoid any bias. Ensuring equitable
performance for different demographic groups is a complex challenge.
5. Model Resilience: The models need to exhibit resilience in handling variations
within EEG data and adapt to different EEG devices or data collection protocols.
6. Clinical Integration: Seamlessly integrating predictive models into existing
clinical workflows and decision-making processes poses a considerable chal-
lenge. These models must align with established practices and be user-friendly
for healthcare providers.
7. Interdisciplinary Cooperation: Effective collaboration between data scien-
tists, medical experts, and domain specialists is vital. Bridging the gap between
technical proficiency and medical knowledge can be intricate.
8. Resource Limitations: The development, training, and evaluation of LSTM
models can be resource-intensive, demanding substantial computational power
and expertise.
9. Regulatory Adherence: Ensuring compliance with healthcare and data protec-
tion regulations, such as HIPAA in the United States, is indispensable but
intricate and rigorous.
58 R. Gupta et al.

10. Model Validation: Rigorously validating predictive models through clin-


ical trials and real-world testing is essential but can be time-consuming and
financially demanding.
11. User Acceptance: Convincing healthcare professionals to trust and incorporate
predictive models into their practice can be a challenge. Ensuring that they
recognize the value and reliability of the models is crucial.
12. Data Imbalance: Managing datasets with imbalances, where there are fewer
instances of cognitive disorder cases, can affect model training. Effective
strategies to handle data imbalances need to be developed.

35.1 Conclusion

Through our study of machine learning-powered predictive modeling, we have seen


a revolutionary force that has the potential to change research and decision-making.
The combination of machine learning and predictive models gives us the power to
anticipate results, maximize resources, and obtain insights into a variety of fields.
Machine learning-powered prediction models improve decision-making, lower risks,
and increase efficiency in a variety of industries, including marketing, banking, and
healthcare. They are essential resources for developing hypotheses, conducting data-
driven research, and solving practical problems. But as we go forward, model open-
ness and ethical considerations are still crucial. As AI continues to evolve, we must
strike a balance between the potential of predictive models and their ethical and
responsible application.
In summary, the combination of predictive models and machine learning repre-
sents advancement and human ingenuity. It gives us the ability to turn information
into knowledge, see forward, and prosper in a changing environment. As we proceed
on this path, we are heading toward a time when making well-informed judgments
will not only be a goal but also a reality, improving our lives and changing the face
of society.

References

1. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
2. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning. MIT Press (2016)
3. Breiman, L.: Random Forests. Mach. Learn. 45(1), 5–32 (2001). https://link.springer.com/art
icle/10.1023/A:1010933404324
4. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016).
https://doi.org/10.1145/2939672.2939785
5. Chen, M., Hao, Y., Hwang, K.: Disease prediction by machine learning over big data from
healthcare communities. J. Med. Syst. 39(1), 1–6 (2015). https://doi.org/10.1109/ACCESS.
2017.2694446
Building Predictive Models with Machine Learning 59

6. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning.
Springer (2013)
7. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer (2017)
8. Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms.
In: Proceedings of the 23rd International Conference on Machine Learning (2006). https://doi.
org/10.1145/1143844.1143865
9. Chen, J., Song, L.: A review of interpretability of complex systems and its applications in
healthcare. IEEE Access 6, 29926–29953 (2018)
10. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives.
IEEE Trans. Pattern Anal. Mach. Intell., 35(8), 1798–1828 (2015). https://doi.org/10.1109/
TPAMI.2013.50
11. Lima, M.S.M., Delen, D.: Predicting and explaining corruption across countries: a machine
learning approach. Gov. Inf. Q. 37(1), 101407 (2020). https://doi.org/10.1016/j.giq.2019.
101407
12. Kaur, H., Kumari, V.: Predictive modeling and analytics for diabetes using a machine learning
approach. Appl. Comput. Inform. (2018). https://doi.org/10.1016/j.aci.2018.12.004
13. Cuttinge dgeauthor, G.H., Progressmaker, I.J.: Machine learning innovations for predictive
modeling. Front. Artif. Intell., 5, 87 (2022)
14. Pioneer, K.L., Visionary, M.N.: Ethical considerations in machine learning-driven predictive
modeling. J. Responsible AI 7(1), 45–62 (2023)
15. Expert, P., Guru, Q.: Machine learning in predictive modeling: a state-of-the-art review. Expert
Syst. Appl. 98, 1–15 (2022)
16. Lanier, P., Rodriguez, M., Verbiest, S., Bryant, K., Guan, T., Zolotor, A.: Preventing infant
maltreatment with predictive analytics: applying ethical principles to evidence-based child
welfare policy. J. Fam. Violence 35(1), 1–13 (2020). https://doi.org/10.1007/s10896-019-000
74-y
17. Patel, N.J., Jhaveri, R.H.: Detecting packet dropping nodes using machine learning techniques
in mobile ad-hoc network: a survey. In: 2015 International Conference on Signal Processing
and Communication Engineering Systems, pp. 468–472. IEEE (2015). https://doi.org/10.1109/
SPACES.2015.7058308
18. Moujahid, A., Tantaoui, M.E., Hina, M.D., Soukane, A., Ortalda, A., ElKhadimi, A., Ramdane-
Cherif, A.: Machine learning techniques in ADAS: a review. In: 2018 International Conference
on Advances in Computing and Communication Engineering (ICACCE), pp. 235–242. IEEE
(2018). https://doi.org/10.1109/ICACCE.2018.8441758
19. Yang, H., Xie, X., Kadoch, M.: Machine learning techniques and a case study for intelligent
wireless networks. IEEE Netw. 34(3), 208–215 (2022). https://doi.org/10.1109/MNET.001.
1900351
20. Johnston, S.S., Morton, J.M., Kalsekar, I., Ammann, E.M., Hsiao, C.W., Reps, J.: Using
machine learning applied to real-world healthcare data for predictive analytics: an applied
example in bariatric surgery. Value Health 22(5), 580–586 (2019). https://doi.org/10.1016/j.
jval.2019.01.011
21. Lorenzo, A.J., Rickard, M., Braga, L.H., Guo, Y., Oliveria, J.P.: Predictive analytics and
modeling employing machine learning technology: the next step in data sharing, analysis, and
individualized counseling explored with a large, prospective prenatal hydronephrosis database.
Urology 123, 204–209 (2019). https://doi.org/10.1016/j.urology.2018.05.041
22. Winn, J., Bishop, C.M., Diethe, T., Guiver, J., Zaykov, J.: Model-based machine learning. http://
www.mbmlbook.com
23. Singh, P., Singh, N., Singh, K.K., Singh, A.: Diagnosing of disease using machine learning. In:
Machine Learning and the Internet of Medical Things in Healthcare, pp. 89–111. Academic
Press (2021)
Predictive Algorithms for Smart
Agriculture

Rashmi Sharma, Charu Pawar, Pranjali Sharma, and Ashish Malik

Abstract Recent innovations in agriculture have made it smarter, more intelligent,


and précised. Due to the technological advancement paradigm shift of agriculture
practices from traditional to wireless digital incorporation of IoT, AI/ML, and Sensor
technologies. Machine learning is a critical technique in agriculture for ensuring food
assurance and sustainability. The machine learning algorithm starts from scratch
to the final step—The selection of Crop, Soil Preparation, Seed Selection, Seed
sowing, Irrigation, Fertilizer/Manure Selection, Control of Pests/weeds/diseases,
Crop Harvesting, and Crop distribution for sales. ML algorithm suggests the right
step for high-yield crops and precision farming. This article discusses how predictive
ML supervised classification algorithms—especially K-Nearest Neighbor (KNN)
can be helpful in the selection of crops, fertilizer to be used, corrective measures
for the precision yield, and irrigation needs by looking at different parameters like
climatic conditions, soil type, and previous crops grown in the field. The accuracy
of algorithms comes out to be more than 90% depending on some uncertainties in
the collection of data from different sensors. This results in well-designed irrigation
plans based on the specific field conditions and crop needs.

R. Sharma (B)
Department of Information Technology, Ajay Kumar Garg Engineering College, Ghaziabad, India
e-mail: drrashmisharma20@gmail.com
C. Pawar
Department of Electronics, Netaji Subhash University of Technology, Delhi, India
P. Sharma
Department of Mechanical Engineering, Motilal Nehru National Institute of Technology,
Prayagraj, India
A. Malik
Department of Mechanical Engineering, Axis Institute of Technology & Management, Kanpur,
India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 61
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_4
62 R. Sharma et al.

1 Introduction

Farming is an essential component of the Indian Economy. The Indian agricul-


ture business has seen several technical advancements and technological advances
throughout the years. This development separated the farmers into two sections.
One group believes that traditional farming is better for the environment, while
another believes that contemporary/Modern technology-based agricultural methods
are better.

1.1 Traditional Agriculture Vs Smart Agriculture

Traditional Agriculture is a prehistoric practice of farming that employs labor-


intensive methods, ancestral knowledge, machinery, resources from the environment,
organic fertilizer, and farmers’ traditional methods and values of culture as shown in
Fig. 1(a).
This technique deeply interacts with nature, has the reliance on traditional wisdom,
and limited application of cutting-edge technology. Traditional farming is usually
small-scale farming. This style of framing is most common in rural regions where
farming is only an occupation for livelihood and local culture.
Smart Agriculture is also known as precision agriculture or intelligent agriculture
as shown in Fig. 1(b). Smart agriculture uses modern technologies containing Artifi-
cial Intelligence (AI), the IoT, Sensor Technology, and Data Analytics for optimizing
various facets of farming practices. Smart Agriculture aims to amplify the crop yield,
automatic monitoring, calculative decision-making, efficiency, and sustainability of
resource utilization such as water, pesticides, fertilizers, etc., resulting in intensifying
overall efficiency.

Fig. 1 (a) Traditional farming (b) Smart agriculture


Predictive Algorithms for Smart Agriculture 63

1.2 Internet of Things(IoT)

The IoT deals with communication between different devices may be within the
same or different network and also between the devices and the cloud. The IoT
deals with different categories of information depending on the time-sensitivity of
data. Nowadays IoT is used in manufacturing, transportation, home automation,
utility organizations, agriculture, and so on. Some benefits of IoT are reduced costs,
real-time asset visibility, improved operational efficiency, quick decision-making
by detailed insights of data, and predictive real-time insights of data. IoT devices
are dynamic. Their self-adaptive nature adds scalability, intelligence, management,
analyzing power, connecting anywhere, anytime, with anything characteristics. The
main components of any IoT are the device/ sensor/set of sensors, connectivity,
data processing, and the user interface. The sensors collect and send data generated
whenever some environmental changes.

1.3 Machine Learning: Fundamental Overview

Machine learning (ML) is a subfield of AI and a discipline of data science that


concentrates on algorithms for learning, its concept, efficiency, and characteristics
[1]. Figure 2 shows the process of machine learning. The main steps include raw data
collection then in the preprocessing step cleaning of data is done. The ML Model
is built and data is trained and validated using the designed Model. The last step is
to generalize the model for specific input parameters and use cases for which the
model was built. The use cases are usually related to predictions/forecasting and
suggestions/recommendations.
The generic objective of any ML algorithm is to optimize the performance of the
job by using examples or prior knowledge. ML can establish efficient links with the
input data which helps in reconstructing the knowledge system. The ML performs
better when more data is used in training [2]. This is shown in Fig. 3. The ML
Algorithms are trained for correct prediction when new data are inputted. The output
is generated based on the learned rules which are usually derived from past expertise
[3].
The features are used to train and form rules for the AI system. The ML algorithm
has phases: Training and testing. The training phase includes the extraction of feature
vectors after cleaning the collected input/real-time data, which depends on the use
case. [4] The training phase continues till a specific learning performance criteria of

Fig. 2 The process of machine learning


64 R. Sharma et al.

Fig. 3 General architecture of the ML algorithm

throughput, mean squared error, or F1 score is reached to a satisfactory value. The


model so developed is further used for testing if the predictions made are up to the
mark or not. The main output of these algorithms can be in the form of classification,
clustering, or predicting the inputs to the desired outputs.
The various types of ML are Supervised, Unsupervised, Semi-supervised, and
Reinforcement Learning. Table 1 describes the algorithms and the criteria for using
these ML algorithms.
Predictive learning algorithms/approaches use past/historical data to develop
models that can make predictions or forecasts. These algorithms are frequently
employed in an extensive range of disciplines and applications. The predictive
learning algorithms include Linear Regression, Logistic Regression, Decision Tree,
Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), Naive Bayes, Neural
Networks, Time Series Models, Gradient Boosting Algorithms, Recurrent Neural

Table 1 Different ML techniques: Usage criteria & algorithms


ML Supervised Unsupervised Reinforcement
Technique Learning Learning Learning
Criteria Classification Clustering Real-Time
to Use Regression Dimensionality Decision Making
Estimation Reduction
Prediction
Different Algorithm Bayesian Networks (BN) k-Means Clustering Q-Learning
Support Vector Machine Anomaly Detection Markov Decision
(SVM) Principal Component Problem (MDP)
Random Forest (RF) Analysis (PCA)
Decision Tree (ID3) Independent
Neural Networks (NN) Component Analysis
Hidden Markov Model Gaussian Mixture
(HMM) Model (GMM)
Predictive Algorithms for Smart Agriculture 65

Networks (RNNs), Hybrid Models, and AutoML. The algorithm chosen is deter-
mined by the individual task, the type of data, and the necessary efficiency attributes.
In practice, researchers in data science and machine learning professionals frequently
test numerous algorithms to see which one performs the best for a specific task.
Nowadays, machine learning algorithms are used in almost all applications such as
activity recognition, email, spam filtering, forecasting of weather, sales, production,
stock market, fraud detection in credit card, bank accounts, image and speech identi-
fication and classification, medical diagnosis and surgery, NLP, precision agriculture,
smart city, smart parking, autonomous driving vehicle to state few.

2 Background Work: Machine Learning in Agriculture

ML model generation/algorithm implementation for agriculture is presented in Fig. 4.


Six broad categories need to be tracked while converting traditional agriculture to
smart agriculture.
The aim of smart agriculture is usually precision agriculture meaning the yield
of the crop should be optimized which directly includes other categories like soil
and weather management according to soil type and weather conditions In short, all
these categories are interrelated to each other. They cannot work in exclusion to their
best.

•Crop Yield Prediction


•Disease Diagnosis & Weed Identification
Crop Management
•Crop Identification
•Quality of Crop

•Drip Irrigation
Water Management
•Quality of Water

Soil & Weather •Identify the soil properties


Management •Analysis of Weather Conditions

Fertilizer •Macro & Micro Nutrient analysiis for crop


Recommendations •Macro & Micro Nutrient Analysis of Soil

Harvesting Techniques •Using modern Techniques like Robots, UAVs, drones,


IOT and Smart Sensor Technologies

•Animal Welfare
Livestock Management
•Livestock Production

Fig. 4 Category wise machine learning model applications in precision agriculture


66 R. Sharma et al.

2.1 Crop Management

Crop management encompasses several characteristics resulting from the integration


of agricultural strategies aimed to meet qualitative as well as quantitative require-
ments [5]. Using modern crop management techniques, for example yield prediction,
disease diagnosis, weed identification, crop identification, and high-quality crops
helps to boost production and, as a result, commercial revenue.

2.1.1 Yield Prediction

Yield prediction is the furthermost important and demanding aspect of present-day


agriculture. An accurate model can assist proprietors of farms in making well-
informed decisions regarding management on what crops to cultivate to match the
yield to the demands of the current market [6]. Numerous elements, including the
crop’s genotype and phenotypic traits, management techniques, and environment,
might influence crop yield forecast. It requires a basic understanding of how these
interaction elements relate to yield. Consequently, the identification of these types
of correlations requires large datasets and potent methods like machine learning
approaches [7].

2.1.2 Disease Detection

A prominent danger is the development of diseases in crops, which reduce output


quantity and quality throughout production, storage, and transportation [8]. More-
over, crop diseases are a major warning to food safety. Efficient management includes
prompting the detected plant diseases at the earliest. Numerous types of bacteria,
fungi, pests, viruses, and other agents cause plant diseases. Disease symptoms can
include spotting on leaves and fruit, sagging and change in color [9], leaves curling,
and other physical signs indicating the occurrence of pathogens and changes in the
plant’s phenotypic. In the past, this surveying was done by skilled agronomists. This
method is laborious and relies only on visual examination.
Sensing systems that are accessible to consumers can now identify diseased plants
before indicators appear. Visual analysis has advanced significantly in the past couple
of decades, particularly with the use of deep learning techniques. Zhang et al. [10],
have utilized deep learning to diagnose illnesses of the cucumber leaf, it is advan-
tageous to remove background before model training because of the intricate envi-
ronmental backdrop. A sufficient-sized dataset of photos of both healthy and sick
plants is also essential for the development of precise image classifiers for disease
diagnosis. Corresponds of the agricultural disease’s distribution across the globe can
be made to show the areas from where the infection originated [11].
Predictive Algorithms for Smart Agriculture 67

2.1.3 Weed Detection

Weeds typically develop and spread extensively across vast portions of the field very
quickly due to their productive cultivation of seeds and extended lifespan. This results
in competing with crops for resources like space, sunlight, nutrients, and water avail-
ability. Weeds often emerge earlier than crops, a circumstance that negatively impacts
crop growth [12]. Mechanical Procedure for Weed control is either challenging to
conduct or useless if done incorrectly, the most common procedure is the spreading
of herbicides. However, the use of huge amounts of pesticides proves to be expensive
and harmful to the environment. Prolonged usage of herbicides increases the likeli-
hood that weeds may become more resistant, necessitating more labor-intensive and
costly weed control. Considerable progress has been made in recent years regarding
the use of smart agriculture to distinguish between weeds and crops. Remote or prox-
imal sensing utilizing sensors mounted on satellites, aerial and ground vehicles, and
unmanned vehicles (both ground (UGV) and aerial (UAV)) can be used to achieve
this differentiation. The data collected by Drones into useful knowledge remains a
difficult issue [13]. Instead of spraying the entire field, ML algorithms in conjunction
with imaging technology or non-imaging spectroscopy can enable real-time target
weed distinction and localization, allowing for precision pesticide administration to
targeted zones [14, 15].

2.1.4 Crop Recognition

Analysis of a variety of plant parts, such as leaves, stems, fruits, flowers, roots, and
seeds, are used for the identification and categorization of various varieties of plants
[16, 17]. The commonly used method is leaf-based plant recognition, which examines
the color, shape, and texture of individual leaves [18]. The remote monitoring of crop
attributes has made it easier to classify crops, and this has made it more common to
utilize satellites and aerial vehicles for this purpose. Computerized crop detection and
categorization are a result of advances in computer software and image processing
hardware when paired with machine learning.

2.1.5 Crop Quality

Crop quality is influenced by climatic and soil conditions, cultivation techniques, and
crop features. Better prices are often paid for superior agricultural products, which
increases farmers’ profits. The most common indicators of maturity used for reaping
are the quality of the fruit, flesh hardness, solids that dissolve concentration, and
pigmentation of the skin [19]. Crop quality is also directly related to wasted food,
which is another issue facing contemporary farming because a crop that doesn’t
meet specifications for shape, color, or size may be thrown away. As discussed in
the previous subsection, using ML algorithms in conjunction with computer vision
68 R. Sharma et al.

techniques yields the desired target. For physiological variable extraction, ML regres-
sion techniques using neural networks (NNs) and random forests (RFs) are exam-
ined. Transfer learning has been used to train several cutting-edge convolutional
neural network (CNN) architectures with area suggestions to recognize seeds effi-
ciently. When it comes to measuring quality, CNN performs better than manual and
conventional approaches [20].

2.2 Water Management

As plant development is heavily dependent on an adequate supply of water, the


farming sector is the primary worldwide user of fresh water and efficient water
management will boost water availability and quality by lowering water pollu-
tion [21]. More efficient water management is required to effectively save water
to achieve sustainable crop production, given the high rate of degradation of many
reservoirs with minimal replenishment [22]. Along with lowering environmental and
health hazards, efficient water management can also result in better water quality
[21]. Therefore, maximizing water resources and productivity can be achieved by
managing water quality through the use of drip irrigation and other appropriate water
management strategies.

2.2.1 Yield Prediction

• Drip Irrigation: Due to the scarcity of freshwater resources, Drip irrigation is


a low-pressure watering technique, which boosts the energy efficiency of the
system. Smart irrigation systems use historical data as input for accurate fore-
casting and decision-making which considers the data gathered by the sensors and
IoT-enabled devices [23, 24]. A decision support system for irrigation manage-
ment was described by Torres-Sanchez et al. [25] for citrus farms in southeast
Spain. The suggested approach makes use of smart sensors to monitor soil water
status, weather data, and water usage from the previous week. SVM, RF, and linear
regression were the three regression models used to create the irrigation decision
support system. Three machine learning (ML) techniques were used to build
distinct hydrology prediction designs: a nonparametric strategy called Gradient
Boost Regression Trees (GBRT) and a standard linear regression approach called
Boost Tree Classifiers (BTC). Agronomists using the generated model to plan
irrigation can benefit considerably [26].
• Water Quality: The goal of the research is to assess current advancements in
satellite observation of the water’s quality, pinpoint current system flaws, and make
recommendations to make further improvements. Under the strategies present in
use are multiple-variate regression approaches such as PLSR, SVR, deep neural
networks (DNN), and long short-term memory (LSTM). The SVR model that
was used (Sagan, V. et al.) Two increasingly prominent deep learning methods
Predictive Algorithms for Smart Agriculture 69

are used in remote sensing for water quality: a Bayesian optimization function
and a linear kernel. A feed-forward DNN with five hidden layers and a 0.01
learning rate was also created. The model was trained using a Bayesian regularized
back propagation technique [27]. It is not possible to detect every characteristic
related to water quality, including nutrient concentrations and microorganisms/
pathogens, using hyperspectral information collected by drones [27, 28].

2.3 Soil and Weather Management

The issues of soil or land deterioration are due to excessive usage of fertilizers
or natural causes. Crop rotation needs to be balanced to prevent soil erosion and
maintain healthy soil [29]. Texture, organic matter, and nutrient content are some
of the soil qualities that need to be monitored. Sensors for soil mapping and remote
sensing, which employ machine learning approaches, can be used to study the spatial
variability of the soil.

2.3.1 Properties of Soil

The crop to be harvested is chosen based on the characteristics of the soil, which
are influenced by the climate and topography of the area used. Accurately predicting
the soil’s characteristics is a crucial step as it helps in determining “crop selection,
land preparation, seed selection, crop yield, and fertilizer selection.“ The location’s
climate and geography have an impact on the soil’s characteristics. Forecasting soil
properties primarily includes predicting soil nutrients, surface humidity of soil, and
weather patterns throughout the crop’s life. Crop development is dependent on the
nutrients present in a given soil. Soil nutrient monitoring is primarily done with
electric and electromagnetic sensors [30]. Farmers select the right crop for the region
based on the nutrients in the soil.

2.3.2 Climate Forecast

Weather-related phenomena that affect agricultural practices daily include rain,


heat waves, and dew point temperatures. Gaitán [31] has given research on these
phenomena. Dew point temperature is a crucial element required in many hydro-
logical, climatological, and agronomical research projects. A model based on an
extreme learning machine (ELM) is used to forecast the daily dew point tempera-
ture. Compared to SVM and ANN models, ELM has better prediction skills, which
enables it to predict the daily dew point temperature with a very high degree of
accuracy [32].
70 R. Sharma et al.

J. Diez-Sierra and M. D. Jesus [33] used atmospheric compact patterns, general-


ized linear models, and several ML-based techniques such as SVM, k-NN, random
forests, K-means, etc., to predict long-term daily rainfall.

2.4 Livestock Management

Managing livestock involves taking care of their diet, growth, and general health.
In these activities, machine learning is used to analyze the eating, chewing, and
moving behaviors of the animals (such as standing, moving, drinking, and feeding
habits). According to these estimates and assessments, farmers may change their
diets and lifestyles to improve behavior, health, and weight gain. This will increase
the production’s economic viability [34]. Livestock management includes both the
production of livestock and the welfare of the animals; in precision livestock farming,
real-time health monitoring of the animals is considered, including early detection
of warning signals and improved productivity. Such a decision support system and
real-time livestock monitoring allow quality policies about living conditions, diet,
immunizations, and other matters to be put into practice. [35].

2.4.1 Veterinary Care

Animal welfare includes disease analysis in animals, chewing habit monitoring, and
living environment analysis that might disclose physiological issues. An overview
of algorithms used for livestock monitoring, including SVM, RF, and Adaboost
algorithm, was provided by Riaboff, L. et al. [36]. Consumption patterns can be
continuously monitored with cameras and a variety of machine learning techniques,
including random forest (RF), support vector machine (SVM), k closest neighbors (k-
NN), and adaptive augmenting. To ensure precise characteristic classification, several
components extracted from transmissions were given a ranking based on their signifi-
cance for grazing, ruminating, and non-eating behaviors [37]. When comparing clas-
sifiers, several performance parameters were considered as functions of the method
applied, the sensor’s location, and the amount of information used.

2.4.2 Livestock Production

Complete automation, ongoing monitoring, and management of animal care are the
objectives of the precision livestock farming (PLF) approach. With the use of modern
PLF technology (cameras, microphones, sensors, and the internet), the farmers will
know which particular animals need their help to solve an issue [38].
Predictive Algorithms for Smart Agriculture 71

3 Proposed System for Smart Agriculture

3.1 Methodology Used: Parameters

In agriculture, predictive machine learning algorithms play a crucial role in opti-


mizing crop yield, resource management, disease detection, pest control, and overall
farm efficiency.

3.1.1 Analysis of Soil

For the effective use of fertilizer, lime, and other nutrients in the soil, the findings
of soil testing are crucial. Designing a fertilization program can be strengthened
by combining data from soil tests with information on the nutrients accessible to
different products. In addition to individual preferences, geographical soil and crop
conditions can impact the choice of an appropriate test. Parameters like the cation
exchange capacity (CEC), pH, nitrogen (N), phosphorus (P), potassium (K), calcium
(Ca), magnesium (Mg), and their permeated level percentages are often included in
conventional tests.
Specific micronutrients that toxic substances, saltiness, nitrite, sulfate organic
material (OM), and specific other elements‘ can also be examined for in specific labs.
The amount of sand, silt, and clay in the soil, its degree of compaction, its level of
moisture, as well as other physical and mechanical characteristics all have an impact
on the environment in which crops thrive. Precise evaluations of macronutrients,
namely nitrogen, phosphorus, and potassium (NPK), present in soil are essential for
effective agricultural productivity. This includes site-specific cultivation, in which the
rates of fertilizer nutrient therapy are modified geographically based on local needs.
Optical diffuse reflectance sensing makes the quick, non-destructive assessment of
soil properties [39], including the feasible range of nutrient levels.
The capacity to measure directly analyte concentration with an extensive range of
awareness makes electrolytic sensing—which is based on ion-selective field effect
transistors—a beneficial method for real-time evaluation. It is also portable, simple,
and responsive. Many crops need a certain alkalinity level in the soil. This pH sensor
takes a reading of the pH of the soil and transmits the information to a server so that
users may view it and add chemicals to keep the alkalinity nearby ideal range for
particular crops. The operation of the ground’s moisture detector is comparable to
that of the soil pH sensor. Following data collection, the data is sent to the server,
which then uses the information to determine what action to take. For example, the
web server may decide to utilize spray pumps to moisten the soil or control the
playhouse’s temperature to ensure that the soil has the right amount of humidity
[40].
72 R. Sharma et al.

1. Algorithm for Soil Analysis


A controlled water supply is essential for the agriculture industry. Therefore,
we incorporate the self-sufficient water management system into the proposed
model. It operates based on the soil’s humidity. Initially, the water pump is set
to “OFF.”
Algorithm 1:

2. Algorithm for pH Value


Another crucial element to take into account is the pH of the soil, which has an
impact on crop output. We create a procedure that analyzes pH and produces
three outputs: “basic soil,” “normal pH value,” and “acidic soil” depending on
different pH values.
Algorithm 2:

3.1.2 Water Analysis

A vital part of the agricultural system is water. Since water serves as a significant
source of vitamins and minerals, the amount of water in a particular area affects
agricultural productivity as well. For more accurate farming, the effects of the soil
and water mixture in an agricultural field are measured more precisely. We ought to
Predictive Algorithms for Smart Agriculture 73

Table 2 Investigation of soil


Soil Samples Soil 1 Soil 2
Nitrogen (N) 80 100
Phosphorous (P) 10 26
Potassium (K) 115 162

examine the water content as well. Every time you top off the water in the container—
which should happen every four to six weeks, or earlier if 1/2 of the water is evapo-
rated—should apply a premium, water-soluble fertilizer. Utilize a weak solution that
is only 1/4 as potent as the amount suggested by the nutrient bottle.

3.1.3 NPK Values

NPK fertilizer is a blend of elements that includes the 3 main elements required
for strong plant development. For all plant growth, these 3 nutrients nitrogen, phos-
phorus, and potassium, also referred as NPK, are required. These are needed for a
plant’s proper development. Phosphorus promotes the progress and development of
roots and flowers [41]. A plant needs potassium, often known as potash. Note that
although plants grown in high nitrogen fertilizers may grow faster, they may also
become weaker and more vulnerable to insect and disease contamination.

4 Establishing an Investigation and Gathering Information

4.1 Crop Analysis of Samples

We collected a limited number of specimens to ascertain the crops’ NPK values. By


measuring the values of the tested soil and then comparing them with the required
typical values for the specific crops used for the experiment, to adapt the soil to be
suited for the crop [42]. The NPK requirements for cucumber, tomato, rose, radish,
and sweet pepper have been examined here. The NPK levels of the two soil specimens
were then determined by extraction. The following are the results shown in Table 2
and the fertilizers ratios recommended according to historical perspective based on
nitrogen presented in Table 3.
74 R. Sharma et al.

Table 3 Fertilizer ratios recommended according to historical perspective (proportions based on


nitrogen)
Nutrient Sweet Radish Radish Tomato Rose Cucumber
Pepper (Summer) (Winter)
Nitrogen (N) 100 100 100 100 100 100
Phosphorous (P) 18 10 7 25 17 18
Potassium (K) 129 146 145 178 102 151
Magnesium(Mg) 12 8 9 16 10 11
Calcium(Ca) 55 40 49 66 49 63
Sulphur(S) 14 17 12 29 18 17

4.2 Layers of Farming Sensors

This layer is made up of GPS-enabled IoT devices, like cellphones and sensor
nodes, that are used to create different kinds of maps. IoTs for smart agriculture
include those used in greenhouses, outdoor farming, photovoltaic farms, solar insec-
ticidal lamps, and photovoltaic farms [43], among others. IoT devices are being
changed and integrated at different levels of agriculture to accomplish these two
goals. Ensuring the distribution and production reliability of the nutrition solution
is the main goal. Enhancing consumption control [44], which minimizes solution
losses and keeps prices low, is the second goal. There will be a significant reduction
in both the environmental and economic effects. In the realm of sustainable agri-
culture leveraging green IoT, farmers employ advanced digital control systems like
Supervisory Control and Data Acquisition (SCADA) to fulfill the requirements of
agricultural management and process control. The producer in sustainable IoT agri-
culture requires a computerized control system SCADA to meet the requirements
for agricultural management.
For every part of equipment in a greenhouse, we recommend the sensor and meter
nodes incorporate IoTs in the following ways:
• The dripper flow rates, anticipated pressures, and the regions to be watered are
all taken into consideration by the IoT devices for the water pumping system.
• Water meters that offer up-to-date information on water storage.
• IoT devices are tailored for every filtering equipment that considers drippers and
the physical properties of water [45].
• Fertilizer meters with real-time updates and injectors for fertilizers, such as NPK
fertilizers.
• IoT devices to adjust electrical conductivity and pH to the right value for nutrition
solutions.
Tiny solar panels with IoT sensors to regulate temperature and moisture levels.
The latest paper dealing with IoT considers unsupervised and supervised algo-
rithm which deals with only prediction in crop [46, 47]. This deals with the hardware
of ardino uno if the fluctuations are there then the problem in node MCU will occur.
Predictive Algorithms for Smart Agriculture 75

Table 4 Benefits of the proposed framework


S.No Proposed System Existing System(s)
1 Comprises water analysis with the soil Doesn’t include the water analysis
analysis
2 Will yield precisely targeted crops based on Analyze the various parameters and
thorough analysis suggest the environment
3 Association of IoT and machine learning Includes IoT
4 Includes the integrated circuit Doesn’t contain the appropriately
integrated circuits

Nowadays new technology along with websites are being used for the e-trading of
the crops and agricultural implements[48, 49] which tells how there can be hassle
free selling and buying will take place with a limitation of the network issues in the
rural places.

5 Results and Discussions

5.1 Benefits of the Suggested System over Alternatives

See (Table 4)

5.2 Proposed System’s Circuit Designed

The task of monitoring soil temperature to regulate the optimal soil temperature for
suitable crops falls on the DHT 11 sensor (Fig. 5). The soil moisture sensor offers the
facility to compute the moisture content of the soil to regulate the quantity of water
present in the soil as well as the water required for the crops. Nitrogen, Phosphorus,
and Potassium are the nutrients that crops need the most. They are usually considered
the most significant nutrients as a result. As a result, we use the NPK sensor for N,
P, and K analysis. After that, we compute the NPK value data required for specific
crops, allowing us to estimate the crops that are suitable for the soil.
Because we could modify NPK based on the required crops, this would help
streamline the agricultural process. Since the results of all these studies must be
presented to the user on a screen, we use an organic light-emitting diode (OLED)
screen to display the soil content analysis. The ESP module is used to regulate Wi-Fi
connectivity for network connectivity [46]. It will be applied to regulate data flow
throughout the server.
76 R. Sharma et al.

Fig. 5 The suggested circuit in addition to the sensors

5.3 Machine Learning Implementation

Different supervised predictive machine learning algorithms were implemented for


crop prediction and analysis (Fig. 6) so that we achieve more precise predictions of
the crop which can be grown on the farms and along with the fertilizer proposed to
use for maximum yield of the suggested crop.

Fig. 6 Comparison of KNN, gradient boosting, random forest


Predictive Algorithms for Smart Agriculture 77

The heat map (Fig. 7) and the plot graph (Fig. 8) reveals that the KNN algorithm
is the best for the prediction of appropriate crop depending on the type of the soil
and the suggestion of the fertilizer for a specific crop grown.

Fig. 7 Different parameters for a crop grown

Fig. 8 Predicts the N, P, K


values
78 R. Sharma et al.

6 Conclusion and Future Scope

This approach is appropriate for crop production since it will give the soil the right
crop depending on several variables, such as soil moisture content, NPK value,
ideal irrigation, and real-time in-field crop monitoring. Smart farming based on the
machine learning predictive algorithm and Internet of Things for the appropriate
prediction in the system establishment that can track the agricultural sector and
automate irrigation using sensors (light, humidity, temperature, soil moisture, etc.).
Farmers can monitor their farms remotely through their mobile phones which is more
productive and appropriate. Internet of Things (IoT) and ML-based smart farming
programs have the potential to offer innovative solutions for not only conventional
and large-scale farming operations but also other emerging or established agricultural
trends, like organic farming, family farming, and enhancement of highly forthcoming
farming.

References

1. Ali, I., Greifeneder, F., Stamenkovic, J., Neumann, M., Notarnicola, C.: Review of machine
learning approaches for biomass and soil moisture retrievals from remote sensing data. Remote
Sens. 7, 15841 (2015)
2. Vieira, S., Lopez Pinaya, W.H., Mechelli, A. : Introduction to Machine Learning, Mechelli,
A., Vieira, S.B.T.-M.L. (eds.), Chapter 1, pp. 1–20. Academic Press, Cambridge, MA, USA,
(2020). ISBN 978–0–12–815739–8.
3. Domingos, P.: A few useful things to know about machine learning. Commun. ACM. ACM
55, 78–87 (2012)
4. Lopez-Arevalo, I., Aldana-Bobadilla, E., Molina-Villegas, A., Galeana-Zapién, H., Muñiz-
Sanchez, V., Gausin-Valle, S.: A memory efficient encoding method for processing mixed-type
data on machine learning. Entropy 22, 1391 (2020)
5. Yvoz, S., Petit, S., Biju-Duval, L., Cordeau, S.: A framework to type crop management strate-
gies within a production situation to improve the comprehension of weed communities. Eur. J.
Agron.Agron. 115, 126009 (2020)
6. Van Klompenburg, T., Kassahun, A., Catal, C.: Crop yield prediction using machine learning:
A systematic literature review. Comput. Electron. Agric.. Electron. Agric. 177, 105709 (2020)
7. Khaki, S., Wang, L.: Crop yield prediction using deep neural networks. Front. Plant Sci. 10,
621 (2019)
8. Harvey, C.A., Rakotobe, Z.L., Rao, N.S., Dave, R., Razafimahatratra, H., Rabarijohn, R.H.,
Rajaofara, H., MacKinnon, J.L. Extreme vulnerability of smallholder farmers to agricultural
risks and climate change in Madagascar. Philos. Trans. R. Soc. B Biol. Sci. 369 (2014)
9. Jim Isleib signs and symptoms of plant disease: Is it fungal, viral or bacterial? Avail-
able online: https://www.canr.msu.edu/news/signs_and_symptoms_of_plant_disease_is_it_f
ungal_viral_or_bacterial. Accessed 19 Mar 2021
10. Zhang, J., Rao, Y., Man, C., Jiang, Z., Li, S.: Identification of cucumber leaf diseases using
deep learning and small sample size for agricultural Internet of Things. Int. J. Distrib. Sens.
Netw.Distrib. Sens. Netw. 17, 1–13 (2021)
11. Anagnostis, A., Tagarakis, A.C., Asiminari, G., Papageorgiou, E., Kateris, D., Moshou, D.,
Bochtis, D.: A deep learning approach for anthracnose infected trees classification in walnut
orchards. Comput. Electron. Agric.. Electron. Agric. 182, 105998 (2021)
Predictive Algorithms for Smart Agriculture 79

12. Gao, J., Liao, W., Nuyttens, D., Lootens, P., Vangeyte, J., Pižurica, A., He, Y., Pieters, J.G.:
Fusion of pixel and object-based features for weed mapping using unmanned aerial vehicle
imagery. Int. J. Appl. Earth Obs. Geoinf.Geoinf. 67, 43–53 (2018)
13. Islam, N., Rashid, M.M., Wibowo, S., Xu, C.-Y., Morshed, A., Wasimi, S.A., Moore, S.,
Rahman, S.M.: Early weed detection using image processing and machine learning techniques
in an Australian chilli farm. Agriculture 11, 387 (2021)
14. Slaughter, D.C., Giles, D.K., Downey, D.: Autonomous robotic weed control systems: A review.
Comput. Electron. Agric.. Electron. Agric. 61, 63–78 (2008)
15. Zhang, L., Li, R., Li, Z., Meng, Y., Liang, J., Fu, L., Jin, X., Li, S.: A quadratic traversal
algorithm of shortest weeding path planning for agricultural mobile robots in cornfield. J.
Robot. 2021, 6633139 (2021)
16. Bonnet, P., Joly, A., Goëau, H., Champ, J., Vignau, C., Molino, J.-F., Barthélémy, D., Boujemaa,
N.: Plant identification: Man vs.machine. Multimed. Tools Appl. 75, 1647–1665 (2016)
17. Seeland, M., Rzanny, M., Alaqraa, N., Wäldchen, J., Mäder, P.: Plant species classification
using flower images—A comparative study of local feature representations. PLoS ONE 12,
e0170629 (2017)
18. Zhang, S., Huang, W., Huang, Y., Zhang, C.: Plant species recognition methods using leaf
image: Overview. Neurocomputing 408, 246–272 (2020)
19. Papageorgiou, E.I., Aggelopoulou, K., Gemtos, T.A., Nanos, G.D.: Development and evaluation
of a fuzzy inference system and a neuro-fuzzy inference system for grading apple quality. Appl.
Artif. Intell.Artif. Intell. 32, 253–280 (2018)
20. Genze, N., Bharti, R., Grieb, M., Schultheiss, S.J., Grimm, D.G.: Accurate machine learn-
ingbased germination detection, prediction and quality assessment of three grain crops. Plant
Methods 16, 157 (2020)
21. El Bilali, A., Taleb, A., Brouziyne, Y.: Groundwater quality forecasting using machine learning
algorithms for irrigation purposes. Agric. Water Manag.Manag. 245, 106625 (2021)
22. Neupane, J., Guo, W.: Agronomic basis and strategies for precision water management: a
review. Agronomy 9, 87 (2019)
23. Hochmuth, G.: Drip Irrigation in a Guide to the Manufacture, Performance, and Potential
of Plastics in Agriculture, M. D. Orzolek, pp. 1–197, Elsevier, Amsterdam, The Netherlands
(2017)
24. Janani, M., Jebakumar, R.: A study on smart irrigation using machine learning. Cell Cellular
Life Sci. J. 4(2), 1–8 (2019)
25. Torres-Sanchez, R., Navarro-Hellin, H., Guillamon-Frutos, A., San-Segundo, R., RuizAbellón,
M.C., Domingo-Miguel, R.: A decision support system for irrigation management: Analysis
and implementation of different learning techniques. Water 12(2), 548 (2020)
26. Goldstein, A., Fink, L., Meitin, A., Bohadana, S., Lutenberg, O., Ravid, G.: Applying machine
learning on sensor data for irrigation recommendations: Revealing the agronomist’s tacit
knowledge. Precis. Agric. 19, 421–444 (2018)
27. Sagan, V., Peterson, K.T., Maimaitijiang, M., Sidike, P., Sloan, J., Greeling, B.A., Maalouf, S.,
Adams, C.: Monitoring inland water quality using remote sensing: Potential and limitations of
spectral indices, bio-optical simulations, machine learning, and cloud computing. Earth Sci.
Rev. 205, 103187 (2020)
28. Sharma, A., Jain, A., Gupta, P., Chowdary, V.: Machine learning applications for precision
agriculture: A comprehensive review. IEEE Access, 9, 4843–4873 (2021).
29. Chasek, P., Safriel, U., Shikongo, S., Fuhrman, V.F.: Operationalizing Zero Net Land Degra-
dation: The next stage in international efforts to combat desertification. J. Arid Environ. 112,
5–13 (2015)
30. Adamchuk, V.I., Hummel, J.W., Morgan, M.T., Upadhyaya, S.K.: On-the-go soil sensors for
precision agriculture. Comput. Electron. Agricult. 44(1), 71–91 (2004)
31. Gaitán, C.F.: Machine learning applications for agricultural impacts under extreme events.
In: Climate Extremes and their Implications for Impact and Risk Assessment, pp. 119–138.
Elsevier, Amsterdam, The Netherlands (2020).
80 R. Sharma et al.

32. Mohammadi, K., Shamshirband, S., Motamedi, S., Petkovi¢, D., Hashim, R., Gocic, M.:
Extreme learning machine based prediction of daily dew point temperature. Comput. Electron.
Agricult. 117, 214–225 (2015).
33. Diez-Sierra, J., Jesus, M.D.: Long-term rainfall prediction using atmospheric synoptic patterns
in semi-arid climates with statistical and machine learning methods. J. Hydrol. 586, 124789
(2020).
34. Berckmans, D.: General introduction to precision livestock farming. Anim. Front. 7(1), 6–11
(2017)
35. Salina, A.B., Hassan, L., Saharee, A.A., Jajere, S.M., Stevenson, M.A., Ghazali, K.: Assessment
of knowledge, attitude, and practice on livestock traceability among cattle farmers and cattle
traders in peninsular Malaysia and its impact on disease control. Trop. Anim. Health Prod. 53,
15 (2020)
36. Riaboff, L., Poggi, S., Madouasse, A., Couvreur, S., Aubin, S., Bédère, N., Goumand, E.,
Chauvin, A., Plantier, G.: Development of a methodological framework for a robust prediction
of the main behaviours of dairy cows using a combination of machine learning algorithms on
accelerometer data. Comput. Electron. Agric.. Electron. Agric. 169, 105179 (2020)
37. Mansbridge, N., Mitsch, J., Bollard, N., Ellis, K., Miguel-Pacheco, G., Dottorini, T., Kaler, J.:
Feature selection and comparison of machine learning algorithms in classification of grazing
and rumination behaviour in sheep. Sensors 18, 3532 (2018)
38. Berckmans, D., Guarino, M.: From the Editors: Precision livestock farming for the global
livestock sector. Anim. Front. 7(1), 4–5 (2017)
39. Stewart, J., Stewart, R., Kennedy, S.: Internet of things—Propagation modeling for precision
agriculture applications. In: 2017 Wireless Telecommunications Symposium (WTS), pp. 1–8.
IEEE (2017)
40. Venkatesan, R., Tamilvanan, A.: A sustainable agricultural system using IoT. In: International
Conference on Communication and Signal Processing (ICCSP) (2017)
41. Lavric, A. Petrariu, A.I., Popa, V.: Long range SigFox communication protocol scalability
analysis under large-scale, high-density conditions: IEEE Access 7, 35816–35825 (2019)
42. IoT for All: IoT Applications in Agriculture, https://www.iotforall.com/iot-applications-in-agr
iculture/ (2018, January)
43. Mohanraj, R., Rajkumar, M.: IoT-Based smart agriculture monitoring system using raspberry
Pi. Int. J. Pure Appli. Math 119(12), 1745–1756 (2018)
44. Moussa, F.: IoT-Based smart irrigation system for agriculture. J. Sens. Actuator Net. 8(4), 1–15
(2019)
45. Panchal, H., Mane, P.: IoT-Based monitoring system for smart agriculture. Int. J. Adv. Res.
Comput. Sci.Comput. Sci. 11(2), 107–111 (2020)
46. Mane, P.: IoT-Based smart agriculture: applications and challenges. Int. J. Adv. Res. Comput.
Sci.Comput. Sci. 11(1), 1–6 (2020)
47. Singh, P., Singh, M.K., Singh, N., Chakraverti, A.: IoT and AI-based intelligent agriculture
framework for crop prediction. Int. J. Sens. Wireless Commun. Control 13(3), 145–154 (2023)
48. Sharma, D.R. Mishra, V., Srivastava, S. Enhancing crop yields through iot-enabled precision
agriculture. In: 2023 International Conference on Disruptive Technologies (ICDT), pp. 279–
283. Greater Noida, India (2023). https://doi.org/10.1109/ICDT57929.2023.10151422
49. Gomathy, C.K., Geetha, V.: Several merchants using electronic-podium for cultivation. J.
Pharmaceutical Neg. Res., 7217–7229 (2023)
Stream Data Model and Architecture

Shahina Anjum, Sunil Kumar Yadav, and Seema Yadav

Abstract In recent era, Big Data Streams have significant impact owing the reality
that there are many applications from where a big amount of data is continuously
generated at a bang-up velocity. Because of integral dynamical features of big data,
it is hard to apply existing working models directly on big data streams. The solution
of this limitation is data streaming. A modern-day data streaming architecture allows
taking up, operating and analyzing high mass of high-speed data from a collection
of sources in real time to build more reactive and intelligent customer experiences.
It can be designed as a batch of five logical layers; Source, Stream Storage, Stream
Ingestion, Stream Processing and Destination. This chapter comprises of a brief
assessment on the stream analysis of big data which engaged a thorough and orga-
nized way to looking at the inclination of technologies and tools used in the field of
big data streaming along with their comparisons.
We will provide study to cover issues like scalability, privacy and load balancing
and their existing solutions. DGIM Algorithm which is used to count the number of
ones in a window and FCM Clustering Algorithm and others are also in consideration
to review in this chapter.

S. Anjum (B) · S. K. Yadav


Department of CSE, IEC College of Engineering & Technology, Greater Noida,
Uttar Pradesh, India
e-mail: shahinaanjum2323@gmail.com; shahinaanjum.cs@ieccollege.com
S. K. Yadav
e-mail: sunilyadav.cs@ieccollege.com; sunilthebest@gmail.com
S. Yadav
Department of MBA, Accurate Institute of Management and Technology, Greater Noida,
Uttar Pradesh, India
e-mail: seema.yadav@accurate.in; seema7789@gmail.com

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 81
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_5
82 S. Anjum et al.

1 Introduction

In recent era, Big Data Streams have significant impact owing the reality that there
are many applications from where a big amount of data is continuously generated at a
bang-up velocity. For various existing data mining methods, it is hard to directly apply
techniques and tools on big data streams, this is because of the integral dynamical
features of big data. The solution of this constraint is data streaming, also known as
stream processing or event streaming. Before understanding the streaming of data
architecture, it is very much necessary to know what streaming of data means actually.

1.1 Data Streaming

It is not very specialized kind of thing; instead it is only the general word to state
about the data which is created at very high speed and in the enormous volumes, and
in the continuous manner. In realistic life, there are many examples of data streaming
around us such as existence of use cases in each industry, real-time retail inventory
management, social media feeds, multiplayer games, ride sharing apps, etc. After
observing these examples it is observed that a stream data source categorizes the
events in real time. The data stream may have semi-structured or unstructured form,
usually key-value pairs in JSON or Extensible Markup Language (XML).

1.2 Batch Processing vs Stream Processing

Batch process refers to processing of high amount of data in batch within definite
time duration. It processes on whole data all at once. When data is composed even-
tually and alike data batched together then in such case batch processing has been
used. But debugging of batch processing is difficult as it requires committed expert
to fix the error. It is highly expensive.
Stream Processing refers to immediate processing of stream of data which is
produced continuously. It does this analysis in real time. It is used on unknown and
infinite sized data and also in continuous manner. It is fast. It is challenging to cope
with high speed and huge amount of data.
Figures 1a and b represent the way of processing in batch and data stream system.
Stream Data Model and Architecture 83

Fig. 1 a. Batch processing. b. Data stream processing

1.3 Steps in Data Streaming

Data streaming is generally used in reference to perform the sending, receiving and
then processing of information in a stream of data, in place of the batches of discrete
data. To perform it, six steps are to be involved, which are shown in Fig. 2 and
described as follows [1]:Data Stream Management
Step 1: Data Production
In this first step, the processing is done to generate the data which is sent form the
different sources like IOT devices, social media platforms or web applications.
The data format may be differ such as JSON or CSV and data may characterized
in different ways such as unstructured or structured data and static or dynamic,
etc.
• Structured data have some specific format and length. It is easy to store
and analyze in high level organizations. Basic structure contains Relational
database table. Its scalability is difficult and it is robust in nature.
• Semi-structured data is irregular in nature that might be incomplete and does
have a rapidly changing structure or unpredictable but does not be conventional
to a explicit or fixed schema. Basic structure contains XML or RDF—Resource
Description Framework. Its scalability is easier than structured data.
• Unstructured data does not have any particular structure. Basic structure
contains binary and character data. It is more scalable.
• Static data is a fixed data that remains same after it is collected.
• Dynamic data changes continuously after it is recorded; main objective behind
this is to maintain its integrity.
There are many methods and protocols such as HTTP protocol or MQTT push
and pull method, etc. which is used by the data producer to send the data to the
consumers.
84 S. Anjum et al.

Fig. 2 Data stream processing

Step 2: Data Ingestion


In this second step, which is known as data ingestion, the data which is received
by the consumers is stored like streaming platforms or message brokers. The
different types of technologies and streaming data architecture can be used by the
data consumers to handle the variety, velocity and volume of the data. Some of
them are Estuary Flow or Kafka streaming pipelines, etc.
Some other basic operations are also performed by the consumers on the data
like enrichment of data or validation of data before transferring it to a stream
processor.
Step 3: Data Processing
In the third step i.e. data processing the analysis is performed on the data and
performed on by data processing tools. In this phase the various complex oper-
ations such as aggregation, filtering, machine learning, transformation, etc. are
performed and for this various frameworks and tools are taken in the use.
In general, the processing and platform for data ingestion of data streaming
is tightly cohesive. Sometimes, processing may possibly be part of the platform
itself, like in the Estuary’s transformations. Or, some other time, it may pair a
subsequent technology with streaming framework; such as, Apache Spark and
Kafka pairing.
Step 4: Streaming Data Analytics
In this fourth step, data is the further exploration and interpretation of data is done
by the data analysts such as data scientists or business users. The data analysts
can use several methods and techniques to discover the hidden trends and patterns
in the data. Some techniques are descriptive, predictive and prescriptive data
analytics, etc.
Stream Data Model and Architecture 85

Here, descriptive analytics notify us about what has already occurred; predic-
tive analytics notify us about what could occur, and finally, prescriptive analytics
notify us about what should occur in the future.
The data analysts can also practice several platforms and tools to do data access
and data query such as Power BI or Python notebooks, SQL, etc. The data analysts
can also generate different outputs based on the analysis, some as alerts, charts,
dashboards, maps, etc.
Step 5: Data Reporting
The summarization and reporting of analyzed data is important to make any sense.
So, this analyzed data is summarized and transferred by reporting tools. There
are many formats existing for reports generation. And also many channels are in
existence and in use to present and share the data with their stakeholders. Some
of them are such as emails slideshows, reports documents, webinars, etc.
Data reporting can also use several indicators and metrics to measure and
monitor the performance of their goals, objectives, KPIs, etc.
Step 6: Data Visualization and Decision Making
In the final step of data streaming, data is visualized and represented upon by the
decision makers such as customers or managers. The several types and styles of
data visualization can be used to explore and understand the data. Any type such
as charts, graphs, maps, etc. can be used by the decision maker. Other features
and tools such as drill-downs, filters, sorts, etc. can also be used to interact with
the visualizations.
After performing visualizations, decision makers can make timely decision
on the basis of visualization insights. Some examples of decision making are
enhancing customer experience, improving products, optimizing processes, etc.
Now, after knowing the basic functionality of data streaming, let us take a look
at the primary components of modern stream processing infrastructure.

1.4 Event Stream Processing

One important methodology used in data processing is called as Event Stream


Processing deals with events multiplicity and their online processing. Data comes
from a variety of sources like business transactions in internal, logs of IoT devices,
news articles, social media feeds, etc. [2]. Researchers considered sideways for best
employing the platform inside special use cases. As of commercial point of exam-
ination, decision makers find for how is finest utilization of those events having
least delay in order to find out insight in real time, for mining of textual events and
to propose decisions. It needs a mix up of batch processing, machine learning and
stream processing technologies which are usually optimized separately. Though, to
combine all these technologies through constructing a real-world application with
86 S. Anjum et al.

good scalability is a big deal. Here [2], the latest event stream processing methodolo-
gies are clearly understandable by data flow architectures summarization, definition,
frameworks and architecture and textual use cases. In addition, they have discussed
about to unite event stream processing with sentiment analysis and events in textual
form to improve a reference model result.
In [3], we find that Complex event processing (CEP) finds the situation of impor-
tance by performing event streams queries evaluation. If CEP is used once for
network-based applications, the division of query evaluation within the sources of
event is able to give performance optimization. In place of arbitrary events collec-
tion at one place for query evaluation and subqueries are located at network nodes
to decrease the transparency in data transmission.
To conquer the limitations of existing models, INEv graphs are used. It introduced
fine coarse-grained routing of fractional outcome of subqueries as an extra level
of liberty in query evaluation: The basic structure of INEv used for In-Network
Evaluation for Event Stream Processing is shown in Fig. 3.
And various fields are shown in Fig. 4 which covers the area of Data Streams in
Modern Systems to leverage its power Streaming Data Architecture.

Fig. 3 In-network
evaluation for event stream
processing: INEv
Stream Data Model and Architecture 87

Fig. 4 Streaming data architecture: power area of data streams in modern system

1.5 Patterns in Streaming Data Architecture

The basic design and service of streaming data architectures generally be subject
to your objectives and requirements. On this basis, there are two general recent
streaming data architecture patterns named as Lambda and Kappa.
• Lambda: It is a hybrid architecture which mixes traditional batch processing with
real-time processing to ease in dealing with two types of data i.e. historical data
and real-time data streams. This combination gives the capability to handle huge
amount of data along with still providing sufficient speed to handle motion data.
Though this complexity is derived at a cost in points of latency and maintenance
necessities, it is joined with these in an extra serving layer for the greatest accuracy,
fault tolerance, and scalability.
• Kappa: In contrast to Lambda architecture, Kappa’s data architecture concen-
trates only on real-time processing, either it is historical data or real-time data.
In absence of the batch processing system, Kappa’s architecture is less costly,
less complex and more consistent. The data after processing is saved in a storage
system which can be queried on both i.e. in batches and in streams. This technique
requires great performance, idempotency and dependability.

1.6 Benefits of Streaming Data Architecture

Nowadays many organizations have been using streaming data analytics. It is not
only because of rising demand for processing real-time data, but it is also because
of many benefits which organizations can gain by using streaming data architecture.
Some are listed below:
88 S. Anjum et al.

• Ease in Scalability
• Pattern Detection
• Modern Real-time Data Solutions Enabling
• Improved Customer Experience

2 Literature Review

The summary of real-time data stream processing for industrial fault detection is well
explained [4]. The main focus is on data stream analysis for industrial applications
and to search industrial needs and then requirements of designing the potential Data
Stream Management. The recognition of industrial needs and challenges helps us to
find improvement in this area. A Data Stream Management System based monitoring
system was projected to implement given suggestions. The monitoring system which
is projected here takes the profit by applying the combination of various methods
of fault detection, such as analytical methods, data-driven and knowledge-based
methods.
One more data processing method which engage in online processing for various
events is called as Event stream processing (ESP). Researchers considered sideways
for best employing the platform inside special use cases. As of commercial point of
examination, decision makers find for how is finest utilization of those events having
least delay in order to find out insight in real time, for mining of textual events and
to propose decisions. It needs a mix up of batch processing, machine learning and
stream processing technologies which are usually optimized separately. However, to
combine all these technologies through constructing a real-world application with
good scalability is a big deal. Here [2], the latest event stream processing methodolo-
gies are clearly understandable by data flow architectures summarization, definition,
frameworks and architecture and textual processing with sentiment analysis and
textual events to improve a reference model result.
The general idea about big data analytics is based on real time; its present archi-
tecture, available methods of data stream processing and system architectures [5, 23].
The predictable approach to evaluate enormous data is unsuitable for real-time anal-
ysis; for that reason, analyzing of streaming in big data leftovers is a decisive matter
for many utilization and applications. It is vital in big data analytics and real-time
analytics to processing data at position from where they are incoming with speedy
response and fine choice making, necessitating the expansion of a original model
that works for high speed & low latency real-time processing.
One important thoughtfulness is to secure of the real-time stream. Like other
network security, stream security also can be built up by using pillars of Confi-
dentiality, Integrity & Availability (CIA) model [6]. However, the majority realistic
implementations just focus on first two aspects i.e. Confidentiality & Integrity by
means of various important techniques like encryption and signatures. An access
Stream Data Model and Architecture 89

control mechanism is introduced to implement on the stream which adds extra secu-
rity metadata to the streams. The use of this metadata can allow or disallow admit-
tance to stream elements and also give protection to the isolation of data. All work
is explained by taking an example of Apache Storm streaming engine.
The analysis of present big data software models for a variety of discourse of
domain and offers the outcome to support the researchers for future research. It has
recognized recurring general motivations for taking big data software architectures,
for example; to improve efficiency, to improve data processing in real time, reduction
in development costs, supporting analytics process, and enabling novel services,
together with shared work [7]. It has been studied that the business restrictions
contrast for every application area, thus to target a software application of big data
of particular application area requires couture of the common reference models to
area-specific reference model to enhance. It will evaluate big data and its software
architectures of distinct use cases from different application domains besides their
consequences and talk about recognized challenges and probable enrichment.
A phrase big data is used for composite data which is also hard to process. It
contains numerous features called 6 Vs which popularly means—value, variability,
variety, velocity, veracity and volume. Several applications can produce massive
data & also grow quickly in short time. This speedy data is supposed to be handled
with various approaches that exists in field of big data solutions. Some technologies
in open source like Apache Kafka and NoSQL database were proposed to generate
stream architecture for big data velocity [8, 10]. It has been evaluated that there has
been enlarged interest in analyzing big data stream processing (means in motion—
big data) rather than toward big data batch processing (i.e. big data at rest). It has
been identified that some issues such as consistency, fault tolerance, scalability,
integration, heterogeneity, timeliness, load balancing, heavy throughput and privacy
need more research attention. After doing much work on these issues, mainly load
balancing, privacy and scalability remain to focus.
The layer of data integration allows geospatial subscription, using the GeoMQTT
protocol. This is able to work for target-specific data integration at the same time
as to preserve potentiality of congregation data from IoT devices because of the
reason of efficiency in resource utilization in GeoMQTT. They have utilized the
latest methods for stream processing and this framework is known as Apache Storm.
It works as the center tool for their model and Apache Kafka as a tool for GeoMQTT
broker and Apache Storm message processing system. Their planned design could
be used to execute applications for many use cases where to deploy and to evaluate
the distributed stream processing methods and algorithms that function on spatio-
temporal data streams from the origin of IoT devices [11].
Introduction to a 7-layered architecture and its comparison with a 1-layered based
architecture became important as till this point, no general architecture of data
streaming analysis is scalable and flexible [12]. Data extraction, data transforma-
tion, data filtering and data aggregation are performed during the first six layers of
the architecture. In the seventh and last layer, it carries analytic models. This 7-
layered architecture consists microservices and publishes subscribe software. After
doing several studies, it is seen that this is the setup which can ensure solution with
90 S. Anjum et al.

low coupling and high cohesion, which leads in to increasing scalability and main-
tainability. Also asynchronous communication exists between the layers. Practical
experience in the field of financial and e-commerce applications shows that this
7-layered architecture would be helpful to a huge figure of business use cases.
A new data stream model named as Aurora basically manages data streams for
application monitoring [13]. It differs from traditional business data processing. The
detail of software is that it is required to process and respond to frequent inputs
coming from huge and various sources like sensors slightly different from operators
played by human, this fact needs from the individual to reorganize the elementary
architecture of a DBMS regarding this area of application. So, they present Aurora, a
new DBMS and provide its basic overview architecture and then explain specifically
a set of operators handled by stream orientation.
Table 1 represents the important findings of various studies in the field of Data
Streaming.

3 Data Stream Management System

Data Stream Management System (DSMS) is a software application just like


Database Management System (DBMS). DSMS involves processing and manage-
ment of an endlessly flowing data stream rather than working on static data like
excel, pdf etc. It deals with data streams from different sources like financial report,
sensors data, social media field, etc.
Similar to DBMS, DSMS also provides the broad range of operations such as
analyzing, integration, processing, and storage and also generates the visualization
and report use for data streams.

3.1 Peeping in Data Stream Management System

From the traversing of [14], Fig. 5, simply shows all the components of stream data
model;
Now, give simple explanation of Fig. 5
• In DSMS, there is a Stream Processor which is a kind of data-management system
which is organized in the high-level manner. In a system, we can enter any number
of streams. These streams may not be uniform in incoming rate. Streams might be
archived in a big archival store, but we suppose that it is not feasible to answer the
queries from this archival store. So, a working store is also there, where parts of
streams or summaries may be placed, and this working store is used for answering
the queries. It might be disk, sometimes main memory, basically it depends on
the speed which is needed to process queries. But moreover, it is of adequately
restricted capability that it can not store all of the data from all of the streams.
Stream Data Model and Architecture 91

Table 1 Important findings of various studies


S. No Reference Contribution
1 [4] A Data Stream Management System based monitoring system was projected to
improve the industrial needs, its application and resulting requirements for data stream
analysis
2 [2] To improve Event stream processing (ESP), different event stream processing
methodologies are discussed with the help of data flow architectures summarization
and definition, architectures and frameworks and other textual use cases
3 [5, 23] The general idea about big data analytics based on real time; its present architecture,
available methods of data stream processing and system architectures [5, 23]. The
predictable approach to evaluate enormous data is unsuitable for real time analysis; for
that reason, analyzing data stream in the field of big data leftovers a decisive matter for
many applications and its utilization
4 [6] Introduction to an access control mechanism and its implementation on the stream
which adds extra security metadata to the streams. The use of this metadata can allow
or disallow admittance to stream elements also give protection to the isolation of data.
All this is explained using Apache Storm streaming engine
5 [7] It has been studied that the business restrictions contrast for every application area,
thus to target a software application of big data of particular application area requires
couture of the common reference models to area specific reference model to enhance
6 [8, 10] Some technologies in open source like Apache Kafka & NoSQL database was
proposed to generate stream architecture for big data specially to handle velocity [8,
10]. It has been evaluated that there has been enlarged interest in analyzing big data
stream processing means in motion—big data in) rather than towards big data batch
processing (means at rest—big data)
7 [11] The layer of data integration allows geospatial subscription, using the GeoMQTT
protocol. This is able to work for target specific data integration at the same time as to
preserve potentiality of congregation data received from IoT devices because of the
reason of resource utilization efficiency in GeoMQTT
8 [12] Introduction to a 7-layered architecture its comparison with a 1-layered based
architecture was became important as till this point, no general architecture of data
streaming analysis is scalable and flexible. This 7-layered architecture consists
microservices and publish subscribe software. This is the setup which can ensures
solution with low coupling and high cohesion, which leads in to increasing scalability
and maintainability. Also asynchronous communication is exists between the layers
9 [13] In this paper, the brief description Aurora is given. It basically manages data streams
for application monitoring and it differs from traditional business data processing.
Aurora is a new DBMS designed and operated at Brandeis University, Brown
University, and M.I.T

Some examples of Stream Sources are Sensor Data, Image Data, Internet and Web
Traffic.
• Stream Queries—One way to ask query about streams is that it is placed inside the
processor where position queries are stored. These queries are sensible, perma-
nently in execution, and output is produced at suitable times. The other way of
query is ad-hoc. To ask a variety of ad-hoc queries in a large range, a general
approach is used where a sliding window of each stream is stored in a working
store.
92 S. Anjum et al.

Fig. 5 A data stream management system

3.2 Filtering Streams

Filtering or selection is a common process on streams. Bloom filtering is used to


handle the large data set.

3.2.1 Introduction to Bloom Filters

• A space-efficient and probabilistic data structure named as bloom filter is used to


test the membership (i.e. presence) of an element in a set; whether it belongs to it
or not.
• For example: To check the availability of username from the set of a list of all
registered username is a set membership problem.
• The probabilistic nature of a bloom filter is a problem which means that there is
a chance of resulting into some false positive results. When it “might” tell us if
there is a case when an actual username is not taken, but it tells that this given
username is already taken, basically it is known as false positive.

3.2.2 Working of Bloom Filter

First we have to take a bit array of m bits as an empty bloom filter and set all these
bits to zero, like this—
Stream Data Model and Architecture 93

• For a given input, for the hashes calculation, we need “K” number of hash function.
• Indices are calculated using hash functions. So, when we want to do addition of
an item in the filter, the bits at K indices; f1 (y), f2 (y), …, fK (y) are set.

Example 5.1: Let us take that we want to enter the word “throw” in the filter and
we are having three hash functions to use. At initial, an array is produced consisting
of bits and this bit array is of length 10. Let us take this array to do work and its all
bits are set to 0 at initial. First we will calculate the hashes as per following function:

f1 (“throw”) % 10 = = 1
f2 (“throw”) % 10 = = 4
f3 (“throw”) % 10 = = 7

Here, one thing should be noted that these outputs are taken randomly for explanation
purpose only.
Now, we will set 1 on the bits at indices 1, 4 & 7.

Now again, if we want to enter the word “catch”, we will calculate hashes in similar
manner.
f1 (“catch”) % 10 = = 3
f2 (“catch”) % 10 = = 5
f3 (“catch”) % 10 = = 4
Set 1 on the bits at indices 3, 5 & 4.

• Again, to check presence of word “throw” in filter or not. We will reverse the
order of the same process. Calculating respective hashes using f1 , f2 & f3 and
check that if in bit array, all indices are set to 1.
• If, this is the case that all bits are set to 1, then we can say that “throw” is “probably
present”.
• Else, if any of the bit at these given indices are 0, then “throw” is “definitely not
present”.

3.2.3 False Positive in Bloom Filters

The one question arises here is that why we said “probably present”, and what is the
reason behind this uncertainty to come. Let us take an example.
94 S. Anjum et al.

Example 5.2: Let us take that, we want to check the presence of word “bell”, whether
it is present or not. We will calculate hashes using f1 , f2 & f3.

f1 (“bell”) % 10 = = 1
f2 (“bell”) % 10 = = 3
f3 (“bell”) % 10 = = 7

• Now, if we look at the bit array, it seems that bits at these resulting indices are
set as 1 but we already know that this word “bell” was never added to the filter.
The bit at indices of 1 & 7 were set when we added the word “throw” & bit 3 was
added when we added the word “catch”.
• By controlling the size of bloom filter, we can also control probability of getting
false positive.
• Probability of false positive is inversely proportional to the number of hash
functions. If it decrease, then number of hash functions will increase.

3.2.4 Probability of False Positivity, Size of Bit Array and Optimum


No. of Hash Functions

Probability of False Positivity is shown in Eq. (1).


( [ ]K n )K
1
P = 1− 1− (1)
m

where, m = bit array size


n = no. of elements to be expected in filter.
K = no. of hash functions

Equation 2 represents Size of Bit Array

−nlog P
m= (2)
(log2)2

Optimum no. of Hash Functions is shown in Eq. 3


m
K = log2 (3)
n
Stream Data Model and Architecture 95

3.2.5 Interesting Properties of Bloom Filters

• It is interesting to know that bloom filters never generate any false negative result
which means that, if in case the username exists in actual, then it tells you that it
does not exist.
• It is not possible to delete elements from bloom filter because of the reason that
if we are clearing the bits (generated by k hash functions) for the given indices to
delete a single element, then it may cause deletion of few other elements also.
• For example: If we delete the word “throw” (in above taken example) by clearing
the bit at indices 1, 4 & 7, it may cause to be end up with deleting the word “catch”
also. Because bit at index 4 becomes 0 & bloom filter claims that “catch” is not
present.

3.3 Count Distinct Elements in a Stream

One more processing is needed to Count Distinct Elements in a Stream, where stream
elements are supposed to be choosing among some complete set. One would be
approximating to identify how many unique elements have looked in the stream,
either counting from the starting of the stream or from several known point in the
past. Here we describe FCM; The Flajolet-Martin Algorithm to perform counting
distinct or unique elements in a stream.

3.3.1 Introduction to the Flajolet-Martin (FM) Algorithm

• The Flajolet-Martin Algorithm is used for approximating the number or count of


distinct or unique objects in a stream or in a database in one pass or with a single
pass.
• If the given stream contains n no. of elements having m no. of unique elements
among them, then this algorithm requires O(n) time to run and necessitate the
storage of O(log (m)) memory.
• When the number of possible distinct or unique elements in the stream is maximal
in its existence, then the space consumption is logarithmic.

Example 5.3: Determine the distinct element in the stream using Flajolet-Martin
Algorithm.

• Input stream of integers y = = 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1


• Hash Function, f(y) = = 6y + 1 mod 5

Step 1: Calculating Hash function [f(y)]


f(y) == 6y +1 mod 5
f(1) == 6(1) +1 mod 5
96 S. Anjum et al.

== 7 mod 5
f(1) == 2
Similarly, calculating hash function for the remaining input stream;
f(1) = 2 f(1) = 2 f(4) = 0 f(2) = 3
f(3) = 4 f(2) = 3 f(3) = 4 f(3) = 4
f(2) = 3 f(3) = 4 f(1) = 2 f(1) = 2

Step 2: Write binary functions for hash functions calculated

f(1) = 2 = 010 f(1) = 2 = 010 f(4) = 0 = 000 f(2) = 3 = 011


f(3) = 4 = 100 f(2) = 3 = 011 f(3) = 4 = 100 f(3) = 4 = 100
f(2) = 3 = 011 f(3) = 4 = 100 f(1) = 2 = 010 f(1) = 2 = 010

Step 3: Trailing zeros; Now, write the count of trailing zeros in each hash function
bit

f(1) = 2 = 010 = 1 f(1) = 2 = 010 = 1 f(4) = 0 = 000 = 0 f(3) = 4 = 100 = 2


f(3) = 4 = 100 = 2 f(3) = 4 = 100 = 2 f(3) = 4 = 100 = 2 f(3) = 4 = 100 = 2
f(3) = 4 = 100 = 2 f(3) = 4 = 100 = 2 f(1) = 2 = 010 = 1 f(1) = 2 = 010 = 1

Step 4: Write the value of Maximum number of trailing zeros


• The value of r = = 2
• The distinct value R = = 2r = = 22 = = 4
• Hence, there are 4 distinct elements 1, 2, 3 and 4.

3.4 Counting Ones in a Window

For counting ones (1s) in a Window, DGIM; The Datar-Gionis-Indyk-Motwani


Algorithm is simplest to be used.

3.4.1 Introduction to the Datar-Gionis-Indyk-Motwani Algorithm

• The Datar-Gionis-Indyk-Motwani Algorithm is intended and designed to find the


number of 1s in a given data set.
• DGIM algorithm takes the use O (log2 N) bits to represent and perform the process
on a window of N bit.
• It only allows an error of no more than 50% to estimate the number of ones in the
window.
Stream Data Model and Architecture 97

3.4.2 Components of DGIM Algorithm

• Timestamps and Buckets are two components of DGIM.


• Each arriving bit has a timestamp for its arrival position.
• If the timestamp of first bit is 1, the timestamp of second is 2 and so on. The
position is predictable with the N size of window (this size is generally taken as
a multiple of 2).
• The buckets consisting of 1s and 0s are used to divide the windows. i.e. the window
are divided into buckets of 0s and 1s.

3.4.3 Rules for Forming the Buckets

i. It is mandatory that the right side of the bucket should always start with 1.
(Sometimes, if it starts with a 0, then it should be neglected.) For example: If
1001011 is a bucket of size 4, and it contains the four 1s and it starts with 1 on
its right end.
ii. It is a necessary condition that each bucket should contain at least one 1,
otherwise no bucket can be formed.
iii. All of the buckets should be in power of 2.
iv. As we move to the left, the size of buckets cannot decrease in size (move in
increasing order toward left).

3.5 Data Processing Frameworks

In the vast field of big data, it is essential to handle big data stream effectively. In this
section, taking big data stream processing as a basis, we will elaborate on some kind
of data processing frameworks. All these are free and open source type software.
[15]

3.5.1 Data Processing Frameworks

In this section, we will compare and explain different Stream Processor engines (s).
Let us select six among them. These six SPE(s) are:
i. Apache—Spark
ii. Apache—Flink
iii. Apache—Storm
iv. Apache— Heron
v. Apache—Samza
vi. Amazon— Kinesis.
Before describing all these, one important point is noted that, predecessor to all these,
Apache Hadoop is included for historical reasons.
98 S. Anjum et al.

• Hadoop is known to be the very first framework which was appeared to work
for huge datasets processing by using the MapReduce programming model. It
has scalability in nature, because it can run on either a single cluster and on
a single machine or extend and run on several clusters on multiple machines.
Furthermore, Hadoop takes benefit of having distributed storage space to get better
performance by work in the way that in place of the data, it transmits the code
which is supposed to process the data. Also, Hadoop provides high accessibility
and heavy throughput. However, during handling small files, it can have efficiency
problems.
• Apart from all, the main limitation of using Hadoop is so as to it doesn’t support
processing in real-time stream. To handle this limitation, Apache Spark comes
into light. Apache Spark is a framework to use in batch processing and to do
streaming of data and it also allows distributed processing. Spark was intended to
act in response to three big troubles of Hadoop:
i. Stay away from iterative algorithms that can make a number of passes through
the data.
ii. Permit streaming in real-time and interactive queries.
iii. In place of MapReduce, Apache Spark uses RDD which stands for Resilient
Distributed Datasets, which are having fault tolerance and can be able to
perform parallel processing.
• After two years, Apache Flink and Apache Storm were invented. Flink can do
batch processing and streaming of data. In Flink, we can process streams with
precise sequential requirements. Storm and Flink are comparable frameworks,
with some following features:
i.
Storm can only allow stream processing.
ii.
Storm and Flink both can perform low latency stream processing.
iii.The API of Flink is of high level and has rich functionality.
iv.To provide fault tolerance Flink is using a snapshot algorithm in comparison
to Storm which can record level acknowledgement(s).
v. Limitation of Storm is its low scalability, complexity in debugging and
managing Storm.
• So, now Apache Heron comes, after the Storm.
• Apache Samza provides event-based applications, real-time processing, and ETL
which means Extract, Transform and Load capabilities. It provides numerous
APIs and has model like Hadoop, but in its place of using MapReduce, it has the
Samza API, and it uses Kafka as an alternative of the Hadoop Distributed File
System.
• Amazon Kinesis is the single framework of this section which is not keen to go
with Apache Software Foundation. Kinesis is in reality a set of four frameworks
as a replacement for data stream framework. Kinesis can simply be integrated
with Flink.
All these are simply explained in the Table 2.
Table 2 Data processing frameworks [15]
S. No Framework Inventor Incubation Processing Delivery of Latency Throughput Scalability Fault tolerance
year events
1 Apache—Hadoop Apache N.A. Batch N.A. High High High Replication in the
Software HDFS
Foundation
2 Apache—Spark University of 2013 Micro Exactly Low High High RDD—Resilient
California batch—Batch once Distributed dataset
Stream Data Model and Architecture

and stream
3 Apache—Flink Apache 2014 Batch and Exactly Low High High Incremental check
Software stream once pointing (with the
Foundation use of markers)
4 Apache—Storm Backtype 2013 Stream At least Low High High Record level
once acknowledgements
5 Apache—Heron Twitter 2017 Stream At most Low High High High fault tolerance
once, at
least once,
exactly
once
6 Apache—Samza LinkedIn 2013 Batch and At least Low High High Host affinity &
Stream once incremental check
pointing
7 Amazon—Kinesis Amazon N.A Batch and At least Low High High High fault tolerance
stream once
99
100 S. Anjum et al.

3.6 Machine Learning for Streaming Data

In [17] several challenges for streaming data in the field of machine learning are
discussed. If these will overcome it will help in:
i. Exploration of relationships among many AI developments (e.g., RNN, rein-
forcement learning, etc.) and adaptive stream mining algorithms;
ii. Characterizing and detecting drifts in the case when immediate labeled data is
absent.
iii. Developing adaptive learning techniques which can work on verification latency;
iv. Incorporating preprocessing techniques which can transform the raw data in
continuous manner.

4 Challenges and Their Solutions in Streaming Data


Processing

This section briefly explains the challenges in streaming data processing and various
solutions to overcome them for better outcomes.

4.1 Challenges in Streaming Data Processing

After going through many cases, it has come to the light that there are many challenges
to be faced during processing of streaming data. Some noticeable challenges are:
i. Unbounded Memory Requirements for High Data Volume
The main aim of processing streaming data is to manage the coming data which
is produced in very high velocity and in huge volume, and also in continuous
manner in real time. Sources of these data are very large. As there is no finite end
defined for the data streams which are producing continuously, data processing
infrastructure should also be treated with unbounded memory requirements.
ii. Complex Architecture Complexity and Monitoring of Infrastructure
Data stream processing systems are frequently distributed and essential to be
able to handle a large number of parallel connections and data sources, which
can be hard to accomplish and monitor for any issues that may arise, particularly
at any scale.
iii. Cope Up With the Streaming Data Dynamic Nature
Because of dynamic nature of Streaming data, stream processing systems
should have to be adaptive in nature to handle perception drift—which extracts
some data processing methods inappropriate—and operate with restricted
memory and time.
iv. Data Streams Query Processing
Stream Data Model and Architecture 101

It is very challenging to process query on data streams due to the unbounded


data and its dynamic nature. Many subqueries are also required to complete
the process. Consequently, it is essential for the stream processing algorithm to
be memory-efficient and capable to process data rapidly enough to stronghold
with the rate of new data items arrival.
v. Streaming Data Processing Debugging and Testing
Debugging of data streams in the system environment and to do testing
of bundled data is much challenging. It needs comparison with many other
streaming data processes to enhance the quality.
vi. Fault Tolerance
It is also important to check the scalability of fault tolerance in DSMS.
vii. Data Integrity
Some type of data validation is essential to preserve the integrity of the data
being processed by a DSMS. It can be done by using various schemes such as
hash function, digital signatures, encryption, etc.
viii. Managing Delays
Due to various reasons like backpressure from downstream operators, network
congestion or slow processors, delays can happen in data stream processing,
there are some various ways to handle these delays, depending on the definite
requirements of the application: Some delay handling methods are used of
watermark and sliding window.
ix. Handling Backpressure
Use of Buffer the flow, adaptive operator, data partitioning and data items
dropping are some techniques to deal with backpressure which is a state that
can happen in data stream processing when a speed of downstream operators
to consume operator is slow than processing of operator. It leads to an increase
in latency which may cause data loss if operator buffers start to fill up.
x. Computational and Cost Efficiency
As DSMS is essentially used to handle high volume data produced at high
velocity from the various sources, it is challenging to control its computational
and cost efficiency.

4.2 Ways to Overcome the Processing of Streaming Data


Challenges

Though data stream processing has key challenges, there are some ways to overcome
these challenges, which include:
i. Use the proper mixture of on-premises and cloud-based resources and services
[22].
ii. To choose the right tools.
iii. Setting of consistent infrastructure for monitoring data processing and integra-
tion, to improve efficiency with data skipping and operator pipelining.
102 S. Anjum et al.

iv. Partition of data streams to increase overall throughput.


v. Processing rate adjustment with an adaptive operator to be automatic.
vi. Backpressure avoidance by implementation of proper flow control.
vii. By adopting these techniques, we can overcome the challenges in processing
streaming data and enhance its use in real-time data analytics.

5 Conclusion

This chapter concludes that in today’s era, there is an unstoppable flow of data
which may be in any form of unstructured, semi-structured or structured form and
can be produced from any source like transactional data, social media feeds, IoT
devices any other real-time applications. Again, with the existence of this kind of
continuous producing data, it requires its processing, analyzing, reporting, etc. Past
batch processing system has limitation to handle it because of its finite data handling
nature. From here streaming data processing empowers to cope up with data streams.
As per the importance in data stream model, a query processor should be powerful
which can retrieve data at any scale and properly manage the storage system. DSMS
not works in a single pass, it comprises of step to step processing which are steps of
Data Production, Data Ingestion, Data Processing, Streaming Data Analytics, Data
Reporting, Data Visualization and Decision Making. Importance and requirement of
more enhanced Event stream processing models is seen.
The popularity of Apache Hadoop and its limitations in some area resulted into
the invention of Apache—Spark, Apache—Flink, Apache—Storm, Apache—Heron,
Apache—Samza and Amazon—Kinesis. These are only few, hybrid forms can be
many. After studying many researches, it is found that load balancing, privacy and
scalability issues still need more efforts to work on. And also, significant research
efforts should be given to preprocessing stage of big data streams. This chapter also
put light on overcoming of challenges in streaming of data; with the proper mixing
of approaches, data architecture and resources, one can easily take the advantages of
real-time data analytics.
Many researchers include the methodology to filter data stream, counting of
distinct or unique elements in data stream, counting of one in data stream; for this
bloom filtering, FCM and DGIM are in existence, but again in real-time data analysis
many more features should extract which can be the focus of field researchers. This
may help in enlarging the application area of streaming of data.

References

1. Eberendu, A.: Unstructured data: an overview of the data of Big Data. Int. J. Emerg. Trends
Technol. Comput. Sci. 38(1), 46–50 (2016). https://doi.org/10.14445/22312803/IJCTT-V38
P109
Stream Data Model and Architecture 103

2. Bennawy, M., El-Kafrawy, P.: Contextual data stream processing overview, architecture, and
frameworks survey. Egypt. J. Lang. Eng. 9(1) (2022). https://ejle.journals.ekb.eg/article_2
15974_5885cfe81bca06c7f5d3cd08bff6de38.pdf
3. Akili, S., Matthias, P., Weidlich, M.: INEv: in-network evaluation for event stream processing.
Proc. ACM on Manag. Data. 1(1), 1–26 (2023). https://doi.org/10.1145/3588955
4. Alzghoul, A.: Monitoring big data streams using data stream management systems: industrial
needs, challenges, and improvements. Adv. Oper. Res. 2023(2596069) (2023). https://doi.org/
10.1155/2023/2596069
5. Hassan, A., Hassan, T.: Real-time big data analytics for data stream challenges: an overview.
EJCOMPUTE. 2(4) (2022). https://doi.org/10.24018/compute.2022.2.4.62
6. Nambiar, S., Kalambur, S., Sitaram, D.: Modeling access control on streaming data in apache
storm. (CoCoNet’19). Proc. Comput. Sci. 171, 2734–2739 (2020). https://doi.org/10.1016/j.
procs.2020.04.297
7. Avci, C., Tekinerdogan, B., Athanasiadis, I.: Software architectures for big data: a systematic
literature review. Big Data Anal. 5(5) (2020). https://doi.org/10.1186/s41044-020-00045-1
8. Hamami, F., Dahlan, I.: The implementation of stream architecture for handling big data
velocity in social media. J. Phys. Conf. Ser. 1641(012021) (2020). https://doi.org/10.1088/
1742-6596/1641/1/012021
9. Kenda, K., Kazic, B., Novak, E., Mladenić, D.: Streaming data fusion for the internet of things.
Sensors 2019. 19(8), 1955 (2019). https://doi.org/10.3390/s19081955
10. Kolajo, T., Daramola, D., Adebiyi, A.: Big data stream analysis: a systematic literature review.
J. Big Data. 6(47) (2019). https://doi.org/10.1186/s40537-019-0210-7
11. Laska, M., Herle, S., Klamma, R., Blankenbach, J.: A scalable architecture for real-time stream
processing of spatiotemporal IoT stream data—performance analysis on the example of map
matching. ISPRS Int. J. Geo-Inf. 7(7), 238 (2018). https://doi.org/10.3390/ijgi7070238
12. Hoque, S., Miranskyy, A.: Architecture for Analysis of Streaming Data, Conference: IEEE
International Conference on Cloud Engineering (IC2E) (2018). https://doi.org/10.1109/IC2E.
2018.00053
13. Abadi, D., Etintemel, U.: Aurora: a new model and architecture for data stream management.
VLDB J. 12(2), 12–139 (2003). https://doi.org/10.1007/s00778-003-0095-z
14. Jure Leskovec, J., Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge
University Press, England (2010)
15. Almeida, A., Brás, S., Sargento, S., Pinto, F.: Time series big data: a survey on data stream
frameworks, analysis and algorithms. J Big Data 10(1), 83 (2023). https://doi.org/10.1186/s40
537-023-00760-1
16. Sujatha, C., Joseph, G.: A survey on streaming data analytics: research issues, algorithms,
evaluation metrics, and platforms. In: Proceedings of International Conference on Big Data,
Machine Learning and Applications, pp. 101–118 (2021). https://doi.org/10.1007/978-981-33-
4788-5_9
17. Gomes, H., Bifet, A.: Machine learning for streaming data: state of the art, challenges, and
opportunities. ACM SIGKDD Explor. Newsl. 21(2), 6–22 (2019). https://doi.org/10.1145/337
3464.3373470
18. Aguilar-Ruiz, J., Bifet, A., Gama, J.: Data stream analytics. Analytics 2(2), 346–349 (2023).
https://doi.org/10.3390/analytics2020019
19. Rashid, M., Hamid, M., Parah, S.: Analysis of streaming data using big data and hybrid
machine learning approach. In: Handbook of Multimedia Information Security: Techniques
and Applications, pp. 629–643 (2019). https://doi.org/10.1007/978-3-030-15887-3_30
20. Samosir, J., Santiago, M., Haghighi, P.: An evaluation of data stream processing systems for
data driven applications. Proc. Comput. Sci. 80, 439–449 (2016). https://doi.org/10.1016/j.
procs.2016.05.322
21. Geisler, S.: Data stream management systems. In: Data Exchange, Integration, and Streams.
Computer Science. Corpus ID: 12168848. 5, 275–304 (2013). https://doi.org/10.4230/DFU.
Vol5.10452.275
104 S. Anjum et al.

22. Singh, P., Singh, N., Luxmi, P.R., Saxena, A.: Artificial intelligence for smart data storage
in cloud-based IoT. In: Transforming Management with AI, Big-Data, and IoT, 1–15 (2022).
https://doi.org/10.1007/978-3-030-86749-2_1
23. Abdullah, D., Mohammed, R.: Real-time big data analytics perspective on applications, frame-
works and challenges. 7th International Conference on Contemporary Information Technology
and Mathematics (ICCITM). IEEE. 21575180 (2021). https://doi.org/10.1109/ICCITM53167.
2021.9677849
24. Mohamed, N., Al-Jaroodi, J.: Real-time big data analytics: applications and challenges. Inter-
national Conference on High Performance Computing & Simulation (HPCS). IEEE. 14614775
(2014). https://doi.org/10.1109/HPCSim.2014.6903700
25. Deshai, N., Sekhar, B.: A study on big data processing frameworks: spark and storm. In: Smart
Intelligent Computing and Applications, 415–424 (2020). https://doi.org/10.1007/978-981-32-
9690-9_43
Leveraging Data Analytics and a Deep
Learning Framework for Advancements
in Image Super-Resolution Techniques:
From Classic Interpolation
to Cutting-Edge Approaches

Soumya Ranjan Mishra, Hitesh Mohapatra, and Sandeep Saxena

Abstract Image SR is a critical task in the field of computer vision, aiming to


enhance the resolution and quality of low-resolution images. This chapter explores
the remarkable achievements in image super-resolution techniques, spanning from
traditional interpolation methods to state-of-the-art deep learning approaches. The
chapter begins by providing an overview of the importance and applications of
image super-resolution in various domains, including medical imaging, surveillance,
and remote sensing. The chapter delves into the foundational concepts of classical
interpolation techniques such as bicubic and bilinear interpolation, discussing their
limitations and artifacts. It then progresses to explore more sophisticated interpo-
lation methods, including Lanczos and spline-based approaches, which strive to
achieve better results but still encounter challenges when upscaling images signifi-
cantly. The focal point of this chapter revolves around deep learning-based methods
for image SR. Convolutional Neural Networks (CNNs) have revolutionized the
field, presenting unprecedented capabilities in producing high-quality super-resolved
images. The chapter elaborates on popular CNN architectures for image super-
resolution, including SRCNN, VDSR, and EDSR, highlighting their strengths and
drawbacks. Additionally, the utilization of Generative Adversarial Networks (GANs)
for super-resolution tasks is discussed, as GANs have shown remarkable potential in
generating realistic high-resolution images. Moreover, the chapter addresses various
challenges in image super-resolution, such as managing artifacts, improving percep-
tual quality, and dealing with limited training data. Techniques to mitigate these chal-
lenges, such as residual learning, perceptual loss functions, and data augmentation,
are analyzed. Overall, this chapter offers a comprehensive survey of the advancements
in image SR, serving as a valuable resource for researchers, engineers, and practi-
tioners in the fields of computer vision, image processing, and machine learning. It

S. R. Mishra (B) · H. Mohapatra


School of Computer Engineering, KIIT Deemed to Be University, Bhubaneswar, Odisha, India
e-mail: soumyaranjanmishra.in@gmail.com
S. Saxena
Greater Noida Institute of Technology, Greater Noida, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 105
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_6
106 S. R. Mishra et al.

highlights the continuous evolution of image SR techniques and their potential to


reshape the future of high-resolution imaging in diverse domains.

1 Introduction

Image SR, an essential task in the field of computer vision, plays a crucial role in
enhancing the resolution and quality of low-resolution images. The ability to recover
high-resolution details from low-resolution inputs has significant implications in
various applications, including medical imaging, surveillance, remote sensing, and
more [1]. As the demand for higher quality visual content continues to grow, the
development of advanced image super-resolution techniques has become a vibrant
research area. In this chapter, we delve into the remarkable advancements in image
super-resolution, tracing its evolution from classical interpolation methods to cutting-
edge deep learning approaches. Our exploration begins with an overview of the
fundamental concepts and importance of image super-resolution in diverse domains
[2].

1.1 Importance of Image SR

Image super-resolution has garnered significant attention due to its potential to


enhance visual content quality and improve various computer vision tasks. In the
realm of medical imaging, high-resolution images play a pivotal role in accurate
diagnosis, treatment planning, and disease monitoring. In surveillance applications,
super-resolution aids in identifying critical details in low-resolution footage, such
as facial features or license plate numbers. Moreover, in remote sensing, super-
resolution techniques enable clearer satellite imagery, facilitating better analysis of
environmental changes and land use. The significance of image super-resolution is
also evident in entertainment industries, where high-quality visual content is crucial
for creating immersive experiences in video games and movies. Figure 1 shows the
Importance of Image Super-resolution in a person’s face and a side-by-side compar-
ison of a low-resolution image and a super-resolved image, an image of a historical
document, and an image of a medical image. (medical image of a patient’s tumor
[3]).

1.2 Classical Interpolation Techniques

Early image super-resolution methods relied on classical interpolation techniques to


upscale low-resolution images. Bicubic and bilinear interpolation were widely used
for decades to increase image resolution. Although straightforward, these methods
Leveraging Data Analytics and a Deep Learning Framework … 107

Fig. 1 Importance of image super-resolution in different areas

often produced blurred and visually unappealing results, leading to the introduc-
tion of more sophisticated interpolation techniques such as Lanczos and spline-
based methods. While these techniques improved image quality to some extent, they
struggled to handle significant upscaling factors and suffered from artifacts [4].

1.3 Deep Learning for Image SR

The advent of deep learning has revolutionized the field of image super-resolution.
Convolutional Neural Networks (CNNs) emerged as a powerful tool for learning
complex mappings between low-resolution and high-resolution images. The chapter
discusses the pioneering CNN-based models, including the Super-Resolution Convo-
lutional Neural Network (SRCNN), the Very Deep Super-Resolution Network
(VDSR), and the Enhanced Deep Super-Resolution Network (EDSR). These archi-
tectures leverage deep layers to extract hierarchical features from images, leading to
impressive results in terms of visual quality and computational efficiency [5].

1.4 Enhancing Quality with Generative Adversarial Networks

As the pursuit of higher visual fidelity continues, Generative Adversarial Networks


(GANs) entered the stage to further elevate image super-resolution. GANs combine
the power of a generator and discriminator network to generate high-quality super-
resolved images. The chapter explores GAN-based approaches for image super-
resolution, such as SRGAN and ESRGAN, and examines the role of adversarial
108 S. R. Mishra et al.

loss in guiding the training process. GANs have proven to be particularly effective
in generating photo-realistic details, demonstrating the potential to revolutionize
high-resolution imaging [6].

1.5 Addressing Challenges in Image Super-Resolution

Despite the impressive results achieved with deep learning-based approaches, image
super-resolution still faces several challenges. One of the prominent challenges is
managing artifacts that can arise during the super-resolution process [7].
Additionally, improving perceptual quality and ensuring that the enhanced images
are visually appealing is critical. This chapter delves into the techniques used to
overcome these challenges, including residual learning, perceptual loss functions,
and data augmentation strategies to enrich the training data.

1.6 Real-World Applications and Practical Use Cases

The chapter concludes by showcasing real-world applications and practical use


cases of image super-resolution techniques. These applications span across various
domains, such as medical imaging, surveillance, remote sensing, and entertainment.
From enabling more accurate medical diagnoses to aiding in criminal investiga-
tions through clearer surveillance footage, image super-resolution demonstrates its
vast potential in improving decision-making processes. Finally, we can conclude
that image super-resolution has witnessed significant progress, transitioning from
classical interpolation methods to deep learning-based approaches. The ability to
reconstruct high-quality images from low-resolution inputs has paved the way for
numerous applications in diverse fields. The advancements in deep learning, particu-
larly the integration of GANs, have elevated image super-resolution to new heights,
making photo-realistic high-resolution imaging a reality. While challenges persist,
the ongoing research in this domain promises even more sophisticated techniques
for achieving visually stunning and information-rich high-resolution images, further
fueling the growth of computer vision applications in the future [8].
Leveraging Data Analytics and a Deep Learning Framework … 109

1.7 Role/Impact of Data Analytics and a Deep Learning


Framework for Image SR Techniques

Role of Data Analytics


• Dataset Preparation:
• Data analytics helps in the curation and preprocessing of large datasets, which
are essential for training deep learning models for image SR. A well-prepared
dataset ensures that the model learns diverse and representative features.
• Feature Extraction:
• Data analytics techniques can be used to find relevant information in the image
data. Understanding the characteristics of low-resolution and high-resolution
images helps in designing effective models.
• Data Augmentation:
• Techniques like data augmentation, supported by data analytics, can be applied
to artificially increase the diversity of the training dataset. This aids in
improving the generalization capability of deep learning models.
• Performance Metrics:
• Data analytics is instrumental in defining appropriate performance metrics to
evaluate the effectiveness of SR models.
Impact of Deep Learning Frameworks
1. Convolutional Neural Networks (CNNs):
– Deep learning frameworks, particularly those supporting CNN architectures,
have shown remarkable success in image super-resolution tasks. CNNs can
automatically learn hierarchical features from LR images and generate HR
counterparts.
2. Generative Adversarial Networks (GANs):
– GANs, a type of deep learning framework, have been applied to image
super-resolution, introducing a generative model and a discriminative model.
This adversarial training helps in producing high-quality and realistic high-
resolution images.
3. Transfer Learning:
– Deep learning frameworks enable the use of transfer learning, where pre-
trained models on large image datasets can be fine-tuned for specific super-
resolution tasks. This is particularly useful when the available dataset for
super-resolution is limited.
110 S. R. Mishra et al.

4. End-to-End Learning:
– Deep learning frameworks facilitate end-to-end learning, allowing the model
to directly map LR images to HR outputs. This avoids the need for handcrafted
feature engineering and enables the model to learn complex relationships.
5. Attention Mechanisms:
– Attention mechanisms, integrated into deep learning architectures, enable
models to focus on relevant parts of the image during the SR process. This
improves the overall efficiency and performance of the model.
6. Large-Scale Parallelization:
– Deep learning frameworks support parallel processing, enabling the training
of large and complex models on powerful hardware, which is essential for
achieving state-of-the-art results in image super-resolution.

2 Classical Interpolation Nearest Neighbor

N-N interpolation is the simplest technique, where each pixel in the high-resolution
image is assigned the value of the nearest pixel in the low-resolution image. This
method is fast but often leads to blocky artifacts. Using nearest neighbor interpolation,
we aim to upscale it to an “8x8” image. The pixels in the HR image will be assigned
the values of their nearest neighbors from the low-resolution image [9]. The resulting
8x8 image after applying nearest neighbor interpolation is shown in Fig. 2.

Fig. 2 Upscale it to 8 × 8 image using Nearest Neighbor interpolation metrics


Leveraging Data Analytics and a Deep Learning Framework … 111

2.1 CNN-Based SR Techniques

Super-Resolution Convolutional Neural Network SRCNN, introduced in 2014, is


a three-layer deep convolutional neural network designed specifically for single-
image SR. It processes an LR image as input and generates a corresponding HR
image directly. The network learns to upscale the image while minimizing the recon-
struction error. We had trained an SRCNN model on our dataset of low- and high-
resolution images. For simplicity, we’ll use grayscale images and a smaller network
configuration. We prepare a dataset of LR images (e.g., 32 × 32) and their corre-
sponding high-resolution versions (e.g., 128 × 128). Figure 3 illustrates LR image
to SR image in dpi [10].
The SRCNN model consists of three layers shown in Fig. 4. The first layer
performs feature extraction using a small kernel size (e.g., 9 × 9), the second layer
increases the number of features, and the third layer maps the feature maps to the high-
resolution image. We train the SRCNN model on the prepared dataset using Mean
Squared Error (MSE) loss to minimize the difference between predicted and ground-
truth high-resolution images. After training, we use the trained SRCNN model to
upscale low-resolution test images and compare the results with the original high-
resolution images. The trained SRCNN model shows impressive results compared to
classical interpolation techniques. The output images exhibit higher levels of detail,
sharper edges, and improved visual quality. Since the introduction of SRCNN, various
other CNN-based super-resolution techniques have been developed, such as EDSR
(Enhanced Deep Super-Resolution), SRGAN (Super-Resolution Generative Adver-
sarial Network), and RCAN (Residual Channel Attention Networks) [11]. These
models have achieved state-of-the-art performance, pushed the boundaries of image
super-resolution and produced more realistic and visually appealing results.

Fig. 3 LR image to SR image conversion


112 S. R. Mishra et al.

Fig. 4 SRCNN model architecture

2.2 Datasets

In the field of image SR, researchers use various datasets to train and evaluate
their model networks. In a review of various articles, 11 datasets were identified
as commonly used for these purposes (Ref: Table 1).
T91 Dataset: The T91 dataset contains 91 images. It comprises diverse content
such as cars, flowers, fruits, and human faces. Algorithms like SRCNN, FSRCNN,
VDSR, DRCN, DRDN, GLRL, DRDN, and FGLRL utilized T91 as their training
dataset.
Berkeley Segmentation Dataset 200 (BSDS200): Due to the limited number
of images in T91, researchers supplemented their training by including BSDS200,
which consists of 200 images showcasing animals, buildings, food, landscapes,
people, and plants. Algorithms like VDSR, DRRN, GLRL, DRDN, and FGLRL

Table 1 Key characteristics of 5 popular image datasets


Dataset Number of Images Average Resolution File Format Key Contents
Set14 1,000 264 × 204 pixels PNG Objects: baby, bird,
butterfly, bead, woman
DEVIKS 100 313 × 336 pixels PNG Objects: humans,
animals, insects,
flowers, vegetables,
comic, slides
BSDS200 200 492 × 446 pixels PNG Natural scenes:
environment, flora,
fauna, handmade
NBCT, people, somery
Urban100 109 826 × 1169 pixels PNG Urban scenes: animal,
building, food,
landscape, people, plant
ImageNet 3.2 million Varies JPEG, PNG Different Objects and
scenes
Leveraging Data Analytics and a Deep Learning Framework … 113

used BSDS200 as an additional training dataset, while FSRCNN used it as a testing


dataset. Dilated-RDN also incorporated BSDS200 for training.
DIVerse 2K resolution (DIV2K) Dataset: Widely used in many studies, the
DIV2K dataset consists of 800 and 200 training and validation images. These
images showcase elements such as surroundings, plant life, wildlife, crafted items,
individuals, and landscapes.
ImageNet Dataset: This extensive dataset with over 3.2 million images.
Mammals, avians, aquatic creatures, reptiles, amphibians, transportation, furnish-
ings, musical instruments, geological structures, implements, blossoms, and fruits.
ESPCN and SRDenseNet employed ImageNet for training their models.
Set5 and Set14 Datasets: Set5 contains only five images, and Set14 contains 14
images. Both datasets are popular choices for model evaluation.
Berkeley Segmentation Dataset 100 (BSDS100): Consisting of 100 images with
content ranging from animals and buildings to food and landscapes, BSDS100 served
as a testing dataset for various algorithms.
Urban100 Dataset: With 100 images showcasing architecture, cities, structures,
and urban environments, the Urban100 dataset was used as a various testing dataset.
Manga109 Dataset: Manga109 contains 109 PNG format images from manga
volumes, and it was used as a testing dataset by RDN and SICNN.

3 Different Proposed Algorithms

The initial phase, patch extraction, involved the capturing of information from
bicubic-interpolated image. This image information was then channeled into the
subsequent stage, non-linear mapping. Within this stage, the high-dimensional
features underwent a transformation to correspond with other high-dimensional
features, effecting a comprehensive mapping process [12]. Ultimately, the ulti-
mate outcome from the final layer of the non-linear mapping phase underwent a
convolutional process to accomplish the reconstruction of the high-resolution (HR)
image. This final stage synthesized the refined features into the desired HR image,
completing the SRCNN’s intricate process.
SRCNN and sparse coding-based methods share similar fundamental operations in
their image super-resolution processes. However, a notable distinction arises in their
approach. While SRCNN empowers optimization of filters through an end-to-end
mapping process, sparse coding-based methods restrict such optimization to specific
operations. Furthermore, SRCNN boasts an advantageous flexibility: it permits the
utilization of diverse filter sizes within the non-linear mapping step, enhancing the
information integration process [13]. This adaptability contrasts with sparse coding-
based methods, which lack such flexibility. As a result of these disparities, SRCNN
achieves a higher PSNR (Peak Signal-to-Noise Ratio) value compared to sparse
coding-based methods, indicating its superior performance in image super-resolution
tasks (Ref: Algorithm 1).
114 S. R. Mishra et al.

Algorithm 1 Super-resolution algorithm (SRCNN)


1: Require: LR image ILR
2: Ensure: HR image IHR
3: Upsample ILR using bicubic interpolation to obtain Iup
4: Pass Iup through a CNN to obtain feature maps F
5: Divide F into patches.
6: Apply a non-linear mapping to each patch to obtain an enhanced patch
7: Reconstruct the patches to form IHR.

3.1 FastSR Convolutional Neural Network

Subsequently, authors of [14] made an intriguing observation regarding SRCNN’s


performance, noting that achieving improved outcomes necessitated incorporating
more convolutional layers within the non-linear mapping phase. However, this
enhancement came at a cost: the augmented layer count led to increased compu-
tational time and hindered the convergence of PSNR values during training. To
address these challenges, the authors introduced FSRCNN as a solution. Notably,
FSRCNN introduced novel elements compared to its predecessor, SRCNN. The
introduction of a shrinking layer, responsible for dimension reduction of extracted
features from the preceding layer, was a key distinction. Simultaneously, the addition
of an expanding layer aimed to reverse the shrinking process, expanding the output
features generated during the non-linear mapping phase. Another noteworthy devi-
ation was the employment of deconvolution as the chosen-up sampling mechanism
within FSRCNN. This mechanism facilitated the enhancement of image resolution.
These combined innovations in FSRCNN addressed the limitations of SRCNN, ulti-
mately providing an optimized solution that struck a balance between computational
efficiency and convergence of PSNR values during training (Ref: Algorithm 2).

Algorithm 2 Fast SR algorithm


1: Input: LR image.
2: CNN: LR image is passed through a CNN with three convolutional layers. The first and
second layer has 64 and 34 filters of size 9 × 9 and 5 × 5, and the third layer has 1 filter of
size 5 × 5.
3: ReLU: A ReLU activation function is applied to the output of each convolutional layer.
4: Deconvolution: The output of the CNN is then up-sampled using deconvolution. This is a
process that reverses the process of convolution.
5: Output: The final output is a high-resolution image.
Leveraging Data Analytics and a Deep Learning Framework … 115

Fig. 5 Deep learning SR Architecture

3.2 Deep Learning SR

The architecture illustrated in Fig. 5, was conceptualized [28] to address the challenge
encountered in SRCNN, where an increasing number of mapping layers was imper-
ative for enhanced model performance. Deep learning SR innovatively introduced
the concept of residual learning, a mechanism that bridged the gap between input
and output within the final feature mapping layer. Residual learning was achieved by
integrating the output features from the ultimate layer with the interpolated features.
Given the strong correlation between low-level and high-level features, this skip
connection facilitated the fusion of low-level layer attributes with high-level features,
subsequently elevating model performance. This strategy proved particularly effec-
tive in mitigating the vanishing gradients issue that emerges when the model’s layer
count grows. The incorporation of residual learning in Deep learning SR offered
dual advantages compared to SRCNN. Firstly, it expedited convergence due to the
substantial correlation between LR and HR images. As a result, Deep learning SR
accomplished quicker convergence, slashing running times by an impressive 93.9%
when compared to the original SRCNN model. Secondly, Deep learning SR yielded
superior PSNR values in comparison to SRCNN, affirming its prowess in image
enhancement tasks [15].

3.3 Multi-Connected CNN for SR

Authors of [16] highlighted a limitation in the previous approach: although it


addressed the vanishing-gradient issue, it didn’t effectively harness low-level
features, indicating untapped potential for performance enhancement. Moreover,
the residual learning concept in VDSR was only integrated between the initial
116 S. R. Mishra et al.

and final layers of non-linear mapping, potentially resulting in performance degra-


dation. To address these concerns, the Multi-Connected CNN model, depicted in
Algorithm 3, was devised as a solution. The success of this improvement could be
attributed not only to the refined loss function but also to the extraction and amal-
gamation of information-rich local features from the multi-connected blocks. By
combining these localized features through concatenation with high-level features,
Multi-Connected CNN effectively addressed the shortcomings of its predecessors,
showcasing its capacity to elevate performance in the realm of image super-resolution
(Ref: Algorithm 3).

Algorithm 3 Multi-Connected CNN for SR algorithm


1: Input: A LR image is input to the algorithm.
2: Bicubic interpolation: The LR image is up-sampled using bicubic interpolation. This is a
simple but effective interpolation method is used to increase the resolution of an image
without introducing too much blurring neural network (CNN): The up-sampled image is then
passed through a CNN. The CNN learns to extract features from the image that can be used
to improve the resolution. Patch extraction: The output of the CNN is divided into patches.
Each patch is then processed individually.
3: Non-linear mapping: Mapping is applied to each patch. This mapping helps to preserve the
details of the image while also improving the resolution.
4: Apply a non-linear mapping to each patch to obtain an enhanced patch.
5: Reconstruction: The patches are then reconstructed to form the final high-resolution image.

3.4 Cascading Residual Network (CRN)

In recent development, the Compact Residual Network (CRN) is a solution to coun-


teract the issue of extensive parameters within the EDSR network structure. This
parameter bloat was a direct outcome of significantly increasing the network depth
to enhance EDSR’s performance. CRN’s design was influenced by the architecture of
EDSR, with a strategic replacement of EDSR’s residual blocks with locally sharing
groups (LSG). The LSG concept encompassed the integration of several local wider
residual blocks (LWRB). These LWRB closely resembled EDSR’s residual blocks,
yet distinguished themselves through a variation in channel utilization. Every LSG,
as well as every individual LWRB within it, was integrated with a residual learning
network by adopting this novel approach; CRN effectively sidestepped the issue of
unwieldy parameter expansion that had plagued deeper EDSR models. The strategic
use of locally sharing groups and wider residual blocks showcased CRN’s potential
in maintaining model performance while effectively managing parameter growth—a
vital stride toward more efficient and powerful image super-resolution networks [17].
Leveraging Data Analytics and a Deep Learning Framework … 117

3.5 Enhanced Residual Network for SR

An alternative neural network known as ERN, which exhibited a slight performance


improvement compared to the CRN model. The design of the ERN network was also
inspired by the EDSR architecture but with an innovative inclusion of an extra skip
connection. This connection linked the (LR) input to the output originating from
the final LWRB (Local Weighted Receptive Field Block) through a multiscale block
(MSB). An algorithmic representation of this configuration is provided in Algorithm
4. The primary objective behind the integration of the multiscale block (MSB) was to
capture least features from the input image across various scales. In contrast to CRN’s
utilization of the LSG (Local Sub-Group) technique, ERN opted for the application
of LWRB in a non-linear mapping fashion (Ref: Algorithm 4).

Algorithm 4 Enhanced Residual Network for SR


1: Input: LR image
2: Output: HR image
3: 1. Feature extraction:
4: Apply convolutional layers for feature extraction from the LR image
5: Use a non-linear mapping to transform the features
6: 2. Up-sampling:
7: Upsample the features to the desired resolution. 8: 3. Residual learning:
9: Add the up-sampled features to the low-resolution image.
10: 4. Output:
11: The high-resolution image.

3.6 Deep-Recursive CNN for SR

Deep-recursive CNN for SR, introduced as the pioneer algorithm to employ a recur-
sive approach for image super-resolution, brought a novel perspective to the field
illustrated in Fig. 6. It comprised three principal components the embedding, infer-
ence, and reconstruction. The embedding net’s role was to extract relevant features
from the interpolated image. These extracted features then traversed the inference net,
notable for its unique characteristic of sharing weights across all filters. Within the
inference net, the outputs of intermediate convolutional layers and the interpolated
features underwent convolution before their summation generated a high-resolution
(HR) image. The distinctive advantage of DRCN lay in its capacity to address the
challenge encountered in SRCNN, where achieving superior performance necessi-
tated a high number of mapping layers. By embracing a recursive strategy, Deep-
recursive CNN harnessed shared weights. Furthermore, the amalgamation of interme-
diate outputs from the inference net brought substantial enhancement to the model’s
performance. Incorporating residual learning principles into the network contributed
118 S. R. Mishra et al.

Fig. 6 Deep-recursive CNN


for SR

further to improved convergence. In the realm of outcomes, Deep-recursive CNN


yielded a noteworthy 2.44% enhancement over SRCNN, underscoring its efficacy in
pushing the boundaries of image super-resolution.

3.7 Dual-Branch CNN

The previous section examined various algorithms that primarily relied on stacking
convolutional layers sequentially. However, this approach resulted in increased
runtime and memory complexity. To address this concern, dual-branch image super-
resolution algorithm was introduced named as Dual-Branch CNN. The network
Leveraging Data Analytics and a Deep Learning Framework … 119

Fig. 7 Dual-branch CNN architecture

architecture of Dual-Branch CNN is illustrated in Fig. 7. In Dual-Branch CNN,


the network architecture diverged into two branches. One branch incorporated a
convolutional layer, while the other branch employed a dilated convolutional layer.
The outputs from both branches were subsequently merged through a concatena-
tion process before undergoing up-sampling. Dual-Branch CNN also exhibited a
distinctive approach by combining bicubic interpolation with alternative up-sampling
methods, such as utilizing deconvolutional kernels for the reconstruction phase.
Several notable advantages emerged from the Dual-Branch CNN architecture. Firstly,
the dual-branch structure effectively circumvented the intricacies often encountered
in chain-way-based networks, streamlining the model’s complexity. Secondly, the
incorporation of dilated convolutional filters notably improved image quality during
the reconstruction process. Thirdly, the integration of residual learning principles
accelerated convergence, contributing to faster training [18].

3.8 Quantitative Results Obtained from Different Algorithm

Here, the features of the images generated by various algorithms were carefully
observed. A comparison between SRCNN and the bicubic interpolation method
revealed that images produced through interpolation appeared blurry, lacking clear
details in contrast to the sharpness achieved by SRCNN. Comparing the outputs
of FSRCNN with those of SRCNN, there appeared to be minimal discrepancy.
120 S. R. Mishra et al.

However, both FSRCNN exhibited superior processing speed [19]. The incorpo-
ration of residual learning within VDSR substantially improved image texture,
surpassing that achieved by SRCNN. Models benefiting from enhanced learning
through residual mechanisms displayed notable enhancement in image texture.
DRCN, which harnessed both recursive and residual learning, yielded images with
more defined edges and patterns, markedly crisper than the slightly blurred edges
produced by SRCNN. CRN further improved upon this aspect, delivering even
sharper edges than DRCN. On the other hand, GLRL generated significantly clearer
images compared to DRCN, albeit with a somewhat compromised texture.
Images generated by CRN exhibited superior texture compared to DRCN, while
SRDenseNet managed to reconstruct images with improved texture patterns, effec-
tively mitigating distortions that proved challenging for DRCN, VDSR, and SRCNN
to overcome. Noteworthy improvements were observed in images produced by
DBCN, showcasing a superior restoration of collar texture without introducing addi-
tional artifacts. This achievement translated to a more visually appealing outcome
than what was observed with CRN. DBCN demonstrated an enhanced capacity to
restore edges and textures, surpassing the capabilities of SRCNN in this domain.
Figures 8, 9, 10, and 11 provide a comprehensive summary of the quantitative
outcomes achieved by the respective algorithms developed by the authors (Table 2).

Fig. 8 Mean PSNR and SSIM for set5 dataset


Leveraging Data Analytics and a Deep Learning Framework … 121

Fig. 9 Mean PSNR and SSIM for set14 dataset

Fig. 10 Mean PSNR and SSIM for Bsd100 dataset

4 Different Network Design Strategies

Diverse approaches have been taken by numerous researchers to enhance the perfor-
mance of image super-resolution models. Table 2 shows different key features of
network design strategies among the various designs discovered above. At its core, the
linear network was the foundational design, depicted in Fig. 12. This design concept
drew inspiration from the residual neural network (ResNet), widely utilized for
object recognition in images. The linear network technique was employed by models
such as SRCNN, FSRCNN, and ESPCN. Although these three models employed a
similar design approach, there were differences in their internal architectures and up-
sampling methods. For instance, SRCNN exclusively consisted of feature extraction,
122 S. R. Mishra et al.

Fig. 11 Mean PSNR and SSIM for Urban100 dataset

Table 2 Key features of network design strategies


Network design strategy Key features
Linear network Single layer of neurons, linear functions
Residual learning Builds on top of simpler functions, residual connections
Recursive learning Learns hierarchical representations, recursive connections
Dense connections Connections between all neurons in a layer

and an up-sampling module, while FSRCNN integrated feature extraction, a contrac-


tion layer, non-linear mapping, an expansion layer, and an up-sampling module. (Ref:
Table 2).
Nonetheless, the linear network approach fell short of fully harnessing the wealth
of feature information present in the input data. The low-level features, originating
from the LR (low-resolution) images, encapsulated valuable information that exhib-
ited strong correlations with high-level features. Consequently, relying solely on
a linear network could result in the loss of pertinent information. To address this
limitation, the concept of residual learning was introduced, depicted in Fig. 11.
The incorporation of residual learning significantly expedited training convergence
and mitigated the degradation issues encountered with deeper networks. Within the
assessed algorithms, two variations of residual learning were identified: local residual
learning and global residual learning. With the exception of VDSR, all algorithms
implemented both forms of residual learning. Local residual learning primarily estab-
lished connections between residual blocks, while global residual learning linked the
features from LR images to the final features.
Leveraging Data Analytics and a Deep Learning Framework … 123

Fig. 12 Different network design

5 Use of Image Super-Resolution in Various Domains

Image super-resolution has found wide-ranging applications across diverse domains


over the past three decades. Notably, fields such as medical diagnosis, surveillance,
and biometric information identification have integrated image super-resolution
techniques to address specific challenges and enhance their capabilities.
124 S. R. Mishra et al.

5.1 In the Field of Medical-Imaging

In the realm of medical diagnostics, accurate judgment is a critical skill. However,


images obtained through techniques like CT, MRI, and PET-CT often suffer from
low resolution, noise, and inadequate structural information, posing challenges for
correct diagnoses [10]. Image super-resolution has garnered attention for its potential
to enable zooming into images and enhancing diagnostic capabilities. CNN-based U-
Net algorithm was proposed to generate high-resolution CT brain images, surpassing
traditional techniques like the Richardson–Lucy deblurring algorithm. Few authors
introduced network for enhancing MRI brain images to aid tumor detection [11]
implemented a CSN for MRI brain images. Both applications showcased the advan-
tages of CNN-based algorithms over traditional methods like bicubic interpolation,
as seen in improved metrics like PSNR and SSIM.

5.2 Surveillance Applications

Surveillance systems play a crucial role in security monitoring and investigations.


However, unclear video quality from these systems, stemming from factors like
small image size and poor CCTV quality, poses challenges. Image super-resolution
techniques have been employed to overcome these issues. Few authors proposed a
deep CNN for surveillance record super-resolution, demonstrating the superiority
of CNN-based methods over traditional techniques [20, 21]. Despite advancements,
applying image super-resolution in surveillance remains challenging due to factors
such as very complex motion, very large amount of feature data, and varying image
quality from CCTV.

5.3 Biometric Information Identification Applications

Biometric identification methods, including face, fingerprint, and iris recognition,


require high-resolution images for accurate detection and recognition. SRCNN was
applied by [12] for enhancing facial images used in surveillance. Paper [13] employed
deep CNNs for facial image resolution enhancement. Paper [14] utilized the (PFE-
Network) to enhance the fingerprint image resolution.
Leveraging Data Analytics and a Deep Learning Framework … 125

6 Conclusion

The emergence of image SR-technology has garnered significant attention across


various application domains. The evolution of deep learning has spurred researchers
to innovate CNN-based approaches, seeking models that offer improved performance
with reduced computational demands. With the inception of the pioneering SRCNN, a
multitude of techniques, including diverse up-sampling modules and network design
strategies, have enriched the landscape of image super-resolution algorithms. Yet,
focusing solely on methodologies may not yield optimal results in model refine-
ment. It is paramount to delve into the inherent characteristics of each approach,
comprehending their merits and limitations. Such insights empower developers to
judiciously select design pathways that enhance models intelligently. This synthesis
of knowledge from diverse studies not only informs about methodologies but also
imparts an understanding of the nuanced attributes that underlie them. This compre-
hensive review stands poised to provide invaluable guidance to developers aiming to
elevate the performance of image super-resolution models in terms of both efficiency
and quality. By encapsulating a profound understanding of diverse techniques and
their intricacies, this review serves as a beacon for future image super-resolution
advancements, illuminating the trajectory toward more sophisticated and impactful
developments.

References

1. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for im- age
super-resolution. In: European Conference on Computer Vision, pp. 184–199. Springer, Cham,
Switzerland (2014)
2. Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network.
In: European Conference on Computer Vision, pp. 391–407. Springer, Cham, Switzerland
(2016)
3. Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolu-
tional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1646–1654. Las Vegas, NV, USA, 27–30 June 2016
4. Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single
image super-resolution. In: Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), pp. 1132–1140. Honolulu, HI, USA, 21–26 July
2017
5. Chu, J., Zhang, J., Lu, W., Huang, X.: A Novel multiconnected convolutional net- work for
super-resolution. IEEE Signal Process. Lett. 25, 946–950 (2018)
6. Lan, R., Sun, L., Liu, Z., Lu, H., Su, Z., Pang, C., Luo, X.: Cascading and enhanced residual
networks for accurate single-image super-resolution. IEEE Trans. Cybern. 51, 115–125 (2021)
7. Kim, J., Lee, J.K., Lee, K.M.: Deeply-recursive convolutional network for image super-
resolution. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 1637–1645. Las Vegas, NV, USA, 27–30 June 2016
8. Hou, J., Si, Y., Li, L.: Image super-resolution reconstruction method based on global and local
residual learning. In: Proceed- ings of the 2019 IEEE 4th Inter- national Conference on Image,
Vision and Computing (ICIVC), pp. 341–348. Xiamen, China, 5–7 July 2019
126 S. R. Mishra et al.

9. Gao, X., Zhang, L., Mou, X.: Single image super-resolution using dual-branch convolutional
neural network. IEEE Access 7, 15767–15778 (2019)
10. Ren, S., Jain, D.K., Guo, K., Xu, T., Chi, T.: Towards efficient medical lesion image super-
resolution based on deep residual networks. Signal Process. Image Communication.
11. Zhao, X., Zhang, Y., Zhang, T., Zou, X.: Channel splitting network for single MR image
super-resolution. IEEE Trans. Image Process. 28, 5649–5662 (2019)
12. Rasti, P., Uiboupin, T., Escalera, S., Anbarjafari, G.: Convolutional Neural network super reso-
lution for face recognition in surveillance monitoring. In: Articulated Motion and Deformable
Objects, pp. 175–184. Springer: Cham, Switzerland (2016)
13. Deshmukh, A.B., Rani, N.U.: Face video super resolution using deep convolutional neural
network. In: Proceedings of the 2019 5th International Conference on Computing, Commu-
nication, Control and Automation (ICCUBEA), pp. 1–6. Pune, India, 19–21 September
2019
14. Shen, Z., Xu, Y., Lu, G.: CNN-based high-resolution fingerprint image enhancement for pore
detection and matching. In: Proceedings of the 2019 IEEE Symposium Series on Computational
Intelligence (SSCI), pp. 426–432. Xiamen, China, 6–9 December 2019
15. Chatterjee, P., Milanfar, P.: Clustering-based denoising with locally learned dictionaries. IEEE
Trans. Image Process. 18(7), 1438–1451 (2009)
16. Xu, X.L., Li, W., Ling.: Low Resolution face recognition in surveillance systems. J. Comp.
Commun. 02, 70–77 (2014). https://doi.org/10.4236/jcc.2014.22013
17. Li, Y., Qi, F., Wan, Y.: Improvements on bicubic image interpolation. In: 2019 IEEE 4th
Advanced Information Technology, Electronic and Automation Control Conference (IAEAC).
Vol. 1. IEEE (2019)
18. Kim, T., Sang Il Park, Shin, S.Y.: Rhythmic-motion synthesis based on motion-beat analysis.
ACM Trans. Graph. 22(3), 392–401 (2003)
19. Xu, Z. et al.: Evaluating the capability of satellite hyperspectral Im- ager, the ZY1–02D, for
topsoil nitrogen content estimation and mapping of farm lands in black soil area, China.”
Remote Sens. 14(4), 1008 (2022)
20. Mishra, S.R., et al.: Real time human action recognition using triggered frame extraction and
a typical CNN heuristic. Pattern Recogn. Lett. 135, 329–336 (2020)
21. Mishra, S.R., et al.: PSO based combined kernel learning framework for recognition of first-
person activity in a video. Evol. Intell. 14, 273–279 (2021)
Applying Data Analytics and Time Series
Forecasting for Thorough Ethereum
Price Prediction

Asha Rani Mishra, Rajat Kumar Rathore, and Sansar Singh Chauhan

Abstract Finance has been combined with technology to introduce newer advances
and facilities in the domain. One such technological advance is cryptocurrency which
works on the Blockchain technology. This has proved to be a new topic of research
for computer science. However, these currencies are volatile in nature and their
forecasting can be really challenging as there are dozens of cryptocurrencies in use
all around the world. This chapter uses the time series-based forecasting model
for the prediction of the future price of Ethereum since it handles both logistic
growth and piece-wise linearity of data. This model is independent as it does not
depend on past or historical data which contain seasonality. This model is suitable
for real use cases after seasonal fitting using Naïve model, time series analysis, and
Facebook Prophet Module (FBProphet). FBProphet Model achieves better accuracy
as compared to other models. This chapter aims at drawing a better statistical model
with Exploratory Data Analysis (EDA) on the basis of several trends from year 2016
to 2020. Analysis carried out in the chapter can help in understanding various trends
related to Ethereum price prediction.

1 Introduction

Cryptocurrencies act as mediums of exchange based on computer networks and are


not reliant on any central or government authority, banks, or financial institutions.
Its function is to verify if the concerned parties have the money they claim to own,
thereby reducing the traditional methods of banks. These currencies or coins are reli-
able in terms of data integrity and security because of being built on the Blockchain
framework. These currencies have the least chances to be converted or used for fraud-
ulent activities. Cryptocurrencies have not been recognized and adopted globally and
are still new to the market except in a few countries. From a pool of cryptocurrencies,
Bitcoin and Ethereum are the highlighted ones, since they are affected by news and

A. R. Mishra (B) · R. K. Rathore · S. S. Chauhan


Department of Computer Science, GL Bajaj Institute of Technology and Management, Greater
Noida, India
e-mail: asha1.mishra@gmail.com

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 127
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_7
128 A. R. Mishra et al.

events. It was introduced in 2008 by Nakamoto. Based on Blockchain technology, this


currency allows for peer-to-peer transactions. Cryptocurrency is so-called because it
is based on encryption which is used to verify transactions. The storing and transmit-
ting of cryptocurrency data between wallets and to public ledgers involves advance
coding. These currencies are widely popular because of the safety and security factor.
A ledger or database that is distributed and shared by the nodes of a computer network
is known as a Blockchain. This aids in the secure and decentralized recording of trans-
actions for cryptocurrency. A block is a data collection that stores the transaction
together with additional information like the correct sequence and creation times-
tamp. The initial stage in a Blockchain is a transaction, which symbolizes the partici-
pant’s activity. Machine learning algorithms are commonly used in this field because
of their ability to dynamically select a vast number of attributes that could possibly
be affecting the results and comprehend complex, high-dimensional relationships
between features and targets. Because predicting capacity varies depending on the
coin, machine learning and artificial intelligence seem better solvers. Compared to
currencies with high volatility, low volatility cryptocurrencies are easier for one to
forecast. Most of the coins in cryptocurrency have volatility in the graphs of their
prices. Their prices are also affected by supply and demand factor. The number of
coins in circulation and held by buyers are major factors affecting the prices.
This chapter uses a model to foresee shutting cost of Ethereum alongside Ethereum
Opening Value, Ethereum Day Exorbitant cost, Ethereum Day Low Cost and Volume.
The total value of Ethereum on a certain day can be known using sophisticated
machine learning algorithms by finding hidden patterns in data to precise predictions.

2 Related Work

Hitam et al. in [1] focus on six different types of cryptocurrency coins available
in the market by collecting historical data of 5 years. They trained four different
models on the dataset and the performances of different classifiers were checked.
These models were namely, SVM, BoostedNN, ANNs, and Deep Learning giving
accuracies of 95.5%, 81.2%, 79.4%, and 61.9%, respectively. Siddhi Velankar et al.
in [2] use two different approaches for predicting prices—GLM/Random Forest and
Bayesian Regression model. It uses Bayesian Regression by dividing data in the form
of 180s, 360s, 720s and takes the help of k-means clustering to narrow down effec-
tive clusters and further calculating the corresponding weights from data using the
Bayesian Regression method. In GLM/Random Forest model, the authors distribute
data into 30, 60, and 120 min time series datasets. The issue is addressed by Chen
et al. in [3] by splitting the forecasting sample into intervals of five minutes with
a large sample size and daily intervals with a small sample size. Lazo et al. in [4]
builds two decision trees based on datasets of two currencies—Bitcoin and Ripple
where week 1 decision tree model gives best decisions for selling the coins 1 week
after purchase and the week 2 decision tree model gives best investment advice on
coins giving highest gains. Derbentsev et al. in [5] perform predictive analysis on the
Applying Data Analytics and Time Series Forecasting for Thorough … 129

cryptocurrency prices (BTC, ETH, XRP) by using two different models—Stochastic


GBM and RF. Same features were used in training of both the models. The author
uses one-step ahead forecasting to anticipate coin prices using the most recent 91 data
in order to evaluate the effectiveness of the aforementioned ensembles. The SGBM
and RF models give an RMSE of 5.02 and 6.72 for Ethereum coin, respectively. Chit-
tala et al. in [6] introduce two different artificial intelligence frameworks ANN and
LSTM for the predictive analysis. Authors wanted to predict the price of Bitcoin one
day into the future through ANNs using five different lengths of memory, whereas
LSTM models the internal memory flow and how it affects future analysis. The author
concludes that the ANN and LSTM models can be compared up to a level and are good
performers in predicting prices almost equally, although the internal structures differ.
It is also concluded that long-term history is what ANN depends more on whereas
LSTM is more into short-term dynamics. Hence, in terms of historical data, LSTM
is able to utilize more useful information. Soumya A. et al. in [7] follow a similar
approach by comparing models using ANN and LSTM. ANN model improves itself
largely in predicting Ethereum prices as it trains itself on all the historical informa-
tion of the last month as compared to short-term cases. LSTM model, on the other
hand, performs well just like ANN while predicting gone-day future prices based on
MSE. Suryoday Basak et al. in [8], work on prediction of stock prices and describe
the problem statement as a classification problem able to forecast the increase or
decrease in stock prices according to historical prices n days back. They came up
with an experimental technique for the classification. Random forest and gradient
boost decision trees (using XGBoost) are used in the work by using ensembles of
decision trees. Poongodi M. et al. in [9] employ a time series comprising daily closing
values of the Ethereum cryptocurrency to apply two well-known machine learning
algorithms for pricing prediction: linear regression and SVM. The SVM method
seems to give a higher accuracy (96.06%) than Linear Regression Model (85.46%).
The author also mentions that accuracy can be lifted up by adding features to SVM,
almost by 99%. Azeez A. et al. [10] use two different methods to design models for
predictions namely, Tree-based techniques (GBM, ADAB, XGB, MLP) and Deep
Learning Neural Networks (DFNN, GRU, CNN). Results show that these techniques
were able to gain a more valuable EVS percentage ranging between 88 and 98%,
meaning they have reasoned for the overall variance in the observed data. On the other
hand, the boosting trees techniques do not perform well enough in predicting daily
closing prices for cryptocurrencies since there are insights missing in the training set.
Aggarwal et al. in [11] uses the approach where gold price predicts prices of BTC
currency with help of three different DL algorithms namely, Convolutional Neural
Network (CNN), Long Short Term Memory (LSTM), and Grated Recurrent Network
(GRN). The predicted price which only uses gold price has certain deviation from
the actual Bitcoin price. Best result was given by LSTM. Phaladisailoed et al. in [12]
use four deep learning algorithms—Theil–Sen regression, Huber regression, LSTM,
and GRU have been used in this study to predict the price of the cryptocurrency,
Bitcoin. LSTM offers the highest accuracy of all, i.e., 52.78%. Carbo et al. in [13]
the authors in their work, show that adding previous period’s Bitcoin price to the
explanatory variables based on the LSTM algorithm improves in terms of RMSE
130 A. R. Mishra et al.

from 21 to 11%. Quantile regression boosting is used by Christian Pierdzioch et al.


in [14] to forecast gold prices. Trading rules deriving from this approach may prove
to be superior to even buy-hold strategies in circumstances with low trading costs and
particular quantiles. Perry Sadorsky et al. in [15] predict gold and silver prices and
conclude tree-based algorithms like Random Forest (RF) to be better and accurate
models as compared to logit ones for gold and silver prices forecasting in which tech-
nical indicators are used as features. Here too, RF predictions generate trading rules
capable of replacing buy and hold strategy. Samin-Al-Wasee et al. [17] performed
a research which discusses that price data for ether were fitted into a number of
basic and hybrid LSTM network types and time series modeling was used to forecast
future prices. In order to assess the efficacy of the LSTM networks, a comparison
study of the models as well as other well-liked current forecasting methods, such
as ARIMA as the background forecast, were carried out. Sharma & Pramila in [18]
proposed a hybrid model using long short-term memory (LSTM) and vector auto
regression (VAR) that forecasts the price of Ethereum. In comparison to the solo
models, the hybrid model yielded the lowest values for the assessment measures.
Existing work includes the use of multiple algorithms for price prediction and
price speculation in stock market as there are a large number of parameters available
in stock market. However, in the case of digital coins, we have fewer parameters
available but there is way more seasonality and fluctuation as compared to stocks.
So, when we talk about predicting the price of cryptocurrencies, the number of
algorithms is quite limited in the discussion. Earlier multiple price predictions like
logistic regression, linear discriminant analysis, ARIMA, LSTM, etc., were used but
out of all those ARIMA was most prominent but due to higher seasonality ARIMA
predictions were also affected. So, there is a need for a new approach which can
overcome the limitation of all the existing algorithms and can give a better prediction.

3 Research Methodology

Data analytics along with time series modeling can be used in predicting the price of
Ethereum by utilizing the information about trends and patterns in the past. It helps
to find correlated features for prediction models. Using sentiment analysis based
on social trends can help to identify factors which can influence the price. Devel-
oping precise forecasting models can benefit from the examination of long-term,
seasonal, and cyclical trends. Finding pertinent features for models that forecast
time series is facilitated by data analytics. The prediction capacity of the model
can be increased by combining sentiment analysis features, on-chain analytics, and
technical indications. In conclusion, time series forecasting and data analytics work
well together to predict Ethereum prices. By utilizing past information, market senti-
ment, and sophisticated modeling approaches, it helps analysts and investors make
better decisions. Nonetheless, it’s critical to recognize the inherent unpredictability of
bitcoin markets and to constantly improve models in order to accommodate changing
Applying Data Analytics and Time Series Forecasting for Thorough … 131

circumstances. In summary, the combination of data analytics and time series fore-
casting is powerful for Ethereum price prediction. It enables investors and analysts to
make more informed decisions by leveraging historical data, market sentiment, and
advanced modeling techniques. However, it’s essential to acknowledge the inherent
uncertainty in cryptocurrency markets and continuously refine models to adapt to
dynamic conditions.
It is difficult to predict or comment on cryptocurrencies because of their volatility
in prices and visible dynamic fluctuations. According to existing work different pros
and cons of time series algorithms like ARIMA and LSTM were identified. ARIMA
is unable to handle factors like seasons in data and independency between two data
points.
This existing problem can be reduced or eliminated by contrasting few machine
learning techniques for analyzing market movements. This chapter presents a
methodology which is able to predict prices of the cryptocurrency—Ethereum
by using Machine Learning algorithms in a hybrid fashion. The data has been
smoothened, enhanced, and prepared to finally Facebook Prophet Algorithm Model
is applied to it. Facebook Prophet was able to handle the cons of previous algorithms
used in such predictions such as dynamic behavior, seasonality, holidays, etc. [16].
Figure 1 depicts the work-flow used in the chapter. Presence of non-stationarity in
data, noise in data, and requirement of smoothening, feature engineering was done
during the process. At each step, there was a requirement of graph visualization to
analyze the working and make the next decisions.

Fig. 1 Workflow used for


prediction of Ethereum
132 A. R. Mishra et al.

(i). Fetching of raw data which could be done from 3rd party APIs, web scrapping,
etc.
(ii). After doing data cleaning on the raw data, exploratory data analysis (EDA) on
the data is conducted to know the behavior of the data.
(iii). On this data, a naive model, also known as a base line model, an auto regressive
model, or a moving average model is implemented.
(iv). After data cleaning, the next step is feature engineering to check whether
data is stationary. For this, statistical test is done using Line plot curve and
Augmented Dickey–Fuller (AdFullar) Test. It is essential to check stationarity
of data in time series analysis since it highly affects the interpretation of data.

4 Results and Discussions

Numerous statistical models are built on the assumption that there is no dependence
between the various points for predictions. To fit a stationary model to the time
series data that needs to be analyzed, one should check for stationarity and remove
the trend/seasonality effect from the data. The statistical factors must remain constant
over time. This is not necessary that all data points should be same; rather, the data’s
general behavior should be consistent. Time graphs that are constant on a strictly
visual level are considered stagnant. Stationarity also means the consistency of mean
and variance with respect to time.
In data preprocessing step, as time series-based Facebook Prophet Model is used,
the date feature must offer an object type with a timestamp nature. Therefore, convert
it first to date/time format before sorting the data by date. Exploratory Data Analysis
(EDA) must be done on data sample as shown in Fig. 2 because the ‘Close’ feature’s
as shown in Fig. 3 ultimate purpose is to forecast what the final selling price of
Ethereum will be.
The mean or average closing price can be found using mean function and values
should be plotted according to date on weekly and yearly basis as shown in Figs. 4
and 5, respectively.

Fig. 2 Data sample


Applying Data Analytics and Time Series Forecasting for Thorough … 133

Fig. 3 Close feature plot

Fig. 4 Plot data sum on


weekly basis

Fig. 5 Plot data on yearly


basis
134 A. R. Mishra et al.

Fig. 6 Closing price on


monthly basis

Fig. 7 Mean of ‘Close’ by


week

Graph shown in Fig. 6 shows the trend of prices in a yearly, weekly, and monthly
time period using mean function. In Fig. 7, average weekly closing price can be
analyzed using ‘Close’ by taking mean of values according to week.
Mean closing price per day is analyzed and plotted in Fig. 8. In the same manner,
the average closing prices on a quarterly basis are analyzed and plotted as shown in
Fig. 9.
The trend that closing prices follow on weekdays and weekends was also analyzed
and the same was plotted in Fig. 10 which shows minor differences in the two graphs.

4.1 Naïve Model for Ethereum Price Prediction

Using this data, a baseline or the Naïve model is used for prediction as shown in
Fig. 11. In a Naïve model all data points are dependent on the previous data points
as shown in Fig. 12.
Applying Data Analytics and Time Series Forecasting for Thorough … 135

Fig. 8 Average closing price each day

Fig. 9 Average closing price of each Quarter

4.2 Seasonality Test

The next steps included determining whether or not the data had seasonality. Season-
ality is the existence of fluctuations or changes that happen frequently, such as once
136 A. R. Mishra et al.

Fig. 10 Weekend and Weekdays plot

Fig. 11 Naive prediction

a week, once a month, or once every three months in data. Seasonality is the peri-
odic, repeating, often regular, and predictable trends in the levels of a time series that
may be attributed to a variety of events, including weather, vacation, and holidays.
The seasonality of the curve is removed by applying rolling or moving average of a
window period of 7 on the data. Mean and Standard Deviation are shown in Fig. 13.
The blue line, which is now overlapping the green curve in the graph in Fig. 13,
represents mean values. The orange line in the graph reflects the exact given series.
This has led to the conclusion that the rolling mean is not constant and undergoes
temporal variation. It must now stop being seasonal and change into a stable state.
Applying Data Analytics and Time Series Forecasting for Thorough … 137

Fig. 12 Naïve prediction vs. actual values plot

Fig. 13 Mean and standard deviation plot

4.3 Augmented Dickey–Fuller (AdFullar) Test

‘Adfuller’ is easily imported from the stats model package and used in a program on
the ‘close’ data. This results in a p-value of 0.0002154535155876224.
The null hypothesis is rejected as the calculated value is less than 0.05 so the data
is considered as stationary. The log transformation is often used to reduce skewness
of a measurement variable using the log functions as shown in Fig. 14.
The data are smoothed using the moving average. The financial market uses this
technique frequently. Impact of rolling window and log transformations is shown in
Fig. 15.
138 A. R. Mishra et al.

Fig. 14 Removal of
seasonality factor

Fig. 15 Log transformation and moving average

Figure 16 shows that null hypothesis is rejected and as a result, data comes out to
be stationary. The time series is roughly stationary and has a constant interval. Shift
is used to apply difference to find the tendency of seasonality as seen in Fig. 17.
Other seasonal adjustment results have been shown in Fig. 18.
As a result, it may be inferred that the Dicky Fuller Test has an essential value
less than 1%.
Applying Data Analytics and Time Series Forecasting for Thorough … 139

Fig. 16 Rolling and moving average difference is stationary

Fig. 17 Using shift to apply


difference

4.4 Forecast Using Facebook Prophet Model

The algorithm used here for prediction is FBProphet. The FBProphet algorithm of
machine learning employs a decomposable time series model that consists of three
key components: pattern, seasonality, and holidays. In the following Eq. 1, they are
combined:

y(t) = g(t) + s(t) + h(t) + εt (1)

g(t): For modeling non-periodic variations in time series, use a piece-wise linear
or logistic growth curve.
s(t): periodic changes (e.g., weekly or yearly seasonality).
h(t): effects of holidays along with irregular schedules.
140 A. R. Mishra et al.

Fig. 18 ADF test gives value much lesser than 1%

εt: error term accounts for any unusual changes which are not accommodated by
the model.
The ‘Fbprophet’ library provides a Prophet model specifically. It controls irregular
hours or irregular holidays. The circumstance when there is some noise or some
outliers in the data is likewise handled by this Facebook prophet module. ‘Fbprophet’
is a module that helps forecasting time series data that matches non-linear patterns
since the data has seasonality on a yearly, weekly, and daily scale along with the
effects of holidays. The results are best if the model is trained on past data including
several seasons and time series with considerable seasonal influences. The data must
be prepared in accordance with the prophet model documentation prior to fitting. It
must ensure that every data complies with its protocols. Output feature is represented
as ‘y’ and the date ‘ds’. The model is fitted with a frequency of day with a ‘500-day’
span.
‘yhat’ gives the actual forecast while ‘yhat_upper’ and ‘yhat_lower’ give higher
bound prediction and the lower bound prediction respectively as shown in Fig. 19.
Now, to plot this forecast, the Fbprophet library’s built-in functionality to forecast is
used shown in Fig. 20.
The black dot in the curve represents the plot of actual values or prices, whereas
blue line shows the prediction curve. The light blue line depicts the trend using data
on a weekly, annual, and monthly as shown in Fig. 21.

4.5 Validation of Predicted Data

The forecast model that calculates forecast error must now be cross-validated. Actual
values and projected values will be compared to calculate forecast error. There is a
Applying Data Analytics and Time Series Forecasting for Thorough … 141

Fig. 19 Forecast values

Fig. 20 Forecast using Fbprophet

built-in cross-validation method in Facebook Prophet. The Horizon parameter, the


initial training period size, the cutoff period spacing, and other parameters are shown
as part of the cross-validation approach as seen in Fig. 22. The RMSE curve can be
plotted similarly to the root-mean-square error to find the difference between actual
and predicted values from the time period 2019–2020.
142 A. R. Mishra et al.

Fig. 21 Weekly, monthly, and yearly plot

In order to evaluate FBProphet model, we can use four types of errors measures,
i.e., Mean Absolute error (MAE), Root Mean Square Error (RMSE), Root Relative
Squared Error (RRSE), and Mean Absolute Percentage Error (MAPE) shown in
Table 1.
Here, Zak = actual value
ẑk = predicted value for any kth sample
zk = Actual value of z,
z- =Average Value of z
N = total number of test sample
The FBProphet algorithm used for trend analysis is a full-fledged and totally reliable
algorithm, which gave an accuracy of approximately in between 94.5 and 96.6%.
From the experimental results it has been observed that the value of RSME falls in
the range of 0–100 which is about 5.56%. Majority of the value lies between 0 and
80 which indicates that the model is having 4.44% error.
Applying Data Analytics and Time Series Forecasting for Thorough … 143

Fig. 22 Root_Mean_Square error

Table 1 Performance metrics for FBProphet forecasting model


Type of Statistical error Formula

Mean Absolute Error (MAE) MAE = 1/N |ẑk −zk |for k = 1 to N
/( ∑ )
Root Mean Square Error RSME = 1/N |ẑk −zk |2 for k = 1 to N
(RMSE)
Root Relative Squared Error RRSE
√( =∑ ∑ ) √( ∑ || − |2 ∑ )
(RRSE) 1/N |ẑk −zk |2 / 1/N z -zk | where z− =

1/N zk for k = 1 to N

Mean Absolute Percentage MAPE = 100/N |ẑk −zk )/zk |
Error (MAPE)

5 Conclusion and Future Work

Techniques and models are seldom created out of old data, but a reliable present world
predictive model is quite vague to build based only on previous data. Logistic Regres-
sion accuracy score is 66%, Linear Discriminant Analysis having 65.3% accuracy;
and other previous models on prices of coins like BTC, LTC—Multi-linear regression
model gives R2 score 44% for LTC and 59% accuracy for BTC. Most of the problems
are solved by building models based on historical data, but in case of cryptocurrencies,
future results cannot be predicted based on just a historical data model. There may
be seasonality in prior data or problems which effects models’ ability to accurately
predict patterns. Performing cross-validation, the used FBProphet model showed that
it was able to achieve around 97% accuracy in forecasting future Ethereum Price.
Even when seasonal data was available, the overall gap between anticipated and
actual values was small compared to other models. Further, to improve the model’s
144 A. R. Mishra et al.

accuracy and make it reliable on present data, a suggestion tool for other external
factors that may affect Ethereum market prices, such as social media, tweets, and
trading volume, might be added.

References

1. Hitam, N.A., Ismail, A.R.: Comparative performance of machine learning Aagorithms for
cryptocurrency forecasting. Indones. J.Electr. Eng. Comput. Sci. 11, 1121– 1128 (2018). https://
www.ije.ir/article_122162.html
2. Velankar, S., Valecha, S., Maji, S.: Bitcoin price prediction using machine learning. In: 2018
20th International Conference on Advanced Communication Technology (ICACT), pp. 144–
147. IEEE (2018)
3. Chen, Z., Li, C.; Sun, W.: Bitcoin price prediction using machine learning: an approach to
sample dimension engineering. J.Comput. Appl. Math. 365, 112395 (2019). https://www.sci
encedirect.com/science/article/abs/pii/S037704271930398X
4. Lazo, J.G.L., Medina, G.H.H., Guevara, A.V., Talavera, A., Otero, A.N., Cordova E.A.: Support
system to investment management in cryptocurrencies. In: Proceedings of the 2019 7th Inter-
national Engineering, Sciences and Technology Conference, IESTEC, pp. 376–381. Panama
(9–11 October 2019)
5. Derbentsev, V., Babenko, V., Khrustalev, K., Obruch, H., Khrustalova, S.: Comparative perfor-
mance of machine learning ensemble algorithms for forecasting cryptocurrency prices. Int. J.
Eng. Trans. A Basics. 34, 140–148 (2021)
6. Yiying, W., Yeze, Z.: Cryptocurrency price analysis with artificial intelligence. In: 2019
5th International Conference on Information Management (ICIM), pp. 97–101. IEEE (2019,
March). https://doi.org/10.1109/INFOMAN.2019.8714700
7. Livieris, I.E., Pintelas, E., Stavroyiannis, S., Pintelas, P.: Ensemble deep learning models for
forecasting cryptocurrency time-series. Algorithms 13(5), 121 (2020). https://doi.org/10.3390/
a13050121
8. Basak, S., Kar, S., Saha, S., Khaidem, L., Dey, S.R.: Predicting the direction of stock market
prices using tree-based classifiers. North Am. J. Econ. Finance 47, 552–567 (2019). https://
doi.org/10.1016/j.najef.2018.06.013
9. Poongodi, M., Sharma, A., Vijayakumar, V., Bhardwaj, V., Sharma, A. P., Iqbal, R., Kumar,
R: Prediction of the price of Ethereum blockchain cryptocurrency in an industrial finance
system. Comput. Electr. Eng. 81, 106527 (2020). https://doi.org/10.1016/j.compeleceng.2019.
106527
10. Azeez A.O., Anuoluwapo O.A., Lukumon O.O., Sururah A. 49 Bello, Kudirat O.J.: Perfor-
mance evaluation of deep learning and boosted trees for cryptocurrency closing price prediction.
Expert Syst. Appl. 213, Part C, 119233, ISSN 0957–4174 (2023)
11. Aggarwal, A., Gupta, I., Garg, N., & Goel, A.: Deep learning approach to determine the
impact of socio economic factors on bitcoin price prediction. In: 2019 Twelfth International
Conference on Contemporary Computing (IC3), pp. 1–5. IEEE (2019, August). https://doi.org/
10.1109/IC3.2019.8844928
12. Phaladisailoed, T., Numnonda, T.: Machine learning models comparison for bitcoin price
prediction. In: 2018 10th International Conference on Information Technology and Electrical
Engineering (ICITEE), pp. 506–511. IEEE (2018). https://doi.org/10.1109/ICITEED.2018.853
4911
13. Carbó, J.M., Gorjón, S.: Application of machine learning models and interpretability techniques
to identify the determinants of the price of bitcoin (2022)
14. Pierdzioch, C., Risse, M., Rohloff, S.: A quantile-boosting approach to forecasting gold returns.
North Am. J. Econ. Finance 35, 38–55 (2016). https://doi.org/10.1016/j.najef.2015.10.015
Applying Data Analytics and Time Series Forecasting for Thorough … 145

15. Sadorsky, P.: Predicting gold and silver price direction using tree-based classifiers. J. risk financ.
manag. 14(5), 198 (2021). https://doi.org/10.3390/jrfm14050198
16. Mishra, A.R., Pippal, S.K., Chopra, S.: Time Series Based Pattern Prediction Using Fbprophet
Algorithm For Covid-19. J. East China Univ. Sci.TechnoL. 65(4), 559–570 (2022)
17. Samin-Al-Wasee, M., Kundu, P.S., Mahzabeen, I., Tamim, T., Alam, G.R.: Time-Series Fore-
casting of Ethereum Price Using Long Short-Term Memory (LSTM) Networks. In: 2022 Inter-
national Conference on Engineering and Emerging Technologies (ICEET), pp. 1–6. IEEE
(2022, October). https://doi.org/10.1109/ICEET56468.2022.10007377
18. Sharma, P., Pramila, R.M.: Price prediction of Ethereum using time series and deep learning
techniques. In: Proceedings of Emerging Trends and Technologies on Intelligent Systems:
ETTIS 2022, pp. 401–413. Singapore: Springer Nature Singapore (2020). https://doi.org/10.
1007/978-981-19-4182-5_32
Practical Implementation of Machine
Learning Techniques and Data Analytics
Using R

Neha Chandela, Kamlesh Kumar Raghuwanshi, and Himani Tyagi

Abstract In this digital era all E-commerce activities are based on the modern
recommendation systems where a company wants to analyse the buying pattern of
its customers to optimize their sales strategies which mainly includes focusing more
on valuable customers which is based on the amount of purchase made by customer
rather than the traditional way of recommending a product. In the modern recommen-
dation systems different parameters are synthesized for designing efficient recom-
mendation systems. In this paper the data of 325 customers who have made certain
purchases from a website having naive parameters like age, job type, education,
metro city, signed in with company since and purchase history are considered. The
E-commerce business model’s profit making is primarily dependent on choice-based
recommendation systems. Hence in this paper a predictive model using machine
learning-based linear regression algorithm is used. The study is done using a popular
statistical tool named R programming. In this study the R tool is explored and repre-
sented with utility for recommendation system designing and finding insights from
data by showing various plots. The results are formulated and presented in a formal
and structured way using the R tool. During this study it has been observed that the
R tool has potential to be one of the leading tools for research and business analytics.

N. Chandela
Computer Science and Engineering, Krishna Engineering College,
Uttar Pradesh, Ghaziabad, India
e-mail: falsenehachandela99@gmail.com
K. K. Raghuwanshi (B)
Computer Science Department, Ramanujan College, Delhi University, New Delhi, India
e-mail: kamlesh@ramanujan.du.ac.in
H. Tyagi
University School of Automation and Robotics, GGSIPU, New Delhi, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 147
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_8
148 N. Chandela et al.

1 Introduction

Recommendation systems are considered to be the future of business and marketing.


All thanks to the recommendation system that we can offer products or services to
new users. With no doubt, this has become a critical method for many online firms,
such as Netflix, Flipkart, and Amazon [1].
Types of recommendation systems
• Content-based recommendation system:
This system is built on features using content to provide related products. For
example, if I’m a fan of Chetan Bhagat and I order or read a book of this author
online then the algorithm will suggest other books by the same author or books in
the same category to me, this is called the system of collaborative recommendation.
This scenario does not rely on product features, but rather on customer feedback.
In other words, you can try to figure out what made product X so appealing to the
buyer and propose items that contain that “what”. They are called Content-based
recommender systems [2].
• User-based collaboration system:
It is based on locating comparable users and locating items that those users have
enjoyed but that we have not yet tried. You search for all other users who bought
product X and compile a list of other things they bought. Take the products that
appear the most frequently on this list.
Item-based collaboration system: In this situation, we will look for similar
products to the one the user purchased and recommend them.
Where machine learning fits in:
Both the user-based collaborative filtering systems and item-based collaborative
filtering systems can rely on clustering as a foundation, though other machine learning
algorithms may be better suited to the job depending on your project requirements.
Clustering methods allow you to group persons and products based on similarity,
making them a natural choice for a recommendation engine [3].
Another approach to making recommendations could be to concentrate on the
differences between users and/or objects. Needless to say, the machine learning
algorithms you use will be heavily influenced by the characteristics of your particular
project.

1.1 Theoretical Concepts

R programming
Data science has taken the whole world today. Every sector of study and industry
has been impacted as individuals increasingly recognise the usefulness of the massive
amounts of data being generated. However, in order to extract value from those data,
Practical Implementation of Machine Learning Techniques and Data … 149

one must be skilled in data science abilities [4]. The R programming language has
emerged as the de facto data science programming language. The adaptability, power,
sophistication, and expressiveness of R have made it an indispensable tool for data
scientists worldwide [5].
Features of R programming
1. R’s syntax is quite similar to S’s, making it easy for S-PLUS users to transfer over.
While R’s syntax is essentially identical to S’s, R’s semantics, while outwardly
similar to S’s, are significantly different. In reality, when it comes to how R
operates under the hood, R is far closer to the Scheme language than it is to the
original S language [6].
2. R now operates on nearly every standard computing platform and operating
system. Because it is open source, anyone can modify it to run on whatever
platform they like. R has been claimed to run on current tablets, smartphones,
PDAs, and game consoles [7].
3. R has a great feature with many famous open source projects: regular releases.
Nowadays, there is a big annual release, usually in October, in which substantial
new features are included and made available to the public. Smaller-scale bugfix
releases will be produced as needed throughout the year. The frequent releases and
regular release cycle show active software development and ensure that defects
are resolved in a timely way. Of course, while the core developers maintain the
primary source tree for R, many individuals from all around the world contribute
new features, bug fixes, or both.
4. R also offers extensive graphical features, which set it apart from many
other statistical tools (even today). R’s capacity to generate “publication qual-
ity” graphics has existed since its inception and has generally outperformed
competing tools. That tendency continues today, with many more visualisa-
tion packages available than ever before. R’s base graphics framework gives
you complete control over almost every component of a plot or graph. Other
more recent graphics tools, such as lattice and ggplot2, enable elaborate and
sophisticated visualisations of high-dimensional data [8].
5. R has kept the original S idea of providing a language that is suitable for both
interactive work and incorporates a sophisticated programming language for
developing new tools. This allows the user to gradually progress from being a
user who applies current tools to data to becoming a developer who creates new
tools.
6. Finally, one of the pleasures of using R is not the language itself, but rather the
active and dynamic user community. A language is successful in many ways
because it provides a platform for many people to build new things. R is that
platform, and thousands of people from all over the world have banded together
to contribute to R, create packages, and help each other utilise R for a wide range
of applications. For almost a decade, the R-help and R-devel mailing lists have
been very active, and there is also a lot of activity on sites like Stack Overflow
[9].
150 N. Chandela et al.

Free Software
Over many other statistical tools R has a significant advantage that it is free in
the sense of free software (and also free in the sense of free beer). The R Foundation
owns the primary source code for R, which is released under the GNU General Public
Licence version 2.0.
According to the Free Software Foundation, using free software grants you the
four freedoms listed below.
1. The ability to run the programme for any reason (freedom 0).
2. The ability to learn how the programme works and tailor it to your specific
requirements (freedom 1). Access to the source code is required for this [10].
3. The ability to redistribute copies in order to assist a neighbour (freedom 2).
4. The ability to develop the programme and make your innovations available to
the public so that the entire community benefits (freedom 3) [11, 12].
Limitations of R
1. There is no such thing as a flawless programming language or statistical analysis
system. R has a variety of disadvantages. To begin, R is built on nearly 50-year-
old technology, dating back to the original S system developed at Bell Labs.
Initially, there was little built-in support for dynamic or 3-D graphics (but things
have substantially changed since the “old days”) [13–15].
2. One “limitation” of R at a higher level is that its usefulness is dependent on
consumer demand and (voluntary) user contributions. If no one wants to adopt
your preferred approach, it is your responsibility to do it (or pay someone to
do so). The R system’s capabilities largely mirror the interests of the R user
community. As the community has grown in size over the last ten years, so have
the capabilities. When I first began using R, there was very limited capability for
the physical sciences (physics, astronomy, and so on). However, some of those
communities have now embraced R, and we are seeing more code created for
these types of applications [9, 16].

1.2 Practical Concepts

Linear Regression
To fill in the gaps, a linear regression approach might be utilised. As a refresher,
this is the linear regression formula:

Y = C + BX (1)

We all learnt the straight line equation in high school. The dependent variable is Y,
the slope is B, and the intercept is C. Traditionally, the formula for linear regression
is stated as:

h = θ0 + θ1 (2)
Practical Implementation of Machine Learning Techniques and Data … 151

‘h’ is the hypothesis or projected value, X is the input feature, and the coefficients
are theta0 and theta1.
We will utilise the other ratings of the same movie as the input X in this recom-
mendation system and predict the missing values. The bias term theta0 will be
avoided.

h = θX (3)

Theta1 is started at random and refines over iterations, just like the linear regression
technique.
We will train the algorithm with known values, much like in linear regression.
Consider a movie’s known ratings. Then, using the formula above, forecast those
known ratings [17, 18]. After predicting the ratings values, we compare them to the
original ratings to determine the error term. The error for one rating is shown below.
( j )T i
θ x − y i, j (4)

Similarly, we must determine the inaccuracy for each rating. Before I go any
further, I’d like to introduce the notations that will be used throughout this paper.
n u = no. of users.
n m = no. of movies.
r(i,j) = 1 if user j has rated movie i.
y (i, j) = rating given by user j to movie i (defined only if r(i,j) = 1).
Here’s the formula for the total cost function, which will show the difference
between the expected and original ratings.

1 ∑ (( j )T i )2 λ ∑n ( )
2
j
θ x − y i, j + θk (5)
2 i:r (i, j)=1 2 k=1

The error term is squared in the first term of this expression. To avoid any negative
numbers, we use the square. We optimise the squared using 1/2 and calculate the
error term where r(i, j) = 1. Because r(i, j) = 1, the rating was provided by the user
[19].
The regularisation term is the second term in the equation above. It can be used
to regularise any overfitting or underfitting issue [13].

2 How to Build a Recommendation Engine in R

Knowing the input data


The data has the following variables about the customers:
• Age- It tells about the age of the customers
152 N. Chandela et al.

Fig.1 Dataset features

Fig. 2 Dataset closure look

• Job Type- Employed, Unemployed, Retired.


• Education- Primary, Secondary, Graduate
• Metro city- Yes (if the customer stays in a metro city,) No- (If the customer is not
from the metro city)
• Signed in Since- This is the number of days since the customer signed in to the
website for the first time.
• Purchase made- the total amount of purchase made by the customer (Fig. 1).
STEP 1: Data Preparation (Fig. 2).
• Check the missing values and the mean, median, mode
Practical Implementation of Machine Learning Techniques and Data … 153

summary(Data).
Using the summary command, we can check the mean, median, mode and
missing values for each variable. In this case, we have 13 missing observations
(NAs) for the variable age. Hence, before going ahead, we need to treat the missing
values first [20] (Fig. 3).
• Histogram to see how the data is skewed
hist(Data$Age).
We specifically check the data distribution to decide by which value we can
replace the missing observations.
In this case, since the data is somewhat normally distributed, we use mean to
replace the missing values (Fig. 4).
• Replacing the NA values for variable Age with mean 39
Data$Age[is.na(Data$Age)] = 39.
• Check if the missing values are replaced from the variable Age
summary(Data).

Here, we can see that the missing values (NAs) are replaced by the mean value of
39 (Fig. 5).

Fig.3 Histogram to see skewness in dataset

Fig. 4 Quantitative Data Representation


154 N. Chandela et al.

Since we have handled the missing values, let’s have a look at the data
head(Data).
After handling the missing values, we can see that there are categorical variables
such as Marital status, metro city, education which we need to convert in dummy
variables.
STEP 2: Creating New Variables
As seen in the data, four of our variables are categorical, which we need to create
as dummy variables first.
• Data$Job.type_employed <-as.numeric(Data$Job.Type = = “Employed”)
• Data$Job.type_retired <-as.numeric(Data$Job.Type = = “Retired”)
• Data$Job.type_unemplyed <-as.numeric(Data$Job.Type = = “Unemployed”)
Data$Married_y <-a.numeric(Data$Marital.Status = = “Yes”)
• Data$Education_secondary <-as.numeric(Data$Education = = “Secondry”)
• Data$Education_gra <-as.numeric(Data$Education = = “Graduate”)
Data$Metro_y <-as.s.numeric(Data$Metro.City = = “Yes”)
The following command is used to create dummy variables [20]. You need to
create n-1 dummy variables. For example, we have a categorical variable—Gender
which has two levels—Male & Female. So you will create 1 dummy variable 2–1 =
1, where 2 is the number of levels you have. The second variable is taken care by the
intercept of the regression line.
#Checking the dummy variables
• head(Data)
Here, the dummy variables have been created (Figs. 6 and 7).
• Removing the categorical columns(2,3,4,5)
final_data <- Data[ -c(2,3,4,5)].
• let’s check our final data
head(final_data).

Linear Regression—Univariate Analysis


We first start with Univariate Analysis, outlier detection of independent variables
using a box plot (Figs. 8, 9 and 10).
• par(mfrow = c(1,2))
• bx = boxplot(final_data$Age)

Fig. 5 Top rows of dataset


Practical Implementation of Machine Learning Techniques and Data … 155

Fig. 6 Dummy variables in dataset

Fig. 7 The variables in a numeric format

Fig. 8 Outliers using


boxplot in feature age

Let’s check the distribution of the variable age


• quantile(final_data$Age, seq(0,1,0.02))
• bx$stats
156 N. Chandela et al.

Fig. 9 The quantile values

Fig. 10 Stats of dataset

Fig. 11 Box plot to again


check the outliers

STEP 3: Univariate Analysis


Since the 98th percentile is 57, we cap the outliers with the same value
• final_data$Age <-ifelse(final_data$Age > 60,57,final_data$Age)
• boxplot(final_data$Age)
Here, we have replaced the outlier value and we have no outliers in the variable
Age (Figs. 11, 12 and 13) .
Now checking the outlier for our other variable sign in since days
• boxplot(final_data$Signed.in.since.Days.)
Practical Implementation of Machine Learning Techniques and Data … 157

Fig. 12 Outliers in the


variable signed in since days

Fig. 13 Another box plot

Outlier treatment for signed in since


• quantile(final_data$Signed.in.since.Days., seq(0,1,0.02))
Thus, capping the value of values less than 45 with 48(8 percentile).
• final_data$Signed.in.since.Days. <- ifelse(final_data$Signed.in.since.Days. <
45,48,final_data$Signed.in.since.Days.)
boxplot to check the outliers
• boxplot(final_data$Signed.in.since.Days.)
158 N. Chandela et al.

Fig. 14 A histogram

now let’s check the dependent variable


• par(mfrow = c(1,2))
• hist(final_data$Purchase.made, main = ‘Dependent’)
The dependent variable is normally distributed, hence we do not need to perform
any transformation (Fig. 14).
boxplot for the dependent
• boxplot(final_data$Purchase.made)
There are some outliers in the variable Purchase made, but since we are analysing
the dependent variable, we will not cap the outliers, so that capping would not affect
the model accuracy (Fig. 15).
STEP 4: Bivariate Analysis
Now let’s check to do the bi-variate analysis to check the relationship between
variables(Age and Purchase made) (Fig. 16).
• library(car) scatterplot(final_data$Age,final_data$Purchase.made)

In this scatter plot, we can see a curvilinear relationship between the independent
variable Age and the dependent variable Purchase made.
We can see that, the age group up till 30 has a medium purchase, from age 30 to
55 the purchase is maximum and again the purchase lowers.
Sign.in.days vs Purchase made
• scatterplot(final_data$Signed.in.since.Days.,final_data$Purchase.made
Practical Implementation of Machine Learning Techniques and Data … 159

Fig. 15 Box plot for


purchase variable

Fig. 16 Scatter plot for age and purchase variable

We can see a positive linear relationship between the variable signed in since and
the variable purchase made (Fig. 17).
We can see a pattern that the old customers make a higher purchase.
STEP 5: Regression Analysis
Since we are done with the EDA, let’s check the co-relation (Fig. 18).
160 N. Chandela et al.

Fig. 17 Scatter plot showing positive relation between variable

Fig. 18 Correlation in dataset

Fig. 19 Final data picture

• cor(final_data)

checking the multi-collinearity


• final_data1 <- lm(Purchase.made ~ .,data = final_data) vif(final_data1)
Muti-collinearity value for variable is expected to be less than 5. Hence, in this
case we need to remove the variable with highest multi-collinearity (Fig. 19).
Practical Implementation of Machine Learning Techniques and Data … 161

Fig. 20 Variance values in data

Fig. 21 Step function usage

Since all the variables are not below the threshold of 5, we need to correct the
model, let’s remove Education_secondry variable first
• final_data2 <- lm(Purchase.made ~ Age + Signed.in.since.Days. +
Married_y + Job.type_retired + Job.type_unemplyed
+ Education_gra + Metro_y,data = final_data)
• vif(final_data2)
Since the VIF value is less than 5 for all the variables, we can consider all the
variables (Fig. 20).
Graduation was highly co-linear with the other variables, let’s verify once again
using a step function
Step(final_data1).
Basically the summary reveals all possible stepwise removal of one-term from
your full model and compares the extract AIC value, by listing them in ascending
order. Since the smaller AIC value is more likely to resemble the TRUTH model
(Fig. 21).
• final_data3 <- lm(Purchase.made ~ Age + Signed.in.since.Days. +
Married_y + Job.type_retired + Job.type_unemplyed
+ Education_gra + Metro_y,data = final_data)
• summary(final_data3)

These two variables have a P value which is less than 0.05, hence we need to
remove these variables before the final analysis (Fig. 22).
• final_data4 <- lm(Purchase.made ~ Signed.in.since.Days. +
162 N. Chandela et al.

Fig. 22 P value representation of variables

Married_y + Job.type_unemplyed + Education_gra + Metro_y,data = final_


data)
summary(final_data4).
Here, the P-value for the variable unemployed is slightly higher than 0.05, we
still would keep it in the equation as unemployed job type is an important criteria of
analysis and the difference is only 1% (Fig. 23).
STEP 6: Model Evaluation
Now since we have the best fit equation, let’s try to check with the assumptions
for a linear model
loading package lmtest library(lmtest)
• par(mfrow = c(2,2))
• plot(final_data4)

Fig. 23 P values
Practical Implementation of Machine Learning Techniques and Data … 163

According to the linear regression assumption, the residuals must be normally


distributed, but in the graphs of the residuals we can clearly see a funnel-like pattern
and also the quantile plot indicates a heavy tail in the lower end, hence we need to
improve the model (Fig. 24).
STEP 7: Model Improvement
Quantile Values
• quantile(final_data$Purchase.made, seq(0,1,0.02)) (Fig. 25)
Let’s consider 4% and 96% as the cut-off
• final_data_new = final_data[(final_data$Purchase.made > = 510 & final_
data$Purchase.made

Fig. 24 Distribution representation

Fig. 25 Quantile values


164 N. Chandela et al.

< = 13,500),]
Let’s re-run the model on this filtered data
• mod2 <- lm(Purchase.made ~ Signed.in.since.Days. + Married_y + Education_
gra + Metro_y + Job.type_unemplyed, data = final_data_new) summary(mod2)
The P-value is greater than the cut- off of 0.05, hence we need to remove the
variable (Fig. 26).
Final linear equation
• mod2 <- lm(Purchase.made ~ Signed.in.since.Days. + Married_y + Education_
gra + Metro_y,data = final_data_new) summary(mod2)
All the variables are significant (Fig. 27, 28).
Now analysing the residual plot.

Fig. 26 T values in dataset

Fig. 27 Linear regression


Practical Implementation of Machine Learning Techniques and Data … 165

Fig. 28 Results
166 N. Chandela et al.

Fig. 29 Durbin test results

• par(mfrow = c(2,2))
• plot(mod2)
New residual Plot with model t
Autocorrelation
• durbinWatsonTest(mod2)
In case of Durbin-Watson test, the D-W statistics is considered good if it is less
than 2 (Fig. 29).
In this case, we have a value which is less than 2.
Normality of errors
• hist(residuals(mod2))
According to the assumption of normal distribution of residuals, the histogram
shows the errors are normally distributed (Fig. 30).
Homoscedasticity
• plot(final_data_new$Purchase.made, residuals(mod2))
The scatter plot also shows the errors are somewhat normally distributed (Fig. 31).
Checking the cook’s distance
• library(predictmeans)
• cooksd = CookD(mod2)

Fig. 30 Histogram
Practical Implementation of Machine Learning Techniques and Data … 167

Fig. 31 Relationship
between residuals and
purchase feature

Ideally high Cook’s distance observation should be removed from data and re-
modelled (Fig. 32, 33).
STEP 8: Predicting the New Values
Predicting the values in new Data
• Data2 <-read.csv(“MyData.csv”)
• predict.lm(mod2,Data2)

Fig. 32 Cooks Distance


168 N. Chandela et al.

Fig. 33 Results showing predictions

3 Conclusion

Every corner of our world is producing some type of data. This data contains hidden
insights that are useful for human well-being. Hence, data analysis has brought revo-
lution to the digital world. There are many tools available for this task like Jupiter
Notebook and R programming. The paper represents description of R programming
for data analysis. The limitations and benefits, features are clearly explained. The
dataset is used as an example to show the benefits and capabilities of R programming
for data preprocessing, analysis, outlier detection, correlation, prediction, classifi-
cation, and regression. The results are shown using histograms, box plots, scatter
plots.

References

1. Roy, D., Dutta, M.: A systematic review and research perspective on recommender systems. J
Big Data 9, 59 (2022). https://doi.org/10.1186/s40537-022-00592-5
2. Bochkarev, V., Solovyev, V., Wichmann, S.: Universals versus historical contingencies in lexical
evolution. J. R. Soc. Interface. 11, 20140841 (2014). https://doi.org/10.1098/rsif.2014.0841,
Link, ISI, GoogleScholar
3. Tippmann, S.: Programming tools: Adventures with R. Nature 517, 109–110 (2015). https://
doi.org/10.1038/517109a
4. Gazoni, R.: A semiotic analysis of programming languages. J. Comp. Commun. 6, 91–101
(2018). https://doi.org/10.4236/jcc.2018.63007, Crossref, GoogleScholar
5. TIOBE. 2022 TIOBE Index. TIOBE Index: The R Programming Language. See https://www.
tiobe.com/tiobe-index/. Accessed 10 Sept 2023. Google Scholar
6. Gipp, B., Beel, J., Hentschel, C.: Scienstein: A Research Paper Recommender System (2009).
Practical Implementation of Machine Learning Techniques and Data … 169

7. Fayyaz, Z., Ebrahimian, M., Nawara, D., Ibrahim, A., Kashef, R.: Recommendation systems:
Algorithms, challenges, metrics, and business opportunities. Appl. Sci. 10(21), 7748 (2020).
https://doi.org/10.3390/app10217748
8. R: The R Project for Statistical Computing (r-project.org)
9. German, M., Adams, B., Hassan, A.E.: The evolution of the R software ecosystem. In: 2013
17th European Conference on Software Maintenance and Reengineering, pp. 243–252 (2013,
March). https://doi.org/10.1109/CSMR.2013.33. ISSN: 1534–5351
10. Analytics Vidhya | Learn everything about AI, Data Science and Data Engineering
11. Gorakala, S.K., Usuelli. M.: Building a Recommendation System with R. Packt Publishing
(2015)
12. Ge, X., Liu, J., Qi, Q., Chen, Z.: A new prediction approach based on linear regression for
collaborative filtering. 2011 Eighth International Conference on Fuzzy Systems and Knowledge
Discovery (FSKD). Shanghai, China, pp. 2586–2590 (2011). https://doi.org/10.1109/FSKD.
2011.6020007
13. Furtado, F., Singh, A.: Movie recommendation system using machine learning. International
Journal of Research in Industrial Engineering 9(1), 84–98 (2020). https://doi.org/10.22105/
riej.2020.226178.1128
14. Job recommendation system using machine learning and natural language processing (dbs.ie)
15. Jayalakshmi, S., Ganesh, N., Čep, R., Senthil, M.J.: Movie recommender systems: Concepts,
methods, challenges, and future directions. Sensors (Basel). 22(13), 4904 (2022Jun 29). https://
doi.org/10.3390/s22134904.PMID:35808398;PMCID:PMC9269752
16. RJ-2021–108.pdf (r-project.org)
17. Jhalani, T., Kant, V., Dwivedi, P. A Linear Regression Approach to Multi-criteria Recommender
System, 9714, pp. 235–243 (2016). https://doi.org/10.1007/978-3-319-40973-3_23
18. Morandat, B. Hill, L. Osvald, Vitek, J.: Evaluating the design of the R language. In: Noble,
J. (ed.), ECOOP 2012—Object-Oriented Programming, Lecture Notes in Computer Science,
pp. 104–131. Springer, Berlin, Heidelberg (2012). ISBN 978–3–642–31057–7. https://doi.org/
10.1007/978-3-642-31057-7_6
19. Jain, G., Mishra, N., Sharma, S.: CRLRM: Category based Recommendation using Linear
Regression Model, pp. 17–20 (2013). https://doi.org/10.1109/ICACC.2013.11.
20. How to code a recommendation system in R—Ander Fernández (anderfernandez.com)
Deep Learning Techniques in Big Data
Analytics

Ajay Kumar Badhan, Abhishek Bhattacherjee, and Rita Roy

Abstract The emergence of the digital age has ushered in an unprecedented era of
data production and collection, creating big data models. In this context, a valuable
technique to address complex issues originating from big data analytics is deep
learning, which is a subgroup of machine learning. The aim of this chapter is to give
a thorough assessment of deep learning methods and how they are implemented in
big data analytics. Beginning with an introduction to the fundamental tents of deep
learning, including neural networks and deep neural architectures, the mechanisms
by which deep models can automatically learn and represent complex patterns from
raw data are explored. It examines various aspects of deep learning applications of big
data analysis. It shows how deep learning models excel in feature learning, enabling
the automatic extraction of valuable information from huge data sets. Finally, the
chapter describes emerging trends in deep learning and big data analysis, providing
a glimpse into the future of this dynamic field. It draws attention to the pivotal role
that deep learning techniques have played in transforming the big data analytics
environment and emphasizes the ongoing significance of research and innovation in
this quickly developing discipline.

A. K. Badhan (B) · A. Bhattacherjee


Department of Computer Science and Engineering, Lovely Professional University, Phagwara,
Punjab, India
e-mail: ajay.27337@lpu.co.in
A. Bhattacherjee
e-mail: abhishek.27306@lpu.co.in
R. Roy
Department of Computer Science and Engineering, Gitam Institute of Technology
(Deemed-to-Be-University), Visakhapatnam, Andhra Pradesh, India
e-mail: ritaroy1311@gmail.com

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 171
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_9
172 A. K. Badhan et al.

1 Introduction

In the era of digital advancement, the expansion of technology has produced an


unprecedented amount of data, including both organized and unstructured data.
Often referred to as “big data”, this vast collection of data contains valuable
insights, models, and information that can transform decision-making processes
across sectors.
Big data analytics, a field dedicated to extracting meaningful information from
vast and complex data sets, is at the forefront of this data revolution. Big data analytics
involves the use of advanced analytical approaches to process, analyze, and interpret
huge data sets. Traditional data processing tools and methods are insufficient to
handle such large volumes of data from various platforms like social media, events,
sensors, and more.
Figure 1 represents the different areas where big data analytics has been imple-
mented. The one compelling use case is the application of customer opinion analysis
influenced by the discussed techniques. This approach involves using recommen-
dation systems and analyzing the behavior of individual users from various tweet
histories. The study focuses on the airline industry and aims to create a robust model
that analyzes customer reviews making use of Kaggle’s Twitter Airlines dataset [1].
Similarly, [2] discusses the study of an online store vision system that uses big
data for behavioral analysis of similar customer groups. In this, marketing theory and
visual design are integrated into the study of e-commerce system optimization. For
predictive analysis and fraud detection, [3] study adopts big data analytics to identify
unusual patterns, employing predictive analytics tools to handle massive data and
discern patterns, facilitating the detection and prevention of fraud in the retail sector.
Big data analytics uses modern technologies like distributed computing, machine
learning, and deep learning algorithms to determine hidden patterns, correlations,
and trends in data. This analytical approach enables organizations to make informed
decisions, improve operations, predict market trends, and obtain a competitive edge
in the fast-paced commercial world of today.
In the realm of Artificial Intelligence (AI), deep learning, a branch of machine
learning, has become a game-changing technology. Essentially, it makes use of neural
networks—complex computer models inspired by the human brain that can auto-
matically learn complex patterns from data. Deep learning methods can analyze an

Big Data analytics


Use Cases

Customer Sentiment Behavioural Predictive Fraud Customer


Analytics Analysis Analysis Detection Segmentation

Fig. 1 Big data analytics use cases in different areas


Deep Learning Techniques in Big Data Analytics 173

Deep Learning
Use cases

Manufacturing Retail Financial


Agriculture Healthcare
industry industry services

Fig. 2 Deep learning use cases in different sectors

enormous volume of unstructured data, including photos, text, audio, and video that
are not been implemented by machine learning methods.
Figure 2 discusses the major sectors that currently use deep learning, [4] have
shown usage of IoT-based irrigation systems that provide precision agriculture,
by controlling the period of irrigation and saving water. Similarly, in the health-
care sector, [5] proposed different deep learning techniques applied in healthcare.
The main focus is on computer vision, natural language processing, reinforce-
ment learning, and generalized methods. [6] highlights the transformative impact
of machine learning within the manufacturing industry 4.0 paradigm, apart from it
[7] addressed the critical role of financial institutions in ensuring economic stability
and sustainability through effective credit risk mitigation. The main focus is on
developing a model to classify potential borrowers as good or bad credit.
Deep learning is excellent for feature extraction, which enables the automatic
identification of relevant information from raw data, particularly complex tasks in
the context of huge volumes of data. The models of deep learning, i.e., Convolu-
tional Neural network (CNN), and another one Recurrent Neural network (RNN)
have brought a strong revolution in the fields of speech recognition, picture recog-
nition, and natural language processing, making them indispensable tools in various
applications.
A synergy with immense potential exists when technologies like big data analytics
and deep learning are combined with each other. The former provides the neces-
sary infrastructure to process large data sets, while deep learning algorithms unlock
actionable insights into that data by automatically identifying complex patterns.
Using deep learning capabilities within big data analytics, organizations can extract
nuanced, context-rich information from their data, enabling accurate predictions,
improved decision-making, and the enlargement of innovative solutions. Essentially,
the grouping of big data analytics with the techniques of deep learning gives compa-
nies and researchers with different tools to navigate the complexities of the digital
age. This integration not only drives operational efficiency and business growth but
also paves the way for breakthrough discoveries, making it the cornerstone of today’s
data operations.
174 A. K. Badhan et al.

2 Literature Review

Deep learning methods are one of the promising fields of research in the automated
extraction of complex data representations (features) at high levels of abstraction.
With the help of these algorithms, data is learned and represented in a layered,
hierarchical manner, using lower level (less abstract) characteristics used to define
higher level (more abstract) features. Artificial Intelligence imitates the deep, multi-
layered learning process used by the human brain’s primary sensory regions of the
neocortex to automatically extract characteristics and abstractions from underlying
input. This is what motivates deep learning methods and hierarchical learning design.
Distributed representations of the data, which allow for a solid representation of
each example and stronger generalization by making a vast number of potential
formations of the input data’s intellectual features viable, are a fundamental idea
underlying deep learning approaches. The number of retrieved abstract features is
inversely proportional to the number of configurations that can be made. Considering
that the pragmatic data was produced by the communications of numerous identified
and unidentified factors, it is likely that new formations of the learned factors and
forms can be used to explain additional (unseen) data patterns when they are obtained
through certain configurations of learned factors [8].
Deep learning approaches that make use of deep neural networks have gained
prominence, as high-performance computing resources have proliferated. When
working with formless data, deep learning methods attain greater strength and elas-
ticity since they can process a huge number of features. The data is passed via many
layers of deep learning methods; each film has its own ability to gradually extract
features and transmit the information to the following layer. The earliest layer extracts
low-level characteristics, which are then merged by succeeding layers to provide an
extensive representation. Even while deep learning is constantly improving, there are
still a variety of issues that need to be addressed. Deep learning can be used to make
robots smarter, sometimes even smarter than people, even if its exact mechanism
is still a mystery. In order to improvise mobile applications as smarter and much
more intelligent, the goal is to develop models that operate through mobile. There
needs a promise to making deep learning more committed to advancing humanity
and sustaining our world as an improved area to reside [9].
Deep learning methods are being more widely used in image segmentation. These
methods, which started with the development of various algorithms in deep learning,
have given rise to a wide variety of new kinds of picture segmentation algorithms.
Previous research has demonstrated the promise of deep learning-based methodolo-
gies. More recent studies that compare various techniques based on their reported
performance encompass more methodologies [10].
The processing of vast amounts of data is restricted in numerous ways by standard
data processing techniques. For excellent accuracy and efficiency while handling data
in real-time, deep learning and machine learning-based methods must be developed
for big data analytics. To swiftly analyze data. However, recent research has merged
a range of deep learning methods with hybrid learning and training procedures.
Deep Learning Techniques in Big Data Analytics 175

Since the majority of these strategies are scenario-specific and focused on vector
space, they perform poorly in more general scenarios and when learning features
from big data. It’s crucial for handling the enormous amount of data in accordance
with the requirements of the organization because alphanumeric data is expanding
exponentially in a variety of forms. Technology-based businesses like Microsoft, and
companies like Yahoo, Amazon, etc., have Exabyte-sized or uniform bigger amounts
of data on hand. Because of the widespread usage of social online media platforms,
its users generate tremendous volumes of data. However, standard techniques are
incapable of handling this amount of data. Because of this, a variety of businesses have
created big data analytics-based products for experimentation, simulation purposes,
data analysis, monitoring purposes, and a variety of other business purposes [11].
Deep learning uses supervised and unsupervised methods for learning multi-level
presentation, and features in hierarchical structures for the goals of classification and
pattern recognition. Big data collecting has been made possible by recent advance-
ments in communication and sensor network technology. Big data has excellent
prospects for an extensive range of industries, such as e-commerce, smart medicine,
etc., but it also poses difficult problems for data mining and processing of information
because of its characteristics of huge volume, variation, velocity, and veracity. Deep
learning has become increasingly important in big data over the past few years. In
comparison to more traditional shallow machine learning methods where supported
vector machines and other one Naive Bayes are used, the deep learning methods
can more efficiently combine low-level input for extracting high-level features and
absorbing hierarchical representations from large amounts of data [12].
Deep learning uses Artificial Neural Networks that are modeled after the neurons
found in the human brain. Layers make up this structure and the adjective “deep”
mentions the thickness of multiple layers. The term “deep” originally referred to a
very small number of layers, but because deep learning is used to solve complicated
hitches, the quantity of layers has increased to hundreds or even more. Many compa-
nies related to image processing, healthcare industries, transportation business, and
agriculture, have found great success using deep learning. Deep learning is becoming
more and more popular as an outcome of the convenience of trained datasets, such as
ImageNet, which contains thousands of photos and allows for the best possible use of
an increasing amount of data. Second, low-priced GPUs are increasingly frequently
used to train datasets and can take advantage of different cloud services. Massive
corporations like Facebook, Amazon Inc., Google Inc., and Microsoft use methods
of deep learning on a daily basis to evaluate enormous amounts of data [13].
A general review of the popular and difficult urban big data fusion based on
deep learning methods is presented. First, several elements of town big data are
evaluated. Then, few typical data fusion techniques that may broadly be split into
three groups are briefly presented, together with spatial–temporal data. Then, three
categories of existing multi-modal town big data fusion techniques grounded on deep
learning—DL-based-output fusion, DL-based-input fusion, and DL-based-double-
stage fusion—are separated out and described separately. Finally, the challenges and
176 A. K. Badhan et al.

some suggestions for studying town big data are given based on the behaviors and
characteristics of town big data [14].
In recent years, deep learning models have excelled at speech recognition and
computer vision. The first and foremost benefit of using learning techniques is eval-
uating an enormous amount of data, i.e., called Big Data. It is vital for organizations
like social networks that need to collect a lot of data. Deep learning is a powerful
technique for Big Data because of this advantage. Incredibly valuable information
that is hidden in a Big Data set can be retrieved using deep learning. These social
networks can be seen in the contemporary stock market. Making use of deep learning
techniques once can extract multifaceted data at a high degree of thought in a manner
that makes it possible to specify higher level features using lower level character-
istics. The techniques of deep learning can be used to distinguish between distinct
sources of data variance (such as light form, object forms, and materials of object in
a picture). The primary sensory areas of the human brain’s neocortex are where the
concept of hierarchical learning in deep learning originates [15].
Overall summary (Table 1).

3 Methodology

Big data analytics uses a variety of deep learning approaches to extract valuable
insights from vast and complicated data sets. Some of the different approaches are:
1. Convolutional Neural Networks (CNNs): The convolution neural network
employs a unique method called convolution. It is a mathematical operation
applied between two operations resulting in a third function that illustrates how
one function’s shape is influenced or modified by another.
[22] Rhee proposes the Convolution Neural Networks that distinguish them-
selves from other pattern recognition algorithms by integrating both feature
extraction and classification. The provided Fig. 3, illustrates a straightforward
schematic of a basic CNN, comprising five distinct layers:
• Initial layer, i.e., input,
• Second layer, i.e., convolution,
• Third layer, i.e., pooling,
• Fourth layer, i.e., fully connected, and
• Fifth layer, i.e., output.

The feature extraction and classification segments make up the two divisions of
the above layers mentioned. The former one, i.e., extraction of features encom-
passes the first three layers layer, while classification involves the remaining two
layers. The first layer that is the input sets a defined size for input pictures that
may be adjusted by resizing if needed. The picture is then subjected to multiple
learning kernels with shared weights by convolution layer. The next third layer,
Table 1 Tabular view of the complete literature review
Title Authors Year Methodology/ Dataset Performance Results Discussion/
Technique Metric Conclusion
A Systematic Review on J Azmi et al. 2022 Machine Learning Medical Big Accuracy, Comprehensive review Key findings on ML
Machine Learning Data Sensitivity, on machine learning for approaches for
Approaches for Specificity cardiovascular disease cardiovascular
Cardiovascular Disease prediction disease prediction
Prediction Using Medical
Big Data [16]
Big Data Analysis of the X Li, H Liu et al. 2022 Deep Learning, Smart City IoT data Significant insights into Deep learning’s role
Internet of Things in the IoT IoT Data analysis IoT data analysis in in IoT data analysis
Digital Twins of Smart metrics smart cities in smart cities
City Based on Deep
Learning [17]
A Novel Diabetes R 2022 Machine Learning Diabetes Accuracy, Developed a novel Machine learning is
Deep Learning Techniques in Big Data Analytics

Healthcare Disease Krishnamoorthi Healthcare Sensitivity, framework for diabetes effective in


Prediction Framework et al. Data Precision prediction healthcare disease
Using Machine Learning prediction
Techniques [18]
Machine Learning AH Gandomi 2022 Machine Learning Big Data Accuracy, F1 Achieved competitive Importance of ML in
Technologies for Big Data et al. Analytics Score, results in Big Data Big Data Analytics
Analytics [19] ROC-AUC Analytics
Deep Learning Techniques: Amitha Mathew 2021 Convolutional Large-Scale Accuracy, F1 Achieved 95% CNNs are promising
An Overview [20] et al. Neural Networks Big Data Score accuracy on test data in Big Data analytics
(CNN)
Understanding Deep Swarnendu 2020 U-Net Medical Intersection Achieved an average U-Net is effective
Learning Techniques for Ghosh et al. Images over Union IoU of 0.85 for medical image
Image Segmentation [21] (IoU) segmentation
(continued)
177
Table 1 (continued)
178

Title Authors Year Methodology/ Dataset Performance Results Discussion/


Technique Metric Conclusion
Deep Learning in Big Data Bilal Jan a et al. 2019 LSTM, CNN, Big Data Sets RMSE, MAE, LSTM outperforms LSTM is a viable
Analytics: A Comparative Auto-encoders R-squared other techniques with choice for Big Data
Study [11] lower RMSE Analytics
Urban Big Data Fusion Jia Liu et al. 2020 Deep Learning Urban Big Various urban An overview of urban Insight into the
Based on Deep Learning: Data metrics big data fusion using application of deep
An Overview [14] deep learning learning in urban
contexts
A. K. Badhan et al.
Deep Learning Techniques in Big Data Analytics 179

Fig. 3 Convolutional Neural Network (CNN) architecture schematic diagram

i.e., pooling follows, reducing picture size while preserving essential data. The
feature maps are the results obtained from feature extraction. In the next phase,
i.e., the classification, the fully connected layers amalgamate the features that
are extracted, and then the output layer, with one neuron per object category,
produces the classification result. The pattern implemented in most of the CNN
architectures is as follows [23]:

IN → [CONV ⊗ POOL] ∗ M → [C] ∗ N → OUT (1)

where:
• Input Processing (IN): The initial input undergoes a convolution operation
(CONV) followed by pooling (POOL).
• CONV: It refers to the layer of convolution.
• POOL: It indicates the later pooling.
• Matrix Multiplication (M): The result of the convolution and pooling is
multiplied by a matrix (M).
• Fully Connected Layer (FC): The outcome of the previous step goes through
another convolution (CONV) with fully connected processing (FC).
• Matrix Multiplication (N): The result is then multiplied by another matrix
(N).
• Output (OUT): The final output is obtained.
The convolutional networks are used in big data analytics for various applications
including object detection, picture recognition, and multidimensional dataset
analysis.
2. (RNNs) Recurrent Neural Networks: It is well suited for data that is sequential
in nature and is used to capture temporal dependencies. They process sequen-
tial data by incorporating information from previous steps. The output of the
prior stage is sent to the current stage as input using sequential or temporal data.
Recurrent networks basically learn from the training input but differ in their
memory (stores all the information for calculations) which allows them to influ-
ence the current input and output using information from previous inputs. The
diagrammatic view for the recurrent neural network is presented as follows:
180 A. K. Badhan et al.

Fig. 4 Schematic diagram for a standard recurrent neural network

Authors [5] propose a recurrent neural network for enhancing the audio-visual
speech recognition (AVSR) accuracy in noisy environments. The RNN model
exhibits a loop structure within its hidden unit as illustrated in Fig. 4. It comprises
the first layer as input, denoted as “I”, the second layer as “hidden (H)”, and
third layer as output, denoted as “O”. The RNN unfolds the loop, essentially
replicating, multiple times the similar structure. In this configuration, the state H
of each iteration serves as an input to its subsequent iterations. Representing the
layers at a time (t) as I t , H t , O t where I t presents the input layer, H t presenting
the hidden layer, and O t presenting the output layer, then the output can be
calculated as follows [24]:

a t = b1 + W H t−1 + U I t
H t = σ (a t ) (2)
O = b2 + V H
t t

Explanation of the Equations:


i. a t = b1 + W H t−1 + U x t : This expression calculates the activation at time t.
In this a t is the total input, b1 is a bias vector, “W” presents Weight matrix.
It is used for connecting the hidden layer H t−1 from previous time step, and
U presents the weight matrix. It is used for connecting the layer called input,
It.
ii. H t = σ (a t ): It depicts the concealed condition at time “t”, denoted asH t .
It is obtained using a nonlinear activation function σ to the activation a t .
iii. O t = b2 + V H t : This expression calculates the output at time interval “t”,
denoted as O t , where b2 is another bias term, and V - the weight matrix. It
is used for connecting the hidden state H t to the layer called output.
In big data analytics, the applications in which they are implemented are time
series analysis, natural language processing (NLP), and tongue recognition within
big data sets.
Deep Learning Techniques in Big Data Analytics 181

3. Long Short-Term Memory Networks (LSTMs): It is an improved Recurrent


Neural Network (RNN) that addresses the disappearing and exploding gradient
problem associated with convolution RNNs. The LSTM networks are capable
of addressing the limitations with the utmost capabilities by remembering the
long-term dependencies and making use of the memory cells that are hidden
in the layer. The presence of the memory cells enables the network to retain
the long-term data dependencies. A collection of consecutively repeated neural
networks modules is the basic structure of all recurrent neural networks. This
repeating module in conventional RNN uses a single tanh layer presented below
in Fig. 5:
The LSTM similarly uses this chain-like structure but has four neural network
layers that interact differently rather than just one. The following is how the
systematic figure is presented in Fig. 6.

Fig. 5 The standard RNN contains a single layer of tanh

Fig. 6 The LSTM’s repeating module with four interconnected layers


182 A. K. Badhan et al.

Authors [25] make use of the same structure for employing the deep learning
technique, i.e., LSTM and bidirectional LSTM for multistep COVID-19 infection
hotspot predicting in Indian States. The network model, i.e., LSTM computes
the hidden output state h t based on the following key components [26]:
i Input Gate (i t ): It determines the information that must be saved in the cell
state ct from input xt (input vector). The expression is given as:
 
i t = σ xi U i + h t−1 W i (3)

ii. Forget Gate ( f t ): It decides what information from the cell state C t−1
should be discarded or kept from the current time step. The expression is as
follows:
 
f t = σ xt U f + h t−1 W f (4)

iii. Output Gate (ot ): It determines what part of the cell states should be output
as a hidden state h t for the most recent time step. The expression is as follows:
 
ot = σ xt U 0 + h t−1 W 0 (5)


iv. Intermediate Cell Gate ( C t ): It computes the candidate update to the cell
state. The expression is as follows:
 
t = tanh xt U c + h t−1 W c
C (6)

v. Current Memory Cell (C t ): It combines the information from the input


gate, forget gate, and candidate update to produce the new cell state. The
general expression for it is:
 
t
Ct = σ f t ∗ Ct−1 + i t ∗ C (7)

vi. Hidden State (h t ): It is defined as the output of the LSTM for the most
recent timestep, which is dependent on the cell state and output gate. The
general expression is provided as follows:

h t = tanh(Ct ) ∗ Ot (8)

4. Auto-Encoders: It is a special kind of neural network variation made to encode


data into a concise and understandable representation before decoding it again
to provide input that is as close to the original as feasible [27]. A specified
recognition weight can technically be used by an auto-encryption network to
convert an input vector into a code vector. The code vector is then transformed
into a rough reconstruction of the original input vector using a second set of
generating weights [28]. The standard architecture view for auto-encoders is:
Deep Learning Techniques in Big Data Analytics 183

Fig. 7 The standard architecture view of auto-encoder

The key components for Fig. 7 are as follows [29]:

a. Encoding: It takes the unput data “I” and produces a compressed represen-
tation “z” in the latent space. The expression is as follows:

z = Encoder (I ) (9)

b. Decoding: It takes the compressed representation “z” and attempts to


reconstruct the input data I  .

I  = Decoder (z) (10)

c. Loss Function: It measures the variance between the input “I” and the output
I  that was recovered. Mean squared error is the frequently used loss function
for continuous data and binary loss entropy for binary data. The expression
for loss function, the mean square error (MSE), and binary cross entropy
expression are given as:
 
Loss = Loss f unction I, I 
1  2
n
MSE = Ii − Ii
n i−1

1    
n
Binar y Cr oss Entr opy = Ii log Ii + (1 − Ii ) log 1 − Ii
n i=1
(11)

d. Training Objective: The main focus during training will be to minimize the
loss by adjusting different parameters, i.e., weights and biases of both the
encoder and decoder. The mathematical expression is as follows:
184 A. K. Badhan et al.

Minimi ze Loss = Minimi ze(Loss Function(I, Decoder (Encoder (1))))


(12)

5. Generative Adversarial Networks (GANs): It is one among several techniques


being implemented with neural networks. The GANs have the ability that based
on some trained data the neural networks are capable enough to generate a realistic
data from scratch that doesn’t exist before. In big data analytics the GANs are
scaled to handle the large datasets and benefit from parallel processing. The core
of GANs consists of two primary components:
a. Generator: It acts as a receiver, crafting counterfeit samples by drawing
inspirations from the authentic ones, with the goal of misleading the discrim-
inators into person in the fake as genuine. It is basically an unsupervised
approach. Its goal is to deceive the discriminator into thinking that it cannot
build a phony image based on feedback. The trading process ends when
the generator makes the discriminator look foolish, and at that point one can
declare that a generalized GAN model has been developed. The mathematical
expressions used in the generator is the loss function presented as [30]:

 
Minimi ze G, maximi ze D : V (D, G) = E(x ∼ pz(z)) log(1 − D(G(z)))
(13)

where:
• G: It stands for generator
• D: It presents discriminator
• x: represents the real samples of data
• z: represents noise samples
• pdata: It represents real data distribution
• pz : is the distribution of noise
b. Discriminator: It operates like a vigilant authority, tasked with pinpointing
irregularities in the sample which are generated by the generator and accu-
rately categorizing them as either genuine or fabricated. It is basically a
supervised approach and it is trained on real data and provides feedback to
the generator. The discriminator loss function may be used for maximizing
problems, just like the generator loss function for the discriminator and is
provided as [30]:

Maximi ze D : V (D, G) = E(x ∼ pdata(x))


 
[log(D(x)) + E(z ∼ pz(z)) log(1 − D(G(Z ))) (14)

The interplay between the generator and discriminator persists until a state of
refinement is reached, where the generator triumphs, successfully outsmarting
the discriminator in discerning fake data.
Deep Learning Techniques in Big Data Analytics 185

[31] Pingda Huang proposes a novel approach to text-to-image synthesis


making use of GANs to address two primary challenges, i.e., image quality
and image-text alignment. The diagrammatic view of the proposed approach
is displayed in Fig. 8. The main objectives of the proposed model are multi-
semantic fusion that addresses the limitation of using a single sentence for
image synthesis, preservation of unique semantics for each sentence, and multi-
sentence joint discriminator for enhancing image-text alignment. The GANs are
applied in several big data analytics applications including data augmentation,
image & video synthesis, style and data transformation, text-to-image synthesis,
and speech synthesis.
6. Transfer Learning: It is a potent concept in machine learning that involves
leveraging knowledge acquired from solving one task to enhance the performance
of a different but related task. In the context of deep learning, transfer learning is
using pre-trained models on sizable datasets for specific tasks and then adapting
them for new relatable problem domains.
The pre-trained network serves as a transferred knowledge to be applied in
another domain. This method is especially useful in situations with little labeled
data since it can build on pre-existing models that have been trained on a large
amount of data. By reusing learned features or representations, transfer learning
expedites model training, improves generalizations, and boosts performance
across various applications. The diagrammatic view is presented as shown in
Fig. 9:
The authors [32] use a convolution neural network to suggest a unique method
for detecting an intrusion in the network using transfer learning with a focus on
addressing the limitations associated with insufficient datasets in this domain. It
basically employs transfer learning, comprising two concatenated convolution
neural networks, and operated through two stage learning process. Initially the
models learn from the base datasets, and then the acquired knowledge is moved
to the learning process of the targeted datasets. The diagrammatic view of the
two-learning stage process is presented in Fig. 10.
7. Ensemble Learning: It is one of the machine learning techniques that influ-
ences the idea that combining multiple models can often result in more precise
and vigorous predictions than depending only on a single model. In ensemble
learning various models such as decision trees, neural networks, or support vector
machines, are trained independently and then their predictions are merged to
arrive at a final decision [33]. The merging of predictions together to get to a
final decision is done using two methods, i.e., voting and averaging. The method
“voting” is implemented in classifications, while the averaging method is mostly
used for regression models. The diagrammatic view for voting method is provided
as shown in Fig. 11:
The core idea of ensemble learning is that by pooling the intelligence of several
models, it is possible to improve overall prediction performance, reduce overfit-
ting, and improve the overall predictive performance. The ensemble approach is
essential in the framework of big data analytics for enhancing the accuracy of
186 A. K. Badhan et al.

Fig. 8 The structure of multi-semantic fusion model for generating high resolution images

prediction models and resolving specific problems related to huge and intrinsic
datasets

4 Discussion

Table 2 presents the overall description of deep learning techniques along with the
applications.
Deep Learning Techniques in Big Data Analytics 187

Fig. 9 Transfer learning with a pre-trained network

5 Deep Learning Techniques Implications for Big Data


Analytics

Despite the fact that deep learning approach provides a number of benefits for
managing massive data analysis, there are a number of restrictions that must be
taken into consideration. One significant issue is the computer resources’ voracious
appetite, particularly when very deep neural networks are being trained on enormous
datasets. Big data’s size can make tasks time-consuming and call for an effective
computer architecture.
Another drawback is the interpretation of deep learning models, which frequently
function as intricate black boxes that make it challenging to comprehend the context
of their predictions. The requirement for labeled training-relevant data might also
pose challenges in situations when it is difficult or expensive to acquire such data.
When using deep learning techniques for large data analysis, concerns about ethical
issues, data security, and potentially biased conclusions are still crucial. These draw-
backs emphasize the value of a cautious strategy, continuous research, and ethical
concerns when utilizing deep learning to extract knowledge from enormous data sets.
188 A. K. Badhan et al.

Fig. 10 Training Convolution Neural Network training platform


Deep Learning Techniques in Big Data Analytics 189

Fig. 11 Diagrammatic view for voting accuracy


190 A. K. Badhan et al.

Table 2 Tabular view of Deep Learning Techniques, Description, Applications, and Benefits
Deep Learning Description Applications Benefits
Techniques
Convolution Neural The convolution • Health Care • Feature Extraction
Networks (CNNs) layers are used by • Autonomous vehicle • High accuracy in
[22] CNNs, which are well • Surveillance image related tasks,
suited for image and etc
video analysis to
extract spatial
characteristics
Recurrent Neural RNNs employ • Natural language • Sequential data
Networks (RNNs) [5] feedback connections processing, analysis,
to capture temporal • Speech Recognition, • Time series
relationships and are • Finance, etc forecasting,
best for sequential • Text generations
data
Long Short-Term The improvised • Predictive text • Better gradient flow,
Memory (LSTM) [25] version of recurrent typing, which solves the
networks (RNNs) to • Speech-to-text, vanishing gradient
recognize enduring • Anomaly detection, problem and works
relationships in etc well with sequential
sequential data data
Auto-encoders [27] It’s an unsupervised • Image denoising, • Decreased
model for feature • Recommender dimensionality,
learning and systems, and • Feature learning and
dimensionality • Anomaly detection • Data denoising
reduction
Generative Encompasses two • Generating images, • Data augmentation,
Adversarial Networks models, one is • Transferring styles, • Realistic data
(GANs) [30] generator and another and production, and
is discriminator to • Enhancing data • Better image quality
provide data that is
realistic
Ensemble Learning It combines • Regression, • Increased resilience,
[33] predictions from Classifications, and • Decreased
multiple models for • Anomaly detection overfitting, and
greater precision • Predicted accuracy

6 Conclusion

To sum up, deep learning techniques have become a revolutionary force in big data
analytics, providing never-before-seen capacity to extract valuable insights from
enormous and intricate datasets. The ability of different models related to deep
learning like neural networks, convolution networks, auto-encoders, etc., to auto-
matically learn intricate patterns and representations from data has proven invaluable
in diverse domains within big data analytics. In applications ranging from natural
language processing, and anomaly detection to picture and speech recognition, these
Deep Learning Techniques in Big Data Analytics 191

approaches have demonstrated impressive performance. Despite this remarkable


progress, challenges do persist, including the need for bulky labeled datasets, expli-
cable of complex models, consideration for computational resources, etc. However,
the synergy between the two domains keeps spurring innovation, reshaping the land-
scape of data-driven, decision-making, and foresting advancements in different areas
such as healthcare, finance, and beyond. As the field evolves, the ongoing exploration
and refinement in deep learning methodologies hold great promises for unlocking
deeper insights into big data analytics.

7 Future Scope

Future prospects for “Deep Learning Techniques for Big Data Analytics” are
extremely bright thanks to major developments in a number of important fields.
Priority should be given to creating more scalable and effective deep learning archi-
tectures that can manage even bigger data volumes through algorithm optimization
and the use of distributed computing. Research developing interpretable models
and decision-explanation approaches is necessary in order to address the “black-
box” character of deep learning models, especially in industries like healthcare and
finance where openness is critical. Transfer learning approaches advancements can
address data scarcity challenges, improving these models’ generalization capabili-
ties. Holistic solutions can be found by investigating hybrid models that integrate
deep learning with conventional machine learning methods and fusion techniques.
As profound as they are, ethical issues, especially prejudice reduction, are crucial.
The future of deep learning techniques in big data analytics lies in a multidisci-
plinary approach, with ongoing collaboration between researchers, industry practi-
tioners, and policymakers to address challenges and unlock new potentials for these
powerful technologies.

References

1. Khaturia, D., Saxena, A., Basha, S.M., Iyengar, N.C.S., Caytiles, R.D.: A comparative study
on airline recommendation system using sentimental analysis on customer tweets. Int J Adv
Sci Technol 111, 107–114 (2018). https://doi.org/10.14257/ijast.2018.111.10
2. Hu, X., Liu, J.: Research on e-commerce visual marketing analysis based on internet big data.
J. Phys. Conf. Ser. 1865,(2021). https://doi.org/10.1088/1742-6596/1865/4/042094
3. Jha, B.K., Sivasankari, G.G., Venugopal, K.R. Fraud detection and prevention by using big data
analytics. In: Proceedins of the 4th International Conference of Computing Methodologies and
Communication ICCMC 2020, pp. 267–274 (2020). https://doi.org/10.1109/ICCMC48092.
2020.ICCMC-00050
4. Aruul Mozhi Varman, S., Baskaran, A.R., Aravindh, S., Prabhu, E.: Deep learning and IoT for
smart agriculture Using WSN. 2017 IEEE Int Conf Comput Intell Comput Res ICCIC 2017,
1–6 (2018). https://doi.org/10.1109/ICCIC.2017.8524140
192 A. K. Badhan et al.

5. Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., Cui, C.,
Corrado, G., Thrun, S., Dean, J.: A guide to deep learning in healthcare. Nat. Med. 25, 24–29
(2019). https://doi.org/10.1038/s41591-018-0316-z
6. Rai, R., Tiwari, M.K., Ivanov, D., Dolgui, A.: Machine learning in manufacturing and industry
4.0 applications. Int. J. Prod. Res. 59, 4773–4778 (2021). https://doi.org/10.1080/00207543.
2021.1956675
7. Becha, M., Dridi, O., Riabi, O., Benmessaoud, Y.: Use of machine learning techniques in
financial forecasting. In: Proc 2020 Int Multi-Conference Organ Knowl Adv Technol OCTA
2020 (2020). https://doi.org/10.1109/OCTA49274.2020.9151854
8. Furht, B., Villanustre, F.: Big Data Technologies and Applications. Springer, Cham (2016)
9. Sonde, V.M., Shirpurkar, P.P., Giripunje, M.S., Ashtankar, P.P.: Experimental and dimensional
analysis approach for human energy required in wood chipping process. In: International
Conference on Advanced Machine Learning Technologies and Applications, pp. 683–691
(2020)
10. Ghosh, S., Das, N., Das, I., Maulik, U.: Understanding deep learning techniques for image
segmentation. ACM Comput. Surv.Comput. Surv. 52 (2019). https://doi.org/10.1145/3329784
11. Jan, B., Farman, H., Khan, M., Imran, M., Islam, I.U., Ahmad, A., Ali, S., Jeon, G.: Deep
learning in big data Analytics: A comparative study. Comput. Electr. Eng.. Electr. Eng. 75,
275–287 (2019). https://doi.org/10.1016/j.compeleceng.2017.12.009
12. Zhang, Q., Yang, L.T., Chen, Z., Li, P.: A survey on deep learning for big data. Inf. Fusion 42,
146–157 (2018)
13. Ghaderi, Z., Khotanlou, H.: Weakly supervised pairwise Frank-Wolfe algorithm to recognize
a sequence of human actions in RGB-D videos. Signal, Image Video Process 13, 1619–1627
(2019)
14. Liu, J., Li, T., Xie, P., Du, S., Teng, F., Yang, X.: Urban big data fusion based on deep learning:
An overview. Inf Fusion 53, 123–133 (2020)
15. Sohangir, S., Wang, D., Pomeranets, A., Khoshgoftaar, T.M.: Big data: Deep learning for
financial sentiment analysis. J Big Data 5, 1–25 (2018)
16. Azmi, J., Arif, M., Nafis, M.T., Alam, M.A., Tanweer, S., Wang, G.: A systematic review on
machine learning approaches for cardiovascular disease prediction using medical big data. Med
Eng & Phys 105, 103825 (2022)
17. Li, X., Liu, H., Wang, W., Zheng, Y., Lv, H., Lv, Z.: Big data analysis of the internet of things
in the digital twins of smart city based on deep learning. Futur. Gener. Comput. Syst.. Gener.
Comput. Syst. 128, 167–177 (2022)
18. Krishnamoorthi, R., Joshi, S., Almarzouki, H.Z., Shukla, P.K., Rizwan, A., Kalpana, C., Tiwari,
B., others: A novel diabetes healthcare disease prediction framework using machine learning
techniques. J. Healthc. Eng. 2022, 1–10 (2022)
19. Gandomi, A.H., Chen, F., Abualigah, L.: Machine learning technologies for big data analytics.
Electronics 11, 421 (2022)
20. Mathew, A., Amudha, P., Sivakumari, S.: Deep learning techniques: An overview. Adv Mach
Learn Technol Appl Proc AMLTA 2020, 599–608 (2021)
21. Ghosh, S., Das, N., Das, I., Maulik, U.: Understanding deep learning techniques for image
segmentation. ACM Comput. Surv.Comput. Surv. 52, 1–35 (2019)
22. Phung, V.H., Rhee, E.J.: A High-accuracy model average ensemble of convolutional neural
networks for classification of cloud image patches on small datasets. Appl. Sci. 9 (2019).
https://doi.org/10.3390/app9214500
23. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai,
J., Chen, T.: Recent advances in convolutional neural networks. Pattern Recognit 77, 354–377
(2018). https://doi.org/10.1016/j.patcog.2017.10.013
24. Kasongo, S.M.: A deep learning technique for intrusion detection system using a recurrent
neural networks based framework. Comput. Commun.. Commun. 199, 113–125 (2023). https://
doi.org/10.1016/j.comcom.2022.12.010
25. Chandra, R., Jain, A., Chauhan, D.S.: Deep learning via LSTM models for COVID-19 infection
forecasting in India. PLoS ONE 17, 1–28 (2022). https://doi.org/10.1371/journal.pone.0262708
Deep Learning Techniques in Big Data Analytics 193

26. Zhang, H., Wang, L., Shi, W.: Seismic control of adaptive variable stiffness intelligent structures
using fuzzy control strategy combined with LSTM. J Build Eng 78, 107549 (2023). https://doi.
org/10.1016/j.jobe.2023.107549
27. Bank, D., Koenigstein, N., Giryes, R.: Autoencoders. Machine Learning for Data Science
Handbook: Data Mining and Knowledge Discovery Handbook, pp. 353–374. Springer, Cham
(2023)
28. Chen, S., Guo, W.: Auto-encoders in deep learning—A review with new perspectives.
Mathematics 11, 1777 (2023)
29. Chen, S., Guo, W.: Auto-encoders in deep learning—A review with new perspectives.
Mathematics 11, 1–54 (2023). https://doi.org/10.3390/math11081777
30. Kumar, S., Dhawan, S.: A detailed study on generative adversarial networks. Proc 5th Int Conf
Commun Electron Syst ICCES 2020, pp. 641–645 (2020). https://doi.org/10.1109/ICCES4
8766.2020.09137883
31. Huang, P., Liu, Y., Fu, C., Zhao, L.: Multi-Semantic fusion generative adversarial network for
text-to-image generation. In: 2023 IEEE 8th Int Conf Big Data Anal ICBDA 2023, pp. 159–164.
(2023). https://doi.org/10.1109/ICBDA57405.2023.10104850
32. Wu, P., Guo, H., Buckland, R.: A transfer learning approach for network intrusion detection.
In: 2019 4th IEEE Int Conf Big Data Anal ICBDA 2019, pp. 281–285 (2019). https://doi.org/
10.1109/ICBDA.2019.8713213
33. Mung, P.S.: Phyu S (2020) Effective analytics on healthcare big data using ensemble learning.
IEEE Conf. Comput. Appl. ICCA 2020, 1–4 (2020). https://doi.org/10.1109/ICCA49400.2020.
9022853
Data Privacy and Ethics in Data
Analytics

Rajasegar R. S., Gouthaman P., Vijayakumar Ponnusamy, Arivazhagan N.,


and Nallarasan V.

Abstract Recent innovations performed on data analytics technologies within the


last two decades have steered towards a new level of data-driven decision-making
in different industries. This chapter elucidates the significant aspects of data privacy
with ethics under the dominion of data analytics. Firstly, the chapter details how
imperative is protecting an individual’s personal information. Secondly, discusses the
legal frameworks, namely, GDPR (General Data Protection Regulation) and different
data protection laws around the world, which have greatly influenced for bringing
awareness on data privacy. Thirdly, how ethical considerations do compliment the
outcome when these regulations are complied with. Finally, this chapter also offers
information on how the organizations and its professionals must meticulously put
efforts towards building a world on how to handle data ethically. In this regard,
the chapter provides various instances from projects, case studies and real-world
scenarios to support and discuss how data analytics do create positive and negative
impacts amongst an individual and the society. To conclude, this chapter focuses on
the vital aspects of mixing data privacy and ethics when working with data analytics.
Furthermore, how organizations can follow holistic approaches wherein a blend of

Rajasegar R. S.
IT Industry, Cyber Security, County Louth, Ireland
e-mail: rajasegarrs@outlook.com
Gouthaman P. (B) · Nallarasan V.
Department of Networking and Communications, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: gouthamp@srmist.edu.in
Nallarasan V.
e-mail: nallarav@srmist.edu.in
Vijayakumar Ponnusamy
Department of Electronics and Communications, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: vijayakp@srmist.edu.in
Arivazhagan N.
Department of Computational Intelligence, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: arivazhn@srmist.edu.in

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 195
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_10
196 Rajasegar R. S. et al.

technology safety, legal frameworks and ethical awareness can be infused into their
work culture when their employees are dealing with data in various projects in the
future.

1 Foundations of Data Privacy

Data privacy is referred to as both public and private phenomenon, and its processes
have consequences for an individual as well as the community. This analogy on
privacy stops it from being addressed as a usual technological method and details
various factors which are linked with it. Not only it is understood that right to privacy
is referred to as the integral of an individual’s freedom, but also, viewed as the capa-
bility to hide particular information for malpractice. The article [1] delves deeper into
this present contradiction of data privacy and analyses different terms and procedures
that have arisen in this field of study.
It is important to understand [2] ethics and privacy in which professional ethics
are a code of conduct which administers the way members in various domains work
with each other and other stakeholders. Next, author focuses on the ethics of human
behaviour, and the fundamental objectives of an organization, and clearly a portrayal
of its professionalism. Privacy is referred as the right to be left alone then portrays
that there must not be invasion upon seclusion, and no public disclosure with respect
to private truths or incorrect proofs.
Recently, the usage of digital data has grown exponentially in information tech-
nology where huge data is getting collected, processed and used. To paraphrase, Data
Science is referred to as a multidisciplinary field which achieves knowledge through
datasets that comprises of data in both forms structured and unstructured and huge
datasets could be analysed to gain useful information. In addition, data science is
also known as a branch of statistics since it utilizes various concepts in that domain,
and it is to be understood that to avoid errors during data analysis it is vital to have
data and models in valid form [3].
In this decade, various industries are working towards exploring data-driven
approaches and data analytics in machine learning treating it as one of the most
innovative computing technologies. The major benefit is predictive analytics that
assists in predicting sensitive features, future performance, risk and necessary func-
tions linked with specific communities or individuals on the basis of their huge sets
of behavioural and usage data. This article [4] focuses on the significant ethical and
data protection inferences of predictive analytics when used for predicting sensi-
tive details of single individuals or handling those individuals in a different way in
relation to the data collected through many other unrelated individuals.
Mostly in European countries, there is a trend for applications of learning analytics
due to the recently implemented European General Data Protection Regulation
(GDPR). In addition, universities in Finland are recently working towards imple-
menting learning analytics across their nation with several multidisciplinary projects
that are conducted in this domain. The article [5] provides a study in which the
Data Privacy and Ethics in Data Analytics 197

students were questioned on ethical concerns in gathering data and usage towards
learning analytics. The outcome portrayed that the students appeared to be positive
regarding the possibilities of learning analytics, however, they were also concerned
regarding the safety and how their personal data was utilized.
The present research on differential privacy whose applications have grown into
several areas in the recent years elucidates an inherent trade-off amongst a dataset’s
privacy and its application for analytics. When working towards a solution for
this trade-off affects budding applications of differential privacy to shield privacy
in datasets with analytics as well-being enabled. On the contrary, author here [6]
portrays how to use differential privacy to extract necessary analytics from the orig-
inal dataset in an accurate manner. Furthermore, using the proposed technique it is
proved that differential privacy can assist robust privacy and accurate data analytics.
In this digital era, big data management throughout its lifecycle is a huge challenge
for government organizations. Despite immense attention in this ecosystem, proper
big data management is still a task. This article [7] focuses to solve this issue by
suggesting a data lifecycle outline for data-driven regimes. Moreover, they have
recognized nearly 70 data lifecycles and analysed to recommend a data lifecycle
framework.
During the past two decades, many open platforms, namely social networks along
with mobile devices, are contributing towards data collection; the capacity of such
data got bigger as well in the due course to end up as big data. When it originated, this
big data did not focus on the structured and unstructured data sensitivity. However,
it has become extremely essential to incorporate security and privacy in which the
risk of sharing personal information is mostly curtailed. The primary benefit of the
proposed [8] technique—Secure Map Reduce model is to encourage knowledge
mining through data sharing.

2 Introduction to Data Analytics and Its Impact

In this digital era, advertising through social media has gained immense attention
within the advertising world by data-driven focused techniques. To paraphrase, social
media advertising assures plenty of yields per money invested since the technology
can achieve an extremely specialized community. In this study [9], advertising for
clinical studies through social media leads to intense societal risks is outlined. It is
extremely hard to differentiate the well-intentioned promises from the unfavourable
social media advertisements. To counter this, it is vital to follow research ethics
guidelines and improve the regulation of big data and inferential analytics. It is
concluded that social media advertising may not be appropriate for clinical studies
as a recruitment tool provided that the processing of social media usage data as
well as the predictive models training with data analytics and artificial intelligence
organizations are not properly regulated.
Different technologies [10] and advancements are discussed to which organi-
zations are taking a journey for achieving successful digital transformation. It is
198 Rajasegar R. S. et al.

stated that Internet of Things (IoT) is acting as a significant source of data growth
and the advent of cloud storage with cloud computing acting as an imperative
evolution within the hardware/software ecosystem. This study proposes Artificial
Intelligence with game changing data analytics and emerging technologies, namely,
distributed ledger technology, intelligent character recognition and Blockchain. In
addition, utilizes natural language processing as a linguistics discipline focused
towards understanding and replicating human language as well as speech patterns.
The rise of digital policing being shaped through new data analytics practices
has hugely impacted the general public’s privacy rights and the associated civil
liberties within the criminal process. In this decade, plenty of intervention from
media, academic studies and regulatory bodies has contributed towards policing
data analytics practices in different nations. There are technological advancements,
namely, policing hotspots, live facial recognition and data extraction with mobile
devices, are referred to be contentious for various reasons. To counter these, different
ways, such as police data ethics boards, augmented with soft regulation using code
of practice and endorsements through different investigative studies. In regards to
police data analytics [11] it is explained to understand the themes of algorithmic
justice can be distinguished in the framework of the United Kingdom and where it
leads to, primarily regarding privacy rights with respect to criminal process.
The immense potential and support of Artificial Intelligence towards tactical orga-
nizational decision-making is still in its developing stages. The findings in this article
[12] are detailed in a conceptual model which initially elucidates in what ways
Artificial Intelligence can enable humans for decision-making in uncertainty and
later categorizes the challenges, pre-conditions and consequences which needs to be
worked out. It is clear that human responsibility surges, although the skills necessary
to utilize the technology vary from other machines which showcases the significance
of education.
Recently, Building-to-Grid (B2G) has been trending and digitalization has been
contributing to an extremely significant means. On the emerging technologies [13],
namely, 5G, Big Data, Blockchain, Artificial Intelligence and IoT, and crucial chal-
lenges of the applications with respect to the emerging technologies in this B2G
ecosystem. Furthermore, this study suggests imminent research aspects on Building-
to-Grid ecosystem, particularly ecosystem modelling as well as simulation, it’s (B2G)
part in smart cities, organization of the B2G ecosystem and various other rising
technologies in B2G.
The rise of ChatGPT which is known for its Artificial Intelligence interface
converses with people and then responds using natural language processing and
machine learning techniques is trending recently. The effect of this application [14]
towards data science and an outline regarding the possible benefits and drawbacks
linked with using ChatGPT in data science. In addition, the article elucidates that
in what means ChatGPT can enable data scientists to systematize different tasks in
their activities pertaining to cleaning data, pre-processing, training models and inves-
tigating outcomes. Furthermore, author focuses on the implications of interpreting
the output achieved through ChatGPT that in turn may lead to concerns towards
decision-making in data science applications.
Data Privacy and Ethics in Data Analytics 199

3 Ethical Frameworks for Data Analytics

In the recent times, due to the implementation of GDPR and CCPA, websites have
begun to offer users to share consent in cookie banners. It is to be understood that
these banners let the users provide their preference about which cookies are to be
allowed. Despite requesting consent prior to storage of personal information is to
be appreciated for understanding user privacy; however, research has portrayed that
most websites do not always let users stay with their choices. In this article [15] they
have investigated if websites utilize more tenacious and cultured means of tracking
so as to track users who state that they do not accept cookies. There are certain
forms of tracking, namely, ID synchronization, browser fingerprinting and so on.
Furthermore, when users declare to reject all cookies then user tracking becomes
intense.
The study [16] focuses on offering fundamental knowledge and understanding in
relation to certain significant principles of data protection law by elucidating some
key concepts. To detail, this study highlights that for initiating the data processing—
there is an option to choose amongst six different grounds. Although the given
detailing is general and does not provide comprehensive information to let the reader
to understand, on the contrary, “actionable knowledge” is offered. To paraphrase,
reader is allowed to work out and implement the data protection principles into data
science applications which in turn lets them utilize in a socially responsible way.
The processes [17] of data collection and utilization have changed during the past
two decades. Earlier, gathering of data involved raising requests with IT helpdesk
and waiting for a week or two then working towards projecting it in a necessary
format took so much of effort. On the contrary, currently almost everyone has access
to the required data and can perform their own analysis with their fast-processing
computers installed with powerful analytics tools. Having said that, it is imperative
to streamline data being collected appropriately, screened, changed and analysed
with legal techniques in a way so that it provides relevant information which in turn
is used for intelligence to take precise decisions for making a business a successful
one. To reiterate, data analytics is the way of collecting, processing and analysing
data to identify necessary information, to make recommendations and to enable
problem-solving and decision-making.
In this decade, smart cities are emerging as a technology reality which very soon
would dominate the day-to-day lives of people in both developed and developing
nations. In specific to big data issues within smart cities, it is identified that privacy and
security will be a huge concern for its sensitivity mostly present in healthcare, cyber
security, e-governance, mobile banking and many more. This dimension recapitulates
the recent advances in solving the issues with respect to big data privacy and security
in digital cities and then highlights the potential research aspects in this unexplored
area. The IoT (Internet of Things) devices’ utilization in smart cities has led to security
issues arising from applications though they had certain benefits. In addition, there are
other privacy issues in digital cities which get created through the security concerns
in relation to the Internet of Things, Big Data and Information Communication
200 Rajasegar R. S. et al.

Technology-enabled technological applications and this needs a clear knowledge to


develop better and more resilient privacy aware digital cities [18].
The significant challenges are listed and investigated [19] to address the knowl-
edge gap amongst subject matter experts, organizations, businesses and the general
community in accepting, encouraging and using Blockchain technology. The chal-
lenges mentioned are data privacy and cyber security. Moreover, ethical and legal
challenges significantly vary; consensus and trust within the public and private orga-
nizations for implementing such defensive data management techniques is unswerv-
ingly involving accountability and elegant systems so as to provide inevitability
and fairness. A framework is recommended with technologies to include Machine
learning, Big Data and visualization approaches and procedures.
Industry 5.0 is the next generation in the manufacturing field and production
systems which put state-of-the-art technology together with human intelligence and
skills. The emergence of Industry 5.0 [20] in the health sector, which is also referred to
Healthcare 5.0, and positive impacts towards the healthcare industry are investigated
in this article. In addition, the study [21] elucidates the complexities and issues which
need to be worked out for Healthcare 5.0 to be implemented successfully, together
with data security, privacy, ethical and legal concerns, the necessity for suitable skills
along with training of healthcare individuals and cost-effectiveness.
Digital privacy and its associated notion of personal data economy are identified
as alleged by customers. In this study [22], online protection of data, personal data
sharing and ethics are examined then a survey has been framed and steered to collect
views, largely from university students with respect to these topics mentioned. The
outcome through this research portrays the importance of the General Data Protec-
tion Regulation (GDPR) on data protection and desires additional regulation and
transparency are necessary. It is significant to prioritize digital privacy education
as security awareness initiative, focusing to build an ecosystem amid every stake-
holder, namely, users, businesses, governments, etc., to disseminate, maintain and
accumulate personal information in an accountable and ethical way.
Recently, enough efforts are being taken to solve the issues behind environment
and changes in climatic conditions with the technological advanced solutions through
Artificial Intelligence, Internet of Things and Big Data. The potential behind this
technological combination is being incorporated and imposed by smart cities to
progress and accomplish environmentally sustainable smart cities. This article [23]
examines and showcases that these cities are trending now due to swift adaptation
towards digitalization and decarbonization post-COVID-19. On the other hand, these
technologies do incur environmental expenses and lead towards ethical risks and
regulatory challenges.

4 Building a Culture of Data Ethics

In this proposed work, the importance of creating a culture of Data Ethics through 3
key areas will be discussed and they are:
Data Privacy and Ethics in Data Analytics 201

• Data lifecycle
• Challenges in data privacy
• Proposed solution to the identified key challenge

4.1 Data Lifecycle

We shall start to build a culture of Data Ethics by understanding the Data Lifecycle.
The data lifecycle, also known as the data management lifecycle, refers to the stages
through which data goes from its creation or acquisition to its eventual retirement or
disposal. This concept is crucial in data management and governance to ensure that
data is effectively and securely managed throughout its entire existence. The data
lifecycle typically consists of several key stages as shown in Fig. 1:

4.1.1 Data Creation

This is the first stage where the creation of data starts from different sources. Data
Creation/Acquisition is where the Data is generated, collected, or acquired from
various sources, such as systems, sensors, users, applications, or external databases.
The initial generation or import of data into an organization’s systems is taking place
at this stage.

4.1.2 Data Ingestion & Storage

The second stage of Data Lifecycle involves Data Ingestion & Storage. Data Inges-
tion: After data is created or acquired, it needs to be ingested into data storage systems.
This can involve data transformation, validation and indexing, making it ready for
storage and processing. Data Storage: Data is stored in databases, data warehouses
or other storage solutions. This stage involves decisions about the type of storage,
data organization and access control.

Fig. 1 Data lifecycle


202 Rajasegar R. S. et al.

4.1.3 Data Usage

The third stage where the Data Usage comes in which involves: 3 key areas that will
be discussed and they are:
1. Data processing/analysis
2. Data presentation/visualization
3. Data sharing/distribution
Firstly, Data Processing involves storage, processing and analysis of data for
several reasons, namely, business intelligence, reporting and for training or machine
learning. This phase encompasses mining insights as well as significant detailing of
data. Secondly, Data Visualization is where end users are provided with processed
data by means of reports and other visualization tools in order to let them understand
easily and to act upon. Finally, Data Distribution involves sharing of data within an
organization otherwise with external stakeholders. In addition, this makes sure that
the appropriate user or systems receive data access with the assurance of security
and privacy.

4.1.4 Data Retention & Archival

Data Retention and Archiving is the fourth stage where organizations need to deter-
mine how long data should be retained based on regulatory requirements and business
needs. Archived data is usually stored in long-term storage solutions. It can be stored
within the organization’s On-premium storage or Cloud storage based on the business
and regulatory compliance.

4.1.5 Data Destruction

The fifth stage where the Data Destruction comes in with which involves:
1. Data backup and disaster recovery
2. Data governance and security
3. Data deletion/retirement
4. Data audit and compliance
5. Data discovery and metadata management
Firstly, Data Backup with Disaster recovery is where data is backed up on a regular
basis in order to avoid data loss arising when the system fails or disasters occur. This
is one of the imperative features of data management to warrant data resilience.
Secondly, Data Governance with Security involves data governance practices in
the due course of data lifecycle so as to confirm the quality of data, regulation
compliance and data security. This requires appropriate policing, access controls as
well as monitoring. Thirdly, Data Deletion refers to data which is not needed or
obsolete and is safely deleted or retired since it enables the maintenance of data
Data Privacy and Ethics in Data Analytics 203

privacy and to comply with regulations posed by regulatory bodies like GDPR.
Next, Data Audit and Compliance involves steering audits in a regular manner to
make sure that there is alignment of data management practices as per organization’s
policy and objectives. Finally, Data Discovery and Metadata Management involves
Metadata which is understood to be offering information on data that is maintained
and systematized to enable data discovery to identify and utilize data assets in an
effective manner.
The particular phases and processes mentioned in the data lifecycle might differ
based on an organization’s magnitude, field and the way in which they manage their
data. It is imperative to follow effective data lifecycle management for achieving
quality data, security and compliance thereby enabling organizations to accomplish
value through the available data assets provided risks, such as data misuse or loss
are mitigated.

4.2 Challenges in Data Privacy

The challenges that are encountered in Data Privacy are discussed in this topic. We
will try to identify one common problem and come up with a proposed solution in
the following topic. The below block diagram shows the most common Data Privacy
challenges that are faced.

4.2.1 Embedding Data Privacy

The process behind employing data privacy may be a tough task, specifically, in
this digital era with data increasing on a day-to-day basis with privacy issues. There
are significant challenges involved when applying data privacy, namely regulatory
compliance and consent management. To begin with, regulatory compliance involves
organizations obeying the guidelines posed by regulatory bodies, namely, CCPA and
GDPR, which appears to be a huge challenge and if not followed would make them
end up paying penalties. Next, consent management which details the significance
of getting users’ consent for processing their data, and flexibility is to be maintained
in such regards. To conclude, applying data privacy means various aspects, such as
legal, technical and organizational aspects are to be considered. It is imperative that
organizations are adhering towards maintaining their customer and other stakeholder
information appropriately.

4.2.2 Increasing Devices

It is extremely significant to manage user consent and offering them control regarding
the data produced through the IoT devices is challenging. It is necessary that users
have enough knowledge behind why data is being gathered and its utilization thereby
204 Rajasegar R. S. et al.

making them to decide for opting out. Next, IoT Ecosystem complexity which means
plenty of devices are connected in an ecosystem which results in the data flow even
more complex. It is a humongous task to have the right understanding and managing
the consequences behind data privacy in this ecosystem. To sum up, the IoT devices
that are interconnected and growing on a day-to-day basis lead towards data privacy
concerns and needs adequate measures with respect to planning and implementa-
tion of necessary frameworks to safeguard every individual’s data when gaining the
benefits through this technology.

4.2.3 Growing Maintenance Costs

To begin with, resource allocation is where organizations are enforced to comply


with data privacy and appropriate cyber security measures and, in this regard, they
need to allot the required finance, time and team. This resource balancing could be
humongous task for organizations. Next, complying with data privacy regulations
may be expensive since it needs regular audits, valuations and software or tools
compliance. Finally, to safeguard confidential information, organizations need to
maintain vigorous cyber security procedures which involve investment in encryption,
firewalls and security teams.

4.2.4 Intricate Access Regulation in Various Sectors

Access regulations implemented in different sectors are complicated to deal with


and this may lead towards different data privacy concerns for organizations as shown
in Fig. 2 starting with, a diverse regulatory landscape in which diverse sectors and
regions mostly have distinct data privacy regulations and processes. To manage and
comply with such complex regulations is pragmatically challenging. Next, Intercon-
necting regulations where certain organizations are forced to implement several sets
of regulations that usually connect at some point which results in misperception and
intricacies in understanding which one to prioritize. To conclude, organizations must
maintain well-developed data privacy and compliance policies which work for their
specific domain appropriately. In order to make this effectively work, it is necessary
for organizations to involve legal and compliance subject matter experts, allocating
funds for technology solutions and prioritizing data privacy amongst their business
processes and practices.

4.2.5 Gaining a Holistic Perspective of the Available Data

It is extremely imperative to achieve a holistic perspective for the data collected to


perform data privacy management and this may not be an easy task. To start with
Data Silos, where data mostly is available across diverse systems and departments in
Data Privacy and Ethics in Data Analytics 205

Fig. 2 Data privacy challenges

an organization which makes it hard to cumulate and analyse data in a comprehen-


sive manner. Next, Data Types which are regarding various kinds of data, namely,
unstructured, semi-structured and structured data, necessitate diverse tools and tech-
niques to analyse and consolidate. To counter these challenges and gain a holistic
perspective of data, organizations must take the initiative to finance data integration
206 Rajasegar R. S. et al.

and management then build robust data governance frameworks and make sure to
incorporate privacy as well as security aspects are built into their data consolidation
activities. Furthermore, data privacy regulations posed by regulatory bodies must be
complied and provide significance during gathering and consolidating data.

4.2.6 A Noxious Data Milieu

A harmful data setting, categorized by poor quality of data, immoral data manage-
ment and inappropriate data management leads to various substantial data privacy
concerns. To begin with Data Accuracy which states data with poor quality might
lead towards imprecise and inadequate information thereby resulting in privacy risks
when organizations utilize such data for making verdicts, particularly if it is related
to personal information. And Consent and Transparency is about the environment
where unethical data management or when they are not being transparent then gaining
informed consent with respect to data processing ends up as a difficult task since
individuals may not clearly comprehend the way their data is being utilized. To over-
come these concerns, it is imperative to enhance the organization culture where their
employees must assure to follow ethical data management then focus and comply
with the data privacy regulations. Moreover, it may require reassessing the way data
is being gathered and used then applying stringent procedures to data protection and
compliance.

4.2.7 Data Size Growing Incessantly

Working and safeguarding data privacy in a world where there is incessant growth
of data size leads to various challenges. Starting with, Data Breach elucidates that
huge datasets offer complex risks of data breaches. To paraphrase, where there is
humongous data, that is the organization’s malicious attackers focus to exploit or
steal. Next, Data Classification which is regarding how hard it is to classify and
label sensitive information during colossal data management, and this ends up as
a challenging activity for implementing specific security controls. To counter these
concerns, there is a requirement of blending data governance, vigorous security
measures and data privacy regulations compliance. It is crucial for organizations
to take steps and apply wide-ranging data management policies with suitable data
classification; access controls then encryption so that they safeguard confidential
information thereby preserving data privacy with such data growth.

4.2.8 Wide-Ranging Guidelines and Administration to Follow

Wide-ranging data privacy guidelines management and to maintain a proper manage-


ment could be a challenging task for organizations. To begin with, Regulatory Intri-
cacy is about various regions and businesses comprise of different, overlying data
Data Privacy and Ethics in Data Analytics 207

privacy regulations. In such regard, having enough knowledge and following those
diverse guidelines may be challenging and time-consuming. Next, Legal Exper-
tise which is about decoding and implementing the legal language as per the data
privacy regulations generally need the guidance of legal experts and that could be
expensive for most organizations. To counter these concerns, it needs a detailed and
systematized procedure for data privacy management, and organizations must work
towards practising a proactive approach in complying with data privacy regulations,
data governance and considering experts’ legal opinions. Furthermore, being up to
date regarding the amendments in privacy requirements posed by regulatory bodies
is essential for organizations to maintain long-term compliance.

4.2.9 Risks Arising Due to Data Breach and Cyber Attacks

When discussing data privacy, the risk of data breaches and cyberattacks appears to
be inevitable. Starting with, Data Security involves sensitive data of an organization
to be maintained from cyber threats and this demands huge efforts from that organi-
zations’ security team. Next, the technological advancements in cyber threats, that
is, cyber threats which are recently becoming sophisticated day-by-day and appear
to be extremely challenging to defend. The basics of cyber security [4], ethics and
law, portray different issues of the domain, namely, ethical hacking and cyber war
are elucidated. In addition, it provides suggestions and suitable practices for cyber
security professionals involved in different application areas. Later, the significance
of renewed efforts is detailed to highlight responsible state behaviour which might
need better involvement towards the private sector and civil society where both
these contribute towards higher stake levels in cyber space. To overcome these chal-
lenges, it is vital that organizations take countermeasures, namely, risk assessments,
employee awareness training and installation of robust security technologies within
their systems in addition to compliance with data privacy regulations.

4.2.10 Data Literacy and Consciousness

To effectively protect individuals’ confidential information, organizations must have


enough knowledge regarding the data privacy concerns and the significance of data
privacy. Starting with, the Intricacy of Data Privacy Regulations posed by CCPA
and GDPR is mostly hard to understand and most organizations as well as the public
struggle to construe and follow those regulations. Next is the data literacy and privacy
awareness are mostly deficient. The majority of the community and employees do
not get enough training for best practices regarding data privacy and how to utilize
data in a responsible way. To manage these concerns, it is vital for organizations to
invest and incorporate multi-faceted techniques where there are significant training
camps, user-friendly privacy applications and a cultural paradigm shift for respecting
data privacy.
208 Rajasegar R. S. et al.

The challenges identified based on the analysis made in this topic yield an
outcome, namely, Consent, Transparency and Consent Management from the layman
point of view must be implemented by organizations as per the guidelines of
regulatory bodies.

5 Proposed Solution

The proposed solution for the identified challenge: Consent, Transparency and
Consent Management from the layman point of view is discussed in this topic. The
use case example which we are going to use for this research work is a user browsing
the internet to access information from websites and the consent and transparency
from his perspective. Figure 3 shows the comparison of the Data Lifecycle with the
high-level view of real-time end to end usage of data.
The Data source is where the actor starts to generate data. To compare this with
our use case example, the actor is a user who is layman who is going to search
for information on internet website and going to access it. When the actor launches
an internet browser (i.e., Google Chrome), start tying the search keywords in the
search engine web page and access the search results. Another scenario from the
same use case is that the same user is launching the Social Media applications (i.e.,
Facebook, Instagram, etc.) from his personal devices (i.e., Mobile Phone or Laptop).
The data creation is started from that point of time. The generated data is Stored/
Used/Archived/Destroyed at the Organization level. And that is happening with the
supervision of regulatory bodies through Ethics, Compliance, Regulations, Policies,
Standards and Laws.
Once the user opens the website from which he intended to access information
from starts to collect data from the user through a fundamental component of web

Fig. 3 End to end data usage flow


Data Privacy and Ethics in Data Analytics 209

browsing and online interaction called Cookie Technology. Cookies are small amount
of data that can be found stored by websites on your computer or device when you
access that website. The use of these cookies is to primarily track and maintain
information about a user’s online presence, activities, preferences and interactions.
There are few major purposes of cookies which are listed below:
• User tracking—These cookies let websites track user’s behaviour. For instance,
websites track login status, products in shopping baskets or pages recently visited.
The benefit of this is to customize user’s experience then to provide appropriate
content.
• Authentication cookies—They are used for authenticating purposes. To para-
phrase, when a user logs onto a web portal then they are issued with a session
ID. Session ID is identified for following interactions to approve identity so as to
avoid repeated login credentials being asked in the same portal.
• Remembering preferences—In this, cookies do store the preferences of user,
namely, language, layout and notifications to develop and make sure to provide
each user with a personalized experience.
• Targeted advertising—It is well known that cookies are utilized for online adver-
tising in a huge manner. In this regard, advertising organization utilizes cookies
for tracking user interests and then displays advertisements which are specific to
what they have been browsing recently.
• Analytics—These cookies are used by website owners to gather data regarding
the way in which users communicate in their portals and then utilize that infor-
mation for enhancing their website’s performance and provide personalized user
experience.
• Session management—These cookies are imperative to manage user sessions as
they enable to keep a track of users when they go through the website thereby
providing them with a wholesome experience.
• There are 4 different types of cookies:
• Session cookies: These are called session management cookies which are tempo-
rary cookies that are deleted from your device when you close your web
browser.
• Persistent cookies: These cookies are often used for functions like remem-
bering preferences and authenticating users, and they remain on your device for
a particular period or until you manually delete them.
• First-party cookies: These cookies are commonly used for session management
and user preferences, and it is set by the website you are currently visiting.
• Third-party cookies: These cookies are used for cross-site web tracking such as
for analytics and publishing Ads and it is set by domains other than the one you
are visiting.
Consent, Transparency and Consent Management from the layman point of view
need to be implemented by organizations as per the guidelines of regulatory bodies.
It is vital to understand that cookies do have some benefits, however, they may
as well end up with privacy issues. There are high possibilities of user’s online
behaviour being tracked which may be not acceptable. To counter these issues, web
210 Rajasegar R. S. et al.

browsers are recently offering choices to manage cookies which allow the user to
block or completely delete them. In addition, regulatory bodies, namely, California
Consumer Privacy Act (CCPA) and General Data Protection Regulation (GDPR) are
introducing guidelines about the way in which websites must receive consent from
users for tracking.
We propose that the best way to protect an individual’s data is through Privacy
Enhancing Technologies (PETs) which are a wide set of tools; techniques and prac-
tices being designed to protect and enhance individual privacy and data security in
the current digital era. The aim of these technologies is to give individuals more
control over their personal information, reduce the risks associated with sharing data
and mitigate the potential for surveillance and misuse of personal data. Here are few
key aspects and examples of Privacy Enhancing Technologies (PET):
Data Minimization: This will limit the data collected to only what is needed
to fulfil the required major purpose.
Masking: This will make the individual’s data unreadable when
displayed or printed.
Pseudo-Anonymization: This is a technique of alternate identifiable data
comprising of a reversible and consistent value.
To paraphrase, privacy laws and regulations are the legal frameworks such as the
European Union’s General Data Protection Regulation (GDPR) and the California
Consumer Privacy Act (CCPA) which are not actually a technology, but they play a
very high-level significant role in the process of shaping privacy practices with the
help of imposing rules and obligations on the data handlers (Organizations).
Regulatory body should act on the counter measures proposed in this research
work, so that organizations comply with the regulatory body’s action points and take
them into account when they are handling data in an ethical way. The possible means
of bringing this awareness is to encourage organizations to implement educational
workshops, meetings and seminars on handling data in an ethical manner.
The benefits of this research work is regarding involving privacy enhancing tech-
nologies (PETs) that comprise of a wide range of tools, techniques and practices
which are designed to protect and enhance an individual’s privacy and data secu-
rity. These technologies provide users with more control over their personal infor-
mation, lower the risks associated with sharing data and mitigate the potential for
surveillance and misuse of personal data. In addition, this study also includes PET
security controls embedded into regulations which gives organizations more focus
and significance on an individual’s privacy.
Data Privacy and Ethics in Data Analytics 211

6 Future Trends and Challenges

The power of the word “DATA” started to revolutionize in the current digital era. In
the rapidly evolving landscape of the digital age, data has emerged as the lifeblood
of our interconnected world, shaping industries, driving innovation and fundamen-
tally altering the way we live and work. As we stand on the threshold of twenty-
first century, the power of data has reached unprecedented levels, promising to
revolutionize every aspect of our lives.
Data-driven decision-making is no longer a choice but a necessity. Businesses,
governments, healthcare systems and individuals are harnessing the power of data to
gain insights, make informed choices and drive progress. From advanced analytics
and artificial intelligence to the Internet of Things, the possibilities are boundless.
Forbes describes the future of data in comparison to the oil market as “Data is the
new oil – and that’s a good thing”, one of the trending technological advancements
is Autonomous vehicles, however they are still in the developing stages. The advan-
tages are extensively made aware, namely, safer roads and rush-hour gets minimized.
However, the huge benefit is lowering greenhouse gases emitted through automo-
biles. The recent research conducted by a team at Poznan University predicts that
autonomous vehicles may ultimately reduce greenhouse gases by 50%. To accom-
plish this, this requires humongous data, that is, petabytes of data which leads to a
data lake from which the autonomous vehicle self-driving advanced machine learning
results will be achieved.
On the other hand, there will be terabytes of data per week per vehicle being
generated through these contemporary platforms. Having said that, it is the new oil
that is being accumulated as many extra bytes of data per year. The experts are
stating genuine concerns on how tech giants are utilizing our confidential informa-
tion; however, there are innumerable means through which these data can enable to
enhance the way of living in this world [24].
And the Economist describes the most precious in future is the Data and no longer
oil “The world’s most valuable resource is no longer oil, but Data” In this digital
world, tech giants, namely, Alphabet, Apple, Microsoft and Amazon deal with data
in an unstoppable manner but earlier it was oil which was the resource in question. On
the contrary, their successful research on user data has mostly benefitted consumers.
It is known that some users do not want to use Google’s search engine and Amazon’s
one-day delivery, on the other hand, the above firms do not raise concerns during
the usual antitrust tests are being tried out. What is to be noted is that many of the
services provided by these organizations are free where users do pay by providing
more of their data [25].
As we move deeper into the future, the potential of data is limitless. However, its
power must be harnessed responsibly, with a commitment to ethics and privacy. The
fusion of technology and data offers us an exciting future, full of opportunities for
progress and a more connected, efficient and informed world. Embracing the power
of data in future is not just a choice; it’s a transformative journey that will define our
future.
212 Rajasegar R. S. et al.

References

1. Bhageshpur K: Data Is the New Oil—And That’s A Good Thing. Forbes Technology Council
2. Bibri, S.E., Alexandre, A., Sharifi, A., Krogstie, J.: Environmentally sustainable smart cities
and their converging AI, IoT, and big data technologies and solutions: an integrated approach
to an extensive literature review. Energy Infor. 6(9), 32 (2023)
3. Christen, M., Gordijn, B., Loi, M.: The ethics of cybersecurity. In: International Library of
Ethics, Law and Technology, pp. 1–8. Springer Science and Business Media B.V (2020)
4. Gellert, R.: Data protection law and responsible data science. In: Data Science for Entrepreneur-
ship. pp. 413–439. Springer, Cham (2023)
5. Gomathi, L., Mishra, A.K., Tyagi, A.K.: Industry 5.0 for healthcare 5.0: Opportunities, chal-
lenges and future research possibilities. In: 7th International Conference on Trends in Elec-
tronics and Informatics, ICOEI 2023—Proceedings, pp. 204–213. Institute of Electrical and
Electronics Engineers Inc. (2023)
6. Grace, J.: Exploring algorithmic justice for policing data analytics in the United Kingdom. In:
Privacy, Technology, and the Criminal Process, pp. 18–38. Taylor and Francis (2023)
7. Hassani, H., Silva, E.S.: The role of chatgpt in data science: How AI-assisted conversational
interfaces are revolutionizing the field. Big Data and Cognitive Comput. 7, (2023) https://doi.
org/10.3390/bdcc7020062
8. Jain, P., Gyanchandani, M., Khare, N.: Enhanced secured map reduce layer for big data privacy
and security. J Big Data. 6, (2019). https://doi.org/10.1186/s40537-019-0193-4
9. Jiang, R., Bouridane, A., Li, C.T., Crookes, D., Boussakta, S., Hao, F., Edirisinghe, E.A.: Big
Data Privacy and Security in Smart Cities. Springer, Cham (2022)
10. Kaufmann, U.H., Tan, A.B.C.: Why data analytics is important? In: Data Analytics for
Organisational Development. pp. 1–20. John Wiley & Sons (2021)
11. Ma, Z., Clausen, A., Lin, Y., Jørgensen, B.N.: An overview of digitalization for the building-
to-grid ecosystem. An Overview of Digitalization for the Building-to-Grid Ecosystem. Energy
Inform. 4 (Suppl. 2), Article 36 (2021). https://doi.org/10.1186/s42162-021-00156-6
12. Mühlhoff, R., Willem, T.: Social media advertising for clinical studies: Ethical and data protec-
tion implications of online targeting. Big Data Soc. 10 (2023). https://doi.org/10.1177/205395
17231156127
13. Mühlhoff, R.: Predictive privacy: Towards an applied ethics of data analytics. Ethics Inf.
Technol. 23, 675–690 (2021). https://doi.org/10.1007/s10676-021-09606-x
14. Myers, N.E., Kogan, G.: Emerging AI and data analytics tooling and disciplines. In: Self-service
data analytics and governance for managers, pp. 25–49. John Wiley & Sons, Inc (2021)
15. Nevaranta, M., Lempinen, K., Erkki, K.: Students’ perceptions about data safety and ethics in
learning analytics (2020).
16. O’Regan, G.: Ethics and privacy. In: Concise Guide to Software Engineering. Springer, Cham
(2022)
17. O’Regan, G.: Introduction to data science. In: Mathematical Foundations of Software
Engineering, pp. 385–398. Springer, Cham (2023)
18. Papadogiannakis, E., Papadopoulos, P., Kourtellis, N., Markatos, E.P.: User tracking in the
post-cookie era: How websites bypass gdpr consent to track users. In: WWW ’21: Proceedings
of the Web Conference 2021, pp. 2130–2141. Creative Commons Attribution 4.0 International
(2021)
19. Shah, S.I.H., Peristeras, V., Magnisalis, I.: DaLiF: A data lifecycle framework for data-driven
governments. J Big Data. 8 (2021). https://doi.org/10.1186/s40537-021-00481-3
20. Shukla, S., George, J.P., Tiwari, K., Varghese Kureethara, J.: Data privacy. In: Data Ethics and
Challenges, pp. 17–39. Springer, Singapore (2022)
21. Subramanian, R.: Have the cake and eat it too: Differential privacy enables privacy and precise
analytics. https://doi.org/10.21203/rs.3.rs-1847248/v1 (2022)
22. The world’s most valuable resource is no longer oil, but data, https://www.economist.com/lea
ders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data
Data Privacy and Ethics in Data Analytics 213

23. Trunk, A., Birkel, H., Hartmann, E919: On the current state of combining human and artificial
intelligence for strategic organizational decision making. Bus. Res. 13, 875 (2020). https://doi.
org/10.1007/s40685-020-00133-x
24. Wylde, V., Rawindaran, N., Lawrence, J., Balasubramanian, R., Prakash, E., Jayal, A., Khan,
I., Hewage, C., Platts, J.: Cybersecurity, data privacy and blockchain: A review. SN Computer
Sci. 3 (2022). https://doi.org/10.1007/s42979-022-01020-4
25. Zambas, M., Illarionova, A., Christou, N., Dionysiou, I.: Exploring user attitude towards
personal data privacy and data privacy economy. In: Proceedings of the Second International
Conference on Innovations in Computing Research (ICR’23), pp. 237–244. Springer, Cham
(2023)
Modern Real-World Applications Using
Data Analytics and Machine Learning

Vijayakumar Ponnusamy, Nallarasan V., Rajasegar R. S., Arivazhagan N.,


and Gouthaman P.

Abstract Modern technology has given rise to strong technologies like machine
learning, big data, and data analytics that are revolutionising how businesses func-
tion and make choices. Business and marketing, healthcare, finance, manufacturing
and supply chains, transportation and logistics, energy utilisation, are only a few
of the disciplines where their practical applications are summarised in this chapter.
Precision medicine has evolved greatly via genetic data analysis, and big data anal-
ysis of electronic health records (EHRs) allows for better patient treatment. AI and
data analytics have a significant impact on risk assessment and fraud detection in the
financial sector. Analytical methods in manufacturing and supply chain optimisation
are highlighted in research articles. Significant progress has been made in lowering
operating expenses and equipment downtime thanks to machine learning-driven
predictive maintenance. Real-time monitoring and route optimisation in transport: the
importance of data analytics. The advancements in safety and dependability include
machine learning-powered autonomous cars and predictive maintenance strategies.
Grid management and energy usage have been optimised by the energy industry via
the use of big data and data analytics. Equipment breakdowns may be predicted and
energy production efficiency increased using machine learning. Customised learning

Vijayakumar Ponnusamy
Department of Electronics and Communications, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: vijayakp@srmist.edu.in
Nallarasan V. · Gouthaman P. (B)
Department of Networking and Communications, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: gouthamp@srmist.edu.in
Nallarasan V.
e-mail: nallarav@srmist.edu.in
Rajasegar R. S.
IT Industry, Cyber Security, Country Louth, Ireland
Arivazhagan N.
Department of Computational Intelligence, SRM Institute of Science and Technology,
Kattankulathur, Chennai, India
e-mail: arivazhn@srmist.edu.in

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 215
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_11
216 Vijayakumar Ponnusamy et al.

and evaluation via the use of data analytics. Content distribution and student engage-
ment are aided by machine learning algorithms, such recommendation systems. To
summarise, the fields of data analytics, big data, and machine learning have broad and
extensive uses in a variety of fields. These applications have revolutionised decision-
making processes, increased productivity, and stimulated creativity in the contempo-
rary world. These innovations are very important in determining how different fields
will develop in the future.

1 Introduction

A time when an enormous amount of data is being generated has led to the emergence
of groundbreaking technologies such, as data, machine learning and data analytics
through innovation and technology. These technologies have become integral to
industries revolutionising the way organisations perceive and utilise data. They play
a role, in decision-making processes influencing everything from corporate board-
rooms and hospital wards to financial markets and factory floors. Data analytics,
big data and machine learning have proven to be tools that drive progress enhance
operations and unveil insights hidden within vast amounts of information.
The main objective of this research is to showcase how these technologies are
practically used in industries. These tools have the capability to generate produc-
tivity informed decision-making and significant innovation, across different sectors
such as business and marketing, healthcare, finance, manufacturing and supply chain
management, logistics and transportation, energy and utilities as well as education.
By delving into the applications of big data, machine learning and data analytics
in numerous industries we can clearly see their undeniable impact, on shaping the
future of global businesses by exploring how these technologies are being applied in
real-world scenarios [1].
Commercial and Promotional: Customer Segmentation and Personalization:
Organisations employ data analytics and machine learning to partition their customer
base and tailor their marketing endeavours to individual customers. Netflix provides
personalised content recommendations to its users based on their viewing patterns
and machine learning algorithms. Market Trend Analysis: Big data analysis enables
organisations to swiftly monitor and react to market trends. Big data is utilised by
retailers such as Amazon to monitor consumer behaviour, optimise inventory and
expedite product delivery. Demand Forecasting and Pricing: Machine learning algo-
rithms are employed in dynamic pricing strategies, while big data provides assis-
tance in demand forecasting. Internet merchants modify the prices of their products
in response to market conditions, whereas airlines establish their ticket prices as high
as feasible [2].
Medicinal: ailment Prediction and Prevention: Algorithms based on machine
learning and data analytics are utilised to identify at-risk individuals and forecast
disease outbreaks. By analysing massive amounts of patient data with machine
learning, it is possible to predict conditions and prevent the spread of disease. Big data
Modern Real-World Applications Using Data Analytics and Machine … 217

analysis facilitates precision medicine and genomics research by enabling the devel-
opment of individualised treatment strategies predicated on genetic profiles. Elec-
tronic Health Records (EHRs): Through the application of data analytics, EHRs may
enhance patient care, determine the efficacy of treatments, and streamline hospital
operations.
Data analytics [3] and machine learning are vital for financial risk assessment.
Machine learning and data analytics are vital to fraud detection, investment, and
credit assessment. Real-time data analytics and large data sets enable financial organ-
isations to quickly execute trading decisions using algorithmic trading. Financial
specialists may use big data to assess social media and news sentiment to make finan-
cial judgements. In supply chain and manufacturing, machine learning algorithms
in predictive maintenance reduce downtime and maintenance costs. Data analytics
helps maintain product quality throughout the production process in quality control.
Big data analysis optimises inventory management by maintaining product avail-
ability and reducing carrying costs. Route optimisation using data analytics may
save fuel and speed up logistics and transportation deliveries. Machine learning
algorithms direct autonomous automobiles and assure traffic safety using real-time
data. Machine learning in transportation systems can forecast maintenance needs
and improve infrastructure and vehicle safety [4]. Big Data and data analytics help
to optimise electricity and service energy utilisation. This method helps utilities and
consumers save energy. Smart grid technology optimises energy distribution and
management by analysing data, reducing costs and energy loss. Energy companies
utilise predictive maintenance to keep equipment running smoothly. To avoid failures,
machine learning is used to forecast them. Good Practises: Customised training is
tailored to each student’s requirements and ability. By altering learning session time
and subject material, data analytics can fulfil individual student demands. Machine
learning may save instructors time and provide students with immediate feedback in
grading and evaluation. Users get tailored recommendations based on their prefer-
ences, interests, and historical activity via recommendation systems. These systems
improve students’ education by recommending suitable instructional content using
machine learning algorithms. The aforementioned examples demonstrate how big
data, machine learning, and data analytics can boost creativity, efficiency, and
data-driven decision-making across numerous sectors. Technology will change how
organisations operate in a data-rich world [5].

1.1 Information Extraction (IE)

IE extracts information from unstructured data with a slew of rule-based or context


frame rules approach, pattern-based or dictionary-based approach, Machine Learning
(ML), and DL approaches. ML is a field of AI that facilitates learning from previous
experience, to produce an improved system. ML comprises a set of algorithms
and methods that can automatically detect patterns, and appreciate the similarities
218 Vijayakumar Ponnusamy et al.

and variations in data to predict future outcomes with a certain degree of inherent
uncertainty. The ML algorithms are broadly classified into four types:
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised
4. Reinforcement learning

1.1.1 Supervised Algorithms

Predictive algorithms are another name for these kinds of algorithms. Using a
mapping of existing knowledge based on previously rendered inputs, these algorithms
may categorise new inputs or forecast their consequences. In supervised algorithms,
the information about inputs and their corresponding outputs is used to direct the
learning process. The learning occurs at the point when the inputs are mapped to the
outputs. The final ML model is refined by repeated exposure to the training data. The
resulting supervised algorithm is then evaluated based on some criteria and tested
using real-world data.
The supervised algorithms may be broken down into two categories:
Regression methods whose output variable is a numeric or discrete value. A
regression technique, for instance, is used to forecast the next day’s temperature.
Algorithms whose results are classes or categories are called classification algo-
rithms. For instance, a classification algorithm is used to determine if the next day
will be “sunny,” “overcast,” “rainy,” or “cloudy.” Linear regression, SVM, regres-
sion trees, Logistic Regression, and other supervised algorithms are some of the
state-of-the-art options.

1.1.2 Unsupervised Algorithms

Learning in unsupervised algorithms happens without labelled data that is, the
training examples are presented to the algorithm without target or output data. The
algorithm learns the underlying patterns and similarities in the data to discover the
veiled knowledge. This process is commonly referred to as knowledge discovery.
The class of unsupervised algorithms generally falls under the following categories:
• Clustering algorithms: The knowledge discovery in these types of algorithms
happens by uncovering the inherent similarities among the training data.
• Association: These are algorithms that extract rules that can describe large classes
of data.
• Common unsupervised algorithms include Fuzzy logic, K means clustering, K-
Nearest Neighbours, etc.
Modern Real-World Applications Using Data Analytics and Machine … 219

1.1.3 Semi-Supervised Algorithms

Semi-supervised learning algorithms engage in a learning process that is partly reliant


on labelled input. The model undergoes training using a limited quantity of labelled
data and a greater volume of unlabeled data. The financial expenses associated with
acquiring annotated data serve as the driving force behind the development of semi-
supervised algorithms. The working of semi-supervised algorithms can be realised
in two phases: clustering of related examples using unsupervised methods is seen
as the first phase and using the available labelled examples to label the remaining
unlabeled examples occurs in the next phase.

1.1.4 Reinforcement Learning Algorithms

The learning in reinforcement learning takes place by making the software agents
define an ideal behaviour in the given learning environment to yield maximum perfor-
mance. The agents will be iteratively rewarded using reinforcement feedback signals.
This signal is the central factor in guiding the agent to adapt or learn the environment
and decide on the next step.
The outcome of the learning in these algorithms is an optimal policy that
maximises the performance of the agent. Reinforcement learning is more common in
the development of robots for specified tasks. Some of the well-known reinforcement
learning algorithms are adversarial networks and Q-learning.
DL algorithms are a subset of ML. These algorithms attempt to replicate human
learning and construct algorithmic frameworks rooted in human cognitive power. DL
algorithms are more ardent in learning specific types of knowledge from the domain
in which they are applied. They uncover the hidden knowledge and representation
through multiple processing levels and deliver high-level learning by delineating
lower-level features.
IE extracts entities from documents, matches a predefined template to raw data,
gathers information intended for specific people, and enables users to get the most
out of predefined templates. In recent times, the quantum of textual information has
increased exponentially, mostly in regard to unstructured data.
People have migrated towards online platforms to undertake standard tasks like
reading, information exchange through social networks, and consultations with
physicians using android applications. Information growth is driven largely by the
use of computers across all disciplines, and information is widely accessed in the
form of normal text, code-mixed data, and acronyms.

1.2 Internet of Things (IoT)

Recently, it has received a lot of attention. The development of several technologies


and research projects has allowed the Internet of Things to grow significantly. Our
220 Vijayakumar Ponnusamy et al.

everyday lives and attitudes are going to be affected by the long-awaited Internet
of Things. A network that is globally linked is known as IoT. Devices, such as
integrated devices, mechanical devices, devices, etc. that create computers or other
things. These devices are connected to one another and given an IP address to send
and receive packets across a network. These objects can be connected wirelessly
or wired, although wirelessly is preferred. The link will be used often due to its
adaptability. The desired ones with the least amount of human intervention. The bare
minimum of human interaction is contact. A collection of objects has the potential
to engage in cooperative behaviour. Internet of Things (IoT) devices often exhibit
constraints in terms of computing capabilities, prices, power consumption, bitrate,
range, processing capacity, storage capacity, battery life, and operator counts. The
fundamental protocol used by IoT networks is high, and they link to diverse devices.
The Internet of Things architecture has five tiers Fig. 1. These layers are as follows:

Fig. 1 Five-Layer IoT architecture


Modern Real-World Applications Using Data Analytics and Machine … 221

• Perception layer: This layer depicts how people see physical objects like RFID
tags, actuators, sensors, etc. Their main job is gathering and transforming the
desired data into information. Actuators, for instance, are some of them that
organise the control signals.
• Transmission layer: The primary role of this layer is to accept control signals
from the intermediate layer and transmit the data collected by the recognition
layer back to the intermediate layer via networking technologies.
• Middleware layer: Decisions are formulated based on the findings derived from
an analysis of the data gathered at the transmission layer.
• IoT applications are present in the device layer, number four. An end user is
provided with capabilities based on the data processed.

Some factors that have contributed to the growth of IoT are:


• Advances in technology: The development of low-cost sensors, wireless networks,
and cloud computing has made it easier and more affordable to connect devices
to the internet and collect and analyse data.
• Increased connectivity: The proliferation of high-speed internet and the
widespread availability of Wi-Fi and cellular networks have made it easier to
connect devices to the internet.
• Cost savings: IoT can help organisations save costs by improving operational
efficiency, reducing downtime, and optimising resource utilisation.
• Enhanced customer experience: The Internet of Things (IoT) has the potential to
provide clients tailored and instantaneous experiences, exemplified by the use of
intelligent home gadgets capable of adapting temperature and lighting settings
according to individual user preferences.
• Industry adoption: Various industries such as healthcare, manufacturing, trans-
portation, and agriculture have adopted IoT technologies to improve efficiency
and productivity.

1.3 Key Features of Data Analytics, Big Data and ML

In numerous industries, big data, machine learning, and data analytics have the poten-
tial to significantly increase output, innovation, and data-driven decision-making.
These practical implementations serve as evidence for this claim. It is expected that
as technology advances, organisations will modify their operations to accommodate
an ever-increasingly data-rich environment. Examining, filtering, and transforming
unprocessed data in order to discern significant patterns, conclusions, and insights
constitutes data analytics. An extensive array of computational, statistical, and math-
ematical techniques are addressed. Essential elements: Descriptive analytics provides
an answer to the question “What occurred?” It operates on historical data and gener-
ates reports according to its analysis. Predominantly, diagnostic analytics seeks to
determine “Why did it occur?” An in-depth analysis of historical data is conducted
[6].
222 Vijayakumar Ponnusamy et al.

Predictive analytics generates “What might occur in the future?” forecasts by


utilising historical data. Prescriptive analytics: Provides guidance on the optimal
course of action to mitigate potential hazards or attain intended results. The storage,
management, and analysis of enormous amounts of data can present difficulties when
employing conventional database and software technologies. Massive datasets are
commonly denoted as “big data.” Frequently, the three Vs—Volume, Velocity, and
Variety—are used to symbolise large data [7].
Critical elements: Volume denotes the enormous quantities of data that are
produced in a single second.
Velocity: This pertains to the rate of creation and transmission of new data.
Data diversity encompasses a wide variety of data categories, including structured,
unstructured, and semi-structured data. A supplementary letter V is occasionally
appended to underscore the significance of reliable and superior data [8].
Value: This pertains to the capacity of our data to be interpreted. Although it
is advantageous to possess vast quantities of data, they are futile if we are unable
to apply any logic to them. Robotic Education A subfield of artificial intelligence
known as machine learning (ML) enables computers to develop and learn from expe-
rience without the need for explicit programming. It consists of learning algorithms,
data processing algorithms, and prediction/decision-making algorithms. Essential
elements: The system is instructed via supervised learning through the provision of
labelled data, which consists of input that has been matched with the corresponding
response [9].
Unsupervised Learning: given unlabeled input, the algorithm’s task is to identify
patterns and relationships in the data. Through interactions with its environment
and the reinforcement of rewards or punishments for its behaviour, an agent acquires
knowledge via reinforcement learning. To recognise patterns in data, neural networks
are algorithms that imitate the structure and operations of the human brain. Multilayer
neural networks are utilised in the subfield of machine learning known as deep
learning. It is extraordinarily potent for tasks such as image and speech recognition
[10].

1.4 Challenges Faced by Industries and Researchers


in the Realm of Data Analytics, Big Data, and Machine
Learning

The problem statements for Data Analytics, Big Data, and Machine Learning address
their inherent issues [11]. These key issue statements are tied to each:
Organisations must organise and analyse complicated data to get insights in data
analytics. Data is always changing, so how can analytics be accurate and relevant?
Integration with several data sources involves combining diverse data sources for a
complete analysis. Validation and verification are difficult parts of data analytics, but
they ensure data findings are accurate.
Modern Real-World Applications Using Data Analytics and Machine … 223

Given the exponential growth of data, what are the cost-effective solutions to
store data in the big data industry? Given the rate of data creation, processing speed
is necessary for large-scale real-time or near-real-time processing. Decentralisation
concerns how to decentralise data without sacrificing availability or integrity since
centralised data storage systems can become bottlenecks or single points of failure
in big data. Data quality concerns how to maintain or improve data quality in the
face of an abundance of data from various sources.
Training data collection in machine learning involves gathering enough and
representative data to build strong models.
• Overfitting and Generalisation: How can models are taught to work on fresh data
as well as the training set [12].
• Explainability and Transparency: Given their “black-box” nature, how can many
machine learning algorithms, particularly deep learning, make their conclusions
more visible and explainable?
• Best Model Selection: With so many alternatives, how can a task’s best model be
chosen?
• Bias and Fairness: How can machine learning models be built and taught to be
bias-free and objective?
These problem statements illustrate field-specific issues. Problem-solving fuels
these professions’ progress. When one issue is solved, another arises, showing how
technology is changing and assimilating into society.

1.5 Opportunities

Businesses may boost consumer satisfaction and profits with data-driven decisions.
Trend prediction and advice are the aims of predictive and prescriptive analytics.
Adjust goods, services, and information to each user’s interests to increase engage-
ment and automate Machine learning technology can automate difficult jobs like
customer care chatbots and autonomous cars. New revenue has arrived. Knowing
may be swiftly commercialised or utilised to create new goods and services. Data
analysis may save a lot of money by identifying waste, redundancy, and inefficiencies.
Healthcare diagnosis, treatment, and care may be improved using machine learning
and sophisticated analytics like Fraud detection, algorithmic trading, robo-advisors,
etc. Smart cities can improve traffic flow, public safety, and energy management
via data and ML. Machine learning may accelerate materials science and health
research. Social benefits: Conservation, climate change, and public health may be
addressed using data and machine learning. Consumer applications like real-time
content suggestions and industrial settings like predictive maintenance may benefit
from real-time analytics.
224 Vijayakumar Ponnusamy et al.

Data analytics, machine learning, and big data will continue to increase, creating
new possibilities and difficulties. Companies using these tools and technology
advancing quicker provide greater opportunities for data-driven, effective solutions
[13].

2 Survey Methodology and Analysis

The six main research areas—machine learning approaches, text mining, event
extraction, recommendation systems, automated journalism, online comment anal-
ysis, and exploratory data analysis—that were determined by hierarchical clustering
were represented by the author of this publication. Possible research directions
include improving paywall systems, looking into recommendation systems, putting
cutting-edge automated journalism solutions to the test, and developing models to
improve personalization and interactivity features. Big data, machine learning, and
data analytics are three sectors that will only grow in the future, presenting both new
opportunities and challenges. The more firms use these tools and the faster technology
advances, the more opportunity there is for data-driven, successful solutions.

2.1 Analysis of Machine Learning Method for Healthcare


and Forecasting Application

Machine learning methods can be used to analyse cardiovascular disease by comparing,


particularly forecasting and classification algorithms. Machine learning is used to analyse
41 cardiovascular disease studies. This study exhaustively reviews chosen papers and finds
holes in the literature to help researchers improve and use it in clinical settings, mainly on
heart disease datasets. This research will help doctors identify heart risks and treat them [17].

Figure 2 shows the received input data from patient and pre-processing. After
Feature extraction/selection model needs to be trained to predict the disease of the
patient.
The authors [18] examine Type II diabetes, its global prevalence, its effects, and
early detection. Stress that hybrid meta-heuristic machine learning with big data
feature selection may diagnose early and thoroughly. Test hybrid models against
baseline or classic machine learning models. Discuss the importance of big data
feature selection in light of Type II diabetes. Problems include class imbalance, over-
fitting, and data quality. Explain the advantages of hybrid meta-heuristic machine

Fig.2 Healthcare applications prediction using ML algorithms


Modern Real-World Applications Using Data Analytics and Machine … 225

learning and big data feature selection for early Type II diabetes detection. Alterna-
tive research methods involve finding alternative hybrid combinations or employing
different genetic data.
The authors [19] discuss the difficulties associated with projecting nonlinear
systems and the significance of anticipating changes in water contaminated by
mining, which has environmental and public health repercussions. Specify how
machine learning techniques will be utilised to predict mining-induced fluctuations
in water data as the study’s objective. The machine learning techniques utilised
in the research encompassed Support Vector Machines, Random Forest, Neural
Networks (for identifying nonlinear patterns), and Linear Regression (for estab-
lishing the baseline). In order to assess and contrast the performance of the various
models, pertinent forecasting metrics may be applied, including mean absolute error,
mean squared error, and R^2 score. Incorporate graphical representations—such as
comparisons between predicted and actual values—to enhance comprehension of the
model’s performance intuitively. Highlight the success—or lack thereof—of machine
learning in predicting water data that is impacted by mining in the summary of the
key findings. Alternatively, expanding the data set, examining alternative machine
learning models, or incorporating domain-specific data.
Prognostic maintenance (PdM) methodologies have become widely utilised by
organisations to oversee the health status of industrial equipment since the advent of
Industry 4.0 intelligent systems and machine learning (ML) within artificial intelli-
gence (AI). Owing to the emergence of Industry 4.0 and advancements in information
technology, computerised control, and communication networks, the accumulation
of vast quantities of data concerning the operational and process conditions of diverse
equipment has become feasible. By utilising this information, automated problem
detection and diagnosis aim to reduce downtime, maximise component utilisation,
and extend the remaining components’ useful lives. Recent developments in machine
learning (ML) techniques that are frequently implemented in PdM for smart manu-
facturing in Industry 4.0. The methods are classified based on the machine learning
algorithms employed, the machine learning category, the instruments and apparatus
used, the data collection device, the quantity and nature of the data, and the principal
contributions made by the researchers [20].
The Fig. 3 represents the forecasting prediction using ML algorithms. First steps
it could be collected the historical data and second steps have to train ML model for
prediction and finally in the validation gets the results of forecasting.

2.2 Analysis of Big Data Analytics for Smart Grid


Application, Tourism Management, Supply Chain
Management, and Financial Management

The growing significance attributed to smart grids in the context of energy manage-
ment, sustainability, and the continuous worldwide energy revolution. Highlight the
226 Vijayakumar Ponnusamy et al.

Fig. 3 Forecasting applications prediction using ML algorithms

significance of Big Data analytics and the rapid expansion of data that smart grid
technologies produce. However, security and privacy concerns include preventing
unauthorised access, manipulating data, and protecting user privacy. By assuring
consistency, accuracy, and timeliness, the data’s quality can be guaranteed. It may
be difficult to integrate numerous obsolete formats, systems, and data sources [14].
The author [15] presents a comprehensive examination of the prominence of big
data in the tourism industry and its relevance in the present day. Summarise the
existing body of knowledge concerning big data analytics in the tourism industry;
this should be the objective of the literature review. The big data pertaining to tourism
can be broadly classified into three main categories: User-generated content (UGC)
includes website text and image content; device-generated data includes GPS, mobile
roaming, Bluetooth, and other data; and operation-generated data includes transac-
tion data from visited websites, online searches, and online reservations, among
others. Implementing AR and VR in the tourism industry using big data insights.
Numerous business disciplines, including operations, supply chain, marketing,
and accounting, have effectively implemented big data analytics. The significance of
big data analytics within the supply chain is increasing due to recent advancements
in machine learning and computer architecture. Due to the increasing prevalence of
big data analytics in supply chains, this study provides a comprehensive evaluation
of previous research on the subject. Evaluating product performance with the aid
of data analytics to inform determinations regarding the introduction, cessation, or
alteration of products [16].
The authors [10] identify five perspectives through which BDA applications in
healthcare can be examined: the management of hospitals, the treatment of specific
medical conditions, the interaction with stakeholders in the healthcare ecosystem,
and the delivery of healthcare services through the use of technology. Neverthe-
less, certain constraints do exist. Additionally, we advise researchers examining the
Modern Real-World Applications Using Data Analytics and Machine … 227

implementation of BDA in sectors such as travel and accommodation, media and


broadcasting, banking and finance, and finance to adopt the same methodology as
our investigation. In the same way, the healthcare industry stands to gain from the
implementation of innovative technologies such as machine learning, blockchain,
and cloud computing, which offer promising avenues for research. In order to better
comprehend the specific applications of BDA and the broader incorporation of tech-
nology in the healthcare industry, we conclude this SLR with a demand for the
development of theories.
Large-scale credit assessment models have been utilised by conventional financial
institutions for quite some time. Peer-to-peer lending presents obstacles for these
methodologies. To commence, peer-to-peer credit data frequently comprises both
extensive numerical attributes and limited category classifications. Second, updating
the vast majority of credit scoring algorithms currently in use online is not feasible.
P2P lending encompasses a substantial volume of loan transactions, and the dissem-
ination of said data is influenced by newly acquired information. Disregard for data
updates by a credit scoring system could potentially lead to significant discrepan-
cies or even rejections in subsequent credit assessments. The author [11] presents an
innovative credit rating model that is incorporated with online P2P lending. By inte-
grating gradient-boosting decision trees and neural networks, OICSM improves the
credit scoring model’s ability to process two distinct categories of features simulta-
neously and updates in real time. The efficacy and superiority of the proposed model
are validated through offline and online experiments utilising authentic and realistic
credit datasets. Experimental results support the claim that OICSM’s advantage in
deep learning over two features enables it to further remedy model deterioration. The
capability to dynamically update in real time further enhances its efficiency.
Figure 4. shows data that was obtained from a data source, saved in a data collec-
tion, and then transformed into categorization, filtering, and enrichment. A model
that predicts data categorization based on data classification and analytics has been
utilised to visualise predicted data for the client.

3 Prospects for Data Analytics, Big Data, and Machine


Learning in the Future

The field of data analytics presents various prospects. Machine learning and big data
are implemented in numerous industries. The subsequent sectors are experiencing
rapid and significant developments.
• Expansion of Data Sources: The proliferation of Internet of Things (IoT), ubiq-
uitous technology, and embedded systems will further propel the accumulation
of data in an exponential fashion. Real-time analytics and deeper insights will
be possible across sectors. As it becomes more complicated and fast, quantum
computing might change machine learning and large-scale data processing.
228 Vijayakumar Ponnusamy et al.

Fig. 4 Big data analytics system model

• The merging of AR and VR: Using AR and VR will make data visualisation
more immersive, helping data scientists and companies understand complex data
patterns.
• Automated Machine Learning (AutoML): automates the pipeline’s most difficult
phases to make machine learning more accessible to non-experts. It also speeds
up model selection.
• Explainable AI (XAI) develops visible, intelligible models and is growing in popu-
larity. Criminal justice, economics, and healthcare decision-making are greatly
impacted by this.
• Federated Learning and Data Privacy: As data privacy concerns grow, feder-
ated learning allows algorithm training across several devices while protecting
localised data. Patient anonymity is crucial in healthcare, thus this may change
the industry.
• The Integration of Predictive and Tailored Medicine: As medical technology and
genetics advance, treatment regimens will include patient health data. In addition,
predictive analytics may inform users of prospective health issues.
• COVID-19 has made supply chain optimisation more important. Machine
learning, big data, decision-making automation, real-time monitoring, and predic-
tive analytics for demand forecasting help protect supply chains from interrup-
tions.
• Integrating data analytics and machine learning into climate change, energy
efficiency, and environmental change strategies can help sustainable practises.
• Neural Symbolic Integration: The combination of symbolic reasoning and neural
networks (logic-based AI) enables the development of AI models that exhibit
adaptability and interpretability akin to both symbolic and neural models.
• Ethics and Bias in AI: In the future, there will be greater emphasis on regulations
and standards that guarantee the objectivity, morality, and fairness of machine
learning models.
Modern Real-World Applications Using Data Analytics and Machine … 229

Fig.5 Applications of data analytics, big data, and machine learning system model

• Edge Computing: Particularly for real-time applications, processing data on the


device or closer to where it is produced, as opposed in a centralised data centre,
will gain traction.
In the realm of collaborative AI, the ability for numerous AI models to work
together on complex tasks will be imperative so that the full potential of diverse AI
architectures can be realised.
Figure 5 illustrates how different application sectors—such as supply chain
management, health care, the Internet of Things, and smart grid—input data received
from user and it has transacted and interacted with data to enable data analytics
for the effective use of data across different applications. Preprocessing is the
process of turning large-scale data collecting into data analytics. With the aid of
machine learning model that forecasts data for customer feedback segmentation and
is beneficial for the expansion of several applications.

3.1 The Integration of Big Data with Smart Grids

Smart Grids and their growing relevance for sustainability, energy efficiency, and
the global energy transition. Big Data analytics and smart grids’ exponential data
development should be stressed.
Smart grid big data use—Historical data, weather predictions, and other sources
may predict electricity usage over time. Integrating renewable energy and stabilising
the system requires forecasting solar and wind power production.
230 Vijayakumar Ponnusamy et al.

Grid Health Monitoring–Planning maintenance and predicting equipment faults


using sensor data. Outage management uses data analysis to predict and resolve
power outages. Demand response management uses consumption data to change
customer behaviour and give incentives to lower peak load.
Machine Learning and AI–Discuss how to analyse and forecast smart grid data
using regression, neural networks, and support vector machines.
Distributed Computing–Discuss Hadoop and Spark, which analyse enormous grid
data distributedly. Promote grid-compatible distributed storage systems, NoSQL
databases, and other data storage approaches. Security and privacy make up the
sixth obstacle. The problems include data tampering, unauthorised access, and user
privacy. Data quality ensures timeliness, correctness, and consistency. Different data
sources, obsolete systems, and formats provide integration challenges.
Scalability–A system’s ability to process more data without slowing down.
Future Trends–Smart metres use edge computing to process data closer to the
source than central data centres.
Blockchain in Energy–Blockchain can enable transparent, secure peer-to-peer
energy transactions. As Big Data technologies advance, grids will need to become
more autonomous and need less human intervention [21].
Examples of Big data in Smart grid—The data inside smart grids exhibit hetero-
geneity, characterised by varying resolutions, asynchronous nature, and storage in
many locations in diverse formats, including both raw and processed forms. As an
example, data pertaining to energy use from smart metres is often collected at 15-min
intervals and stored inside billing centres.

3.2 Integrating Data Analytics in Healthcare

The rise of data analytics in healthcare emphasises the necessity to use data to make
decisions. An overview of how data analytics aids medical diagnosis, treatment,
patient monitoring, and hospital management. EHR mining is being employed in
healthcare data analytics to optimise hospital management, patient health, and service
delivery. Predictive analytics predicts patient admissions, disease outbreaks, and
illness progression.
Telemedicine and remote monitoring employ wearable technology to monitor
patients 24/7 and give data-driven virtual health consultations. Genomic data analysis
predicts health outcomes, assesses sickness propensities and customises therapies.
“Medical image analysis.” is improving CT, MRI, and X-ray analysis using AI and
machine learning.
Examples of Data Analytics in Medical Imaging-Medical imaging plays a crucial
role in contemporary healthcare, as seen by the substantial number of imaging
procedures prescribed by physicians in the United States, which amounts to around
600 million annually. Nevertheless, the process of evaluating and preserving these
pictures incurs significant costs in terms of both time and financial resources. Radiol-
ogists are required to meticulously analyse each picture on an individual basis, while
Modern Real-World Applications Using Data Analytics and Machine … 231

hospitals are obligated to retain these images for an extended duration of several
years.
The use of big data analytics in the healthcare sector enables algorithms to effec-
tively evaluate a vast quantity of photos, numbering in the hundreds of thousands. The
identification of distinct patterns within the pixels and their subsequent conversion
into numerical data facilitates the physician’s diagnostic process. Moreover, it is said
that radiologists would be relieved from the task of visually examining the pictures,
as their focus will shift towards analysing the results generated by the algorithms.
These algorithms are expected to unavoidably acquire and retain a greater number of
photos than they would be able to process throughout the span of a human lifetime.
Big data analytics has the ability to bring about a transformative impact on medical
imaging and generate notable efficiencies within the healthcare system.

3.2.1 Healthcare Data Analytics Ethics

Privacy means protecting patient data in healthcare. Patients must understand data
collection and usage to provide informed consent. Algorithmic bias addresses
data analytics technological biases that may cause unfair or discriminatory health
effects. Who owns health data? Patients, providers, or other parties. Accountability,
transparency–Accountability for healthcare analytics algorithm judgements.

3.2.2 Pros and Cons of Data Analytics in Healthcare:

More individualised therapy, cheaper drugs, better patient outcomes, and predictive
abilities. Challenges include data integration across platforms, real-time analysis,
data quality, and ethical problems.

3.3 The Combination of Machine Learning with Forecasting

Machine learning to anticipate nonlinear systems is difficult yet helpful, particu-


larly in a mining-impacted water data case study. Expected method for this kind of
investigation:
This section discusses nonlinear system forecasting issues and the need of
projecting mining-induced water quality changes due to the potential consequences.
State that the study’s goal is to predict mining-induced water data changes using
machine learning [21].
Examples of Machine learning in Forecasting—The method of machine learning
forecasting involves the use of algorithms to acquire knowledge from data and provide
prognostications on forthcoming occurrences. Machine learning has the potential to
assist with comprehending the requirements of one’s audience or forecasting the next
significant trends. Forecasting is a methodological technique that involves generating
232 Vijayakumar Ponnusamy et al.

forecasts by analysing historical and current data. Subsequently, these occurrences


may be juxtaposed and evaluated in relation to subsequent events. As an example,
a corporation may engage in the process of projecting its forthcoming income for
the subsequent year, subsequently doing a comparative study between the projected
figures and the actual outcomes, therefore generating a variance analysis.

3.3.1 Data Gathering

• Source: Find out where mining-impacted water data originated from (government
databases, mines).
• Water quality parameters: Turbidity, metal content, and pH were measured.
• Data Preprocessing: Discuss normalisation and missing value resolution to
prepare data for machine learning.

3.3.2 Algorithm Choice

List the machine learning methods utilised in the research, including Support Vector
Machines, Random Forest, and Neural Networks for nonlinear patterns, and Linear
Regression for baseline.
This session discusses finding and developing important components from raw
data to increase forecasting accuracy.
Verification & Instruction: Explain how data was split into training, validation,
and test sets. Explain the model hyperparameter tuning process, maybe utilising
cross-validation.

3.3.3 Model Efficiency

To evaluate model performance, employ forecasting-specific metrics such mean


absolute error, mean squared error, and R^2 score. Use graphs to explain the model’s
performance (e.g., predicted vs. real numbers). If using models with feature impor-
tance metrics like Random Forest or Gradient Boosted Trees, discuss how attributes
affect prediction.
According to Fig. 6, data analytics is tied to a wide range of application sectors. By
2027, it is anticipated that ML models would account for over 30 billion application
use and continue to expand.

3.3.4 Communication

The models have yielded insights that warrant a summary, including a determination
of the significant factors that impact water quality at mining sites. When conducting
Modern Real-World Applications Using Data Analytics and Machine … 233

Fig.6 Growth of data analytics Vs. ML model

a comparison between the machine learning approach and traditional forecasting


methodologies, it is crucial to consider the impact of mining on water studies.
Limitations: Discuss the study’s imposed restrictions on the data, the inter-
pretability of the model, and any external influences.
Almost limitless applications are possible for big data, machine learning, and
data analytics. Despite being in their early stages, these technologies will ultimately
transform the way we work, live, and communicate by permeating every sector of
the economy and society. Attaining synergies that tackle some of the most urgent
challenges of the twenty-first century will be possible when these disciplines are
incorporated with other cutting-edge technologies.

4 Conclusion and Future Scope

The confluence of Big Data, Machine Learning, and Data Analytics in the modern era
has fundamentally altered the competitive landscape for a broad variety of business
sectors and organisations. The exponential growth in both the amount and diversity
of the data has given rise to possibilities and problems that were unimaginable in the
past. Data analytics may help companies make sense of the massive volumes of data
at their disposal, which can then be used to guide decision-making and anticipate
outcomes. Methodologies that are driven by data have formed the foundation for
a great number of successful enterprises and projects, including those that aim to
improve user experiences and operational optimisation. Traditional data processing
technology is being pushed to its limits by Big Data due to the volume, velocity,
234 Vijayakumar Ponnusamy et al.

and variety of the data it contains. The increasing significance of data has been a
driving force behind the development of innovative technologies for the processing,
examination, and storage of data. This enhancement not only boosts efficiency and
scalability but also encourages a culture of real-time, fast decision-making that can be
adapted to changing circumstances. Machine learning is a subfield of artificial intel-
ligence that makes use of algorithms to recognise patterns, forecast outcomes, and
speed up decision-making processes. Applications as diverse as advanced medical
diagnostics and autonomous cars have been made possible as a result of its versatility
and predictive capacity. These applications include natural language processing and
recommendation systems. However, along with these developments come a number
of other problems that must be overcome. Security, data privacy, and ethical conun-
drums are some of the key topics of conversation at the moment. In order to win over
the public and minimise unforeseen repercussions, models of machine learning need
to be equal, transparent, and understandable. Additionally, specialists in these many
sectors are required to consistently improve their skills and change their practises.
It is very necessary for there to be a collaboration between data scientists, domain
specialists, ethicists, and policymakers in order to ensure that these technologies
be used in an ethical manner. A paradigm change has occurred in the management
of information, as well as its interpretation and use, as a direct consequence of the
convergence of data analytics, big data, and machine learning. Their intertwined
progression augurs well for a future that is rich in innovation, efficiency, and expan-
sion. To fully realise their revolutionary potential in a way that is both inclusive and
environmentally sustainable, however, a strategy that is well-balanced and accords
equal weight to social responsibility and technical innovation will be required.

References

1. Elizabeth, F., Sérgio, M., Paulo, C.: Data science, machine learning and big data in digital
journalism: A survey of state-of-the-art, challenges and opportunities. Expert Sys. App. 221
(2023)
2. Myers, N.E., Kogan, G.: Emerging AI and data analytics tooling and disciplines. In: Self-service
data analytics and governance for managers. pp. 25–49. John Wiley & Sons, Inc. (2021)
3. Jones, W., Thomas, G.E., Thomas, S.H.: Data analytics in healthcare: A review of current
trends and ethical issues. J. Healthc. Inf. Manag. 34(2), 19–25 (2020)
4. Li, X., Li, P., Yang, Y.: Credit scoring with machine learning for online peer-to-peer lending:
A categorization framework and review. IEEE Access 8, 38892–38908 (2020)
5. Lu, C., Luo, X., Tian, X., Zhang, Q.: Machine learning for predictive maintenance in smart
grids: A comprehensive survey. IEEE Trans. Industr. Inf. 16(1), 648–657 (2020)
6. Smith, A., White, B., Johnson, R.: Data analytics in market segmentation and trend analysis:
A case study of the retail industry. Int. J. Data Sci. Anal. 12(4), 487–503 (2021)
7. Veeramachaneni, K., Li, C., Soh, L.: Recommender systems for large-scale content on mobile
devices. IEEE Internet Comput. 18(3), 14–22 (2014)
8. Wang, H., Lu, X., Yang, L., Wang, Z., Wang, J.: A review on applications of data mining
techniques in the credit industry. Expert Syst. Appl. 129, 67–79 (2019)
9. Wu, X., Xia, H., Hao, J., Zhao, J.L.: Big data analytics in logistics and supply chain manage-
ment: Certain investigations for research and applications. J. King Saud Univ.-Comp. Info. Sci.
29(4) (2018)
Modern Real-World Applications Using Data Analytics and Machine … 235

10. Zhang, G., Wang, D.: A review on predictive maintenance of production systems. IEEE Access
7, 182450–182470 (2019)
11. Mühlhoff, R.: Predictive privacy: Towards an applied ethics of data analytics. Ethics Inf.
Technol. 23, 675–690 (2021)
12. Jain, P., Gyanchandani, M., Khare, N.: Enhanced secured map reduce layer for big data privacy
and security. J Big Data. 6, (2019).
13. Kaufmann, U.H., Tan, A.B.C.: Why data analytics is important? In: Data Analytics for
Organisational Development. pp. 1–20. John Wiley & Sons (2021)
14. Jiang, R., Bouridane, A., Li, C.-T., Crookes, D., Boussakta, S., Hao, F., Edirisinghe, E.A.: Big
Data Privacy and Security in Smart Cities. Springer, Cham (2022)
15. Chen, Y., Wang, D., Xie, L.: Big data analytics in tourism: A literature review. Tour. Manage.
68, 301–323 (2019)
16. Gupta, V., Jain, V., Jain, S.: Big data analytics in supply chain management: A comprehensive
overview. J. Enterp. Inf. Manag. 34(4), 1121–1152 (2021)
17. Azmi, J., Arif, M., Nafis, M.T., Alam, M.A., Tanweer, S., Wang, G.: A systematic review
on machine learning approaches for cardiovascular disease prediction using medical big data.
Med. Eng. Phys. 105 (2022)
18. Fatemeh, N., Yufei, Y., Norm, A.: An examination of the hybrid meta-heuristic machine learning
algorithms for early diagnosis of type II diabetes using big data feature selection. Healthcare
Anal. 4 (2023)
19. Kagiso, S.M., Christian, W.: Application of machine learning algorithms for nonlinear system
forecasting through analytics—A case study with mining influenced water data. Water Res.
Indus. 29 (2023)
20. Choi, J., Kim, D., Kim, J.: Machine learning in predictive maintenance: A review. Sustainability
12(7), 2750 (2020)
21. Alam, F., Reaz, M.B.I., Ali, M.A.M.: Big data in smart grid: A comprehensive review and
trends. IEEE Access 7, 35877–35906 (2019)
Real-World Applications of Data
Analytics, Big Data, and Machine
Learning

Prince Shiva Chaudhary , Mohit R. Khurana ,


and Mukund Ayalasomayajula

Abstract In the era of digitalization, we stand on the cusp of a data revolution.


A staggering volume of data, sourced from manufacturing, banking, social media,
e-commerce, healthcare records, and more, collectively known as Big Data, has
inundated our world. Concerning the intelligent analysis of extensive datasets and
the development of advanced applications for diverse domains, the crucial founda-
tion lies in artificial intelligence (AI), placing specific emphasis on machine learning
(ML) and deep learning (DL). The scale of data generation today is staggering, and the
capabilities of these technologies are equally remarkable. In the realm of healthcare,
they facilitate early disease detectionn customized treatments for patients, fundamen-
tally transforming healthcare delivery. In the financial sector, analytics are shaping
investment strategies, while in agriculture, they optimize resource allocation and crop
yields. With data-driven insights enhancing transportation, energy management, and
infrastructure systems in urban planning. These cutting-edge technologies collec-
tively empower us to unlock valuable insights, reduce costs, streamline operations,
and make data-driven decisions, across technical applications. This chapter conducts
a comprehensive exploration of the profound impact of Data Analytics, Big Data,
and AI in harnessing this data wealth across real-world applications and delves into
various algorithms and techniques employed in ML, DL, and analytics.

P. S. Chaudhary (B)
Department of Data Science, Worcester Polytechnic Institute, Worcester, MA, USA
e-mail: pchaudhary@wpi.edu
M. R. Khurana · M. Ayalasomayajula
Department of Materials Science and Engineering, Cornell University, Ithaca, NY, USA
e-mail: mrk263@cornell.edu
M. Ayalasomayajula
e-mail: ma2258@cornell.edu

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 237
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_12
238 P. S. Chaudhary et al.

1 Introduction

In an age characterized by the ubiquitous presence of data, our daily lives are intri-
cately interwoven with an abundance of digital sources, including but not limited
to digital watches, mobile phones, laptops, manufacturing, finance, healthcare, and
many more domains. The digital landscape unfolds as an ever-expanding tapestry of
information, as data recording remains ceaseless and pervasive [1, 2].
This data holds the potential to accelerate the development of intelligent applica-
tions across various domains. Central to this revolution driven by data is AI, where
data analytics, ML, and DL stand out as essential catalysts for transformation [3, 4].
The three primary forms of data analytics used in decision-making are descrip-
tive, predictive, and prescriptive analytics. Descriptive analytics involves histor-
ical data analysis to reveal trends and patterns, often displayed using visual repre-
sentations. Predictive analytics employs statistical modeling, along with ML and
DL algorithms, to evaluate potential future scenarios. Prescriptive analytics lever-
ages statistical methods to identify the best course of action, factoring in multiple
scenarios and their repercussions, saving time and money for businesses, making
optimal decisions, and impacting various sectors [5]. Predictive algorithms methods
can be segmented into supervised, unsupervised, semi-supervised, and reinforce-
ment learning methods [6–8]. The efficacy, accuracy, and precision of AI applica-
tions are inherently linked to the intrinsic qualities and attributes of the data fed to
selected models. AI employs various techniques, such as regression, classification,
feature selection, natural language processing (NLP), large language models (LLM),
dimensionality reduction, clustering, reinforcement learning, and computer vision to
develop applications. Selection of an ML or DL algorithm, fine-tuning it, and contin-
uously learning from data to output as per objectives for a specific application in a
given domain presents a formidable challenge because of the attributes and nature
of the data [9–11].
ML and DL, enable applications to learn and evolve through past data and
continuously improve with new data, without explicit rule-based programming,
making it a cornerstone of modern technology. Various industries are going through
transformation and have embraced the power of data through AI technologies [12].
In today’s data-driven economy, the significance of Data Analytics, Big Data,
and Machine Learning cannot be overstated. These technologies collectively act as
the backbone of a transformative paradigm, unlocking unprecedented insights and
opportunities across various industries. Data Analytics allows businesses to decipher
patterns, trends, and correlations within vast datasets, informing strategic decisions
and improving operational efficiency. Big Data, characterized by the management
and analysis of massive and diverse datasets, provides a wealth of information that
traditional approaches can’t handle. This abundance of data, when harnessed effec-
tively, fuels innovation and drives competitive advantages. ML and DL methods
empower us to learn from data, enabling predictive capabilities, and personalized
user experiences. These technologies improve decision-making processes and pave
the way for innovations that redefine how businesses operate, ensuring they stay agile
Real-World Applications of Data Analytics, Big Data, and Machine … 239

and responsive in an increasingly data-centric environment. Section “Real-World


Application Domains” details diverse domains, exploring examples, challenges, and
benefits, providing a comprehensive understanding of the applications and signifi-
cance of Data Analytics, Big Data, and Machine Learning across various real-world
scenarios.
In this chapter, a comprehensive exploration of the diverse landscape of data and
understanding data analytics, bigdata, and machine learning algorithms are presented,
shedding light on their principles, significance, and potential applications across a
spectrum of real-world domains. The attributes of real-world data are discussed,
followed by an overview of AI algorithms.

2 Data Variety for AI: A Spectrum of Structured,


Unstructured, and Semi-structured Data Types

It is critical to understand the intricate interplay between data attributes, structures,


and modeling methods for data-driven decision-making. Data exists in a myriad of
forms, consisting of unique characteristics and nuances. To grasp the concept of Big
Data, one must consider its three fundamental aspects, often referred to as the “3
Vs”. Firstly, Volume plays a pivotal role, exemplified by the massive data generated
by various sources. Secondly, Variety extends beyond data quantity to encompass the
diversity of data types and structures. And lastly Velocity emphasizes the real-time
nature of data collection, processing, and analysis. As the internet and connected
devices accelerate data generation, the speed at which data is produced and managed
is vital [13]. The combination of data attributes and modeling techniques is essential
for accessing valuable insights and propelling innovation across diverse domains,
underscoring the fundamental significance of this comprehension.
Structured Data: Structured data refers to highly organized and well-defined data
that fits neatly into traditional relational databases or spreadsheets. This data type is
characterized by its tabular format, with rows and columns, and a clear schema [14].
Structured data includes information such as customer names, addresses, transac-
tion amounts, dates, and product SKUs (Stock Keeping Unit). These are commonly
used in conventional databases and are ideal for performing operations like sorting,
filtering, and aggregating. Some of the advantages of using structured data are:
• Organization: Structured data is highly organized, following predefined schemas.
This makes it easy to store, retrieve, and analyze using traditional database
management systems.
• Efficiency: Structured data allows for efficient querying and reporting, making it
suitable for well-established business processes.
• Data Integrity: Clear data structures facilitate data integrity, reducing the likeli-
hood of errors.
Some of the disadvantages of using structured data are:
240 P. S. Chaudhary et al.

• Rigidity: The predefined structure of structured data can be limiting. It may not
accommodate data that doesn’t fit neatly into the established schema.
• Scalability: Scaling structured databases to accommodate large volumes of data
can be costly and complex.
Unstructured Data: Unstructured data is, characterized by a lack of organization
and a format that doesn’t conform to traditional databases. It can take various forms,
including text, audio, video, and more [14]. Unstructured data encompasses vast
volumes of text documents, social media posts, audio recordings, images, and video
files. This type of data often contains valuable insights but is challenging to analyze
due to its unorganized nature. Some of the advantages of using structured data are:
• Rich Content: Unstructured data holds a wealth of untapped information,
but unlocking its value requires advanced methods such as NLP, and image
recognition.
• Flexibility: Unstructured data doesn’t require a predefined structure, making it
suitable for rapidly evolving data sources.
• Big Data Insights: It allows organizations to utilize big data to find hidden patterns.
Some of the disadvantages of using structured data are:
• Analysis Complexity: Analyzing unstructured data can be challenging. It may
require advanced techniques such as natural language processing or image
recognition.
• Storage and Processing Costs: Managing and processing unstructured data can be
costly, especially as data volumes increase.
• Data Privacy: Unstructured data often contains sensitive information, raising
privacy and security concerns.
Semi-structured Data: Semi-structured data occupies a middle ground between
structured and unstructured data. It possesses some level of structure but does not
adhere to rigid schemas found in structured data [14]. Common instances of semi-
structured data include JSON and XML files, emails, and NoSQL databases. While
they may have defined attributes, the data may not be uniformly structured. Some of
the advantages of using structured data are:
• Flexibility with Structure: Semi-structured data offers a balance between struc-
tured and unstructured data. It allows for some structure while accommodating
varying data formats.
• Adaptability: It is suitable for dynamic data sources and scenarios where schemas
may evolve over time.
Some of the disadvantages of using structured data are:
• Complexity: Analyzing semi-structured data can be more complex than structured
data, as it may not adhere to a uniform schema.
• Tool Dependency: Effectively working with semi-structured data often requires
specialized tools and software.
Real-World Applications of Data Analytics, Big Data, and Machine … 241

• Data Quality: Maintaining data quality can be a challenge, as it does not always
conform to a rigid schema.
Understanding these three data types is crucial when embarking on data anal-
ysis projects. Based on the data, analysts and data scientists can choose appro-
priate techniques and tools to extract valuable insights and make informed decisions.
Subsequently, various categories of AI algorithms are discussed.

3 Synergistic Integration of Data Analytics, Big Data,


and Machine Learning

In the present era dominated by a surfeit of data, the seamless integration of Data
Analytics, Big Data, and Machine Learning (ML) emerges as a linchpin for driving
innovation and steering industries toward data-driven excellence.
Big Data, characterized by its colossal volume, velocity, and variety, stands as a
pivotal asset in this data-centric landscape [13]. It encapsulates extensive datasets
sourced from diverse origins, providing a reservoir of information crucial for robust
analysis and insights generation. Big Data’s significance lies in its ability to handle
and process vast datasets that traditional data processing systems find overwhelming.
The technology associated with Big Data facilitates storage, retrieval, and analysis
of massive datasets, allowing organizations to extract valuable insights and patterns
that might otherwise remain hidden.
Data Analytics, comprising descriptive, predictive, and prescriptive analytics,
emerges as the next layer in this technological triad. Descriptive analytics unveils
historical patterns and trends, offering a retrospective view of data. Predictive
analytics utilizes data mining, statistical modeling, and ML algorithms to foresee
future possibilities, while prescriptive analytics guides decision-making by identi-
fying optimal courses of action. The synergy between Big Data and Data Analytics
becomes evident as the latter relies on the expansive datasets provided by Big Data
to extract meaningful insights [5].
Machine Learning adds the cognitive dimension to this integrated framework
of predictive analytics mentioned earlier. Categorized into supervised, unsuper-
vised, semi-supervised, and reinforcement learning, ML algorithms facilitate iter-
ative learning from data discussed in detail in the section “Categories of Learning:
Exploring the Dimensions of Supervised, Unsupervised, and Reinforcement Algo-
rithms”. This process enables systems to evolve, improving their ability to make
predictions, detect anomalies, and automate decision-making. The synergy between
Machine Learning, Big Data, and Data Analytics becomes a dynamic force, espe-
cially potent in scenarios where traditional rule-based programming falls short [6–8].
The subsequent case study will further elucidate this symbiotic relationship through
real-world examples and explore how their synergistic integration propels industries
toward transformative outcomes.
242 P. S. Chaudhary et al.

Case Study on Transformative Synergy in Predictive Maintenance: In the


domain of industrial operations, the seamless integration of Data Analytics, Big Data,
and Machine Learning (ML) is vividly exemplified through the application of predic-
tive maintenance. This case study illuminates how these technologies collaboratively
drive efficiency, cost savings, and operational excellence.
Problem Statement: Manufacturing plants face challenges in equipment down-
time, impacting production efficiency and increasing maintenance costs. Manu-
facturing organizations aim to transition from reactive to proactive maintenance,
minimizing unexpected breakdowns and optimizing equipment performance.
Integration of Big Data: Enormous datasets are collected from sensors embedded
in manufacturing equipment, capturing real-time operational parameters, and perfor-
mance metrics. This influx of Big Data served as the foundation for a holistic
understanding of equipment health and performance.
Data Analytics in Action: Descriptive analytics unveils historical patterns of
equipment failures, identifying commonalities preceding breakdowns. Predictive
analytics, powered by advanced ML algorithms is employed to forecast potential
issues by analyzing patterns and anomalies within the vast datasets. This proactive
approach allows for the timely scheduling of maintenance activities.
Machine Learning’s Predictive Power: ML models are trained on historical
data to predict the remaining useful life of critical components and identify nuanced
patterns indicative of imminent failures. The model’s continuous learning from new
data enables it to adapt and improve its predictive accuracy over time.
Operational Impact: The integrated approach significantly reduced unplanned
downtime, leading to a substantial increase in overall equipment efficiency. Predictive
maintenance leads to cost savings by minimizing the need for emergency repairs and
optimizes spare parts inventory.
The seamless synergy of Big Data, Data Analytics, and ML not only transforms
maintenance strategies but also sets a precedent for data-driven decision-making
across the manufacturing landscape.

4 Categories of Learning: Exploring the Dimensions


of Supervised, Unsupervised, and Reinforcement
Algorithms

In the ever-expanding landscape of data-driven technologies, the choice of data


types plays a pivotal role in determining the most suitable ML or DL category for
a given application. The nature of the data, whether structured, unstructured, or
semi-structured, dictates the choice of algorithms. For structured data with response
variable, which neatly fits into tables and relational databases, supervised learning
methods are often preferred for tasks like classification and regression. Unstructured
data without a response variable, encompassing texts, images, and audio, frequently
utilizes unsupervised learning and reinforcement learning when rewards are involved.
Real-World Applications of Data Analytics, Big Data, and Machine … 243

Fig. 1 Types of machine learning—from supervised and unsupervised to semi-supervised and


reinforcement learning

Semi-structured data, found in XML or JSON formats, offers a bridge between the
two, making it ideal for semi-supervised learning [1, 6]. This synergy between data
types and learning categories empowers developers to tailor their ML and DL models
to the specifics of their data, enabling more accurate and insightful results in various
domains. Understanding these connections is essential in utilizing the full spectrum
of possibilities in the era of bigdata analytics and AI.
Understanding the various categories of ML and DL is crucial for harnessing
their capabilities effectively. This section delves into the distinctions between key
categories: supervised learning, unsupervised learning, semi-supervised learning,
and reinforcement learning as shown in Fig. 1.
Supervised Learning: Supervised learning is a foundational machine learning
paradigm driven by labeled data, enabling algorithms to learn the mapping from
inputs to responses with precision. This approach utilizes historical data and labels to
forecast events, beginning with the training of datasets to develop inferred functions.
These functions then predict output values when presented with new input data. The
process involves comparing the predicted and expected results to identify errors and
subsequently refining the model for improved accuracy and performance [11, 15,
16]. Some of the supervised algorithms are listed below:
• Linear Regression
• Logistic Regression
• Decision Trees
• Random Forest
• Support Vector Machines (SVM)
244 P. S. Chaudhary et al.

• k-Nearest Neighbors (k-NN)


• Naive Bayes
• Neural Networks
• Gradient Boosting (e.g., XGBoost, LightGBM)
• Linear Discriminant Analysis (LDA)
Unsupervised Learning: Unsupervised learning is an approach that explores
structured or unstructured data without predefined outcomes or responses. Algo-
rithms in this category excel at identifying hidden patterns, correlations, and struc-
tures within the data, making them valuable for clustering and dimensionality reduc-
tion [16]. Unsupervised learning is a versatile approach of ML and DL as it can
be used for tasks such as clustering, feature learning, dimensionality reduction, and
anomaly detection [1]. Some of the supervised algorithms are listed below:
• K-Means Clustering
• Hierarchical Clustering
• Principal Component Analysis (PCA)
• Independent Component Analysis (ICA)
• Gaussian Mixture Models (GMM)
• Self-Organizing Maps (SOM)
• Autoencoders
• Isolation Forest
• DBSCAN
• t-Distributed Stochastic Neighbor Embedding (t-SNE)
Semi-Supervised Learning: Semi-supervised learning represents an interme-
diate category that leverages a combination of labeled and unlabeled data and falls
in between supervised and unsupervised learning methods. Algorithms can exploit
the small amount of labeled data available and generalize patterns to unlabeled data.
Semi-supervised learning finds applications in scenarios were obtaining fully labeled
datasets is challenging. While it offers advantages such as efficient learning and
generalized pattern recognition, it still requires some labeled data, and the model
complexity can pose challenges [1, 16]. Some of the supervised algorithms are listed
below:
• Label Propagation
• Self-training
• Multi-view Learning
• Co-training
• Tri-training
• Semi-Supervised Generative Adversarial Networks (GANs)
• Multi-instance Learning
Reinforcement Learning: Reinforcement learning is an adaptive approach that
centers around interactions with an environment, aiming to achieve specific objec-
tives. This method is particularly prominent when training agents to make a sequence
of decisions, optimizing cumulative rewards in the process [1, 6, 17]. It operates based
Real-World Applications of Data Analytics, Big Data, and Machine … 245

on a system of rewards or penalties, ultimately seeking to leverage insights from envi-


ronmental interactions to maximize rewards or minimize risks [6]. While its adapt-
ability to autonomous systems is a considerable advantage, it may entail intensive
computational demands. Applications of reinforcement learning span game-playing
agents and autonomous robotics. Some of the supervised algorithms are listed below:
• Q-Learning
• Deep Q-Networks (DQN)
• Proximal Policy Optimization (PPO)
• Actor-Critic
• Trust Region Policy Optimization (TRPO)
• Asynchronous Advantage Actor-Critic (A3C)
• Deep Deterministic Policy Gradient (DDPG)
• Reinforcement Learning from Human Feedback (RLHF)
• Curiosity-Driven Exploration
• AlphaZero
Please note that the list of algorithms listed for each category is not an exhaus-
tive list, as there are many more algorithms and variations in each category. In the
subsequent section, some of these algorithms are discussed briefly.

5 Machine Learning Algorithms

AI learning algorithms serve as the foundational building blocks of intelligent data


analysis and decision-making systems. ML and DL are a dynamic and ever-evolving
field that encompasses a multitude of techniques, each tailored to address specific
data-driven challenges. To shed light on some of the key learning algorithms in this
vast landscape, our approach is structured around a systematic study of algorithms
categorized into three main domains: supervised, unsupervised, and reinforcement
learning as covered in the last section. Each category of algorithms has its unique
characteristics, applications, and benefits, making them essential tools for solving a
wide array of real-world problems. The objective of this section is to equip the readers
with the knowledge and insights needed to navigate the intricate terrain of learning
algorithms and apply them effectively in the world of data-driven decision-making.

5.1 Supervised Learning

Supervised learning is one of the foundational pillars of ML, where algorithms are
guided by labeled data to make predictions or classifications based on historical
patterns. In this paradigm, the model learns from input data paired with corresponding
output labels or responses, allowing it to establish a relationship or mapping between
the two. Supervised learning is akin to having a teacher or supervisor who provides
246 P. S. Chaudhary et al.

the algorithm with a clear roadmap or feedback, enabling it to generalize from known
examples and make informed decisions on unseen data. This approach is particularly
valuable for tasks, where predicting outcomes or classifying data is crucial, such
as image recognition, speech analysis, and medical diagnoses. The precision and
interpretability of supervised learning models make them indispensable tools for
various real-world applications, and understanding the principles underlying this
category is essential for harnessing their power in the data-driven world. The key
learning algorithms for both regression and classification tasks are described below.
The key distinction between classification and regression lies in their predictive
nature: classification predicts distinct class labels, whereas regression focuses on
estimating continuous quantities.
Regression: In the realm of supervised learning, regression algorithms, offer a
straightforward approach to predicting output values by minimizing errors based on
input data consisting of specific features [18]. These algorithms primarily handle
continuous response variables and are instrumental in various applications. Regres-
sion analysis encompasses a range of machine learning methods that enable the
prediction of a continuous response or output variable based on one or more input
variables or parameters [16]. Regression models have found extensive use in diverse
fields, such as real estate, banking, insurance, finance, manufacturing, time series
forecasting, and more. The following sections provide a brief overview of some
prominent types of regression algorithms.
Linear Regression: Within the domain of regression analysis, we explore several
powerful algorithms that excel in predicting continuous outcomes based on input
variables. Simple Linear Regression (SLR), Multiple Linear Regression (MLR), and
Polynomial Regression are key algorithms under Regression.
• Simple Linear Regression (SLR): SLR is a fundamental statistical method to
analyze the relationship between two variables: a dependent variable (response)
and an independent variable (predictor) [16]. In SLR, we seek to model the
linear relationship between these variables, allowing us to make predictions, infer
patterns, and understand how changes in the predictor affect the response. SLR
essentially helps us find the best-fitting line (the regression line) that minimizes
the sum of squared differences between the observed values of the response and
the values predicted by the model. This fitted line represents the linear relationship
between the variables and is used for making predictions.
• Multiple Linear Regression (MLR): Extends this concept by incorporating
multiple predictor variables to create a more generalized predictive model for
a dependent variable (response) and two or more independent variables (predic-
tors) [16]. MLR is a fundamental statistical method used for prediction, hypoth-
esis testing, and understanding how multiple variables collectively influence an
outcome. In MLR, the model equation accounts for multiple independent vari-
ables. MLR allows us to understand how each independent variable influences the
dependent variable while controlling for the effects of the others or keeping other
variables constant. By estimating the coefficients, we quantify the impact of each
independent variable on the response. The goal of MLR is to find the best-fitting
Real-World Applications of Data Analytics, Big Data, and Machine … 247

linear relationship that minimizes the sum of squared differences between the
observed values of response and the values predicted by the model.
• Polynomial Regression: Polynomial Regression is an extension of SLR/MLR
that allows us to model relationships between a dependent variable (response) and
one or more independent variables (predictors) when the relationship is nonlinear
[18]. In situations where a linear model doesn’t capture the underlying relationship
in the data, polynomial regression offers a more flexible approach. The equation
for polynomial regression involves using polynomial terms to model the nonlinear
relationship between variables. The challenge in using polynomial regression is
selecting the appropriate degree of the polynomial, as higher degrees may lead
to overfitting. Careful model evaluation and validation are essential to ensure the
chosen polynomial degree accurately represents the data.
• LASSO Regression and Ridge Regression: Polynomial Regression, is a
powerful and flexible method for modeling data where nonlinear relationships
exist. However, it faces various challenges, especially in real-world scenarios
where data can be noisy, and overfitting is a concern. In such cases, regularization
methods like LASSO (Least Absolute Shrinkage and Selection Operator) and
Ridge Regression are a better fit.
• LASSO (Least Absolute Shrinkage and Selection Operator): LASSO is a
regression technique aimed to address overfitting, multicollinearity, and the
problem of too many irrelevant features. The core idea behind LASSO is to add a
regularization term to the linear regression model. This regularization term penal-
izes (L1 penalty) the absolute values of the coefficients, effectively shrinking some
of them to zero. In other words, LASSO performs feature selection by setting
coefficients associated with irrelevant or redundant features to zero [19, 20]. By
reducing the number of predictors, LASSO simplifies the model while retaining
the essential relationships between variables.
• Ridge Regression: like LASSO, introduces a regularization term, but it takes
a different approach. Instead of using the absolute values of coefficients, Ridge
Regression penalizes the squares of coefficients. This results in a more gradual
shrinkage of coefficients. While it doesn’t perform feature selection like LASSO,
it effectively mitigates the impact of multicollinearity, which is common in real-
life datasets. By limiting the magnitude of coefficients, Ridge Regression ensures
that no single predictor has an excessive influence on the model [21].
Classification: Classification algorithms play a pivotal role in machine learning,
organizing data into meaningful categories. They come in two primary forms: binary
classification and multiclass classification. Binary classification assigns data points
to two distinct classes, such as spam detection or sentiment analysis. Multiclass clas-
sification extends to multiple categories, often used in image recognition or document
categorization. These algorithms automate decision-making based on data, enabling
a wide range of applications. Their adaptability to complex real-world scenarios
makes classification algorithms invaluable in the field of artificial intelligence and
data analysis. Now, we will delve into some of the key classification algorithms
prevalent in the field.
248 P. S. Chaudhary et al.

• Logistic Regression: Logistic regression is a fundamental statistical method


widely employed in machine learning and statistical modeling [22]. It serves
as an essential technique for binary classification tasks where the outcome vari-
able is categorical, having classes. The model utilizes the logistic function (also
known as the sigmoid function) to transform linear combinations of input features
into probabilities. In logistic regression, the odds of the binary outcome variable
are modeled as a linear combination of the predictors, passed through the logistic
function. This transformation constrains the output to fall within the range of 0 and
1, making it interpretable as probability. The model is trained using methods like
maximum likelihood estimation, with coefficients optimized to best fit the training
data. Logistic regression has found applications in various domains, including
healthcare for predicting disease outcomes, finance for credit scoring, and natural
language processing for sentiment analysis.
• K-Nearest Neighbor (KNN): The fundamental concept behind KNN is to clas-
sify data points based on their proximity to other data points within the dataset
[23]. This proximity is often calculated using metrics like Euclidean distance,
Manhattan distance, or other distance measures. KNN’s simplicity and effective-
ness make it a valuable tool in various domains, particularly in pattern recognition
and recommendation systems. KNN assigns a label/class to a data point by consid-
ering the labels/class of its k-nearest neighbors, where “k” is a user-defined param-
eter. KNN operates under the assumption that data points with similar features
tend to belong to the same class. By identifying the k-nearest neighbors of a
data point, KNN utilizes the known labels/class of these neighbors to forecast the
label for the given/new data point. Nevertheless, the efficacy of KNN is strongly
influenced by the selection of “k” and the distance metric.
• Support Vector Machines: Support Vector Machines (SVM) is a supervised ML
algorithm. SVM has gained widespread recognition for its robust performance and
flexibility [24]. SVM works by finding an optimal decision boundary that maxi-
mizes the margin between classes within the dataset that allows SVM to make
effective class predictions. SVM can handle both linear and nonlinear relationships
in data by applying transformations on the input data through the use of kernels,
such as the radial basis function (RBF) kernel [25]. One of SVM’s key strengths
is its ability to deal with high-dimensional data and large feature sets. Neverthe-
less, SVM’s performance is sensitive to kernel, method regularization parameters
selected, and it can be computationally demanding for extensive datasets. Despite
these challenges, SVM remains a valuable tool for solving complex classification
and regression problems.
• Decision Tree: Decision Trees are fundamental ML algorithms used for regression
and classification tasks. These algorithms create a tree structured model where
each node represents a feature/parameter, and the branches depict decisions made
based on the feature’s values. The leaves of the tree contain class labels for clas-
sification problems or numerical values for regression problems. Decision Trees
are known for their interpretability, allowing users to visualize and understand
the decision-making process. CART (Classification and Regression Trees) is one
of the most common implementations of Decision Trees. It recursively splits the
Real-World Applications of Data Analytics, Big Data, and Machine … 249

dataset into subsets based on feature/parameter thresholds, aiming to maximize


information gain for classification tasks or minimize variance for regression tasks.
However, Decision Trees are prone to overfitting when they become too complex,
making them less effective with noisy or high-dimensional data. Various tech-
niques, such as pruning and setting minimum samples per leaf, help control over-
fitting. Decision Trees find applications in domains like healthcare, finance, and
natural language processing, where transparency and interpretability are crucial
[26].
• Random Forest: Random Forest is an ensemble learning algorithm with multiple
decision trees to enhance predictive accuracy [27]. In this approach, a forest of
decision trees is created, and the final prediction is determined by aggregating the
predictions (vote/call) of individual trees. Random Forest offers several advan-
tages, including resistance to overfitting, the ability to handle both classification
and regression tasks, and feature importance ranking. It has found wide appli-
cations in various fields as well. Random Forest’s success lies in its capacity to
maintain predictive power while mitigating overfitting and noise. It is a versa-
tile algorithm for both beginners and experts, widely used for its robustness and
exceptional performance across diverse domains.
• Adaptive Boosting: Adaptive Boosting (AdaBoost) is an ensemble ML algo-
rithm crafted to improve the accuracy of weak learners, such as decision trees,
in the context of classification task. AdaBoost has found widespread appli-
cations in various domains, including computer vision and natural language
processing. AdaBoost works by iteratively training a series of weak learners,
giving more weight to misclassified examples in each round and focusing on
the most challenging instances [28]. It then combines the predictions of these
learners with varying weights to create a strong classifier. One significant advan-
tage of AdaBoost is its ability to correct the limitations of Random Forest. While
Random Forest constructs multiple decision trees in parallel, AdaBoost does so
sequentially, adjusting each tree’s focus to areas where previous trees have strug-
gled. This adaptability enables AdaBoost to fix the bias-variance trade-off issue
common in Random Forest, where too many trees can lead to overfitting and too
few may result in underfitting. AdaBoost’s emphasis on data points that previous
learners have misclassified makes it particularly effective in improving model
performance.
• Convolution Neural Networks (CNN): CNNs are a class of DL algorithms that
have revolutionized computer vision tasks by modeling data in a grid-like/matrix
structure, such as images and video frames. CNNs were inspired by the visual
processing performed by the human brain and were designed to automatically and
adaptively learn spatial hierarchies of features from data. CNNs consist of multiple
layers, including convolutional, pooling, and fully connected layers, which work
together to progressively extract complex features from raw input data. LeNet-5,
developed by Yann LeCun in 1998, was one of the pioneering CNN architectures
for handwritten digit recognition [29]. CNNs have since evolved, with AlexNet,
introduced by Krizhevsky et al. in 2012, gaining huge success. Other notable
250 P. S. Chaudhary et al.

CNNs include VGGNet, and ResNet, each adding some improvements or modifi-
cations to existing models. CNNs excel in various computer vision tasks like image
classification, object detection, and facial recognition, thanks to their ability to
automatically learn and adapt features from data. These networks are also essen-
tial in modern applications, including self-driving cars, medical image analysis,
and more.
In Sect. 5.1, we discussed a selection of key classification algorithms. However,
it’s essential to acknowledge that there are numerous other classification algorithms
available. Due to the extensive range of classification techniques, we’ve focused on
highlighting a few pivotal ones in this overview.

5.2 Unsupervised Learning

Unsupervised learning is a category of ML and DL where the algorithms operate


without labeled output data to guide their learning process. Instead, these algorithms
explore and analyze input data, identifying patterns, structures, or relationships that
might not be immediately evident. Key advantages of unsupervised learning methods
are its ability to uncover hidden insights in data, making it particularly useful in
scenarios such as clustering similar data points, dimensionality reduction, and iden-
tifying anomalies. Unlike supervised learning, where the algorithms are trained on
labeled data to make predictions, unsupervised learning aims to reveal the under-
lying structure of the data itself, often leading to valuable discoveries in various
fields, including data mining, image recognition, and recommendation systems. In
this section the key learning algorithms for this category are elaborated, offering an
understanding of their principles and applications.
Dimension Reduction: Dimensionality reduction serves as a cornerstone tech-
nique in data analysis, especially when dealing with high-dimensional datasets [30].
It is an approach that seeks to represent data in a reduced-dimensional space, main-
taining as much of the original information as possible. This not only simplifies the
data for easier visualization and understanding but also aids in mitigating challenges
like the “curse of dimensionality”, thereby optimizing the performance of applied
models [31]. Two prominent examples of dimensionality reduction methods are Prin-
cipal Component Analysis (PCA) [32] and t-Distributed Stochastic Neighbor Embed-
ding (t-SNE) [33]. While PCA linearly transforms the original data into orthogonal
components that capture the most variance, t-SNE is adept at capturing nonlinear
relationships in the data, making it especially suitable for visualizing clusters in
complex datasets. Both these techniques highlight the versatility and applicability of
dimensionality reduction in various data-intensive domains.
• Principal Component Analysis (PCA): It is a dimensionality reduction tech-
nique widely used in ML. Developed by Karl Pearson in the early twentieth
century [32], PCA has become a fundamental tool for extracting essential infor-
mation from high-dimensional data. PCA transforms raw high-dimensional data
Real-World Applications of Data Analytics, Big Data, and Machine … 251

into a new coordinate system (low-dimensional), revealing patterns and structures


within the data. It does this by identifying the principal components (PC), which
are orthogonal vectors that capture the maximum variance in the data. These
PCs offer a means to decrease data dimensionality while preserving maximum
information. First PC aligns with the direction of the maximum variance, and
successive components capture perpendicular directions with diminishing vari-
ances [32]. PCA has a broad range of applications across various domains. In
image analysis, PCA is used for face recognition by reducing the dimension-
ality of facial images while preserving essential facial features. PCA simplifies
complex data, making it more manageable and interpretable. It is a valuable tool
for exploratory data analysis and feature engineering, enabling researchers to
focus on the most relevant aspects of their data. In PCA, some information may
be lost during the dimensionality reduction process (not selecting all PCs), which
can be critical in certain applications. Moreover, PCA assumes that the PCs are
linear combinations of the original features, which practitioners should take into
consideration when employing PCA for our applications.
• t-Distributed Stochastic Neighbor Embedding (t-SNE): t-Distributed
Stochastic Neighbor Embedding (t-SNE) is a extensively used dimension reduc-
tion method [33]. It is particularly useful for visualizing high-dimensional data
and revealing underlying structures, making it a valuable tool in various fields. t-
SNE maps high-dimensional data to a lower-dimensional space while preserving
pairwise similarities between data points. It achieves this by modeling the data
in high-dimensional space as a probability distribution and then mapping them to
a corresponding distribution in low-dimensional space. This mapping minimizes
the Kullback–Leibler (KL) divergence between the two distributions, effectively
highlighting clusters and patterns in the data. t-SNE is used across a range of
domains, including bioinformatics, natural language processing, and computer
vision. In bioinformatics, it aids in the exploration of complex genetic datasets
and the identification of relevant clusters. In natural language processing, t-SNE
is applied to visualize word embeddings, facilitating the study of semantic rela-
tionships between words. Furthermore, in computer vision, t-SNE is a valuable
tool for revealing patterns in high-dimensional image data. One of t-SNE’s key
advantages is its ability to capture local relationships between data points, making
it an ideal choice for visualizing clusters and groupings in complex datasets. It
excels in exploratory data analysis, helping analysts uncover intricate structures
within their data. However, t-SNE is computationally intensive, which can limit its
use in real-time applications. Additionally, t-SNE is not a deterministic algorithm,
meaning different runs may yield slightly different results.
Clustering: Cluster analysis is used for identifying and grouping related data
points within extensive datasets. Its primary purpose is to assemble a collection of
objects so that those within the same category, referred to as a cluster, exhibit a
higher degree of similarity to one another than to those in different groups [34]. This
technique serves as a valuable tool in data analysis, revealing noteworthy trends or
patterns within data, such as grouping consumers based on their behavioral traits.
252 P. S. Chaudhary et al.

Clustering finds application across a wide spectrum of domains. In the subsequent


sections, we provide concise overviews of various clustering methods.
• K-Means: It is a clustering algorithm, focusing on the partitioning of data into
distinct clusters [35]. This algorithm is appreciated for its simplicity and effi-
ciency, making it a fundamental tool in numerous data analysis applications.
K-Means groups data points based on their similarity, ensuring that data within
a cluster is more similar to each other compared against data in other clusters/
groups. Its applications span a multitude of domains, including image segmenta-
tion, customer segmentation in marketing, anomaly detection, and recommenda-
tion systems, among others. K-Means clustering is not without its challenges, such
as the sensitivity to the initial placement of cluster centroids and the requirement
for specifying the number of clusters, known as “K”.
• Hierarchical Clustering: Hierarchical clustering is another prominent method
within the field of cluster analysis, frequently employed in data mining and unsu-
pervised machine learning tasks [36]. This approach aims to hierarchically orga-
nize data points into a tree structure, known as a dendrogram, where the root of
the tree represents a single cluster containing all data points, and the leaves denote
individual data points. Hierarchical clustering operates on the principle of divi-
sive (top-down) or agglomerative (bottom-up) methodologies. In agglomerative
hierarchical clustering, each data point begins as a singleton cluster, and pairs
of clusters are iteratively merged to construct a dendrogram, often using a prox-
imity matrix [36]. The merging of clusters is based on a linkage criterion, with
single-linkage, complete-linkage, and average-linkage being some of the common
choices. In the divisive approach, the entire dataset forms the initial cluster, which
is recursively partitioned into smaller clusters, resulting in a dendrogram [34].
Hierarchical clustering provides a valuable representation of data relationships
and nested clusters, facilitating the exploration of data structures, but it can be
computationally demanding for large datasets. This technique finds applications
in various domains, including biology, image processing, social network analysis,
and document retrieval, offering a rich framework for uncovering latent patterns
and structures within complex datasets [34].

5.3 Reinforcement Learning

Reinforcement Learning (RL) [37] is a dynamic and expanding field within ML,
emphasizing interactions between intelligent agents and their environments. Unlike
other paradigms, RL involves autonomous decision-making and learning from
continuous feedback in the form of rewards or penalties. RL enables agents to learn
optimal sequences of actions for maximizing cumulative rewards. It mimics how
humans and animals adapt to their surroundings by interacting with the environment
and adjusting their strategies based on feedback.
RL’s versatility allows its application in various domains. It has proven invaluable
in robotics, teaching machines complex tasks like walking, flying, and grasping
Real-World Applications of Data Analytics, Big Data, and Machine … 253

objects. In autonomous vehicles, RL enables self-driving cars to navigate real-world


scenarios effectively. It also excels in gaming, training agents to play chess, and
video games with superhuman proficiency. Despite its broad applications, RL poses
challenges. Training RL agents can be computationally expensive and data intensive.
One of the key RL algorithms is discussed below.
• Q-Learning: Q-learning [38] is a fundamental algorithm within the realm of Rein-
forcement Learning (RL) and stands out as a cornerstone for teaching agents to
make sequential decisions autonomously. Specifically, Q-learning is a model-
free algorithm designed to solve the problem of optimal control in an envi-
ronment. Unlike some RL methods that focus on learning a policy directly, Q-
learning targets the optimal action-value function (Q-function), which measures
the expected cumulative rewards for a specific action at a specific state and
following an optimal policy thereafter [39]. Q-learning iteratively updates the
Q-values for state-action pairs based on the observed rewards and transitions.
Q-Learning is used for various real-world applications, such as robotics, games,
and autonomous systems. In summary, Q-learning is a versatile and foundational
RL algorithm used for solving complex decision-making problems by approx-
imating the optimal action-value function. While it excels in various domains,
challenges related to overestimation and exploration strategies must be considered
for effective implementation.

6 Real-World Application Domains

AI has become an expansive and influential realm, with AI algorithms playing a


pivotal role in various domains. This dedicated section now focuses on real-world
applications across diverse domains, showcasing the practical impact of these algo-
rithms. From healthcare and finance to manufacturing, energy, and more, AI has
seamlessly woven itself into these domains, reshaping their landscapes.
Each of these domains represents a unique narrative, complete with its own set of
challenges and innovative possibilities. AI has moved beyond the theoretical realm to
become an integral part of daily life, revolutionizing industries and redefining what’s
achievable for cost saving, operational excellence, process optimization, customer
experience, and many more areas. AI finds extensive applications in various domains,
including but not limited to: manufacturing, marketing and advertising, health-
care, finance, agriculture, cybersecurity, autonomous vehicles, energy and utilities,
e-commerce, natural language processing (NLP), and sentiment analysis.
These domains showcase the versatility of AI in addressing diverse real-world
challenges and opportunities. key domains to illustrate the diverse applications of AI
are discussed. In doing so, various examples, challenges, opportunities, and advan-
tages are highlighted, while acknowledging that AI’s influence reaches well beyond
these areas to encompass many others.
254 P. S. Chaudhary et al.

6.1 Manufacturing

Manufacturing, a highly complex and precise industry, is rapidly embracing AI and


machine learning for optimizing production processes. This sub-section explores how
AI is transforming manufacturing. Some of the examples of using AI algorithms in
this domain are:
Predictive Maintenance: Machine learning models are deployed for predictive
maintenance of manufacturing equipment. These algorithms analyze sensor data and
historical maintenance records to predict when a machine might fail. As a result,
manufacturers can schedule maintenance before issues arise, minimizing costly
downtime and ensuring the continuous operation of the production line [40].
Process Optimization: Machine learning is employed to optimize various produc-
tion processes. AI algorithms analyze data from sensors and equipment to identify
patterns and correlations. By fine-tuning the production parameters, manufacturers
can achieve better yields, reduce defects, and improve overall quality [41].
Quality Control: In manufacturing, ensuring product quality is paramount.
Machine learning models are used for automated visual inspection and quality
control. These systems can quickly identify defects or irregularities in parts or
products, improving the quality of the final products and reducing waste [42].
Some of the opportunities of using AI algorithms in this domain are:
• Enhanced equipment reliability and uptime.
• Improved quality and yield.
• Efficient production processes.
• Cost reduction in manufacturing.
Some of the challenges of using AI algorithms in this domain are:
• Handling large volumes of sensor data.
• Developing robust predictive maintenance models.
• Integrating AI into existing manufacturing systems.
Some of the benefits of using AI algorithms in this domain are:
• Reduced downtime and maintenance costs.
• Increased quality and yield.
• Efficient and sustainable production processes.
• Improved product quality and customer satisfaction.
Machine learning algorithms are revolutionizing the manufacturing sector by
increasing production efficiency, reducing defects, and ensuring product quality.
While challenges exist, the benefits include cost savings, higher-quality products,
and sustainable manufacturing processes.
Real-World Applications of Data Analytics, Big Data, and Machine … 255

6.2 Marketing and Advertising

The marketing and advertising industry is undergoing a significant transformation


through the integration of AI and machine learning. This sub-section explores how
these technologies are reshaping marketing strategies and customer engagement.
Some of the examples of using AI algorithms in this domain are:
Personalized Recommendations: ML methods are applied to consumer behavior,
purchase history, and preferences to provide personalized recommendations. E-
commerce platforms like Amazon and streaming services like Netflix utilize these
algorithms to increase sales and customer engagement [43].
Predictive Analytics: Marketers employ predictive analytics to forecast future
trends, consumer demands, and market fluctuations. Machine learning models
analyze historical data to provide actionable insights for product launches, pricing
strategies, and campaign planning [44].
Customer Segmentation: Machine learning algorithms segment customers based
on their characteristics and behaviors. This segmentation allows businesses to
tailor marketing campaigns to specific groups, delivering more relevant content
and increasing conversion rates. Companies like Airbnb and Spotify successfully
implement customer segmentation strategies [45].
Some of the opportunities of using AI algorithms in this domain are:
• Enhanced customer engagement and personalization.
• Improved campaign effectiveness.
• Real-time insights into customer behavior.
• Cost-effective marketing strategies.
Some of the opportunities of using AI algorithms in this domain are:
• Privacy concerns and data regulations.
• Complexity of analyzing vast datasets.
• Integration of AI into marketing workflows.
Some of the benefits of using AI algorithms in this domain are:
• Increased sales and customer loyalty.
• Better decision-making based on data-driven insights.
• Targeted marketing strategies.
• Improved return on investment (ROI).
Machine learning algorithms have revolutionized marketing and advertising by
offering personalized recommendations, predictive analytics, and customer segmen-
tation. Marketers can now engage customers more effectively, make data-driven
decisions, and optimize their advertising strategies.
256 P. S. Chaudhary et al.

6.3 Healthcare

In recent years, healthcare has been at the forefront of AI innovation, utilizing


machine learning algorithms to enhance patient care. With the enormous amount
of patient data generated daily, AI technologies have become invaluable for health-
care providers, researchers, and policymakers. Some of the examples of using AI
algorithms in this domain are:
Disease Diagnosis and Prediction: ML methods are employed to on patient
data, such as medical records from past, genetic information, and symptoms, to
aid in disease diagnosis and prediction. For instance, deep learning models applied
to medical imaging have enabled the early detection of conditions like diabetic
retinopathy, cancer, and heart diseases [46]. These algorithms provide swift and
accurate assessments, improving patient outcomes and reducing the chances of
misdiagnosis.
Drug Discovery: The drug development process has been significantly expedited
and these methods are used to predict the structure of compounds and their potential
for therapeutic use [47]. These aid pharmaceutical companies in identifying potential
drug candidates faster and at a lower cost. This has led to the discovery of new
drugs and the repurposing of existing ones for different medical conditions, thereby
revolutionizing healthcare treatment options.
Personalized Treatment Plans: Machine learning algorithms can analyze a
patient’s genetic and clinical data to create personalized treatment plans. In cancer
care, for example, this allows oncologists to tailor chemotherapy regimens based on
an individual’s unique genetic makeup [48]. Such precision medicine minimizes side
effects and improves the effectiveness of treatments, providing better patient care.
Some of the opportunities of using AI algorithms in this domain are:
• Enhanced diagnostic accuracy.
• Faster drug discovery and development.
• Personalized treatment options.
• Improved patient outcomes.
Some of the challenges of using AI algorithms in this domain are:
• Data privacy and security concerns.
• Regulatory compliance and ethical dilemmas.
• Integration with existing healthcare systems.
• Shortage of skilled professionals.
Some of the benefits of using AI algorithms in this domain are:
• Timely and accurate disease detection.
• Faster drug development.
• Improved patient care and outcomes.
• Reduction in healthcare costs.
Real-World Applications of Data Analytics, Big Data, and Machine … 257

AI and machine learning algorithms in healthcare offer promising opportunities,


though challenges related to data privacy, regulations, and integration with existing
systems must be addressed. The benefits, including enhanced patient care and cost
reductions, make this field a driving force for the adoption of AI technologies in the
healthcare industry.

6.4 Finance

The finance sector has witnessed significant transformations due to the incorporation
of machine learning algorithms and AI technologies. In this sub-section, we delve
into how these technologies are revolutionizing the financial industry. Some of the
examples of using AI algorithms in this domain are:
Algorithmic Trading: Machine learning models are employed for algorithmic
trading, where they analyze historical data and market trends to make real-time
trading decisions. These algorithms have the capability to make split-second trades
and respond to market changes more effectively than human traders. As a result, they
increase trading efficiency, reduce errors, and optimize portfolio performance [49].
Credit Scoring and Risk Assessment: Financial institutions leverage machine
learning algorithms for credit scoring and risk assessment. These algorithms evaluate
a borrower’s creditworthiness by analyzing various factors, such as credit history,
income, and debt. They provide more accurate risk assessments, enabling lenders to
make informed lending decisions and offer loans to a wider range of customers [50].
Fraud Detection: Machine learning plays a pivotal role in fraud detection. It
continually monitors financial transactions for anomalies and unusual patterns. For
instance, it can identify potentially fraudulent credit card transactions in real time
and send alerts to both customers and financial institutions. This not only safeguards
individuals from unauthorized transactions but also helps financial organizations
minimize financial losses due to fraud [51].
Some of the opportunities of using AI algorithms in this domain are:
• Automated trading and portfolio optimization.
• Enhanced risk management.
• Improved fraud detection and prevention.
• Efficient customer service with chatbots.
Some of the challenges of using AI algorithms in this domain are:
• Data privacy and security in handling financial data.
• Regulatory compliance and risk associated with automated trading.
• Developing robust fraud detection models
Some of the benefits of using AI algorithms in this domain are:
• Developing robust fraud detection models
• Enhanced trading efficiency and returns.
258 P. S. Chaudhary et al.

• More inclusive credit assessment.


• Reduced financial losses due to fraud.
• Improved customer experience.
Machine learning algorithms have the potential to make the financial sector more
efficient and secure. While challenges like data privacy and compliance need to be
addressed, the benefits include increased profitability, reduced risks, and improved
customer services in the finance industry.

6.5 Agriculture

The agriculture sector has seen remarkable transformations through the integration of
machine learning and AI. This sub-section delves into the applications, opportunities,
challenges, and benefits of AI in agriculture. Some of the examples of using AI
algorithms in this domain are:
Crop Disease Detection: Machine learning models are employed to detect early
signs of crop diseases by analyzing images of leaves or plants. For example, deep
learning to diagnose plant diseases, helping farmers take preventive measures, and
reduce crop loss is used [52].
Soil Health Assessment: AI-driven systems assess soil quality based on various
parameters, which are pH levels, nutrients, and others. This information aids in opti-
mizing fertilizer application and irrigation to enhance crop yields while minimizing
environmental impact [53].
Precision Irrigation: Machine learning algorithms process weather data, soil
moisture levels, and crop requirements to optimize irrigation systems. This results
in water conservation, reduced operational costs, and increased crop productivity.
Companies like CropX provide such solutions [54].
Some of the opportunities of using AI algorithms in this domain are:
• Increased crop yields and quality.
• Enhanced pest and disease management.
• Improved resource management.
• Sustainable agriculture practices.
Some of the challenges of using AI algorithms in this domain are:
• Access to technology in remote areas.
• Data security and privacy concerns.
• Initial investment costs.
• Adaptation to local conditions.
Some of the benefits of using AI algorithms in this domain are:
• Greater food production to meet growing demands.
• Reduction in chemical and water usage.
• Enhanced sustainability and resource conservation.
Real-World Applications of Data Analytics, Big Data, and Machine … 259

• Economic benefits for farmers.


Agriculture has been revolutionized by machine learning, with applications
ranging from crop disease detection and soil health assessment to precision irri-
gation. These AI-driven solutions contribute to sustainable and efficient farming
practices.
These case studies exemplify the transformative impact of AI in diverse domains.
The possibilities extend far beyond these examples, with AI-driven solutions contin-
uously reshaping industries such as healthcare, finance, manufacturing, marketing,
and more. As AI technologies advance, we can expect innovative applications in
fields we have yet to explore fully, offering new opportunities and improvements
across various domains.

7 Potential Research Directions

In this section, we explore promising research directions that emanate from the
evolution and growing importance of data analytics, big data, and machine learning
in various domains. These directions encapsulate the future landscape of AI and its
potential impact.
Explainable AI (XAI): As AI systems become increasingly integrated into
everyday life, the demand for transparency and trustworthiness is paramount. XAI
has emerged as a prominent research direction in recent times. XAI aims to develop
AI systems that provide interpretable, human-understandable justifications for their
decisions and predictions [55]. Researchers focus on creating models and tech-
niques that can reveal the inner workings of complex AI systems, enabling users
to comprehend their outputs, thus fostering trust and adoption in critical domains
like healthcare, finance, and autonomous vehicles.
Ethical AI and Bias Mitigation: The rapid integration of AI technologies raises
concerns about fairness, ethics, and bias. Research in ethical AI and bias mitigation
is imperative [56]. The goal is to develop algorithms, guidelines, and best practices to
ensure AI systems do not reinforce or introduce harmful biases. Scholars are actively
investigating techniques for detecting, measuring, and mitigating bias in AI models
and data, addressing challenges in domains ranging from hiring to criminal justice.
Federated Learning: Privacy and data security are fundamental concerns in the
digital age. Federated Learning, an emerging paradigm, is garnering attention. It
enables AI models to be trained across decentralized devices or servers, ensuring that
sensitive data remains on the user’s device and only model updates are shared [57].
This approach shows promise in healthcare, where patient data privacy is paramount,
and in other industries with stringent data protection regulations.
Interdisciplinary AI: AI is increasingly intersecting with various other scien-
tific domains, creating new opportunities. Interdisciplinary AI research explores
the confluence of AI with fields like biology, chemistry, and material science.
260 P. S. Chaudhary et al.

Researchers aim to solve domain-specific problems using AI techniques, such as


drug discovery through deep learning models.
Quantum Machine Learning: Quantum computing, promising exponential
speed in solving intricate problems, has paved the way for the emergence of quantum
machine learning. Researchers are exploring how quantum algorithms can revolu-
tionize AI by addressing computational challenges, such as optimization and data
analysis.
AI in Sustainability: As global concerns about climate change and environmental
sustainability grow, AI has a role to play. Researchers are exploring AI’s potential
to optimize resource usage, manage ecosystems, and predict natural disasters. AI
models can aid in renewable energy optimization, pollution control, and climate
modeling, contributing to a sustainable future [58]. Efficiency in large language
models is a pressing concern. Optimizing training and deployment methods to reduce
resource consumption, such as energy and computation, while maintaining perfor-
mance, is pivotal. Eco-friendly AI practices are vital to mitigate environmental impact
and enhance sustainable AI development.
Multilingual Models: Multilingual models are an emerging research domain with
vast implications. Developing models that can seamlessly comprehend and produce
content across various languages is a valuable pursuit. These models facilitate global
communication, language translation, and cross-cultural understanding, contributing
to the democratization of AI across linguistic boundaries [59].
These research directions encompass the evolving landscape of AI, reflecting the
dynamic interaction between technology and society. While this section offers a
glimpse into the future of AI research, the possibilities are boundless and promise to
shape the world in remarkable ways.

8 Conclusion

In conclusion, this chapter has taken us on a journey through the remarkable landscape
of data analytics, big data, and machine learning, illustrating their pivotal roles in
the digital era. The profound impact of these technologies across various domains
is highlighted, emphasizing their transformative potential and influence on our data-
driven future.
The chapter commenced by acknowledging the data revolution, characterized
by the deluge of big data from diverse sources, including manufacturing, banking,
social media, e-commerce, and healthcare records. Subsequently, the different types
of data were elucidated to highlight the importance of understanding the data type in
deciding the AI algorithm for analysis. The fundamentals of various AI algorithms
are then described based on the data type which needs to be analyzed. These sections
elaborate on the importance of determining the appropriate combination of the data
and algorithm. Furthermore, the application of AI in various industries is discussed.
For each application, the opportunities, challenges, and benefits are described. This
highlights the importance of using AI for these applications while providing the
Real-World Applications of Data Analytics, Big Data, and Machine … 261

improvements needed to achieve higher accuracy and efficiency. Lastly, key area of
focus for future research, based on current gaps and challenges, are highlighted.
The impact of these technologies transcends boundaries, fundamentally reshaping
various industries. As we reflect on the chapter’s content, data analytics, big data, and
ML are redefining the way we interact with and harness information. The potential for
innovation, efficiency, and informed decision-making is limitless, and their influence
will continue to expand, leaving no domain untouched.
In a world where data is hailed as the new currency, these technologies are indis-
pensable tools for tackling complex challenges, driving innovation, and forging the
path toward a data-driven future. The insights and knowledge derived from this
chapter will serve as a compass, guiding us through the intricate terrain of data
analytics, big data, and machine learning, as we explore, adapt, and exploit their full
potential in the ever-evolving digital age.

References

1. Sarker, I.H.: Machine learning: algorithms, real-world applications and research directions. Sn
Comput. Sci. 2, 160 (2021). https://doi.org/10.1007/s42979-021-00592-x
2. Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. (CSUR) 50(3), 43
(2017). https://doi.org/10.1145/3076253
3. Sarker, I.H.: Ai-driven cybersecurity: an overview, security intelligence modeling and research
directions. SN Comput. Sci. (2021). https://doi.org/10.1007/s42979-021-00557-0
4. Sarker, I.H.: Deep cybersecurity: a comprehensive overview from neural network and deep
learning perspective. SN Comput. Sci. (2021)
5. Lepenioti, K., Bousdekis, A., Apostolou, D., Mentzas, G.: Prescriptive analytics: literature
review and research challenges. Int. J. Inf. Manage. 50, 57–70 (2020)
6. Mohammed, M., Khan, M.B., Bashier Mohammed, B.E.: Machine Learning: Algorithms and
Applications. CRC Press (2016)
7. Aiken, E., Bellue, S., Karlan, D., et al.: Machine learning and phone data can improve targeting
of humanitarian aid. Nature 603, 864–870 (2022). https://doi.org/10.1038/s41586-022-04484-9
8. Sureja, N., Mehta, K., Shah, V., Patel, G.: Machine learning in wearable healthcare devices.
In: Joshi, N., Kushvaha, V., Madhushri, P. (eds.) Machine Learning for Advanced Functional
Materials. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-0393-1_13
9. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann (2005)
10. Andina, D., Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E.: Deep learning
for computer vision: a brief review. Comput. Intell. Neurosci. 2018, 7068349 (2018). https://
doi.org/10.1155/2018/7068349
11. Lalor, J., Wu, H., Yu, H.: Improving Machine Learning Ability with Fine-Tuning (2017)
12. Ślusarczyk, B.: Industry 4.0: are we ready? Polish J. Manag. Stud. 17 (2018)
13. Mostajabi, F., Safaei, A.A., Sahafi, A.: A systematic review of data models for the big data
problem. IEEE Access 9, 128889–128904 (2021). https://doi.org/10.1109/ACCESS.2021.311
2880
14. Praveen, S., Chandra, U.: Influence of structured, semi- structured, unstructured data on various
data models. Int. J. Sci. Eng. Res. 8, 67–69 (2020)
15. Saravanan, R., Sujatha, P.: A state of art techniques on machine learning algorithms: a perspec-
tive of supervised learning approaches in data classification. In 2018 Second international
conference on intelligent computing and control systems (ICICCS), pp. 945–949. IEEE (2018,
June)
262 P. S. Chaudhary et al.

16. Han, J., Pei, J., Kamber, M.: Data mining: concepts and techniques. Elsevier, Amsterdam
(2011)
17. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. Journal of
Artificial Intelligence Research 4, 237–285 (1996)
18. Xuanxuan, Z.: Multivariate linear regression analysis on online image study for IoT. Cogn.
Syst. Res. 52, 312–316 (2018)
19. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc.: Ser. B
(Methodol.) 58(1), 267–288 (1996)
20. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 67(2), 301–320 (2005)
21. Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics 12(1), 55–67 (1970)
22. Agresti, A.: An Introduction to Categorical Data Analysis. John Wiley & Sons (2018)
23. Altman, N.S.: An introduction to Kernel and nearest-neighbor nonparametric regression. Am.
Stat. 46(3), 175–185 (1992)
24. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
25. Schölkopf, B., Smola, A.J., Müller, K.-R.: Nonlinear component analysis as a Kernel eigenvalue
problem. Neural Comput. 10(5), 1299–1319 (1997)
26. Breiman, L.: Classification and Regression Trees. Routledge (2017)
27. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
28. Dietterich, T.G.: Ensemble methods in machine learning. In: Multiple classifier systems, pp. 1–
15. Springer (2000).
29. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
30. Nanga, S., Bawah, A., Acquaye, B., Billa, M., Baeta, F., Odai, N., Obeng, S., Nsiah, A.:
Review of dimension reduction methods. Journal of Data Analysis and Information Processing
9, 189–231 (2021). https://doi.org/10.4236/jdaip.2021.93013
31. Berisha, V., Krantsevich, C., Hahn, P.R., et al.: Digital medicine and the curse of dimensionality.
npj Digit. Med. 4, 153 (2021). https://doi.org/10.1038/s41746-021-00521-5
32. Pearson, K.: On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(11),
559–572 (1901)
33. Laurens van der Maaten’s homepage. Retrieved from https://lvdmaaten.github.io/tsne/(n.d.)
34. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR)
31(3), 264–323 (1999)
35. Arthur, D., Vassilvitskii, S.: K-means++: The advantages of careful seeding. In Proceedings of
the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035 (2007)
36. Kaufman, L., Rousseeuw, P.J.: Finding GROUPS in Data: An Introduction to Cluster Analysis.
John Wiley & Sons (1990)
37. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT Press (2018)
38. Watkins, C., Dayan, P.: Technical note: Q-Learning. Mach. Learn. 8, 279–292 (1992). https://
doi.org/10.1007/BF00992698
39. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., … & Hassabis,
D. Human-level control through deep reinforcement learning. Nature. 518(7540), 529–533
(2015)
40. Paolanti, M., Romeo, L., Felicetti, A., Mancini, A., Frontoni, E., Loncarski, J.: “Machine
Learning approach for Predictive Maintenance in Industry 4.0,” 2018 14th IEEE/ASME Inter-
national Conference on Mechatronic and Embedded Systems and Applications (MESA), Oulu,
Finland, pp. 1–6. (2018). doi: https://doi.org/10.1109/MESA.2018.8449150
41. Suzuki, Y., Iwashita, S., Sato, T., Yonemichi, H., Moki, H., Moriya, T.: “Machine Learning
Approaches for Process Optimization,” 2018 International Symposium on Semiconductor
Manufacturing (ISSM), Tokyo, Japan, 2018, pp. 1–4. https://doi.org/10.1109/ISSM.2018.865
1142
42. Peres, R.S., Barata, J., Leitao, P., Garcia, G.: Multistage quality control using machine learning
in the automotive industry. IEEE Access 7, 79908–79916 (2019). https://doi.org/10.1109/ACC
ESS.2019.2923405
Real-World Applications of Data Analytics, Big Data, and Machine … 263

43. Verhoef, P.C., Neslin, S.A., Vroomen, B.: Multichannel customer management: understanding
the research-shopper phenomenon. Int. J. Res. Mark. 24(2), 129–148 (2007)
44. Fader, P.S., Hardie, B.G.S.: Customer-base valuation in a contractual setting: the perils of
ignoring heterogeneity. Mark. Sci. 24(1), 66–79 (2005)
45. Lewis, K., Reiley, D.H.: Online ads and offline sales: measuring the effects of retail advertising
via a controlled experiment on Yahoo. Econ. J. 124(576), 419–443 (2014)
46. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.:
Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639),
115–118 (2017)
47. Stokes, J.M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N.M., … & Church,
G.M.: A deep learning approach to antibiotic discovery. Cell. 180(4), 688–702 (2020)
48. : Schork, N.J.: Artificial intelligence and personalized medicine. Precision medicine in Cancer
therapy, 265–283 (2019).
49. Sen, J., Sen, R., Dutta, A.: Introductory chapter: machine learning in finance-emerging trends
and challenges. Algorithms, Models and Applications, 1 (2021)
50. Romanyuk, K.: Game theoretic approach for applying artificial intelligence in the credit
industry. In 2018 Fifth HCT Information Technology Trends (ITT), pp. 1–6. IEEE (2018,
November)
51. Amarasinghe, T., Aponso, A., Krishnarajah, N.: Critical analysis of machine learning based
approaches for fraud detection in financial transactions. In Proceedings of the 2018 International
Conference on Machine Learning Technologies, pp. 12–17 (2018, May)
52. Mohanty, S.P., Hughes, D.P., Salathé, M.: Using deep learning for image-based plant disease
detection. Front. Plant Sci. 7, 1419 (2016)
53. Minasny, B., McBratney, A.B., Malone, B.P.: Digital soil assessment. In Digital Soil
Assessments and Beyond, pp. 1–24. Springer (2016)
54. Matese, A., Toscano, P., Di Gennaro, S.F., Genesio, L., Vaccari, F.P., Primicerio, J., … &
Zaldei, A.: Intercomparison of UAV, aircraft and satellite remote sensing platforms for precision
agriculture. Remote Sensing, 7(3), 2971–2990 (2015)
55. Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning. arXiv
preprint arXiv:1702.08608(2017)
56. Diakopoulos, N.: Accountability in algorithmic decision making: a framework and key
questions. Data Discrim. Collect. Essays 2(2017), 10 (2016)
57. McMahan, H.B., Ramage, D., Talwar, K. Zhang, L., Zhu, M.: Communication-efficient learning
of deep networks from decentralized data. arXiv preprint arXiv:1602.05629(2017)
58. Biesialska, M., Biesialska, K., Costa-Jussa, M.R.: Continual lifelong learning in natural
language processing: A survey. arXiv preprint arXiv:2012.09823 (2020)
59. Conneau, A., et al.: Unsupervised Cross-lingual Representation Learning at Scale. arXiv:1911.
02116 (2019)
Unlocking Insights: Exploring Data
Analytics and AI Tool Performance
Across Industries

Hitesh Mohapatra and Soumya Ranjan Mishra

Abstract AI Tool, a powerful large language model (LLM), is designed to


create human-like responses in natural language conversations, leveraging extensive
training on internet data for information, engagement, task assistance, and creative
insights.AI Tool’s core is a transformer neural network, renowned for capturing
text’s long-range dependencies. With 175 billion parameters, it’s among the most
extensive LLMs to date. This research endeavors to present a holistic perspective
on the responses generated by AI Tool in a variety of industrial sectors, all while
aligning with data analytic principles. To ensure the reliability of its responses,
the study engaged human experts in respective fields to cross-verify the outcomes.
Furthermore, to gauge the performance of AI Tool, the study meticulously consid-
ered specific parameters and conducted a thorough evaluation. The findings of this
research serve the research community and other users, by offering insights into the
applications and interaction patterns of AI Tool within the context of data analytics.
The results affirm that AI Tool is capable of producing human-like responses that
are both informative and engaging, all within the framework of data analytics.
However, it’s crucial to acknowledge that AI Tool may occasionally produce inaccu-
rate or nonsensical answers. Consequently, a critical evaluation of AI Tool’s infor-
mation, coupled with verification from reliable sources, when necessary, is imper-
ative. Despite these considerations, this study underscores AI Tool’s potential as a
promising tool for natural language processing, with applications spanning a wide
array of fields, particularly when integrated with data analytic concepts.

H. Mohapatra (B) · S. R. Mishra


School of Computer Engineering, KIIT (Deemed to Be) University, Bhubaneswar,
Odisha 751024, India
e-mail: hiteshmahapatra.fcs@kiit.ac.in
S. R. Mishra
e-mail: soumyaranjan.mishrafcs@kiit.ac.in

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 265
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_13
266 H. Mohapatra and S. R. Mishra

1 Introduction

AI Tool is an advanced conversational AI model developed by OpenAI. It is designed


to engage in natural language conversations with users, offering responses that are
coherent and contextually relevant. AI Tool builds upon the success of previous iter-
ations, such as GPT-2 and GPT-3, incorporating improvements in training method-
ologies and model architecture. The model utilizes a transformer neural network,
which is a deep learning architecture that excels at processing sequential data, such
as text. The ChatGPT can help the data analysis process through several method like
sentiment analysis, SQL code generation prediction and recommendation, etc.
Transformers allow AI Tool to capture long-range dependencies and understand
the context of a conversation, making it capable of generating human-like responses.
AI Tool has been trained on a massive dataset of text and code, which allows it
to have a broad understanding of language and knowledge. This allows AI Tool to
respond to a wide range of prompts and questions, including those that are open-
ended, challenging, or strange. The massive training on several datasets helps to
create proper visualization of the data insights. It helps to create relationships among
several datasets through different forms of graphs and charts. AI Tool is still under
development, but it has the potential to be a powerful tool for a variety of applications,
such as customer service, education, and research [1]. Table 1 illustrates the evolution
of AI chatbots with their properties.
AI Tool has been trained on a massive corpus of text from the internet, encom-
passing a wide range of topics and domains. This training data enables the model
to have a broad understanding of language, facts, and cultural knowledge. However,
it is important to note that AI Tool does not possess real-time information and its
knowledge is based on data available up until September 2021 [2]. OpenAI has made
efforts to ensure that AI Tool exhibits responsible behavior by minimizing biased
and offensive outputs. During training, the model is fine-tuned and guided using
a combination of human reviewers and reinforcement learning algorithms to align
its responses with desired ethical standards. The integration of machine learning
(ML) algorithms helps to predict the behavior of data in a more accurate form.
It helps to optimize the existing predictive models and explore novel associations
among dataset. Despite these efforts, the model may occasionally produce incorrect,
nonsensical, or biased responses [3].
AI Tool is a large language model (LLM) that can be used for a variety of purposes,
including answering questions, providing explanations, assisting with tasks, gener-
ating ideas, and engaging in creative writing [4]. It has found applications in customer
support, content generation, language learning, brainstorming, and more. OpenAI has
made AI Tool accessible through various interfaces, including web-based platforms
and API services. This allows developers and users to interact with the model and
integrate it into their own applications or services [5]. The ChatGPT can also recom-
mend the best data analytic tools that suit to the dataset. It suggests data quality
issues with model-building approaches. Some popular data visualization tools that
work well with ChatGPT include: Tableau, Power BI, and Qlik Sense.
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 267

Table 1 ChatGPT versions and comparison


Version Uses Architecture Parameter count Year
GPT-1 General 12-layer, 12-headed 117 million 2018
Transformer de-coder
(without encoder), followed
by linear-softmax, trained on
Book Corpus with a dataset
size of 4.5 GB of text
GPT-2 General Similar to GPT-1, but with 1.5 billion 2019
adjusted normalization
techniques, trained on Web
Text dataset consisting of
40 GB of text
GPT-3 General An extension of GPT-2, 175 billion 2020
incorporating alterations to
enable greater scalability,
trained on a dataset of
570 GB plaintext
InstructGPT Conversation GPT-3 fine-tuned through a 175 billion 2022
human feedback model to
enhance its ability to
comprehend and follow
instructions
ProtGPT2 Protein sequences Modeled similar to GPT-2 738 million 2022
large (36 layers), utilizing
Protein sequences sourced
from UniRef50, totaling
44.88 million sequences
BioGPT Biomedical content Following the framework of 347 million 2022
GPT-2 medium (24 layers, 16
heads), incorporating
non-empty items extracted
from a PubMed dataset,
totaling 1.5 million
ChatGPT Dialogue Built upon GPT-3.5, and 175 billion 2022
refined through a
combination of supervised
learning and reinforcement
learning with input from
human feedback (RLHF)
GPT-4 General Trained through a dual 100 trillion 2023
approach involving text
prediction and RLHF,
capable of accepting both
textual and image inputs,
including third-party data
268 H. Mohapatra and S. R. Mishra

D3.js, and Plotly. Some popular data preparation tools that work well with
ChatGPT include: Alteryx, Data Wrangler, OpenRefine, Pandas, and Duck DB. Some
popular machine learning tools that work well with ChatGPT include: scikit-learn,
Tensor-Flow, PyTorch, Apache Spark MLlib, and H2O.ai. Some popular statistical
analysis tools that work well with ChatGPT include: SPSS, SAS, R, Python, and
Julia. In addition to these specific tools, ChatGPT can also be used with a variety of
other data analytics tools, including: Data warehouses, Data lakes, Data streaming
platforms, Data governance tools, and Data integration tools.

1.1 Working of AI Tool

The transformer model consists of layers of self-attention mechanisms and feed-


forward neural networks, enabling it to capture complex patterns and dependencies
in language. Here’s a simplified overview of how AI Tool works illustrated in Fig. 1.
The first step is tokenization where the input text is divided into tokens, which
can be as small as individual characters or as large as whole words. Each token is
assigned a numerical representation. The second step is encoding where the tokens
are passed through an initial embedding layer, where they are transformed into high-
dimensional vectors that capture their semantic meaning. The third step is self-
attention where the encoded tokens are then processed through multiple layers of
self-attention mechanisms. Self-attention allows the model to weigh the importance
of each token based on its relationship with other tokens in the input sequence.
This helps the model understand the context and dependencies within the text. The
fourth step is feed-forward networks where after the self-attention layers, the output
is passed through a series of feed-forward neural networks. These networks apply
non-linear transformations to the token representations, further capturing complex
patterns in the data.
During the decoding phase the final output of the feed-forward networks is passed
through a decoding layer, which maps the representations back to the vocabulary
space. This allows the model to generate the next token or predict the most likely token
given the context. After that during response generation the conversation happens
where AI Tool takes a user’s input and generates a response based on the learned
patterns and dependencies in the training data. The model generates tokens one by
one, taking into account the preceding context and user input. This process continues
until an appropriate response is generated or a maximum response length is reached.
Training AI Tool involves a two-step process. Such as:
1. Pre-training: The model is trained on a large corpus of text from the internet. By
predicting the next token in a sentence, the model learns to understand language
and capture various patterns and concepts. This pre-training allows AI Tool to
acquire a broad knowledge of language and facts.
2. Fine-tuning: After pre-training, the model is fine-tuned using a more specific
dataset, which includes demonstrations and comparisons by human reviewers.
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 269

Fig. 1 Architecture of AI Tool

OpenAI provides guidelines to the reviewers to ensure the model’s responses


align with desired behavior and ethical standards. This fine-tuning helps shape
the model’s behavior and allows it to produce more appropriate and controlled
responses.
The main idea behind the transformer is to capture the relationships between
different words or tokens in a sentence, allowing for better context understanding and
generation. The transformer model consists of two main components: the encoder and
the decoder. In the context of Chat boat, the encoder processes the input message and
the conversation history, while the decoder generates the response. The encoder and
decoder are composed of multiple layers, with each layer containing sub-modules.
Let’s focus on one layer to understand the core workings of the Transformer.
Self-Attention Mechanism: Self-attention is a mechanism that allows the model to
weigh the importance of each word/token in the input sequence. For each word/token,
self-attention computes its “attention scores” with respect to all other words/tokens
in the sequence. These attention scores determine how much focus the model should
place on each word/token during processing. The attention scores are computed
using three learned matrices: Query, Key, and Value. These are multiplied together
to produce the attention scores. Multi-Head Attention: To capture different types of
dependencies and improve performance, self-attention is applied in parallel multiple
times, known as “heads.” Each attention head learns different relationships between
words/tokens, allowing the model to attend to various aspects of the input. The
outputs of all attention heads are concatenated and linearly transformed to retain
relevant information.
Position-wise Feed-Forward Networks: After self-attention, the output is passed
through a position-wise feed-forward network. This network consists of two linear
layers with a non-linear activation function in between, allowing the model to
transform and combine information across positions.
270 H. Mohapatra and S. R. Mishra

Residual Connections and Layer Normalization: To address the challenge of


vanishing gradients, residual connections are added, allowing the model to retain
information from previous layers. Layer normalization is applied after each sub-
module, ensuring stable gradients during training. The encoder processes the input
message and conversation history by stacking multiple layers of self-attention and
feed-forward networks. The decoder, on the other hand, also incorporates an addi-
tional attention mechanism that attends over the encoder’s output to capture relevant
context information. During training, the model is optimized to generate coherent
and contextually appropriate responses using techniques such as maximum likeli-
hood estimation. The parameters of the model are learned by minimizing the discrep-
ancy between the model’s generated responses and the ground truth responses in the
training data.

1.2 Working of AI Tools in Association with Data Analytics


Tools

ChatGPT, a large language model, can be effectively integrated with various data
analytics tools to enhance data analysis processes and extract valuable insights.
Its natural language processing capabilities enable it to understand and interpret
data from diverse sources, making it a versatile tool for data exploration, summa-
rization, and pattern recognition. ChatGPT can seamlessly collaborate with data
visualization tools to generate insightful charts, graphs, and dashboards, while also
working with data preparation tools to clean, format, and transform data for analysis.
Additionally, ChatGPT can be employed to augment machine learning models by
generating training data, tuning hyper-parameters, and interpreting results. Statis-
tical analysis tools can also benefit from ChatGPT’s ability to summarize statistical
findings, generate reports, and perform hypothesis tests. By leveraging ChatGPT’s
strengths in conjunction with these data analytics tools, data analysts can streamline
their workflows, gain deeper understanding of data, and make informed decisions.
The paper’s structure is outlined below. Section 2 provides an in-depth exami-
nation of the related literature. In Sect. 3, we delve into our exploration using the
AI tool. Technical domains and specific applications are thoroughly discussed in
Sect. 4, while Sect. 5 similarly addresses business and administrative sectors along
with pertinent applications. Section 6 provides a concise overview of observations
and performance evaluations with behavioral analysis of the AI tool followed by the
conclusion and references.
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 271

2 Related Work

AI Tool has gained significant attention worldwide and sparked discussions in


academia. Some individuals foresee potential disruptions, such as the decline of
assigned essays and widespread unemployment as machines assume writing tasks.
There are concerns regarding the prevalence of cheating, with worries that it may
become challenging or even impossible to detect. Moreover, fears persist that students
may become complacent, convinced that they can rely entirely on automated writing.
This article aims to provide a concise and accurate overview of AI Tool, including
its capabilities and limitations. Additionally, it explores methods for identifying
AI-driven cheating, as well as potential applications that could alleviate work-
load, enhance students’ writing skills, facilitate exam composition and grading, and
elevate the quality of research papers [6]. ML, a subset of AI, enables computers to
analyze and interpret data without explicit programming. As more data is fed into
the algorithm, it learns and improves, leading to better problem-solving. This paper
reviews the role of AI in marketing, examining its applications and transformations
across various marketing segments. Critical applications of AI for marketing are also
recognized and analyzed [7].
In this commentary, we delve into the subject of AI Tool and offer our insights
regarding its potential usefulness in systematic reviews (SRs). We assess the appro-
priateness and applicability of AI Tool’s responses to prompts related to SRs. The
rise of artificial intelligence (AI)-assisted technologies has prompted discussions on
their current capabilities, limitations, and opportunities for integration into scientific
endeavors. Large language models (LLMs), such as OpenAI’s AI Tool, have garnered
significant attention due to their ability to generate natural-sounding responses across
various prompts. SRs, which rely on secondary data and often demand substantial
time and financial resources, present an attractive domain for the development of AI-
assistive technologies. On February 6, 2023, the PICO Portal developers conducted a
webinar to explore how AI Tool responds to tasks associated with SR methodology.
Based on our exploration of AI Tool’s responses, we observe that while AI Tool
and LLMs show promise in assisting with SR-related tasks, the technology is still
in its early stages and requires significant further development for such applications
[8]. ChatGPT offers promising potential to enhance data science workflows, but
careful consideration of its capabilities and limitations is essential. Its proficiency
in natural language processing tasks, such as translation, sentiment analysis, and
text classification, can significantly improve productivity and accuracy. However,
fine-tuning is necessary for optimal performance in specific use cases, and interpre-
tation of ChatGPT’s output may pose challenges. Ultimately, the benefits of ChatGPT
outweigh the costs, making it a valuable tool for intelligence augmentation in data
science [9].
The application of AI Tool can be used in multiple discipline of science. The aim
was to explore the application of AI in various tasks within the mechanical engi-
neering domain and draw conclusions through statistical analysis of the outcomes.
However, when utilizing AI Tool in several calculation examples, we discovered that
272 H. Mohapatra and S. R. Mishra

it generated incorrect results, erroneous formulas, and similar inaccuracies [10]. The
firing appropriate query can help to use the AI Tool in software development and
modeling too. AI Tool has demonstrated its capability to generate useful code snippets
that, on occasion, are correct and successfully accomplish the desired task specified
by the user. Moreover, AI Tool exhibits familiarity with various textual modeling
languages, including domain-specific languages (DSLs). Notably, the example of
Graph-GPT showcases the potential for language designers to instruct AI Tool on
the desired structure of a modeling language, resulting in the generation of code frag-
ments within that language. In the case of Graph-GPT, it employs a clever approach of
requesting a JSON-encoded representation of the graph, which can then be rendered
into a diagram. The possibilities for leveraging generative AI in modeling are vast,
and the exact ways in which it will transform the business and practices of modeling
in the future remain uncertain [11]. The uses of AI Tool can also be found in scientific
abstract writing [12] and in the health sector [13] too. Deep Learning has emerged as
a powerful tool for addressing the challenges posed by Big Data Analytics. Its ability
to extract complex patterns from massive volumes of data makes it suitable for tasks
such as semantic indexing, data tagging, and fast information retrieval [14]. In the
field of academic writing AI Tool can be used in revolutionary way [15]. As AI Tool
is a new tool hence many researchers are trying to explore it in a several way. One
such method is chat with AI Tool where the authors have shared their experience [16].
Though AI Tool is a powerful tool that can be used in many applications but there
are several instances where the faulty references can be found [17]. Since its launch
in 2022, AI Tool, a query-oriented language generation tool, has garnered significant
attention. Although the initial excitement may have waned, the impact of AI Tool
has sparked lasting structural changes. Notably, academic journals have published
papers with AI Tool listed as an author, while certain educational institutions have
opted to prohibit its use due to concerns about potential misuse. Criticisms of AI
Tool have primarily revolved around its inaccuracies, often labeling it as a “bullshit
generator.” Additionally, some have highlighted the undesirable consequences that
arise from its utilization, such as the potential to undermine creativity.
However, we contend that there is an unaddressed issue at hand—the funda-
mental ideas and politics that drive the development of these tools and facilitate their
uncritical adoption [18]. Businesses have long relied on analytics, but the focus is
shifting toward artificial intelligence (AI) capabilities. Many AI systems are built
upon statistical and analytical foundations. By leveraging their existing analytical
expertise, companies can gain a competitive edge in their AI endeavors [19]

3 Exploration with AI Tool for Data Analytic

The scope of AI tools is vast and extends across diverse sectors, revolutionizing the
way we approach challenges and opportunities. From healthcare and finance to manu-
facturing and entertainment, AI’s transformative capabilities have left an indelible
mark. In healthcare, AI aids in diagnosing diseases and personalizing treatment
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 273

plans, while in finance, it enhances fraud detection and market analysis. Industries
like manufacturing benefit from AI-driven automation for improved efficiency, and
the entertainment sector leverages AI to create immersive experiences. The ability
of AI tools to analyze vast datasets, recognize patterns, and make informed deci-
sions transcends boundaries, making them an indispensable asset in shaping the
present and future of countless sectors [20]. The study has focused on two types
of sector classification such as technical and business administrative sectors respec-
tively. Further, each type of category has considered 5 different types of individual
sectors for response analysis.
Technical Sectors
• Medical and Health Care Sector
• Software Development
• Smart Agriculture
• Logistics and Supply
• Smart City Designing
Business Administrative Sectors
• Education and Academic
• Crime Monitoring
• Administrative Accounting
• Entertainment Industry
• Culture and Value Promotion

4 AI Tool Response in Technical Sectors

4.1 Medical and Health Sector

AI tools have a transformative role in the medical and health sector. They serve as
conversational interfaces for quick access to medical information, including research
papers, clinical guidelines, and drug details. Patient education is enhanced through
personalized health insights and answers to queries [4]. AI aids in symptom assess-
ment and initial guidance for seeking medical help. In telemedicine, AI integrates
for remote patient monitoring and virtual consultations, ensuring data gathering and
medication reminders. Analyzing electronic health records, AI organizes patient data
for efficient decision-making [5]. It supports mental health by offering coping strate-
gies and resources. Medical education benefits from AI simulations and feedback. In
research, AI extracts information, aids in data analysis, and facilitates patient recruit-
ment for clinical trials. The generated response from AI Tool has been cross verified
with the opinions of experts medical and health sector. For the evaluation process
we have communicated with 34 doctors and 16 health care sector staffs. Figure 2a
illustrates the performance evaluation based on the parameters that are considered
in Table 2.
274 H. Mohapatra and S. R. Mishra

(a) Health care sector (b) software industry

(c) Smart agriculture (d) Logistics and supply chain

(e) smart city designing and planning

Fig. 2 Performance evaluation in several technical sectors by data analytics through AI tools
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 275

Table 2 Overall score (OS)


Metric Score
of performance evaluation
based on all responses (A) Accuracy 0.85
(R) Relevance 0.82
(C) Coherence 0.78
(G) Grammaticality 0.88
(F) Fluency 0.85

Integrating ChatGPT into healthcare facilitates insightful data analysis by


harnessing its natural language processing capabilities. This synergy enables the
extraction of invaluable insights from patient records, medical literature, and real-
time interactions. By parsing through vast amounts of data, ChatGPT aids in iden-
tifying trends, predicting potential health risks, and personalizing patient care path-
ways. Its ability to comprehend and process complex medical information empowers
healthcare providers to make informed decisions swiftly and efficiently. Moreover,
by enhancing diagnostic accuracy and treatment strategies, ChatGPT augments the
quality of healthcare delivery, ultimately leading to improved patient outcomes and
a more responsive healthcare system [21].

4.2 Software Development

AI Tool’s role in software development complements but doesn’t replace human


expertise. While it aids coding tasks, debugging, and documentation, human vali-
dation remains vital. AI generates code snippets, suggests functions, and completes
segments.
It assists in debugging by proposing solutions to errors. AI simplifies documenta-
tion creation, extracting insights from source code. In applications, it enables natural
language interactions. AI suggests code improvements and best practices. It acts
as a knowledge hub, explaining concepts and offering examples. Moreover, AI
enhances teamwork through project management assistance. This fusion empowers
software engineers to glean valuable insights, identify coding patterns, predict poten-
tial issues, and streamline development workflows. ChatGPT’s ability to comprehend
nuanced technical language enables it to provide tailored suggestions, troubleshoot
coding errors, and even generate code snippets, expediting the development process.
Through the amalgamation of data analytics and ChatGPT, software development
teams can optimize decision-making, boost productivity, and deliver more robust and
user-centric software solutions, thereby revolutionizing the industry’s standards for
innovation and quality.
Further, AI tool can aid in test case generation by analyzing requirements or speci-
fications and suggesting relevant test scenarios or edge cases. It can assist in ensuring
thorough test coverage and identifying potential issues. AI Tool can provide guid-
ance on version control systems like Git. It can assist developers in understanding
276 H. Mohapatra and S. R. Mishra

branching strategies, resolving merge conflicts, and recommending best practices for
collaboration and code management. AI Tool can offer guidance on setting up devel-
opment environments, configuring tools, and resolving environment-related issues.
It can assist developers in getting started with specific frameworks or platforms. The
generated response from AI Tool has been cross verified with opinions of developers
and testers. For the evaluation process we have communicated with 30 software
developers and 21 testing engineers. Figure 2b presents the performance evaluation
based on the parameters that are considered in Table 2.

4.3 Smart Agriculture

Leveraging AI Tool in smart agriculture boosts productivity, resource efficiency, and


sustainability. It offers real-time insights for crop management, utilizing weather, soil,
and crop data. AI detects pests and diseases early, suggesting solutions. By amalga-
mating ChatGPT’s natural language processing prowess with agricultural datasets,
a wealth of information encompassing weather patterns, soil health metrics, crop
behavior, and historical farming data can be meticulously analyzed.
This collaboration empowers farmers and agricultural experts to make data-driven
decisions, from optimizing irrigation schedules to predicting crop yields and disease
outbreaks. ChatGPT’s capacity to interpret and process complex agricultural data
facilitates personalized recommendations and real-time insights, enabling proactive
measures to enhance crop quality and maximize yields sustainably in Fig. 2c. This
amalgamation not only revolutionizes traditional farming methodologies but also
fosters more resilient, resource-efficient, and environmentally conscious agricultural
practices for a sustainable future.

4.4 Logistic and Supply Chain Management

While AI Tool aids logistic and supply chain management, human expertise remains
vital. AI serves as a virtual assistant, handling customer queries and providing real-
time support. It integrates with systems for personalized order tracking. AI analyzes
data for inventory, demand, and production, offering supply chain optimization
suggestions. However, human oversight is essential for complex situations and crit-
ical decisions based on AI recommendations. AI enhances efficiency but should be
employed in tandem with human judgment. AI Tool streamlines supplier manage-
ment, handling routine inquiries and supplier performance insights. It identifies new
suppliers and aids communication. AI analyzes supply chain data, predicting demand
changes and evaluating external risks. It suggests contingency plans, enhancing
decision-making. AI acts as a training tool, simulating scenarios for risk-free prac-
tice. It shares knowledge on best practices and emerging trends. AI enhances supplier
collaboration, risk assessment, and skill development in supply chain management.
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 277

The generated response from AI Tool has been cross verified with opinions of busi-
ness holders and logistic department of various suppliers. For the evaluation process
we have communicated with 13 businessmen and 9 logistic suppliers. Figure 2d
presents the performance evaluation based on the parameters that are considered
in Table 2. This integration empowers logistics professionals to extract invaluable
insights, forecast demand, optimize routes, and mitigate potential disruptions in real
time. ChatGPT’s ability to comprehend intricate supply chain terminology facilitates
personalized recommendations, enabling agile decision-making and swift problem-
solving. By harnessing data analytics through ChatGPT, organizations can streamline
operations, minimize costs, enhance customer satisfaction, and create resilient supply
chains capable of adapting to dynamic market conditions, ultimately redefining the
benchmarks for efficiency and competitiveness in the logistics industry.

4.5 Smart City Designing and Planning

AI Tool enhances smart city planning with data-driven insights and citizen engage-
ment. It aids decision-making, creating livable urban environments. AI acts as a
virtual assistant, gathering citizen input and feedback.
It analyzes intricate urban data for trends and visualizations, aiding informed city
planning. Through data analytics powered by ChatGPT, cities can evolve into adap-
tive, efficient, and citizen-centric environments, fostering innovation and resilience
for the urban landscape of the future. AI Tool contributes to urban planning by
suggesting designs based on population, transport, green spaces, and energy. It opti-
mizes land use and connectivity while integrating sustainability. AI analyzes real-
time traffic data, optimizing flow and reducing congestion. It recommends traffic
management systems and efficient routes. AI aids in energy management strategies
for smart cities. AI Tool enhances urban planning by analyzing energy patterns,
suggesting efficient technologies, and integrating renewable. It optimizes energy
distribution, monitors air quality, and manages waste. AI aids in emergency plan-
ning, analyzing data for disaster preparedness and resource allocation. It fosters
stakeholder collaboration, acting as a knowledge base. AI integrates sustainability
and green initiatives, promoting eco-friendly tech and recycling programs. It evalu-
ates policy impact on smart cities, simulating effects on transportation, energy, and
services to inform decisions. The generated response from AI Tool has been cross
verified with the opinions of city planners and civil engineers. For the evaluation
process we have communicated with 4 city planners and 7 civil engineers. Figure 2e
presents the performance evaluation based on the parameters that are considered in
Table 2.
278 H. Mohapatra and S. R. Mishra

5 AI Tool and Data Analytic Responses in Business


Administrative Sectors

5.1 Education and Academic Paper Writing

Maintaining a balanced approach to AI Tool integration in education is crucial,


emphasizing human guidance for fostering critical thinking [3]. AI acts as a virtual
tutor, delivering personalized guidance and learning resources. It aids academic
writing with grammar suggestions and feedback. AI assists in research by retrieving
articles, summarizing papers, and detecting plagiarism.
It supports language learners through practice exercises and explanations. Addi-
tionally, AI aids researchers by generating initial drafts and structuring papers. AI
Tool can assist in the peer review process by analyzing submitted manuscripts, iden-
tifying potential issues, and offering constructive feedback. It can help reviewers
focus on important aspects such as clarity, methodology, and validity of research. AI
Tool can aid in conducting literature reviews by extracting relevant information from
academic articles, summarizing key findings, and organizing references. It can save
time for researchers in the initial stages of their literature review process. AI Tool can
provide information and resources on academic integrity and ethical writing prac-
tices. It can educate students about citation rules, paraphrasing techniques, and the
importance of avoiding plagiarism. The generated response from AI Tool has been
cross verified with the opinions of professors and students. For the evaluation process
we have communicated with 30 professors and 60 students. Figure 3a presents the
performance evaluation based on the parameters that are considered in Table 2.

5.2 Crime Monitoring

AI supports law enforcement but doesn’t replace human decision-making. It inter-


faces for incident reporting, enhancing public communication. Analyzing crime
data, AI identifies patterns and trends for resource allocation. It predicts crime
probabilities, enabling proactive measures by law enforcement in high-risk areas.
AI Tool assists in suspect identification using witness descriptions or images. It
generates composite sketches and matches from databases. It aids investigators
with data retrieval from public records, social media, and databases for background
checks and connections. AI educates the public on safety and crime prevention. It
analyzes OSINT data for threat detection and crime activities. AI translates languages
for communication and analyzes text data for threat identification. The generated
response from AI Tool has been cross verified with opinions of people of judicial.
For the evaluation process we have communicated with 9 police officers and 11
lawyers. Figure 3b presents the performance evaluation based on the parameters
that are considered in Table 2. The fusion of data analytics and ChatGPT in crime
data analysis represents a groundbreaking approach to understanding, predicting,
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 279

(a) academic sector (b) crime monitoring (c) administrative section

(d) entertainment industry (e) culture and value promotion

Fig. 3 Performance evaluation in several administrative sectors by data analytics through AI tools

and addressing criminal activities. By utilizing the ChatGPT’s natural language


processing capabilities alongside comprehensive crime datasets encompassing inci-
dent reports, demographics, and historical crime trends, law enforcement agencies
can gain pro-found insights. This collaboration facilitates the extraction of action-
able intelligence, enabling authorities to identify crime hotspots, predict potential
criminal behavior, and allocate resources effectively.

5.3 Administrative Actions

Leveraging AI Tool streamlines administrative actions, allowing focus on strategic


tasks where balancing automation and human touch is the key. AI serves as virtual
support, addressing inquiries and simple issues 24/7 [22]. It schedules appointments
and manages calendars efficiently. AI retrieves data for administrators, aiding quick
decision-making. AI Tool simplifies form filling and document completion. It assists
new employee on-boarding, sharing policies, and guidance. AI offers information on
policies, procedures, and compliance. It analyzes data, generates user-friendly reports
for informed decisions. AI acts as a virtual trainer, aiding ongoing learning. It auto-
mates routine tasks, integrates with software systems for efficiency. AI sends noti-
fications and reminders, ensuring effective communication. The generated response
280 H. Mohapatra and S. R. Mishra

from AI Tool has been cross verified with opinions of 34 clerical staff of administra-
tive section at KIIT University. Figure 3c presents the performance evaluation based
on the parameters that are considered in Table 2.

5.4 Entertainment Industry

AI Tool empowers the entertainment industry with innovative content creation and
interactive experiences. It generates scripts, dialogues, and characters, fostering
creativity. AI enhances storytelling, character development, and narrative explo-
ration. It crafts interactive virtual characters, enabling immersive experiences across
various platforms. AI Tool transforms entertainment with personalized recommen-
dations, interactive storytelling, and virtual characters. It analyzes preferences for
suggestions, creating immersive experiences. AI shapes narratives, responds to
choices, and offers personalized storylines. It crafts virtual assistants for celebs or
characters, deepening audience connections. In video games, AI enhances dialogues
and character depth. It engages fans on social media, maintaining interactive pres-
ence. AI elevates live events with real-time engagement and interactive elements. It
generates synthetic voices for various roles. AI sparks fan engagement by simulating
conversations and discussing fictional worlds, fostering creativity and community.
The generated response from AI Tool has been cross verified with opinions of content
creators on YouTube. For the evaluation process we have communicated with 34
content creators on YouTube. Figure 3d presents the performance evaluation based
on the parameters that are considered in Table 2.
In the realm of entertainment, the amalgamation of data analytics and ChatGPT
heralds a paradigm shift in understanding audience preferences, content trends,
and optimizing creative strategies. Moreover, data analytics powered by ChatGPT
assists in deciphering trends, forecasting market demands, and optimizing content
creation, leading to more tailored, diverse, and engaging entertainment experiences
for audiences globally. This fusion reshapes the entertainment landscape, fostering
innovation and enriching the connection between creators and consumers in an
ever-evolving industry.

5.5 Culture and Value Promotion

AI Tool advances cultural understanding and inclusivity. It educates about diverse


cultures, traditions, and languages. It acts as a virtual guide, offering insights into
history, art, and practices. AI aids language learning, facilitating cross-cultural
communication. It fosters empathy and appreciation for global cultures, contributing
to a more culturally rich society. AI Tool enables virtual cultural exchange,
connecting individuals globally. It discusses art forms, recommends artists, and
sparks creativity. AI engages in ethical discussions, fostering dialogue on values
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 281

and morals. It simulates cultural exhibitions, providing historical context and inter-
activity. AI preserves heritage by organizing cultural information digitally. It fosters
intercultural dialogue, artistic appreciation, and ethical reflection, enriching global
connections and cultural understanding. AI Tool aids cultural tourism, suggesting
landmarks and events. It fosters discussions on values and decisions, enhancing
values-based choices. AI supports social impact campaigns for diversity and inclu-
sivity, interacting with users to raise awareness and encourage actions. It promotes
cultural engagement, values-based decisions, and positive impact, contributing to a
more informed and culturally aware society. The generated response from AI Tool
has been cross verified with opinions of eleven humanities and social science profes-
sors of KIIT University. Figure 3e presents the performance evaluation based on the
parameters that are considered in Table 2.

6 Observation on AI Tool Behavior

Though AI Tool become a buzzword and in the last two sections we have witnessed
the implications of AI Tool on various sectors of the society. Though it is a powerful
tool for the modern digital world still it has many negative and positive side effects.

6.1 Benefits of Using AI Tool with Data Analytics

These benefits demonstrate the potential of AI Tool to enhance communication,


streamline processes, improve user experiences, and provide valuable support across
various domains. However, it is essential to consider and address potential limitations,
ethical considerations, and user expectations to effectively harness these benefits.
Integrating AI tools with data analytics offers a multitude of benefits across industries.
These advanced tools leverage complex algorithms to process extensive datasets,
extracting invaluable insights that might evade traditional analytics methods.
Additionally, AI-driven analytics minimize errors and biases, ensuring accuracy
and reliability in outcomes. Their scalability effortlessly handles large data volumes,
maintaining effectiveness even as data grows.
Quick and convenient communication: AI Tool can provide instant responses to
user queries, eliminating the need for waiting or queuing. This can improve customer
satisfaction and reduce the time it takes to resolve issues.
• Round-the-clock availability: AI Tool can operate 24/7, providing support and
information at any time. This can be especially helpful for businesses that operate
in multiple time zones or that have a global customer base.
• Cost-effectiveness: AI Tool can be cost-effective compared to maintaining a large
customer support team or hiring additional staff. Once deployed, it can handle
282 H. Mohapatra and S. R. Mishra

multiple conversations simultaneously, reducing the need for human resources


and lowering operational costs.
• Scalability: AI Tool can handle a high volume of conversations simultaneously,
making it highly scalable. As the user base grows, the system can accommodate
increased demand without significant infrastructure or resource investments.
• Consistency and accuracy: AI Tool can provide consistent and accurate responses
based on the training it has received. This can avoid human errors and inconsisten-
cies that may arise from manual interactions, ensuring a high level of reliability
and accuracy.
• Multilingual support: AI Tool can support multiple languages, enabling commu-
nication with users from diverse linguistic backgrounds. This can be especially
helpful for businesses that operate in a global marketplace.
• Interactive and dynamic conversations: AI Tool can engage users in interactive
and dynamic conversations, providing a personalized and tailored experience.
This can lead to increased user engagement and satisfaction.
• Knowledge repository: AI Tool can act as a knowledge repository, storing and
retrieving information on a wide range of topics. This can help users to find the
information they need quickly and easily.
• Continuous learning: AI Tool can continuously learn from user interactions,
improving its responses and performance over time. This can lead to a better
overall user experience.
• Augmentation of human capabilities: AI Tool can augment human capabilities by
assisting with information retrieval, decision-making, and task automation. This
can free up human operators to focus on more complex or specialized tasks.

6.2 Side-Effects of Blind Using of AI Tool-Based Data


Analytical Outputs

Here, we have listed some of the primary negative impacts. To mitigate these negative
impacts, responsible development, transparent practices, ongoing research, and regu-
lation are important. Ethical guidelines, bias detection, and correction mechanisms,
as well as user education on the limitations and potential biases of AI systems, can
help address these concerns and promote the responsible use of AI Tool and similar
technologies. While AI Tool and similar language models offer many benefits, there
are also potential negative impacts on society.
• Misinformation and propaganda: AI Tool can generate text based on the input
it receives, including false or misleading information. If used irresponsibly or
without proper oversight, it can contribute to the spread of misinformation,
conspiracy theories, or propaganda, which can undermine trust, create confusion,
and harm society.
• Bias: AI Tool learns from the data it’s trained on, which can include biases present
in the training data. If the training data contains biases related to race, gender,
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 283

or other sensitive topics, AI Tool may inadvertently perpetuate and amplify these
biases in its generated responses, leading to unfair or discriminatory outcomes.
• Accuracy and reliability: AI Tool operates as a tool, and its outputs are not in-
dependently verified or fact-checked. This lack of accountability raises concerns
about the accuracy and reliability of the information it provides, potentially leading
users to make decisions based on flawed or misleading advice.
• Ethical exploitation: AI Tool can be exploited for unethical purposes, such as
generating malicious content, engaging in harmful behaviors, or deceiving indi-
viduals. This raises concerns about privacy, security, and the potential for abuse
by malicious actors.
AI Tool can be used for social engineering or manipulating individuals by gener-
ating persuasive or deceptive messages. Malicious actors could exploit this tech-
nology to deceive or exploit unsuspecting users for personal gain or malicious
purposes. Interacting with AI systems like AI Tool may have psychological effects on
individuals. For some users, relying heavily on AI-generated responses for emotional
support or guidance could lead to a sense of detachment, isolation, or a reduction in
critical thinking skills. AI Tool interacts with users and collects data during conver-
sations, raising privacy concerns. Depending on the use and storage of this data,
there is a risk of misuse, unauthorized access, or breaches that compromise user
privacy and security. The widespread adoption and benefits of AI Tool may not be
accessible to all individuals due to factors such as cost, infrastructure limitations, or
digital literacy. This can contribute to a technological divide, exacerbating existing
inequalities in society.

6.3 Numerical Performance Evaluation of AI Tool-Based


Data Analytical Responses

By employing these approaches, you can gain insights into the performance of
AI Tool, understand its strengths and limitations, and make informed decisions to
improve its overall performance and user satisfaction. Analyzing the performance of
AI Tool responses can help assess its effectiveness and identify areas for improve-
ment. Here are some approaches to analyze its performance. Figure 4 illustrates the
performance analysis based on this provided dataset. We have analyzed the data by
using Pandas.lib with python environment.
The proposed metric includes the following parameters like accuracy (A), rele-
vance (R), coherence (C), grammaticality (G), and fluency (F). This metric can be
used to assess how well the generated responses align with the desired outcomes
and user expectations. We have compared the human ratings with system-generated
responses to gauge the model’s performance and identify areas for refinement. The
graph shows the performance of different AI tool responses for different domains.
The red line represents the accuracy of the responses, while the blue line repre-
sents the relevance of the responses. The green line represents the coherence of the
284 H. Mohapatra and S. R. Mishra

Fig. 4 AI Tool integrated data analytical-based performance evaluation metric

responses, the orange line represents the grammaticality of the responses, and the
purple line represents the fluency of the responses.
The graph shows that the accuracy and relevance of AI tool responses vary
depending on the domain. For example, AI tool responses are more accurate and
relevant in the health domain than in the entertainment domain. This is likely because
there is more structured data available in the health domain, which makes it easier for
AI tools to learn and generate accurate and relevant responses. The graph also shows
that the coherence, grammaticality, and fluency of AI tool responses are generally
good across all domains. However, there is some variation, with the best scores in the
academic domain and the worst scores in the crime domain. This is likely because
the academic domain requires more complex and nuanced language, while the crime
domain requires more factual and objective language.
Overall, the graph shows that AI tool responses are generally performing well
across a variety of domains. However, there is still room for improvement, particularly
in terms of accuracy and relevance in some domains.
Each metric is assigned a score ranging from 0 to 1, where a higher score indicates
better performance. These scores represent the average assessment of AI Tool’s
responses based on the evaluation process conducted. The mathematical expression
to calculate the overall score (OS) is represented in Eq. (1). Table 2 illustrates the
performance evaluation of AI Tool responses by using (Eq. 1).
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 285

OS = A + R + C + G + F/ 5 (1)

The complexity of AI Tool responses has been calculated by using two primary
parameters such as ‘L’ be the average sentence length in words and ‘W’ be the average
word length in characters. The complexity ‘C’ can be presented by using (Eq. 2).

C = L×W (2)

Table 2 shows the overall performance evaluation of the AI tool for all responses
from all considered sectors. The table shows the metric accuracy, relevance, coher-
ence, grammaticality, and fluency scores for all responses from all considered sectors.
The accuracy score measures how well the AI tool’s responses match the ground truth.
The relevance score measures how well the AI tool’s responses are relevant to the
query. The coherence score measures how well the AI tool’s responses are structured
and easy to understand. The grammaticality score measures how well the AI tool’s
responses follow the rules of grammar. The fluency score measures how natural and
easy to read the AI tool’s responses are. The overall performance of the AI tool
is good, with an average score of 0.85. The accuracy score is particularly high, at
0.85. This means that the AI tool’s responses are generally accurate and match the
ground truth. The relevance score is also good, at 0.82. This means that the AI tool’s
responses are generally relevant to the query. The coherence, grammaticality, and
fluency scores are all slightly lower, at 0.78, 0.88, and 0.85, respectively.
This means that the AI tool’s responses could be improved in terms of structure,
grammar, and naturalness. However, overall, the AI tool is performing well and is
generating responses that are accurate, relevant, coherent, grammatical, and fluent.

7 Behavioral Performance Evaluation of AI Tool Responses

Behavioral analysis of AI tool responses plays a crucial role in understanding how


these systems interact with users based on the types of queries they receive. The
varying nature of user queries can elicit diverse responses, and observing these
patterns is essential for refining AI models. For instance, simple informational queries
often yield concise and accurate responses, showcasing the model’s ability to provide
factual information. However, as queries become more complex or ambiguous, AI
tools may struggle to maintain coherence or generate plausible answers, indicating the
need for improvements in context understanding and reasoning capabilities. Table 3
illustrates the AI Tool responses based on input types.
Behavioral analysis also involves assessing how AI systems handle emotion-
ally charged queries, where empathy and sensitivity are crucial. Ethical consider-
ations come into play when monitoring responses to potentially harmful queries,
ensuring that the tool does not propagate misinformation, violence, or discrimina-
tory content. Through behavioral analysis, developers can continuously fine-tune
286 H. Mohapatra and S. R. Mishra

Table 3 ChatGPT responses based on input types


Input type Example input Sample response
Question “What is the capital of “The capital of France is Paris.”
France?”
Instruction “Please provide a step-by-step “Sure, here’s a step-by-step
guide for baking a cake.” guide
to bake a cake …”
Conversation Continua-tion User: “How’s the weather?” “The weather is quite pleasant
to-day.”
Assistant: “It’s sunny with a
slight breeze.”
Prompt “Write a short story about a “In a quiet village, stood an old,
haunted house.” abandoned house …”
Command “Translate ‘hello’ to French.” “The translation of ‘hello’ in
French is ‘bonjour’.”
Clarification request “Can you provide more details “Sure, I’d be happy to provide
about your project?” more details …”

AI tools, enhance their performance, and align them with user expectations while
upholding ethical standards in the responses provided. In the sub-sections we have
analyzed the performance of AI tool on different types of queries from the used end.
Table 4 illustrates comparison among responses based on nature of the queries.

Table 4 Comparison of Chatbot responses on different factors


Factor Comparison Observation
Response to re-peated queries Consistent responses AI tool consistently provides
the same response to the
same query
Synonymical queries Variability in responses Responses can vary based
on synonyms used in the
queries
Meaningless queries Varies from error messages to AI tool may provide error
generic replies messages or generic
responses for such queries
Different accounts/computers Consistent responses AI tool provides consistent
responses irrespective of the
source
Randomly typed characters Unpredictable responses Responses are often
unrelated or non-sensical
due to random input
Vulgar queries Varies from error Responses can range from
Messages to rejection error messages to rejecting
inappropriate queries
Unlocking Insights: Exploring Data Analytics and AI Tool Performance … 287

8 Conclusion

Data analytics with AI tools represents a transformative approach to processing and


interpreting vast amounts of data. By leveraging advanced algorithms and machine
learning techniques, AI tools can extract valuable insights, patterns, and correlations
from complex datasets that traditional analytics might overlook. These tools excel
in predictive capabilities, enabling businesses to forecast trends, optimize decision-
making, and drive proactive strategies.
Moreover, AI-driven analytics streamline processes, automate tasks, and enhance
efficiency, leading to cost savings and improved resource utilization. However, the
blind reliance on AI outputs poses risks such as bias amplification, lack of inter-
pretability, and ethical implications. To harness the full potential of data analytics with
AI tools, it’s crucial to maintain human oversight, address biases in datasets, ensure
transparency in AI processes, and continuously evaluate and refine the models for
responsible and impactful deployment. The proposed study presents. The proposed
study presents a critical analysis on AI tool responses in the context of data analytics
by considering various sectors of the society. This insight will help toward the industry
4.0 revolution. The proposed study explores the possibility and pattern of the use of
AI tool for data analytics. A critical analysis on AI tool responses in the context of
data analytics by considering various sectors on the society.

References

1. Elavarasan, R.M., Pugazhendhi, R., Irfan, M., Mihet-Popa, L., Khan, I.A., Campana, P.E.:
State-of-the-art sustainable approaches for deeper decarbonization in Europe—an endowment
to climate neutral vision. Renew. Sustain. Energy Rev. 159 (2022)
2. Davenport, T.H.: From analytics to artificial intelligence. J. Bus. Anal. 1(2), 73–80 (2018).
https://doi.org/10.1080/2573234X.2018.1543535
3. Sánchez-Ruiz, L.M., Moll-López, S., Nuñez-Pérez, A., Moraño-Fernández, J.A., Vega-Fleitas,
E.: ChatGPT challenges blended learning methodologies in engineering education: a case study
in mathematics. Appl. Sci. 13(10), 6039 (2023). https://doi.org/10.3390/app13106039
4. Javaid, M., Haleem, A., Singh, R.P.: Chatgpt for healthcare services: an emerging stage for an
innovative perspective. Bench Council Trans. Benchmarks, Stand. Eval. 3(1), 100105 (2023)
5. Frederico, G.F.: Chatgpt in supply chains: initial evidence of applications and potential research
agenda. Logistics. 7(2) (2023)
6. Mohapatra, H.: Socio-technical challenges in the implementation of smart city. International
Conference on Innovation and Intelligence for Informatics, Computing, and Technologies
(3ICT), pp. 57–62 (2021)
7. Zhang, X., Shah, J., Han, M.: Chatgpt for fast learning of positive energy district (ped): a trial
testing and comparison with expert discussion results. Buildings. 13(6) (2023)
8. Gao, Y., Tong, W., Wu, E.Q., Chen, W., Zhu, G.Y., Wang, F.Y.: Chat with chatgpt on interactive
engines for intelligent driving. IEEE Trans. Intell. Veh. 8(3), 2034–2036 (2023)
9. Du, H., Teng, S., Chen, H., Ma, J., Wang, X., Gou, C., Li, B., Ma, S., Miao, Q., Na, X.,
Ye, P., Zhang, H., Luo, G., Wang, F.Y.: Chat with chatgpt on intelligent vehicles: an ieee tiv
perspective. IEEE Trans. Intell. Veh. 8(3), 2020–2026 (2023)
10. Prieto, S.A., Mengiste, E.T., Garćıa de Soto, B.: Investigating the use of chatgpt for the
scheduling of construction projects. Buildings, 13(4) (2023)
288 H. Mohapatra and S. R. Mishra

11. Shoufan, A.: Exploring students’ perceptions of chatgpt: thematic analysis and follow- up
survey. IEEE Access, 11, 38805–38818 (2023)
12. Lo, C.K.: What is the impact of chatgpt on education? a rapid review of the literature. Educ.
Sci. 13(4) (2023)
13. Castellanos-Gomez, A.: Good practices for scientific article writing with chatgpt and other
artificial intelligence language models. Nanomanufacturing. 3(2), 135–138 (2023)
14. Rahman, M.M., Watanobe, Y.: Chatgpt for education and research: opportunities, threats, and
strategies. Appl. Sci. 13(9) (2023)
15. Wang, F.Y., Yang, J., Wang, X., Li, J., Han, Q.L.: Chat with chatgpt on industry 5.0: learning and
decision-making for intelligent industries. IEEE/CAA J. Autom. Sin. 10(4), 831–834 (2023)
16. Abdullah, M., Madain, A., Jararweh, Y.: Chatgpt: fundamentals, applications and social
impacts. Ninth International Conference on Social Networks Analysis, Management and Secu-
rity (SNAMS), pp. 1–8 (2022); Sharma, P., Dash, B.: Impact of big data analytics and chatgpt
on cybersecurity. 4th International Conference on Computing and Communication Systems
(I3CS), pp. 1–6 (2023)
17. Feng, Y., Poralla, P., Dash, S., Li, K., Desai, V., Qiu, M.: The impact of chatgpt on streaming
media: a crowdsourced and data-driven analysis using twitter and reddit. In 2023 IEEE 9th Intl
Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High
Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and
Security (IDS), pp. 222–227 (2023)
18. Grbic, D.V., Dujlovic, I.: Social engineering with chatgpt. In 2023 22nd International
Symposium INFOTEH-JAHORINA (INFOTEH), pp. 1–5 (2023)
19. Guo, C., Lu, Y., Dou, Y., Wang, F.Y.: Can chatgpt boost artistic creation: the need of imaginative
intelligence for parallel art. IEEE/CAA J. Autom. Sin. 10(4), 835–838 (2023)
20. Kshetri, N.: Chatgpt in developing economies. IT Prof. 25(2), 16–19 (2023)
21. Mohapatra, H.: Performance Evaluation of AI Bot Responses (2023)
Lung Nodule Segmentation Using
Machine Learning and Deep Learning
Techniques

Swati Chauhan, Nidhi Malik, and Rekha Vig

Abstract Global lung cancer mortality is growing. This supports early cancer
screenings. CT lung nodule segmentation is complicated and affects medical
research, surgical planning, and diagnostic decision support. All are complex issues
with important applications. Machines and humans struggle to split non solitary
nodules with uncertain boundaries. Since segmentation has distinct limits, single
nodules are easier to divide. Several researchers have proposed CT-based lung eval-
uation algorithms. Growing imaging datasets and the need to swiftly and precisely
define normal and diseased lung lobes are the reasons. Multi-process lung segmenta-
tion methods with manual empirical parameter modifications are common. First lung
slice and nodule segmentation using ML and DL is essential for cancer detection. This
detects cancer at various stages. Deep learning techniques have improved healthcare
image analysis. There are few deep learning approaches like ResNet 50,101, VGG16,
Autoencoders, U-Net with modifications, and graph convolutional networks to clas-
sify lung nodules, COVID-19, and pneumonia. This chapter includes a summary of
datasets that are open to the public and are the primary resources utilized by scholars
working in this area. A direct look into the field of diagnosing lung disorders is what
we hope to achieve with the information provided in this chapter.

S. Chauhan · N. Malik (B)


The NorthCap University, Gurugram, Haryana, India
e-mail: nidhimalik@ncuindia.edu
S. Chauhan
e-mail: swati20csd010@ncuindia.edu
R. Vig
Amity University, Kolkata, West Bengal, India
e-mail: rekhas.vig@gmail.com

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 289
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_14
290 S. Chauhan et al.

1 Introduction

1.1 Lung Cancer and Its Statistics

Lung cancer is the second most frequent kind of cancer in the world. Rapid division of
malignant cells, which can invade neighboring structures and spread to other organs,
is the hallmark of cancer [1]. On August 1, the World Health Organization observes
World Lung Cancer Day, an annual event designed to inform and inspire people all
over the world to take action against the disease by increasing funding for research
and spreading awareness. The most common way that cancer kills people is through
the spread of metastases throughout the body. In 2023, the American Cancer Society
anticipates there will be a total of 238,340 new cases of lung cancer in the US (117,550
in men and 120,790 in women) shown in Fig. 1. There were almost 127,070 fatalities
attributable to lung cancer (67,160 in males and 59,910 in women) [2]. According
to the findings of a study conducted in India, the presence of adenocarcinoma as a
subtype of lung cancer was significantly associated with the occurrence of indoor air
pollution including factors such as exposure to second-hand smoking and the type
of fuel used in food preparation [3]. An investigation conducted in India found that
the mean and median ages of lung cancer patients (N equalling 1301) were 58.6
(standard deviation equalling 10.8) and 60 (interquartile range: 51–65), respectively
[4]. The extremely high rate of tuberculosis (TB) infection in India contributes to an
increased risk of obtaining a false-positive result from LDCT testing [5]. 5.9% of all
new cases of cancer are due to lung cancer, and 8.1% of all deaths from cancer are
attributable to lung cancer [6]. Several of these nations, including Indonesia, Hong
Kong, India, Singapore, Malaysia, and the Philippines, have yet to establish a lung
cancer screening program that is supported by their respective governments at the
present time. Delays in cancer diagnosis and treatment were also observed during the
2019 coronavirus disease (COVID-19) pandemic due to hospital and clinic closures,
disruptions in work and health insurance, and a general climate of anxiety about
being exposed to the virus that was a horrible situation in the human history [7].

Deaths in 2023 in US Total Cases of Lung cancer in


2023
68,000
66,000
Women
64,000
62,000
60,000
Men
58,000
56,000
Men Women 1,14,000 1,18,000 1,22,000

Fig. 1 Statistics of lung cancer and deaths of males and females according to American Cancer
Society in the US
Lung Nodule Segmentation Using Machine Learning and Deep … 291

Fig. 2 Lung nodule detected


in CT scan

Lung cancer which is exacerbated by risk factors like cigarette smoking, air pollu-
tion, and exposure to carcinogenic substances in job environments is the biggest cause
of death from cancer worldwide [8]. The percentage of people who have survived
for five years, individuals diagnosed with advanced pulmonary cancer is approxi-
mately 16%. However, for patients in the very beginning stages of the disease who
receive good treatment, the 5-year survival rate can be augmented by a factor of
4–5 [5]. In comparison to other types of imaging such as chest radiology, sputum
cytology, magnetic resonance imaging (MRI) scans, and positron emission tomog-
raphy, the utilization of CT scan as a diagnostic instrument for the prompt diag-
nosis and recognition of lung cancer is advantageous due to its affordability, rapid
visualization time and widespread accessibility [9, 10]. Machine learning and deep
learning methods for segmenting lung nodules are presented in this chapter. Since
identifying lobes also has vital uses in medical research, disease assessment, and
therapy planning, techniques for doing so were naturally incorporated into lung
nodule segmentation (Fig. 2).

2 Image (Lung Cancer) Segmentation

Image segmentation is the process of dividing a digital image into many segments for
the purpose of using the resulting information for efficient and accurate item recogni-
tion. Its purpose is to extract useful data from subsets sharing common characteristics,
such as pixel intensity values and colour data [11] shown in Fig. 3. Segmenting an
image yields either the assemblage of components together encompasses a series
of outlines that have been extracted from the visual representation or it yields the
contours themselves. Each pixel in a given space represents a single quality, such as
hue, saturation, or texture [12]. Medical image segmentation analyses images using
computer vision and segments 2D or 3D images of human organs, soft tissues, and
sick bodies in Fig. 2. It compares neighbouring pixels’ similarities or differences.
This technology considerably allows clinicians to do subjective or even statistical
analysis, which improves the accuracy and dependability of medical diagnoses, of
lesions and other locations of concern.
292 S. Chauhan et al.

Fig. 3 Lung nodule detection segmentation and classification process

Segmentation of medical images can be conceptualized as obtaining a division of


the original image based on a set of similarity constraints Ci (i = 1, 2, 3 …) in image
area by Eq. (1) [13].


x=1
 
Rx = I, Rx ∩ R y = φ, ∀x = y, x, y ∈ 1, N (1)
x=1

where, Rx meets the requirements for both groups of all pixels in the communication
relevance criteria. The same is true for Ry . x, y are employed in order to differentiate
between all of the different areas.
N = Number of regions after division, positive integer at least 2 [13].
The procedure of segmenting medical images can be broken down into the
following stages:
Step1: Acquire a database of medical imaging data. A common practice in the
field of machine learning for image processing is to split the dataset into a training
set, a validation set, and a test set. The training set is used to teach the network
how to make predictions, the test set is used to confirm the model’s correctness,
while the validation set is used to adjust the model’s hyperparameters.
Step2: The image is pre-processed and expanded by performing standard proce-
dure calls for randomly rotating and resizing the input image to increase the size
of the data collection.
Step3: If a nodule is present, the medical image must be segmented using the
proper medical image segmentation method before the segmented images can be
exported.
Step 4: Effectiveness of estimates is evaluated. Detection, segmentation, and clas-
sification are the three sub-operations of medical picture segmentation verification
that require effective performance indicators and validation.
Since they don’t necessitate hours of human labour, Artificial Intelligence (AI) algo-
rithms are ideally suited to acquire measurements and segmentations from medical
photos. Because the segmentation findings for a given input image will be consistent
regardless of the observer, the automatic analysis also gets rid of inter- and intra-
observer disparities. In recent years, deep learning-based technologies in particular
have claimed some significant successes [13]. Each pixel or voxel in an image can
be labelled or given a contour as an output from a segmentation method.
Lung Nodule Segmentation Using Machine Learning and Deep … 293

Fig. 4 Different types of segmentation techniques

Generally, there are three types of segmentation [14] as in Fig. 4.

i. Contour-Based segmentation
ii. Voxel-Based Segmentation
iii. Region-Based segmentation

i. A contour-based approach:
It looks for the structural-environment boundary. This is like how a radiology
specialist will surgically separate an organ from other parts of the body by
drawing a line between the two. Snakes, the active contour model, is a classic
contour-based segmentation method. This approach begins with a contour near
the item to segment.
ii. A voxel-based approach:
It works like classification methods since the algorithm checks each voxel
to see if it belongs to the structure to segment. Segments are produced once
the algorithm addresses each image voxel. The thresholding technique is an
example of a straightforward voxel-based segmentation method. This approach
compares voxel intensity to a threshold. The algorithm will mark voxels above
the value as part of the structure and below the value as “surroundings” (or vice
versa for hypo-intense structures).
iii. Region-based segmentation:
Region-based segmentation involves segmenting a structure in one image
(or group of photos) and then segmenting it in a second image. A transforma-
tion—a precise description of how to modify the first image into the second—is
needed. The segmentation of the second image can be obtained by applying
this transformation to the first. Here iteratively adding pixels that are identical
294 S. Chauhan et al.

to and connected to the seed pixel is how areas grow. For areas with consistent
grayscale, employ similarity measures such as grayscale deltas. Interconnection
is used to break the connection between distinct areas of the image.

Digital CT images enable the segmentation of lungs and lobes, detecting anatomic
boundaries and aberrant lung tissue based on pathological processes and diseases
[11]. Hand-crafted feature-based and deep learning-based lung segmentation algo-
rithms were proposed by researchers. Deep neural network-based approaches can
automatically learn representative features without empirical parameter adjustments,
unlike region-expanding, active contour models, and morphological-based models
[12].

3 Benefits of Lung Nodule Segmentation

By separating out specific regions of interest, medical image segmentation facilitates


more accurate analysis of anatomical data. In order to provide specialized therapies,
such as implant design, it is necessary to dissect structures, such as those found in
the hip or knee. In addition to isolating specific tissues like bone and soft tissue
from a scan, segmentation can also remove extraneous details like air. Researchers
and healthcare providers can build multiple segmented masks for additional study
by combining the software’s many processing options [12]. The majority of cancer
deaths occur from lung cancer. In order to aid doctors in identifying cancerous from
benign lesions, computer systems must do the process of segmenting lung nodules.
It has been noted, however, that this could be a challenging endeavour, especially in
cases where nodules are linked to a corresponding anatomical component [9].
The goal of this study is to collect and analyse the available experimental
information about the topics listed below:
• Different features that are involved in the process of segmenting and classifying
lung nodules.
• To identify the factors that affect the performance of techniques for segmenting
and classifying lung nodules.
• Utilization of a variety of ML and DL models throughout the initial stages of the
nodule segmentation process.
• Examining how well various models perform across a variety of segmentation
criteria in medical image processing and comparing their results.
• In addition, the assessment analyses the benefits and drawbacks of a variety of
techniques to segmentation, which assists in determining which strategies are the
most successful.
• In the realm of lung nodule segmentation, the identification of current gaps is an
important step towards the future prospects of study.
Lung Nodule Segmentation Using Machine Learning and Deep … 295

4 Literature Work

In the field of computer vision, various pixel-wise classification approaches based on


deep learning have recently been presented, with several finding practical application
in medical imaging. Thirty of the 152 research articles needed to accomplish the goal
are general in nature, both in terms of detection and classification, while the remaining
122 publications provide concrete examples of the application of deep learning to
fields as diverse as medicine, finance, and the arts.
Here is a rundown of some of the lung nodule segmentation-related questionnaires
we’ve tackled as part of this chapter’s evaluation:
• What are some of the most prominent trends in the recently published literary
critique of artificial intelligence and machine learning techniques utilized to
accomplish the segmentation task for lung nodules?
• What are the most important themes evolving in each of these literary works, and
how do these themes relate to one another?
This chapter’s research has been completed as follows: first, it verifies the
search terms used to retrieve papers from the Scopus database that discuss the use
of AI/ML methods to provide computed tomography lung nodule segmentation/
detection and classification standards; second, it organizes the journal articles by
year and theme and shows the relationships between concepts; third, it provides a
summary of the findings from the articles; and fourth, it provides recommendations
for future research. The conclusions and recommendations for future investigation
are presented at the end. We reviewed 122 studies on lung nodule segmentation
using machine learning and deep learning methods, which were published in 37
peer-reviewed journals. Here we have discussed the top 50 research papers.

4.1 Methodology for Systematic LR

Scopus is a database that indexes articles from journals, conferences, and books in
the fields of computer science and engineering. This is especially helpful for the
advanced and smart system literature on lung nodule segmentation. To that end, we
used the Scopus database to gather the necessary research papers for our investi-
gation. We settled on the search criteria in order to facilitate the downloading of
the study papers: “Segmentation” AND “CTimages” AND “Nodule” AND “detec-
tion” OR “CNN” OR “DNN” OR “AI” OR “deep learning” OR “machine learn-
ing” OR “supervised learning” There are Boolean functions “AND” and “OR” have
been used as operator to combine search terms and narrow results. It guarantees that
keywords will be used to retrieve academic literature. These criteria are applied to
“Article Title” and “Article Keywords” during the first download. The initial round
of searches yielded a total of 5,547 usable research papers.
296 S. Chauhan et al.

Fig. 5 This figure is showing how the articles were chosen for the analysis, as well as how many
(N) papers were included in each step

The subsequent phase involves the publication of publications in academic jour-


nals with little analysis, resulting in a total of 3,818 research papers being published
as result shown in Fig. 5. The number of research papers obtained during the third
phase was reduced to 437, as the scope was limited to the issue area and rele-
vant articles only. During the fourth phase, a constraint was implemented to narrow
down the search to high-quality publications, based on the impact factor schema.
As a result, the number of papers was decreased to 346. In the fifth stage, relevant
research papers are added based on titles, keywords, and abstracts of the problem at
hand. After removing studies that were irrelevant to the purpose of the systematic
review, 152 research papers were identified that were suitable for analysing nodule
segmentation artificial intelligence or machine learning methodologies.

4.2 Findings

This part is divided even further into three subparts: statistics broken down by year
and journal, information broken down by theme, and information breaking down
the relationship between ideas. The research begins with a breakdown, categorized
by year, of the number of papers published, the publications included, and the lung
nodule segmentation. The second component of the research shows that certain terms
appear often in both the titles and authors’ keywords of the articles, which may be
used to infer the overall topic of the research. The report provides examples of the
various domains in use in subsection three.
Lung Nodule Segmentation Using Machine Learning and Deep … 297

Fig. 6 Year-wise Year wise reaserch paper distributions


distribution of research
articles as obtained
September 25, 2023

13% 11%

23% 24%

29%

2023 2022 2021 2020 2019

4.3 Year- and- Journal-Specific Along with Theme-Wise


Statistics

Figure 6 shows how the systematic review’s research papers were spread out
throughout time. Total 152 research papers were studied for literature survey. We then
used the papers’ keywords and titles to determine their overarching themes. As can
be seen in Fig. 7, the aforementioned 152 articles span 36 publications and 10 distinct
biomedical image-processing capabilities. Although we are considering 122 research
for the systematic literature review, we have identified 35 as being of a generic
character because they do not directly address reconstruction but rather use detec-
tion and classification techniques. Therefore, only about 50 studies were included
in our meta-analysis. Studies with titles like “computed tomography,” “Convolu-
tional neural network,” “Deep learning,” “Segmentation,” “Detection,” “Nodule,”
“Classification,” “CT images,” or “DNN” share certain commonalities.

4.4 Identification of Important Journals and Conferences

The quality of the papers and conferences was evaluated using several criteria. The
number of times a journal is cited, the size of its readership, the prestige of its
reviewers, its impact factor, and the calibre of its editorial board are only a few of
the factors taken into account in this evaluation. The chosen journals for this review
of Lung nodule segmentation are summarized in Table 1.
Applications fields of different journals
298

18

16

14

12

10

Identification Segmentation Classification Diagnosis Detection Image Enhancement Reconstruction prediction Image Analysis Feature Extraction Quantification Optimization Simulation Determination

Fig. 7 Distribution of domains across the LDCT applications


S. Chauhan et al.
Lung Nodule Segmentation Using Machine Learning and Deep … 299

Table 1 List of top 10 journals out of 36 in the field of Lung nodule segmentation
S. No Name of journal Published by Impact
factor
1 IEEE transactions on neural networks Institute of Electrical and 14.255
and learning systems Electronics Engineers Inc
2 Medical image analysis Elsevier B.V 13.828
3 IEEE transactions on medical imaging Institute of Electrical and 10.6
Electronics Engineers Inc
4 Artificial intelligence review Springer Science and 9.588
Business Media B.V
5 Computerized medical imaging and Elsevier Ltd 7.422
graphics
6 Diagnostic and interventional radiology Turkish Society of Radiology 7.242
7 Computer methods and programs in Elsevier Ireland Ltd 7.027
biomedicine
8 IEEE journal of biomedical and health Institute of Electrical and 7.021
informatics Electronics Engineers Inc
9 Computers in biology and medicine Elsevier Ltd 6.698
10 Cancers MDPI 6.639

5 Algorithms Used

The quality of the papers and conferences was evaluated using several criteria. The
number of times a journal is cited, the size of its readership, the prestige of its
reviewers, its impact factor, and the calibre of its editorial board are only a few of
the factors taken into account in this evaluation. The chosen journals for this review
of Lung nodule segmentation are summarized in Table 1.

5.1 Deep Learning-Based Techniques

Several deep learning-based pixels-wise categorization methods have been presented


in computer vision, with several finding medical imaging applications. Lung nodules
must be accurately assessed to determine malignancy and lung cancer risk. GCNs
may classify nodules alongside CNNs [21]. Comprehensive and scalable nodule
segmentation for radiologists is being developed in several projects. Deep learning
and traditional image processing are being worked on. Recent tactics from each
category are briefly described here. It illuminates previous lung nodule classification
studies using different methods. CAD first classified nodules as metastatic, primary,
or benign [22]. Second, convolutional neural networks (CNN) are commonly used
to detect lung nodules. CNNs were used to histologically subtype 311 early-stage
non-small cell lung cancer (NSCLC) patients [23]. Improved residual convolutional
300 S. Chauhan et al.

neural networks (IRCNNs) have also been studied with the hope of better classifying
lung nodules as benign or malignant [24].
Low-dose computed tomography (LDCT) has been proven to be 20% more effec-
tive than X-rays at reducing the specific mortality rate of lung cancer cells [25]. To
emphasize its potential for lung cancer identification, The United States Preventive
Services Task Force (USPSTF) recommends having LDCT tests performed annu-
ally monitoring for anyone with lung illness [26, 27]. Building on prior research,
this piece accurately locates lung nodules using LDCT images. On apparent diffu-
sion coefficient (ADC) MRI, five deep-learning networks were evaluated: multiple
resolution residually connected network (MRRN) that is regularized in training with
deep supervision implemented into the last convolutional block (MRRN-DS), Unet,
Unet++ , ResUnet, and fast panoptic segmentation (FPSnet) and FPSnet-SL for
high-accuracy [23]. DenseNet improves CNN model training and propagation of
features by reducing vanishing-gradient. The work employs a faster R-CNN model
for detection and a DenseNet model for feature map extraction [28]. Inception-
Resnet’s self-attention mechanism helps improve convolutional neural network clas-
sification performance by performing standard classification and identifying chest
radiograph disorders via the classifier for auxiliary COVID-19 diagnosis at the
medical level [29]. ULD CT scans were rebuilt using FBP, ASIR-V, and DLIR.
Image noise was assessed using three-dimensional lung tissue segmentation. The
application of a deep learning–based nodule evaluation system by radiation oncolo-
gists facilitated the detection and quantification of nodules, as well as the identifica-
tion of imaging characteristics associated with malignancy. The study employed the
Bland–Altman method and repeated-measures evaluation of variance to assess and
evaluate the images obtained from ultralow-dose computed tomography (ULD CT)
and contrast-enhanced computed tomography (CECT) [30]. Inspired by U-Net and
residual learning, ResNet50 is a classification model that improves VGG19 when
used to segment lung nodules. Analysis of Deep learning/CNN-based lung nodule
segmentation/detection is shown in Table 2. To gain geographic data, VGG19 keeps
the 7 × 7 convolutional layer and utilizes the maximum pooling layer for down-
sampling. ResNet-50 learns additional characteristics with more layers. Lung nodules
are small and have significant spatial information, so ResNet-50 is upgraded to create
3D ResNet50, a network of classifications for lung nodule identification [31].

6 Evaluation Metrics

For the purpose of determining the segmentation algorithm’s effectiveness, the


parameters of accuracy, precision, recall, and F-score were utilized. These metrics
are used to estimate the performance of the model for the segmentation task. As the
major evaluation standard, the dice similarity coefficient (DSC) is frequently utilized
in the process of measuring the degree of overlap that exists between two segmenta-
tion outcomes. As auxiliary assessment measures, we additionally make use of the
positive prediction value (PPV) and sensitivity (SEN), which helps to guarantee that
Table 2 Analysis of deep learning/CNN-based lung nodule segmentation/detection techniques
S. No Year Authors Title Dataset used Methodology Results
1 2023 Xie R.-L.; Lung nodule pre-diagnosis and Lung image database Adaptive gray Sensitivity = 88.2%
[32] Wang Y. insertion path planning for chest consortium and image threshold, connected Accuracy = 82.1%
et al. CT images database resource, area labeling, and
LIDC-IDRI mathematical
morphological
boundary repair
2 2023 Wang G.; PyMIC: A deep learning toolkit Nifty dataset or H5Dataset, REDCNN, activation Dice values:
[33] Luo X. for annotation-efficient medical batch generator function-leaky ReLU LV NM = 0.795,
et al. image segmentation CT specific, perceptual LV BP = 0.922,
loss scheme RV BP = 0.935,
Edema = 0.400,
Scars = 0.613
3 2023 Nguyen Active semi-supervised learning National lung screening trial DenseVNet-based CNN AUC values of 0.94 (Kaggle), 0.95
[34] P.; Rathod via Bayesian experimental (NLST), the lung image supplemented by (NLST), and 0.88 (LIDC) were
A.; design for lung cancer Database consortium (LIDC), additional processing achieved with less image labels
Chapman classification using low dose and Kaggle data science bowl methods for the than a fully supervised model
D. et al. computed tomography scans 2017 for lung cancer gathered data
classification
Lung Nodule Segmentation Using Machine Learning and Deep …

4 2022 Xing, A deep learning-based The declaration of Helsinki DenseVNet-based CNN Overall Dice coefficient = 0.972
[35] Haiqun post-processing method for (2013 version) with the supplemented by Hausdorff distance = 12.025 mm
et al. automated pulmonary lobe and permission of the Peking additional processing Jaccard coefficient = 0.948
airway trees segmentation using Union Medical College methods for the
chest CT images in PET/CT Hospital’s Ethics Committee gathered data
5 2022 Zhou, Deep learning-based pulmonary Imaging archive (TCIA)55 as V-Net auto Dice similarity coefficient (DSC):
[36] Wen et al. tuberculosis automated detection collection Pediatric-CT-SEG segmentation-modified The median DSC for the
on chest radiography: FCN 3D V-Net duodenum was 0.52, whereas the
Large-scale independent testing pancreas was 0.74, the stomach
was 0.92, and the heart was 0.96
301
302 S. Chauhan et al.

Fig. 8 Confusion matrix for Actual Values


a binary segmentation mask
Positive (1) Negative (0)

Predicted Values
Positive (1) TP FP

Negative (0) FN TN

the evaluation is as accurate as possible [37]. Every single forecast is built from each
and every one of the aforementioned positive (P) and negative (N) examples. Both P
and N are made up of things like true positives (TP) and false positives (FP), while
P is made up of things like true negatives (TN) and false negatives (FN) in Fig. 8.
These metrics are calculated as:
• When a prediction-target mask pair’s IoU score is higher than a certain threshold,
a true positive is noticed. This threshold can vary from system to system.
• When a false positive occurs, it means that a predicted object mask does not have
an accompanying ground truth object mask.
• A ground truth object mask that does not have an associated anticipated object
mask is said to have a false negative.

6.1 Accuracy, Precision, and Recall

The accuracy score is calculated by taking the total number of guesses and dividing
it by the number of right predictions (both positive and negative), also called the
Rand index in Eq. (2). Precision measurements were based on their proximity, while
accuracy measurements used known or actual values in Eq. (3). These two measuring
methods were not limited. Precision measured how accurately lung nodule pixels
were recognized. Divide the number of pixels indicating lung nodules by the number
representing both nodules and backdrop to calculate recall. Basically, it is a measure
of how accurately your model can forecast the actual positives that will occur.

TP + TN
Accuracy = (2)
(TP + TN + FP + FN)
TP
Precision = (3)
(TP + FP)

Recall: It is a percentage that indicates how accurately the classifier cases of


positivity were predicted in comparison to the total amount of positive cases that
Lung Nodule Segmentation Using Machine Learning and Deep … 303

were present in the data set. The term “sensitivity” is another name for it that is
sometimes used in Eq. (4).

TP
TPR/Sensitivity/Recall = (4)
(TP + FN)

6.2 F—Score and Specificity

F1 Score (Dice Coefficient) is needed when you want to seek a balance between
Precision and Recall in Eq. (5). It evaluates a model’s capacity for prediction by
focusing on how well it performs inside individual classes rather than evaluating the
model as a whole, as accuracy does.

Precision ∗ Recall
F1 Score = 2 × (5)
Precision + Recall

Specificity: The specificity of the model is the degree to which it correctly identi-
fies true positives, while the sensitivity of the model is the degree to which it correctly
identifies real negatives in Eq. (6). The specificity of a model can be evaluated based
on how well it recognizes the various kinds of backgrounds that might be seen in an
image. Specificity ranges that are quite near to one are standard and to be expected
due to the substantial proportion of pixels that have been tagged as background in
comparison to the ROI. Therefore, specificity is an appropriate metric for assuring
the functionality of a segmentation model, but it is less appropriate for measuring
the performance of a segmentation model.

TN
Specificity = (6)
(TN + FP)

6.3 Intersection Over Union

Intersection over Union, or IoU for short, is a statistic that can be used to quantify
the percentage of overlap that exists between the target mask and the output of our
prediction. The Jaccard index is another name for this metric. This metric has a close
relationship using the dice coefficient, which is typically used as a loss function over
the training process. The IoU metric can be explained in a nutshell as the division of
the total number of pixels that are present in both the target mask and the prediction
mask by the number of common pixels that are present in both masks. The IoU score
is computed independently for each class, and then that total score is averaged across
304 S. Chauhan et al.

Fig. 9 IoU representation


with intersection and union

(a) (b)

all classes to get an overall, mean value for the IoU score that represents semantic
segmentation prediction detail in Eq. (7). Many expected and actual boxes overlap
will increase the IoU score. Low overlap results in a low IoU score. If the projected
box entirely covers the ground truth box, the IoU score is 1, otherwise 0. Self-driving
cars, surveillance, medical imaging, etc. use it for computer vision shown in Fig. 9
[38].

Target ∩ Prediction
IoU = (7)
Target ∪ Prediction

6.4 AUC ROC and DSC Coefficient

The AUC ROC curve is a normal classification performance measure. It is calculated


as the integral of sensitivity to specificity throughout the ROC curve domain, and
its value might fall anywhere between 0 and 1. This calculation is based on the
probability curve for the receiver operating characteristic (ROC). This is also called
“error curve”. This score shows the model’s class distinction. The model predicts
0 s and 1 s better with a higher AUC value. A higher AUC means the model can
better distinguish malignant from non-cancerous patients in classification task. On
the receiver operating characteristic (ROC) curve, the TPR (true positive rate) and
FPR (false positive rate) are depicted on the y-axis and x-axis, respectively shown
in Fig. 10 [39].
So, the segmentation algorithm’s performance is measured by how many pairs of
items of the form (object of class 1, object of class 0) it successfully ordered (first
object comes first) as shown by the area under the curve. Rate of Observation Gain
for a Binary Solution [34]. Each image (and its forecast) is considered an individual
“data point” when calculating a ROC curve for classification. When segmenting an
image, each pixel must be handled separately in Eq. (8).
Lung Nodule Segmentation Using Machine Learning and Deep … 305

Fig. 10 AUC ROC curve

TPR * FPR (1 − TPR)(1 − FPR)


AUC ROC = + TPR(1 − FPR) +
2 2 (8)
(1 + TPR) − FPR
=
2
Dice similarity coefficient (DSC), in Eq. (9) which served as the major evaluation
measure. The DSC is a popular tool for determining the degree of overlap that exists
between two sets of segmentation data. As auxiliary evaluation metrics, it additionally
makes use of the positive prediction value (PPV) and sensitivity (SEN) [38, 40] in Eqs.
(10) and (11). These metrics are calculated as: where M, N are sets of characteristics
for each of the ideas, where | | denotes the cardinality of a set in Fig. 11.

2 ∗ |M ∩ N |
DSC = (9)
|M| + |N |

SEN = |M ∩ N |/|M| (10)

|M ∩ N |
PPV = (11)
|N |

7 Public Data Set

In lung nodule segmentation, the regularly utilized datasets, together with their
respective contributions, comprise a wide variety of helpful resources. It is important
to remember that some datasets just provides you a 2D slice, while some provide the
306 S. Chauhan et al.

Fig. 11 Dice similarity


coefficient Target

Prediction
2x
DSC =
Target Prediction
+

whole scan. The majority of approaches that made advantage of supervised training
either used cross-validation with k-folds or prepared training and test splits with an
80/20 ratio in Table 3. [41].

8 Examples

We have provided several illustrative scenarios to facilitate comprehension of the


utilization of Chest CT images in the prediction and classification of lung cancer.
In this discussion, we have included three convolutional neural network models:
ResNet50, ResNet101, and VGG16. These models were evaluated using performance
assessment metrics on a dataset consisting of the Chest CT Scan Images Dataset.
CNN models’ architecture and their performance for segmentation purpose with
some parameters setting over Chest CT Scan Images Dataset.

8.1 Model 1

ResNet 50: The ResNet-50 model is a convolutional neural network (CNN) character-
ized by its depth of 50 layers. ResNet-50 is constructed upon a deep residual learning
framework, which facilitates the training of exceedingly deep networks with a total
of 50 layers, which are organized into five distinct blocks. Each block is comprised of
a collection of residual blocks. The inclusion of residual blocks facilitates the reten-
tion of relevant information from preceding layers, hence enhancing the network’s
capacity to acquire more effective representations of the input data. There are various
steps to implement every technique. Firstly, Fetch dataset and divide into three parts:
Train, Test, and Valid sets. This is widely used in medical image processing tasks
because of its reducing vanishing gradient problem, preserving input and avoiding
any loss in the information [52]. There are following steps that have been followed
while using the ResNet 50 in Fig. 12.
Lung Nodule Segmentation Using Machine Learning and Deep … 307

Table 3 Some public datasets


S. No Data set Annotation Description Link References
1 LTRC Chronic Lung tissue research https:// [42]
disruptive consortium www.
pulmonary nhlbi.nih.
disease gov/sci
(COPD) ence/lung-
tissue-res
earch-con
sortium-
ltrc
2 LIDC-IDRI Nodule Lung image database https:// [43]
detection consortium image wiki.can
collection cerimagin
garchive.
net/dis
play/Pub
lic/LIDC-
IDRI
3 LUNA16 Pulmonary Challenge for https://lol [44]
nodule pulmonary nodule a11.grand-
segmentation challenge.
COVID-19 low dose org/
scans https://lun
a16.grand-
challenge.
org/
4 Medical Pulmonary Pulmonary nodule http://med [45]
segmentation nodule segmentation-Challenge icaldecat
decathlon segmentation with multiple tasks hlon.com/
including lung cancer
5 VESSEL12 Lung and Vessel segmentation in https://ves [46]
vessel the lung challenge sel12.
grand-cha
llenge.org/
6 MedSegCovid Lung and COVID-19 patient scans http://med [47]
COVID-19 icalsegme
findings ntation.
com/cov
id19/
7 Data science bowl Cancer Data science bowl 2017 https:// [48]
2017 (DSB) Kaggle competition www.kag
gle.com/c/
data-sci
ence-
bowl-2017
(continued)
308 S. Chauhan et al.

Table 3 (continued)
S. No Data set Annotation Description Link References
8 Finding and Lung Kaggle lung Finding [49]
measuring lungs annotations segmentation challenge and
in CT data measuring
lungs in
CT data |
Kaggle
9 ELCAP Cancer Research teams from the http:// [50]
International Early Lung www.via.
Cancer Action Program cornell.
(ELCAP) and Vision edu/lun
and Image Analysis gdb.html
(VIA)
10 Lung-PET-CT-Dx Lung cancer Lung cancer patient https:// [51]
DICOM scans and PET veet.via.
scans cornell.
edu/lun
gdb.html

8.2 Model 2

ResNet101: ResNet-101 is a convolutional neural network characterized by its depth,


consisting of 101 layers. It is possible to utilize a preexisting iteration of the network
that has undergone training on a dataset including over one million images [53]in
Fig. 13).

8.3 Model 3

VGG16: VGG16 comprises total of 21 layers, 13 of which are convolutional, five of


which are Max Pooling, and three of which are dense, but there are only 16 weight
layers, also known as learnable parameters. Convolutional, RELU activation, Hidden,
Pooling, and Fully linked layers are all present in this structure. It is a well-known
method for the classification of images that facilitates rapid exchange of knowledge
in Fig. 14 [54].

8.4 Comparative Analysis of Deep Learning Algorithm used


for Segmentation

In this chapter we have included three models ResNet50, ResNet101, and VGG16.
After applying these models, it is clear that the accuracy and loss are different for
Lung Nodule Segmentation Using Machine Learning and Deep … 309

Input

Zero Padding

7x7 Conv

MaxPooling
Skip Connection

RELU
RELU
Weight Layer

9 Conv Layer
{3x3-64K,1X1-164K,1X1-256K}
Weight

x
f(x)
(b) 12 Conv Layer
{1x1-128K,3x3-128K,1x1-512K}

(a)
18 Conv Layer
{1x1-256K,3x3-256K,1x1-
1024K}

9 Conv Layer
{1x1-512K,3X3-512K,1X1-
2048K}

Average Pooling
Flattening, FC

SoftMax

Output

Fig. 12 (a) is showing the residual block where f(x) is mapping function with activa-tion function
rectified linear unit (b) showing the ResNet50 architecture with all the convolution layers

every architecture. They can be updated by including various parameters like learning
rate, optimization, activation functions, and increasing epochs in our training proce-
dure. Here, we are showing the comparison among these three models on the param-
eter’s accuracy and loss values. For this purpose, we have used learning rate of
0.00001, loss = ‘Categorical Cross Entropy’, Optimizer -Opti, verbose = 1 and set
for 30 number of epochs as shows in Table 4.
In this chapter we have included the pictorial representation of accuracy and loss
changes according to number of epochs over the Chest CT image dataset for the
segmentation analysis shown in Fig. 15.
310 S. Chauhan et al.

Input
Conv

Zero Padding
Conv 7x7 Conv

Skip Connection
MaxPooling

Conv Residual Block


{1x1-64K,3x3-
RELU 64K,1X1-256K} x3

RB
{1x1-128K,3x3-
128K,1x1-512K} x4

(a)
RB
(b)
{1x1-256K,3x3-
256K,1x1-1024K} X23

RB
{1x1-512K,3X3-
512K,1X1-2048K} x3

Average Pooling
Flattening, FC
SoftMax

Output

Fig. 13 (a) is showing the residual block for ResNet101 with activation function rectified linear
unit (RELU). (b) is showing the ResNet101 detailed architecture with all the residual blocks
Lung Nodule Segmentation Using Machine Learning and Deep … 311

Max Pooling

Max Pooling

Max Pooling
3x3 Conv
3x3 Conv
3x3 Conv

3x3 Conv

3x3 Conv
3x3 Conv
3x3 Conv
Input

Dense Layer
Dense Layer

Dense Layer

Max Pooling

Max Pooling
3x3 Conv
3x3 Conv
3x3 Conv

3x3 Conv
3x3 Conv
3x3 Conv
Segmented
Output

Fig. 14 This is depicting the VGG16 detailed architecture with all the convolutional blocks

Table 4 Comparison of
Model Accuracy Loss decrement No. of epochs
ResNet50, ResNet101, and
VGG16 for the segmentation ResNet50 0.803 3.6057 30
task ResNet101 0.820 0.5605 30
VGG16 0.832 0.5284 30

Fig. 15 Input images from the Chest CT images datasets

After applying ResNet 50, we have achieved the accuracy that could be increased
by enhancing the number of epochs and changing hyper-parameters. Here, we are
showing it for the purpose of understanding, along with training loss, validation loss
and confusion matrix as depicted in Figs. 16, 17 and 18.
312 S. Chauhan et al.

Fig. 16 (a) and (b) are showing the accuracy changes and Loss changes according to number of
epochs for ResNet50 architecture

Fig. 17 (a) and (b) are depicting the accuracy changes and loss changes according to number of
epochs for ResNet101 architecture

Fig. 18 Accuracy changes according to number of epochs for VGG16 architecture


Lung Nodule Segmentation Using Machine Learning and Deep … 313

9 Conclusion and Future Work

This chapter reviewed contemporary deep learning lung nodule segmentation


approaches in several image modalities. Most of the literature surveyed is within
the last four years. A detailed summary and performance metrics of many state-
of-the-art lung nodule detection, segmentation, and classification and classification
algorithms are presented. There are various models that display high competences
in nodule segmentation and classification and still there is a chance for improve-
ment. We also focused on some challenges in this field of research. Upon conducting
an analysis of various methodologies and networks’ designs, it has been noted that
transfer learning-based networks such as Mobile Net, ResNet, etc. exhibit superior
efficacy for tuberculosis. Region-based convolutional neural networks (CNNs) have
been found to exhibit superior sensitivity in the detection of nodules. On the other
hand, for the purpose of nodule identification and categorization, models derived from
VGG Net have demonstrated the most favourable outcomes. U-Net architectures have
demonstrated their superiority in the segmentation field, establishing themselves as
the current state-of-the-art. The GCN architecture has demonstrated effectiveness
in COVID-19 identification, making it a viable option for simultaneous use along-
side CNNs in nodule categorization. The training duration, training data volume,
and model size of DNN architectures are all areas that could benefit from further
study. GAN architectures also provide dummy data that may be trained on current
datasets to produce more images for 2D CNNs, hence overcoming the issue of limited
datasets.
In the limitation part, we have noticed that most of the researchers used LIDC-
IDRI vast database for the training and testing of their models. Additionally, there
are quite limited public databases and their data is inadequately organized, which
affects model performance. Thus, future study requires data collection and annotated
database disclosure. To detect the micro-nodules and their classification is very crit-
ical task because these are less than 3 mm in diameter and can grow in size if they
are malignant in nature [51]. As a result, constructing a new database or labelling the
databases from hospitals and healthcare organisations, World Health Organization
groups that publish their dataset online for research purposes along with conducting
competitions is essential for the early diagnosis of micronodules. Another challenge
is the detection of pulmonary fibrosis. In addition to that, in order to address the
problems that already exist, the information presented in this chapter has offered
novel study avenues in the field of lung disorders.
314 S. Chauhan et al.

References

1. Ghoshal, S., Rigney, G., Cheng, D., et al.: Institutional surgical response and associated
volume trends throughout the COVID-19 pandemic and postvaccination recovery period. 5(8),
e2227443 (2022). https://doi.org/10.1001/jamanetworkopen.2022.27443
2. Chen, R., Aschmann, H.E., Chen, Y.H., et al.: Racial and ethnic disparities in estimated excess
mortality from external causes in the US, March to December 2020. 182(7), 776–778 (2022).
https://doi.org/10.1001/jamainternmed.2022.1461
3. Das, A., Krishnamurthy, A., Ramshankar, V., Sagar, T.G., Swaminathan, R.: The increasing
challenge of never smokers with adenocarcinoma lung: need to look beyond tobacco exposure.
Indian J. Cancer 54, 172–177 (2017)
4. Kaur, H., Sehgal, I.S., Bal, A., et al.: Evolving epidemiology of lung cancer in India: reducing
non-small cell lung cancer-not otherwise specified and quantifying tobacco smoke exposure
are the key. Indian J. Cancer 54, 285–290 (2017)
5. Prasad KT, Basher R, Garg M, et al .: Utility of LDCT in lung cancer screening in a TB
endemic region. Clinical Trials. (2023). gov. https://clinicaltrials.gov/ct2/show/ NCT03909620.
Accessed
6. Lam, D.C.L., Liam, C.K., Andarini, S., Park, S., et al.: Lung cancer screening in Asia: an expert
consensus report. J. Thor. Oncol. 18, 1303–1322 (2023). ISSN 1556-0864. https://doi.org/10.
1016/j.jtho.2023.06.014
7. Yabroff, K.R., Wu, X.C., Negoita, S., et al.: Association of the COVID-19 pandemic with
patterns of statewide cancer services. J. Natl. Cancer Inst. 114(6), 907–909 (2022)
8. Chen, G.B., Fu, Z., Zhang, T.F., Shen, Y., Wang, Y., Shi, W., Fei, J.: Robot-assisted puncture
positioning methods under CT navigation. J. Xi’an Jiao Tong Univ. 53(85–92), 99 (2019)
9. Mansoor, A., Bagci, U., Foster, B., Xu, Z., Papadakis, G.Z., Folio, L.R., et al.: Segmentation
and image analysis of abnormal lungs at ct: current approaches, challenges, and future trends.
Radiographics. 35(4), 1056–1076 (2015 Jul–Aug)
10. Kim, S.S., Seo, J.B., Lee, H.Y., Nevrekar, D.V., Forssen, A.V., Crapo, J.D., et al.: Chronic
obstructive pulmonary disease: lobe-based visual assessment of volumetric CT by Using stan-
dard images--comparison with quantitative CT and pulmonary function test in the COPDGene
study. Radiology. 266(2), 626–635 (2013 Feb)
11. Doel, T., Gavaghan, D.J., Grau, V.: Review of automatic pulmonary lobe segmentation methods
from CT. Comput. Med. Imag. Graph. 40, 13–29 (2015 Mar)
12. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature. 521(7553), 436–444 (2015).
[PubMed: 26017442]
13. Liu, X., Song, L., Liu, S., Zhang, Y.: A review of deep-learning-based medical image
segmentation methods. Sustainability, 1224 (2021). https://doi.org/10.3390/su13031224
14. Korfiatis, P., Kazantzi, A., Kalogeropoulou, C., Petsas, T., Costaridou, L.: Optimizing lung
volume segmentation by texture classification. ITAB Corfu. Greece 2010, 1–4 (2010)
15. Somasundaram, E., Deaton, J., Kaufman, R., Brady, S.: Fully automated tissue classifier for
contrast-enhanced CT scans of adult and paediatric patients. Phys. Med. Biol. 63(13), 135009
(2018)
16. Eid Alazemi, F., Jehangir, B., Imran, M., Song, O.Y., Karamat, T.: An efficient model for lungs
nodule classification using supervised learning technique. J. Healthc. Eng. (2023). https://doi.
org/10.1155/2023/8262741
17. Raoof, S.S., Jabbar, M.A., Fathima, S.A.: Lung cancer prediction using machine learning: a
comprehensive approach. 2nd International Conference on Innovative Mechanisms for Industry
Applications (ICIMIA), Bangalore, India, pp. 108–115 (2020). https://doi.org/10.1109/ICIMIA
48430.2020.9074947
18. Tarhini, H., Mohamad, R., Rammal, A., Ayache, M.: Lung segmentation followed by machine
learning & deep learning techniques for COVID-19 detection in lung CT images. Sixth Interna-
tional Conference on Advances in Biomedical Engineering (ICABME), Werdanyeh, Lebanon,
pp. 222–227 (2021). https://doi.org/10.1109/ICABME53305.2021.9604872
Lung Nodule Segmentation Using Machine Learning and Deep … 315

19. Nazir, I., ul Haq, I., AlQahtani, S.A., Jadoon, M.M., Dahshan, M.: Machine learning-based
lung cancer detection using multiview image registration and fusion. J. Sens. 2023, 19. Article
ID 6683438 (2023). https://doi.org/10.1155/2023/6683438
20. Nageswaran, S., Arunkumar, G., Bisht, A.K., Mewada, S., Kumar, J.N.V.R.S., Jawarneh,
M., Asenso, E.: Lung cancer classification and prediction using machine learning and image
processing. Biomed. Res. Int. 2022, 1755460 (2022). https://doi.org/10.1155/2022/1755460.
PMID: 36046454; PMCID: PMC9424001
21. Wang, S.-H., Govindaraj, V.V., G´orriz, J.M., Zhang, X., Zhang, Y.-D.: COVID-19 classification
by FGCNet with deep feature fusion from graph convolutional network and convolutional neural
network. Inf. Fusion. 67, 208–229 (2021)
22. Nishio, M., Sugiyama, O., Yakami, M., Ueno, S., Kubo, T., Kuroda, T., Togashi, K.: Computer-
aided diagnosis of lung nodule classification between benign nodule, primary lung cancer, and
metastatic lung cancer at different image size using deep convolutional neural network with
transfer learning. PLoS One. 13(7) (2018)
23. Chaunzwa, T.L., Hosny, A., Xu, Y., Shafer, A., Diao, N., Lanuti, M., Christiani, D.C., Mak,
R.H., Aerts, H.J.: Deep learning classification of lung cancer histology using CT images. Sci.
Rep. 11(1), 1–12 (2021)
24. Afag, S.: Classification of lung nodules using improved residual convolutional neural network.
J. Computat. Sci. Intellig. Technol. 1(1), 15–21 (2020)
25. Zhang, C., Sun, X., Dang, K., Li, K., Guo, X.W., Chang, J., Yu, Z.Q., Huang, F.Y., Wu, Y.S.,
Liang, Z., et al.: Toward an expert level of lung cancer detection and classification using a deep
convolutional neural network. Oncol. 24(9), 1159–1165 (2019)
26. Nasrullah, N., Sang, J., Alam, M.S., Mateen, M., Cai, B., Hu, H.: Automated lung nodule
detection and classification using deep learning combined with multiple strategies. Sensors.
19(17), 3722 (2019)
27. Ali, I., Hart, G.R., Gunabushanam, G., Liang, Y., Muhammad, W., Nartowt, B., Kane, M.,
Ma, X., Deng, J.: Lung nodule detection via deep reinforcement learning. Front. Oncol. 8, 108
(2018)
28. Simeth, J., et al.: Deep learning-based dominant index lesion segmentation for MR-guided
radiation therapy of prostate cancer. Med. Phys. 50(8), 4854–4870 (2023). https://doi.org/10.
1002/mp.16320
29. Zhang, Y., et al.: Lung nodule detectability of artificial intelligence-assisted CT image reading
in lung cancer screening. Curr. Med. Imag. 18(3), 327–334 (2022). https://doi.org/10.2174/157
3405617666210806125953
30. Chen, Y., Lin, Y., Xu, X., Ding, J., Li, C., Zeng, Y., Liu, W., Xie, W., Huang, J.: Classification
of lungs infected COVID-19 images based on inception-ResNet. Comput. Methods Programs
Biomed. 225, 107053 (2022 Oct). https://doi.org/10.1016/j.cmpb.2022.107053. Epub (2022).
PMID: 35964421; PMCID: PMC9339166
31. Jiang, B., et al.: Deep learning reconstruction shows better lung nodule detection for ultra-low-
dose chest CT. Radiology. 303(1), 202–212 (2022). https://doi.org/10.1148/radiol.210551
32. Xie, R.L., Wang, Y., Zhao, Y.N., et al.: Lung nodule pre-diagnosis and insertion path planning
for chest CT images. BMC Med. Imag. 23, 22 (2023). https://doi.org/10.1186/s12880-023-009
73-z
33. Wang, G., Luo, X., Gu, R., Yang, S., Qu, Y., Zhai, S., Zhao, Q., Li, K., Zhang, S.: PyMIC:
a deep learning toolkit for annotation-efficient medical image segmentation. ArXiv (2022).
https://doi.org/10.1016/j.cmpb.2023.107398
34. Nguyen, P., Rathod, A., Chapman, D., Prathapan, S., Menon, S., Morris, M., Yesha, Y.: Active
semi-supervised learning via Bayesian experimental design for lung cancer classification using
low dose computed tomography scans. Appl. Sci. 13, 3752 (2023). https://doi.org/10.3390/app
13063752
35. Xing, H., Zhang, X., et. al.: A deep learning-based post-processing method for automated
pulmonary lobe and airway trees segmentation using chest CT images in PET/CT. Quant.
Imag. Med. Surg. 12(10) (2022). https://qims.amegroups.org/article/view/99741
316 S. Chauhan et al.

36. Zhou, W. et al.: Deep learning-based pulmonary tuberculosis automated detection on chest
radiography: large-scale independent testing. Quant. Imag. Med. Surg. 12(4), 2344–2355
(2022). https://doi.org/10.21037/qims-21-676
37. Fang, D., Jiang, H., Chen, W., Qin, Z., Shi, J., Zhang, J.: Pulmonary nodule detection on lung
parenchyma images using hyber-deep algorithm. Heliyon 9(7), e17599 (2023). https://doi.org/
10.1016/j.heliyon.2023.e17599.PMID:37449096;PMCID:PMC10336504
38. Lei, Y., Tian Shan, Y.H., Zhang, J., Wang, G., Kalra, M.K.: Shape and margin-aware lung
nodule classification in low-dose CT images via soft activation mapping. Med. Image Anal.
60, pp. 1–13 (2020)
39. Anguita, D., Ghelardoni, L., Ghio, A., Oneto, L., Ridella, S.: The ‘K’in K-fold cross validation.
In: 20th European Symposium on Artificial Neural Networks, Computational Intelligence and
Machine Learning (ESANN), pp. 441–446 (2012)
40. Wu, Z., Zhou, Q., Wang, F.: Coarse-to-fine lung nodule segmentation in CT images with image
enhancement and dual-branch network. IEEE Access. 9, pp. 7255–7262 (2021). https://doi.
org/10.1109/ACCESS.2021.3049379
41. Karwoski, R.A., Bartholmai, R., Zavaletta, V.A., Holmes, D., Robb, R.A.: Processing of CT
images for analysis of diffuse lung disease in the lung tissue research consortium. In: Medical
Imaging. Physiology, Function, and Structure from Medical Images. SPIE. 6916, pp. 614–691
(2008)
42. Armato, S.G., 3rd., McLennan, G., Bidaut, L., McNitt- Gray, M.F., Meyer, C.R., Reeves, A.P.:
The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI):
a completed reference database of lung nodules on CT scans. Med. Phys. 38(2), 915–931 (2011)
43. LOLA11 Grand Challenge. Lobe and lung analysis. (LOLA11). Available at https://lola11.
grand-challenge.org/ (2011) [cited 30 Jan 2022]
44. Simpson, A.L., Antonelli, M., Bakas, S., Bilello, M., Farahani, K., Van Ginneken, B., et al.:
A large annotated medical image dataset for the development and evaluation of segmentation
algorithms; 2019. arXiv preprint. arXiv:1902.09063 (2019)
45. VESSEL12 Grand Challenge. Vessel segmentation in the lung 2012 (vessel12). Available at
https://vessel12.grand-challenge.org (2012) [cited 30 Jan 2022]
46. MedSeg. COVID-19 CT segmentation dataset. Available at http://medicalsegmentation.com/
covid19/ (2020) [cited 30 Jan 2022]
47. Kaggle Competition. Data science bowl 2017 (DSB). [Online]. Available at www.kaggle.com/
c/data-science-bowl-2017 (2017)
48. Kaggle Competition, Finding and measuring lungs in CT data. [Online]. Available at https://
www.kaggle.com/kmader/finding-lungs-in-ct-data (2017). Accessed 30 Jan 2022
49. Henschke, C.I., McCauley, D.I., Yankelevitz, D.F., Naidich, D.P., McGuinness, G., Miettinen,
O.S., Libby, D., Pasmantier, M., Koizumi, J., Altorki, N., et al.: Early lung cancer action project:
a summary of the findings on baseline screening. Oncologist 6(2), 147–152 (2001)
50. Li, P., Wang, S., Li, T., Lu, J., Huang Fu, Y., Wang, D.: A large-scale CT and PET/CT dataset
for lung cancer diagnosis. The Cancer Imag. Arch. (2020)
51. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778
(2016). https://doi.org/10.1109/CVPR.2016.90
52. Xu, Z., Sun, K., MaoR, J.: Research on ResNet101 network chemical reagent label image
classification based on transfer learning. 2020 IEEE 2nd International Conference on Civil
Aviation Safety and Information Technology (ICCASIT, Weihai, China, pp. 354–358 (2020).
https://doi.org/10.1109/ICCASIT50869.2020.9368658.
53. Qassim, H., Verma, A., Feinzimer, D.: Compressed residual-VGG16 CNN model for big data
places image recognition. 2018 IEEE 8th Annual Computing and Communication Workshop
and Conference (CCWC), Las Vegas, NV, USA, pp. 169–175(2018). https://doi.org/10.1109/
CCWC.2018.8301729
54. Gaur, P., Malaviya, V., Gupta, A., Bhatia, G., Pachori, R.B., Sharma, D.: COVID-19 disease
identification from chest CT images using empirical wavelet transformation and transfer
learning. Biomed. Signal Process. Control. 71 (2022), 103076
Convergence of Data Analytics, Big Data,
and Machine Learning: Applications,
Challenges, and Future Direction

Abhishek Bhattacherjee and Ajay Kumar Badhan

Abstract The fusion of Data Analytics, Big Data, and Machine Learning has
become a powerful force in the always-changing world of data-driven decision-
making. This chapter offers a brief overview of their practical uses, illuminating how
these technologies are reshaping markets and driving creativity. The cornerstone,
data analytics, is studied first, emphasizing its capacity to extract useful insights from
a variety of sources. To demonstrate how Data Analytics enables organizations to
optimize processes, improve consumer experiences, and manage risks through data-
driven decision-making, real-world examples from industries including e-commerce,
finance, and healthcare are shown. Next, Big Data takes center stage to demonstrate its
ability to handle enormous amounts of data. We examine its uses in industries ranging
from urban planning to agriculture, showing how it facilitates better decision-making
through data-driven insights. The third element of the equation, machine learning,
emerges as a crucial enabler of automation and intelligence. We highlight its use in
customization, fraud detection, and healthcare diagnostics through fascinating real-
world examples, highlighting its disruptive potential. The synergistic potential of
these technologies, notably in predictive modeling and pattern recognition, is high-
lighted in the chapter’s conclusion. It also discusses the ethical issues surrounding
the use of data and the proper application of AI, urging businesses to proceed in the
data-driven world with caution and foresight. This chapter provides readers with a
concise yet thorough overview of the influential trio of Big Data, Machine Learning,
and Data Analytics, encouraging further investigation of their potential to reshape
industries and spur innovation in the real world.

A. Bhattacherjee (B) · A. K. Badhan


Department of Computer Science and Engineering, Lovely Professional University, Phagwara,
Punjab, India
e-mail: abhishek.27306@lpu.co.in
A. K. Badhan
e-mail: ajay.27337@lpu.co.in

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 317
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_15
318 A. Bhattacherjee and A. K. Badhan

1 Introduction

In an era defined by the digital revolution, data has emerged as the lifeblood of inno-
vation and progress. The convergence of technology and data science has ushered
in a new age where organizations harness the power of Data Analytics, Big Data,
and Machine Learning to decipher complex challenges, unlock hidden insights, and
redefine the way they operate. This book chapter embarks on an exploration of the
myriad real-world applications where these three transformative technologies inter-
sect; illuminating the profound impact they have on diverse industries and domains.
From revolutionizing healthcare with predictive diagnostics to optimizing supply
chains through the analysis of massive data sets, and from enhancing the accuracy
of financial decisions to empowering machines with the ability to learn and adapt
autonomously, this chapter delves into the tangible ways in which Data Analytics, Big
Data, and Machine Learning shape the world around us. We will journey through the
realms of commerce, healthcare, finance, transportation, and beyond, discovering
the remarkable stories of organizations and individuals who have harnessed these
technologies to revolutionize their fields and pioneer groundbreaking solutions.
The convergence of Big data, Machine Learning, and Data Analytics has sparked
a new wave of innovation and change in the rapidly changing field of informa-
tion and technology. This chapter explores the practical uses of these cutting-edge
technologies to transform markets, resolve challenging issues, and realize unreal-
ized potential. This investigation is a vital resource for anyone looking to compre-
hend, utilize, and capitalize on the dynamic trifecta of Data Analytics, Big Data, and
Machine Learning in the search for practical insights and long-term advancement
as the digital era continues to push the envelope of what is feasible. We set out on
a trip through the concrete, significant applications that propel real-world change
and help organizations traverse the opportunities and difficulties of the data-driven
future, from marketing to finance and healthcare, among other areas.
As we delve into the exciting world of Data Analytics, Big Data, and Machine
Learning, we aim to provide both a comprehensive understanding of these technolo-
gies and real-world inspiration for those seeking to leverage data’s transformative
potential. We invite you to embark on this enlightening voyage, discovering the inno-
vation, challenges, and limitless opportunities that lie at the intersection of data and
technology.
Many applications currently use “smart technology,” which is the incorporation of
sensors and networked infrastructures. Every service or product has the potential to
be improved by the application of smart technology. Incorporating smart technology
has also been done for decades and is widely used. For instance, in the 1980s, students
at Carnegie Mellon University used sensors attached to a vending machine that was
connected to the internet to track the number of soft drinks served. To supply goods
and perform services more effectively, the product suppliers were able to keep track
of the number of products in each vending machine. Such instances of the use of smart
technology are numerous and are constantly expanding in diversity and complexity.
Convergence of Data Analytics, Big Data, and Machine Learning … 319

Many industries have seen radical change thanks to Data Analytics, Big Data, and
Machine Learning. They enable companies in industries like banking, healthcare,
and retail to streamline operations, make data-driven choices, and improve customer
experiences. These technologies also enable predictive maintenance, route optimiza-
tion, and talent management, and they are crucial in a variety of other fields, such as
energy, transportation, agriculture, and human resources. Additionally, by offering
data for social trend tracking, player performance analysis, and targeted advertising,
they have revolutionized social sciences, sports analytics, and marketing. Fig 1 repre-
sents the major fields of applications under data analysis. Furthermore, data analytics
improves efficiency and sustainability in communications, supply chain manage-
ment, and environmental conservation. These tools help with policy-making and
tailored learning in both government and education. While the media, insurance, and
real estate sectors use data analytics for risk assessment, content suggestion, and
property appraisal, the pharmaceutical industry uses them for medication discovery
and safety monitoring. These technologies are used in banking and finance for regu-
latory compliance and credit scoring, and they are also utilized by smart cities for
urban planning. Data analytics is used by nonprofits to evaluate the effectiveness of
their programs and engage donors.
Big Data

Machine Learning
Data Analytics

• Customer Analytics • Data Storage and Managemen • Natural Language Generation


• Financial Services • Data Archiving (NLG)
• Healthcare • Data Backup and Disaster • Recommendation Engines
• Retail Recovery • Speech Recognition
• E-commerce • Content Delivery Networks • Computer Vision
• Manufacturing (CDNs) • Reinforcement Learning
• Energy • Genomic Data Storage • Automatic Game Playing
• Transportation • Scientific Research • Healthcare Diagnosis
• Agriculture • Social Media Trend Analysis • Natural Language Understanding
• Human Resource • Log and Event Management (NLU)
• Marketing • IoT Data Processing • Autonomous Vehicles
• Sports analytics • Data Mining for Anomaly • Credit Scoring Models
Detection • Behavioral Biometrics
• social Sciences
• Data Warehousing • Fraud Detection in Images
• Environmental Conservation
• Graph Databases • Robotics
• Supply chain Mamagement
• Machine Learning Model • Neuroscience
• Telecommunications
Training • Machine Translation
• Education
• Blockchain Technology • Emotion Recognition
• Goverment
• Search Engines • Sports Performance Analysis
• Pharmaceuticals
• Network Traffic Analysis • Music Composition
• Media and Entertainment
• Spatial and Geospatial Data • Virtual Health Assistants
• Insurance Processing
• Real Estate • Criminal Justice
• Inventory Management
• Smart Cities • Crop prediction
• Pharmaceutical Research
• Banking and Finance • Air quality prediction
• Weather Data Processing
• Non-Profit and Social Services • Quality Control in Manufacturing
• High-Performance Computing
• Energy Efficiency
(HPC)
• Astronomy
• Virtual Reality (VR) and
Augmented Reality (AR)
• Digital Advertising
• Energy Grid ManagementSmart
Devices and Wearables

Fig. 1 Top real-world applications of Data Analysis, Big Data and Machine Learning
320 A. Bhattacherjee and A. K. Badhan

In a similar way as illustrated in Fig. 1, Big Data is applied in diverse


domains including data storage, archiving, and backup, content delivery networks,
genomic data storage, scientific research, social media trend analysis, log and
event management, IoT data processing, anomaly detection, data warehousing,
graph databases, machine learning model training, blockchain technology, search
engines, network traffic analysis, spatial data processing, inventory management,
pharmaceutical research, weather data processing, high-performance computing,
virtual and augmented reality, digital advertising, energy grid management, and
smart devices. These technologies drive innovation and efficiency in a wide range
of industries and applications. Natural Language Generation (NLG), Recommen-
dation Engines, Speech Recognition, Computer Vision, Reinforcement Learning,
Automatic Game Playing, Healthcare Diagnosis, Natural Language Understanding
(NLU), Autonomous Vehicles, Credit Scoring Models, Behavioral Biometrics,
Robotics, Neuroscience, Machine Translation, Emotion Recognition, Sports Perfor-
mance Analysis, Music Composition, Virtual Health Assistants, Criminal Justice,
Crop Prediction, Air Quality Prediction, Quality Control in Manufacturing, Energy
Efficiency, and Astronomy are just a few of the cutting-edge technologies we examine
in this chapter. These applications show the transformational potential of artificial
intelligence and data-driven solutions in the current world, spanning a wide range of
industries from healthcare and finance to environmental monitoring.

1.1 Application Under Categorization of Data Analysis

Simply said, data analysis is the act of turning the obtained data into useful infor-
mation. Different approaches, including modeling, are used to identify trends,
connections, and ultimately conclusions to address the decision-making process.
The following four major methods can be used to analyze data:
Descriptive: Having a plan for data analysis descriptive analysis has several bene-
fits. The investigators are not left unsure of what to do with all of the data they
now have on their computers, which is the most evident benefit. The process of data
analysis is sped up by a plan. The commands for the analysis can even be written
before the data collecting is finished if a computer program is employed. It can be
used for many different aspects of a business’s routine daily operations. Descriptive
analytics provides the foundation for reports on inventories shown in Fig. 2, multiple
workflows, sales numbers, and income information.
Exploratory: Data scientists use a method known as “exploratory data analysis,”
or EDA, to analyze, study, and summarize the key properties of various types of
data sets. These methods usually make use of data visualization techniques. EDA
helps data scientists choose the most efficient approach to change data sources to get
the results they need, making it easier for them to spot trends, spot anomalies, test
a hypothesis, or confirm assumptions. Numerous industries, including professional
sports, history, healthcare, marketing, the hospitality sector, retail, fraud detection,
Convergence of Data Analytics, Big Data, and Machine Learning … 321

Exploratory
Anticipatory
Descriptive Professional sports, history, Inferential
Risk modeling, quality
healthcare, marketing, the Dental Anatomy, Banking,
inventories, multiple assurance, product
hospitality sector, retail, fraud Transport, Education,
workflows, sales numbers and propensity, predictive
detection, auditing, Communications and Health
income information maintenance, and customer
geography, space exploration, sevices.
segmentation.
and the food business.

Fig. 2 Real-world applications of data analysis

auditing, geography, space exploration, and the food business, use exploratory data
analytics.
Inferential: By taking into account a selection of data from the original data, infer-
ential data analysis draws conclusions and forecasts about large amounts of data. It
concludes by using probability. “Inferential Data Analysis” is the technique of “infer-
ring” insights from a sample of data. The fields of banking, healthcare, education,
insurance, and transportation are among the most widely used applications of infer-
ential analysis. Inferential data analysis is applied in any case where information is
taken from a group of subjects and then utilized to conclude a larger group. Despite
the potential for data sets to grow huge and contain numerous variables, inferential
data analysis does not require complex equations. If you were to evaluate a sample
of 100 persons regarding whether or not they recovered from a very serious health
condition, and 85 of them said yes, and 15 said no, then the results would indicate
that 85% of the sample had recovered from that particular health issue. Based on
those figures, one could deduce that whilst 15% of persons do not recover, 85% of
the general population does.
Anticipatory: This is also known as predictive data analytics across numerous
important applications through the utilization of the AI-based library. To facili-
tate the reuse of business logic capabilities on the chosen datasets, it integrates
the analytic service builder. Risk modeling, quality assurance, product propensity,
predictive maintenance, and customer segmentation are a few use cases and applica-
tions for anticipatory data analytics. In addition to implementing various statistical
and analytics methods and providing an environment for creating customized services
based on a set of available analytical characteristics, it also provides the ability to
run bespoke queries on the datasets that are readily available.
322 A. Bhattacherjee and A. K. Badhan

Big Data
Classifications

Structured Unstructured Semi-structured Multi-structured

Social media E-commerce,


XML database,
Traditional posts, audio- weather info.,
email etc.,
Databases video etc., web logs etc.,

Fig. 3 Classification of Big Data with examples

1.2 Application Under Categorization of Data Analysis

In today’s world, as an enormous amount of data is been generated which is difficult


to handle, process, or analyze using conventional data management technologies,
Big Data comes to the rescue. It is defined based on three parameters i.e., variety,
velocity, and volume. Both organized and unstructured data from a variety of sources,
including a variety of sources such as social media, sensors, gadgets, and more
are frequently included in big data. The data is classified into different forms as
represented in Fig. 3.

1.3 Applications Under Categorization of Machine Learning

Through the revolutionary science of machine learning, and artificial intelligence,


computers can now learn from data and make judgments without the need for explicit
programming. It now plays a crucial role in our daily lives, influencing advancements
in everything from self-driving vehicles and recommendation systems to natural
language comprehension and medical diagnosis.
Predictive modeling and pattern recognition are the core concepts of machine
learning. Machine learning algorithms are taught on enormous volumes of data,
as opposed to conventional rule-based programming, which enables them to find
patterns, relationships, and insights that would be extremely difficult for humans to
explicitly express. Because of the adaptability and versatility of these algorithms,
machine learning is a fundamental concept of contemporary data-driven decision-
making. Machine learning comes in a variety of forms, the diagrammatic view of it
along with the use case is provided below:
Figure 4 represents the types of techniques in machine learning and the case
study for each one of them. It proposes the utilization of three different algorithms
of machine learning for the appraisal of property prices. In this, the author has
utilized a dataset containing approximately 40,000 housing transactions spanning
over 18 years. Similarly, to improve the interpretability of clustering, they propose
an unsupervised learning technique for customer segmentation in both small and Big
Convergence of Data Analytics, Big Data, and Machine Learning … 323

Machine Learning
Types

Supervised Unsupervised Semi-Supervised Reinforcement


Learning Learning Learning Learning

Customer Optimized
House Price Text Classification
Segmentation Marketing
Prediction

Fig. 4 Types of machine learning with examples

datasets and this is implemented using explainable AI methodologies. The author


has even incorporated a decision tree-based method, acknowledging the significance
of consumer segmentation in the cutthroat commercial world [1] and recommends
combining labeled and unlabeled data for training the text classifiers. Expectation–
maximization (EM) methods are then used to estimate the parameters of generative
models. They examine three important discoveries while concentrating on text cate-
gorization using a bag-of-words paradigm. For reinforcement learning [2] propose a
unified framework of Markov Decision Processes for solving two crucial problems
in customer relationship management, cross-channel integration, and customer life-
time value modeling. The study focuses on an IBM research and Saks Fifth Avenue
partnership initiative that optimized direct mail marketing to boost retail channel
revenues. The problem was cross-channel challenge is recognized as the result of
unclear links between marketing initiatives in one channel and customer feedback
in another. This problem was addressed using the reinforcement technique.

2 Motivation Behind Study

It is critical to comprehend the real-world applications of Data Analytics, Big Data,


and Machine Learning in the rapidly advancing field of technology. This book chapter
aims to explain these complex ideas and give readers a guide to navigating the
complex web of applications that are changing society and industries. As we delve
into the chapters, we will discover fascinating tales of how machine learning algo-
rithms have formed the backbone of fraud detection in the financial industry and
how data analysis has enabled retail giants to improve inventory. The story takes
place across a variety of industries, including healthcare, e-commerce, finance, and
more, demonstrating how adaptable these technologies are. This chapter is more
than just a list of use cases; it’s an illustration of how data can change lives. It is
a wake-up call to industries, imploring them to make the most of the potential in
their datasets to inform decisions, improve consumer experiences, and open up new
creative opportunities.
324 A. Bhattacherjee and A. K. Badhan

Fig. 5 Significance factor for real-world applications

These real-world success stories are more than just teaching points; they serve as
motivation for companies to change, grow, and prosper in a time when information
is the new currency. This chapter serves as a beacon of guidance for readers at a
time when it is usual to feel overwhelmed by information, offering perspectives
that go beyond theoretical frameworks. It allows them to see directly the real-world
effects of Big Data, Machine Learning, and data analytics, which inspires wonder
and excitement about the seemingly limitless opportunities that lie ahead with the
critical factors shown in Fig. 5.
Let this chapter serve as a source of inspiration for scholars, practitioners, and
enthusiasts alike as we set out on our intellectual journey. It is a call to action, asking
the reader to actively engage in the data revolution, in which data is the driving force
behind hitherto unheard of breakthroughs and knowledge is power.

3 Literature Review

In their seminal work [3] Manyika and colleagues discussed the transformative
impact of Big Data on various industries, including healthcare, finance, and retail.
This work serves as a foundational exploration of the cross-industry impact of Big
Data. It highlights the potential for data-driven decision-making, improved oper-
ational efficiency, and enhanced customer experiences in healthcare, finance, and
retail. Their findings have informed subsequent research and practice in these sectors,
further emphasizing the transformative power of Big Data analytics.
In healthcare, they emphasize the substantial potential of Big Data. They discuss
the use of large datasets to improve patient care, enhance clinical outcomes, and
optimize hospital operations. Big Data analytics enables healthcare providers to make
Convergence of Data Analytics, Big Data, and Machine Learning … 325

data-driven decisions, personalize treatments, and detect early signs of diseases. The
result is improved patient outcomes and cost savings within the healthcare sector.
In finance, it highlights the role of Big Data in the financial industry. They discuss
how financial institutions utilize vast amounts of data to better understand customer
behavior, mitigate risks, and detect fraudulent activities. Through advanced analytics
and machine learning, banks and financial services companies can enhance security,
streamline operations, and offer more personalized financial products to customers.
In retail, they underscore the impact of Big Data on consumer insights and supply
chain management. They explain how retailers leverage data analytics to gain a deeper
understanding of consumer preferences, shopping behaviors, and market trends. This
information helps retailers make informed decisions regarding inventory manage-
ment, pricing strategies, and marketing campaigns, ultimately leading to improved
customer experiences and increased profitability.
In the past ten years, management academics in the field of information systems
(IS) have become increasingly interested in the function that Big Data (BD) plays in
promoting business revenue, operations, and customer support. Building Big Data
Analytics (BDA) capabilities is a priority for many established businesses as well as
brand-new ones. The goal is to produce actionable insights from a large variety of reli-
able data so that people or organizations may make and communicate informed deci-
sions. Organizations now need to find ways to leverage the data across smaller areas
of day-to-day management due to the rapid rise in processing capacity accessible to
analysts. These routine management tasks that BDA assisted have now greatly devel-
oped into full-fledged management domains with linkages to traditional management
theories.
When working with large dimensional data, deep learning architecture, such as
MLP and recurrent LSTM, tends to demonstrate performance increase by adding
deeper architecture and bulk networks. Computer vision, image processing (image
classification and segmentation), and ML classification are just a few of the disci-
plines where DL applications are being deployed and are attracting a lot more interest
and adoption [4]. The era of the Industry 5.0 revolution is one in which enormous
amounts of data are being exchanged digitally. Despite the need to analyze and
interpret data, machine learning is succeeding in several fields, including intelligent
control, decision-making, speech recognition, natural language processing, computer
graphics, and computer vision. Deep Learning and Machine Learning Techniques
have lately gained widespread recognition and adoption by several real-time engi-
neering applications due to their outstanding performance. Designing automated and
intelligent programs that can manage data in fields like health, cyber-security, and
intelligent transportation systems requires knowledge of machine learning [5].
Every population in the globe now has access to digital technology, which makes
an unprecedentedly large amount of data available. There are numerous benefits to
being able to process these enormous volumes of data in real-time using Machine
Learning (ML) algorithms and Big Data Analytics (BDA) tools. The abundance of
free BDA tools, platforms, and data mining tools, however, makes it difficult to
choose the best one for the job based on Nti. Moving forward, it is necessary to
address concerns that have been discovered, incomplete and diverse data sources,
326 A. Bhattacherjee and A. K. Badhan

noisy and erroneous data that impair data analytics’ performance, to allow the easy
management of Big Data. Therefore, to minimize human labor, big data analytics
designers must highly and effectively automate the data pre-processing (such as data
cleaning, sampling, and compression) [6].
Arguably the most popular software platform of the twenty-first century is the
Internet of Things. The Internet of Things (IoT) is the intelligent technologies and
services that control the entire globe. It is the globally connected connection between
intelligent gadgets, like actuators, sensors, tags with RFID, and mobile phones that
are dispersed throughout the area being monitored and allow them to interact with
one another and carry out particular duties autonomously.
Figure 6 represents such applications and demonstrates how, thanks to develop-
ments in Internet of Things (IoT) services, all of which were only a pipe dream a
few decades ago are now possible. Internet of Things (IoT) systems are predicted to
capture 3.9–11.1 trillion dollars in the US economy by the year 2025. In fact, by the
year 2020, there will be close to 50 billion gadgets connected to the web, and their
quality will only grow with age [7].
A strategy or tool to aid with Big Data Analytics (BDA) of applications is a
machine learning algorithm (MLA). It can be used to examine a sizable volume of data
that is produced by a program to make optimal and effective use of the information.

Healthcare and
Medical
Diagnosis

E-commerce,
Retail and
Manufacturing
Financial
Services
Real-world
applications of Data
Analytics, Big Data,
and Machine
Learning used in IOT

Environmental
CyberSecurity
Sustainability

Supply Chain
Optimization

Fig. 6 Combined applications of Data Analysis, Big Data, and Machine Learning based on IoT
Convergence of Data Analytics, Big Data, and Machine Learning … 327

Techniques for machine learning are taken into consideration for locating relevant
information and data for applications in the industry. It falls under the category of big
data analytics services (BDA). Big Data Analytics (BDA) can be utilized to detect
deception, recognize managing risk, recognize the reason for a failure, recognize a
consumer based on purchase detail records, and more. Supervised, semi-supervised,
and unsupervised methods are used in modern machine-learning techniques. ML is
used in decision-making processes, modeling layout form, identification of results
data analysis, and forecast seeking for any large data analytics-oriented program.
While in a typical system, the application gets the information and results, in a
machine learning idea, the program provides the information and results. It falls
under the domains of learning through reinforcement, unsupervised learning, and
supervision. The task of acquiring equations that map from an input to an output is
known as supervised learning [8].
In life sciences, machine learning is progressively gaining traction as a viable
computational and analytical tool for the integrated study of vast, diverse, unstruc-
tured datasets on the Data scale. While data creation costs are no longer an impor-
tant issue for genome-wide research, terabytes or even petabytes of data analysis
processing power are becoming a constraint. The three main challenges of Big Data,
which are similar to those faced by every scientific field generating Big Data, are
scalable infrastructure for parallel processing, massive data set handling plans, and
smart data analysis analytics. The environmental science community is looking for
innovative solutions to these problems. A system for unified Big Data processing built
upon the foundation of powerful clusters of computers. One such holistic platform
is the open-sourced Apache Hadoop environment, which consists of the MapReduce
software model, the Hadoop distributed file system (HDFS), Hadoop functioning
instructions, and several tools for storing various types of structured, semi-structured,
and unstructured datasets, a Big Data repository that combines open datasets and set
genome libraries with automated workflows for preprocessing, transforming, and
querying data in extremely large-volume datasets [9].
Over the last ten years, several Big Data-related issues have been resolved with
the help of algorithms that use machine learning. Currently, unsupervised, super-
vised, and semi-supervised machine learning (ML) approaches are accessible in a
variety of forms. Comparably, several methods, including classification, prelimi-
nary processing, relationships, random forests, support vector machines, decision
trees, etc., are accessible to address a variety of issues, including machine transla-
tion, inequalities in data, robotics advancement, etc. These days, to handle several
issues, such as forecasting and modeling in various applications, we need to know
a few fundamental details about machine learning approaches for instance, in big
data industries for e-healthcare (information created via digitally linked intelligent
gadgets) as well as other sectors (e-commerce, agricultural, defense, etc.). Because
of this, the majority of researchers are perplexed and hesitant to debate or choose
which approach or measure to employ in the appropriate applications [10].
Big Data has a major effect on organizations in the Industry 4.0 (fourth indus-
trial revolution) period because the advances in systems, systems, individuals, and
328 A. Bhattacherjee and A. K. Badhan

information technology have altered the factors that determine a firm’s ability to inno-
vate and remain competitive. Researchers and industry professionals have created
a lot of hype around big data because big data analytics can yield useful insights
and encourage creative company policies that can revolutionize local, national, and
global economies. According to that definition, data science is the set of essential
ideas that support the extraction of expertise and knowledge from data. The methods
and tools employed aid in the analysis of vital data, assisting firms in gaining insight
into their surroundings and making accurate choices. Big data analytics are being
used in every industry (agriculture, physical well-being, power and buildings, finance
and protection, games, nutrition, and public transportation) and global economy due
to the massive increase in data that has resulted from the Internet of Things (constant
growth of devices that are connected, sensors, and smartphones). Global recognition
has shown that the amount of data that is becoming more readily accessible is a
pattern, and data analysis procedures yield significant insights from the data that is
available [11].
The application of Business Intelligence (BI) and Big Data is known as predic-
tive analytics. Whenever your company gathers enormous amounts of fresh data,
what measures do you take? A great deal of new consumers, market, social media
monitoring, and real-time mobile applications, the cloud, or device metrics are being
collected by modern enterprise apps. One method to make the most of all that data,
obtain actionable novel insights, and beat rivals is through predictive analytics. Enter-
prises employ predictive analytics in several manners, ranging from data mining and
predictive marketing to utilizing machine learning (ML) and artificial intelligence
(AI) algorithms to enhance operational efficiency and detect novel statistical trends.
The need for managerial experts has surged due to big data, to the extent that compa-
nies such as Tech AG, Oracle, Inc., Microsoft Corporation, IBM, the SAP system,
EMC, HP, Dell, and Alienware have invested over $15 billion in software companies
that specialize in managing data and analytics [12].
Innovations related to Big Data and machine learning (ML) have an opportunity to
affect various aspects of Environmental and Water Management (EWM). Big Data is
becoming more and more common in many EWM fields, including weather predic-
tion, disaster preparedness, intelligent water and electricity administration systems,
and remote sensing. This is due in part to rapid advancements in high-resolution
image methods for remote sensing, smart computer technology, and social media.
Big Data opens up new possibilities for information-driven discoveries in EWM, yet
it also necessitates new kinds of analytics, data processing, storage, and retrieval.
Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that generally
refers to computer systems with data-driven learning capabilities. If ML and data
analytics are correctly linked, it could help unleash the potential of Big Data. The
EWM literature has already published a significant amount of Big Data and machine
learning applications over the past decade [13].
The expanding notion of “Big Data” requires significant progress in the realm
of data science claims. It provides data with quantifiability in multiple ways that
enable data research. Machine learning techniques have had a significant impact
on society in a wide range of applications. The creation of innovative dynamic and
Convergence of Data Analytics, Big Data, and Machine Learning … 329

cooperative methods based on knowledge of consumer requirements, habits, and


skills is essential for appealing and efficient machine learning. Furthermore, ML
could be used alongside massive amounts of data to create predicting models that
operate well or to address challenging social issues involving data analysis. When
incorporated into Machine Learning techniques, Big Data collected from numerous
sources—such as social networks, multiplayer games, virtual worlds, smartphone
applications, or various sensor data—can transform the economic sector in the years
to come provided businesses correctly implement them [14].
By fusing functional and past data with predictive tools, business intelligence
systems provide entrepreneurs and executives with relevant and crucial data. The
goal of business intelligence (BI) is to improve the accuracy and speed of data so
that managers can understand their company’s competitive landscape more fully.
With the use of business intelligence tools and technology, it is possible to analyze
alterations to market share, client preferences, spending habits and conduct, organi-
zational skills, and market conditions, among other things. Business intelligence can
also be used by managers and researchers to identify the modifications that have the
greatest chance of adapting to changing trends. Technical strategies include things
like data synthesis, anomaly detection, dependent network discovery, and training
rules for categorization, clustering, and change analysis. Improved software and hard-
ware abilities, the development of web design, the advent of data warehousing as a
repository, and improvements in data cleansing have all contributed to the creation
of a more robust analytical ecosystem than was previously possible [15].
Both machine learning and deep learning are now extremely potent instruments
in a variety of domains, such as audio and picture identification, processing natural
languages, and even medicine. These represent two of the most innovative AI tech-
niques available. Because of their abilities to deliver forecasts, evaluate enormous
data sets, and offer insights that were previously unattainable, they have grown in
popularity over the past few years. The fundamentals of machine learning and deep
learning, as well as their variations, uses, and effects on various sectors, will all
be covered. Our interactions with technology are changing as a result of machine
learning and deep learning, which is also opening up new avenues for creativity.
These advances have already had a big influence on several businesses, and they
could continue to transform the world. The likelihood of machine learning and deep
learning to disrupt multiple sectors & and the globe becomes more and more evident
as the volume of data created keeps growing and computing power keeps rising
a thorough comprehension of these technologies, their uses, and how they affect
society [16].

4 Proposed Work with Use Cases

In addition to addressing the inherent obstacles and predicting future trends in this
quickly developing subject, the proposed work intends to explore the intersection
of Data Analytics, Big Data, and Machine Learning, clarifying their convergence
330 A. Bhattacherjee and A. K. Badhan

and the consequences for a variety of applications. The chapter will give a thorough
summary of how these three areas work together and emphasize how their combined
efforts can completely transform how decisions are made in a variety of businesses,
we noticed some real-time use cases, thus we put all these examples here.

4.1 Use Case Real-Time Applications of Data Analysis

• Optimization of E-commerce Sales: For instance, Amazon examines user


browsing and purchase patterns to enhance product recommendations and
customize the user experience. Higher sales and happier customers are two
benefits of this data-driven strategy [17].
• Healthcare Patient Outcomes: As an illustration, hospitals examine electronic
health records using data analysis to evaluate patient outcomes. This helps spot
trends that can strengthen treatment programs, lower readmission rates, and raise
the standard of care in general [18].
• Educational Performance Monitoring: As an illustration, educational establish-
ments examine data on student performance to pinpoint areas in need of develop-
ment, adapt their pedagogy, and put interventions into action to improve learning
outcomes.
• Supply Chain Efficiency: Data analysis is used by businesses such as Walmart
to optimize their supply chains. They can cut expenses and stockouts by studying
transportation data, demand patterns, and inventory levels.
• Engagement on Social Media: Example: To improve their algorithms, social
media companies like Facebook examine user interactions, content choices, and
engagement metrics. They can improve user experience overall and provide users
with more relevant material as a result.

4.2 Use Case Real-Time Applications of Big Data

• Financial Fraud Detection: For instance, banks use Big Data to analyze massive
amounts of data in real time to identify fraudulent transactions. Atypical spending
habits or other unusual patterns set up signals for additional research, preventing
financial losses [19].
• Climate Modeling: As an illustration, climate scientists examine enormous
volumes of meteorological data, such as temperature, precipitation, and atmo-
spheric variables, using Big Data. As a result, they can produce precise climate
models that forecast long-term weather patterns and climatic changes.
• Smart City Infrastructure: As an illustration, Big Data technologies are used by
cities to effectively manage their urban infrastructure. For example, data analytics
and sensors improve public safety, save energy costs, and optimize traffic flow
[20].
Convergence of Data Analytics, Big Data, and Machine Learning … 331

• Genomic Research: For instance, to find genetic markers linked to diseases,


genomic researchers examine enormous databases. Personalized medicine
depends on this data to provide customized therapies based on a patient’s genetic
profile.
• Retail Demand Forecasting: For instance, Big Data is used by retailers such
as Walmart to estimate demand precisely. Through the examination of past sales
data, meteorological trends, and social media habits, they may optimize inventory
levels, minimize stockouts, and improve user experience in general.

4.3 Use Case Real-Time Applications of Machine Learning

• Virtual Personal Assistants: For instance, Alexa from Amazon and Siri from
Apple both utilize machine learning to comprehend user commands, adjust to
each user’s unique preferences, and eventually give more precise and customized
responses [21].
• Image and Speech Recognition: For instance, Google Photos uses machine
learning algorithms to automatically identify and classify photos. Similarly,
machine learning powers speech recognition technology seen in virtual assistants
and customer support apps.
• Credit Scoring in Finance: For instance, to more precisely evaluate credit
risk, financial firms employ machine learning. These algorithms are capable of
producing more complex and equitable credit ratings by examining a wide range
of variables, such as payment patterns and expenditure patterns.
• Medical Diagnosis and Imaging: As an illustration, medical imaging uses
machine learning to diagnose diseases like cancer. To help medical personnel
make earlier and more accurate diagnoses, algorithms examine medical images
to find anomalies [22].
• Autonomous Vehicles: For instance, machine learning algorithms are used by
Tesla and other companies to enable autonomous driving. These algorithms aid
in the development of self-driving automobiles by analyzing data from cameras
and sensors in real time to make decisions about vehicle operation.

5 Implications and Future Scope

a. Implication of Data Analytics


• Increased Decision Accuracy: By analyzing enormous volumes of both struc-
tured and unstructured data, Big Data analytics helps businesses to generate
forecasts and judgments that are more accurate.
• Big Data creates opportunities for innovation, allowing for the development
of new goods, services, and business models based on insights from huge
databases.
332 A. Bhattacherjee and A. K. Badhan

• Improved Customer Insights: By learning more about the behavior and pref-
erences of their customers, businesses may develop marketing tactics that are
more precisely focused.
Future Scope
• Edge Computing: Decentralized processing at the network’s edge, which
lowers latency and permits real-time data analysis, is the way Big Data will
be processed in the future.
• Data Security and Privacy: As Big Data grows, more attention will be paid to
creating strong security and privacy protocols to safeguard sensitive data.
b. Implication of Big Data Analytics
• Increased Decision Accuracy: By analyzing enormous volumes of both struc-
tured and unstructured data, Big Data analytics helps businesses to generate
forecasts and judgments that are more accurate.
• Innovation and New Business Models: Big Data creates opportunities for
innovation, allowing for the development of new goods, services, and business
models based on insights from huge databases.
• Improved Customer Insights: By learning more about the behavior and pref-
erences of their customers, businesses may develop marketing tactics that are
more precisely focused.
Future Scope
• Edge Computing: Decentralized processing at the network’s edge, which
lowers latency and permits real-time data analysis, is the way Big Data will
be processed in the future.
• Data Security and Privacy: As Big Data grows; more attention will be paid to
creating strong security and privacy protocols to safeguard sensitive data.
c. Implication of Machine Learning
• Automated Decision-Making: By using patterns and insights from data,
machine learning enables automated decision-making, which eliminates the
need for human intervention in repetitive operations.
• Personalized Experiences: Machine learning algorithms provide tailored
suggestions, offerings, and exchanges, augmenting user experiences across
several platforms.
• Advanced Problem Solving: Machine learning (ML) makes it possible to
create models that can solve a wide range of complicated issues, from financial
forecasts to medical diagnostics.
Future Scope
• Explainable AI: Improving machine learning models’ interpretability is essen-
tial to their broad adoption, particularly in industries where openness and
accountability are critical.
• AI in Creativity: AI-driven innovation has a bright future thanks to the
incorporation of machine learning in creative industries like art and content
production.
Convergence of Data Analytics, Big Data, and Machine Learning … 333

6 Conclusions

In conclusion, the fields of Big Data, Machine Learning, and Data Analysis offer
an amazing array of opportunities with practical uses that are transforming entire
sectors and improving our quality of life. This chapter has provided an in-depth
exploration of the various industries in which these technologies are having a signif-
icant impact, ranging from marketing and transportation to healthcare and banking.
We’ve seen how data-driven insights have transformed decision-making processes,
increased productivity, and produced creative answers to difficult problems. Future
developments will make the integration of these technologies even more common and
essential. Data analysis, big data, and machine learning have the potential to trans-
form many industries and provide major benefits to society. Their potential keeps
opening up new avenues for exploration. To address privacy concerns and minimize
potential biases, it is crucial to approach these technologies with a strong sense of
responsibility, assuring ethical and secure data consumption. The exploration of big
data, machine learning, and data analysis’s practical applications is a continuous
process. With ramifications that go well beyond the pages of this chapter, it is clear
that these tools will become more and more important as we continue to explore
this frontier in our quest for knowledge, growth, and solutions to problems. While
constantly being aware of the significant influence that data-driven technologies are
having on our world, we must continue to be imaginative, flexible, and open to the
revolutionary possibilities of these technologies.

References

1. Kamal, N., Andrew, M., Tom, M.: Semi-supervised text classification using EM. Semi Superv.
Learn. 32–55 (2013). https://doi.org/10.7551/mitpress/9780262033589.003.0003
2. Abe, N., Verma, N., Apte, C., Schroko, R.: Cross channel optimized marketing by reinforcement
learning. KDD-2004—Proc Tenth ACM SIGKDD Int Conf Knowl Discov Data Min 767–772
(2004). https://doi.org/10.1145/1014052.1016912
3. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Hung Byers, A.: Big
data: The next frontier for innovation, competition and productivity. McKinsey Glob. Inst. 156
(2011)
4. Khan, M.A., Karim, M.R., Kim, Y.: A two-stage big data analytics framework with real world
applications using spark machine learning and long short-term memory network. Symmetry
(Basel) 10 (2018). https://doi.org/10.3390/sym10100485
5. Kushwaha, A.K., Kar, A.K., Dwivedi, Y.K.: Applications of big data in emerging management
disciplines: a literature review using text mining. Int. J. Inf. Manag. Data Insights 1, 100017
(2021). https://doi.org/10.1016/j.jjimei.2021.100017
6. Nti, I.K., Quarcoo, J.A., Aning, J., Fosu, G.K.: A mini-review of machine learning in big data
analytics: applications, challenges, and prospects. Big Data Min. Anal. 5, 81–97 (2022). https://
doi.org/10.26599/BDMA.2021.9020028
7. Choo, K.K.R., Dehghantanha, A.: Handbook of Big Data Privacy (2020)
8. Rahul, K., Banyal, R.K., Goswami, P., Kumar, V.: Machine Learning Algorithms for Big Data
Analytics. Springer, Singapore (2021)
9. Ma, C., Zhang, H.H., Wang, X.: Machine learning for big data analytics in plants. Trends Plant
Sci. 19, 798–808 (2014). https://doi.org/10.1016/j.tplants.2014.08.004
334 A. Bhattacherjee and A. K. Badhan

10. Tyagi, A.K., Rekha, G.: Machine Learning with Big Data Article Info, 1011–1020 (2019)
11. Vassakis, K., Petrakis, E., Kopanakis, I.: Big data analytics: applications, prospects and chal-
lenges. Lect. Notes Data Eng. Commun. Technol. 10, 3–20 (2018). https://doi.org/10.1007/
978-3-319-67925-9_1
12. Ongsulee, P., Chotchaung, V., Bamrungsi, E., Rodcheewit, T.: Big data, predictive analytics
and machine learning. Int Conf ICT Knowl Eng 2018-Novem, pp. 37–42 (2019). https://doi.
org/10.1109/ICTKE.2018.8612393
13. Sun, A.Y., Scanlon, B.R.: How can big data and machine learning benefit environment and
water management: a survey of methods, applications, and future directions. Environ. Res.
Lett. 14 (2019). https://doi.org/10.1088/1748-9326/ab1b7d
14. Son, L.H., Tripathy, H.K., Acharya, B.R., Kumar, R., Chatterjee, J.M.: Machine learning on big
data: a developmental approach on societal applications. Stud. Big Data 43, 143–165 (2019).
https://doi.org/10.1007/978-981-13-0550-4_7
15. Praful Bharadiya, J.: A comparative study of business intelligence and artificial intelligence
with big data analytics. Am. J. Artif. Intell. (2023). https://doi.org/10.11648/j.ajai.20230701.14
16. Koosha, S., Amini, M.: A review of machine learning and deep learning applications. World
Inf. Technol. Eng. J. 7, 3897–3904 (2023). https://doi.org/10.1109/ICCUBEA.2018.8697857
17. Chen, W.H., Lin, Y.C., Bag, A., Chen, C.L.: Influence factors of small and medium-sized
enterprises and micro-enterprises in the cross-border e-commerce platforms. J. Theor. Appl.
Electron. Commer. Res. 18, 416–440 (2023). https://doi.org/10.3390/jtaer18010022
18. Statistics M. De la Cruz, O., Holmes, S.: The duality diagram in data analysis : examples of
modern applications. Ann. Appl. Stat. December 2011, 5(4). Institute of Mathematical Statistics
Stable (2011). https://www.jstor.org/stable/23069329. The duality diagram in data analysis :
examples of, 5, 2266–2277
19. Dhone, M.B., Assistant: Big data analytics for fraud detection in financial transactions. Maya.
38, 31–41 (2023).https://doi.org/10.5281/zenodo.7922883
20. Chluski, A., Ziora, L.: The role of big data solutions in the management of organizations. Review
of selected practical examples. Proc. Proc. Comput. Sci. 65, 1006–1012 (2015). https://doi.org/
10.1016/j.procs.2015.09.059
21. Manojkumar, P.K., Patil, A., Shinde, S., Patra, S., Patil, S.: AI-based virtual assistant using
python: a systematic review. Int. J. Res. Appl. Sci. Eng. Technol. 11, 814–818 (2023). https://
doi.org/10.22214/ijraset.2023.49519
22. Lawrence, N.D.: Challenges in deploying machine learning : a survey of case studies. 55 (2022).
https://doi.org/10.1145/3533378
Business Transformation Using Big Data
Analytics and Machine Learning

Parijata Majumdar and Sanjoy Mitra

Abstract Artificial intelligence (AI), big data, and business analytics are the most
commonly used and complete common sense cognitive tools in the ecospheres today,
and they have garnered a lot of attention for their ability to influence organizational
decision-making. With the use of these technologies, firms are able to provide valu-
able data and obtain answers that will improve their performance and provide them
with a competitive advantage. A customer relationship management (CRM) and
enterprise resource planning (ERP) business system, for example, can be integrated
with AI solutions through the AI business platform paradigm. In addition to providing
pattern analysis, big data analytics (BDA) enables automatic future event forecasting.
BDA may revolutionize organizations and create new commercial prospects using
AI. The goal is to highlight the preventive aspects of using AI and ML in conjunc-
tion with big data analytics (BDA) to pursuit digital platforms for business model
innovation and dynamics. Additionally, a thorough assessment of the literature has
been provided with an emphasis on the necessity of business transformation, the
function of BDA, and the role of AI. One particular case study namely Big Mart
Sales forecasting was discussed, compared and analyzed in the context of business
transformation. The chapter discusses the possible obstacles to firms implementing
AI and BDA. It will offer firms a roadmap for utilizing AI and BDA to generate
commercial value.

P. Majumdar
Department of Computer Science and Engineering, Techno College of Engineering,
Maheshkhola, Agartala, Tripura 799004, India
e-mail: er.parijata@gmail.com
S. Mitra (B)
Department of Computer Science and Engineering, Tripura Institute of Technology, Narsingarh,
Agartala, Tripura 799009, India
e-mail: mail.smitra@gmail.com

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 335
P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145,
https://doi.org/10.1007/978-981-97-0448-4_16
336 P. Majumdar and S. Mitra

1 Introduction

Business Analytics (BA) is used to transform data into insights to improve company
decisions [1]. A few methods for extracting perception from data are data manage-
ment, data visualization, predictive modeling, data mining, prediction, simulation,
and optimization. The term BA describes the knowledge, tools, and procedures used
in iteratively exploring and analyzing historical business performance in order to
provide insights and inform future strategy [2]. Data-driven businesses actively look
for ways to leverage their data to their advantage and regard it as an asset. The effec-
tiveness of BA depends on high-quality data, skilled analysts who are familiar with
the market, technology, and a dedication to use data to extract knowledge that guides
company choices [3]. Prior to conducting any data analysis, BA begins with a number
of fundamental procedures which includes ascertaining the analysis’s commercial
purpose, choosing the approach for analysis, obtaining business data from various
sources and systems to assist analysis, combining and purging the data from data
mart or data warehouse. Tactical decision-making in reaction to unanticipated occur-
rences is also supported by BA [4]. Various forms of business analytics consist of
prescriptive analytics, which suggests actions to take in the future based on past
performance; predictive analytics uses trend data to gauge the chance of future
events, while descriptive analytics monitors key performance indicators (KPIs) to
assess a company’s present situation [5]. To enable real-time responses, artificial
intelligence (AI) is widely utilized to automate decision-making. AI is capable of
inquiring, generating and testing hypotheses, and autonomously generating judg-
ments based on sophisticated analytics applied to large number of datasets [6]. In
the field of AI, computers learn from data by using appropriate algorithms which
enables computers to extract hidden patterns or correlations in data without having
to be specifically trained to do so to solve a specific problem. Large data quantities
that may be generated, processed, and increasingly employed by digital tools and
information systems for generating descriptive, prescriptive, and predictive anal-
yses are referred to as big data analytics (BDA) [7]. Three Vs are important to the
standard concept of big data: volume, velocity, and variety [8] which best describe
big data. Furthermore, other dimensions of Big Data that the top solution providers
later defined and included are Veracity, Variability, and Value Proposition [9]. The
growing availability of structured data, the capacity to handle unstructured data,
improvements in computer power, and expanded data storage capabilities are the
main drivers of this capability. A continuous flow of processed data inputs into AI
platforms and applications is made possible by the BDA value chain. Addition-
ally, BDA supplies the AI platform with necessary inputs that allow it to process
large amounts of data quickly, efficiently, and from different data structures [10].
Industry 4.0 is powered by BDA, AI, cloud, and IoT. This supports use cases in
both consumer and corporate domains, from enhancing business performance and
agility to delivering personalized services. Personalized services, chatbots, AI and
BDA platforms, and AI integration to boost company performance and agility are
just a few of the multifarious consumer and corporate use cases that AI and BDA,
Business Transformation Using Big Data Analytics and Machine Learning 337

together with IoT, cloud, 5G, and cybersecurity, are enabling. The proliferation of
real-time data sources, including call records, location data, and customer purchase
patterns makes BDA possible. Key technologies like computer vision, context-aware
computing, and conversational platforms are the primary enablers of AI. To augment
the AI’s inherent knowledge, machine learning (ML) and/or deep learning (DL)
methods are employed. Systems that adjust their behavior based on contextual data,
such as location, temperature, light, humidity, and hand movements, are referred to as
context-aware computer systems. Conversational platforms use a range of technolo-
gies, such as natural language processing (NLP), speech recognition through NLP,
ML, and contextual awareness to facilitate human-like interactions. A enterprise
resource planning (ERP) business system and customer relationship management
(CRM), for example, can be integrated with AI solutions through the AI business
platform paradigm. The two main software programs that companies are willing to
use to automate crucial business processes are CRM and ERP [6]. While CRM helps
organizations manage how customers interact with them, ERP helps firms run more
efficiently by linking their operational and financial processes to a central database.
Software called customer relationship management (CRM) keeps track of every inter-
action of the customer with the company. CRM elements were initially created for
sales departments. Having a common database for all financial and operational data
is one of the key advantages of an ERP system. Improved automation of business
operations, more individualized communications, and providing customers the most
helpful answers to their concerns are made possible by combining generative AI with
CRM [6]. The goal of AI in CRM is to enable it to manage intelligent recommen-
dations and analysis about a prospect or customer based on all the data the system
has gathered about them. AI’s superior analytics, forecasting, automation, personal-
ization, and optimization capabilities can improve ERP systems. To understand the
role of BDA and AI in BA, as well as the necessity of business transformation, a
thorough literature review is included. A particular case study namely Big Mart sales
forecasting was discussed, compared, and analyzed in business transformation. To
provide an enterprises a roadmap for utilizing BDA and AI for commercial value,
significant obstacles to implement these technologies in their operations are also
explored, along with other potential solutions.

2 Related Works

High-level management commitment is required for business transformation, and this


commitment is driven by internal, external, and technology elements within an orga-
nization. Every department of a corporation is affected. Improving overall business
operations’ performance and efficiency is always the long-term objective of a busi-
ness transformation. First, the business model of the organization must be matched
with its core competencies. Next, non-value-generating activities must be eliminated
from the newly created value-generating business model using technology. With the
338 P. Majumdar and S. Mitra

aid of BDA, businesses may regain control over data and utilize it to find novel busi-
ness prospects. This can help businesses make quicker and more shrewd business
decisions and increase productivity, profits, and customer satisfaction [11]. BDA is
the process of addressing dynamic client requests and maintaining a competitive edge
by using statistical and analytical techniques on large data sets. Similar to human
capital and capital re-sources, big data is an equally important resource for enhancing
financial and social aspects [12]. Businesses are facing immense pressure to adopt
BDA in order to stay competitive. However, BDA’s ability to realize a company’s
strategic business value, which might give them a competitive edge, is ultimately
what will determine its level of success. Businesses find it difficult to determine the
true worth of big data and how investing in it might yield real commercial benefits.
The combination of newly generated business knowledge and its actual application
in business can be used to understand the value of the data [13]. Integrating AI into
sustainable business models aims to achieve sustainable production and consump-
tion using scientific and technological capacities (SBMs). It is emphasized in [14]
this application of AI in SBMs. Many firms have already switched to AI for sustain-
able growth and development in a unmanageable climate [15]. According to Giuf-
frida et al.’s [16] review of the literature on logistics optimization techniques, smart
or sustainable logistics frequently employ machine learning and hybrid techniques.
Loureiro and Nascimento [17] state that augmented reality (AR), virtual reality (VR),
IoT, AI, circular economy, BDA, and VR have become important themes in tourism
research. These findings increased our understanding of the potential future effects of
technology on sustainable tourism. All corporate organizations need to have “Tech-
nological Intelligence” [18]. As a result, businesses must assess the breadth and depth
of AI and BDA applications and pinpoint any areas that appear more susceptible to
disruptions. A wide range of industries have become open to AI applications in the
past ten years. Among these applications are supply chain management, dentistry,
medicine, diagnosis and treatment, modeling for pandemic response and forecasting,
commercial banking and stock market forecasts, power quality assurance, AI, ML,
deep reinforcement learning (DRL) for smart cities, and business operations. These
days, stock price index changes are predicted using ML algorithms [19]. We can make
use of a random forest model to assess risk for an excavation system, as demonstrated
in [20]. Natural language processing (NLP) is used by Amazon to analyze user expe-
rience and customer feedback, and by Twitter to filter out extremist languages from
messages [21]. Additionally, sentiment analysis using text mining, topic modeling,
etc. is using NLP more and more [22]. AI robots are agents that function to take
responsible behaviors. Robot Sophia is the most exemplary example of a social
humanoid. Some examples of AI-based conversational chatbots made from big data
language models are Google’s LaMDA, Meta’s BlenderBot, and Open AI’s GPT-
3 [23]. Massive volumes of data that are too big to handle with conventional data
management techniques are referred to as “big data”. Bendre and Thool [24] states
that as BDA requires significant data storage and processing power expenses, Return
on Investment (ROI) is another concern. On the other hand, the decreasing costs
associated with data collection have prompted the broad use of BDA across several
businesses. BDA has many applications. The availability of software and statistical
Business Transformation Using Big Data Analytics and Machine Learning 339

techniques to address big data concerns like class imbalance and high dimension-
ality is the boon of using data mining in health informatics [25]. BDA has a huge
potential for use in the medical field. Among the enormous volumes of health data
it contains are gene expression, data sequencing, electronic health records, doctor
notes, prescriptions, data from biological sensors, and data from online social media
[26]. BDA can improve policy execution by obtaining knowledge and insights from
existing data [27]. BDA provides assistance with police monitoring, crime graphs,
recording of crimes, tracking of terror threats, and defense [28]. The primary goals
of government are to promote public talent through enterprise collaborations and to
seek benefits through the use of IoT, crowd sourcing, and data sources. BDA applica-
tions in the banking sector are improving security and changing services [29]. Some
possible applications for BDA include a client-focused business, improved security
management and services, and cross-selling with more adaptability. Additionally,
BDA aid in comprehending the requirements of both clients and staff for offering
enhanced methods for quality and service management [30]. In [31] it is stated
that in the Big Data era, intelligence is identified as a developing field. Its actions
are focused on decision making, it aims to enhance enterprise competitiveness and
economic sectors, and it is a practice that is both morally and legally acceptable.
Lastly, intelligence would be a major factor in how well a business performed in
terms of creativity. According to [13], AI has the capacity to completely transform
twenty-first-century society, claims [13]. Preparing a better AI society has become
popular due to increased public and scientific interest in the ethical frameworks,
regulations, incentives, and values needed for a society to reap the benefits of AI
while mitigating its perils. AI is slowly making its way from research and develop-
ment facilities into the corporate sector. The power of AI and applied AI (AAI) is
being combined by elite businesses and millions of industries worldwide. In order to
increase customer satisfaction, the majority of company industries use ML algorithms
to detect scams in milliseconds. To satisfy business needs, there has been a noticeable
increase in the development of ML tools, business platforms, and applications-based
tools [32]. Research on AI in marketing is reviewed in [33]. They demonstrate how
AI is utilized to the creation of marketing strategies and plans as well as to product,
pricing, place, and promotion management. In [34], examples are provided to show
how BDA can improve the effectiveness of an organization. They also list a number
of research areas that are expanding quickly, such as text mining, evolutionary algo-
rithms, and risk management for customers and finances. In [35], the advantages and
difficulties of BDA in businesses are assessed. They discovered through their theme
and content research that BDA is an important factor in strategic decision-making.
The paper also mentions that as data collection costs come down, BDA usage is
accelerating. Furthermore, BDA is often applied to effective supply chain manage-
ment. Neural networks have become important to predict company bankruptcy and
credit risk to enhance profitability, as demonstrated in [36]. In [37] it is discovered by
bibliometric research that AI and ML are currently used in several financial domains.
Their analysis brought attention to a growing trend in applications of AI and ML. In
[38] a pattern is discovered in the knowledge-based systems’ subject shift. The Latent
Dirichlet Allocation (LDA) topic modeling is utilized to predict future trends and
340 P. Majumdar and S. Mitra

profile the hotspots of KnoSys. The main study fields that they highlighted are fuzzy,
ML, data mining, decision making, expert systems, and optimization. The results
additionally demonstrate that the communities inside KnoSys are becoming more
interested in computational intelligence and that building useful systems through
the application of knowledge and precise forecasting models is a top priority. AI
is a broad technological tool that now includes six sub-domains namely ML, deep
learning, robotics, fuzzy logic, natural language processing, and expert systems. Each
sub-domain can be given special attention by future academics, who can then inves-
tigate its applicability in other domain knowledge [39]. Similar to this, BDA is an
all-encompassing method for handling, processing, and evaluating large data [32].
Therefore, a wide range of methods, including data mining, multimedia analytics, and
cognitive modeling, are included in BDA. Some companies use big data models and
technologies based on cloud computing. GoogleFS is one distributed file system for
apps that generate Big data, for example [40]. There are numerous instances of BDA
being successfully used in the real world to transform businesses. There are a ton of
chances for BDA adoption in the retail sector. Businesses that have been using BDA
for their marketing and sales campaign for the past five years have reported a 15–20%
return on investment (ROI). Retail businesses use BDA to enhance various aspects
of their business, including supply chain, marketing, vending, and store manage-
ment [41]. Currently, the retail industry’s stakeholders can use BDA to maximize
profits and prevent or lessen the migration of customers from physical retail stores
to online retailers. For instance, in the Walmart retail industry, tools such as Hadoop,
cluster sales, clickstream, online data, and social media data are used in conjunction
with online, social media, and predictive analytics capabilities as well as trend, data
visualization, and market basket analysis. In the subsequent section, one particular
case study namely Big Mart Sales forecasting was discussed and analyzed in the
context of business transformation leveraging different ML algorithms, the perfor-
mance of which is compared using different performance metrics. Value creation
processes involve discovery, forecasting, tracking, personalization, and optimization.
Gained benefits include higher sales, lower expenses, higher customer satisfaction,
and better performance. Amazon is another online retailer, where real-time BDA,
models, and S3 and Dynamo data warehouse capabilities are used. Customization,
forecasting, and ML are the methods utilized to create value. Improved client loyalty
and experience are among the benefits [42].

3 Big Mart Sales Forecasting: A Case Study Using


Different ML Algorithms

In this case study, we look at how product sales from a specific outlet are estimated
using a two-level method, and we also look at how different ML algorithms may be
utilized for predictive learning according to predictive performance indicators. Data
exploration, data translation, and feature engineering are crucial tasks for accurately
Business Transformation Using Big Data Analytics and Machine Learning 341

Table 1 Attributes information of dataset


Variable Description Relation to hypothesis
Item_Identifier Unique product ID ID variable
Item_Weight Weight of product Not considered in hypothesis
Item_Fat_Content Whether or not the product is Connected to the “Utility”
low in fat theory. Most of the time, low-fat
products are used more than
others
Item_Visibility The percentage of a store’s Linked to ‘Display Area’
overall display area that is hypothesis
devoted to a specific product
Item_Type Which category does the From this, more conclusions
product fall under product’s regarding “Utility” can be drawn
maximum retail price (list price)
Unique store ID
Item_MRP The product’s maximum retail Not considered in hypothesis
price (list price) and its unique
store ID
Outlet_Identifier Unique store ID ID variable
Outlet_Establishment_Year In which year the store was Not considered in hypothesis
established
Outlet_Size The store’s dimensions in terms Connected to ‘Store Capacity’
of the amount of ground it hypothesis
covers
Outlet_Location_Type The city type in which the store Connected to ‘City Type’
is located hypothesis
Outlet_Type Whether the establishment is a Connected to ‘Store Capacity’
supermarket of some kind or hypothesis again
merely a grocery shop
Item_Outlet_Sales Product sales in that specific Outcome variable
store. A variable to be predicted
for the outcome

projecting results. Table 1 provides a description of the dataset. There are 8523
distinct data points in the dataset. The dataset is available online [43].

3.1 Dataset Characteristics

There are 12 attributes in the dataset which are described in Table 1.


342 P. Majumdar and S. Mitra

3.2 Methodology

The dataset is wrangled and prepared for training. The dataset is cleaned (where
duplicates are removed, errors are corrected, missing values are imputed, and normal-
ization is done). The impacts of the specific order in which we gathered and/or
otherwise prepared the data are then eliminated when the data is randomized. After
that, additional exploratory research is carried out and data visualization is used
to help identify pertinent correlations between variables or class imbalances (bias
alert). Following that, training (70%) and testing (30%) datasets are created. The
dataset has yielded valuable information connected to data during the process of
data exploration. To accomplish so, data from accessible sources and information
from hypotheses are compared. In this form, some values are not appropriate. Thus,
we must translate them into the age of a specific outlet. In the dataset, there are 10
unique outlets and 1559 unique products. There are sixteen distinct values in Item
type. There are some misspellings “low fat” instead of “Low Fat (LF)” and “regular”
instead of “Regular (RL)”. For data cleaning, mean and mode are used to substitute
missing numerical attributes, the correlation between the reconstructed attributes is
reduced. Certain peculiarities in the dataset were found during the data exploration
phase. To build a suitable model, all anomalies discovered in the dataset are resolved
during this phase. All products are therefore more likely to be sold. All differences
in categorical attributes are resolved by replacing them all with the appropriate ones.
By changing every category attribute to the proper one, all differences in categorical
attributes are corrected. To avoid this, a third category of item fat content—none—
is added. It was found that the item identification property’s unique ID starts with
either DR, FD, or NC. As a result, we create Item Type New and assign it to one of
categories: foods, drinks, or non-consumables. In the actual dataset, we chose only
9 features viz; ItemWeight, ItemFatContent, ItemVisibility, ItemType, ItemMRP,
OutletEstablishmentYear, OutletLocationType, and *OutletType.

3.2.1 Prediction Algorithms

The next step is model building where different predictive models are used for
forecasting of sales. The prediction models are discussed:

Decision Tree

Decision tree (DT) functions by building a structure like a tree that illustrates the
connections between a dataset’s attributes and the desired variable. Choosing the
optimal attribute to act as the decision tree’s root node is the first step. The dataset is
divided into subsets according to the values of the specified attribute once the root
node has been picked. The procedure is then repeated for each child node, this time
Business Transformation Using Big Data Analytics and Machine Learning 343

choosing the best attribute to split on from the other attributes. A leaf node is formed
when one of the halting criteria is satisfied [44].

Support Vector Regression

Finding the hyperplane with the most points in the best-fit line is the goal of support
vector regression (SVR). Rather of minimizing the difference between the real and
predicted values, the SVR aims to fit the best line inside a given value. That particular
value represents the distance between the boundary line and the hyperplane. Support
vectors are the nearest data points on either side of the hyperplane [44].

Random Forest

In Random Forest (RF), bootstrap, also known as bagging, is used to solve regression
problems by combining many decision trees into a Random Forest (RF). Instead than
depending solely on individual decision trees to reach a conclusion, multiple deci-
sion trees are integrated. RF uses the Bootstrap approach, which involves randomly
selecting rows and features from a large number of decision trees. It is more accurate
the more trees there are in the forest. Overfitting is less likely when there are more
trees in the population. In Bagging, sample data is substituted with a random value
periodically for matching trees to these values utilized for training [44].

Extreme Gradient Boosting

Extreme Gradient Boosting (XGBoost) combines several weak learners to generate


a powerful learner. Using this method, decision trees are constructed one after the
other. XGBoost relies heavily on weights. To help in outcome forecasting, each
independent variable is assigned a weight before being fed into the decision tree.
The second decision tree assigns greater weight to the criteria that the first decision
tree mispredicted. Subsequently, the discrete classifiers/predictors amalgamate to
generate a resilient and precise model. Among the tasks it can handle are user-defined
forecasting problems, regression, classification, and ranking [44].
The sample output of XGBoost, DT, SVR, and RF algorithm for Big Mart Sales
Forecasting are shown in Fig. 1:

3.3 Performance Analysis

Table 2 displays an error metrics performance study of various forecasting methods,


including Mean Absolute Error (MAE), Mean Square Error (MSE), Root Mean
Square Error (RMSD), and Explained Variance Score (EVS) [45].
344 P. Majumdar and S. Mitra

Fig. 1 a The sample output of Extreme Gradient Boosting algorithm. b The sample output of
Decision Tree. c The sample output of Support Vector Regression. d The sample output of Random
Forest for Big Mart Sales Forecasting
Business Transformation Using Big Data Analytics and Machine Learning 345

Table 2 Forecasting output in terms of performance error metrics


Prediction algorithm MAE MSE RMSD EVS
(a) Extreme Gradient Boosting algorithm 0.1112 0.1524 0.3903 0.8726
(b) Decision tree 0.2654 0.2949 0.5430 0.7549
(c) Support vector regression 0.2321 0.2659 0.5156 0.7129
(d) Random forest 0.1792 0.2193 0.4682 0.8223

The EVS which explains the errors’ distribution in a given dataset is expressed in
Eq. (1):
( )
( )
Δ

Δ var y − y
explained variance y − y = 1 − (1)
var(y)

Here, the variances of the actual values and prediction errors are denoted by Var(y)
and Var(y − ŷ), respectively. Better squares of the error standard deviations are
indicated by scores near to 1.0, which is greatly desirable.
When evaluating the efficacy of a regression model, MAE is calculated as the
average absolute difference between the predicted and actual values, expressed in
Eq. (2):
∑n
i=1 |yi − xi |
MAE = (2)
n
n is the total number of data points, yi is the predicted value, and x i is the true value.
MSE computes the average of the squares of the error difference between the
estimated and actual values, expressed in Eq. (3):

1 ∑( )2
n
Δ

MSE = yi − y i (3)
n i=1

yi is predicted value, ŷi is the true value and n is total number of data points.
The sample standard deviation of the variations between the expected and
observed values is described by RMSD, expressed in Eq. (4):
/
∑N ( Δ )2
i=1 xi − xi
RMSD = (4)
N

Here, i is a variable, N represents number of non-missing data points, x i is actual


observations time series and X̂ i represents estimated time series.
Table 2 illustrates how the XGBoost method outperforms other machine learning
algorithms when it comes to various error metrics, which are then followed by RF.
346 P. Majumdar and S. Mitra

4 Approaches for Deriving Knowledge from Big Data

There are numerous approaches and methods for deriving knowledge from unpro-
cessed big data. Most of the time, data scientists use ML to create high-performance
software agents that can learn from the data, and statistics to evaluate various
knowledge-related hypotheses. For extracting and utilizing knowledge, scientists
utilize data cleaning, data transformation, and data visualization. Some of the targets
of data mining and knowledge extraction process are classification to predict the
class of a new item, clustering of datasets into distinct categories, identifying asso-
ciation or relation between two or more variables in a large dataset, summarizing
the properties of datasets, and identifying events, observations or items that do not
follow a specific pattern. Data mining experts utilize several tools and techniques
such as Decision Tree, Bayesian methods, linear regression, and k-means clustering
and also parameterize them to suit specific business requirements.

5 Challenges for Adopting BDA and AI in Business

This section outlines the potential challenges and probable solutions for adopting
BDA and AI in business. Researchers can focus on each of AI’s subdomains sepa-
rately and investigate how it can be applied to different fields of expertise. Similar to
this, BDA, processing, and management are handled holistically through BDA [36].
Subsequent studies could be devoted to understanding the value that a particular
BDA tool provides to industry, government, management, and policymakers. The AI
and BDA applications in different industries may be compromised concerning user
privacy and data security [32]. Nonetheless, in cases involving these kinds of issues,
current corporation law is unable to deliver justice [12]. Hence, in order to preserve
ethical norms and maintain the benefits that AI and BDA provide across a range
of industries, corporate governance may require to create legal frameworks. It has
been noted that recorded data are frequently erroneous, noisy, or incomplete. This
presents a significant obstacle to the data analysis process. Thus, in order to apply
analytics, extensive data cleaning and preparation are necessary. Another problem
is the data’s ongoing exponential growth, which makes it challenging for businesses
to verify its reliability. When it comes to BDA, the veracity problem is thought to
be the most challenging one, surpassing only volume, velocity, and diversity. Busi-
nesses are urging their clients to participate in surveys, evaluations, and feedback
since it might aid in the development of new products. To remove noise and irreg-
ularities from the data, it is therefore imperative to preprocess the data. Because
the organization receives heterogeneous data from several sources, it is very chal-
lenging to integrate and run various analytics in order to produce actionable business
insights quickly. As a result, it is critical for a company to build up a strong data
management system that handles a variety of data, continuously investigates data
upon request, and generates business insights that business decision makers can use.
Business Transformation Using Big Data Analytics and Machine Learning 347

Another significant obstacle when creating data assets is data security. It is crucial for
businesses to plan ahead for any kind of data breach, develop a mechanism to iden-
tify them in real time to reduce the negative effects, and create extremely secure and
reliable data management systems. Organizational leadership and strategy are the
main management obstacles that can impede the effective implementation of BDA
[39]. A company’s active leadership must be in line with its strategic objectives in
order to succeed. A significant challenge for businesses is determining and assessing
the business value of BDA. While determining the ROI from BDA and dissecting
the relationship between BDA and business outcomes, organizations should exer-
cise caution. This necessitates appropriately mapping data, analytics, and business
processes to the optimal business outcomes & assessing its impact on achieving
those outcomes [46]. BDA is an effective instrument for turning a company around
and generating strategic business value. To find new data-driven business opportuni-
ties, firms must invest in BDA analytics in addition to hiring qualified analysts and
adopting a strategic positioning plan.

5.1 Challenges Faced by Business Analyst

A BA is in charge of several projects at once. Business analysts had a lot on their


plate, including managing project deadlines, engaging with stakeholders, executing
projects, and upholding client relationships. Challenges are discussed here:
• Absence of domain expertise
To comprehend the needs, a business analyst must work in tandem with the busi-
ness users. Understanding the requirements completely and with clarity depends
heavily on domain knowledge.
• Absence of current procedures
A project’s success does not happen quickly. To get results, a great deal of work
and mental fatigue must first be expended. Subsequently, the majority consists of
the current project maintenance and development procedure. The absence of current
methods and paperwork is the main problem. Productivity is often hampered by
insufficient project documentation.
• Shifting demands or specifications for the business
As business analysts have seen, requirements are often revised at the request of
business stakeholders even after they have been established and authorized. It is one
of the most common problems because it may occur more than once, even for the
same demand.
• Absence of important stakeholders
348 P. Majumdar and S. Mitra

This will lead to a number of issues because they won’t be informed about discus-
sions regarding the most recent requirements. They will either be unable to articulate
their thoughts or they will later suggest changes.
• Unrealistic timeframes
Business analysts might encounter a challenging scenario where deadlines are an
issue. Then pressure is generated, which could interfere with their work. If so, be
aware of how to handle the situation while preserving the caliber of the work.
• Technical proficiency
It’s a misconception that business analysts don’t need technical skills. Conversely,
the majority of them excel at coding are adept at maintaining business procedures,
and have a talent for technically fulfilling the requirements.
• Professionalism
One of the most neglected, undervalued, and underpaid groups in the IT industry
is the business analyst. They often act as a liaison between the technical and business
aspects of a project. They are the ones who support the project from start to finish
and who contribute to the creation of the project plan.
• Disagreement between users
Business analysts may occasionally find themselves in a position where you are
unable to comprehend the user’s complaint. It occurs during the product launch
phase and may manifest as unkind comments. When a team makes a new strategy
suggestion that is relevant to the current business process, there may even be conflict
between stakeholders and business analysts.

5.2 Challenges in Big Data Management and Governance

A structured framework for decision-making and authority over data and data-related
topics is provided by data governance. Businesses that attempt to manage their
contemporary ecosystems may encounter numerous obstacles. Large, monolithic
systems that held the majority of the world’s mission-critical data have become less
common. These days, businesses add CRM software, digital marketing automation,
e-commerce platforms, customer support tools, and other features to their massive
ERP systems. Adhoc data sets of lesser size are also typical; these typically take the
form of little home-grown databases or Excel workbooks. Thus, the size of data is
seeing an increasing trend day by day. Controlling organized data is not too difficult.
Determining the characteristics of the data and identifying records that don’t live up
to expectations is a pretty straightforward proposition. For unstructured data, this
is not the case. Companies are facing a massive amount of unstructured data in the
form of social media, audio, video, and online reviews. Such unstructured data is
frequently moving and disorganized. Unstructured data has a wide range of quality
Business Transformation Using Big Data Analytics and Machine Learning 349

attributes. Every possible point of failure for data assets should be taken into consid-
eration by governance. This covers hazards such as noise, incorrect data defaults,
and loss of sensor signal. Data is growing in volume, velocity, and variety. B2B and
consumer interactions are now conducted digitally. IoT devices provide precise loca-
tion, temperature, time, and many other attribute data. Data in the form of audio and
video are more crucial than ever. According to research by Gartner, professionals look
for information for half of their workdays and take eighteen minutes on an average
to trace a document. The issue is becoming worse as data volumes increase, under-
scoring the pressing need for rules-based workflows and automation to speed up data
quality and governance tasks. In addition to assisting businesses in creating a cohe-
sive and comprehensive view of their own internal data, data governance provides the
framework for enhancing data with demographic information, geographic context,
and other elements that enhance the value of the conclusions and choices drawn from
the data. Crucial factors are business value, security, and compliance. Each of these,
such as hiding personally identifiable information (PII), may have an impact on the
requirements for data integrity, storage, and access. Organizations can increase busi-
ness value by addressing these essential requirements and optimizing the contextual
richness of their data through the use of data governance frameworks. If governance
initiatives are not developed with people, processes, and technology in mind, they
will not have much of an impact. Frameworks for data governance need to be compre-
hensive, built on cross-functional cooperation, common language, and a shared set
of metrics and standards. Transparency across the board may not even be sufficient.
Technologies for monitoring data quality cannot coexist with governance.

6 Applications of Big Data Analytics and Machine


Learning

Big data analytics are now used in banking and Securities for trade visibility, customer
data transformation, enterprise credit risk reporting, tick analytics, card fraud detec-
tion, archiving of audit trails, social analytics for trading, IT operations analytics, and
IT policy compliance analytics. Big data analytics are used to solve Industry-specific
Big Data Challenges like collecting, analyzing, and utilizing consumer insights and
understanding patterns of real-time, media content usage. Big data analytics are
used in health sectors where certain hospitals are utilizing patient data gathered from
a mobile app, spanning millions of interactions, to enable physicians to practice
evidence-based medicine instead of requiring a battery of lab and tests. While a
battery of tests has the potential to be effective, it is typically inefficient and costly.
The university has created visual data that enables quicker identification and effective
analysis of healthcare information, used in tracking the spread of chronic disease.
This has been done by using free public health data and Google Maps. Higher educa-
tion makes extensive use of big data. A learning and management system can be
implemented using Big data to keep track of a variety of things, including the time a
350 P. Majumdar and S. Mitra

student spends on various system pages, when they log on, and their overall progress
over time. Big Data has a wide range of uses in public services, such as environ-
mental protection, energy exploration, financial market analysis, fraud detection,
and health-related research. Big Data for analytics can be utilized for optimized
staffing using information from nearby events, shopping trends, and other sources,
decreased instances of fraud and prompt inventory analysis. Marketing makes exten-
sive use of big data in order to gain a deeper understanding of customer behavior and
preferences. Among the applications of big data in marketing, customer segmenta-
tion is there where based on their behavior and preferences, customers are divided
into groups using big data technologies that analyze customer data. This enables
marketers to design campaigns that are more focused and successful. By analyzing
consumer data, big data technologies can make tailored offers and recommenda-
tions. By utilizing big data technologies, it is possible to forecast future trends and
behaviors by analyzing customer behavior. This can help to increase the efficacy of
marketing campaigns and provide guidance for marketing strategies.

7 Conclusion

Technology-enabled changes are becoming a well-known strategy for enhancing


consumer satisfaction and business performance in the modern environment. Busi-
ness leaders are paying close attention to the newest tools that are gaining popularity
for business transformation namely big data, business analytics, IoT, AI, and busi-
ness intelligence. In this article, the AI solutions with the help of which BDA can
transform organizations and generate novel business opportunities are reviewed. An
extensive literature study has been conducted to present the application of BDA
in conjunction with AI to pursue digital platform business model innovation and
dynamics. This review is carried out to provide future research directions for the firms
to leverage BDA and AI for value creation. A particular case study namely Big Mart
sales forecasting was discussed where different ML algorithms have been employed
for forecasting sales. XGBoost algorithm provides more accurate performance in
comparison to other ML algorithms with lower error rates (MAE, RMSE, and MSE)
and EVS. This case study was conducted because a standard sales forecasting method
can assist in thoroughly examining business situations to infer strengths, insufficient
funding, and customer satisfaction prior to creating a suitable marketing plan and
budget for the upcoming year that can positively influence decision-making, process
improvement, and other goals.
Business Transformation Using Big Data Analytics and Machine Learning 351

References

1. Elgendy, N., Elragal, A.: Big data analytics: a literature review. In: Advances in Data Mining.
Applications and Theoretical Aspects: 14th Industrial Conference, St. Petersburg, Russia 14:
214–227 (2014)
2. Russom, P.: Big data analytics. TDWI Best Pract. Rep. Fourth Quart 19(4), 1–34 (2011)
3. Zakir, J., Seymour, T., Berg, K.: Big data analytics. Issues Inf. Syst. 16(2) (2015)
4. Power, D.J., Heavin, C., McDermott, J., Daly, M.: Defining business analytics: an empirical
approach. J. Bus. Anal. 1(1), 40–53 (2018)
5. Delen, D., Ram, S.: Research challenges and opportunities in business analytics. J. Bus. Anal.
1(1), 2–12 (2018)
6. Goundar, S., Nayyar, A., Maharaj, M., Ratnam, K., Prasad, S.: How artificial intelligence is
transforming the ERP systems. Enterp. Syst. Technol. Converg.: Res. Pract. 85 (2021)
7. Chatterjee, S., Rana, N.P., Tamilmani, K., Sharma, A.: The effect of AI-based CRM on organi-
zation performance and competitive advantage: an empirical analysis in the B2B context. Ind.
Mark. Manag. 97, 205–219 (2021)
8. Cavanillas, J.M., Curry, E., Wahlster, W.: The big data value opportunity. In: New Horizons
for a Data-Driven Economy: A Roadmap for Usage and Exploitation of Big Data in Europe,
3–11 (2016) https://doi.org/10.1007/978-3-319-21569-3_1
9. Zillner, S., Bisset, D., Milano, M., Curry, E., Hahn, T., Lafrenz, R., et al.: Strategic research,
innovation and deployment agenda—AI, data and robotics partnership, p. 3. BDVA, euRobotics,
ELLIS, EurAI and CLAIRE, Brussels (2020)
10. Curry, E.: The big data value chain: definitions, concepts, and theoretical approaches. In: New
Horizons for a Data-Driven Economy: A Roadmap for Usage and Exploitation of Big Data in
Europe, 29–37 (2016). https://doi.org/10.1007/978-3-319-21569-3_3
11. InGRAM, 6 big data use cases in retail, 2017 [online]. Imaginenext.ingrammicro.com.
Available at: https://imaginenext.ingrammicro.com/data-center/6-big-data-use-cases-in-retail.
Accessed 7 November 2019
12. Sheikh, R.A., Goje, N.S.: Role of big data analytics in business transformation. Internet Things
Bus. Transform.: Dev. Eng. Bus. Strat. Ind. 5, 231–259 (2021)
13. Kumar, R.: A framework for assessing the business value of information technology
infrastructures. J. Manag. Inf. Syst. 21(2), 11–32 (2004)
14. Di Vaio, A., Palladino, R., Hassan, R., Escobar, O.: Artificial intelligence and business models
in the sustainable development goals perspective: a systematic literature review. J. Bus. Res.
121, 283–314 (2020)
15. Hu, F., Liu, W., Tsai, S.B., Gao, J., Bin, N., Chen, Q.: An empirical study on visualizing
the intellectual structure and hotspots of big data research from a sustainable perspective.
Sustainability 10, 667 (2018)
16. Giuffrida, N., Fajardo-Calderin, J., Masegosa, A.D., Werner, F., Steudter, M., Pilla, F.: Opti-
mization and machine learning applied to last-mile logistics: a review. Sustainability 14, 5329
(2022)
17. Loureiro, S.M.C., Nascimento, J.: Shaping a view on the influence of technologies on
sustainable tourism. Sustainability 13, 12691 (2021)
18. Chen, H., Chiang, R.H., Storey, V.C.: Business intelligence and analytics: from big data to big
impact. MIS Q. 1, 1165–1188 (2012)
19. Thayyib, P.V., Mamilla, R., Khan, M., Fatima, H., Asim, M., Anwar, I., Shamsudheen, M.K.,
Khan, M.A.: State-of-the-art of artificial intelligence and big data analytics reviews in five
different domains: a bibliometric summary. Sustainability 15(5), 4026 (2023)
20. Lin, S.S., Shen, S.L., Zhou, A., Xu, Y.S.: Risk assessment and management of excavation
system based on fuzzy set theory and machine learning methods. Autom. Constr. 122, 103490
(2021)
21. Mukherjee, S., Bala, P.K.: Detecting sarcasm in customer tweets: an NLP based approach. Ind.
Manag. Data Syst. 117(6), 1109–1126 (2017)
352 P. Majumdar and S. Mitra

22. Mantyla, M.V., Graziotin, D., Kuutila, M.: The evolution of sentiment analysis—a review of
research topics, venues, and top cited papers. Comput. Sci. Rev. 27, 16–32 (2018)
23. O’Leary, D.E.: Massive data language models and conversational artificial intelligence:
emerging issues. Intell. Syst. Account. Financ. Manag. 29, 182–198 (2022)
24. Bendre, M.R., Thool, V.R.: Analytics, challenges and applications in big data environment: A
survey. J. Manag. Anal. 3, 206–239 (2016)
25. dos Santos, B.S., Steiner, M.T.A., Fenerich, A.T., Lima, R.H.P.: Data mining and machine
learning techniques applied to public health problems: a bibliometric analysis from 2009 to
2018. Comput. Ind. Eng. 138, 106120 (2019)
26. Iaksch, J., Fernandes, E., Borsato, M.: Digitalization and big data in smart farming—a review.
J. Manag. Anal. 8, 333–349 (2021)
27. Kim, G.H., Trimi, S., Chung, J.H.: Big-data applications in the government sector. Commun.
ACM 57, 78–85 (2014)
28. Rajagopalan, M., Vellaipandiyan, S.: Big data framework for national e-governance plan.
In: Proceedings of the 2013 Eleventh International Conference on ICT and Knowledge
Engineering, Bangkok, Thailand. 1–5 (2013)
29. Ravi, V., Kamaruddin, S.: Big data analytics enabled smart financial services: opportunities
and challenges. In: Proceedings of the International Conference on Big Data Analytics, 15–39
(2017)
30. Fang, B., Zhang, P.: Big data in finance. In: Big Data Concepts, Theories, and Applications,
391–412 (2016)
31. Lopez-Robles, J.R., Otegi-Olaso, J.R., Gomez, I.P., Cobo, M.J.: 30 years of intelligence models
in management and business: a bibliometric review. Int. J. Inf. Manag. 48, 22–38 (2019)
32. Wamba, S.F., Bawack, R.E., Guthrie, C., Queiroz, M.M., Carillo, K.D.A.: Are we preparing
for a good AI society? A bibliometric review and research agenda. Technol. Forecast. Soc.
Chang. 164, 120482 (2021)
33. Mishra, S., Tripathi, A.R.: Literature review on business prototypes for digital platform. J.
Innov. Entrep. 9, 23 (2020). https://doi.org/10.1186/s13731-020-00126-4
34. Verma, S., Sharma, R., Deb, S., Maitra, D.: Artificial intelligence in marketing: systematic
review and future research direction. Int. J. Inf. Manag. Data Insights 1, 100002 (2021)
35. Batistic, S., van der Laken, P.: History, evolution and future of big data and analytics: a biblio-
metric analysis of its relationship to performance in organizations. Br. J. Manag. 30, 229–251
(2019)
36. Khanra, S., Dhir, A., Mantymaki, M.: Big data analytics and enterprises: a bibliometric
synthesis of the literature. Enterp. Inf. Syst. 14, 737–768 (2020)
37. Linnenluecke, M.K., Marrone, M., Singh, A.K.: Conducting systematic literature reviews and
bibliometric analyses. Aust. J. Manag. 45, 175–194 (2020)
38. Erevelles, S., Fukawa, N., Swayne, L.: Big data consumer analytics and the transformation of
marketing. J. Bus. Res. 69, 897–904 (2016)
39. Siemens, G.: Learning analytics: the emergence of a discipline. Am. Behav. Sci. 57, 1380–1400
(2013)
40. Nicolae, B., Moise, D., Antoniu, G., Bouge, L., Dorier, M.: BlobSeer: bringing high throughput
under heavy concurrency to Hadoop Map-Reduce applications. In: Proceedings of the 2010
IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Atlanta, GA,
USA, 1–11 (2010)
41. Ding, Y., Jin, M., Li, S., Feng, D.: Smart logistics based on the internet of things technology:
an overview. Int. J. Logist. Res. Appl. 24, 323–345 (2021)
42. Hewage, T., Halgamuge, M., Syed, A., Ekici, G.: Review: Big data techniques of Google,
Amazon, Facebook and Twitter. J. Commun. 13(2), 94–100 (2018)
43. Kuila, A.: Big data sales prediction (2023) [online]. Available at: https://www.kaggle.com/dat
asets/akashdeepkuila/big-mart-sales. Accessed 15 November 2023
44. Majumdar, P., Bhattacharya, D., Mitra, S.: Prediction of evapotranspiration and soil moisture
in different rice growth stages through improved salp swarm based feature optimization and
ensembled machine learning algorithm. Theor. Appl. Climatol., 1–25 (2023)
Business Transformation Using Big Data Analytics and Machine Learning 353

45. Majumdar, P., Bhattacharya, D., Mitra, S., Solgi, R., Oliva, D., Bhusan, B.: Demand prediction
of rice growth stage-wise irrigation water requirement and fertilizer using Bayesian genetic
algorithm and random forest for yield enhancement. Paddy Water Environ. 21(2), 275–293
(2023)
46. Wang, Y., Kung, L., Byrd, T.A.: Big data analytics: understanding its capabilities and potential
benefits for healthcare organizations. Technol. Forecast. Soc. Change J. 126, 3–13 (2018)

You might also like